Is this you? Create Your Porfile

Zhuo Feng

Michigan Technological University

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Zhuo Feng is active.

Explore More

Publication

Featured researches published by Zhuo Feng.

international conference on computer aided design | 2008

Multigrid on GPU: tackling power grid analysis on parallel SIMT platforms

Zhuo Feng; Peng Li

The challenging task of analyzing on-chip power (ground) distribution networks with multi-million node complexity and beyond is key to todaypsilas large chip designs. For the first time, we show how to exploit recent massively parallel single-instruction multiple-thread (SIMT) based graphics processing unit (GPU) platforms to tackle power grid analysis with promising performance. Several key enablers including GPU-specific algorithm design, circuit topology transformation, workload partitioning, performance tuning are embodied in our GPU-accelerated hybrid multigrid algorithm, GpuHMD, and its implementation. In particular, a proper interplay between algorithm design and SIMT architecture consideration is shown to be essential to achieve good runtime performance. Different from the standard CPU based CAD development, care must be taken to balance between computing and memory access, reduce random memory access patterns and simplify flow control to achieve efficiency on the GPU platform. Extensive experiments on industrial and synthetic benchmarks have shown that the proposed GpuHMD engine can achieve 100times runtime speedup over a state-of-the-art direct solver and be more than 15times faster than the CPU based multigrid implementation. The DC analysis of a 1.6 million-node industrial power grid benchmark can be accurately solved in three seconds with less than 50 MB memory on a commodity GPU. It is observed that the proposed approach scales favorably with the circuit complexity, at a rate about one second per million nodes.

design automation conference | 2010

Tradeoff analysis and optimization of power delivery networks with on-chip voltage regulation

Zhiyu Zeng; Xiaoji Ye; Zhuo Feng; Peng Li

Integrating a large number of on-chip voltage regulators holds the promise of solving many power delivery challenges through strong local load regulation and facilitates systemlevel power management. The quantitative understanding of such complex power delivery networks (PDNs) is hampered by the large network complexity and interactions between passive on-die/package-level circuits and a multitude of nonlinear active regulators. We develop a fast combined GPU-CPU analysis engine encompassing several simulation strategies, optimized for various subcomponents of the network. Using accurate quantitative analysis, we demonstrate the significant performance improvement brought by on-chip low-dropout regulators (LDOs) in terms of suppressing high-frequency local voltage droops and avoiding the mid-frequency resonance caused by off-chip inductive par-asitics. We perform comprehensive analysis on the tradeoffs among overhead of on-chip LDOs, maximum voltage droop and overall power efficiency. We conduct systematic design optimization by developing a simulation-based nonlinear optimization strategy that determines the optimal number of on-chip LDOs required and on-board input voltage, and the corresponding voltage droop and power efficiency for PDNs with multiple power domains.

international conference on computer aided design | 2006

Combinatorial algorithms for fast clock mesh optimization

Ganesh Venkataraman; Zhuo Feng; Jiang Hu; Peng Li

We present a fast and efficient combinatorial algorithm to simultaneously identify the candidate locations as well as the sizes of the buffers driving a clock mesh. Due to the high redundancy, a mesh architecture offers high tolerance towards variation in the clock skew. However, such a redundancy comes at the expense of mesh wire length and power dissipation. Based on survivable network theory, we formulate the problem to reduce the clock mesh by retaining only those edges that are critical to maintain redundancy. Such a formulation offers designer the option to trade-off between power and tolerance to process variations. Experimental results indicate that our techniques can result in power savings up to 28% with less than 4% delay penalty

international conference on computer aided design | 2006

Performance-oriented statistical parameter reduction of parameterized systems via reduced rank regression

Zhuo Feng; Peng Li

Process variations in modern VLSI technologies are growing in both magnitude and dimensionality. To assess performance variability, complex simulation and performance models parameterized in a high-dimensional process variation space are desired. However, the high parameter dimensionality, imposed by a large number of variation sources encountered in modern technologies, can introduce significant complexion in circuit analysis and may even render performance variability analysis completely intractable. We address the challenge brought by high-dimensional process variations via a new performance-oriented parameter dimension reduction technique. The basic premise behind our approach is that the dimensionality of performance variability is determined not only by the statistical characteristics of the underlying process variables, but also by the structural information imposed by a given design. Using the powerful reduced rank regression (RRR) and its extension as a vehicle for variability modeling, we are able to systematically identify statistically significant reduced parameter sets and compute not only reduced-parameter but also reduced-parameter-order models that are far more efficient than what was possible before. For a variety of interconnect modeling problems, it is shown that the proposed parameter reduction technique can provide more than one order of magnitude reduction in parameter dimensionality. Such parameter reduction immediately leads to reduced simulation cost in sampling-based performance analysis, and more importantly, highly efficient parameterized interconnect reduced order models. As a general parameter dimension reduction methodology, it is anticipated that the proposed technique is broadly applicable to a variety of statistical circuit modeling problems, thereby offering a useful framework for controlling the complexity of statistical circuit analysis

design automation conference | 2010

Parallel multigrid preconditioning on graphics processing units (GPUs) for robust power grid analysis

Zhuo Feng; Zhiyu Zeng

Leveraging the power of nowadays graphics processing units for robust power grid simulation remains a challenging task. Existing preconditioned iterative methods that require incomplete matrix factorizations can not be effectively accelerated on GPU due to its limited hardware resource as well as data parallel computing. This work presents an efficient GPU-based multigrid preconditioning algorithm for robust power grid analysis. An ELL-like sparse matrix data structure is adopted and implemented specifically for power grid analysis to assure coalesced GPU device memory access and high arithmetic intensity. By combining the fast geometrical multigrid solver with the robust Krylov-subspace iterative solver, power grid DC and transient analysis can be performed efficiently on GPU without loss of accuracy (largest errors < 0.5 mV). Unlike previous GPU-based algorithms that rely on good power grid regularities, the proposed algorithm can be applied for more general power grid structures. Experimental results show that the DC and transient analysis on GPU achieves more than 25X speedups over the best available CPU-based solvers. An industrial power grid with 10.5 million nodes can be accurately solved in 12 seconds.

design automation conference | 2007

Fast second-order statistical static timing analysis using parameter dimension reduction

Zhuo Feng; Peng Li; Yaping Zhan

The ability to account for the growing impacts of multiple process variations in modern technologies is becoming an integral part of nanometer VLSI design. Under the context of timing analysis, the need for combating process variations has sparkled a growing body of statistical static timing analysis (SSTA) techniques. While first-order SSTA techniques enjoy good runtime efficiency desired for tackling large industrial designs, more accurate second-order SSTA techniques have been proposed to improve the analysis accuracy, but at the cost of high computational complexity. Although many sources of variations may impact the circuit performance, considering a large number of inter-die and intra-die variations in the traditional SSTA analysis is very challenging. In this paper, we address the analysis complexity brought by high parameter dimensionality in static timing analysis and propose an accurate yet fast second-order SSTA algorithm based upon novel parameter dimension reduction. By developing reduced-rank regression based parameter reduction algorithms within block-based SSTA flow, we demonstrate that accurate second order SSTA analysis can be extended to a much higher parameter dimensionality than what is possible before. Our experimental results have shown that the proposed parameter reduction can achieve up to 10X parameter dimension reduction and lead to significantly improved second-order SSTA analysis under a large set of process variations.

IEEE Transactions on Very Large Scale Integration Systems | 2011

Parallel On-Chip Power Distribution Network Analysis on Multi-Core-Multi-GPU Platforms

Zhuo Feng; Zhiyu Zeng; Peng Li

The challenging task of analyzing on-chip power (ground) distribution networks with multimillion node complexity and beyond is key to todays large chip designs. For the first time, we show how to exploit recent massively parallel single-instruction multiple-thread (SIMT)-based graphics processing unit (GPU) platforms to tackle large-scale power grid analysis with promising performance. Several key enablers including GPU-speciflc algorithm design, circuit topology transformation, workload partitioning, performance tuning are embodied in our GPU-accelerated hybrid multigrid (HMD) algorithm (GpuHMD) and its implementation. We also demonstrate that using the HMD solver as a preconditioner, the conjugate gradient solver can converge much faster to the true solution with good robustness. Extensive experiments on industrial and synthetic benchmarks have shown that for DC power grid analysis using one GPU, the proposed simulation engine achieves up to 100× runtime speedup over a state-of-the-art direct solver and more than 50× speedup over the CPU based multigrid implementation, while utilizing a four-core-four-GPU system, a grid with eight million nodes can be solved within about 1 s. It is observed that the proposed approach scales favorably with the circuit complexity, at a rate about 1 s per two million nodes on a single GPU card.

international conference on computer aided design | 2011

Power grid analysis with hierarchical support graphs

Xueqian Zhao; Jia Wang; Zhuo Feng; Shiyan Hu

It is increasingly challenging to analyze present day large-scale power delivery networks (PDNs) due to the drastically growing complexity in power grid design. To achieve greater runtime and memory efficiencies, a variety of preconditioned iterative algorithms has been investigated in the past few decades with promising performance, while incremental power grid analysis also becomes popular to facilitate fast re-simulations of corrected designs. Although existing preconditioned solvers, such as incomplete matrix factor-based preconditioners, usually exhibit high efficiency in memory usage, their convergence behaviors are not always satisfactory. In this work, we present a novel hierarchical support-graph preconditioned iterative algorithm that constructs preconditioners by generating spanning trees in power supply networks for fast power grid analysis. The support-graph preconditioner is efficient for handling complex power grid structures (regular or irregular grids), and can facilitate very fast incremental analysis. Our experimental results on IBM power grid benchmarks show that compared with the best direct or iterative solvers, the proposed support-graph preconditioned iterative solver achieves up to 3.6X speedups for DC analysis, and up to 22X speedups for incremental analysis, while reducing the memory consumption by a factor of four.

international conference on computer aided design | 2010

Fast thermal analysis on GPU for 3D-ICs with integrated microchannel cooling

Zhuo Feng; Peng Li

While effective thermal management for 3D-ICs is becoming increasingly challenging due to the ever increasing power density and chip design complexity, traditional heat sinks are expected to quickly reach their limits for meeting the cooling needs of 3D-ICs. Alternatively, integrated liquid-cooled microchannel heat sink becomes one of the most effective solutions. For the first time, we present fast GPU-based thermal simulation methods for 3D-ICs with integrated microchannel cooling. Based on the physical heat dissipation paths of 3D-ICs with integrated microchannels, we propose novel preconditioned iterative methods that can be efficiently accelerated on GPUs massively parallel computing platforms. Unlike the CPU-based solver development environment in which many existing sophisticated numerical simulation methods (matrix solvers) can be readily adopted and implemented, GPU-based thermal simulation demands more efforts in the algorithm and data structure design phase, and requires careful consideration of GPUs thread/memory organizations, data access/communication patterns, arithmetic intensity, as well as the hardware occupancies. As shown in various experimental results, our GPU-based 3D thermal simulation solvers can achieve up to 360X speedups over the best available direct solvers and more than 35X speedups compared with the CPU-based iterative solvers, without loss of accuracy.

IEEE Transactions on Very Large Scale Integration Systems | 2007

Characterizing Multistage Nonlinear Drivers and Variability for Accurate Timing and Noise Analysis

Peng Li; Zhuo Feng; Emrah Acar

Nanoscale device characteristics and noise coupling have rendered traditional waveform-based gate delay models increasingly difficult to adopt. While the widely adopted delay models are built upon the assumption of simple ramp-like signal waveforms, realistic signal shapes in nanoscale designs can be far more complex. The need for considering process-voltage-temperature (PVT) variations imposes further accuracy requirement on gate models. We present a parameterizable waveform independent gate model (PWiM) where no assumption is made upon the input waveforms. The PWiM model is constructed by encapsulating the drivers intrinsic nonlinear dc and dynamic characteristics, which are important to model for complex signal waveforms, via novel and yet easy-to-implement characterization steps. As such, PWiM can provide near-SPICE accuracy for input signals that significantly deviate from simple ramps. While recently developed current-based models can only be applied to single channel-connected component, PWiM can work for multistage cells leading to improved library compactness and analysis efficiency. Our experiments have indicated that the proposed driver model not only provides up to two orders of magnitude speedups over SPICE for delay and noise analysis, it also offers accurate assessment of performance variability introduced by process and environmental variations.

Explore More