Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Deming Chen is active.

Publication


Featured researches published by Deming Chen.


field programmable gate arrays | 2003

Architecture evaluation for power-efficient FPGAs

Fei Li; Deming Chen; Lei He; Jason Cong

This paper presents a flexible FPGA architecture evaluation framework, named fpgaEVA-LP, for power efficiency analysis of LUT-based FPGA architectures. Our work has several contributions: (i) We develop a mixed-level FPGA power model that combines switch-level models for interconnects and macromodels for LUTs; (ii) We develop a tool that automatically generates a back-annotated gate-level netlist with post-layout extracted capacitances and delays; (iii) We develop a cycle-accurate power simulator based on our power model. It carries out gate-level simulation under real delay model and is able to capture glitch power; (iv) Using the framework fpgaEVA-LP, we study the power efficiency of FPGAs, in 0.10um technology, under various settings of architecture parameters such as LUT sizes, cluster sizes and wire segmentation schemes and reach several important conclusions. We also present the detailed power consumption distribution among different FPGA components and shed light on the potential opportunities of power optimization for future FPGA designs (e.g., ≤: 0.10um technology).


international conference on computer aided design | 2004

DAOmap: a depth-optimal area optimization mapping algorithm for FPGA designs

Deming Chen; Jason Cong

In This work we study the technology mapping problem for FPGA architectures to minimize chip area, or the total number of lookup tables (LUTs) of the mapped design, under the chip performance constraint. This is a well-studied topic and a very difficult task (NP-hard). The contributions of This work are as follows: (i) we consider the potential node duplications during the cut enumeration/generation procedure so the mapping costs encoded in the cuts drive the area-optimization objective more effectively; (ii) after the timing constraint is determined, we will relax the non-critical paths by searching the solution space considering both local and global optimality information to minimize mapping area; (iii) an iterative cut selection procedure is carried out that further explores and perturbs the solution space to improve solution quality. We guarantee optimal mapping depth under the unit delay model. Experimental results show that our mapping algorithm, named DAOmap, produces significant quality and runtime improvements. Compared to the state-of-the-art depth-optimal, area minimization mapping algorithm CutMap (Cong and Hwan, 1995), DAOmap is 16.02% better on area and runs 24.2X faster on average when both algorithms are mapping to FPGAs using LUTs of five inputs. LUTs of other inputs are also used for comparisons.


IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2005

Power modeling and characteristics of field programmable gate arrays

Fei Li; Yizhou Lin; Lei He; Deming Chen; Jason Cong

This paper studies power modeling for field programmable gate arrays (FPGAs) and investigates FPGA power characteristics in nanometer technologies. Considering both dynamic and leakage power, a mixed-level power model that combines switch-level models for interconnects and macromodels for look-up tables (LUTs) is developed. Gate-level netlists back-annotated with postlayout capacitances and delays are generated and cycle-accurate power simulation is performed using the mixed-level power model. The resulting power analysis framework is named as fpgaEVA-LP2. Experiments show that fpgaEVA-LP2 achieves high fidelity compared to SPICE simulation, and the absolute error is merely 8% on average. fpgaEVA-LP2 can be used to examine the power impact of FPGA circuits, architectures, and CAD algorithms, and it is used to study the power characteristics of existing FPGA architectures in this paper. It is shown that interconnect power is dominant and leakage power is significant in nanometer technologies. In addition, tuning cluster and LUT sizes lead to 1.7/spl times/ energy difference and 0.8/spl times/ delay difference between the resulting min-energy and min-delay FPGA architectures, and FPGA area and power are reduced at the same time by tuning the cluster and LUT sizes. The existing commercial architectures are similar to the min-energy (and min-area at the same time) architecture according to this study. Therefore, innovative FPGA circuits, architectures, and CAD algorithms, for example, considering programmable power supply voltage, are needed to further reduce FPGA power.


symposium on application specific processors | 2009

FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs

Alexandros Papakonstantinou; Karthik Gururaj; John A. Stratton; Deming Chen; Jason Cong; Wen-mei W. Hwu

As growing power dissipation and thermal effects disrupted the rising clock frequency trend and threatened to annul Moores law, the computing industry has switched its route to higher performance through parallel processing. The rise of multi-core systems in all domains of computing has opened the door to heterogeneous multi-processors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs and FPGAs are becoming very popular in PC-based heterogeneous systems for speeding up compute intensive kernels of scientific, imaging and simulation applications. GPUs can execute hundreds of concurrent threads, while FPGAs provide customized concurrency for highly parallel kernels. However, exploiting the parallelism available in these applications is currently not a push-button task. Often the programmer has to expose the applications fine and coarse grained parallelism by using special APIs. CUDA is such a parallel-computing API that is driven by the GPU industry and is gaining significant popularity. In this work, we adapt the CUDA programming model into a new FPGA design flow called FCUDA, which efficiently maps the coarse and fine grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SPMD CUDA thread blocks into parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multi-core accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.


IEEE Transactions on Circuits and Systems | 2007

3-D nFPGA: A Reconfigurable Architecture for 3-D CMOS/Nanomaterial Hybrid Digital Circuits

Chen Dong; Deming Chen; Sansiri Haruehanroengra; Wei Wang

In this paper, we introduce a novel reconfigurable architecture, named 3D field-programmable gate array (3D nFPGA), which utilizes 3D integration techniques and new nanoscale materials synergistically. The proposed architecture is based on CMOS nanohybrid techniques that incorporate nanomaterials such as carbon nanotube bundles and nanowire crossbars into CMOS fabrication process. This architecture also has built-in features for fault tolerance and heat alleviation. Using unique features of FPGAs and a novel 3D stacking method enabled by the application of nanomaterials, 3D nFPGA obtains a 4x footprint reduction comparing to the traditional CMOS-based 2D FPGAs. With a customized design automation flow, we evaluate the performance and power of 3D nFPGA driven by the 20 largest MCNC benchmarks. Results demonstrate that 3D nFPGA is able to provide a performance gain of 2.6 x with a small power overhead comparing to the traditional 2D FPGA architecture.


high-performance computer architecture | 2009

Blueshift: Designing processors for timing speculation from the ground up.

Brian Greskamp; Lu Wan; Ulya R. Karpuzcu; Jeffrey J. Cook; Josep Torrellas; Deming Chen; Craig B. Zilles

Several recent processor designs have proposed to enhance performance by increasing the clock frequency to the point where timing faults occur, and by adding error-correcting support to guarantee correctness. However, such Timing Speculation (TS) proposals are limited in that they assume traditional design methodologies that are suboptimal under TS. In this paper, we present a new approach where the processor itself is designed from the ground up for TS. The idea is to identify and optimize the most frequently-exercised critical paths in the design, at the expense of the majority of the static critical paths, which are allowed to suffer timing errors. Our approach and design optimization algorithm are called BlueShift. We also introduce two techniques that, when applied under BlueShift, improve processor performance: On-demand Selective Biasing (OSB) and Path Constraint Tuning (PCT). Our evaluation with modules from the OpenSPARC T1 processor shows that, compared to conventional TS, BlueShift with OSB speeds up applications by an average of 8% while increasing the processor power by an average of 12%. Moreover, compared to a high-performance TS design, BlueShift with PCT speeds up applications by an average of 6% with an average processor power overhead of 23% . providing a way to speed up logic modules that is orthogonal to voltage scaling.


asia and south pacific design automation conference | 2004

Register binding and port assignment for multiplexer optimization

Deming Chen; Jason Cong

Data path connection elements, such as multiplexers, consume a significant amount of area on a VLSI chip, especially for FPGA designs. Multiplexer optimization is a difficult problem because both register binding and port assignment to reduce total multiplexer connectivity during high-level synthesis are NP-complete problems. In this paper, we first formulate a k-cofamily-based register binding algorithm targeting the multiplexer optimization problem. We then further reduce the multiplexer width through an efficient port assignment algorithm. Experimental results show that we are 44% better overall than the left-edge register binding algorithm on the total usage of multiplexer inputs and 7% better than a bipartite graph-based algorithm. For large designs, we are able to achieve significantly better results consistently. After technology mapping, placement and routing for an FPGA architecture, it shows considerably positive impacts on chip area, delay and power consumption.


Bioinformatics | 2014

BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads

Yun Heo; Xiao Long Wu; Deming Chen; Jian Ma; Wen-mei W. Hwu

MOTIVATION Rapid advances in next-generation sequencing (NGS) technology have led to exponential increase in the amount of genomic information. However, NGS reads contain far more errors than data from traditional sequencing methods, and downstream genomic analysis results can be improved by correcting the errors. Unfortunately, all the previous error correction methods required a large amount of memory, making it unsuitable to process reads from large genomes with commodity computers. RESULTS We present a novel algorithm that produces accurate correction results with much less memory compared with previous solutions. The algorithm, named BLoom-filter-based Error correction Solution for high-throughput Sequencing reads (BLESS), uses a single minimum-sized Bloom filter, and is also able to tolerate a higher false-positive rate, thus allowing us to correct errors with a 40× memory usage reduction on average compared with previous methods. Meanwhile, BLESS can extend reads like DNA assemblers to correct errors at the end of reads. Evaluations using real and simulated reads showed that BLESS could generate more accurate results than existing solutions. After errors were corrected using BLESS, 69% of initially unaligned reads could be aligned correctly. Additionally, de novo assembly results became 50% longer with 66% fewer assembly errors. AVAILABILITY AND IMPLEMENTATION Freely available at http://sourceforge.net/p/bless-ec CONTACT [email protected] SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.


IEEE Journal of Selected Topics in Signal Processing | 2009

A Fast Digital Predistortion Algorithm for Radio-Frequency Power Amplifier Linearization With Loop Delay Compensation

Hao Li; Dae Hyun Kwon; Deming Chen; Yun Chiu

An adaptive, digital, baseband predistortion (PD) algorithm that compensates for the memoryless nonlinearities of radio-frequency (RF) power amplifiers (PAs) for wireless systems using non-constant-envelop modulation schemes is presented. Compared with the conventional, complex-gain predistorters based on lookup tables (LUTs), the proposed direct-learning, multilevel lookup table (ML-LUT) approach assisted by a hardware-efficient loop delay compensation scheme achieves a significant reduction in convergence time and an improvement in linearization accuracy in the presence of an unknown loopback delay. The experimental results in an FPGA prototyping platform show that the fast adaptation speed enables the predistorter to track time-varying PA nonlinearities as fast as in the tens of kilohertz range, constituting a potential solution for highly efficient PAs in mobile handsets.


international conference on computer aided design | 2013

An efficient compiler framework for cache bypassing on GPUs

Xiaolong Xie; Yun Liang; Guangyu Sun; Deming Chen

Graphics Processing Units (GPUs) have become ubiquitous for general purpose applications due to their tremendous computing power. Initially, GPUs only employ scratchpad memory as on-chip memory. Though scratchpad memory benefits many applications, it is not ideal for those general purpose applications with irregular memory accesses. Hence, GPU vendors have introduced caches in conjunction with scratchpad memory in the recent generations of GPUs. The caches on GPUs are highly-configurable. The programmer or the compiler can explicitly control cache access or bypass for global load instructions. This highly-configurable feature of GPU caches opens up the opportunities for optimizing the cache performance. In this paper, we propose an efficient compiler framework for cache bypassing on GPUs. Our objective is to efficiently utilize the configurable cache and improve the overall performance for general purpose GPU applications. In order to achieve this goal, we first characterize GPU cache utilization and develop performance metrics to estimate the cache reuses and memory traffic. Next, we present efficient algorithms that judiciously select global load instructions for cache access or bypass. Finally, we integrate our techniques into an automatic compiler framework that leverages PTX instruction set architecture. Experiments evaluation demonstrates that compared to cache-all and bypass-all solutions, our techniques can achieve considerable performance improvement.

Collaboration


Dive into the Deming Chen's collaboration.

Top Co-Authors

Avatar

Jason Cong

University of California

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Kyle Rupnow

University of Wisconsin-Madison

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Kyle Rupnow

University of Wisconsin-Madison

View shared research outputs
Top Co-Authors

Avatar

Liwei Yang

Nanyang Technological University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Yiping Fan

University of California

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge