Is this you? Create Your Porfile

Jason Cong

University of California, Los Angeles

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jason Cong is active.

Explore More

Publication

Featured researches published by Jason Cong.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 1994

FlowMap: an optimal technology mapping algorithm for delay optimization in lookup-table based FPGA designs

Jason Cong; Yuzheng Ding

The field programmable gate-array (FPGA) has become an important technology in VLSI ASIC designs. In the past few years, a number of heuristic algorithms have been proposed for technology mapping in lookup-table (LUT) based FPGA designs, but none of them guarantees optimal solutions for general Boolean networks and little is known about how far their solutions are away from the optimal ones. This paper presents a theoretical breakthrough which shows that the LUT-based FPGA technology mapping problem for depth minimization can be solved optimally in polynomial time. A key step in our algorithm is to compute a minimum height K-feasible cut in a network, which is solved optimally in polynomial time based on network flow computation. Our algorithm also effectively minimizes the number of LUTs by maximizing the volume of each cut and by several post-processing operations. Based on these results, we have implemented an LUT-based FPGA mapping package called FlowMap. We have tested FlowMap on a large set of benchmark examples and compared it with other LUT-based FPGA mapping algorithms for delay optimization, including Chortle-d, MIS-pga-delay, and DAG-Map. FlowMap reduces the LUT network depth by up to 7% and reduces the number of LUTs by up to 50% compared to the three previous methods. >

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2011

High-Level Synthesis for FPGAs: From Prototyping to Deployment

Jason Cong; Bin Liu; Stephen Neuendorffer; Juanjo Noguera; Kees A. Vissers; Zhiru Zhang

Escalating system-on-chip design complexity is pushing the design community to raise the level of abstraction beyond register transfer level. Despite the unsuccessful adoptions of early generations of commercial high-level synthesis (HLS) systems, we believe that the tipping point for transitioning to HLS msystem-on-chip design complexityethodology is happening now, especially for field-programmable gate array (FPGA) designs. The latest generation of HLS tools has made significant progress in providing wide language coverage and robust compilation technology, platform-based modeling, advancement in core HLS algorithms, and a domain-specific approach. In this paper, we use AutoESLs AutoPilot HLS tool coupled with domain-specific system-level implementation platforms developed by Xilinx as an example to demonstrate the effectiveness of state-of-art C-to-FPGA synthesis solutions targeting multiple application domains. Complex industrial designs targeting Xilinx FPGAs are also presented as case studies, including comparison of HLS solutions versus optimized manual designs. In particular, the experiment on a sphere decoder shows that the HLS solution can achieve an 11-31% reduction in FPGA resource usage with improved design productivity compared to hand-coded design.

international conference on computer aided design | 2004

A thermal-driven floorplanning algorithm for 3D ICs

Jason Cong; Jie Wei; Yan Zhang

As the technology progresses, interconnect delays have become bottlenecks of chip performance. 3D integrated circuits are proposed as one way to address this problem. However, thermal problem is a critical challenge for 3D IC circuit design. We propose a thermal-driven 3D floorplanning algorithm. Our contributions include: (1) a new 3D floorplan representation, CBA and new interlayer local operations to more efficiently exploit the solution space; (2) an efficient thermal-driven 3D floorplanning algorithm with an integrated compact resistive network thermal model (CBA-T); (3) two fast thermal-driven 3D floorplanning algorithms using two different thermal models with different runtime and quality (CBA-T-Fast and CBA-T-Hybrid). Our experiments show that the proposed 3D floorplan algorithm with CBA representation can reduce the wirelength by 29% compared with a recent published result from (Hsiu et al., 2004). In addition, compared to a nonthermal-driven 3D floorplanning algorithm, the thermal-driven 3D floorplanning algorithm can reduce the maximum on-chip temperature by 56%.

field programmable gate arrays | 2015

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks

Chen Zhang; Peng Li; Guangyu Sun; Yijin Guan; Bingjun Xiao; Jason Cong

Convolutional neural network (CNN) has been widely employed for image recognition because it can achieve high accuracy by emulating behavior of optic nerves in living creatures. Recently, rapid growth of modern applications based on deep learning algorithms has further improved research and implementations. Especially, various accelerators for deep CNN have been proposed based on FPGA platform because it has advantages of high performance, reconfigurability, and fast development round, etc. Although current FPGA accelerators have demonstrated better performance over generic processors, the accelerator design space has not been well exploited. One critical problem is that the computation throughput may not well match the memory bandwidth provided an FPGA platform. Consequently, existing approaches cannot achieve best performance due to under-utilization of either logic resource or memory bandwidth. At the same time, the increasing complexity and scalability of deep learning applications aggravate this problem. In order to overcome this problem, we propose an analytical design scheme using the roofline model. For any solution of a CNN design, we quantitatively analyze its computing throughput and required memory bandwidth using various optimization techniques, such as loop tiling and transformation. Then, with the help of rooine model, we can identify the solution with best performance and lowest FPGA resource requirement. As a case study, we implement a CNN accelerator on a VC707 FPGA board and compare it to previous approaches. Our implementation achieves a peak performance of 61.62 GFLOPS under 100MHz working frequency, which outperform previous approaches significantly.

Integration | 1996

Performance optimization of VLSI interconnect layout

Jason Cong; Lei He; Cheng-Kok Koh; Patrick H. Madden

Abstract This paper presents a comprehensive survey of existing techniques for interconnect optimization during the VLSI physical design process, with emphasis on recent studies on interconnect design and optimization for high-performance VLSI circuit design under the deep submicron fabrication technologies. First, we present a number of interconnect delay models and driver/gate delay models of various degrees of accuracy and efficiency which are most useful to guide the circuit design and interconnect optimization process. Then, we classify the existing work on optimization of VLSI interconnect into the following three categories and discuss the results in each category in detail: (i) topology optimization for high-performance interconnects, including the algorithms for total wire length minimization, critical path length minimization, and delay minimization; (ii) device and interconnect sizing, including techniques for efficient driver, gate, and transistor sizing, optimal wire sizing, and simultaneous topology construction, buffer insertion, buffer and wire sizing; (iii) high-performance clock routing, including abstract clock net topology generation and embedding, planar clock routing, buffer and wire sizing for clock nets, non-tree clock routing, and clock schedule optimization. For each method, we discuss its effectiveness, its advantages and limitations, as well as its computational efficiency. We group the related techniques according to either their optimization techniques or optimization objectives so that the reader can easily compare the quality and efficiency of different solutions.

field programmable gate arrays | 2004

Application-specific instruction generation for configurable processor architectures

Jason Cong; Yiping Fan; Guoling Han; Zhiru Zhang

Designing an application-specific embedded system in nanometer technologies has become more difficult than ever due to the rapid increase in design complexity and manufacturing cost. Efficiency and flexibility must be carefully balanced to meet different application requirements. The recently emerged configurable and extensible processor architectures offer a favorable tradeoff between efficiency and flexibility, and a promising way to minimize certain important metrics (e.g., execution time, code size, etc.) of the embedded processors. This paper addresses the problem of generating the application-specific instructions to improve the execution speed for configurable processors. A set of algorithms, including pattern generation, pattern selection, and application mapping, are proposed to efficiently utilize the instruction set extensibility of the target configurable processor. Applications of our approach to several real-life benchmarks on the Altera Nios processor show encouraging performance speedup (2.75X on average and up to 3.73X in some cases).

high-performance computer architecture | 2008

CMP network-on-chip overlaid with multi-band RF-interconnect

Mau-Chung Frank Chang; Jason Cong; Adam Kaplan; Mishali Naik; Glenn Reinman; Eran Socher; Sai-Wang Tam

In this paper, we explore the use of multi-band radio frequency interconnect (or RF-I) with signal propagation at the speed of light to provide shortcuts in a many core network-on-chip (NoC) mesh topology. We investigate the costs associated with this technology, and examine the latency and bandwidth benefits that it can provide. Assuming a 400 mm2 die, we demonstrate that in exchange for 0.13% of area overhead on the active layer, RF-I can provide an average 13% (max 18%) boost in application performance, corresponding to an average 22% (max 24%) reduction in packet latency. We observe that RF access points may become traffic bottlenecks when many packets try to use the RF at once, and conclude by proposing strategies that adapt RF-I utilization at runtime to actively combat this congestion.

ACM Transactions on Design Automation of Electronic Systems | 1996

Combinational logic synthesis for LUT based field programmable gate arrays

Jason Cong; Yuzheng Ding

The increasing popularity of the field programmable gate-array (FPGA) technology has generated a great deal of interest in the algorithmic study and tool development for FPGA-specific design automation problems. The most widely used FPGAs are LUT based FPGAs, in which the basic logic element is a K-input one-output lookup-table (LUT) that can implement any Boolean function of up to K variables. This unique feature of the LUT has brought new challenges to logic synthesis and optimization, resulting in many new techniques reported in recent years. This article summarizes the research results on combinational logic synthesis for LUT based FPGAs under a coherent framework. These results were dispersed in various conference proceedings and journals and under various formulations and terminologies. We first present general problem formulations, various optimization objectives and measurements, then focus on a set of commonly used basic concepts and techniques, and finally summarize existing synthesis algorithms and systems. We classify and summarize the basic techniques into two categories, namely, logic optimization and technology mapping, and describe the existing algorithms and systems in terms of how they use the classified basic techniques. A comprehensive list of references is compiled in the attached bibliography.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 1992

Provably good performance-driven global routing

Jason Cong; Andrew B. Kahng; Gabriel Robins; Majid Sarrafzadeh; C. K. Wong

The authors propose a provably good performance-driven global routing algorithm for both cell-based and building-block design. The approach is based on a new bounded-radius minimum routing tree formulation. The authors first present several heuristics with good performance, based on an analog of Prims minimum spanning tree construction. Next, they give an algorithm which simultaneously minimizes both routing cost and the longest interconnection path, so that both are bounded by small constant factors away from optimal. They also show that geometry helps in routing: in the Manhattan plane, the total wire length for Steiner routing improves to 3/2*(1+(1/ epsilon )) times the optimal Steiner tree cost, while in the Euclidean plane, the total cost is further reduced to (2/ square root 3)*(1+(1/ epsilon )) times optimal. The method generalizes to the case where varying wire length bounds are prescribed for different source-sink paths. Extensive simulations confirm that this approach works well. >

international conference on computer aided design | 1997

Interconnect design for deep submicron ICs

Jason Cong; Zhigang Pan; Lei He; Cheng-Kok Koh; Kei-Yong Khoo

In this paper, we study the interconnect layout optimization problem under a higher-order RLC model to optimize not just delay, but also waveform for RLC circuits with non-monotone signal response. We propose a unified approach that considers topology optimization, wiresizing optimization, and waveform optimization simultaneously. Our algorithm considers a large class of routing topologies, ranging from shortest-path Steiner trees to bounded-radius Steiner trees and Steiner routings. We construct a set of required-arrival-time Steiner trees or RATS-trees, providing a smooth trade-off among signal delay, waveform, and routing area. Using a new incremental moment computation algorithm, we interleave topology construction with moment computation to facilitate accurate delay calculation and evaluation of waveform quality. Experimental results show that our algorithm is able to construct a set of topologies providing a smooth trade-off among signal delay, signal settling time, voltage overshoot, and routing cost.

Explore More