Zhiru Zhang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Zhiru Zhang is active.

Explore More

Publication

Featured researches published by Zhiru Zhang.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2011

High-Level Synthesis for FPGAs: From Prototyping to Deployment

Jason Cong; Bin Liu; Stephen Neuendorffer; Juanjo Noguera; Kees A. Vissers; Zhiru Zhang

Escalating system-on-chip design complexity is pushing the design community to raise the level of abstraction beyond register transfer level. Despite the unsuccessful adoptions of early generations of commercial high-level synthesis (HLS) systems, we believe that the tipping point for transitioning to HLS msystem-on-chip design complexityethodology is happening now, especially for field-programmable gate array (FPGA) designs. The latest generation of HLS tools has made significant progress in providing wide language coverage and robust compilation technology, platform-based modeling, advancement in core HLS algorithms, and a domain-specific approach. In this paper, we use AutoESLs AutoPilot HLS tool coupled with domain-specific system-level implementation platforms developed by Xilinx as an example to demonstrate the effectiveness of state-of-art C-to-FPGA synthesis solutions targeting multiple application domains. Complex industrial designs targeting Xilinx FPGAs are also presented as case studies, including comparison of HLS solutions versus optimized manual designs. In particular, the experiment on a sphere decoder shows that the HLS solution can achieve an 11-31% reduction in FPGA resource usage with improved design productivity compared to hand-coded design.

field programmable gate arrays | 2004

Application-specific instruction generation for configurable processor architectures

Jason Cong; Yiping Fan; Guoling Han; Zhiru Zhang

Designing an application-specific embedded system in nanometer technologies has become more difficult than ever due to the rapid increase in design complexity and manufacturing cost. Efficiency and flexibility must be carefully balanced to meet different application requirements. The recently emerged configurable and extensible processor architectures offer a favorable tradeoff between efficiency and flexibility, and a promising way to minimize certain important metrics (e.g., execution time, code size, etc.) of the embedded processors. This paper addresses the problem of generating the application-specific instructions to improve the execution speed for configurable processors. A set of algorithms, including pattern generation, pattern selection, and application mapping, are proposed to efficiently utilize the instruction set extensibility of the target configurable processor. Applications of our approach to several real-life benchmarks on the Altera Nios processor show encouraging performance speedup (2.75X on average and up to 3.73X in some cases).

design automation conference | 2006

An efficient and versatile scheduling algorithm based on SDC formulation

Jason Cong; Zhiru Zhang

Scheduling plays a central role in the behavioral synthesis process, which automatically compiles high-level specifications into optimized hardware implementations. However, most of the existing behavior-level scheduling heuristics either have a limited efficiency in a specific class of applications or lack general support of various design constraints. In this paper we describe a new scheduler that converts a rich set of scheduling constraints into a system of difference constraints (SDC) and performs a variety of powerful optimizations under a unified mathematical programming framework. In particular, we show that our SDC-based scheduling algorithm can efficiently support resource constraints, frequency constraints, latency constraints, and relative timing constraints, and effectively optimize longest path latency, expected overall latency, and the slack distribution. Experiments demonstrate that our proposed technique provides efficient solutions for a broader range of applications with higher quality of results (in terms of system performance) when compared to the state-of-the-art scheduling heuristics

Archive | 2008

AutoPilot: A Platform-Based ESL Synthesis System

Zhiru Zhang; Yiping Fan; Wei Jiang; Guoling Han; Changqi Yang; Jason Cong

The rapid increase of complexity in System-on-a-Chip design urges the design community to raise the level of abstraction beyond RTL. Automated behavior-level and system-level synthesis are naturally identified as next steps to replace RTL synthesis and will greatly boost the adoption of electronic system-level (ESL) design. High-level executable specifications, such as C, C++, or SystemC, are also preferred for system-level verification and hardware/software co-design.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2004

Architecture and synthesis for on-chip multicycle communication

Jason Cong; Yiping Fan; Guoling Han; Xun Yang; Zhiru Zhang

For multigigahertz designs in nanometer technologies, data transfers on global interconnects take multiple clock cycles. In this paper, we propose a regular distributed register (RDR) microarchitecture, which offers high regularity and direct support of multicycle on-chip communication. The RDR microarchitecture divides the entire chip into an array of islands so that all local computation and communication within an island can be performed in a single clock cycle. Each island contains a cluster of computational elements, local registers, and a local controller. On top of the RDR microarchitecture, novel layout-driven architectural synthesis algorithms have been developed for multicycle communication, including scheduling-driven placement, placement-driven simultaneous scheduling with rebinding, and distributed control generation, etc. The experimentation on a number of real-life examples demonstrates promising results. For data flow intensive examples, we obtain a 44% improvement on average in terms of the clock period and a 37% improvement on average in terms of the final latency, over the traditional flow. For designs with control flow, our approach achieves a 28% clock-period reduction and a 23% latency reduction on average.

symposium on cloud computing | 2006

Platform-Based Behavior-Level and System-Level Synthesis

Jason Cong; Yiping Fan; Guoling Han; Wei Jiang; Zhiru Zhang

With the rapid increase of complexity in system-on-a-chip (SoC) design, the electronic design automation (EDA) community is moving from RTL (Register Transfer Level) synthesis to behavioral-level and system-level synthesis. The needs of system-level verification and software/hardware co-design also prefer behavior-level executable specifications, such as C or SystemC. In this paper we present the platform-based synthesis system, named xPilot, being developed at UCLA. The first objective of xPilot is to provide novel behavioral synthesis capability for automatically generating efficient RTL code from a C or SystemC description for a given system platform and optimizing the logic, interconnects, performance, and power simultaneously. The second objective of xPilot is to provide a platform-based system-level synthesis capability, including both synthesis for application-specific configurable processors and heterogeneous multi-core systems. Preliminary experiments on FPGAs demonstrate the efficacy of our approach on a wide range of applications and its value in exploring various design tradeoffs.

field programmable gate arrays | 2005

Instruction set extension with shadow registers for configurable processors

Jason Cong; Yiping Fan; Guoling Han; Ashok Jagannathan; Glenn Reinman; Zhiru Zhang

Configurable processors are becoming increasingly popular for modern embedded systems (especially for the field-programmable system-on-a-chip). While steady progress has been made in the tools and methodologies of automatic instruction set extension for configurable processors, the limited data bandwidth available in the core processor (e.g., the number of simultaneous accesses to the register file) becomes a potential performance bottleneck. In this paper we first present a quantitative analysis of the data bandwidth limitation in configurable processors, and then propose a novel low-cost architectural extension and associated compilation techniques to address the problem. The application of our approach results in a promising performance improvement.

asia and south pacific design automation conference | 2007

High-Level Power Estimation and Low-Power Design Space Exploration for FPGAs

Deming Chen; Jason Cong; Yiping Fan; Zhiru Zhang

In this paper, we present a simultaneous resource allocation and binding algorithm for FPGA power minimization. To fully validate our methodology and result, our work targets a real FPGA architecture - Altera Stratix FPGA, which includes generic logic elements, DSP cores, and memories, etc. We design a high-level power estimator for this architecture and evaluate its estimation accuracy against a commercial gate-level power estimator - Quartus II PowerPlay Analyzer. During the synthesis stage, we pay special attention to interconnections and multiplexers. We concentrate on resource allocation and binding tasks because they are the key steps to determine the interconnections. We use a novel approach to explore the design space. Experimental results show that our high-level power estimator is 8.7% away from PowerPlay Analyzer. Meanwhile, we are able to achieve a significant amount of power reduction (32%) with better circuit speed (16%) compared to a traditional resource allocation and binding algorithm.

field programmable gate arrays | 2017

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Ritchie Zhao; Weinan Song; Wentao Zhang; Tianwei Xing; Jeng-Hau Lin; Mani B. Srivastava; Rajesh K. Gupta; Zhiru Zhang

Convolutional neural networks (CNN) are the current stateof-the-art for many computer vision tasks. CNNs outperform older methods in accuracy, but require vast amounts of computation and memory. As a result, existing CNN applications are typically run on clusters of CPUs or GPUs. Studies into the FPGA acceleration of CNN workloads has achieved reductions in power and energy consumption. However, large GPUs outperform modern FPGAs in throughput, and the existence of compatible deep learning frameworks give GPUs a significant advantage in programmability. Recent research in machine learning demonstrates the potential of very low precision CNNs -- i.e., CNNs with binarized weights and activations. Such binarized neural networks (BNNs) appear well suited for FPGA implementation, as their dominant computations are bitwise logic operations and their memory requirements are reduced. A combination of low-precision networks and high-level design methodology may help address the performance and productivity gap between FPGAs and GPUs. In this paper, we present the design of a BNN accelerator that is synthesized from C++ to FPGA-targeted Verilog. The accelerator outperforms existing FPGA-based CNN accelerators in GOPS as well as energy and resource efficiency.

asia and south pacific design automation conference | 2005

Bitwidth-aware scheduling and binding in high-level synthesis

Jason Cong; Yiping Fan; Guoling Han; Yizhou Lin; Junjuan Xu; Zhiru Zhang; Xu Cheng

Many high-level description languages, such as C/C++ or Java, lack the capability to specify the bitwidth information for variables and operations. Synthesis from these specifications without bitwidth analysis may introduce wasted resources. Furthermore, conventional high-level synthesis techniques usually focus on uniform-width resources, thus they cannot obtain the full resource savings even with bitwidth information. This work develops a bitwidth-aware synthesis flow, including bitwidth analysis, scheduling and binding, and register allocation and binding, to exploit the multi-bitwidth nature of operations and variables for area-efficient designs. We also develop lower bound estimation to evaluate the efficiency of our proposed solutions for register allocation and binding. The flow is implemented in the MCAS synthesis system (Cong et al., 2004). Experimental results show that our proposed bitwidth-aware synthesis flow reduces area by 36% and wire-length by 52% on average compared to the uniform-width MCAS flow, while achieving the same performance.

Explore More