Ritchie Zhao
Cornell University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ritchie Zhao.
field programmable gate arrays | 2017
Ritchie Zhao; Weinan Song; Wentao Zhang; Tianwei Xing; Jeng-Hau Lin; Mani B. Srivastava; Rajesh K. Gupta; Zhiru Zhang
Convolutional neural networks (CNN) are the current stateof-the-art for many computer vision tasks. CNNs outperform older methods in accuracy, but require vast amounts of computation and memory. As a result, existing CNN applications are typically run on clusters of CPUs or GPUs. Studies into the FPGA acceleration of CNN workloads has achieved reductions in power and energy consumption. However, large GPUs outperform modern FPGAs in throughput, and the existence of compatible deep learning frameworks give GPUs a significant advantage in programmability. Recent research in machine learning demonstrates the potential of very low precision CNNs -- i.e., CNNs with binarized weights and activations. Such binarized neural networks (BNNs) appear well suited for FPGA implementation, as their dominant computations are bitwise logic operations and their memory requirements are reduced. A combination of low-precision networks and high-level design methodology may help address the performance and productivity gap between FPGAs and GPUs. In this paper, we present the design of a BNN accelerator that is synthesized from C++ to FPGA-targeted Verilog. The accelerator outperforms existing FPGA-based CNN accelerators in GOPS as well as energy and resource efficiency.
international conference on computer aided design | 2015
Mingxing Tan; Gai Liu; Ritchie Zhao; Steve Dai; Zhiru Zhang
Modern high-level synthesis (HLS) tools commonly employ pipelining to achieve efficient loop acceleration by overlapping the execution of successive loop iterations. However, existing HLS techniques provide inadequate support for pipelining irregular loop nests that contain dynamic-bound inner loops, where unrolling is either very expensive or not even applicable. To overcome this major limitation, we propose ElasticFlow, a novel architectural synthesis approach capable of dynamically distributing inner loops to an array of loop processing units (LPUs) in a complexity-effective manner. These LPUs can be either specialized to execute an individual loop or shared amongst multiple inner loops for area reduction. We evaluate ElasticFlow using a variety of real-life applications and demonstrate significant performance improvements over a widely used commercial HLS tool for Xilinx FPGAs.
design automation conference | 2015
Ritchie Zhao; Mingxing Tan; Steve Dai; Zhiru Zhang
Traditional techniques for pipeline scheduling in high-level synthesis for FPGAs assume an additive delay model where each operation incurs a pre-characterized delay. While a good approximation for some operation types, this fails to consider technology mapping, where a group of logic operations can be mapped to a single look-up table (LUT) and together incur one LUT worth of delay. We propose an exact formulation of the throughput-constrained, mapping-aware pipeline scheduling problem for FPGA-targeted high-level synthesis with area minimization being a primary objective. By taking this cross-layered approach, our technique is able to mitigate the pessimism inherent in static delay estimates and reduce the usage of LUTs and pipeline registers. Experimental results using our method demonstrate improved resource utilization for a number of logic-intensive, real-life benchmarks compared to a state-of-the-art commercial HLS tool for Xilinx FPGAs.
field programmable gate arrays | 2017
Steve Dai; Ritchie Zhao; Gai Liu; Shreesha Srinath; Udit Gupta; Christopher Batten; Zhiru Zhang
Current pipelining approach in high-level synthesis (HLS) achieves high performance for applications with regular and statically analyzable memory access patterns. However, it cannot effectively handle infrequent data-dependent structural and data hazards because they are conservatively assumed to always occur in the synthesized pipeline. To enable high-throughput pipelining of irregular loops, we study the problem of augmenting HLS with application-specific dynamic hazard resolution, and examine its implications on scheduling and quality of results. We propose to generate an aggressive pipeline at compile-time while resolving hazards with memory port arbitration and squash-and-replay at run-time. Our experiments targeting a Xilinx FPGA demonstrate promising performance improvement across a suite of representative benchmarks.
design automation conference | 2016
Ritchie Zhao; Gai Liu; Shreesha Srinath; Christopher Batten; Zhiru Zhang
Existing high-level synthesis (HLS) tools are mostly effective on algorithm-dominated programs that only use primitive data structures such as fixed size arrays and queues. However, many widely used data structures such as priority queues, heaps, and trees feature complex member methods with data-dependent work and irregular memory access patterns. These methods can be inlined to their call sites, but this does not address the aforementioned issues and may further complicate conventional HLS optimizations, resulting in a low-performance hardware implementation. To overcome this deficiency, we propose a novel HLS architectural template in which complex data structures are decoupled from the algorithm using a latency-insensitive interface. This enables overlapped execution of the algorithm and data structure methods, as well as parallel and out-of-order execution of independent methods on multiple decoupled lanes. Experimental results across a variety of real-life benchmarks show our approach is capable of achieving very promising speedups without causing significant area overhead.
field programmable gate arrays | 2017
Chang Xu; Gai Liu; Ritchie Zhao; Stephen Yang; Guojie Luo; Zhiru Zhang
Mainstream FPGA CAD tools provide an extensive collection of optimization options that have a significant impact on the quality of the final design. These options together create an enormous and complex design space that cannot effectively be explored by human effort alone. Instead, we propose to search this parameter space using autotuning, which is a popular approach in the compiler optimization domain. Specifically, we study the effectiveness of applying the multi-armed bandit (MAB) technique to automatically tune the options for a complete FPGA compilation flow from RTL to bitstream, including RTL/logic synthesis, technology mapping, placement, and routing. To mitigate the high runtime cost incurred by the complex FPGA implementation process, we devise an efficient parallelization scheme that enables multiple MAB-based autotuners to explore the design space simultaneously. In particular, we propose a dynamic solution space partitioning and resource allocation technique that intelligently allocates computing resources to promising search regions based on the runtime information of search quality from previous iterations. Experiments on academic and commercial FPGA CAD tools demonstrate promising improvements in quality and convergence rate across a variety of real-life designs.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2017
Gai Liu; Mingxing Tan; Steve Dai; Ritchie Zhao; Zhiru Zhang
Modern high-level synthesis (HLS) tools commonly employ pipelining to achieve efficient loop acceleration by overlapping the execution of successive loop iterations. While existing HLS pipelining techniques obtain good performance with low complexity for regular loop nests, they provide inadequate support for effectively synthesizing irregular loop nests. For loop nests with dynamic-bound inner loops, current pipelining techniques require unrolling of the inner loops, which is either very expensive in resource or even inapplicable due to dynamic loop bounds. To address this major limitation, this paper proposes ElasticFlow, a novel architecture capable of dynamically distributing inner loops to an array of processing units (LPUs) in an area-efficient manner. The proposed LPUs can be either specialized to execute an individual inner loop or shared among multiple inner loops to balance the tradeoff between performance and area. A customized banked memory architecture is proposed to coordinate memory accesses among different LPUs to maximize memory bandwidth without significantly increasing memory footprint. We evaluate ElasticFlow using a variety of real-life applications and demonstrate significant performance improvements over a state-of-the-art commercial HLS tool for Xilinx FPGAs.
field programmable gate arrays | 2018
Yuan Zhou; Udit Gupta; Steve Dai; Ritchie Zhao; Nitish Kumar Srivastava; Hanchen Jin; Joseph Featherston; Yi-Hsiang Lai; Gai Liu; Gustavo Velásquez; Wenping Wang; Zhiru Zhang
Modern high-level synthesis (HLS) tools greatly reduce the turn-around time of designing and implementing complex FPGA-based accelerators. They also expose various optimization opportunities, which cannot be easily explored at the register-transfer level. With the increasing adoption of the HLS design methodology and continued advances of synthesis optimization, there is a growing need for realistic benchmarks to (1) facilitate comparisons between tools, (2) evaluate and stress-test new synthesis techniques, and (3) establish meaningful performance baselines to track progress of the HLS technology. While several HLS benchmark suites already exist, they are primarily comprised of small textbook-style function kernels, instead of complete and complex applications. To address this limitation, we introduce Rosetta, a realistic benchmark suite for software programmable FPGAs. Designs in Rosetta are fully-developed applications. They are associated with realistic performance constraints, and optimized with advanced features of modern HLS tools. We believe that Rosetta is not only useful for the HLS research community, but can also serve as a set of design tutorials for non-expert HLS users. In this paper we describe the characteristics of our benchmarks and the optimization techniques applied to them. We further report experimental results on an embedded FPGA device as well as a cloud FPGA platform.
computer vision and pattern recognition | 2017
Jeng-Hau Lin; Tianwei Xing; Ritchie Zhao; Zhiru Zhang; Mani B. Srivastava; Zhuowen Tu; Rajesh K. Gupta
State-of-the-art convolutional neural networks are enormously costly in both compute and memory, demanding massively parallel GPUs for execution. Such networks strain the computational capabilities and energy available to embedded and mobile processing platforms, restricting their use in many important applications. In this paper, we propose BCNN with Separable Filters (BCNNw/SF), which applies Singular Value Decomposition (SVD) on BCNN kernels to further reduce computational and storage complexity. We provide a closed form of the gradient over SVD to calculate the exact gradient with respect to every binarized weight in backward propagation. We verify BCNNw/SF on the MNIST, CIFAR-10, and SVHN datasets, and implement an accelerator for CIFAR10 on FPGA hardware. Our BCNNw/SF accelerator realizes memory savings of 17% and execution time reduction of 31.3% compared to BCNN with only minor accuracy sacrifices.
IEEE Micro | 2018
Eric S. Chung; Jeremy Fowers; Kalin Ovtcharov; Michael Papamichael; Adrian M. Caulfield; Todd Massengill; Ming Liu; Daniel Lo; Shlomi Alkalay; Michael Haselman; Maleen Abeydeera; Logan Adams; Hari Angepat; Christian Boehn; Derek Chiou; Oren Firestein; Alessandro Forin; Kang Su Gatlin; Mahdi Ghandi; Stephen Heil; Kyle Holohan; Ahmad M. El Husseini; Tamás Juhász; Kara Kagi; Ratna Kovvuri; Sitaram Lanka; Friedel van Megen; Dima Mukhortov; Prerak Patel; Brandon Perez