Zhenman Fang
University of California, Los Angeles
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Zhenman Fang.
international conference on computer aided design | 2016
Chen Zhang; Zhenman Fang; Peipei Zhou; Peichen Pan; Jason Cong
With the recent advancement of multilayer convolutional neural networks (CNN), deep learning has achieved amazing success in many areas, especially in visual content understanding and classification. To improve the performance and energy-efficiency of the computation-demanding CNN, the FPGA-based acceleration emerges as one of the most attractive alternatives. In this paper we design and implement Caffeine, a hardware/software co-designed library to efficiently accelerate the entire CNN on FPGAs. First, we propose a uniformed convolutional matrix-multiplication representation for both computation-intensive convolutional layers and communication-intensive fully connected (FCN) layers. Second, we design Caffeine with the goal to maximize the underlying FPGA computing and bandwidth resource utilization, with a key focus on the bandwidth optimization by the memory access reorganization not studied in prior work. Moreover, we implement Caffeine in the portable high-level synthesis and provide various hardware/software definable parameters for user configurations. Finally, we also integrate Caffeine into the industry-standard software deep learning framework Caffe. We evaluate Caffeine and its integration with Caffe by implementing VGG16 and AlexNet network on multiple FPGA platforms. Caffeine achieves a peak performance of 365 GOPS on Xilinx KU060 FPGA and 636 GOPS on Virtex7 690t FPGA. This is the best published result to our best knowledge. We achieve more than 100× speedup on FCN layers over previous FPGA accelerators. An end-to-end evaluation with Caffe integration shows up to 7.3× and 43.5× performance and energy gains over Caffe on a 12-core Xeon server, and 1.5× better energy-efficiency over the GPU implementation on a medium-sized FPGA (KU060). Performance projections to a system with a high-end FPGA (Virtex7 690t) shows even higher gains.
design automation conference | 2016
Young-kyu Choi; Jason Cong; Zhenman Fang; Yuchen Hao; Glenn Reinman; Peng Wei
CPU-FPGA heterogeneous acceleration platforms have shown great potential for continued performance and energy efficiency improvement for modern data centers, and have captured great attention from both academia and industry. However, it is nontrivial for users to choose the right platform among various PCIe and QPI based CPU-FPGA platforms from different vendors. This paper aims to find out what microarchitectural characteristics affect the performance, and how. We conduct our quantitative comparison and in-depth analysis on two representative platforms: QPI-based Intel-Altera HARP with coherent shared memory, and PCIe-based Alpha Data board with private device memory. We provide multiple insights for both application developers and platform designers.
symposium on cloud computing | 2016
Muhuan Huang; Di Wu; Cody Hao Yu; Zhenman Fang; Matteo Interlandi; Tyson Condie; Jason Cong
With the end of CPU core scaling due to dark silicon limitations, customized accelerators on FPGAs have gained increased attention in modern datacenters due to their lower power, high performance and energy efficiency. Evidenced by Microsofts FPGA deployment in its Bing search engine and Intels 16.7 billion acquisition of Altera, integrating FPGAs into datacenters is considered one of the most promising approaches to sustain future datacenter growth. However, it is quite challenging for existing big data computing systems---like Apache Spark and Hadoop---to access the performance and energy benefits of FPGA accelerators. In this paper we design and implement Blaze to provide programming and runtime support for enabling easy and efficient deployments of FPGA accelerators in datacenters. In particular, Blaze abstracts FPGA accelerators as a service (FaaS) and provides a set of clean programming APIs for big data processing applications to easily utilize those accelerators. Our Blaze runtime implements an FaaS framework to efficiently share FPGA accelerators among multiple heterogeneous threads on a single node, and extends Hadoop YARN with accelerator-centric scheduling to efficiently share them among multiple computing tasks in the cluster. Experimental results using four representative big data applications demonstrate that Blaze greatly reduces the programming efforts to access FPGA accelerators in systems like Apache Spark and YARN, and improves the system throughput by 1.7× to 3× (and energy efficiency by 1.5× to 2.7×) compared to a conventional CPU-only cluster.
high-performance computer architecture | 2017
Yuchen Hao; Zhenman Fang; Glenn Reinman; Jason Cong
While emerging accelerator-centric architectures offer orders-of-magnitude performance and energy improvements, use cases and adoption can be limited by their rigid programming model. A unified virtual address space between the host CPU cores and customized accelerators can largely improve the programmability, which necessitates hardware support for address translation. However, supporting address translation for customized accelerators with low overhead is nontrivial. Prior studies either assume an infinite-sized TLB and zero page walk latency, or rely on a slow IOMMU for correctness and safety—which penalizes the overall system performance. To provide efficient address translation support for accelerator-centric architectures, we examine the memory access behavior of customized accelerators to drive the TLB augmentation and MMU designs. First, to support bulk transfers of consecutive data between the scratchpad memory of customized accelerators and the memory system, we present a relatively small private TLB design to provide low-latency caching of translations to each accelerator. Second, to compensate for the effects of the widely used data tiling techniques, we design a shared level-two TLB to serve private TLB misses on common virtual pages, eliminating duplicate page walks from accelerators working on neighboring data tiles that are mapped to the same physical page. This two-level TLB design effectively reduces page walks by 75.8% on average. Finally, instead of implementing a dedicated MMU which introduces additional hardware complexity, we propose simply leveraging the host per-core MMU for efficient page walk handling. This mechanism is based on our insight that the existing MMU cache in the CPU MMU satisfies the demand of customized accelerators with minimal overhead. Our evaluation demonstrates that the combined approach incurs only a 6.4% performance overhead compared to the ideal address translation.
international conference on computer aided design | 2015
Jason Cong; Zhenman Fang; Michael Gill; Glenn Reinman
The power wall and utilization wall in todays processors have led to a focus on accelerator-rich architecture, which will include a sea of accelerators that can achieve orders-of-magnitude performance and energy gains. The emerging accelerator-rich architecture is still in its early stage, and many design issues, such as the efficient accelerator resource management and communication between accelerators and CPU cores, remain unclear. Therefore, a research platform that can enable those design explorations will be extremely useful. This paper presents the first cycle-accurate full-system simulation Platform for Accelerator-Rich Architectural Design and Exploration (PARADE). PARADE can automatically generate dedicated or composable accelerator simulation modules, simulate the global accelerator management, a coherent cache/scratchpad with shared memory, and a customizable network-on-chip-all at cycle-level. In addition, PARADE provides visualization support to assist architects with design space exploration. Finally, a few case studies are conducted to confirm that PARADE can enable various system-level design space explorations in the accelerator-rich architecture.
field programmable custom computing machines | 2016
Yu-Ting Chen; Jason Cong; Zhenman Fang; Jie Lei; Peng Wei
FPGA-enabled datacenters have shown great potential for providing performance and energy efficiency improvement, and captured a great amount of attention from both academia and industry. In this paper we aim to answer one key question: how can we efficiently integrate FPGAs into state-of-the-art big-data computing frameworks? Although very important, this problem has not been well studied, especially for the integration of fine-grained FPGA accelerators that have short execution time but will be invoked many times. To provide a generalized methodology and insight for efficient integration, we conduct an in-depth analysis of challenges and corresponding solutions of integration at single-thread, single-node multi-thread, and multi-node levels. With a step-by-step case study for the next-generation DNA sequencing application, we demonstrate how a straightforward integration with 1000x slowdown can be tuned into an efficient integration with 2.6x overall system speedup and 2.4x energy efficiency improvement.
field programmable gate arrays | 2017
Jason Cong; Zhenman Fang; Muhuan Huang; Libo Wang; Di Wu
To efficiently process a tremendous amount of data, todays big data applications tend to distribute the datasets into multiple partitions, such that each partition can be fit into memory and be processed by a separate core/server in parallel. Meanwhile, due to the limited scaling of general-purpose CPUs, FPGAs have emerged as an attractive alternative to accelerate big data applications due to their low power, high performance and energy efficiency. In this paper we aim to answer one key question: How should the multicore CPU and FPGA coordinate together to optimize the performance of big data applications? To address the above question, we conduct a step-by-step case study to perform CPU and FPGA co-optimization for in-memory Samtool sorting in genomic data processing, which is one of the most important big data applications for personalized healthcare. First, to accelerate the time-consuming compression algorithm and its associated cyclic redundancy check (CRC) in Samtool sorting, we implement a portable and maintainable FPGA accelerator using high-level synthesis (HLS). Although FPGAs are traditionally well-known to be suitable for compression and CRC, we find that a straightforward integration of this FPGA accelerator into the multi-threaded Samtool sorting only achieves marginal system throughput improvement over the software baseline running on a 12-core CPU. To improve system performance, we propose a dataflow execution model to effectively orchestrate the computation between the multi-threaded CPU and FPGA. Experimental results show that our co-optimized CPU-FPGA system achieves a 2.6x speedup for in-memory Samtool sorting.
field programmable gate arrays | 2016
Yu-Ting Chen; Jason Cong; Zhenman Fang; Peipei Zhou
Compared to conventional general-purpose processors, accelerator-rich architectures (ARAs) can provide orders-of-magnitude performance and energy gains. In this paper we design and implement the ARAPrototyper to enable rapid design space explorations for ARAs in real silicons and reduce the tedious prototyping efforts. First, ARAPrototyper provides a reusable baseline prototype with a highly customizable memory system, including interconnect between accelerators and buffers, interconnect between buffers and last-level cache (LLC) or DRAM, coherency choice at LLC or DRAM, and address translation support. To provide more insights into performance analysis, ARAPrototyper adds several performance counters on the accelerator side and leverages existing performance counters on the CPU side. Second, ARAPrototyper provides a clean interface to quickly integrate a user?s own accelerators written in high-level synthesis (HLS) code. Then, an ARA prototype can be automatically generated and mapped to a Xilinx Zynq SoC. To quickly develop applications that run seamlessly on the ARA prototype, ARAPrototyper provides a system software stack and abstracts the accelerators as software libraries for application developers. Our results demonstrate that ARAPrototyper enables a wide range of design space explorations for ARAs at manageable prototyping efforts and 4,000 to 10,000X faster evaluation time than full-system simulations. We believe that ARAPrototyper can be an attractive alternative for ARA design and evaluation.
field programmable gate arrays | 2018
Jason Cong; Zhenman Fang; Yao Hu; Di Wu
With the slowing down of Moores law, major cloud service providers---such as Amazon Web Services, Microsoft Azure, and Alibaba Cloud---all started deploying FPGAs in their cloud platforms to improve the performance and energy-efficiency. From the perspective of performance per unit cost in the cloud, it is essential to efficiently utilize all available CPU and FPGA resources within a requested computing instance. However, most prior studies overlook the CPU-FPGA co-optimization or require a considerable amount of manual efforts to achieve it. In this poster, we present a framework called K-Flow, which enables easy FPGA accelerator integration and efficient CPU-FPGA co-scheduling for big data applications. K-Flow abstracts an application as a widely used directed acyclic graph (DAG), and dynamically schedules a number of CPU threads and/or FPGA accelerator processing elements (PEs) to execute the dataflow tasks on each DAG node. Moreover, K-Flow provides user-friendly interfaces to program each DAG node and automates the tedious process of FPGA accelerator integration and CPU-FPGA co-optimization using the genomic read alignment application BWA-MEM as a case study. Experimental results show that K-Flow achieves a throughput that is on average 94.5% of the theoretical upper bound and 1.4x better than a straightforward FPGA integration.
field programmable gate arrays | 2018
Weikang Qiao; Jieqiong Du; Zhenman Fang; Libo Wang; Michael Lo; Mau-Chung Frank Chang; Jason Cong
Data compression techniques have been widely used to reduce data storage and movement overhead, especially in the big data era. While FPGAs are well suited to accelerate the computation-intensive lossless compression algorithms, big data compression with parallel requests intrinsically poses two challenges to the overall system throughput. First, scaling existing single-engine FPGA compression accelerator designs already encounters bottlenecks which will result in lower clock frequency, saturated throughput and lower area efficiency. Second, when such FPGA compression accelerators are integrated with the processors, the overall system throughput is typically limited by the communication between a CPU and an FPGA. We propose a novel multi-way parallel and fully pipelined architecture to achieve high-throughput lossless compression on modern Intel-Altera HARPv2 platforms. To compensate for the compression ratio loss in a multi-way design, we implement novel techniques, such as a better data feeding method and a hash chain to increase the hash dictionary history. Our accelerator kernel itself can achieve a compression throughput of 12.8 GB/s (2.3x better than the current record throughput) and a comparable compression ratio of 2.03 for standard benchmark data. Our approach enables design scalability without a reduction in clock frequency and also improves the performance per area efficiency (up to 1.5x). Moreover, we exploit the high CPU-FPGA communication bandwidth of HARPv2 platforms to improve the compression throughput of the overall system, which can achieve an average practical end-to-end throughput of 10.0 GB/s (up to 12 GB/s for larger input files) on HARPv2.