Is this you? Create Your Porfile

Peipei Zhou

University of California, Los Angeles

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Peipei Zhou is active.

Explore More

Publication

Featured researches published by Peipei Zhou.

international conference on computer aided design | 2016

Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks

Chen Zhang; Zhenman Fang; Peipei Zhou; Peichen Pan; Jason Cong

With the recent advancement of multilayer convolutional neural networks (CNN), deep learning has achieved amazing success in many areas, especially in visual content understanding and classification. To improve the performance and energy-efficiency of the computation-demanding CNN, the FPGA-based acceleration emerges as one of the most attractive alternatives. In this paper we design and implement Caffeine, a hardware/software co-designed library to efficiently accelerate the entire CNN on FPGAs. First, we propose a uniformed convolutional matrix-multiplication representation for both computation-intensive convolutional layers and communication-intensive fully connected (FCN) layers. Second, we design Caffeine with the goal to maximize the underlying FPGA computing and bandwidth resource utilization, with a key focus on the bandwidth optimization by the memory access reorganization not studied in prior work. Moreover, we implement Caffeine in the portable high-level synthesis and provide various hardware/software definable parameters for user configurations. Finally, we also integrate Caffeine into the industry-standard software deep learning framework Caffe. We evaluate Caffeine and its integration with Caffe by implementing VGG16 and AlexNet network on multiple FPGA platforms. Caffeine achieves a peak performance of 365 GOPS on Xilinx KU060 FPGA and 636 GOPS on Virtex7 690t FPGA. This is the best published result to our best knowledge. We achieve more than 100× speedup on FCN layers over previous FPGA accelerators. An end-to-end evaluation with Caffe integration shows up to 7.3× and 43.5× performance and energy gains over Caffe on a 12-core Xeon server, and 1.5× better energy-efficiency over the GPU implementation on a medium-sized FPGA (KU060). Performance projections to a system with a high-end FPGA (Virtex7 690t) shows even higher gains.

field-programmable custom computing machines | 2014

A Fully Pipelined and Dynamically Composable Architecture of CGRA

Jason Cong; Hui Huang; Chiyuan Ma; Bingjun Xiao; Peipei Zhou

This paper presents a satellite-based communication system dedicated to disaster recovery. The proposed solution relies on the underlay transmission of low-power emergency signals in the frequency band of a primary transparent satellite telecommunication system. Wideband spreading is used so as to guarantee that the primary system performance is not affected by the inter-system interference. The emergency system capacity is evaluated under realistic assumptions regarding the primary satellite mission. It is shown that depending on the emergency terminal characteristics, various emergency communication services can be envisaged, from simple alert missions to voice communications. As it does not require any specific space segment development nor the full-time reservation of expensive radio resources, the described solution is attractive in terms of deployment cost. Provided an extension of the regulatory framework for exceptional security-oriented missions, it might satisfactorily match a governmental need for reliable, easy-to-deploy emergency communication means.With more and more enterprises and organizations outsourcing their IT services to distributed clouds for cost savings, historical and operational data generated by these services grows exponentially, which usually is stored in the data centers located at different geographic location in the distributed cloud. Such data referred to as big data now becomes an invaluable asset to many businesses or organizations, as it can be used to identify business advantages by helping them make their strategic decisions. Big data analytics thus is emerged as a main research topic in distributed cloud computing. The challenges associated with the query evaluation for big data analytics are that (i) its cloud resource demands are typically beyond the supplies by any single data center and expand to multiple data centers, and (ii) the source data of the query is located at different data centers. This creates heavy data traffic among the data centers in the distributed cloud, thereby resulting in high communication costs. A fundamental question for query evaluation of big data analytics thus is how to admit as many such queries as possible while keeping the accumulative communication cost minimized. In this paper, we investigate this question by formulating an online query evaluation problem for big data analytics in distributed clouds, with an objective to maximize the query acceptance ratio while minimizing the accumulative communication cost of query evaluation, for which we first propose a novel metric model to model different resource utilizations of data centres, by incorporating resource workloads and resource demands of each query. We then devise an efficient online algorithm. We finally conduct extensive experiments by simulations to evaluate the performance of the proposed algorithm. Experimental results demonstrate that the proposed algorithm is promising and outperforms other heuristics.Future processor will not be limited by the transistor resources, but will be mainly constrained by energy efficiency. Reconfigurable architecture offers higher energy efficiency than CPUs through customized hardware and more flexibility than ASICs. FPGAs allow configurability at bit level to keep both efficiency and flexibility. However, in many computation-intensive applications, only word level customizations are necessary, which inspires coarse-grained reconfigurable arrays(CGRAs) to raise configurability to word level and to reduce configuration information, and to enable on-the-fly customization. Traditional CGRAs are designed in the era when transistor resources are scarce. Previous work in CGRAs share hardware resources among different operations via modulo scheduling and time multiplexing processing elements. In the emerging scenario where transistor resources are rich, we develop a novel CGRA architecture that features full pipelining and dynamic composition to improve energy efficiency and implement the prototype on Xilinx Virtex-6 FPGA board. Experiments show that fully pipelined and dynamically composable architecture(FPCA) can exploit the energy benefits of customization for user applications when the transistor resources are rich.

field programmable gate arrays | 2016

ARAPrototyper: Enabling Rapid Prototyping and Evaluation for Accelerator-Rich Architecture (Abstact Only)

Yu-Ting Chen; Jason Cong; Zhenman Fang; Peipei Zhou

Compared to conventional general-purpose processors, accelerator-rich architectures (ARAs) can provide orders-of-magnitude performance and energy gains. In this paper we design and implement the ARAPrototyper to enable rapid design space explorations for ARAs in real silicons and reduce the tedious prototyping efforts. First, ARAPrototyper provides a reusable baseline prototype with a highly customizable memory system, including interconnect between accelerators and buffers, interconnect between buffers and last-level cache (LLC) or DRAM, coherency choice at LLC or DRAM, and address translation support. To provide more insights into performance analysis, ARAPrototyper adds several performance counters on the accelerator side and leverages existing performance counters on the CPU side. Second, ARAPrototyper provides a clean interface to quickly integrate a user?s own accelerators written in high-level synthesis (HLS) code. Then, an ARA prototype can be automatically generated and mapped to a Xilinx Zynq SoC. To quickly develop applications that run seamlessly on the ARA prototype, ARAPrototyper provides a system software stack and abstracts the accelerators as software libraries for application developers. Our results demonstrate that ARAPrototyper enables a wide range of design space explorations for ARAs at manageable prototyping efforts and 4,000 to 10,000X faster evaluation time than full-system simulations. We believe that ARAPrototyper can be an attractive alternative for ARA design and evaluation.

field programmable gate arrays | 2018

An Optimal Microarchitecture for Stencil Computation with Data Reuse and Fine-Grained Parallelism: (Abstract Only)

Yuze Chi; Peipei Zhou; Jason Cong

Stencil computation is one of the most important kernels for many applications such as image processing, solving partial differential equations, and cellular automata. Nevertheless, implementing a high throughput stencil kernel is not trivial due to its nature of high memory access load and low operational intensity. In this work we adopt data reuse and fine-grained parallelism and present an optimal microarchitecture for stencil computation. The data reuse line buffers not only fully utilize the external memory bandwidth and fully reuse the input data, they also minimize the size of data reuse buffer given the number of fine-grained parallelized and fully pipelined PEs. With the proposed microarchitecture, the number of PEs can be increased to saturate all available off-chip memory bandwidth. We implement this microarchitecture with a high-level synthesis (HLS) based template instead of register transfer level (RTL) specifications, which provides great programmability. To guide the system design, we propose a performance model in addition to detailed model evaluation and optimization analysis. Experimental results from on-board execution show that our design can provide an average of 6.5x speedup over line buffer-only design with only 2.4x resource overhead. Compared with loop transformation-only design, our design can implement a fully pipelined accelerator for applications that cannot be implemented with loop transformation-only due to its high memory conflict and low design flexibility. Furthermore, our FPGA implementation provides 83% throughput of a 14-core CPU with 4x energy-efficiency.

field programmable custom computing machines | 2016

Energy Efficiency of Full Pipelining: A Case Study for Matrix Multiplication

Peipei Zhou; Hyunseok Park; Zhenman Fang; Jason Cong; André DeHon

Customized pipeline designs that minimize the pipeline initiation interval (II) maximize the throughput of FPGA accelerators designed with high-level synthesis (HLS). What is the impact of minimizing II on energy efficiency? Using a matrix-multiply accelerator, we show that matrix multiplies with II>1 can sometimes reduce dynamic energy below II=1 due to interconnect savings, but II=1 always achieves energy close to the minimum. We also identify sources of inefficient mapping in the commercial tool flow.

design automation conference | 2017