Cody Hao Yu
University of California, Los Angeles
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Cody Hao Yu.
design automation conference | 2017
Xuechao Wei; Cody Hao Yu; Peng Zhang; Youxiang Chen; Yuxin Wang; Han Hu; Yun Liang; Jason Cong
Convolutional neural networks (CNNs) have been widely applied in many deep learning applications. In recent years, the FPGA implementation for CNNs has attracted much attention because of its high performance and energy efficiency. However, existing implementations have difficulty to fully leverage the computation power of the latest FPGAs. In this paper we implement CNN on an FPGA using a systolic array architecture, which can achieve high clock frequency under high resource utilization. We provide an analytical model for performance and resource utilization and develop an automatic design space exploration framework, as well as source-to-source code transformation from a C program to a CNN implementation using systolic array. The experimental results show that our framework is able to generate the accelerator for real-life CNN models, achieving up to 461 GFlops for floating point data type and 1.2 Tops for 8–16 bit fixed point.
symposium on cloud computing | 2016
Muhuan Huang; Di Wu; Cody Hao Yu; Zhenman Fang; Matteo Interlandi; Tyson Condie; Jason Cong
With the end of CPU core scaling due to dark silicon limitations, customized accelerators on FPGAs have gained increased attention in modern datacenters due to their lower power, high performance and energy efficiency. Evidenced by Microsofts FPGA deployment in its Bing search engine and Intels 16.7 billion acquisition of Altera, integrating FPGAs into datacenters is considered one of the most promising approaches to sustain future datacenter growth. However, it is quite challenging for existing big data computing systems---like Apache Spark and Hadoop---to access the performance and energy benefits of FPGA accelerators. In this paper we design and implement Blaze to provide programming and runtime support for enabling easy and efficient deployments of FPGA accelerators in datacenters. In particular, Blaze abstracts FPGA accelerators as a service (FaaS) and provides a set of clean programming APIs for big data processing applications to easily utilize those accelerators. Our Blaze runtime implements an FaaS framework to efficiently share FPGA accelerators among multiple heterogeneous threads on a single node, and extends Hadoop YARN with accelerator-centric scheduling to efficiently share them among multiple computing tasks in the cluster. Experimental results using four representative big data applications demonstrate that Blaze greatly reduces the programming efforts to access FPGA accelerators in systems like Apache Spark and YARN, and improves the system throughput by 1.7× to 3× (and energy efficiency by 1.5× to 2.7×) compared to a conventional CPU-only cluster.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2014
Cody Hao Yu; Chiao-Ling Lung; Yi-Lun Ho; Ruei-Siang Hsu; Ding-Ming Kwai; Shih-Chieh Chang
3-D many-core processor (3-D MCP) has become an emerging technology to tackle the power wall problem due to rapidly increasing number of transistors. However, when maximizing the throughput of 3-D MCP, which is expressed as a weighted sum of the speeds, due to the inherent heat removal limitation, thermal issues must be taken into consideration. Since the temperature of a core strongly depends on its location in the 3-D IC, a proper task allocation can alleviate the thermal problem and improve the throughput. Nevertheless, conventional techniques require computationally intensive thermal simulation, which prohibits its usage from the online application. In this paper, we propose an efficient online task allocation and task migration algorithm attempting to maximize the throughput of 3-D MCP simultaneously, considering unfinished tasks left from the last scheduling interval and new incoming tasks of this scheduling interval. The results of our experiments show that our proposed method achieves a 20.82X runtime speedup. These results are comparable to the exhaustive solutions obtained from optimization-modeling software LINGO. In addition, on average, our throughput results, with and without consideration of unfinished tasks, are only 4.39% and 0.69% worse, respectively, than that of the exhaustive method. In 128 task-to-core allocations, our method takes only 0.951 ms, which is 59.39 times faster than that of the previous work.
design automation conference | 2016
Jason Cong; Muhuan Huang; Di Wu; Cody Hao Yu
In this paper we present our ongoing study and deployment efforts for enabling FPGAs in datacenters. An important focus is to provide a quantitative evaluation of a wide range of heterogeneous system designs and integration options, from low-power field-programmable SoCs to server-class computer nodes plus high-capacity FPGAs, with real system prototyping and implementation on real-life applications. In the meantime, we develop a cloud-friendly programming interface and a runtime environment for efficient accelerator deployment, scheduling and transparent resource management for integration of FPGAs for large-scale acceleration across different system integration platforms to enable “write once, execute everywhere”.
field programmable custom computing machines | 2016
Mau-Chung Frank Chang; Yu-Ting Chen; Jason Cong; Po-Tsang Huang; Chun-Liang Kuo; Cody Hao Yu
The advance of next-generation sequencing technology has dramatically reduced the cost of genome sequencing. However, processing and analyzing huge amounts of data collected from sequencers introduces significant computation challenges, these have become the bottleneck in many research and clinical applications. For such applications, read alignment is usually one of the most compute-intensive steps. Billions of reads generated from the sequencer need to be aligned to the long reference genome. Recent state-of-the-art software read aligners follow the seed-andextend model. In this paper we focus on accelerating the first seeding stage, which generates the seeds using the supermaximal exact match (SMEM) seeding algorithm. The two main challenges for accelerating this process are 1) how to process a huge number of short reads with high throughput, and 2) how to hide the frequent and long random memory access when we try to fetch the value of the reference genome. In this paper, we propose a scalable array-based architecture, which is composed by many processing engines (PEs) to process large amounts of data simultaneously for the demand of high throughput. Furthermore, we provide a tight software/hardware integration that realizes the proposed architecture on the Intel-Altera HARP system. With a 16-PE accelerator engine, we accelerate the SMEM algorithm by 4x, and the overall SMEM seeding stage by 26% when compared with 16-thread CPU execution. We further analyze the performance bottleneck of the design due to extensive DRAM accesses and discuss the possible improvements that are worthwhile to be explored in the future.
international conference on computer aided design | 2015
Jason Cong; Cody Hao Yu
Application-level correctness is a useful and widely accepted concept for many kinds of applications in this modern world. The results of some applications, such as multimedia, may be incorrect due to transient hardware faults or soft-errors, but they are still acceptable from a users perspective. Thus, it is worthwhile to develop approaches to guarantee application-level correctness in software, instead of hardware, to reduce cost and save energy. Many previous research efforts presented solutions to identify parts of programs that may potentially cause unacceptable results, and placed error detectors to improve reliability. On the other hand, we observe that loop transformations have the ability to improve reliability. By applying suitable loop transformations, some critical instructions may become non-critical. In this paper we propose a metric to analyze the reliability impact of each loop transformation. Thus, we can guide a compiler to optimize programs not only for reliability improvement, but for energy saving. The experimental results show that our analysis perfectly matches the results of fault injection, and achieves a 39.72% energy saving while improving performance by 52.16% when compared with [1]. To our knowledge, this is the first work that considers a software reliability by loop transformations.
design automation conference | 2018
Jason Cong; Peng Wei; Cody Hao Yu; Peng Zhang
CPU-FPGA heterogeneous architectures feature flexible acceleration of many workloads to advance computational capabilities and energy efficiency in today’s datacenters. This advantage, however, is often overshadowed by the poor programmability of FPGAs. Although recent advances in high-level synthesis (HLS) significantly improve the FPGA programmability, it still leaves programmers facing the challenge of identifying the optimal design configuration in a tremendous design space. In this paper we propose the composable, parallel and pipeline (CPP) microarchitecture as an accelerator design template to substantially reduce the design space. Also, by introducing the CPP analytical model to capture the performance-resource trade-offs, we achieve efficient, analytical-based design space exploration. Furthermore, we develop the AutoAccel framework to automate the entire accelerator generation process. Our experiments show that the AutoAccel-generated accelerators outperform their corresponding software implementations by an average of 72x for a broad class of computation kernels.
design automation conference | 2018
Cody Hao Yu; Peng Wei; Max Grossman; Peng Zhang; Vivek Sarker; Jason Cong
Big data analytics using the JVM-based MapReduce framework has become a popular approach to address the explosive growth of data sizes. Adopting FPGAs in datacenters as accelerators to improve performance and energy efficiency also attracts increasing attention. However, the integration of FPGAs into such JVM-based frameworks raises the challenge of poor programmability. Programmers must not only rewrite Java/Scala programs to C/C++ or OpenCL, but, to achieve high performance, they must also take into consideration the intricacies of FPGAs. To address this challenge, we present S2FA (Spark-to-FPGA-Accelerator), an automation framework that generates FPGA accelerator designs from Apache Spark programs written in Scala. S2FA bridges the semantic gap between object-oriented languages and HLS C while achieving high performance using learning-based design space exploration. Evaluation results show that our generated FPGA designs achieve up to 49.9× performance improvement for several machine learning applications compared to their corresponding implementations on the JVM.
design automation conference | 2017
Jason Cong; Peng Wei; Cody Hao Yu; Peipei Zhou
usenix conference on hot topics in cloud ccomputing | 2018
Jason Cong; Peng Wei; Cody Hao Yu