Kyle Rupnow
Agency for Science, Technology and Research
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kyle Rupnow.
Journal of Electrical and Computer Engineering | 2012
Yun Liang; Kyle Rupnow; Yinan Li; Dongbo Min; Minh N. Do; Deming Chen
FPGAs are an attractive platform for applications with high computation demand and low energy consumption requirements. However, design effort for FPGA implementations remains high--often an order of magnitude larger than design effort using high-level languages. Instead of this time-consuming process, high-level synthesis (HLS) tools generate hardware implementations from algorithm descriptions in languages such as C/C++ and SystemC. Such tools reduce design effort: high-level descriptions are more compact and less error prone. HLS tools promise hardware development abstracted from software designer knowledge of the implementation platform. In this paper, we present an unbiased study of the performance, usability and productivity of HLS using AutoPilot (a state-of-the-art HLS tool). In particular, we first evaluate AutoPilot using the popular embedded benchmark kernels. Then, to evaluate the suitability of HLS on real-world applications, we perform a case study of stereo matching, an active area of computer vision research that uses techniques also common for image denoising, image retrieval, feature matching, and face recognition. Based on our study, we provide insights on current limitations of mapping general-purpose software to hardware using HLS and some future directions for HLS tool development. We also offer several guidelines for hardware-friendly software design. For popular embedded benchmark kernels, the designs produced by HLS achieve 4× to 126× speedup over the software version. The stereo matching algorithms achieve between 3.5× and 67.9× speedup over software (but still less than manual RTL design) with a fivefold reduction in design effort versus manual RTL design.
international parallel and distributed processing symposium | 2012
Zheng Cui; Yun Liang; Kyle Rupnow; Deming Chen
Graphics processing units (GPUs) are increasingly critical for general-purpose parallel processing performance. GPU hardware is composed of many streaming multiprocessors, each of which employs the single-instruction multiple-data (SIMD) execution style. This massively parallel architecture allows GPUs to execute tens of thousands of threads in parallel. Thus, GPU architectures efficiently execute heavily data-parallel applications. However, due to this SIMD execution style, resource utilization and thus overall performance can be significantly affected if computation threads must take diverging control paths. Control flow divergence in GPUs is a well-known problem: prior approaches have attempted to reduce control flow divergence through code transformations, memory access indirection, and input data reorganization. However, as we will demonstrate, the utility of these transformations is seriously affected by the lack of a guiding metric that properly estimates how control flow divergence affects application performance. In this paper, we introduce a metric that simply and accurately estimates performance of computation-bound GPU kernels with control flow divergence, and use the metric as a value function for thread re-grouping algorithms. We measure the performance on NVIDIA GTS250 GPU. For the tested set of applications, our experiments demonstrate that the proposed metric correlates well with actual GPU application performance. Through thread re-grouping guided by our metric, control flow divergence optimization can improve application performance by up to 3.19X.
field programmable gate arrays | 2014
Hongbin Zheng; Swathi T. Gurumani; Kyle Rupnow; Deming Chen
Achievable frequency (fmax) is a widely used input constraint for designs targeting Field-Programmable Gate Arrays (FPGA), because of its impact on design latency and throughput. Fmax is limited by critical path delay, which is highly influenced by lower-level details of the circuit implementation such as technology mapping, placement and routing. However, for high-level synthesis~(HLS) design flows, it is challenging to evaluate the real critical delay at the behavioral level. Current HLS flows typically use module pre-characterization for delay estimates. However, we will demonstrate that such delay estimates are not sufficient to obtain high fmax and also minimize total execution latency. In this paper, we introduce a new HLS flow that integrates with Alteras Quartus synthesis and fast placement and routing (PAR) tool to obtain realistic post-PAR delay estimates. This integration enables an iterative flow that improves the performance of the design with both behavioral-level and circuit-level optimizations using realistic delay information. We demonstrate our HLS flow produces up to 24% (on average 20%) improvement in fmax and upto 22% (on average 20%) improvement in execution latency. Furthermore, results demonstrate that our flow is able to achieve from 65% to 91% of the theoretical fmax on Stratix IV devices (550MHz).
international conference on asic | 2011
Kyle Rupnow; Yun Liang; Yinan Li; Deming Chen
A wide variety of application domains such as networking, computer vision, and cryptography target FPGA platforms to meet computation demand and energy consumption constraints. However, design effort for FPGA implementations in hardware description languages (HDLs) remains high - often an order of magnitude larger than design effort using high level languages (HLLs). Instead of development in HDLs, high level synthesis (HLS) tools generate hardware implementations from algorithm descriptions in HLLs such as C/C++/SystemC. HLS tools promise reduced design effort and hardware development without the detailed knowledge of the implementation platform. In this paper, we study AutoPilot, a state-of-the-art HLS tool, and examine the suitability of using HLS for a variety of application domains. Based on our study of application code not originally written for HLS, we provide guidelines for software design, limitations of mapping general purpose software to hardware using HLS, and future directions for HLS tool development. For the examined applications, we demonstrate speedup from 4X to over 126X, with a five-fold reduction in design effort vs. manual design in HDLs.
field programmable gate arrays | 2016
Xinheng Liu; Yao Chen; Tan Nguyen; Swathi T. Gurumani; Kyle Rupnow; Deming Chen
High level synthesis (HLS) is gaining wider acceptance for hardware design due to its higher productivity and better design space exploration features. In recent years, HLS techniques and design flows have also advanced significantly, and as a result, many new FPGA designs are developed with HLS. However, despite many studies using HLS, the size and complexity of such applications remain generally small, and it is not well understood how to design and optimize for HLS with large, complex reference code. Typical HLS benchmark applications contain somewhere between 100 to 1400 lines of code and about 20 sub-functions, but typical input applications may contain many times more code and functions. To study such complex applications, we present a case study using HLS for a full H.264 decoder: an application with over 6000 lines of code and over 100 functions. We share our experience on code conversion for synthesizability, various HLS optimizations, HLS limitations while dealing with complex input code, and general design insights. Through our optimization process, we achieve 34 frames/s at 640x480 resolution (480p). To enable future study and benefit the research community, we open-source our synthe- sizable H.264 implementation.
IEEE Transactions on Very Large Scale Integration Systems | 2016
Yao Chen; Swathi T. Gurumani; Yun Liang; Guofeng Li; Donghui Guo; Kyle Rupnow; Deming Chen
High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture (CUDA), enables efficient description and implementation of independent computation cores. HLS tools can effectively translate the many threads of computation present in the parallel descriptions into independent, optimized cores. The generated hardware cores often heavily share input data and produce outputs independently. As the number of instantiated cores grows, the off-chip memory bandwidth may be insufficient to meet the demand. Hence, a scalable system architecture and a data-sharing mechanism become necessary for improving system performance. The network-on-chip (NoC) paradigm for intrachip communication has proved to be an efficient alternative to a hierarchical bus or crossbar interconnect, since it can reduce wire routing congestion, and has higher operating frequencies and better scalability for adding new nodes. In this paper, we present a customizable NoC architecture along with a directory-based data-sharing mechanism for an existing CUDA-to-FPGA (FCUDA) flow to enable scalability of our system and improve overall system performance. We build a fully automated FCUDA-NoC generator that takes in CUDA code and custom network parameters as inputs and produces synthesizable register transfer level (RTL) code for the entire NoC system. We implement the NoC system on a VC709 Xilinx evaluation board and evaluate our architecture with a set of benchmarks. The results demonstrate that our FCUDA-NoC design is scalable and efficient and we improve the system execution time by up to 63× and reduce external memory reads by up to 81% compared with a single hardware core implementation.
field-programmable technology | 2015
Liwei Yang; Magzhan Ikram; Swathi T. Gurumani; Suhaib A. Fahmy; Deming Chen; Kyle Rupnow
High level synthesis (HLS) tools are increasingly adopted for hardware design as the quality of tools consistently improves. Concerted development effort on HLS tools represents significant software development effort, and debugging and validation represents a significant portion of that effort. However, HLS tools are different from typical large-scale software systems; HLS tool output must be subsequently verified through functional verification of the generated RTL implementation. Debugging machine-generated functionally incorrect RTL is time-consuming and cumbersome requiring back-tracing through hundreds of signals and simulation cycles to determine the underlying error. This challenging process requires support framework in the HLS flow to enable fast and efficient pinpointing of the incorrectness in the tool. In this paper, we present a debug framework that uses just-in-time (JIT) traces and automated insertion of verification code into the generated RTL to assist in debugging an HLS tool. This framework aids the user by quickly pinpointing the earliest instance of execution mismatch, paired with detailed information on the faulty signal, and the corresponding instruction from the application source. Using CHStone benchmarks, we demonstrate that this technique can significantly reduce bug detection latency: often with zero cycle detection.
field programmable gate arrays | 2017
Sitao Huang; Gowthami Jayashri Manikandan; Kyle Rupnow; Wen-mei W. Hwu; Deming Chen
In this project, we propose an SoC solution to accelerate the Pair-HMMs forward algorithm which is the key performance bottleneck in the GATKs HaplotypeCaller tool for DNA variant calling. We develop two versions of the Pair-HMM accelerator: one using High Level Synthesis (HLS), and another ring-based manual RTL implementation. We investigate the performance of the manual RTL design and HLS design in terms of design flexibility and overall run-time. We achieve a significant speed-up of up to 19x through the HLS implementation and speed-up of up to 95x through the RTL implementation of the algorithm.
IET Cyber-Physical Systems: Theory & Applications | 2016
Deming Chen; Jason Cong; Swathi T. Gurumani; Wen-mei W. Hwu; Kyle Rupnow; Zhiru Zhang
The rise of the Internet of Things has led to an explosion of sensor computing platforms. The complexity and applications of IoT devices range from simple devices in vending machines to complex, interactive artificial intelligence in smart vehicles and drones. Developers target more aggressive objectives and protect market share through feature differentiation; they just choose between low-cost, and low-performance CPU-based systems, and high-performance custom platforms with hardware accelerators including GPUs and FPGAs. Both CPU-based and custom designs introduce a variety of design challenges: extreme pressure on time-to-market, design cost, and development risk drive a voracious demand for new CAD technologies to enable rapid, low cost design of effective IoT platforms with smaller design teams and lower risk. In this article, we present a generic IoT device design flow and discuss platform choices for IoT devices to efficiently tradeoff cost, power, performance and volume constraints: CPU-based systems and custom platforms that contain hardware accelerators including embedded GPUs and FPGAs. We demonstrate this design process through a driving application in computer vision. We also present current critical design automation needs for IoT development and demonstrate how our prior work in CAD for FPGAs and SoCs begin to address these needs.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2014
Hongbin Zheng; Swathi T. Gurumani; Liwei Yang; Deming Chen; Kyle Rupnow
High-level synthesis (HLS) tools generate register-transfer level (RTL) hardware descriptions from behavioral-level specifications through resource allocation, scheduling and binding. Traditionally, HLS tools build datapath pipelines by inserting pipeline registers to break combinational logic into single-cycle segments; accurately analyzing that the number of available cycles for signal propagation is proven to be infeasible at the RT-level. Thus, RT-level timing analyses must pessimistically assume each path has at most one cycle for signal propagation. This leads to false positives in critical-path analyses, prevents RTL synthesis tools from optimizing real critical paths, and forces HLS flows to insert pipeline registers without improving hardware quality. In this paper, we present an efficient behavioral-level multicycle path analysis (BL-MCPA) algorithm that leverages control-data flow information to reduce time complexity of multicycle path analysis from exponential to polynomial. BL-MCPA helps eliminate false positives in timing analysis, and improves the reported fmax by 15% on average. With BL-MCPA, we avoid unnecessary pipeline register insertion, and reduce execution latency by 25% and register usage by 29% under a user fmax constraint of 300 MHz. Using BL-MCPA, we replace large multiplexers (MUXs) by pipelined MUX-trees and reduce execution latency of hardware by up to 67% on designs whose performance is limited by the large MUXs.