Is this you? Create Your Porfile

Swathi T. Gurumani

Agency for Science, Technology and Research

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Swathi T. Gurumani is active.

Explore More

Publication

Featured researches published by Swathi T. Gurumani.

field programmable gate arrays | 2014

Fast and effective placement and routing directed high-level synthesis for FPGAs

Hongbin Zheng; Swathi T. Gurumani; Kyle Rupnow; Deming Chen

Achievable frequency (fmax) is a widely used input constraint for designs targeting Field-Programmable Gate Arrays (FPGA), because of its impact on design latency and throughput. Fmax is limited by critical path delay, which is highly influenced by lower-level details of the circuit implementation such as technology mapping, placement and routing. However, for high-level synthesis~(HLS) design flows, it is challenging to evaluate the real critical delay at the behavioral level. Current HLS flows typically use module pre-characterization for delay estimates. However, we will demonstrate that such delay estimates are not sufficient to obtain high fmax and also minimize total execution latency. In this paper, we introduce a new HLS flow that integrates with Alteras Quartus synthesis and fast placement and routing (PAR) tool to obtain realistic post-PAR delay estimates. This integration enables an iterative flow that improves the performance of the design with both behavioral-level and circuit-level optimizations using realistic delay information. We demonstrate our HLS flow produces up to 24% (on average 20%) improvement in fmax and upto 22% (on average 20%) improvement in execution latency. Furthermore, results demonstrate that our flow is able to achieve from 65% to 91% of the theoretical fmax on Stratix IV devices (550MHz).

field-programmable logic and applications | 2013

High-level synthesis with behavioral level multi-cycle path analysis

Hongbin Zheng; Swathi T. Gurumani; Liwei Yang; Deming Chen; Kyle Rupnow

High-level synthesis (HLS) tools generate register transfer level (RTL) hardware descriptions through a process of resource allocation, scheduling and binding. Intuitively, RTL quality influences the logic synthesis quality. Specifically, the achievable clock rate, area, and latency in clock cycles will be determined by the RTL description. However, not all paths should receive equal logic synthesis effort - multi-cycle paths represent an opportunity to spend logic synthesis effort elsewhere to achieve better design quality. In this paper, we perform multi-cycle optimisation on chained functional operations. We couple HLS and logic synthesis synergistically so multi-cycle paths can be identified and optimised coherently across both behavioral and logic levels. In addition, we perform multi-cycle path analysis at the behavioral level efficiently. We prove that our technique examines all reachable circuit state and finds multi-cycle paths including control flow and guarding conditions that improve the flexibility and power of the technique. Compared to LegUp, we achieve average 55% execution time improvement, 29% area improvement, and 68% time-area product improvement targeting FPGA architecture.

field programmable gate arrays | 2016

High Level Synthesis of Complex Applications: An H.264 Video Decoder

Xinheng Liu; Yao Chen; Tan Nguyen; Swathi T. Gurumani; Kyle Rupnow; Deming Chen

High level synthesis (HLS) is gaining wider acceptance for hardware design due to its higher productivity and better design space exploration features. In recent years, HLS techniques and design flows have also advanced significantly, and as a result, many new FPGA designs are developed with HLS. However, despite many studies using HLS, the size and complexity of such applications remain generally small, and it is not well understood how to design and optimize for HLS with large, complex reference code. Typical HLS benchmark applications contain somewhere between 100 to 1400 lines of code and about 20 sub-functions, but typical input applications may contain many times more code and functions. To study such complex applications, we present a case study using HLS for a full H.264 decoder: an application with over 6000 lines of code and over 100 functions. We share our experience on code conversion for synthesizability, various HLS optimizations, HLS limitations while dealing with complex input code, and general design insights. Through our optimization process, we achieve 34 frames/s at 640x480 resolution (480p). To enable future study and benefit the research community, we open-source our synthe- sizable H.264 implementation.

IEEE Transactions on Very Large Scale Integration Systems | 2016

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

Yao Chen; Swathi T. Gurumani; Yun Liang; Guofeng Li; Donghui Guo; Kyle Rupnow; Deming Chen

High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture (CUDA), enables efficient description and implementation of independent computation cores. HLS tools can effectively translate the many threads of computation present in the parallel descriptions into independent, optimized cores. The generated hardware cores often heavily share input data and produce outputs independently. As the number of instantiated cores grows, the off-chip memory bandwidth may be insufficient to meet the demand. Hence, a scalable system architecture and a data-sharing mechanism become necessary for improving system performance. The network-on-chip (NoC) paradigm for intrachip communication has proved to be an efficient alternative to a hierarchical bus or crossbar interconnect, since it can reduce wire routing congestion, and has higher operating frequencies and better scalability for adding new nodes. In this paper, we present a customizable NoC architecture along with a directory-based data-sharing mechanism for an existing CUDA-to-FPGA (FCUDA) flow to enable scalability of our system and improve overall system performance. We build a fully automated FCUDA-NoC generator that takes in CUDA code and custom network parameters as inputs and produces synthesizable register transfer level (RTL) code for the entire NoC system. We implement the NoC system on a VC709 Xilinx evaluation board and evaluate our architecture with a set of benchmarks. The results demonstrate that our FCUDA-NoC design is scalable and efficient and we improve the system execution time by up to 63× and reduce external memory reads by up to 81% compared with a single hardware core implementation.

field-programmable technology | 2015

JIT trace-based verification for high-level synthesis

Liwei Yang; Magzhan Ikram; Swathi T. Gurumani; Suhaib A. Fahmy; Deming Chen; Kyle Rupnow

High level synthesis (HLS) tools are increasingly adopted for hardware design as the quality of tools consistently improves. Concerted development effort on HLS tools represents significant software development effort, and debugging and validation represents a significant portion of that effort. However, HLS tools are different from typical large-scale software systems; HLS tool output must be subsequently verified through functional verification of the generated RTL implementation. Debugging machine-generated functionally incorrect RTL is time-consuming and cumbersome requiring back-tracing through hundreds of signals and simulation cycles to determine the underlying error. This challenging process requires support framework in the HLS flow to enable fast and efficient pinpointing of the incorrectness in the tool. In this paper, we present a debug framework that uses just-in-time (JIT) traces and automated insertion of verification code into the generated RTL to assist in debugging an HLS tool. This framework aids the user by quickly pinpointing the earliest instance of execution mismatch, paired with detailed information on the faulty signal, and the corresponding instruction from the application source. Using CHStone benchmarks, we demonstrate that this technique can significantly reduce bug detection latency: often with zero cycle detection.

IET Cyber-Physical Systems: Theory & Applications | 2016

Platform choices and design demands for IoT platforms: cost, power, and performance tradeoffs

Deming Chen; Jason Cong; Swathi T. Gurumani; Wen-mei W. Hwu; Kyle Rupnow; Zhiru Zhang

The rise of the Internet of Things has led to an explosion of sensor computing platforms. The complexity and applications of IoT devices range from simple devices in vending machines to complex, interactive artificial intelligence in smart vehicles and drones. Developers target more aggressive objectives and protect market share through feature differentiation; they just choose between low-cost, and low-performance CPU-based systems, and high-performance custom platforms with hardware accelerators including GPUs and FPGAs. Both CPU-based and custom designs introduce a variety of design challenges: extreme pressure on time-to-market, design cost, and development risk drive a voracious demand for new CAD technologies to enable rapid, low cost design of effective IoT platforms with smaller design teams and lower risk. In this article, we present a generic IoT device design flow and discuss platform choices for IoT devices to efficiently tradeoff cost, power, performance and volume constraints: CPU-based systems and custom platforms that contain hardware accelerators including embedded GPUs and FPGAs. We demonstrate this design process through a driving application in computer vision. We also present current critical design automation needs for IoT development and demonstrate how our prior work in CAD for FPGAs and SoCs begin to address these needs.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2014

High-Level Synthesis With Behavioral-Level Multicycle Path Analysis

Hongbin Zheng; Swathi T. Gurumani; Liwei Yang; Deming Chen; Kyle Rupnow

High-level synthesis (HLS) tools generate register-transfer level (RTL) hardware descriptions from behavioral-level specifications through resource allocation, scheduling and binding. Traditionally, HLS tools build datapath pipelines by inserting pipeline registers to break combinational logic into single-cycle segments; accurately analyzing that the number of available cycles for signal propagation is proven to be infeasible at the RT-level. Thus, RT-level timing analyses must pessimistically assume each path has at most one cycle for signal propagation. This leads to false positives in critical-path analyses, prevents RTL synthesis tools from optimizing real critical paths, and forces HLS flows to insert pipeline registers without improving hardware quality. In this paper, we present an efficient behavioral-level multicycle path analysis (BL-MCPA) algorithm that leverages control-data flow information to reduce time complexity of multicycle path analysis from exponential to polynomial. BL-MCPA helps eliminate false positives in timing analysis, and improves the reported fmax by 15% on average. With BL-MCPA, we avoid unnecessary pipeline register insertion, and reduce execution latency by 25% and register usage by 29% under a user fmax constraint of 300 MHz. Using BL-MCPA, we replace large multiplexers (MUXs) by pipelined MUX-trees and reduce execution latency of hardware by up to 67% on designs whose performance is limited by the large MUXs.

field programmable gate arrays | 2016

FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDA-to-FPGA Compiler

Tan Nguyen; Swathi T. Gurumani; Kyle Rupnow; Deming Chen

Throughput oriented high level synthesis allows efficient design and optimization using parallel input languages. Parallel languages offer the benefit of parallelism extraction at multiple levels of granularity, offering effective design space exploration to select efficient single core implementations, and easy scaling of parallelism through multiple core instantiations. However, study of high level synthesis for parallel languages has concentrated on optimization of core and on-chip communications, while neglecting platform integration, which can have a significant impact on achieved performance. In this paper, we create an automated flow to perform efficient platform integration for an existing CUDA-to-RTL throughput oriented HLS, and we open source the FCUDA tool, platform integration, and benchmark applications. We demonstrate platform integration of 16 benchmarks on two Zynq-based systems in bare-metal and OS mode. We study implementation optimization for platform integration, compare to an embedded GPU (Tegra TK1) and verify designs on a Zedboard Zynq 7020 (bare-metal) and Omnitek Zynq 7045 (OS).

field programmable custom computing machines | 2016

AutoSLIDE: Automatic Source-Level Instrumentation and Debugging for HLS

Liwei Yang; Swathi T. Gurumani; Deming Chen; Kyle Rupnow

Improved quality of results from high level synthesis (HLS) tools have led to their increased adoption in hardware design. However, functional verification of HLS-produced designs remains a major challenge. Once a bug is exposed, designers must backtrace thousands of signals and simulation cycles to determine the underlying cause. The challenge is further exacerbated with HLS-produced non-human-readable RTL. In this paper, we present AutoSLIDE, an automated cross-layer verification framework that instruments critical operations, detects discrepancies between software and hardware execution, and traces the suspect datapath tree to identify bug source for the detected discrepancy. AutoSLIDE also maintains mappings between RTL datapath operations, LLVM-IR operations, and C/C++ source code to precisely pinpoint the root-cause of bugs to the exact line/operation in source code, substantially reducing user effort to localize bugs. We demonstrate the effectiveness by detecting and localizing bugs from former versions of the CHStone benchmark suite. Furthermore, we demonstrate the efficiency of AutoSLIDE, with low overhead in HLS time (27%), software trace gathering (10%), and significantly reduced trace size and simulation time compared to exhaustive instrumentation.

design, automation, and test in europe | 2016

Real-time system-level implementation of a telepresence robot using an embedded GPU platform

Muhammad Teguh Satria; Swathi T. Gurumani; Wang Zheng; Keng Peng Tee; Augustine Koh; Pan Yu; Kyle Rupnow; Deming Chen

Real-time applications such as telepresence systems present an opportunity to use embedded GPUs for compute acceleration to meet platform goals. In this paper, we develop a prototype of a portable, standalone telepresence robot that performs real-time attention-directed control using an NVIDIA Jetson TK1 embedded platform. We perform platform-specific optimizations to improve thread occupancy, optimize computation workload and improve accuracy of face detection on the embedded GPU and achieve real-time performance of 30 frames per second on the Jetson TK1 and an overall speedup of 10× compared to the ARM CPU version.

Explore More