Naveen Suda | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Naveen Suda is active.

Explore More

Publication

Featured researches published by Naveen Suda.

field programmable gate arrays | 2016

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Naveen Suda; Vikas Chandra; Ganesh Dasika; Abinash Mohanty; Yufei Ma; Sarma B. K. Vrudhula; Jae-sun Seo; Yu Cao

Convolutional Neural Networks (CNNs) have gained popularity in many computer vision applications such as image classification, face detection, and video analysis, because of their ability to train and classify with high accuracy. Due to multiple convolution and fully-connected layers that are compute-/memory-intensive, it is difficult to perform real-time classification with low power consumption on today?s computing systems. FPGAs have been widely explored as hardware accelerators for CNNs because of their reconfigurability and energy efficiency, as well as fast turn-around-time, especially with high-level synthesis methodologies. Previous FPGA-based CNN accelerators, however, typically implemented generic accelerators agnostic to the CNN configuration, where the reconfigurable capabilities of FPGAs are not fully leveraged to maximize the overall system throughput. In this work, we present a systematic design space exploration methodology to maximize the throughput of an OpenCL-based FPGA accelerator for a given CNN model, considering the FPGA resource constraints such as on-chip memory, registers, computational resources and external memory bandwidth. The proposed methodology is demonstrated by optimizing two representative large-scale CNNs, AlexNet and VGG, on two Altera Stratix-V FPGA platforms, DE5-Net and P395-D8 boards, which have different hardware resources. We achieve a peak performance of 136.5 GOPS for convolution operation, and 117.8 GOPS for the entire VGG network that performs ImageNet classification on P395-D8 board.

field programmable logic and applications | 2016

Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA

Yufei Ma; Naveen Suda; Yu Cao; Jae-sun Seo; Sarma B. K. Vrudhula

Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration. However, this requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. In this work, we present a scalable solution that integrates the flexibility of high-level synthesis and the finer level optimization of an RTL implementation. The cornerstone is a compiler that analyzes the CNN structure and parameters, and automatically generates a set of modular and scalable computing primitives that can accelerate various deep learning algorithms. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the compliers design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints. The proposed methodology is demonstrated on Altera Stratix-V GXA7 FPGA for AlexNet and NIN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9× improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.

IEEE Transactions on Circuits and Systems | 2013

Programmable ANalog Device Array (PANDA): A Methodology for Transistor-Level Analog Emulation

Jounghyuk Suh; Naveen Suda; Cheng Xu; Nagib Hakim; Yu Cao; Bertan Bakkaloglu

The design and development of analog/mixed-signal (AMS) integrated circuits (ICs) is becoming increasingly expensive, complex, and lengthy. Rapid prototyping and emulation of analog ICs will be significant in the design and testing of complex analog systems. A new approach, Programmable ANalog Device Array (PANDA) that maps any AMS design problem to a transistor-level programmable hardware, is proposed. This approach enables fast system level validation and a reduction in post-silicon bugs, minimizing design risk and cost. The unique features of the approach include: 1) transistor-level programmability that emulates each transistor behavior in an analog design, achieving very fine granularity of reconfiguration; 2) programmable switches that are treated as a design component during analog transistor emulating, and optimized with the reconfiguration matrix; and 3) compensation of ac performance degradation through boosting the bias current. Based on these principles, a digitally controlled PANDA platform is designed at 45 nm node that can map AMS modules across 22-90 nm technology nodes. A systematic emulation approach to map any analog transistor to 45 nm PANDA cell is proposed, which achieves transistor level matching accuracy of less than 5% for ID and less than 10% for Rout and Gm. Circuit level analog metrics of a voltage-controlled oscillator (VCO) emulated by PANDA, match to those of the original designs in 22 and 90 nm nodes with less than a 5% error. Several other 90 and 22 nm analog blocks are successfully emulated by the 45 nm PANDA platform, including a folded-cascode operational amplifier and a sample-and-hold module (S/H).

Integration | 2018

ALAMO: FPGA acceleration of deep learning algorithms with a modularized RTL compiler

Yufei Ma; Naveen Suda; Yu Cao; Sarma B. K. Vrudhula; Jae-sun Seo

Abstract Deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to the large volume of data, the extensive amount of computation and frequent memory accesses. Although existing high-level synthesis tools (e.g. HLS, OpenCL) for FPGAs dramatically reduce the design time, the resulting implementations are still inefficient with respect to resource allocation for maximizing parallelism and throughput. Manual hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration but that requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. This work presents a scalable solution that achieves the flexibility and reduced design time of high-level synthesis and the near-optimality of an RTL implementation. The proposed solution is a compiler that analyzes the algorithm structure and parameters, and automatically integrates a set of modular and scalable computing primitives to accelerate the operation of various deep learning algorithms on an FPGA. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the compliers design strategy to optimize the throughput of a given CNN model under the FPGA resource constraints. The proposed RTL compiler, named ALAMO, is demonstrated on Altera Stratix-V GXA7 FPGA for the inference tasks of AlexNet and NiN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9X improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.

IEEE Transactions on Circuits and Systems | 2016

A 65 nm Programmable ANalog Device Array (PANDA) for Analog Circuit Emulation

Naveen Suda; Jounghyuk Suh; Nagib Hakim; Yu Cao; Bertan Bakkaloglu

Reconfigurable analog/mixed signal (AMS) platforms in scaled CMOS technology nodes are gaining importance due to the increased design cost, effort and shrinking time-to-market. Similar to field programmable gate arrays (FPGA) for digital designs, a Programmable ANalog Device Array (PANDA) provides a flexible and versatile solution with transistor-level granularity and reconfiguration capability for rapid prototyping and validation of analog circuits. This paper presents design and synthesis methodology of a PANDA design on 65 nm CMOS technology, consisting of a 24 × 25 cell array, reconfigurable interconnect, configuration memory and serial programming interface. To implement AMS circuits on the PANDA platform, this paper further proposes a CAD tool for technology mapping, placement, routing and configuration bit-stream generation. Several representative building blocks of AMS circuits, such as amplifiers, voltage and current references, filters, are successfully implemented on the PANDA platform. Dynamic reconfiguration capability of PANDA is demonstrated through input offset cancellation of an operational amplifier using an FPGA in a closed loop. Initial measurement results of PANDA implemented circuits demonstrate the potential of the methodology for rapid prototyping and hardware validation of analog circuits.

international symposium on circuits and systems | 2016

High-performance face detection with CPU-FPGA acceleration

Abinash Mohanty; Naveen Suda; Minkyu Kim; Sarma B. K. Vrudhula; Jae-sun Seo; Yu Cao

Face detection is a critical function in many embedded applications, such as computer vision and security. Although face detection has been well studied, detecting a large number of faces with different scales and excessive variations (pose, expression, or illumination) usually involves computationally expensive classification algorithms. These algorithms may divide an image into sub-windows at different scales, evaluate a large set of features for each sub-window, and determine the presence and location of a face. Even with state-of-the-art CPUs, it is still challenging to perform real-time face detection with sufficiently high energy efficiency and accuracy. In this paper, we propose a suite of acceleration techniques to enable such a capability on the CPU-FPGA platform, based on a state-of-the-art face detection algorithm that employs a large number of simple classifiers. We first map the algorithm using the integrated OpenCL environment for FPGA. Matching the structure of the algorithm, a nested architecture is proposed to speed up both memory access and the computing iterations. This multi-layer architecture distributes parallel computing cores with the memory. The physical aspects of the nested architecture, such as the core size and the number of cores, are further optimized to achieve real-time face detection, under realistic hardware constraints.

asia and south pacific design automation conference | 2017

A real-time 17-scale object detection accelerator with adaptive 2000-stage classification in 65nm CMOS

Minkyu Kim; Abinash Mohanty; Deepak Kadetotad; Naveen Suda; Luning Wei; Pooja Saseendran; Xiaofei He; Yu Cao; Jae-sun Seo

This paper presents an object detection accelerator that features many-scale (17), many-object (up to 50), multi-class (e.g., face, traffic sign), and high accuracy (average precision of 0.79/0.65 for AFW/BTSD datasets). Employing 10 gradient/color channels, integral features are extracted, and the results of 2,000 simple classifiers for rigid boosted templates are adaptively combined to make a strong classification. By jointly optimizing the algorithm and the hardware architecture, the prototype chip implemented in 65nm CMOS demonstrates real-time object detection of 13–35 frames per second with low power consumption of 22–160mW at 0.58–1.0V supply.

custom integrated circuits conference | 2015

A 14.8μV RMS integrated noise output capacitor-less low dropout regulator with a switched-RC bandgap reference

Raveesh Magod; Naveen Suda; Vadim V. Ivanov; Ravi Balasingam; Bertan Bakkaloglu

Achieving low noise is becoming an important requirement in linear supply regulators for RF and mixed-signal SoC applications. A low-noise, low dropout regulator using switched-RC bandgap reference and a multi-loop, unconditionally stable error amplifier for output capacitor-less operation is presented. Switched-RC sample-and-hold filtered bandgap reference and current-mode chopped error amplifier techniques are used for reducing output noise of the LDO. A switched capacitor notch filter is used to ensure chopping ripple free output voltage. The proposed techniques reduce the 10Hz to 100kHz integrated output noise of the LDO from 95.3uVrms to 14.8μVrms. The LDO delivers a maximum load current of 100mA with a dropout voltage of 230mV and quiescent current consumption of 40μA. It achieves a PSR of 50dB at 10kHz for programmable output voltage range of 1V-3.3V. Fabricated in a 0.25μm CMOS process, the LDO core occupies an area of 0.18mm2.

arXiv: Learning | 2017