Abinash Mohanty | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Abinash Mohanty is active.

Explore More

Publication

Featured researches published by Abinash Mohanty.

field programmable gate arrays | 2016

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Naveen Suda; Vikas Chandra; Ganesh Dasika; Abinash Mohanty; Yufei Ma; Sarma B. K. Vrudhula; Jae-sun Seo; Yu Cao

Convolutional Neural Networks (CNNs) have gained popularity in many computer vision applications such as image classification, face detection, and video analysis, because of their ability to train and classify with high accuracy. Due to multiple convolution and fully-connected layers that are compute-/memory-intensive, it is difficult to perform real-time classification with low power consumption on today?s computing systems. FPGAs have been widely explored as hardware accelerators for CNNs because of their reconfigurability and energy efficiency, as well as fast turn-around-time, especially with high-level synthesis methodologies. Previous FPGA-based CNN accelerators, however, typically implemented generic accelerators agnostic to the CNN configuration, where the reconfigurable capabilities of FPGAs are not fully leveraged to maximize the overall system throughput. In this work, we present a systematic design space exploration methodology to maximize the throughput of an OpenCL-based FPGA accelerator for a given CNN model, considering the FPGA resource constraints such as on-chip memory, registers, computational resources and external memory bandwidth. The proposed methodology is demonstrated by optimizing two representative large-scale CNNs, AlexNet and VGG, on two Altera Stratix-V FPGA platforms, DE5-Net and P395-D8 boards, which have different hardware resources. We achieve a peak performance of 136.5 GOPS for convolution operation, and 117.8 GOPS for the entire VGG network that performs ImageNet classification on P395-D8 board.

design, automation, and test in europe | 2015

Technology-design co-optimization of resistive cross-point array for accelerating learning algorithms on chip

Pai-Yu Chen; Deepak Kadetotad; Zihan Xu; Abinash Mohanty; Binbin Lin; Jieping Ye; Sarma B. K. Vrudhula; Jae-sun Seo; Yu Cao; Shimeng Yu

Technology-design co-optimization methodologies of the resistive cross-point array are proposed for implementing the machine learning algorithms on a chip. A novel read and write scheme is designed to accelerate the training process, which realizes fully parallel operations of the weighted sum and the weight update. Furthermore, technology and design parameters of the resistive cross-point array are co-optimized to enhance the learning accuracy, latency and energy consumption, etc. In contrast to the conventional memory design, a set of reverse scaling rules is proposed on the resistive cross-point array to achieve high learning accuracy. These include 1) larger wire width to reduce the IR drop on interconnects thereby increasing the learning accuracy; 2) use of multiple cells for each weight element to alleviate the impact of the device variations, at an affordable expense of area, energy and latency. The optimized resistive cross-point array with peripheral circuitry is implemented at the 65 nm node. Its performance is benchmarked for handwritten digit recognition on the MNIST database using gradient-based sparse coding. Compared to state-of-the-art software approach running on CPU, it achieves >103 speed-up and >106 energy efficiency improvement, enabling real-time image feature extraction and learning.

IEEE Journal on Emerging and Selected Topics in Circuits and Systems | 2015

Parallel Architecture With Resistive Crosspoint Array for Dictionary Learning Acceleration

Deepak Kadetotad; Zihan Xu; Abinash Mohanty; Pai-Yu Chen; Binbin Lin; Jieping Ye; Sarma B. K. Vrudhula; Shimeng Yu; Yu Cao; Jae-sun Seo

This paper proposes a parallel architecture with resistive crosspoint array. The design of its two essential operations, read and write, is inspired by the biophysical behavior of a neural system, such as integrate-and-fire and local synapse weight update. The proposed hardware consists of an array with resistive random access memory (RRAM) and CMOS peripheral circuits, which perform matrix-vector multiplication and dictionary update in a fully parallel fashion, at the speed that is independent of the matrix dimension. The read and write circuits are implemented in 65 nm CMOS technology and verified together with an array of RRAM device model built from experimental data. The overall system exploits array-level parallelism and is demonstrated for accelerated dictionary learning tasks. As compared to software implementation running on a 8-core CPU, the proposed hardware achieves more than 3000 × speedup, enabling high-speed feature extraction on a single chip.

IEEE Transactions on Nanotechnology | 2015

On-Chip Sparse Learning Acceleration With CMOS and Resistive Synaptic Devices

Jae-sun Seo; Binbin Lin; Minkyu Kim; Pai Yu Chen; Deepak Kadetotad; Zihan Xu; Abinash Mohanty; Sarma B. K. Vrudhula; Shimeng Yu; Jieping Ye; Yu Cao

Many recent advances in sparse coding led its wide adoption in signal processing, pattern classification, and object recognition applications. Even with improved performance in state-of-the-art algorithms and the hardware platform of CPUs/GPUs, solving a sparse coding problem still requires expensive computations, making real-time large-scale learning a very challenging problem. In this paper, we cooptimize algorithm, architecture, circuit, and device for real-time energy-efficient on-chip hardware acceleration of sparse coding. The principle of hardware acceleration is to recognize the properties of learning algorithms, which involve many parallel operations of data fetch and matrix/vector multiplication/addition. Todays von Neumann architecture, however, is not suitable for such parallelization, due to the separation of memory and the computing unit that makes sequential operations inevitable. Such principle drives both the selection of algorithms and the design evolution from CPU to CMOS application-specific integrated circuits (ASIC) to parallel architecture with resistive crosspoint array (PARCA) that we propose. The CMOS ASIC scheme implements sparse coding with SRAM dictionaries and all-digital circuits, and PARCA employs resistive-RAM dictionaries with special read and write circuits. We show that 65 nm implementation of the CMOS ASIC and PARCA scheme accelerates sparse coding computation by 394 and 2140×, respectively, compared to software running on a eight-core CPU. Simulated power for both hardware schemes lie in the milli-Watt range, making it viable for portable single-chip learning applications.

biomedical circuits and systems conference | 2014

Neurophysics-inspired parallel architecture with resistive crosspoint array for dictionary learning

Deepak Kadetotad; Zihan Xu; Abinash Mohanty; Pai-Yu Chen; Binbin Lin; Jieping Ye; Sarma B. K. Vrudhula; Shimeng Yu; Yu Cao; Jae-sun Seo

This paper proposes a parallel architecture with resistive crosspoint array. The design of its two essential operations, Read and Write, is inspired by the biophysical behavior of a neural system, such as integrate-and-fire and time-dependent synaptic plasticity. The proposed hardware consists of an array with resistive random access memory (RRAM) and CMOS peripheral circuits, which perform matrix product and dictionary update in a fully parallel fashion, at the speed that is independent of the matrix dimension. The entire system is implemented in 65nm CMOS technology with RRAM to realize high-speed unsupervised dictionary learning. As compared to state-of-the-art software approach, it achieves more than 3000X speedup, enabling real-time feature extraction on a single chip.

IEEE Transactions on Device and Materials Reliability | 2015

Accelerated Aging in Analog and Digital Circuits With Feedback

Ketul B. Sutaria; Abinash Mohanty; Runsheng Wang; Ru Huang; Yu Cao

Aging mechanisms such as Bias Temperature Instability (BTI) and Channel Hot Carrier (CHC) are key limiting factors of circuit lifetime in CMOS design. Threshold voltage shift of a device due to degradation is usually a gradual process, only causing moderate increase in failure rate of CMOS designs. Conventional analog and digital circuits typically employ feedback control for system stability or Dynamic Voltage Scaling (DVS) to optimize power performance respectively. For such closed loop topologies, the degradation rate can be dramatically accelerated, leading to destructive consequences. To identify such catastrophic phenomenon, this work (1) presents accurate simulation framework and aging models for BTI and CHC accounting underlying physics. Complete methodology is validated with 28 nm and 65 nm silicon data. (2) Investigates Bias Runaway, a rapid increase of gate-drain voltage of a bias circuit in analog/mixed signal (AMS) circuits. Along with silicon evidence, critical boundary condition and design trade-offs for bias runaway are also explored with technology scaling. (3) DVS induced acceleration in failure rate of logic circuits under NBTI and PBTI is demonstrated. Overall, this work identifies key issues to the stability of feedback systems, which is vitally important for reliable IC designs.

international symposium on circuits and systems | 2016

High-performance face detection with CPU-FPGA acceleration

Abinash Mohanty; Naveen Suda; Minkyu Kim; Sarma B. K. Vrudhula; Jae-sun Seo; Yu Cao

Face detection is a critical function in many embedded applications, such as computer vision and security. Although face detection has been well studied, detecting a large number of faces with different scales and excessive variations (pose, expression, or illumination) usually involves computationally expensive classification algorithms. These algorithms may divide an image into sub-windows at different scales, evaluate a large set of features for each sub-window, and determine the presence and location of a face. Even with state-of-the-art CPUs, it is still challenging to perform real-time face detection with sufficiently high energy efficiency and accuracy. In this paper, we propose a suite of acceleration techniques to enable such a capability on the CPU-FPGA platform, based on a state-of-the-art face detection algorithm that employs a large number of simple classifiers. We first map the algorithm using the integrated OpenCL environment for FPGA. Matching the structure of the algorithm, a nested architecture is proposed to speed up both memory access and the computing iterations. This multi-layer architecture distributes parallel computing cores with the memory. The physical aspects of the nested architecture, such as the core size and the number of cores, are further optimized to achieve real-time face detection, under realistic hardware constraints.

asia and south pacific design automation conference | 2017

A real-time 17-scale object detection accelerator with adaptive 2000-stage classification in 65nm CMOS

Minkyu Kim; Abinash Mohanty; Deepak Kadetotad; Naveen Suda; Luning Wei; Pooja Saseendran; Xiaofei He; Yu Cao; Jae-sun Seo

This paper presents an object detection accelerator that features many-scale (17), many-object (up to 50), multi-class (e.g., face, traffic sign), and high accuracy (average precision of 0.79/0.65 for AFW/BTSD datasets). Employing 10 gradient/color channels, integral features are extracted, and the results of 2,000 simple classifiers for rigid boosted templates are adaptively combined to make a strong classification. By jointly optimizing the algorithm and the hardware architecture, the prototype chip implemented in 65nm CMOS demonstrates real-time object detection of 13–35 frames per second with low power consumption of 22–160mW at 0.58–1.0V supply.

IEEE Transactions on Very Large Scale Integration Systems | 2017

RTN in Scaled Transistors for On-Chip Random Seed Generation

Abinash Mohanty; Ketul B. Sutaria; Hiromitsu Awano; Takashi Sato; Yu Cao

Random numbers play a vital role in cryptography, where they are used to generate keys, nonce, one-time pads, and initialization vectors for symmetric encryption. The quality of random number generator (RNG) has significant implications on vulnerability and performance of these algorithms. A pseudo-RNG uses a deterministic algorithm to produce numbers with a distribution very similar to uniform. True RNGs (TRNGs), on the other hand, use some natural phenomenon/process to generate random bits. They are nondeterministic, because the next number to be generated cannot be determined in advance. In this paper, a novel on-chip noise source, random telegraph noise (RTN), is exploited for simple and reliable TRNG. RTN, a microscopic process of stochastic trapping/detrapping of charges, is usually considered as a noise and mitigated in design. Through physical modeling and silicon measurement, we demonstrate that RTN is appropriate for TRNG, especially in highly scaled MOSFETs. Due to the slow speed of RTN, we purpose the system for on-chip seed generation for random number. Our contributions are: 1) physical model calibration of RTN with comprehensive 65- and 180-nm transistor measurements; 2) the scaling trend of RTN, validated with silicon data down to 28 nm; 3) design principles to achieve 50% signal probability by using intrinsic RTN physical properties, without traditional postprocessing algorithms, the generated sequence passes the National Institute of Standards and Technology (NIST) tests; and 4) solutions to manage realistic issues in practice, including multilevel RTN signal, robustness to voltage and temperature fluctuations and the operation speed.

international reliability physics symposium | 2015

Duty cycle shift under static/dynamic aging in 28nm HK-MG technology

Ketul B. Sutaria; Pengpeng Ren; Abinash Mohanty; Xixiang Feng; Runsheng Wang; Ru Huang; Yu Cao

Aging due to bias-temperature-instability (BTI) is the dominant cause of functional failure in large scale logic circuits. Power efficient techniques such as clock gating or dynamic voltage scaling exacerbate the problem of asymmetric aging. Traditional analysis on synchronous circuits focuses on shift in data path delay and neglects the change in duty cycle. This work highlights the impact of NBTI and PBTI at advanced technology node on duty cycle shift which is important for edge triggered designs, such as latch based circuits. The contributions of this work are: (1) characterization, decoupling and model calibration of NBTI, PBTI and CHC data at 28nm HK-MG technology; (2) demonstration of monotonic shift of duty cycle under static stress condition and non-monotonic shift under dynamic stress, in which duty cycle converges to 50%. Additional PBTI component at 28nm HK-MG causes faster shift in duty cycle compared to conventional NBTI aging; (3) the sensitivity of long-term aging to the ratio between static and dynamic stress conditions. With PBTI, duty cycle shift is effectively reduced by dynamic stress.

Explore More