Deepak Kadetotad | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Deepak Kadetotad is active.

Explore More

Publication

Featured researches published by Deepak Kadetotad.

design, automation, and test in europe | 2015

Technology-design co-optimization of resistive cross-point array for accelerating learning algorithms on chip

Pai-Yu Chen; Deepak Kadetotad; Zihan Xu; Abinash Mohanty; Binbin Lin; Jieping Ye; Sarma B. K. Vrudhula; Jae-sun Seo; Yu Cao; Shimeng Yu

Technology-design co-optimization methodologies of the resistive cross-point array are proposed for implementing the machine learning algorithms on a chip. A novel read and write scheme is designed to accelerate the training process, which realizes fully parallel operations of the weighted sum and the weight update. Furthermore, technology and design parameters of the resistive cross-point array are co-optimized to enhance the learning accuracy, latency and energy consumption, etc. In contrast to the conventional memory design, a set of reverse scaling rules is proposed on the resistive cross-point array to achieve high learning accuracy. These include 1) larger wire width to reduce the IR drop on interconnects thereby increasing the learning accuracy; 2) use of multiple cells for each weight element to alleviate the impact of the device variations, at an affordable expense of area, energy and latency. The optimized resistive cross-point array with peripheral circuitry is implemented at the 65 nm node. Its performance is benchmarked for handwritten digit recognition on the MNIST database using gradient-based sparse coding. Compared to state-of-the-art software approach running on CPU, it achieves >103 speed-up and >106 energy efficiency improvement, enabling real-time image feature extraction and learning.

IEEE Journal on Emerging and Selected Topics in Circuits and Systems | 2015

Parallel Architecture With Resistive Crosspoint Array for Dictionary Learning Acceleration

Deepak Kadetotad; Zihan Xu; Abinash Mohanty; Pai-Yu Chen; Binbin Lin; Jieping Ye; Sarma B. K. Vrudhula; Shimeng Yu; Yu Cao; Jae-sun Seo

This paper proposes a parallel architecture with resistive crosspoint array. The design of its two essential operations, read and write, is inspired by the biophysical behavior of a neural system, such as integrate-and-fire and local synapse weight update. The proposed hardware consists of an array with resistive random access memory (RRAM) and CMOS peripheral circuits, which perform matrix-vector multiplication and dictionary update in a fully parallel fashion, at the speed that is independent of the matrix dimension. The read and write circuits are implemented in 65 nm CMOS technology and verified together with an array of RRAM device model built from experimental data. The overall system exploits array-level parallelism and is demonstrated for accelerated dictionary learning tasks. As compared to software implementation running on a 8-core CPU, the proposed hardware achieves more than 3000 × speedup, enabling high-speed feature extraction on a single chip.

IEEE Transactions on Nanotechnology | 2015

On-Chip Sparse Learning Acceleration With CMOS and Resistive Synaptic Devices

Jae-sun Seo; Binbin Lin; Minkyu Kim; Pai Yu Chen; Deepak Kadetotad; Zihan Xu; Abinash Mohanty; Sarma B. K. Vrudhula; Shimeng Yu; Jieping Ye; Yu Cao

Many recent advances in sparse coding led its wide adoption in signal processing, pattern classification, and object recognition applications. Even with improved performance in state-of-the-art algorithms and the hardware platform of CPUs/GPUs, solving a sparse coding problem still requires expensive computations, making real-time large-scale learning a very challenging problem. In this paper, we cooptimize algorithm, architecture, circuit, and device for real-time energy-efficient on-chip hardware acceleration of sparse coding. The principle of hardware acceleration is to recognize the properties of learning algorithms, which involve many parallel operations of data fetch and matrix/vector multiplication/addition. Todays von Neumann architecture, however, is not suitable for such parallelization, due to the separation of memory and the computing unit that makes sequential operations inevitable. Such principle drives both the selection of algorithms and the design evolution from CPU to CMOS application-specific integrated circuits (ASIC) to parallel architecture with resistive crosspoint array (PARCA) that we propose. The CMOS ASIC scheme implements sparse coding with SRAM dictionaries and all-digital circuits, and PARCA employs resistive-RAM dictionaries with special read and write circuits. We show that 65 nm implementation of the CMOS ASIC and PARCA scheme accelerates sparse coding computation by 394 and 2140×, respectively, compared to software running on a eight-core CPU. Simulated power for both hardware schemes lie in the milli-Watt range, making it viable for portable single-chip learning applications.

biomedical circuits and systems conference | 2014

Neurophysics-inspired parallel architecture with resistive crosspoint array for dictionary learning

Deepak Kadetotad; Zihan Xu; Abinash Mohanty; Pai-Yu Chen; Binbin Lin; Jieping Ye; Sarma B. K. Vrudhula; Shimeng Yu; Yu Cao; Jae-sun Seo

This paper proposes a parallel architecture with resistive crosspoint array. The design of its two essential operations, Read and Write, is inspired by the biophysical behavior of a neural system, such as integrate-and-fire and time-dependent synaptic plasticity. The proposed hardware consists of an array with resistive random access memory (RRAM) and CMOS peripheral circuits, which perform matrix product and dictionary update in a fully parallel fashion, at the speed that is independent of the matrix dimension. The entire system is implemented in 65nm CMOS technology with RRAM to realize high-speed unsupervised dictionary learning. As compared to state-of-the-art software approach, it achieves more than 3000X speedup, enabling real-time feature extraction on a single chip.

international conference on computer aided design | 2016

Efficient memory compression in deep neural networks using coarse-grain sparsification for speech applications

Deepak Kadetotad; Sairam Arunachalam; Chaitali Chakrabarti; Jae-sun Seo

Recent breakthroughs in deep neural networks have led to the proliferation of its use in image and speech applications. Conventional deep neural networks (DNNs) are fully-connected multi-layer networks with hundreds or thousands of neurons in each layer. Such a network requires a very large weight memory to store the connectivity between neurons. In this paper, we propose a hardware-centric methodology to design low power neural networks with significantly smaller memory footprint and computation resource requirements. We achieve this by judiciously dropping connections in large blocks of weights. The corresponding technique, termed coarse-grain sparsification (CGS), introduces hardware-aware sparsity during the DNN training, which leads to efficient weight memory compression and significant computation reduction during classification without losing accuracy. We apply the proposed approach to DNN design for keyword detection and speech recognition. When the two DNNs are trained with 75% of the weights dropped and classified with 5–6 bit weight precision, the weight memory requirement is reduced by 95% compared to their fully-connected counterparts with double precision, while maintaining similar performance in keyword detection accuracy, word error rate, and sentence error rate. To validate this technique in real hardware, a time-multiplexed architecture using a shared multiply and accumulate (MAC) engine was implemented in 65nm and 40nm low power (LP) CMOS. In 40nm at 0.6 V, the keyword detection network consumes 36µW and the speech recognition network consumes 552µW, making this technique highly suitable for mobile and wearable devices.

symposium on vlsi circuits | 2017

A 1.06 μW smart ECG processor in 65 nm CMOS for real-time biometrie authentication and personal cardiac monitoring

Shihui Yin; Minkyu Kim; Deepak Kadetotad; Yang Liu; Chisung Bae; Sang Joon Kim; Yu Cao; Jae-sun Seo

A smart wearable electrocardiographic (ECG) processor is presented for secure ECG-based biometric authentication and cardiac monitoring, including arrhythmia and anomaly detection. Data-driven Lasso regression and low-precision techniques are developed to compress the neural networks by 24.4X. The prototype chip fabricated in 65 nm LP CMOS consumes 1.06 μW at 0.55 V for real-time ECG authentication. Equal error rates of 0.74% and 1.7% are achieved on ECG-ID database and in-house 645-subject database, respectively.

international symposium on low power electronics and design | 2017

Monolithic 3D IC designs for low-power deep neural networks targeting speech recognition

Kyungwook Chang; Deepak Kadetotad; Yu Cao; Jae-sun Seo; Sung Kyu Lim

In recent years, deep learning has become widespread for various real-world recognition tasks. In addition to recognition accuracy, energy efficiency is another grand challenge to enable local intelligence in edge devices. In this paper, we investigate the adoption of monolithic 3D IC (M3D) technology for deep learning hardware design, using speech recognition as a test vehicle. M3D has recently proven to be one of the leading contenders to address the power, performance and area (PPA) scaling challenges in advanced technology nodes. Our study encompasses the influence of key parameters in DNN hardware implementations towards energy efficiency, including DNN architectural choices, underlying workloads, and tier partitioning choices in M3D. Our post-layout M3D designs, together with hardware-efficient sparse algorithms, produce power savings beyond what can be achieved using conventional 2D ICs. Experimental results show that M3D offers 22.3% iso-performance power saving, convincingly demonstrating its entitlement as a solution for DNN ASICs. We further present architectural guidelines for M3D DNNs to maximize the power saving.

international conference on artificial intelligence and soft computing | 2017

Comprehensive Evaluation of OpenCL-Based CNN Implementations for FPGAs

Ricardo Tapiador-Morales; Antonio Rios-Navarro; Alejandro Linares-Barranco; Minkyu Kim; Deepak Kadetotad; Jae-sun Seo

Deep learning has significantly advanced the state of the art in artificial intelligence, gaining wide popularity from both industry and academia. Special interest is around Convolutional Neural Networks (CNN), which take inspiration from the hierarchical structure of the visual cortex, to form deep layers of convolutional operations, along with fully connected classifiers. Hardware implementations of these deep CNN architectures are challenged with memory bottlenecks that require many convolution and fully-connected layers demanding large amount of communication for parallel computation. Multi-core CPU based solutions have demonstrated their inadequacy for this problem due to the memory wall and low parallelism. Many-core GPU architectures show superior performance but they consume high power and also have memory constraints due to inconsistencies between cache and main memory. OpenCL is commonly used to describe these architectures for their execution on GPGPUs or FPGAs. FPGA design solutions are also actively being explored, which allow implementing the memory hierarchy using embedded parallel BlockRAMs. This boosts the parallel use of shared memory elements between multiple processing units, avoiding data replicability and inconsistencies. This makes FPGAs potentially powerful solutions for real-time classification of CNNs. In this paper both Altera and Xilinx adopted OpenCL co-design frameworks for pseudo-automatic development solutions are evaluated. A comprehensive evaluation and comparison for a 5-layer deep CNN is presented. Hardware resources, temporal performance and the OpenCL architecture for CNNs are discussed. Xilinx demonstrates faster synthesis, better FPGA resource utilization and more compact boards. Altera provides multi-platforms tools, mature design community and better execution times.

asia and south pacific design automation conference | 2017

A real-time 17-scale object detection accelerator with adaptive 2000-stage classification in 65nm CMOS

Minkyu Kim; Abinash Mohanty; Deepak Kadetotad; Naveen Suda; Luning Wei; Pooja Saseendran; Xiaofei He; Yu Cao; Jae-sun Seo

This paper presents an object detection accelerator that features many-scale (17), many-object (up to 50), multi-class (e.g., face, traffic sign), and high accuracy (average precision of 0.79/0.65 for AFW/BTSD datasets). Employing 10 gradient/color channels, integral features are extracted, and the results of 2,000 simple classifiers for rigid boosted templates are adaptively combined to make a strong classification. By jointly optimizing the algorithm and the hardware architecture, the prototype chip implemented in 65nm CMOS demonstrates real-time object detection of 13–35 frames per second with low power consumption of 22–160mW at 0.58–1.0V supply.

asia and south pacific design automation conference | 2017

Low-power neuromorphic speech recognition engine with coarse-grain sparsity

Shihui Yin; Deepak Kadetotad; Bonan Yan; Chang Song; Yiran Chen; Chaitali Chakrabarti; Jae-sun Seo

In recent years, we have seen a surge of interest in neuromorphic computing and its hardware design for cognitive applications. In this work, we present new neuromorphic architecture, circuit, and device co-designs that enable spike-based classification for speech recognition task. The proposed neuromorphic speech recognition engine supports a sparsely connected deep spiking network with coarse granularity, leading to large memory reduction with minimal index information. Simulation results show that the proposed deep spiking neural network accelerator achieves phoneme error rate (PER) of 20.5% for TIMIT database, and consume 2.57mW in 40nm CMOS for real-time performance. To alleviate the memory bottleneck, the usage of non-volatile memory is also evaluated and discussed.

Explore More