Swagath Venkataramani

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Swagath Venkataramani is active.

Explore More

Publication

Featured researches published by Swagath Venkataramani.

international symposium on computer architecture | 2017

ScaleDeep: A Scalable Compute Architecture for Learning and Evaluating Deep Networks

Swagath Venkataramani; Ashish Ranjan; Subarno Banerjee; Dipankar Das; Sasikanth Avancha; Ashok Jagannathan; Ajaya V. Durg; Dheemanth Nagaraj; Bharat Kaul; Pradeep Dubey; Anand Raghunathan

Deep Neural Networks (DNNs) have demonstrated state-of-the-art performance on a broad range of tasks involving natural language, speech, image, and video processing, and are deployed in many real world applications. However, DNNs impose significant computational challenges owing to the complexity of the networks and the amount of data they process, both of which are projected to grow in the future. To improve the efficiency of DNNs, we propose SCALEDEEP, a dense, scalable server architecture, whose processing, memory and interconnect subsystems are specialized to leverage the compute and communication characteristics of DNNs. While several DNN accelerator designs have been proposed in recent years, the key difference is that SCALEDEEP primarily targets DNN training, as opposed to only inference or evaluation. The key architectural features from which SCALEDEEP derives its efficiency are: (i) heterogeneous processing tiles and chips to match the wide diversity in computational characteristics (FLOPs and Bytes/FLOP ratio) that manifest at different levels of granularity in DNNs, (ii) a memory hierarchy and 3-tiered interconnect topology that is suited to the memory access and communication patterns in DNNs, (iii) a low-overhead synchronization mechanism based on hardware data-flow trackers, and (iv) methods to map DNNs to the proposed architecture that minimize data movement and improve core utilization through nested pipelining. We have developed a compiler to allow any DNN topology to be programmed onto SCALEDEEP, and a detailed architectural simulator to estimate performance and energy. The simulator incorporates timing and power models of SCALEDEEPs components based on synthesis to Intels 14nm technology. We evaluate an embodiment of SCALEDEEP with 7032 processing tiles that operates at 600 MHz and has a peak performance of 680 TFLOPs (single precision) and 1.35 PFLOPs (half-precision) at 1.4KW. Across 11 state-of-the-art DNNs containing 0.65M-14.9M neurons and 6.8M-145.9M weights, including winners from 5 years of the ImageNet competition, SCALEDEEP demonstrates 6×-28× speedup at iso-power over the state-of-the-art performance on GPUs.

international symposium on low power electronics and design | 2018

Across the Stack Opportunities for Deep Learning Acceleration

Vijayalakshmi Srinivasan; Bruce M. Fleischer; Sunil Shukla; Matthew M. Ziegler; Joel Abraham Silberman; Jinwook Oh; Jungwook Choi; Silvia Melitta Mueller; Ankur Agrawal; Tina Babinsky; Nianzheng Cao; Chia-Yu Chen; Pierce Chuang; Thomas W. Fox; George D. Gristede; Michael A. Guillorn; Howard M. Haynie; Michael Klaiber; Dongsoo Lee; Shih-Hsien Lo; Gary W. Maier; Michael R. Scheuermann; Swagath Venkataramani; Christos Vezyrtzis; Naigang Wang; Fanchieh Yee; Ching Zhou; Pong-Fei Lu; Brian W. Curran; Leland Chang

The combination of growth in compute capabilities and availability of large datasets has led to a re-birth of deep learning. Deep Neural Networks (DNNs) have become state-of-the-art in a variety of machine learning tasks spanning domains across vision, speech, and machine translation. Deep Learning (DL) achieves high accuracy in these tasks at the expense of 100s of ExaOps of computation; posing significant challenges to efficient large-scale deployment in both resource-constrained environments and data centers. One of the key enablers to improve operational efficiency of DNNs is the observation that when extracting deep insight from vast quantities of structured and unstructured data the exactness imposed by traditional computing is not required. Relaxing the exactness constraint enables exploiting opportunities for approximate computing across all layers of the system stack. In this talk we present a multi-TOPS AI core [3] for acceleration of deep learning training and inference in systems from edge devices to data centers. We demonstrate that to derive high sustained utilization and energy efficiency from the AI core requires ground-up re-thinking to exploit approximate computing across the stack including algorithms, architecture, programmability, and hardware. Model accuracy is the fundamental measure of deep learning quality. The compute engine precision in our AI core is carefully calibrated to realize significant reduction in area and power while not compromising numerical accuracy. Our research at the DL algorithms/applications-level [2] shows that it is possible to carefully tune the precision of both weights and activations to as low as 2-bits for inference and was used to guide the choices of compute precision supported in the architecture and hardware for both training and inference. Similarly, distributed DL trainings scalability is impacted by the communication overhead to exchange gradients and weights after each mini-batch. Our research on gradient compression [1] shows by selectively sending gradients larger than a threshold, and by further choosing the threshold based on the importance of the gradient we achieve achieve compression ratio of 40X for convolutional layers, and up to 200X for fully-connected layers of the network without losing model accuracy. These results guide the choice of interconnection network topology exploration for a system of accelerators built using the AI core. Overall, our work shows how the benefits from exploiting approximation using algorithm/applications robustness to tolerate reduced precision, and compressed data communication can be combined effectively with the architecture and hardware of the accelerator designed to support these reduced-precision computation and compressed data communication. Our results demonstate improved end-to-end efficiency of the DL accelerator across different metrics such as high sustained TOPs, high TOPs/watt and TOPs/mm2 catering to different operating environments for both training and inference.

international symposium on low power electronics and design | 2018

Taming the beast: Programming Peta-FLOP class Deep Learning Systems

Swagath Venkataramani; Vijayalakshmi Srinivasan; Jungwook Choi; Kailash Gopalakrishnan; Leland Chang

1 EXTENDED ABSTRACT The field of Artificial Intelligence (AI) has witnessed quintessential growth in recent years with the advent of Deep Neural Networks (DNNs) that have achieved state-of-the-art performance on challenging cognitive tasks involving images, videos, text and natural language. They are being increasingly deployed in many real-world services and products, and have pervaded the spectrum of computing devices from mobile/IoT devices to server-class platforms. However, DNNs are highly compute and data intensive workloads, far outstripping the capabilities of today’s computing platforms. For example, state-of-the-art image recognition DNNs require billions of operations to classify a single image. On the other hand, training DNN models demands exa-flops of compute and uses massive datasets requiring 100s of giga-bytes of memory. One approach to address the computational challenges imposed by DNNs is through the design of hardware accelerators, whose compute cores, memory hierarchy and interconnect topology are specialized to match the DNN’s compute and communication characteristics. Several such designs ranging from low-power IP cores to largescale accelerator systems have been proposed in literature. Some factors that enable the design of specialized systems for DNNs are: (i) their computations can be expressed as static data-flow graphs, (ii) their computation patterns are regular with no data-dependent control flows and offer abundant opportunities for data-reuse, and (iii) their functionality could be encapsulated within a set of few (tens of) basic functions (e.g. convolution, matrix-multiplication etc.). That said, DNNs also exhibit abundant heterogeneity at various levels. Across layers, the number of input and output channels and the dimensions of each feature are substantially different. Further, each layer comprises of operations whose Bytes/FLOP requirement vary by over two orders of magnitude. The heterogeneity in compute characteristics engenders a wide range of possibilities to spatiotemporally map DNNs on accelerator platforms, defined in terms of how computations are split across the different compute elements in the architecture and how computations assigned to a compute element are temporally sequenced in time. We are therefore led to ask whether it is possible to come up with a systematic exploration of the design space of mapping configurations to maximize DNN’s performance on a given accelerator architecture using a variety of different dataflows? How will the computations be partitioned and sequenced across the processing

design automation conference | 2018

Compensated-DNN: energy efficient low-precision deep neural networks by compensating quantization errors

Shubham Jain; Swagath Venkataramani; Vijayalakshmi Srinivasan; Jungwook Choi; Pierce Chuang; Leland Chang

Deep Neural Networks (DNNs) represent the state-of-the-art in many Artificial Intelligence (AI) tasks involving images, videos, text, and natural language. Their ubiquitous adoption is limited by the high computation and storage requirements of DNNs, especially for energy-constrained inference tasks at the edge using wearable and IoT devices. One promising approach to alleviate the computational challenges is implementing DNNs using low-precision fixed point (<16 bits) representation. However, the quantization error inherent in any Fixed Point (FxP) implementation limits the choice of bit-widths to maintain application-level accuracy. Prior efforts recommend increasing the network size and/or re-training the DNN to minimize loss due to quantization, albeit with limited success.Complementary to the above approaches, we present Compensated-DNN, wherein we propose to dynamically compensate the error introduced due to quantization during execution. To this end, we introduce a new fixed-point representation viz. Fixed Point with Error Compensation (FPEC). The bits in FPEC are split between computation bits vs. compensation bits. The computation bits use conventional FxP notation to represent the number at low-precision. On the other hand, the compensation bits (1 or 2 bits at most) explicitly capture an estimate (direction and magnitude) of the quantization error in the representation. For a given word length, since FPEC uses fewer computation bits compared to FxP representation, we achieve a near-quadratic improvement in energy in the multiply-and-accumulate (MAC) operations. The compensation bits are simultaneously used by a low-overhead sparse compensation scheme to estimate the error accrued during MAC operations, which is then added to the MAC output to minimize the impact of quantization. We build compensated-DNNs for 7 popular image recognition benchmarks with 0.05–20.5 million neurons and 0.01–15.5 billion connections. Based on gate-level analysis at 14nm technology, we achieve 2.65 × –4.88 × and 1.13 × –1.7 × improvement in energy compared to 16-bit and 8-bit FxP implementations respectively, while maintaining <0.5% loss in classification accuracy.

design automation conference | 2018

Dyhard-DNN: even more DNN acceleration with dynamic hardware reconfiguration

Mateja Putic; Alper Buyuktosunoglu; Swagath Venkataramani; Pradip Bose; Schuyler Eldridge; Mircea R. Stan

Deep Neural Networks (DNNs) have demonstrated their utility across a wide range of input data types, usable across diverse computing substrates, from edge devices to datacenters. This broad utility has resulted in myriad hardware accelerator architectures. However, DNNs exhibit significant heterogeneity in their computational characteristics, e.g., feature and kernel dimensions, and dramatic variances in computational intensity, even between adjacent layers in one DNN. Consequently, accelerators with static hardware parameters run sub-optimally and leave energy-efficiency margins unclaimed. We propose DyHard-DNNs, where accelerator microarchitectural parameters are dynamically reconfigured during DNN execution to significantly improve metrics of interest. We demonstrate the effectiveness of this approach on a configurable SIMD 2D systolic array and show a 15–65% performance improvement (at iso-power) and 25–90% energy improvement (at iso-latency) over the best static configuration in six mainstream DNN workloads.

international conference on parallel architectures and compilation techniques | 2017

POSTER: Design Space Exploration for Performance Optimization of Deep Neural Networks on Shared Memory Accelerators

Swagath Venkataramani; Jungwook Choi; Vijayalakshmi Srinivasan; Kailash Gopalakrishnan; Leland Chang

The growing prominence and computational challenges imposed by Deep Neural Networks (DNNs) has fueled the design of specialized accelerator architectures and associated dataflows to improve their implementation efficiency. Each of these solutions serve as a datapoint on the throughput vs. energy trade-offs for a given DNN and a set of architectural constraints. In this paper, we set out to explore whether it is possible to systematically explore the design space so as to estimate a given DNNs (both inference and training) performance on an shared memory architecture specification using a variety of data-flows. To this end, we have developed a framework, DEEPMATRIX, which given a description of a DNN and a hardware architecture, automatically identifies how the computations of the DNNs layers need to partitioned and mapped on to the architecture such that the overall performance is maximized, while meeting the constraints imposed by the hardware (processing power, memory capacity, bandwidth etc.) We demonstrate DEEPMATRIXs effectiveness for the VGG DNN benchmark, showing the trade-offs and sensitivity of utilization based on different architecture constraints.

international conference on computer design | 2017

Very Low Voltage (VLV) Design

Ramon Bertran; Pradip Bose; David M. Brooks; Jeff Burns; Alper Buyuktosunoglu; Nandhini Chandramoorthy; Eric Cheng; Martin Cochet; Schuyler Eldridge; Daniel J. Friedman; Hans M. Jacobson; Rajiv V. Joshi; Subhasish Mitra; Robert K. Montoye; Arun Paidimarri; Pritish R. Parida; Kevin Skadron; Mircea R. Stan; Karthik Swaminathan; Augusto Vega; Swagath Venkataramani; Christos Vezyrtzis; Gu-Yeon Wei; John-David Wellman; Matthew M. Ziegler

This paper is a tutorial-style introduction to a special session on: Effective Voltage Scaling in the Late CMOS Era. It covers the fundamental challenges and associated solution strategies in pursuing very low voltage (VLV) designs. We discuss the performance and system reliability constraints that are key impediments to VLV. The associated trade-offs across power, performance and reliability are helpful in inferring the optimal operational voltage-frequency point. This work was performed under the auspices of an ongoing DARPA program (named PERFECT) that is focused on maximizing system-level energy efficiency.

design automation conference | 2017

Accelerator Design for Deep Learning Training: Extended Abstract: Invited

Ankur Agrawal; Chia-Yu Chen; Jungwook Choi; Kailash Gopalakrishnan; Jinwook Oh; Sunil Shukla; Viji Srinivasan; Swagath Venkataramani; Wei Zhang

Deep Neural Networks (DNNs) have emerged as a powerful and versatile set of techniques showing successes on challenging artificial intelligence (AI) problems. Applications in domains such as image/video processing, autonomous cars, natural language processing, speech synthesis and recognition, genomics and many others have embraced deep learning as the foundation. DNNs achieve superior accuracy for these applications with high computational complexity using very large models which require 100s of MBs of data storage, exaops of computation and high bandwidth for data movement. In spite of these impressive advances, it still takes days to weeks to train state of the art Deep Networks on large datasets - which directly limits the pace of innovation and adoption. In this paper, we present a multi-pronged approach to address the challenges in meeting both the throughput and the energy efficiency goals for DNN training.

design, automation, and test in europe | 2018