Rangharajan Venkatesan

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rangharajan Venkatesan is active.

Explore More

Publication

Featured researches published by Rangharajan Venkatesan.

international conference on computer aided design | 2011

MACACO: modeling and analysis of circuits for approximate computing

Rangharajan Venkatesan; Amit Agarwal; Kaushik Roy; Anand Raghunathan

Approximate computing, which refers to a class of techniques that relax the requirement of exact equivalence between the specification and implementation of a computing system, has attracted significant interest in recent years. We propose a systematic methodology, called MACACO, for the Modeling and Analysis of Circuits for Approximate Computing. The proposed methodology can be utilized to analyze how an approximate circuit behaves with reference to a conventional correct implementation, by computing metrics such as worst-case error, average-case error, error probability, and error distribution. The methodology applies to both timing-induced approximations such as voltage over-scaling or over-clocking, and functional approximations based on logic complexity reduction. The first step in MACACO is the construction of an equivalent untimed circuit that represents the behavior of the approximate circuit at a given voltage and clock period. Next, we construct a virtual error circuit that represents the error in the approximate circuits output for any given input or input sequence. Finally, we apply conventional Boolean analysis techniques (SAT solvers, BDDs) and statistical techniques (Monte-Carlo simulation) in order to compute the various metrics of interest. We have applied the proposed methodology to analyze a range of approximate designs for datapath building blocks. Our results show that MACACO can help a designer to systematically evaluate the impact of approximate circuits, and to choose between different approximate implementations, thereby facilitating the adoption of such circuits for approximate computing.

design, automation, and test in europe | 2013

DWM-TAPESTRI - an energy efficient all-spin cache using domain wall shift based writes

Rangharajan Venkatesan; Mrigank Sharad; Kaushik Roy; Anand Raghunathan

Spin-based memories are promising candidates for future on-chip memories due to their high density, non-volatility, and very low leakage. However, the high energy and latency of write operations in these memories is a major challenge. In this work, we explore a new approach - shift based write - that offers a fast and energy-efficient alternative to performing writes in spin-based memories. We propose DWM-TAPESTRI, a new all-spin cache design that utilizes Domain Wall Memory (DWM) with shift based writes at all levels of the cache hierarchy. The proposed write scheme enables DWM to be used, for the first time, in L1 caches and in tag arrays, where the inefficiency of writes in spin memories has traditionally precluded their use. At the circuit level, we propose bit-cell designs utilizing shift-based writes, which are tailored to the differing requirements of different levels in the cache hierarchy. We also propose pre-shifting as an architectural technique to hide the latency of shift operations that is inherent to DWM. We performed a systematic device-circuit-architecture evaluation of the proposed design. Over a wide range of SPEC 2006 benchmarks, DWM-TAPESTRI achieves 8.2X improvement in energy and 4X improvement in area, with virtually identical performance, compared to an iso-capacity SRAM cache. Compared to an iso-capacity STT-MRAM cache, the proposed design achieves around 1.6X improvement in both area and energy under iso-performance conditions.

international symposium on computer architecture | 2014

STAG: spintronic-tape architecture for GPGPU cache hierarchies

Rangharajan Venkatesan; Shankar Ganesh Ramasubramanian; Swagath Venkataramani; Kaushik Roy; Anand Raghunathan

General-purpose Graphics Processing Units (GPGPUs) are widely used for executing massively parallel workloads from various application domains. Feeding data to the hundreds to thousands of cores that current GPGPUs integrate places great demands on the memory hierarchy, fueling an ever-increasing demand for on-chip memory. In this work, we propose STAG, a high density, energy-efficient GPGPU cache hierarchy design using a new spintronic memory technology called Domain Wall Memory (DWM). DWMs inherently offer unprecedented benefits in density by storing multiple bits in the domains of a ferromagnetic nanowire, which logically resembles a bit-serial tape. However, this structure also leads to a unique challenge that the bits must be sequentially accessed by performing “shift” operations, resulting in variable and potentially higher access latencies. To address this challenge, STAG utilizes a number of architectural techniques : (i) a hybrid cache organization that employs different DWM bit-cells to realize the different memory arrays within the GPGPU cache hierarchy, (ii) a clustered, bit-interleaved organization, in which the bits in a cache block are spread across a cluster of DWM tapes, allowing parallel access, (iii) tape head management policies that predictively configure DWM arrays to reduce the expected number of shift operations for subsequent accesses, and (iv) a shift aware promotion buffer (SaPB), in which accesses to the DWM cache are predicted based on intra-warp locality, and locations that would incur a large shift penalty are promoted to a smaller buffer. Over a wide range of benchmarks from the Rodinia, ISPASS and Parboil suites, STAG achieves significant benefits in performance (12.1% over SRAM and 5.8% over STT-MRAM) and energy (3.3X over SRAM and 2.6X over STT-MRAM).

international symposium on computer architecture | 2017

SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks

Angshuman Parashar; Minsoo Rhu; Anurag Mukkara; Antonio Puglielli; Rangharajan Venkatesan; Brucek Khailany; Joel S. Emer; Stephen W. Keckler; William J. Dally

Convolutional Neural Networks (CNNs) have emerged as a fundamental technology for machine learning. High performance and extreme energy efficiency are critical for deployments of CNNs, especially in mobile platforms such as autonomous vehicles, cameras, and electronic personal assistants. This paper introduces the Sparse CNN (SCNN) accelerator architecture, which improves performance and energy efficiency by exploiting the zero-valued weights that stem from network pruning during training and zero-valued activations that arise from the common ReLU operator. Specifically, SCNN employs a novel dataflow that enables maintaining the sparse weights and activations in a compressed encoding, which eliminates unnecessary data transfers and reduces storage requirements. Furthermore, the SCNN dataflow facilitates efficient delivery of those weights and activations to a multiplier array, where they are extensively reused; product accumulation is performed in a novel accumulator array. On contemporary neural networks, SCNN can improve both performance and energy by a factor of 2.7× and 2.3×, respectively, over a comparably provisioned dense CNN accelerator.

international symposium on low power electronics and design | 2014

SPINDLE: SPINtronic deep learning engine for large-scale neuromorphic computing

Shankar Ganesh Ramasubramanian; Rangharajan Venkatesan; Mrigank Sharad; Kaushik Roy; Anand Raghunathan

Deep Learning Networks (DLNs) are bio-inspired large-scale neural networks that are widely used in emerging vision, analytics, and search applications. The high computation and storage requirements of DLNs have led to the exploration of various avenues for their efficient realization. Concurrently, the ability of emerging post-CMOS devices to efficiently mimic neurons and synapses has led to great interest in their use for neuromorphic computing. We describe SPINDLE, a programmable processor for deep learning based on spintronic devices. SPINDLE exploits the unique ability of spintronic devices to realize highly dense and energy-efficient neurons and memory, which form the fundamental building blocks of DLNs. SPINDLE consists of a three-tier hierarchy of processing elements to capture the nested parallelism present in DLNs, and a two-level memory hierarchy to facilitate data reuse. It can be programmed to execute DLNs with widely varying topologies for different applications. SPINDLE employs techniques to limit the overheads of spin-to-charge conversion, and utilizes output and weight quantization to enhance the efficiency of spin-neurons. We evaluate SPINDLE using a device-to-architecture modeling framework and a set of widely used DLN applications (handwriting recognition, face detection, and object recognition). Our results indicate that SPINDLE achieves 14.4X reduction in energy consumption and 20.4X reduction in EDP over the CMOS baseline under iso-area conditions.

Proceedings of the IEEE | 2016

Spin-Transfer Torque Memories: Devices, Circuits, and Systems

Xuanyao Fong; Yusung Kim; Rangharajan Venkatesan; Sri Harsha Choday; Anand Raghunathan; Kaushik Roy

Spin-transfer torque magnetic memory (STT-MRAM) has gained significant research interest due to its nonvolatility and zero standby leakage, near unlimited endurance, excellent integration density, acceptable read and write performance, and compatibility with CMOS process technology. However, several obstacles need to be overcome for STT-MRAM to become the universal memory technology. This paper first reviews the fundamentals of STT-MRAM and discusses key experimental breakthroughs. The state of the art in STT-MRAM is then discussed, beginning with the device design concepts and challenges. The corresponding bit-cell design solutions are also presented, followed by the STT-MRAM cache architectures suitable for on-chip applications.

international symposium on low power electronics and design | 2013

Multi-level magnetic RAM using domain wall shift for energy-efficient, high-density caches

Mrigank Sharad; Rangharajan Venkatesan; Anand Raghunathan; Kaushik Roy

Spin-based devices promise to revolutionize computing platforms by enabling high-density, low-leakage memories. However, stringent tradeoffs between critical design metrics such as read and write stability, reliability, density, performance and energy-efficiency limit the efficiency of conventional spin-transfer-torque devices and bit-cells. We propose a new multi-level cell design with domain wall magnets (DWM-MLC) that significantly improves upon the read/write performance, density, and write energy consumption of conventional spin memories. The fundamental design tradeoff between read and write operations are addressed in DWM-MLC by decoupling the read and write paths, thereby allowing separate optimization for reads and writes. A thicker tunneling oxide is used for higher readability, while a domain-wall-shift (DWS) based write mechanism is used to improve write speed and energy. The storage of multiple bits per cell and the ability to use smaller transistors lead to a net improvement in density compared to conventional spin memories. We perform a systematic evaluation of DWM-MLC at different levels of design abstraction. At the circuit level, DWM-MLC achieves 2X improvement in density, read energy and read latency over its 1-bit counterpart. We evaluate an “all-spin” cache hierarchy that uses DWM-MLC for both L1 and L2, resulting in 4.4X (1.7X) area improvement and 10X (2X) energy reduction at iso-performance over SRAM (STT-MRAM).

design, automation, and test in europe | 2015

Spintastic: spin -based s t och astic logic for energy-efficient computing

Rangharajan Venkatesan; Swagath Venkataramani; Xuanyao Fong; Kaushik Roy; Anand Raghunathan

Spintronics is one of the leading technologies under consideration for the post-CMOS era. While spintronic memories have demonstrated great promise due to their density, non-volatility and low leakage, efforts to realize spintronic logic have been much less fruitful. Recent studies project the performance and energy efficiency of spintronic logic to be considerably inferior to CMOS. In this work, we explore Stochastic Computing (SC) as a new direction for the realization of energy-efficient logic using spintronic devices. We establish the synergy between stochastic computing and spintronics by demonstrating that (i) the peripheral circuits required for SC to convert to/from stochastic domains, which incur significant energy overheads in CMOS, can be efficiently realized by exploiting the characteristics of spintronic devices, and (ii) the low logic complexity and finegrained parallelism in SC circuits can be leveraged to alleviate the shortcomings of spintronic logic. We propose Spintastic, a new design approach in which all the components of stochastic circuits - stochastic number generators, stochastic arithmetic units, and stochastic-to-binary converters - are realized using spintronic devices. Our experiments on a range of benchmarks from different application domains demonstrate that Spintastic achieves 2.8X improvement in energy over CMOS stochastic implementations and 1.9X over a CMOS binary baseline.

international symposium on nanoscale architectures | 2011

Energy efficient many-core processor for recognition and mining using spin-based memory

Rangharajan Venkatesan; Vinay K. Chippa; Charles Augustine; Kaushik Roy; Anand Raghunathan

Emerging workloads such as Recognition, Mining and Synthesis present great opportunities for many-core parallel computing, but also place significant demands on the memory system. Spin-based devices have shown great promise in enabling high-density, energy-efficient memory. In this paper, we present the design and evaluation of a many-core domain-specific processor for Recognition and Data Mining (RM) using spin-based memory. The RM processor has a two-level on-chip memory hierarchy consisting of a streaming access first-level memory and a random access second-level memory. Based on the memory access characteristics, we suggest the use of Domain Wall Memory (DWM) and Spin Transfer Torque Magnetic RAM (STT MRAM) to realize the first and second levels, respectively. We develop architectural models of DWM and STT MRAM, and use them to evaluate the proposed design and explore various architectural tradeoffs in the RM processor. We evaluate the proposed design by comparing it to a CMOS based design at the same 45nm technology node. For three representative RM algorithms (Support Vector Machines, k-means clustering, and GLVQ classification), the iso-area spin memory based design achieves an energy-delay product improvement of 1.5X–3X. Our results suggest that spin based memory technologies can enable significant improvements in energy efficiency and performance for highly parallel, data-intensive workloads.

IEEE Transactions on Magnetics | 2014

Non-Volatile Complementary Polarizer Spin-Transfer Torque On-Chip Caches: A Device/Circuit/Systems Perspective

Xuanyao Fong; Rangharajan Venkatesan; Anand Raghunathan; Kaushik Roy

In this paper, we propose a new spin-transfer torque magnetic random access memory (STT-MRAM) bit-cell structure (with complementary polarizers) that is suitable for on-chip caches. Our proposed structure requires a lower average critical write current than standard STT-MRAM, with improved write-ability, readability, and reliability. A cache array based on our proposed structure is studied using a device/circuit simulation framework, which we developed for this paper. Simulation results show that at the bit-cell level, our proposed structure can achieve subnanosecond sensing delay and lower read disturb torque using a self-referenced differential READ operation. Sensing and disturb margins of our proposed cell are 1.8× and 2.4× better than standard STT-MRAM, respectively. Furthermore, near disturb-free READ operation at ≥1.5 GHz is achieved using a latch-based sense amplifier and verified in circuit simulations. In addition, content addressable memory may also be efficiently implemented using complementary polarizer spin-transfer torque (CPSTT). Transient SPICE simulations show that CPSTT may be suitable for L1 cache, with a read energy of 14 fJ/bit. System level simulation shows that a CPSTT-based L2 cache can achieve ~9% lower energy consumption and >9% improvement in instructions per cycle over a standard STT-MRAM-based cache.

Explore More