Angshuman Parashar | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Angshuman Parashar is active.

Explore More

Publication

Featured researches published by Angshuman Parashar.

high-performance computer architecture | 2011

HAsim: FPGA-based high-detail multicore simulation using time-division multiplexing

Michael Pellauer; Michael Adler; Michel A. Kinsy; Angshuman Parashar; Joel S. Emer

In this paper we present the HAsim FPGA-accelerated simulator. HAsim is able to model a shared-memory multicore system including detailed core pipelines, cache hierarchy, and on-chip network, using a single FPGA. We describe the scaling techniques that make this possible, including novel uses of time-multiplexing in the core pipeline and on-chip network. We compare our time-multiplexed approach to a direct implementation, and present a case study that motivates why high-detail simulations should continue to play a role in the architectural exploration process.

field programmable gate arrays | 2011

Leap scratchpads: automatic memory and cache management for reconfigurable logic

Michael Adler; Kermin Fleming; Angshuman Parashar; Michael Pellauer; Joel S. Emer

Developers accelerating applications on FPGAs or other reconfigurable logic have nothing but raw memory devices in their standard toolkits. Each project typically includes tedious development of single-use memory management. Software developers expect a programming environment to include automatic memory management. Virtual memory provides the illusion of very large arrays and processor caches reduce access latency without explicit programmer instructions. LEAP scratchpads for reconfigurable logic dynamically allocate and manage multiple, independent, memory arrays in a large backing store. Scratchpad accesses are cached automatically in multiple levels, ranging from shared on-board, RAM-based, set-associative caches to private caches stored in FPGA RAM blocks. In the LEAP framework, scratchpads share the same interface as on-die RAM blocks and are plug-in replacements. Additional libraries support heap management within a storage set. Like software developers, accelerator authors using scratchpads may focus more on core algorithms and less on memory management.

international symposium on computer architecture | 2013

Triggered instructions: a control paradigm for spatially-programmed architectures

Angshuman Parashar; Michael Pellauer; Michael Adler; Bushra Ahsan; Neal Clayton Crago; Daniel Lustig; Vladimir Pavlov; Antonia Zhai; Mohit Gambhir; Aamer Jaleel; Randy L. Allmon; Rachid Rayess; Stephen Maresh; Joel S. Emer

In this paper, we present triggered instructions, a novel control paradigm for arrays of processing elements (PEs) aimed at exploiting spatial parallelism. Triggered instructions completely eliminate the program counter and allow programs to transition concisely between states without explicit branch instructions. They also allow efficient reactivity to inter-PE communication traffic. The approach provides a unified mechanism to avoid over-serialized execution, essentially achieving the effect of techniques such as dynamic instruction reordering and multithreading, which each require distinct hardware mechanisms in a traditional sequential architecture. Our analysis shows that a triggered-instruction based spatial accelerator can achieve 8X greater area-normalized performance than a traditional general-purpose processor. Further analysis shows that triggered control reduces the number of static and dynamic instructions in the critical paths by 62% and 64% respectively over a program-counter style spatial baseline, resulting in a speedup of 2.0X.

international symposium on computer architecture | 2017

SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks

Angshuman Parashar; Minsoo Rhu; Anurag Mukkara; Antonio Puglielli; Rangharajan Venkatesan; Brucek Khailany; Joel S. Emer; Stephen W. Keckler; William J. Dally

Convolutional Neural Networks (CNNs) have emerged as a fundamental technology for machine learning. High performance and extreme energy efficiency are critical for deployments of CNNs, especially in mobile platforms such as autonomous vehicles, cameras, and electronic personal assistants. This paper introduces the Sparse CNN (SCNN) accelerator architecture, which improves performance and energy efficiency by exploiting the zero-valued weights that stem from network pruning during training and zero-valued activations that arise from the common ReLU operator. Specifically, SCNN employs a novel dataflow that enables maintaining the sparse weights and activations in a compressed encoding, which eliminates unnecessary data transfers and reduces storage requirements. Furthermore, the SCNN dataflow facilitates efficient delivery of those weights and activations to a multiplier array, where they are extensively reused; product accumulation is performed in a novel accumulator array. On contemporary neural networks, SCNN can improve both performance and energy by a factor of 2.7× and 2.3×, respectively, over a comparably provisioned dense CNN accelerator.

field programmable gate arrays | 2012

Leveraging latency-insensitivity to ease multiple FPGA design

Kermin Fleming; Michael Adler; Michael Pellauer; Angshuman Parashar; Arvind Mithal; Joel S. Emer

Traditionally, hardware designs partitioned across multiple FPGAs have had low performance due to the inefficiency of maintaining cycle-by-cycle timing among discrete FPGAs. In this paper, we present a mechanism by which complex designs may be efficiently and automatically partitioned among multiple FPGAs using explicitly programmed latency-insensitive links. We describe the automatic synthesis of an area efficient, high performance network for routing these inter-FPGA links. By mapping a diverse set of large research prototypes onto a multiple FPGA platform, we demonstrate that our tool obtains significant gains in design feasibility, compilation time, and even wall-clock performance.

IEEE Micro | 2014

Efficient Spatial Processing Element Control via Triggered Instructions

In this article, the authors present triggered instructions, a novel control paradigm for arrays of processing elements (PEs) aimed at exploiting spatial parallelism. Triggered instructions completely eliminate the program counter and allow programs to transition concisely between states without explicit branch instructions. They also allow efficient reactivity to inter-PE communication traffic. The approach provides a unified mechanism to avoid overserialized execution, essentially achieving the effect of techniques such as dynamic instruction reordering and multithreading, which each require distinct hardware mechanisms in a traditional sequential architecture.

ACM Transactions on Computer Systems | 2015

Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures

Michael Pellauer; Angshuman Parashar; Michael Adler; Bushra Ahsan; Randy L. Allmon; Neal Clayton Crago; Kermin Fleming; Mohit Gambhir; Aamer Jaleel; Tushar Krishna; Daniel Lustig; Stephen Maresh; Vladimir Pavlov; Rachid Rayess; Antonia Zhai; Joel S. Emer

There has been recent interest in exploring the acceleration of nonvectorizable workloads with spatially programmed architectures that are designed to efficiently exploit pipeline parallelism. Such an architecture faces two main problems: how to efficiently control each processing element (PE) in the system, and how to facilitate inter-PE communication without the overheads of traditional shared-memory coherent memory. In this article, we explore solving these problems using triggered instructions and latency-insensitive channels. Triggered instructions completely eliminate the program counter (PC) and allow programs to transition concisely between states without explicit branch instructions. Latency-insensitive channels allow efficient communication of inter-PE control information while simultaneously enabling flexible code placement and improving tolerance for variable events such as cache accesses. Together, these approaches provide a unified mechanism to avoid overserialized execution, essentially achieving the effect of techniques such as dynamic instruction reordering and multithreading. Our analysis shows that a spatial accelerator using triggered instructions and latency-insensitive channels can achieve 8 × greater area-normalized performance than a traditional general-purpose processor. Further analysis shows that triggered control reduces the number of static and dynamic instructions in the critical paths by 62% and 64%, respectively, over a PC-style baseline, increasing the performance of the spatial programming approach by 2.0 ×.

Archive | 2010

Accelerating Simulation with FPGAs

Michael Pellauer; Michael Adler; Angshuman Parashar; Joel S. Emer

This chapter presents an approach to accelerating processor simulation using FPGAs. This is distinguished from traditional uses of FPGAs, and the increased development effort from using FPGAs is discussed. Techniques are presented to efficiently control a highly distributed simulation on an FPGA. Time-multiplexing the simulator is presented in order to simulate numerous virtual cores while improving utilization of functional units.

Archive | 2013