Aviral Shrivastava
Arizona State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Aviral Shrivastava.
design automation conference | 2012
Mahdi Hamzeh; Aviral Shrivastava; Sarma B. K. Vrudhula
Coarse-Grained Reconfigurable Architectures (CGRAs) are an attractive platform that promise simultaneous high-performance and high power-efficiency. One of the primary challenges in using CGRAs is to develop efficient compilers that can automatically and efficiently map applications to the CGRA. To this end, this paper makes several contributions: i) Using Re-computation for Resource Limitations: For the first time in CGRA compilers, we propose the use of re-computation as a solution for resource limitation problem. This extends the solutions space, and enables better mappings, ii) General Problem Formulation: A precise and general formulation of the application mapping problem on a CGRA is presented, and its computational complexity is established. iii) Extracting an Efficient Heuristic: Using the insights from the problem formulation, we design an effective global heuristic called EPIMap. EPIMap transforms the input specification (a directed graph) to an Epimorphic equivalent graph that satisfies the necessary conditions for mapping on to a CGRA, reducing the search space. Experimental results on 14 important kernels extracted from well known benchmark programs show that using EPIMap can improve the performance of the kernels on CGRA by more than 2.8X on average, as compared to one of the best existing mapping algorithm, EMS. EPIMap was able to achieve the theoretical best performance for 9 out of 14 benchmarks, while EMS could not achieve the theoretical best performance for any of the benchmarks. EPIMap achieves better mappings at acceptable increase in the compilation time.
compilers, architecture, and synthesis for embedded systems | 2006
Kyoungwoo Lee; Aviral Shrivastava; Ilya Issenin; Nikil D. Dutt; Nalini Venkatasubramanian
With advances in process technology, soft errors(SE)are becoming an increasingly critical design concern. Due to their large area and high density, caches are worst hit by soft errors. Although Error Correction Code based mechanisms protect the data in caches, they have high performance and power overheads. Since multimedia applications are increasingly being used in mission-critical embedded systems where both reliability and energy are a major concern, there is a de?nite need to improve reliability in embedded systems, without too much energy overhead. We observe that while a soft error in multimedia data may only result in a minor loss in QoS, a soft error in avariable that controls the execution ?ow of the program may be fatal. Consequently, we propose to partition the data space into failure critical and failure non-critical data, and provide a high-degree of soft error protection only to the failure critical data in Horizontally Partitioned Caches. Experimental results demonstrate that our selective data protection can achieve the failure rate close to that of a soft error protected cache system, while retaining the performance and energy consumption similar to those of a traditional cache system, with some degradation in QoS. For example, for conventional con?guration as in IntelXScale, our approach achieves the same failure rate, while improving performance by 28% and reducing energy consumption by 29%in comparison with a soft error protected cache.
compilers architecture and synthesis for embedded systems | 2005
Aviral Shrivastava; Ilya Issenin; Nikil D. Dutt
Horizontally partitioned data caches are a popular architectural feature in which the processor maintains two or more data caches at the same level of hierarchy. Horizontally partitioned caches help reduce cache pollution and thereby improve performance. Consequently most previous research has focused on exploiting horizontally partitioned data caches to improve performance, and achieve energy reduction only as a byproduct of performance improvement. In constrast, in this paper we show that optimizing for performance trades-off several opportunities for energy reduction. Our experiments on a HP iPAQ h4300-like memory subsystem demonstrate that optimizing for energy consumption results in up to 127% less memory subsystem energy consumption than the performance optimal solution. Furthermore, we show that energy optimal solution incurs on average only 1.7% performance penalty. Therefore, with energy consumption becoming a first class design constraint, there is a need for compilation techniques aimed at energy reduction. To achieve aforementioned energy savings we propose and explore several low-complexity algorithms aimed at reducing the energy consumption and show that very simple greedy heuristics achieve 97% of the possible memory subsystem energy savings.
Archive | 2010
Preeti Ranjan Panda; B. V. N. Silpa; Aviral Shrivastava; Krishnaiah Gummidipudi
This book addresses power optimization in modern electronic and computer systems. Several forces aligned in the past decade to drive contemporary computing in the direction of low power and energy-awareness: the mobile revolution took the world by storm; power budgets forced mainstream processor designers to abandon the quest for higher clock frequency; and large data centers with overwhelming power costs began to play vital roles in our daily lives. Power optimization was elevated to a first class design concern, forcing everyone from the process engineer, circuit designer, processor architect, software developer, system builder, and even data center maintainer to make conscious efforts to reduce power consumption using myriad techniques and tools. This book explores power optimization opportunities and their exploitation at various levels of abstraction. Fundamental power optimizations are covered at each level of abstraction, concluding in a case study illustrating the application of the major techniques to a graphics processor. This book covers a comprehensive range of disparate power optimizations and is designed to be accessible to students, researchers, and practitioners alike.
asia and south pacific design automation conference | 2008
Jonghee W. Yoon; Aviral Shrivastava; Sang-Hyun Park; Minwook Ahn; Yunheung Paek
Recently coarse-grained reconfigurable architectures (CGRAs) have drawn increasing attention due to their efficiency and flexibility. While many CGRAs have demonstrated impressive performance improvements, the effectiveness of CGRA platforms ultimately hinges on the compiler. Existing CGRA compilers do not model the details of the CGRA architecture, due to which they are, i) unable to map applications, even though a mapping exists, and ii) use too many PEs to map an application. In this paper, we model several CGRA details in our compiler and develop a graph mapping based approach (SPKM) for mapping applications onto CGRAs. On randomly generated graphs our technique can map on average 4.5times more applications than the previous approaches, while using fewer CGRA rows 62% times, without any penalty in mapping time. We observe similar results on a suite of benchmarks collected from Livermore Loops, Multimedia and DSPStone benchmarks.
international conference on hardware/software codesign and system synthesis | 2010
Ke Bai; Aviral Shrivastava
This paper presents a scheme to manage heap data in the local memory present in each core of a limited local memory (LLM) multi-core processor. While it is possible to manage heap data semi-automatically using software cache, managing heap data of a core through software cache may require changing the code of the other threads. Cross thread modifications are difficult to code and debug, and only become more difficult as we scale the number of cores. We propose a semi-automatic, and scalable scheme for heap data management that hides this complexity in a library with a much natural programming interface. Furthermore, for embedded applications, where the maximum heap size can be known at compile time, we propose optimizations on the heap management to significantly improve the application performance. Experiments on several benchmarks of MiBench executing on the Sony Playstation 3 show that our scheme is easier to use, and if we know the maximum size of heap data, then our optimizations can improve application performance by an average of 14%.
application specific systems architectures and processors | 2010
Seung Chul Jung; Aviral Shrivastava; Ke Bai
This paper presents heuristics for dynamic management of application code on limited local memories present in high-performance multi-core processors. Previous techniques formulate the problem using call graphs, which do not capture the temporal ordering of functions. In addition, they only use a conservative estimate of the interference cost between functions to obtain a mapping. As a result previous techniques are unable to achieve efficient code mapping. Techniques proposed in this paper overcome both these limitations and achieve superior code mapping. Experimental results from executing benchmarks from MiBench onto the Cell processor in the Sony Playstation 3 demonstrate up to 29% and average 12% performance improvement, at tolerable compile-time overhead.
design automation conference | 2011
Yooseong Kim; Aviral Shrivastava
CUDA programming model provides a simple interface to program on GPUs, but tuning GPGPU applications for high performance is still quite challenging. Programmers need to consider several architectural details, and small changes in source code, especially on memory access pattern, affect performance significantly. This paper presents CuMAPz, a tool to compare the memory performance of a CUDA program. CuMAPz can help programmers explore different ways of using shared and global memories, and optimize their program for memory behavior. CuMAPz models several memory effects, e.g., data reuse, global memory access coalescing, shared memory bank conflict, channel skew, and branch divergence. By using CuMAPz to explore memory access design space, we could improve the performance of our benchmarks by 62% over the naive cases, and 32% over previous approach[8].
ieee international conference on high performance computing, data, and analytics | 2008
Amit Pabalkar; Aviral Shrivastava; Arun Kannan; Jongeun Lee
Many programmable embedded systems feature low power processorscoupled with fast compiler controlled on-chip scratchpad memories (SPMs) toreduce their energy consumption. SPMs are more efficient than caches in termsof energy consumption, performance, area and timing predictability. However,unlike caches SPMs need explicit management by software, the quality ofwhich can impact the performance of SPM based systems. In this paper, wepresent a fully-automated, dynamic code overlaying technique for SPMs basedon pure static analysis. Static analysis is less restrictive than profiling and canbe easily extended to general compiler framework where the time consumingand expensive task of profiling may not be feasible. The SPM code mappingproblem is harder than bin packing problem, which is NP-complete. Therefore weformulate the SPMcode mapping as a binary integer linear programming problemand also propose a heuristic, determining simultaneously the region (bin) sizesas well as the function-to-region mapping. To the best of our knowledge, thisis the first heuristic which simultaneously solves the interdependent problemsof region size determination and the function-to-region mapping. We evaluateour approach for a set of MiBench applications on a horizontally split I-cache and SPM architecture (HSA). Compared to a cache-only architecture (COA),the HSA gives an average energy reduction of 35%, with minimal performancedegradation. For the HSA, we also compare the energy results from our proposedSDRM heuristic against a previous static analysis based mapping heuristic andobserve an average 27% energy reduction.
IEEE Transactions on Very Large Scale Integration Systems | 2009
Jonghee W. Yoon; Aviral Shrivastava; Sang-Hyun Park; Minwook Ahn; Yunheung Paek
Recently coarse-grained reconfigurable architectures (CGRAs) have drawn increasing attention due to their efficiency and flexibility. While many CGRAs have demonstrated impressive performance improvements, the effectiveness of CGRA platforms ultimately hinges on the compiler. Existing CGRA compilers do not model the details of the CGRA, and thus they are i) unable to map applications, even though a mapping exists, and ii) using too many processing elements (PEs) to map an application. In this paper, we model several CGRA details, e.g., irregular CGRA topologies, shared resources and routing PEs in our compiler and develop a graph drawing based approach, split-push kernel mapping (SPKM), for mapping applications onto CGRAs. On randomly generated graphs our technique can map on average 4.5times more applications than the previous approach, while generating mappings which have better qualities in terms of utilized CGRA resources. Utilizing fewer resources is directly translated into increased opportunities for novel power and performance optimization techniques. Our technique shows less power consumption in 71 cases and shorter execution cycles in 66 cases out of 100 synthetic applications, with minimum mapping time overhead. We observe similar results on a suite of benchmarks collected from Livermore loops, Mediabench, Multimedia, Wavelet and DSPStone benchmarks. SPKM is not a customized algorithm only for a specific CGRA template, and it is demonstrated by exploring various PE interconnection topologies and shared resource configurations with SPKM.