Viji Srinivasan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Viji Srinivasan is active.

Explore More

Publication

Featured researches published by Viji Srinivasan.

high performance computer architecture | 2001

Branch history guided instruction prefetching

Viji Srinivasan; Edward S. Davidson; Gary S. Tyson; Mark J. Charney; Thomas R. Puzak

Instruction cache misses stall the fetch stage of the processor pipeline and hence affect instruction supply to the processor. Instruction prefetching has been proposed as a mechanism to reduce instruction cache (I-cache) misses. However, a prefetch is effective only if accurate and initiated sufficiently early to cover the miss penalty. This paper presents a new hardware-based instruction prefetching mechanism, Branch History Guided Prefetching (BHGP), to improve the timeliness of instruction prefetches. BHGP correlates the execution of a branch instruction with I-cache misses and uses branch instructions to trigger prefetches of instructions that occur (N-1) branches later in the program execution, for a given N>1. Evaluations on commercial applications, windows-NT applications, and some CPU2000 applications show an average reduction of 66% in miss rate over all applications. BHGP improved the IPC bp 12 to 14% for the CPU2000 applications studied; on average 80% of the BHGP prefetches arrived in cache before their next use, even on a 4-wide issue machine with a 15 cycle L2 access penalty.

international symposium on microarchitecture | 2002

Optimizing pipelines for power and performance

Viji Srinivasan; David M. Brooks; Michael Karl Gschwind; Pradip Bose; Victor Zyuban; Philip N. Strenski; Philip G. Emma

During the concept phase and definition of next generation high-end processors, power and performance will need to be weighted appropriately to deliver competitive cost/performance. It is not enough to adopt a CPI-centric view alone in early-stage definition studies. One of the fundamental issues confronting the architect at this stage is the choice of pipeline depth and target frequency. In this paper we present an optimization methodology that starts with an analytical power-performance model to derive optimal pipeline depth for a superscalar processor. The results are validated and further refined using detailed simulation based analysis. As part of the power-modeling methodology, we have developed equations that model the variation of energy as a function of pipeline depth. Our results using a set of SPEC2000 applications show that when both power and performance are considered for optimization, the optimal clock period is around 18 FO4. We also provide a detailed sensitivity analysis of the optimal pipeline depth against key assumptions of these energy models.

Ibm Journal of Research and Development | 2003

New methodology for early-stage, microarchitecture-level power-performance analysis of microprocessors

David M. Brooks; Pradip Bose; Viji Srinivasan; Michael Karl Gschwind; Philip G. Emma; Michael G. Rosenfield

The PowerTimer toolset has been developed for use in early-stage, microarchitecture-level power-performance analysis of microprocessors. The key component of the toolset is a parameterized set of energy functions that can be used in conjunction with any given cycle-accurate microarchitectural simulator. The energy functions model the power consumption of primitive and hierarchically composed building blocks which are used in microarchitecture-level performance models. Examples of structures modeled are pipeline stage latches, queues, buffers and component read/write multiplexers, local clock buffers, register files, and cache array macros. The energy functions can be derived using purely analytical equations that are driven by organizational, circuit, and technology parameters or behavioral equations that are derived from empirical, circuit-level simulation experiments. After describing the modeling methodology, we present analysis results in the context of a current-generation superscalar processor simulator to illustrate the use and effectiveness of such early-stage models. In addition to average power and performance tradeoff analysis, PowerTimer is useful in assessing the typical and worst-case power (or current) swings that occur between successive cycle windows in a given workload execution. Such a characterization of workloads at the early stage of microarchitecture definition helps pinpoint potential inductive noise problems on the voltage rail that can be addressed by designing an appropriate package or by suitably tuning the dynamic power management controls within the processor.

IEEE Transactions on Computers | 2004

Integrated analysis of power and performance for pipelined microprocessors

Victor Zyuban; David M. Brooks; Viji Srinivasan; Michael Karl Gschwind; Pradip Bose; Philip N. Strenski; Philip G. Emma

Choosing the pipeline depth of a microprocessor is one of the most critical design decisions that an architect must make in the concept phase of a microprocessor design. To be successful in todays cost/performance marketplace, modern CPU designs must effectively balance both performance and power dissipation. The choice of pipeline depth and target clock frequency has a critical impact on both of these metrics. We describe an optimization methodology based on both analytical models and detailed simulations for power and performance as a function of pipeline depth. Our results for a set of SPEC2000 applications show that, when both power and performance are considered for optimization, the optimal clock period is around 18 FO4. We also provide a detailed sensitivity analysis of the optimal pipeline depth against key assumptions of our energy models. Finally, we discuss the potential risks in design quality for overly aggressive or conservative choices of pipeline depth.

computing frontiers | 2006

Cache miss behavior: is it √2?

Allan M. Hartstein; Viji Srinivasan; Thomas R. Puzak; Philip G. Emma

It has long been empirically observed that the cache miss rate decreased as a power law of cache size, where the power was approximately -1/2. In this paper, we examine the dependence of the cache miss rate on cache size both theoretically and through simulation. By combining the observed time dependence of the cache reference pattern with a statistical treatment of cache entry replacement, we predict that the cache miss rate should vary with cache size as an inverse power law for a first level cache. The exponent in the power law is directly related to the time dependence of cache references, and lies between -0.3 to -0.7. Results are presented for both direct mapped and set associative caches, and for various levels of the cache hierarchy. Our results demonstrate that the dependence of cache miss rate on cache size arises from the temporal dependence of the cache access pattern.

PACS'02 Proceedings of the 2nd international conference on Power-aware computer systems | 2002

Early-stage definition of LPX: a low power issue-execute processor

Pradip Bose; David M. Brooks; Alper Buyuktosunoglu; Peter W. Cook; Koushik K. Das; Philip G. Emma; Michael Karl Gschwind; Hans M. Jacobson; Tejas Karkhanis; Prabhakar Kudva; Stanley E. Schuster; James E. Smith; Viji Srinivasan; Victor Zyuban; David H. Albonesi; Sandhya Dwarkadas

We present the high-level microarchitecture of LPX: a low-power issue-execute processor prototype that is being designed by a joint industry-academia research team. LPX implements a very small subset of a RISC architecture, with a primary focus on a vector (SIMD) multimedia extension. The objective of this project is to validate some key new ideas in power-aware microarchitecture techniques, supported by recent advances in circuit design and clocking.

computing frontiers | 2005

When prefetching improves/degrades performance

Thomas R. Puzak; Allan M. Hartstein; Philip G. Emma; Viji Srinivasan

We formulate a new method for evaluating any prefetching algorithm (real or hypothetical). This method allows researchers to analyze the potential improvements prefetching can bring to an application independent of any known prefetching algorithm. We characterize prefetching with the metrics: timeliness, coverage, and accuracy. We demonstrate the usefulness of this method using a Markov prefetch algorithm. Under ideal conditions, prefetching can remove nearly all of the pipeline stalls associated with a cache miss. However, in todays processors, we show that nearly all of the performance benefits derived from prefetching are eroded and, in many cases, prefetching loses performance. We do quantitative analysis of these trade-offs, and show that there are linear relationships between overall performance and coverage, accuracy, and bandwidth

design automation conference | 2017

Accelerator Design for Deep Learning Training: Extended Abstract: Invited

Ankur Agrawal; Chia-Yu Chen; Jungwook Choi; Kailash Gopalakrishnan; Jinwook Oh; Sunil Shukla; Viji Srinivasan; Swagath Venkataramani; Wei Zhang

Deep Neural Networks (DNNs) have emerged as a powerful and versatile set of techniques showing successes on challenging artificial intelligence (AI) problems. Applications in domains such as image/video processing, autonomous cars, natural language processing, speech synthesis and recognition, genomics and many others have embraced deep learning as the foundation. DNNs achieve superior accuracy for these applications with high computational complexity using very large models which require 100s of MBs of data storage, exaops of computation and high bandwidth for data movement. In spite of these impressive advances, it still takes days to weeks to train state of the art Deep Networks on large datasets - which directly limits the pace of innovation and adoption. In this paper, we present a multi-pronged approach to address the challenges in meeting both the throughput and the energy efficiency goals for DNN training.

IEEE Transactions on Computers | 2004