Ravishankar R. Iyer | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ravishankar R. Iyer is active.

Explore More

Publication

Featured researches published by Ravishankar R. Iyer.

design automation conference | 2012

Cache revive: architecting volatile STT-RAM caches for enhanced performance in CMPs

Adwait Jog; Asit K. Mishra; Cong Xu; Yuan Xie; Vijaykrishnan Narayanan; Ravishankar R. Iyer; Chita R. Das

High density, low leakage and non-volatility are the attractive features of Spin-Transfer-Torque-RAM (STT-RAM), which has made it a strong competitor against SRAM as a universal memory replacement in multi-core systems. However, STT-RAM suffers from high write latency and energy which has impeded its widespread adoption. To this end, we look at trading-off STT-RAMs non-volatility property (data-retention-time) to overcome these problems. We formulate the relationship between retention-time and write-latency, and find optimal retention-time for architecting an efficient cache hierarchy using STT-RAM. Our results show that, compared to SRAM-based design, our proposal can improve performance and energy consumption by 18% and 60%, respectively.

architectural support for programming languages and operating systems | 2013

OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

Adwait Jog; Onur Kayiran; Nachiappan Chidambaram Nachiappan; Asit K. Mishra; Mahmut T. Kandemir; Onur Mutlu; Ravishankar R. Iyer; Chita R. Das

Emerging GPGPU architectures, along with programming models like CUDA and OpenCL, offer a cost-effective platform for many applications by providing high thread level parallelism at lower energy budgets. Unfortunately, for many general-purpose applications, available hardware resources of a GPGPU are not efficiently utilized, leading to lost opportunity in improving performance. A major cause of this is the inefficiency of current warp scheduling policies in tolerating long memory latencies. In this paper, we identify that the scheduling decisions made by such policies are agnostic to thread-block, or cooperative thread array (CTA), behavior, and as a result inefficient. We present a coordinated CTA-aware scheduling policy that utilizes four schemes to minimize the impact of long memory latencies. The first two schemes, CTA-aware two-level warp scheduling and locality aware warp scheduling, enhance per-core performance by effectively reducing cache contention and improving latency hiding capability. The third scheme, bank-level parallelism aware warp scheduling, improves overall GPGPU performance by enhancing DRAM bank-level parallelism. The fourth scheme employs opportunistic memory-side prefetching to further enhance performance by taking advantage of open DRAM rows. Evaluations on a 28-core GPGPU platform with highly memory-intensive applications indicate that our proposed mechanism can provide 33% average performance improvement compared to the commonly-employed round-robin warp scheduling policy.

international symposium on microarchitecture | 2007

A Framework for Providing Quality of Service in Chip Multi-Processors

Fei Guo; Yan Solihin; Li Zhao; Ravishankar R. Iyer

The trends in enterprise IT toward service-oriented computing, server consolidation, and virtual computing point to a future in which workloads are becoming increasingly diverse in terms of performance, reliability, and availability requirements. It can be expected that more and more applications with diverse requirements will run on a CMP and share platform resources such as the lowest level cache and off-chip bandwidth. In this environment, it is desirable to have microarchitecture and software support that can provide a guarantee of a certain level of performance, which we refer to as performance Quality of Service. In this paper, we investigate a framework that would be needed for a CMP to fully provide QoS. We found that the ability of a CMP to partition platform resources alone is not sufficient for fully providing QoS. We also need an appropriate way to specify a QoS target, and an admission control policy that accepts jobs only when their QoS targets can be satisfied. We also found that providing strict QoS often leads to a significant reduction in throughput due to resource fragmentation. We propose novel throughput optimization techniques that include: (1) exploiting various QoS execution modes, and (2) a microarchitecture technique that steals excess resources from a job while still meeting its QoS target. We evaluated our QoS framework with a full system simulation of a 4-core CMP and a recent version of the Linux Operating System. We found that compared to an unoptimized scheme, the throughput can be improved by up to 47%, making the throughput significantly closer to a non-QoS CMP.

high-performance computer architecture | 2010

CHOP: Adaptive filter-based DRAM caching for CMP server platforms

Xiaowei Jiang; Niti Madan; Li Zhao; Mike Upton; Ravishankar R. Iyer; Srihari Makineni; Donald Newell; Yan Solihin; Rajeev Balasubramonian

As manycore architectures enable a large number of cores on the die, a key challenge that emerges is the availability of memory bandwidth with conventional DRAM solutions. To address this challenge, integration of large DRAM caches that provide as much as 5× higher bandwidth and as low as 1/3rd of the latency (as compared to conventional DRAM) is very promising. However, organizing and implementing a large DRAM cache is challenging because of two primary tradeoffs: (a) DRAM caches at cache line granularity require too large an on-chip tag area that makes it undesirable and (b) DRAM caches with larger page granularity require too much bandwidth because the miss rate does not reduce enough to overcome the bandwidth increase. In this paper, we propose CHOP (Caching HOt Pages) in DRAM caches to address these challenges. We study several filter-based DRAM caching techniques: (a) a filter cache (CHOP-FC) that profiles pages and determines the hot subset of pages to allocate into the DRAM cache, (b) a memory-based filter cache (CHOP-MFC) that spills and fills filter state to improve the accuracy and reduce the size of the filter cache and (c) an adaptive DRAM caching technique (CHOP-AFC) to determine when the filter cache should be enabled and disabled for DRAM caching. We conduct detailed simulations with server workloads to show that our filter-based DRAM caching techniques achieve the following: (a) on average over 30% performance improvement over previous solutions, (b) several magnitudes lower area overhead in tag space required for cache-line based DRAM caches, (c) significantly lower memory bandwidth consumption as compared to page-granular DRAM caches.

international symposium on microarchitecture | 2009

A case for dynamic frequency tuning in on-chip networks

Asit K. Mishra; Reetuparna Das; Soumya Eachempati; Ravishankar R. Iyer; Narayanan Vijaykrishnan; Chita R. Das

Performance and power are the first order design metrics for network-on-chips (NoCs) that have become the de-facto standard in providing scalable communication backbones for multicores/CMPs. However, NoCs can be plagued by higher power consumption and degraded throughput if the network and router are not designed properly. Towards this end, this paper proposes a novel router architecture, where we tune the frequency of a router in response to network load to manage both performance and power. We propose three dynamic frequency tuning techniques, FreqBoost, FreqThrtl and FreqTune, targeted at congestion and power management in NoCs. As enablers for these techniques, we exploit Dynamic Voltage and Frequency Scaling (DVFS) and the imbalance in a generic router pipeline through time stealing. Experiments using synthetic workloads on a 8x8 wormhole-switched mesh interconnect show that FreqBoost is a better choice for reducing average latency (maximum 40%) while, FreqThrtl provides the maximum benefits in terms of power saving and energy delay product (EDP). The FreqTune scheme is a better candidate for optimizing both performance and power, achieving on an average 36% reduction in latency, 13% savings in power (up to 24% at high load), and 40% savings (up to 70% at high load) in EDP. With application benchmarks, we observe IPC improvement up to 23% using our design. The performance and power benefits also scale for larger NoCs.

ACM Sigarch Computer Architecture News | 2005

Exploring the cache design space for large scale CMPs

Lisa R. Hsu; Ravishankar R. Iyer; Srihari Makineni; Steven K. Reinhardt; Donald Newell

With the advent of dual-core chips in the marketplace, small-scale CMP (chip multiprocessor) architectures are becoming commonplace. We expect a continuing trend of increasing the number of cores on a die to maximize the performance/power efficiency of a single chip. We believe an era of large-scale CMPs (LCMPs) with several tens to hundreds of cores is on the way, but as of now architects have little understanding of how best to build a cache hierarchy given such a large number of cores/threads to support. With this in mind, our initial goals are to prune the cache design space for LCMPs by characterizing basic server workload behavior in such an environment.In this paper, we describe the range of methodologies that we are developing to overcome the challenges of exploring the cache design space for LCMP platforms. We then focus on employing a trace-driven approach to characterizing one key server workload (OLTP) in both a homogeneous and a heterogeneous workload environment. We study the effect of increasing threads (from 1 to 128) on a three-level cache hierarchy with emphasis on second and third level caches. We study the effect of varying sizes at these cache levels and show the effects of threads contending for cache space, the effects of prefetching instruction addresses, and the effects of inclusion. We make initial observations and conclusions about the factors on which LCMP cache hierarchy design decisions should be based and discuss future work.

high-performance computer architecture | 2008

Performance and power optimization through data compression in Network-on-Chip architectures

Reetuparna Das; Asit K. Mishra; Chrysostomos Nicopoulos; Dongkook Park; Vijaykrishnan Narayanan; Ravishankar R. Iyer; Mazin S. Yousif; Chita R. Das

The trend towards integrating multiple cores on the same die has accentuated the need for larger on-chip caches. Such large caches are constructed as a multitude of smaller cache banks interconnected through a packet-based network-on-chip (NoC) communication fabric. Thus, the NoC plays a critical role in optimizing the performance and power consumption of such non-uniform cache-based multicore architectures. While almost all prior NoC studies have focused on the design of router microarchitectures for achieving this goal, in this paper, we explore the role of data compression on NoC performance and energy behavior. In this context, we examine two different configurations that explore combinations of storage and communication compression: (1) Cache compression (CC) and (2) Compression in the NIC (NC). We also address techniques to hide the decompression latency by overlapping with NoC communication latency. Our simulation results with a diverse set of scientific and commercial benchmark traces reveal that CC can provide up to 33% reduction in network latency and up to 23% power savings. Even in the case of NC - where the data is compressed only when passing through the NoC fabric of the NUCA architecture and stored uncompressed - performance and power savings of up to 32% and 21%, respectively, can be obtained. These performance benefits in the interconnect translate up to 17% reduction in CPI. These benefits are orthogonal to any router architecture and make a strong case for utilizing compression for optimizing the performance and power envelope of NoC architectures. In addition, the study demonstrates the criticality of designing faster routers in shaping the performance behavior.

high-performance computer architecture | 2009

Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy

Niti Madan; Li Zhao; Naveen Muralimanohar; Aniruddha N. Udipi; Rajeev Balasubramonian; Ravishankar R. Iyer; Srihari Makineni; Donald Newell

Cache hierarchies in future many-core processors are expected to grow in size and contribute a large fraction of overall processor power and performance. In this paper, we postulate a 3D chip design that stacks SRAM and DRAM upon processing cores and employs OS-based page coloring to minimize horizontal communication of cache data. We then propose a heterogeneous reconfigurable cache design that takes advantage of the high density of DRAM and the superior power/delay characteristics of SRAM to efficiently meet the working set demands of each individual core. Finally, we analyze the communication patterns for such a processor and show that a tree topology is an ideal fit that significantly reduces the power and latency requirements of the on-chip network. The above proposals are synergistic: each proposal is made more compelling because of its combination with the other innovations described in this paper. The proposed reconfigurable cache model improves performance by up to 19% along with 48% savings in network power.

high performance interconnects | 2007

Design of a Dynamic Priority-Based Fast Path Architecture for On-Chip Interconnects

Dongkook Park; Reetuparna Das; Chrysostomos Nicopoulos; Jongman Kim; Narayanan Vijaykrishnan; Ravishankar R. Iyer; Chita R. Das

In modern multi-core system-on-chip (SoC) architectures, the design of innovative interconnection fabrics is indispensable. The concept of the network-on-chip (NoC) architecture has been proposed recently to better suit this requirement. Especially, the router architecture has a significant effect on the overall performance and energy consumption of the chip. We propose a dynamic path management scheme that exploits network traffic information during switch arbitration. Consequently, flits transferred across frequently used paths are expedited by traversing a reduced router pipeline. This technique, based on pipeline bypassing, is simulated and evaluated in terms of network latency and average power consumption. Simulation results with real-world application traces show that the architecture improves the performance up to 30% while incurring only minimal area/power overhead.

Operating Systems Review | 2011

Efficient interaction between OS and architecture in heterogeneous platforms

Sadagopan Srinivasan; Li Zhao; Ramesh Illikkal; Ravishankar R. Iyer

Almost all hardware platforms to date have been homogeneous with one or more identical processors managed by the operating system (OS). However, recently, it has been recognized that power constraints and the need for domain-specific high performance computing may lead architects towards building heterogeneous architectures and platforms in the near future. In this paper, we consider the three types of heterogeneous core architectures: (a) Virtual asymmetric cores: multiple processors that have identical core micro-architectures and ISA but each running at a different frequency point or perhaps having a different cache size, (b) Physically asymmetric cores: heterogeneous cores, each with a fundamentally different microarchitecture (in-order vs. out-of-order for instance) running at similar or different frequencies, with identical ISA and (c) Hybrid cores: multiple cores, where some cores have tightly-coupled hardware accelerators or special functional units. We show case studies that highlight why existing OS and hardware interaction in such heterogeneous architectures is inefficient and causes loss in application performance, throughput efficiency and lack of quality of service. We then discuss hardware and software support needed to address these challenges in heterogeneous platforms and establish efficient heterogeneous environments for platforms in the next decade. In particular, we will outline a monitoring and prediction framework for heterogeneity along with software support to take advantage of this information. Based on measurements on real platforms, we will show that these proposed techniques can provide significant advantage in terms of performance and power efficiency in heterogeneous platforms.

Explore More