Nagendra Gulur | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nagendra Gulur is active.

Explore More

Publication

Featured researches published by Nagendra Gulur.

international symposium on microarchitecture | 2014

Bi-Modal DRAM Cache: A Scalable and Effective Die-Stacked DRAM Cache

Nagendra Gulur; Mahesh Mehendale; R. Manikantan; R. Govindarajan

In this paper, we present Bi-Modal Cache - a flexible stacked DRAM cache organization which simultaneously achieves several objectives: (i) improved cache hit ratio, (ii) moving the tag storage overhead to DRAM, (iii) lower cache hit latency than tags-in-SRAM, and (iv) reduction in off-chip bandwidth wastage. The Bi-Modal Cache addresses the miss rate versus off-chip bandwidth dilemma by organizing the data in a bi-modal fashion - blocks with high spatial locality are organized as large blocks and those with little spatial locality as small blocks. By adaptively selecting the right granularity of storage for individual blocks at run-time, the proposed DRAM cache organization is able to make judicious use of the available DRAM cache capacity as well as reduce the off-chip memory bandwidth consumption. The Bi-Modal Cache improves cache hit latency despite moving the metadata to DRAM by means of a small SRAM based Way Locator. Further by leveraging the tremendous internal bandwidth and capacity that stacked DRAM organizations provide, the Bi-Modal Cache enables efficient concurrent accesses to tags and data to reduce hit time. Through detailed simulations, we demonstrate that the Bi-Modal Cache achieves overall performance improvement (in terms of Average Normalized Turnaround Time (ANTT)) of 10.8%, 13.8% and 14.0% in 4-core, 8-core and 16-core workloads respectively.In this paper, we present Bi-Modal Cache - a flexible stacked DRAM cache organization which simultaneously achieves several objectives: (i) improved cache hit ratio, (ii) moving the tag storage overhead to DRAM, (iii) lower cache hit latency than tags-in-SRAM, and (iv) reduction in off-chip bandwidth wastage. The Bi-Modal Cache addresses the miss rate versus off-chip bandwidth dilemma by organizing the data in a bi-modal fashion - blocks with high spatial locality are organized as large blocks and those with little spatial locality as small blocks. By adaptively selecting the right granularity of storage for individual blocks at run-time, the proposed DRAM cache organization is able to make judicious use of the available DRAM cache capacity as well as reduce the off-chip memory bandwidth consumption. The Bi-Modal Cache improves cache hit latency despite moving the metadata to DRAM by means of a small SRAM based Way Locator. Further by leveraging the tremendous internal bandwidth and capacity that stacked DRAM organizations provide, the Bi-Modal Cache enables efficient concurrent accesses to tags and data to reduce hit time. Through detailed simulations, we demonstrate that the Bi-Modal Cache achieves overall performance improvement (in terms of Average Normalized Turnaround Time (ANTT)) of 10.8%, 13.8% and 14.0% in 4-core, 8-core and 16-core workloads respectively.

international conference on supercomputing | 2012

Multiple sub-row buffers in DRAM: unlocking performance and energy improvement opportunities

Nagendra Gulur; R. Manikantan; Mahesh Mehendale; R. Govindarajan

The twin demands of energy-efficiency and higher performance on DRAM are highly emphasized in multicore architectures. A variety of schemes have been proposed to address either the latency or the energy consumption of DRAMs. These schemes typically require non-trivial hardware changes and end up improving latency at the cost of energy or vice-versa. One specific DRAM performance problem in multicores is that interleaved accesses from different cores can potentially degrade row-buffer locality. In this paper, based on the temporal and spatial locality characteristics of memory accesses, we propose a reorganization of the existing single large row-buffer in a DRAM bank into multiple sub-row buffers (MSRB). This re-organization not only improves row hit rates, and hence the average memory latency, but also brings down the energy consumed by the DRAM. The first major contribution of this work is proposing such a reorganization without requiring any significant changes to the existing widely accepted DRAM specifications. Our proposed reorganization improves weighted speedup by 35.8%, 14.5% and 21.6% in quad, eight and sixteen core workloads along with a 42%, 28% and 31% reduction in DRAM energy. The proposed MSRB organization enables opportunities for the management of multiple row-buffers at the memory controller level. As the memory controller is aware of the behaviour of individual cores it allows us to implement coordinated buffer allocation schemes for different cores that take into account program behaviour. We demonstrate two such schemes, namely Fairness Oriented Allocation and Performance Oriented Allocation, which show the flexibility that memory controllers can now exploit in our MSRB organization to improve overall performance and/or fairness. Further, the MSRB organization enables additional opportunities for DRAM intra-bank parallelism and selective early precharging of the LRU row-buffer to further improve memory access latencies. These two optimizations together provide an additional 5.9% performance improvement.

measurement and modeling of computer systems | 2014

ANATOMY: an analytical model of memory system performance

Nagendra Gulur; Mahesh Mehendale; Raman Manikantan; R. Govindarajan

Memory system design is increasingly influencing modern multi-core architectures from both performance and power perspectives. However predicting the performance of memory systems is complex, compounded by the myriad design choices and parameters along multiple dimensions, namely (i) technology, (ii) design and (iii) architectural choices. In this work, we construct an analytical model of the memory system to comprehend this diverse space and to study the impact of memory system parameters from latency and bandwidth perspectives. Our model, called ANATOMY, consists of two key components that are coupled with each other, to model the memory system accurately. The first component is a queuing model of memory which models in detail various design choices and captures the impact of technological choices in memory systems. The second component is an analytical model to summarize key workload characteristics, namely row buffer hit rate (RBH), bank-level parallelism (BLP), and request spread (S) which are used as inputs to the queuing model to estimate memory performance. We validate the model across a wide variety of memory configurations on 4, 8 and 16 cores using a total of 44 workloads. ANATOMY is able to predict memory latency with an average error of 8.1%, 4.1% and 9.7% over 4, 8 and 16 core configurations. We demonstrate the extensibility and applicability of our model by exploring a variety of memory design choices such as the impact of clock speed, benefit of multiple memory controllers, the role of banks and channel width, and so on. We also demonstrate ANATOMYs ability to capture architectural elements such as scheduling mechanisms (using FR_FCFS and PAR_BS) and impact of DRAM refresh cycles. In all of these studies, ANATOMY provides insight into sources of memory performance bottlenecks and is able to quantitatively predict the benefit of redressing them.

international symposium on computer architecture | 2017

Rethinking TLB Designs in Virtualized Environments: A Very Large Part-of-Memory TLB

Jee Ho Ryoo; Nagendra Gulur; Shuang Song; Lizy Kurian John

With increasing deployment of virtual machines for cloud services and server applications, memory address translation overheads in virtualized environments have received great attention. In the radix-4 type of page tables used in x86 architectures, a TLB-miss necessitates up to 24 memory references for one guest to host translation. While dedicated page walk caches and such recent enhancements eliminate many of these memory references, our measurements on the Intel Skylake processors indicate that many programs in virtualized mode of execution still spend hundreds of cycles for translations that do not hit in the TLBs. This paper presents an innovative scheme to reduce the cost of address translations by using a very large Translation Lookaside Buffer that is part of memory, the POM-TLB. In the POM-TLB, only one access is required instead of up to 24 accesses required in commonly used 2D walks with radix-4 type of page tables. Even if many of the 24 accesses may hit in the page walk caches, the aggregated cost of the many hits plus the overhead of occasional misses from page walk caches still exceeds the cost of one access to the POM-TLB. Since the POM-TLB is part of the memory space, TLB entries (as opposed to multiple page table entries) can be cached in large L2 and L3 data caches, yielding significant benefits. Through detailed evaluation running SPEC, PARSEC and graph workloads, we demonstrate that the proposed POM-TLB improves performance by approximately 10% on average. The improvement is more than 16% for 5 of the benchmarks. It is further seen that a POM-TLB of 16MB size can eliminate nearly all TLB misses in 8-core systems.

Proceedings of the Second International Symposium on Memory Systems | 2016

MicroRefresh: Minimizing Refresh Overhead in DRAM Caches

Nagendra Gulur; R. Govindarajan; Mahesh Mehendale

DRAM memory systems require periodic recharging to avoid loss of data from leaky capacitors. These refresh operations consume energy and reduce the duration of time for which the DRAM banks are available to service memory requests. Higher DRAM density and 3D-stacking aggravate the refresh overheads, incurring even higher energy and performance costs. 3D-stacked DRAM and other emerging on-chip High Bandwidth Memory (HBM) technologies which are widely considered to be changing the landscape of memory hierarchy in future heterogeneous and many-core architectures could suffer significantly from refresh overheads. Such large on-chip memory, when used as a very large last-level cache, however, provides opportunities for addressing the refresh overheads. In this work, we propose MicroRefresh, a scheme for almost eliminating the refresh overhead in DRAM caches. MicroRefresh eliminates unwanted refresh of recently accessed DRAM pages; it takes advantage of the relative latency difference between on-chip and off-chip DRAM and achieves a fine balance of usage of system resources by aggressively opportunistically eliminating refresh of older DRAM pages. It tolerates any resulting increase in cache misses by leveraging the under-utilized main memory bandwidth. The resulting organization eliminates the energy and performance overhead of refresh operations in the DRAM cache to achieve overall performance and energy improvement. Across both 4-core and 8-core workloads, MicroRefresh eliminates 92% the refresh energy consumed in the baseline periodic refresh mechanism. Further this is accompanied by performance improvements of upto 10%, with average improvements of 3.9% and 3.4% in 4-core and 8-core respectively.

international conference on performance engineering | 2015

A Comprehensive Analytical Performance Model of DRAM Caches

Nagendra Gulur; Mahesh Mehendale; R. Govindarajan

Stacked DRAM promises to offer unprecedented capacity, and bandwidth to multi-core processors at moderately lower latency than off-chip DRAMs. A typical use of this abundant DRAM is as a large last level cache. Prior research works are divided on how to organize this cache and the proposed organizations fall into one of two categories: (i) as a Tags-In-DRAM organization with the cache organized as small blocks (typically 64B) and metadata (tags, valid, dirty, recency and coherence bits) stored in DRAM, and (ii) as a Tags-In-SRAM organization with the cache organized as larger blocks (typiclly 512B or larger) and metadata stored on SRAM. Tags-In-DRAM organizations tend to incur higher latency but conserve off-chip bandwidth while the Tags-In-SRAM organizations incur lower latency at some additional bandwidth. In this work, we develop a unified performance model of the DRAM-Cache that models these different organizational styles. The model is validated against detailed architecture simulations and shown to have latency estimation errors of 10:7% and 8:8% on average in 4-core and 8-core processors respectively. We also explore two insights from the model: (i) the need for achieving very high hit rates in the meta- data cache/predictor (commonly employed in the Tags-In-DRAM designs) in reducing latency, and (ii) opportunities for reducing latency by load-balancing the DRAM Cache and main memory.

international symposium on electronic system design | 2014

Method to Determine Contrariety between Architectures Containing Stratified Memory Mapped Register Sets

Vasant Easwaran; Nagendra Gulur; Sushaanth Srirangapathi; Mihir Mody; Rahul Gulati; Prashant Karandikar; Prithvi Shankar

As silicon architectures get more and more complex, the cost of development has scaled exponentially. IC design and embedded software are significant contributors to the overall cost of development. To enable cost effective solutions, the most commonly used approach is to define a platform with a family of devices to address different market segments. This approach will enable hardware and software reuse. The major impediment in this approach is the lack of any single metric to quantify the extent of compatibility across the family of devices. This paper presents a novel method for producing detailed summary to understand the degree of architecture incompatibility. This method uses machine-readable specifications created for family of devices and provides extensive register, bit-field and enumeration level differences between the two architectures to articulate the effort involved for software development and migration. This method has been used to determine architecture incompatibility for a derivative device spun off from a platform device, and was found to be 34% incompatible.

international conference on parallel architectures and compilation techniques | 2011

Row-Buffer Reorganization: Simultaneously Improving Performance and Reducing Energy in DRAMs

Nagendra Gulur; R. Manikantan; R. Govindarajan; Mahesh Mehendale

In this paper, based on the temporal and spatial locality characteristics of memory accesses in multicores, we propose a re-organization of the existing single large row buffer in a DRAM bank into multiple smaller row-buffers. The proposed configuration helps improve the row hit rates and also brings down the energy required for row-activations. The major contribution of this work is proposing such a reorganization without requiring any significant changes to the existing widely accepted DRAM specifications. Our proposed reorganization improves performance by 35.8%, 14.5% and 21.6% in quad, eight and sixteen core workloads along with a 42%, 28% and 31% reduction in DRAM energy. Additionally, we introduce a Need Based Allocation scheme for buffer management that shows additional performance improvement.

international symposium on microarchitecture | 2017

CSALT: context switch aware large TLB

Yashwant Marathe; Nagendra Gulur; Jee Ho Ryoo; Shuang Song; Lizy Kurian John

Computing in virtualized environments has become a common practice for many businesses. Typically, hosting companies aim for lower operational costs by targeting high utilization of host machines maintaining just enough machines to meet the demand. In this scenario, frequent virtual machine context switches are common, resulting in increased TLB miss rates (often, by over 5X when contexts are doubled) and subsequent expensive page walks. Since each TLB miss in a virtual environment initiates a 2D page walk, the data caches get filled with a large fraction of page table entries (often, in excess of 50%) thereby evicting potentially more useful data contents.In this work, we propose CSALT -a Context-Switch Aware Large TLB, to address the problem of increased TLB miss rates and their adverse impact on data caches. First, we demonstrate that the CSALT architecture can effectively cope with the demands of increased context switches by its capacity to store a very large number of TLB entries. Next, we show that CSALT mitigates data cache contention caused by conflicts between data and translation entries by employing a novel TLB-Aware Cache Partitioning scheme. On 8-core systems that switch between two virtual machine contexts executing multi-threaded workloads, CSALT achieves an average performance improvement of 85% over a baseline with conventional Ll-L2 TLBs and 25% over a baseline which has a large L3 TLB.CS CONCEPTSComputer systems organization → Heterogeneous (hybrid) systems;

ieee international conference on high performance computing data and analytics | 2015