Sreenivas Subramoney | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sreenivas Subramoney is active.

Explore More

Publication

Featured researches published by Sreenivas Subramoney.

programming language design and implementation | 2004

Prefetch injection based on hardware monitoring and object metadata

Ali-Reza Adl-Tabatabai; Richard L. Hudson; Mauricio J. Serrano; Sreenivas Subramoney

Cache miss stalls hurt performance because of the large gap between memory and processor speeds - for example, the popular server benchmark SPEC JBB2000 spends 45% of its cycles stalled waiting for memory requests on the Itanium® 2 processor. Traversing linked data structures causes a large portion of these stalls. Prefetching for linked data structures remains a major challenge because serial data dependencies between elements in a linked data structure preclude the timely materialization of prefetch addresses. This paper presents Mississippi Delta (MS Delta), a novel technique for prefetching linked data structures that closely integrates the hardware performance monitor (HPM), the garbage collectors global view of heap and object layout, the type-level metadata inherent in type-safe programs, and JIT compiler analysis. The garbage collector uses the HPMs data cache miss information to identify cache miss intensive traversal paths through linked data structures, and then discovers regular distances (deltas) between these linked objects. JIT compiler analysis injects prefetch instructions using deltas to materialize prefetch addresses.We have implemented MS Delta in a fully dynamic profile-guided optimization system: the StarJIT dynamic compiler [1] and the ORP Java virtual machine [9]. We demonstrate a 28-29% reduction in stall cycles attributable to the high-latency cache misses targeted by MS Delta and a speedup of 11-14% on the cache miss intensive SPEC JBB2000 benchmark.

international symposium on computer architecture | 2011

Bypass and insertion algorithms for exclusive last-level caches

Jayesh Gaur; Mainak Chaudhuri; Sreenivas Subramoney

Inclusive last-level caches (LLCs) waste precious silicon estate due to cross-level replication of cache blocks. As the industry moves toward cache hierarchies with larger inner levels, this wasted cache space leads to bigger performance losses compared to exclusive LLCs. However, exclusive LLCs make the design of replacement policies more challenging. While in an inclusive LLC a block can gather a filtered access history, this is not possible in an exclusive design because the block is de-allocated from the LLC on a hit. As a result, the popular least-recently-used replacement policy and its approximations are rendered ineffective and proper choice of insertion ages of cache blocks becomes even more important in exclusive designs. On the other hand, it is not necessary to fill every block into an exclusive LLC. This is known as selective cache bypassing and is not possible to implement in an inclusive LLC because that would violate inclusion. This paper explores insertion and bypass algorithms for exclusive LLCs. Our detailed execution-driven simulation results show that a combination of our best insertion and bypass policies delivers an improvement of up to 61.2% and on average (geometric mean) 3.4% in terms of instructions retired per cycle (IPC) for 97 single-threaded dynamic instruction traces spanning selected SPEC 2006 and server applications, running on a 2 MB 16-way exclusive LLC compared to a baseline exclusive design in the presence of well-tuned multi-stream hardware prefetchers. The corresponding improvements in throughput for 35 4-way multi-programmed workloads running with an 8 MB 16-way shared exclusive LLC are 20.6% (maximum) and 2.5% (geometric mean).

international conference on parallel architectures and compilation techniques | 2012

Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches

Mainak Chaudhuri; Jayesh Gaur; Nithiyanandan Bashyam; Sreenivas Subramoney; Joseph Nuzman

The replacement policies for the last-level caches (LLCs) are usually designed based on the access information available locally at the LLC. These policies are inherently sub-optimal due to lack of information about the activities in the inner-levels of the hierarchy. This paper introduces cache hierarchy-aware replacement (CHAR) algorithms for inclusive LLCs (or L3 caches) and applies the same algorithms to implement efficient bypass techniques for exclusive LLCs in a three-level hierarchy. In a hierarchy with an inclusive LLC, these algorithms mine the L2 cache eviction stream and decide if a block evicted from the L2 cache should be made a victim candidate in the LLC based on the access pattern of the evicted block. Ours is the first proposal that explores the possibility of using a subset of L2 cache eviction hints to improve the replacement algorithms of an inclusive LLC. The CHAR algorithm classifies the blocks residing in the L2 cache based on their reuse patterns and dynamically estimates the reuse probability of each class of blocks to generate selective replacement hints to the LLC. Compared to the static re-reference interval prediction (SRRIP) policy, our proposal offers an average reduction of 10.9% in LLC misses and an average improvement of 3.8% in instructions retired per cycle (IPC) for twelve single-threaded applications. The corresponding reduction in LLC misses for one hundred 4-way multi-programmed workloads is 6.8% leading to an average improvement of 3.9% in through-put. Finally, our proposal achieves an 11.1% reduction in LLC misses and a 4.2% reduction in parallel execution cycles for six 8-way threaded shared memory applications compared to the SRRIP policy. In a cache hierarchy with an exclusive LLC, our CHAR proposal offers an effective algorithm for selecting the subset of blocks (clean or dirty) evicted from the L2 cache that need not be written to the LLC and can be bypassed. Compared to the TC-AGE policy (analogue of SRRIP for exclusive LLC), our best exclusive LLC proposal improves average throughput by 3.2% while saving an average of 66.6% of data transactions from the L2 cache to the on-die interconnect for one hundred 4-way multi-programmed workloads. Compared to an inclusive LLC design with an identical hierarchy, this corresponds to an average throughput improvement of 8.2% with only 17% more data write transactions originating from the L2 cache.

international symposium on memory management | 2000

Cycles to recycle: garbage collection to the IA-64

Richard L. Hudson; J. Eliot B. Moss; Sreenivas Subramoney; Weldon Washburn

The IA-64, Intels 64-bit instruction set architecture, exhibits a number of interesting architectural features. Here we consider those features as they relate to supporting garbage collection (GC). We aim to assist GC and compiler implementors by describing how one may exploit features of the IA-64. Along the way, we record some previously unpublished object scanning techniques, and offer novel ones for object allocation (suggesting some simple operating system support that would simplify it) and the Java “jsr problem”. We also discuss ordering of memory accesses and how the IA-64 can achieve publication safety efficiently. While our focus is not on any particular GC implementation or programming language, we draw on our experience designing and implementing GC for the Intel Java Virtual Machine for the IA-64.

international symposium on microarchitecture | 2013

Efficient management of last-level caches in graphics processors for 3D scene rendering workloads

Jayesh Gaur; Raghuram Srinivasan; Sreenivas Subramoney; Mainak Chaudhuri

Three-dimensional (3D) scene rendering is implemented in the form of a pipeline in graphics processing units (GPUs). In different stages of the pipeline, different types of data get accessed. These include, for instance, vertex, depth, stencil, render target (same as pixel color), and texture sampler data. The GPUs traditionally include small caches for vertex, render target, depth, and stencil data as well as multi-level caches for the texture sampler units. Recent introduction of reasonably large last-level caches (LLCs) shared among these data streams in discrete as well as integrated graphics hardware architectures has opened up new opportunities for improving 3D rendering. The GPUs equipped with such large LLCs can enjoy far-flung intra- and inter-stream reuses. However, there is no comprehensive study that can help graphics cache architects understand how to effectively manage a large multi-megabyte LLC shared between different 3D graphics streams. In this paper, we characterize the intra-stream and inter-stream reuses in 52 frames captured from eight DirectX game titles and four DirectX benchmark applications spanning three different frame resolutions. Based on this characterization, we propose graphics stream-aware probabilistic caching (GSPC) that dynamically learns the reuse probabilities and accordingly manages the LLC of the GPU. Our detailed trace-driven simulation of a typical GPU equipped with 768 shader thread contexts, twelve fixed-function texture samplers, and an 8 MB 16-way LLC shows that GSPC saves up to 29.6% and on average 13.1% LLC misses across 52 frames compared to the baseline state-of-the-art two-bit dynamic re-reference interval prediction (DRRIP) policy. These savings in the LLC misses result in a speedup of up to 18.2% and on average 8.0%. On a 16 MB LLC, the average speedup achieved by GSPC further improves to 11.8% compared to DRRIP.

international symposium on computer architecture | 2016

Base-victim compression: an opportunistic cache compression architecture

Jayesh Gaur; Alaa R. Alameldeen; Sreenivas Subramoney

The memory wall has motivated many enhancements to cache management policies aimed at reducing misses. Cache compression has been proposed to increase effective cache capacity, which potentially reduces capacity and conflict misses. However, complexity in cache compression implementations could increase cache power and access latency. On the other hand, advanced cache replacement mechanisms use heuristics to reduce misses, leading to significant performance gains. Both cache compression and replacement policies should collaborate to improve performance. In this paper, we demonstrate that cache compression and replacement policies can interact negatively. In many workloads, performance gains from replacement policies are lost due to the need to alter the replacement policy to accommodate compression. This leads to sub-optimal replacement policies that could lose performance compared to an uncompressed cache. We introduce a novel, opportunistic cache compression mechanism, Base-Victim, based on an efficient cache design. Our compression architecture improves performance on top of advanced cache replacement policies, and guarantees a hit rate at least as high as that of an uncompressed cache. For cache-sensitive applications, Base-Victim achieves an average 7.3% performance gain for single-threaded workloads, and 8.7% gain for four-thread multi-program workload mixes.

design, automation, and test in europe | 2016

Machine Learned Machines: Adaptive co-optimization of caches, cores, and On-chip Network

Rahul Jain; Preeti Ranjan Panda; Sreenivas Subramoney

Modern multicore architectures require runtime optimization techniques to address the problem of mismatches between the dynamic resource requirements of different processes and the runtime allocation. Choosing between multiple optimizations at runtime is complex due to the non-additive effects, making the adaptiveness of the machine learning techniques useful. We present a novel method, Machine Learned Machines (MLM), by using Online Reinforcement Learning (RL) to perform dynamic partitioning of the last level cache (LLC), along with dynamic voltage and frequency scaling (DVFS) of the core and uncore (interconnection network and LLC). We show that the co-optimization results in much lower energy-delay product (EDP) than any of the techniques applied individually. The results show an average of 19.6% EDP and 2.6% execution time improvement over the baseline.

high-performance computer architecture | 2017

Near-Optimal Access Partitioning for Memory Hierarchies with Multiple Heterogeneous Bandwidth Sources

Jayesh Gaur; Mainak Chaudhuri; Sreenivas Subramoney

The memory wall continues to be a major performance bottleneck. While small on-die caches have been effective so far in hiding this bottleneck, the ever-increasing footprint of modern applications renders such caches ineffective. Recent advances in memory technologies like embedded DRAM (eDRAM) and High Bandwidth Memory (HBM) have enabled the integration of large memories on the CPU package as an additional source of bandwidth other than the DDR main memory. Because of limited capacity, these memories are typically implemented as a memory-side cache. Driven by traditional wisdom, many of the optimizations that target improving system performance have been tried to maximize the hit rate of the memory-side cache. A higher hit rate enables better utilization of the cache, and is therefore believed to result in higher performance. In this paper, we challenge this traditional wisdom and present DAP, a Dynamic Access Partitioning algorithm that sacrifices cache hit rates to exploit under-utilized bandwidth available at main memory. DAP achieves a near-optimal bandwidth partitioning between the memory-side cache and main memory by using a light-weight learning mechanism that needs just sixteen bytes of additional hardware. Simulation results show a 13% average performance gain when DAP is implemented on top of a die-stacked memory-side DRAM cache. We also show that DAP delivers large performance benefits across different implementations, bandwidth points, and capacity points of the memory-side cache, making it a valuable addition to any current or future systems based on multiple heterogeneous bandwidth sources beyond the on-chip SRAM cache hierarchy.

asia and south pacific design automation conference | 2014

Array scalarization in high level synthesis

Preeti Ranjan Panda; Namita Sharma; Arun Kumar Pilania; Gummidipudi Krishnaiah; Sreenivas Subramoney; Ashok Jagannathan

Parallelism across loop iterations present in behavioral specifications can typically be exposed and optimized using well known techniques such as Loop Unrolling. However, since behavioral arrays are usually mapped to memories (SRAM) during synthesis, performance bottlenecks arise due to memory port constraints. We study array scalarization, the transformation of an array into a group of scalar variables. We propose a technique for selectively scalarizing arrays for improving the performance of synthesized designs by taking into consideration the latency benefits as well as the area overhead caused by using discrete registers for storing array elements instead of denser SRAM. Our experiments on several benchmark examples indicate promising speedups of more than 10x for several designs due to scalarization.

international symposium on computer architecture | 2018

Criticality aware tiered cache hierarchy: a fundamental relook at multi-level cache hierarchies

Anant V. Nori; Jayesh Gaur; Siddharth Rai; Sreenivas Subramoney; Hong Wang

On-die caches are a popular method to help hide the main memory latency. However, it is difficult to build large caches without substantially increasing their access latency, which in turn hurts performance. To overcome this difficulty, on-die caches are typically built as a multi-level cache hierarchy. One such popular hierarchy that has been adopted by modern microprocessors is the three level cache hierarchy. Building a three level cache hierarchy enables a low average hit latency since most requests are serviced from faster inner level caches. This has motivated recent microprocessors to deploy large level-2 (L2) caches that can help further reduce the average hit latency. In this paper, we do a fundamental analysis of the popular three level cache hierarchy and understand its performance delivery using program criticality. Through our detailed analysis we show that the current trend of increasing L2 cache sizes to reduce average hit latency is, in fact, an inefficient design choice. We instead propose Criticality Aware Tiered Cache Hierarchy (CATCH) that utilizes an accurate detection of program criticality in hardware and using a novel set of inter-cache prefetchers ensures that on-die data accesses that lie on the critical path of execution are served at the latency of the fastest level-1 (L1) cache. The last level cache (LLC) serves the purpose of reducing slow memory accesses, thereby making the large L2 cache redundant for most applications. The area saved by eliminating the L2 cache can then be used to create more efficient processor configurations. Our simulation results show that CATCH outperforms the three level cache hierarchy with a large 1MB L2 and exclusive LLC by an average of 8.4%, and a baseline with 256KB L2 and inclusive LLC by 10.3%. We also show that CATCH enables a powerful framework to explore broad chip-level area, performance and power trade-offs in cache hierarchy design. Supported by CATCH, we evaluate radical architecture directions such as eliminating the L2 altogether and show that such architectures can yield 4.5% performance gain over the baseline at nearly 30% lesser area or improve the performance by 7.3% at the same area while reducing energy consumption by 11%.

Explore More