Lavanya Subramanian | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Lavanya Subramanian is active.

Explore More

Publication

Featured researches published by Lavanya Subramanian.

international symposium on computer architecture | 2012

Staged memory scheduling: achieving high performance and scalability in heterogeneous systems

Rachata Ausavarungnirun; Kevin Kai-Wei Chang; Lavanya Subramanian; Gabriel H. Loh; Onur Mutlu

When multiple processor (CPU) cores and a GPU integrated together on the same chip share the off-chip main memory, requests from the GPU can heavily interfere with requests from the CPU cores, leading to low system performance and starvation of CPU cores. Unfortunately, state-of-the-art application-aware memory scheduling algorithms are ineffective at solving this problem at low complexity due to the large amount of GPU traffic. A large and costly request buffer is needed to provide these algorithms with enough visibility across the global request stream, requiring relatively complex hardware implementations. This paper proposes a fundamentally new approach that decouples the memory controllers three primary tasks into three significantly simpler structures that together improve system performance and fairness, especially in integrated CPU-GPU systems. Our three-stage memory controller first groups requests based on row-buffer locality. This grouping allows the second stage to focus only on inter-application request scheduling. These two stages enforce high-level policies regarding performance and fairness, and therefore the last stage consists of simple per-bank FIFO queues (no further command reordering within each bank) and straightforward logic that deals only with low-level DRAM commands and timing. We evaluate the design trade-offs involved in our Staged Memory Scheduler (SMS) and compare it against three state-of-the-art memory controller designs. Our evaluations show that SMS improves CPU performance without degrading GPU frame rate beyond a generally acceptable level, while being significantly less complex to implement than previous application-aware schedulers. Furthermore, SMS can be configured by the system software to prioritize the CPU or the GPU at varying levels to address different performance needs.

high-performance computer architecture | 2013

Tiered-latency DRAM: A low latency and low cost DRAM architecture

Donghyuk Lee; Yoongu Kim; Vivek Seshadri; Jamie Liu; Lavanya Subramanian; Onur Mutlu

The capacity and cost-per-bit of DRAM have historically scaled to satisfy the needs of increasingly large and complex computer systems. However, DRAM latency has remained almost constant, making memory latency the performance bottleneck in todays systems. We observe that the high access latency is not intrinsic to DRAM, but a trade-off made to decrease cost-per-bit. To mitigate the high area overhead of DRAM sensing structures, commodity DRAMs connect many DRAM cells to each sense-amplifier through a wire called a bitline. These bitlines have a high parasitic capacitance due to their long length, and this bitline capacitance is the dominant source of DRAM latency. Specialized low-latency DRAMs use shorter bitlines with fewer cells, but have a higher cost-per-bit due to greater sense-amplifier area overhead. In this work, we introduce Tiered-Latency DRAM (TL-DRAM), which achieves both low latency and low cost-per-bit. In TL-DRAM, each long bitline is split into two shorter segments by an isolation transistor, allowing one segment to be accessed with the latency of a short-bitline DRAM without incurring high cost-per-bit. We propose mechanisms that use the low-latency segment as a hardware-managed or software-managed cache. Evaluations show that our proposed mechanisms improve both performance and energy-efficiency for both single-core and multi-programmed workloads.

international conference on computer design | 2014

The Blacklisting Memory Scheduler: Achieving high performance and fairness at low cost

Lavanya Subramanian; Donghyuk Lee; Vivek Seshadri; Harsha Rastogi; Onur Mutlu

In a multicore system, applications running on different cores interfere at main memory. This inter-application interference degrades overall system performance and unfairly slows down applications. Prior works have developed application-aware memory request schedulers to tackle this problem. State-of-the-art application-aware memory request schedulers prioritize memory requests of applications that are vulnerable to interference, by ranking individual applications based on their memory access characteristics and enforcing a total rank order. In this paper, we observe that state-of-the-art application-aware memory schedulers have two major shortcomings. First, ranking applications individually with a total order based on memory access characteristics leads to high hardware cost and complexity. Second, ranking can unfairly slow down applications that are at the bottom of the ranking stack. To overcome these shortcomings, we propose the Blacklisting Memory Scheduler (BLISS), which achieves high system performance and fairness while incurring low hardware cost and complexity. BLISS design is based on two new observations. First, we find that, to mitigate interference, it is sufficient to separate applications into only two groups, one containing applications that cause interference and another containing applications vulnerable to interference, instead of ranking individual applications with a total order. Vulnerable-to-interference group is prioritized over the interference-causing group. Second, we show that this grouping can be efficiently performed by simply counting the number of consecutive requests served from each application - an application that has a large number of consecutive requests served is dynamically classified as interference-causing. We evaluate BLISS across a wide variety of workloads and system configurations and compare its performance and complexity with five state-of-the-art memory schedulers. Our evaluations show that BLISS achieves 5% better system performance and 25% better fairness than the best-performing previous memory scheduler while greatly reducing critical path latency and hardware area cost of the memory scheduler (by 79% and 43%, respectively).

international symposium on microarchitecture | 2015

The application slowdown model: quantifying and controlling the impact of inter-application interference at shared caches and main memory

Lavanya Subramanian; Vivek Seshadri; Arnab Ghosh; Samira Manabi Khan; Onur Mutlu

In a multi-core system, interference at shared resources (such as caches and main memory) slows down applications running on different cores. Accurately estimating the slowdown of each application has several benefits: e.g., it can enable shared resource allocation in a manner that avoids unfair application slowdowns or provides slowdown guarantees. Unfortunately, prior works on estimating slowdowns either lead to inaccurate estimates, do not take into account shared caches, or rely on a priori application knowledge. This severely limits their applicability. In this work, we propose the Application Slowdown Model (ASM), a new technique that accurately estimates application slowdowns due to interference at both the shared cache and main memory, in the absence of a priori application knowledge. ASM is based on the observation that the performance of each application is strongly correlated with the rate at which the application accesses the shared cache. Thus, ASM reduces the problem of estimating slowdown to that of estimating the shared cache access rate of the application had it been run alone on the system. To estimate this for each application, ASM periodically 1) minimizes interference for the application at the main memory, 2) quantifies the interference the application receives at the shared cache, in an aggregate manner for a large set of requests. Our evaluations across 100 workloads show that ASM has an average slowdown estimation error of only 9.9%, a 2.97× improvement over the best previous mechanism. We present several use cases of ASM that leverage its slowdown estimates to improve fairness, performance and provide slowdown guarantees. We provide detailed evaluations of three such use cases: slowdown-aware cache partitioning, slowdown-aware memory bandwidth partitioning and an example scheme to provide soft slowdown guarantees. Our evaluations show that these new schemes perform significantly better than state-of-the-art cache partitioning and memory scheduling schemes.

ACM Transactions on Architecture and Code Optimization | 2016

DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators

Hiroyuki Usui; Lavanya Subramanian; Kevin Kai-Wei Chang; Onur Mutlu

Modern SoCs integrate multiple CPU cores and hardware accelerators (HWAs) that share the same main memory system, causing interference among memory requests from different agents. The result of this interference, if it is not controlled well, is missed deadlines for HWAs and low CPU performance. Few previous works have tackled this problem. State-of-the-art mechanisms designed for CPU-GPU systems strive to meet a target frame rate for GPUs by prioritizing the GPU close to the time when it has to complete a frame. We observe two major problems when such an approach is adapted to a heterogeneous CPU-HWA system. First, HWAs miss deadlines because they are prioritized only when close to their deadlines. Second, such an approach does not consider the diverse memory access characteristics of different applications running on CPUs and HWAs, leading to low performance for latency-sensitive CPU applications and deadline misses for some HWAs, including GPUs. In this article, we propose a Deadline-Aware memory Scheduler for Heterogeneous systems (DASH), which overcomes these problems using three key ideas, with the goal of meeting HWAs’ deadlines while providing high CPU performance. First, DASH prioritizes an HWA when it is not on track to meet its deadline any time during a deadline period, instead of prioritizing it only when close to a deadline. Second, DASH prioritizes HWAs over memory-intensive CPU applications based on the observation that memory-intensive applications’ performance is not sensitive to memory latency. Third, DASH treats short-deadline HWAs differently as they are more likely to miss their deadlines and schedules their requests based on worst-case memory access time estimates. Extensive evaluations across a wide variety of different workloads and systems show that DASH achieves significantly better CPU performance than the best previous scheduler while always meeting the deadlines for all HWAs, including GPUs, thereby largely improving frame rates.

international conference on parallel architectures and compilation techniques | 2015

Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM

Donghyuk Lee; Lavanya Subramanian; Rachata Ausavarungnirun; Jongmoo Choi; Onur Mutlu

Memory channel contention is a critical performance bottleneck in modern systems that have highly parallelized processing units operating on large data sets. The memory channel is contended not only by requests from different user applications (CPU access) but also by system requests for peripheral data (IO access), usually controlled by Direct Memory Access (DMA) engines. Our goal, in this work, is to improve system performance byeliminating memory channel contention between CPU accesses and IO accesses. To this end, we propose a hardware-software cooperative data transfer mechanism, Decoupled DMA (DDMA) that provides a specialized low-cost memory channel for IO accesses. In our DDMA design, main memoryhas two independent data channels, of which one is connected to the processor (CPU channel) and the other to the IO devices (IO channel), enabling CPU and IO accesses to be served on different channels. Systemsoftware or the compiler identifies which requests should be handled on the IO channel and communicates this to the DDMA engine, which then initiates the transfers on the IO channel. By doing so, our proposal increasesthe effective memory channel bandwidth, thereby either accelerating data transfers between system components, or providing opportunities to employ IO performance enhancement techniques (e.g., aggressive IO prefetching)without interfering with CPU accessesWe demonstrate the effectiveness of our DDMA framework in two scenarios: (i) CPU-GPU communication and (ii) in-memory communication (bulk datacopy/initialization within the main memory). By effectively decoupling accesses for CPU-GPU communication and in-memory communication from CPU accesses, our DDMA-based design achieves significant performanceimprovement across a wide variety of system configurations (e.g., 20% average performance improvement on a typical 2-channel 2-rank memory system).

virtual execution environments | 2015

A-DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters

Hui Wang; Canturk Isci; Lavanya Subramanian; Jongmoo Choi; Depei Qian; Onur Mutlu

Virtualization technologies has been widely adopted by large-scale cloud computing platforms. These virtualized systems employ distributed resource management (DRM) to achieve high resource utilization and energy savings by dynamically migrating and consolidating virtual machines. DRM schemes usually use operating-system-level metrics, such as CPU utilization, memory capacity demand and I/O utilization, to detect and balance resource contention. However, they are oblivious to microarchitecture-level resource interference (e.g., memory bandwidth contention between different VMs running on a host), which is currently not exposed to the operating system. We observe that the lack of visibility into microarchitecture-level resource interference significantly impacts the performance of virtualized systems. Motivated by this observation, we propose a novel architecture-aware DRM scheme (ADRM), that takes into account microarchitecture-level resource interference when making migration decisions in a virtualized cluster. ADRM makes use of three core techniques: 1) a profiler to monitor the microarchitecture-level resource usage behavior online for each physical host, 2) a memory bandwidth interference model to assess the interference degree among virtual machines on a host, and 3) a cost-benefit analysis to determine a candidate virtual machine and a host for migration. Real system experiments on thirty randomly selected combinations of applications from the CPU2006, PARSEC, STREAM, NAS Parallel Benchmark suites in a four-host virtualized cluster show that ADRM can improve performance by up to 26.55%, with an average of 9.67%, compared to traditional DRM schemes that lack visibility into microarchitecture-level resource utilization and contention.

measurement and modeling of computer systems | 2017

Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms

Donghyuk Lee; Samira Manabi Khan; Lavanya Subramanian; Saugata Ghose; Rachata Ausavarungnirun; Gennady Pekhimenko; Vivek Seshadri; Onur Mutlu

Variation has been shown to exist across the cells within a modern DRAM chip. Prior work has studied and exploited several forms of variation, such as manufacturing-process- or temperature-induced variation. We empirically demonstrate a new form of variation that exists within a real DRAM chip, induced by the design and placement of different components in the DRAM chip: different regions in DRAM, based on their relative distances from the peripheral structures, require different minimum access latencies for reliable operation. In particular, we show that in most real DRAM chips, cells closer to the peripheral structures can be accessed much faster than cells that are farther. We call this phenomenon design-induced variation in DRAM. Our goals are to i) understand design-induced variation that exists in real, state-of-the-art DRAM chips, ii) exploit it to develop low-cost mechanisms that can dynamically find and use the lowest latency at which to operate a DRAM chip reliably, and, thus, iii) improve overall system performance while ensuring reliable system operation. To this end, we first experimentally demonstrate and analyze designed-induced variation in modern DRAM devices by testing and characterizing 96 DIMMs (768 DRAM chips). Our experimental study shows that i) modern DRAM chips exhibit design-induced latency variation in both row and column directions, ii) access latency gradually increases in the row direction within a DRAM cell array (mat) and this pattern repeats in every mat, and iii) some columns require higher latency than others due to the internal hierarchical organization of the DRAM chip. Our characterization identifies DRAM regions that are vulnerable to errors, if operated at lower latency, and finds consistency in their locations across a given DRAM chip generation, due to design-induced variation. Variations in the vertical and horizontal dimensions, together, divide the cell array into heterogeneous-latency regions, where cells in some regions require longer access latencies for reliable operation. Reducing the latency uniformly across all regions in DRAM would improve performance, but can introduce failures in the inherently slower regions that require longer access latencies for correct operation. We refer to these inherently slower regions of DRAM as design-induced vulnerable regions. Based on our extensive experimental analysis, we develop two mechanisms that reliably reduce DRAM latency. First, DIVI Profiling uses runtime profiling to dynamically identify the lowest DRAM latency that does not introduce failures. DIVA Profiling exploits design-induced variation and periodically profiles only the vulnerable regions to determine the lowest DRAM latency at low cost. It is the first mechanism to dynamically determine the lowest latency that can be used to operate DRAM reliably. DIVA Profiling reduces the latency of read/write requests by 35.1%/57.8%, respectively, at 55C. Our second mechanism, DIVA Shuffling, shuffles data such that values stored in vulnerable regions are mapped to multiple error-correcting code (ECC) codewords. As a result, DIVA Shuffling can correct 26% more multi-bit errors than conventional ECC. Combined together, our two mechanisms reduce read/write latency by 40.0%/60.5%, which translates to an overall system performance improvement of 14.7%/13.7%/13.8% (in 2-/4-/8-core systems) over a variety of workloads, while ensuring reliable operation.

international symposium on microarchitecture | 2011