Vijayalakshmi Srinivasan

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Vijayalakshmi Srinivasan is active.

Explore More

Publication

Featured researches published by Vijayalakshmi Srinivasan.

international symposium on computer architecture | 2009

Scalable high performance main memory system using phase-change memory technology

Moinuddin K. Qureshi; Vijayalakshmi Srinivasan; Jude A. Rivers

The memory subsystem accounts for a significant cost and power budget of a computer system. Current DRAM-based main memory systems are starting to hit the power and cost limit. An alternative memory technology that uses resistance contrast in phase-change materials is being actively investigated in the circuits community. Phase Change Memory (PCM) devices offer more density relative to DRAM, and can help increase main memory capacity of future systems while remaining within the cost and power constraints. In this paper, we analyze a PCM-based hybrid main memory system using an architecture level model of PCM.We explore the trade-offs for a main memory system consisting of PCMstorage coupled with a small DRAM buffer. Such an architecture has the latency benefits of DRAM and the capacity benefits of PCM. Our evaluations for a baseline system of 16-cores with 8GB DRAM show that, on average, PCM can reduce page faults by 5X and provide a speedup of 3X. As PCM is projected to have limited write endurance, we also propose simple organizational and management solutions of the hybrid memory that reduces the write traffic to PCM, boosting its lifetime from 3 years to 9.7 years.

international symposium on microarchitecture | 2009

Enhancing lifetime and security of PCM-based main memory with start-gap wear leveling

Moinuddin K. Qureshi; John P. Karidis; Michele M. Franceschini; Vijayalakshmi Srinivasan; Luis A. Lastras; Bulent Abali

Phase Change Memory (PCM) is an emerging memory technology that can increase main memory capacity in a cost-effective and power-efficient manner. However, PCM cells can endure only a maximum of 107-108 writes, making a PCM based system have a lifetime of only a few years under ideal conditions. Furthermore, we show that non-uniformity in writes to different cells reduces the achievable lifetime of PCM system by 20×. Writes to PCM cells can be made uniform with Wear-Leveling. Unfortunately, existing wear-leveling techniques require large storage tables and indirection, resulting in significant area and latency overheads. We propose Start-Gap, a simple, novel, and effective wear-leveling technique that uses only two registers. By combining Start-Gap with simple address-space randomization techniques we show that the achievable lifetime of the baseline 16 GB PCM-based system is boosted from 5% (with no wear-leveling) to 97% of the theoretical maximum, while incurring a total storage overhead of less than 13 bytes and obviating the latency overhead of accessing large tables. We also analyze the security vulnerabilities for memory systems that have limited write endurance, showing that under adversarial settings, a PCM-based system can fail in less than one minute. We provide a simple extension to Start-Gap that makes PCM-based systems robust to such malicious attacks.

international symposium on microarchitecture | 2010

SAFER: Stuck-At-Fault Error Recovery for Memories

Nak Hee Seong; Dong Hyuk Woo; Vijayalakshmi Srinivasan; Jude A. Rivers; Hsien-Hsin S. Lee

As technology scaling poses a threat to DRAM scaling due to physical limitations such as limited charge, alternative memory technologies including several emerging non-volatile memories are being explored as possible DRAM replacements. One main roadblock for wider adoption of these new memories is the limited write endurance, which leads to wear-out related permanent failures. Furthermore, technology scaling increases the variation in cell lifetime resulting in early failures of many cells. Existing error correcting techniques are primarily devised for recovering from transient faults and are not suitable for recovering from permanent stuck-at faults, which tend to increase gradually with repeated write cycles. In this paper, we propose SAFER, a novel hardware-efficient multi-bit stuck-at fault error recovery scheme for resistive memories, which can function in conjunction with existing wear-leveling techniques. SAFER exploits the key attribute that a failed cell with a stuck-at value is still readable, making it possible to continue to use the failed cell to store data, thereby reducing the hardware overhead for error recovery. SAFER partitions a data block dynamically while ensuring that there is at most one fail bit per partition and uses single error correction techniques per partition for fail recovery. SAFER increases the number of recoverable fails and achieves better lifetime improvement with smaller hardware overhead relative to recently proposed Error Correcting Pointers and even ideal hamming coding scheme.

international symposium on microarchitecture | 2009

A tagless coherence directory

Jason Zebchuk; Moinuddin K. Qureshi; Vijayalakshmi Srinivasan; Andreas Moshovos

A key challenge in architecting a CMP with many cores is maintaining cache coherence in an efficient manner. Directory-based protocols avoid the bandwidth overhead of snoop-based protocols, and therefore scale to a large number of cores. Unfortunately, conventional directory structures incur significant area overheads in larger CMPs. The tagless coherence directory (TL) is a scalable coherence solution that uses an implicit, conservative representation of sharing information. Conceptually, TL consists of a grid of small Bloom filters. The grid has one column per core and one row per cache set. TL uses 48% less area, 57% less leakage power, and 44% less dynamic energy than a conventional coherence directory for a 16-core CMP with 1MB private L2 caches. Simulations of commercial and scientific workloads indicate that TL has no statistically significant impact on performance, and incurs only a 2.5% increase in bandwidth utilization. Analytical modelling predicts that TL continues to scale well up to at least 1024 cores.

international symposium on performance analysis of systems and software | 2014

NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads

Seth H. Pugsley; Jeffrey Jestes; Huihui Zhang; Rajeev Balasubramonian; Vijayalakshmi Srinivasan; Alper Buyuktosunoglu; Al Davis; Feifei Li

While Processing-in-Memory has been investigated for decades, it has not been embraced commercially. A number of emerging technologies have renewed interest in this topic. In particular, the emergence of 3D stacking and the imminent release of Microns Hybrid Memory Cube device have made it more practical to move computation near memory. However, the literature is missing a detailed analysis of a killer application that can leverage a Near Data Computing (NDC) architecture. This paper focuses on in-memory MapReduce workloads that are commercially important and are especially suitable for NDC because of their embarrassing parallelism and largely localized memory accesses. The NDC architecture incorporates several simple processing cores on a separate, non-memory die in a 3D-stacked memory package; these cores can perform Map operations with efficient memory access and without hitting the bandwidth wall. This paper describes and evaluates a number of key elements necessary in realizing efficient NDC operation: (i) low-EPI cores, (ii) long daisy chains of memory devices, (iii) the dynamic activation of cores and SerDes links. Compared to a baseline that is heavily optimized for MapReduce execution, the NDC design yields up to 15X reduction in execution time and 18X reduction in system energy.

international conference on parallel architectures and compilation techniques | 2011

SPATL: Honey, I Shrunk the Coherence Directory

Hongzhou Zhao; Arrvindh Shriraman; Sandhya Dwarkadas; Vijayalakshmi Srinivasan

One of the key scalability challenges of on-chip coherence in a multicore chip is the coherence directory, which provides information on sharing of cache blocks. Shadow tags that duplicate entire private cache tag arrays are widely used to minimize area overhead, but require an energy-intensive associative search to obtain the sharing information. Recent research proposed a Tagless directory, which uses bloom filters to summarize the tags in a cache set. The Tagless directory associates the sharing vector with the bloom filter buckets to completely eliminate the associative lookup and reduce the directory overhead. However, Tagless still uses a full map sharing vector to represent the sharing information, resulting in remaining area and energy challenges with increasing core counts. In this paper, we first show that due to the regular nature of applications, many bloom filters essentially replicate the same sharing pattern. We next exploit the pattern commonality and propose SPATL (Sharing-pattern based Tagless Directory). SPATL exploits the sharing pattern commonality to decouple the sharing patterns from the bloom filters and eliminates the redundant copies of sharing patterns. SPATL works with both inclusive and noninclusive shared caches and provides 34% storage savings over Tagless, the previous most storage-efficient directory, at 16 cores. We study multiple strategies to periodically eliminate the false sharing that comes from combining sharing pattern compression with Tagless, and demonstrate that SPATL can achieve the same level of false sharers as Tagless with 5% extra bandwidth. Finally, we demonstrate that SPATL scales even better than an idealized directory and can support 1024-core chips with less than 1% of the private cache space for data parallel applications.

IEEE Micro | 2014

Comparing Implementations of Near-Data Computing with In-Memory MapReduce Workloads

Seth H. Pugsley; Jeffrey Jestes; Rajeev Balasubramonian; Vijayalakshmi Srinivasan; Alper Buyuktosunoglu; Al Davis; Feifei Li

The emergence of 3D stacking and the imminent release of Microns Hybrid Memory Cube (HMC) device have made it more practical to move computation near memory. This work presents a detailed analysis of in-memory MapReduce in the context of near-data computing (NDC). MapReduce is a good fit for NDC because it is embarrassingly parallel and has highly localized memory accesses. This article considers two NDC architectures: one that exploits HMC devices and one that does not. It thus provides insight on the benefits of different NDC approaches and quantifies the potential for improvement for an important emerging big-data workload.

Proceedings of the 2012 ACM workshop on Relaxing synchronization for multicore and manycore scalability | 2012

Programming with relaxed synchronization

Lakshminarayanan Renganarayana; Vijayalakshmi Srinivasan; Ravi Nair; Daniel A. Prener

Synchronization overhead is a major bottleneck in scaling parallel applications to a large number of cores. This continues to be true in spite of various synchronization-reduction techniques that have been proposed. Previously studied synchronization-reduction techniques tacitly assume that all synchronizations specified in a source program are essential to guarantee quality of the results produced by the program. Recently there have been proposals to relax the synchronizations in a parallel program and compute approximate results. A fundamental challenge in using relaxed synchronization is guaranteeing that the relaxed program always produces results with a specified quality. We propose a methodology that addresses this challenge in programming with relaxed synchronization. Using our methodology programmers can systematically relax synchronization while always producing results that are of same quality as the original (un-relaxed) program. We demonstrate significant speedups using our methodology on a variety of benchmarks (e.g., up to 15x on KMeans benchmark, and up to 3x on a already highly tuned kernel from Graph500 benchmark).

international symposium on microarchitecture | 2016

Co-designing accelerators and SoC interfaces using gem5-aladdin

Yakun Sophia Shao; Sam Likun Xi; Vijayalakshmi Srinivasan; Gu-Yeon Wei; David M. Brooks

Increasing demand for power-efficient, high-performance computing has spurred a growing number and diversity of hardware accelerators in mobile and server Systems on Chip (SoCs). This paper makes the case that the co-design of the accelerator microarchitecture with the system in which it belongs is critical to balanced, efficient accelerator microarchitectures. We find that data movement and coherence management for accelerators are significant yet often unaccounted components of total accelerator runtime, resulting in misleading performance predictions and inefficient accelerator designs. To explore the design space of accelerator-system co-design, we develop gem5-Aladdin, an SoC simulator that captures dynamic interactions between accelerators and the SoC platform, and validate it to within 6% against real hardware. Our co-design studies show that the optimal energy-delay-product (EDP) of an accelerator microarchitecture can improve by up to 7.4× when system-level effects are considered compared to optimizing accelerators in isolation.

PACS'03 Proceedings of the Third international conference on Power - Aware Computer Systems | 2003

Hot-and-Cold: using criticality in the design of energy-efficient caches

Rajeev Balasubramonian; Vijayalakshmi Srinivasan; Sandhya Dwarkadas; Alper Buyuktosunoglu

As technology scales and processor speeds improve, power has become a first-order design constraint in all aspects of processor design. In this paper, we explore the use of criticality metrics to reduce dynamic and leakage energy within data caches. We leverage the ability to predict whether an access is in the applications critical path to partition the accesses into multiple streams. Accesses in the critical path are serviced by a high-performance (hot) cache bank. Accesses not in the critical path are serviced by a lower energy (and lower performance (cold)) cache bank. The resulting organization is a physically banked cache with different levels of energy consumption and performance in each bank. Our results demonstrate that such a classification of instructions and data across two streams can be achieved with high accuracy. Each additional cycle in the cold cache access time slows performance down by only 0.8%. However, such a partition can increase contention for cache banks and entail non-negligible hardware overhead. While prior research has effectively employed criticality metrics to reduce power in arithmetic units, our analysis shows that the success of these techniques are limited when applied to data caches.

Explore More