Vilas Sridharan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Vilas Sridharan is active.

Explore More

Publication

Featured researches published by Vilas Sridharan.

ieee international conference on high performance computing data and analytics | 2012

A study of DRAM failures in the field

Vilas Sridharan; Dean A. Liberty

Most modern computer systems use dynamic random access memory (DRAM) as a main memory store. Recent publications have confirmed that DRAM errors are a common source of failures in the field. Therefore, further attention to the faults experienced by DRAM sub-systems is warranted. In this paper, we present a study of 11 months of DRAM errors in a large high-performance computing cluster. Our goal is to understand the failure modes, rates, and fault types experienced by DRAM in production settings. We identify several unique DRAM failure modes, including single-bit, multi-bit, and multi-chip failures. We also provide a deterministic bound on the rate of transient faults in the DRAM array, by exploiting the presence of a hardware scrubber on our nodes. We draw several conclusions from our study. First, DRAM failures are dominated by permanent, rather than transient, faults, although not to the extent found by previous publications. Second, DRAMs are susceptible to large multi-bit failures, such as failures that affect an entire DRAM row or column, indicating faults in shared internal circuitry. Third, we identify a DRAM failure mode that disrupts access to other DRAM devices that share the same board-level circuitry. Finally, we find that chipkill error-correcting codes (ECC) are extremely effective, reducing the node failure rate from uncorrected DRAM errors by 42x compared to single-error correct/double-error detect (SEC-DED) ECC.

international symposium on performance analysis of systems and software | 2005

Balancing Performance and Reliability in the Memory Hierarchy

Ghazanfar-Hossein Asadi; Vilas Sridharan; Mehdi Baradaran Tahoori; David R. Kaeli

Cosmic-ray induced soft errors in cache memories are becoming a major threat to the reliability of microprocessor-based systems. In this paper, we present a new method to accurately estimate the reliability of cache memories. We have measured the MTTF (mean-time-to-failure) of unprotected first-level (L1) caches for twenty programs taken from SPEC2000 benchmark suite. Our results show that a 16 KB first-level cache possesses a MTTF of at least 400 years (for a raw error rate of 0.002 FIT/bit.) However, this MTTF is significantly reduced for higher error rates and larger cache sizes. Our results show that for selected programs, a 64 KB first-level cache is more than 10 times as vulnerable to soft errors versus a 16 KB cache memory. Our work also illustrates that the reliability of cache memories is highly application-dependent. Finally, we present three different techniques to reduce the susceptibility of first-level caches to soft errors by two orders of magnitude. Our analysis shows how to achieve a balance between performance and reliability

ieee international conference on high performance computing data and analytics | 2013

Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults

Vilas Sridharan; Jon Stearley; Nathan DeBardeleben; Sean Blanchard; Sudhanva Gurumurthi

Several recent publications confirm that faults are common in high-performance computing systems. Therefore, further attention to the faults experienced by such computing systems is warranted. In this paper, we present a study of DRAM and SRAM faults in large high-performance computing systems. Our goal is to understand the factors that influence faults in production settings. We examine the impact of aging on DRAM, finding a marked shift from permanent to transient faults in the first two years of DRAM lifetime. We examine the impact of DRAM vendor, finding that fault rates vary by more than 4x among vendors. We examine the physical location of faults in a DRAM device and in a data center; contrary to prior studies, we find no correlations with either. Finally, we study the impact of altitude and rack placement on SRAM faults, finding that, as expected, altitude has a substantial impact on SRAM faults, and that top of rack placement correlates with 20% higher fault rate.

high-performance computer architecture | 2009

Eliminating microarchitectural dependency from Architectural Vulnerability

Vilas Sridharan; David R. Kaeli

The Architectural Vulnerability Factor (AVF) of a hardware structure is the probability that a fault in the structure will affect the output of a program. AVF captures both microarchitectural and architectural fault masking effects; therefore, AVF measurements cannot generate insight into the vulnerability of software independent of hardware. To evaluate the behavior of software in the presence of hardware faults, we must isolate the software-dependent (architecture-level masking) portion of AVF from the hardware-dependent (microarchitecture-level masking) portion, providing a quantitative basis to make reliability decisions about software independent of hardware. In this work, we demonstrate that the new Program Vulnerability Factor (PVF) metric provides such a basis: PVF captures the architecture-level fault masking inherent in a program, allowing software designers to make quantitative statements about a programs tolerance to soft errors. PVF can also explain the AVF behavior of a program when executed on hardware; PVF captures the workload-driven changes in AVF for all structures. Finally, we demonstrate two practical uses for PVF: choosing algorithms and compiler optimizations to reduce a programs failure rate.

architectural support for programming languages and operating systems | 2015

Memory Errors in Modern Systems: The Good, The Bad, and The Ugly

Vilas Sridharan; Nathan DeBardeleben; Sean Blanchard; Kurt Brian Ferreira; Jon Stearley; John Shalf; Sudhanva Gurumurthi

Several recent publications have shown that hardware faults in the memory subsystem are commonplace. These faults are predicted to become more frequent in future systems that contain orders of magnitude more DRAM and SRAM than found in current memory subsystems. These memory subsystems will need to provide resilience techniques to tolerate these faults when deployed in high-performance computing systems and data centers containing tens of thousands of nodes. Therefore, it is critical to understand the efficacy of current hardware resilience techniques to determine whether they will be suitable for future systems. In this paper, we present a study of DRAM and SRAM faults and errors from the field. We use data from two leadership-class high-performance computer systems to analyze the reliability impact of hardware resilience schemes that are deployed in current systems. Our study has several key findings about the efficacy of many currently deployed reliability techniques such as DRAM ECC, DDR address/command parity, and SRAM ECC and parity. We also perform a methodological study, and find that counting errors instead of faults, a common practice among researchers and data center operators, can lead to incorrect conclusions about system reliability. Finally, we use our data to project the needs of future large-scale systems. We find that SRAM faults are unlikely to pose a significantly larger reliability threat in the future, while DRAM faults will be a major concern and stronger DRAM resilience schemes will be needed to maintain acceptable failure rates similar to those found on todays systems.

IEEE Transactions on Dependable and Secure Computing | 2006

Reducing Data Cache Susceptibility to Soft Errors

Vilas Sridharan; Hossein Asadi; Mehdi Baradaran Tahoori; David R. Kaeli

Data caches are a fundamental component of most modern microprocessors. They provide for efficient read/write access to data memory. Errors occurring in the data cache can corrupt data values or state, and can easily propagate throughout the memory hierarchy. One of the main threats to data cache reliability is soft (transient, nonreproducible) errors. These errors can occur more often than hard (permanent) errors, and most often arise from single event upsets (SEUs) caused by strikes from energetic particles such as neutrons and alpha particles. Many protection techniques exist for data caches; the most common are ECC (error correcting codes) and parity. These protection techniques detect all single bit errors and, in the case of ECC, correct them. To make proper design decisions about which protection technique to use, accurate design-time modeling of cache reliability is crucial. In addition, as caches increase in storage capacity, another important goal is to reduce the failure rate of a cache, to limit disruption to normal system operation. In this paper, we present our modeling approach for assessing the impact of soft errors using architectural simulators. We also describe a new technique for reducing the vulnerability of data caches: refetching. By selectively refetching cache lines from the ECC-protected L2 cache, we can significantly reduce the vulnerability of the L1 data cache. We discuss and present results for two different algorithms that perform selective refetch. Experimental results show that we can obtain an 85 percent decrease in vulnerability when running the SPEC2K benchmark suite while only experiencing a slight decrease in performance. Our results demonstrate that selective refetch can cost-effectivety decrease the error rate of an L1 data cache

design, automation, and test in europe | 2006

Vulnerability Analysis of L2 Cache Elements to Single Event Upsets

Hossein Asadi; Vilas Sridharan; Mehdi Baradaran Tahoori; David R. Kaeli

Memory elements are the most vulnerable system component to soft errors. Since memory elements in cache arrays consume a large fraction of the die in modern microprocessors, the probability of particle strikes in these elements is high and can significantly impact overall processor reliability. Previous work (Asadi et al., 2005) has developed effective metrics to accurately measure the vulnerability of cache memory elements. Based on these metrics, we have developed a reliability-performance evaluation framework, which has been built upon the Simplescalar simulator. In this work, we focus on the reliability aspects of L1 and L2 caches. Specifically, we present algorithms for tag vulnerability computation and investigate and report in detail on the vulnerability of data, tag, and status bits in the L2 array. Experiments on SPECint2K and SPECfp2K benchmarks show that one class of error, replacement error, makes up almost 85% of the total tag vulnerability of a 1MB write-back L2 cache. In addition, the vulnerability of L2 tag-addresses significantly increases as the size of the memory address space increases. Results show that the L2 tag array can be as susceptible as first-level instruction and data caches (IL1/DL1) to soft errors

international symposium on computer architecture | 2010

Using hardware vulnerability factors to enhance AVF analysis

Vilas Sridharan; David R. Kaeli

Fault tolerance is now a primary design constraint for all major microprocessors. One step in determining a processors compliance to its failure rate target is measuring the Architectural Vulnerability Factor (AVF) of each on-chip structure. The AVF of a hardware structure is the probability that a fault in the structure will affect the output of a program. While AVF generates meaningful insight into system behavior, it cannot quantify the vulnerability of an individual system component (hardware, user program, etc.), limiting the amount of insight that can be generated. To address this, prior work has introduced the Program Vulnerability Factor (PVF) to quantify the vulnerability of software. In this paper, we introduce and analyze the Hardware Vulnerability Factor (HVF) to quantify the vulnerability of hardware. HVF has three concrete benefits which we examine in this paper. First, HVF analysis can provide insight to hardware designers beyond that gained from AVF analysis alone. Second, separating AVF analysis into HVF and PVF steps can accelerate the AVF measurement process. Finally, HVF measurement enables runtime AVF estimation that combines compile-time PVF estimates with runtime HVF measurements. A key benefit of this technique is that it allows software developers to influence the runtime AVF estimates. We demonstrate that this technique can estimate AVF at runtime with an average absolute error of less than 3%.

Proceedings of the 2008 workshop on Radiation effects and fault tolerance in nanometer technologies | 2008

Quantifying software vulnerability

Vilas Sridharan; David R. Kaeli

The technique known as ACE Analysis allows researchers to quantify a hardware structures Architectural Vulnerability Factor (AVF) using simulation. This allows researchers to understand a hardware structures vulnerability to soft errors and consider design tradeoffs when running specific workloads. AVF is only applicable to hardware, however, and no corresponding concept has yet been introduced for software. Quantifying vulnerability to hardware faults at a software, or program, level would allow researchers to gain a better understanding of the reliability of a program as run on a particular architecture (e.g., X86, PowerPC), independent of the micro-architecture on which it is executed. This ability can provide a basis for future research into reliability techniques at a software level. In this work, we adapt the techniques of ACE Analysis to develop a new software-level vulnerability metric called the Program Vulnerability Factor (PVF). This metric allows insight into the vulnerability of a software resource to hardware faults in a micro-architecture independent way, and can be used to make judgments about the relative reliability of different programs. We describe in detail how to calculate the PVF of a software resource, and show that the PVF of the architectural register file closely correlates with the AVF of the underlying physical register file and can serve as a good predictor of relative AVF when comparing the AVF of two different programs.

international symposium on computer architecture | 2013

Resilient die-stacked DRAM caches

Jaewoong Sim; Gabriel H. Loh; Vilas Sridharan; Mike O'Connor

Die-stacked DRAM can provide large amounts of in-package, high-bandwidth cache storage. For server and high-performance computing markets, however, such DRAM caches must also provide sufficient support for reliability and fault tolerance. While conventional off-chip memory provides ECC support by adding one or more extra chips, this may not be practical in a 3D stack. In this paper, we present a DRAM cache organization that uses error-correcting codes (ECCs), strong checksums (CRCs), and dirty data duplication to detect and correct a wide range of stacked DRAM failures, from traditional bit errors to large-scale row, column, bank, and channel failures. With only a modest performance degradation compared to a DRAM cache with no ECC support, our proposal can correct all single-bit failures, and 99.9993% of all row, column, and bank failures, providing more than a 54,000x improvement in the FIT rate of silent-data corruptions compared to basic SECDED ECC protection.

Explore More