Prashant J. Nair | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Prashant J. Nair is active.

Explore More

Publication

Featured researches published by Prashant J. Nair.

international symposium on computer architecture | 2013

ArchShield: architectural framework for assisting DRAM scaling by tolerating high error rates

Prashant J. Nair; Dae-Hyun Kim; Moinuddin K. Qureshi

DRAM scaling has been the prime driver for increasing the capacity of main memory system over the past three decades. Unfortunately, scaling DRAM to smaller technology nodes has become challenging due to the inherent difficulty in designing smaller geometries, coupled with the problems of device variation and leakage. Future DRAM devices are likely to experience significantly high error-rates. Techniques that can tolerate errors efficiently can enable DRAM to scale to smaller technology nodes. However, existing techniques such as row/column sparing and ECC become prohibitive at high error-rates. To develop cost-effective solutions for tolerating high error-rates, this paper advocates a cross-layer approach. Rather than hiding the faulty cell information within the DRAM chips, we expose it to the architectural level. We propose ArchShield, an architectural framework that employs runtime testing to identify faulty DRAM cells. ArchShield tolerates these faults using two components, a Fault Map that keeps information about faulty words in a cache line, and Selective Word-Level Replication (SWLR) that replicates faulty words for error resilience. Both Fault Map and SWLR are integrated in reserved area in DRAM memory. Our evaluations with 8GB DRAM DIMM show that ArchShield can efficiently tolerate error-rates as higher as 10−4 (100x higher than ECC alone), causes less than 2% performance degradation, and still maintains 1-bit error tolerance against soft errors.

dependable systems and networks | 2015

AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems

Moinuddin K. Qureshi; Dae-Hyun Kim; Samira Manabi Khan; Prashant J. Nair; Onur Mutlu

Multirate refresh techniques exploit the non-uniformity in retention times of DRAM cells to reduce the DRAM refresh overheads. Such techniques rely on accurate profiling of retention times of cells, and perform faster refresh only for a few rows which have cells with low retention times. Unfortunately, retention times of some cells can change at runtime due to Variable Retention Time (VRT), which makes it impractical to reliably deploy multirate refresh. Based on experimental data from 24 DRAM chips, we develop architecture-level models for analyzing the impact of VRT. We show that simply relying on ECC DIMMs to correct VRT failures is unusable as it causes a data error once every few months. We propose AVATAR, a VRT-aware multirate refresh scheme that adaptively changes the refresh rate for different rows at runtime based on current VRT failures. AVATAR provides a time to failure in the regime of several tens of years while reducing refresh operations by 62%-72%.

high-performance computer architecture | 2013

A case for Refresh Pausing in DRAM memory systems

Prashant J. Nair; Chia-Chen Chou; Moinuddin K. Qureshi

DRAM cells rely on periodic refresh operations to maintain data integrity. As the capacity of DRAM memories has increased, so has the amount of time consumed in doing refresh. Refresh operations contend with read operations, which increases read latency and reduces system performance. We show that eliminating latency penalty due to refresh can improve average performance by 7.2%. However, simply doing intelligent scheduling of refresh operations is ineffective at obtaining significant performance improvement. This paper provides an alternative and scalable option to reduce the latency penalty due to refresh. It exploits the property that each refresh operation in a typical DRAM device internally refreshes multiple DRAM rows in JEDEC-based distributed refresh mode. Therefore, a refresh operation has well defined points at which it can potentially be Paused to service a pending read request. Leveraging this property, we propose Refresh Pausing, a solution that is highly effective at alleviating the contention from refresh operations. It provides an average performance improvement of 5.1% for 8Gb devices, and becomes even more effective for future high-density technologies. We also show that Refresh Pausing significantly outperforms the recently proposed Elastic Refresh scheme.

high-performance computer architecture | 2016

Low-Cost Inter-Linked Subarrays (LISA): Enabling fast inter-subarray data movement in DRAM

Kevin Kai-Wei Chang; Prashant J. Nair; Donghyuk Lee; Saugata Ghose; Moinuddin K. Qureshi; Onur Mutlu

This paper introduces a new DRAM design that enables fast and energy-efficient bulk data movement across subarrays in a DRAM chip. While bulk data movement is a key operation in many applications and operating systems, contemporary systems perform this movement inefficiently, by transferring data from DRAM to the processor, and then back to DRAM, across a narrow off-chip channel. The use of this narrow channel for bulk data movement results in high latency and energy consumption. Prior work proposed to avoid these high costs by exploiting the existing wide internal DRAM bandwidth for bulk data movement, but the limited connectivity of wires within DRAM allows fast data movement within only a single DRAM subarray. Each subarray is only a few megabytes in size, greatly restricting the range over which fast bulk data movement can happen within DRAM. We propose a new DRAM substrate, Low-Cost Inter-Linked Subarrays (LISA), whose goal is to enable fast and efficient data movement across a large range of memory at low cost. LISA adds low-cost connections between adjacent subarrays. By using these connections to interconnect the existing internal wires (bitlines) of adjacent subarrays, LISA enables wide-bandwidth data transfer across multiple subarrays with little (only 0.8%) DRAM area overhead. As a DRAM substrate, LISA is versatile, enabling an array of new applications. We describe and evaluate three such applications in detail: (1) fast inter-subarray bulk data copy, (2) in-DRAM caching using a DRAM architecture whose rows have heterogeneous access latencies, and (3) accelerated bitline precharging by linking multiple precharge units together. Our extensive evaluations show that each of LISAs three applications significantly improves performance and memory energy efficiency, and their combined benefit is higher than the benefit of each alone, on a variety of workloads and system configurations.

IEEE Computer Architecture Letters | 2015

Architectural Support for Mitigating Row Hammering in DRAM Memories

Dae-Hyun Kim; Prashant J. Nair; Moinuddin K. Qureshi

DRAM scaling has been the prime driver of increasing capacity of main memory systems. Unfortunately, lower technology nodes worsen the cell reliability as it increases the coupling between adjacent DRAM cells, thereby exacerbating different failure modes. This paper investigates the reliability problem due to Row Hammering, whereby frequent activations of a given row can cause data loss for its neighboring rows. As DRAM scales to lower technology nodes, the threshold for the number of row activations that causes data loss for the neighboring rows reduces, making Row Hammering a challenging problem for future DRAM chips. To overcome Row Hammering, we propose two architectural solutions: First, Counter-Based Row Activation (CRA), which uses a counter with each row to count the number of row activations. If the count exceeds the row hammering threshold, a dummy activation is sent to neighboring rows proactively to refresh the data. Second, Probabilistic Row Activation (PRA), which obviates storage overhead of tracking and simply allows the memory controller to proactively issue dummy activations to neighboring rows with a small probability for all memory access. Our evaluations show that these solutions are effective at mitigating Row hammering while causing negligible performance loss (<; 1 percent).

architectural support for programming languages and operating systems | 2015

DEUCE: Write-Efficient Encryption for Non-Volatile Memories

Vinson Young; Prashant J. Nair; Moinuddin K. Qureshi

Phase Change Memory (PCM) is an emerging Non Volatile Memory (NVM) technology that has the potential to provide scalable high-density memory systems. While the non-volatility of PCM is a desirable property in order to save leakage power, it also has the undesirable effect of making PCM main memories susceptible to newer modes of security vulnerabilities, for example, accessibility to sensitive data if a PCM DIMM gets stolen. PCM memories can be made secure by encrypting the data. Unfortunately, such encryption comes with a significant overhead in terms of bits written to PCM memory, causing half of the bits in the line to change on every write, even if the actual number of bits being written to memory is small. Our studies show that a typical writeback modifies, on average, only 12% of the bits in the cacheline. Thus, encryption causes almost a 4x increase in the number of bits written to PCM memories. Such extraneous bit writes cause significant increase in write power, reduction in write endurance, and reduction in write bandwidth. To provide the benefit of secure memory in a write efficient manner this paper proposes Dual Counter Encryption (DEUCE). DEUCE is based on the observation that a typical writeback only changes a few words, so DEUCE reencrypts only the words that have changed. We show that DEUCE reduces the number of modified bits per writeback for a secure memory from 50% to 24%, which improves performance by 27% and increases lifetime by 2x.

high-performance computer architecture | 2015

Reducing read latency of phase change memory via early read and Turbo Read

Prashant J. Nair; Chia-Chen Chou; Bipin Rajendran; Moinuddin K. Qureshi

Phase Change Memory (PCM) is an emerging memory technology that can enable scalable high-density main memory systems. Unfortunately, PCM has higher read latency than DRAM, resulting in lower system performance. This paper investigates architectural techniques to improve the read latency of PCM. We observe that there is a wide distribution in cell resistance in both the SET state and the RESET state, and that the read latency of PCM is designed conservatively to handle the worst case cell. If PCM sensing can be tuned to exploit the variability in cell resistance, then we can get reduced read latency. We propose two schemes to enable better-than-worst-case read latency for PCM systems. Our first proposal, Early Read, reads the data earlier than the specified time period. Our key observation that Early Read causes only unidirectional errors (SET being read as RESET) allows us to efficiently detect data errors using Berger codes. In the uncommon case that Early Read causes data error(s), we simply retry the read operation with original latency. Our evaluations show that Early Read can reduce the read latency by 25% while incurring a storage overhead of only 10 bits per 64 byte line. Our second proposal, Turbo Read, reduces the sensing time for read operations by pumping higher current, at the expense of accidentally switching the PCM cell with small probability during the read operation. We analyze Error Correction Codes (ECC) and Probabilistic Row Scrubbing (PRS) for maintaining data integrity under Turbo Read. We show that a combination of Early Read and Turbo Read can reduce the PCM read latency by 30%, improve the system performance by 21%, and reduce the Energy Delay Product (EDP) by 28%, while requiring minimal changes to the memory system.

ACM Transactions on Architecture and Code Optimization | 2014

Refresh pausing in DRAM memory systems

Prashant J. Nair; Chia-Chen Chou; Moinuddin K. Qureshi

Dynamic Random Access Memory (DRAM) cells rely on periodic refresh operations to maintain data integrity. As the capacity of DRAM memories has increased, so has the amount of time consumed in doing refresh. Refresh operations contend with read operations, which increases read latency and reduces system performance. We show that eliminating latency penalty due to refresh can improve average performance by 7.2%. However, simply doing intelligent scheduling of refresh operations is ineffective at obtaining significant performance improvement. This article provides an alternative and scalable option to reduce the latency penalty due to refresh. It exploits the property that each refresh operation in a typical DRAM device internally refreshes multiple DRAM rows in JEDEC-based distributed refresh mode. Therefore, a refresh operation has well-defined points at which it can potentially be Paused to service a pending read request. Leveraging this property, we propose Refresh Pausing, a solution that is highly effective at alleviating the contention from refresh operations. It provides an average performance improvement of 5.1% for 8Gb devices and becomes even more effective for future high-density technologies. We also show that Refresh Pausing significantly outperforms the recently proposed Elastic Refresh scheme.

international symposium on microarchitecture | 2014

Citadel: Efficiently Protecting Stacked Memory from Large Granularity Failures

Prashant J. Nair; David A. Roberts; Moinuddin K. Qureshi

Stacked memory modules are likely to be tightly integrated with the processor. It is vital that these memory modules operate reliably, as memory failure can require the replacement of the entire socket. To make matters worse, stacked memory designs are susceptible to newer failure modes (for example, due to faulty through-silicon vias, or TSVs) that can cause large portions of memory, such as a bank, to become faulty. To avoid data loss from large-granularity failures, the memory system may use symbol-based codes that stripe the data for a cache line across several banks (or channels). Unfortunately, such data-striping reduces memory level parallelism causing significant slowdown and higher power consumption. This paper proposes Citadel, a robust memory architecture that allows the memory system to retain each cache line within one bank, thus allowing high performance, lower power and efficiently protects the stacked memory from large-granularity failures. Citadel consists of three components, TSV-Swap, which can tolerate both faulty data-TSVs and faulty address-TSVs, Tri Dimensional Parity (3DP), which can tolerate column failures, row failures, and bank failures, and Dynamic Dual Granularity Sparing (DDS), which can mitigate permanent faults by dynamically sparing faulty memory regions either at a row granularity or at a bank granularity. Our evaluations with real-world data for DRAM failures show that Citadel provides performance and power similar to maintaining the entire cache line in the same bank, and yet provides 700x higher reliability than Chip Kill-like ECC codes.

dependable systems and networks | 2015

Reducing Refresh Power in Mobile Devices with Morphable ECC

Chia-Chen Chou; Prashant J. Nair; Moinuddin K. Qureshi

Energy consumption is a primary consideration that determines the usability of emerging mobile computing devices such as smartphones. Refresh operations for main memory account for a significant fraction of the overall energy consumption, especially during idle periods, when processor can be switched off quickly, however, memory contents continue to get refreshed to avoid data loss. Given that mobile devices are idle most of the times, reducing refresh power in idle mode is critical to maximize the duration for which the device remains usable. The frequency of refresh operations in memory can be reduced significantly by using strong multi-bit error correction codes (ECC). Unfortunately, strong ECC codes incur high latency, which causes significant performance degradation (as high as 21%, and on average 10%). To obtain both low refresh power in idle periods and high performance in active periods, this paper proposes Morphable ECC (MECC). During idle periods, MECC keeps the memory protected with 6-bit ECC (ECC-6) and employs a refresh period of 1 second, instead of the typical refresh period of 64ms. During active operation, MECC reduces the refresh interval to 64ms, and converts memory from ECC-6 to weaker ECC (single-bit error correction) on a demand-basis, thus avoiding the high latency of ECC-6, except for the first access during the active mode. Our proposal reduces refresh operations during idle mode by 16x, memory power in idle mode by 2X, while retaining performance within 2% of a system that does not use any ECC.

Explore More