Is this you? Create Your Porfile

Yu Cai

Carnegie Mellon University

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yu Cai is active.

Explore More

Publication

Featured researches published by Yu Cai.

design, automation, and test in europe | 2012

Error patterns in MLC NAND flash memory: measurement, characterization, and analysis

Yu Cai; Erich F. Haratsch; Onur Mutlu; Ken Mai

As NAND flash memory manufacturers scale down to smaller process technology nodes and store more bits per cell, reliability and endurance of flash memory reduce. Wear-leveling and error correction coding can improve both reliability and endurance, but finding effective algorithms requires a strong understanding of flash memory error patterns. To enable such understanding, we have designed and implemented a framework for fast and accurate characterization of flash memory throughout its lifetime. This paper examines the complex flash errors that occur at 30-40nm flash technologies. We demonstrate distinct error patterns, such as cycle-dependency, location-dependency and value-dependency, for various types of flash operations. We analyze the discovered error patterns and explain why they exist from a circuit and device standpoint. Our hope is that the understanding developed from this characterization serves as a building block for new error tolerance algorithms for flash memory.

design, automation, and test in europe | 2013

Threshold voltage distribution in MLC NAND flash memory: characterization, analysis, and modeling

Yu Cai; Erich F. Haratsch; Onur Mutlu; Ken Mai

With continued scaling of NAND flash memory process technology and multiple bits programmed per cell, NAND flash reliability and endurance are degrading. Understanding, characterizing, and modeling the distribution of the threshold voltages across different cells in a modern multi-level cell (MLC) flash memory can enable the design of more effective and efficient error correction mechanisms to combat this degradation. We show the first published experimental measurement-based characterization of the threshold voltage distribution of flash memory. To accomplish this, we develop a testing infrastructure that uses the read retry feature present in some 2Y-nm (i.e., 20–24nm) flash chips. We devise a model of the threshold voltage distributions taking into account program/erase (P/E) cycle effects, analyze the noise in the distributions, and evaluate the accuracy of our model. A key result is that the threshold voltage distribution can be modeled, with more than 95% accuracy, as a Gaussian distribution with additive white noise, which shifts to the right and widens as P/E cycles increase. The novel characterization and models provided in this paper can enable the design of more effective error tolerance mechanisms for future flash memories.

international conference on computer design | 2012

Flash correct-and-refresh: Retention-aware error management for increased flash memory lifetime

Yu Cai; Gulay Yalcin; Onur Mutlu; Erich F. Haratsch; Adrián Cristal; Osman S. Unsal; Ken Mai

With the continued scaling of NAND flash and multi-level cell technology, flash-based storage has gained widespread use in systems ranging from mobile platforms to enterprise servers. However, the robustness of NAND flash cells is an increasing concern, especially at nanometer-regime process geometries. NAND flash memory bit error rate increases exponentially with the number of program/erase cycles. Stronger error correcting codes (ECC) can be used to tolerate higher error rates, but these have diminishing returns with increasing P/E cycles and can have prohibitively high power, area, and latency overheads. The goal of this paper is to develop new techniques that can tolerate high bit error rates without requiring prohibitively strong ECC. Our techniques, called Flash Correct-and-Refresh (FCR) exploit the observation that the dominant error source in NAND flash memory is retention errors, caused by flash cells losing charge over time. The key idea is to periodically read, correct, and reprogram (in-place) or remap the stored data before it accumulates more retention errors than can be corrected by simple ECC. Detailed simulations of a solid-state drive (SSD) storage system driven by measured experimental data from error characterization on real flash memory chips show that our techniques provide 46× average lifetime improvement on a variety of workloads at no additional hardware cost. We also find that our techniques achieve lifetime improvements that cannot feasibly be achieved with stronger ECC.

international conference on computer design | 2013

Program interference in MLC NAND flash memory: Characterization, modeling, and mitigation

Yu Cai; Onur Mutlu; Erich F. Haratsch; Ken Mai

As NAND flash memory continues to scale down to smaller process technology nodes, its reliability and endurance are degrading. One important source of reduced reliability is the phenomenon of program interference: when a flash cell is programmed to a value, the programming operation affects the threshold voltage of not only that cell, but also the other cells surrounding it. This interference potentially causes a surrounding cell to move to a logical state (i.e., a threshold voltage range) that is different from its original state, leading to an error when the cell is read. Understanding, characterizing, and modeling of program interference, i.e., how much the threshold voltage of a cell shifts when another cell is programmed, can enable the design of mechanisms that can effectively and efficiently predict and/or tolerate such errors. In this paper, we provide the first experimental characterization of and a realistic model for program interference in modern MLC NAND flash memory. To this end, we utilize the read-retry mechanism present in some state-of-the-art 2Y-nm (i.e., 20-24nm) flash chips to measure the changes in threshold voltage distributions of cells when a particular cell is programmed. Our results show that the amount of program interference received by a cell depends on 1) the location of the programmed cells, 2) the order in which cells are programmed, and 3) the data values of the cell that is being programmed as well as the cells surrounding it. Based on our experimental characterization, we develop a new model that predicts the amount of program interference as a function of threshold voltage values and changes in neighboring cells. We devise and evaluate one application of this model that adjusts the read reference voltage to the predicted threshold voltage distribution with the goal of minimizing erroneous reads. Our analysis shows that this new technique can reduce the raw flash bit error rate by 64% and thereby improve flash lifetime by 30%. We hope that the understanding and models developed in this paper lead to other error tolerance mechanisms for future flash memories.

high-performance computer architecture | 2015

Data retention in MLC NAND flash memory: Characterization, optimization, and recovery

Yu Cai; Yixin Luo; Erich F. Haratsch; Ken Mai; Onur Mutlu

Retention errors, caused by charge leakage over time, are the dominant source of flash memory errors. Understanding, characterizing, and reducing retention errors can significantly improve NAND flash memory reliability and endurance. In this paper, we first characterize, with real 2y-nm MLC NAND flash chips, how the threshold voltage distribution of flash memory changes with different retention age - the length of time since a flash cell was programmed. We observe from our characterization results that 1) the optimal read reference voltage of a flash cell, using which the data can be read with the lowest raw bit error rate (RBER), systematically changes with its retention age, and 2) different regions of flash memory can have different retention ages, and hence different optimal read reference voltages. Based on our findings, we propose two new techniques. First, Retention Optimized Reading (ROR) adaptively learns and applies the optimal read reference voltage for each flash memory block online. The key idea of ROR is to periodically learn a tight upper bound, and from there approach the optimal read reference voltage. Our evaluations show that ROR can extend flash memory lifetime by 64% and reduce average error correction latency by 10.1%, with only 768 KB storage overhead in flash memory for a 512 GB flash-based SSD. Second, Retention Failure Recovery (RFR) recovers data with uncorrectable errors offline by identifying and probabilistically correcting flash cells with retention errors. Our evaluation shows that RFR reduces RBER by 50%, which essentially doubles the error correction capability, and thus can effectively recover data from otherwise uncorrectable flash errors.

dependable systems and networks | 2015

Read Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery

Yu Cai; Yixin Luo; Saugata Ghose; Onur Mutlu

NAND flash memory reliability continues to degrade as the memory is scaled down and more bits are programmed per cell. A key contributor to this reduced reliability is read disturb, where a read to one row of cells impacts the threshold voltages of unread flash cells in different rows of the same block. Such disturbances may shift the threshold voltages of these unread cells to different logical states than originally programmed, leading to read errors that hurt endurance. For the first time in open literature, this paper experimentally characterizes read disturb errors on state-of-the-art 2Y-nm (i.e., 20-24 nm) MLC NAND flash memory chips. Our findings (1) correlate the magnitude of threshold voltage shifts with read operation counts, (2) demonstrate how program/erase cycle count and retention age affect the read-disturb-induced error rate, and (3) identify that lowering pass-through voltage levels reduces the impact of read disturb and extend flash lifetime. Particularly, we find that the probability of read disturb errors increases with both higher wear-out and higher pass-through voltage levels. We leverage these findings to develop two new techniques. The first technique mitigates read disturb errors by dynamically tuning the pass-through voltage on a per-block basis. Using real workload traces, our evaluations show that this technique increases flash memory endurance by an average of 21%. The second technique recovers from previously-uncorrectable flash errors by identifying and probabilistically correcting cells susceptible to read disturb errors. Our evaluations show that this recovery technique reduces the raw bit error rate by 36%.

measurement and modeling of computer systems | 2014

Neighbor-cell assisted error correction for MLC NAND flash memories

Yu Cai; Gulay Yalcin; Onur Mutlu; Erich F. Haratsch; Osman S. Unsal; Adrián Cristal; Ken Mai

Continued scaling of NAND flash memory to smaller process technology nodes decreases its reliability, necessitating more sophisticated mechanisms to correctly read stored data values. To distinguish between different potential stored values, conventional techniques to read data from flash memory employ a single set of reference voltage values, which are determined based on the overall threshold voltage distribution of flash cells. Unfortunately, the phenomenon of program interference, in which a cells threshold voltage unintentionally changes when a neighboring cell is programmed, makes this conventional approach increasingly inaccurate in determining the values of cells.n This paper makes the new empirical observation that identifying the value stored in the immediate-neighbor cell makes it easier to determine the data value stored in the cell that is being read. We provide a detailed statistical and experimental characterization of threshold voltage distribution of flash memory cells conditional upon the immediate-neighbor cell values, and show that such conditional distributions can be used to determine a set of read reference voltages that lead to error rates much lower than when a single set of reference voltage values based on the overall distribution are used. Based on our analyses, we propose a new method for correcting errors in a flash memory page, neighbor-cell assisted correction (NAC). The key idea is to re-read a flash memory page that fails error correction codes (ECC) with the set of read reference voltage values corresponding to the conditional threshold voltage distribution assuming a neighbor cell value and use the re-read values to correct the cells that have neighbors with that value. Our simulations show that NAC effectively improves flash memory lifetime by 33% while having no (at nominal lifetime) or very modest (less than 5% at extended lifetime) performance overhead.

ieee conference on mass storage systems and technologies | 2015

WARM: Improving NAND flash memory lifetime with write-hotness aware retention management

Yixin Luo; Yu Cai; Saugata Ghose; Jongmoo Choi; Onur Mutlu

Increased NAND flash memory density has come at the cost of lifetime reductions. Flash lifetime can be extended by relaxing internal data retention time, the duration for which a flash cell correctly holds data. Such relaxation cannot be exposed externally to avoid altering the expected data integrity property of a flash device. Reliability mechanisms, most prominently refresh, restore the duration of data integrity, but greatly reduce the lifetime improvements from retention time relaxation by performing a large number of write operations. We find that retention time relaxation can be achieved more efficiently by exploiting heterogeneity in write-hotness, i.e., the frequency at which each page is written. We propose WARM, a write-hotness aware retention management policy for flash memory, which identifies and physically groups together write-hot data within the flash device, allowing the flash controller to selectively perform retention time relaxation with little cost. When applied alone, WARM improves overall flash lifetime by an average of 3.24× over a conventional management policy without refresh, across a variety of real I/O workload traces. When WARM is applied together with an adaptive refresh mechanism, the average lifetime improves by 12.9×, 1.21× over adaptive refresh alone.

field-programmable custom computing machines | 2011

FPGA-Based Solid-State Drive Prototyping Platform

Yu Cai; Erich F. Haratsch; Mark P. McCartney; Ken Mai

NAND flash memory has been widely used for data storage due to its high density, high throughput, low cost, and low power. However, as flash memory manufacturers scale to smaller process technologies and store more bits per cell, the reliability and endurance of flash memory are decreasing. Wear-leveling and error correction coding can significantly improve both reliability and endurance, but finding effective algorithms requires quick and accurate characterization of flash memory error patterns. To this end, we have designed and implemented an FPGA-based open framework for quick, accurate, and comprehensive characterization of flash memories. Using this framework, we characterized detailed error patterns of flash memory throughout its entire lifetime. Our implementation uses an error accelerator block to decrease the test time by 20x. Based on these results, we propose and evaluate a smart bad block management policy in flash translation layer to increase SSD lifetime by up to 51%.

arXiv: Hardware Architecture | 2017

Error Characterization, Mitigation, and Recovery in Flash-Memory-Based Solid-State Drives

Yu Cai; Saugata Ghose; Erich F. Haratsch; Yixin Luo; Onur Mutlu

NAND flash memory is ubiquitous in everyday life today because its capacity has continuously increased and cost has continuously decreased over decades. This positive growth is a result of two key trends: 1) effective process technology scaling; and 2) multi-level (e.g., MLC, TLC) cell data coding. Unfortunately, the reliability of raw data stored in flash memory has also continued to become more difficult to ensure, because these two trends lead to 1) fewer electrons in the flash memory cell floating gate to represent the data; and 2) larger cell-to-cell interference and disturbance effects. Without mitigation, worsening reliability can reduce the lifetime of NAND flash memory. As a result, flash memory controllers in solid-state drives (SSDs) have become much more sophisticated: they incorporate many effective techniques to ensure the correct interpretation of noisy data stored in flash memory cells. In this article, we review recent advances in SSD error characterization, mitigation, and data recovery techniques for reliability and lifetime improvement. We provide rigorous experimental data from state-of-the-art MLC and TLC NAND flash devices on various types of flash memory errors, to motivate the need for such techniques. Based on the understanding developed by the experimental characterization, we describe several mitigation and recovery techniques, including 1) cell-to-cell interference mitigation; 2) optimal multi-level cell sensing; 3) error correction using state-of-the-art algorithms and methods; and 4) data recovery when error correction fails. We quantify the reliability improvement provided by each of these techniques. Looking forward, we briefly discuss how flash memory and these techniques could evolve into the future.

Explore More