[PDF] Bit-Exact ECC Recovery (BEER): Determining DRAM On-Die ECC Functions by Exploiting DRAM Data Retention Characteristics

Abstract

Increasing single-cell DRAM error rates have pushed DRAM manufacturers to adopt on-die error-correction coding (ECC), which operates entirely within a DRAM chip to improve factory yield. The on-die ECC function and its effects on DRAM reliability are considered trade secrets, so only the manufacturer knows precisely how on-die ECC alters the externally-visible reliability characteristics. Consequently, on-die ECC obstructs third-party DRAM customers (e.g., test engineers, experimental researchers), who typically design, test, and validate systems based on these characteristics. To give third parties insight into precisely how on-die ECC transforms DRAM error patterns during error correction, we introduce Bit-Exact ECC Recovery (BEER), a new methodology for determining the full DRAM on-die ECC function (i.e., its parity-check matrix) without hardware tools, prerequisite knowledge about the DRAM chip or on-die ECC mechanism, or access to ECC metadata (e.g., error syndromes, parity information). BEER exploits the key insight that non-intrusively inducing data-retention errors with carefully-crafted test patterns reveals behavior that is unique to a specific ECC function. We use BEER to identify the ECC functions of 80 real LPDDR4 DRAM chips with on-die ECC from three major DRAM manufacturers. We evaluate BEER's correctness in simulation and performance on a real system to show that BEER is effective and practical across a wide range of on-die ECC functions. To demonstrate BEER's value, we propose and discuss several ways that third parties can use BEER to improve their design and testing practices. As a concrete example, we introduce and evaluate BEEP, the first error profiling methodology that uses the known on-die ECC function to recover the number and bit-exact locations of unobservable raw bit errors responsible for observable post-correction errors.

Full PDF

BBit-Exact ECC Recovery (BEER):Determining DRAM On-Die ECC Functionsby Exploiting DRAM Data Retention Characteristics

Minesh Patel † Jeremie S. Kim ‡† Taha Shahroodi † Hasan Hassan † Onur Mutlu †‡†

ETH Z¨urich ‡ Carnegie Mellon University

Increasing single-cell DRAM error rates have pushed DRAMmanufacturers to adopt on-die error-correction coding (ECC),which operates entirely within a DRAM chip to improve factoryyield. The on-die ECC function and its effects on DRAM relia-bility are considered trade secrets, so only the manufacturerknows precisely how on-die ECC alters the externally-visiblereliability characteristics. Consequently, on-die ECC obstructsthird-party DRAM customers (e.g., test engineers, experimentalresearchers), who typically design, test, and validate systemsbased on these characteristics.To give third parties insight into precisely how on-die ECCtransforms DRAM error patterns during error correction, weintroduce Bit-Exact ECC Recovery (BEER), a new methodol-ogy for determining the full DRAM on-die ECC function (i.e.,its parity-check matrix) without hardware tools, prerequisiteknowledge about the DRAM chip or on-die ECC mechanism,or access to ECC metadata (e.g., error syndromes, parity infor-mation). BEER exploits the key insight that non-intrusivelyinducing data-retention errors with carefully-crafted test pat-terns reveals behavior that is unique to a specific ECC function.We use BEER to identify the ECC functions of 80 realLPDDR4 DRAM chips with on-die ECC from three majorDRAM manufacturers. We evaluate BEER’s correctness insimulation and performance on a real system to show thatBEER is effective and practical across a wide range of on-dieECC functions. To demonstrate BEER’s value, we propose anddiscuss several ways that third parties can use BEER to improvetheir design and testing practices. As a concrete example, weintroduce and evaluate BEEP, the first error profiling method-ology that uses the known on-die ECC function to recover thenumber and bit-exact locations of unobservable raw bit errorsresponsible for observable post-correction errors.

1. Introduction

Dynamic random access memory (DRAM) is the predomi-nant choice for system main memory across a wide variety ofcomputing platforms due to its favorable cost-per-bit relativeto other memory technologies. DRAM manufacturers main-tain a competitive advantage by improving raw storage densi-ties across device generations. Unfortunately, these improve-ments largely rely on process technology scaling, which causesserious reliability issues that reduce factory yield. DRAMmanufacturers traditionally mitigate yield loss using post-manufacturing repair techniques such as row/column spar-ing [51]. However, continued technology scaling in mod-ern DRAM chips requires stronger error-mitigation mecha-nisms to remain viable because of random single-bit errorsthat are increasingly frequent at smaller process technologynodes [39,76,89,99,109,119,120,124,127,129,133,160]. Therefore,DRAM manufacturers have begun to use on-die error correctioncoding ( on-die ECC ), which silently corrects single-bit errors entirely within the DRAM chip [39, 76, 120, 129, 138]. On-dieECC is completely invisible outside of the DRAM chip, so ECCmetadata (i.e., parity-check bits, error syndromes) that is usedto correct errors is hidden from the rest of the system.Prior works [60, 97, 98, 120, 129, 133, 138, 147] indicate thatexisting on-die ECC codes are 64- or 128-bit single-error cor-rection (SEC) Hamming codes [44]. However, each DRAMmanufacturer considers their on-die ECC mechanism’s designand implementation to be highly proprietary and ensures not toreveal its details in any public documentation, including DRAMstandards [68, 69], DRAM datasheets [63, 121, 149, 158], publi-cations [76, 97, 98, 133], and industry whitepapers [120, 147].Because the unknown on-die ECC function is encapsulatedwithin the DRAM chip, it obfuscates raw bit errors (i.e., pre-correction errors) in an ECC-function-specific manner. There-fore, the locations of software-visible uncorrectable errors (i.e., post-correction errors) often no longer match those of the pre-correction errors that were caused by physical DRAM errormechanisms. While this behavior appears desirable from ablack-box perspective, it poses serious problems for third-partyDRAM customers who study, test and validate, and/or designsystems based on the reliability characteristics of the DRAMchips that they buy and use. Section 2.2 describes these cus-tomers and the problems they face in detail, including, but notlimited to, three important groups: (1) system designers whoneed to ensure that supplementary error-mitigation mecha-nisms (e.g., rank-level ECC within the DRAM controller) arecarefully designed to cooperate with the on-die ECC func-tion [40, 129, 160], (2) large-scale industries (e.g., computingsystem providers such as Microsoft [33], HP [47], and Intel [59],DRAM module manufacturers [4, 92, 159]) or government enti-ties (e.g., national labs [131, 150]) who must understand DRAMreliability characteristics when validating DRAM chips theybuy and use, and (3) researchers who need full visibility intophysical device characteristics to study and model DRAM reli-ability [17, 20, 31, 42, 43, 46, 72, 78–86, 109, 138, 139, 172, 178].For each of these third parties, merely knowing or reverse-engineering the type of ECC code (e.g., n -bit Hamming code)based on existing industry [60, 97, 98, 120, 133, 147] and aca-demic [129,138] publications is not enough to determine exactlyhow the ECC mechanism obfuscates specific error patterns.This is because an ECC code of a given type can have manydifferent implementations based on how its ECC function (i.e.,its parity-check matrix) is designed, and different designs leadto different reliability characteristics. For example, Figure 1shows the relative probability of observing errors in differentbit positions for three different ECC codes of the same type (i.e.,single-error correction Hamming code with 32 data bits and We use the term “error” to refer to any bit-flip event, whether observed(e.g., uncorrectable bit-flips) or unobserved (e.g., corrected by ECC). a r X i v : . [ c s . A R ] S e p parity-check bits) but that use different ECC functions. Weobtain this data by simulating 10 ECC words using the EINSimsimulator [2, 138] and show medians and 95% confidence inter-vals calculated via statistical bootstrapping [32] over 1000 sam-ples. We simulate a test pattern with uniform-randompre-correction errors at a raw bit error rate of 10 –4 (e.g., as oftenseen in experimental studies [17,20,43,46,76,102,109,139,157]). R e l a t i v e E rr o r P r o b a b ili t y Pre-CorrectionPost-Correction (ECC Function 0) Post-Correction (ECC Function 1)Post-Correction (ECC Function 2)

Figure 1: Relative error probabilities in different bit posi-tions for different ECC functions with uniform-randomly dis-tributed pre-correction (i.e., raw) bit errors.

The data demonstrates that ECC codes of the same type canhave vastly different post-correction error characteristics. Thisis because each ECC mechanism acts differently when facedwith more errors than it can correct (i.e., uncorrectable errors),causing it to mistakenly perform ECC-function-specific “correc-tions” to bits that did not experience errors (i.e., miscorrections ,which Section 3.3 expands upon). Therefore, a researcher or en-gineer who studies two DRAM chips that use the same type ofECC code but different ECC functions may find that the chips’software-visible reliability characteristics are quite differenteven if the physical DRAM cells’ reliability characteristics areidentical. On the other hand, if we know the full ECC function(i.e., its parity-check matrix), we can calculate exactly whichpre-correction error pattern(s) result in a set of observed er-rors. Figure 1 is a result of aggregating such calculations across10 error patterns , and Section 7.1 demonstrates how we canuse the ECC function to infer pre-correction error counts andlocations using only observed post-correction errors.Knowing the precise transformation between pre- and post-correction errors benefits all of the aforementioned third-partyuse cases because it provides system designers, test engineers,and researchers with a way to isolate the error characteris-tics of the memory itself from the effects of a particular ECCfunction. Section 2.2 provides several example use cases anddescribes the benefits of knowing the ECC function in detail.While specialized, possibly intrusive methods (e.g., chip tear-down [66, 164], advanced imaging techniques [48, 164]) cantheoretically extract the ECC function, such techniques are typ-ically inaccessible to or infeasible for many third-party users.To enable third parties to reconstruct pre-correction DRAMreliability characteristics, our goal is to develop a methodol-ogy that can reliably and accurately determine the full on-dieECC function without requiring hardware tools, prerequisiteknowledge about the DRAM chip or on-die ECC mechanism, oraccess to ECC metadata (e.g., error syndromes, parity informa-tion). To this end, we develop Bit-Exact ECC Recovery (BEER),a new methodology for determining a DRAM chip’s full on-dieECC function simply by studying the software-visible post-correction error patterns that it generates. Thus, BEER requiresno hardware support, hardware intrusion, or access to inter-nal ECC metadata (e.g., error syndromes, parity information). Other patterns show similar behavior, including

RANDOM data. Capturing approximately 10 of the 2 ≈ × unique patterns. BEER exploits the key insight that forcing the ECC function toact upon carefully-crafted uncorrectable error patterns revealsECC-function-specific behavior that disambiguates differentECC functions. BEER comprises three key steps: (1) deliber-ately inducing uncorrectable data-retention errors by pausingDRAM refresh while using carefully-crafted test patterns tocontrol the errors’ bit-locations, which is done by leveragingdata-retention errors’ intrinsic data-pattern asymmetry (dis-cussed in Section 3.2), (2) enumerating the bit positions wherethe ECC mechanism causes miscorrections, and (3) using a SATsolver [28] to solve for the unique parity-check matrix thatcauses the observed set of miscorrections.We experimentally apply BEER to 80 real LPDDR4 DRAMchips with on-die ECC from three major DRAM manufacturersto determine the chips’ on-die ECC functions. We describethe experimental steps required to apply BEER to any DRAMchip with on-die ECC and show that BEER tolerates observedexperimental noise. We show that different manufacturers ap-pear to use different on-die ECC functions while chips fromthe same manufacturer and model number appear to use thesame on-die ECC function (Section 5.1.3). Unfortunately, ourexperimental studies with real DRAM chips have two limita-tions against further validation: (1) because the on-die ECCfunction is considered trade secret for each manufacturer, weare unable to obtain a groundtruth to compare BEER’s resultsagainst, even when considering non-disclosure agreementswith DRAM manufacturers and (2) we are unable to publishthe final ECC functions that we uncover using BEER for confi-dentiality reasons (discussed in Section 2.1).To overcome the limitations of experimental studies withreal DRAM chips, we rigorously evaluate BEER’s correctness insimulation (Section 6). We show that BEER correctly recoversthe on-die ECC function for 115300 single-error correctionHamming codes , which are representative of on-die ECC,with ECC word lengths ranging from 4 to 247 bits. We evaluateour BEER implementation’s runtime and memory consumptionusing a real system to demonstrate that BEER is practical andthe SAT problem that BEER requires is realistically solvable.To demonstrate how BEER is useful in practice, we proposeand discuss several ways that third parties can leverage theECC function that BEER reveals to more effectively design,study, and test systems that use DRAM chips with on-die ECC(Section 7). As a concrete example, we introduce and evaluateBit-Exact Error Profiling (BEEP), a new DRAM data-retentionerror profiling methodology that reconstructs pre-correction er-ror counts and locations purely from observed post-correctionerrors. Using the ECC function revealed by BEER, BEEP infersprecisely which unobservable raw bit errors correspond to ob-served post-correction errors at a given set of testing conditions.We show that BEEP enables characterizing pre-correction er-rors across a wide range of ECC functions, ECC word lengths,error patterns, and error rates. We publicly release our tools asopen-source software: (1) a new tool [1] for applying BEER toexperimental data from real DRAM chips and (2) enhancementsto EINSim [2] for evaluating BEER and BEEP in simulation.This paper makes the following key contributions:1. We provide Bit-Exact ECC Recovery (BEER), the firstmethodology that determines the full DRAM on-die ECCfunction (i.e., its parity-check matrix) without requiring This irregular number arises from evaluating a different number of ECCfunctions for different code lengths because longer codes require exponentiallymore simulation time (discussed in Section 6.1).

2. Challenges of Unknown On-Die ECCs

This section discusses why on-die ECC is considered propri-etary, how its secrecy causes difficulties for third-party con-sumers, and how the BEER methodology helps overcome thesedifficulties by identifying the full on-die ECC function.

On-die ECC silently mitigates increasing single-bit errorsthat reduce factory yield [39, 76, 89, 99, 109, 119, 120, 124, 127,129, 133, 160]. Because on-die ECC is invisible to the externalDRAM chip interface, older DRAM standards [68, 69] placeno restrictions on the on-die ECC mechanism while newerstandards [70] specify only a high-level description for on-dieECC to support new (albeit limited) DDR5 features, e.g., on-dieECC scrubbing. In particular, there are no restrictions on thedesign or implementation of the on-die ECC function itself.This means that knowing an on-die ECC mechanism’s de-tails could reveal information about its manufacturer’s factoryyield rates, which are highly proprietary [23, 55] due to theirdirect connection with business interests, potential legal con-cerns, and competitiveness in a USD 45+ billion DRAM mar-ket [143, 170]. Therefore, manufacturers consider their on-dieECC designs and implementations to be trade secrets that theyare unwilling to disclose. In our experience, DRAM manufac-turers will not reveal on-die ECC details under confidentialityagreements, even for large-scale industry board vendors forwhom knowing the details stands to be mutually beneficial. Even if such agreements were possible, industry teams and academicswithout major business relations with DRAM manufacturers (i.e., an over-whelming majority of the potentially interested scientists and engineers) willlikely be unable to secure disclosure.

This raises two challenges for our experiments with realDRAM chips: (1) we do not have access to “groundtruth” ECCfunctions to validate BEER’s results against and (2) we cannotpublish the final ECC functions that we determine using BEERfor confidentiality reasons based on our relationships with theDRAM manufacturers. However, this does not prevent third-party consumers from applying BEER to their own devices,and we hope that our work encourages DRAM manufacturersto be more open with their designs going forward. On-die ECC alters a DRAM chip’s software-visible reliabilitycharacteristics so that they are no longer determined solely byhow errors physically occur within the DRAM chip. Figure 1illustrates this by showing how using different on-die ECC func-tions changes how the same underlying DRAM errors appearto the end user. Instead of following the pre-correction errordistribution (i.e., uniform-random errors), the post-correctionerrors exhibit ECC-function-specific shapes that are difficult topredict without knowing precisely which ECC function is usedin each case. This means that two commodity DRAM chips withdifferent on-die ECC functions may show similar or differentreliability characteristics irrespective of how the underlyingDRAM technology and error mechanisms behave. Therefore,the physical error mechanisms’ behavior alone can no longerexplain a DRAM chip’s post-correction error characteristics.Unfortunately, this poses a serious problem for third-partyDRAM consumers (e.g., system designers, testers, and re-searchers), who can no longer accurately understand a DRAMchip’s reliability characteristics by studying its software-visibleerrors. This lack of understanding prevents third parties fromboth (1) making informed design decisions, e.g., when buildingmemory-controller based error-mitigation mechanisms to com-plement on-die ECC and (2) developing new ideas that rely onon leveraging predictable aspects of a DRAM chip’ reliabilitycharacteristics, e.g., physical error mechanisms that are funda-mental to all DRAM technology. As error rates worsen withcontinued technology scaling [39, 76, 86, 89, 90, 99, 119, 120, 124,127,129,133], manufacturers will likely resort to stronger codesthat further distort the post-correction reliability characteris-tics. The remainder of this section describes three key ways inwhich an unknown on-die ECC function hinders third-parties,and determining the function helps mitigate the problem.

Designing High-Reliability Systems.

System designers of-ten seek to improve memory reliability beyond that which theDRAM provides alone (e.g., by including rank-level ECC withinthe memory controllers of server-class machines or ECC withinon-chip caches). In particular, rank-level ECCs are carefullydesigned to mitigate common DRAM failure modes [21] (e.g.,chip failure [129], burst errors [29, 116]) in order to correct asmany errors as possible. However, designing for key failuremodes requires knowing a DRAM chip’s reliability characteris-tics, including the effects of any underlying ECC function (e.g.,on-die ECC) [40,160]. For example, Son et al. [160] show that ifon-die ECC suffers an uncorrectable error and mistakenly “cor-rects” a non-erroneous bit (i.e., introduces a miscorrection ), thestronger rank-level ECC may no longer be able to even detectwhat would otherwise be a detectable (possibly correctable)error. To prevent this scenario, both levels of ECC must be care-fully co-designed to complement each others’ weaknesses. In While full disclosure would be ideal, a more realistic scenario could bemore flexible on-die ECC confidentiality agreements. As recent work [35]shows, security or protection by obscurity is likely a poor strategy in practice.

Testing, Validation, and Quality Assurance.

Large-scalecomputing system providers (e.g., Microsoft [33], HP [47], In-tel [59]), DRAM module manufacturers [4, 92, 159], and gov-ernment entities (e.g., national labs [131, 150]) typically per-form extensive third-party testing of the DRAM chips theypurchase in order to ensure that the chips meet internal per-formance/energy/reliability targets. These tests validate thatDRAM chips operate as expected and that there are well-understood, convincing root-causes (e.g., fundamental DRAMerror mechanisms) for any observed errors. Unfortunately, on-die ECC interferes with two key components of such testing.First, it obfuscates the number and bit-exact locations of pre-correction errors, so diagnosing the root cause for any observederror becomes challenging. Second, on-die ECC encodes allwritten data into ECC codewords, so the values written intothe physical cells likely do not match the values observedat the DRAM chip interface. The encoding process defeatscarefully-constructed test patterns that target specific circuit-level phenomena (e.g., exacerbating interference between bit-lines [3, 79, 123]) because the encoded data may no longer havethe intended effect. Unfortunately, constructing such patternsis crucial for efficient testing since it minimizes the testing timerequired to achieve high error coverage [3, 51]. In both cases,the full on-die ECC function determined by BEER describesexactly how on-die ECC transforms pre-correction error pat-terns into post-correction ones. This enables users to inferpre-correction error locations (demonstrated in Section 7.1)and design test patterns that result in codewords with desiredproperties (discussed in Section 7.2).

Scientific Error-Characterization Studies.

Scientific error-characterization studies explore physical DRAM error mecha-nisms (e.g., data retention [42, 43, 46, 74, 75, 78–81, 109, 139, 157,172, 173], reduced access-latency [16, 17, 20, 37, 83–85, 102, 104],circuit disturbance [35, 79, 81, 86, 90, 135, 136]) by deliberatelyexacerbating the error mechanism and analyzing the resultingerrors’ statistical properties (e.g., frequency, spatial distribu-tion). These studies help build error models [20,31,43,83,94,104,157, 178], leading to new DRAM designs and operating pointsthat improve upon the state-of-the-art. Unfortunately, on-dieECC complicates error analysis and modeling by (1) obscuringthe physical pre-correction errors that are the object of studyand (2) preventing direct access to parity-check bits, therebyprecluding comprehensive testing of all DRAM cells in a givenchip. Although prior work [138] enables inferring high-levelstatistical characteristics of the pre-correction errors, it doesnot provide a precise mapping between pre-correction andpost-correction errors, which is only possible knowing the fullECC function. Knowing the full ECC function, via our newBEER methodology, enables recovering the bit-exact locationsof pre-correction errors throughout the entire ECC word (as wedemonstrate in Section 7.1) so that error-characterization stud-ies can separate the effects of DRAM error mechanisms fromthose of on-die ECC. Section 7 provides a detailed discussion ofseveral key characterization studies that BEER enables.

3. Background

This section provides a basic overview of DRAM, codingtheory, and satisfiability (SAT) solvers as pertinent to thismanuscript. For further detail, we refer the reader to com-prehensive texts on DRAM design and operation [17–20, 45, 54, 61, 64, 65, 77, 102, 103, 106, 111, 153–155, 180], coding the-ory [25, 53, 108, 115, 122, 146, 148], and SAT solvers [8, 24, 28, 30].

A DRAM chip stores each data bit in its own storage cell usingthe charge level of a storage capacitor . Because the capacitor issusceptible to charge leakage [26, 42, 90, 95, 109, 110, 138, 139,169], the stored value may eventually degrade to the point ofdata loss, resulting in a data-retention error . During normalDRAM operation, a refresh operation restores the data valuestored in each cell every refresh window ( t REFw ), e.g., 32ms or64ms [67–69, 109, 110, 139], to prevent data-retention errors.Depending on a given chip’s circuit design, each cell maystore data using one of two encoding conventions: a true-cell encodes data ‘1’ as a fully-charged storage capacitor (i.e., the

CHARGED state), and an anti-cell encodes data ‘1’ as a fully-discharged capacitor (i.e., the

DISCHARGED state). Althougha cell’s encoding scheme is transparent to the rest of the systemduring normal operation, it becomes evident in the presence ofdata-retention errors because DRAM cells typically decay onlyfrom their

CHARGED to their

DISCHARGED state as shownexperimentally by prior work [26, 90, 95, 109, 110, 138, 139].

Deliberately inducing DRAM errors (e.g., by violating de-fault timing parameters) reveals detailed information abouta DRAM chip’s internal design through the resulting errors’statistical characteristics. Prior works use custom memorytesting platforms (e.g., FPGA-based [46]) and commodityCPUs [6, 57] (e.g., by changing CPU configuration registersvia the BIOS [56]) to study a variety of DRAM error mecha-nisms, including data-retention [26, 90, 95, 109, 110, 138, 139],circuit timing violations [17, 83–85, 102, 104], and RowHam-mer [86, 90, 125, 126, 135, 136]. Our work focuses on data-retention errors because they exhibit well-studied propertiesthat are helpful for our purposes:1. They are easily induced and controlled by manipulating therefresh window ( t REFw ) and ambient temperature.2. They are repeatable [139, 163] and their spatial distributionis uniform random [7, 43, 84, 138, 157].3. They fail unidirectionally from the

CHARGED state to the

DISCHARGED state [26, 90, 95, 109, 110, 138, 139].

Off-DRAM-Chip Errors.

Software-visible memory errorsoften occur due to failures in components outside the DRAMchip (e.g., sockets, buses) [119]. However, our work focuses onerrors that occur within a DRAM chip, which are a serious andgrowing concern at modern technology node sizes [39, 76, 89,99, 119, 120, 124, 127, 129, 133, 160]. These errors are the primarymotivation for on-die ECC, which attempts to correct thembefore they are ever observed outside the DRAM chip.

As manufacturers continue to increase DRAM storage den-sity, unwanted single-bit errors appear more frequently [39, 41,50, 76, 89, 99, 105, 114, 119, 120, 128, 129, 133, 138, 151, 161, 162]and reduce factory yield. To combat these errors, manufactur-ers use on-die ECC [39, 76, 120, 128, 129, 133, 138], which is anerror-correction code implemented directly in the DRAM chip.Figure 2 shows how a system might interface with a memorychip that uses on-die ECC. The system writes k -bit datawords ( d ) to the chip, which internally maintains an expanded n -bitrepresentation of the data called a codeword ( c ), created bythe ECC encoding of d. The stored codeword may experienceerrors, resulting in a potentially erroneous codeword ( c (cid:48) ). Ifmore errors occur than ECC can correct, e.g., two errors in a4ingle-error correction (SEC) code, the final dataword read outafter ECC decoding ( d (cid:48) ) may also contain errors. The encodingand decoding functions are labeled F encode and F decode . codeword ( c )dataword ( d ) d d d d … p p p …d d d d … error(s) F decode ( c´ ) F encode ( d ) codeword´ ( c´ )d d d d …dataword ´ ( d´ ) ECC EncoderECC Decoder´ ´ ´ ´ ´ ´ ´ ´ ´´ ´ MEMORY CHIP d d d d … p p p … SYSTEM DRAM BUS

Figure 2: Interfacing a memory chip that uses on-die ECC.

For all linear codes (e.g., SEC Hamming codes [44]), F encode and F decode can be represented using matrix transformations.As a demonstrative example throughout this paper, we usethe (7, 4, 3) Hamming code [44] shown in Equation 1. F encode represents a generator matrix G such that the codeword c iscomputed from the dataword d as c = G · d . F encode = G T = (cid:34) (cid:35) F decode = H = (cid:20) (cid:21) (1) Decoding.

The most common decoding algorithm is known as syndrome decoding , which simply computes an error syndrome s = H · c (cid:48) that describes if and where an error exists:• s = : no error detected.• s (cid:54) = : error detected, and s describes its bit-exact location.Note that the error syndrome computation is unaware of thetrue error count; it blindly computes the error syndrome(s)assuming a low probability of uncorrectable errors. If, however,an uncorrectable error is present (e.g., deliberately inducedduring testing), one of three possibilities may occur:• Silent data corruption: syndrome is zero; no error.•

Partial correction: syndrome points to one of the errors.•

Miscorrection: syndrome points to a non-erroneous bit.When a nonzero error syndrome occurs, the ECC decodinglogic simply flips the bit pointed to by the error syndrome,potentially exacerbating the overall number of errors.

Design Space.

Each manufacturer can freely select F encode and F decode functions, whose implementations can help to meet a setof design constraints (e.g., circuit area, reliability, power con-sumption). The space of functions that a designer can choosefrom is quantified by the number of arrangements of columnsof H . This means that for an n -bit code with k data bits, thereare (cid:0) n – k –1 n (cid:1) possible ECC functions. Section 4.2 formalizes thisspace of possible functions in the context of our work. Satisfiability (SAT) solvers [8, 24, 28, 30, 38, 140] find possi-ble solutions to logic equation(s) with one or more unknownBoolean variables. A SAT solver accepts one or more suchequations as inputs, which effectively act as constraints overthe unknown variables. The SAT solver then attempts to deter-mine a set of values for the unknown variables such that theequations are satisfied (i.e., the constraints are met). The SATsolver will return either (1) one (of possibly many) solutionsor (2) no solution if the Boolean equation is unsolvable.

4. Determining the ECC Function

BEER identifies an unknown ECC function by systematicallyreconstructing its parity-check matrix based on the error syn-dromes that the ECC logic generates while correcting errors.Different ECC functions compute different error syndromesfor a given error pattern, and by constructing and analyzingcarefully-crafted test cases, BEER uniquely identifies whichECC function a particular implementation uses. This sectiondescribes how and why this process works. Section 5 describeshow BEER accomplishes this in practice for on-die ECC.

DRAM ECCs are linear block codes, e.g., Hammingcodes [44] for on-die ECC [60, 97, 98, 120, 129, 133, 138, 147],BCH [9, 49] or Reed-Solomon [145] codes for rank-levelECC [26, 87], whose encoding and decoding operations aredescribed by linear transformations of their respective inputs(i.e., G and H matrices, respectively). We can therefore deter-mine the full ECC function by independently determining eachof its linear components.We can isolate each linear component of the ECC functionby injecting errors in each codeword bit position and observingthe resulting error syndromes. For example, an n -bit Hammingcode’s parity-check matrix can be systematically determinedby injecting a single-bit error in each of the n bit positions: theerror syndrome that the ECC decoder computes for each pat-tern is exactly equal to the column of the parity-check matrixthat corresponds to the position of the injected error. As anexample, Equation 2 shows how injecting an error at position2 (i.e., adding error pattern e to codeword c ) extracts the cor-responding column of the parity-check matrix H in the errorsyndrome s . By the definition of a block code, H · c = for allcodewords [27, 53], so e isolates column 2 of H (i.e., H ∗ ,2 ). s = H · c (cid:48) = H · ( c + e ) = H ·  c +   = + H ∗ ,2 = H ∗ ,2 (2) Thus, the entire parity-check matrix can be fully determinedby testing across all 1-hot error patterns. Cojocar et al. [26]use this approach on DRAM rank-level ECC, injecting errorsinto codewords on the DDR bus and reading the resulting errorsyndromes provided by the memory controller.

Unfortunately, systematically determining an ECC functionas described in Section 4.1 is not possible with on-die ECCfor two key reasons. First, on-die ECC’s parity-check bitscannot be accessed directly, so we have no easy way to injectan error within them. Second, on-die ECC does not signalan error-correction event or report error syndromes (i.e., s ).Therefore, even if specialized methods (e.g., chip teardown [66,164], advanced imaging techniques [48,164]) could inject errorswithin a DRAM chip package where the on-die ECC mechanismresides, the error syndromes would remain invisible, so theapproach taken by Cojocar et al. [26] cannot be applied toon-die ECC. To determine the on-die ECC function using theapproach of Section 4.1, we first formalize the unknown on-die ECC function and then determine how we can infer errorsyndromes within the constraints of the formalized problem. We as-sume that on-die ECC uses a systematic encoding, which meansthat the ECC function stores data bits unmodified. This is areasonable assumption for real hardware since it greatly sim-plifies data access [181] and is consistent with our experimen-tal results in Section 5.1.2. Furthermore, because the DRAMchip interface exposes only data bits, the relative ordering ofparity-check bits within the codeword is irrelevant from thesystem’s perspective. Mathematically, the different choices ofbit positions represent equivalent codes that all have identicalerror-correction properties and differ only in their internalrepresentations [146, 148], which on-die ECC does not expose.Therefore, we are free to arbitrarily choose the parity-check Such methods may reveal the exact on-die ECC circuitry. However, theyare typically inaccessible to or infeasible for many third-party consumers. standard form , where we express the parity-checkmatrix for an ( n , k ) code as a partitioned matrix H n – k × n =[ P n – k × k | I n – k × n – k ]. P is a conventional notation for the sub-matrix that corresponds to information bit positions and I is anidentity matrix that corresponds to parity-check bit positions.Note that the example ECC code of Equation 1 is in standardform. With this representation, all codewords take the form c × n = [ d d … d k –1 | p p … p n – k –1 ], where d and p are data andparity-check symbols, respectively. Given that on-die ECC conceals error syndromes, we developa new approach for determining the on-die ECC function that indirectly determines error syndromes based on how the ECCmechanism responds when faced with uncorrectable errors.To induce uncorrectable errors, we deliberately pause normalDRAM refresh operations long enough (e.g., several minutesat 80 ◦ C) to cause a large number of data-retention errors (e.g.,BER > –4 ) throughout a chip. These errors expose a signif-icant number of miscorrections in different ECC words, andthe sheer number of data-retention errors dominates any un-wanted interference from other possible error mechanisms(e.g., particle strikes [117]).To control which data-retention errors occur, we writecarefully-crafted test patterns that restrict the errors to specificbit locations. This is possible because only cells programmedto the CHARGED state can experience data-retention errors asdiscussed in Section 3.2. By restricting pre-correction errorsto certain cells, if a post-correction error is observed in anunexpected location, it must be an artifact of error correction,i.e., a miscorrection . Such a miscorrection is significant sinceit: (1) signals an error-correction event, (2) is purely a func-tion of the ECC decoding logic, and (3) indirectly reveals theerror syndrome generated by the pre-correction error pattern.The indirection occurs because, although the miscorrectiondoes not expose the raw error syndrome, it does reveal thatwhichever error syndrome is generated internally by the ECClogic exactly matches the parity-check matrix column thatcorresponds to the position of the miscorrected bit.These three properties mean that miscorrections are a re-liable tool for analyzing ECC functions: for a given pre-correction error pattern, different ECC functions will gener-ate different error syndromes, and therefore miscorrections,depending on how the functions’ parity-check matrices areorganized. This means that a given ECC function causesmiscorrections only within certain bits, and the locations ofmiscorrection-susceptible bits differ between functions. There-fore, we can differentiate ECC functions by identifying whichmiscorrections are possible for different test patterns.

To construct aset of test patterns that suffice to uniquely identify an ECCfunction, we observe that a miscorrection is possible in a

DISCHARGED data bit only if the bit’s error syndrome can beproduced by some linear combination of the parity-check ma-trix columns that correspond to

CHARGED bit locations. For ex-ample, consider the 1-

CHARGED patterns that each set one data bit to the

CHARGED state and all others to the

DISCHARGED state. In these patterns, data-retention errors may only oc-cur in either (1) the

CHARGED bit or (2) any parity-check bitsthat the ECC function also sets to the

CHARGED state. Withthese restrictions, observable miscorrections may only occurwithin data bits whose error syndromes can be created by somelinear combination of the parity-check matrix columns thatcorrespond to the

CHARGED cells within the codeword.As a concrete example, consider the codeword of Equation 3. C and D represent that the corresponding cell is programmedto the CHARGED and

DISCHARGED states, respectively. c = (cid:2) D D C D | D C C (cid:3) (3)

Because only

CHARGED cells can experience data-retention er-rors, there are 2 = 8 possible error syndromes that correspondto the unique combinations of CHARGED cells failing. Table 1illustrates these eight possibilities.

Pre-Correction Error Syndrome Post-CorrectionError Pattern Outcome (cid:2) | (cid:3) No error (cid:2) | (cid:3) H ∗ ,6 Correctable (cid:2) | (cid:3) H ∗ ,5 Correctable (cid:2) | (cid:3) H ∗ ,5 + H ∗ ,6 Uncorrectable (cid:2) | (cid:3) H ∗ ,2 Correctable (cid:2) | (cid:3) H ∗ ,2 + H ∗ ,5 Uncorrectable (cid:2) | (cid:3) H ∗ ,2 + H ∗ ,6 Uncorrectable (cid:2) | (cid:3) H ∗ ,2 + H ∗ ,5 + H ∗ ,6 Uncorrectable

Table 1: Possible data-retention error patterns, their syn-dromes, and their outcomes for the codeword of Equation 3.

A miscorrection occurs whenever the error syndrome of anuncorrectable error pattern matches the parity-check matrixcolumn of a non-erroneous data bit. In this case, the column’slocation would then correspond to the bit position of the mis-correction. However, a miscorrection only reveals informationif it occurs within one of the

DISCHARGED data bits, for onlythen are we certain that the observed bit flip is unambiguouslya miscorrection rather than an uncorrected data-retention er-ror. Therefore, the test patterns we use should maximize thenumber of

DISCHARGED bits so as to increase the number ofmiscorrections that yield information about the ECC function.To determine which test patterns to use, we expand uponthe approach of injecting 1-hot errors described in Section 4.1.Although we would need to write data to all codeword bits inorder to test every 1-hot error pattern, on-die ECC does notallow writing directly to the parity-check bits. This leads totwo challenges. First, we cannot test 1-hot error patterns forwhich the 1-hot error is within the parity-check bits, whichmeans that we cannot differentiate ECC functions that differonly within their parity-check bit positions. Fortunately, thisis not a problem because, as Section 4.2.1 discusses in detail, allsuch functions are equivalent codes with identical externally-visible error-correction properties. Therefore, we are free toassume that the parity-check matrix is in standard form, whichspecifies parity-check bits’ error syndromes (i.e., I n – k × n – k ) andobviates the need to experimentally determine them.Second, writing the k bits of the dataword with a single CHARGED cell results in a codeword with an unknown numberof

CHARGED cells because the ECC function independentlydetermines the values of remaining n – k parity-check bits.As a result, the final codeword may contain anywhere from1 to n – k + 1 CHARGED cells, and the number of

CHARGED cells will vary for different test patterns. Because we cannotdirectly access the parity-check bits’ values, we do not know6hich cells are

CHARGED for a given test pattern, and there-fore, we cannot tie post-correction errors back to particularpre-correction error patterns. Fortunately, we can work aroundthis problem by considering all possible error patterns that agiven codeword can experience, which amounts to examin-ing all combinations of errors that the

CHARGED cells canexperience. Table 1 illustrates this for when the datawordis programmed with a 1-

CHARGED test pattern (as shown inEquation 3). In this example, the encoded codeword containsthree

CHARGED cells, which may experience any of 2 possibleerror patterns. Section 5.1.3 discusses how we can accomplishtesting all possible error patterns in practice by exploiting thefact that data-retention errors occur uniform-randomly, so test-ing across many different codewords provides samples frommany different error patterns at once. Linear block codes can be either of full-length if all possible error syndromes are present withinthe parity-check matrix (e.g., all 2 p – 1 error syndromes for aHamming code with p parity-check bits, as is the case for thecode shown in Equation 1) or shortened if one or more informa-tion symbols are truncated while retaining the same numberof parity-check symbols [27, 53]. This distinction is crucial fordetermining appropriate test patterns because, for full-lengthcodes, the 1- CHARGED patterns identify the miscorrection-susceptible bits for all possible error syndromes. In this case,testing additional patterns that have more than one

CHARGED bit provides no new information because any resulting errorsyndromes are already tested using the 1-

CHARGED patterns.However, for shortened codes , the 1-

CHARGED patterns maynot provide enough information to uniquely identify the ECCfunction because the 1-

CHARGED patterns can no longer testfor the missing error syndromes. Fortunately, we can recoverthe missing information by reconstructing the truncated errorsyndromes using pairwise combinations of the 1-

CHARGED patterns. For example, asserting two

CHARGED bits effectivelytests an error syndrome that is the linear combination of thebits’ corresponding parity-check matrix columns. Therefore, bysupplementing the 1-

CHARGED patterns with the 2-

CHARGED patterns, we effectively encompass the error syndromes thatwere shortened. Section 6.1 evaluates BEER’s sensitivity tocode length, showing that the 1-

CHARGED patterns are in-deed sufficient for full-length codes and the { } - CHARGED patterns for shortened codes that we evaluate with datawordlengths between 4 and 247.

5. Bit-Exact Error Recovery (BEER)

Our goal in this work is to develop a methodology that reli-ably and accurately determines the full ECC function (i.e., itsparity-check matrix) for any DRAM on-die ECC implementa-tion without requiring hardware tools, prerequisite knowledgeabout the DRAM chip or on-die ECC mechanism, or access toECC metadata (e.g., error syndromes, parity information). Tothis end, we present BEER, which systematically determinesthe ECC function by observing how it reacts when subjectedto carefully-crafted uncorrectable error patterns. BEER imple-ments the ideas developed throughout Section 4 and consistsof three key steps: (1) experimentally inducing miscorrections,(2) analyzing observed post-correction errors, and (3) solvingfor the ECC function.This section describes each of these steps in detail in the con-text of experiments using 32, 20, and 28 real LPDDR4 DRAMchips from three major manufacturers, whom we anonymizefor confidentiality reasons as A, B, and C, respectively. We perform all tests using a temperature-controlled infrastruc-ture with precise control over the timings of refresh and otherDRAM bus commands.

To induce miscorrections as discussed in Section 4.2.3, wemust first identify the (1)

CHARGED and

DISCHARGED encod-ings of each cell and (2) layout of individual datawords withinthe address space. This section describes how we determinethese in a way that is applicable to any DRAM chip.

CHARGED and

DISCHARGED

States.

We determine the encodings of the

CHARGED and

DISCHARGED states by experimentally measuring thelayout of true- and anti-cells throughout the address spaceas done in prior works [90, 95, 138]. We write data ‘0’ anddata ‘1’ test patterns to the entire chip while pausing DRAMrefresh for 30 minutes at temperatures between 30 – 80 ◦ C.The resulting data-retention error patterns reveal the true-and anti-cell layout since each test pattern isolates one ofthe cell types. We find that chips from manufacturers A andB use exclusively true-cells, and chips from manufacturerC use 50%/50% true-/anti-cells organized in alternatingblocks of rows with block lengths of 800, 824, and 1224 rows.These observations are consistent with the results of similarexperiments performed by prior work [138].

To deter-mine which addresses correspond to individual ECC datawords,we program one cell per row to the CHARGED state with allother cells

DISCHARGED . We then sweep the refresh window t REFw from 10 seconds to 10 minutes at 80 ◦ C to induce un-correctable errors. Because only

CHARGED cells can fail, post-correction errors may only occur in bit positions correspondingto either (1) the

CHARGED cell itself or (2)

DISCHARGED cellsdue to a miscorrection. By sweeping the bit position of the

CHARGED cell within the dataword, we observe miscorrec-tions that are restricted exclusively to within the same ECCdataword.

We find that chips from all three manufacturers useidentical ECC word layouts: each contiguous 32B region ofDRAM comprises two 16B ECC words that are interleaved atbyte granularity. A 128-bit dataword is consistent with priorindustry and academic works on on-die ECC [97, 98, 120, 138].

CHARGED

Patterns.

To test each ofthe 1- or 2-

CHARGED patterns, we program an equal number ofdatawords with each test pattern. For example, a 128-bit data-word yields (cid:0) (cid:1) = 128 and (cid:0) (cid:1) = 8128 1- and 2-

CHARGED test patterns, respectively. As Section 4.2.3 discusses, BEERmust identify all possible miscorrections for each test pattern.To do so, BEER must exercise all possible error patterns thata codeword programmed with a given test pattern can expe-rience (e.g., up to 2 = 1024 unique error patterns for a (136,128) Hamming code using a 2- CHARGED pattern).Fortunately, although BEER must test a large number of errorpatterns, even a single DRAM chip typically contains millionsof ECC words (e.g., 2 CHARGED cells, and running multiple We assume that ECC words do not straddle row boundaries since accesseswould then require reading two rows simultaneously. However, one cell per bank can be tested to accommodate this case if required. dramatically increases the sample size, making the prob-ability of not observing a given error pattern exceedingly low..We analyze experimental runtime in Section 6.3.Table 2 illustrates testing the 1- CHARGED patterns usingthe ECC function given by Equation 1. There are four testpatterns, and Table 2 shows the miscorrections that are pos-sible for each one assuming that all cells are true cells. Forthis ECC function, miscorrections are possible only for testpattern 0, and no pre-correction error pattern exists that cancause miscorrections for the other test patterns. Note that,for errors in the

CHARGED -bit positions, we cannot be certainwhether a post-correction error is a miscorrection or simplya data-retention error, so we label it using ‘?’. We refer to thecumulative pattern-miscorrection pairs as a miscorrection pro-file . Thus, Table 2 shows the miscorrection profile of the ECCfunction given by Equation 1. CHARGED

Pattern ID 1-

CHARGED

Pattern Possible Miscorrections(Bit-Index of

CHARGED

Cell)

D D D C ] [– – – ?]2 [

D D C D ] [– – ? –]1 [

D C D D ] [– ? – –]0 [

C D D D ] [? ] Table 2: Example miscorrection profile for the ECC functiongiven in Equation 1.

To obtain the miscorrection profile of the on-die ECC func-tion within each DRAM chip that we test, we lengthen therefresh window t REFw to between 2 minutes, where uncor-rectable errors begin to occur frequently (BER ≈ –7 ), and22 minutes, where nearly all ECC words exhibit uncorrectableerrors (BER ≈ –3 ), in 1 minute intervals at 80 ◦ C. During eachexperiment, we record which bits are susceptible to miscor-rections for each test pattern (analogous to Table 2). Figure 3shows this information graphically, giving the logarithm ofthe number of errors observed in each bit position ( X -axis)for each 1- CHARGED test pattern ( Y -axis). The data is takenfrom the true-cell regions of a single representative chip fromeach manufacturer. Errors in the CHARGED bit positions (i.e.,where Y = X ) stand out clearly because they occur alongsideall miscorrections as uncorrectable errors. - C H A R G E D P a tt e r n I D ( C H A R G E D B i t I n d e x ) A Bit Index Within ECC Dataword B C Rarely-Observed Error Frequently-Observed ErrorBER 0 BER 10 Figure 3: Errors observed in a single representative chip fromeach manufacturer using the 1-

CHARGED test patterns, show-ing that manufacturers appear to use different ECC functions.

The data shows that miscorrection profiles vary significantlybetween different manufacturers. This is likely because eachmanufacturer uses a different parity-check matrix: the possi-ble miscorrections for a given test pattern depend on whichparity-check matrix columns are used to construct error syn-dromes. With different matrices, different columns combine toform different error syndromes. The miscorrection profiles of Assuming chips of the same model use the same on-die ECC mechanism,which our experimental results in Section 5.1.3 support. manufacturers B and C exhibit repeating patterns, which likelyoccur due to regularities in how syndromes are organized inthe parity-check matrix, whereas the matrix of manufacturer Aappears to be relatively unstructured. We suspect that manufac-turers use different ECC functions because each manufactureremploys their own circuit design, and specific parity-check ma-trix organizations lead to more favorable circuit-level tradeoffs(e.g., layout area, critical path lengths).We find that chips of the same model number from the samemanufacturer yield identical miscorrection profiles, which (1)validates that we are observing design-dependent data and(2) confirms that chips from the same manufacturer and prod-uct generation appear to use the same ECC functions. Tosanity-check our results, we use EINSim [2, 138] to simulatethe miscorrection profiles of the final parity-check matrices weobtain from our experiments with real chips, and we observethat the miscorrection profiles obtained via simulation matchthose measured via real chip experiments.

In practice, BEER may either (1) fail to observe a possiblemiscorrection or (2) misidentify a miscorrection due to unpre-dictable transient errors (e.g., soft errors from particle strikes,variable-retention time errors, voltage fluctuations). Theseevents can theoretically pollute the miscorrection profile withincorrect data, potentially resulting in an illegal miscorrectionprofile, i.e., one that does not match any

ECC function.Fortunately, case (1) is unlikely given the sheer number ofECC words even a single chip provides for testing (discussedin Section 5.1.3). While it is possible that different ECC wordsthroughout a chip use different ECC functions, we believe thatthis is unlikely because it complicates the design with no clearbenefits. Even if a chip does use more than one ECC function,the different functions will likely follow patterns aligning withDRAM substructures (e.g., alternating between DRAM rows orsubarrays [83, 91]), and we can test each region individually.Similarly, case (2) is unlikely because transient errors occurrandomly and rarely [141] as compared with the data-retentionerror rates that we induce for BEER ( > –7 ), so transient erroroccurrence counts are far lower than those of real miscorrec-tions that are observed frequently in miscorrection-susceptiblebit positions. Therefore, we apply a simple threshold filter toremove rarely-observed post-correction errors from the mis-correction profile. Figure 4 shows the relative probability of ob-serving a miscorrection in each bit position aggregated acrossall 1- CHARGED test patterns for a representative chip frommanufacturer B. Each data point is a boxplot that shows thefull distribution of probability values, i.e., min, median, max,and interquartile-range (IQR), observed when sweeping the re-fresh window from 2 to 22 minutes (i.e., the same experimentsdescribed in Section 5.1.3).We see that zero and nonzero probabilities are distinctly sep-arated, so we can robustly resolve miscorrections for each bit.Furthermore, each distribution is extremely tight, meaning thatany of the individual experiments (i.e., any single componentof the distributions) is suitable for identifying miscorrections.Therefore, a simple threshold filter (illustrated in Figure 4) dis-tinctly separates post-correction errors that occur near-zerotimes from miscorrections that occur significantly more often.

We use the Z3 SAT solver [28] (described in Section 3.4)to identify the exact ECC function given a miscorrection pro-file. To determine the encoding ( F encode ) and decoding ( F decode )8

16 32 48 64 80 96 112 127Bit Index in ECC Word0.0000.0050.0100.0150.0200.025 M i s c o rr e c t i o n P r o b a b ili t y M a ss Example Threshold (1e-3)

Figure 4: Relative probability of observing a miscorrection ineach bit position aggregated across all 1-

CHARGED test patternsfor a representative chip of manufacturer B. The dashed lineshows a threshold filter separating zero and nonzero values. functions, we express them as unknown generator ( G ) andparity-check ( H ) matrices, respectively. We then add the fol-lowing constraints to the SAT solver for G and H :1. Basic linear code properties (e.g., unique H columns).2. Standard form matrices, as described in Section 4.2.1.3. Information contained within the miscorrection profile (i.e.,pattern i can(not) yield a miscorrection in bit j ).Upon evaluating the SAT solver with these constraints, theresulting G and H matrices represent the ECC encoding anddecoding functions, respectively, that cause the observed mis-correction profile. To verify that no other ECC function mayresult in the same miscorrection profile, we simply repeat theSAT solver evaluation with the additional constraint that thealready discovered G and H matrices are invalid. If the SATsolver finds another ECC function that satisfies the new con-straints, the solution is not unique.To seamlessly apply BEER to the DRAM chips that we test,we develop an open-source C++ application [1] that incorpo-rates the SAT solver and determines the ECC function corre-sponding to an arbitrary miscorrection profile. The tool ex-haustively searches for all possible ECC functions that satisfythe aforementioned constraints and therefore will generate theinput miscorrection profile. Using this tool, we apply BEER tomiscorrection profiles that we experimentally measure acrossall chips using refresh windows up to 30 minutes and temper-atures up to 80 ◦ C. We find that BEER uniquely identifies theECC function for all manufacturers. Unfortunately, we are un-able to publish the resulting ECC functions for confidentialityreasons as set out in Section 2.1. Although we are confident inour results because our SAT solver tool identifies a unique ECCfunction that explains the observed miscorrection profiles foreach chip, we have no way to validate BEER’s results againsta groundtruth. To overcome this limitation, we demonstrateBEER’s correctness using simulation in Section 6.1.

Although we demonstrate BEER’s effectiveness using bothexperiment and simulation, BEER has several testing require-ments and limitations that we review in this section.

Testing Requirements • Single-level ECC : BEER assumes that there is no second levelof ECC (e.g., rank-level ECC in the DRAM controller) presentduring testing. This is reasonable since system-level ECCscan typically be bypassed (e.g., via FPGA-based testing ordisabling through the BIOS) or reverse-engineered [26], evenin the presence of on-die ECC, before applying BEER. We can potentially extend BEER to multiple levels of ECC by extendingthe SAT problem to the concatenated code formed by the combined ECCs andconstructing test patterns that target each level sequentially, but we leave thisdirection to future work. • Inducing data-retention errors:

BEER requires finding a re-fresh window (i.e., t REFw ) for each chip that is long enough toinduce data-retention errors and expose miscorrections. For-tunately, we find that refresh windows between 1-30 minutesat 80 ◦ C reveal more than enough miscorrections to applyBEER. In general, the refresh window can be easily modi-fied (discussed in Section 3.2), and because data-retentionerrors are fundamental to DRAM technology, BEER appliesto all DDRx DRAM families regardless of their data accessprotocols and will likely hold for future DRAM chips, whosedata-retention error rates will likely be even more promi-nent [39, 76, 89, 99, 109, 119, 120, 129, 133, 160].

Limitations • ECC code type : BEER works on systematic linear block codes,which are commonly employed for latency-sensitive mainmemory chips since: (i) they allow the data to be directly ac-cessed without additional operations [181] and (ii) strongercodes (e.g., LDPC [36], concatenated codes [34]) cost consid-erably more area and latency [11, 132].•

No groundtruth : BEER alone cannot confirm whether theECC function that it identifies is the correct answer. How-ever, if BEER finds exactly one ECC function that explainsthe experimentally observed miscorrection profile, it is verylikely that the ECC function is correct.•

Disambiguating equivalent codes : On-die ECC does not ex-pose the parity-check bits, so BEER can only determinethe ECC function to an equivalent code (discussed in Sec-tions 4.2.1 and 4.2.3). Fortunately, equivalent codes differonly in their internal metadata representations, so this limi-tation should not hinder most third-party studies. In general,we are unaware of any way to disambiguate equivalent codeswithout accessing the ECC mechanism’s internals.

6. BEER Evaluation

We evaluate BEER’s correctness in simulation, SAT solverperformance on a real system, and experimental runtime ana-lytically. Our evaluations both (1) show that BEER is practicaland correctly identifies the ECC function within our simulation-based analyses, and (2) provide intuition for how the SAT prob-lem’s complexity scales for longer ECC codewords.

We simulate applying BEER to DRAM chips with on-dieECC using a modified version of the EINSim [2, 138] open-source DRAM error-correction simulator that we also publiclyrelease [2]. We simulate 115300 single-error correction Ham-ming code functions that are representative of those used foron-die ECC [60, 97, 98, 120, 129, 133, 138, 147]: 2000 each fordataword lengths between 4 and 57 bits, 100 each between58 and 120 bits, and 100 each for selected values between 121and 247 bits because longer codes require significantly moresimulation time. For each ECC function, we simulate induc-ing data-retention errors within the 1-, 2-, and 3-

CHARGED test patterns according to the data-retention error propertiesoutlined in Section 3.2. For each test pattern, we model a realexperiment by simulating 10 ECC words and data-retentionerror rates ranging from 10 –5 to 10 –2 to obtain a miscorrectionprofile. Then, we apply BEER to the miscorrection profiles andshow that BEER correctly recovers the original ECC functions.Figure 5 shows how many unique ECC functions BEER findswhen using different test patterns to generate miscorrection We include the 3-

CHARGED patterns to show that they fail to uniquelyidentify all ECC functions despite comprising combinatorially more test pat-terns than the combined 1- and 2-

CHARGED patterns. { } - CHARGED configuration that uses both the 1-

CHARGED and 2-

CHARGED test patterns. For full-length codes(i.e., with dataword lengths k ∈

4, 11, 26, 57, 120, 247, …) thatcontain all possible error syndromes within the parity-checkmatrix by construction, all test patterns uniquely determinethe ECC function, including the 1-

CHARGED patterns alone. k )10 N u m b e r o f U n i q u e E CC F un c t i o n s Figure 5: Number of ECC functions that match miscorrectionprofiles created using different test patterns.

On the other hand, the individual 1-, 2-, and 3-

CHARGED pat-terns sometimes identify multiple ECC functions for shortenedcodes, with more solutions identified both for (1) shorter codesand (2) codes with more aggressive shortening. However, thedata shows that BEER often still uniquely identifies the ECCfunction even using only the 1-

CHARGED patterns (i.e., for87.7% of all codes simulated) and always does so with the { } - CHARGED patterns. This is consistent with the fact that short-ened codes expose fewer error syndromes to test (discussedin Section 4.2.3). It is important to note that, even if BEERidentifies multiple solutions, it still narrows a combinatorial-sized search space to a tractable number of ECC functionsthat are well suited to more expensive analyses (e.g., intrusiveerror-injection, die imaging techniques, or manual inspection).While our simulations do not model interference from tran-sient errors, such errors are rare events [141] when comparedwith the amount of uncorrectable data-retention errors thatBEER induces. Even if sporadic transient errors were to occur,Section 5.2 discusses in detail how BEER mitigates their impacton the miscorrection profile using a simple thresholding filter.

We evaluate BEER’s performance and memory usage usingten servers with 24-core 2.30 GHz Intel Xeon(R) Gold 5118CPUs [58] and 192 GiB 2666 MHz DDR4 DRAM [68] each. Allmeasurements are taken with Hyper-Threading [58] enabledand all cores fully occupied. Figure 6 shows overall runtimeand memory usage when running BEER with the 1-

CHARGED patterns for different ECC code lengths on a log-log plot alongwith the time required to (1) solve for the ECC function (“Deter-mine Function”) and (2) verify the uniqueness of the solution(“Check Uniqueness”). Each data point gives the minimum,median, and maximum values observed across our simulatedECC functions (described in Section 6.1). We see that the totalruntime and memory usage are negligible for short codes andgrow as large as 62 hours and 11.4 GiB of memory for largecodes. For a representative dataword length of 128 bits, themedian total runtime and memory usage are 57.1 hours and6.3 GiB, respectively. At each code length where we add anadditional parity-check bit, the runtime and memory usagejump accordingly since the complexity of the SAT evaluationproblem increases by an extra dimension.The total runtime is quickly dominated by the SAT solver k )10 T i m e ( s ) Total RuntimeCheck UniquenessDetermine Function(s)Memory Usage M e m o r y U s a g e ( M i B ) Figure 6: Measured BEER runtime (left y-axis) and memoryusage (right y-axis) for different ECC codeword lengths. checking uniqueness, which requires exhaustively exploringthe entire search space of a given ECC function. However, sim-ply determining the solution ECC function(s) is much faster,requiring less than 2.7 minutes even for the longest codes eval-uated and for shortened codes that potentially have multiplesolutions using only the 1-

CHARGED patterns. From this data,we conclude that BEER is practical for reasonable-length codesused for on-die ECC (e.g., k = 64, 128). However, our BEERimplementation has room for optimization, e.g., using dedi-cated GF(2) BLAS libraries (e.g., LELA [52]) or advanced SATsolver theories (e.g., SMT bitvectors [10]), and an optimizedimplementation would likely improve performance, enablingBEER’s application to an even greater range of on-die ECCfunctions. Section 7.3 discusses such optimizations in greaterdetail. Nevertheless, BEER is a one-time offline process, so itneed not be aggressively performant in most use-cases. Our experimental runtime is overwhelmingly bound by wait-ing for data-retention errors to occur during a lengthened re-fresh window (e.g., 10 minutes) while interfacing with theDRAM chip requires only on the order of milliseconds (e.g.,168 ms to read an entire 2 GiB LPDDR4-3200 chip [69]). There-fore, we estimate total experimental runtime as the sum of therefresh windows that we individually test. For the data wepresent in Section 5.1.3, testing each refresh window between2 to 22 minutes in 1 minute increments requires a combined4.2 hours of testing for a single chip. However, if chips of thesame model number use the same ECC functions (as our datasupports in Section 5.1.3), we can reduce overall testing latencyby parallelizing individual tests across different chips. Further-more, because BEER is likely a one-time exercise for a givenDRAM chip, it is sufficient that BEER is practical offline.

7. Example Practical Use-Cases

BEER empowers third-party DRAM users to decouple thereliability characteristics of modern DRAM chips from anyparticular on-die ECC function that a chip implements. Thissection discusses five concrete analyses that BEER enables. Toour knowledge, BEER is the first work capable of inferring thisinformation without bypassing the on-die ECC mechanism.We hope that end users and future works find more ways toextend and apply BEER in practice.

We introduce Bit-Exact Error Profiling (BEEP), a new data-retention error profiling algorithm enabled by BEER that infersthe number and bit-exact locations of pre-correction error-prone cells when given a set of operating conditions that causeuncorrectable errors in an ECC word. To our knowledge, BEEPis the first DRAM error profiling methodology capable of iden-tifying bit-exact error locations throughout the entire on-dieECC codeword, including within the parity bits.10 .1.1. BEEP: Inference Based on Miscorrections.

Becausemiscorrections are purely a function of the ECC logic (discussedin Section 4.2.2), an observed miscorrection indicates that aspecific pre-correction error pattern has occurred. Althoughseveral such patterns can map to the same miscorrection, BEEPnarrows down the possible pre-correction error locations byusing the known parity-check matrix (after applying BEER)to construct test patterns for additional experiments that dis-ambiguate the possibilities. At a high level, BEEP crafts testpatterns to reveal errors as it incrementally traverses eachcodeword bit, possibly using multiple passes to capture low-probability errors. As BEEP iterates over the codeword, itbuilds up a list of suspected error-prone cells.BEEP comprises three phases: crafting suitable test pat-terns, experimental testing with crafted patterns, and calculating pre-correction error locations from observed mis-corrections. Figure 7 illustrates these three phases in an exam-ple where BEEP profiles for pre-correction errors in a 128-bitECC dataword. The following sections explain each of the threephases and refer to Figure 7 as a running example. ConventionalDRAM error profilers (e.g., [22, 46, 71, 79, 81, 95, 104, 109, 110,139, 165, 169]) use carefully designed test patterns that induceworst-case circuit conditions in order to maximize their cov-erage of potential errors [3, 123]. Unfortunately, on-die ECCencodes all data into codewords, so the intended software-leveltest patterns likely do not maintain their carefully-designedproperties when written to the physical DRAM cells. BEEPcircumvents these ECC-imposed restrictions by using a SATsolver along with the known ECC function (via BEER) to crafttest patterns that both (1) locally induce the worst-case cir-cuit conditions and (2) result in observable miscorrections ifsuspected error-prone cells do indeed fail.Without loss of generality, we assume that the worst-caseconditions for a given bit occur when its neighbors are pro-grammed with the opposite charge states, which prior workshows to exacerbate circuit-level coupling effects and increaseerror rates [3, 5, 79, 93, 107, 109, 123, 130, 144, 156, 166]. If thedesign of a worst-case pattern is not known, or if it has a differ-ent structure than we assume, BEEP can be adapted by simplymodifying the relevant SAT solver constraints (described be-low). To ensure that BEEP observes a miscorrection when agiven error occurs, BEEP crafts a pattern that will suffer a mis-correction if the error occurs alongside an already-discovered error. We express these conditions to the SAT solver using thefollowing constraints:1. Bits adjacent to the target bit have opposing charge states.2. One or more miscorrections is possible using some combi-nation of the already-identified data-retention errors.Several such patterns typically exist, and BEEP simply uses thefirst one that the SAT solver returns (although a different BEEPimplementation could test multiple patterns to help identify low-probability errors). Figure 7 illustrates how such a testpattern appears physically within the cells of a codeword: thetarget cell is CHARGED , its neighbors are

DISCHARGED , andthe SAT solver freely determines the states of the remainingcells to increase the likelihood of a miscorrection if the tar-get cell fails. If the SAT solver fails to find such a test pattern,BEEP attempts to craft a pattern using constraint 2 alone, which,unlike constraint 1, is essential to observing miscorrections.Failing that, BEEP simply skips the bit until more error-pronecells are identified that could facilitate causing miscorrections.We evaluate how successfully BEEP identifies errors in Sec-tion 7.1.4, finding that a second pass over the codeword helpsin cases of few or low-probability errors.

BEEPtests a pattern by writing it to the target ECC word, induc-ing errors by lengthening the refresh window, and readingout the post-correction data. Figure 7 shows examples ofpost-correction error patterns that might be observed duringan experiment. Each miscorrection indicates that an uncor-rectable number of pre-correction errors exists, and BEEP usesthe parity-check matrix H to calculate their precise locations.This is possible because each miscorrection reveals an error syn-drome s for the (unknown) erroneous pre-correction codeword c (cid:48) that caused the miscorrection. Therefore, we can directlysolve for c (cid:48) as shown in Equation 4. s = H ∗ c (cid:48) = c (cid:48) · H ∗ , + c (cid:48) · H ∗ , + … + c (cid:48) n · H ∗ , n (4)This is a system of equations with one equation for each of n – k unknowns, i.e., one each for the n – k inaccessible parity bits.There is guaranteed to be exactly one solution for c (cid:48) since theparity-check matrix always has full rank (i.e., rank( H ) = n – k ).Since we also know the original codeword ( c = F encode ( d ) = G · d ), we can simply compare the two (i.e., c ⊕ c (cid:48) ) to determinethe bit-exact error pattern that led to the observed miscorrection.Figure 7 shows how BEEP updates a list of learned pre-correction error locations, which the SAT solver then uses toconstruct test patterns for subsequent bits. Once all bits aretested, the list of pre-correction errors yields the number andbit-locations of all identified error-prone cells. To understand howBEEP performs in practice, we evaluate its success rate , i.e.,the likelihood that BEEP correctly identifies errors within acodeword. We use a modified version of EINSim [2] to performMonte-Carlo simulation across 100 codewords per measure-ment. To keep our analysis independent of any particularbit-error rate model, we subdivide experiments by the numberof errors ( N ) injected per codeword. In this way, we can flexiblyevaluate the success rate for a specific error distribution usingthe law of total probability over the N s. Number of Passes.

Figure 8 shows BEEP’s success ratewhen using one and two passes over the codeword for differentcodeword lengths. Each bar shows the median value over the

Test Bit[0]

Generate Test PaernTest for Miscorrections … Learn Pre-Correction Errors

Calculate With ECC FunctionRun Experiments C r a  T e s t P a  e r n U s i n g S A T S o l v e r CD-

CHARGED cell

DISCHARGED cellFree for SAT solver to choose M- Miscorrection observedNo error observed E? Known errorUnknown … E ? ? ? ? ?? ?? E

Test Bit[1] … E ? ? ? ?? ?? E … CD - - - - -D E

Test Bit[135] … E ? ? ?? ?? E … CD- - -- - E- E (Optional) Repeat for Second Pass to Find Low-Probability Errors

Begin … C D - - - - - -- - - - - -

ECC Words with Diﬀerent Pre-Correction Error Paerns … - - -- -- -- … - M- -- -- - … - M - -- -- - New Error Identified … - -- -- --M … - - -- -- - - … - M- - -- -- … - -- -- --M … - -- -- - -M … -M - - --- - Pre-Correction Error LocationsTo Test Bit[0]

Figure 7: Example of running BEEP on a single 136-bit ECC codeword to identify locations of pre-correction errors. even with a single pass .Longer codewords perform better in part because BEEP usesone test pattern per bit, which means that longer codes leadto more patterns. However, longer codewords perform bettereven with comparable test-pattern counts (e.g., 2 passes with31-bit vs 1 pass with 63-bit codewords) because longer code-words simply have more bits (and therefore, error syndromes)for the SAT solver to consider when crafting a miscorrection-prone test pattern. On the other hand, miscorrection-pronetest patterns are more difficult to construct for shorter codesthat provide fewer bits to work with, so BEEP fails more oftenwhen testing shorter codes. B EE P S u cc e ss R a t e

10 15 20 25

Number of Errors Injected per Codeword

Figure 8: BEEP success rate for 1 vs. 2 passes and differentcodeword lengths and numbers of errors injected.

Per-Bit Error Probabilities.

Figure 9 shows how BEEP’ssuccess rate changes using a single pass when the injectederrors have different per-bit probabilities of error (P[error]).This experiment represents a more realistic scenario wheresome DRAM cells probabilistically experience data-retentionerrors. We see that BEEP remains effective (i.e., has a near-100%success rate) for realistic 63- and 127-bit codeword lengths,especially at higher bit-error probabilities and error counts.BEEP generally has a higher success rate with longer codescompared to shorter ones, and for shorter codewords at lowerror probabilities, the data shows that BEEP may require moretest patterns (e.g., multiple passes) to reliably identify all errors. B EE P S u cc e ss R a t e

10 15 20 25

Number of Errors Injected per Codeword

Figure 9: BEEP success rate for different single-bit error prob-abilities using different ECC codeword lengths for differentnumbers of errors injected in the codeword.

It is important to note that, while evaluating low error prob-abilities is demonstrative, it represents a pessimistic scenariosince a real DRAM chip exhibits a mix of low and high per-biterror probabilities. Although any error-profiling mechanismthat identifies errors based on when they manifest might miss Patel et al. [139] provide a preliminary exploration of how per-bit er-ror probabilities are distributed throughout a DRAM chip, but formulating adetailed error model for accurate simulation is beyond the scope of our work. low-probability errors, the data shows that BEEP is resilientto low error probabilities, especially for longer, more realis-tic codewords. Therefore, our evaluations demonstrate thatBEEP effectively enables a new profiling methodology that usesthe ECC function determined by BEER to infer pre-correctionerrors from observed post-correction error patterns. Although wedemonstrate BEEP solely for data-retention errors, BEEP canpotentially be extended to identify errors that occur due toother DRAM error mechanisms (e.g., stuck-at faults, circuittiming failures). However, simultaneously diagnosing multipleerror models is a very difficult problem since different typesof faults can be nearly indistinguishable (e.g., data-retentionerrors and stuck-at-

DISCHARGED errors). Profiling for arbi-trary error types is a separate problem from what we tackle inthis work, and we intend BEEP as a simple, intuitive demon-stration of how knowing the ECC function is practically useful.Therefore, we leave extending BEEP to alternative DRAM errormechanisms to future work.

We identify four additional use cases for which BEER mit-igates on-die ECC’s interference with third-party studies byrevealing the full ECC function (i.e., its parity-check matrix).

If theon-die ECC function is known, a system architect can designa second level of error mitigation (e.g., rank-level ECC) thatbetter suits the error characteristics of a DRAM chip with on-die ECC. Figure 1 provides a simple example of how differentECC functions cause different data bits to be more error-proneeven though the pre-correction errors are uniformly distributed.This means that on-die ECC changes the DRAM chip’s software-visible error characteristics in a way that depends on the par-ticular ECC function it employs. If the on-die ECC functionis known, we can calculate the expected post-correction errorcharacteristics and build an error model that accounts for thetransformative effects of on-die ECC. Using this error model,the system architect can make an informed decision whenselecting a secondary mitigation mechanism to complementon-die ECC. For example, architects could modify a traditionalrank-level ECC scheme to asymmetrically protect certain databits that are more prone to errors than others as a result of on-die ECC’s behavior [95, 174]. In general, BEER enables systemdesigners to better design secondary error-mitigation mech-anisms to suit the expected DRAM reliability characteristics,thereby improving overall system reliability. Several DRAMerror mechanisms are highly pattern sensitive, includingRowHammer [86, 90, 125, 126], data-retention [43, 78, 79, 81, 88,109, 110, 139], and reduced-access-latency [17, 20, 83, 102, 104].Different test patterns affect error rates by orders of magni-tude [79–81, 86, 100, 109, 139] because each pattern exercisesdifferent static and dynamic circuit-level effects. Therefore, testpatterns are typically designed carefully to induce the worst-case circuit conditions for the error mechanism under test (e.g.,marching ‘1’s [3, 46, 109, 123, 139]). As Section 7.1.2 discussesin greater detail, on-die ECC restricts the possible test patternsto only the ECC function’s codewords. Fortunately, the SAT- Patel et al. [139] increase error coverage by exacerbating the bit-errorprobability, and their approach (REAPER) can be used alongside BEEP to helpidentify low-probability errors. By assuming a given data value distribution, e.g., fixed values for a pre-dictable software application, uniform-random data for a general system.

Numerousprior works [17, 20, 83, 90, 104, 136, 157] experimentally studythe spatial distributions of errors throughout the DRAM chipin order to gain insight into how the chip operates and how itsperformance, energy, and/or reliability can be improved. Thesestudies rely on inducing errors at relatively high error rates sothat many errors occur that can leak information about a de-vice’s underlying structure. With on-die ECC, studying spatialerror distributions requires identifying pre-correction errorsthroughout the codeword, including within the inaccessibleparity bits. BEEP demonstrates one possible concrete way bywhich BEER enables these studies for chips with on-die ECC.

A third-partytester may want to determine the physical reason(s) behindan observed error. For example, a system integrator who isvalidating a DRAM chip’s worst-case operating conditions mayobserve unexpected errors due to an unforeseen defect (e.g.,at a precise DQ-pin position). Unfortunately, on-die ECC ob-scures both the number and locations of pre-correction errors,so the observed errors no longer provide insight into the un-derlying physical error mechanism responsible. Using BEEP,such errors can be more easily diagnosed because the revealedpre-correction errors directly result from the error mechanism.

Our work demonstrates that on-die ECC is not an insur-mountable problem for third-party system design and testing.To further explore how tools like BEER can help clarify a DRAMchip’s core reliability characteristics, we identify several waysin which future studies can build upon our work. We believethese are promising directions to explore and will further fa-cilitate studying the reliability characteristics of current andfuture devices with on-die ECC.

Extension to Other Devices.

BEER theoretically applies toany memory device that uses a linear block code in whichwe can exploit data-dependent errors (e.g.,

CHARGED -to-

DISCHARGED ) to control which miscorrections occur. A con-crete example is DRAM with rank-level ECC, where BEERcan be applied as is. However, BEER may be extensible toother memory devices (e.g., Flash memory [11–14,112,113,118],STT-MRAM [62, 96, 182], PCM [101, 142, 152, 177], Race-track [137, 179], RRAM [134, 171, 176]) if its core principles canbe adapted for their error models and ECC functions. Thesememories all exhibit reliability challenges that BEER can helpthird-party scientists and engineers better tackle and overcome.

Further Constraining the SAT Problem.

We believe thereare several ways to further constrain the SAT problem, includ-ing (i) prioritizing more likely hardware ECC implementations,(ii) adding additional SAT constraints for obvious or trivialcases, and (iii) further constraining the set of test patterns.

Improving SAT Solver Efficiency.

Our implementations ofBEER and BEEP express ECC arithmetic (e.g., GF(2) matrix op-erations, SAT constraints) using simple Boolean logic equations.An optimized implementation that leverages native GF(2) BLASlibraries (e.g., LELA [52]) and advanced SAT solver theories(e.g., SMT bitvectors [10]) could drastically improve BEER’sperformance, enabling BEER for a wider variety of ECC func-tions. Taking this a step further, future work could reformulate There may be no need to infer error syndromes from miscorrections if theCPU directly exposes them [26].

BEER’s SAT problem mathematically in order to directly solvefor the parity-check matrix that can produce a given miscor-rection profile. Such an approach could identify the solutionsignificantly faster than using a SAT solver to perform a brute-force exploration of the entire solution space.

8. Related Work

To our knowledge, this is the first work to (i) determine thefull on-die ECC function and (ii) recover the number and bit-exact error locations of pre-correction errors in DRAM chipswith on-die ECC without any insight into the ECC mechanismor any hardware modification. We distinguish BEER fromrelated works that study on-die ECC, techniques for reverse-engineering DRAM ECC functions, and DRAM error profiling.

On-Die ECC.

Several works study on-die ECC [15,40,129,138],but only Patel et al. [138] attempt to identify pre-correction er-ror characteristics without bypassing or modifying the on-dieECC mechanism. Although Patel et al. [138] statistically inferhigh-level characteristics about the ECC mechanism and pre-correction errors, their approach has several key limitations (discussed in Section 1). BEER overcomes these limitationsand identifies (1) the full ECC function and (2) the bit-exactlocations of pre-correction errors without requiring any pre-requisite knowledge about the errors being studied.

Determining ECC Functions.

Prior works reverse-engineerECC characteristics in Flash memories [167, 168, 175], DRAMwith rank-level ECC [26], and on-die ECC [138]. However,none of these works can identify the full ECC function bystudying data only at the external DRAM chip interface becausethey either require (1) examining the encoded data [167, 168,175], (2) injecting errors directly into the codeword [26], or (3)knowing when an ECC correction is performed and obtainingthe resulting error syndrome [26]. On-die ECC provides noinsight into the error-correction process and does not report ifor when a correction is performed.

DRAM Error Profiling.

Prior work proposes many DRAMerror profiling methodologies [17, 20, 26, 37, 42, 43, 46, 74, 75,78–80, 83–86, 90, 95, 102, 104, 109, 110, 138, 139, 141, 157, 169, 172,173]. Unfortunately, none of these approaches are capableof identifying pre-correction error locations throughout theentire codeword (i.e., including within parity-check bits).

9. Conclusion

We introduce Bit-Exact Error Recovery (BEER), a newmethodology for determining the full DRAM on-die ECC func-tion (i.e., its parity-check matrix) without requiring hardwaresupport, prerequisite knowledge about the DRAM chip or on-die ECC mechanism, or access to ECC metadata (e.g., parity-check bits, error syndromes). We use BEER to determine theon-die ECC functions of 80 real LPDDR4 DRAM chips and showthat BEER is both effective and practical using rigorous simu-lations. We discuss five concrete use-cases for BEER, includingBEEP, a new DRAM error profiling methodology capable ofinferring exact pre-correction error counts and locations. Webelieve that BEER takes an important step towards enablingeffective third-party design and testing around DRAM chipswith on-die ECC and are hopeful that BEER will enable manynew studies going forward.

Acknowledgments

We thank the SAFARI Research Group members for thevaluable input and stimulating intellectual environment theyprovide, Karthik Sethuraman for his expertise in nonparametricstatistics, and the anonymous reviewers for their feedback.13 eferences [1] “BEER Source Code,” https://github.com/CMU-SAFARI/BEER.[2] “EINSim Source Code,” https://github.com/CMU-SAFARI/EINSim.[3] R. D. Adams,

High Performance Memory Testing: Design Principles,Fault Modeling and Self-Test . Springer SBM, 2002.[4] ADATA, “ADATA XPG DDR4 Officially Validated by AMD asAM4/Ryzen Compatible,” ADATA, Tech. Rep., 2017.[5] Z. Al-Ars, S. Hamdioui, and A. J. van de Goor, “Effects of Bit LineCoupling on the Faulty Behavior of DRAMs,” in

VTS , 2004.[6] AMD, “AMD Opteron 4300 Series Processors,” 2018.[7] S. Baek, S. Cho, and R. Melhem, “Refresh Now and Then,” in TC , 2014.[8] N. Bjørner, A.-D. Phan, and L. Fleckenstein, “nu-Z: An OptimizingSMT Solver,” in TACAS , 2015.[9] R. C. Bose and D. K. Ray-Chaudhuri, “On a Class of Error CorrectingBinary Group Codes,”

Information and Control , 1960.[10] R. Brummayer and A. Biere, “Boolector: An Efficient SMT Solverfor Bit-Vectors and Arrays,” in

International Conference on Tools andAlgorithms for the Construction and Analysis of Systems , 2009.[11] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, and O. Mutlu, “Error Character-ization, Mitigation, and Recovery In Flash-Memory-Based Solid-StateDrives,”

Proc. IEEE , 2017.[12] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, and O. Mutlu, “Errors in Flash-Memory-Based Solid-State Drives: Analysis, Mitigation, and Recovery,”

Inside Solid State Drives , 2018.[13] Y. Cai, E. F. Haratsch, O. Mutlu, and K. Mai, “Error Patterns in MLCNAND Flash Memory: Measurement, Characterization, and Analysis,”in

DATE , 2012.[14] Y. Cai, G. Yalcin, O. Mutlu, E. F. Haratsch, A. Cristal, O. S. Unsal, andK. Mai, “Error Analysis and Retention-Aware Error Management forNAND Flash Memory,” in

ITJ , 2013.[15] S. Cha et al. , “Defect Analysis and Cost-Effective Resilience Architec-ture for Future DRAM Devices,” in

HPCA , 2017.[16] K. Chandrasekar, S. Goossens, C. Weis, M. Koedam, B. Akesson,N. Wehn, and K. Goossens, “Exploiting Expendable Process-Marginsin DRAMs for Run-Time Performance Optimization,” in

DATE , 2014.[17] K. K. Chang, A. Kashyap, H. Hassan, S. Ghose, K. Hsieh, D. Lee, T. Li,G. Pekhimenko, S. Khan, and O. Mutlu, “Understanding Latency Varia-tion in Modern DRAM Chips: Experimental Characterization, Analysis,and Optimization,” in

SIGMETRICS , 2016.[18] K. K. Chang, D. Lee, Z. Chishti, A. R. Alameldeen, C. Wilkerson, Y. Kim,and O. Mutlu, “Improving DRAM Performance by Parallelizing Re-freshes with Accesses,” in

HPCA , 2014.[19] K. K. Chang, P. J. Nair, D. Lee, S. Ghose, M. K. Qureshi, and O. Mutlu,“Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast Inter-SubarrayData Movement in DRAM,” in

HPCA , 2016.[20] K. K. Chang, A. G. Ya˘glıkc¸ı, S. Ghose, A. Agrawal, N. Chatterjee,A. Kashyap, D. Lee, M. O’Connor, H. Hassan, and O. Mutlu, “Un-derstanding Reduced-Voltage Operation in Modern DRAM Devices:Experimental Characterization, Analysis, and Mechanisms,” in

SIG-METRICS , 2017.[21] H.-M. Chen, S.-Y. Lee, T. Mudge, C.-J. Wu, and C. Chakrabarti,“Configurable-ECC: Architecting a Flexible ECC Scheme to SupportDifferent Sized Accesses in High Bandwidth Memory Systems,” TC ,2018.[22] K.-L. Cheng, M.-F. Tsai, and C.-W. Wu, “Neighborhood Pattern-Sensitive Fault Testing and Diagnostics for Random-Access Memories,” TCAD , 2002.[23] B. R. Childers, J. Yang, and Y. Zhang, “Achieving Yield, Density and Per-formance Effective DRAM at Extreme Technology Sizes,” in

MEMSYS ,2015.[24] A. Cimatti, A. Franz´en, A. Griggio, R. Sebastiani, and C. Stenico, “Sat-isfiability Modulo The Theory of Costs: Foundations and Applications,”in

TACAS , 2010.[25] G. C. Clark Jr and J. B. Cain,

Error-Correction Coding for Digital Com-munications . Springer SBM, 2013.[26] L. Cojocar, K. Razavi, C. Giuffrida, and H. Bos, “Exploiting CorrectingCodes: On the Effectiveness of ECC Memory Against RowhammerAttacks,” in

S&P , 2019.[27] D. J. Costello and S. Lin,

Error Control Coding: Fundamentals andApplications . Prentice Hall, 1982.[28] L. De Moura and N. Bjørner, “Z3: An Efficient SMT Solver,” in

TACAS ,2008.[29] T. J. Dell, “A White Paper on the Benefits of Chipkill-Correct ECC forPC Server Main Memory,”

IBM Microelectronics Division , 1997.[30] I. Dillig, T. Dillig, K. L. McMillan, and A. Aiken, “Minimum SatisfyingAssignments for SMT,” in

CAV , 2012. [31] N. Edri, P. Meinerzhagen, A. Teman, A. Burg, and A. Fish, “Silicon-Proven, Per-Cell Retention Time Distribution Model for Gain-CellBased eDRAMs,”

IEEE TOCS , 2016.[32] B. Efron, “Bootstrap Methods: Another Look at the Jackknife,” in

Breakthroughs in Statistics , 1992.[33] S. Field, “Microsoft Azure uses Error-Correcting Code Memory forEnhanced Reliability and Security,” https://azure.microsoft.com/en-us/blog/microsoft-azure-uses-error-correcting-code-memory-for-enhanced-reliability-and-security, 2015.[34] G. D. Forney, “Concatenated Codes,”

MIT Press , 1965.[35] P. Frigo, E. Vannacci, H. Hassan, V. van der Veen, O. Mutlu, C. Giuffrida,H. Bos, and K. Razavi, “TRRespass: Exploiting the Many Sides of TargetRow Refresh,” in

IEEE S&P , 2020.[36] R. G. Gallager, “Low density parity check codes,” Ph.D. dissertation,Massachusetts Institute of Technology, 1963.[37] F. Gao, G. Tziantzioulis, and D. Wentzlaff, “ComputeDRAM: In-Memory Compute using Off-the-Shelf DRAMs,” in

MICRO , 2019.[38] C. P. Gomes, H. Kautz, A. Sabharwal, and B. Selman, “SatisfiabilitySolvers,”

Foundations of Artificial Intelligence , 2008.[39] S.-L. Gong, J. Kim, and M. Erez, “DRAM Scaling Error EvaluationModel Using Various Retention Time,” in

DSN-W , 2017.[40] S.-L. Gong, J. Kim, S. Lym, M. Sullivan, H. David, and M. Erez, “DUO:Exposing On-Chip Redundancy to Rank-Level ECC for High Reliabil-ity,” in

HPCA , 2018.[41] B. Gu, T. Coughlin, B. Maxwell, J. Griffith, J. Lee, J. Cordingley, S. John-son, E. Karaginiannis, and J. Ehmann, “Challenges and Future Direc-tions of Laser Fuse Processing in Memory Repair,”

Proc. Semicon China ,2003.[42] T. Hamamoto, S. Sugiura, and S. Sawada, “Well Concentration: A NovelScaling Limitation Factor Derived From DRAM Retention Time andIts Modeling,” in

IEDM , 1995.[43] T. Hamamoto, S. Sugiura, and S. Sawada, “On the Retention TimeDistribution of Dynamic Random Access Memory (DRAM),” in

TED ,1998.[44] R. W. Hamming, “Error Detecting and Error Correcting Codes,” in

BellLabs Technical Journal , 1950.[45] H. Hassan, M. Patel, J. S. Kim, A. G. Ya˘glıkc¸ı, N. Vijaykumar, N. M.Ghiasi, S. Ghose, and O. Mutlu, “CROW: A Low-Cost Substrate forImproving DRAM Performance, Energy Efficiency, and Reliability,” in

ISCA , 2019.[46] H. Hassan, N. Vijaykumar, S. Khan, S. Ghose, K. Chang, G. Pekhimenko,D. Lee, O. Ergin, and O. Mutlu, “SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies,” in

HPCA , 2017.[47] Hewlett-Packard Development Company, L.P., “Why Buy HP QualifiedMemory?” Hewlett-Packard Development Company, L.P., Tech. Rep.,2011, 3rd Edition.[48] M.-J. Ho, “Method of Analyzing DRAM Redundancy Repair,” 2003, uSPatent 6,573,524.[49] A. Hocquenghem, “Codes Correcteurs D’erreurs,”

Chiffres , 1959.[50] S. Hong, “Memory Technology Trend and Future Challenges,” in

IEDM ,2010.[51] M. Horiguchi and K. Itoh,

Nanoscale Memory Repair

Fundamentals of Error-Correcting Codes .Cambridge University Press, 2003.[54] K. Iniewski,

Nano-Semiconductors: Devices and Technology . CRCPress, 2011.[55] Integrated Circuit Engineering Corporation,

Cost Effective IC Manu-facturing

61] E. Ipek, O. Mutlu, J. F. Mart´ınez, and R. Caruana, “Self-OptimizingMemory Controllers: A Reinforcement Learning Approach,” in

ISCA ,2008.[62] T. Ishigaki, T. Kawahara, R. Takemura, K. Ono, K. Ito, H. Matsuoka,and H. Ohno, “A Multi-Level-Cell Spin-Transfer Torque Memory withSeries-Stacked Magnetotunnel Junctions,” in

VLSI , 2010.[63] ISSI, “8Gb (x16 x 2 Channel) Mobile LPDDR4/LPDDR4X,” 2020.[64] K. Itoh,

VLSI Memory Chip Design . Springer Science & BusinessMedia, 2013, vol. 5.[65] B. Jacob, S. Ng, and D. Wang,

Memory Systems: Cache, DRAM, Disk .Morgan Kaufmann, 2010.[66] D. James, “Silicon Chip Teardown to the Atomic Scale–ChallengesFacing the Reverse Engineering of Semiconductors,”

Microscopy andMicroanalysis , 2010.[67] JEDEC,

DDR3 SDRAM Specification , 2008.[68] JEDEC,

DDR4 SDRAM Specification , 2012.[69] JEDEC, “Low Power Double Data Rate 4 (LPDDR4) SDRAM Specifica-tion,”

JEDEC Standard JESD209–4B , 2014.[70] JEDEC,

DDR5 SDRAM Specification , 2020.[71] N. K. Jha and S. Gupta,

Testing of Digital Systems . Cambridge Univer-sity Press, 2003.[72] S. Jin, J.-H. Yi, Y. J. Park, H. S. Min, J. H. Choi, and D. G. Kang, “Modelingof Retention Time Distribution of DRAM Cell Using a Monte-CarloMethod,” in

IEDM , 2004.[73] M. Jung, C. C. Rheinl¨ander, C. Weis, and N. Wehn, “Reverse Engineer-ing of DRAMs: Row Hammer with Crosshair,” in

MEMSYS , 2016.[74] M. Jung, C. Weis, N. Wehn, M. Sadri, and L. Benini, “Optimized Activeand Power-Down Mode Refresh Control in 3D-DRAMs,” in

VLSI-SoC ,2014.[75] M. Jung, ´E. Zulian, D. M. Mathew, M. Herrmann, C. Brugger, C. Weis,and N. Wehn, “Omitting Refresh: A Case Study for Commodity andWide I/O DRAMs,” in

MEMSYS , 2015.[76] U. Kang, H.-s. Yu, C. Park, H. Zheng, J. Halbert, K. Bains, S. Jang, andJ. S. Choi, “Co-Architecting Controllers and DRAM to Enhance DRAMProcess Scaling,” in

The Memory Forum , 2014.[77] B. Keeth, R. J. Baker, B. Johnson, and F. Lin,

DRAM Circuit Design:Fundamental and High-Speed Topics . John Wiley & Sons, 2007.[78] S. Khan, D. Lee, Y. Kim, A. R. Alameldeen, C. Wilkerson, and O. Mutlu,“The Efficacy of Error Mitigation Techniques for DRAM RetentionFailures: A Comparative Experimental Study,” in

SIGMETRICS , 2014.[79] S. Khan, D. Lee, and O. Mutlu, “PARBOR: An Efficient System-LevelTechnique to Detect Data-Dependent Failures in DRAM,” in

DSN , 2016.[80] S. Khan, C. Wilkerson, D. Lee, A. R. Alameldeen, and O. Mutlu, “ACase for Memory Content-Based Detection and Mitigation of Data-Dependent Failures in DRAM,” in

IEEE CAL , 2016.[81] S. Khan, C. Wilkerson, Z. Wang, A. R. Alameldeen, D. Lee, and O. Mutlu,“Detecting and Mitigating Data-Dependent DRAM Failures by Exploit-ing Current Memory Content,” in

MICRO , 2017.[82] D.-H. Kim, S. Cha, and L. S. Milor, “AVERT: An Elaborate Model forSimulating Variable Retention Time in DRAMs,”

Microelectronics Reli-ability , 2015.[83] J. S. Kim, M. Patel, H. Hassan, and O. Mutlu, “Solar-DRAM: ReducingDRAM Access Latency by Exploiting the Variation in Local Bitlines,”in

ICCD , 2018.[84] J. S. Kim, M. Patel, H. Hassan, and O. Mutlu, “The DRAM Latency PUF:Quickly Evaluating Physical Unclonable Functions by Exploiting theLatency-Reliability Tradeoff in Modern Commodity DRAM Devices,”in

HPCA , 2018.[85] J. S. Kim, M. Patel, H. Hassan, L. Orosa, and O. Mutlu, “D-RaNGe:Using Commodity DRAM Devices to Generate True Random NumbersWith Low Latency And High Throughput,” in

HPCA , 2019.[86] J. S. Kim, M. Patel, A. G. Ya˘glıkc¸ı, H. Hassan, R. Azizi, L. Orosa, andO. Mutlu, “Revisiting RowHammer: An Experimental Analysis ofModern Devices and Mitigation Techniques,” in

ISCA , 2020.[87] J. Kim, M. Sullivan, S. Lym, and M. Erez, “All-Inclusive ECC: ThoroughEnd-to-End Protection for Reliable Computer Memory,” in

ISCA , 2016.[88] K. Kim and J. Lee, “A New Investigation of Data Retention Time inTruly Nanoscaled DRAMs,” in

EDL , 2009.[89] S.-H. Kim et al. , “A Low Power and Highly Reliable 400Mbps MobileDDR SDRAM With On-Chip Distributed ECC,” in

ASSCC , 2007.[90] Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson, K. Lai,and O. Mutlu, “Flipping Bits in Memory Without Accessing Them: AnExperimental Study of DRAM Disturbance Errors,” in

ISCA , 2014.[91] Y. Kim, V. Seshadri, D. Lee, J. Liu, and O. Mutlu, “A Case for ExploitingSubarray-Level Parallelism (SALP) in DRAM,” in

ISCA , 2012.[92] Kingston Technology Corporation, “Kingston Testing Overview,”Kingston Technology Corporation, Tech. Rep., 2012. [93] Y. Konishi, M. Kumanoya, H. Yamasaki, K. Dosaka, and T. Yoshihara,“Analysis of Coupling Noise Between Adjacent Bit Lines in MegabitDRAMs,”

JSSC , 1989.[94] S. Koppula, L. Orosa, A. G. Ya˘glıkc¸ı, R. Azizi, T. Shahroodi, K. Kanel-lopoulos, and O. Mutlu, “EDEN: Enabling Energy-Efficient, High-Performance Deep Neural Network Inference Using ApproximateDRAM,” in

MICRO , 2019.[95] K. Kraft, C. Sudarshan, D. M. Mathew, C. Weis, N. Wehn, and M. Jung,“Improving the Error Behavior of DRAM by Exploiting its Z-ChannelProperty,” in

DATE , 2018.[96] E. K¨ult¨ursay, M. Kandemir, A. Sivasubramaniam, and O. Mutlu, “Eval-uating STT-RAM as an Energy-Efficient Main Memory Alternative,”in

ISPASS , 2013.[97] N. Kwak et al. , “A 4.8 Gb/s/pin 2Gb LPDDR4 SDRAM with Sub-100 µ ASelf-Refresh Current for IoT Applications,” in

ISSCC , 2017.[98] H.-J. Kwon et al. , “An Extremely Low-Standby-Power 3.733 Gb/s/pin2Gb LPDDR4 SDRAM for Wearable Devices,” in

ISSCC , 2017.[99] S. Kwon, Y. H. Son, and J. H. Ahn, “Understanding DDR4 in Pursuit ofIn-DRAM ECC,” in

ISOCC , 2014.[100] M. Lanteigne, “How Rowhammer Could Be Used to Exploit Weak-nesses in Computer Hardware,” Tech. Rep., 2016.[101] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting Phase ChangeMemory as a Scalable DRAM Alternative,” in

ISCA , 2009.[102] D. Lee, Y. Kim, G. Pekhimenko, S. Khan, V. Seshadri, K. Chang, andO. Mutlu, “Adaptive-Latency DRAM: Optimizing DRAM Timing forthe Common-Case,” in

HPCA , 2015.[103] D. Lee, S. Ghose, G. Pekhimenko, S. Khan, and O. Mutlu, “SimultaneousMulti-Layer Access: Improving 3D-Stacked Memory Bandwidth atLow Cost,” in

TACO , 2016.[104] D. Lee, S. Khan, L. Subramanian, S. Ghose, R. Ausavarungnirun,G. Pekhimenko, V. Seshadri, and O. Mutlu, “Design-Induced LatencyVariation in Modern DRAM Chips: Characterization, Analysis, andLatency Reduction Mechanisms,” in

SIGMETRICS , 2017.[105] D. Lee, Y. Kim, V. Seshadri, J. Liu, L. Subramanian, and O. Mutlu,“Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Archi-tecture,” in

HPCA , 2013.[106] D. Lee, L. Subramanian, R. Ausavarungnirun, J. Choi, and O. Mutlu,“Decoupled Direct Memory Access: Isolating CPU and IO Traffic byLeveraging a Dual-Data-Port DRAM,” in

PACT , 2015.[107] Y. Li, H. Schneider, F. Schnabel, R. Thewes, and D. Schmitt-Landsiedel,“DRAM Yield Analysis and Optimization by a Statistical Design Ap-proach,” in

CSI , 2011.[108] S. Lin and D. J. Costello,

Error Control Coding: Fundamentals andApplications , 2004.[109] J. Liu, B. Jaiyen, Y. Kim, C. Wilkerson, and O. Mutlu, “An Experi-mental Study of Data Retention Behavior in Modern DRAM Devices:Implications for Retention Time Profiling Mechanisms,” in

ISCA , 2013.[110] J. Liu, B. Jaiyen, R. Veras, and O. Mutlu, “RAIDR: Retention-AwareIntelligent DRAM Refresh,” in

ISCA , 2012.[111] H. Luo, T. Shahroodi, H. Hassan, M. Patel, A. Giray Ya˘glıkc¸ı, L. Orosa,J. Park, and O. Mutlu, “CLR-DRAM: A Low-Cost DRAM ArchitectureEnabling Dynamic Capacity-Latency Trade-Off,” in

ISCA , 2020.[112] Y. Luo, S. Ghose, Y. Cai, E. F. Haratsch, and O. Mutlu, “HeatWatch:Improving 3D NAND Flash Memory Device Reliability by ExploitingSelf-Recovery and Temperature Awareness,” in

HPCA , 2018.[113] Y. Luo, S. Ghose, Y. Cai, E. F. Haratsch, and O. Mutlu, “Improving 3DNAND Flash Memory Lifetime by Tolerating Early Retention Loss andProcess Variation,”

SIGMETRICS , 2018.[114] Y. Luo, S. Govindan, B. Sharma, M. Santaniello, J. Meza, A. Kansal, J. Liu,B. Khessib, K. Vaid, and O. Mutlu, “Characterizing Application MemoryError Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory,” in

DSN , 2014.[115] F. J. MacWilliams and N. J. A. Sloane,

The Theory of Error-CorrectingCodes . Elsevier, 1977.[116] J. Maiz, S. Hareland, K. Zhang, and P. Armstrong, “Characterization ofMulti-Bit Soft Error Events in Advanced SRAMs,” in

IEDM , 2003.[117] T. C. May and M. H. Woods, “Alpha-Particle-Induced Soft Errors inDynamic Memories,”

TED , 1979.[118] J. Meza et al. , “A Large-Scale Study of Flash Memory Errors in theField,” in

SIGMETRICS , 2015.[119] J. Meza, Q. Wu, S. Kumar, and O. Mutlu, “Revisiting Memory Errors inLarge-Scale Production Data Centers: Analysis and Modeling of NewTrends from the Field,” in

DSN , 2015.[120] Micron Technology Inc., “ECC Brings Reliability and Power Efficiencyto Mobile Devices,” Micron Technology Inc., Tech. Rep., 2017. Error Correction Coding: Mathematical Methods and Algo-rithms . John Wiley & Sons, 2005.[123] I. Mrozek,

Multi-Run Memory Tests for Pattern Sensitive Faults .Springer, 2019.[124] O. Mutlu, “Memory Scaling: A Systems Architecture Perspective,” in

IMW , 2013.[125] O. Mutlu, “The RowHammer Problem and Other Issues we may Faceas Memory Becomes Denser,” in

DATE , 2017.[126] O. Mutlu and J. Kim, “RowHammer: A Retrospective,” in

TCAD , 2019.[127] O. Mutlu and L. Subramanian, “Research Problems and Opportunitiesin Memory Systems,” in

SUPERFRI , 2014.[128] P. J. Nair, D.-H. Kim, and M. K. Qureshi, “ArchShield: ArchitecturalFramework for Assisting DRAM Scaling by Tolerating High ErrorRates,” in

ISCA , 2013.[129] P. J. Nair, V. Sridharan, and M. K. Qureshi, “XED: Exposing On-DieError Detection Information for Strong Memory Reliability,” in

ISCA ,2016.[130] Y. Nakagome, M. Aoki, S. Ikenaga, M. Horiguchi, S. Kimura,Y. Kawamoto, and K. Itoh, “The Impact of Data-Line Interference Noiseon DRAM Scaling,” in

JSSC , 1988.[131] NASA, “NASA NEPP Program Memory Technology - Testing, Anal-ysis, and Roadmap,” https://radhome.gsfc.nasa.gov/radhome/papers/radecs05 sc.pdf, 2016.[132] Y. Nishi and B. Magyari-Kope,

Advances in Non-Volatile Memory andStorage Technology . Woodhead Publishing, 2019.[133] T.-Y. Oh et al. , “A 3.2Gbps/pin 8Gb 1.0V LPDDR4 SDRAM with In-tegrated ECC Engine for Sub-1V DRAM Core Operation,” in

ISSCC ,2014.[134] S. Pal, S. Bose, W.-H. Ki, and A. Islam, “Design of Power-and Variability-Aware Nonvolatile RRAM Cell Using Memristor as a Memory Element,”

J-EDS , 2019.[135] K. Park, C. Lim, D. Yun, and S. Baeg, “Experiments and Root CauseAnalysis for Active-Precharge Hammering Fault In DDR3 SDRAMUnder 3 × Nm Technology,”

Microelectronics Reliability , 2016.[136] K. Park, D. Yun, and S. Baeg, “Statistical Distributions of Row-Hammering Induced Failures in DDR3 Components,”

MicroelectronicsReliability , 2016.[137] S. Parkin and S.-H. Yang, “Memory on the Racetrack,”

Nature Nan-otechnology , 2015.[138] M. Patel, J. S. Kim, H. Hassan, and O. Mutlu, “Understanding and Mod-eling On-Die Error Correction in Modern DRAM: An ExperimentalStudy Using Real Devices,” in

DSN , 2019.[139] M. Patel, J. S. Kim, and O. Mutlu, “The Reach Profiler (REAPER): En-abling the Mitigation of DRAM Retention Failures via Profiling atAggressive Conditions,” in

ISCA , 2017.[140] M. R. Prasad, A. Biere, and A. Gupta, “A Survey of Recent Advancesin SAT-based Formal Verification,”

STTT , 2005.[141] M. K. Qureshi, D.-H. Kim, S. Khan, P. J. Nair, and O. Mutlu, “AVATAR:A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems,”in

DSN , 2015.[142] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, “Scalable High Perfor-mance Main Memory System Using Phase-change Memory Technol-ogy,” in

ISCA , 2009.[143] QY Research, “Global DRAM Market Professional Survey Re-port,” https://garnerinsights.com/Global-DRAM-Market-Professional-Survey-Report-2019, 2019.[144] M. Redeker, B. F. Cockburn, and D. G. Elliott, “An Investigation IntoCrosstalk Noise in DRAM Structures,” in

MTDT , 2002.[145] I. S. Reed and G. Solomon, “Polynomial Codes Over Certain FiniteFields,”

SIAM , 1960.[146] T. Richardson and R. Urbanke,

Modern Coding Theory . CambridgeUniversity Press, 2008.[147] R. Rooney and N. Koyle, “Micron DDR5 SDRAM: New Features,” Mi-cron Technology Inc., Tech. Rep., 2019.[148] R. M. Roth,

Introduction to Coding Theory

SIGMETRICS , 2009.[152] N. H. Seong, S. Yeo, and H.-H. S. Lee, “Tri-Level-Cell Phase ChangeMemory: Toward an Efficient and Reliable Memory System,” in

ISCA ,2013. [153] V. Seshadri et al. , “RowClone: Fast and Energy-Efficient In-DRAMBulk Data Copy and Initialization,” in

MICRO , 2013.[154] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A.Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit: In-MemoryAccelerator for Bulk Bitwise Operations Using Commodity DRAMTechnology,” in

MICRO , 2017.[155] V. Seshadri and O. Mutlu, “In-DRAM Bulk Bitwise Execution Engine,” arXiv preprint arXiv:1905.09822 , 2019.[156] S. M. Seyedzadeh, D. Kline Jr, A. K. Jones, and R. Melhem, “MitigatingBitline Crosstalk Noise in DRAM Memories,” in

ISMS , 2017.[157] C. G. Shirley and W. R. Daasch, “Copula Models of Correlation: ADRAM Case Study,” in TC , 2014.[158] SK Hnyix, “366ball FBGA Specification 32Gb LPDDR4 (x16, 4 Channel),”2015.[159] SMART Modular Technologies, “SMART Press Release 415,” SMARTModular Technologies, Tech. Rep., 2017.[160] Y. H. Son, S. Lee, O. Seongil, S. Kwon, N. S. Kim, and J. H. Ahn, “CiDRA:A cache-Inspired DRAM resilience architecture,” in HPCA , 2015.[161] V. Sridharan, N. DeBardeleben, S. Blanchard, K. B. Ferreira, J. Stearley,J. Shalf, and S. Gurumurthi, “Memory Errors in Modern Systems: TheGood, the Bad, and the Ugly,” in

ASPLOS , 2015.[162] V. Sridharan and D. Liberty, “A Study of DRAM Failures in the Field,”in SC , 2012.[163] S. Sutar, A. Raha, and V. Raghunathan, “D-PUF: An Intrinsically Recon-figurable DRAM PUF for Device Authentication in Embedded Systems,”in CASES , 2016.[164] R. Torrance and D. James, “The State-of-the-Art in IC Reverse Engi-neering,” in

CHES , 2009.[165] A. J. Van de Goor,

Testing Semiconductor Memories: Theory and Practice .John Wiley & Sons, Inc., 1991.[166] A. J. Van De Goor and I. Schanstra, “Address and Data Scrambling:Causes and Impact on Memory Tests,” in

DELTA , 2002.[167] J. P. van Zandwijk, “A Mathematical Approach to NAND Flash-Memory Descrambling and Decoding,”

Digital Investigation , 2015.[168] J. P. van Zandwijk, “Bit-Errors as a Source of Forensic Information inNAND-Flash Memory,”

Digital Investigation , 2017.[169] R. K. Venkatesan, S. Herr, and E. Rotenberg, “Retention-Aware Place-ment in DRAM (RAPID): Software Methods for Quasi-Non-VolatileDRAM,” in

HPCA

NVMTS , 2015.[172] C. Weis, M. Jung, P. Ehses, C. Santos, P. Vivet, S. Goossens, M. Koedam,and N. Wehn, “Retention Time Measurements and Modelling of BitError Rates of Wide I/O DRAM in MPSoCs,” in

DATE , 2015.[173] C. Weis, M. Jung, O. Naji, C. Santos, P. Vivet, and A. Hansson, “ThermalAspects and High-Level Explorations of 3D Stacked DRAMs,” in

ISVLSI ,2015.[174] W. Wen, M. Mao, X. Zhu, S. H. Kang, D. Wang, and Y. Chen, “CD-ECC:Content-Dependent Error Correction Codes for Combating Asymmet-ric Nonvolatile Memory Operation Errors,” in

ICCAD , 2013.[175] J. Wise, “Reverse Engineering a NAND Flash Device Management Al-gorithm,” https://joshuawise.com/projects/ndfrecovery

Proc. IEEE , 2012.[177] H.-S. P. Wong, S. Raoux, S. Kim, J. Liang, J. P. Reifenberg, B. Rajendran,M. Asheghi, and K. E. Goodson, “Phase Change Memory,”

Proc. IEEE ,2010.[178] D. S. Yaney, C.-Y. Lu, R. A. Kohler, M. J. Kelly, and J. T. Nelson, “AMeta-Stable Leakage Phenomenon in DRAM Charge Storage-VariableHold Time,” in

IEDM , 1987.[179] C. Zhang, G. Sun, X. Zhang, W. Zhang, W. Zhao, T. Wang, Y. Liang,Y. Liu, Y. Wang, and J. Shu, “Hi-Fi Playback: Tolerating Position Errorsin Shift Operations of Racetrack Memory,” in

ISCA , 2015.[180] T. Zhang, K. Chen, C. Xu, G. Sun, T. Wang, and Y. Xie, “Half-DRAM:A High-Bandwidth and Low-Power DRAM Architecture from theRethinking of Fine-Grained Activation,” in

ISCA , 2014.[181] X. Zhang,

VLSI Architectures for Modern Error-Correcting Codes . CRCPress, 2015.[182] Y. Zhang, L. Zhang, W. Wen, G. Sun, and Y. Chen, “Multi-Level CellSTT-RAM: Is it Realistic or Just a Dream?” in

ICCAD , 2012., 2012.