[PDF] Addressing multiple bit/symbol errors in DRAM subsystem

Abstract

As DRAM technology continues to evolve towards smaller feature sizes and increased densities, faults in DRAM subsystem are becoming more severe. Current servers mostly use CHIPKILL based schemes to tolerate up-to one/two symbol errors per DRAM beat. Multi-symbol errors arising due to faults in multiple data buses and chips may not be detected by these schemes. In this paper, we introduce Single Symbol Correction Multiple Symbol Detection (SSCMSD) - a novel error handling scheme to correct single-symbol errors and detect multi-symbol errors. Our scheme makes use of a hash in combination with Error Correcting Code (ECC) to avoid silent data corruptions (SDCs). SSCMSD can also enhance the capability of detecting errors in address bits. We employ 32-bit CRC along with Reed-Solomon code to implement SSCMSD for a x4 based DDRx system. Our simulations show that the proposed scheme effectively prevents SDCs in the presence of multiple symbol errors. Our novel design enabled us to achieve this without introducing additional READ latency. Also, we need 19 chips per rank (storage overhead of 18.75 percent), 76 data bus-lines and additional hash-logic at the memory controller.

Full PDF

11 Addressing multiple bit/symbol errors in DRAMsubsystem

Ravikiran Yeleswarapu and Arun K. Somani

Abstract —As DRAM technology continues to evolve towards smaller feature sizes and increased densities, faults in DRAM subsystemare becoming more severe. Current servers mostly use CHIPKILL based schemes to tolerate up-to one/two symbol errors per DRAMbeat. Multi-symbol errors arising due to faults in multiple data buses and chips may not be detected by these schemes. In this paper,we introduce Single Symbol Correction Multiple Symbol Detection (SSCMSD) - a novel error handling scheme to correct single-symbolerrors and detect multi-symbol errors. Our scheme makes use of a hash in combination with Error Correcting Code (ECC) to avoidsilent data corruptions (SDCs). SSCMSD can also enhance the capability of detecting errors in address bits.We employ 32-bit CRC along with Reed-Solomon code (ECC) to implement SSCMSD for a x4 based DDRx system. Our simulationsshow that the proposed scheme effectively prevents SDCs in the presence of multiple symbol errors. We are able to achieve thisimprovement in reliability with similar READ latency as compared to existing ECC. For this design, we need 19 chips per rank (storageoverhead of 18.75 percent), 76 data bus-lines and additional hash-logic at the memory controller.

Index Terms —DRAM Reliability, Reed Solomon Code, Hash, Chipkill, silent data corruption, Error correcting Code, multiple bit errors (cid:70)

NTRODUCTION F AILURES in DRAM subsystem are one of the majorsources of crashes due to hardware errors in comput-ing systems [2]. As DRAM technology continues to evolvetowards smaller feature sizes and increased densities, faultsin DRAM devices are predicted to be more severe. Small celldimensions limit the charge that can be stored in them. Thisresults in lower noise margins. As cell density increases,coupling (or crosstalk) effects come into picture. In-fact,researchers have recently identiﬁed "disturbance error" [19]in newer DRAM devices. This error has a cross-device cor-relation, hence will lead to multi-bit errors across differentdevices in a rank.Each generation of DDRx family has doubled the trans-fer rates and reduced I/O voltages, and therefore, transmis-sion errors in the Memory controller-DIMM interface areon the rise [1, 26]. Most of the servers use CHIPKILL [4]based reliability schemes. They, can tolerate only one or twosymbol errors per beat. Multiple bit errors spread across thechip boundaries of a rank may not be detected by theseschemes. Errors in bus along with growing device failuresincrease the frequency of multi-bit errors in the data fetchedfrom DRAM subsystems.Numerous ﬁeld studies such as [1,3] studied large scaledata-centers and predict that future exascale systems mayrequire stronger reliability schemes than CHIPKILL. Thesestudies base their analysis using limited protection mecha-nisms/logging capabilities and therefore the actual failurerates might be greater than their assessments.We ﬁrst describe our error model, which captures theeffect of various type of faults that may occur in DRAMdevices and the data-bus. This model complements recent • Ravikiran Yeleswarapu and Arun K. Somani are with Department ofElectrical and Computer Engineering, Iowa State University, Ames, IA,50010.E-mail: [email protected], [email protected] efforts such as AIECC [13], which focus on faults in address,command and control signals. We then propose a new errorhandling mechanism - Single Symbol Correction MultipleSymbol Detection (SSCMSD). As single symbol errors/beatare more frequent [3], our mechanism uses ECC to correctthem. In addition we use a hash function to detect theless frequently occurring multi-bit (or symbol) errors. Ahash function will detect multi-symbol errors with a highprobability. It is the judicious combination of the two, i.eECC and hash that makes our scheme effective.We believe that SSCMSD is a very effective reliabilitymechanism for HPC/data-centers. More frequently occur-ring single symbol errors are corrected to achieve low re-covery time. On the other hand, relatively infrequent, multi-symbol errors are detected by SSCMSD. Our scheme is alsosuitable for Selective Error Protection (SEP [31]), as we canenable/disable the enhanced detection capability providedby the hash for certain applications.The rest of the paper is organized as follows. Section 2introduces DDRx subsystem, Section 3 describes prior workin the ﬁeld of memory reliability. In Section 4, we performpreliminary experiments to understand the impact of multi-symbol errors to Single Symbol Correcting Reed Solomon(SSC-RS) code and to validate our simulation framework.Our error model is described in Section 5. Section 6 detailsour SSCMSD scheme. In Section 7, we evaluate our schemewith other mechanisms. In Section 8, we summarize ourwork with Conclusion. X M EMORY O RGANIZATION

A DDRx [6,30] based memory is organized into hierarchicalgroups to enable designers to trade bandwidth, power, costand latency while designing memory subsystems. At thetopmost level, the subsystem comprises one or more chan-nels. Each channel is made up of one or more DDRx DIMMs, a r X i v : . [ c s . A R ] F e b Fig. 1. Memory channel - Memory controller is connected to DRAMmodules (DIMMs) through shared bus. a shared Data-bus, Address/Command bus, Control andClock signals. Each DIMM includes multiple DRAM chipswhich are grouped into multiple "ranks". Typically, eachDIMM has one, two, four or eight ranks. Furthermore,each chip has multiple independent banks. Each bank iscomposed of multiple sub-arrays [21] and a global senseampliﬁer. Each sub-array is further organized into a ma-trix of rows and columns with a sense ampliﬁer. Figure1 shows the organization of a channel which is composedof two x4 (transfer width - 4 bits) based DIMMs. The databus is organized into sixteen groups or "lanes", each laneis shared by DRAM devices (or chips) across a channel.Address/command and control buses drive all the devicesin the channel.The memory controller (MC) handles memory requestsfrom processor/cache/IO devices. As shown in Figure 1,the MC communicates address, commands, and data tothe DRAM ranks over the channels. Typically, read/writecache miss require 64-byte data to be transferred betweenMC and DRAM memory subsystem. In this paper, we referthis 64-byte data (plus additional redundancy if any) asa cache-line. This is communicated in eight "beats" (8-halfbus cycles). For a DRAM subsystem composed of DDR4,x4 devices, each beat activates an entire rank [33] (sixteendevices) and MC fetches/sends 64 bits of data per beat.

This section summarizes schemes currently used by the in-dustry and recent academic efforts to improve the reliabilityof DRAM subsystem. SECDED [7] and CHIPKILL [4] mech-anisms were developed to address DRAM device errors.JEDEC introduced four schemes in DDR4 [14] to partiallyaddress signal integrity errors. MEMGUARD [23], Bamboo-ECC [12] and AIECC[13] are recent academic efforts whichare closely related to our work.

During 1990s, memory modules in servers were protectedby using SECDED Codes. These codes make use of redun-dant (or check) bits to correct single-bit or detect double bit errors in a beat. For a typical beat size of 64 bits, SECDEDcode [7] makes use of eight redundant bits. SECDED de-sign can correct 1-bit error or detect 2-bit errors in 64 bits(per beat) with 12.5% redundancy and 8 additional buslines/channel. In practice, it can detect/mis-correct somemulti-bit errors [12] as well.

As the demand for larger, high-density memory modulesincreased in the server industry, there was a need to protectagainst a single device failure. IBM introduced the "CHIP-KILL Correct" error model to tolerate the failure of a singleDRAM device in a rank.CHIPKILL implementations make use of Reed Solomon(RS) Codes. RS codes use Galois "symbol" (set of bits) basedarithmetic [8] and like SECDED use additional logic togenerate codewords (set of data and check symbols) usingdata symbols. The circuit complexity of RS code increaseswith the symbol size. Therefore, small symbol sized RScodes such as 4-bit and 8-bit ones are more commonly used.There are two popular versions of chipkill:

1) SSCDSD (Single Symbol Correct, Double SymbolDetect) CHIPKILL : AMD’s 2007 design [10] and SunUltraSPARC [11] provide SSCDSD capability for x4 DRAMdevices by using 4-bit symbol RS code with four checksymbols. To maintain redundancy at 12.5%, this designuses 32 data symbols (128 bits), 4 check symbols (16 bits)per beat with 144-bit data bus and 36 devices per rank.The nature of the design is such that it "over fetches", i.e.two cache lines are accessed during a memory transaction(8 beats * 32 data devices/rank * 4 bits = 128 Bytes) anduses 144 bit data bus. Therefore, it may result in increasedenergy consumption.

2) SSC (Single Symbol Correction) CHIPKILL : Toreduce cache access granularity, in 2013, AMD developed aSSC based 8-bit symbol RS code [5] for x4 DRAM devices.This scheme uses the 72 bit data bus and 18 devices per rank(64 data + 8 redundant bits/beat). In this design, bits fromtwo successive beats are interleaved to form one codewordwith "Chipkill" functionality [5, 40]. For 8 beats from x4devices, each cache request makes use of four codewords.Each codeword comprises 16 data and two check symbolswith a redundancy of 12.5%. This design is used as ourbaseline for comparison.When there are >1 symbol errors/codeword (mostlydue to multiple chip failures), AMD uses history basedhardware-software approach to cover these scenarios [5].

1) WRITECRC : WRITECRC is designed to detecttransmission errors in data during WRITE operation.In this design, the memory controller generates an 8-bitCRC checksum for the entire write data burst (8 beats)to each chip/data-lane [14] of the rank. These 8 bits aresent over two additional beats after the data is sent to theindividual chips. Each DRAM chip also has logic to re-compute the CRC checksum and compare it with checksumsent by the controller. Such a design allows the chips to detect errors before storing them and provides an option toretry the transmission of the data. However, transmissionerrors during READs (not covered by WRITECRC) mayalso lead to SDCs with the baseline scheme.

2) CA (Command/Address) parity : CA parity uses anadditional pin (or bus-line) to transfer even parity of theCMD/ADD signals to every DRAM chip. It cannot detectan even number of bit-errors on the CMD/ADD signals.

3) Data Bus Inversion (DBI) : DBI is designed to protectagainst Simultaneously Switching Noise (SSO) [20] duringdata transmission. This scheme is only available for x8, x16DDR4 chips. With 8 Data bits/pins and an additional 9thpin per each data-lane, DBI ensures that at least 5 out of9 pins are "1"s. This avoids the situation where all bits gofrom 0 to 1 or from 1 to 0 to improve the signal integrity ofdata bus.

4) Gear Down Mode : Gear-down mode allows theMC to lower transmission rate of command/Addressand control signals to trade-off latency and commandbandwidth for signal quality while maintaining high datarates/bandwidth.

Memguard is a reliability scheme designed to detect multi-bit errors in DRAMs without using redundant storage. Itmakes use of two registers (READHASH, WRITEHASH)and custom logic at the memory controller (MC). Wheneverthere is a memory transaction between the last level cacheand the DRAM, the logic at MC computes a hash value forthis transaction and READHASH/WRITEHASH registersare updated. This scheme does not store the hash valuesin the memory as they use incremental multi-set hashingtechnique [35]. By periodically synchronizing the two hashregisters at the MC, this scheme detects errors in data.Memguard relies on OS-checkpointing for error recovery.Although this scheme can detect multi-bit (or multi-symbol) errors, on its own it is not suitable forHPC/datacenters due to the high recovery time associ-ated with checkpointing and synchronization. Also, Mem-guard is effective only against soft errors. Although ourdesign is motivated by Memguard’s scheme, we do notuse incremental multi-set hashing technique which is at thecore of Memguard’s design and instead store hash alongwith data and ECC bits in the DRAM (use redundancy).Thus, unlike Memguard, we employ ECC and hash toprovide correction/detection capability for each cachelineand do not require any synchronization for error detec-tion. This ensures faster recovery, effectiveness against bothpermanent and soft errors, and is therefore suitable forHPC/datacenters/servers.

QPC Bamboo-ECC is an 8-bit symbol RS based schemedesigned to target more frequently occurring error patterns.They provide CHIPKILL capability with 12.5 % redundancyfor x4 based memory systems and show that they performbetter than AMD’s RS-SSC CHIPKILL in reducing SDCs for certain type of faults. Since they use one codeword forthe entire cache-line, their design leads to increased READlatency.Our goal in this paper is to consider more realistic errormodel based on the nature of faults and develop an appro-priate scheme to protect against them. We demonstrate thatBamboo-ECC and extended Bamboo-ECC (same overheadas ours) are prone to Silent Data Corruptions when faultsare spread across multiple chips/buses.

AIECC is a suite of mechanisms designed to protect againstCCCA (clock, control, command, address) faults withoutadditional redundant storage or new signals to and frommemory.Our work is orthogonal to AIECC. We focus on im-proving reliability against device, bus errors while AIECCfocuses on CCCA errors. The reliability of future memorysystems can be improved by incorporating our schemealong with AIECC.

RELIMINARY E XPERIMENTS

In the presence of an error, a generic reliability scheme re-ports it as either a Correctable Error (CE) or a Detectable butUncorrectable Error (DUE). When an error is outside of thecoverage of the scheme, it can result in Detectable but Miss-corrected Error (DME) or Undetectable Error (UE). DMEsand UEs are collectively called as Silent Data Corruptions(SDCs) as the scheme inadvertently forwards corrupt datawithout raising an alarm.The baseline scheme uses RS (18, 16, 8) systematic SSCcode. A RS (n, k, m) codeword has k data symbols and n-k check symbols with m bits per symbol. The minimumhamming distance between any two codewords is n-k+1 (3in this case). It can correct (cid:98) ( n − k ) / (cid:99) (1 for baseline scheme)symbol errors. When there is an error across multiple sym-bols of a codeword, this RS decoder can either identify it tobe uncorrectable error (DUE) or "miss-correct" it to anothercodeword thinking it to be a single symbol error of anothercodeword (DME) or fail to detect presence of the error (UE).This can result in Silent Data corruptions in the baselinescheme. We therefore devised simple set of experiments toassess the amount of Silent Data corruptions in the baselinein presence of multi-symbol errors.We developed an in-house simulator to perform ourexperiments. We used open source software [16], [36], [41]to develop galios (symbol-based) arithmetic, RS encoder,decoder. We use generator polynomial - G ( x ) = ( x − a )( x − a ) ... ( x − a N ) (where N - number of ECC symbols/CW)to construct RS code. Our decoder uses Berlekamp Masseyalgorithm for correcting/detecting errors. Due to the sim-plicity of hardware design, most of the hardware implemen-tations use either algorithm based on Euclidean approachor Berlekamp Massey to implement Reed Solomon decoder.We also veriﬁed that the simulation results are similar withEuclidean based RS decoder.For each iteration of the experiment, we fed random16 byte dataword to RS encoder and stored the 18-symbolcodeword in an array. With the help of 18-symbol error Experiments Miscorrected Detected but UndetectedUncorrected2 Symbol 6.3% 93.7% 0% (0)Errors/CW3 Symbol 6.9% 93.1% -> 0%Errors/CW (˜10,000)4 Symbol 7.0% 93% -> 0%Errors/CW (˜10,000)

TABLE 1Experiments/Results of Random multi-symbol data errors for RS(18,16,8). mask, we inserted errors into the stored codeword. Wethen, decoded the stored codeword (with errors) using theRS-decoder. The decoder ﬂagged whether each codewordhad "No Errors" or "Detectable but Uncorrectable Errors" or"Correctable Errors". If the decoder detected a correctableerror in a codeword, it corrected the corresponding stored-codeword. Next, we retrieved the stored data word pro-cessed by RS-decoder and compared it with the original dataword to identify silent data corruptions.We executed three experiments - introducing random 2,3 and 4 symbol errors per codeword. Each of this experimentwas run for ten iterations and each iteration handled 1billion random datawords. Table 2 lists the mean % across10 iterations for the number of miscorrections, detected butuncorrectable errors and undetected errors. The standarddeviation for each of the experiments (except for undetectederrors with random 2 symbol errors) was up to 13,000. Wealso show the mean of undetected errors along with mean% to show give a glimpse of actual number of undetectederrors we encountered.

We can explain these results of our experiments with thehelp of analytical methods. Figure 2 depicts the codespaceof the baseline RS (18, 16, 8) code. In the ﬁgure, starsrepresent valid codewords and diamonds represent non-codewords. Due to errors, a particular codeword (say CW1)gets corrupted and may be detected by RS decoder as anon-codeword (diamond) or as other codeword in the space[12]. The dotted hypersphere which is HD = 1 away fromcodeword represents the correction range of the SSC. Allthe words on this sphere will be corrected to the codewordon the center of the sphere (in this case CW1). Words onHD=2 hypersphere (solid line in green) are either detectedas errors or miscorrected to the adjacent codeword. Wordson the dashed sphere (HD=3) are either correctly detected aserrors or undetected (falsely detected as adjacent codeword)or miscorrected as another codeword.For a generic RS (n, k, m) code, the total n-tuple spaceavailable is n ∗ m . Out of this space, the number of code-words are k ∗ m . Assuming that the space is uniformlydistributed among the codewords, we can say that the spacearound (or owned by) each codeword is n ∗ m / k ∗ m .If we introduce "e" symbol errors from a given codeword(say CW1), all such words lie on a hypersphere at HD=efrom the codeword. If "e" is greater than minimum HD of "n-k+1", this sphere may also contain other codewords. For ex- Fig. 2. N tuple space representation of Reed Solomon SSC code. Starsare codewords which are atleast 3 Hamming Distance apart.Fig. 3. CW A, CW B are two 18-symbol codewords which are HD = eapart. Notice that dataword "CW A + e-symbol error" is HD = 1 (differ in1 symbol) from CW B. ample, as shown in Figure 2, the RS code has two codewords(CW1 and CW2) which are HD=3 apart. If we introduce 4symbol errors from CW1, the hypersphere centered on CW1with radius 4 also contains CW2. On an average, the numberof such codewords C e on or inside a hypersphere HD=eaway is approximately given by dividing the total numberof words inside the sphere by number of words "owned" byeach codeword : C e = (cid:80) eα =1 n C α (2 m − α (2 m ) n − k − (1)The RS decoder "mis-corrects" such an "e" (where e > ( n − k + 1) ) symbol error from a given codeword when thee-symbol error also : 1) falls on HD = 1 sphere of anotherCW which is HD = e away OR 2) falls on HD = 1 sphere ofanother CW which is HD = e-1 away OR 3) falls on HD = 1sphere of another CW which is HD = e+1 away. For example,Figure 2 shows a hypersphere at HD = 4 away from CW1.This sphere represents all the 4-symbol errors from CW1.Few words on this sphere get mis-corrected to CW3, whichis at HD = 4 away from CW1. Due to presence of CW2 atHD = 3 away from CW1, few other words on sphere HD=4also fall on HD=1 sphere of CW2 and therefore get mis-corrected. Similarly, few other words on this HD = 4 spherealso fall on HD = 1 sphere of CW4 which is at HD = 5 awayfrom CW1. Using eq (1), the number of CWs at HD = e from a givencodeword is given by C e − C e − which is equal to n C e (2 m − e (2 m ) n − k (2)Now, due to the presence of one CW at HD = e froma given CW (say CW A), more than one "e" symbol errorsare mis-corrected. Figure 3 shows two codewords CW Aand CW B which are HD=e away. There will be exactly e C e − · (2 m − number of "e" symbol errors from CW A whichare HD=1 away from CW B and hence will be miscorrectedto CW B. Combining this with eq (2) we get the expres-sion for total number of "e" symbol errors from a givencodeword CWA that will be miscorrected due to presenceof codewords at HD=e from CW A: m e = n C e · (2 m − e · e C e − · (2 m − m ) n − k (3)Similarly we can calculate number of "e" symbol errors thatwill be miscorrected due to presence of codewords at HD =e-1, HD = e+1 given by (4) and (5) respectively : m e − = n C e − · (2 m − e − · ( n − e + 1) C · (2 m − m ) n − k (4) m e +1 = n C e +1 · (2 m − e +1 · ( e + 1) C e (2 m ) n − k (5)The total number of "e" symbol errors from a CW is givenby n C e · (2 m − e . Therefore, the fraction of miscorrections inthe total set of "e" symbol errors from a CW is given by : m total = 1 n C e · (2 m − e · ( m e + m e − + m e +1 ) (6)Using (6), we calculate the fraction of mis-correctionsfor the experiments. For the ﬁrst experiment (Random 2Symbol errors) as RS (18, 16, 8) code has a minimum HDof 3, there are no codewords at HD = 2 or HD = 1 ,therefore m e and m e − do not contribute to the expres-sion in eq (6). We calculate the fraction of miscorrectionsfor this experiment to be 6.3 %. This value corroborateswith the experimental results shown in Table 1. The totalinformation space available for single symbol correcting RS(18,16,8) is ∗ (2 n ∗ m ) . Out of this, ∗ (2 k ∗ m ) are to beused as codewords. As the fraction of codewords over thetotal space is only − (2 ∗ / ∗ ) , as the code-space issparsely populated 93.7 % of random errors on HD=2 spheredo not fall on HD=1 spheres of other codewords. Also, asexpected we do not observe any undetected errors in thisexperiment as there are no codewords at HD = 2. Similarly,we calculate the fraction of mis-corrections for the secondand third experiments and ﬁnd that these also corroboratewith the experimental results in Table 1.As we are able to corroborate the experiment resultswith our analytical model, we have conﬁdence that ourexperimental framework is able to accurately simulate ReedSolomon decoder and random error injection. Also, these results further motivated us to develop a solution to tackleSDCs in current/future DRAM subsystems.Fault mode Source Error patternper cacheline1 bit fault Particle strikes OR 1 bit error/Cell failure in a cachelinesub-array1 pin fault Fault in 1DQ of 1 pin stuck at 0a bus lane OR or 1 for all beatssub-array failureRow/Chip/ Failure of sub-array 1 word relatedBank fault row drivers/address to faulty chipdecoding circuit in all beatsis stuckat 1 or 0Column Failure of single 1 bit stuck at 0/1fault column in in a cachelinea sub arrayBus fault Fault in 1 bus lane errors in randombeat positionsof a busCorrelated External noise or Bus faults inBus fault coupling between consecutivetwo consecutive bus-lanesbus-lanes1bit/pin + Combine 1 bit/pin faults whichother faults fault with pin/ lead torow/chip/bus 2 symbol errorsfaultChip + Chip Failure of 2 different 2 speciﬁc wordsfault chips in all beats stuckat 0 or 13 fault combine 3 of the Random errorsmode above mentioned in 3 wordsfaults in all beats TABLE 2Error Model

RROR M ODEL

To represent the possible fault modes that may occur incurrent/future DRAM systems, we ﬁrst describe our errormodel. This model covers various type of faults that arise inDRAM devices and the data-bus.Faults in DRAM subsystems are caused due to a varietyof sources such as cosmic rays [29], circuit failure, signalintegrity etc.. These faults can be broadly categorized astransient or permanent. Transient phenomena corrupt mem-ory locations temporarily, once rewritten these locationsare free from errors. Permanent faults cause the memorylocations to consistently return erroneous values.Field studies [1,3] help us in understanding the trendsof errors in DRAM subsystem up to a certain extent. Weuse this information along with nature of faults in DRAMsubsystem to develop our error model (Table 2). Here, wedescribe the sources of these faults and the correspondingerrors perceived per cache-line due to a particular faulttype. Single bit faults are mainly due to failures in DRAM

Transmission Fault Description / Cause Impact on Signal IntegrityDielectric Loss Signals attenuate as a function of trace length and frequency All data bits are affected, resultsin signal attenuationSkin effect Resistance of conductor varies non-uniformly with frequency All data bits are affected, resultsin signal attenuationElectromagnetic Electromagnetic/capacitive coupling of closely packed lines few bus lines/lanes are affectedinterference at one point of timeSkew Path length variations result in timing variations RandomJitter Fluctuations in voltage, temperature and crosstalk between Difﬁcult to model/characterizesuccessive cycles of a given signal impact the propa-gation time of the signalInter symbol Past signals on a line have residual effects on subsequent No. of bit lines affected: Random,interference signals of the same line Data dependentSimultaneously When many signals in a bus-lane switch, they induce coupling Data dependent.switching output on other signals

TABLE 3Summary of Data Transmission faults cells. Due to failure in a sub-array or one DQ pin (onebus line in a bus-lane), bits transferred over a single DQpin are corrupted. Failures in circuitry inside chips suchas sense ampliﬁers, address decoders etc. cause particularrows/columns/banks/chips to malfunction. For example,if a local row buffer (sense-ampliﬁer) in a bank of a chip isstuck at 1, then all the bits fetched from the chip of particularREAD request are read as "1". Therefore each word (bitsprovided by a chip in one beat) fetched from this chip willhave all 1’s for this particular READ.Bus faults are another source of errors. According to 1storder analysis, bus lines act as a low pass ﬁlter. Since digi-tal signals are composed of numerous frequencies, distinctcomponents of this signals experience attenuation to a dif-ferent degree giving rise to signal degradation. Reﬂection isanother ﬁrst order effect which results in signal degradation.Table 3 describes other sources of transmission faults [26]and their impact on signal integrity of the data bus. As mostof the errors associated with bus faults are data-dependentor random, we expect random errors in different beats of afaulty bus. To simulate this behavior for a single bus fault,we use a random number to identify the erroneous beatpositions among eight beats. We then inject random errorsin these positions. We also consider correlated bus fault dueto presence of external noise or coupling between two buslanes. In this fault-mode, we expect two consecutive buslanes to faulty. Similar to single bus fault, we ﬁrst identifyerroneous beat positions and inject random errors for thesetwo consecutive bus lanes.We combine single bit faults (as they occur withhigher frequency [28]) with other fault types to model 2-symbol/chip errors per codeword. With increase in thepossibilities of fault occurrence especially in exascale sys-tems, there is a higher possibility for other faults to occursimultaneously across three different chips/bus lanes. Tocover such scenarios, we include 3-fault mode (fault whichleads to random errors in three random chips/bus-lanes).

OVEL A RCHITECTURAL S OLU - TION FOR MULTI - BIT / MULTIPLE SYMBOL ERRORS

We ﬁrst carried out a set of experiments detailed in ourerror model to study the behavior of the baseline (SSC-RS (18,16,8)) scheme. The results are shown in the 2ndcolumn of Table 5 in the Evaluation section. As describedearlier, as the code-space is sparsely populated this schemecan detect many multi-symbol errors as well. However, asshown by the experiment results, the baseline is still proneto SDCs with multiple device and bus faults. Inspired bythis observation, we chose to further decrease this SDC rateby improving the ECC scheme at the memory controllerwith minimal increase in redundancy (1 more redundantchip and 1 more bus lane).With one more chip and bus lane at our disposal, one cansimply extend the baseline scheme. This extended baselinescheme uses three check symbols (instead of two used inbaseline) per each codeword to provide SSC capability. Thefourth column of Table 5 shows the performance of thisscheme with our error model. This scheme has lower SDCrate when compared to baseline, but it is still prone to SDCswith multiple symbol errors.An interesting point to note from these results is thatthe SDC rate is dependent on the type of error pattern afault generates rather than on the number of bits/symbolsbeing corrupted. For example, 1 bit + Chip fault corrupts9 bits per CW and has 6% SDC rate while 1-bit fault +1-pin fault corrupts 3 bits of a particular CW and has aSDC rate of 7.6% for the baseline scheme. Although wedo not show the breakdown of SDCs into UEs and DMEsfor the baseline in Table 5, our evaluation shows that for allthe experiments of baseline and Extended-baseline schemes,SDCs occur mostly (99%) due to miss-corrections (DMEs)from the SSC-RS decoder. Therefore, the stored informationis subjected to errors from faults and due to errors inducedby the decoder. These observations inspired us to use ahash function, as the hash value allows us to identify sucharbitrary corruption in the data. Taking a cue from thedesign of Memguard [23], we use a non-cryptographic hashfunction to compute a signature of the data. We use thissignature to detect multi-bit errors with high probability. By combining hash and CHIPKILL, we develop our new errorhandling scheme, called Single Symbol Correct, MultipleSymbol Detect (SSCMSD) CHIPKILL.As shown in Figure 4, during WRITE operation, we cancombine the hash and baseline CHIPKILL scheme in threepossible ways :

Scheme A : Compute the hash of data and then use SSCencoder to encode data and hash.

Scheme B : Encode the data and then compute the hash ofencoded data.

Scheme C : Encode the data and compute the hash of datain parallel.As shown in Table 5 the baseline and extended base-line provide CHIPKILL(SSC) correction capability, but withmultiple symbol errors, they result up-to 8% SDC rate. Thepurpose of using the hash is to further reduce this SDCrate without impacting the existing reliability provided bySSC code. Therefore, while retrieving the data from theDRAM (READ operation), we use a simple, straight forwarddesign to build upon the existing SCC capabilities. First,we perform the SSC decoding, in this process the decoderwill tag each retrieved codeword to have NO Error ORCorrectable Error OR Un-correctable Error. We then use thehash to validate the ﬁndings of the decoder.On analyzing Scheme B and Scheme C with this simpleretrieval mechanism we ﬁnd that there is a possibility of afalse positive i.e report data which was correctly handled bySSC decoder to be erroneous. This happens when the hashgets corrupted (erroneous). In this scenario, when there isa single symbol error or no error in the data/ECC symbolsof a codeword, the decoder corrects it or reports that theretrieved data is free from errors, respectively. But, as thehash is corrupted in this scenario, the second step of theretrieval process reports that the data is erroneous. WithScheme A there is no scope for such false positives as hash isalso correctable by SSC decoder. At the minimum, SchemeA guarantees to provide the reliability already offered bybaseline (SSC decoder). In addition, it also provides capabil-ity to detect miscorrections OR undetected errors missed bythe SSC decoder. Hence we identify Scheme A to be mostsuitable for our purpose.As described earlier, Scheme A generates the hash ofthe data before encoding it with the RS-SSC encoder. Thisencoded data, hash pair (codeword) is stored in the memoryduring WRITE. When this stored codeword is retrievedfrom the memory during READ, we ﬁrst employ the RS-SSC decoder to correct/detect errors. The RS-SSC decodercorrects up to one symbol error in each codeword to retrievedata, hash pair. As noted earlier, there is a possibility ofsilent data corruption in the retrieved data, hash pair if thereare multiple symbol errors in the codeword. To detect thisscenario, we recompute the hash of data retrieved from theSSC-RS decoder and compare it with the retrieved hash.If the hashes match, then with a high probability, we canconclude that there are no SDCs in the retrieved data.When the two hash values do not match, this indicates thepresence of multiple symbol errors. Thus, we can effectivelyavoid silent data corruptions.When there is up to one symbol error per codeword,this combined scheme (Scheme A, SCC decoding + Hashvalidation) corrects the codeword (similar to the baseline

Fig. 4. Possible hash and SSC combinations scheme) and pass on the requested 64-byte cache-line tothe processor. Hence, applications waiting for this cache-line can resume their execution almost immediately on theprocessor. But if there is a multi-symbol error in any ofthe codeword, our scheme would detect that with highprobability and prevent silent data corruption. This is animprovement over the baseline scheme.

As shown in Figure 5, during a WRITE operation, weuse a hash function to generate 32 bit output (4 symbols)from the entire cacheline (64 Bytes). Similar to the baselineSSC-RS scheme, the 64 Byte data is divided into 4 blocks(Block -Block ), each block is composed of 16 symbols. Wedistribute the 4 symbol hash output across the 4 data blocksby combining each data block of size 16 data symbols with1 hash symbol to obtain a dataword. The size of our "ex-tended" dataword is 17 symbols, as opposed to 16-symboldataword used in the baseline design. Each dataword isencoded using RS (19, 17, 8) code to obtain a 19-symbolcodeword. This 19-symbol codeword is interleaved across 2beats as in the baseline design. Therefore, we need a totalof three additional chips (storage overhead of 18.75%) perrank and 12 redundant bus-lines in every channel. Similar to the baseline scheme, during a READ request (orMISS) two consecutive incoming data beats at the memorycontroller are combined to obtain a 19-symbol codeword.As shown in Figure 6, for DDRx systems, the codewordsof this READ request are obtained in four consecutive buscycles. We need to employ SSC decoder on each codewordto obtain the 64-byte data and then validate this data withthe help of hash function. As this two-step approach intro-duces additional latency to the READ MISS, in the followingparagraphs, we describe our novel design to minimize thislatency.The SSC-Reed Solomon decoding on the received code-words is typically done in two phases. In the ﬁrst phase,syndrome is computed to identify if there are any errors.Error correction (second phase) is computationally moreexpensive and therefore is triggered only when syndrome

Fig. 5. SSCMSD Design. 32 bit hash is split into 4 symbols (red). Eachhash symbol is combined with 16 symbol data block (blue) to form adataword. computation block detects errors. Since errors are relativelyrare, the average delay incurred due to decoding will beclose to the error free case where only the syndrome compu-tation is performed. Study [34] mentions that delay of SSC-RS syndrome calculation is about 0.48 ns with 45nm VLSIdesign library. For DDR4 [4] with a memory clock frequencyof 1200 Mhz, syndrome computation can be implementedwithin one memory cycle.The detection capability of our scheme depends onhash function properties such as length, collision resis-tance, avalanche behavior, distribution etc. [24]. Also, a non-cryptographic hash is suitable for our design as crypto-graphic hash functions are more complex in terms of com-putation time, which increases the memory latency. Studies[24,32] show that non-cryptographic hashes - CityHash,MurmurHash, Lookup3 and SpookyHash have good prop-erties with respect to avalanche behavior, collision resistanceand even distribution. CRC-Hash is also widely used due toits simple hardware design and due to its linear property.We analyzed the hardware design of Lookup3, Spookyhash[25, 40], CRC-Hash [37] and found that these can be im-plemented using combination logic. Therefore these hashfunctions can be easily implemented within four memorycycles.Since we are using systematic SSC-Reed Solomon code,RS syndrome calculation and hash computation can be donein parallel. As shown in Figure 6, DDRx provides two beatsof data per memory clock cycle, hence SSC-RS syndromecalculation (shown in the ﬁgure as RS) and hash compu-tation can start at the second cycle. Both these operationscan be completed in ﬁve memory cycles. Each codewordreceived at the memory controller for decoding has 16data symbols, one hash symbol and two ECC symbols. Wedenote the 64 data symbols and four hash symbols obtainedfrom all the codewords which are not yet decoded by RSdecoder as D’ and H’ respectively. We ﬁrst compute the hash(H1) of the 64 data symbols (D’) and compare it with H’. The retrieved hash H’ can match the computed hash H1when :

Case A1 : There is no error in H’, D’ OR

Case A2 : D’ != D (the original 64-byte datastored/written in the memory) due to some error and H’= H (the original hash stored/written in the memory) butdue to hash aliasing, H’ = H1 OR

Case A3 : H’ != H due to some error and D’ != D due to error,but H1 (function of D’) = H’.The retrieved hash H’ does not match H1 when there iserror in hash OR 64-byte data OR in both hash and 64-bytedata.In parallel, the RS decoder calculates the syndrome S i foreach codeword CW i . S i can be equal to 0 when: Case B1 : There is no error in CW i OR Case B2 : There is an undetected error in CW i .Similarly, the syndrome is non zero when there is anerror in the codeword.Based on comparison of H1 and H’ and four values ofS i for i=1 to 4, we come up with a decision table (Table 4).In the scenario where both the hashes match and syndromeis zero for all the four codewords (Scenario 1), we declarethe cache-line to be free from errors. Theoretically, there isscope for silent data corruption here as it could be becauseof case A2 or A3 and case B2 for all the four codewords.From our preliminary experiments in Section 5, we cansee that the probability of undetected errors is very less(0.001 %) for each codeword. The probability reduces furtherwhen considering this scenario over all the four codewords.Therefore, we declare this scenario to be free from errors.For Scenarios 2,4 where at-least one of the syndromes S i isnot zero, we can check if we can correct with the help ofSSC-RS and verify again with the hash. In the scenario 3,where the hashes do not match and all S i are 0s, we declarethe cache-line to have an undetectable error due to error indata OR in both data and hash.As the error free scenario is more common when com-pared to erroneous scenarios, we design our READ op-eration in a way that minimises latency in the error freescenario. Therefore, as shown in Figure 6, we check forScenario 1 at the end of ﬁve cycles and declare the cache-lineto be free of errors if Scenario 1 is found to be true. Other-wise, there are two possibilities, either at-least one of S i !=0(Scenarios 2,4) OR Scenario 3. For Scenario 3, we declare thecacheline to be a uncorrectable error. In case of Scenarios 2and 4 we use RS-SSC correction logic on codewords whoseS i !=0 to determine if each such codeword has a "correctableerror" (CE) or "detectable-uncorrectable error" (DUE). If anyone of them is an uncorrectable codeword, we declare theentire cache-line to be uncorrectable. Otherwise we correctall such codewords to obtain the corrected 64-byte data (D")and 32-bit hash (H"). In this case, there is a scope for SilentData Corruptions (SDCs), therefore, we compute the hashH2 from D" and compare H2, H". If these hashes match, thenwith a high probability there is no silent data corruption. Ifthe hashes do not match, then we can conclude that SDCoccurred.Thus, we are able to reduce SDCs with our novel ap-proach. On an average, the additional latency introducedper each READ miss our is expected to be one memory clock Fig. 6. SSCMSD Design. During Read operation, syndrome computation and hash calculation are done in parallel. If any of the syndromes arenon-zero AND if there are no uncorrectable errors, SSC correction is employed and Hash is computed to detect Silent Data Corruptions.

Hash Syndrome Decisioncheck calculation1 H1 = H’ S i = 0 Declare Error Freefor i = 1 to 42 H1 = H’ atleast one of Error, Try to correctS i != 0 it with SSC-RS andcheck back with hash3 H1 != H’ S i = 0 Declare Errorfor i = 1 to 44 H1 != H’ atleast one Error, Try to correctof S i != 0 it with SSC-RS andcheck back with hash TABLE 4Possible scenarios after initial step of Hash computation and Syndromecalculation. cycle. Note, that this is the similar to latency in the baseline(SSC-RS) scheme.There is scope for false negatives (report no error al-though SSC decoder fails in presence of multiple symbolerrors) due to hash collisions. The probability of false nega-tive is estimated by using the upper bound on SDC rate forthe baseline SSC-decoder (8%) and collision probability fora N-bit hash is estimated by birthday paradox (2 − N/ ) . Theupper bound on false negatives for our scheme is given by: U pper bound on P ( f alse negative ) = 0 . ∗ − (7) Errors in address bus during memory WRITEs will leadto memory data corruption [13]. To prevent this scenario,JEDEC has introduced CAP-Command Address Parity [14]in DDR4. Another recent work, AIECC [13] proposed astronger protection mechanism called eWRITECRC to ad-dress this concern.With weaker CAP, errors in address bus during memoryREADs will result in reading codewords from an incorrectaddress. As the baseline CHIPKILL scheme does not keeptrack of address associated with the data, it will decode thecodewords and inadvertently pass the data from incorrectaddress location to entity (I/O or processor) which initiatedthis READ request. Therefore, this will result in Silent Data Corruption. To provide stronger protection for up-to 32address bits, eDECC was introduced in [13].The 32-bit hash we used in SSCMSD design can alsobe used to detect multi-bit errors in the address bus duringREADs. We can hash all the address bits (8 bytes) along withthe data during WRITE operation shown in Figure 5. Thishash (H), which is stored in the form of 4 symbols in theDDRx memory will protect against both data and address(during READs) corruption. During the READ operation,as the memory controller generates the address, it alreadyhas the correct address. So, the hashes H1, H2 describedin Section 6.2 will now be a function of both Data (D’/D")and the correct address. When a transmission fault resultsin corruption of address bits during a particular READ re-quest (address A) [13], the memory controller in our designreceives the hash and corresponding data stored in addressA’ (the corrupted address). At location A’, we have the dataand the hash of data, address (A’) stored, so the RS-decoderwill not be able to detect errors, but the hashes H’/H" willnot be equal to H1/H2 and hence Silent data corruption isprevented.

As described earlier, we consider non-cryptographic hashfunctions - Spookyhash, Lookup3 (hashlittle [38]), CRC-32 to be employed in SSCMSD. Jenkins [15] recommends"short" version of SpookyHash for key size less than 192bytes. We use this "short" SpookyHash for our evaluationsas our key size with data, address is 72 bytes.Minimum hamming distance (HD) and parity are im-portant parameters useful for deciding the generating poly-nomial for CRC-32. For keysize of 72 bytes, CRC-32 poly-nomials such as Castagnoli (1,31), koopman32k (1,3,28),koopman32k (1,1,30) provide minimum HD of 6 [39].Therefore, errors up-to 5 (6-1) random bit ﬂips are guar-anteed to be detected by these polynomials. Errors whichresult in 6 or more bit-ﬂips are not guaranteed to be de-tected by these polynomials. Also, the above mentionedHD=6 polynomials have even parity, hence they guaranteedetection of all odd bit errors. IEEE 802.3 (32) polynomialprovides a minimum HD of 5 for our keysize and has oddparity. Fig. 7. Simulation methodology.

We used SpookyHash, Lookup3, CRC-32-Castagnoli (asa representative of HD=6, even parity polynomials), CRC-32-IEEE 802.3 (as a representative of odd parity polyno-mial) hash functions in our simulations shown in Table forSSCMSD design. Across all the fault modes, we did notﬁnd any difference in performance of these hash functions.Hence, we can employ one of HD=6 CRC-32 polynomials(Castagnoli / koopman32k / koopman32k ) for our SS-CMSD design as they are simple, provide minimum HD=6codewords with even parity and enable us to compute thehash in parts (due to linear property) during the READoperation. VALUATION

We evaluate our scheme and compare it with the existingschemes and their extensions with the same overhead asour scheme. Baseline (RS-SSC(18,16,8)) and Bamboo-ECC(Bamboo-ECC (8 ECC symbols)) use 18 chips per rank andprovide CHIPKILL capability. SSCMSD uses 19 chips perrank (Storage overhead - 18.75%). Therefore, we extend boththe baseline and Bamboo ECC with additional redundancyusing their methodology of incorporating redundancy tocreate equal overhead conditions. RS-SSC (19,16,8) uses 3ECC symbols, the correction capability of this code is stillone symbol ( (cid:98) ( n − k ) / (cid:99) ). The 12-ECC symbol version ofBamboo-ECC is capable of correcting up-to 6 error-symbols.The goal of our experiments is to compare the numberof Silent Data corruptions across all the schemes for ourerror model. We classify fault modes to be causing up toone OR two OR three symbols/CW to be erroneous forthe baseline, extended baseline and SSCMSD schemes (SeeSection 4). As Bamboo and extended Bamboo use verticallyaligned codeword, our error model effectively translates tocause 2 to 12 symbols to be erroneous. For the rest of thediscussion, we use the terminology of error model relativeto the baseline scheme.The following mechanisms are used to introduce errorsin the encoded cacheline stored in DRAM subsystem:1) Single bit fault : Flip a random bit in the cacheline.2) Single pin fault : As two beats are interleaved in thebaseline scheme to form one codeword, each 8-bit symbol is composed of four 2-bit pairs. As each chip has 4 datapins, each 2-bit pair of this symbol is transferred via onepin. We therefore choose a DQ pin randomly and ﬂip twocorresponding consecutive bits of a symbol.3) Single memory chip fault/failure : Choose a chiprandomly and replace the data in the chip with a randompattern OR with all 0s OR all 1s.4) Single bus fault: Choose a bus lane randomly anduse an 8-bit random number to identify the erroneous beatpositions among eight beats. As each bus-lane transferseight beats, we then inject random errors in these positions.But, we ensure that atleast one word of this faulty bus laneis corrupted.1-bit, 1-pin, Row/Chip/Bank, Column, Bus faults causeerrors within 1 Chip or Bus lane. Correlated bus fault affectstwo consecutive bus lanes. As discussed in the error model,we evaluate the following 2-chip/symbol fault modes: 1 bitfault + 1 bus fault, 1 bit fault + 1 row/bank/chip fault, 1 bitfault + 1 pin fault, 1 pin fault + 1 pin fault, and chip+chipfault. 3 fault mode is also included in our evaluation.As shown in Figure 7 for each run, we generate a 64-byte random data (representing a cache-line). The cache-lineis now encoded with the speciﬁc scheme and appropriateerrors are injected as per the fault mode. The corruptedencoded cacheline is fed to the corresponding decoderlogic. As described earlier, Baseline, Extended-baseline andSSCMSD use four codewords per each cache-line whereasBamboo, Extended-Bamboo use only one codeword. Ac-cordingly, the baseline and Extended-baseline decoder logicuse four RS decoders. Bamboo and extended bamboo em-ploy only one RS decoder in their decoder logic. For SS-CMSD, we use the decoder logic described in Section 6.2(Read Operation). The decoder-logic will then determinewhether this cache-line has Detectable Uncorrectable Er-ror(s) (DUE) or Detectable Correctable Error(s) (DCE) or noerror(s). If the logic ﬂagged any one of the codewords of thecache-line to be a DUE, we do not suspect the decoder to bewrong as our error model has multiple symbol errors (2, 3)beyond the SSC-RS correction range. In this case, the entirecacheline has to be a DUE as this cacheline cannot be con-sumed and we report the whole cacheline to be a correctlyﬂagged (CF) by the decoder. For the remaining non-DUEcachelines, we compare the original (non-corrupted) 64-bytecacheline with the cumulative output of the decoder-logic.If they do not match, we report it to be a Silent DataCorruption (SDC). Otherwise we report that the scheme(decoder) correctly ﬂagged (CF) the cacheline.We generate one billion runs for every iteration and exe-cute each simulation (or experiment) for 10 iterations. Table5 lists the mean % for these statistics across 10 iterations.The standard deviation for each of the experiments (exceptfor SSCMSD) was up to 10,000 (for 1 billion cachelines).As 1-bit, 1-pin, Row/Chip/Bank, Column, Bus faultsresult in errors conﬁned within 1 symbol, they are correctedby all the schemes. Faults which lead to 2 or 3 symbol errorsin at least one of the codeword lead to SDCs rates rangingfrom 0 to 7.6% in both Baseline and extended Baseline. 1-bit + 1-pin fault and 1-pin + 1-pin fault modes result in twosymbol errors for Bamboo-ECC and extended Bamboo-ECC,hence they are corrected by them. As extended Bamboo-ECC can correct up-to six symbol errors it can provide 100% Comparison Baseline Bamboo-ECC Extended Baseline Extended Bamboo SSCMSD StatsRS(18,16,8) (8 ECC symbols) RS-SSC(19,16,8) (12 ECC symbols) RS(19,17,8) &32-bit hashStorage Overhead 12.5% 12.5% 18.75% 18.75% 18.75%ECC Symbols/CW 2 8 3 12 2Upto 1-Chip/Bus Fault 100 100 100 100 100 CFCorrelated 2 11.4 0.9 12.3 0 SDCBus fault 98 88.6 99.1 87.7 100 CF1 bit fault + 4 11.1 2 0 0 SDC1 bus fault 96 88.9 98 100 100 CF1 bit fault + 6 11 3.2 0 0 SDC(row/bank/chip) 94 89 96.8 100 100 CF1 bit fault + 7.6 0 3 0 0 SDC1 pin fault 92.4 100 97 100 100 CF1 pin fault + 3.5 0 1.6 0 0 SDC1 pin fault 96.5 100 98.4 100 100 CFChip fault + < 0.1 11 < 0.1 12 0 SDCChip fault > 99.9 89 > 99.9 88 100 CF3 fault mode < 0.1 11 < 0.1 12 0 SDC> 99.9 89 > 99.9 88 100 CF

TABLE 5Simulations results of SSC-RS, Bamboo-ECC and SSCMSD. correction with 1-bit fault + (row/chip/bank) fault and 1-bit+ 1-bus fault modes. On the other hand, SSCMSD is able toavoid SDCs in all of the above fault modes. We also executedthese simulations for 10 iterations, with 10 billion runs periteration for SSCMSD scheme to understand the impact ofhash aliasing. We found that there were up to 5 SDCs foreach iteration across all the fault modes.

We also executed the simulations described in Table 5 forSSCMSD to include protection for address bits. Therefore,instead of using a random 64 bytes cache-line for eachexperiment (shown in Figure 7), we used random 72 bytes,each time to include 8 bytes of random address along withthe cache-line. The results were identical to the ones shownin Table 5 for SSCMSD. In addition, we executed simulationsto verify the effectiveness of SSCMSD in the presence ofaddress errors during READs. As noted in Section 6.3,SSCMSD can provide protection against address errors dur-ing READs. Our scheme prevents SDCs due to errors inaddress bits during READs provided there was no addresscorruption during a prior WRITE operation. If one writesdata to an unintended location due to address corruption,there is no way to detect such errors unless address is alsostored along with data in DRAM. CAP[14] and eWRITECRC[13] can take care of address corruption during WRITEs. So,in these simulations, for each run, we generated random 72bytes (representing the cache-line data and 8 byte address)and computed the hash (HA) of this data, address pair.Then, we used a 8-byte error mask to introduce randomerrors in the address bits. Next, we computed the hash (HB)of this data, corrupted address pair and compared HA andHB. If they differed, we declare that our scheme detected theerrors (correctly ﬂagged), otherwise we declare that therewas silent data corruption. We executed this simulationfor 10 iterations. Each iteration comprised of 100 billion runs. Across these 10 iterations the mean of SDCs was 24.5runs with a standard deviation of 4.3. The remaining werecorrectly ﬂagged (detected as errors) by our scheme.

ONCLUSION

We motivate the need for addressing multiple symbol errorsin CHIKPILL based DRAM subsystems given the trend ofincrease in failures in these systems. Based on the natureof these failures, we analyzed possible errors and then de-veloped a new error-handling scheme called Single SymbolCorrection, Multi Symbol Detection (SSCMSD).We implemented SSCMSD using CRC-32 and Singlesymbol correcting reed solomon (SSC-RS) code. By leverag-ing the usage of systematic SSC-RS code and simple CRC-32hash, our novel design’s impact on the READ latency is verynegligible. Our simulations compare SSCMSD scheme withbaseline (SSC-RS) and Bamboo-ECC. The results clearlydemonstrate that SSCMSD is effective in avoiding SilentData Corruptions (SDCs) in the presence of multiple symbolerrors. A CKNOWLEDGMENTS

The research reported in this paper is partially supported bythe NSF award 1618104 and the Philip and Virginia SproulProfessorship at Iowa State University. Any opinions, ﬁnd-ings, and conclusions or recommendations expressed in thismaterial are those of the author(s) and do not necessarilyreﬂect the views of the funding agencies. R EFERENCES [1] J. Meza, Q. Wu, S. Kumar, and O. Mutlu, “Revisiting memory er-rors in large-scale production data centers: Analysis and modelingof new trends from the ﬁeld.” in

DSN , 2015, pp. 415–426.[2] Reliability data sets. Los Alamos National Laboratory. [Online].Available: http://institutes . lanl . gov/data/fdata/ [3] V. Sridharan, N. DeBardeleben, S. Blanchard, K. B. Ferreira,J. Stearley, J. Shalf, and S. Gurumurthi, “Memory errors in modernsystems: The good, the bad, and the ugly,” in ASPLOS’15, Istanbul,Turkey, March 14-18, 2015 , 2015, pp. 297–310.[4] T. J. Dell, “A white paper on the beneﬁts of chipkill-correct ecc forpc server main memory,” in

IBM Microelectronics Division , 1997,pp. 1–23.[5] “Bios and kernel developers guide (bkdg) for amd family 15hmodels 00h-0fh processors.” AMD Inc., 2013.[6] B. Jacob,

The Memory System: You Can’T Avoid It, You Can’T IgnoreIt, You Can’T Fake It . Morgan and Claypool Publishers, 2009.[7] B. Jacob, S. Ng, and D. Wang,

Memory Systems: Cache, DRAM, Disk .San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2007,ch. 30.[8] W. A. Geisel, “Tutorial on reed-solomon error correction coding.”National Aeronautics and Space Administration, Lyndon B. John-son Space Center, 1990.[9] A. N. Udipi, N. Muralimanohar, R. Balsubramonian, A. Davis, andN. P. Jouppi, “Lot-ecc: Localized and tiered reliability mechanismsfor commodity memory systems,” ser. ISCA ’12, 2012.[10] “Bios and kernel developers guide for amd npt family 15h proces-sors.” AMD Inc., 2007.[11] “Opensparc t2 system-on-chip (soc) microarchitecture speciﬁca-tion.” Sun Microsystems, 2008.[12] J. Kim, M. Sullivan, and M. Erez, “Bamboo ECC: strong, safe, andﬂexible codes for reliable computer memory,” in

HPCA , 2015, pp.101–112.[13] J. Kim, M. Sullivan, S. Lym, and M. Erez, “All-inclusive ECC:thorough end-to-end protection for reliable computer memory,”in

ISCA , 2016, pp. 622–633.[14] “Ddr4 sdram standard, jesd79-4, joint electron device engineeringcouncil, sep. 2012.”[15] B. Jenkins. Spookyhash. [Online]. Available: https://burtleburtle . net/bob/hash/spooky . html[16] H. Minsky. A c library for reed solomon code. [Online]. Available:http://rscode . sourceforge . net[17] A. Kleen. A c library for spookyhash. [Online]. Available:https://github . com/andikleen/spooky-c[18] B. Sklar. Reed-solomon codes. [Online]. Available:http://ptgmedia . pearsoncmg . com/images/art_sklar7_reed-solomon/elementLinks/art_sklar7_reed-solomon . pdf[19] Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson,K. Lai, and O. Mutlu, “Flipping bits in memory without accessingthem: An experimental study of dram disturbance errors,” in ACMSIGARCH Computer Architecture News . mentor . com/pcb/blog/post/simultaneously-switching-noise-an-overview-dff75b6d-6b41-4d47-a231-1aafb29c07ad?cmpid=9049[21] Y. Kim, V. Seshadri, D. Lee, J. Liu, and O. Mutlu, “A case forexploiting subarray-level parallelism (salp) in dram,” ser. ISCA’12, 2012.[22] IntelÂ R (cid:13) xeonÂ R (cid:13) . intel . . pdf[23] L. Chen and Z. Zhang, “Memguard: A low cost and energy efﬁ-cient design to support and enhance memory system reliability,”ser. ISCA ’14, 2014.[24] C. Estébanez, Y. Saez, G. Recio, and P. Isasi, “Performance ofthe most common non-cryptographic hash functions,” Software:Practice and Experience , vol. 44, no. 6, pp. 681–698, 2014.[25] C. Nelson, K. R. Townsend, O. G. Attia, P. H. Jones, and J. Zam-breno, “Ramps: A reconﬁgurable architecture for minimal perfectsequencing,”

IEEE Transactions on Parallel and Distributed Systems ,vol. 27, no. 10, pp. 3029–3043, 2016.[26] B. Jacob, S. Ng, and D. Wang,

Memory Systems: Cache, DRAM, Disk .San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2007,ch. 9.[27] T. Siddiqua, A. Papathanasiou, A. Biswas, and S. Gurumurti,“Analysis of memory errors from large-scale ﬁeld data collection,”in

In IEEE Workshop on Silicon Errors in Logic - System Effects(SELSE), 2013 , 2013.[28] V. Sridharan and D. Liberty, “A study of dram failures in theﬁeld,” in

In International Conference on High Performance Computing,Networking, Storage and Analysis (SC), 2012 , 2012. [29] R. Baumann, “Soft errors in advanced computer systems,” in

IEEEDesign and Test of Computers , 2005, pp. 258–266.[30] J. B. Kotra, N. Shahidi, Z. A. Chisthi, and M. T. Kandemir,“Hardware-software co-design to mitigate dram refresh over-heads,” in

ASPLOS’15 , 2017.[31] M. Mehrara and T. Austin, “Exploiting selective placement forlow-cost memory protection,” in

ACM Transactions on Architectureand Code Optimization , 2008.[32] G. Cheng and Y. Yan, “Evaluation and design of non-cryptographic hash functions for network data stream algo-rithms,” in , Aug 2017, pp. 239–244.[33] X. Tang, M. Kandemir, P. Yedlapalli, and J. Kotra, “Improvingbank-level parallelism for irregular applications,” ser. MICRO-49,2016.[34] S. Pontarelli, P. Reviriego, M. Ottavi, and J. A. Maestro, “Lowdelay single symbol error correction codes based on reed solomoncodes,”

IEEE Transactions on Computers , vol. 64, no. 5, pp. 1497–1501, May 2015.[35] M. v. D. B. G. G. E. S. Dwaine Clarke, Srinivas Devadas, “Incre-mental multiset hash functions and their application to memoryintegrity checking,”

In Advances in Cryptology - Asiacrypt 2003Proceedings . eccpage . com/rs . c[37] Mytsko, Evgeniy, Malchukov, Andrey, Ryzova, Svetlana, andKim, Valeriy, “A study of hardware implementations of the crccomputation algorithms,” MATEC Web of Conferences , vol. 48,p. 04001, 2016. [Online]. Available: https://doi . org/10 . . net/bob/c/lookup3 . c[39] P. Koopman, “32-bit cyclic redundancy codes for internet appli-cations,” in , 2002, pp. 459–472.[40] R. Yeleswarapu and A. K. Somani, “SSCMSD - single-symbolcorrection multi-symbol detection for DRAM subsystem,” in , 2018, pp. 15–24.[41] N. Eruchalu. Reed-solomon (rs) encoder/decoder + channelsimulation using euclidean algorithm. [Online]. Available:https://github . com/nceruchalu/reed_solomon Ravikiran Yeleswarapu is a Ph.D candidate at Iowa State University,Ames, Iowa. He worked at Qualcomm’s WLAN division from 2010-2014. He received the bachelor’s of engineering degree in electricaland electronics engineeering and the MSc degree in physics from BirlaInstitute of Technology and Science-Pilani, India, in 2010. His researchinterests include computer system design and architecture, memory,reliability and new computing paradigms.