[PDF] Addressing Resiliency of In-Memory Floating Point Computation

Abstract

In-memory computing (IMC) can eliminate the data movement between processor and memory which is a barrier to the energy-efficiency and performance in Von-Neumann computing. Resistive RAM (RRAM) is one of the promising devices for IMC applications (e.g. integer and Floating Point (FP) operations and random logic implementation) due to low power consumption, fast operation, and small footprint in crossbar architecture. In this paper, we propose FAME, a pipelined FP arithmetic (adder/subtractor) using RRAM crossbar based IMC. A novel shift circuitry is proposed to lower the shift overhead during FP operations. Since 96% of the RRAMs used in our architecture are in High Resistance State (HRS), we propose two approaches namely Shift-At-The-Output (SATO) and Force To VDD (FTV) (ground (FTG)) to mitigate Stuck-at-1 (SA1) failures. In both techniques, the fault-free RRAMs are exploited to perform the computation by using an extra clock cycle. Although performance degrades by 50%, SATO can handle 50% of the faults whereas FTV can handle 99% of the faults in the RRAM-based compute array at low power and area overhead. Simulation results show that the proposed single precision FP adder consumes 335 pJ and 322 pJ for NAND-NAND and NOR-NOR based implementations, respectively. The area overheads of SATO and FTV are 28.5% and 9.5%, respectively.

Full PDF

AAddressing Resiliency of In-Memory Floating PointComputation

Sina Sayyah Ensan, Swaroop Ghosh, Seyedhamidreza Motaman, and Derek Weast

School of Electrical Engineering and Computer Science, Pennsylvania State University, University Park, PA 16802 USA (sxs2541, szg212, sxm884, dqw5347)@psu.edu

Abstract —In-memory computing (IMC) can eliminate the datamovement between processor and memory which is a bar-rier to the energy-efﬁciency and performance in Von-Neumanncomputing. Resistive RAM (RRAM) is one of the promisingdevices for IMC applications (e.g. integer and Floating Point (FP)operations and random logic implementation) due to low powerconsumption, fast operation and small footprint in crossbararchitecture. In this paper, we propose FAME, a pipelinedFP arithmetic ( adder / subtractor ) using RRAM crossbar basedIMC. A novel shift circuitry is proposed to lower the shiftoverhead during FP operations. Since of the RRAMs usedin our architecture are in High Resistance State (HRS), wepropose two approaches namely Shift-At-The-Output (SATO)and Force To V DD (FTV) (ground (FTG)) to mitigate Stuck-at-1 (SA1) failures. In both techniques, the fault-free RRAMsare exploited to perform the computation by using an extraclock cycle. Although performance degrades by 50%, SATO canhandle 50% of the faults whereas FTV can handle 99% of thefaults in the RRAM-based compute array at low power andarea overhead. Simulation results show that the proposed singleprecision FP adder consumes pJ and pJ for NAND - NAND and

NOR - NOR based implementations, respectively.The area overheads of SATO and FTV are 28.5% and 9.5%,respectively.

Index Terms —In-Memory Computing, Floating Point, RRAM,Crossbar, Resiliency. I. INTRODUCTION

In the big data era, conventional CMOS-based Von-Neumann architecture platforms are unable to face real-timedata processing requirements [1]. Memory and computingelements are decoupled from each other in Von-Neumannarchitecture [2] which apply frequent communication betweenmemory and computing cores [3]. The compute energy hasbeen scaled asymmetrically compared to data transport energywith transistor scaling. Data movement in modern computingsystems dominates energy-efﬁciency and performance [4].In Memory Computing (IMC) is one of the promising com-pute models to fully or partially eliminate the need to transportdata between processors and memory. The main concept ofIMC is to infuse compute capability into the memory cells [5].IMC is achievable by using emerging Non-Volatile Memories(NVM) e.g., RRAM, Spin Transfer Torque (STT) RAM andPhase Change Memory (PCM) [5], [6], [7], [8]. Near memoryprocessing [9] and logic-in-memory, which employ NVMs inthe logic space [10], [11] to preserve states between poweringsequence have been proposed in the literature. However, theycannot solve the problem of separation between logic andmemory. IMC modiﬁes memory cells and/or peripheral cir-cuits/access mechanisms to infuse compute capability intomemory cells. IMC can solve speciﬁc tasks such as, dot-products for recognition [8], search [12] and classiﬁcation [6].It also supports a wide range of logic and arithmetic operations[10], [13], [14], [15]. NVM-based IMC using STTRAM [16],RRAM [17], Ferroelectric FET (FeFET) and Phase-ChangeMemory [18] are becoming popular.Due to immature fabrication technology limitations, man-ufacturing yield is still a serious concern for NVMs suchas, RRAM crossbar. Faults in RRAM crossbar arrays arecategorized into hard and soft faults [1]. Previous studieshave been predominantly focused on soft faults [19] whereasfew attempts are made to recover crossbar arrays from hardfaults. The soft faults (e.g., read disturb) can be recoveredby calibrating the resistance [19] [20]. However, hard faultsare recovered through mapping algorithms (i.e., by assigninginputs of faulty RRAMs to the redundant rows or columns)[1], [21], [22].Stuck-at fault is deﬁned as a situation when the RRAM ispermanently stuck at High Resistance State (HRS) or LowResistance State (LRS). It has been reported [23] that only63% of

Hf O − based RRAM devices for 4Mb crossbar arrayare fault-free and about 10% of RRAM devices contain stuck-at faults. Retention failure which is similar to the resistiveswitching due to the generation or recovery of oxygen vacancyis another type of hard faults in RRAMs. In the proposedIMC architecture, only 4% of the RRAMs are in LRS and theother 96% are in HRS. Therefore in this paper, we focused onthe HRS retention failure and stuck-at-1 (i.e., stuck-at HRS)faults. If the yield of a single RRAM device is 99%, there isonly − probability for a column of 64*32 array to be faultfree. The stuck-at failures and HRS to LRS switching [24]can be ﬁxed by employing few redundant rows/columns whenRRAM array is considered working as a memory. However,the whole array is needed for IMC application. Consequently,computations will fail due to errors in the absence of faulttolerance schemes.We have considered Floating Point (FP) operations to eval-uate the proposed resilience techniques. This is motivatedby the fact that emerging applications e.g., mission-criticalsystems like autonomous cars require huge amount of dataprocessing in real-time at low-power (to make timely deci-sions). The autonomous cars make complex decisions in atight deadline using algorithms e.g., Kalman ﬁlters for data a r X i v : . [ c s . A R ] N ov usion, ray tracing for path planning and, edge detectionand deep neural networks for classiﬁcation. Most of thesealgorithms require FP vector operations involving transpose,inverse, addition/multiplication. Therefore, the capability toperform these tasks, quickly and accurately can be of utmostimportance to enable the safe and energy-efﬁcient autonomoussystems. Conventionally, FP architectures are implementedas full custom VLSI or in FPGA. Although fast and powerefﬁcient, these custom designs impose cost and complexity.In this paper, we propose FAME (Single Precision FloatingPoint Arithmetic using In-Memory Computing) implementedon crossbar RRAM. We employ a modiﬁed version (Section ?? ) of Dynamic Computing In Memory (DCIM) [7] basedarchitecture as our baseline compute substrate for FAME.Additionally, two approaches namely, Shift-At-The-Output(SATO) and Force To V DD ( GN D ) (FTV(G)) are proposedto enable in-memory computing in presence of HRS to LRSretention failures. We focus on this failure mechanism due totwo reasons: (i) HRS to LRS switching is more common inRRAM [25]; (ii) majority of the RRAMs (96%) are in HRS forboth N AN D - N AN D and

N OR - N OR arrays. Carry SelectAdder (CSA) based on DCIM implementation is used for thedemonstration. We add extra peripheral circuits on each arrayto implement the proposed techniques.In particular, we make the following contributions in thispaper:1) Alternative low-overhead realization of DCIM for FPcomputation;2) In-memory shift circuit embedded in the peripherals e.g.,sense ampliﬁer (SA);3) Enabling pipeline architecture using the latch embeddedin the SA;4) Propose fault mitigation approaches such as, SATO andFTV/FTG for DCIM architecture;5) Conduct PV analysis of the RRAM array to check theintegrity of SATO and FTV/FTG.Rest of the paper is organized as follows. Section II intro-duces related work on IMC. Section ?? explains the proposedFAME circuit and architecture. Section III presents the simu-lation results of FAME and comparison with other IMC logicimplementation. Section IV explains proposed approaches toovercome SA1 faults in IMC architecture. Section V presentsthe proposed fault tolerance approaches and simulation results.Section VI draws the conclusion.II. R ELATED W ORK AND B ACKGROUND

A. Memristor Aided Logic (MAGIC)

MAGIC [26] (shown in Fig. 1) is an IMC architecture inwhich logic state of the gates are represented by the memristor(RRAM in this paper) resistance where high (low) resistance isconsidered as logic ‘1’ (‘0’). The inputs to a MAGIC gate arethe logic states stored in the input memristors and the outputis the ﬁnal state of the output memristor. MAGIC executesoperations in two steps: 1) setting the output memristor toa known logic state (e.g., for

N OR operation the output

Out

Gateway

In-2

In-1 𝑽 Step

Application of Voltage 𝒐𝒖𝒕 ← 𝑽 𝑺𝒆𝒕 𝑰𝒏 , 𝑰𝒏 ← 𝑽 , 𝒐𝒖𝒕 ← 𝑮𝑵𝑫 Fig. 1: MAGIC

N OR is in LRS); 2) applying a known voltage ( V ) to the inputmemristors which causes current ﬂow through the input andoutput memristors. The output memristor’s state changes if thecurrent passing through it is higher than the set/reset current.MAGIC is capable of implementing Boolean functions suchas, N AN D , N OR , AN D , OR and N OT . B. Dynamic Computing In Memory (DCIM)

DCIM [7] is an RRAM crossbar based architecture, whicheach memory cell is composed of an RRAM device connectedin series with a selector diode (Fig. 2a. In-memory computa-tion is accomplished by implementing the functions in theform of Sum-of-Product (SoP). Thus, both

AN D and OR operations are required to implement the logical functions.In DCIM, wordlines (WL) serve as the inputs and thebitlines (BL) serve as the outputs of the arrays. Separate pre-programmed AN D and OR arrays are dedicated to implementthe desired function. For instance, in order to implement in .in , the bitcells connected to in and in are programmedto LRS while the bitcells connected to in and in areprogrammed to HRS (Fig. 2a. All bitcells which are not partof AN D gate inputs are programmed to HRS (e.g., the bitcellsconnected to input in n and in n ).Fig. 2 shows the implementation of XOR function usingDCIM. Initially, Pre signal is activated to pre-charge BLs ofthe

AN D array. Next, inputs ( in and in ) are applied byasserting EN AND . As shown in Fig. 2b, both BL and BL drop below the reference voltage ( V Ref − AND ) when in = in = 1 . As a result, SA output which determines the resultsof in .in and in .in functions are pulled down to ‘0’ at theedge of SE AND . Next,

AN D array SA outputs are providedas inputs to the OR array. Since inputs of the OR array are ‘0’,the BL ( BL OR ) remains discharged which results in in ⊕ in = 0 . If in = 0 , in = 1 ( in .in = 0 and in .in = 1 ), BL discharges while BL remains pre-charged. Therefore, BL OR starts charging at the edge of EN OR . Finally, thevoltage of BL OR is compared against V Ref − OR at the edgeof SE OR which produces ‘1’ at the output of SA. C. FP Addition/Subtraction

In IEEE 754 standard, a single precision FP number isrepresented by 1 Sign bit, 8 Exponent bits, and 23 Fractionbits. A negative (positive) number is represented with a sign bit a) (b)

Fig. 2: (a) XOR implementation using DCIM architecture in RRAM crossbar array; and, (b) timing diagram of logical XORoperation.

Add The FractionsSubtracting FP Numbers’ Mantissas to Find Their Difference. Shift Smaller Number to the Right. Normalize the Sum, Either Shift to the right and Increment the Mantissa or Shift to the Left and

Decrement the Mantissa

Overflow or Underflow ExceptionRound the Fraction to the Appropriate Number of BitsDone Yes

Yes

No Still

Normalized

Fig. 3: IEEE 754 Standard FP addition/subtraction Flowchart[27]equal to ‘1’(‘0’). In order to demonstrate negative exponents,IEEE 754 uses a bias of 127 for single precision (e.g., -1 isrepresented by -1+127=126). The general representation of aFP number is given by: ( − Sign ∗ (1 + F raction ) ∗ ( Exponent − Bias ) (1)The ﬂowchart for FP addition/subtraction as per IEEE 754standard is shown in Fig. 3.III. FAME S IMULATION R ESULTS

The simulations are carried out in 65nm PTM [28] technol-ogy by employing ASU RRAM model [29] and bi-directionalselector diode model [30]. Worst-case Sense Margin (SM),BL-delay, average delay, average power, and energy consump-tion (Table III) are calculated to evaluate FAME architecture.Key parameters of devices for simulations are listed in Table TABLE I: Simulation parameters

Parameter ValueMOSFET Gate Length 65 nmNMOS/PMOS Threshold Voltage 423/-365 mVBL Capacitance 30 fFRRAM Gap Min/Max/Oxide Thickness 0.1/1.7/5 nm

Atomic Energy for Vacancy Generation/Recombination Ω TABLE II: Monte Carlo simulation parameters

Parameter Real Value Variation STD. DeviationRRAM LRS Gap 0.1 nm 7% σ RRAM HRS Gap 1.7 nm 7% σ MOS Oxide Thickness 1.2 nm 10% σ MOS Gate Length 65nm 10% σ I. SM is obtained by performing 1000 point Monte Carlosimulations at various temperatures with parameters listed inTable II to mimic process variations.The worst-case SM is obtained under process variation @25 o C for worst case compute array (i.e., fraction additionarray). The BL-delay is the time when 100 mV SM isachieved. The proposed FP adder/subtractor implementationwith both N AN D - N AN D and

N OR - N OR architecture arecompared against MAGIC and ASIC design.The write latency is obtained by performing 1000 pointsMC simulation. The worst-case write latency for low-to-highand high-to-low switching under process variation is 20ns.FAME achieves 828X, 3.2X and 3.7X improvement in latency,power and energy, respectively compared to MAGIC. Thehigher energy associated with MAGIC is attributed to theneed to write into the RRAMs when an operation is done.Furthermore, compared to the power, energy consumption, andABLE III: Simulation results

Characteristics

NAND NOR

MAGIC CPU [32]BL Delay (ns) 1.42 1.23 N/A N/ASA Sense Delay (ps) 24.52 69.1 N/A N/AAverage Delay 25ns 23ns 20us 84nsExp. Subt. Pow. (uW) 443.31 448 2808.92 N/AFr. Add. Pow. (uW) 1068.52 1123.19 2142.84 N/AShift Pow. (uW) 443.24 452.93 982.31 N/AAvg. Power (mW) 0.7 0.71 2.3 61Energy (nJ) 0.33 0.32 1.2 5.1

TABLE IV: SM in different temperatures

SM (mv) / Temp − ° ° ° NAND

NOR delay imposed by transferring data between main memory andprocessing units (e.g. CPU, GPU, and FPGA), FAME reducespower and energy consumption and delay by . , . ,and . , respectively.A 1000 point MC simulations are performed at − °C, °C, and °C at . V supply voltage to obtain mean of SM(Table III). V NAND (NAND array BL voltage when input is‘0’), V NAND , V NOR , and V NOR distributions at worst-casetemperature are shown in Fig. 4. In order to achieve the readaccess pass yield ( RAP Y ) [31] [7] we have performed SAoffset voltage analysis. The SA offset voltage can be modeledby a Gaussian distribution with σ = 16 mV and µ = 8 mV . Toobtain RAP Y we assume that V Ref is produced by a voltageregulator with negligible variation ( mV ). We assigned V Ref in such a way to maximize

RAP Y . Based on the Monte-Carlosimulation, the RAPY of

N AN D and

N OR operations arefound to be . σ and . σ respectively.IV. R ESILIENCE TO S TUCK - AT FAULT

In this section, we describe SATO and FTV, two faultmitigation techniques proposed for DCIM architecture. In thefollowing we use, (i) faulty BL to denote each BL with anTABLE V: FAME area

Block Array Size

NAND

NAND 𝝁=𝟕𝟔𝟕 σ =1.33 𝝁=𝟗𝟑𝟐 σ =3.69 𝝁=𝟏𝟗𝟔 σ =3.57 𝝁=𝟑𝟔𝟏 σ =2.10 𝑽 𝑹 𝒆 𝒇 = 𝒎 𝒗 𝑽 𝑹 𝒆 𝒇 = 𝒎 𝒗 Fig. 4: SM distribution. undesired stuck-at-1 (SA1) RRAM; (ii) faulty WL (BL) todenote each WL (BL) with an undesired SA1 RRAM.Computations are performed in two cycles when the pro-posed fault mitigation techniques are applied (Fig. 5). Thecomputations of fault-free BLs ( BL , BL and BL in Fig.5 (a)) are performed in the ﬁrst cycle and the computations ofthe faulty BLs ( BL in Fig. 5 (a)) are performed in the secondcycle. In FTV, the WLs corresponding to faulty RRAMs( In in Fig. 5 (a)) for N AN D ( AN D ) array are forced to V DD to mask faulty bits. In a dual Force-to-Ground (FTG)technique, the faulty BLs are forced to 0V for NOR (OR)arrays. FTV/FTG tolerates 99% of stuck-at faults (SAF) whilereducing power consumption of the array. In SATO approach,operations of fault-free BLs are executed in the ﬁrst cycle andthen the outputs are shifted in the SAs. Then, the operationsof faulty BLs are computed using fault-free BLs (operation of BL is done in BL ). SATO covers 50% of SAFs withoutaffecting power consumption. The high level timings of FTVand SATO are illustrated in Fig. 5 (b) and (c), respectively. A. Shifting-At-The-Output (SATO)

As described before, in this technique the normal opera-tion for fault-free BLs are performed in the ﬁrst cycle andcomputation of faulty BLs are performed in the second cycle.SATO does not use faulty BLs for performing an operationand executes all the operations on the fault-free BLs. SATOshifts the data stored in SAs’ latch of fault-free BLs to preventoverwriting. When computation of ﬁrst cycle is completed,the data are shifted in SAs (three shifts are needed if anadder/subtractor is implemented). As shown in Fig. 6, inputs ofthe WLs should get shifted too, so computation is performedusing fault-free BLs. Peripherals of SATO incurs 29.5% areaoverhead.

1) Non-ﬁxable Faults:

SATO cannot handle faults thatappear on two consecutive sets of BLs (each three consecutiveBLs are a set if an adder/subtractor is implemented). Moremultiplexers are needed for each WL to handle faults onconsecutive sets of BLs. The number of multiplexers per WLincreases linearly with the number of consecutive faulty setsof BLs to be handled by SATO. For example, if faults occuron two consecutive sets of BLs (e.g., if BL in Fig. 6 alsocontains a fault) SATO cannot handle it unless two or moremultiplexers are dedicated to each WL. The probability of twofaults occurring on two consecutive BL is less than 3% for a64*32 crossbar array. However, SATO is able to handle lessthan 50% of the faults if a yield of 99.5% is considered on acrossbar array.

2) Handling multiple faults:

SATO’s efﬁciency degradesfor increasing number of faults. In this paper, we considereda yield of 99.5% in a 64*32 crossbar array for SATO sim-ulations. This corresponds to 11 randomly distributed faultsthroughout the array. SATO is able to mitigate around 50%of the faults in the array. Faults have been distributed onthe memory cells using rand function provided by C + + programming language. G ... 𝑰𝒏 𝑰𝒏 𝑰𝒏 𝒏 𝑰𝒏 𝒏 𝑰𝒏 𝒏 𝑰𝒏 𝑰𝒏 𝑰𝒏 𝑰𝒏 𝑬𝑵 TGTG

𝑬𝑵 𝑬𝑵 𝑬𝑵 ... ... ... ... ...... 𝑭 𝒊𝒏𝟎 . 𝑪𝑺 𝑭 𝒊𝒏𝟎 . 𝑪𝑺 𝑭 𝒊𝒏𝟏 . 𝑪𝑺 𝑭 𝒊𝒏𝟏 . 𝑪𝑺 𝑭 𝒊𝒏𝒏 . 𝑪𝑺 𝑭 𝒊𝒏𝒏 . 𝑪𝑺 𝑭 . 𝑷𝒓𝒆𝑪𝑺 ... 𝑭 𝒏 . 𝑷𝒓𝒆𝑭 𝒏 . 𝑷𝒓𝒆 𝑪𝑺 ... ... ... ... ...... 𝑽 𝑹 𝒆 𝒇 − 𝑨 𝑵 𝑫 𝑰𝒏 𝑰𝒏 𝑪𝑺𝑭 . 𝑺𝑬 𝑭 . 𝑺𝑬 𝑰𝒏 𝑰𝒏 𝒏 𝑪𝑺 𝑽 𝑹 𝒆 𝒇 − 𝑨 𝑵 𝑫 𝑭 𝒏 . 𝑺𝑬 𝑭 𝒏 . 𝑺𝑬𝑭 . 𝑷𝒓𝒆 (a) SAActivation

SAActivation

𝐵𝐿𝑛 𝑜𝑢𝑡 𝐼𝑛 𝑛 𝐵𝐿0 𝑜𝑢𝑡 𝐼𝑛 𝑛 𝐼𝑛 𝐼𝑛 SE CLK (b)

Fig. 7: FTV: (a) fault mitigation in undesired LRS RRAMs; (b) timing diagram. 𝑰𝒏 𝑬𝑵 𝑩𝒍 𝑩𝒍 𝑩𝒍 𝑩𝒍 𝑷 𝒓 𝒆 𝑨 𝑵 𝑫 𝑰𝒏 𝑰𝒏 𝑰𝒏 𝑰𝒏 𝑰𝒏 𝑰𝒏 𝑰𝒏 𝑰𝒏 𝑰𝒏 𝑰𝒏 𝑰𝒏 𝑺𝑨 𝑬𝑵 𝑺𝑨 𝑬𝑵 𝑺𝑨 𝑬𝑵 𝑺𝑨 𝑬𝑵 𝑺𝑨 𝑹𝒆𝒇 𝑺𝑨 𝑹𝒆𝒇 𝑺𝑨 𝑹𝒆𝒇 𝑺𝑨 𝑹𝒆𝒇 (a)

Clock st Cycle 2 nd Cycle 𝑩𝑳 , 𝑩𝑳 ,𝑩𝒍 𝑩𝑳 FTV 𝑩𝑳 , 𝑩𝑳 ,𝑩𝒍 Shift 𝑩𝑳 , 𝑩𝑳 ,𝑩𝒍 𝑩𝑳 SATOLRS RRAMHRS RRAM

Selector Diode

Faulty RRAM (b)

Fig. 5: (a) 4*4 RRAM crossbar array, (b) FTV and SATOtiming. ... ... ... ...... ... 𝑭 .𝑷𝒓𝒆 𝑪𝑺𝑪𝑺 𝑭 .𝑷𝒓𝒆 𝑭 .𝑷𝒓𝒆𝑭 .𝑷𝒓𝒆 𝑰𝒏 𝒎 𝑪𝑺 𝑰𝒏 𝑬𝑵 𝑽 𝑹 𝒆 𝒇 − 𝑨 𝑵 𝑫 𝑰𝒏 𝑰𝒏 𝑪𝑺 𝑪𝑺𝑭 .𝑺𝑬 𝑭 .𝑺𝑬 𝑽 𝑹 𝒆 𝒇 − 𝑨 𝑵 𝑫 ...... ... ... 𝑰𝒏 𝑰𝒏 𝑭 .𝑺𝑬𝑭 .𝑺𝑬𝑰𝒏 𝑪𝑺 𝑰𝒏 𝑬𝑵 𝑰𝒏 𝑪𝑺 𝑰𝒏 𝑬𝑵 𝑰𝒏 𝒎−𝟏 𝑪𝑺 𝑰𝒏 𝒎 𝑬𝑵 Fig. 6: SATO fault mitigation technique.

B. Forcing to V DD (FTV) FTV performs operations of fault-free and faulty BLs in theﬁrst and second cycle, respectively. Inputs of faulty RRAMsare forced to V DD in the second cycle. To apply FTV to N AN D arrays, we follow a simple

N AN D logic where forexample A · B · C is replaced with A · B · , where C is theinput of the SA1 RRAM. Therefore, N AN D − is performedin N AN D − form with an extra ‘1’ which do not affect thelogic. However, increased number of RRAMs in a BL reducesthe SM. If the faults are located on different BLs, they do notaffect the SM.FTV uses a multiplexer for the enable signal of SAs toensure that the array is capable of working in two cycles. F · CS and F · CS are inputs of the multiplexer, where F is‘0’ if the BL is fault-free and is ‘1’ if the BL is faulty. CS isthe clock sequence initialized to ‘0’ in the ﬁrst cycle and ‘1’ inthe second cycle. Enable signal of SAs connected to fault-freeBLs are asserted in the ﬁrst cycle while the enable signal ofSAs connected to faulty BLs is asserted in the second cycle tosave power and maintain the correct logic. Furthermore, FTVuses 4 additional transistors compared to DCIM at the WLinput to enable the test procedure (explained in Section IV-C)to ﬁnd faulty RRAMs.As shown in Fig. 7a, FTV uses two transmission gates toconnect input and input to WLs. Also, one PMOS transistoris added to each WL to force the WL to V DD when is neededin the second cycle. In the second cycle, SC signal of faultyWLs gets activated to force faulty WLs to V DD . FTV employsa fault signal ( F W ) for each WL to track the faulty WLs andset them to V DD in the second cycle. FTV also deﬁnes afault signal ( F B ) for each BL to keep track of faulty BLs.Additional circuitry and peripherals needed to apply FTV tothe DCIM increase the area by 9.5%.

1) Forcing-to-Ground (FTG):

The basic concept of FTGis similar to FTV but it applies to

N OR ( OR ) arrays. FTGfollows simple logic that number of ‘0’s is not important in N OR ( OR ) operation. Therefore, FTG forces inputs of faulty 𝒍 ... 𝑰𝒏 𝑰𝒏 𝑰𝒏 𝒏 𝑰𝒏 𝒏 𝑰𝒏 𝒏 𝑰𝒏 𝑰𝒏 𝑽 𝑹 𝒆 𝒇 − 𝑨 𝑵 𝑫 𝑰𝒏 𝑰𝒏 ... 𝑰𝒏 𝑰𝒏 𝑬𝑵 𝑰𝒏 𝒎 𝑰𝒏 𝒎 𝑰𝒏 𝒎 TG TG TGTG

𝑬𝑵 𝑬𝑵 𝑬𝑵 𝑬𝑵 ... ... ... ... ...... ... ... 𝑭 𝒊𝒏𝟎 .𝑪𝑺 𝑭 𝒊𝒏𝟎 .𝑪𝑺𝑭 𝒊𝒏𝟏 .𝑪𝑺 𝑭 𝒊𝒏𝟏 .𝑪𝑺 𝑭 𝒊𝒏𝒏 .𝑪𝑺 𝑭 𝒊𝒏𝒏 .𝑪𝑺 𝑭 𝒊𝒏𝒎 .𝑪𝑺 𝑭 𝒊𝒏𝒎 .𝑪𝑺𝑭 .𝑷𝒓𝒆𝑭 .𝑷𝒓𝒆 𝑪𝑺 𝑪𝑺 ... ... 𝑰𝒏 𝑰𝒏 𝒏 𝑭 𝒎 .𝑷𝒓𝒆𝑭 𝒎 .𝑷𝒓𝒆 𝑪𝑺 𝑪𝑺𝑭 .𝑺𝑬 𝑭 .𝑺𝑬 ... ... 𝑰𝒏 𝑰𝒏 𝒏 𝑭 𝒏 .𝑷𝒓𝒆𝑭 𝒏 .𝑷𝒓𝒆 𝑪𝑺 𝑪𝑺 ... ... ... ... ...... ... ... 𝑽 𝑹 𝒆 𝒇 − 𝑨 𝑵 𝑫 𝑽 𝑹 𝒆 𝒇 − 𝑨 𝑵 𝑫 𝑭 𝒏 .𝑺𝑬 𝑭 𝒏 .𝑺𝑬𝑭 .𝑺𝑬 𝑭 .𝑺𝑬𝑹𝑹𝑨𝑴 𝑩𝒍 𝒏 𝑩𝒍 𝒎 𝑩𝒍 𝒎 Fig. 8: Faults that cannot be handled by FTV.TABLE VI: Comparison between SATO and FTV

Characteristics SATO FTVCoverage 50% 90%SM (w/ diode) Not Affected Not AffectedSM (w/o diode) Not Affected LowerTest circuitry Needed IncludedArea overhead 28.5% 9.5%Power Not affected LowerEnergy Higher Slightly LowerPerformance 50% 50%

RRAMs to the ground. Peripherals and the rest of the FTG’soperation are the same as FTV.

2) Non-ﬁxable Faults :

Although FTV can ﬁx most ofthe faults in a crossbar array, it is unable to handle somerare situations. For example, if there are two faulty BLs andthe faulty RRAM on one of the BLs is the operand of theother BL. As shown in Fig. 8, BL and BL m are faultyand their operation must be done in the second cycle. In thiscase, the logic of BL m gets lost if FTV forces input of thefaulty RRAM (RRAM1) on BL to V DD since one of itsinputs is set to V DD . The N AN D operation for BL m isincorrectly performed between in n and ‘1’ instead of between in n and in m . The probability of occurrence of such a faultfor fabrication yields of more than 99% is less than 1%. Werandomly distributed the faults for 100 times using C + + language rand function in order to achieve the percentage offaults occurring in an array.

3) Handling multiple faults:

As long as faulty RRAMs inthe crossbar array are independent of each other, FTV canhandle as many as possible faults. For our simulations weinserted 30 faults in a 64*32 crossbar array and FTV was ableto solve more than 99% of fault distribution over the array.

C. Finding Faults using FTV Peripherals

It is required to ﬁnd the faulty RRAMs to set fault signalsof the BLs and WLs. Faults can be found by the peripheralsthat are included in the FTV. However, the BLs must be testedone at a time. To ﬁnd the faults in a

N AN D array, input ofeach RRAM, which is set to LRS in a BL is forced to V DD and the rest are forced to ‘0’. The output of the SA indicateswhether a BL is faulty (‘1’) or fault-free (‘0’). If the BL is (a) (b) Fig. 9: Process variation analysis of SM for various numberof failures on a single BL with selector diode in the bitcelli.e., selector diode-RRAM crossbar at, (a) − ° C ; (b) ° C . (a) (b) Fig. 10: Process variation analysis of SM for different numberof failures on a single BL while bitcell consists of RRAMonly at, (a) − ° C ; (b) ° C .faulty, we need to ﬁnd out which RRAM is faulty on thatBL. A divide-and-conquer approach cannot be used in thisarchitecture since there might be more than one faulty RRAMper BL, so we use brute force algorithm to ﬁnd faulty RRAM.The input of the RRAM-under-test is set to ‘0’ while inputsof all other RRAMs are set to V DD . If SA output is ‘0’, theRRAM-under-test is deemed faulty and its ﬂag is set to ‘1’.All faulty RRAMs can be found by repeating this operationfor each RRAM in each BL sequentially. D. Usage and Limitations of SATO/FTV/FTG

SATO/FTV/FTG should be enabled only when a fault hasbeen detected in the test process. Therefore, the fault-freearray will only incur area overhead but no performance loss.The faulty array will be salvaged at the cost of performanceoverhead. Note that SATO/FTV/FTG are only applicable toDCIM-based IMC. They cannot be applied to MAGIC orRRAM-based static IMC in the current form.V. S

IMULATION R ESULTS

To evaluate SATO and FTV, we compute performancemetrics that include worst-case SM, BL-delay, average delay,average power, and energy consumption of a 64*32 DCIMRRAM crossbar array (VI). Based on the simulation results,FTV is more efﬁcient than SATO.

A. SATO Simulation Results

Applying SATO to DCIM increases power and energyconsumption by 12% and 127%, respectively (the worst case)and also performance is reduced by more than 50%. However,ABLE VII: SATO power and energy consumption

Power (uW)

Energy (pJ)

SATO is able to handle ∼

50% of the SA1 faults. SAs arevery costly and occupy large area, which using SAs to shiftdata, increases power consumption and leads to higher energyconsumption. Power and Energy consumption of SATO withdifferent number of SA1 faults is reported in Table VII.

B. FTV/FTG Simulation Results

SM is the most important parameter when FTV is applied toDCIM. Increased number of LRS RRAMs connected to V DD (ground) on a BL worsens the SM when ‘0’ (‘1’) is the output.Considering N AN D − , the worst case ‘0’ occurs when oneof the operands is ‘0’ and the other operand is ‘1’. In thiscase, there is a voltage division is between one LRS RRAMconnected to ‘0’ and one LRS RRAM which is connected to V DD . When there is a faulty LRS RRAM on the BL and it isforced to V DD , the worst case ‘0’ is when two LRS RRAMsare connected to V DD and one LRS RRAM is connected to‘0’ which lead to increased output ‘0’ voltage on the BL.Increased BL voltage for the worst case ‘0’ degrades the SMas shown in Fig. 11 (a).The degradation in worst case SM happens when the bitcellis made of only a RRAM (i.e., no selector diode). However,DCIM employs a bidirectional diode in series with the RRAM.This series-connected bidirectional diode is included to reducepower consumption by dropping 0.5V across the 2 terminals.When the voltage difference between a BL and a WL is lessthen the selector diode threshold voltage there is no currentbetween the BL and the WL. So, increased number of LRSinputs connected to V DD (ground) does not affect the SM ofDCIM (Fig. 11 (b)). The current of HRS RRAMs increaseswith temperature which results in higher sneak path currents.In an AN D ( OR ) array, the higher sneak path currents pull up(down) BL voltage to degrade the SM. However, when numberof faults increases, sneak paths currents become negligiblecompared LRS RRAMs which are connected to V DD (ground).Simulation results (Fig. 11 (a)) show that the SMs in differenttemperatures become equal when the number of faults is morethan 20.Compared to the fault-free situation, FTV reduces powerand energy consumption by >

54% and >

7% respectively(since, SAs consume a lot of power and in the case of FTV,SAs connected to faulty BLs are deactivate in the ﬁrst cycleand SAs connected to fault-free BLs are deactivated in thesecond cycle). This is due to inactive BLs and SAs and alonger time of operation. However, the performance reduces by50% due to two cycle operations. Average power and energyfor the four consecutive

AN D operations in the 64*64 arrayare reported in Table VIII. (a) (b)

Fig. 11: SM for different number of faults in one BL at varioustemperatures, (a) diode+RRAM in the bitcell; (b) pure RRAMin the bitcell).TABLE VIII: FTV power and energy consumption

Power (uW)

Energy (pJ)

C. Process Variation Simulations

The most important parameter to consider in a crossbararray under process variation is SM. We ran 1000-point MCsimulations at − ° C , and ° C on DCIM by considering thebitcell consisting of only a RRAM and RRAM and a selectordiode with different number of SA1 faults. Simulation resultsfor RRAM bitcell and RRAM and selector diode bitcell areshown in Fig. 9 and 10, respectively.As shown in Fig. 9, variations do not affect SM signiﬁcantlydue to the presence of selector diode which stabilizes BLvoltage. However, as demonstrated in Fig. 10, variations affectthe SM when only RRAM is used in the crossbar. This is dueto large changes in the RRAM resistance for a small changein RRAM gap when 1.2V is applied across it. Worst case SMwith the number of failures for both w/ and w/o selector diodeis reported in Table. IX.VI. C ONCLUSIONS

We proposed FAME for in-memory FP arithmetic computa-tion. FAME implements single precision FP adder/subtractorusing RRAM crossbar and evaluated two ﬂavors with

N AN D − N AN D and

N OR − N OR compute arrays. Wealso proposed a novel SA based shift circuit for frequentshifting needed in FP operation. Compared to MAGIC-basedimplementation, FAME achieves 828X and 3.7X latency andenergy improvement over MAGIC and compared to processingunits (e.g. CPU, FPGA, GPU) it also reduces energy con-sumption and delay by and , respectively. FAMEachieves lower power and energy consumption compared toMAGIC and processing units at low area overhead to theTABLE IX: SM for different number of failures

Failures SM (Selector diode) SM (Without Selector Diode)0 91.3 mV 118 mV1 91.2 mV 95 mV10 91.3 mV 44 mV30 91.4 mV 19 mV emory arrays. FAME uses 3KB memory to implement singleprecision FP operations (V). Furthermore, two approachesto mitigate HRS to LRS retention and stuck-at-1 failures inRRAM-based compute memories are proposed along witha test approach to identify faulty RRAMs. Forcing-to- V DD (FTV) can mitigate 99% of the faults while reducing the powerconsumption by >

50% and energy consumption by > > Acknowledgement:

This work is supported by SRC(2847.001), and NSF (CNS- 1722557, CCF-1718474, CNS-1814710, DGE-1723687 and DGE-1821766).R

EFERENCES[1] Huangfu, W., Xia, L., Cheng, M., Yin, X., Tang, T., Li, B., Chakrabarty,K., Xie, Y., Wang, Y. and Yang, H., “Computation-oriented fault-tolerance schemes for rram computing systems,” , JAN 2017.[2] Haj-Ali, A., Ben-Hur, R., Wald, N., Ronen, R. and Kvatinsky, S.,“Imaging–in-memory algorithms for image processing,”

IEEE Trans-actions on Circuits and Systems I: Regular Papers (TCAS1) , JUN 2018.[3] Linn, E., Rosezin, R., Tappertzhofen, S., B¨ottger, U. and Waser, R.,“Beyond von neumann—logic operations in passive crossbar arraysalongside memory operations,”

Nanotechnology , JUL 2012.[4] Agrawal, A., Jaiswal, A., Lee, C. and Roy, K., “X-sram: Enabling in-memory boolean computations in cmos static random access memories,”

IEEE Transactions on Circuits and Systems I: Regular Papers (TCAS1) ,JUL 2018.[5] Imani, M., Gupta, S. and Rosing, T., “Ultra-efﬁcient processing in-memory for data intensive applications,”

Proceedings of the 54th AnnualDesign Automation Conference 2017 (DAC 2017) , JUN 2017.[6] Zhang, J., Wang, Z. and Verma, N., “In-memory computation of amachine-learning classiﬁer in a standard 6t sram array,”

IEEE Journalof Solid-State Circuits (JSC) , APR 2017.[7] Motaman, S. and Ghosh, S., “Dynamic computing in memory (dcim) inresistive crossbar arrays,”

ICCD , OCT 2019.[8] Kang, M., Keel, M.S., Shanbhag, N.R., Eilert, S. and Curewitz, K.,“An energy-efﬁcient vlsi architecture for pattern recognition via deepembedding of computation in sram,”

IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) , pp. 8326–8330,MAY 2014.[9] Patterson, D., Anderson, T., Cardwell, N., Fromm, R., Keeton, K.,Kozyrakis, C., Thomas, R. and Yelick, K., “Intelligent ram (iram): Chipsthat remember and compute,”

IEEE International Solids-State CircuitsConference. Digest of Technical Papers , FEB 1997.[10] Yin, X., Aziz, A., Nahas, J., Datta, S., Gupta, S., Niemier, M. and Hu,X.S., “Exploiting ferroelectric fets for low-power non-volatile logic-in-memory circuits,”

IEEE/ACM International Conference on Computer-Aided Design (ICCAD) , NOV 2016.[11] Iyengar, A.S., Ghosh, S. and Jang, J.W., “Mtj-based state retentive ﬂip-ﬂop with enhanced-scan capability to sustain sudden power failure,”

IEEE Transactions on Circuits and Systems I: Regular Papers , vol. 62,no. 8, pp. 2062–2068, AUG 2015.[12] Yin, X., Niemier, M. and Hu, X.S., “Design and benchmarking offerroelectric fet based tcam,”

Design, Automation & Test in EuropeConference & Exhibition (DATE) , MAR 2017.[13] Imani, M., Kim, Y. and Rosing, T, “Mpim: Multi-purpose in-memoryprocessing using conﬁgurable resistive memory,”

Asia and South PaciﬁcDesign Automation Conference (ASP-DAC) , JAN 2017.[14] Seshadri, V., Lee, D., Mullins, T., Hassan, H., Boroumand, A., Kim, J.,Kozuch, M.A., Mutlu, O., Gibbons, P.B. and Mowry, T.C., “Buddy-ram:Improving the performance and efﬁciency of bulk bitwise operationsusing dram,” arXiv preprint arXiv:1611.09988 , 2016.[15] Sayyah Ensan, S. and Ghosh, S., “Fpcas: In-memory ﬂoating point com-putations for autonomous systems,”

The International Joint Conferenceon Neural Networks (IJCNN) , JUL 2019.[16] Kang, W., Wang, H., Wang, Z., Zhang, Y. and Zhao, W., “In-memoryprocessing paradigm for bitwise logic operations in stt–mram,”

IEEETransactions on Magnetics , vol. 53, no. 11, MAY 2017. [17] Talati, N., Gupta, S., Mane, P. and Kvatinsky, S., “Logic designwithin memristive memories using memristor-aided logic (magic),”

IEEE Transactions on Nanotechnology , vol. 15, no. 4, pp. 635–650,MAY 2016.[18] Li, S., Xu, C., Zou, Q., Zhao, J., Lu, Y. and Xie, Y., “Pinatubo:A processing-in-memory architecture for bulk bitwise operations inemerging non-volatile memories,”

ACM/EDAC/IEEE Design AutomationConference (DAC) , JUN 2016.[19] Li, B., Wang, Y., Chen, Y., Li, H.H. and Yang, H., “Ice: Inline calibrationfor memristor crossbar-based computing engine,”

Design, Automation &Test in Europe Conference & Exhibition (DATE) , MAR 2014.[20] Xia, L., Gu, P., Li, B., Tang, T., Yin, X., Huangfu, W., Yu, S., Cao,Y., Wang, Y. and Yang, H, “Technological exploration of rram crossbararray for matrix-vector multiplication,”

Journal of Computer Scienceand Technology , JAN 2016.[21] Xia, L., Huangfu, W., Tang, T., Yin, X., Chakrabarty, K., Xie, Y., Wang,Y. and Yang, H., “Stuck-at fault tolerance in rram computing systems,”

IEEE Journal on Emerging and Selected Topics in Circuits and Systems ,MAR 2018.[22] Zhang, B., Uysal, N., Fan, D. and Ewetz, R., “Handling stuck-at-faultsin memristor crossbar arrays using matrix transformations,”

Proceedingsof the 24th Asia and South Paciﬁc Design Automation Conference(ASPDAC) , JAN 2019.[23] Chen, C.Y., Shih, H.C., Wu, C.W., Lin, C.H., Chiu, P.F., Sheu, S.S.and Chen, F.T, “Rram defect modeling and failure analysis based onmarch test and a novel squeeze-search scheme,”

IEEE Transactions onComputers , JAN 2015.[24] Kannan, S., Karimi, N., Karri, R. and Sinanoglu, O., “Detection,diagnosis, and repair of faults in memristor-based memories,”

IEEE 32ndVLSI Test Symposium (VTS) , APR 2014.[25] B. Gao, H. Zhang, B. Chen, L. Liu, X. Liu, R. Han, J. Kang, Z. Fang,H. Yu, B. Yu et al. , “Modeling of retention failure behavior in bipolaroxide-based resistive switching memory,”

IEEE Electron Device Letters ,vol. 32, no. 3, 2011.[26] Kvatinsky, S., Belousov, D., Liman, S., Satat, G., Wald, N., Friedman,E.G., Kolodny, A. and Weiser, U.C., “Magic—memristor-aided logic,”

IEEE Transactions on Circuits and Systems II: Express Briefs , vol. 61,no. 11, pp. 895–899, SEP 2014.[27] Patterson, D.A. and Hennessy, J.L.,

Computer organization and design .Morgan Kaufmann, 2007.[28] Predictive technology model. [Online]. Available: http://ptm.asu.edu/[29] Arizona state university rram model. [Online]. Available: http://nimo.asu.edu/memory/[30] Huang, Jiun-Jia, Yi-Ming Tseng, Wun-Cheng Luo, Chung-Wei Hsu, andTuo-Hung Hou., “One selector-one resistor (1s1r) crossbar array forhigh-density ﬂexible memory applications,”

Electron Devices Meeting(IEDM) , 2011.[31] Nho, H., Yoon, S.S., Wong, S.S. and Jung, S.O., “Numerical estimationof yield in sub-100-nm sram design using monte carlo simulation,”

IEEETransactions on Circuits and Systems II: Express Briefs , 2008.[32] Malladi et al, “Towards energy-proportional datacenter memory withmobile dram,”