Addressing Resiliency of In-Memory Floating Point Computation
Sina Sayyah Ensan, Swaroop Ghosh, Seyedhamidreza Motaman, Derek Weast
AAddressing Resiliency of In-Memory Floating PointComputation
Sina Sayyah Ensan, Swaroop Ghosh, Seyedhamidreza Motaman, and Derek Weast
School of Electrical Engineering and Computer Science, Pennsylvania State University, University Park, PA 16802 USA (sxs2541, szg212, sxm884, dqw5347)@psu.edu
Abstract —In-memory computing (IMC) can eliminate the datamovement between processor and memory which is a bar-rier to the energy-efficiency and performance in Von-Neumanncomputing. Resistive RAM (RRAM) is one of the promisingdevices for IMC applications (e.g. integer and Floating Point (FP)operations and random logic implementation) due to low powerconsumption, fast operation and small footprint in crossbararchitecture. In this paper, we propose FAME, a pipelinedFP arithmetic ( adder / subtractor ) using RRAM crossbar basedIMC. A novel shift circuitry is proposed to lower the shiftoverhead during FP operations. Since of the RRAMs usedin our architecture are in High Resistance State (HRS), wepropose two approaches namely Shift-At-The-Output (SATO)and Force To V DD (FTV) (ground (FTG)) to mitigate Stuck-at-1 (SA1) failures. In both techniques, the fault-free RRAMsare exploited to perform the computation by using an extraclock cycle. Although performance degrades by 50%, SATO canhandle 50% of the faults whereas FTV can handle 99% of thefaults in the RRAM-based compute array at low power andarea overhead. Simulation results show that the proposed singleprecision FP adder consumes pJ and pJ for NAND - NAND and
NOR - NOR based implementations, respectively.The area overheads of SATO and FTV are 28.5% and 9.5%,respectively.
Index Terms —In-Memory Computing, Floating Point, RRAM,Crossbar, Resiliency. I. INTRODUCTION
In the big data era, conventional CMOS-based Von-Neumann architecture platforms are unable to face real-timedata processing requirements [1]. Memory and computingelements are decoupled from each other in Von-Neumannarchitecture [2] which apply frequent communication betweenmemory and computing cores [3]. The compute energy hasbeen scaled asymmetrically compared to data transport energywith transistor scaling. Data movement in modern computingsystems dominates energy-efficiency and performance [4].In Memory Computing (IMC) is one of the promising com-pute models to fully or partially eliminate the need to transportdata between processors and memory. The main concept ofIMC is to infuse compute capability into the memory cells [5].IMC is achievable by using emerging Non-Volatile Memories(NVM) e.g., RRAM, Spin Transfer Torque (STT) RAM andPhase Change Memory (PCM) [5], [6], [7], [8]. Near memoryprocessing [9] and logic-in-memory, which employ NVMs inthe logic space [10], [11] to preserve states between poweringsequence have been proposed in the literature. However, theycannot solve the problem of separation between logic andmemory. IMC modifies memory cells and/or peripheral cir-cuits/access mechanisms to infuse compute capability intomemory cells. IMC can solve specific tasks such as, dot-products for recognition [8], search [12] and classification [6].It also supports a wide range of logic and arithmetic operations[10], [13], [14], [15]. NVM-based IMC using STTRAM [16],RRAM [17], Ferroelectric FET (FeFET) and Phase-ChangeMemory [18] are becoming popular.Due to immature fabrication technology limitations, man-ufacturing yield is still a serious concern for NVMs suchas, RRAM crossbar. Faults in RRAM crossbar arrays arecategorized into hard and soft faults [1]. Previous studieshave been predominantly focused on soft faults [19] whereasfew attempts are made to recover crossbar arrays from hardfaults. The soft faults (e.g., read disturb) can be recoveredby calibrating the resistance [19] [20]. However, hard faultsare recovered through mapping algorithms (i.e., by assigninginputs of faulty RRAMs to the redundant rows or columns)[1], [21], [22].Stuck-at fault is defined as a situation when the RRAM ispermanently stuck at High Resistance State (HRS) or LowResistance State (LRS). It has been reported [23] that only63% of
Hf O − based RRAM devices for 4Mb crossbar arrayare fault-free and about 10% of RRAM devices contain stuck-at faults. Retention failure which is similar to the resistiveswitching due to the generation or recovery of oxygen vacancyis another type of hard faults in RRAMs. In the proposedIMC architecture, only 4% of the RRAMs are in LRS and theother 96% are in HRS. Therefore in this paper, we focused onthe HRS retention failure and stuck-at-1 (i.e., stuck-at HRS)faults. If the yield of a single RRAM device is 99%, there isonly − probability for a column of 64*32 array to be faultfree. The stuck-at failures and HRS to LRS switching [24]can be fixed by employing few redundant rows/columns whenRRAM array is considered working as a memory. However,the whole array is needed for IMC application. Consequently,computations will fail due to errors in the absence of faulttolerance schemes.We have considered Floating Point (FP) operations to eval-uate the proposed resilience techniques. This is motivatedby the fact that emerging applications e.g., mission-criticalsystems like autonomous cars require huge amount of dataprocessing in real-time at low-power (to make timely deci-sions). The autonomous cars make complex decisions in atight deadline using algorithms e.g., Kalman filters for data a r X i v : . [ c s . A R ] N ov usion, ray tracing for path planning and, edge detectionand deep neural networks for classification. Most of thesealgorithms require FP vector operations involving transpose,inverse, addition/multiplication. Therefore, the capability toperform these tasks, quickly and accurately can be of utmostimportance to enable the safe and energy-efficient autonomoussystems. Conventionally, FP architectures are implementedas full custom VLSI or in FPGA. Although fast and powerefficient, these custom designs impose cost and complexity.In this paper, we propose FAME (Single Precision FloatingPoint Arithmetic using In-Memory Computing) implementedon crossbar RRAM. We employ a modified version (Section ?? ) of Dynamic Computing In Memory (DCIM) [7] basedarchitecture as our baseline compute substrate for FAME.Additionally, two approaches namely, Shift-At-The-Output(SATO) and Force To V DD ( GN D ) (FTV(G)) are proposedto enable in-memory computing in presence of HRS to LRSretention failures. We focus on this failure mechanism due totwo reasons: (i) HRS to LRS switching is more common inRRAM [25]; (ii) majority of the RRAMs (96%) are in HRS forboth N AN D - N AN D and
N OR - N OR arrays. Carry SelectAdder (CSA) based on DCIM implementation is used for thedemonstration. We add extra peripheral circuits on each arrayto implement the proposed techniques.In particular, we make the following contributions in thispaper:1) Alternative low-overhead realization of DCIM for FPcomputation;2) In-memory shift circuit embedded in the peripherals e.g.,sense amplifier (SA);3) Enabling pipeline architecture using the latch embeddedin the SA;4) Propose fault mitigation approaches such as, SATO andFTV/FTG for DCIM architecture;5) Conduct PV analysis of the RRAM array to check theintegrity of SATO and FTV/FTG.Rest of the paper is organized as follows. Section II intro-duces related work on IMC. Section ?? explains the proposedFAME circuit and architecture. Section III presents the simu-lation results of FAME and comparison with other IMC logicimplementation. Section IV explains proposed approaches toovercome SA1 faults in IMC architecture. Section V presentsthe proposed fault tolerance approaches and simulation results.Section VI draws the conclusion.II. R ELATED W ORK AND B ACKGROUND
A. Memristor Aided Logic (MAGIC)
MAGIC [26] (shown in Fig. 1) is an IMC architecture inwhich logic state of the gates are represented by the memristor(RRAM in this paper) resistance where high (low) resistance isconsidered as logic ‘1’ (‘0’). The inputs to a MAGIC gate arethe logic states stored in the input memristors and the outputis the final state of the output memristor. MAGIC executesoperations in two steps: 1) setting the output memristor toa known logic state (e.g., for
N OR operation the output
Out
Gateway
In-2
In-1 𝑽 Step
Application of Voltage 𝒐𝒖𝒕 ← 𝑽 𝑺𝒆𝒕 𝑰𝒏 , 𝑰𝒏 ← 𝑽 , 𝒐𝒖𝒕 ← 𝑮𝑵𝑫 Fig. 1: MAGIC
N OR is in LRS); 2) applying a known voltage ( V ) to the inputmemristors which causes current flow through the input andoutput memristors. The output memristor’s state changes if thecurrent passing through it is higher than the set/reset current.MAGIC is capable of implementing Boolean functions suchas, N AN D , N OR , AN D , OR and N OT . B. Dynamic Computing In Memory (DCIM)
DCIM [7] is an RRAM crossbar based architecture, whicheach memory cell is composed of an RRAM device connectedin series with a selector diode (Fig. 2a. In-memory computa-tion is accomplished by implementing the functions in theform of Sum-of-Product (SoP). Thus, both
AN D and OR operations are required to implement the logical functions.In DCIM, wordlines (WL) serve as the inputs and thebitlines (BL) serve as the outputs of the arrays. Separate pre-programmed AN D and OR arrays are dedicated to implementthe desired function. For instance, in order to implement in .in , the bitcells connected to in and in are programmedto LRS while the bitcells connected to in and in areprogrammed to HRS (Fig. 2a. All bitcells which are not partof AN D gate inputs are programmed to HRS (e.g., the bitcellsconnected to input in n and in n ).Fig. 2 shows the implementation of XOR function usingDCIM. Initially, Pre signal is activated to pre-charge BLs ofthe
AN D array. Next, inputs ( in and in ) are applied byasserting EN AND . As shown in Fig. 2b, both BL and BL drop below the reference voltage ( V Ref − AND ) when in = in = 1 . As a result, SA output which determines the resultsof in .in and in .in functions are pulled down to ‘0’ at theedge of SE AND . Next,
AN D array SA outputs are providedas inputs to the OR array. Since inputs of the OR array are ‘0’,the BL ( BL OR ) remains discharged which results in in ⊕ in = 0 . If in = 0 , in = 1 ( in .in = 0 and in .in = 1 ), BL discharges while BL remains pre-charged. Therefore, BL OR starts charging at the edge of EN OR . Finally, thevoltage of BL OR is compared against V Ref − OR at the edgeof SE OR which produces ‘1’ at the output of SA. C. FP Addition/Subtraction
In IEEE 754 standard, a single precision FP number isrepresented by 1 Sign bit, 8 Exponent bits, and 23 Fractionbits. A negative (positive) number is represented with a sign bit a) (b)
Fig. 2: (a) XOR implementation using DCIM architecture in RRAM crossbar array; and, (b) timing diagram of logical XORoperation.
Add The FractionsSubtracting FP Numbers’ Mantissas to Find Their Difference. Shift Smaller Number to the Right. Normalize the Sum, Either Shift to the right and Increment the Mantissa or Shift to the Left and
Decrement the Mantissa
Overflow or Underflow ExceptionRound the Fraction to the Appropriate Number of BitsDone Yes
Yes
No Still
Normalized
Fig. 3: IEEE 754 Standard FP addition/subtraction Flowchart[27]equal to ‘1’(‘0’). In order to demonstrate negative exponents,IEEE 754 uses a bias of 127 for single precision (e.g., -1 isrepresented by -1+127=126). The general representation of aFP number is given by: ( − Sign ∗ (1 + F raction ) ∗ ( Exponent − Bias ) (1)The flowchart for FP addition/subtraction as per IEEE 754standard is shown in Fig. 3.III. FAME S IMULATION R ESULTS
The simulations are carried out in 65nm PTM [28] technol-ogy by employing ASU RRAM model [29] and bi-directionalselector diode model [30]. Worst-case Sense Margin (SM),BL-delay, average delay, average power, and energy consump-tion (Table III) are calculated to evaluate FAME architecture.Key parameters of devices for simulations are listed in Table TABLE I: Simulation parameters
Parameter ValueMOSFET Gate Length 65 nmNMOS/PMOS Threshold Voltage 423/-365 mVBL Capacitance 30 fFRRAM Gap Min/Max/Oxide Thickness 0.1/1.7/5 nm
Atomic Energy for Vacancy Generation/Recombination Ω TABLE II: Monte Carlo simulation parameters
Parameter Real Value Variation STD. DeviationRRAM LRS Gap 0.1 nm 7% σ RRAM HRS Gap 1.7 nm 7% σ MOS Oxide Thickness 1.2 nm 10% σ MOS Gate Length 65nm 10% σ I. SM is obtained by performing 1000 point Monte Carlosimulations at various temperatures with parameters listed inTable II to mimic process variations.The worst-case SM is obtained under process variation @25 o C for worst case compute array (i.e., fraction additionarray). The BL-delay is the time when 100 mV SM isachieved. The proposed FP adder/subtractor implementationwith both N AN D - N AN D and
N OR - N OR architecture arecompared against MAGIC and ASIC design.The write latency is obtained by performing 1000 pointsMC simulation. The worst-case write latency for low-to-highand high-to-low switching under process variation is 20ns.FAME achieves 828X, 3.2X and 3.7X improvement in latency,power and energy, respectively compared to MAGIC. Thehigher energy associated with MAGIC is attributed to theneed to write into the RRAMs when an operation is done.Furthermore, compared to the power, energy consumption, andABLE III: Simulation results
Characteristics
NAND NOR
MAGIC CPU [32]BL Delay (ns) 1.42 1.23 N/A N/ASA Sense Delay (ps) 24.52 69.1 N/A N/AAverage Delay 25ns 23ns 20us 84nsExp. Subt. Pow. (uW) 443.31 448 2808.92 N/AFr. Add. Pow. (uW) 1068.52 1123.19 2142.84 N/AShift Pow. (uW) 443.24 452.93 982.31 N/AAvg. Power (mW) 0.7 0.71 2.3 61Energy (nJ) 0.33 0.32 1.2 5.1
TABLE IV: SM in different temperatures
SM (mv) / Temp − ° ° ° NAND
NOR delay imposed by transferring data between main memory andprocessing units (e.g. CPU, GPU, and FPGA), FAME reducespower and energy consumption and delay by . , . ,and . , respectively.A 1000 point MC simulations are performed at − °C, °C, and °C at . V supply voltage to obtain mean of SM(Table III). V NAND (NAND array BL voltage when input is‘0’), V NAND , V NOR , and V NOR distributions at worst-casetemperature are shown in Fig. 4. In order to achieve the readaccess pass yield ( RAP Y ) [31] [7] we have performed SAoffset voltage analysis. The SA offset voltage can be modeledby a Gaussian distribution with σ = 16 mV and µ = 8 mV . Toobtain RAP Y we assume that V Ref is produced by a voltageregulator with negligible variation ( mV ). We assigned V Ref in such a way to maximize
RAP Y . Based on the Monte-Carlosimulation, the RAPY of
N AN D and
N OR operations arefound to be . σ and . σ respectively.IV. R ESILIENCE TO S TUCK - AT FAULT
In this section, we describe SATO and FTV, two faultmitigation techniques proposed for DCIM architecture. In thefollowing we use, (i) faulty BL to denote each BL with anTABLE V: FAME area
Block Array Size
NAND
NAND
NAND
NAND
NAND
NAND 𝝁=𝟕𝟔𝟕 σ =1.33 𝝁=𝟗𝟑𝟐 σ =3.69 𝝁=𝟏𝟗𝟔 σ =3.57 𝝁=𝟑𝟔𝟏 σ =2.10 𝑽 𝑹 𝒆 𝒇 = 𝒎 𝒗 𝑽 𝑹 𝒆 𝒇 = 𝒎 𝒗 Fig. 4: SM distribution. undesired stuck-at-1 (SA1) RRAM; (ii) faulty WL (BL) todenote each WL (BL) with an undesired SA1 RRAM.Computations are performed in two cycles when the pro-posed fault mitigation techniques are applied (Fig. 5). Thecomputations of fault-free BLs ( BL , BL and BL in Fig.5 (a)) are performed in the first cycle and the computations ofthe faulty BLs ( BL in Fig. 5 (a)) are performed in the secondcycle. In FTV, the WLs corresponding to faulty RRAMs( In in Fig. 5 (a)) for N AN D ( AN D ) array are forced to V DD to mask faulty bits. In a dual Force-to-Ground (FTG)technique, the faulty BLs are forced to 0V for NOR (OR)arrays. FTV/FTG tolerates 99% of stuck-at faults (SAF) whilereducing power consumption of the array. In SATO approach,operations of fault-free BLs are executed in the first cycle andthen the outputs are shifted in the SAs. Then, the operationsof faulty BLs are computed using fault-free BLs (operation of BL is done in BL ). SATO covers 50% of SAFs withoutaffecting power consumption. The high level timings of FTVand SATO are illustrated in Fig. 5 (b) and (c), respectively. A. Shifting-At-The-Output (SATO)
As described before, in this technique the normal opera-tion for fault-free BLs are performed in the first cycle andcomputation of faulty BLs are performed in the second cycle.SATO does not use faulty BLs for performing an operationand executes all the operations on the fault-free BLs. SATOshifts the data stored in SAs’ latch of fault-free BLs to preventoverwriting. When computation of first cycle is completed,the data are shifted in SAs (three shifts are needed if anadder/subtractor is implemented). As shown in Fig. 6, inputs ofthe WLs should get shifted too, so computation is performedusing fault-free BLs. Peripherals of SATO incurs 29.5% areaoverhead.
1) Non-fixable Faults:
SATO cannot handle faults thatappear on two consecutive sets of BLs (each three consecutiveBLs are a set if an adder/subtractor is implemented). Moremultiplexers are needed for each WL to handle faults onconsecutive sets of BLs. The number of multiplexers per WLincreases linearly with the number of consecutive faulty setsof BLs to be handled by SATO. For example, if faults occuron two consecutive sets of BLs (e.g., if BL in Fig. 6 alsocontains a fault) SATO cannot handle it unless two or moremultiplexers are dedicated to each WL. The probability of twofaults occurring on two consecutive BL is less than 3% for a64*32 crossbar array. However, SATO is able to handle lessthan 50% of the faults if a yield of 99.5% is considered on acrossbar array.
2) Handling multiple faults:
SATO’s efficiency degradesfor increasing number of faults. In this paper, we considereda yield of 99.5% in a 64*32 crossbar array for SATO sim-ulations. This corresponds to 11 randomly distributed faultsthroughout the array. SATO is able to mitigate around 50%of the faults in the array. Faults have been distributed onthe memory cells using rand function provided by C + + programming language. G ... 𝑰𝒏 𝑰𝒏 𝑰𝒏 𝒏 𝑰𝒏 𝒏 𝑰𝒏 𝒏 𝑰𝒏 𝑰𝒏 𝑰𝒏 𝑰𝒏 𝑬𝑵 TGTG
𝑬𝑵 𝑬𝑵 𝑬𝑵 ... ... ... ... ...... 𝑭 𝒊𝒏𝟎 . 𝑪𝑺 𝑭 𝒊𝒏𝟎 . 𝑪𝑺 𝑭 𝒊𝒏𝟏 . 𝑪𝑺 𝑭 𝒊𝒏𝟏 . 𝑪𝑺 𝑭 𝒊𝒏𝒏 . 𝑪𝑺 𝑭 𝒊𝒏𝒏 . 𝑪𝑺 𝑭 . 𝑷𝒓𝒆𝑪𝑺 ... 𝑭 𝒏 . 𝑷𝒓𝒆𝑭 𝒏 . 𝑷𝒓𝒆 𝑪𝑺 ... ... ... ... ...... 𝑽 𝑹 𝒆 𝒇 − 𝑨 𝑵 𝑫 𝑰𝒏 𝑰𝒏 𝑪𝑺𝑭 . 𝑺𝑬 𝑭 . 𝑺𝑬 𝑰𝒏 𝑰𝒏 𝒏 𝑪𝑺 𝑽 𝑹 𝒆 𝒇 − 𝑨 𝑵 𝑫 𝑭 𝒏 . 𝑺𝑬 𝑭 𝒏 . 𝑺𝑬𝑭 . 𝑷𝒓𝒆 (a) SAActivation
SAActivation
𝐵𝐿𝑛 𝑜𝑢𝑡 𝐼𝑛 𝑛 𝐵𝐿0 𝑜𝑢𝑡 𝐼𝑛 𝑛 𝐼𝑛 𝐼𝑛 SE CLK (b)
Fig. 7: FTV: (a) fault mitigation in undesired LRS RRAMs; (b) timing diagram. 𝑰𝒏 𝑬𝑵 𝑩𝒍 𝑩𝒍 𝑩𝒍 𝑩𝒍 𝑷 𝒓 𝒆 𝑨 𝑵 𝑫 𝑰𝒏 𝑰𝒏 𝑰𝒏 𝑰𝒏 𝑰𝒏 𝑰𝒏 𝑰𝒏 𝑰𝒏 𝑰𝒏 𝑰𝒏 𝑰𝒏 𝑺𝑨 𝑬𝑵 𝑺𝑨 𝑬𝑵 𝑺𝑨 𝑬𝑵 𝑺𝑨 𝑬𝑵 𝑺𝑨 𝑹𝒆𝒇 𝑺𝑨 𝑹𝒆𝒇 𝑺𝑨 𝑹𝒆𝒇 𝑺𝑨 𝑹𝒆𝒇 (a)
Clock st Cycle 2 nd Cycle 𝑩𝑳 , 𝑩𝑳 ,𝑩𝒍 𝑩𝑳 FTV 𝑩𝑳 , 𝑩𝑳 ,𝑩𝒍 Shift 𝑩𝑳 , 𝑩𝑳 ,𝑩𝒍 𝑩𝑳 SATOLRS RRAMHRS RRAM
Selector Diode
Faulty RRAM (b)
Fig. 5: (a) 4*4 RRAM crossbar array, (b) FTV and SATOtiming. ... ... ... ...... ... 𝑭 .𝑷𝒓𝒆 𝑪𝑺𝑪𝑺 𝑭 .𝑷𝒓𝒆 𝑭 .𝑷𝒓𝒆𝑭 .𝑷𝒓𝒆 𝑰𝒏 𝒎 𝑪𝑺 𝑰𝒏 𝑬𝑵 𝑽 𝑹 𝒆 𝒇 − 𝑨 𝑵 𝑫 𝑰𝒏 𝑰𝒏 𝑪𝑺 𝑪𝑺𝑭 .𝑺𝑬 𝑭 .𝑺𝑬 𝑽 𝑹 𝒆 𝒇 − 𝑨 𝑵 𝑫 ...... ... ... 𝑰𝒏 𝑰𝒏 𝑭 .𝑺𝑬𝑭 .𝑺𝑬𝑰𝒏 𝑪𝑺 𝑰𝒏 𝑬𝑵 𝑰𝒏 𝑪𝑺 𝑰𝒏 𝑬𝑵 𝑰𝒏 𝒎−𝟏 𝑪𝑺 𝑰𝒏 𝒎 𝑬𝑵 Fig. 6: SATO fault mitigation technique.
B. Forcing to V DD (FTV) FTV performs operations of fault-free and faulty BLs in thefirst and second cycle, respectively. Inputs of faulty RRAMsare forced to V DD in the second cycle. To apply FTV to N AN D arrays, we follow a simple
N AN D logic where forexample A · B · C is replaced with A · B · , where C is theinput of the SA1 RRAM. Therefore, N AN D − is performedin N AN D − form with an extra ‘1’ which do not affect thelogic. However, increased number of RRAMs in a BL reducesthe SM. If the faults are located on different BLs, they do notaffect the SM.FTV uses a multiplexer for the enable signal of SAs toensure that the array is capable of working in two cycles. F · CS and F · CS are inputs of the multiplexer, where F is‘0’ if the BL is fault-free and is ‘1’ if the BL is faulty. CS isthe clock sequence initialized to ‘0’ in the first cycle and ‘1’ inthe second cycle. Enable signal of SAs connected to fault-freeBLs are asserted in the first cycle while the enable signal ofSAs connected to faulty BLs is asserted in the second cycle tosave power and maintain the correct logic. Furthermore, FTVuses 4 additional transistors compared to DCIM at the WLinput to enable the test procedure (explained in Section IV-C)to find faulty RRAMs.As shown in Fig. 7a, FTV uses two transmission gates toconnect input and input to WLs. Also, one PMOS transistoris added to each WL to force the WL to V DD when is neededin the second cycle. In the second cycle, SC signal of faultyWLs gets activated to force faulty WLs to V DD . FTV employsa fault signal ( F W ) for each WL to track the faulty WLs andset them to V DD in the second cycle. FTV also defines afault signal ( F B ) for each BL to keep track of faulty BLs.Additional circuitry and peripherals needed to apply FTV tothe DCIM increase the area by 9.5%.
1) Forcing-to-Ground (FTG):
The basic concept of FTGis similar to FTV but it applies to
N OR ( OR ) arrays. FTGfollows simple logic that number of ‘0’s is not important in N OR ( OR ) operation. Therefore, FTG forces inputs of faulty 𝒍 ... 𝑰𝒏 𝑰𝒏 𝑰𝒏 𝒏 𝑰𝒏 𝒏 𝑰𝒏 𝒏 𝑰𝒏 𝑰𝒏 𝑽 𝑹 𝒆 𝒇 − 𝑨 𝑵 𝑫 𝑰𝒏 𝑰𝒏 ... 𝑰𝒏 𝑰𝒏 𝑬𝑵 𝑰𝒏 𝒎 𝑰𝒏 𝒎 𝑰𝒏 𝒎 TG TG TGTG
𝑬𝑵 𝑬𝑵 𝑬𝑵 𝑬𝑵 ... ... ... ... ...... ... ... 𝑭 𝒊𝒏𝟎 .𝑪𝑺 𝑭 𝒊𝒏𝟎 .𝑪𝑺𝑭 𝒊𝒏𝟏 .𝑪𝑺 𝑭 𝒊𝒏𝟏 .𝑪𝑺 𝑭 𝒊𝒏𝒏 .𝑪𝑺 𝑭 𝒊𝒏𝒏 .𝑪𝑺 𝑭 𝒊𝒏𝒎 .𝑪𝑺 𝑭 𝒊𝒏𝒎 .𝑪𝑺𝑭 .𝑷𝒓𝒆𝑭 .𝑷𝒓𝒆 𝑪𝑺 𝑪𝑺 ... ... 𝑰𝒏 𝑰𝒏 𝒏 𝑭 𝒎 .𝑷𝒓𝒆𝑭 𝒎 .𝑷𝒓𝒆 𝑪𝑺 𝑪𝑺𝑭 .𝑺𝑬 𝑭 .𝑺𝑬 ... ... 𝑰𝒏 𝑰𝒏 𝒏 𝑭 𝒏 .𝑷𝒓𝒆𝑭 𝒏 .𝑷𝒓𝒆 𝑪𝑺 𝑪𝑺 ... ... ... ... ...... ... ... 𝑽 𝑹 𝒆 𝒇 − 𝑨 𝑵 𝑫 𝑽 𝑹 𝒆 𝒇 − 𝑨 𝑵 𝑫 𝑭 𝒏 .𝑺𝑬 𝑭 𝒏 .𝑺𝑬𝑭 .𝑺𝑬 𝑭 .𝑺𝑬𝑹𝑹𝑨𝑴 𝑩𝒍 𝒏 𝑩𝒍 𝒎 𝑩𝒍 𝒎 Fig. 8: Faults that cannot be handled by FTV.TABLE VI: Comparison between SATO and FTV
Characteristics SATO FTVCoverage 50% 90%SM (w/ diode) Not Affected Not AffectedSM (w/o diode) Not Affected LowerTest circuitry Needed IncludedArea overhead 28.5% 9.5%Power Not affected LowerEnergy Higher Slightly LowerPerformance 50% 50%
RRAMs to the ground. Peripherals and the rest of the FTG’soperation are the same as FTV.
2) Non-fixable Faults :
Although FTV can fix most ofthe faults in a crossbar array, it is unable to handle somerare situations. For example, if there are two faulty BLs andthe faulty RRAM on one of the BLs is the operand of theother BL. As shown in Fig. 8, BL and BL m are faultyand their operation must be done in the second cycle. In thiscase, the logic of BL m gets lost if FTV forces input of thefaulty RRAM (RRAM1) on BL to V DD since one of itsinputs is set to V DD . The N AN D operation for BL m isincorrectly performed between in n and ‘1’ instead of between in n and in m . The probability of occurrence of such a faultfor fabrication yields of more than 99% is less than 1%. Werandomly distributed the faults for 100 times using C + + language rand function in order to achieve the percentage offaults occurring in an array.
3) Handling multiple faults:
As long as faulty RRAMs inthe crossbar array are independent of each other, FTV canhandle as many as possible faults. For our simulations weinserted 30 faults in a 64*32 crossbar array and FTV was ableto solve more than 99% of fault distribution over the array.
C. Finding Faults using FTV Peripherals
It is required to find the faulty RRAMs to set fault signalsof the BLs and WLs. Faults can be found by the peripheralsthat are included in the FTV. However, the BLs must be testedone at a time. To find the faults in a
N AN D array, input ofeach RRAM, which is set to LRS in a BL is forced to V DD and the rest are forced to ‘0’. The output of the SA indicateswhether a BL is faulty (‘1’) or fault-free (‘0’). If the BL is (a) (b) Fig. 9: Process variation analysis of SM for various numberof failures on a single BL with selector diode in the bitcelli.e., selector diode-RRAM crossbar at, (a) − ° C ; (b) ° C . (a) (b) Fig. 10: Process variation analysis of SM for different numberof failures on a single BL while bitcell consists of RRAMonly at, (a) − ° C ; (b) ° C .faulty, we need to find out which RRAM is faulty on thatBL. A divide-and-conquer approach cannot be used in thisarchitecture since there might be more than one faulty RRAMper BL, so we use brute force algorithm to find faulty RRAM.The input of the RRAM-under-test is set to ‘0’ while inputsof all other RRAMs are set to V DD . If SA output is ‘0’, theRRAM-under-test is deemed faulty and its flag is set to ‘1’.All faulty RRAMs can be found by repeating this operationfor each RRAM in each BL sequentially. D. Usage and Limitations of SATO/FTV/FTG
SATO/FTV/FTG should be enabled only when a fault hasbeen detected in the test process. Therefore, the fault-freearray will only incur area overhead but no performance loss.The faulty array will be salvaged at the cost of performanceoverhead. Note that SATO/FTV/FTG are only applicable toDCIM-based IMC. They cannot be applied to MAGIC orRRAM-based static IMC in the current form.V. S
IMULATION R ESULTS
To evaluate SATO and FTV, we compute performancemetrics that include worst-case SM, BL-delay, average delay,average power, and energy consumption of a 64*32 DCIMRRAM crossbar array (VI). Based on the simulation results,FTV is more efficient than SATO.
A. SATO Simulation Results
Applying SATO to DCIM increases power and energyconsumption by 12% and 127%, respectively (the worst case)and also performance is reduced by more than 50%. However,ABLE VII: SATO power and energy consumption
Power (uW)
Energy (pJ)
SATO is able to handle ∼
50% of the SA1 faults. SAs arevery costly and occupy large area, which using SAs to shiftdata, increases power consumption and leads to higher energyconsumption. Power and Energy consumption of SATO withdifferent number of SA1 faults is reported in Table VII.
B. FTV/FTG Simulation Results
SM is the most important parameter when FTV is applied toDCIM. Increased number of LRS RRAMs connected to V DD (ground) on a BL worsens the SM when ‘0’ (‘1’) is the output.Considering N AN D − , the worst case ‘0’ occurs when oneof the operands is ‘0’ and the other operand is ‘1’. In thiscase, there is a voltage division is between one LRS RRAMconnected to ‘0’ and one LRS RRAM which is connected to V DD . When there is a faulty LRS RRAM on the BL and it isforced to V DD , the worst case ‘0’ is when two LRS RRAMsare connected to V DD and one LRS RRAM is connected to‘0’ which lead to increased output ‘0’ voltage on the BL.Increased BL voltage for the worst case ‘0’ degrades the SMas shown in Fig. 11 (a).The degradation in worst case SM happens when the bitcellis made of only a RRAM (i.e., no selector diode). However,DCIM employs a bidirectional diode in series with the RRAM.This series-connected bidirectional diode is included to reducepower consumption by dropping 0.5V across the 2 terminals.When the voltage difference between a BL and a WL is lessthen the selector diode threshold voltage there is no currentbetween the BL and the WL. So, increased number of LRSinputs connected to V DD (ground) does not affect the SM ofDCIM (Fig. 11 (b)). The current of HRS RRAMs increaseswith temperature which results in higher sneak path currents.In an AN D ( OR ) array, the higher sneak path currents pull up(down) BL voltage to degrade the SM. However, when numberof faults increases, sneak paths currents become negligiblecompared LRS RRAMs which are connected to V DD (ground).Simulation results (Fig. 11 (a)) show that the SMs in differenttemperatures become equal when the number of faults is morethan 20.Compared to the fault-free situation, FTV reduces powerand energy consumption by >
54% and >
7% respectively(since, SAs consume a lot of power and in the case of FTV,SAs connected to faulty BLs are deactivate in the first cycleand SAs connected to fault-free BLs are deactivated in thesecond cycle). This is due to inactive BLs and SAs and alonger time of operation. However, the performance reduces by50% due to two cycle operations. Average power and energyfor the four consecutive
AN D operations in the 64*64 arrayare reported in Table VIII. (a) (b)
Fig. 11: SM for different number of faults in one BL at varioustemperatures, (a) diode+RRAM in the bitcell; (b) pure RRAMin the bitcell).TABLE VIII: FTV power and energy consumption
Power (uW)
Energy (pJ)
C. Process Variation Simulations
The most important parameter to consider in a crossbararray under process variation is SM. We ran 1000-point MCsimulations at − ° C , and ° C on DCIM by considering thebitcell consisting of only a RRAM and RRAM and a selectordiode with different number of SA1 faults. Simulation resultsfor RRAM bitcell and RRAM and selector diode bitcell areshown in Fig. 9 and 10, respectively.As shown in Fig. 9, variations do not affect SM significantlydue to the presence of selector diode which stabilizes BLvoltage. However, as demonstrated in Fig. 10, variations affectthe SM when only RRAM is used in the crossbar. This is dueto large changes in the RRAM resistance for a small changein RRAM gap when 1.2V is applied across it. Worst case SMwith the number of failures for both w/ and w/o selector diodeis reported in Table. IX.VI. C ONCLUSIONS
We proposed FAME for in-memory FP arithmetic computa-tion. FAME implements single precision FP adder/subtractorusing RRAM crossbar and evaluated two flavors with
N AN D − N AN D and
N OR − N OR compute arrays. Wealso proposed a novel SA based shift circuit for frequentshifting needed in FP operation. Compared to MAGIC-basedimplementation, FAME achieves 828X and 3.7X latency andenergy improvement over MAGIC and compared to processingunits (e.g. CPU, FPGA, GPU) it also reduces energy con-sumption and delay by and , respectively. FAMEachieves lower power and energy consumption compared toMAGIC and processing units at low area overhead to theTABLE IX: SM for different number of failures
Failures SM (Selector diode) SM (Without Selector Diode)0 91.3 mV 118 mV1 91.2 mV 95 mV10 91.3 mV 44 mV30 91.4 mV 19 mV emory arrays. FAME uses 3KB memory to implement singleprecision FP operations (V). Furthermore, two approachesto mitigate HRS to LRS retention and stuck-at-1 failures inRRAM-based compute memories are proposed along witha test approach to identify faulty RRAMs. Forcing-to- V DD (FTV) can mitigate 99% of the faults while reducing the powerconsumption by >
50% and energy consumption by > > Acknowledgement:
This work is supported by SRC(2847.001), and NSF (CNS- 1722557, CCF-1718474, CNS-1814710, DGE-1723687 and DGE-1821766).R
EFERENCES[1] Huangfu, W., Xia, L., Cheng, M., Yin, X., Tang, T., Li, B., Chakrabarty,K., Xie, Y., Wang, Y. and Yang, H., “Computation-oriented fault-tolerance schemes for rram computing systems,” , JAN 2017.[2] Haj-Ali, A., Ben-Hur, R., Wald, N., Ronen, R. and Kvatinsky, S.,“Imaging–in-memory algorithms for image processing,”
IEEE Trans-actions on Circuits and Systems I: Regular Papers (TCAS1) , JUN 2018.[3] Linn, E., Rosezin, R., Tappertzhofen, S., B¨ottger, U. and Waser, R.,“Beyond von neumann—logic operations in passive crossbar arraysalongside memory operations,”
Nanotechnology , JUL 2012.[4] Agrawal, A., Jaiswal, A., Lee, C. and Roy, K., “X-sram: Enabling in-memory boolean computations in cmos static random access memories,”
IEEE Transactions on Circuits and Systems I: Regular Papers (TCAS1) ,JUL 2018.[5] Imani, M., Gupta, S. and Rosing, T., “Ultra-efficient processing in-memory for data intensive applications,”
Proceedings of the 54th AnnualDesign Automation Conference 2017 (DAC 2017) , JUN 2017.[6] Zhang, J., Wang, Z. and Verma, N., “In-memory computation of amachine-learning classifier in a standard 6t sram array,”
IEEE Journalof Solid-State Circuits (JSC) , APR 2017.[7] Motaman, S. and Ghosh, S., “Dynamic computing in memory (dcim) inresistive crossbar arrays,”
ICCD , OCT 2019.[8] Kang, M., Keel, M.S., Shanbhag, N.R., Eilert, S. and Curewitz, K.,“An energy-efficient vlsi architecture for pattern recognition via deepembedding of computation in sram,”
IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) , pp. 8326–8330,MAY 2014.[9] Patterson, D., Anderson, T., Cardwell, N., Fromm, R., Keeton, K.,Kozyrakis, C., Thomas, R. and Yelick, K., “Intelligent ram (iram): Chipsthat remember and compute,”
IEEE International Solids-State CircuitsConference. Digest of Technical Papers , FEB 1997.[10] Yin, X., Aziz, A., Nahas, J., Datta, S., Gupta, S., Niemier, M. and Hu,X.S., “Exploiting ferroelectric fets for low-power non-volatile logic-in-memory circuits,”
IEEE/ACM International Conference on Computer-Aided Design (ICCAD) , NOV 2016.[11] Iyengar, A.S., Ghosh, S. and Jang, J.W., “Mtj-based state retentive flip-flop with enhanced-scan capability to sustain sudden power failure,”
IEEE Transactions on Circuits and Systems I: Regular Papers , vol. 62,no. 8, pp. 2062–2068, AUG 2015.[12] Yin, X., Niemier, M. and Hu, X.S., “Design and benchmarking offerroelectric fet based tcam,”
Design, Automation & Test in EuropeConference & Exhibition (DATE) , MAR 2017.[13] Imani, M., Kim, Y. and Rosing, T, “Mpim: Multi-purpose in-memoryprocessing using configurable resistive memory,”
Asia and South PacificDesign Automation Conference (ASP-DAC) , JAN 2017.[14] Seshadri, V., Lee, D., Mullins, T., Hassan, H., Boroumand, A., Kim, J.,Kozuch, M.A., Mutlu, O., Gibbons, P.B. and Mowry, T.C., “Buddy-ram:Improving the performance and efficiency of bulk bitwise operationsusing dram,” arXiv preprint arXiv:1611.09988 , 2016.[15] Sayyah Ensan, S. and Ghosh, S., “Fpcas: In-memory floating point com-putations for autonomous systems,”
The International Joint Conferenceon Neural Networks (IJCNN) , JUL 2019.[16] Kang, W., Wang, H., Wang, Z., Zhang, Y. and Zhao, W., “In-memoryprocessing paradigm for bitwise logic operations in stt–mram,”
IEEETransactions on Magnetics , vol. 53, no. 11, MAY 2017. [17] Talati, N., Gupta, S., Mane, P. and Kvatinsky, S., “Logic designwithin memristive memories using memristor-aided logic (magic),”
IEEE Transactions on Nanotechnology , vol. 15, no. 4, pp. 635–650,MAY 2016.[18] Li, S., Xu, C., Zou, Q., Zhao, J., Lu, Y. and Xie, Y., “Pinatubo:A processing-in-memory architecture for bulk bitwise operations inemerging non-volatile memories,”
ACM/EDAC/IEEE Design AutomationConference (DAC) , JUN 2016.[19] Li, B., Wang, Y., Chen, Y., Li, H.H. and Yang, H., “Ice: Inline calibrationfor memristor crossbar-based computing engine,”
Design, Automation &Test in Europe Conference & Exhibition (DATE) , MAR 2014.[20] Xia, L., Gu, P., Li, B., Tang, T., Yin, X., Huangfu, W., Yu, S., Cao,Y., Wang, Y. and Yang, H, “Technological exploration of rram crossbararray for matrix-vector multiplication,”
Journal of Computer Scienceand Technology , JAN 2016.[21] Xia, L., Huangfu, W., Tang, T., Yin, X., Chakrabarty, K., Xie, Y., Wang,Y. and Yang, H., “Stuck-at fault tolerance in rram computing systems,”
IEEE Journal on Emerging and Selected Topics in Circuits and Systems ,MAR 2018.[22] Zhang, B., Uysal, N., Fan, D. and Ewetz, R., “Handling stuck-at-faultsin memristor crossbar arrays using matrix transformations,”
Proceedingsof the 24th Asia and South Pacific Design Automation Conference(ASPDAC) , JAN 2019.[23] Chen, C.Y., Shih, H.C., Wu, C.W., Lin, C.H., Chiu, P.F., Sheu, S.S.and Chen, F.T, “Rram defect modeling and failure analysis based onmarch test and a novel squeeze-search scheme,”
IEEE Transactions onComputers , JAN 2015.[24] Kannan, S., Karimi, N., Karri, R. and Sinanoglu, O., “Detection,diagnosis, and repair of faults in memristor-based memories,”
IEEE 32ndVLSI Test Symposium (VTS) , APR 2014.[25] B. Gao, H. Zhang, B. Chen, L. Liu, X. Liu, R. Han, J. Kang, Z. Fang,H. Yu, B. Yu et al. , “Modeling of retention failure behavior in bipolaroxide-based resistive switching memory,”
IEEE Electron Device Letters ,vol. 32, no. 3, 2011.[26] Kvatinsky, S., Belousov, D., Liman, S., Satat, G., Wald, N., Friedman,E.G., Kolodny, A. and Weiser, U.C., “Magic—memristor-aided logic,”
IEEE Transactions on Circuits and Systems II: Express Briefs , vol. 61,no. 11, pp. 895–899, SEP 2014.[27] Patterson, D.A. and Hennessy, J.L.,
Computer organization and design .Morgan Kaufmann, 2007.[28] Predictive technology model. [Online]. Available: http://ptm.asu.edu/[29] Arizona state university rram model. [Online]. Available: http://nimo.asu.edu/memory/[30] Huang, Jiun-Jia, Yi-Ming Tseng, Wun-Cheng Luo, Chung-Wei Hsu, andTuo-Hung Hou., “One selector-one resistor (1s1r) crossbar array forhigh-density flexible memory applications,”
Electron Devices Meeting(IEDM) , 2011.[31] Nho, H., Yoon, S.S., Wong, S.S. and Jung, S.O., “Numerical estimationof yield in sub-100-nm sram design using monte carlo simulation,”
IEEETransactions on Circuits and Systems II: Express Briefs , 2008.[32] Malladi et al, “Towards energy-proportional datacenter memory withmobile dram,”