CrossStack: A 3-D Reconfigurable RRAM Crossbar Inference Engine
CCrossStack: A 3-D Reconfigurable RRAM CrossbarInference Engine
Jason K. Eshraghian , Kyoungrok Cho and Sung Mo Kang School of Electrical, Electronic and Computer Engineering, University of Michigan, Ann Arbor, MI 48109 USA College of Electrical and Computer Engineering, Chungbuk National University, Cheongju 362763, South Korea Jack Baskin School of Engineering, University of California, Santa Cruz, Santa Cruz, CA 95064 USA
Abstract —Deep neural network inference accelerators arerapidly growing in importance as we turn to massively par-allelized processing beyond GPUs and ASICs. The dominantoperation in feedforward inference is the multiply-and-accumlateprocess, where each column in a crossbar generates the currentresponse of a single neuron. As a result, memristor crossbararrays parallelize inference and image processing tasks veryefficiently. In this brief, we present a 3-D active memristorcrossbar array ‘CrossStack’, which adopts stacked pairs ofAl/TiO /TiO x /Al devices with common middle electrodes. Bydesigning CMOS-memristor hybrid cells used in the layout ofthe array, CrossStack can operate in one of two user-configurablemodes as a reconfigurable inference engine: 1) expansion modeand 2) deep-net mode. In expansion mode, the resolution of thenetwork is doubled by increasing the number of inputs for agiven chip area, reducing IR drop by 22%. In deep-net mode,inference speed per-10-bit convolution is improved by 29% bysimultaneously using one TiO /TiO x layer for read processes,and the other for write processes. We experimentally verify bothmodes on our × × array. Index Terms —deep learning, in-memory computing, memris-tors, neural network, RRAM.
I. I
NTRODUCTION
Increasing the sizes of artificial neural networks (ANNs)has been the most common response to the copious amountof data being continuously generated. Where training sets arein excess of billions of inputs processed through hundredsof millions of parameters in a neural network [1], new waysto speed up the processing of all this information must bedeveloped. Since 2012, the training runtime of neural networkshas doubled every 3–4 months. It is equally important todevelop hardware that is not only optimized for running verylarge scale networks, but is adaptable to the unceasing waveof emerging ANN topologies.Memristors are now ubiquitous in neuromorphic computingliterature due to their long retention [2]–[4], excellent scala-bility [5], [6], fast read and write speeds [7], [8], compatibilitywith CMOS technology [9]–[12], and precise weight updates[13], [14]. The development of dense integrated structures with3-D stacked crossbar arrays enables an increase in the through-put for a given chip area, but thus far, target applications of3-D RRAM have mostly been limited to digital memory [16]–[23].It can be difficult for ASIC designs to keep pace with thelatest developments in machine learning due to the lag timebetween algorithm development and the full IC design cycle. With popular machine learning methods in a rapidly evolvingstate, reconfigurability is of paramount importance to ensurehardware does not become obsolete the moment new networkarchitectures and topologies are introduced. In response tothis, we present a reconfigurable stacked pair of memristorcrossbars dubbed ‘CrossStack’ that can be operated in oneof two modes: 1) expansion mode, and 2) deep-net mode. Inexpansion mode, the resolution of the network is doubled byincreasing the number of inputs for a given chip area, thusreducing IR drop by 22% of an equivalent array. In deep-netmode, inference speed per-10-bit convolution is improved by29% by simultaneously using one array for read processes, andthe other for write processes. This brief will demonstrate howto selectively isolate and couple the two layers using CMOScell design. We experimentally verify this on our in-housefabricated crossbar stack, using separately controlled CMOScircuitry in the SK Hynix 180nm process.II. M
ATRIX -V ECTOR M ULTIPLICATION
To perform analog-domain multiply-and-accumulate (MAC)using RRAM arrays, a voltage vector V i is applied at theinput, multiplied by a conductance matrix G , to generate acurrent vector of i = V i G in accordance with Ohm’s Lawand Kirchhoff’s Current Law. On a pre-trained network, theconductance of each memristor is programmed prior to read-out. For further detail on MAC on a crossbar, we recom-mend referring to [26], [27]. In a conventional crossbar, thememristors must be programmed prior to read-out. Where thenumber of parameters exceed the memory resources available,RRAM cells must be reprogrammed while the data flow of theactivations is stalled. CrossStack avoids possible stalling thatmay occur by adding the option to pipeline the read and writeprocesses simultaneously across the two layers. How this isachieved will be described in the following section.III. C IRCUIT O PERATION M ODES
A simplified structure of CrossStack is depicted in Fig. 1(a),and the memristor-CMOS cell schematic is given in Fig. 1(b).This work presents two modes in which CrossStack mayoperate in. These modes are controlled by an active-high read-enable signal RE . A. Expansion Mode
Expansion mode enables access to a shared column linefrom memristors both above and below the wire. Vertical a r X i v : . [ c s . A R ] F e b a)(b) (c) (d)(e)(f)Fig. 1. (a) Simplified 3-D memristor crossbar array in expansion mode withcumulative current through the shared electrode (b) Memristor-CMOS cell (c)Current flow in read mode when read-enable is set high (d) Current flow inwrite mode when read-enable is set low (e) A stacked pair of cells during theread cycle in expansion mode (f) A stacked pair of cells in deep-net mode. stacking of memristors doubles the number of possible inputsand weights to each neuron for a given length of column wirewhich is illustrated in Fig. 1(a) when compared to conventionalcrossbars. This can be formalized by the following equation: (cid:2) i i ... i m (cid:3) = V i V i ... V in T G , G , ... G ,m G , G , ... G ,m ... ... . . . ... G n, G n, ... G n,m (1)where m is the number of columns in the crossbar and n isthe number of rows. In expansion mode, n is double that of a 2-D array, as there are rows both above and below the sharedcolumn contributing to output current.To activate a cell in expansion mode, the read-enable signalof all cells must be identical. The current pathway from inputto output of a single cell is depicted in Fig. 1(c), and a pairof stacked cells is shown in Fig. 1(e). To generate a read-out current at each shared column, RE must be set high (inour case, V ≥ V T h = 0 . V ; described in further detailin our experimental results). If transistors N1 and N2 aretreated as switches, then N2 would be off and N1 would beon. Therefore, a current pathway is formed from both upperand lower crossbar arrays to the column line. To programa memristor, RE is set low as in Fig. 1(d). This causes N2 to switch on and N1 off, which forms a pathway fromthe memristor to ground and prevents current from flowingto the shared column. Therefore, the two crossbars can beprogrammed independently of one another by isolating them.The transistors are sized to ensure a negligible leakage current,and to sustain a sufficiently low ON resistance in comparisonto the memristance while it is operating in the linear regionsuch that it behaves as an ideal switch. B. Deep-net Mode
Deep-net mode ensures both pairs of arrays are isolatedfrom one another at all times. Isolation biasing enables eachlayer to operate independently. The two arrays must havecomplementary RE signals as distinct from expansion mode,depicted in Fig. 1(f). This means that while one array generatesa read-out current, the other array of memristors are beingprogrammed (in write mode). As described above, when RE is low the cell does not contribute to read-out current andinput voltage V i for the write layer is applied such thatthe conductances written to the memristors correspond to theweights of the next hidden layer in the neural network. Oncethe analog output current is digitized as a voltage, the write-layer has been pre-programmed and there is no need to bufferthe current or store it in memory, as is required by most otherpipelines [27], [29], [30]. This process is repeated, but now theroles of the crossbars are reversed. The original read-layer isnow programmed to the next hidden layer of weights, and theoriginal write-layer generates the read-out current. Thus, read-write processes run in parallel. One layer is programmed inanticipation of the output from the other layer, which enablesa novel in-situ pipeline.In describing how to program a memristor, the key dif-ference between expansion and deep-net modes is that inexpansion, all cells are identically biased for either read orwrite at any given time. In deep-net mode, 50% of the cellsare biased for read processes and the remaining 50% are biasedfor write processes, only switching once each operation iscomplete. The shorter read-out time is subsumed within theprogramming cycle, but at the expense of half the numberof inputs n in (1). We quantitatively demonstrate this in ourexperimental results. ig. 2. A × × prototype of CrossStack, with a cross-sectional viewof the active layer taken using a focus ion beam analyzer.TABLE IC ROSS S TACK C HARACTERISTICS
Symbol Parameter Value R s static resistance of set; R s = gs Ω ± R r static resistance of reset; R r = gr Ω ± V DD supply voltage 1.8 V V read read voltage 0.5 V V write write voltage 1.2 V t read current-read out time 10 ns t write programming time 250 ns n number of memristors 200 v Th threshold voltage | | V P critical worst case power consumption 2.9 mW R wire wire resistance 3.2 Ω p/cell A cell cell area 20 µ m × µ m W/L transistor sizing 450nm/180nm = 2.5
IV. E
XPERIMENTAL R ESULTS
A. Crossbar Fabrication
CrossStack was constructed based on two monolithicallyintegrated crossbar arrays with a shared central electrodewhich make up the column line, based on a sandwich structureof Al/TiO /TiO x /Al layers. A layer of Al (200-nm-thick and20- µ m-wide) was deposited using photolithography on a glasswafer as the bottom electrode (irradiated using mask alignmentfor 100 s, subsequently developed at 23 ◦ C for 120 s). Anyexcess Al outside of the channel region was removed via wetetching (H PO : HNO : CH COOH : H O = 80 ml : 5 ml : 5ml : 10 ml) at a rate of ∆ d/t = 300 nm/min. TiO (5-nm-thick)and TiO x (15-nm-thick) thin films were formed by atomiclayer deposition and magnetron sputtering to fabricate thememristor. Another 200-nm-thick layer of Al was sputtered asthe top electrode using photolithography to create a 20 µ m × µ m mask. After a planarization step, the top stack of activelayer and metal were also deposited. Note that the polarity ofthe pair of active layers are mirrored, as distinguishable from[31]. This allows for identical input voltages to be appliedwhen programming the memristors. Fig. 2 shows a working × × prototype of CrossStack, and a cross-sectionalview of the memristor taken using a focus ion beam analyzer. B. Cell Test
The CMOS cell was designed in the SK Hynix 180-nmprocess where v DD = 1.8 V, v T h = 0.4 V, and our parametersare summarized in Table I. We used a read voltage in the rangeof { } , a write voltage of 1.2 V, both applied at V IN ,with measurements taken with a Micromanipulator tungstenprobe tip. The pinched hysteresis loop measured with a 50Hzsource is shown in Fig. 3(a). First, we tested the circuit in expansion mode. In the criticalcase of a write voltage applied to all devices V write =1.2V and RE is set HIGH (1.8V), we show that IR drop is decreasedby approximately 22% compared to a similar planar inferenceengine in [26]. This is shown by the slower decline of currentoutput across columns in Fig. 3(b) for CrossStack, where thegold standard would be a perfectly straight line. Therefore, weverified that expansion mode reduces line losses for a givennumber of inputs due to the shorter length of column wirerequired. The trade-off is that column wires must handle twicethe current capacity of an equivalent 2-D array. This demandswide column lines to handle such current capacity withoutrisk of electromigration. But given that RRAM is integratedin the back end of the line, minimum thickness of higher layerrouting wires may mitigate this.Designing for deep-net mode opens up susceptibility toleakage currents through N1 during write mode, concurrentlywith N2 in read mode (see Fig. 1). The worst case leakageoccurs along the shared column line, when there is a minimalread current and a maximal write current. The read arraymemristors will all be OFF , R r = 100 K Ω , and all writearray memristors will be ON , R s = 10 K Ω . To calculate theminimum value for V read , we use 0.5 V as the maximum readvoltage and assume an input of 7-bit resolution which requiresincrements of approximately mV / ≈ mV . The outputcurrent of a single cell under these conditions was measured tobe 39.6 nA, which is 1% off the ideal 40 nA. The accumulatedleakage current through a column of 10 memristors beingprogrammed ( V IN =1.2V; RE=LOW ) was negligible in ourexperiments, and simulated to be approximately 2.5 pA percell (i.e., 2.5pA ×
10 cells = 25pA column current). This is . × − % of the worst-case read-current, and so in the 180-nm process used, we are able to employ minimum transistordimensions (W = 450nm, L = 180nm, W/L = 2.5). Leakagefrom a single cell is shown as a function of a DC sweep atthe input is measured in Fig. 3(c) and (d), with a Monte Carloparametric sweep of resistance overlaid (10k Ω ± ISCUSSION AND C ONCLUSION
With two different modes available to the user, how wouldyou decide to use one over another? In short, deep-net is suitedfor tasks where speed is paramount and expansion mode is fortasks involving a large number of inputs, be it for increased a) (b) (c) (d)Fig. 3. Experimental results (a) Pinched hysteresis loop of a single memristor (b) IR loss comparison in expansion mode (c) Worst-case leakage currentthrough transistor N1 during a write cycle in deep-net mode with Monte Carlo parameter sweep of memristance (d) Extreme test case under large input writesignal resulting in a nonlinear voltage drop across the memristor in deep-net mode.Fig. 4. Transient analysis of a single cell current output during deep-net modein a read cycle. resolution, or to process a very large number of parameters ina shallow network.There are three primary motivations in the use of expansionmode: 1) fully-connected networks typically have a far largernumber of connections than in convolutional layers. Expansionmode doubles the number of possible inputs for a fixed areawhich enables increases the number of possible inputs; 2) alarger number of inputs requires a larger length of columnwire. By using expansion mode to double up on inputs,we reduce wire resistance to half the original amount fora given number of inputs. More inputs per unit length ofwire enables our crossbar to be resilient to write failuresarising from line resistance IR drops by reducing line lossesby 22%. 3) The lack of reliable analog memory technologymakes it hard to perform hardware multiplexing in analog,and transmitting analog values over long distances or athigh speed is not efficient. Restricting each memristor toone of two conductance values (i.e., single-bit memristors)means that one would require log ( n ) memristors for n bitsof precision. For crossbars that use conductance states ofmemristors conservatively, digital computing is more desirablebut requires more memristors in a crossbar than analog forthe same precision. Expansion mode facilitates this increasein devices whilst halving crossbar area. The drawback is thata larger current must be carried through the wire with morevias. Deep-net mode is a novel processing scheme where the twocrossbar layers are isolated from one another by appropriatelybiasing RE , in order to parallelize read and write operations.It is engaged where speed is of greater importance thanprecision. In the most simplistic way (ignoring max-poolingand dropout), a crossbar performs inference in the followingway:1) Write weights to memristor conductances,2) Apply a sub-threshold read voltage,3) Buffer or store read-out signal in memory,4) Write the next hidden layer weights to the crossbar,5) Repeat steps 2-5 until output is generated.In deep-net mode, CrossStack performs steps 2 and 4 simul-taneously which enables read and write operations to occurtogether. By the time an output is generated from step 3, it isready to be processed by the next hidden layer in step 4 for aspeed increase of 29% over an equivalent 2-D array.The most prevailing issues in realizing large scale cross-bar arrays beyond our working prototype are mismatch andendurance. At this stage, our current error rate was 8%which limits the number of bits that can be represented bya single cell. Subthreshold current was hardly an issue inour design, but Fig. 3(d) shows the nonlinearity of voltageacross the memristor under high write voltages ( V IN > V ),and leakages will become more prevalent at shorter channellengths. Capacitive coupling as a result of high programmingvoltages into bit-lines that are reading out poses a degree ofrisk, but should be tolerable for higher metal layers giventhe wider spacing. Allowing for heat dissipation in a 3Dstructure is an ongoing challenge and the subject of significantprocess-related research, often calling for external heat sinks.In general, the advantages seem to outweigh the drawbacksand CrossStack presents a promising methodology for recon-figurable inference acceleration to adapt to the various typesof ANNs being deployed.A CKNOWLEDGMENT
This work was supported by the National Research Founda-tion of Korea grant funded by the Korean government (MSIT)(No.2020R1F1A1069381).
EFERENCES[1] D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li,A. Bharambe and L. van der Maaten, “Exploring the limits of weaklysupervised pretraining,”
Proc. of the European Conf. on Computer Vision(EECV) , pp. 181–196, 2018.A. G. Howard, et al. , “MobileNets: Efficient convolutional neural networksfor mobile vision applications”, arXiv preprint, arXiv:1704.04861.[2] K. H. Kim, S. Jo, S. Gaba and W. Lu, “Nanoscale resistive memorywith intrinsic diode characteristics and long endurance,”
Applied PhysicsLetters , vol. 96, no. 5, p. 053106, Feb. 2010.[3] A. Kumar, M. Das, V. Garg, B. S. Sengar, M. T. Htay, S. Kumar, A. Krantiand S. Mukherjee, “Forming-free high-endurance Al/ZnO/Al memristorfabricated by dual ion beam sputtering,”
Applied Physics Letters , vol 110,no. 25, p. 253509, Jun. 2017.[4] C.-Y. Lin et al. , “Adaptive synaptic memory via lithium ion modulationin RRAM devices”,
Small , vol. 16, no. 42, p. 2003964, Oct. 2020.[5] S. Pi, C. Li, H. Jiang, W. Xia, H. Xin, J. J. Yang and Q. Xia, “Memristorcrossbar arrays with 6-nm half-pitch and 2-nm critical dimension,”
Naturenanotechnology , vol. 14, no. 1, p. 35, Jan. 2019.[6] E. J. Fuller, S. T. Keene, A. Melianas, Z. Wang, S. Agarwal, Y. Li,Y. Tuchman, C. D. James, M. J. Marinella, J. J. Yang, A. salleo andA. A. Talin, “Parallel programming of an ionic floating-gate memory arrayfor scalable neuromorphic computing”,
Science , vol. 364, no. 6440, pp.570–574, May 2019.[7] E. J. Merced-Grafals, N. D´avila, N. Ge, R. S. Williams and J. P. Strachan,“Repeatable, accurate, and high speed multi-level programming of mem-ristor 1T1R arrays for power efficient analog computing applications,”
Nanotechnology , vol. 27, no. 36, p. 365202, Aug. 2016.[8] C.-Y. Lin et al. , “A high-speed MIM resistive memory cell with aninherent vanadium selector”,
Applied Materials Today , vol. 21, p. 100848,Dec. 2020.[9] J. K. Eshraghian, K. Cho, C. Zheng, M. Nam, H. H. C. Iu, W. Lei, andK. Eshraghian, “Neuromorphic Vision Hybrid RRAM-CMOS Architec-ture,”
IEEE Trans. Very Large Scale Integration (VLSI) Systems , vol. 26,no. 12, pp. 2816—2829, Dec. 2018[10] B. Chakrabarti, M. A. Lastras-Monta˜no, G. Adam, M. Prezioso,B. Hoskins, M. Payvand, A. Madhavan, A. Ghofrani, L. Theogarajan,K. T. Cheng, D. B. Strukov, “A multiply-add engine with monolithicallyintegrated 3D memristor crossbar/CMOS hybrid circuit,”
Scientific reports ,vol. 14, no. 7, p. 42429, Feb. 2017.[11] F. Cai, J. M. Correll, S. H. Lee, Y. Lim, V. Bothra, Z. Zhang, M. P. Flynnand W. D. Lu, “A fully integrated reprogrammable memristor-CMOSsystem for efficient multiply-accumulate operations”,
Nature Electronics ,vol. 15, no. 1, Jul. 2019.[12] M. R. Azghadi et al. , “Complementary Metal-Oxide Semiconductor andMemristive Hardware for Neuromorphic Computing”,
Advanced Intelli-gent Systems , vol. 2, no. 5, Mar. 2020.[13] C. Lammie and M. R. Azghadi, “MemTorch: A simulation frameworkfor deep memristive Cross-Bar architectures”, [14] O. Krestinskaya, K. N. Salama and A. P. James, “Analog Backpropaga-tion Learning Circuits for Memristive Crossbar Neural Networks”, [15] A. Serb, J. Bill, A. Khiat, R. Berdan, R. Legenstein and T. Prodro-makis, “Unsupervised learning in probabilistic neural networks with multi-state metal-oxide memristive synapses,”
Nature communications , vol. 7,p. 12611, Sep. 2016.[16] C. J. Chevallier, C. H. Siau, S. F. Lim, S. R. Namala, M. Matsuoka,B. L. Bateman and D. A. Rinerson, “A 0.13 µ m 64Mb multi-layeredconductive metal-oxide memory”, , pp. 260–261, Feb. 2010.[17] C. H. Wang, Y. H. Tsai, K. C. Lin, M. F. Chang, Y. C. King, C. J. Lin,S. S. Sheu, Y. S. Chen, Y. H. Lee, F. T. Chen and M. J. Tsai, “Three-dimensional F ReRAM cell with vertical BJT driver by CMOS logiccompatible process,”
IEEE Trans. on Electron Devices , vol. 58, no. 8,pp. 2466–2472, Jul 2011.[18] S. H. Jo, T. Kumar, S. Narayanan, W. D. Lu and H. Nazarian, “3D-stackable crossbar resistive memory based on field assisted superlinearthreshold (FAST) selector,” ,Washington, DC, USA, pp. 6–7, Dec. 2014.[19] J. Hong, M. Stone, B. Navarrete, K. Luongo, Q. Zheng, Z. Yuan, K. Xia,N. Xu, J. Bokor, L. You and S. Khizroev, “3D multilevel spin transfer torque devices,”
Applied Physics Letters , vol. 112, no. 11, p. 112402,Mar. 2018.[20] M. R. Azghadi et al. , “Hardware implementation of deep networkaccelerators towards healthcare and biomedical applications”,
IEEE Trans.Biomedical Circuits and Systems , vol. 14, no. 6, pp. 1138-1159, Dec. 2020.[21] S. Baek, J. K. Eshraghian, S.-H. Ahn, A. James and K. Cho, “Amemristor-CMOS Braun multiplier array for arithmetic pipelining”, [22] R. Fastow, K. Hasnat, P. Majhi and O. Jungroth, “Three-dimensional(3D) memory with shared control circuitry using wafer-to-wafer bonding,”
United States Patent Application , US 16/011, 139, Feb. 2019.[23] J. K. Eshraghian, K. R. Cho, H. H. C. Iu, T. Fernando, N. Iannella,S. M. Kang and K. Eshraghian, “Maximization of crossbar array memoryusing fundamental memristor theory,”
IEEE Trans. on Circuits and Syst.II: Express Briefs , vol. 64, no. 12, pp. 1402–1406, Oct. 2017.[24] C. Li, M. Hu, Y. Li, H. Jiang, N. Ge, E. Montgomery, J. Zhang, W. Song,N. Davila, C. E. Graves, Z. Li, J. P. Strachan, P. Lin, Z. Wang, M. Barnell,Q. Wu, R. S. Williams, J. J. Yang, and Q. Xia, “Analogue signal and imageprocessing with large memristor crossbars”,
Nature Electronics , vol. 1, no.1, pp. 52-–59, Jan. 2018.[25] J. K. Eshraghian, S. M. Kang, S. Baek, G. Orchard, H. H. C. Iu, andW. Lei, “Analog weights in ReRAM DNN Accelerators”, , Mar. 2019.[26] M. Hu, J. P. Strachan, Z. Li, E. M. Grafals, N. Davila, C. Graves,S. Lam, N. Ge, J. J. Yang, R. S. Williams, “Dot-product engine forneuromorphic computing: Programming 1T1M crossbar to acceleratematrix-vector multiplication,”
Proc. of the 53rd Annual Design AutomationConference , p. 19, Jun. 2016.[27] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Stra-chan, M. Hu, R. S. Williams and V. Srikumar, “ISAAC: A convolutionalneural network accelerator with in-situ analog arithmetic in crossbars,”
ACM SIGARCH Computer Architecture News , vol. 44, no. 3, pp. 14–26,Oct. 2016.[28] S. Stathopoulos, A. Khiat, M. Trapatseli, S. Cortese, A. Serb, I. Valovand T. Prodromakis, “Multibit memory operation of metal-oxide bi-layermemristors,”
Scientific Reports , vol. 7, no. 17532, Dec. 2017.[29] L. Song, X. Qian, H. Li and Y. Chen, “Pipelayer: A pipelined ReRAM-based accelerator for deep learning,” , pp. 541–552, Feb. 2017.[30] H. Valavi, P. J. Ramadge, E. Nestler and N. Verma, “A 64-Tile 2.4-Mb In-Memory-Computing CNN accelerator employing charge-domaincomputer,”
IEEE Journal of Solid-State Circuits , vol. 54, no. 6, pp. 1789-1799, Mar. 2019.[31] G. C. Adam, B. D. Hoskins, M. Prezioso, F. Merrikh-Bayat,B. Chakrabarti and D. B. Strukov, “3-D memristor crossbars for analogand neuromorphic computing applications,”