[PDF] A Device Non-Ideality Resilient Approach for Mapping Neural Networks to Crossbar Arrays

Abstract

We propose a technology-independent method, referred to as adjacent connection matrix (ACM), to efficiently map signed weight matrices to non-negative crossbar arrays. When compared to same-hardware-overhead mapping methods, using ACM leads to improvements of up to 20% in training accuracy for ResNet-20 with the CIFAR-10 dataset when training with 5-bit precision crossbar arrays or lower. When compared with strategies that use two elements to represent a weight, ACM achieves comparable training accuracies, while also offering area and read energy reductions of 2.3x and 7x, respectively. ACM also has a mild regularization effect that improves inference accuracy in crossbar arrays without any retraining or costly device/variation-aware training.

Full PDF

AA Device Non-Ideality Resilient Approach forMapping Neural Networks to Crossbar Arrays

Arman Kazemi ∗ , Cristobal Alessandri † , Alan C. Seabaugh † , X. Sharon Hu ∗ , Michael Niemier ∗ , Siddharth Joshi ∗ ∗ Department of Computer Science and Engineering, University of Notre Dame † Department of Electrical Engineering, University of Notre Dame, [email protected]

Abstract —We propose a technology-independent method, re-ferred to as adjacent connection matrix (ACM), to efﬁciently mapsigned weight matrices to non-negative crossbar arrays. Whencompared to same-hardware-overhead mapping methods, usingACM leads to improvements of up to 20% in training accuracyfor ResNet-20 with the CIFAR-10 dataset when training with5-bit precision crossbar arrays or lower. When compared withstrategies that use two elements to represent a weight, ACMachieves comparable training accuracies, while also offering areaand read energy reductions of 2.3 × and 7 × , respectively. ACMalso has a mild regularization effect that improves inferenceaccuracy in crossbar arrays without any retraining or costlydevice/variation-aware training. I. I

NTRODUCTION

Automated analysis of vast amounts of data can potentiallyrevolutionize governance, manufacturing, medicine, and manyother ﬁelds. Over the past decade, increasingly complex deepneural network (DNN) models have been proposed as a meansto perform such automated analysis. The cost to train anddeploy such models has grown along with model complexity,leading to a need for hardware platforms that are energy-efﬁcient and low-latency [1].A promising avenue for hardware research that addressesthese challenges is based on analog domain computation ofmatrix-vector multiplication (MVM), a critical kernel in theforward pass of DNN training as well as inference. Onefamily of accelerators uses analog crossbar (XBar) arrayswhich could be composed of emerging devices such as re-sistive random access memories (RRAM) [2], phase changememories (PCM) [3], and ferroelectric ﬁeld-effect transistors(FeFET) [4], for highly parallel MVMs. A XBar array rep-resents the input vector as analog voltages. Applying thesevoltages to the rows of the XBar array (where weights arestored as conductances at row/column crosspoints) induces acurrent along the XBar columns. The current of each columnrepresents the dot product of the input vector and the weightvector represented by the synapse devices on the column [3].While XBar arrays efﬁciently implement MVMs and offermany performance advantages in energy and latency, theiruse poses many practical challenges. One such challenge isinherent in representing weights as conductance values. Thisconstrains XBar arrays to use non-negative conductance valuesto implement arbitrary signed

MVMs. As shown in Fig. 1, twoapproaches have been widely adopted: (i) differential encodingwhere two elements are used to represent one weight (a doubleelement (DE) approach) [5], [6] and (ii) a constant bias to remap values in the range ( − w, w ) to (0 , w ) (a bias column(BC) approach) [7], [8]. Another challenge with performingMVMs via XBar arrays arises due to the limitations of devicesemployed for synapse elements, e.g., RRAMs [2], PCMs [3],FeFETs [4], etc, with respect to achievable weight resolutionand weight update linearity, which in turn can adversely impacttraining accuracy. Furthermore, methods to overcome theseissues, e.g., using multiple synapse elements for a singleweight [9], further reduce the energy and area savings thatmight otherwise be obtained from XBar arrays. Moreover, theaccuracy of models deployed on XBar arrays for inference isfurther degraded by device variation [7], [10].As stated in [11], to capitalize on the beneﬁts offeredby XBar arrays, breakthroughs in material development orarchitecture design are needed. Within this context, this paperpresents a method, referred to as adjacent connection matrix(ACM) , for efﬁciently mapping signed MVMs to XBar arrays.By learning the most effective representation of a weightthrough a combination of a XBar array column and its im-mediate neighbor, ACM increases the effective dynamic rangeof weight representations. This nearest-neighbor coupling alsointroduces a mild regularization effect that improves resilienceto device variation. ACM has been evaluated with the MNIST[12] and CIFAR-10 [13] datasets and results indicate that usingACM can lead to (i) up to 20% improvements in trainingaccuracy when compared to other strategies under equivalentresource constraints (i.e., the BC approach), (ii) comparabletraining accuracies coupled with area reductions of 2.3 × andread energy reductions of 7 × when compared to the DEapproach, and (iii) a 10% average improvement compared toboth DE and BC in the inference accuracy of a VGG networktrained on the CIFAR-10 dataset with 3-bit precision XBararrays, assuming a 15% device variation.The remainder of the paper is organized as follows: SectionII reviews strategies for mapping signed MVMs to XBararrays. Section III presents ACM and a method to evaluateand compare it with other mappings. Section IV discussesthe results of our simulations and quantiﬁes DNN accuracyimprovements under resource constraints and device variation;system-level evaluations of energy, area, and delay are alsopresented. Section V concludes by summarizing our ﬁndings.II. B ACKGROUND

There are two approaches typically used to map a MVM toa XBar array; these can be implemented in both the analog a) Double Element (DE) (b) Bias Column (BC)

Fig. 1: Prior approaches to map DNN models to ReXB arrays; (a)

Using two resistive elements to represent one weight; (b) using a single column of resistive elements as reference.and the digital domain. The ﬁrst approach, i.e., DE shown inFig. 1a, uses two XBar array columns to represent one weightcolumn [14], [15]. With this approach, the difference betweencolumn-pairs in the XBar array represents the equivalentsigned weighted sum. The second case is an input dependentbias approach, i.e., BC [7], [8]. As illustrated in Fig. 1b, asingle XBar array column is used as a reference to implementthe bias. The conductance of each element in this column isﬁxed to the middle of the conductance range. The output fromthis column is subtracted from the output of all other columnsto compute the signed weighted sum.In order to perform a MVM with XBar arrays with thesemapping methods, the outputs of the MVM are digitizedin the periphery of the XBar array [3]. The operationaloverhead of the mappings are the additions and subtractionsperformed after digitization. Since both mappings require asingle subtraction per weight, this overhead is the same forboth approaches. The hardware overhead due to the numberof elements used for each mapping is evaluated in Section IV.Note that, if the conductance values of the XBar array ele-ments are limited to the range [ G min , G max ] (for simplicity,we assume G min = 0 ), the weights of BC will be in therange [ − G max / , G max / , with the conductance values ofthe bias column elements ﬁxed to G max / . For DE, the rangeof weights will be [ − G max , G max ] , at the expense of usingtwice as many weight elements, while representing twice asmany weight values as that of the BC. In short, DE utilizes2 × synapse elements and gains a 2 × wider dynamic range ofweight representation compared to the BC approach. However,in both the DE and BC approaches, a MVM with non-negativeweights is performed in the XBar array, followed by a simple(linear) combination of the outputs from its columns to obtainan equivalent MVM with signed weights. A critical aspect ofthe effectiveness of these mapping solutions is the simplicityof the linear transforms implemented. These transforms consistof the addition and subtraction of values at the XBar periphery. Fig. 2: ACM computes the outputs of a signed MVM as acombination of the outputs of adjacent columns with alternat-ing signs. III. A DJACENT C ONNECTION M ATRIX

We extend the idea of using simple linear transforms in theperiphery of the XBar array circuit to map a MVM onto XBararrays and present a new method called the adjacent connec-tion matrix (ACM). After a brief qualitative introduction ofthe ACM and its beneﬁts over other mapping methods, i.e.,DE and BC, we provide a formal deﬁnition of the differentmapping techniques (DE, BC, and ACM) and show how asigned MVM can be realized with a non-negative matrixrepresenting the XBar array, and a constant signed matrixrepresenting each mapping. Finally, we show the regularizationeffect of ACM using its formal deﬁnition.

A. Adjacent Connection Matrix Concept

In contrast with (i) the DE approach that uses two elementsper weight (each a reference for the other) and (ii) theBC approach that uses a ﬁxed reference column per MVM,the ACM approach uses each column as a reference for itsimmediate neighbor. ACM computes the outputs of a signedMVM as a combination of the outputs of neighboring columnsof the XBar array with alternating signs as shown in Fig. 2.Like BC, ACM also requires one additional column. Thiscompares favorably against DE which requires two elementsper weight, and therefore requires almost double the number ofsynapse elements in large XBar arrays. It is noteworthy that theoperational overhead of ACM is the same as BC and DE, sinceit requires a single subtraction for each weight. Furthermore,ACM provides a mild regularization effect which increasesresilience against device variation.

B. Formal Deﬁnition of Different Mappings

ACM, DE, and BC, all combine the outputs from columnsof the XBar array in a ﬁxed and predeﬁned way. This combi-nation of columns is comprised of additions and subtractionsonly and can be thought of as a matrix with its non-zero entrieslimited to ± . We refer to this matrix as the periphery matrix.The three different mappings and their corresponding uniqueperiphery matrices are presented in Figs. 1 and 2. In order to N I nodes N O nodes (a) Signed Matrix W ofa Fully-Connected Layer dummy M S N I nodes N D nodes N O nodes(b) Equivalent Non-negative Matrix M and the Periphery Matrix S Fig. 3: An example in the context of fully-connected layers,where the original matrix is decomposed as a sequence of anon-negative matrix M , followed by a periphery matrix S .verify that the periphery matrix holds as a general techniqueto map MVMs onto XBar arrays, we ﬁrst demonstrate thedecomposition of a signed MVM into (i) a non-negative matrixthat is stored on the XBar array and (ii) a ﬁxed, signedmatrix. We then characterize the requirements of this matrixand demonstrate that it can be implemented through additionand subtraction operations at the periphery of the XBar array.Consider an arbitrary signed matrix W with dimensions N O × N I . To constrain the multiplication to non-negativeweights only, matrix W is factored into a matrix with non-negative elements M , followed by a periphery matrix S , i.e., S M = W , M ≥ , (1)where M ≥ means that all elements of M are non-negative.Within the context of DNN layers, W is the weight matrix of afully-connected layer with N I inputs and N O outputs as shownin Fig. 3a. For ease of visualization, we deﬁne a dummy layer Y with N D neurons. Fig. 3b shows the layer mapped to a XBararray where M is the non-negative matrix with dimensions N D × N I , and S is a periphery matrix of dimensions N O × N D .While we have used a fully-connected layer from a DNN inour example, all linear transforms, including convolutions, arepossible through ACM.There are two properties we desire from S : (i) a ﬁxed S must guarantee that any multiplication using a signed matrix W can be realized using a non-negative matrix M and a ﬁxedsigned matrix S and (ii) S must be of a form that does notimpose large hardware implementation costs. C. Sufﬁcient Conditions of a Periphery Matrix

As before, given a W with the dimensions N O × N I , S can be assigned dimensions N O × N D , and M the dimensions N D × N I . Formulated independently for each column of W and M , this can be expressed as: S m k = w k , m k ≥ , (2)where m k and w k are the k -th columns of M and W ,respectively, and k ∈ { , , . . . , N I } . In this section we examine the transpose of the matrices to simplify thesolution of equations and the explanations.

A necessary condition for the existence of a solution toEq. (2) is that w k is in the column space of S . This con-dition will be satisﬁed for any arbitrary w k if and only if rank( S ) = N O . The sufﬁcient condition for the existence ofa non-negative solution is the existence of a vector x h in thenull space of S with strictly positive elements. This guaranteesthat any particular solution x p to the system S m k = w k canbe shifted as x (cid:48) p = x p + α x h to be non-negative. The sufﬁcientconditions are summarized as: . rank( S ) = N O . ∃ x h > , x h ∈ R N D , s . t . S x h = 0 . (3)If these conditions are met, the signed matrix W can be de-composed to a non-negative matrix M and periphery matrix S ,such that, W = S M . Equations N D = rank( S )+nullity( S ) and nullity( S ) ≥ hold, if there is at least one element ( x h )in the null space of S . Therefore, M with N D columns has atleast one more column than W with N O columns. A particularcase that satisﬁes the second condition of Eq. (3) is x h = ,which implies that the elements of the rows of S add up to0. Thus, an S such that neighboring columns are subtractedfrom each other, introduced earlier as the ACM, satisﬁes bothconditions, i.e., one extra column in M and the sum of rowsof S = 0 . Note that S also has all the properties listed asdesirable per Section III-B. D. Analysis of Different Mappings

Using the periphery matrix decomposition discussed above,we can derive not only the ACM mapping but also the DE andBC mappings. We can observe in Figs. 1 and 2 that all threeapproaches satisfy the conditions stated in Eq. (3). In all cases,each row in the periphery matrix has two nonzero elements (1and − ), hence the sum of the elements of the rows add up to0. Furthermore, N D ≥ is true for all three cases. However,DE has N D = 2 N O columns, whereas BC and ACM havethe minimum number of columns, N D = N O + 1 . Therefore,BC and ACM require minimal additional hardware resources.Furthermore, we assume that the elements of M have aconductance range of [ G min , G max ] (again, for simplicity,we assume G min = 0 ). Thus, by using the ACM approach,addition and subtraction of neighboring elements can result inthe representation of weights over the range [ − G max , G max ] while using the same hardware resources as BC. That said,while DE can always demonstrate the full range in weights,ACM is limited by having to balance DNN accuracy andweight range, as neighboring columns are not guaranteedto have a large disparity in weights. Section IV quantiﬁesthe effect of these different mappings on system-level DNNtraining and inference accuracy in the presence of non-idealXBar array synapse devices. E. Regularization Effect of Adjacent Connection Matrix

An observation of the nearest-neighbor coupling induced bythe ACM approach naturally leads to the question, how doesthe ACM approach constrain the neighboring weights and whatis the consequence of such constraint? In this subsection wexamine the effects of these constraints through the lens of regularization . Let us denote the sum of all the elements ofa column of M as M j , in other words M j = (cid:80) N I i =1 M ij .Inserting this expansion into Eq. (1) and explicitly writing outthe values in the periphery matrix S of ACM (Fig. 2) leads tothe following expression: N I (cid:88) i =1 N O (cid:88) j =1 W ij = M − M + M − M + . . . + M N O − − M N O = M − M N O . (4)Eq. (4) demonstrates that any non-negative weight matrix M trained with the ACM approach must satisfy the constraintson the ﬁrst and last columns in the matrix M . For a quantizedmatrix where the elements of matrix M have bit precision B,each element can only be assigned one of B values. Thus, forany index j , the column M j can be assigned 1 of B valuesper element, leading to N I × B different values. Consequently,for quantized matrices, Eq. (4) constrains (cid:80) N I i =1 (cid:80) N O j =1 W ij to × N I × B − values. When each element in M has a smallset of possible values (i.e., when B is smaller), this constraintis tighter. Thus, leading to a regularization effect when trainingwith ACM. As the results in Section IV-B indicate, thenearest-neighbor coupling of ACM increases device variationresilience in reverse relation with B (the smaller B , the higherthe resilience). However, ACM based training is not meant toreplace standard regularization methods, e.g. L-2, dropout, etc,which have a much stronger regularization effect.IV. E VALUATION AND R ESULTS

In this section, we ﬁrst evaluate the training accuracy ofthe ACM method and compare it with alternative mappings.We follow this with an evaluation of inference accuracyon a pre-trained network with different mappings when theweights are subject to device variation. We have developed amodel for neural network training using TensorFlow [16] thatincorporates the non-idealities of the synapse devices. Whiletraining, matrix M is constrained to be non-negative and isfollowed by a periphery matrix that is deﬁned as a ﬁxed layerwith values in {− , +1 , } as depicted in Figs. 1 and 2. Inour studies, we consider two non-ideal device characteristicsthat virtually exist in all physical synapse devices used forXBar arrays and impact the accuracy of DNNs trained onXBar arrays: (i) limited weight precision (i.e., the numberof representable states) and (ii) non-linear weight update ofXBar array synapse devices. For the former, we quantize theweights similar to [17], and for the latter, we present resultsfor devices with symmetric increase/decrease steps [4], [18](Fig. 4a) to isolate the effect of the nonlinear weight updateon ACM from the effect of the nonlinearity on the learningrule. Since ACM is a linear transform, it is also compatiblewith learning rules tailored for devices with asymmetric weightupdate nonlinearity [19]. We present results for activationsquantized to 8 bits of resolution; results on 6-bit quantized (a) Pulse Number C ondu c t a n ce ( S ) PotentiationDepression (b) σ σ

Conductance (S) P D F State 1State 2

Fig. 4: (a)

Up/down symmetric non-linearity observed in [4],[18] and assumed in training with non-linear weight update; (b)

Gaussian distribution used to model device variation for a1-bit (2 state) device, observed in [10].activations followed the same trend and were omitted forbrevity.To evaluate DNN training accuracy when using ACM, wetrain a variant of LeNet [20] with the MNIST dataset; we alsotrain a VGG-9 [21] network (with 6 convolutional layers and 3fully-connected layers) and a ResNet20 network [22] with theCIFAR-10 dataset using a vanilla stochastic gradient descent.We train four types of models for the above networks: (i) abaseline model, i.e., the original network trained with signedweights, (ii)

DE, i.e., a network trained using non-negativeweights and the periphery matrix in Fig. 1a, (iii)

BC, i.e., anetwork trained using non-negative weights and the peripherymatrix in Fig. 1b, and (iv)

ACM, i.e., the network trained usingnon-negative weights and the periphery matrix in Fig. 2. Wealso evaluate the impact of device variation on the inferenceaccuracy of the VGG-9 network trained with the CIFAR-10 dataset when different mappings are used. Variation ismodeled as a zero-mean, normal distribution [10] as depictedin Fig. 4b, and this is added to the desired conductance value.Finally, we investigate the system-level characteristics of aXBar-based accelerator that uses different mappings with theNeuroSim+ [7] tool in terms of area, delay, and energy.

A. Neural Network Training Accuracy

Figs. 5a and 5e show training and test accuracies forthe LeNet and ResNet20 networks with single-precisionﬂoating-point (FP32) weights, respectively. The three mappingschemes achieve results equivalent to the baseline model. Fur-thermore, the training and test errors follow similar trajectoriesas a function of the number of epochs. This is consistentlyobserved for different networks and training conditions for thetwo datasets. These observations validate our analysis in Sec-tion III, experimentally showing that the three decompositions(DE, BC, ACM) can achieve similar accuracy in DNN trainingwhen there are no constraints on weight precision duringtraining. Note that while ACM provides identical test accuracyat FP32 weights, the training accuracy is lower than that ofDE and BC in Fig. 5e. This is due to the mild regularizationeffect of ACM that is discussed in Section III-E.Figs. 5b, 5c, and 5d show the effect of limited XBar arrayweight precision on training DNNs on MNIST and CIFAR-10tasks. The test errors are shown as a function of the numberof bits for weight precision. Since there exists no array-level

10 20 300123

Test(a) Training E rr o r( % ) LeNet, FP32

BaselineACMDEBC

Reportedprecisionsat scale(b) LeNet, Linear

ACMDEBC (c) VGG-9, Linear

ACMDEBC (d) ResNet20, Linear

ACMDEBC (e) Number of Epochs E rr o r( % ) ResNet20, FP32

BaselineACMDEBC

ACMDEBC ~2 bits(g)Weight Resolution (bits)VGG-9, Nonlinear

ACMDEBC

Fig. 5: This ﬁgure presents training results with MNIST and CIFAR-10 datasets. (a) and (e) are results from training withFP32 weights and activations and the solid and dashed lines are test accuracies and training accuracies, respectively. (b) , (c) ,and (d) are results from training with limited weight precision with linear weight update and 8-bit activations. (f) , (g) , and (h) show results for training with limited weight precision with non-linear weight update and 8-bit activations. The grey areaillustrates the bit precisions that have not been demonstrated experimentally at the array scale for synapse devices.experiments demonstrating XBar array synapse devices withweight precision higher than 5 bits, we focus our study onweights from 2-6 bits. For precision lower than 6 bits, it canbe observed that the error of DE is lower than that of the othermappings. As one would expect, this is because DE uses twicethe number of elements as BC and ACM and consequently hastwice the range in weight representation. When using ACM,some of the resolution lost through BC is recovered, placing itsaccuracy between BC and DE. The ACM encoding distributesthe weight value over two columns to provide better toleranceto limited resolution compared to the BC approach. Thus, at resource parity , ACM provides a resolution advantage overthe BC approach. At precision higher than 5 bits, for DNNstrained on the MNIST dataset, training accuracy saturates tovalues achieved through FP32 training. However, results onthe CIFAR-10 dataset (Figs. 5c and 5d) start to diverge fromthe FP32 results for precision lower than 8 bits due to thehigher complexity of the task.When training with non-linear weight update (Figs. 5f, 5g,and 5h), the differences between DE and BC approachesbecome even more apparent. As before, the error when trainingwith ACM is lower than the error obtained when trainingusing BC, approaching the error of DE. However, the accuracyimprovement is much more dramatic due to the disparateaccuracy impact of the nonlinear weight update. These resultsshow that ACM consistently improves upon the accuracy ofBC while using the same hardware resources. The VGG-9 network is overparameterized and better offsets the non-linearity as compared to the ResNet20 network. Therefore, the accuracy decline starts at 5 bits in Fig. 5g rather than 6bits in Fig. 5h. The largest gain is obtained for non-linearweights when training with 5-bit precision XBar arrays orlower: e.g. for ResNet20, effectively 2 bits in weight resolutionare recovered, leading to a 20% improvement in accuracy. B. Effects of Device Variation on Neural Network Inference

We evaluate the inference accuracy of a VGG-9 networktrained on CIFAR-10 under the conditions of device variationwhen operating with different weight precision. After training,variation is added to the trained model weights and theinference accuracy is evaluated without any ﬁne-tuning. Fig. 6shows the results averaged over 25 samples per data pointfor 4 different precision values. There is a disparity in DNNinference accuracy even when not subjected to any device vari-ation (e.g., see Fig. 5e for the difference in quantized DNNstrained on CIFAR-10 with DE, BC, and ACM). Results oninference with added variation show that this initial disparityis dramatically increased and BC consistently performs worsethan the other two mappings regardless of the bit precision.Due to limited space, we only show results for 1-bit, 3-bit,4-bit, and-6 bit weights; the 2-bit and 5-bit weights follow theexpected trends. ACM performs better than DE and BC for 1,2, and 3-bit weights and DE outperforms the other mappingmethods for the 4, 5, and 6-bit weights. The improvement ininference accuracy when using ACM at lower bit precisionmay appear counterintuitive. However, this can be explainedby considering the regularization effect of ACM discussed inSection III-E. At lower bit precision, the constraint is tighter, 5 10 15 20 25 A cc u r ac y ( % ) DEACMBC

Sigma of Variation (%) A cc u r ac y ( % ) Sigma of Variation (%)6-bit WeightsFig. 6: Effects of device variation on the inference accuracyof a VGG-like network trained with different mapping ap-proaches and bit precision on the CIFAR-10 datasetwhile it is relaxed at higher bit precision. This constraintstrengthens the network against device variation.

C. System-Level Evaluation

Table I shows system-level results for the three mappingapproaches generated using the NeuroSim+ [7] tool withthe default parameters in the 14nm tech node. The assumedperipherals include MUX, ADC, word-line decoder, bit-lineand select-line switch matrices, adders, and shift registers. Theread energy and latency values are for one epoch of traininga two-layered multi-layer perceptron (MLP) network with aXBar-based hardware accelerator. Read energy, area, and readdelay values for BC and ACM approaches are exactly thesame, as there is practically no difference in their hardwareresource utilization. The read energy of DE is 7 × more thanthat of the ACM due to the longer wires for rows of the XBararray. DE uses 2.3 × XBar area compared to the ACM, asit requires twice as many elements. The peripherals are alsolarger and require more area. Furthermore, DE has a 1.33 × higher read delay due to the additional columns that need tobe multiplexed for the associated peripherals.V. C ONCLUSION

We introduced the ACM mapping method to mitigate theeffects of limited weight resolution and non-linearity on neuralnetwork training while incurring minimal hardware overhead.We demonstrated, both mathematically and by simulation, thatACM is a general approach and represents the same MVMs asprevious mapping approaches. Neural network training evalua-tions with limited resolution and non-linearity show that ACMconsistently improved upon the accuracy of BC while usingthe same hardware resources. The largest gain was obtained TABLE I: System-level results of the three mapping ap-proaches for training a two-layered MLP on XBar arrays.

Mapping BC DE ACMXBar Area ( µm ) 914 2088 914Periphery Area ( µm ) 157 246 157Read Energy ( µJ ) 2.402 14.408 2.402Read Delay ( ms ) 0.240 0.318 0.240 for non-linear weights when training with 5-bit precision XBararrays or lower: effectively 2 bits in weight resolution wererecovered, leading to a 20% accuracy improvement. Comparedto DE, ACM can achieve comparable training accuracies whilereducing the read energy consumption by 7 × and area by2.3 × . Furthermore, the regularization effect of ACM makes itresilient to device variation. Assuming a 15% device variation,ACM improves the inference accuracy of a VGG networktrained on the CIFAR-10 dataset with 3-bit precision XBararrays by an average of 10% compared to other mappings.A CKNOWLEDGMENTThis work was supported in part by ASCENT, one of six centers in JUMP,a SRC program sponsored by DARPA under task ID 2776.043. R EFERENCES[1] S. Han et al. , “Eie: efﬁcient inference engine on compressed deep neuralnetwork,” in

ISCA , 2016.[2] H. P. Wong et al. , “Metal–oxide rram,”

Proceedings of the IEEE , 2012.[3] G. W. Burr et al. , “Neuromorphic computing using non-volatile mem-ory,”

Advances in Physics: X , 2017.[4] M. Jerry et al. , “A ferroelectric ﬁeld effect transistor based synapticweight cell,”

Journal of Physics D: Applied Physics , 2018.[5] S. Ambrogio et al. , “Equivalent-accuracy accelerated neural-networktraining using analogue memory,”

Nature , 2018.[6] T. Gokmen and Y. Vlasov, “Acceleration of deep neural network trainingwith resistive cross-point devices,”

Frontiers in Neuroscience , 2016.[7] P.-Y. Chen et al. , “Neurosim+: An integrated device-to-algorithm frame-work for benchmarking synaptic devices and array architectures,” in

IEDM , 2016.[8] C. C. Chang et al. , “Mitigating asymmetric nonlinear weight updateeffects in hardware neural network based on analog resistive synapse,”

IEEE Trans. Emerg. Sel. Topics Circuits Syst. , 2017.[9] P. Chi et al. , “Prime: A novel processing-in-memory architecture forneural network computation in reram-based main memory,” in

ACMSIGARCH Computer Architecture News , 2016.[10] Y. Lin et al. , “Demonstration of generative adversarial network byintrinsic random noises of analog rram devices,” in

IEDM , 2018.[11] W. Haensch, T. Gokmen, and R. Puri, “The next generation of deeplearning hardware: Analog computing,”

Proceedings of the IEEE , 2018.[12] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010.[13] A. Krizhevsky and G. Hinton, “Learning multiple layers of features fromtiny images,” tech. rep., Citeseer, 2009.[14] G. W. Burr et al. , “Experimental demonstration,”

IEEE T-ED .[15] P. Narayanan et al. , “Reducing circuit design complexity for neuromor-phic systems based on non-volatile memory,” in

ISCAS , 2017.[16] M. Abadi et al. , “Tensorﬂow: A system for large-scale machine learn-ing,” in , 2016.[17] S. Zhou et al. , “Dorefa-net: Training low bitwidth convolutional neuralnetworks with low bitwidth gradients,” arXiv , 2016.[18] J. Woo et al. , “Resistive memory-based analog synapse: The pursuit forlinear and symmetric weight update,”

IEEE Nanotechnol. Mag. , 2018.[19] M. Fouda et al. , “Independent component analysis using rrams,”

IEEETransactions on Nanotechnology , 2018.[20] Y. LeCun et al. , “Lenet-5, convolutional neural networks,” 2015.[21] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv , 2014.[22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in