[PDF] Low Power In-Memory Implementation of Ternary Neural Networks with Resistive RAM-Based Synapse

Abstract

The design of systems implementing low precision neural networks with emerging memories such as resistive random access memory (RRAM) is a major lead for reducing the energy consumption of artificial intelligence (AI). Multiple works have for example proposed in-memory architectures to implement low power binarized neural networks. These simple neural networks, where synaptic weights and neuronal activations assume binary values, can indeed approach state-of-the-art performance on vision tasks. In this work, we revisit one of these architectures where synapses are implemented in a differential fashion to reduce bit errors, and synaptic weights are read using precharge sense amplifiers. Based on experimental measurements on a hybrid 130 nm CMOS/RRAM chip and on circuit simulation, we show that the same memory array architecture can be used to implement ternary weights instead of binary weights, and that this technique is particularly appropriate if the sense amplifier is operated in near-threshold regime. We also show based on neural network simulation on the CIFAR-10 image recognition task that going from binary to ternary neural networks significantly increases neural network performance. These results highlight that AI circuits function may sometimes be revisited when operated in low power regimes.

Full PDF

LLow Power In-Memory Implementationof Ternary Neural Networks withResistive RAM-Based Synapse

A. Laborieux ∗ , M. Bocquet † , T. Hirtzlin ∗ , J.-O. Klein ∗ , L. Herrera Diez ∗ , E. Nowak ‡ ,E. Vianello ‡ , J.-M. Portal † and D. Querlioz ∗∗ Universit´e Paris-Saclay, CNRS, C2N, 91120 Palaiseau, France. Email: [email protected] † IM2NP, Univ. Aix-Marseille et Toulon, CNRS, France. ‡ CEA, LETI, Grenoble, France.

Abstract —The design of systems implementing low precisionneural networks with emerging memories such as resistiverandom access memory (RRAM) is a major lead for reducing theenergy consumption of artiﬁcial intelligence (AI). Multiple workshave for example proposed in-memory architectures to implementlow power binarized neural networks. These simple neural net-works, where synaptic weights and neuronal activations assumebinary values, can indeed approach state-of-the-art performanceon vision tasks. In this work, we revisit one of these architectureswhere synapses are implemented in a differential fashion toreduce bit errors, and synaptic weights are read using prechargesense ampliﬁers. Based on experimental measurements on ahybrid 130 nm CMOS/RRAM chip and on circuit simulation,we show that the same memory array architecture can be usedto implement ternary weights instead of binary weights, and thatthis technique is particularly appropriate if the sense ampliﬁer isoperated in near-threshold regime. We also show based on neuralnetwork simulation on the CIFAR-10 image recognition taskthat going from binary to ternary neural networks signiﬁcantlyincreases neural network performance. These results highlightthat AI circuits function may sometimes be revisited whenoperated in low power regimes.

I. I

NTRODUCTION

Artiﬁcial Intelligence has made tremendous progress inrecent years due to the development of deep neural networks.Its deployment at the edge, however, is currently limited by thehigh power consumption of the associated algorithms [1]. Lowprecision neural networks are currently emerging as a solution,as they allow the development of low power consumptionhardware specialized on deep learning inference [2]. The mostextreme case of low precision neural network, the BinarizedNeural Network (BNN), also called XNOR-NET, is receivingspecial attention as it is particularly efﬁcient for hardware im-plementation: both synaptic weights and neuronal activationsassume only binary values [3], [4]. Remarkably, this type ofneural network can approach near-state-of-the-art performanceon vision tasks [5]. One particularly investigated lead isto fabricate hardware BNNs with emerging memories suchas resistive RAM or memristors [6]–[13]. The low memoryrequirements of BNNs, as well as their reliance on simplearithmetic operations, make them indeed particularly adapted

This work was supported by the ERC Grant NANOINFER (715872)and the ANR grant NEURONIC (ANR-18-CE24-0009). for “in-memory” or “near-memory” computing approaches,which achieve superior energy-efﬁciency by avoiding the von-Neumann bottleneck entirely.In this work, we revisit one of this hardware developed forthe energy-efﬁcient in-memory implementation of BNNs [6],where the synaptic weights are implemented in a differentialfashion. We show that it can be extended to a more complexform of low precision neural networks, ternary neural network[14] (TNN, also called Gated XNOR-NET, or GXNOR-NET[15]), where both synaptic weights and neuronal activationsassume ternary values. The contribution of this work is as fol-lows. After presenting the background of the work (section II): • We demonstrate experimentally on a fabricated 130 nmRRAM/CMOS hybrid chip a strategy for implementingternary weights using a precharge sense ampliﬁer, whichis particularly appropriate when the sense ampliﬁer isoperated in the near-threshold regime (sec. III). • We carry simulations to show the superiority of TNNsover BNNs on the canonical CIFAR-10 vision task, andevidence the error resilience of hardware TNNs (sec. IV).II. B

ACKGROUND

The main equation in conventional neural networks is thecomputation of the neuronal activation A j = f ( (cid:80) i W ji X i ) , where A j , the synaptic weights W ji and input neuronalactivations X i assume real values, and f is a non-linearactivation function. BNNs are a considerable simpliﬁcation ofconventional neural networks, in which all neuronal activations( A j , X i ) and synaptic weights W ji can only take binary valuesmeaning +1 and − . Neuronal activation then becomes: A j = sign (cid:32)(cid:88) i XN OR ( W ji , X i ) − T j (cid:33) , (1)where sign is the sign function, T j is a threshold associatedwith the neuron, and the XN OR operation is deﬁned inTable I. Training BNNs is a relatively sophisticated operation,during which synapses need to be associated with a real valuein addition to their binary value. Once training is ﬁnished,these real values can be discarded and the neural network isentirely binarized. Due to their reduced memory requirements, a r X i v : . [ c s . ET ] M a y ABLE IT

RUTH T ABLES OF THE

XNOR

AND

GXNOR G

ATES W ji X i XNOR − − − − − −

11 1 1 W ji X i GXNOR − − − − − −

11 1 10 X X and reliance on simple arithmetic operations, BNNs are espe-cially appropriate for in- or near- memory implementations.In particular, multiple groups investigate the implementationof BNN inference with resistive memory tightly integratedat the core of CMOS [6]–[13]. Usually, resistive memorystores the synaptic weights W ji . However, this comes with asigniﬁcant challenge: resistive memory is prone to bit errors,and in digital applications, is typically used with strong errorcorrecting codes (ECC). ECC, which requires large decodingcircuit [16], goes against the principles of in- or near- memorycomputing. For this reason, [6] proposes a two-transistor/two-resistor (2T2R) structure, which reduces resistive memorybit errors, without the need of ECC decoding circuit, bystoring synaptic weights in a differential fashion. This allowsfor extremely efﬁcient implementation of BNNs, and to usethe resistive memory devices in very favorable programmingconditions (low energy, high endurance).In this work, we show that the same architecture may beused for a generalization of BNNs, TNNs , where neuronalactivations and synaptic weights A j , X i and W ji can now as-sume three values: +1 , − and . Equation (1) now becomes: A j = φ (cid:32)(cid:88) i GXN OR ( W ji , X i ) − T j (cid:33) ; (2) GXN OR is the “gated” XNOR operation that realizes theproduct between numbers with values +1 , − and (Table I). φ is an activation function that outputs +1 if its input is greaterthan a threshold ∆ , − if the input is lesser than − ∆ and otherwise. We show experimentally and by circuit simulationin sec. III how the 2T2R BNN architecture can be extendedto TNNs with practically no overhead, and in sec. IV thecorresponding beneﬁts in terms of neural network accuracy.III. T HE O PERATION OF

A P

RECHARGE S ENSE A MPLIFIER C AN P ROVIDE T ERNARY W EIGHTS

In this work, we use the architecture of [6], where synap-tic weights are stored in a differential fashion. Each bit isimplemented using two devices programmed either as lowresistance state (LRS)/high resistance state (HRS) to meanweight +1 or HRS/LRS to mean weight − . Fig. 1 presentsthe test chip used for the experiments. This chip cointegrates CMOS and resistive memory in the back-end-of-line, In the literature, the name “Ternary Neural Networks” is sometimes alsoused to refer to neural networks where the synaptic weights are ternarized,but the neuronal activations remains real or integer [17], [18]. Fig. 1. (a) Electron microscopy image of a hafnium oxide resistive memorycell (RRAM) integrated in the backend-of-line of a

CMOS process.(b) Photograph and (c) simpliﬁed schematic of the hybrid CMOS/RRAM testchip characterized in this work.

WL SENWLgnd SEN Qb R BL R BLb

BL BLb

XOR

QQb

SEN

Q Data

QQb

Fig. 2. Schematic of the precharge sense ampliﬁer fabricated in the test chip. between levels four and ﬁve of metal. The resistive memorycells are based on thick hafnium oxide (Fig. 1(a)). Ourexperiments are based on a kilobit array incorporating all senseand periphery circuitry (Fig. 1(b-c)). The synaptic weightsare read using the onchip precharge sense ampliﬁer (PCSA)presented in Fig. 2, initially proposed in [19]. Fig. 3(a) showsan electrical simulation of this circuit to explain its workingprinciple. These simulations are presented in the commercial technology used in our test chip, with a near-thresholdsupply voltage of . [20] (the nominal voltage in thisprocess is . ). In a ﬁrst phase (SEN=0), the outputs Q andQb are precharged to the supply voltage V DD . In the secondphase (SEN= V DD ), each branch starts to discharge to theground. The branch that has the resistive memory (BL or BLb)with lowest electrical resistance discharges faster and causesits associated inverter to drive the output of the other inverterto the supply voltage. At the end of the process, the twooutputs are therefore complementary, and can be used to tellwhich resistive memory has highest resistance and thereforethe synaptic weight. We observed that the convergence speedof a PCSA depends heavily on the resistance state of the tworesistive memories. This effect is particularly magniﬁed whenthe PCSA is used in near-threshold operation, as presentedhere. Fig. 3(b) shows a simulation where the two devices BLand BLb were programmed in the HRS. We see that the twooutputs converge to complementary values in , whereas ig. 3. Circuit simulation of the precharge sense ampliﬁer of Fig. 2 with anear-threshold supply voltage of . , if the two devices are programmedin an (a) LRS / HRS ( / ) or (b) HRS/HRS ( / )conﬁguration. only were necessary in Fig. 3(a), where the devicesare programmed in complementary LRS/HRS states. Theseﬁrst simulations suggest a technique for implementing ternaryweights using the memory array of our test chip. Similarly towhen this array is used to implement BNN, we propose toprogram the devices in the LRS/HRS conﬁguration to meanthe synaptic weight , and HRS/LRS to mean the synapticweight − . Additionally, we use the HRS/HRS conﬁgurationto mean synaptic , while the LRS/LRS conﬁguration isavoided. The sense operation is performed during a durationof . If at the end of this period, outputs Q and Qb havedifferentiated and the output of the XOR gate is therefore 1,output Q determines the synaptic weight ( or − ). Otherwise,output of the XOR gate is 0 and the weight is determined tobe .Experimental measurements on our test chip conﬁrm thatthe PCSA can be used in this fashion. We ﬁrst focus on onesynapse of the memory array. We program one of the twodevices (BLb) to a resistance of Ω . We then program itscomplementary device BL to several resistance values, andfor each of them perform 100 read operations of duration , using an onchip PCSA operated in near-threshold. Fig. 4plots the probability that the sense ampliﬁer has convergedduring the read time. In , the read operation is onlyconverged if the resistance of the BL device is signiﬁcantlylower than . To evaluate this behavior in a wider rangeof programming conditions, we repeated the experiment on109 devices and their complementary devices of the memoryarray programmed, each programmed 14 times, with variousresistance values in the resistive memory, and performed aread operation in with an on-chip PCSA. Fig. 5 shows,for each couple of resistance value R BL and R BLb if the readoperation was converged with Q = V DD (blue), meaning aweight of , converged with Q = 0 (red), meaning a weight of Fig. 4. Two devices have been programmed in four distinct programmingconditions, presented in (a), and measured using an onchip sense ampliﬁer.(b) Proportion of read operations that have converged in , over 100 trials.Fig. 5. For 109 device pairs programmed with multiple R BL /R BLb conﬁguration, value of the synaptic weight measured by the onchip senseampliﬁer using the strategy described in body text and reading time.

20 40 100 200 500 R BL (k ) R B L b ( k ) (a) V DD = 0.6V

20 40 100 200 500 R BL (k )(b) V DD = 1.2V <104070>100 X O R s w i t c h i n g t i m e s ( n s ) Fig. 6. Switching time of a precharge sense ampliﬁer, extracted from circuitsimulations using the design kit of a commercial technology, as afunction of the resistance of the BL and BLb complementary resistive memo-ries. Simulations performed using (a) near-threshold . supply voltage and(b) nominal . supply voltage. , or not converged (grey) meaning a weight of . The resultsconﬁrm that LRS/HRS or HRS/LRS conﬁgurations may beused to mean weights and − , and HRS/HRS for weight . Relatively high values of the HRS should be targeted: theseparation between the (or − ) and regions is not strict,and for intermediate resistance values we see that the readoperation may or may not converge in ns .Extensive circuit simulations in the technology ofour test chip allow to evaluate this behavior in the generalcase. Fig. 6 shows the switching time of the PCSA as afunction of the resistance of the two resistive memories BL andBLb, with nominal supply voltage ( . ) and near-thresholdsupply voltage ( . ). We see that for both supply voltages,the HRS/HRS conﬁguration leads to longer switching times.In our technology, HRS states are typically characterized byresistances above . We see that the operation in near-threshold exhibits a larger area of HRS/HRS values with aswitching time above , corresponding to a 0 state. Thisimplies a more robust detection of the 0 state in near threshold,compliant with the HRS variability.IV. N ETWORK L EVEL B ENEFITS OF T ERNARIZATION

We now investigate the accuracy gain when using ternarizedinstead of binarized networks. We trained BNN and TNNversions of networks with VGG-type architectures [21] on theCIFAR-10 task of image recognition, consisting in classifyingcolor images among ten classes. The architecture of ournetworks consists of six convolutional layers with kernel sizethree. The number of ﬁlters at the ﬁrst layer is called N and ismultiplied by two every two layers. Maximum-value poolingwith kernel size two is used every two layers and batch-normalization [22] every layer. The classiﬁer consists of onehidden layer of 512 units. For the TNN, the activation functionhas a threshold ∆ = 5 · − (as deﬁned in section II).BNNs were trained following the methodology described inthe Appendix of [23] and adapted from [3]. TNNs weretrained with the methodology introduced in [2]. The training isperformed using AdamW optimizer [24], [25] with minibatchsize and learning rate schedule used in [25], [26], resultingin a total of epochs. Data is augmented using randomhorizontal ﬂip, and random choice between cropping afterpadding and random small rotations. Fig. 7. Maximum Test Accuracy reached during one training, averaged overﬁve runs, for BNNs and TNNs with various model sizes on the CIFAR-10dataset. Error bar is one standard deviation. Fig. 8. Impact of Bit Error Rate on the Test Accuracy at inference time formodel size N = 100 TNN and BNN. Type 1 errors are sign switches (e.g. +1 mistaken for − ) and Type 2 errors are mistaken for +1 or − , or − and +1 mistaken for . Fig. 7 shows the maximum test accuracies for differentsizes of the model. TNNs always outperform BNNs with thesame model size (and same number of synapses). The largestdifference is seen for smaller model size, but a signiﬁcant gapremains even for large models. Besides, the difference in thenumber of parameters required to reach a given accuracy forTNNs and BNNs increases with higher accuracies. There istherefore a clear advantage to use TNNs instead of BNNs.We then investigate the impact of bit errors in BNNs andTNNs to see if the advantage provided by TNNs remainswhen errors are taken into account. Two types of errors areinvestigated: Type 1 errors are sign switches, while Type 2errors are only deﬁned for TNNs and correspond to mistakenfor +1 or − , and +1 or − mistaken for . Type 1 errors areless likely than type 2 errors thanks to the differential readingscheme 2T2R, as seen in Fig. 5. Fig. 8 shows the impact oferrors in the test accuracy for different values of the Bit errorrate, averaged over ﬁve runs. Though Type 2 errors are morelikely to occur with TNNs, their effect is not as serious asType 1 errors as the drop in accuracy for Type 2 errors occursfor bit error rates one order of magnitude higher than for Type1 errors. Therefore TNNs still outperform BNNs when deviceimperfections are included.V. C ONCLUSION

In this work, we revisited a differential compute in-memoryarchitecture for BNNs. We showed experimentally on a hybridCMOS/RRAM chip, and by circuit simulation that, its senseampliﬁer is able to differentiate not only the LRS/HRS andHRS/LRS states, but also the HRS/HRS states. This allows thearchitecture to store ternary weights, and to provide a buildingblock for TNNs. We showed by neural network simulation onthe CIFAR-10 tasks the beneﬁts of using ternary instead ofbinary networks, and the high resilience of TNNs to weightserrors. As this behavior is magniﬁed in the slow but low powernear-threshold operation regime, our approach specially targetsextremely energy-conscious applications such as uses withinwireless sensors or medical applications. This work opens theway for increasing the edge intelligence in such contexts, andalso highlights that near-threshold operations of circuits maysometimes provide opportunities for new functionalities.

EFERENCES[1] X. Xu, Y. Ding, S. X. Hu, M. Niemier, J. Cong, Y. Hu, and Y. Shi,“Scaling for edge inference of deep neural networks,”

Nature Electron-ics , vol. 1, no. 4, p. 216, 2018.[2] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,“Quantized neural networks: Training neural networks with low pre-cision weights and activations,”

The Journal of Machine LearningResearch , vol. 18, no. 1, pp. 6869–6898, 2017.[3] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Ben-gio, “Binarized neural networks: Training deep neural networks withweights and activations constrained to+ 1 or-1,” arXiv preprintarXiv:1602.02830 , 2016.[4] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net:Imagenet classiﬁcation using binary convolutional neural networks,” in

Proc. ECCV . Springer, 2016, pp. 525–542.[5] X. Lin, C. Zhao, and W. Pan, “Towards accurate binary convolutionalneural network,” in

Advances in Neural Information Processing Systems ,2017, pp. 345–353.[6] M. Bocquet, T. Hirztlin, J.-O. Klein, E. Nowak, E. Vianello, J.-M.Portal, and D. Querlioz, “In-memory and error-immune differential rramimplementation of binarized deep neural networks,” in

IEDM Tech. Dig.

IEEE, 2018, p. 20.6.1.[7] S. Yu, Z. Li, P.-Y. Chen, H. Wu, B. Gao, D. Wang, W. Wu, and H. Qian,“Binary neural network with 16 mb rram macro chip for classiﬁcationand online training,” in

IEDM Tech. Dig.

IEEE, 2016, pp. 16–2.[8] E. Giacomin, T. Greenberg-Toledo, S. Kvatinsky, and P.-E. Gaillardon,“A robust digital rram-based convolutional block for low-power imageprocessing and learning applications,”

IEEE Transactions on Circuitsand Systems I: Regular Papers , vol. 66, no. 2, pp. 643–654, 2019.[9] X. Sun, S. Yin, X. Peng, R. Liu, J.-s. Seo, and S. Yu, “Xnor-rram:A scalable and parallel resistive synaptic architecture for binary neuralnetworks,” algorithms , vol. 2, p. 3, 2018.[10] Z. Zhou, P. Huang, Y. Xiang, W. Shen, Y. Zhao, Y. Feng, B. Gao, H. Wu,H. Qian, L. Liu et al. , “A new hardware implementation approach ofbnns based on nonlinear 2t2r synaptic cell,” in . IEEE, 2018, pp. 20–7.[11] M. Natsui, T. Chiba, and T. Hanyu, “Design of mtj-based nonvolatilelogic gates for quantized neural networks,”

Microelectronics journal ,vol. 82, pp. 13–21, 2018.[12] T. Tang, L. Xia, B. Li, Y. Wang, and H. Yang, “Binary convolutionalneural network on rram,” in

Proc. ASP-DAC . IEEE, 2017, pp. 782–787.[13] J. Lee, J. K. Eshraghian, K. Cho, and K. Eshraghian, “Adaptive precisioncnn accelerator using radix-x parallel connected memristor crossbars,” arXiv preprint arXiv:1906.09395 , 2019.[14] H. Alemdar, V. Leroy, A. Prost-Boucle, and F. P´etrot, “Ternary neuralnetworks for resource-efﬁcient ai applications,” in . IEEE, 2017, pp. 2547–2554.[15] L. Deng, P. Jiao, J. Pei, Z. Wu, and G. Li, “Gxnor-net: Trainingdeep neural networks with ternary weights and activations without full-precision memory under a uniﬁed discretization framework,”

NeuralNetworks , vol. 100, pp. 49–58, 2018.[16] S. Gregori, A. Cabrini, O. Khouri, and G. Torelli, “On-chip errorcorrecting techniques for new-generation ﬂash memories,”

Proc. IEEE ,vol. 91, no. 4, pp. 602–616, 2003.[17] N. Mellempudi, A. Kundu, D. Mudigere, D. Das, B. Kaul, and P. Dubey,“Ternary neural networks with ﬁne-grained quantization,” arXiv preprintarXiv:1705.01462 , 2017.[18] E. Nurvitadhi, G. Venkatesh, J. Sim, D. Marr, R. Huang, J. OngGee Hock, Y. T. Liew, K. Srivatsan, D. Moss, S. Subhaschandra et al. , “Can fpgas beat gpus in accelerating next-generation deep neuralnetworks?” in

Proc. ACM/SIGDA Int. Symp. Field-Programmable GateArrays . ACM, 2017, pp. 5–14.[19] W. Zhao, C. Chappert, V. Javerliac, and J.-P. Noziere, “High speed,high stability and low power sensing ampliﬁer for mtj/cmos hybrid logiccircuits,”

IEEE Transactions on Magnetics , vol. 45, no. 10, pp. 3784–3787, 2009.[20] R. G. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, andT. Mudge, “Near-threshold computing: Reclaiming moore’s law throughenergy efﬁcient integrated circuits,”

Proc. IEEE , vol. 98, no. 2, pp. 253–266, Feb 2010.[21] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014. [22] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXiv preprintarXiv:1502.03167 , 2015.[23] T. Hirtzlin, B. Penkovsky, M. Bocquet, J.-O. Klein, J.-M. Portal, andD. Querlioz, “Stochastic computing for hardware implementation ofbinarized neural networks,”

IEEE Access , vol. 7, p. 76394, 2019.[24] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[25] I. Loshchilov and F. Hutter, “Fixing weight decay regularization inadam,” arXiv preprint arXiv:1711.05101 , 2017.[26] ——, “Sgdr: Stochastic gradient descent with warm restarts,” arXivpreprint arXiv:1608.03983arXivpreprint arXiv:1608.03983