[PDF] Hybrid In-memory Computing Architecture for the Training of Deep Neural Networks

Abstract

The cost involved in training deep neural networks (DNNs) on von-Neumann architectures has motivated the development of novel solutions for efficient DNN training accelerators. We propose a hybrid in-memory computing (HIC) architecture for the training of DNNs on hardware accelerators that results in memory-efficient inference and outperforms baseline software accuracy in benchmark tasks. We introduce a weight representation technique that exploits both binary and multi-level phase-change memory (PCM) devices, and this leads to a memory-efficient inference accelerator. Unlike previous in-memory computing-based implementations, we use a low precision weight update accumulator that results in more memory savings. We trained the ResNet-32 network to classify CIFAR-10 images using HIC. For a comparable model size, HIC-based training outperforms baseline network, trained in floating-point 32-bit (FP32) precision, by leveraging appropriate network width multiplier. Furthermore, we observe that HIC-based training results in about 50% less inference model size to achieve baseline comparable accuracy. We also show that the temporal drift in PCM devices has a negligible effect on post-training inference accuracy for extended periods (year). Finally, our simulations indicate HIC-based training naturally ensures that the number of write-erase cycles seen by the devices is a small fraction of the endurance limit of PCM, demonstrating the feasibility of this architecture for achieving hardware platforms that can learn in the field.

Full PDF

HHybrid In-memory Computing Architecture for theTraining of Deep Neural Networks

Vinay Joshi ∗ , Wangxin He † , Jae-sun Seo † , and Bipin Rajendran ∗ Email: [email protected], { wangxinh,jseo28 } @ase.edu, [email protected] ∗ King’s College London, Strand, London WC2R 2LS, United Kingdom † Arizona State University, AZ, USA

Abstract —The cost involved in training deep neural networks(DNNs) on von-Neumann architectures has motivated the devel-opment of novel solutions for efﬁcient DNN training accelerators.We propose a hybrid in-memory computing (HIC) architecturefor the training of DNNs on hardware accelerators that results inmemory-efﬁcient inference and outperforms baseline software ac-curacy in benchmark tasks. We introduce a weight representationtechnique that exploits both binary and multi-level phase-changememory (PCM) devices, and this leads to a memory-efﬁcientinference accelerator. Unlike previous in-memory computing-based implementations, we use a low precision weight updateaccumulator that results in more memory savings. We trained theResNet- network to classify CIFAR- images using HIC. For acomparable model size, HIC-based training outperforms baselinenetwork, trained in ﬂoating-point -bit (FP ) precision, byleveraging appropriate network width multiplier. Furthermore,we observe that HIC-based training results in about % lessinference model size to achieve baseline comparable accuracy. Wealso show that the temporal drift in PCM devices has a negligibleeffect on post-training inference accuracy for extended periods(year). Finally, our simulations indicate HIC-based trainingnaturally ensures that the number of write-erase cycles seen bythe devices is a small fraction of the endurance limit of PCM,demonstrating the feasibility of this architecture for achievinghardware platforms that can learn in the ﬁeld. I. I

NTRODUCTION

Numerous emerging smart applications (e.g. IoT, wearables,drones, etc.) demand on-chip continuous learning, compellingthe development of application-speciﬁc memories and archi-tectures. More often these applications demand the imple-mentation of learning algorithms for large network modelsin an energy-efﬁcient manner. Conventional digital memorysolutions based on SRAM or DRAM cannot address therequired density/energy requirements due to large area andrestrictive off-chip memory access costs. High-end expensivegraphical processing units (GPUs) have been a default choiceto perform DNN training. The energy and time requirementof training the state-of-the-art DNN architectures on GPUs ishigh [1]. This necessitates the development of more energy-/area-efﬁcient custom hardware accelerators for performingdeep learning training workloads. A few ASIC processors havebeen recently reported for DNN training [2], [3], [4], but basedon conventional SRAM for on-chip storage, which requires alarge amount of memory access with associated density andleakage power constraints.The multi-level storage offered by non-volatile memoriessuch as phase-change memory (PCM) [5], resistive RAM (RRAM) [6] when combined with in-memory computingforms a basis of analog DNN training accelerators [7], [8], [9],[10], [11], [12], [13]. In-memory computing with PCM deviceshas shown promise in training deep neural networks (DNNs)on several cognitive tasks [8], [11], [14]. However, in thesesolutions weight gradients are stored on off-chip storage whichagain inherits the inherent density/power constraints. In [15] ahybrid precision synapse for DNN training is proposed. In ourimplementation, unlike in [15], the higher signiﬁcant bits areprogrammed only if there is an overﬂow event on the lowersigniﬁcant bits. Furthermore, our implementation ensures thatthe conductance decay due to PCM device drift (similar tocapacitor leakage in [15]) does not affect the network trainingaccuracy.Here, we propose a hybrid in-memory computing (HIC)architecture for training of DNNs on hardware acceleratorsand memory-efﬁcient inference. Speciﬁcally, we propose tomap a higher signiﬁcant bits of the weight values usingPCM devices with multi-level storage capability and the lowersigniﬁcant bits using PCM devices with binary-level storagecapability. In the following sections, we show that this schemecan outperform baseline network trained, in ﬂoating-point -bit (FP ) precision, by leveraging appropriate network widthmultiplier and requires about % less inference model sizeto achieve similar accuracy in comparison to the baseline.II. DNN TRAINING USING H YBRID I N - MEMORY C OMPUTING

In this section, we will describe the weight representationstrategy on two memory arrays used in the HIC architecture.We also discuss, a set of hardware aware operations that arean essential basis of the HIC architecture.

A. Weight representation strategy

An overview of the weight representation strategy in theHIC architecture is shown in Fig. 1. In the HIC architecturethe MSB part of the weight values is stored on an arrayof multi-level PCM cells denoted as MSB array. The LSBpart of the weight values is stored on a memory array ofbinary-valued PCM devices denoted as LSB array. This hybriddesign is motivated by the fact that only MSBs of weights areneeded for forward/backward propagation of DNN trainingthat targets low-precision inference, while typically weightupdates are small values that mostly modify only LSBs of

Copyright © 2021 IEEE. Personal use of this material is permitted. However, permission to use thismaterial for any other purposes must be obtained from the IEEE by sending a request to [email protected] paper has been accepted at ISCAS 2021 for publication. a r X i v : . [ c s . A R ] F e b ransposable PCM(TP-PCM) bitcell Conventional PCM(C-PCM) bitcellSL(GND) WL BL Single-levelPCM

T-PCM array (MSB part)

In-mem. comp. (MAC) for FP/BP (non-trans./transpose)

SL(GND) WL BLWL_TBL_T

Multi-levelPCM(4-bit cell)

W0[7:4]W1[7:4]W2[7:4]W3[7:4]W4[7:4]W5[7:4] W0[3]W1[3]W2[3]W3[3]W4[3]W5[3] W0[2]W1[2]W2[2]W3[2]W4[2]W5[2] W0[1]W1[1]W2[1]W3[1]W4[1]W5[1] W0[0]W1[0]W2[0]W3[0]W4[0]W5[0]

ADC

Digital ckt.

SA SA SA SA

Read-modify-write

C-PCM array (LSB part)

Row-byrow weight update(gates for small Δ W) Fig. 1. Hybrid in-memory computing (HIC) architecture for DNN training -the MSB part of the synaptic weight is stored on a multi-level PCM cell thatoffers the equivalent of 4-bit precision while the LSB part is stored on a 7-bitmemory formed by seven binary PCM devices. This choice of precision inMSB and LSB parts is found to be optimal for DNNs studied in this paper. weights. The MSB values of weights are programmed andread using statistically accurate PCM models proposed in [16].The LSB part of the weights is mapped using their binaryrepresentation on multiple binary-level PCM devices. Notably,the write operation on a memory location in the LSB arrayis performed by simply reading and ﬂipping the binary stateof the appropriate PCM device. This results in reduced writeoperation cost on the LSB array but also implies that somePCM devices shall drift more compared to their neighbors inthe same memory location (as device drift is based on the lastprogramming time). However, we will empirically show thatthis does not adversely affect the network training accuracyeven for extended periods of time (year).The MSB array is a crossbar array formed by a differentialpair of multi-level PCM devices at each cross point. Weperform our experiments using the PCM models proposed forDNN training in [16]. In [16], PCM models were developedby gathering the read/write statistics from 10K PCM devices.The proposed PCM model has a strong statistical match withthe obtained data. Hence, the use of this PCM model for thisstudy is an accurate choice. The proposed PCM model consistsof 4 different non-ideal components, (1) stochastic write, (2)stochastic read, (3) temporal variation in device conductance,and (4) nonlinearity of the programming curve. For the LSBarray, we use a binary-level PCM model with stochastic writeof high conductance state, and the read operation consists ofthe conductance drift coupled with stochastic read as proposedin [16]. For a binary PCM device, stochastic write is simulatedby adding a zero mean ﬁxed standard deviation Gaussian noiseto the expected high state conductance value. A multi-levelPCM can also be used as binary-level storage by utilizing thelowest conductance state and desired high conductance state.

B. Implementation of hybrid in-memory computing

The HIC architecture performs vector-matrix multiplication(VMM) operation using an analog crossbar array of PCMdevices. We assume that all other operations required in DNNtraining are performed in digital CMOS circuits. A digital toanalog converter (DAC) is required to apply input voltageto read/program the analog crossbar array, and an analogto digital converter (ADC) is required to read the crossbar

Prog. pulse D e v i c e c o n d u c t a n c e ( μ S ) Device conductance ( μ S) C o un t MSB ArrayADCNormalizeActivationDACX X D YY A Y N Z Δ Y Δ Y A Δ Y N Δ Z Δ X Δ X D LSB Array Δ Y A X Δ W Program on overflow

ADC DAC+ _

BackpropForward

Optimizer

Program with bitflip

Weightupdate

Fig. 2. The overview of DNN layer operation during HIC-based training - theMSB array is used to perform vector-matrix multiplication operations involvedin the forward and backpropagation phases of DNN training, while the weightupdates are quantized and accumulated in the LSB array. An overﬂow eventin the LSB array triggers the programming of corresponding weight elementsin the MSB array. array output. A convolution operation is essentially a matrix-matrix multiplication [17]. Hence it can be implementedon a crossbar as a vector-matrix multiplication. A recurrentlayer usually performs vector-matrix multiplication followedby some elementwise non-linear activation functions [18].Hence, the operation layout, shown in Fig. 2, is valid for threetypes of commonly used DNN layers, namely fully-connected,convolution, and recurrent layer.Fig. 2 shows a typical operation layout of a DNN layerin the HIC-based training. A transposable crossbar array per-forms VMM required in forward and backpropagation phasesof DNN training. A digital to analog converter (DAC) appliesan input (activation or error gradient) to the crossbar array.An analog to digital converter reads the output current ofthe crossbar array. A network-speciﬁc normalization functionsuch as batch normalization [19] or group normalization [20]is computed on ADC output. An activation function such asReLU, Sigmoid, or Tanh is computed on the normalizationlayer output. The gradients are backpropagated by applyingvoltages proportional to the error gradients on the columnsof the transposable crossbar array. During the weight updatephase, an outer product is computed on the input X and outputerror gradients ∆ Y A to obtain weight gradients ∆ W . Theweight gradients are quantized and used to update the LSBarray. The values in the LSB array are updated by simplyﬂipping the binary states of the devices if required. In casethere is an overﬂow on the LSB array, the MSB array isprogrammed with a corresponding update value. There are noother speciﬁc programming events on the MSB array.III. N UMERICAL VALIDATION

In this section, we discuss the experimental setup used forour simulations and the results from four different studies.

A. Experimental setup and network details

We have chosen ResNet- network for all our evaluationsas they are representative of large networks used in the Simulations C I F A R - t e s t a cc u r a c y ( % ) L i n e a r q u a n t i z e d w r i t e n o i s e r e a d n o i s e d r i f t N o n - li n e a r q u a n t i z e d w r i t e n o i s e r e a d n o i s e d r i f t w r i t e + r e a d F u ll - m o d e l BaselineLinearNon-linearFull-model

Fig. 3. The impact of different non-ideal aspects of the PCM on HIC-based training accuracy. PCM device non-linearity, as well as stochasticityassociated with write and read operations results in a notable drop in thetraining accuracy while the temporal drift in the device conductance resultsin accuracy improvement as it has the same effect on DNN training as thatof the weight decay regularization. The results are based on the average ofﬁve distinct training runs. machine learning research [21]. The ResNet- network hasa total of convolution layers and fully-connected layerfor classiﬁcation. It has about K trainable parameters andall the weights and updates are stored on PCM-based memoryarrays. All the layers of the ResNet- network are trainedusing HIC architecture. The network is trained to classify theimages from the CIFAR- dataset [22].The scheme used to program PCM devices on the MSB ar-ray can only increment the device conductance. Application ofseveral programming pulses to the devices in a differential paircan result in saturation of their conductance level. After every batches of training, we perform a refresh operation on thePCM devices in the MSB array to avoid saturation of deviceconductance [23]. A differential pair of PCM devices used inthe MSB array offers an equivalent precision of approximately -bits [24]. In the LSB array, we use seven PCM devices for -bit signed ﬁxed-point representation. All the DACs and ADCshave -bit precision, as they have been reported to be designpoints with good trade-off between precision and energyconsumption [25]. We use the same hyperparameter settingfor the baseline network and preprocessing of the CIFAR-10images as given in [21]. The hyperparameter setting in HICimplementation is the same as [21] except that the learningrate is 0.05 with a decay factor of 0.45 and a batch size of100. All our simulations are performed using TensorFlow [26]. B. Effect of individual non-idealities

Our simulations are based on the PCM model proposedin [16]. This PCM model is composed of four non-idealcomponents, (1) stochastic write, (2) stochastic read, (3)temporal variation in the device conductance, and (4) nonlinearprogramming curve. We study the effect of each non-idealcomponent on the network training accuracy by performingthe ablation of the PCM model. As shown in Fig. 3, light bluebars indicate network trained with linear PCM model and atmost one other non-ideal component (see the text on bars). Thecolorless bar indicates a network trained with the nonlinearPCM model and at most two other non-ideal components.The right-most colored bar (denoted as ”Full-model”) indicates

Inference Model size (MB) C I F A R - t e s t a cc u r a c y ( % ) Baseline FP32HIC

Network width multiplier

Fig. 4. With network width multiplication, HIC-based training of ResNet-32 network outperforms the FP32 software baseline for comparable inferencemodel size and also shows better accuracy improvement with increase inmodel size. Note that HIC achieves comparable accuracy to the baseline withabout % less inference model size. The network width multiplier (valuesnear markers) compensates for the loss in accuracy due to non-idealities ofthe PCM devices. The results are based on the average of ﬁve distinct trainingruns. network trained with a PCM model including all the non-idealcomponents.We observe that the nonlinearity of the programming curvecauses a notable drop in the network training accuracy com-pared to the linear PCM model. This is attributed to the factthat the expected conductance increment in the PCM deviceis an inverse function of the number of applied programmingpulses [16]. The stochastic read and write of the PCM modelhave a stronger negative effect on the network training accu-racy due to the the large amount of write noise present in thePCM programming process. Interestingly, the implementationthat incorporates device conductance drift achieves higheraccuracy in comparison to other nonideal components. Thisaccuracy improvement can be attributed to the fact that thedevice drift is similar to the weight decay regularizationtechnique commonly used in deep learning [27], [28]. Likeweight decay regularization, PCM conductance drift causeshigher decay in the weight values that do not contribute tothe network training and weights that are regularly updated(or trained) are subjected to smaller drift in their conductancevalue.Although the network training accuracy in the full-modelcase including all the non-ideal components is about 4.4%lower than the baseline network trained in FP32 precision,note that HIC is utilizing only 9-PCM devices. We show in thenext section that HIC-based training is more memory-efﬁcientand it achieves better accuracy compared to the baseline. C. Effect of the model size on network accuracy

Now, we discuss the network training accuracy as a functionof the model size required for inference. We increase thenetwork size by increasing the number of neurons in eachlayer by a desired network width multiplier (see values nearmarkers in Fig. 4) as proposed in [29]. We observe that theincrease in network width compensates for the loss in trainingaccuracy due to non-ideal PCM devices. Fig. 4 shows thenetwork training accuracy as a function of the inference modelsize. The inference model size is the amount of memoryrequired to store weights during the inference phase. The HIC- Time (s) C I F A R - t e s t a cc u r a c y ( % ) Baseline FP32 (model size= 3.72MB)HIC (equivalent model size is 1.84MB)

No compensation

HIC (equivalent model size is 1.84MB)

AdaBS

Fig. 5. Post-training inference accuracy estimate of ResNet- networkcaused by the temporal drift in PCM devices - for a period of s (11.7 days) there is no observable drop in the inference accuracy. The AdaBScompensation technique [9] is shown to be effective in compensating forweight degradation due to PCM drift, resulting in maintaining the inferenceaccuracy close to the baseline for over a year. The plots shown are averagedover 10 distinct training runs and 10 distinct inference runs per training (totalof 100 runs). based implementation requires approximately -bits for storingweights and -bits for the baseline. Like markers indicate thesame network architecture, in other words, the same numberof neurons in each layer. The model size for HIC is lowerbecause of the low precision representation of weights.We observe that the accuracy improvement in HIC-basedtraining per additional neurons/model size is better comparedto the baseline. Furthermore, for a comparable inference modelsize, HIC-based training achieves at least 1% better accuracythan the baseline. Notably, for a comparable network accuracy,HIC-based training requires about % less inference modelsize. This suggests that HIC-based training results in memory-efﬁcient inference computation on the hardware without com-promising on accuracy compared to the baseline. D. Effect of Device Drift on Post-training Inference Accuracy

Now, we discuss the effect of PCM conductance drift on theinference accuracy computed as a function time after networktraining. As shown in Fig. 5, we compute the inference ona version of the HIC network that uses a width multiplierof . . In these simulations, we initially train the network for epochs and incorporate the drift in the PCM devices whichcauses the weights to degrade throughout training. We performthe inference on the trained network from s to × scausing further degradation in the weight value.In [9], AdaBS technique was proposed to compensate forthe weight decay caused by the PCM conductance drift. TheAdaBS technique infrequently performs a calibration phasethat recomputes the global mean and variance of all the batchnormalization layers in the network. The calibration phaserequires about % of the images in the training set.The inference accuracy of HIC is unaffected for the timeduration of about s when no compensation is applied to theweights. After s, there is more adverse impact of the drifton the inference accuracy. The AdaBS compensation techniquehelps to recover the degradation in the network accuracy. Fora year long simulation the inference accuracy of HIC drops byjust . % compared to the training accuracy at s whenAdaBS compensation is applied on the weights. The drop in (( ( ( (( Fig. 6. The number of write-erase cycles applied to the PCM devices on theMSB or LSB array during the training of ResNet- network is well withinthe reported endurance limit of PCM devices ( ). the inference accuracy is . % when no compensation isapplied and AdaBS helps to recover such a signiﬁcant drop inthe network accuracy. Note that using AdaBS compensationno signiﬁcant gain in the inference accuracy is observed till s but afterwards AdaBS plays a crucial role in maintainingthe inference accuracy close to the baseline. E. Write-Erase Cycle Estimation

During our training simulations we also tracked the numberof write-erase cycles that were applied to the devices on theMSB or LSB array throughout the training. Following thedeﬁnition in [30], we deﬁne a write-erase cycle as a sequenceof at most SET pulses followed by a RESET pulse. Fig.6 shows the distribution of write-erase cycles applied on alldevices during one full training of the ResNet- network.The number of write-erase cycles applied to a PCM devicein the HIC-based training of the ResNet- network is lessthan for the MSB array and less than K for the LSBarray. A PCM device endurance is of the order of [30]and the number of write-erase cycles seen by any device inHIC implementation is a small fraction of PCM endurance.IV. C ONCLUSION

We proposed hybrid in-memory computing (HIC) architec-ture for memory-efﬁcient training of deep neural networks onhardware accelerators. Based on the simulations demonstratingtraining of ResNet-32 network on CIFAR-10 dataset, the HICarchitecture shows promising memory savings and highertraining accuracy compared to the baseline network trainedin ﬂoating-point 32-bit precision. Speciﬁcally, the HIC imple-mentation outperformed baseline software accuracy by at least % while still using a comparable amount of inference modelsize by leveraging a network width multiplier. Interestingly,the HIC implementation achieved similar accuracy to that ofbaseline but with % less inference model size by leveraginga network width multiplier. With a suitable PCM devicedrift compensation technique, we showed that post-traininginference accuracy suffers a negligible drop. Finally, the HIC-based training incurred write-erase cycles that are a smallfraction of PCM endurance demonstrating the usefulness ofthis architecture for the memory-constrained deep learning ap-plications such as edge computing, IoT, wearable technology.A CKNOWLEDGMENT

This work was in part supported by NSF grants 1652866and 1715443, SRC AIHW program, and C-BRIC, one of sixcenters in JUMP, a SRC program sponsored by DARPA.

EFERENCES[1] E. Strubell, A. Ganesh, and A. McCallum, “Energy and policyconsiderations for deep learning in NLP,”

CoRR , vol. abs/1906.02243,2019. [Online]. Available: http://arxiv.org/abs/1906.02243[2] J. Lee, J. Lee, D. Han, J. Lee, G. Park, and H. Yoo, “7.7 lnpu: A25.3tﬂops/w sparse deep-neural-network learning processor with ﬁne-grained mixed precision of fp8-fp16,” in , 2019, pp. 142–144.[3] C. Kim, S. Kang, D. Shin, S. Choi, Y. Kim, and H. Yoo, “A 2.1tﬂops/wmobile deep rl accelerator with transposable pe array and experiencecompression,” in , 2019, pp. 136–138.[4] S. Yin and J. Seo, “A 2.6 TOPS/W 16-Bit Fixed-Point ConvolutionalNeural Network Learning Processor in 65-nm CMOS,”

IEEE Solid-StateCircuits Letters , vol. 3, pp. 13–16, 2020.[5] M. L. Gallo and A. Sebastian, “An overview of phase-change memorydevice physics,”

Journal of Physics D: Applied Physics , vol. 53, no. 21,p. 213002, mar 2020.[6] F. Zahoor, T. Z. Azni Zulkiﬂi, and F. A. Khanday, “Resistive randomaccess memory (rram): an overview of materials, switching mechanism,performance, multilevel cell (mlc) storage, modeling, and applications,”

Nanoscale Research Letters , vol. 15, no. 1, p. 90, Apr 2020.[7] A. Sebastian, M. Le Gallo, G. W. Burr, S. Kim, M. BrightSky, andE. Eleftheriou, “Tutorial: Brain-inspired computing using phase-changememory devices,”

Journal of Applied Physics , vol. 124, no. 11, p.111101, 2018.[8] G. W. Burr, R. M. Shelby, A. Sebastian, S. Kim, S. Kim, S. Sidler,K. Virwani, M. Ishii, P. Narayanan, A. Fumarola et al. , “Neuromorphiccomputing using non-volatile memory,”

Advances in Physics: X , vol. 2,no. 1, pp. 89–124, 2017.[9] V. Joshi, M. Le Gallo, S. Haefeli, I. Boybat, S. R. Nandakumar,C. Piveteau, M. Dazzi, B. Rajendran, A. Sebastian, and E. Eleftheriou,“Accurate deep neural network inference using computational phase-change memory,”

Nature Communications , vol. 11, no. 1, p. 2473, May2020. [Online]. Available: https://doi.org/10.1038/s41467-020-16108-9[10] A. Shaﬁee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Stra-chan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutionalneural network accelerator with in-situ analog arithmetic in crossbars,”in , June 2016, pp. 14–26.[11] S. R. Nandakumar, M. Le Gallo, C. Piveteau, V. Joshi, G. Mariani,I. Boybat, G. Karunaratne, R. Khaddam-Aljameh, U. Egger,A. Petropoulos, T. Antonakopoulos, B. Rajendran, A. Sebastian, andE. Eleftheriou, “Mixed-precision deep learning based on computationalmemory,”

Frontiers in Neuroscience

Proceedings of the43rd International Symposium on Computer Architecture , ser. ISCA’16. Piscataway, NJ, USA: IEEE Press, 2016, pp. 27–39. [Online].Available: https://doi.org/10.1109/ISCA.2016.13[13] L. Song, X. Qian, H. Li, and Y. Chen, “Pipelayer: A pipelined reram-based accelerator for deep learning,” in , Feb 2017,pp. 541–552.[14] G. Tayfun and Y. Vlasov, “Acceleration of deep neural network trainingwith resistive cross-point devices,”

CoRR , vol. abs/1603.07341, 2016.[Online]. Available: http://arxiv.org/abs/1603.07341[15] Y. Luo and S. Yu, “Accelerating deep neural network in-situ trainingwith non-volatile and volatile memory based hybrid precision synapses,”

IEEE Transactions on Computers , vol. 69, no. 8, pp. 1113–1127, 2020.[16] S. R. Nandakumar, M. Le Gallo, I. Boybat, B. Rajendran, A. Sebastian,and E. Eleftheriou, “A phase-change memory model for neuromorphiccomputing,”

Journal of Applied Physics , vol. 124, no. 15, p. 152135,2018. [Online]. Available: https://doi.org/10.1063/1.5042408[17] T. Gokmen, O. M. Onen, and W. Haensch, “Training deepconvolutional neural networks with resistive cross-point devices,”

CoRR , vol. abs/1705.08014, 2017. [Online]. Available: http://arxiv.org/abs/1705.08014[18] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

NeuralComput. , vol. 9, no. 8, p. 1735–1780, Nov. 1997. [Online]. Available:https://doi.org/10.1162/neco.1997.9.8.1735 [19] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” ser. Proceedingsof Machine Learning Research, F. Bach and D. Blei, Eds., vol. 37.Lille, France: PMLR, 07–09 Jul 2015, pp. 448–456. [Online]. Available:http://proceedings.mlr.press/v37/ioffe15.html[20] Y. Wu and K. He, “Group normalization,”

International Journal ofComputer Vision , vol. 128, no. 3, pp. 742–755, Mar 2020. [Online].Available: https://doi.org/10.1007/s11263-019-01198-w[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,”

CoRR , vol. abs/1512.03385, 2015. [Online]. Available:http://arxiv.org/abs/1512.03385[22] A. Krizhevsky, “Learning multiple layers of features from tiny images,”

University of Toronto , 05 2012.[23] I. Boybat, M. Le Gallo, S. R. Nandakumar, T. Moraitis, T. Parnell,T. Tuma, B. Rajendran, Y. Leblebici, A. Sebastian, and E. Eleftheriou,“Neuromorphic computing with multi-memristive synapses,”

NatureCommunications , vol. 9, no. 1, p. 2514, Jun 2018. [Online]. Available:https://doi.org/10.1038/s41467-018-04933-y[24] M. Le Gallo, A. Sebastian, G. Cherubini, H. Giefers, and E. Eleftheriou,“Compressed sensing with approximate message passing using in-memory computing,”

IEEE Transactions on Electron Devices , vol. 65,no. 10, pp. 4304–4312, 2018.[25] A. S. Rekhi, B. Zimmer, N. Nedovic, N. Liu, R. Venkatesan, M. Wang,B. Khailany, W. J. Dally, and C. T. Gray, “Analog/mixed-signalhardware error modeling for deep learning inference,” in

Proceedingsof the 56th Annual Design Automation Conference 2019 . ACM, Jun.2019. [Online]. Available: https://doi.org/10.1145/3316781.3317770[26] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga,S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan,P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensorﬂow: A systemfor large-scale machine learning,” in

Advances in Neural Information Processing Systems4 , J. E. Moody, S. J. Hanson, and R. P. Lippmann, Eds. Morgan-Kaufmann, 1992, pp. 950–957. [Online]. Available: http://papers.nips.cc/paper/563-a-simple-weight-decay-can-improve-generalization.pdf[28] S. Bos and E. Chug, “Using weight decay to optimize the generalizationability of a perceptron,” in

Proceedings of International Conference onNeural Networks (ICNN’96) , vol. 1, 1996, pp. 241–246 vol.1.[29] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,M. Andreetto, and H. Adam, “Mobilenets: Efﬁcient convolutional neuralnetworks for mobile vision applications,”

CoRR , vol. abs/1704.04861,2017. [Online]. Available: http://arxiv.org/abs/1704.04861[30] T. Tuma, A. Pantazi, M. Le Gallo, A. Sebastian, and E. Eleftheriou,“Stochastic phase-change neurons,”