Accurate deep neural network inference using computational phase-change memory
Vinay Joshi, Manuel Le Gallo, Simon Haefeli, Irem Boybat, S.R. Nandakumar, Christophe Piveteau, Martino Dazzi, Bipin Rajendran, Abu Sebastian, Evangelos Eleftheriou
TThis is a pre-print of an article accepted for publication in Nature Communications Accurate deep neural network inference using computationalphase-change memory
Vinay Joshi,
1, 2
Manuel Le Gallo, a) Simon Haefeli,
1, 3
Irem Boybat,
1, 4
S.R. Nandakumar, ChristophePiveteau,
1, 3
Martino Dazzi,
1, 3
Bipin Rajendran, Abu Sebastian, b) and Evangelos Eleftheriou IBM Research - Zurich, S¨aumerstrasse 4, 8803 R¨uschlikon, Switzerland King’s College London, Strand, London WC2R 2LS, United Kingdom ETH Zurich, R¨amistrasse 101, 8092 Zurich, Switzerland Ecole Polytechnique Federale de Lausanne (EPFL), 1015 Lausanne, Switzerland (Dated: 14 April 2020)
In-memory computing is a promising non-von Neumann approach for making energy-efficient deep learning inferencehardware. Crossbar arrays of resistive memory devices can be used to encode the network weights and perform efficientanalog matrix-vector multiplications without intermediate movements of data. However, due to device variability andnoise, the network needs to be trained in a specific way so that transferring the digitally trained weights to the analogresistive memory devices will not result in significant loss of accuracy. Here, we introduce a methodology to trainResNet-type convolutional neural networks that results in no appreciable accuracy loss when transferring weights to in-memory computing hardware based on phase-change memory (PCM). We also propose a compensation technique thatexploits the batch normalization parameters to improve the accuracy retention over time. We achieve a classificationaccuracy of 93.7% on the CIFAR-10 dataset and a top-1 accuracy on the ImageNet benchmark of 71.6% after mappingthe trained weights to PCM. Our hardware results on CIFAR-10 with ResNet-32 demonstrate an accuracy above 93.5%retained over a one day period, where each of the 361,722 synaptic weights of the network is programmed on just twoPCM devices organized in a differential configuration.
I. INTRODUCTION
Deep neural networks (DNNs) have revolutionized the field of artificial intelligence and have achieved unprecedented successin cognitive tasks such as image and speech recognition. Platforms for deploying the trained model of such networks andperforming inference in an energy-efficient manner are highly attractive for edge computing applications. In particular, internet-of-things battery-powered devices and autonomous cars could especially benefit from fast, low-power, and reliably accurateDNN inference engines. Significant progress in this direction has been made with the introduction of specialized hardware forinference operating at reduced digital precision (4 to 8-bit), such as Google’s tensor processing unit (TPU) and low-powergraphical processing units (GPUs) such as NVIDIA T4 . While these platforms are very flexible, they are based on architectureswhere there is a physical separation between memory and processing units. The models are typically stored in off-chip memory,leading to constant shuttling of data between memory and processing units, which limits the maximum achievable energyefficiency.In order to reduce the data transfers to a minimum in inference accelerators, a promising avenue is to employ in-memory com-puting using non-volatile memory devices . Both charge-based storage devices, such as Flash memory , and resistance-based(memristive) storage devices, such as metal-oxide resistive random-access memory (ReRAM) and phase-change memory(PCM) are being investigated for this. In this approach, the network weights are encoded as the analog charge state orconductance state of these devices organized in crossbar arrays, and the matrix-vector multiplications during inference can beperformed in-situ in a single time step by exploiting Kirchhoff’s circuit laws. The fact that these devices are non-volatile (theweights will be retained when the power supply is turned off) and have multi-level storage capability (a single device can encodean analog range of values as opposed to 1 bit) is very attractive for inference applications. However, due to the analog nature ofthe weights programmed in these devices, only limited precision can be achieved in the matrix-vector multiplications and thiscould limit the achievable inference accuracy of the accelerator.One potential solution to this problem is to train the network fully on hardware , such that all hardware non-idealitieswould be de facto included as constraints during training. Another similar approach is to perform partial optimizations of thehardware weights after transferring a trained model to the chip . The drawback of these approaches is that every neuralnetwork would have to be trained on each individual chip before deployment. Off-line variation-aware training schemes havealso been proposed, where hardware non-idealities such as device-to-device variations , defective devices , or IR drop are first characterized and then fed into the training algorithm running in software. However, these approaches would require a) Electronic mail: [email protected] b) Electronic mail: [email protected] a r X i v : . [ c s . ET ] A p r his is a pre-print of an article accepted for publication in Nature Communications , synaptic weights , andpre-activations . However, previous demonstrations have generally been limited to rather simple and shallow networks, andexperimental validations of the effectiveness of the various approaches have been missing. We are aware of one recent work thatanalyzed more complex problems such as ImageNet classification , however the hardware model used was rather abstract andno experimental validation was presented.In this work, we explore injecting noise to the synaptic weights during the training of DNNs in software as a generic methodto improve the network resilience against analog in-memory computing hardware non-idealities. We focus on the ResNetconvolutional neural network (CNN) architecture, and introduce a number of techniques that allow us to achieve a classificationaccuracy of 93.7% on the CIFAR-10 dataset and a top-1 accuracy of 71.6% on the ImageNet benchmark after mapping thetrained weights to PCM synapses. In contrast to previous approaches, the noise injected during training is crudely estimated froma one-time all-around hardware characterization, and captures the combined effect of read and write noise without introducingadditional noise-related training hyperparameters. We validate the training approach through hardware/software experiments,where each of the 361,722 weights of ResNet-32 is programmed on two PCM devices of a prototype chip, and the rest of thenetwork functionality is simulated in software. We achieve an experimental accuracy of 93 .
75% after programming, whichstays above 92 .
6% over a period of 1 day. To improve the accuracy retention further, we develop a method to periodicallycalibrate the batch normalization parameters to correct the activation distributions during inference. We demonstrate a significantimprovement in the accuracy retention with this method (up to 93 .
5% on hardware for CIFAR-10) compared with a simple globalscaling of the layers’ outputs, at the cost of additional digital computations during calibration. Finally, we discuss our trainingapproach with respect to other methods and quantify the tradeoffs in terms of accuracy and ease of training.
II. RESULTSA. Problem statement
For our experiments, we consider two residual networks on two different datasets: ResNet-32 on the CIFAR-10 dataset,and ResNet-34 on the ImageNet dataset . As shown in Fig. 1 a , ResNet-32 consists of 3 different ResNet blocks with ten3 × × b . Each synaptic weight can be mapped on a differential pair of memristive devices that are locatedon two different columns. For a given layer l , the synaptic weight W li j of the ( i , j ) th synaptic element is represented by the effective synaptic conductance G li j given by G li j = G l , + i j − G l , − i j , (1)where G l , + i j and G l , − i j are the conductance values of the two devices forming the differential pair. Those device conductancevalues are defined as the effective conductance perceived in the operation of a non-ideal memristive crossbar array, and thereforeinclude all the circuit non-idealities from the crossbar and peripheral circuitry.The mapping between the synaptic weight W li j obtained after software training and the corresponding synaptic conductance isgiven by G li j = W li j × G max W lmax + δ G li j = G lT , i j + δ G li j , (2)where G max is the maximum reliably programmable device conductance and W lmax is the maximum absolute synaptic weightvalue of layer l . δ G li j represents the synaptic conductance error from the ideal target conductance value G lT , i j = W li j × G max W lmax . δ G li j is a time-varying random variable that describes the effects of non-ideal device programming (inaccuracies associated withwrite) and conductance fluctuations over time (inaccuracies associated with read). Possible factors leading to such conductanceerrors include inaccuracies in programming the synaptic conductance to G lT , i j , 1 / f noise from memristive devices and circuits,temporal conductance drift, device-to-device variations, defective (stuck) devices, and circuit non-idealities (e.g. IR drop). his is a pre-print of an article accepted for publication in Nature Communications ResNet block 110 layers ResNet block 210 layers ResNet block 310 layers K e r ne l s i z e x I npu t c hanne l s x x = W l Input activations for ( l +1) th layerMap W to Ga l I I I + - + - + - Crossbar of memristive devices G l, +21 Digital circuitry for batch normalization and ReLU a l a l G l, -21 a l +11 a l +12 a l +128 ab Batch normalizationparameters correction P os t t r a i n i ng Fo r wa r d p r opaga ti onBa ck wa r d p r opaga ti on I npu t a c ti v a ti on s F l a tt en W e i gh t upda t e M e m r i s ti v e de v i c e non i dea liti e s → δ W tr l W l = W l + δ W tr l ^ W l Software training with noise injection Inference using in-memory computing hardwareConvolution layer l Kernel size 3x3Input channels 16Output channels 28: This paper's contributions
FIG. 1. a , ResNet-32 network architecture for CIFAR-10 image classification. The architecture of ResNet-32 used in this study is a slightlymodified version of the original implementation with fewer input and output channels in ResNet blocks 2 and 3. b , Training and inferenceof an example layer of ResNet-32 according to the methodology proposed in this paper. Software training is performed by injecting a randomnoise term to the weights used during the forward propagation, δ W ltr , which is representative of the combined read and write noise of thememristive devices used during inference (see Section II B). When transferring the weights of a convolution layer to memristive crossbars forinference, they are flattened into a 2-D matrix by collapsing each filter into a single vector programmed on a crossbar column, and stacking allfilters on separate columns . The weights are then programmed as the differential conductance of two memristive devices. Input activations a l are applied as voltages on the crossbar rows. The output current from the column containing the G − devices is subtracted from the onefrom the column containing the G + devices. The differential current output I from the crossbar then routes to a digital circuitry that performsbatch normalization and the corresponding rectified linear unit (ReLU) activation function, in order to obtain the input activations for the nextlayer a l + . The final softmax activation function can be performed off-chip if required. An optional correction of the batch normalizationparameters can be periodically performed to improve the accuracy retention over time (see Section II D and II E). The input image is paddedwith zero values at the border to ensure that the convolution operation preserves the height and width of the image. Therefore, considering aninput image of size n × n , the convolution operation can be performed in n matrix-vector multiplication cycles. Clearly, a direct mapping of the synaptic weights of a DNN trained with 32-bit floating point (FP32) precision to the sameDNN with memristive synapses is expected to degrade the network accuracy due to the added error in the weights arising from δ G li j . For existing memristive technologies, the magnitude of δ G li j may range from 1 −
10% of the magnitude of G lT , i j , which ingeneral is not tolerable by DNNs trained with FP32 without any constrains. Imposing such errors as constraints during trainingcan be beneficial in improving the network accuracy. In fact, quantization of the weights or activations , and injecting noiseon the weights , activations or gradients have been widely used as DNN regularizers during training to reduce overfittingon the training dataset . These techniques can improve the accuracy of DNN inference when it is performed with the samemodel precision as during training. However, achieving baseline accuracy while performing DNN inference on a model whichis inevitably different from the one obtained after training, as it is the case for any analog in-memory computing hardware, is amore difficult problem and requires additional investigations.Although a large body of efficient techniques to train DNNs with reduced digital precision has been reported , it is unlikelythat such procedures can generally be applied as-is to analog in-memory computing hardware due to the random nature of δ G li j .Since quantization errors coming from rounding to reduced fixed-point precision are not random, DNNs trained in this wayare not a priori expected to be suitable for deployment on analog in-memory computing hardware. Techniques that inject his is a pre-print of an article accepted for publication in Nature Communications . Recent works have also proposed to apply noise to the layer inputs or pre-activations in order to improve thenetwork tolerance to hardware noise . In this work, we follow the original approach of Murray et al. of injecting Gaussiannoise to the synaptic weights during training. Next, we discuss different techniques that we employed together with synapticweight noise in order to improve the accuracy of inference on ResNet and achieve close to software-equivalent accuracy aftertransferring the weights to PCM hardware. B. Training procedure
When performing inference with analog in-memory computing hardware, the DNN experiences errors primarily due to (i)inaccurate programming of the network weights onto the devices (write noise) and (ii) temporal fluctuations of the hardwareweights (read noise). We can cast the effect of these errors into a single error term δ G li j that distorts each synaptic weightwhen performing forward propagation during inference. Hence, we propose to add random noise that corresponds to the errorinduced by δ G li j to the synaptic weights at each forward pass during training (see Fig. 1 b ). The backward pass and weightupdates are performed with weights that did not experience this noise. We found that adding noise to the weights only in theforward propagation is sufficient to achieve close to baseline accuracy for a noise magnitude comparable to that of our hardware,and adding noise during the backward propagation did not improve the results further. For simplicity, we assume that δ G li j is Gaussian distributed, which is usually the case for analog memristive hardware. Weights are linearly mapped to the entireconductance range G max of the hardware, hence the standard deviation σ l δ W tr of the Gaussian noise on weights to be appliedduring training, for a layer l , can be computed as σ l δ W tr W lmax ≡ η tr = σ δ G G max , (3)where σ δ G is a representative standard deviation of the combined read and write noise measured from hardware. During training,the weight distribution of every layer and hence W lmax changes, therefore σ l δ W tr is recomputed after every weight update so that η tr stays constant throughout training. We found this to be especially important in achieving good training convergence withthis method.Weight initialization can have a significant effect on DNN training . Two different weight initializations can lead to com-pletely different minima when optimizing the network objective function. The network optimum when training with additivenoise could be closer to the FP32 training optimum than to a completely random initialization. So it can be beneficial to initializeweights from a pretrained baseline network and then retrain this network by injecting noise. A similar observation was reportedfor training ResNet with reduced digital precision . For achieving high classification accuracy in our experiments, we foundthis strategy more helpful than random initialization.The noise injected during training according to Eq. (3) is closely related to the maximum weight of a layer, and can thusgrow uncontrollably with outlier weight values. Controlling the weight distribution in a desirable range can improve the networktraining convergence and makes the mapping of weights to hardware with limited conductance range easier. We therefore clipthe synaptic weights at layer l after every weight update in the range [ − α × σ W l , α × σ W l ] , where σ W l is the standard deviationof weights in layer l and α is a tunable hyperparameter. In our studies, α = . α = . η tr during training. We also show how the accuracy is affected when the inference weights are perturbed by acertain amount of relative noise η in f ≡ σ l δ Winf W lmax , where σ l δ W inf is the standard deviation of the noise injected to the weights of layer l before performing inference on the test dataset.The test accuracy of ResNet-32 on CIFAR-10 obtained for different amounts of noise injected during training, without induc-ing any perturbation during inference ( η in f = a . It can be seen that the training algorithm is able to achievea test accuracy close to the software baseline of 93 .
87% with up to approximately η tr = his is a pre-print of an article accepted for publication in Nature Communications b CIFAR-10 test accuracy (%)
R e l a t i v e i n f e r e n c e n o i s e (cid:1) in f ( % ) (cid:1) t r = 2 . 6 % (cid:1) t r = 3 . 7 7 % (cid:1) t r = 5 . 6 5 % (cid:1) t r = 7 . 5 3 % (cid:1) t r = 1 1 . 3 % b a s e l i n e (cid:1) in f = 0 b a s e l i n e
CIFAR-10 test accuracy (%)
R e l a t i v e t r a i n i n g n o i s e (cid:1) t r ( % ) a c b a s e l i n e
CIFAR-10 test accuracy (%)
R e l a t i v e n o i s e (cid:1) t r = (cid:1) in f ( % ) d b a s e l i n e ImageNet top-1 accuracy (%)
R e l a t i v e n o i s e (cid:1) t r = (cid:1) in f ( % ) N o i s e o n a l l l a y e r s N o n o i s e o n f i r s t a n d l a s t l a y e r s FIG. 2. a , Test accuracy on CIFAR-10 obtained for different amounts of relative injected weight noise during training η tr without inducingany perturbation during inference ( η in f = η tr > b , Test accuracy on CIFAR-10 obtained for the networks trained with different amounts of relative weight noise η tr as afunction of the weight noise injected during inference η in f . In most cases, η tr can be increased above η in f up to a certain point and still leadto comparable or slightly higher (within ≈ . η tr = η in f . However, when η tr becomes much higher than η in f , thetest accuracy decreases due to the inability of the network to achieve baseline accuracy when η tr > c , Test accuracy on CIFAR-10 as a function of η tr = η in f . The error bars represent the standarddeviation over 100 inference runs averaged over 10 training runs. d , Top-1 accuracy on ImageNet as a function of η tr = η in f . The error barsrepresent the standard deviation over 10 inference runs on a single training run. trained with different amounts of η tr to weight perturbations during inference, η in f , is shown in Fig. 2 b . For a given value of η in f , in general, the highest test accuracy can be obtained for the network that has been trained with a comparable amount ofsynaptic weight noise, i.e. for η tr ≈ η in f . The test accuracy for η tr = η in f is shown in Fig. 2 c . It can be seen that for up to η in f = .
5% of the software baseline is achievable. The impact of the weight initialization, clipping,and learning rate scheduling on the accuracy is shown in Supplementary Fig. 2. Not incorporating any of those three techniquesresults in at least 1% drop in test accuracy for η tr = η in f = . η tr = η in f is shown in Fig. 2 d . Consistent with previous observations ,we found that the network recovers high accuracy extremely quickly when retraining with additive noise due to quick updatesof the batch normalization parameters (see Supplementary Note 1), and obtained satisfactory convergence after only 8 epochs.The accuracy on ImageNet is much more sensitive to the noise injected during training than for CIFAR-10, and when noisein injected on all layers, there is more than 0 .
5% accuracy drop from the baseline even down to ∼ .
2% relative noise. In theliterature, many network compression techniques allow higher precision for the first and last layers, which are more sensitiveto noise . We applied the same simplification to our problem, which means that we removed the noise during training onthe first convolutional layer and the last dense layer, and performed inference with the first and last layer without noise. Theobtained accuracy after training, by injecting the same training and inference noise as previously, can be increased by more than his is a pre-print of an article accepted for publication in Nature Communications Cumulative frequency (%)
C o n d u c t a n c e ( m S ) a b
E x p e r i m e n t P o l y n o m i a l f i t
Standard deviation ( m S) T a r g e t c o n d u c t a n c e ( m S )8 1 0 1 2 1 4 1 6C o n d u c t a n c e ( m S ) c CIFAR-10 test accuracy (%)
T r a i n i n g s c h e m e s A f t e r t r a i n i n g A f t e r t r a n s f e r t o P C M s y n a p s e s d ImageNet top-1 accuracy (%)
T r a i n i n g s c h e m e s A f t e r t r a i n i n g A f t e r t r a n s f e r t o P C M s y n a p s e s
FIG. 3. a , Cumulative distributions of 11 representative iteratively programmed conductance levels on 10,000 PCM devices per level. Thevertical dashed lines denote the target conductance for each level. b , Conductance standard deviation of the 11 levels as a function of targetconductance. The inset shows a representative conductance distribution of one level. c , Test accuracy on CIFAR-10 after software training andafter weight transfer to PCM synapses for different training schemes. d , Top-1 accuracy on ImageNet after software training and after weighttransfer to PCM synapses for different training schemes. In c and d , the error bars represent the standard deviation over 10 inference runs.
1% with this technique (see Fig. 2 d ). C. Weight transfer to PCM-based synapses
In order to experimentally validate the effectiveness of the above training methodology, we performed experiments on aprototype multi-level PCM chip comprising 1 million PCM devices fabricated in 90 nm CMOS baseline technology . PCMis a memristive technology which records data in a nanometric volume of phase-change material sandwiched between twoelectrodes . The phase-change material is in the low-resistive crystalline phase in an as-fabricated device. By applying acurrent pulse of sufficient amplitude (typically referred to as the RESET pulse) an amorphous region around the narrow bottomelectrode is created via a melt-quench process. The device will be in a low conductance state if the high-resistive amorphousregion blocks the current path between the two electrodes. The size of the amorphous region can be modulated in an almostcompletely analog manner by the application of suitable electrical pulses. Hence, a continuum of conductance values can beprogrammed in a single PCM device over a range of more than two orders of magnitude.An optimized iterative programming algorithm was developed to program the conductance values in the PCM devices withhigh accuracy (see Methods). The experimental cumulative distributions of conductance values for 11 representative pro-grammed levels, measured approximately 25 seconds after programming, are shown in Fig. 3 a . The standard deviation ofthese distributions is extracted and fitted with a polynomial function of the target conductance (dashed lines in Fig. 3 a ) as shownin Fig. 3 b . For all levels, we achieve a standard deviation less than 1.2 µ S, which is more than 2 times lower than that reportedin previous works on nanoscale PCM arrays for a similar conductance range . his is a pre-print of an article accepted for publication in Nature Communications δ G li j is modeled as a Gaussian distributed random variable with 0 mean and standard deviation given bythe fitted curve of Fig. 3 b for the corresponding target conductance, | G lT , i j | , computed with G max = µ S. The resulting testaccuracy obtained after software training and after weight transfer to PCM synapses for ResNet-32 on CIFAR-10 is shown inFig. 3 c for different training procedures. It can be seen that standard FP32 training without constraints performs the worst aftertransfer to PCM synapses. Training with 4-bit precision weights (using the method described in Ref. 28), which is roughly theeffective precision of our PCM devices , improves the performance after transfer with respect to FP32, but nevertheless theaccuracy decreases by more than 1% after transferring the 4-bit weights to PCM. Training ternary digital weights leads to alower performance drop ( < . η tr = . σ δ G = . µ S (median ofthe 11 values reported in Fig. 3 b ), the best overall performance after transfer to PCM is obtained. The resulting accuracy of93.7% is less than 0 .
2% below the FP32 baseline. A rather broad range of values of η tr lead to a similar resulting accuracy (seeSupplementary Fig. 3), demonstrating that η tr does not have to be very precisely determined for obtaining satisfactory resultson PCM. The accuracy obtained without perturbing the weights after training by injecting noise is slightly higher than the FP32baseline, which could be attributed to improved generalization resulting from the additive noise training.The top-1 accuracy for ResNet-34 on ImageNet after transfer to PCM synapses for different training procedures is shown inFig. 3 d . Training with additive noise increases the accuracy by approximately 6% on PCM compared with FP32 and 4-bit training. The accuracy of 71 .
6% achieved with additive noise training on PCM is significantly higher than that reported in Fig.2 d with η in f = . . µ S. D. Hardware/software inference experiment on CIFAR-10
Although we could achieve good test accuracy after weight transfer to PCM synapses as shown in the previous section, animportant challenge for any analog in-memory computing hardware is to be able to retain this accuracy over time. This isespecially true for PCM due to the high 1 / f noise experienced in these devices as well as temporal conductance drift. Theconductance values in PCM drift over time t according to the relation G ( t ) = G ( t )( t / t ) − ν , where G ( t ) is the conductancemeasured at time t after programming and ν is the drift exponent, which depends on the device, phase-change material, andphase configuration of the PCM ( ν is higher for the amorphous than the crystalline phase) . In our PCM devices, ν ≈ . η tr = .
8% are programmedindividually on two PCM devices of the chip. Depending on the sign of G lT , i j , either G l , + i j or G l , − i j is iteratively programmed to | G lT , i j | , and the other device is RESET close to 0 µ S with a single pulse of 450 µ A amplitude and 50 ns width. The iterativeprogramming algorithm converged on 99 .
1% of the devices programmed to nonzero conductance, and no screening for defectivedevices on the chip was performed prior to the experiments. The scatter plot of the PCM weights measured approximately 25seconds after programming versus the target weights W li j is shown in Fig. 4 a . After programming, the PCM analog conductancevalues were periodically read from hardware over a period of 1 day, scaled to the network weights, and reported to the softwarethat performed inference on the test dataset (see Methods).In addition to the experiment, we developed an advanced behavioral model of the hardware in order to precisely capturethe conductance evolution over time during inference (see Supplementary Note 2). The model is built based on an extensiveexperimental characterization of the array-level statistics of hardware noise and drift. Conductance drift is modeled using aGaussian distributed drift exponent across devices, whose mean and standard deviation both depend on the target conductancestate | G lT , i j | . Conductance noise with the experimentally observed 1 / f . frequency dependence is also incorporated with amagnitude that depends on the target conductance state and time. The model is able to accurately reproduce both the array-levelstatistics (see Fig. 4 b ) and individual device behavior (see Fig. 4 c ) observed over the duration of the experiment. Accuratemodeling of all the complex dependencies of noise and drift as a function of time and conductance state was found to be verycritical in being able to reproduce the experimental evolution of the accuracy on ResNet.The resulting accuracy on CIFAR-10 over time is shown in Fig. 4 d . The test accuracy measured 25 seconds after programmingis 93 . c . However, if nothing is done to compensate for conductancedrift, the accuracy quickly decreases down to 10% (random guessing) within approximately 1000 seconds. This is becausethe magnitude of the PCM weights gradually reduces over time due to drift and this prevents the activations from properlypropagating throughout the network. A simple global scaling calibration procedure can be used to compensate for the effectof drift on the matrix-vector multiplications performed with PCM crossbars. As proposed in Ref. 34, the summed current of asubset of the columns in the array can be periodically read over time at a constant voltage. The resulting total current is then his is a pre-print of an article accepted for publication in Nature Communications
051 01 52 02 5 1 0 a Programmed weight in PCM chip
T a r g e t w e i g h t 1 0 Counts - 2 0 - 1 0 0 1 0 2 001 0 k2 0 k3 0 k
Counts
C o n d u c t a n c e e r r o r ( % ) s =
4 % G T = 3 . 6 m S G T = 8 . 7 m S G T = 1 4 m S c G T ~ 2 m S G T ~ 7 m S G T ~ 1 2 m S G T ~ 2 2 m S Device conductance ( m S) T i m e ( s ) E x p e r i m e n t M o d e l G T ~ 1 7 m S G T = 2 4 m S 0 5 1 0 1 5 2 0 2 51 0 E x p e r i m e n t M o d e l
C o n d u c t a n c e ( m S )1 0 b
27 s b a s e l i n e d CIFAR-10 test accuracy (%)
T i m e ( s ) E x p e r i m e n t M o d e l
FIG. 4. a , Scatter plot of weights programmed in the PCM chip versus target weights obtained after training. The inset shows the distributionof the relative error between programmed and target synaptic conductance, ( G li j − G l T , i j ) / G max , and its standard deviation σ . b , Distributionsof programmed devices whose target conductance fall within 0 . µ S from 5 representative G T values. The distributions are shown at fourdifferent times spanning the experiment duration of one day. The filled bars are the measured hardware data, the black lines are the PCMmodel. c , Individual device conductance evolution over time of four arbitrarily picked devices from the chip programmed to four distinct G T values, along with one PCM model realization for the same G T values. d , Measured test accuracy over time from the weights of the PCMchip, along with the corresponding PCM model match. A global drift compensation (GDC) procedure is performed for every layer beforeperforming inference on the test set. The filled areas from the PCM model correspond to one standard deviation over 25 inference runs. divided by the summed current of the same columns but read at time t . This results in a single scaling factor that can be appliedto the output of the entire crossbar in order to compensate for a global conductance shift (see Methods and Supplementary Fig.4). Since this factor can be combined with the batch normalization parameters, it does not incur any additional overhead whenperforming inference. This simple global drift compensation (GDC) procedure was implemented for every layer before carryingout inference on the test set, and the results are shown in Fig. 4 d . It can be seen that GDC allows the retention of a test accuracyabove 92 .
6% for 1 day on the PCM chip, and effectively prevents the effect of global weight decay over time as illustrated inSupplementary Fig. 4. A good agreement of the accuracy evolution between model and experiment is obtained, hence validatingits use for extrapolating results over a longer period of time and for assessing the accuracy of larger networks that cannot fit onour current hardware.
E. Adapting batch normalization statistics to improve the accuracy retention
Although GDC can compensate for a global conductance shift across the array, it cannot mitigate the effect of 1 / f noise anddrift variability across devices. From the model, we observe that 1 / f noise is responsible for the random accuracy fluctuations, his is a pre-print of an article accepted for publication in Nature Communications AdaBS calibration: compute μ and σ At the end of calibration phaseupdate μ and σ n input mini-batches of activations from n distinct randomly sampled mini-batches from the training set X1XnLayer input Crossbararrayof l th layerBatch norm. μ , σ , γ , β ReLU Crossbararrayof l th layerInference phase Calibration phaseBatch norm. μ B , σ B2 , γ , β ReLU a To next layerTo next layer b C I F A R - t e s t a cc u r a cy ( % ) Time (s)Model (GDC)Model (AdaBS)Experiment (AdaBS) baseline c I m age N e t t op - a cc u r a cy ( % ) Time (s)Model (GDC); All-PCMModel (AdaBS); All-PCMModel (GDC); First and last layers in FP32Model (AdaBS); First and last layers in FP32
FIG. 5. a , The AdaBS calibration procedure consists in updating the running mean µ and variance σ parameters of the batch normalizationperformed in the digital unit of the in-memory computing hardware. The calibration is performed periodically when the device is idle, andafter calibration the values of µ and σ of every layer are updated for subsequent inference. Note that during the calibration phase, batchnormalization is performed using the mini-batch mean µ B and variance σ B instead of µ and σ (see Methods and Supplementary Note 3). b , Test accuracy of ResNet-32 on CIFAR-10 with GDC and AdaBS using the PCM model, along with experimental test accuracy obtainedby applying AdaBS on ResNet-32 with the measured weights from the PCM chip. The filled areas from the PCM model correspond to onestandard deviation over 25 inference runs. c Top-1 accuracy of ResNet-34 on ImageNet with GDC and AdaBS computed using the PCMmodel. Implementations using PCM synapses for all layers as well as first and last layers in digital FP32, and PCM synapses for all otherlayers, are shown. In the latter, no noise is applied on the first and last layers during training. The filled areas correspond to one standarddeviation over 10 inference runs. whereas drift variability and its dependence on the target conductance state cause the monotonous accuracy decrease over time(see Supplementary Fig. 5). In order to improve the accuracy retention further, we propose to leverage the batch normalizationparameters to correct the activation distributions during inference such that their mean and variance match those that wereoptimally learned during training. During inference, batch normalization is performed by normalizing the preactivations bytheir corresponding running mean µ and variance σ computed during training. Then, scale and shift factors ( γ and β ) thatwere learned through backpropagation are applied to the normalized preactivations. Since γ and β are learnable parameters,it is not desirable to change them since it would require retraining the model on the PCM devices. However, updating µ and σ is more intuitive, since the mean and variance of the preactivations are affected by noise and drift. Leveraging this idea,we introduce a new compensation technique called adaptive batch normalization statistics update (AdaBS), which improves theaccuracy retention beyond GDC at the cost of additional computations during the calibration phase.As described in Fig. 5 a , the calibration phase consists in sending multiple mini-batches from a set of calibration imagesthat come from the same distribution than the images seen during inference. In this study, we use the images from the trainingdataset as calibration images. The running mean and variance of preactivations are computed across the entire calibration dataset.The new values of µ and σ computed during calibration are then used for subsequent inference. The main advantage of thistechnique is that it does not incur additional digital computations nor weight programming during inference, since we are onlyupdating the batch normalization parameters µ and σ when the calibration is performed. However, injecting the entire training his is a pre-print of an article accepted for publication in Nature Communications µ and σ in the calibration phase would bring significant overhead. When reducing the amount of injectedimages, the number of updates of the running statistics becomes smaller, and if the momentum used for computing µ and σ isnot properly tuned to account for this, the network accuracy heavily decreases. To tackle this issue, we developed a procedure toobtain the optimal momentum as a function of the number of mini-batches used for calibration (see Methods and SupplementaryNote 3). With this method, we were able to reduce the number of calibration images down to 5 .
2% of the CIFAR-10 trainingdataset (2,600 images) without affecting the accuracy. With that number of images, the overhead in terms of digital computationsof the AdaBS calibration is about 52% of performing batch normalization during inference on the whole CIFAR-10 test set (seeSupplementary Note 3). It may appear cumbersome to send so many images to the device to perform the calibration, howeversince it is only performed periodically over time when the device is idle and not every time an image is inferred by the network,the high calibration cost can be amortized. The calibration overhead can be further reduced by using more efficient variants ofbatch normalization such as the L -norm version (see Supplementary Note 3). Moreover, although we used AdaBS (and GDC)to compensate solely for the drift of the PCM devices, the same procedure can be applied to mitigate conductance changes dueto ambient temperature variations, a critical issue for any analog in-memory computing hardware. The resulting accuracy whenperforming AdaBS on ResNet-32 with hardware weights before carrying out inference on the test set is shown in Fig. 5 b . AdaBSallows to retain a test accuracy above 93 .
5% over one day, an improvement of 0 .
9% compared with GDC. This improvementbecomes 1 .
8% for one year when extrapolating the results using the PCM model.We also applied AdaBS on the ImageNet classification task with ResNet-34, trained with η tr = . c ). When the first and last layers are implemented in digitalFP32, the initial accuracy increases to 71 .
9% and the retention is significantly improved. This technique, combined with AdaBS,allows the retention of an accuracy above 71% for one year. Drawbacks in efficiency when performing inference on hardwarein this way have to be mentioned, but they stay limited given the small number of parameters and input size of the first and lastlayers . III. DISCUSSION
Combined together, the strategies developed in this study allow us to achieve the highest accuracies reported so far with analogresistive memory on the CIFAR-10 and ImageNet benchmarks with residual networks close to their original implementation .Although there is still room for improvement especially on ImageNet, those accuracies are already comparable or higher thanthose reported on ternary weight networks , for example 71 .
6% top-1 accuracy of ResNet-34 on ImageNet with first layer inFP32 . Importantly, the accuracies we report are achieved with just a single nanoscale PCM device encoding the absolutevalue of a weight. A common approach that could improve the accuracy further is to use multiple devices to encode differentbits of a weight , at the expense of area and energy penalty, and additional support required by the peripheral circuitry.Aligned with previous observations , we notice that retraining ResNet with additive noise results mainly in adapting thebatch normalization parameters, whereas the weights stay close to the full-precision weights trained without noise. Hence,retraining by injecting noise from a pretrained baseline network rather than from scratch is very effective since the networkrecovers high accuracy very quickly, especially for ImageNet. Although our experiments are not done on a fully-integrated chipthat supports all functions of deep learning inference, the most critical effects of array-level variability, noise, and drift, arefully accounted for because each weight of the network is programmed on individual PCM devices of our array. Aspects of afully-integrated chip that are not entirely captured in our experiments such as IR drop and additional circuit nonidealities such asoffsets and noise have been studied in previous works and could be mitigated by additional retraining methods . Additionalerrors due to quantization coming from the crossbar data converters are analyzed further below.There exist many different methods of training a neural network with noise that aim to improve the resilience of the model toanalog mixed-signal hardware. These include injecting additive noise on the inputs of every layer , on the preactivations , orjust adding noise on the input data . Moreover, injecting multiplicative Gaussian noise to the weights ( σ l δ W tr , i j ∝ | W li j | ) is alsodefensible regarding the observed noise on the hardware. We analyzed the four aforementioned methods, attempting to reach thesame accuracy demonstrated previously after weight transfer to PCM devices, to identify their possible benefits and drawbacks(see Supplementary Note 4). We found that it is possible to adjust the training procedure of all four methods to achieve asimilar accuracy on CIFAR-10 after transferring the weights to PCM synapses. Somewhat surprisingly, even adding noise onthe input data during training, which is just a simple form of data augmentation, leads to a model which is more resilientto weight perturbations during inference. This shows that it is not necessary to train a model with very complicated noisemodels that imitate the observed hardware noise precisely. As long as the data propagated through the network is corrupted The first layer’s input is a large 224 ×
224 image, but it has only 3 channels. For the last layer, the input is flattened to a single 512-dimensional vector(assuming a batch size of 1). The first and last layers contain less than 3% of the network weights. his is a pre-print of an article accepted for publication in Nature Communications η tr , from a simplehardware characterization, avoiding any hyperparameter search for noise scaling factors. The value of η tr does not have to bevery precise either, because there is a range of values that lead to similar accuracy after transfer to PCM (see Supplementary Fig.3). Moreover, we found that injecting noise on weights achieves better accuracy retention over time (see Supplementary Note4), which suggests that weight noise mimics the behavior of the PCM hardware better.A critical issue for in-memory computing hardware is the need for digital-to-analog (analog-to-digital) conversion every timedata goes in (out) of the crossbar arrays. These data conversions lead to quantization of the activations and preactivations, re-spectively, which introduce additional errors in the forward propagation. Based on a recent ADC survey , 8-bit data conversionis a good tradeoff between precision and energy consumption. Hence, we analyzed the effect of quantizing the input and outputof every layer of ResNet-32 and ResNet-34 to 8-bit on the inference accuracy. We set the input/output quantization ranges tothe 99 . < .
05% drop) and ResNet-34 on ImageNet ( < .
15% drop) after weight transfer to PCM synapses. The accu-racy evolution over time, retaining the same quantization ranges, does not degrade significantly further and stays well withinone standard deviation of that obtained without quantization. The small accuracy deviations could be potentially overcome byincluding the quantization in the retraining process, which will likely be necessary if less than 8-bit resolution is desired forhigher energy efficiency.Although a computational memory accelerates the matrix-vector multiplication operations in a DNN, communicating activa-tions between computational memory cores executing different layers can become a bottleneck. This bottleneck depends upontwo factors, (i) the way different layers are connected to each other and (ii) the latency of the hardware implementation to trans-fer activations from one core to another. Designing optimal interconnectivity between the cores for state-of-the-art deep CNNsis an open research problem. Indeed, having the network weights stationary during execution in a computational memory putslimits on what portion of the computation can be forwarded to different cores and what cannot. This ultimately results in long-established hardware communication fabrics being ill-fit for the task. One topology for communication fabrics that is well-suitedfor computational memory is proposed by Dazzi et al. . It is based on a 5 parallel prism (5PP) graph topology and facilitatesinter-layer pipelined execution of CNNs . The proposed 5PP topology allows the mapping of all the primary connectivitiesof state-of-the-art neural networks, including ResNet, DenseNet and Inception-style networks . As discussed in Ref. 40, theResNet-32 implementation with 5PP can result in potentially 2 × improvement in pipeline stage latency with similar bandwidthrequirements compared with a standard 2D-mesh. Assuming 8-bit activations, communication links with data rate of 5Gbps ,and crossbar computational cycle time of 100 ns, a single image inference latency of 52 µ s and frame rate of 38 ,
600 frames persecond (FPS) for ResNet-32 on CIFAR-10 is estimated. As an approximate comparison, YodaNN , a digital DNN inferenceaccelerator for binary weight networks with ultra-low power budget, achieves 434.8 FPS in high throughput mode for a 9-layerCNN (BinaryConnect ) on CIFAR-10. Although not a direct comparison, the proposed topology and pipelined execution ofResNet-32 could result in 88 × speedup, with a deeper network than the digital solution.In summary, we introduced strategies for training ResNet-type CNNs for deployment on analog in-memory computing hard-ware, as well as improving the accuracy retention on such hardware. We proposed to inject noise to the synaptic weights whichis proportional to the combined read and write conductance noise of the hardware during the forward pass of training. Thisapproach combined with judicious weight initialization, clipping, and learning rate scheduling, allowed us to achieve an accu-racy of 93.7% on the CIFAR-10 dataset and a top-1 accuracy on the ImageNet benchmark of 71.6% after mapping the trainedweights to analog PCM synapses. Our methods introduce only a single additional hyperparameter during training, the weightclip scale α , since the magnitude of the injected noise can be easily deduced from a one-time hardware characterization. Af-ter programming the trained weights of ResNet-32 on 723,444 PCM devices of a prototype chip, the accuracy computed fromthe measured hardware weights stayed above 92 .
6% over a period of 1 day, which is to the best of our knowledge the highestaccuracy experimentally reported to-date on the CIFAR-10 dataset by any analog resistive memory hardware. A global scalingprocedure was used to compensate for the conductance drift of the PCM devices, which was found to be critical in improving theaccuracy retention. However, global scaling could not mitigate the effect of 1 / f noise and drift variability across devices, whichled to accuracy fluctuations and monotonous accuracy decrease over time, respectively. Periodically calibrating the batch nor-malization parameters before inference allowed to alleviate those issues at the cost of additional digital computations, increasingthe 1-day accuracy to 93 .
5% on hardware. These results demonstrate the feasibility of realizing accurate inference on complexDNNs through analog in-memory computing using existing PCM devices. his is a pre-print of an article accepted for publication in Nature Communications METHODSA. Experiments on PCM hardware platform
The experimental platform is built around a prototype PCM chip that comprises 3 million PCM devices. The PCM array isorganized as a matrix of word lines (WL) and bit lines (BL). In addition to the PCM devices, the prototype chip integrates thecircuitry for device addressing and for write and read operations. The PCM chip is interfaced to a hardware platform comprisingtwo field programmable gate array (FPGA) boards and an analog-front-end (AFE) board. The AFE board provides the powersupplies as well as the voltage and current reference sources for the PCM chip. The FPGA boards are used to implement overallsystem control and data management as well as the interface with the data processing unit. The experimental platform is operatedfrom a host computer, and a Matlab environment is used to coordinate the experiments. The PCM devices were integrated intothe chip in 90-nm CMOS technology using the key-hole process described in Ref. 44. The phase-change material is dopedGe Sb Te . The bottom electrode has a radius of ∼
20 nm and a length of ∼
65 nm. The phase-change material is ∼
100 nmthick and extends to the top electrode, whose radius is ∼
100 nm. All experiments performed in this work were done on an arraycontaining 1 million devices accessed via transistors, which is organized as a matrix of 512 WL and 2048 BL.A PCM device is selected by serially addressing a WL and a BL. To read a PCM device, the selected BL is biased to a constantvoltage (300 mV) by a voltage regulator via a voltage generated off chip. The sensed current is integrated by a capacitor, and theresulting voltage is then digitized by the on-chip 8-bit cyclic analog-to-digital converter (ADC). The total duration of applyingthe read pulse and converting the data with the ADC is 1 µ s. The readout characteristic is calibrated via on-chip referencepolysilicon resistors. To program a PCM device, a voltage generated off chip is converted on chip into a programming current.This current is then mirrored into the selected BL for the desired duration of the programming pulse. Iterative programminginvolving a sequence of program-and-verify steps is used to program the PCM devices to the desired conductance values . Thedevices are initialized to a high-conductance state via a staircase-pulse sequence. The sequence starts with a RESET pulse ofamplitude 450 µ A and width 50 ns, followed by 6 pulses of amplitude decreasing regularly from 160 µ A to 60 µ A and with aconstant width of 1000 ns. After initialization, each device is set to a desired conductance value through a program-and-verifyscheme. The conductance of all devices in the array is read 5 times consecutively at a voltage of 0.3 V, and the mean conductanceof these reads is used for verification. If the read conductance of a specific device does not fall within 0.25 µ S from its targetconductance, it receives a programming pulse where the pulse amplitude is incremented or decremented proportionally to thedifference between the read and target conductance. The pulse amplitude ranges between 80 µ A and 400 µ A. This program-and-verify scheme is repeated for a maximum of 55 iterations.In the hardware/software inference experiments, the analog conductance values of the PCM devices encoding the networkweights, G l , + i j and G l , − i j , are serially read individually with the 8-bit on-chip ADC at predefined timestamps spaced over a pe-riod of one day. The read conductance values at every timestamp are reported to a TensorFlow-based software. This softwareperforms the forward propagation of the CIFAR-10 test set on the weights read from hardware and computes the resulting clas-sification accuracy. The drift compensation techniques, GDC and AdaBS, are performed entirely in software at every timestampbased on the conductance values read from hardware. B. PCM-based deep learning inference simulator
We developed a simulation framework to test the efficacy of DNN inference using PCM devices. We chose Google’sTensorFlow deep learning framework for the simulator development. The large library of algorithms in TensorFlow enablesus to use native implementation of required activation functions and batch normalization. Moreover, any regular TensorFlowcode of a DNN can be easily ported to our simulator. As shown in Supplementary Fig. 7, custom made TensorFlow opera-tions are implemented that generate PCM conductance values from the behavioral model of hardware PCM devices that wasdeveloped (see Supplementary Note 2). All the nonidealities including conductance range, programming noise, read noise, andconductance drift are implemented in TensorFlow following the equations shown in Supplementary Note 2. The simulator canalso take the PCM conductance data measured from hardware as input, in order to perform inference on the hardware data.Data converters that simulate digital quantization of data at the input and output of crossbars are also implemented with tunablequantization ranges and precision. In this study, the data converters were turned off for all simulations except those presented inSupplementary Fig. 6. The drift correction techniques are implemented post quantization of the crossbar output. C. Training implementation of ResNet-32 on CIFAR-10
ResNet-32 has 31 convolution layers with 3 × × × his is a pre-print of an article accepted for publication in Nature Communications × . ReLU activation is used after every batch normalization except in case of residual connections,where the ReLU activation is computed after summation. The output of the last convolution layer is then downsampled usingglobal average pooling , which is followed by a single fully-connected layer. For the last fully-connected layer, no batchnormalization is performed. The architecture of ResNet-32 used in this study is a slightly modified version of the originalimplementation with fewer input and output channels in ResNet blocks 2 and 3. This network is trained on the well-knownCIFAR-10 classification dataset . It has 32 ×
32 pixels RGB images that belong to one of the 10 classes.The network is trained on the 50,000 images of the training set, and evaluation is performed on the 10,000 images of the testset. The training is performed using stochastic gradient descent with a momentum of 0.9. The network objective is categoricalcross entropy function over 10 classes of the input image. Learning rate scheduling is performed to reduce learning rate by 90%at every 50 th training epoch. The initial learning rate for the baseline network is 0.1 and training converges in 200 epochs witha mini-batch size of 128. Weights of all convolution and fully connected layers of the baseline network are initialized using HeNormal initialization. The baseline network is retrained by injecting Gaussian noise for up to 150 epochs with weight clipscale α =
2. We preprocess the training images by randomly cropping a 32 ×
32 patch after padding 2 pixels along the heightand width of the image. We also apply a random horizontal flip on the images from the train set. Additionally, we apply cutout on the training set images. For both training and test set, we apply channel wise normalization for 0 mean and unit standarddeviation. D. Training implementation of ResNet-34 on ImageNet
The architecture of the ResNet-34 network for ImageNet classification is derived from Ref. 19. It has 32 convolution layerswith 3 × × × × ×
56 pixels. Each residual connection with 1 × × × dataset. The ImageNet dataset has 1.3M images in the training set and 50k imagesin the test set. Images in the ImageNet dataset are preprocessed by following the same preprocessing steps as that of the Pytorchbaseline model. Training images are randomly cropped to a 224 ×
224 patch and then random horizontal flip is applied onthe images. Channel wise normalization is performed on the images in both training and test set for 0 mean and unit standarddeviation. Only for the test set, images are first resized to 256 ×
256 using bilinear interpolation method and then a center cropis performed to obtain the 224 ×
224 image patch.The network objective function is softmax cross entropy on network output and corresponding 1,000 labels. The networkobjective is minimized using the stochastic gradient descent algorithm with a momentum of 0.9. We obtained our baselinenetwork architecture and its parameters from the Pytorch model zoo . We use this network to perform additive noise trainingby injecting Gaussian noise for a total of 10 training epochs. In contrast to ResNet-32 on CIFAR-10, no learning rate schedulingwas performed since the network was trained only for 10 epochs with additive noise. We use mini-batch size of 400 and learningrate of 0.001 for the additive noise training simulations. We also use L2 weight decay of 0.0001 and weight clip scale of α = . E. Global drift compensation (GDC) method
The GDC calibration phase consists of computing the summed current of L columns in each array encoding a network layer(see Supplementary Fig. 4). Those L columns contain devices initially programmed to known conductance values G mn ( t ) . Byreading those column currents, I m , periodically with applied voltage V cal on all rows, we can compensate for a global conductanceshift in the array during inference. When input data is processed by the crossbar during inference, the crossbar output can bescaled by 1 / ˆ α , where ˆ α = ∑ Lm = I m V cal ∑ Nn = ∑ Lm = G mn ( t ) . This procedure is especially simple because L can be chosen to be small, enough to get sufficient statistics. Moreover, ˆ α iscomputed from the device data itself, without resorting to any assumption on how the conductance changes nor requiring extratiming information. The term V cal ∑ Nn = ∑ Lm = G mn ( t ) needs to be computed only once, stored in the digital memory of the chip, his is a pre-print of an article accepted for publication in Nature Communications L columns of the crossbar can be done while the PCM array is idle, i.e.,when there are no incoming images to be processed by the device. Performing the L current summations can be implementedeither with on-chip digital circuitry or in the control unit of the chip. At the end of the calibration phase, 1 / ˆ α is computedand stored locally in digital unit of the crossbar. The output scaling by 1 / ˆ α during inference can be combined with batchnormalization because it is a linear operation. In our experiments, the calibration procedure was performed using all columns ofeach layer (e.g. L is equal to two times number of output channels) every time before inference is performed on the whole testset. F. Adaptive batch normalization statistics update (AdaBS) technique
Batch normalization is performed differently in the training and inference phases of a DNN. During the training of a DNN,batch normalization normalizes the input to zero mean and unit variance by computing the mean ( µ B ) and variance ( σ B ) over themini-batch of m images µ B = m m ∑ i = x i (4) σ B = m m ∑ i = ( x i − µ B ) . (5)The normalized input is then scaled and shifted by γ and β . During the training phase, γ and β are learned through backpropaga-tion. In parallel, a global running mean ( µ ) and variance ( σ ) are computed by exponentially averaging µ B and σ B respectively,over all the training batches µ = p · µ + ( − p ) · µ B (6) σ = p · σ + ( − p ) · σ B , (7)where p is the momentum. After training, the estimates of the global mean and variance µ and σ are then used during theinference phase. When performing forward propagation during inference, the batch normalization coefficients µ , σ , γ , and β are used for normalization, scale, and shift.The calibration phase of AdaBS consists in recomputing and updating µ and σ for every layer where batch normalization ispresent. We recompute µ and σ by feeding a randomly sampled set of mini-batches from the training dataset. In recomputing µ and σ , hyper-parameters such as mini-batch size ( m ) and momentum ( p ) need to be carefully tuned to achieve the best networkaccuracy.For AdaBS calibration, we observed that using an optimal value of the momentum is necessary to achieve good inferenceaccuracy evolution over time. For this, we have developed an algorithm to estimate the optimal value of momentum by anempirical analysis, which is explained in Supplementary Note 3. Based on this analysis, the formula we used to compute theoptimal momentum as a function of the number of injected mini-batches n is p = . ( / n ) . (8)Using Eq. (8) to compute the momentum, we found that with a fixed mini-batch size of m =
200 images, it is sufficient to inject n =
13 mini-batches for the AdaBS calibration of the ResNet-32 network, that is approximately 5% of the CIFAR-10 training set(2,600 images). The sensitivity of the accuracy to the number of images used for AdaBS calibration is shown in SupplementaryNote 3. For ResNet-34 on ImageNet, we used mini-batch size of m =
50 and n =
26 mini-batches, that is 0 .
1% of the ImageNettraining set (1,300 images). In the experiments presented in Fig. 5, AdaBS calibration was performed for every layer beforeperforming inference on the test set, except the last layer because it does not have batch normalization.
ACKNOWLEDGMENTS
We thank O. Hilliges for discussions, and our colleagues at IBM TJ Watson Research Center, in particular M. BrightSky,for help with fabricating the PCM prototype chip used in this work. This work was partially funded by the European Re-search Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number682675). his is a pre-print of an article accepted for publication in Nature Communications AUTHOR CONTRIBUTIONS
V.J., M.L., S.H., C.P. and A.S. conceived the training methodology. V.J., M.L., S.H., I.B. and A.S. conceived the driftcorrection techniques. V.J. and S.H. performed the software training and inference simulations under the guidance of M.L.. I.B.performed the PCM hardware experiments with the support of V.J.. S.R.N. and V.J. developed the PCM model. V.J. and C.P.developed the PCM deep learning inference TensorFlow-based software. M.D. provided critical in-memory computing hardwareinsights and performed the ResNet-32 performance estimation. M.L. wrote the manuscript with input from all authors. M.L.,A.S., B.R. and E.E. supervised the project. his is a pre-print of an article accepted for publication in Nature Communications REFERENCES N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al. , “In-datacenter performance analysisof a tensor processing unit,” in (IEEE, 2017) pp. 1–12. Z. Jia, M. Maggioni, J. Smith, and D. P. Scarpazza, “Dissecting the NVidia turing T4 GPU via microbenchmarking,” arXiv preprint arXiv:1903.07486 (2019). A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, “ISAAC: A convolutional neural networkaccelerator with in-situ analog arithmetic in crossbars,” in (2016)pp. 14–26. F. Merrikh-Bayat, X. Guo, M. Klachko, M. Prezioso, K. K. Likharev, and D. B. Strukov, “High-performance mixed-signal neurocomputing with nanoscalefloating-gate memory cell arrays,” IEEE Transactions on Neural Networks and Learning Systems , 4782–4790 (2018). W.-H. Chen, C. Dou, K.-X. Li, W.-Y. Lin, P.-Y. Li, J.-H. Huang, J.-H. Wang, W.-C. Wei, C.-X. Xue, Y.-C. Chiu, et al. , “CMOS-integrated memristivenon-volatile computing-in-memory for AI edge processors,” Nature Electronics , 420–428 (2019). M. Hu, C. E. Graves, C. Li, Y. Li, N. Ge, E. Montgomery, N. Davila, H. Jiang, R. S. Williams, J. J. Yang, et al. , “Memristor-based analog computation andneural network classification with a dot product engine,” Advanced Materials , 1705914 (2018). M. Le Gallo, A. Sebastian, R. Mathis, M. Manica, H. Giefers, T. Tuma, C. Bekas, A. Curioni, and E. Eleftheriou, “Mixed-precision in-memory computing,”Nature Electronics , 246 (2018). I. Boybat, M. Le Gallo, S. Nandakumar, T. Moraitis, T. Parnell, T. Tuma, B. Rajendran, Y. Leblebici, A. Sebastian, and E. Eleftheriou, “Neuromorphiccomputing with multi-memristive synapses,” Nature communications , 2514 (2018). S. Ambrogio, P. Narayanan, H. Tsai, R. M. Shelby, I. Boybat, C. Nolfo, S. Sidler, M. Giordano, M. Bodini, N. C. Farinha, et al. , “Equivalent-accuracyaccelerated neural-network training using analogue memory,” Nature , 60 (2018). S. Nandakumar, M. Le Gallo, I. Boybat, B. Rajendran, A. Sebastian, and E. Eleftheriou, “Mixed-precision architecture based on computational memory fortraining deep neural networks,” in
International Symposium on Circuits and Systems (ISCAS) (IEEE, 2018) pp. 1–5. A. Mohanty, X. Du, P.-Y. Chen, J.-s. Seo, S. Yu, and Y. Cao, “Random sparse adaptation for accurate inference with inaccurate multi-level rram arrays,” in (IEEE, 2017) pp. 6–3. S. K. Gonugondla, M. Kang, and N. R. Shanbhag, “A variation-tolerant in-memory machine learning classifier via on-chip training,” IEEE Journal ofSolid-State Circuits , 3163–3173 (2018). B. Liu, H. Li, Y. Chen, X. Li, Q. Wu, and T. Huang, “Vortex: variation-aware training for memristor x-bar,” in
Proceedings of the 52nd Annual DesignAutomation Conference (ACM, 2015) p. 15. L. Chen, J. Li, Y. Chen, Q. Deng, J. Shen, X. Liang, and L. Jiang, “Accelerator-friendly neural-network training: Learning variations and defects in rramcrossbar,” in
Proceedings of the Conference on Design, Automation & Test in Europe (European Design and Automation Association, 2017) pp. 19–24. S. Moon, K. Shin, and D. Jeon, “Enhancing reliability of analog neural network processors,” IEEE Transactions on Very Large Scale Integration (VLSI)Systems , 1455–1459 (2019). D. Miyashita, S. Kousai, T. Suzuki, and J. Deguchi, “A neuromorphic chip optimized for deep learning and CMOS technology with time-domain analog anddigital mixed-signal processing,” IEEE Journal of Solid-State Circuits , 2679–2689 (2017). M. Klachko, M. R. Mahmoodi, and D. B. Strukov, “Improving noise tolerance of mixed-signal neural networks,” arXiv preprint arXiv:1904.01705 (2019). A. S. Rekhi, B. Zimmer, N. Nedovic, N. Liu, R. Venkatesan, M. Wang, B. Khailany, W. J. Dally, and C. T. Gray, “Analog/mixed-signal hardware errormodeling for deep learning inference,” in
Proceedings of the 56th Annual Design Automation Conference (ACM, 2019) pp. 81:1–81:6. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in
Proceedings of the IEEE conference on computer vision and patternrecognition (2016) pp. 770–778. T. Gokmen, M. Onen, and W. Haensch, “Training Deep Convolutional Neural Networks with Resistive Cross-Point Devices,” Frontiers in Neuroscience ,1–22 (2017), 1705.08014. P. Merolla, R. Appuswamy, J. Arthur, S. K. Esser, and D. Modha, “Deep neural networks are robust to weight binarization and other non-linear distortions,”arXiv preprint arXiv:1606.01981 (2016). C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight uncertainty in neural networks,” arXiv preprint arXiv:1505.05424 (2015). C. Gulcehre, M. Moczulski, M. Denil, and Y. Bengio, “Noisy activation functions,” in
International conference on machine learning (2016) pp. 3059–3068. A. Neelakantan, L. Vilnis, Q. V. Le, I. Sutskever, L. Kaiser, K. Kurach, and J. Martens, “Adding gradient noise improves learning for very deep networks,”arXiv preprint arXiv:1511.06807 (2015). G. An, “The effects of adding noise during backpropagation training on a generalization performance,” Neural computation , 643–674 (1996). K. Jim, B. G. Horne, and C. L. Giles, “Effects of noise on convergence and generalization in recurrent networks,” in
Proceedings of the 7th InternationalConference on Neural Information Processing Systems , NIPS’94 (MIT Press, Cambridge, MA, USA, 1994) pp. 649–656. S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with limited numerical precision,” in
Proceedings of the 32nd InternationalConference on Machine Learning (ICML-15) (2015) pp. 1737–1746. J. L. McKinstry, S. K. Esser, R. Appuswamy, D. Bablani, J. V. Arthur, I. B. Yildiz, and D. S. Modha, “Discovering low-precision networks close to full-precision networks for efficient embedded inference,” CoRR abs/1809.04191 (2018), arXiv:1809.04191. A. F. Murray and P. J. Edwards, “Enhanced MLP performance and fault tolerance resulting from synaptic weight noise during training,” IEEE Transactionson neural networks , 792–802 (1994). K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” CoRR abs/1502.01852 (2015), arXiv:1502.01852. M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks,” in
European Confer-ence on Computer Vision (Springer, 2016) pp. 525–542. G. Close, U. Frey, M. Breitwisch, H. Lung, C. Lam, C. Hagleitner, and E. Eleftheriou, “Device, circuit and system-level analysis of noise in multi-bitphase-change memory,” in (IEEE, 2010) pp. 29–5. G. W. Burr, M. J. Brightsky, A. Sebastian, H.-Y. Cheng, J.-Y. Wu, S. Kim, N. E. Sosa, N. Papandreou, H.-L. Lung, H. Pozidis, et al. , “Recent progress inphase-change memory technology,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems , 146–162 (2016). M. Le Gallo, A. Sebastian, G. Cherubini, H. Giefers, and E. Eleftheriou, “Compressed sensing with approximate message passing using in-memory comput-ing,” IEEE Transactions on Electron Devices , 4304–4312 (2018). H. Tsai, S. Ambrogio, C. Mackin, P. Narayanan, R. M. Shelby, K. Rocki, A. Chen, and G. W. Burr, “Inference of long-short term memory networks atsoftware-equivalent accuracy using 2.5m analog phase change memory devices,” in (2019) pp. T82–T83. F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” arXiv preprint arXiv:1605.04711 (2016). his is a pre-print of an article accepted for publication in Nature Communications M. Le Gallo, D. Krebs, F. Zipoli, M. Salinga, and A. Sebastian, “Collective structural relaxation in phase-change memory devices,” Advanced ElectronicMaterials , 1700627 (2018). G. Venkatesh, E. Nurvitadhi, and D. Marr, “Accelerating deep convolutional networks using low-precision and sparsity,” in
IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) (2017) pp. 2861–2865. C. M. Bishop, “Training with noise is equivalent to Tikhonov regularization,” Neural computation , 108–116 (1995). M. Dazzi, A. Sebastian, P. A. Francese, T. Parnell, L. Benini, and E. Eleftheriou, “5 parallel prism: A topology for pipelined implementations of convolutionalneural networks using computational memory,” arXiv preprint arXiv:1906.03474 (2019). E. Sacco, P. A. Francese, M. Brndli, C. Menolfi, T. Morf, A. Cevrero, I. Ozkaya, M. Kossel, L. Kull, D. Luu, H. Yueksel, G. Gielen, and T. Toifl, “A 5Gb/s7.1fJ/b/mm 8x multi-drop on-chip 10mm data link in 14nm FinFET CMOS SOI at 0.5V,” in (2017) pp. C54–C55. R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “YodaNN: An architecture for ultralow power binary-weight CNN acceleration,” IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems , 48–60 (2017). M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” in
Advances in NeuralInformation Processing Systems (2015) pp. 3123–3131. M. Breitwisch, T. Nirschl, C. Chen, Y. Zhu, M. Lee, M. Lamorey, G. Burr, E. Joseph, A. Schrott, J. Philipp, et al. , “Novel lithography-independent pore phasechange memory,” in
Proc. IEEE Symposium on VLSI Technology (2007) pp. 100–101. N. Papandreou, H. Pozidis, A. Pantazi, A. Sebastian, M. Breitwisch, C. Lam, and E. Eleftheriou, “Programming algorithms for multilevel phase-changememory,” in
Proc. International Symposium on Circuits and Systems (ISCAS) (2011) pp. 329–332. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp,G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man´e, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng,“TensorFlow: Large-scale machine learning on heterogeneous systems,” (2015), software available from tensorflow.org. S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” CoRR abs/1502.03167 (2015),arXiv:1502.03167. B. Zhou, A. Khosla, `A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” CoRR abs/1512.04150 (2015),arXiv:1512.04150. A. Krizhevsky, V. Nair, and G. Hinton, “Cifar-10 (canadian institute for advanced research),” . T. Devries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” CoRR abs/1708.04552 (2017), arXiv:1708.04552. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “Imagenet largescale visual recognition challenge,” Int. J. Comput. Vision , 211–252 (2015). “Torchvision.models,” https://pytorch.org/docs/stable/torchvision/models.htmlhttps://pytorch.org/docs/stable/torchvision/models.html