[PDF] Robust Processing-In-Memory Neural Networks via Noise-Aware Normalization

Abstract

Analog computing hardwares, such as Processing-in-memory (PIM) accelerators, have gradually received more attention for accelerating the neural network computations. However, PIM accelerators often suffer from intrinsic noise in the physical components, making it challenging for neural network models to achieve the same performance as on the digital hardware. Previous works in mitigating intrinsic noise assumed the knowledge of the noise model, and retraining the neural networks accordingly was required. In this paper, we propose a noise-agnostic method to achieve robust neural network performance against any noise setting. Our key observation is that the degradation of performance is due to the distribution shifts in network activations, which are caused by the noise. To properly track the shifts and calibrate the biased distributions, we propose a "noise-aware" batch normalization layer, which is able to align the distributions of the activations under variational noise inherent in the analog environments. Our method is simple, easy to implement, general to various noise settings, and does not need to retrain the models. We conduct experiments on several tasks in computer vision, including classification, object detection and semantic segmentation. The results demonstrate the effectiveness of our method, achieving robust performance under a wide range of noise settings, more reliable than existing methods. We believe that our simple yet general method can facilitate the adoption of analog computing devices for neural networks.

Full PDF

CCalibrated BatchNorm: Improving RobustnessAgainst Noisy Weights in Neural Networks

Li-Huang Tsai

National Tsing-Hua University [email protected]

Shih-Chieh Chang

National Tsing-Hua University [email protected]

Yu-Ting Chen

Google Research [email protected]

Jia-Yu Pan

Google Research [email protected]

Wei Wei

Google Research [email protected]

Da-Cheng Juan

Google Research [email protected]

Abstract

Analog computing hardware has gradually received more attention by the re-searchers for accelerating the neural network computations in recent years. How-ever, the analog accelerators often suffer from the undesirable intrinsic noise causedby the physical components, making the neural networks challenging to achieveordinary performance as on the digital ones. We suppose the performance drop ofthe noisy neural networks is due to the distribution shifts in the network activations.In this paper, we propose to recalculate the statistics of the batch normalizationlayers to calibrate the biased distributions during the inference phase. Withoutthe need of knowing the attributes of the noise beforehand, our approach is ableto align the distributions of the activations under variational noise inherent in theanalog environments. In order to validate our assumptions, we conduct quantitativeexperiments and apply our methods on several computer vision tasks, includingclassiﬁcation, object detection, and semantic segmentation. The results demon-strate the effectiveness of achieving noise-agnostic robust networks and progressthe developments of the analog computing devices in the ﬁeld of neural networks.

The rapid progress of deep neural networks has aroused the interest in discovering suitable hardwaredevices for neural network inference which heavily demands computational resources and energyconsumption. As the widespread deployment of network models on a variety of edge devices, it isurgent to design adequate hardware to satisfy the needs. In addition to the widely-used digital circuits(e.g. GPU) which have already been well developed, analog computing has attracted more attentionin recent years since non-volatile memory devices are favourable in accelerating the inference ofneural networks (Shaﬁee et al., 2016; Nomura et al., 2018; Angizi et al., 2019). In comparison tothe digital platforms, process in-memory (PIM) analog computing has demonstrated orders of speedacceleration and lower power consumption, which allows it to become the reasonable choice to beemployed for the neural network inference.The imprecise analog computations result in intolerable performance drop of the neural networks andmake the replacement of the digital circuits impractical despite the advantages of analog computing in

Preprint. Under review. a r X i v : . [ c s . C V ] J u l a) KL Divergence (b) JS Divergence Figure 1: Distribution distance between the activations in ResNet-34 before and after the noiseinjection. The blue and the green bars represent the divergence of the activations with and withoutapplying our approach, respectively, while the red line illustrates the ratio between them. It canbe observed that the distributions of the intermediate presentation with noise injected are far fromthe ones without any noise while the BatchNorm statistics are not calibrated, which results to thedegradation in the ﬁnal performance. In contrast, our approach calibrates the BatchNorm statisticsand shifts the distributions close to the clean ones in both (a) KL divergence and (b) JS divergence.speed and power consumption. The performance degradation is due to the undesirable intrinsic noiseinherent in the physical components of the analog devices. Previous work (Joshi et al., 2019) proposedto ﬁnetune the neural networks after the conventional training phase. Model weights were injectedwith Gaussian noise to simulate the analog computing scheme while ﬁnetuning. The proposednoisy-training allowed the networks to be robust to the noise caused by the analog devices during theinference. However, it is unrealistic to acquire the attributes of the noise beforehand and simulatethe injected noise based on the prior knowledge in the noisy-training. In addition, noisy-trainednetworks are exclusively ﬁt to a certain noise scale and further ﬁnetuning is required for differentones. The process of ﬁnetuning is inefﬁcient as it demands additional training time and computingresources on top of the original training phase. As a result, the inefﬁciency and the unaffordable costof noisy-training are unsatisfactory in progressing the analog computing into practice.We assume the noise shifts the distributions of the activations away from the ‘clean’ ones withoutany noise, causing the performance drop in the neural networks. To verify our hypothesis, wedemonstrate the distance of the distributions between the clean and the noisy activations as thegreen bars in Figure 1, where the distance is measured by Kullback-Liebler and Jensen-Shannondivergence. It is obvious that the noisy activations are signiﬁcantly disturbed by the noise and shiftedfar away from the clean ones. Additionally, since BatchNorm (Ioffe & Szegedy, 2015) is capableof normalizing the mismatched distributions of the activations among mini-batches, we discover itappropriate and advantageous to alleviate the noise interference. In this paper, we propose to calibratethe BatchNorm statistics to rectify the disturbed activation distributions. The noise would make therunning estimates of mean and variance inaccurate and therefore deactivate the normalization effectof BatchNorm layers. By recalculating the BatchNorm statistics, the calibrated BatchNorm is able todemonstrate its effectiveness in normalizing the mismatched distributions, as depicted in Figure 1blue bars. Unlike the noisy-training which introduces additional resource for ﬁntuning, the cost forkeeping tack of the running estimates is negligible. Furthermore, our methods are easily adaptive tovarious scales of noise since it only requires to estimate the mean and variance of the activations incalibration. Therefore, calibrating the BatchNorm statistics is a practical approach to mitigate thenoise interference in the neural network inference during analog computing.We validate the performance of calibrating the BatchNorm statistics during analog computing ina variety of computer vision tasks, including image classiﬁcation, object detection, and semanticsegmentation. We demonstrate that our approach is able to alleviate the disturbing noise and improvethe network robustness against the noisy weights. Additionally, we analyze the effectiveness ofour approach employing on several representative models under variational noise in a number ofexperiments quantitatively and qualitatively. The contributions of this paper are summarized asfollows: 2

We propose to recalculate the BatchNorm statistics to calibrate the shifted distributionscaused by the noise during analog computing. • Our approach requires negligible additional cost for the BatchNorm statistics calibration,comparing to the unaffordable cost introduced by the noisy-training. • Our approach is adaptive to variational noise and merely a few adjustments are needed fordifferent scales of noise. • The effectiveness and the efﬁciency allow calibrating the BatchNorm statistics a practicalapproach in developing the analog computing devices and putting it into practice.The remainder of this paper is organized as follows. Section 2 discusses the research works related tothis paper. Section 4 walks through the proposed methodology. Section 5 presents the experimentalresults and the ablation analysis. Section 6 concludes the paper.

Applying analog computation to accelerate neural network in inference phase has been a active ﬁeldfor these years (Haensch et al., 2019). Except the works mentioned in previous paragraph whichdedicated to design and ameliorate the hardware architecture, Klachko et al. (2019) explore the effectof common DNN componet and training regularization techniques e.g., activation function, weightdecay, Dropout(Srivastava et al., 2014), BatchNorm, on noise tolerance. Joshi et al. (2019) trained aneural network with injection noise to make the model weights less sensitive to variation of noise.Zhou et al. (2020) integrated the technique of knowledge distillation with noise injection trainingwhich can take advantage of the additive information of teacher model.

Tradition BatchNorm contains two stastical and two learnable components, mean, variance, scaleand bias. In training phase, BatchNorm compute the mean E [ x ] and variance V ar [ x ] of inputs batch x , then normalize each scalar feature independently, by making it have the mean of zero and thevariance of 1. At last, scale and shift the normalized value ˆ x by the learnable parameters γ and β .Meanwhile BatchNorm also records the exponential moving average (EMA) of mean and variancewhich can represent the training data distribution to normalize the inputs batch in inference phase.BatchNorm (Ioffe & Szegedy, 2015) is a widely used normalization technique that can accelerate andstabilize the training of neural network by normalizing intermediate representations. However theeffect of BatchNorm is not well-known, so recently works investigate the properties of BatchNorm tobetter understand the usage under different circumstances. (Guo et al., 2018) (Singh & Shrivastava,2019) (Summers & Dinneen, 2019) propose methods that re-weigh the statistics between exponentialmoving average (EVM) and instant inference batch to mitigate training and testing data discrepancyoccur in traditional BatchNorm. Xie et al. (2019) advanced the perspective that the distribution mis-match between clean examples and adversarial examples is a key factor that causes the performancedegradation in modern neural networks containing the component of BatchNorm layer. and Frankleet al. (2020) investigated the expressive power of BatchNorm by only training BatchNorm parameterswhich merely account for 0.5% in the total number of model parameters. The model can only shiftand rescale random features, but still achieves a fairly high accuracy. ˆ x = x − E [ x ] (cid:112) V ar [ x ] + (cid:15) · γ + β (1) The process of analog in-memory computing is limited by the non-idealities of variations originatingfrom three main factors: 1.quality of wafer manufacturing, 2.stability of supply voltage, 3.temperaturechange, the combination of the above three factors leads to the ﬂoating of the computation. AndNVM cell which is a key component in PIM-based DNN-accelerator used to store the weights of3eural network (Nomura et al., 2018) (Balaji et al., 2019) (Joshi et al., 2019), suffers variation ofelectro/thermo-dynamics during the read and write operations, the stored value has tendency toﬂuctuate from time to time due to the temperature change and the conductance drift of the device.The long-term ﬂuctuation can be adjusted by refresh the hardware periodically. But the time-varingnoise exist at smaller time granularity should be addressed. The simulated noises are described bytwo factors and the severity of noise can be controlled by noise scale η . Temporal ﬂuctuation . During the analog computing process, the instability of supply voltage, thetemperature rises and falls may cause different degrees of computation error, this time-varying noisecan be described by N T ∼ N ( η , σ T ) which is randomly sampled for each inference batch. And anoise temporal ﬂuctuation level of 10% means that σ T = 0.1 η . Spatial ﬂuctuation . Due to the defect of the transistor manufacturing process, the analog computingnoise may vary between different parts of chip, and this spatially-varying noises are introduced bysampling N S ∼ N ( η , σ S ) for once when the neural network is instantiated. And a noise spatialﬂuctuation level of 10% means that σ S = 0.1So the weights of model after noise injection sampled from above two types of noise sources W noisy will be like the following eq(2): W noisy = W orig + W orig · N T · N S (2) In analog computation noise setting, the noise is injected into model weights, makes the internalactivation outputs distribution greatly differ from the original outputs of the clean weights. However,traditional BatchNorm performs the normalization by statistical results of training data, which arerecorded without the exist of injected noises. Such statistical results can not successfully normalizethe noisy outputs, make BatchNorm lose its effectiveness to adjust the output distribution of previouslayer, so the noise will keep propagating, ﬁnally lead to wrong prediction. To mitigate the distributionmismatch between the noisy outputs and original outputs, we conjecture a perspective differ fromtraditional paradigm that use the training statistic or like (Guo et al., 2018) (Singh & Shrivastava,2019) (Summers & Dinneen, 2019) which investigating the discrepancy between training and testingdataset, then integrate the training statistic with inference batch to gain additional improvement. Weshould not use the statistical results recorded in training phase when there is a obvious distributionshift, the mean and variance of inference batch need to be re-computed. So we perform calibration ontraining dataset, while inference on the training dataset with the noise is injected into the model, thestatistic of such unstable analog computing environment can be recorded implicitly without knowingthe actual setting of injected noise.

Algorithm 1:

Calibrate the statistics of BatchNorm for x i in X NoisyT rain do B σ ← m · B σ + (1 - m ) · x iσ B µ ← m · B µ + (1 - m ) · x iµ In this section, experiments are conducted with widely-used dataset ImageNet-2012. (Krizhevskyet al., 2012) with different architectures. And we show the performance degradation when the noiseis injected into model weights, and with the proposed method, the problem of distribution mismatchis eased, leads to favorable performance.

Experimental Setup.

ImageNet-2012 is a large-scale dataset for image classiﬁcation with 1kcategories. The dataset contains roughly 1.3 million training images and 50k validation images of4able 1: Validation acc (%) on ImageNet-2012. η = 0.0 η = 0.02 η = 0.04 η = 0.06 η = 0.08 η = 0.10ResNet-34 73.31 71.53 65.73 49.92 20.55 3.66ours 73.31 72.34 72.28 72.07 71.97 71.64ResNet-50 76.13 72.85 56.88 19.49 2.73 0.40ours 76.13 74.81 74.79 74.71 74.57 74.54ResNet-101 77.37 73.80 52.89 9.32 0.38 0.08ours 77.37 76.30 76.27 76.25 76.12 75.94WideResNet 78.47 75.09 59.24 23.36 3.93 0.59ours 78.47 76.10 76.08 75.95 75.77 75.72MobileNet-v2 71.88 49.61 1.96 0.19 0.13 0.12ours 71.88 68.76 67.97 66.90 65.26 63.41256x256 pixels. We use pytorch ofﬁcial pretrained models to conduct experiments and comparethe performance of Calibrated BatchNorm with EMA (batch size = 256, momentum = 0.999) to theoriginal model under the perturbation of noise injection of temproal level σ T = 20%; spatial level σ S = 10%. Results.

Table 1. Shows that even in the large-scale dataset, the effect of calibrated BatchNorm isstill notable. Besides, we ﬁnd that the deeper the network, the more susceptible it was to noise dueto the propagation of noise, as you can see the performance of different depths of ResNet 34, 50,101 are 49.92%, 19.49%, 9.32% respectively under the noise scale of 0.06. And in the comparisonbetween ResNet-50 and WideResNet-50-2, increasing the width of the network which is regardedto be more robust to adversarial examples (Gao et al., 2019), does not seem to have any beneﬁt interms of noise resistance in analog computing circumstance, because any additional parameters willbe affected by the noise as well.

We use the ofﬁcial pretrainde YOLO-v3 (Redmon & Farhadi, 2018) implemen-tation as our detector. And conduct all the experiment on COCO-2014 dataset (Lin et al., 2014) with80 object classes. Our detector is trained on trainval35k set with around 75k images, and evaluate ona held out set with 5k images. We report the standard COCO evaluation metrics for mean averageprecision mAP and mean average recall mAR with IoU threshold 0.5, comparing the performance ofCalibrated BatchNorm with EMA (batch size = 8, momentum = 0.999) to the original model underthe perturbation of noise injection of temproal level σ T = 20%; spatial level σ S = 10%. Results.

Table 2. shows the results that even when the injection noise is severe, our method can stillwork in the difﬁcult task of computer vision.Table 2: Validation mAP, mAR (%) at 0.5 IoU on COCO-2014 val 5k. η = 0 η = 0.05 η = 0.10mAP mAR mAP mAR mAP mAROriginal model 51.5 75.1 30.5 58.4 0.0 0.0Calibrated BatchNorm 51.5 75.1 49.5 73.3 48.6 71.3 We use pytorch ofﬁcial FCN-ResNet101 (Long et al., 2014) that is pretrainedon part of COCO dataset that shared the same classes with PASCAL VOC-2012 (Everingham et al.),we conduct the segmentation task on PASCAL VOC-2012 with 21 classes including background. Wecompare the performance of Calibrated BatchNorm with EMA (batch size = 1, momentum = 0.999)5o the original model under the perturbation of noise injection of temproal level σ T = 20%; spatiallevel σ S = 10%. Results.

Table 3. shows the results of semantic segmentation, the mIoU metric is more reliable thanPixAcc, because mIoU focuses on classes excluding background. The experiment demonstrates thatmIoU of original model drops drastically when noise scale increases, but Calibrated Batchnorm canhandle the distribution shift even when the data come in a stream manner and in the task of semanticsegmentation. Table 3: Validation mIoU, PixAcc (%) on PASCAL VOC-2012. η = 0 η = 0.05 η = 0.10mIoU PixAcc mIoU PixAcc mIoU PixAccOriginal model 63.7 91.9 25.9 80.8 2.95 62.2Calibrated BatchNorm 63.7 91.9 61.1 90.3 60.6 90.1 We advance a concept that formulate the imprecise process in memory computing as a distributionshift problem, and propose a effective way to calibrate the mismatching distribution. We conductextensive experiments on important vision tasks including classiﬁcation, object detection and semanticsegmentation, the results demonstrate the signiﬁcant improvement, further promote the progress thedevelopment of neural network in the analog computing ﬁeld.6 eferences

S. Angizi, Z. He, D. Reis, X. S. Hu, W. Tsai, S. J. Lin, and D. Fan. Accelerating deep neural networksin processing-in-memory platforms: Analog or digital approach? In

AAAI , 2018.W. Haensch, T. Gokmen, and R. Puri. The next generation of deep learning hardware: Analogcomputing.

Proceedings of the IEEE , 107(1):108–122, 2019.Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training byreducing internal covariate shift, 2015.Vinay Joshi, Manuel Le Gallo, Irem Boybat, Simon Haefeli, Christophe Piveteau, Martino Dazzi,Bipin Rajendran, Abu Sebastian, and Evangelos Eleftheriou. Accurate deep neural networkinference using computational phase-change memory, 2019.Michael Klachko, Mohammad Reza Mahmoodi, and Dmitri B. Strukov. Improving noise toleranceof mixed-signal neural networks, 2019.Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolu-tional neural networks. In

NIPS’12 , 2012.Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, PietroPerona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objectsin context, 2014.Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semanticsegmentation, 2014.Akiyo Nomura, Megumi Ito, Atsuya Okazaki, Masatoshi Ishii, Sangbum Kim, Junka Okazawa, KohjiHosokawa, and Wilfried Haensch. Nvm weight variation impact on analog spiking neural networkchip. In Long Cheng, Andrew Chi Sing Leung, and Seiichi Ozawa (eds.),

Neural InformationProcessing , pp. 676–685, Cham, 2018. Springer International Publishing. ISBN 978-3-030-04239-4.Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement, 2018.A. Shaﬁee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams,and V. Srikumar. Isaac: A convolutional neural network accelerator with in-situ analog arithmeticin crossbars. In , pp. 14–26, 2016.Saurabh Singh and Abhinav Shrivastava. Evalnorm: Estimating batch normalization statistics forevaluation, 2019.Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.Dropout: A simple way to prevent neural networks from overﬁtting.

Journal of Machine LearningResearch , 15(56):1929–1958, 2014. URL http://jmlr.org/papers/v15/srivastava14a.htmlhttp://jmlr.org/papers/v15/srivastava14a.html