Robust Processing-In-Memory Neural Networks via Noise-Aware Normalization
Li-Huang Tsai, Shih-Chieh Chang, Yu-Ting Chen, Jia-Yu Pan, Wei Wei, Da-Cheng Juan
CCalibrated BatchNorm: Improving RobustnessAgainst Noisy Weights in Neural Networks
Li-Huang Tsai
National Tsing-Hua University [email protected]
Shih-Chieh Chang
National Tsing-Hua University [email protected]
Yu-Ting Chen
Google Research [email protected]
Jia-Yu Pan
Google Research [email protected]
Wei Wei
Google Research [email protected]
Da-Cheng Juan
Google Research [email protected]
Abstract
Analog computing hardware has gradually received more attention by the re-searchers for accelerating the neural network computations in recent years. How-ever, the analog accelerators often suffer from the undesirable intrinsic noise causedby the physical components, making the neural networks challenging to achieveordinary performance as on the digital ones. We suppose the performance drop ofthe noisy neural networks is due to the distribution shifts in the network activations.In this paper, we propose to recalculate the statistics of the batch normalizationlayers to calibrate the biased distributions during the inference phase. Withoutthe need of knowing the attributes of the noise beforehand, our approach is ableto align the distributions of the activations under variational noise inherent in theanalog environments. In order to validate our assumptions, we conduct quantitativeexperiments and apply our methods on several computer vision tasks, includingclassification, object detection, and semantic segmentation. The results demon-strate the effectiveness of achieving noise-agnostic robust networks and progressthe developments of the analog computing devices in the field of neural networks.
The rapid progress of deep neural networks has aroused the interest in discovering suitable hardwaredevices for neural network inference which heavily demands computational resources and energyconsumption. As the widespread deployment of network models on a variety of edge devices, it isurgent to design adequate hardware to satisfy the needs. In addition to the widely-used digital circuits(e.g. GPU) which have already been well developed, analog computing has attracted more attentionin recent years since non-volatile memory devices are favourable in accelerating the inference ofneural networks (Shafiee et al., 2016; Nomura et al., 2018; Angizi et al., 2019). In comparison tothe digital platforms, process in-memory (PIM) analog computing has demonstrated orders of speedacceleration and lower power consumption, which allows it to become the reasonable choice to beemployed for the neural network inference.The imprecise analog computations result in intolerable performance drop of the neural networks andmake the replacement of the digital circuits impractical despite the advantages of analog computing in
Preprint. Under review. a r X i v : . [ c s . C V ] J u l a) KL Divergence (b) JS Divergence Figure 1: Distribution distance between the activations in ResNet-34 before and after the noiseinjection. The blue and the green bars represent the divergence of the activations with and withoutapplying our approach, respectively, while the red line illustrates the ratio between them. It canbe observed that the distributions of the intermediate presentation with noise injected are far fromthe ones without any noise while the BatchNorm statistics are not calibrated, which results to thedegradation in the final performance. In contrast, our approach calibrates the BatchNorm statisticsand shifts the distributions close to the clean ones in both (a) KL divergence and (b) JS divergence.speed and power consumption. The performance degradation is due to the undesirable intrinsic noiseinherent in the physical components of the analog devices. Previous work (Joshi et al., 2019) proposedto finetune the neural networks after the conventional training phase. Model weights were injectedwith Gaussian noise to simulate the analog computing scheme while finetuning. The proposednoisy-training allowed the networks to be robust to the noise caused by the analog devices during theinference. However, it is unrealistic to acquire the attributes of the noise beforehand and simulatethe injected noise based on the prior knowledge in the noisy-training. In addition, noisy-trainednetworks are exclusively fit to a certain noise scale and further finetuning is required for differentones. The process of finetuning is inefficient as it demands additional training time and computingresources on top of the original training phase. As a result, the inefficiency and the unaffordable costof noisy-training are unsatisfactory in progressing the analog computing into practice.We assume the noise shifts the distributions of the activations away from the ‘clean’ ones withoutany noise, causing the performance drop in the neural networks. To verify our hypothesis, wedemonstrate the distance of the distributions between the clean and the noisy activations as thegreen bars in Figure 1, where the distance is measured by Kullback-Liebler and Jensen-Shannondivergence. It is obvious that the noisy activations are significantly disturbed by the noise and shiftedfar away from the clean ones. Additionally, since BatchNorm (Ioffe & Szegedy, 2015) is capableof normalizing the mismatched distributions of the activations among mini-batches, we discover itappropriate and advantageous to alleviate the noise interference. In this paper, we propose to calibratethe BatchNorm statistics to rectify the disturbed activation distributions. The noise would make therunning estimates of mean and variance inaccurate and therefore deactivate the normalization effectof BatchNorm layers. By recalculating the BatchNorm statistics, the calibrated BatchNorm is able todemonstrate its effectiveness in normalizing the mismatched distributions, as depicted in Figure 1blue bars. Unlike the noisy-training which introduces additional resource for fintuning, the cost forkeeping tack of the running estimates is negligible. Furthermore, our methods are easily adaptive tovarious scales of noise since it only requires to estimate the mean and variance of the activations incalibration. Therefore, calibrating the BatchNorm statistics is a practical approach to mitigate thenoise interference in the neural network inference during analog computing.We validate the performance of calibrating the BatchNorm statistics during analog computing ina variety of computer vision tasks, including image classification, object detection, and semanticsegmentation. We demonstrate that our approach is able to alleviate the disturbing noise and improvethe network robustness against the noisy weights. Additionally, we analyze the effectiveness ofour approach employing on several representative models under variational noise in a number ofexperiments quantitatively and qualitatively. The contributions of this paper are summarized asfollows: 2
We propose to recalculate the BatchNorm statistics to calibrate the shifted distributionscaused by the noise during analog computing. • Our approach requires negligible additional cost for the BatchNorm statistics calibration,comparing to the unaffordable cost introduced by the noisy-training. • Our approach is adaptive to variational noise and merely a few adjustments are needed fordifferent scales of noise. • The effectiveness and the efficiency allow calibrating the BatchNorm statistics a practicalapproach in developing the analog computing devices and putting it into practice.The remainder of this paper is organized as follows. Section 2 discusses the research works related tothis paper. Section 4 walks through the proposed methodology. Section 5 presents the experimentalresults and the ablation analysis. Section 6 concludes the paper.
Applying analog computation to accelerate neural network in inference phase has been a active fieldfor these years (Haensch et al., 2019). Except the works mentioned in previous paragraph whichdedicated to design and ameliorate the hardware architecture, Klachko et al. (2019) explore the effectof common DNN componet and training regularization techniques e.g., activation function, weightdecay, Dropout(Srivastava et al., 2014), BatchNorm, on noise tolerance. Joshi et al. (2019) trained aneural network with injection noise to make the model weights less sensitive to variation of noise.Zhou et al. (2020) integrated the technique of knowledge distillation with noise injection trainingwhich can take advantage of the additive information of teacher model.
Tradition BatchNorm contains two stastical and two learnable components, mean, variance, scaleand bias. In training phase, BatchNorm compute the mean E [ x ] and variance V ar [ x ] of inputs batch x , then normalize each scalar feature independently, by making it have the mean of zero and thevariance of 1. At last, scale and shift the normalized value ˆ x by the learnable parameters γ and β .Meanwhile BatchNorm also records the exponential moving average (EMA) of mean and variancewhich can represent the training data distribution to normalize the inputs batch in inference phase.BatchNorm (Ioffe & Szegedy, 2015) is a widely used normalization technique that can accelerate andstabilize the training of neural network by normalizing intermediate representations. However theeffect of BatchNorm is not well-known, so recently works investigate the properties of BatchNorm tobetter understand the usage under different circumstances. (Guo et al., 2018) (Singh & Shrivastava,2019) (Summers & Dinneen, 2019) propose methods that re-weigh the statistics between exponentialmoving average (EVM) and instant inference batch to mitigate training and testing data discrepancyoccur in traditional BatchNorm. Xie et al. (2019) advanced the perspective that the distribution mis-match between clean examples and adversarial examples is a key factor that causes the performancedegradation in modern neural networks containing the component of BatchNorm layer. and Frankleet al. (2020) investigated the expressive power of BatchNorm by only training BatchNorm parameterswhich merely account for 0.5% in the total number of model parameters. The model can only shiftand rescale random features, but still achieves a fairly high accuracy. ˆ x = x − E [ x ] (cid:112) V ar [ x ] + (cid:15) · γ + β (1) The process of analog in-memory computing is limited by the non-idealities of variations originatingfrom three main factors: 1.quality of wafer manufacturing, 2.stability of supply voltage, 3.temperaturechange, the combination of the above three factors leads to the floating of the computation. AndNVM cell which is a key component in PIM-based DNN-accelerator used to store the weights of3eural network (Nomura et al., 2018) (Balaji et al., 2019) (Joshi et al., 2019), suffers variation ofelectro/thermo-dynamics during the read and write operations, the stored value has tendency tofluctuate from time to time due to the temperature change and the conductance drift of the device.The long-term fluctuation can be adjusted by refresh the hardware periodically. But the time-varingnoise exist at smaller time granularity should be addressed. The simulated noises are described bytwo factors and the severity of noise can be controlled by noise scale η . Temporal fluctuation . During the analog computing process, the instability of supply voltage, thetemperature rises and falls may cause different degrees of computation error, this time-varying noisecan be described by N T ∼ N ( η , σ T ) which is randomly sampled for each inference batch. And anoise temporal fluctuation level of 10% means that σ T = 0.1 η . Spatial fluctuation . Due to the defect of the transistor manufacturing process, the analog computingnoise may vary between different parts of chip, and this spatially-varying noises are introduced bysampling N S ∼ N ( η , σ S ) for once when the neural network is instantiated. And a noise spatialfluctuation level of 10% means that σ S = 0.1So the weights of model after noise injection sampled from above two types of noise sources W noisy will be like the following eq(2): W noisy = W orig + W orig · N T · N S (2) In analog computation noise setting, the noise is injected into model weights, makes the internalactivation outputs distribution greatly differ from the original outputs of the clean weights. However,traditional BatchNorm performs the normalization by statistical results of training data, which arerecorded without the exist of injected noises. Such statistical results can not successfully normalizethe noisy outputs, make BatchNorm lose its effectiveness to adjust the output distribution of previouslayer, so the noise will keep propagating, finally lead to wrong prediction. To mitigate the distributionmismatch between the noisy outputs and original outputs, we conjecture a perspective differ fromtraditional paradigm that use the training statistic or like (Guo et al., 2018) (Singh & Shrivastava,2019) (Summers & Dinneen, 2019) which investigating the discrepancy between training and testingdataset, then integrate the training statistic with inference batch to gain additional improvement. Weshould not use the statistical results recorded in training phase when there is a obvious distributionshift, the mean and variance of inference batch need to be re-computed. So we perform calibration ontraining dataset, while inference on the training dataset with the noise is injected into the model, thestatistic of such unstable analog computing environment can be recorded implicitly without knowingthe actual setting of injected noise.
Algorithm 1:
Calibrate the statistics of BatchNorm for x i in X NoisyT rain do B σ ← m · B σ + (1 - m ) · x iσ B µ ← m · B µ + (1 - m ) · x iµ In this section, experiments are conducted with widely-used dataset ImageNet-2012. (Krizhevskyet al., 2012) with different architectures. And we show the performance degradation when the noiseis injected into model weights, and with the proposed method, the problem of distribution mismatchis eased, leads to favorable performance.
Experimental Setup.
ImageNet-2012 is a large-scale dataset for image classification with 1kcategories. The dataset contains roughly 1.3 million training images and 50k validation images of4able 1: Validation acc (%) on ImageNet-2012. η = 0.0 η = 0.02 η = 0.04 η = 0.06 η = 0.08 η = 0.10ResNet-34 73.31 71.53 65.73 49.92 20.55 3.66ours 73.31 72.34 72.28 72.07 71.97 71.64ResNet-50 76.13 72.85 56.88 19.49 2.73 0.40ours 76.13 74.81 74.79 74.71 74.57 74.54ResNet-101 77.37 73.80 52.89 9.32 0.38 0.08ours 77.37 76.30 76.27 76.25 76.12 75.94WideResNet 78.47 75.09 59.24 23.36 3.93 0.59ours 78.47 76.10 76.08 75.95 75.77 75.72MobileNet-v2 71.88 49.61 1.96 0.19 0.13 0.12ours 71.88 68.76 67.97 66.90 65.26 63.41256x256 pixels. We use pytorch official pretrained models to conduct experiments and comparethe performance of Calibrated BatchNorm with EMA (batch size = 256, momentum = 0.999) to theoriginal model under the perturbation of noise injection of temproal level σ T = 20%; spatial level σ S = 10%. Results.
Table 1. Shows that even in the large-scale dataset, the effect of calibrated BatchNorm isstill notable. Besides, we find that the deeper the network, the more susceptible it was to noise dueto the propagation of noise, as you can see the performance of different depths of ResNet 34, 50,101 are 49.92%, 19.49%, 9.32% respectively under the noise scale of 0.06. And in the comparisonbetween ResNet-50 and WideResNet-50-2, increasing the width of the network which is regardedto be more robust to adversarial examples (Gao et al., 2019), does not seem to have any benefit interms of noise resistance in analog computing circumstance, because any additional parameters willbe affected by the noise as well.
We use the official pretrainde YOLO-v3 (Redmon & Farhadi, 2018) implemen-tation as our detector. And conduct all the experiment on COCO-2014 dataset (Lin et al., 2014) with80 object classes. Our detector is trained on trainval35k set with around 75k images, and evaluate ona held out set with 5k images. We report the standard COCO evaluation metrics for mean averageprecision mAP and mean average recall mAR with IoU threshold 0.5, comparing the performance ofCalibrated BatchNorm with EMA (batch size = 8, momentum = 0.999) to the original model underthe perturbation of noise injection of temproal level σ T = 20%; spatial level σ S = 10%. Results.
Table 2. shows the results that even when the injection noise is severe, our method can stillwork in the difficult task of computer vision.Table 2: Validation mAP, mAR (%) at 0.5 IoU on COCO-2014 val 5k. η = 0 η = 0.05 η = 0.10mAP mAR mAP mAR mAP mAROriginal model 51.5 75.1 30.5 58.4 0.0 0.0Calibrated BatchNorm 51.5 75.1 49.5 73.3 48.6 71.3 We use pytorch official FCN-ResNet101 (Long et al., 2014) that is pretrainedon part of COCO dataset that shared the same classes with PASCAL VOC-2012 (Everingham et al.),we conduct the segmentation task on PASCAL VOC-2012 with 21 classes including background. Wecompare the performance of Calibrated BatchNorm with EMA (batch size = 1, momentum = 0.999)5o the original model under the perturbation of noise injection of temproal level σ T = 20%; spatiallevel σ S = 10%. Results.
Table 3. shows the results of semantic segmentation, the mIoU metric is more reliable thanPixAcc, because mIoU focuses on classes excluding background. The experiment demonstrates thatmIoU of original model drops drastically when noise scale increases, but Calibrated Batchnorm canhandle the distribution shift even when the data come in a stream manner and in the task of semanticsegmentation. Table 3: Validation mIoU, PixAcc (%) on PASCAL VOC-2012. η = 0 η = 0.05 η = 0.10mIoU PixAcc mIoU PixAcc mIoU PixAccOriginal model 63.7 91.9 25.9 80.8 2.95 62.2Calibrated BatchNorm 63.7 91.9 61.1 90.3 60.6 90.1 We advance a concept that formulate the imprecise process in memory computing as a distributionshift problem, and propose a effective way to calibrate the mismatching distribution. We conductextensive experiments on important vision tasks including classification, object detection and semanticsegmentation, the results demonstrate the significant improvement, further promote the progress thedevelopment of neural network in the analog computing field.6 eferences
S. Angizi, Z. He, D. Reis, X. S. Hu, W. Tsai, S. J. Lin, and D. Fan. Accelerating deep neural networksin processing-in-memory platforms: Analog or digital approach? In
AAAI , 2018.W. Haensch, T. Gokmen, and R. Puri. The next generation of deep learning hardware: Analogcomputing.
Proceedings of the IEEE , 107(1):108–122, 2019.Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training byreducing internal covariate shift, 2015.Vinay Joshi, Manuel Le Gallo, Irem Boybat, Simon Haefeli, Christophe Piveteau, Martino Dazzi,Bipin Rajendran, Abu Sebastian, and Evangelos Eleftheriou. Accurate deep neural networkinference using computational phase-change memory, 2019.Michael Klachko, Mohammad Reza Mahmoodi, and Dmitri B. Strukov. Improving noise toleranceof mixed-signal neural networks, 2019.Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolu-tional neural networks. In
NIPS’12 , 2012.Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, PietroPerona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objectsin context, 2014.Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semanticsegmentation, 2014.Akiyo Nomura, Megumi Ito, Atsuya Okazaki, Masatoshi Ishii, Sangbum Kim, Junka Okazawa, KohjiHosokawa, and Wilfried Haensch. Nvm weight variation impact on analog spiking neural networkchip. In Long Cheng, Andrew Chi Sing Leung, and Seiichi Ozawa (eds.),
Neural InformationProcessing , pp. 676–685, Cham, 2018. Springer International Publishing. ISBN 978-3-030-04239-4.Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement, 2018.A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams,and V. Srikumar. Isaac: A convolutional neural network accelerator with in-situ analog arithmeticin crossbars. In , pp. 14–26, 2016.Saurabh Singh and Abhinav Shrivastava. Evalnorm: Estimating batch normalization statistics forevaluation, 2019.Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.Dropout: A simple way to prevent neural networks from overfitting.
Journal of Machine LearningResearch , 15(56):1929–1958, 2014. URL http://jmlr.org/papers/v15/srivastava14a.htmlhttp://jmlr.org/papers/v15/srivastava14a.html