[PDF] A New Look at Ghost Normalization

Abstract

Batch normalization (BatchNorm) is an effective yet poorly understood technique for neural network optimization. It is often assumed that the degradation in BatchNorm performance to smaller batch sizes stems from it having to estimate layer statistics using smaller sample sizes. However, recently, Ghost normalization (GhostNorm), a variant of BatchNorm that explicitly uses smaller sample sizes for normalization, has been shown to improve upon BatchNorm in some datasets. Our contributions are: (i) we uncover a source of regularization that is unique to GhostNorm, and not simply an extension from BatchNorm, (ii) three types of GhostNorm implementations are described, two of which employ BatchNorm as the underlying normalization technique, (iii) by visualising the loss landscape of GhostNorm, we observe that GhostNorm consistently decreases the smoothness when compared to BatchNorm, (iv) we introduce Sequential Normalization (SeqNorm), and report superior performance over state-of-the-art methodologies on both CIFAR--10 and CIFAR--100 datasets.

Full PDF

AA N EW L OOK AT G HOST N ORMALIZATION

A P

REPRINT

Neofytos Dimitriou

School of Computer ScienceUniversity of St AndrewsSt Andrews, United Kingdom [email protected]

Ognjen Arandjelovi´c

School of Computer ScienceUniversity of St AndrewsSt Andrews, United Kingdom [email protected]

July 20, 2020 A BSTRACT

Batch normalization (BatchNorm) is an effective yet poorly understood technique for neural networkoptimization. It is often assumed that the degradation in BatchNorm performance to smaller batchsizes stems from it having to estimate layer statistics using smaller sample sizes. However, recently,Ghost normalization (GhostNorm), a variant of BatchNorm that explicitly uses smaller sample sizesfor normalization, has been shown to improve upon BatchNorm in some datasets. Our contributionsare: (i) we uncover a source of regularization that is unique to GhostNorm, and not simply anextension from BatchNorm, (ii) three types of GhostNorm implementations are described, two ofwhich employ BatchNorm as the underlying normalization technique, (iii) by visualising the losslandscape of GhostNorm, we observe that GhostNorm consistently decreases the smoothness whencompared to BatchNorm, (iv) we introduce Sequential Normalization (SeqNorm), and report superiorperformance over state-of-the-art methodologies on both CIFAR–10 and CIFAR–100 datasets. K eywords Group Normalization · Sequential normalization · Loss landscape · Accumulating gradients · Image classiﬁcation

The effectiveness of Batch Normalization (BatchNorm), a technique ﬁrst introduced by Ioeffe and Szegedy [1] on neuralnetwork optimization has been demonstrated over the years on a variety of tasks, including computer vision, [2, 3, 4],speech recognition [5], and other [6, 7, 8]. BatchNorm is typically embedded at each neural network (NN) layereither before or after the activation function, normalizing and projecting the input features to match a Gaussian-likedistribution. Consequently, the activation values of each layer maintain more stable distributions during NN trainingwhich in turn is thought to enable faster convergence and better generalization performances [1, 9, 10].Despite the wide adoption and practical success of BatchNorm, its underlying mechanics within the context of NNoptimization has yet to be fully understood. Initially, Ioeffe and Szegedy suggested that it came from it reducing theso-called internal covariate shift [1]. At a high level, internal covariate shift refers to the change in the distribution of theinputs of each NN layer that is caused by updates to the previous layers. This continual change throughout training wasconjectured to negatively affect optimization [1, 9]. However, recent research disputes that with compelling evidencethat demonstrates how BatchNorm may in fact be increasing internal covariate shift [9]. Instead, the effectiveness ofBatchNorm is argued to be a consequence of a smoother loss landscape [9].Following the effectiveness of BatchNorm on NN optimization, a number of different normalization techniques havebeen introduced [11, 12, 13, 14, 15]. Their main inspiration was to provide different ways of normalizing the activationswithout being inherently affected by the batch size. In particular, it is often observed that BatchNorm performs worsewith smaller batch sizes [11, 14, 16]. This degradation has been widely associated to BatchNorm computing poorerestimates of mean and variance due to having a smaller sample size. However, recent demonstration of the effectivenessof GhostNorm comes in antithesis with the above belief [17]. GhostNorm explicitly divides the mini–batch intosmaller batches and normalizes over them independently [18]. Nevertheless, when compared to other normalization a r X i v : . [ c s . C V ] J u l PREPRINT - J

ULY

20, 2020techniques [11, 12, 13, 14, 15], the adoption of GhostNorm has been rather scarce, and narrow to large batch sizetraining regimes [17].In this work, we take a new look at GhostNorm, and contribute in the following ways: (i) Identifying a source ofregularization that is unique to GhostNorm, and discussing the difference against other normalization techniques, (ii)Providing a direct way of implementing GhostNorm, as well as through the use of accumulating gradients and multipleGPUs, (iii) Visualizing the loss landscape of GhostNorm under vastly different experimental setups, and observing thatGhostNorm consistently decreases the smoothness of the loss landscape, especially on the later epochs of training, (iv)Introducing SeqNorm as a new normalization technique, (v) Surpassing the performance of baselines that are based onstate-of-the-art (SOTA) methodologies on CIFAR–10 and CIFAR–100 for both GhostNorm and SeqNorm, with thelatter even surpassing the current SOTA on CIFAR–100 that employs a data augmentation strategy.

Ghost Normalization is a technique originally introduced by Hofer et al . [18]. Over the years, the primary use ofGhostNorm has been to optimize NNs with large batch sizes and multiple GPUs [17]. However, when compared toother normalization techniques [11, 12, 13, 14, 15], the adoption of GhostNorm has been rather scarce.In parallel to our work, we were able to identify a recent published work that has in fact experimented with GhostNormon both small and medium batch size training regimes. Summers and Dinnen [17] tuned the number of groups withinGhostNorm (see section 2.1) on CIFAR–100, Caltech–256, and SVHN, and reported positive results on the ﬁrsttwo datasets. More results are reported on other datasets through transfer learning, however, the use of other newoptimization methods confound the contribution made by GhostNorm.The closest line of work to SeqNorm is, again, found in the work of Summers and Dinnen [17]. Therein they employ anormalization technique which although at ﬁrst glance may appear similar to SeqNorm, at a fundamental level, it israther different. This stems from the vastly different goals between our works, i.e. they try to increase the availableinformation when small batch sizes are used [17], whereas we strive to improve GhostNorm in the more general setting.At a high level, where SeqNorm performs GroupNorm and GhostNorm sequentially, their normalization method appliesboth simultaneously. At a fundamental level, the latter embeds the stochastic nature of GhostNorm (see section 2.2)into that of GroupNorm, thereby potentially disrupting the learning of channel grouping within NNs. Switchablenormalization is also of some relevance to SeqNorm as it enables the NN to learn which normalization techniques toemploy at different layers [19]. However, similar to the previous work, simultaneously applying different normalizationtechniques has a fundamentally different effect than SeqNorm.Related to our work is also research geared towards exploring the effects of BatchNorm on optimization [10, 9, 20].Finally, of some relevance is also the large body of work that exists on improving BatchNorm at the small batch trainingregime [11, 14, 21, 16].

Given a fully-connected or convolutional neural network, the parameters of a typical layer l with normalization, N orm ,are the weights W l as well as the scale and shift parameters γ l and β l . For brevity, we omit the l superscript. Given aninput tensor X , the activation values A of layer l are computed as: A = g ( N orm ( X (cid:12) W ) ⊗ γ + β ) (1)where g ( · ) is the activation function, (cid:12) corresponds to either matrix multiplication or convolution for fully-connectedand convolutional layers respectively, and ⊗ describes an element-wise multiplication.Most normalization techniques differ in how they normalize the product X (cid:12) W . Let the product be a tensor with ( M, C, F ) dimensions where M is the so-called mini–batch size, or just batch size, C is the channels dimension, and F is the spatial dimension.In BatchNorm , the given tensor is normalized across the channels dimension. In particular, the mean and variance arecomputed across C number of slices of ( M, F ) dimensions (see Figure 1) which are subsequently used to normalizeeach channel c ∈ C independently. In LayerNorm , statistics are computed over M slices of ( C, F ) dimensions,normalizing the values of each data sample m ∈ M independently. InstanceNorm normalizes the values of the tensorover both M and C , i.e. computes statistics across M × C slices of F dimensions. GroupNorm can be thought as an extension to LayerNorm wherein the C dimension is divided into G C number ofgroups, i.e. ( M, G C , C / G C , F ) . Statistics are calculated over M × G C slices of ( C / G C , F ) dimensions. Similarly,2 PREPRINT - J

ULY

20, 2020Figure 1: The input tensor is divided into a number of line (1D) or plane (2D) slices. Each normalization techniqueslices the input tensor differently and each slice is normalized independently of the other slices.

GhostNorm can be thought as an extension to BatchNorm, wherein the M dimension is divided into G M groups,normalizing over C × G M slices of ( M / G M , F ) dimensions. Both G C and G M are hyperparameters that can be tunedbased on a validation set. All of the aforementioned normalization techniques are illustrated in Figure 1. SeqNorm employs both GroupNorm and GhostNorm in a sequential manner. Initially, the input tensor is divided into ( M, G C , C / G C , F ) dimensions, normalizing across M × G C number of slices, i.e. same as GroupNorm. Then, once the G C and C / G C dimensions are collapsed back together, the input tensor is divided into ( G M , M / G M , C, F ) dimensionsfor normalizing over C × G M slices of ( M / G M , F ) dimensions.Any of the slices described above is treated as a set of values S with one dimension. The mean and variance of S arecomputed in the traditional way (see Equation 2). The values of S are then normalized as shown below. µ = 1 M (cid:88) x ∈ S x and σ = 1 M (cid:88) x ∈ S ( x − µ ) (2) ∀ x ∈ S , x = x − µ √ σ + (cid:15) (3)Once all slices are normalized, the output of the N orm layer is simply the concatenation of all slices back into theinitial tensor shape.

There is only one other published work which has investigated the effectiveness of Ghost Normalization for smalland medium mini-batch sizes [17]. Therein, they hypothesize that GhostNorm offers stronger regularization thanBatchNorm as it computes the normalization statistics on smaller sample sizes [17]. In this section, we support thathypothesis by providing insights into a particular source of regularization, unique to GhostNorm, that stems fromnormalizing groups of activations during a forward pass.Consider as an example the tuple X with (35 , , , , , , , values which can be thought as an inputtensor with (8 , , dimensions. Given to a BatchNorm layer, the output is the normalized version ¯ X with values (0 . , . , . , − . , . , − . , − . , − . . Note how although the values have changed, the ranking order of theactivation values has remained the same, e.g. the 2nd value is larger than the 5th value in both X ( > ) and ¯ X PREPRINT - J

ULY

20, 2020( . > . ). More formally, the following holds true:Given n-tuples X = ( x , x , ..x n ) and ¯ X = (¯ x , ¯ x , ..., ¯ x n ) , ∀ i, j ∈ I, ¯ x i > ¯ x j ⇐⇒ x i > x j ¯ x i < ¯ x j ⇐⇒ x i < x j ¯ x i = ¯ x j ⇐⇒ x i = x j (4)On the other hand, given X to a GhostNorm layer with G M = 2 , the output ¯ X is (0 . , . , . , − . , . , − . , − . , − . . Now, we observe that after normalization, the 2nd value has be-come much smaller than the 5th value in ¯ X ( . < . ). Where BatchNorm preserves the ranking order, GhostNormcan modify the importance of each sample, and hence alter the course of optimization. Our experimental resultsdemonstrate how GhostNorm improves upon BatchNorm, supporting the hypothesis that the above type of regularizationcan be beneﬁcial to optimization. Note that for BatchNorm the condition in Equation 4 only holds true across the M × F dimension of the input tensor whereas for GhostNorm it cannot be guaranteed for any given dimension. GhostNorm to BatchNorm

One can argue that the same type of regularization can be found in BatchNorm over dif-ferent mini–batches, e.g. given [35 , , , and [38 , , , as two different mini–batches. However, GhostNormintroduces the above during each forward pass rather than between forward passes. Hence, it is a regularization that isembedded during learning (GhostNorm), rather than across learning (BatchNorm). GhostNorm to GroupNorm

Despite the visual symmetry between GhostNorm and GroupNorm, there is one majordifference. Grouping has been employed extensively in classical feature engineering, such as SIFT, HOG, and GIST,wherein independent normalization is often performed over these groups [14]. At a high level, GroupNorm can bethought as motivating the network to group similar features together [14]. On the other hand, that learning behaviouris not possible with GhostNorm due to random sampling, and random arrangement of data within each mini–batch.Therefore, we hypothesize that the effects of these two normalization techniques could be combined for their beneﬁts tobe accumulated. Speciﬁcally, we propose SeqNorm, a normalization technique that employs both GroupNorm andGhostNorm in a sequential manner.

The direct approach of implementing GhostNorm is shown in Figure 2. Although the expon-tential moving averages are ommited for brevity , it’s worth mentioning that they were accumulated in the same wayas BatchNorm. In addition to the above direct implementation, GhostNorm can be effectively employed while usingBatchNorm as the underlying normalization technique.When the desired batch size exceeds the memory capacity of the available GPUs, practitioners often resort to the use ofaccumulating gradients. That is, instead of having a single forward pass with M examples through the network, n fp number of forward passes are made with M / n fp examples each. Most of the time, gradients computed using a smallernumber of training examples, i.e. M / n fp , and accumulated over a number of forward passes n fp are identical to thosecomputed using a single forward pass of M training examples. However, it turns out that when BatchNorm is employedin the neural network, the gradients can be substantially different for the above two cases. This is a consequence of themean and variance calculation (see Equation 2) since each forwarded smaller batch of M / n fp data will have a differentmean and variance than if all M examples were present. Accumulating gradients with BatchNorm can thus be thoughtas an alternative way of using GhostNorm with the number of forward passes n fp corresponding to the number ofgroups G M . A PyTorch implementation of accumulating gradients is shown in Figure 3.Finally, the most popular implementation of GhostNorm via BatchNorm, albeit typically unintentional, comes as aconsequence of using multiple GPUs. Given n g GPUs and M training examples, M/n g examples are forwarded toeach GPU. If the BatchNorm statistics are not synchronized across the GPUs (i.e. Synchnronized BatchNorm ref),which is often the case for image classiﬁcation, then n g corresponds to the number of groups G M .A practitioner who would like to use GhostNorm should employ the implementation shown in Figure 2. Nevertheless,under the discussed circumstances, one could explore GhostNorm through the use of the other implementations. Sequential Normalization

The implementation of SeqNorm is straightforward since it combines GroupNorm, awidely implemented normalization technique, and GhostNorm for which we have discussed three possible implementa-tions. The full implementation will be provided in a code repository. PREPRINT - J

ULY

20, 2020 def GhostNorm(X, groupsM, eps=1e-05):"""X: Input Tensor with (M, C, F) dimensionsgroupsM: Number of groups for the mini-batch dimensioneps: A small value to prevent division by zero"""

Figure 2: Python code for GhostNorm in PyTorch. def train__for_an_epoch():model.train()model.zero_grad()for i, (X, y) in enumerate(train_loader):outputs = model(X)loss = loss_function(outputs, y)loss = loss / acc_stepsloss.backward()if (i + 1) % acc_steps == 0:optimizer.step()model.zero_grad()

Figure 3: Python code for accumulating gradients in PyTorch.

In this section, we ﬁrst strive to take a closer look into GhostNorm by visualizing the smoothness of the loss landscapeduring training; a component which has been described as the primary reason behind the effectiveness of BatchNorm.Then, we evaluate the effectiveness of both GhostNorm and SeqNorm on the standard image classiﬁcation datasets ofCIFAR–10 and CIFAR–100. Note that in all of our experiments, the smallest M / G M we employ for both SeqNorm andGhostNorm is . A ratio of would be undeﬁned for normalization, whereas a ratio of results in large informationcorruption, i.e. all activation values are reduced to either or − . On MNIST, we train a fully-connected neural network (SimpleNet) with two fully-connectedlayers of and neurons. The input images are transformed to one-dimensional vectors of channels, and arenormalized based on the mean and variance of the training set. The learning rate is set to . for a batch size of ona single GPU. In addition to training SimpleNet with BatchNorm and GhostNorm, we also train a SimpleNet baselinewithout any normalization technique.A residual convolutional network with layers (ResNet–56) [4] is employed for CIFAR–10. We achieve super–convergence by using the one cycle learning policy described in the work of Smith and Topin [22]. Horizontal ﬂipping,and pad-and-crop transformations are used for data augmentation. We use a triangularly annealing learning rate schedule5 PREPRINT - J

ULY

20, 2020Figure 4: Comparison of the loss landscape on MNIST between BatchNorm, GhostNorm, and the baseline.between . and , and down to . for the last few epochs. In order to train ResNet–56 without a normalizationtechnique (baseline), we had to adjust the cyclical learning rate schedule to ( . , ).For both datasets, we train the networks on , training images, and evaluate on , testing images. Forcompleteness, further implementation details are provided in the Appendix. Loss landscape

We visualize the loss landscape during optimization using an approach that was described bySanturkar et al. [9]. Each time the network parameters are to be updated, we walk towards the gradient direction andcompute the loss at multiple points. This enable us to visualise the smoothness of the loss landscape by observinghow predictive the computed gradients are. In particular, at each step of updating the network parameters, we computethe loss at a range of learning rates, and store both the minimum and maximum loss. For MNIST, we computethe loss at learning rates ∈ [0 . , . , . , ..., . ], whereas for CIFAR–10, we do so for cyclical learning rates ∈ [(0 . , . , (0 . , , (0 . , . , (0 . , , and analogously for the baseline.For both datasets and networks, we observe that the smoothness of the loss landscape deteriorates when GhostNorm isemployed. In fact for MNIST, as seen in Figure 4, the loss landscape of GhostNorm bears closer resemblance to ourbaseline which did not use any normalization technique. For CIFAR–10, this is only observable towards the last epochsof training. In spite of the above observation, we have consistently witness better generalization performances withGhostNorm in almost all of our experiments, even at the extremes wherein G M is set to , i.e. only 4 samples pergroup. We include further experimental results and discussion in the Appendix.Our experimental results challenge the often established correlation between a smoother loss landscape and a bettergeneralization performance [9, 13]. Although beyond the scope of our work, a theoretical analysis of the implications ofGhostNorm when compared to BatchNorm could potentially uncover further insights into the optimization mechanismsof both normalization techniques. 6 PREPRINT - J

ULY

20, 2020Figure 5: Comparison of the loss landscape on CIFAR–10 between BatchNorm, GhostNorm, and the baseline.

In this section, we further explore the effectiveness of both GhostNorm and SeqNorm on optimization using state-of-the-art (SOTA) methodologies for image classiﬁcation. Note that the maximum G C is by architectural design set to ,i.e. the smallest layer in the network has channels. Implementation details

For both CIFAR–10 and CIFAR–100, we employ a training set of , images, avalidation set of , images (randomly stratiﬁed from the training set), and a testing set of , . The inputdata were stochastically augmented with horizontal ﬂips, pad-and-crop as well as Cutout [23]. We use the samehyperparameter conﬁgurations as Cubuk et al. [24]. However, in order to speed up optimization, we increase the batchsize from 128 to 512, and apply a warmup scheme [25] that increases the initial learning rate by four times in epochs;thereafter we use the cosine learning schedule. Based on the above experimental settings, we train Wide-ResNet modelsof depth and width [26] for epochs. Note that since GPUs are employed, our BatchNorm baselines areequivalent to using GhostNorm with G M = 8 . Nevertheless, to avoid any confusion, we refer to it as BatchNorm. It’sworth mentioning that setting G M to on GPUs is equivalent to using on GPU.

CIFAR–100

Initially, we turn to CIFAR–100, and tune the hyperparameters G C and G M of SeqNorm in a grid-searchfashion. The results are shown in Table 1. Both GhostNorm and SeqNorm improve upon the baseline by a largemargin ( . and . respectively). Moreover, SeqNorm surpasses the current SOTA performance on CIFAR–100,which uses a data augmentation strategy, by . [24]. These results support our hypothesis that sequentially applyingGhostNorm and GroupNorm can have an additive effect on improving NN optimization.However, the grid–search approach to tuning G C and G M is rather time consuming (time complexity: Θ( G C × G M ) ).Hence, we attempt to identify a less demanding hyperparameter tuning approach. The most obvious, and the one weactually adopt for the next experiment, is to sequentially tune G C and G M . In particular, we ﬁnd that ﬁrst tuning G M ,then selecting the largest g M ∈ G M for which the network performs well, and ﬁnally tuning G C with g M to be aneffective approach (time complexity: Θ( G C + G M ) ). Note that by following this approach on CIFAR-100, we still endup with the same best SeqNorm, i.e. G C = 4 and G M = 8 .7 PREPRINT - J

ULY

20, 2020Table 1: Results on CIFAR–100 by tuning both G C and G M in a grid–search fashion. For SeqNorm, we only show thebest results for each G C . Both validation and testing performances are averaged over two different runs. Given thesame mean performances between two hyperparameter conﬁgurations, the one exhibiting less performance variancewas adopted. Normalization Validation accuracy Testing accuracyBatchNorm .

6% 82 . GhostNorm ( G M = 2 ) . -GhostNorm ( G M = 4 ) . -GhostNorm ( G M = 8 ) . . .

4% 82 . GhostNorm ( G M = 16 ) . -SeqNorm ( G C = 1 , G M = 4 ) . -SeqNorm ( G C = 2 , G M = 4 ) . -SeqNorm ( G C = 4 , G M = 8 ) . . .

5% 83 . . . SeqNorm ( G C = 8 , G M = 8 ) . -SeqNorm ( G C = 16 , G M = 8 ) . -Table 2: Results on CIFAR–10 based on the sequential tuning of G C and G M . Both validation and testing performancesare averaged over two different runs.Normalization Validation accuracy Testing accuracyBatchNorm .

6% 97 . GhostNorm ( G M = 4 ) .

7% 97 . SeqNorm ( G C = 16 , G M = 8 ) .

8% 97 . . . CIFAR–10

As the ﬁrst step, we tune G M for GhostNorm. We observe that for G M ∈ (2 , , , the network performssimilarly on the validation set at ≈ . accuracy. We choose G M = 4 for GhostNorm since it exhibits slightlyhigher accuracy at . .Based on the tuning strategy described in the previous section, we adopt G M = 8 and tune G C for values between and , inclusively. Although the network performs similarly at ≈ . accuracy for G C ∈ (1 , , , we choose G C = 16 as it achieves slightly higher accuracy than the rest. Using the above conﬁguration, SeqNorm is able to matchthe current SOTA on the testing set [27]. A number of other approaches were adopted in conjunction with GhostNorm and SeqNorm. These preliminaryexperiments did not surpass the BatchNorm baseline performances on the validation sets (most often than not by a largemargin), and are therefore not included in detail. Note that given a more elaborate hyperparameter tuning phase, theseapproaches may had otherwise succeeded.In particular, we have also experimented with placing GhostNorm and GroupNorm in reverse order for SeqNorm(in retrospect, this could have been expected given what we describe in Section 2.2), and have also experimentedwith augmenting SeqNorm and GhostNorm with weight standardisation [13] as well as by computing the variance ofbatch statistics on the whole input tensor [ref]. Finally, for all experiments, we have attempted to tune networks withonly GroupNorm [14] but the networks were either unable to converge or they achieved worse performances than theBatchNorm baselines.

It is generally believed that the cause of performance deterioration of BatchNorm with smaller batch sizes stems from ithaving to estimate layer statistics using smaller sample sizes [11, 14, 16]. In this work we challenged this belief bydemonstrating the effectiveness of GhostNorm on a number of different networks, learning policies, and datasets. Forinstance, when using super–convergence on CIFAR–10, GhostNorm performs better than BatchNorm, even though theformer normalizes the input activations using samples whereas the latter uses all samples. By providing novelinsight on the source of regularization in GhostNorm, and by introducing a number of possible implementations, wehope to inspire further research into GhostNorm. 8 PREPRINT - J

ULY

20, 2020Moreover, based on the understanding developed while investigating GhostNorm, we introduce SeqNorm and follow upwith empirical analysis. Surprisingly, SeqNorm not only surpasses the performances of BatchNorm and GhostNorm, buteven challenges the current SOTA methodologies on both CIFAR–10 and CIFAR–100 that employ data augmentationstrategies [24, 27]. Finally, we also describe a tuning strategy for SeqNorm that provides a faster alternative to thetraditional grid–search approach.

References [1] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariateshift. In F. Bach and D. Blei, editors,

International Conference on Machine Learning , volume 37, pages 448–456,July 2015.[2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classiﬁcation with deep convolutional neuralnetworks.

Communications of the ACM , 60(6):84–90, 2017.[3] G. Huang, Z. Liu, and Q. K. Weinberger. Densely connected convolutional networks.

CoRR , abs/1608.06993,2016.[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. , pages 770–778, 2015.[5] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neuralnetworks. In . IEEE, 2013.[6] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. InZ. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors,

Advances in NeuralInformation Processing Systems 27 , pages 3104–3112. Curran Associates, Inc., 2014.[7] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, JulianSchrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe,John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, ThoreGraepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search.

Nature ,529(7587):484–489, 2016.[8] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, AlexGraves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik,Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning.

Nature , 518(7540):529–533, 2015.[9] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry. How does batch normalization help optimization? In S. Bengio,H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,

Advances in Neural InformationProcessing Systems , pages 2483–2493, 2018.[10] Nils Bjorck, Carla P Gomes, Bart Selman, and Kilian Q Weinberger. Understanding batch normalization. InS. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,

Advances in NeuralInformation Processing Systems 31 , pages 7694–7705. Curran Associates, Inc., 2018.[11] Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. In

Proceedings of the 31st International Conference on Neural Information Processing Systems , pages 1942—-1950.Curran Associates Inc., 2017.[12] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer Normalization. arXiv e-prints , pagearXiv:1607.06450, 2016.[13] Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, and Alan Yuille. Weight Standardization. arXiv e-prints , pagearXiv:1903.10520, 2019.[14] Y. Wu and K. He. Group normalization. In

European Conference on Computer Vision (ECCV) , 2018.[15] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance Normalization: The Missing Ingredient for FastStylization. arXiv e-prints , page arXiv:1607.08022, July 2016.[16] Junjie Yan, Ruosi Wan, Xiangyu Zhang, Wei Zhang, Yichen Wei, and Jian Sun. Towards stabilizing batch statisticsin backward propagation of batch normalization. In

International Conference on Learning Representations , 2020.[17] Cecilia Summers and Michael J. Dinneen. Four things everyone should know to improve batch normalization. In

International Conference on Learning Representations , 2020.9

PREPRINT - J

ULY

20, 2020[18] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: Closing the generalization gap inlarge batch training of neural networks. In

Proceedings of the 31st International Conference on Neural InformationProcessing Systems , NIPS’17, pages 1729––1739. Curran Associates Inc., 2017.[19] Ping Luo, Jiamin Ren, Zhanglin Peng, Ruimao Zhang, and Jingyu Li. Differentiable learning-to-normalize viaswitchable normalization. In

International Conference on Learning Representations , 2019.[20] Jonas Kohler, Hadi Daneshmand, Aurelien Lucchi, Thomas Hofmann, Ming Zhou, and Klaus Neymeyr. Ex-ponential convergence rates for batch normalization: The power of length-direction decoupling in non-convexoptimization. In Kamalika Chaudhuri and Masashi Sugiyama, editors,

Proceedings of Machine Learning Research ,volume 89 of

Proceedings of Machine Learning Research , pages 806–815. PMLR, 16–18 Apr 2019.[21] Chunjie Luo, Jianfeng Zhan, Lei Wang, and Wanling Gao. Extended Batch Normalization. arXiv e-prints , pagearXiv:2003.05569, March 2020.[22] Leslie N. Smith and Nicholay Topin. Super-Convergence: Very Fast Training of Neural Networks Using LargeLearning Rates. arXiv e-prints , page arXiv:1708.07120, August 2017.[23] Terrance DeVries and Graham W. Taylor. Improved Regularization of Convolutional Neural Networks withCutout. arXiv e-prints , page arXiv:1708.04552, 2017.[24] Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. Le. RandAugment: Practical automated dataaugmentation with a reduced search space. arXiv e-prints , page arXiv:1909.13719, 2019.[25] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch,Yangqing Jia, and Kaiming He. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv e-prints ,page arXiv:1706.02677, 2017.[26] S. Zagoruyko and N. Komodakis. Wide residual networks.

CoRR , abs/1605.07146, 2016.[27] Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. Autoaugment: Learningaugmentation strategies from data. In