[PDF] Revisit Batch Normalization: New Understanding from an Optimization View and a Refinement via Composition Optimization

Abstract

Batch Normalization (BN) has been used extensively in deep learning to achieve faster training process and better resulting models. However, whether BN works strongly depends on how the batches are constructed during training and it may not converge to a desired solution if the statistics on a batch are not close to the statistics over the whole dataset. In this paper, we try to understand BN from an optimization perspective by formulating the optimization problem which motivates BN. We show when BN works and when BN does not work by analyzing the optimization problem. We then propose a refinement of BN based on compositional optimization techniques called Full Normalization (FN) to alleviate the issues of BN when the batches are not constructed ideally. We provide convergence analysis for FN and empirically study its effectiveness to refine BN.

Full PDF

RRevisit Batch Normalization: New Understanding from anOptimization View and a Reﬁnement via CompositionOptimization

Xiangru Lian and Ji Liu [email protected], [email protected]

University of RochesterOctober 16, 2018

Abstract

Batch Normalization (BN) has been used extensively in deep learning to achieve faster training process andbetter resulting models. However, whether BN works strongly depends on how the batches are constructedduring training and it may not converge to a desired solution if the statistics on a batch are not closeto the statistics over the whole dataset. In this paper, we try to understand BN from an optimizationperspective by formulating the optimization problem which motivates BN. We show when BN works andwhen BN does not work by analyzing the optimization problem. We then propose a reﬁnement of BN basedon compositional optimization techniques called Full Normalization (FN) to alleviate the issues of BNwhen the batches are not constructed ideally. We provide convergence analysis for FN and empiricallystudy its eﬀectiveness to reﬁne BN.

Batch Normalization (BN) [Ioﬀe and Szegedy, 2015] has been used extensively in deep learning [Szegedy et al.,2016, He et al., 2016, Silver et al., 2017, Huang et al., 2017, Hubara et al., 2017] to accelerate the trainingprocess. During the training process, a BN layer normalizes its input by the mean and variance computedwithin a mini-batch. Many state-of-the-art deep learning models are based on BN such as ResNet [He et al.,2016, Xie et al., 2017] and Inception [Szegedy et al., 2016, 2017]. It is often believed that BN can mitigatethe exploding/vanishing gradients problem [Cooijmans et al., 2016] or reduce internal variance [Ioﬀe andSzegedy, 2015]. Therefore, BN has become a standard tool that is implemented almost in all deep learningsolvers such as Tensorﬂow [Abadi et al., 2015], MXnet [Chen et al., 2015], Pytorch [Paszke et al., 2017], etc.Despite the great success of BN in quite a few scenarios, people with rich experiences with BN may alsonotice some issues with BN. Some examples are provided below.

BN fails/overﬁts when the mini-batch size is 1 as shown in Figure 1

We construct a simple networkwith 3 layers for a classiﬁcation task. It fails to learn a reasonable classiﬁer on a dataset with only 3 samplesas seen in Figure 1. a r X i v : . [ m a t h . O C ] O c t

20 40 60 80 100 epoch l o ss training loss with BNtest loss with BN Figure 1: (

Fail when batch size is 1)

Given a datasetcomprised of three samples: [0 , , with label , [1 , , with label and [2 , , with label , use the followingsimple network including one batch normalization layer,where the numbers in the parenthesis are the dimensions ofinput and output of a layer: linear layer (3 → ⇒ batchnormalization ⇒ relu ⇒ linear layer (3 → ⇒ nll loss.Train with batch size 1, and test on the same dataset. Thetest loss increases while the training loss decreases. BN’s solution is sensitive to the mini-batchsize as shown in Figure 2

The test conductedin Figure 2 uses ResNet18 on the Cifar10 dataset.When the batch size changes, the neural networkwith BN training tends to converge to a diﬀerentsolution. Indeed, this observation is true even forsolving convex optimization problems. It can also beobserved that the smaller the batch size, the worsethe performance of the solution.

BN fails if data are with large variation asshown in Figure 3

BN breaks convergence on sim-ple convex logistic regression problems if the varianceof the dataset is large. Figure 3 shows the ﬁrst 20epochs of such training on a synthesized dataset.This phenomenon also often appears when using dis-tributed training algorithms where each worker onlyhas its local dataset, and the local datasets are verydiﬀerent.Therefore, these observations may remind people toask some fundermental questions:1. Does BN always converge and/or where does it converge to?2. Using BN to train the model, why does sometimes severe over-ﬁtting happen?3. Why can BN often accelerate the training process?4. Is BN a trustable “optimization” algorithm? epoch t e s t a cc u r a c y bs=1bs=4bs=16bs=64 Figure 2: (

Sensitive to the size of mini-batch)

Thetest accuracy for ResNet18 on CIFAR10 dataset trained for10 epochs with diﬀerent batch sizes. The smaller the batchsize is, with BN layers in ResNet, the worse the convergenceresult is.

In this paper, we aim at understanding the BN al-gorithm from a rigorous optimization perspective toshow1. BN always converges, but not solving either theobjective in our common sense, or the optimizationoriginally motivated.2. The result of BN heavily depends on how to con-struct batches, and it can overﬁt predeﬁned batches.BN treats the training and the inference diﬀerently,which makes the situation worse.3. BN is not always trustable, especially when thebatches are not constructed with randomly selectedsamples or the batch size is small.Besides these, we also provide the explicit form ofthe original objective BN aims to optimize (but doesnot in fact), and propose a Multilayer CompositionalStochastic Gradient Descent (MCSGD) algorithmbased on the compositional optimization technology to solve the original objective. We prove the convergenceof the proposed MCSGD algorithm and empirically study its eﬀectiveness to reﬁne the state-of-the-art BNlgorithm. epoch a cc u r a c y without BN testwithout BN trainwith BN testwith BN train Figure 3: (

Fail if data are with large variation)

Thetraining and test accuracy for a simple logistic regressionproblem on synthesized dataset with mini-batch size 20.We synthesize 10,000 samples where each sample is a vectorof 1000 elements. There are 1000 classes. Each sample isgenerated from a zero vector by ﬁrstly randomly assigninga class to the sample (for example it is the i -th class), andthen setting the i -th element of the vector to be a randomnumber from 0 to 50, and ﬁnally adding noise (generatedfrom a standard normal distribution) to each element. Wegenerate 100 test samples in the same way. A logisticregression classiﬁer should be able to classify this dataseteasily. However, if we add a BN layer to the classiﬁer, themodel no longer converges. We ﬁrst review traditional normalization techniques.LeCun et al. [1998] showed that normalizing the inputdataset makes training faster. In deep neural net-works, normalization was used before the inventionof BN. For example Local Response Normalization(LRU) [Lyu and Simoncelli, 2008, Jarrett K et al.,2009, Krizhevsky et al., 2012] which computes thestatistics for the neighborhoods around each pixel.We then review batch normalization techniques. Ioﬀeand Szegedy [2015] proposed the Batch Normaliza-tion (BN) algorithm which performs normalizationalong the batch dimension. It is more global thanLRU and can be done in the middle of a neuralnetwork. Since during inference there is no “batch”,BN uses the running average of the statistics dur-ing training to perform inference, which introducesunwanted bias between training and testing. Ioﬀe[2017] proposed Batch Renormalization (BR) that in-troduces two extra parameters to reduce the drift ofthe estimation of mean and variance. However, BothBN and BR are based on the assumption that thestatistics on a mini-batch approximates the statisticson the whole dataset.Next we review the instance based normalization which normalizes the input sample-wise, instead ofusing the batch information. This includes Layer Normalization [Ba et al., 2016], Instance Normalization[Ulyanov et al., 2016], Weight Normalization [Salimans and Kingma, 2016] and a recently proposed GroupNormalization [Wu and He, 2018]. They all normalize on a single sample basis and are less global than BN.They avoid doing statistics on a batch, which work well when it comes to some vision tasks where the outputsof layers can be of handreds of channels, and doing statistics on the outputs of a single sample is enough.However we still prefer BN when it comes to the case that the single sample statistics is not enough andmore global statistics is needed to increase accuracy.Finally we review the compositional optimization which our reﬁnement of BN is based on. Wang and Liu[2016] proposed the compositional optimization algorithm which optimizes the nested expectation problem min E f ( E g ( · )) , and later the convergence rate was improved in Wang et al. [2016]. The convergence ofcompositional optimization on nonsmooth regularized problems was shown in Huo et al. [2017] and a variancereduced variant solving strongly convex problems was analyzed in Lian et al. [2016]. Review the BN algorithm

In this section we review the Batch Normalization (BN) algorithm [Ioﬀe and Szegedy, 2015].BN is usually implemented as an additional layer in neural networks. In each iteration of the training process,the BN layer normalizes its input using the mean and variance of each channel of the input batch to makeits output having zero mean and unit variance. The mean and variance on the input batch is expected tobe similar to the mean and variance over the whole dataset. In the inference process, the layer normalizesits input’s each channel using the saved mean and variance, which are the running averages of mean andvariance calculated during training. This is described in Algorithm 1 and Algorithm 2. With BN, the inputat the BN layer is normalized so that the next layer in the network accepts inputs that are easier to train on.In practice it has been observed in many applications that the speed of training is improved by using BN. Algorithm 1

Batch Normalization Layer (Training)

Require:

Input B in , which is a batch of input. Estimated mean µ and variance ν , and averaging constant α . µ ← (1 − α ) · µ + α · mean ( B in ) ,ν ← (1 − α ) · ν + α · var ( B in ) .(cid:46) mean ( B in ) and var ( B in ) calculate the mean and variance of B in respectively. Output B out ← B in − mean ( B in ) (cid:112) var ( B in ) + (cid:15) .(cid:46) (cid:15) is a small constant for numerical stability. Algorithm 2

Batch Normalization Layer (Inference)

Require:

Input B in , estimated mean µ and variance ν . A small constant (cid:15) for numerical stability. Output B out ← B in − µ √ ν + (cid:15) . In the original paper proposing BN algorithm [Ioﬀe and Szegedy, 2015], authors do not provide the explicitoptimization objective BN targets to solve. Therefore, many people may naturally think that BN is anoptimization trick, that accelerates the training process but still solves the original objective in our commonsense. Unfortunately, this is not true! The actual objective BN solves is diﬀerent from the objective in ourcommon sense and also nontrivial.

Rigorous mathematical description of BN layer

To deﬁne the objective of BN in a precise way, we needto deﬁne the normalization operator f B,σW that maps a function to a function associating with a mini-batches A linear transformation is often added after applying BN to compensate the normalization. , an activation function σ , and parameters W . Let g ( · ) be a function g ( · ) =  g ( · ) g ( · ) · · · g n ( · )1  where g ( · ) , · · · , g n ( · ) are functions mapping a vector to a number. The operator f B,σW is deﬁned by f B,σWf

B,σW ’s argument isa function g (cid:122)(cid:125)(cid:124)(cid:123) ( g ) (cid:124) (cid:123)(cid:122) (cid:125) f B,σW ( g ) isanother function ( · ) := σ  W  g ( · ) − mean ( g ,B ) √ var ( g ,B ) g ( · ) − mean ( g ,B ) √ var ( g ,B ) ... g n ( · ) − mean ( g n ,B ) √ var ( g n ,B )  (1)where mean ( r, B ) is deﬁned by mean ( r, B ) := 1 | B | (cid:88) b ∈ B r ( b ) . and var ( r, B ) is deﬁned by var ( r, B ) = mean ( r , B ) − mean ( r, B ) . Note that the ﬁrst augment of mean ( · , · ) and var ( · , · ) is a function, and the second augment is a set ofsamples. A m -layer’s neural network with BN can be represented by a function F B { W i } mi =1 ( · ) := f B,σ m W m ( f B,σ m − W m − ( · · · ( f B,σ W ( I ))))( · ) or f B,σ m W m ◦ f B,σ m − W m − ◦ · · · ◦ f B,σ W ( I )( · ) where I ( · ) is the identical mapping function from a vector to itself, that is, I ( x ) = x . BN’s objective is very diﬀerent from the objective in our common sense

Using the same way, wecan represent a fully connected network in our common sense (without BN) by F { W i } mi =1 ( · ) := ¯ f σ m W m ◦ ¯ f σ m − W m − ◦ · · · ◦ ¯ f σ W ( I )( · ) where operator ¯ f σW is deﬁned by ¯ f σW ( g )( · ) := σ ( W g ( · )) . (2)Besides of the diﬀerence of operator function deﬁnitions, their ultimate objectives are also diﬀerent. Giventhe training date set (cid:68) , the objective without BN (or the objective in our common sense) is deﬁned by min { W j } mj =1 | (cid:68) | (cid:88) ( x ,y ) ∈ (cid:68) l ( F { W j } mj =1 ( x ) , y ) , (3)where l ( · , · ) is a predeﬁned loss function, while the objective with BN (or the objective BN is actually solving)is ( BN ) min { W j } mj =1 | (cid:66) | (cid:88) B ∈ (cid:66) | B | (cid:88) ( x ,y ) ∈ B l ( F B { W j } mj =1 ( x ) , y ) , (4)where (cid:66) is the set of batches.Therefore, the objective of BN could be very diﬀerent from the objective in our common sense in general. N could be very sensitive to the sample strategy and the minibatch size

The super set (cid:66) hasdiﬀerent forms depending on how to deﬁne mini-batchs. For example, • People can choose b as the size of minibatch, and form (cid:66) by (cid:66) := { B ⊂ (cid:68) : | B | = b } . Choosing (cid:66) in this way implicitly assumes that all nodes can access the same dataset, which may not betrue in practice. • If data is distributed ( (cid:68) i is the local dataset on the i -th worker, disjoint with others, satisfying (cid:68) = (cid:68) ∪ · · · ∪ (cid:68) n ), a typical (cid:66) is deﬁned by (cid:66) = (cid:66) ∪ (cid:66) ∪ · · · ∪ (cid:66) n with (cid:66) i deﬁned by (cid:66) i := { B ⊂ (cid:68) i : | B | = b } . • When the mini-batch is chosen to be the whole dataset, (cid:66) contains only one element (cid:68) .After ﬁgure out the implicit objective BN optimizes in (4), it is not diﬃcult to have following key observations • The BN objectives vary a lot when diﬀerent sampling strategies are applied. This explains why theconvergent solution could be very diﬀerent when we change the sampling strategy; • For the same sample strategy, BN’s objective function also varies if the size of minibatch gets changed.This explains why BN could be sensitive to the batch size.The observations above may seriously bother us how to appropriately choose parameters for BN, since it doesnot optimize the objective in our common sense, and could be sensitive to the sampling strategy and the sizeof minibatch.

BN’s objective (4) with (cid:66) = { (cid:68) } has the same optimal value as the original objective (3) Whenthe batch size is equal to the size of the whole dataset in (4), the objective becomes ( FN ) min { W j } mj =1 | (cid:68) | (cid:88) ( x ,y ) ∈ (cid:68) l ( F (cid:68) { W j } mj =1 ( x ) , y ) , (5)which diﬀers from the original objective (3) only in the ﬁrst argument of l . Noting the only diﬀerence between(1) and (2) when B is a constant (cid:68) is a linear transformation which can be absorbed into W : f (cid:68) ,σW ( g )( · ) = σ  W  g ( · ) − mean ( g , (cid:68) ) √ var ( g , (cid:68) ) g ( · ) − mean ( g , (cid:68) ) √ var ( g , (cid:68) ) ... g n ( · ) − mean ( g n , (cid:68) ) √ var ( g n , (cid:68) )  = σ  W  √ var ( g , (cid:68) ) − mean ( g , (cid:68) ) √ var ( g , (cid:68) )1 √ var ( g , (cid:68) ) − mean ( g , (cid:68) ) √ var ( g , (cid:68) ) ... ... √ var ( gn, (cid:68) ) − mean ( gn, (cid:68) ) √ var ( gn, (cid:68) ) (cid:124) (cid:123)(cid:122) (cid:125) =: W (cid:48)  g ( · ) g ( · ) ... g n ( · )1  σ (cid:16) W (cid:48) (cid:0) g ( · ) g ( · ) · · · g n ( · ) 1 (cid:1) (cid:62) (cid:17) . If we use this W (cid:48) as our new W , (3) and (4) has the same form and the two objectives should have thesame optimal value, which means when (cid:66) = { (cid:68) } adding BN does not hurt the expressiveness of the network.However, since each layer’s input has been normalized, in general the condition number of the objective isreduced, and thus easier to train. (5) via compositional opti-mization The BN objective contains the normalization operator which mostly reduces the condition number of theobjective that can accelerate the training process. While the FN formulation in (5) provides a more stableand trustable way to deﬁne the BN formation, it is very challenging to solve the FN formulation in practice.If follow the standard SGD used in solving BN to solve (5), then every single iteration needs to involve thewhole dataset since B = (cid:68) , which is almost impossible in practice. The key diﬃculty to solve (5) lies on thatthere exists “expectation” in each layer, which is very expensive to compute even we just need to compute astochastic gradient if use the standard SGD method.To solve (5) eﬃciently, we follow the spirit of compositional optimization to develop a new algorithm namely, Multilayer Compositional Stochastic Gradient Descent (Algorithm 3). The proposed algorithmdoes not have any requirement on the size of minibatch. The key idea is to estimate the expectation from thecurrent samples and historical record.

To propose our solution to solve (5), let us deﬁne the objective in a more general but neater way in thefollowing.When (cid:66) = { (cid:68) } , (4) is a special case of the following general objective: min w f ( w ) := E ξ (cid:104) F w , (cid:68) ◦ F w , (cid:68) ◦ · · · ◦ F w n , (cid:68) n ◦ I ( ξ ) (cid:105) , (6)where ξ represents a sample in the dataset (cid:68) , for example it can be the pair ( x , y ) in (4). w i represents theparameters of the i -th layer. w represents all parameters of the layers: w := ( w , . . . , w n ) . Each F i is anoperator of the form: F w i , (cid:68) i ( g )( · ) := F i ( w i ; g ( · ); E ξ ∈ (cid:68) e i ( g ( ξ ))) , where e i is used to compute statistics over the dataset. For example, with e i ( x ) = [ x, x ] , the mean andvariance of the layer’s input over the dataset can be calculated, and F w i , (cid:68) i can use that information, forexample, to perform normalization.There exist compositional optimization algorithms [Wang and Liu, 2016, Wang et al., 2016] for solving (6)when n = 2 , but for n > we still do not have a good algorithm to solve it. We follow the spirit to extend thecompositional optimization algorithms to solve the general optimization problem (6) as shown in Algorithm 3.See Section 6 for an implementation when it comes to normalization.To simplify notation, given w = ( w , . . . , w n ) we deﬁne: e i ( w ; ξ ) := e i ( F w i +1 , (cid:68) i +1 ◦ F w i +2 , (cid:68) i +2 ◦ · · · ◦ F w n , (cid:68) n ◦ I ( ξ )) . e deﬁne the following operator if we already have an estimation, say ˆ e i , of E ξ ∈ (cid:68) e i ( w ; ξ ) : ˆ F w i , ˆ e i i ( g )( · ) := F i ( w i ; g ( · ); ˆ e i ) . Given w = ( w , . . . , w n ) and ˆ e = (ˆ e , . . . , ˆ e n ) , we deﬁne ˆ e i ( w ; ξ ; ˆ e ) := e i ( ˆ F w i +1 , ˆ e i +1 i +1 ◦ ˆ F w i +2 , ˆ e i +2 i +2 ◦ · · · ◦ ˆ F w n , ˆ e n n ◦ I ( ξ )) . Algorithm 3

Multilayer Compositional Stochastic Gradient Descent (MCSGD) algorithm

Require:

Learning rate { γ k } Kk =0 , approximation rate { α k } Kk =0 , dataset (cid:68) , initial point w (0) :=( w (0)1 , . . . , w (0) n ) and initial estimations ˆ e (0) := (ˆ e (0)1 , . . . , ˆ e (0) n ) . for k = 0 , , , . . . , K do For each i , randomly select a sample ξ k , estimate E ξ [ e i ( w ( k ) ; ξ )] by ˆ e ( k +1) i ← (1 − α k )ˆ e ( k ) i + α k ˆ e i ( w ( k ) ; ξ k ; ˆ e ( k ) ) . Ask the oracle (cid:79) using w ( k ) and estimated means ˆ e ( k +1) to obtain the approximated gradient at w ( k ) : g ( k ) . (cid:46) See Remark 1 for discussion on the oracle. w ( k +1) ← w ( k ) − γ k g ( k ) . end forRemark 1 (MCSGD oracle) . In MCSGD, the gradient oracle takes the current parameters of the model,and the current estimation of each E ξ [ e i ( w ( k ) ; ξ )] , to output an estimation of a stochastic gradient.For example for an objective like (6) , the derivative of the loss function w.r.t. the i -th layer’s parameters is ∂ w i f ( w ) = ∂ w i ( E ξ [ F w , (cid:68) ◦ F w , (cid:68) ◦ · · · ◦ F w n , (cid:68) n ◦ I ( ξ )])= E ξ  [ ∂ x ( F w , (cid:68) ( x )( ξ ))] x = F w , (cid:68) ◦···◦ F wn, (cid:68) n ◦ I · [ ∂ x ( F w , (cid:68) ( x )( ξ ))] x = F w , (cid:68) ◦···◦ F wn, (cid:68) n ◦ I · · · · · [ ∂ x ( F w i − , (cid:68) i − ( x )( ξ ))] x = F wi, (cid:68) i ◦···◦ F wn, (cid:68) n ◦ I · [ ∂ w i ( F w i , (cid:68) i ( x )( ξ ))] x = F wi +1 , (cid:68) i ◦···◦ F wn, (cid:68) n ◦ I  , (7) where for any j ∈ [1 , . . . , i − : [ ∂ x F w j , (cid:68) j ( x )( ξ )] x = F wj +1 , (cid:68) j +1 ◦···◦ F wn, (cid:68) n ◦ I = (cid:20) ∂ x F j ( w j ; x ; y )+ ∂ y F j ( w j ; x ; y ) · E ξ (cid:48) ∼ (cid:68) D j ( ξ (cid:48) ) (cid:21) with  x = F w j +1 , (cid:68) j +1 ◦ · · · ◦ F w n , (cid:68) n ◦ I ( ξ ) y = E ξ [ e j ( w ; ξ )] D j ( ξ (cid:48) ) = [ ∂ z e i ( z )] z = F wj +1 , (cid:68) j +1 ◦···◦ F wn, (cid:68) n ◦ I ( ξ (cid:48) ) . The oracle samples a ξ (cid:48) for each layer and calculate this derivative with y set to an estimated value ˆ e j , andreturns the stochastic gradient. The expectation of this stochastic gradient will be (7) with some error causedby the diﬀerence between the estimated ˆ e j and the true expectation E ξ [ e j ( w ; ξ )] . The more accurate theestimation is, the closer the expectation of this stochastic and the true gradient (7) will be.In practice, we often use the same ξ (cid:48) for each layer to save some computation resource, this introduce some biasthe expectation of the estimated stochastic gradient. One such implementation can be found in Algorithm 4. .2 Theoretical analysis In this section we analyze Algorithm 3 to show its convergence when applied on (6) (a generic version of theFN formulation in (5)). The detailed proof is provided in the supplementary material. We use the followingassumptions as shown in Assumption 1 for the analysis.

Assumption 1.

1. The gradients g ( k ) ’s are bounded: (cid:107) g ( k ) (cid:107) (cid:54) (cid:71) , ∀ k for some constant (cid:71) .2. The variance of all e i ’s are bounded: E ξ (cid:107) e i ( w ; ξ ) − E ξ e i ( w ; ξ ) (cid:107) (cid:54) σ , ∀ i, w, E ξ (cid:107) e i ( w ; ξ ; ˆ e ) − E ξ e i ( w ; ξ ; ˆ e ) (cid:107) (cid:54) σ , ∀ i, w, ˆ e for some constant σ .3. The error of approximated gradient E [ g ( k ) ] is proportional to the error of approximation for E [ e i ] : (cid:107) E ξ k [ g ( k ) ] − ∇ f ( w ( k ) ) (cid:107) (cid:54) L g n (cid:88) i =1 (cid:107) ˆ e ( k +1) i − E ξ k [ e i ( w ( k ) ; ξ k )] (cid:107) , ∀ k for some constant L g .4. All functions and their ﬁrst order derivatives are Lipschitzian with Lipschitz constant L .5. The minimum of the objective f ( w ) is ﬁnite.6. γ k , α k are monotonically decreasing. γ k = O ( k − γ ) and α k = O ( k − a ) for some constants γ > a > . It can be shown under the given assumptions, the approximation errors will vanish shown in Lemma 1.

Lemma 1 (

Approximation error ) Choose the learning rate γ k and α k in Algorithm 3 in the form deﬁned in Assumption 1-6 with parameters γ and a . Under Assumption 1, the sequence generated in Algorithm 3 satisﬁes E (cid:107) ˆ e ( k +1) i − E ξ e i ( w ( k ) ; ξ ) (cid:107) (cid:54) (cid:69) ( k − γ +2 a + ε + k − a + ε ) , ∀ i, ∀ k. (8) E (cid:107) ˆ e ( k +1) i − E ξ e i ( w ( k ) ; ξ ) (cid:107) (cid:54) (cid:16) − α k (cid:17) E (cid:107) ˆ e ( k ) i − E ξ e i ( w ( k − ; ξ ) (cid:107) + (cid:67) ( k − γ + a + ε + k − a + ε ) , ∀ i, ∀ k. (9) for any ε satisfying − a > ε > , where (cid:69) and (cid:67) are two constants independent of k . Then it can be shown that on (6) Algorithm 3 converges as seen in Theorem 2 and Corollary 3. It is worthnoting that the convergence rate in Corollary 3 is slower than the √ K convergence rate of SGD without anynormalization. This is due to the estimation error. If the estimation error is small (for example, the samplesin a batch are randomly selected and the batch size is large), the convergence will be fast. Theorem 2 (

Convergence ) Choose the learning rate γ k and α k in Algorithm 3 in the form deﬁned in Assumption 1-6 with parameters γ and a satisfying a < γ − , a < / , and γ k L g α k +1 (cid:54) . Under Assumption 1, for any integer K > the

200 400 600 800 iteration t r a i n i n g l o ss BNFN 0 2 4 6 8 10 epoch t e s t l o ss BNFN

Figure 4: The training and testing error for the given model trained on MNIST with batch size 64, learning rate 0.01and momentum 0.5 using BN or FN for the case where all samples in a batch are of a single label. The approximationrate α in FN is ( k + 1) − . where k is the iteration number. sequence generated by Algorithm 3 satisﬁes (cid:80) Kk =0 γ k E (cid:107) ∂f ( w ( k ) ) (cid:107) (cid:80) Kk =0 γ k (cid:54) (cid:72) (cid:80) Kk =0 γ k , where (cid:72) is a constant independent of K . We next specify the choice of γ and a to give a more clear convergence rate for our MCSGD algorithm inCorollary 3. Corollary 3

Choose the learning rate γ k and α k in Algorithm 3 in the form deﬁned in Assumption 1-6, more speciﬁcally, γ k = L g ( k + 2) − / and α k = ( k + 1) − / . Under Assumption 1, for any integer K > the sequencegenerated in Algorithm 3 satisﬁes (cid:80) Kk =0 E (cid:107) ∂f ( w ( k ) ) (cid:107) K + 2 (cid:54) (cid:72) ( K + 2) / , where (cid:72) is a constant independent of K . Experiments are conducted to validate the eﬀectiveness of our MCSGD algorithm for solving the FNformulation. We consider two settings. The ﬁrst one in Section 6.1 shows BN’s convergence highly dependson the size of batches, while FN is more robust to diﬀerent batch sizes. The second one in Section 6.2 showsthat FN is more robust to diﬀerent construction of mini-batches.We use a simple neural network as the testing network whose architecture is shown in Figure 6. Steps 2and 3 in Algorithm 3 are implemented in Algorithm 4. The forward pass essentially performs Step 2 ofAlgorithm 3, which estimates the mean and variance of layer inputs over the dataset. The backward pass

50 100 epoch t r a i n i n g l o ss BNFN epoch t e s t i n g l o ss BNFN

Figure 5: The training and testing error for the given model trained on CIFAR10 with batch size 64, learning rate0.01 and momentum 0.9 using BN and FN for the case where all samples in a batch are of no more than 3 labels. Thelearning rate is decreased by a factor of 5 for every 20 epochs. The approximation rate α in FN is ( k + 1) − . where k is the iteration number.Figure 6: The simple neural network used in the experiments. The “Normalization” between layers can be BN or FN.The activation function is relu and the loss function is negative log likelihood. iteration t r a i n i n g l o ss BNFN 0 2 4 6 8 10 epoch t e s t l o ss BNFN

Figure 7: The training and testing error for the given model trained on MNIST with batch size 64, learning rate 0.01and momentum 0.5 using BN or FN for the case where samples in a batch are randomly selected. The approximationrate α in FN is ( k + 1) − . where k is the iteration number.

20 40 60 80 epoch t r a i n i n g l o ss BNFN epoch t e s t i n g l o ss BNFN

Figure 8: The training and testing error for the given model trained on CIFAR10 with batch size 64, learning rate0.01 and momentum 0.9 using BN or FN for the case where samples in a batch are randomly selected. The learningrate is decreased by a factor of 5 for every 20 epochs. The approximation rate α in FN is ( k + 1) − . where k is theiteration number. epoch t e s t l o ss bs=1bs=16 (a) With BN layers. epoch t e s t l o ss bs=1bs=16 (b) With FN layers.Figure 9: The testing error for the given model trained on MNIST, where each sample is multiplied by a randomnumber in ( − . , . . The optimization algorithm is SGD with learning rate is 0.01 and momentum is 0.5. FN givesmore consistent results under diﬀerent batch sizes. The approximation rate α in FN is ( k + 1) − . where k is theiteration number. essentially performs Step 3 of Algorithm 3, which gives the approximated gradient based on current networkparameters and the estimations. Note that as discussed in Remark 1, for eﬃciency, in each iteration of thisimplementation we are using the same samples to do estimation in all normalization layers, which savescomputational cost but brings additional bias. MNIST is used as the testing dataset with some modiﬁcation by multiplying a random number in ( − . , . to each sample to make them more diverse. In this case, as shown in Figure 9, with batch size 1 and batchsize 16, we can see the convergent points are diﬀerent in BN, while the convergent results of FN are muchmore close for diﬀerent batch sizes. Therefore, FN is more robust to the size of mini-batch. lgorithm 4 Full Normalization (FN) layer

Forward pass

Require:

Learning rate γ , approximation rate α , input B in ∈ R b × d , mean estimation µ , and mean of squareestimation ν . (cid:46) b is the batch size, and d is the dimension of each sample. if training then µ ← (1 − α ) µ + α mean ( B in ) ,ν ← (1 − α ) ν + α mean_of_square ( B in ) . end if return layer output: B output ← B in − µ (cid:112) max { ν − µ , (cid:15) } .(cid:46) (cid:15) is a small constant for numerical stability. Backward pass

Require:

Mean estimation µ , mean of square estimation ν , the gradient at the output g out , and input B in ∈ R b × d . Deﬁne f ( µ, ν, B in ) := B in − µ (cid:112) max { ν − µ , (cid:15) } . return the gradient at the input: g in ← g out · (cid:18) ∂ B in f + ∂ µ f + 2 B in ∂ ν fbd (cid:19) .(cid:46) ∂ a f is the derivative of f w.r.t. a at µ, ν, B in . We study two cases — shuﬄed case and unshuﬄed case. For the shuﬄed case, where the samples in a batchare randomly sampled, we expect the performance of FN matches BN’s in this case. For the unshuﬄed case,where the batch contains only samples in just a few number of categories (so the mean and variance are verydiﬀerent from batch to batch). In this case we can observe that the FN outperforms BN.

Unshuﬄed case

We show the FN has advantages over BN when the data variation is large among mini-batches. In the unshuﬄed setting, we do not shuﬄe the dataset. It means that the samples in the samemini-batch are mostly with the same labels. The comparison uses two datasets (MINST and CIFAR10): • On MNIST, the batch size is chosen to be 64 and each batch only contains a single label. The convergenceresults are shown in Figure 4. We can observe from these ﬁgures that for the training loss, BN and FNconverge equally fast (BN might be slightly faster). However since the statistics on any batch are verydiﬀerent from the whole dataset, the estimated mean and variance in BN are very diﬀerent from the truemean and variance on the whole dataset, resulting in a very high test loss for BN. • On CIFAR10, we observe similar results as shown in Figure 5. In this case, we restrict the number oflabels in every batch to be no more than 3. Thus BN’s performance on CIFAR10 is slightly better than onMNIST. We see the convergence eﬃciency of both methods is still comparable in term of the training loss.However, the testing error for BN is still far behind the FN. huﬄed case

For the shuﬄed case, where the samples in each batch are selected randomly, we expect BNto have similar performance as FN in this case, since the statistics in a batch is close to the statistics onthe whole dataset. The results for MNIST are shown in Figure 7 and the results for CIFAR10 are shown inFigure 8. We observe the convergence curves of BN and FN for both training and testing loss match well.

We provide new understanding for BN from an optimization perspective by rigorously deﬁning the optimizationobjective for BN. BN essentially optimizes an objective diﬀerent from the one in our common sense. Theimplicitly targeted objective by BN depends on the sampling strategy as well as the minibatch size. Thatexplains why BN becomes unstable and sensitive in some scenarios. The stablest objective of BN (calledFN formulation) is to use the full dataset as the mini-batch, since it has equivalence to the objective in ourcommon sense. But it is very challenging to solve such formulation. To solve the FN objective, we follow thespirit of compositional optimization to develop MCSGD algorithm to solve it eﬃciently. Experiments are alsoconduct to validate the proposed method.

References

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals,P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning onheterogeneous systems, 2015. URL . Software available from tensorﬂow.org.J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450 , 2016.T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet:A ﬂexible and eﬃcient machine learning library for heterogeneous distributed systems. arXiv preprintarXiv:1512.01274 , 2015.T. Cooijmans, N. Ballas, C. Laurent, Ç. Gülçehre, and A. Courville. Recurrent batch normalization. arXivpreprint arXiv:1603.09025 , 2016.K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In

Proceedings of theIEEE conference on computer vision and pattern recognition , pages 770–778, 2016.G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In

CVPR , volume 1, page 3, 2017.I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Quantized neural networks: Trainingneural networks with low precision weights and activations.

Journal of Machine Learning Research , 18:187–1, 2017.Z. Huo, B. Gu, and H. Huang. Accelerated method for stochastic composition optimization with nonsmoothregularization. arXiv preprint arXiv:1711.03937 , 2017.S. Ioﬀe. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. In

Advances in Neural Information Processing Systems , pages 1942–1950, 2017.. Ioﬀe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internalcovariate shift. arXiv preprint arXiv:1502.03167 , 2015.K. K. Jarrett K, A. Ranzato M, et al. What is the best multii stage architecture for object recognition?

Proceedings of the 2009 IEEE 12th International Conference on Computer Vision. Kyoto, Japan , 2146:2153, 2009.A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks.In

Advances in neural information processing systems , pages 1097–1105, 2012.Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller. Eﬃcient backprop. In

Neural networks: Tricks of thetrade , pages 9–50. Springer, 1998.X. Lian, M. Wang, and J. Liu. Finite-sum composition optimization via variance reduced gradient descent. arXiv preprint arXiv:1610.04674 , 2016.S. Lyu and E. P. Simoncelli. Nonlinear image representation using divisive normalization. In

ComputerVision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on , pages 1–8. IEEE, 2008.A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, andA. Lerer. Automatic diﬀerentiation in pytorch. In

NIPS-W , 2017.T. Salimans and D. P. Kingma. Weight normalization: A simple reparameterization to accelerate training ofdeep neural networks. In

Advances in Neural Information Processing Systems , pages 901–909, 2016.D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai,A. Bolton, et al. Mastering the game of go without human knowledge.

Nature , 550(7676):354, 2017.C. Szegedy, V. Vanhoucke, S. Ioﬀe, J. Shlens, and Z. Wojna. Rethinking the inception architecture forcomputer vision. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ,pages 2818–2826, 2016.C. Szegedy, S. Ioﬀe, V. Vanhoucke, and A. A. Alemi. Inception-v4, inception-resnet and the impact of residualconnections on learning. In

AAAI , volume 4, page 12, 2017.D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022 , 2016.M. Wang and J. Liu. A stochastic compositional gradient method using markov samples. In

Proceedings ofthe 2016 Winter Simulation Conference , pages 702–713. IEEE Press, 2016.M. Wang, J. Liu, and E. Fang. Accelerating stochastic composition optimization. In

Advances in NeuralInformation Processing Systems , pages 1714–1722, 2016.Y. Wu and K. He. Group normalization. arXiv preprint arXiv:1803.08494 , 2018.S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks.In

Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on , pages 5987–5995. IEEE,2017. upplementary MaterialsA Proofs

Lemma 4 If r k +1 (cid:54) (1 − C k − a ) r k + C ( k − a + ε + k − γ + a + ε ) , (10) for any constant > C > , C > , > a > , − a > ε > , there exists a constant (cid:69) such that r k +1 (cid:54) ( k − γ +2 a + ε + k − a + ε ) (cid:69) . Proof to Lemma 4

First we expand the recursion relation (10) till k = 1 : r k +1 (cid:54) (1 − C k − a ) r k + C ( k − a + ε + k − γ + a + ε ) (cid:54) (cid:32) k (cid:89) κ =1 (1 − C κ − a ) (cid:33) r + C (cid:32) k (cid:88) κ =1 ( κ − a + ε + κ − γ + a + ε ) (cid:32) k (cid:89) (cid:107) = κ +1 (1 − C (cid:107) − a ) (cid:33)(cid:33) . (11)As the ﬁrst step to bound (11), we use the monotonicity of the ln( · ) function to bound the form (cid:81) k (cid:107) = m (1 − C (cid:107) − a ) for some m . Notice that ln (cid:32) k (cid:89) (cid:107) = m (1 − C (cid:107) − a ) (cid:33) = k (cid:88) (cid:107) = m ln(1 − C (cid:107) − a ) (cid:54) − C k (cid:88) (cid:107) = m (cid:107) − a . Then it follows from the monotonicity of ln( · ) that: k (cid:89) (cid:107) = m (1 − C (cid:107) − a ) (cid:54) exp (cid:32) − C k (cid:88) (cid:107) = m (cid:107) − a (cid:33) (cid:54) exp( − C ( k − m + 1) k − a ) . (12)Using this inequality (11) can be bounded by the following inequality: r k +1 (cid:54) (cid:32) k (cid:89) κ =1 (1 − C κ − a ) (cid:33) r + C (cid:32) k (cid:88) κ =1 ( κ − a + ε + κ − γ + a + ε ) (cid:32) k (cid:89) (cid:107) = κ +1 (1 − C (cid:107) − a ) (cid:33)(cid:33) (12) (cid:54) exp( − k − a ) r + C (cid:32) k (cid:88) κ =1 ( κ − a + ε + κ − γ + a + ε ) exp( − C ( k − κ ) k − a ) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) =: T k . (13)The ﬁnal step will be bounding T k above. The term can actually be bounded by a power of reciprocal function.Notice that for any − a > ε > , we have the following inequality: T k = C k (cid:88) κ =1 ( κ − a + ε + κ − γ + a + ε ) exp( − C ( k − κ ) k − a ) C k −(cid:98) k a + ε (cid:99) (cid:88) κ =1 ( κ − a + ε + κ − γ + a + ε ) exp( − C (cid:98) k a + ε (cid:99) k − a ) + C k (cid:88) κ = k −(cid:98) k a + ε (cid:99) +1 ( κ − a + ε + κ − γ + a + ε )= O (exp( − C (cid:98) k a + ε (cid:99) k − a )) + O ( k a + ε ( k − a + ε + k − γ + a + ε )) = O ( k − γ +2 a + ε + k − a + ε ) , where O ( · ) notation is used to ignore any constant terms to simplify the notation. It immediately followsfrom (13) that there exists a constant (cid:69) such that for any − a > ε > , the following inequality holds: r k +1 (cid:54) ( k − γ +2 a + ε + k − a + ε ) (cid:69) . Proof to Lemma 1

First note that for any α k ∈ [0 , , the following identity always holds for the diﬀerencebetween the estimation and the true expectation. The ﬁrst step comes from the update rule in Algorithm 3. ˆ e ( k +1) n − E ξ e n ( x ( k ) ; ξ )=(1 − α k )ˆ e ( k ) n + α k ˆ e n ( x ( k ) ; ξ k ; ˆ e ( k ) ) − E ξ e n ( x ( k ) ; ξ )=(1 − α k )(ˆ e ( k ) n − E ξ e n ( x ( k ) ; ξ )) + α k (ˆ e n ( x ( k ) ; ξ k ; ˆ e ( k ) ) − E ξ e n ( x ( k ) ; ξ ))=(1 − α k )(ˆ e ( k ) n − E ξ e n ( x ( k − ; ξ )) + α k (ˆ e n ( x ( k ) ; ξ k ; ˆ e ( k ) ) − E ξ e n ( x ( k ) ; ξ )) − (1 − α k )( E ξ e n ( x ( k ) ; ξ ) − E ξ e n ( x ( k − ; ξ )) . (14)Then take (cid:96) norm on both sides. The second step comes from the fact that n is the last layer’s index, so e n ( x ; ξ ) = ˆ e n ( x ( k ) ; ξ ; ˆ e ( k ) ) . E (cid:107) ˆ e ( k +1) n − E ξ e n ( x ( k ) ; ξ ) (cid:107) = E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (1 − α k )(ˆ e ( k ) n − E ξ e n ( x ( k − ; ξ ))+ α k (ˆ e n ( x ( k ) ; ξ k ; ˆ e ( k ) ) − E ξ e n ( x ( k ) ; ξ )) − (1 − α k )( E ξ e n ( x ( k ) ; ξ ) − E ξ e n ( x ( k − ; ξ )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (1 − α k )(ˆ e ( k ) n − E ξ e n ( x ( k − ; ξ ))+ α k ( e n ( x ( k ) ; ξ k ) − E ξ e n ( x ( k ) ; ξ )) − (1 − α k )( E ξ e n ( x ( k ) ; ξ ) − E ξ e n ( x ( k − ; ξ )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) =(1 − α k ) E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˆ e ( k ) n − E ξ e n ( x ( k − ; ξ ) − ( E ξ e n ( x ( k ) ; ξ ) − E ξ e n ( x ( k − ; ξ )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + α k E (cid:107) e n ( x ( k ) ; ξ k ) − E ξ e n ( x ( k ) ; ξ ) (cid:107) + 2(1 − α k ) α k E (cid:42) ˆ e ( k ) n − E ξ e n ( x ( k − ; ξ ) − ( E ξ e n ( x ( k ) ; ξ ) − E ξ e n ( x ( k − ; ξ )) , e n ( x ( k ) ; ξ k ) − E ξ e n ( x ( k ) ; ξ ) (cid:43) =(1 − α k ) E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˆ e ( k ) n − E ξ e n ( x ( k − ; ξ ) − ( E ξ e n ( x ( k ) ; ξ ) − E ξ e n ( x ( k − ; ξ )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + α k E (cid:107) e n ( x ( k ) ; ξ k ) − E ξ e n ( x ( k ) ; ξ ) (cid:107) + 2(1 − α k ) α k E (cid:42) ˆ e ( k ) n − E ξ e n ( x ( k − ; ξ ) − ( E ξ e n ( x ( k ) ; ξ ) − E ξ e n ( x ( k − ; ξ )) , E ξ k e n ( x ( k ) ; ξ k ) − E ξ e n ( x ( k ) ; ξ ) (cid:124) (cid:123)(cid:122) (cid:125) =0 (cid:43) Assumption 1-2 (cid:54) (1 − α k ) E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˆ e ( k ) n − E ξ e n ( x ( k − ; ξ ) − ( E ξ e n ( x ( k ) ; ξ ) − E ξ e n ( x ( k − ; ξ )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:124) (cid:123)(cid:122) (cid:125) =: (cid:84) + α k σ . (15)The second term is bounded by the bounded variance assumption. We denote the ﬁrst term as (cid:84) and boundit separately. With the fact that (cid:107) x + y (cid:107) (cid:54) (1 + (cid:100) ) (cid:107) x (cid:107) + (1 + (cid:100) ) (cid:107) y (cid:107) , ∀ (cid:100) ∈ (0 , , ∀ x, ∀ y we can bound (cid:84) asollowing: (cid:84) =(1 − α k ) E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (ˆ e ( k ) n − E ξ e n ( x ( k − ; ξ )) − ( E ξ e n ( x ( k ) ; ξ ) − E ξ e n ( x ( k − ; ξ )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:54) (1 − α k ) (cid:18) (1 + (cid:100) ) E (cid:107) ˆ e ( k ) n − E ξ e n ( x ( k − ; ξ ) (cid:107) + (cid:18) (cid:100) (cid:19) E (cid:107) E ξ e n ( x ( k ) ; ξ ) − E ξ e n ( x ( k − ; ξ ) (cid:107) (cid:19) , ∀ (cid:100) ∈ (0 , (cid:100) ← α k = (1 − α k ) (1 + α k ) E (cid:107) ˆ e ( k ) n − E ξ e n ( x ( k − ; ξ ) (cid:107) + (cid:18) (1 + α k )(1 − α k ) α k (cid:19) E (cid:107) E ξ e n ( x ( k ) ; ξ ) − E ξ e n ( x ( k − ; ξ ) (cid:107) (cid:54) (1 − α k ) E (cid:107) ˆ e ( k ) n − E ξ e n ( x ( k − ; ξ ) (cid:107) + 1 α k E (cid:107) E ξ e n ( x ( k ) ; ξ ) − E ξ e n ( x ( k − ; ξ ) (cid:107) Assumption 1-4 (cid:54) (1 − α k ) E (cid:107) ˆ e ( k ) n − E ξ e n ( x ( k − ; ξ ) (cid:107) + 1 α k L E (cid:107) x ( k ) − x ( k − (cid:107) . (16)The last step comes from the Lipschitzian assumption on the functions.Putting (16) back into (15) we obtain E (cid:107) ˆ e ( k +1) n − E ξ e n ( x ( k ) ; ξ ) (cid:107) (cid:54) (1 − α k ) E (cid:107) ˆ e ( k ) n − E ξ e n ( x ( k − ; ξ ) (cid:107) + 1 α k L E (cid:107) x ( k ) − x ( k − (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) = O ( γ k − ) + α k σ , (17)where the second term is of order O ( γ k − ) because the step length is γ k − for the ( k − -th step and thegradient is bounded according to Assumption 1-1. For the case i = n , (8) directly follows from combining(17), Lemma 4 and Assumption 1-6.We then prove (8) for all i by induction. Assuming for all p > i and for any − a > (cid:15) > there exists aconstant (cid:69) (8) holds such that E (cid:107) ˆ e ( k +1) p − E ξ e p ( x ( k ) ; ξ ) (cid:107) (cid:54) (cid:69) ( k − γ +2 a + ε + k − a + ε ) . (18)Similar to (14) we can split the diﬀerence between the estimation and the expectation into three parts: ˆ e ( k +1) i − E ξ e i ( x ( k ) ; ξ )=(1 − α k )ˆ e ( k ) i + α k ˆ e i ( x ( k ) ; ξ k ; ˆ e ( k ) ) − E ξ e i ( x ( k ) ; ξ )=(1 − α k )(ˆ e ( k ) i − E ξ e i ( x ( k − ; ξ )) + (1 − α k ) E ξ e i ( x ( k − ; ξ ) + α k ˆ e i ( x ( k ) ; ξ k ; ˆ e ( k ) ) − E ξ e i ( x ( k ) ; ξ )=(1 − α k )(ˆ e ( k ) i − E ξ e i ( x ( k − ; ξ )) + (1 − α k )( E ξ e i ( x ( k − ; ξ ) − E ξ e i ( x ( k ) ; ξ ))+ α k (ˆ e i ( x ( k ) ; ξ k ; ˆ e ( k ) ) − E ξ e i ( x ( k ) ; ξ )) , Taking the (cid:96) norm on both sides we obtain E (cid:107) ˆ e ( k +1) i − E ξ e i ( x ( k ) ; ξ ) (cid:107) (cid:54) E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (1 − α k )(ˆ e ( k ) i − E ξ e i ( x ( k − ; ξ ))+(1 − α k )( E ξ e i ( x ( k − ; ξ ) − E ξ e i ( x ( k ) ; ξ ))+ α k (ˆ e i ( x ( k ) ; ξ k ; ˆ e ( k ) ) − E ξ e i ( x ( k ) ; ξ )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (1 − α k ) E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˆ e ( k ) i − E ξ e i ( x ( k − ; ξ )+( E ξ e i ( x ( k − ; ξ ) − E ξ e i ( x ( k ) ; ξ )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:124) (cid:123)(cid:122) (cid:125) (cid:84) − α k ) α k (cid:42) ˆ e ( k ) i − E ξ e i ( x ( k − ; ξ )+ E ξ e i ( x ( k − ; ξ ) − E ξ e i ( x ( k ) ; ξ ) , ˆ e i ( x ( k ) ; ξ k ; ˆ e ( k ) ) − E ξ e i ( x ( k ) ; ξ ) (cid:43)(cid:124) (cid:123)(cid:122) (cid:125) (cid:84) + α k E (cid:107) ˆe i ( x ( k ) ; ξ k ; ˆ e ( k ) ) − E ξ e i ( x ( k ) ; ξ ) (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) (cid:84) , We deﬁne the there parts in the above inequality as (cid:84) , (cid:84) and (cid:84) so that we can bound them one by one.Firstly for (cid:84) , it can be easily bounded using (18), Assumption 1-2 and Assumption 1-4: (cid:84) = α k E (cid:107) ˆ e i ( x ( k ) ; ξ k ; ˆ e ( k ) ) − E ξ e i ( x ( k ) ; ξ ) (cid:107) (cid:54) α k E (cid:107) ˆ e i ( x ( k ) ; ξ k ; ˆ e ( k ) ) − E ξ ˆ e i ( x ( k ) ; ξ ; ˆ e ( k ) ) (cid:107) + α k E (cid:107) E ξ ˆ e i ( x ( k ) ; ξ ; ˆ e ( k ) ) − E ξ e i ( x ( k ) ; ξ ) (cid:107) Assumption 1-2 (cid:54) α k σ + α k E (cid:107) E ξ ˆ e i ( x ( k ) ; ξ ; ˆ e ( k ) ) − E ξ e i ( x ( k ) ; ξ ) (cid:107) Assumption 1-4 (cid:54) α k σ + α k L n (cid:88) j = i +1 E (cid:107) ˆ e ( k ) j − E ξ e j ( x ( k ) ; ξ ) (cid:107) (18) (cid:54) α k σ + nα k L (cid:69) . (19)We then take a look at (cid:84) . It can be bounded similar to what we did in (16) to bound (cid:84) . (cid:84) =(1 − α k ) E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˆ e ( k ) i − E ξ e i ( x ( k − ; ξ )+( E ξ e i ( x ( k − ; ξ ) − E ξ e i ( x ( k ) ; ξ )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:54) (1 − α k ) (cid:32) (1 + (cid:100) ) E (cid:107) ˆ e ( k ) i − E ξ e i ( x ( k − ; ξ ) (cid:107) + (cid:0) (cid:100) (cid:1) E (cid:107) E ξ e i ( x ( k − ; ξ ) − E ξ e i ( x ( k ) ; ξ ) (cid:107) (cid:33) , ∀ (cid:100) ∈ (0 , (cid:100) ← α k (cid:54) (1 − α k ) E (cid:107) ˆ e ( k ) i − E ξ e i ( x ( k − ; ξ ) (cid:107) + 1 α k E (cid:107) E ξ e i ( x ( k − ; ξ ) − E ξ e i ( x ( k ) ; ξ ) (cid:107) Assumption 1-4 (cid:54) (1 − α k ) E (cid:107) ˆ e ( k ) i − E ξ e i ( x ( k − ; ξ ) (cid:107) + Lα k ( E (cid:107) x ( k ) − x ( k − (cid:107) ) . (20)Finally we need to bound (cid:84) . This one is a litter harder than (cid:84) and (cid:84) . Diﬀerent from what we were doingin (15), we no longer have the nice property e i ( x ; ξ ) = ˆ e i ( x ( k ) ; ξ ; ˆ e ( k ) ) since here we are dealing with i (cid:54) = n .We need to further split it and bound each part separately. (cid:84) = (cid:42) ˆ e ( k ) i − E ξ e i ( x ( k − ; ξ )+ E ξ e i ( x ( k − ; ξ ) − E ξ e i ( x ( k ) ; ξ ) , E ξ ˆ e i ( x ( k ) ; ξ ; ˆ e ( k ) ) − E ξ e i ( x ( k ) ; ξ ) (cid:43) = (cid:104) ˆ e ( k ) i − E ξ e i ( x ( k − ; ξ ) , E ξ ˆ e i ( x ( k ) ; ξ ; ˆ e ( k ) ) − E ξ e i ( x ( k ) ; ξ ) (cid:105) (cid:124) (cid:123)(cid:122) (cid:125) (cid:84) + (cid:104) E ξ e i ( x ( k − ; ξ ) − E ξ e i ( x ( k ) ; ξ ) , E ξ ˆ e i ( x ( k ) ; ξ ; ˆ e ( k ) ) − E ξ e i ( x ( k ) ; ξ ) (cid:105) (cid:124) (cid:123)(cid:122) (cid:125) (cid:84) . (21)We ﬁrst bound (cid:84) : (cid:84) = (cid:104) E ξ e i ( x ( k − ; ξ ) − E ξ e i ( x ( k ) ; ξ ) , E (cid:98) e i ( x ( k ) ; ξ k ; ˆ e ( k ) ) − E ξ e i ( x ( k ) ; ξ ) (cid:105) (cid:54) (cid:107) E ξ e i ( x ( k − ; ξ ) − E ξ e i ( x ( k ) ; ξ ) (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) = O ( γ k − ) (cid:107) E (cid:98) e i ( x ( k ) ; ξ k ; ˆ e ( k ) ) − E ξ e i ( x ( k ) ; ξ ) (cid:107) = O (cid:16) γ k − (cid:107) E (cid:98) e i ( x ( k ) ; ξ k ; ˆ e ( k ) ) − E ξ e i ( x ( k ) ; ξ ) (cid:107) (cid:17) (22) ssumption 1-4 (cid:54) O  γ k − n (cid:88) j = i +1 (cid:107) ˆ e ( k ) j − E ξ e j ( x ( k ) ; ξ ) (cid:107)  (18) = O (cid:16) γ k − (cid:112) k − γ +2 a + ε + k − a + ε (cid:17) = O ( γ k − ( k − γ + a + ε/ + k − a/ ε/ ))= O ( k − γ + a + ε/ + k − γ − a/ ε/ ) , (23)where in the second step we have (cid:107) E ξ e i ( x ( k − ; ξ ) − E ξ e i ( x ( k ) ; ξ ) (cid:107) = O ( γ k − ) since by Assumption 1-4 weobtain (cid:107) E ξ e i ( x ( k − ; ξ ) − E ξ e i ( x ( k ) ; ξ ) (cid:107) (cid:54) L (cid:107) x ( k − − x ( k ) (cid:107) . Then it follows from the same argument as in(17).After bounding (cid:84) , we start investigating (cid:84) . The last step follows the same procedure as in (22). (cid:84) = (cid:104) ˆ e ( k ) i − E ξ e i ( x ( k − ; ξ ) , E e i ( x ( k ) ; ξ k ; ˆ e ( k ) ) − E ξ e i ( x ( k ) ; ξ ) (cid:105) (cid:54) (cid:100) k (cid:107) ˆ e ( k ) i − E ξ e i ( x ( k − ; ξ ) (cid:107) + (cid:100) k (cid:107) E e i ( x ( k ) ; ξ k ; ˆ e ( k ) ) − E ξ e i ( x ( k ) ; ξ ) (cid:107) , ∀ (cid:100) k >

0= 12 (cid:100) k (cid:107) ˆ e ( k ) i − E ξ e i ( x ( k − ; ξ ) (cid:107) + (cid:100) k O ( k − γ +2 a + ε + k − a + ε ) , ∀ (cid:100) k > . (24)Finally plug (19), (20), (24) and (23) back into (21). By choosing (cid:100) k in (24) to be 1 we obtain: E (cid:107) ˆ e ( k +1) i − E ξ e i ( x ( k ) ; ξ ) (cid:107) (cid:54) α k σ + nα k L (cid:69) + (1 − α k ) E (cid:107) ˆ e ( k ) i − E ξ e i ( x ( k − ; ξ ) (cid:107) + Lα k ( E (cid:107) x ( k ) − x ( k − (cid:107) )+ 2(1 − α k ) α k O ( k − γ + a + ε/ + k − γ − a/ ε/ )+ (1 − α k ) α k (cid:100) k (cid:107) ˆ e ( k ) i − E ξ e i ( x ( k − ; ξ ) (cid:107) + 2(1 − α k ) α k (cid:100) k O ( k − γ +2 a + ε + k − a + ε ) (cid:100) k ← (cid:54) α k σ + nα k L (cid:69) + (1 − α k ) E (cid:107) ˆ e ( k ) i − E ξ e i ( x ( k − ; ξ ) (cid:107) + Lα k ( E (cid:107) x ( k ) − x ( k − (cid:107) )+ 2(1 − α k ) α k O ( k − γ + a + ε/ + k − γ − a/ ε/ )+ (1 − α k ) α k (cid:107) ˆ e ( k ) i − E ξ e i ( x ( k − ; ξ ) (cid:107) + 4(1 − α k ) α k O ( k − γ +2 a + ε + k − a + ε ) (cid:54) (cid:16) − α k (cid:17) E (cid:107) ˆ e ( k ) i − E ξ e i ( x ( k − ; ξ ) (cid:107) + α k σ + nα k L (cid:69) + Lγ k − α k (cid:71) + 2(1 − α k ) α k O ( k − γ + a + ε/ + k − γ − a/ ε/ ) + 4(1 − α k ) α k O ( k − γ +2 a + ε + k − a + ε )= (cid:16) − α k (cid:17) E (cid:107) ˆ e ( k ) i − E ξ e i ( x ( k − ; ξ ) (cid:107) + α k σ + nα k L (cid:69) + Lγ k − α k (cid:71) + O ( k − γ + a + ε + k − a + ε ) . (25)By combining (25) and Lemma 4 we obtain (8). (9) follows from (17) and (25). Proof to Theorem 2

It directly follows from the deﬁnition of Lipschitz condition of f that, f ( x ( k +1) ) Assumption 1-4 (cid:54) f ( x ( k ) ) + (cid:104) ∂f ( x ( k ) ) , x ( k +1) − x ( k ) (cid:105) + L (cid:107) x ( k +1) − x ( k ) (cid:107) f ( x ( k ) ) + (cid:104) ∂f ( x ( k ) ) , − γ k g ( k ) (cid:105) + L (cid:107) x ( k +1) − x ( k ) (cid:107) = f ( x ( k ) ) − γ k (cid:107) ∂f ( x ( k ) ) (cid:107) + (cid:104) ∂f ( x ( k ) ) , − γ k ( g ( k ) − ∂f ( x ( k ) )) (cid:105) (cid:124) (cid:123)(cid:122) (cid:125) (cid:84) cross + L (cid:107) x ( k +1) − x ( k ) (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) (cid:84) progress . (26)Here we deﬁne two new terms and try to bound the expectation of (cid:84) cross and (cid:84) progress separately as shownbelow: E (cid:84) cross = E (cid:104) ∂f ( x ( k ) ) , − γ k ( g ( k ) − ∂f ( x ( k ) )) (cid:105) (cid:54) γ k L g α k +1 E (cid:107) ∂f ( x ( k ) ) (cid:107) + α k +1 L g E (cid:107) g ( k ) − ∂f ( x ( k ) ) (cid:107) Assumption 1-3 (cid:54) γ k L g α k +1 E (cid:107) ∂f ( x ( k ) ) (cid:107) + α k +1 n (cid:88) i =1 (cid:107) ˆ e ( k +1) i − E ξ [ e i ( x ( k ) ; ξ )] (cid:107) , (27)and E (cid:84) progress = E (cid:107) x k +1 − x k (cid:107) Assumption 1-1 (cid:54) γ k (cid:71) . (28)With the help of (27) and (28), (26) becomes E f ( x ( k +1) ) (cid:54) E f ( x ( k ) ) − γ k E (cid:107) ∂f ( x ( k ) ) (cid:107) + 2 γ k L g α k +1 E (cid:107) ∂f ( x ( k ) ) (cid:107) + α k +1 n − (cid:88) i =1 E (cid:107) ˆ e ( k +1) i − E ξ k [ e i ( x ( k ) ; ξ k )] (cid:107) + γ k L (cid:71) . (29)To show how this converges, we deﬁne the following (cid:100) k to derive a recursive relation for (29): (cid:100) k := f ( x ( k ) ) + n − (cid:88) i =1 (cid:107) ˆ e ( k +1) i − E ξ k [ e i ( x ( k ) ; ξ k )] (cid:107) . Then the recursive relation is derived by observing E (cid:100) k +1 (cid:54) E f ( x ( k ) ) − γ k E (cid:107) ∂f ( x ( k ) ) (cid:107) + 2 γ k L g α k +1 E (cid:107) ∂f ( x ( k ) ) (cid:107) + α k +1 n − (cid:88) i =1 (cid:107) ˆ e ( k +1) i − E ξ k [ e i ( x ( k ) ; ξ k )] (cid:107) + γ k L (cid:71) + n − (cid:88) i =1 (cid:107) ˆ e ( k +2) i − E ξ k +1 [ e i ( x ( k +1) ; ξ k +1 )] (cid:107) Lemma 1 (cid:54) E f ( x ( k ) ) − γ k E (cid:107) ∂f ( x ( k ) ) (cid:107) + 2 γ k L g α k +1 E (cid:107) ∂f ( x ( k ) ) (cid:107) + α k +1 n − (cid:88) i =1 E (cid:107) ˆ e ( k +1) i − E ξ k [ e i ( x ( k ) ; ξ k )] (cid:107) + γ k L (cid:71) + n − (cid:88) i =1 (cid:16)(cid:16) − α k +1 (cid:17) E (cid:107) ˆ e ( k +1) i − E ξ e i ( x ( k +1) ; ξ ) (cid:107) + (cid:67) (( k + 1) − γ + a + ε + ( k + 1) − a + ε ) (cid:17) (cid:54) E f ( x ( k ) ) − γ k E (cid:107) ∂f ( x ( k ) ) (cid:107) + 2 γ k L g α k +1 E (cid:107) ∂f ( x ( k ) ) (cid:107) + n − (cid:88) i =1 E (cid:107) ˆ e ( k +1) i − E ξ e i ( x ( k ) ; ξ ) (cid:107) + O ( k − γ + a + ε + k − a + ε ) E (cid:100) k − (cid:18) γ k − γ k L g α k +1 (cid:19) E (cid:107) ∂f ( x ( k ) ) (cid:107) + O ( k − γ + a + ε + k − a + ε ) . Thus as long as a − γ < − , a < / we can choose (cid:15) such that there exists a constant (cid:82) : E (cid:100) K +1 (cid:54) E (cid:100) − K (cid:88) k =0 (cid:18) γ k − γ k L g α k +1 (cid:19) E (cid:107) ∂f ( x ( k ) ) (cid:107) + (cid:82) K (cid:88) k =0 (cid:18) γ k − γ k L g α k +1 (cid:19) E (cid:107) ∂f ( x ( k ) ) (cid:107) (cid:54) (cid:82) + E (cid:100) − E (cid:100) K +1 γkLgαk +1 (cid:54) −→ (cid:80) Kk =0 γ k E (cid:107) ∂f ( x ( k ) ) (cid:107) (cid:80) Kk =0 γ k (cid:54) (cid:82) + E (cid:100) − E (cid:100) K +1 ) (cid:80) Kk =0 γ k . By Assumption 1-5 we complete the proof.

Proof to Corollary 3

With the given choice of α k and γ k the prerequisites in Theorem 2 are satisﬁed.Then by Theorem 2 there exists a constant (cid:72) (which may diﬀer from the constant in Theorem 2) such that K (cid:88) k =0 ( k + 2) − / E (cid:107) ∂f ( x ( k ) ) (cid:107) (cid:54) (cid:72) K (cid:88) k =0 E (cid:107) ∂f ( x ( k ) ) (cid:107) (cid:54) (cid:72) ( K + 2) − / (cid:80) Kk =0 E (cid:107) ∂f ( x ( k ) ) (cid:107) K + 2 (cid:54) (cid:72) ( K + 2) / ,,