Understanding Why Neural Networks Generalize Well Through GSNR of Parameters
Jinlong Liu, Guoqing Jiang, Yunzhi Bai, Ting Chen, Huayan Wang
PPublished as a conference paper at ICLR 2020 U NDERSTANDING W HY N EURAL N ETWORKS G ENER - ALIZE W ELL T HROUGH
GSNR OF P ARAMETERS
Jinlong Liu ∗ , Guo-qing Jiang , Yunzhi Bai , Ting Chen , and Huayan Wang Ytech – KWAI incorporation{liujinlong,jiangguoqing,baiyunzhi,wanghuayan}@kuaishou.com Samsung Research China – Beijing (SRC-B)[email protected] A BSTRACT
As deep neural networks (DNNs) achieve tremendous success across many applica-tion domains, researchers tried to explore in many aspects on why they generalizewell. In this paper, we provide a novel perspective on these issues using the gradientsignal to noise ratio (GSNR) of parameters during training process of DNNs. TheGSNR of a parameter is defined as the ratio between its gradient’s squared mean andvariance, over the data distribution. Based on several approximations, we establisha quantitative relationship between model parameters’ GSNR and the generaliza-tion gap. This relationship indicates that larger GSNR during training process leadsto better generalization performance. Moreover, we show that, different from thatof shallow models (e.g. logistic regression, support vector machines), the gradientdescent optimization dynamics of DNNs naturally produces large GSNR duringtraining, which is probably the key to DNNs’ remarkable generalization ability.
NTRODUCTION
Deep neural networks typically contain far more trainable parameters than training samples, whichseems to easily cause a poor generalization performance. However, in fact they usually exhibitremarkably small generalization gaps. Traditional generalization theories such as VC dimension(Vapnik & Chervonenkis, 1991) or Rademacher complexity (Bartlett & Mendelson, 2002) cannotexplain its mechanism. Extensive research focuses on the generalization ability of DNNs (Neyshaburet al., 2017; Arora et al., 2018; Keskar et al., 2016; Dinh et al., 2017; Hoffer et al., 2017; Novak et al.,2018; Dziugaite & Roy, 2017; Jakubovitz et al., 2019; Kawaguchi et al., 2017; Advani & Saxe, 2017).Unlike that of shallow models such as logistic regression or support vector machines, the globalminimum of high-dimensional and non-convex DNNs cannot be found analytically, but can only beapproximated by gradient descent and its variants (Zeiler, 2012; Kingma & Ba, 2014; Graves, 2013).Previous work (Zhang et al., 2016; Hardt et al., 2015; Dziugaite & Roy, 2017) suggests that thegeneralization ability of DNNs is closely related to gradient descent optimization. For example, Hardtet al. (2015) claims that any model trained with stochastic gradient descent (SGD) for reasonableepochs would exhibit small generalization error. Their analysis is based on the smoothness of lossfunction. In this work, we attempt to understand the generalization behavior of DNNs through GSNRand reveal how GSNR affects the training dynamics of gradient descent. Stanislav Fort (2019) studieda new gradient alignment measure called stiffness in order to understand generalization better andstiffness is related to our work.The GSNR of a parameter is defined as the ratio between its gradient’s squared mean and varianceover the data distribution. Previous work tried to use GSNR to conduct theoretical analysis ondeep learning. For example, Rainforth et al. (2018) used GSNR to analyze variational bounds inunsupervised DNNs such as variational auto-encoder (VAE). Here we focus on analyzing the relationbetween GSNR and the generalization gap. ∗ corresponding author a r X i v : . [ c s . L G ] F e b ublished as a conference paper at ICLR 2020Figure 1: Schematic diagram of the sample-wise parameter gradient distribution corresponding togreater ( Left ) and smaller (
Right ) GSNR. Pink arrows denote the gradient vectors for each samplewhile the blue arrow indicates their mean.Intuitively, GSNR measures the similarity of a parameter’s gradients among different training samples.Large GSNR implies that most training samples agree on the optimization direction of this parameter,thus the parameter is more likely to be associated with a meaningful “pattern” and we assume itsupdate could lead to a better generalization. In this work, we prove that the GSNR is strongly relatedto the generalization performance, and larger GSNR means a better generalization.To reveal the mechanism of DNNs’ good generalization ability, we show that the gradient descentoptimization dynamics of DNN naturally leads to large GSNR of model parameters and thereforegood generalization. Furthermore, we give a complete analysis and a detailed interpretation to thisphenomenon. We believe this is probably the key to DNNs’ remarkable generalization ability.In the remainder of this paper we first analyze the relation between GSNR and generalization (Section2). We then show how the training dynamics lead to large GSNR of model parameters experimentallyand analytically in Section 3.
ARGER
GSNR L
EADS TO B ETTER G ENERALIZATION
In this section, we establish a quantitative relation between the GSNR of model parameters andgeneralization gap, showing that larger GSNR during training leads to better generalization.2.1 G
RADIENTS S IGNAL TO N OISE R ATIO
Consider a data distribution Z = X ×Y , from which each sample ( x, y ) is drawn; a model ˆ y = f ( x, θ ) parameterized by θ ; and a loss function L .The parameters’ gradient w.r.t. L and sample ( x i , y i ) is denoted by g ( x i , y i , θ ) or g i ( θ ) := ∂L ( y i , f ( x i , θ )) ∂θ (1)whose j -th element is g i ( θ j ) . Note that throughout this paper we always use i to index data examplesand j to index model parameters.Given the data distribution Z , we have the (sample-wise) mean and variance of g i ( θ ) . We denotethem as ˜g ( θ ) = E ( x,y ) ∼Z ( g ( x, y, θ )) and ρ ( θ ) = Var ( x,y ) ∼Z ( g ( x, y, θ )) , respectively.The gradient signal to noise ratio (GSNR) of one model parameter θ j is defined as: r ( θ j ) := ˜ g ( θ j ) ρ ( θ j ) (2)At a particular point of the parameter space, GSNR measures the consistency of a parameter’sgradients across different data samples. Figure 1 intuitively shows that if GSNR is large, theparameter gradient space tends to be distributed in the similar direction and if GSNR is small, thegradient vectors are then scatteredly distributed. 2ublished as a conference paper at ICLR 2020Figure 2: Schematic diagram of the training behavior satisfies OSGR ( t ) = 0 ( Left ), OSGR ( t ) ≈ ( Right ). Note that the Middle scenario most commonlyhappens in regular tasks.2.2 O NE -S TEP G ENERALIZATION R ATIO In this section we introduce a new concept to help measure the generalization performance dur-ing gradient descent optimization, which we call one-step generalization ratio (OSGR). Considertraining set D = { ( x , y ) , ..., ( x n , y n ) } ∼ Z n with n samples drawn from Z , and a test set D (cid:48) = { ( x (cid:48) , y (cid:48) ) , ..., ( x (cid:48) n (cid:48) , y (cid:48) n (cid:48) ) } ∼ Z n (cid:48) . In practice we use the loss on D (cid:48) to measure generalization.For simplicity, we assume the sizes of training and test datasets are equal, i.e. n = n (cid:48) . We denote theempirical training and test loss as: L [ D ] = 1 n n (cid:88) i =1 L ( y i , f ( x i , θ )) , L [ D (cid:48) ] = 1 n n (cid:88) i =1 L ( y (cid:48) i , f ( x (cid:48) i , θ )) , (3)respectively. Then the empirical generalization gap is given by L [ D (cid:48) ] − L [ D ] .In gradient descent optimization, both the training and test loss would decrease step by step. We use ∆ L [ D ] and ∆ L [ D (cid:48) ] to denote the one-step training and test loss decrease during training, respectively.Let’s consider the ratio between the expectations of ∆ L [ D (cid:48) ] and ∆ L [ D ] of one single training step,which we denote as R ( Z , n ) . R ( Z , n ) := E D,D (cid:48) ∼Z n (∆ L [ D (cid:48) ]) E D ∼Z n (∆ L [ D ]) (4)Note that this ratio also depends on current model parameters θ and learning rate λ . We are notincluding them in the above notation as we will not explicitly model these dependencies, but rathertry to quantitatively characterize R for very small λ and for θ at the early stage of training (satisfyingAssumption 2.3.1).Also note that the expectation of ∆ L [ D (cid:48) ] is over D and D (cid:48) . This is because the optimization step isperformed on D . We refer to R ( Z , n ) as OSGR of gradient descent optimization. Statistically thetraining loss decreases faster than the test loss and < OSGR ( t ) < ( Middle panel of Figure 2),which usually results in a non-zero generalization gap at the end of training. If OSGR ( t ) is large( ≈ ) in the whole training process ( Right panel of Figure 2), generalization gap would be smallwhen training completes, implying good generalization ability of the model. If OSGR ( t ) is small( = 0 ), the test loss will not decrease while the training loss normally drops ( Left panel of Figure 2),corresponding to a large generalization gap.2.3 R ELATION BETWEEN GSNR AND OSGRIn this section, we derive a relation between the OSGR during training and the GSNR of modelparameters. This relation indicates that, for the first time as far as we know, the sample-wisegradient distribution of parameters is related to the generalization performance of gradient descentoptimization.In gradient descent optimization, we take the average gradient over training set D , which we denoteas g D ( θ ) . Note that we have used g i ( θ ) to denote gradient evaluated on one data sample and ˜ g ( θ ) to3ublished as a conference paper at ICLR 2020denote its expectation over the entire data distribution. Similarly we define g D (cid:48) ( θ ) to be the averagegradient over test set D (cid:48) . g D ( θ ) = 1 n n (cid:88) i =1 g ( x i , y i , θ ) = ∂L [ D ] ∂θ , g D (cid:48) ( θ ) = 1 n n (cid:88) i =1 g ( x (cid:48) i , y (cid:48) i , θ ) = ∂L [ D (cid:48) ] ∂θ (5)Both the training and test dataset are randomly generated from the same distribution Z n , so wecan treat g D ( θ ) and g D (cid:48) ( θ ) as random variables. At the beginning of the optimization process, θ israndomly initialized thus independent of D , so g D ( θ ) and g D (cid:48) ( θ ) would obey the same distribution.After a period of training, the model parameters begin to fit the training dataset and become a functionof D , i.e. θ = θ ( D ) , therefore distributions of g D ( θ ( D )) and g D (cid:48) ( θ ( D )) become different. Howeverwe choose not to model this dependency and make the following assumption for our analysis: Assumption 2.3.1 (Non-overfitting limit approximation) The average gradient over the trainingdataset and test dataset g D ( θ ) and g D (cid:48) ( θ ) obey the same distribution. Obviously the mean of g D ( θ ) and g D (cid:48) ( θ ) is just the mean gradient over the data distribution ˜ g ( θ ) . E D ∼Z n [ g D ( θ )] = E D,D (cid:48) ∼Z n [ g D (cid:48) ( θ )] = ˜ g ( θ ) (6)We denote their variance as σ ( θ ) , i.e. Var D ∼Z n [ g D ( θ )] = Var D,D (cid:48) ∼Z n [ g D (cid:48) ( θ )] = σ ( θ ) (7)It is straightforward to show that: σ ( θ ) = Var D ∼Z n [ 1 n n (cid:88) i =1 g i ( θ )] = 1 n ρ ( θ ) (8)where σ ( θ ) is the variance of the average gradient over the dataset of size n , and ρ ( θ ) is the varianceof the gradient of a single data sample.In one gradient descent step, the model parameter is updated by ∆ θ = θ t +1 − θ t = − λ g D ( θ ) where λ is the learning rate. If λ is small enough, the one-step training and test loss decrease can beapproximated by ∆ L [ D ] ≈ − ∆ θ · ∂L [ D ] ∂θ + O ( λ ) = λ g D ( θ ) · g D ( θ ) + O ( λ ) (9) ∆ L [ D (cid:48) ] ≈ − ∆ θ · ∂L [ D (cid:48) ] ∂θ + O ( λ ) = λ g D ( θ ) · g D (cid:48) ( θ ) + O ( λ ) (10)Usually there are some differences between the directions of g D ( θ ) and g D (cid:48) ( θ ) , so statistically ∆ L [ D ] tends to be larger than ∆ L [ D (cid:48) ] and the generalization gap would increase during training.When λ → , in one single training step the empirical generalization gap increases by ∆ L [ D ] − ∆ L [ D (cid:48) ] , for simplicity we denote this quantity as (cid:53) : (cid:53) := ∆ L [ D ] − ∆ L [ D (cid:48) ] ≈ λ g D ( θ ) · g D ( θ ) − λ g D ( θ ) · g D (cid:48) ( θ ) (11) = λ (˜ g ( θ ) + (cid:15) )(˜ g ( θ ) + (cid:15) − ˜ g ( θ ) − (cid:15) (cid:48) ) (12) = λ (˜ g ( θ ) + (cid:15) )( (cid:15) − (cid:15) (cid:48) ) (13)Here we replaced the random variables by g D ( θ ) = ˜ g ( θ ) + (cid:15) and g D (cid:48) ( θ ) = ˜ g ( θ ) + (cid:15) (cid:48) , where (cid:15) and (cid:15) (cid:48) are random variables with zero mean and variance σ ( θ ) . Since E ( (cid:15) (cid:48) ) = E ( (cid:15) ) = 0 , (cid:15) and (cid:15) (cid:48) areindependent, the expectation of (cid:53) is E D,D (cid:48) ∼Z n ( (cid:53) ) = E ( λ(cid:15) · (cid:15) ) + O ( λ ) = λ (cid:88) j σ ( θ j ) + O ( λ ) (14)where σ ( θ j ) is the variance the of average gradient of the parameter θ j .For simplicity, when it involves a single model parameter θ j , we will use only a subscript j insteadof the full notation. For example, we use σ j , r j , and g D,j to denote σ ( θ j ) , r ( θ j ) , and g D ( θ j ) respectively. 4ublished as a conference paper at ICLR 2020Consider the expectation of ∆ L [ D ] and ∆ L [ D (cid:48) ] when λ → E D ∼Z n (∆ L [ D ]) ≈ λE D ∼Z n ( g D ( θ ) · g D ( θ )) = λ (cid:88) j E D ∼Z n ( g D,j ) (15) E D,D (cid:48) ∼Z n (∆ L [ D (cid:48) ]) = E D,D (cid:48) ∼Z n (∆ L [ D ] − (cid:53) ) (16) ≈ λ (cid:88) j ( E D ∼Z n ( g D,j ) − σ j ) (17) = λ (cid:88) j ( E D ∼Z n ( g D,j ) − ρ j /n ) (18)Substituting (18) and (15) into (4) we have: R ( Z , n ) = 1 − (cid:80) j ρ j n (cid:80) j E D ∼Z n ( g D,j ) (19)Although we derived eq. (19) from simplified assumptions, we can empirically verify it by estimatingtwo sides of the equation on real data. We will elaborate on this estimation method in section 2.4.We can rewrite eq. (19) as: R ( Z , n ) = 1 − n (cid:88) j E D ∼Z n ( g D,j ) (cid:80) j (cid:48) E D ∼Z n ( g D,j (cid:48) ) ρ j E D ∼Z n ( g D,j ) (20) = 1 − n (cid:88) j E D ∼Z n ( g D,j ) (cid:80) j (cid:48) E D ∼Z n ( g D,j (cid:48) ) 1 r j + n (21)where E D ∼Z n ( g D,j ) = V ar D ∼Z n ( g D,j ) + E D ∼Z n ( g D,j ) = n ρ j + ˜ g j .We define ∆ L j [ D ] to be the training loss decrease caused by updating θ j . We can show that when λ is very small ∆ L j [ D ] = λ g D,j + O ( λ ) . Therefore when λ → , we have R ( Z , n ) = 1 − n (cid:88) j W j r j + n , where W j := E D ∼Z n (∆ L j [ D ]) E D ∼Z n (∆ L [ D ]) with (cid:88) j W j = 1 (22)Eq. (22) shows that the GSNR r j plays a crucial role in the model’s generalization ability—theone-step generalization ratio in gradient descent equals one minus the weighted average of r j + n overall model parameters divided by n . The weight is proportional to the expectation of the training lossdecrease resulted from updating that parameter. This implies that larger GSNR of model parametersduring training leads to smaller generalization gap growth thus better generalization performance ofthe trained model. Also note when n → ∞ , we have R ( Z , n ) → , meaning that training on moredata helps generalization.2.4 E XPERIMENTAL VERIFICATION OF THE RELATION BETWEEN GSNR AND OSGRThe relation between GSNR and OSGR, i.e. eq. (19) or (22) can be empirically verified using anydataset if: (1) The dataset includes enough samples to construct many training sets and a large enoughtest set so that we can reliably estimate ρ j , E D ∼Z n ( g D,j ) and OSGR. (2) The learning rate is smallenough. (3) In the early training stage of gradient descent.To empirically verify eq. (19), we show how to estimate its left and right hand sides, i.e. OSGR bydefinition and OSGR as a function of GSNR. Suppose we have M training sets each with size n , anda test set of size n (cid:48) . We initialize a model and train it separately on the M training sets and test itwith the same test set. For the t -th training iteration, we denote the training loss and test loss of themodel trained on the m -th training dataset as L ( m ) t and L (cid:48) ( m ) t , respectively. Then the left hand side, i.e. OSGR by definition, of the t -th iteration can be estimated by R t ( Z , n ) ≈ (cid:80) Mm =1 L (cid:48) ( m ) t +1 − L (cid:48) ( m ) t (cid:80) Mm =1 L ( m ) t +1 − L ( m ) t (23)5ublished as a conference paper at ICLR 2020Figure 3: Left hand (LHS or OSGR by definition) and right side (RHS or OSGR as a function ofGSNR) of eq. (19). Points are drawn under different experiment settings. Left : LHS vs RHS atepoch 20, 100, 500, 2500. Each point is drawn by LHS and RHS computed at the given epoch underdifferent model structure (number of channels) or training data size; red dotted line is the line of bestfit computed by least squares; blue dotted line is the line of reference representing LHS = RHS; thevalue of c in each title represents the Pearson correlation coefficient between LHS and RHS computedby points in figure. Right : The legend. Different symbols and colors stand for different number ofchannels and training data size. Different random noise levels are not distinguished.For the model trained on the m -th training set, we can compute the t -th step average gradient andsample-wise gradient variance of θ j on the corresponding training set, denoted as g m,j,t and ρ m,j,t ,respectively. Therefore the right hand side of eq. (19) can be estimated by E D ∼Z n ( g D,j,t ) ≈ M M (cid:88) m =1 g m,j,t , ρ j,t ≈ M M (cid:88) m =1 ρ m,j,t (24)We performed the above estimations on MNIST with a simple CNN structure consists of 2 Conv-Relu-MaxPooling blocks and 2 fully-connected layers. First, to estimate eq. (24) with M = 10 , we ran-domly sample 10 training sets with size n and a test set with size 10,000. To cover different conditions,we (1) choose n ∈ { , , , , , , } , respectively; (2) inject noise byrandomly changing the labels with probability p random ∈ { . , . , . , . , . } ; (3) change themodel structure by varying number of channels in the layers, ch ∈ { , , , , , , , } . SeeAppendix A for more details of the setup. We use the gradient descent training (not SGD), with asmall learning rate of . . The left and right hand sides of 19 at different epochs are shown inFigure 3, where each point represents one specific choice of the above settings.At the beginning of training, the data points are closely distributed along the dashed line correspondingto LHS=RHS. This shows that eq. (19) fits quite well under a variety of different settings. As trainingproceeds, the points become more scattered as the non-overfitting limit approximation no longerholds, but correlation between the LHS and RHS remains high even when the training converges(at epoch 2,500). We also conducted the same experiment on CIFAR10 A.2 and a toy dataset A.3observed the same behavior. See Appendix for these experiments.The empirical evidence together with our previous derivation of eq. (19) clearly show the relationbetween GSNR and OSGR and its implication in the model’s generalization ability. RAINING DYNAMICS OF DNN S NATURALLY LEADS TO LARGE GSNR In this section, we analyze and explain one interesting phenomenon: the parameters’ GSNR of DNNsrises in the early stages of training, whereas the GSNR of shallow models such as logistic regressionor support vector machines declines during the entire training process. This difference gives rise toGSNR’s large practical values during training, which in turn is associated with good generalization.We analyze the dynamics behind this phenomenon both experimentally and theoretically.3.1 GSNR BEHAVIOR OF DNN S TRAINING For shallow models, the GSNR of parameters decreases in the whole training process becausegradients become small as learning converges. But for DNNs it is not the case. We trained DNNs6ublished as a conference paper at ICLR 2020on the CIFAR datasets and computed the GSNR averaged over all model parameters. Because E D ∼Z n ( g D,j ) = n ρ j + ˜ g j and we assume n is large, E D ∼Z n ( g D,j ) ≈ ˜ g j . In the case of only onelarge training datasets, we estimate GSNR of t -th iteration by r j,t ≈ g D,j,t /ρ D,j,t (25)As shown in Figure 4, the GSNR starts out low with randomly initialized parameters. As learningprogresses, the GSNR increases in the early training stage and stays at a high level in the wholelearning process. For each model parameter, we also computed the proportion of the samples withthe same gradient sign, denoted as p same _ sign . In Figure 4c, we plot the mean of time series of thisproportion for all the parameters. This value increases from about 50% (half positive half negetivedue to random initialization) to about 56% finally, which indicates that for most parameters, thegradient signs on different samples become more consistent. This is because meaningful featuresbegin to emerge in the learning process and the gradients of the weights on these features tend tohave the same sign among different samples.Previous research (Zhang et al., 2016) showed that DNNs achieved zero training loss by memorizingtraining samples even if the labels were randomized. We also plot the average GSNR for modeltrained using data with randomized labels in Figure 4 and find that the GSNR stays at a low levelthroughout the training process. Although the training loss of both the original and randomized labelsgo to zero (not shown), the GSNR curves clearly distinguish between these two cases and reveal thelack of meaningful patterns in the latter one. We believe this is the reason why DNNs trained on realand random data lead to completely different generalization behaviors.Figure 4: (a) : GSNR curves generated by a simple network based on real and random data. Anobvious upward process in the early training stage was observed for real data only. (b) : Same plot forResNet18. (c) : Average of p same _ sign for the same model as in (a).3.2 T RAINING D YNAMICS BEHIND THE GSNR BEHAVIOR In this section we show that the feature learning ability of DNNs is the key reason why the GSNRcurve behavior of DNNs is different from that of shallow models during the gradient descent training.To demonstrate this, a simple two-layer perceptron regression model is constructed. A syntheticdataset is generated as following. Each data point is constructed i.i.d. using y = x x + (cid:15) , where x and x are drawn from uniform distribution [ − , and (cid:15) is drawn from uniform distribution [ − . , . . The training set and test set sizes are 200 and 10,000, respectively. We use a verysimple two-layer MLP structure with 2 inputs, 20 hidden neurons and 1 output.We randomly initialized the model parameters and trained the model on the synthetic training dataset.As a control setup we also tried to freeze model weights in the first layer to prevent it from learningfeatures. Note that a two layer MLP with the first layer frozen is equivalent to a linear regressionmodel. That is, regression weights are learned on the second layer using fixed features extracted bythe first layer. We plot the average GSNR of the second layer parameters for both the frozen andnon-frozen cases. Figure 5 shows that in the non-frozen case, the average GSNR over parameters ofthe second layer shows a significant upward process, whereas in the frozen case the average GSNRdecreases in the beginning and remains at a low level during the whole training process.In the non-frozen case, GSNR curve of individual parameters of the second layer are shown in Figure5. The GSNR for some parameters show a significant upward process. To measure the quality ofthese features, we computed the Pearson correlation between them and the target output y , both at thebeginning of training and at the maximum point of their GSNR curves. We can see that the learningprocess learns “good” features (high correlation value, i.e. with stronger correlation with y ) from7ublished as a conference paper at ICLR 2020Figure 5: Average GSNR (a) and loss (b) curves for the frozen and non-frozen case. (c) : GSNRcurves of individual parameters for the non-frozen case.random initialized ones, as shown in Table 1. This shows that the GSNR increasing process is relatedto feature learning.3.3 A NALYSIS OF TRAINING DYNAMICS BEHIND DNN S ’ GSNR BEHAVIOR In this section, we will investigate the training dynamics behind the GSNR curve behavior. In the caseof fully connected network structure, we can analytically show that the numerator of GSNR, i.e. thesquared gradient mean of model parameters, tends to increase in the early training stage throughfeature learning.Consider a fully connected network, whose parameters are θ = { W (1) , b (1) , ..., W ( l max ) , b ( l max ) } ,where W (1) , b (1) are the weight matrix and bias of the first layer, and so on. We denote theactivations of the l -th layer as a ( l ) = { a ( l ) s ( θ ( l − ) ) } , where s is the index for nodes/channels ofthis layer, and θ ( l − ) is the collection of model parameters in the layers before l , i.e. θ ( l − ) = { W (1) , b (1) , ..., W ( l − , b ( l − } . In the forward pass on data sample i , { a ls ( θ ( l − ) ) } is multiplied bythe weight matrix W ( l ) : o ( l ) i,c = (cid:88) s W ( l ) c,s a ( l ) i,s ( θ ( l − ) ) (26)where o ( l ) = { o ( l ) i,c } is the output of the matrix multiplication, for the i -th data sample, on the l -thlayer, c = { , , ..., C } is the index of nodes/channels in the ( l + 1) -th layer. We use g ( l ) D to denotethe average gradient of weights of the l -th layer W ( l ) , i.e. g ( l ) D = n (cid:80) ni =1 ∂L i ∂ W ( l ) , where L i is theloss of the i -th sample.Here we show that the feature learning ability of DNNs plays a crucial role in the GSNR increasingprocess. More precisely, we show that the learning of features a ( l ) ( θ ( l − ) ) , i.e. the learning ofparameters θ ( l − ) tends to increase the absolute value of g ( l ) D . Consider the one-step change ofgradient mean ∆ g ( l ) D = g ( l ) D,t +1 − g ( l ) D,t with the learning rate λ → . In one training step, θ is updatedby ∆ θ = θ t +1 − θ t = − λ g D ( θ ) . Using linear approximation with λ → , we have ∆ g ( l ) D,s,c ≈ (cid:88) j ∂ g ( l ) D,s,c ∂θ j ∆ θ j = (cid:88) θ j ∈ θ ( l − ) ∂ g ( l ) D,s,c ∂θ j ∆ θ j + (cid:88) θ j ∈ θ ( l +) ∂ g ( l ) D,s,c ∂θ j ∆ θ j (27)where θ ( l − ) and θ ( l +) denote model parameters before and after the l -the layer (including the l -th),respectively.We focus on the first term of eq. (27), i.e. the one-step change of g ( l ) D caused by learning θ ( l − ) .Substituting g ( l ) D = n (cid:80) ni =1 ∂L i ∂ W ( l ) and ∆ θ j = ( − λ n (cid:80) ni =1 ∂L i ∂θ j ) into eq. (27), we have ∆ g ( l ) D,s,c = − λn (cid:88) θ j ∈ θ ( l − ) W ( l ) s,c ( n (cid:88) i =1 ∂L i ∂o ( l ) i,c ∂a ( l ) i,s ∂θ j ) + other terms (28)8ublished as a conference paper at ICLR 2020The detailed derivation of eq. (28) can be found in Appendix B. We can see the first term (which is asummation over parameters in θ ( l − ) ) in eq. (28) has opposite sign with W ( l ) s,c . This term will make ∆ g ( l ) D,s,c negatively correlated with W ( l ) s,c . We plot the correlation between ∆ g ( l ) D,s,c with W ( l ) s,c for amodel trained on MNIST for 200 epochs in Figure 6a. In the early training stage, they are indeednegatively correlated. For top-10% weights with larger absolute values, the negative correlation iseven more significant.Here we show that this negative correlation between ∆ g ( l ) D,s,c and W ( l ) s,c tends to increase theabsolute value of g ( l ) D through an interesting mechanism. Consider the weights W ( l ) s,c with { W ( l ) s,c > , g ( l ) D,s,c < } . Learning θ l − would decrease g ( l ) D,s,c and thus increase its absolutevalue because the first term in eq. (28) is negative. On the other hand, learning W ( l ) s,c would increase W ( l ) s,c and its absolute value because ∆ W ( l ) s,c = − λ g ( l ) D,s,c is positive. This will form a positivefeedback process, in which the numerator of GSNR, ( g ( l ) D,s,c ) , would increase and so is the GSNR.Similar analysis can be done for the case with { W ( l ) s,c < , g ( l ) D,s,c > } .On the other hand, when { W ( l ) s,c g ( l ) D,s,c > } , we show that the weights tend to change into the earliercase, i.e. { W ( l ) s,c g ( l ) D,s,c < } during training. Consider the case of { W ( l ) s,c > , g ( l ) D,s,c > } , the firstterm in eq. (28) is negative, learning θ ( l − ) tends to decrease g ( l ) D,s,c or even change its sign. Anotherposibility is that learning W ( l ) s,c changes the sign of W ( l ) s,c because ∆ W ( l ) s,c = − λ g ( l ) D,s,c is negative.In both cases the weights change into the earlier case with { W ( l ) s,c g ( l ) D,s,c < } . Similar analysis canbe done for the case of { W ( l ) s,c < , g ( l ) D,s,c < } .Therefore { W ( l ) s,c g ( l ) D,s,c < } is a more stable state in the training process. For a simple model trainedon MNIST, We plot the proportion of weights satisfying { W ( l ) s,c g ( l ) D,s,c < } in Figure 6b and findthat there are indeed more weights with { W ( l ) s,c g ( l ) D,s,c < } than the opposite. Because weights withsmall absolute value easily change sign during training, we also plot this proportion for the top-10%weights with larger absolute values. We can see that for the weights with large absolute values, nearly80% of them have opposite signs with their gradient mean, confirming our earlier analysis. For theseweights, the numerator of GSNR, ( g ( l ) D,s,c ) , tends to increase through the positive feedback processas discussed above.Figure 6: MNIST experiments. Left : Correlation between ∆ g ( l ) D,s,c and W ( l ) s,c . Right : Ratio of weights that have opposite signs withtheir gradient mean. Table 1: Pearson correlationbetween features and targetoutput y , where c t and c t max are correlations at the begin-ning of training and maximumof GSNR curve respectively.feature id c t c t max UMMARY In this paper, we performed a series of analysis on the role of model parameters’ GSNR in deepneural networks’ generalization ability. We showed that large GSNR is a key to small generalizationgap, and gradient descent training naturally incurs and exploits large GSNR as the model discoversuseful features in learning. 9ublished as a conference paper at ICLR 2020 R EFERENCES Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neuralnetworks. arXiv preprint arXiv:1710.03667 , 2017.Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds fordeep nets via a compression approach, 2018. arXiv:1802.05296.Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds andstructural results. Journal of Machine Learning Research , 3:463–482, 2002.Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize fordeep nets. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 ,pp. 1019–1028. JMLR. org, 2017.Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds fordeep (stochastic) neural networks with many more parameters than training data. arXiv preprintarXiv:1703.11008 , 2017.Alex Graves. Agenerating sequences with recurrent neural networks, 2013. arXiv:1308.0850v5.Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability ofstochastic gradient descent. arXiv preprint arXiv:1509.01240 , 2015.Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generaliza-tion gap in large batch training of neural networks. Advances in Neural Information ProcessingSystems , pp. 1731–1741, 2017.Daniel Jakubovitz, Raja Giryes, and Miguel RD Rodrigues. Generalization error in deep learning,2019.Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning. arXivpreprint arXiv:1710.05468 , 2017.Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak PeterTang. On large-batch training for deep learning: Generalization gap and sharp minima. arXivpreprint arXiv:1609.04836 , 2016.Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring general-ization in deep learning. In Advances in Neural Information Processing Systems , pp. 5947–5956,2017.Roman Novak, Yasaman Bahri, Daniel A Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein.Sensitivity and generalization in neural networks: An empirical study. arXiv:1802.08760 , 2018.Tom Rainforth, Adam R Kosiorek, Tuan Anh Le, Chris J Maddison, Maximilian Igl, Frank Wood,and Yee Whye Teh. Tighter variational bounds are not necessarily better. arXiv preprintarXiv:1802.04537 , 2018.Stanislaw Jastrzebski Srini Narayanan Stanislav Fort, Paweł Krzysztof Nowak. Stiffness: A newperspective on generalization in neural networks, 2019. arXiv:1901.09491.Vladimir N Vapnik and A Ja Chervonenkis. The necessary and sufficient conditions for consistencyof the method of empirical risk. Pattern Recognition and Image Analysis , 1(3):284–305, 1991.Matthew D. Zeiler. Adadelta: An adaptive learning rate method, 2012. arXiv:1212.5701.Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understandingdeep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 , 2016.10ublished as a conference paper at ICLR 2020 A A PPENDIX A A.1 M ODEL S TRUCTURE IN S ECTION p from { , , , , , , , } .Table 2: Model structure On MNIST in Section 2.4. p is the number of channels and q = int (2 . ∗ p ) Layer input conv + relu + maxpooling 1 p conv + relu + maxpooling p q flatten - -fc + relu 16 * q 10 * q fc + relu 10 * q XPERIMENT ON CIFAR10Different from the experiment on MNIST, we use a deeper network on CIFAR10. We also includethe Batch Normalization (BN) layer, because we find that it’s difficult for the network to converge inthe absence of it. The network consists of 4 Conv-BN-Relu-Conv-BN-Relu-MaxPooling blocks and3 fully-connected layers. More details are shown in Table 3.Table 3: Model structure on CIFAR10. p is the number of channels. Layer input conv + bn + relu 3 p conv + bn + relu p p maxpooling - -conv + bn + relu p p conv + bn + relu p p maxpooling - -conv + bn + relu p p conv + bn + relu p p maxpooling - -conv + bn + relu p p conv + bn + relu p p maxpooling - -flatten - -fc + relu 32 * q q fc + relu 8 * q q fc 8 * q n ∈ { , , , , } , p random ∈ { . , . , . } , ch ∈ { , , , , , , } .We use the gradient descent training (Not SGD), with a small learning rate of . . The left andright hand sides of 19 at different epochs are shown in Figure 7, where each point represents onespecific combination of the above settings. Note that at the evaluation step of every epoch, we use11ublished as a conference paper at ICLR 2020Figure 7: Left hand (LHS) and right side (RHS) of eq. (19). Points are drawn under differentexperiment settings. Left figure: LHS vs RHS relation at epoch 20, 100, 500, 1000.Figure 8: Similar with Fig. 3, but for a toy regression model discussed in in Appendix A.3.the same mean and variance inside the BN layers as the training dataset. That’s to ensure that thenetwork and loss function are consistent between training and test.At the beginning of training, compared to that of MNIST, the data points no longer perfectly resideson the diagonal dashed line. We suppose that’s beacuse of the presence of BN layer, whose internalparameters, i.e. running mean and running variance, are not regular learnable parameters in theoptimization process, but change their values in a different way. Their change affects the OSGR,yet we could not include them in the estimation of OSGR. However, the strong positive correlationbetween the left and right hand sides of eq. (19) can always be observed until the training begins toconverge.A.3 E XPERIMENT ON T OY D ATASET In this section we show a simple two-layer regression model consists of a FC-Relu structure with only2 inputs, 1 hidden layer with N neurons and 1 output. A similar synthetic dataset with the trainingdata used in the experiment of Section 3.2 is generated as follows. Each data point is constructed i.i.d. using y = x x + (cid:15) , where x and x are drawn from uniform distribution of [ − , and (cid:15) isdrawn from uniform distribution of [ − η noise , η noise ] .To estimate eq. (24), we randomly generate 100 training sets with n samples each, i.e. M =100,and a test set with 20,000 samples. To cover different conditions, we (1) choose n ∈{ , , , , , , } ; (2) inject noise with η noise ∈ { . , , , , } ; (3) perturbmodel structures by choosing N ∈ { , , , , , , , } . We use gradient descent withlearning rate of 0.001.Figure 8 shows a similar behavior as Fig. 3. During the early training stages, the LHS and RHS ofeq. (19) are very close. Their highly correlated relation remains until training converges, whereas theRHS of eq. (19) decreases significantly. 12ublished as a conference paper at ICLR 2020 B A PPENDIX B Derivation of eq. (28) ∆ g ( l ) D,s,c = (cid:88) θ j ∈ θ ( l − ) ∂ g ( l ) D,s,c ∂θ j ∆ θ j + other terms (29) = (cid:88) θ j ∈ θ ( l − ) ∂ ( n (cid:80) ni =1 ∂L i ∂ W ( l ) s,c ) ∂θ j ( − λ n n (cid:88) i =1 ∂L i ∂θ j ) + other terms (30) = (cid:88) θ j ∈ θ ( l − ) ∂ ( n (cid:80) ni =1 ∂L i ∂o ( l ) i,c ∂o ( l ) i,c ∂ W ( l ) s,c ) ∂θ j ( − λn n (cid:88) i =1 (cid:88) s (cid:48) ,c (cid:48) ∂L i ∂o ( l ) i,c (cid:48) ∂o ( l ) i,c (cid:48) ∂a ( l ) i,s (cid:48) ∂a ( l ) i,s (cid:48) ∂θ j ) + other terms (31) = − λn (cid:88) θ j ∈ θ ( l − ) ∂ ( (cid:80) ni =1 ∂L i ∂o ( l ) i,c a ( l ) i,s ) ∂θ j ( n (cid:88) i =1 (cid:88) s (cid:48) ,c (cid:48) ∂L i ∂o ( l ) i,c (cid:48) W ( l ) s (cid:48) ,c (cid:48) ∂a ( l ) i,s (cid:48) ∂θ j ) + other terms (32) = − λn (cid:88) θ j ∈ θ ( l − ) n (cid:88) i =1 ( ∂L i ∂o ( l ) i,c ∂a ( l ) i,s ∂θ j + ∂ L i ∂o ( l ) i,c ∂θ j a ( l ) i,s )( (cid:88) s (cid:48) ,c (cid:48) W ( l ) s (cid:48) ,c (cid:48) n (cid:88) i =1 ∂L i ∂o ( l ) i,c (cid:48) ∂a ( l ) i,s (cid:48) ∂θ j )+ other terms (33)Above we used ∂o ( l ) i,c (cid:48) ∂a ( l ) i,s (cid:48) = W ( l ) s (cid:48) ,c (cid:48) and ∂o ( l ) i,c ∂ W ( l ) s,c = a ( l ) i,s that can both be derived from eq. (26). Considerthe first term of eq. (33). When s (cid:48) = s, c (cid:48) = c , we have ∆ g ( l ) s,c = − λn (cid:88) θ j ∈ θ ( l − ) W ( l ) s,c ( n (cid:88) i =1 ∂L i ∂o ( l ) i,c ∂a ( l ) i,s ∂θ j ) + other terms (34)Note that the term related to ∂ L i ∂o ( l ) i,c ∂θ j a ( l ) i,s and the terms when s (cid:48) (cid:54) = s or c (cid:48) (cid:54) = c in eq. (33) are mergedinto other terms of eq. (34). 13ublished as a conference paper at ICLR 2020 C A PPENDIX C Notations Z A data distribution satisfies X × Y s or ( x, y ) A single data sample D Training set consists of n samples drawn from Z D (cid:48) Test set consists of n (cid:48) samples drawn from Z θ Model parameters, whose components are denoted as θ j g s ( θ ) or g i ( θ ) Parameters’ gradient w.r.t. a single data sample s or ( x i , y i )˜ g ( θ ) Mean values of parameters’ gradient over a total data distribution, i.e. , E s ∼Z ( g s ( θ )) g D ( θ ) Average gradient over the training dataset, i.e. , n (cid:80) ni =1 g i ( θ ) g D (cid:48) ( θ ) Average gradient over the test dataset, i.e. , n (cid:48) (cid:80) n (cid:48) i =1 g (cid:48) i ( θ ) . Notethat, in eq. (5), we assume n (cid:48) = n g D,j Same as g D ( θ j ) ρ ( θ ) Variance of parameters’ gradient of a single sample, i.e. , Var s ∼Z ( g s ( θ )) ρ j Same as ρ ( θ j ) σ ( θ ) Variance of the average gradient over a training dataset of size n , i.e. , Var D ∼Z n [ g D ( θ )] σ j Same as σ ( θ j ) r j or r ( θ j ) Gradient signal to noise ratio (GSNR) of model parameter θ j L [ D ] Empirical training loss, i.e. , n (cid:80) ni =1 L ( y i , f ( x i , θ )) L [ D (cid:48) ] Empirical test loss, i.e. , n (cid:48) (cid:80) n (cid:48) i =1 L ( y (cid:48) i , f ( x (cid:48) i , θ )))∆ L [ D ] One-step training loss decrease ∆ L j [ D ] One-step training loss decrease caused by updating one parame-ter θ j R ( Z , n ) One-step generalization ratio (OSGR) for the training andtest sets of size n sampled from data distribution Z , i.e. , E D,D (cid:48)∼Z n (∆ L [ D (cid:48) ]) E D ∼Z n (∆ L [ D ]) λ Learning rate (cid:53) One-step generalization gap increment, i.e. , ∆ L [ D ] - ∆ L [ D (cid:48) ] (cid:15) Random variables with zero mean and variance σ ( θ ) W ( l ) and b ( l ) Model parameters (weight matrix and bias) of the l -th layer θ ( l − ) Collection of model parameters over all the layers before the l -thlayer g ( l ) D Average gradient of W ( l ) over the training dataset θ ( l +) Collection of model parameters over all the layers after the l -thlayer, including the l -th layer a ( l ) = { a ( l ) s ( θ ( l − ) ) } Activations of the l -th layer, where s = { , , ..., S } is the indexof nodes/channels in the l -th layer. o ( l ) = { o ( l ) c } Outputs of matrix multiplication of the l -th layer, where c = { , , ..., C } is index of nodes/channels in the ( l + 1) -th layer. a ( l ) i,s and o ( l ) i,c a ( l ) s and o ( l ) c evaluated on data sample ii