[PDF] Understanding Why Neural Networks Generalize Well Through GSNR of Parameters

Abstract

As deep neural networks (DNNs) achieve tremendous success across many application domains, researchers tried to explore in many aspects on why they generalize well. In this paper, we provide a novel perspective on these issues using the gradient signal to noise ratio (GSNR) of parameters during training process of DNNs. The GSNR of a parameter is defined as the ratio between its gradient's squared mean and variance, over the data distribution. Based on several approximations, we establish a quantitative relationship between model parameters' GSNR and the generalization gap. This relationship indicates that larger GSNR during training process leads to better generalization performance. Moreover, we show that, different from that of shallow models (e.g. logistic regression, support vector machines), the gradient descent optimization dynamics of DNNs naturally produces large GSNR during training, which is probably the key to DNNs' remarkable generalization ability.

Full PDF

PPublished as a conference paper at ICLR 2020 U NDERSTANDING W HY N EURAL N ETWORKS G ENER - ALIZE W ELL T HROUGH

GSNR OF P ARAMETERS

Jinlong Liu ∗ , Guo-qing Jiang , Yunzhi Bai , Ting Chen , and Huayan Wang Ytech – KWAI incorporation{liujinlong,jiangguoqing,baiyunzhi,wanghuayan}@kuaishou.com Samsung Research China – Beijing (SRC-B)[email protected] A BSTRACT

As deep neural networks (DNNs) achieve tremendous success across many applica-tion domains, researchers tried to explore in many aspects on why they generalizewell. In this paper, we provide a novel perspective on these issues using the gradientsignal to noise ratio (GSNR) of parameters during training process of DNNs. TheGSNR of a parameter is deﬁned as the ratio between its gradient’s squared mean andvariance, over the data distribution. Based on several approximations, we establisha quantitative relationship between model parameters’ GSNR and the generaliza-tion gap. This relationship indicates that larger GSNR during training process leadsto better generalization performance. Moreover, we show that, different from thatof shallow models (e.g. logistic regression, support vector machines), the gradientdescent optimization dynamics of DNNs naturally produces large GSNR duringtraining, which is probably the key to DNNs’ remarkable generalization ability.

NTRODUCTION

Deep neural networks typically contain far more trainable parameters than training samples, whichseems to easily cause a poor generalization performance. However, in fact they usually exhibitremarkably small generalization gaps. Traditional generalization theories such as VC dimension(Vapnik & Chervonenkis, 1991) or Rademacher complexity (Bartlett & Mendelson, 2002) cannotexplain its mechanism. Extensive research focuses on the generalization ability of DNNs (Neyshaburet al., 2017; Arora et al., 2018; Keskar et al., 2016; Dinh et al., 2017; Hoffer et al., 2017; Novak et al.,2018; Dziugaite & Roy, 2017; Jakubovitz et al., 2019; Kawaguchi et al., 2017; Advani & Saxe, 2017).Unlike that of shallow models such as logistic regression or support vector machines, the globalminimum of high-dimensional and non-convex DNNs cannot be found analytically, but can only beapproximated by gradient descent and its variants (Zeiler, 2012; Kingma & Ba, 2014; Graves, 2013).Previous work (Zhang et al., 2016; Hardt et al., 2015; Dziugaite & Roy, 2017) suggests that thegeneralization ability of DNNs is closely related to gradient descent optimization. For example, Hardtet al. (2015) claims that any model trained with stochastic gradient descent (SGD) for reasonableepochs would exhibit small generalization error. Their analysis is based on the smoothness of lossfunction. In this work, we attempt to understand the generalization behavior of DNNs through GSNRand reveal how GSNR affects the training dynamics of gradient descent. Stanislav Fort (2019) studieda new gradient alignment measure called stiffness in order to understand generalization better andstiffness is related to our work.The GSNR of a parameter is deﬁned as the ratio between its gradient’s squared mean and varianceover the data distribution. Previous work tried to use GSNR to conduct theoretical analysis ondeep learning. For example, Rainforth et al. (2018) used GSNR to analyze variational bounds inunsupervised DNNs such as variational auto-encoder (VAE). Here we focus on analyzing the relationbetween GSNR and the generalization gap. ∗ corresponding author a r X i v : . [ c s . L G ] F e b ublished as a conference paper at ICLR 2020Figure 1: Schematic diagram of the sample-wise parameter gradient distribution corresponding togreater ( Left ) and smaller (

Right ) GSNR. Pink arrows denote the gradient vectors for each samplewhile the blue arrow indicates their mean.Intuitively, GSNR measures the similarity of a parameter’s gradients among different training samples.Large GSNR implies that most training samples agree on the optimization direction of this parameter,thus the parameter is more likely to be associated with a meaningful “pattern” and we assume itsupdate could lead to a better generalization. In this work, we prove that the GSNR is strongly relatedto the generalization performance, and larger GSNR means a better generalization.To reveal the mechanism of DNNs’ good generalization ability, we show that the gradient descentoptimization dynamics of DNN naturally leads to large GSNR of model parameters and thereforegood generalization. Furthermore, we give a complete analysis and a detailed interpretation to thisphenomenon. We believe this is probably the key to DNNs’ remarkable generalization ability.In the remainder of this paper we ﬁrst analyze the relation between GSNR and generalization (Section2). We then show how the training dynamics lead to large GSNR of model parameters experimentallyand analytically in Section 3.

ARGER

GSNR L

EADS TO B ETTER G ENERALIZATION

In this section, we establish a quantitative relation between the GSNR of model parameters andgeneralization gap, showing that larger GSNR during training leads to better generalization.2.1 G

RADIENTS S IGNAL TO N OISE R ATIO

Consider a data distribution Z = X ×Y , from which each sample ( x, y ) is drawn; a model ˆ y = f ( x, θ ) parameterized by θ ; and a loss function L .The parameters’ gradient w.r.t. L and sample ( x i , y i ) is denoted by g ( x i , y i , θ ) or g i ( θ ) := ∂L ( y i , f ( x i , θ )) ∂θ (1)whose j -th element is g i ( θ j ) . Note that throughout this paper we always use i to index data examplesand j to index model parameters.Given the data distribution Z , we have the (sample-wise) mean and variance of g i ( θ ) . We denotethem as ˜g ( θ ) = E ( x,y ) ∼Z ( g ( x, y, θ )) and ρ ( θ ) = Var ( x,y ) ∼Z ( g ( x, y, θ )) , respectively.The gradient signal to noise ratio (GSNR) of one model parameter θ j is deﬁned as: r ( θ j ) := ˜ g ( θ j ) ρ ( θ j ) (2)At a particular point of the parameter space, GSNR measures the consistency of a parameter’sgradients across different data samples. Figure 1 intuitively shows that if GSNR is large, theparameter gradient space tends to be distributed in the similar direction and if GSNR is small, thegradient vectors are then scatteredly distributed. 2ublished as a conference paper at ICLR 2020Figure 2: Schematic diagram of the training behavior satisﬁes OSGR ( t ) = 0 ( Left ),

OSGR ( t ) ≈ ( Right ). Note that the

Middle scenario most commonlyhappens in regular tasks.2.2 O NE -S TEP G ENERALIZATION R ATIO

In this section we introduce a new concept to help measure the generalization performance dur-ing gradient descent optimization, which we call one-step generalization ratio (OSGR). Considertraining set D = { ( x , y ) , ..., ( x n , y n ) } ∼ Z n with n samples drawn from Z , and a test set D (cid:48) = { ( x (cid:48) , y (cid:48) ) , ..., ( x (cid:48) n (cid:48) , y (cid:48) n (cid:48) ) } ∼ Z n (cid:48) . In practice we use the loss on D (cid:48) to measure generalization.For simplicity, we assume the sizes of training and test datasets are equal, i.e. n = n (cid:48) . We denote theempirical training and test loss as: L [ D ] = 1 n n (cid:88) i =1 L ( y i , f ( x i , θ )) , L [ D (cid:48) ] = 1 n n (cid:88) i =1 L ( y (cid:48) i , f ( x (cid:48) i , θ )) , (3)respectively. Then the empirical generalization gap is given by L [ D (cid:48) ] − L [ D ] .In gradient descent optimization, both the training and test loss would decrease step by step. We use ∆ L [ D ] and ∆ L [ D (cid:48) ] to denote the one-step training and test loss decrease during training, respectively.Let’s consider the ratio between the expectations of ∆ L [ D (cid:48) ] and ∆ L [ D ] of one single training step,which we denote as R ( Z , n ) . R ( Z , n ) := E D,D (cid:48) ∼Z n (∆ L [ D (cid:48) ]) E D ∼Z n (∆ L [ D ]) (4)Note that this ratio also depends on current model parameters θ and learning rate λ . We are notincluding them in the above notation as we will not explicitly model these dependencies, but rathertry to quantitatively characterize R for very small λ and for θ at the early stage of training (satisfyingAssumption 2.3.1).Also note that the expectation of ∆ L [ D (cid:48) ] is over D and D (cid:48) . This is because the optimization step isperformed on D . We refer to R ( Z , n ) as OSGR of gradient descent optimization. Statistically thetraining loss decreases faster than the test loss and < OSGR ( t ) < ( Middle panel of Figure 2),which usually results in a non-zero generalization gap at the end of training. If

OSGR ( t ) is large( ≈ ) in the whole training process ( Right panel of Figure 2), generalization gap would be smallwhen training completes, implying good generalization ability of the model. If

OSGR ( t ) is small( = 0 ), the test loss will not decrease while the training loss normally drops ( Left panel of Figure 2),corresponding to a large generalization gap.2.3 R

ELATION BETWEEN

GSNR

AND

OSGRIn this section, we derive a relation between the OSGR during training and the GSNR of modelparameters. This relation indicates that, for the ﬁrst time as far as we know, the sample-wisegradient distribution of parameters is related to the generalization performance of gradient descentoptimization.In gradient descent optimization, we take the average gradient over training set D , which we denoteas g D ( θ ) . Note that we have used g i ( θ ) to denote gradient evaluated on one data sample and ˜ g ( θ ) to3ublished as a conference paper at ICLR 2020denote its expectation over the entire data distribution. Similarly we deﬁne g D (cid:48) ( θ ) to be the averagegradient over test set D (cid:48) . g D ( θ ) = 1 n n (cid:88) i =1 g ( x i , y i , θ ) = ∂L [ D ] ∂θ , g D (cid:48) ( θ ) = 1 n n (cid:88) i =1 g ( x (cid:48) i , y (cid:48) i , θ ) = ∂L [ D (cid:48) ] ∂θ (5)Both the training and test dataset are randomly generated from the same distribution Z n , so wecan treat g D ( θ ) and g D (cid:48) ( θ ) as random variables. At the beginning of the optimization process, θ israndomly initialized thus independent of D , so g D ( θ ) and g D (cid:48) ( θ ) would obey the same distribution.After a period of training, the model parameters begin to ﬁt the training dataset and become a functionof D , i.e. θ = θ ( D ) , therefore distributions of g D ( θ ( D )) and g D (cid:48) ( θ ( D )) become different. Howeverwe choose not to model this dependency and make the following assumption for our analysis: Assumption 2.3.1 (Non-overﬁtting limit approximation)

The average gradient over the trainingdataset and test dataset g D ( θ ) and g D (cid:48) ( θ ) obey the same distribution. Obviously the mean of g D ( θ ) and g D (cid:48) ( θ ) is just the mean gradient over the data distribution ˜ g ( θ ) . E D ∼Z n [ g D ( θ )] = E D,D (cid:48) ∼Z n [ g D (cid:48) ( θ )] = ˜ g ( θ ) (6)We denote their variance as σ ( θ ) , i.e. Var D ∼Z n [ g D ( θ )] = Var D,D (cid:48) ∼Z n [ g D (cid:48) ( θ )] = σ ( θ ) (7)It is straightforward to show that: σ ( θ ) = Var D ∼Z n [ 1 n n (cid:88) i =1 g i ( θ )] = 1 n ρ ( θ ) (8)where σ ( θ ) is the variance of the average gradient over the dataset of size n , and ρ ( θ ) is the varianceof the gradient of a single data sample.In one gradient descent step, the model parameter is updated by ∆ θ = θ t +1 − θ t = − λ g D ( θ ) where λ is the learning rate. If λ is small enough, the one-step training and test loss decrease can beapproximated by ∆ L [ D ] ≈ − ∆ θ · ∂L [ D ] ∂θ + O ( λ ) = λ g D ( θ ) · g D ( θ ) + O ( λ ) (9) ∆ L [ D (cid:48) ] ≈ − ∆ θ · ∂L [ D (cid:48) ] ∂θ + O ( λ ) = λ g D ( θ ) · g D (cid:48) ( θ ) + O ( λ ) (10)Usually there are some differences between the directions of g D ( θ ) and g D (cid:48) ( θ ) , so statistically ∆ L [ D ] tends to be larger than ∆ L [ D (cid:48) ] and the generalization gap would increase during training.When λ → , in one single training step the empirical generalization gap increases by ∆ L [ D ] − ∆ L [ D (cid:48) ] , for simplicity we denote this quantity as (cid:53) : (cid:53) := ∆ L [ D ] − ∆ L [ D (cid:48) ] ≈ λ g D ( θ ) · g D ( θ ) − λ g D ( θ ) · g D (cid:48) ( θ ) (11) = λ (˜ g ( θ ) + (cid:15) )(˜ g ( θ ) + (cid:15) − ˜ g ( θ ) − (cid:15) (cid:48) ) (12) = λ (˜ g ( θ ) + (cid:15) )( (cid:15) − (cid:15) (cid:48) ) (13)Here we replaced the random variables by g D ( θ ) = ˜ g ( θ ) + (cid:15) and g D (cid:48) ( θ ) = ˜ g ( θ ) + (cid:15) (cid:48) , where (cid:15) and (cid:15) (cid:48) are random variables with zero mean and variance σ ( θ ) . Since E ( (cid:15) (cid:48) ) = E ( (cid:15) ) = 0 , (cid:15) and (cid:15) (cid:48) areindependent, the expectation of (cid:53) is E D,D (cid:48) ∼Z n ( (cid:53) ) = E ( λ(cid:15) · (cid:15) ) + O ( λ ) = λ (cid:88) j σ ( θ j ) + O ( λ ) (14)where σ ( θ j ) is the variance the of average gradient of the parameter θ j .For simplicity, when it involves a single model parameter θ j , we will use only a subscript j insteadof the full notation. For example, we use σ j , r j , and g D,j to denote σ ( θ j ) , r ( θ j ) , and g D ( θ j ) respectively. 4ublished as a conference paper at ICLR 2020Consider the expectation of ∆ L [ D ] and ∆ L [ D (cid:48) ] when λ → E D ∼Z n (∆ L [ D ]) ≈ λE D ∼Z n ( g D ( θ ) · g D ( θ )) = λ (cid:88) j E D ∼Z n ( g D,j ) (15) E D,D (cid:48) ∼Z n (∆ L [ D (cid:48) ]) = E D,D (cid:48) ∼Z n (∆ L [ D ] − (cid:53) ) (16) ≈ λ (cid:88) j ( E D ∼Z n ( g D,j ) − σ j ) (17) = λ (cid:88) j ( E D ∼Z n ( g D,j ) − ρ j /n ) (18)Substituting (18) and (15) into (4) we have: R ( Z , n ) = 1 − (cid:80) j ρ j n (cid:80) j E D ∼Z n ( g D,j ) (19)Although we derived eq. (19) from simpliﬁed assumptions, we can empirically verify it by estimatingtwo sides of the equation on real data. We will elaborate on this estimation method in section 2.4.We can rewrite eq. (19) as: R ( Z , n ) = 1 − n (cid:88) j E D ∼Z n ( g D,j ) (cid:80) j (cid:48) E D ∼Z n ( g D,j (cid:48) ) ρ j E D ∼Z n ( g D,j ) (20) = 1 − n (cid:88) j E D ∼Z n ( g D,j ) (cid:80) j (cid:48) E D ∼Z n ( g D,j (cid:48) ) 1 r j + n (21)where E D ∼Z n ( g D,j ) =

V ar D ∼Z n ( g D,j ) + E D ∼Z n ( g D,j ) = n ρ j + ˜ g j .We deﬁne ∆ L j [ D ] to be the training loss decrease caused by updating θ j . We can show that when λ is very small ∆ L j [ D ] = λ g D,j + O ( λ ) . Therefore when λ → , we have R ( Z , n ) = 1 − n (cid:88) j W j r j + n , where W j := E D ∼Z n (∆ L j [ D ]) E D ∼Z n (∆ L [ D ]) with (cid:88) j W j = 1 (22)Eq. (22) shows that the GSNR r j plays a crucial role in the model’s generalization ability—theone-step generalization ratio in gradient descent equals one minus the weighted average of r j + n overall model parameters divided by n . The weight is proportional to the expectation of the training lossdecrease resulted from updating that parameter. This implies that larger GSNR of model parametersduring training leads to smaller generalization gap growth thus better generalization performance ofthe trained model. Also note when n → ∞ , we have R ( Z , n ) → , meaning that training on moredata helps generalization.2.4 E XPERIMENTAL VERIFICATION OF THE RELATION BETWEEN

GSNR

AND

OSGRThe relation between GSNR and OSGR, i.e. eq. (19) or (22) can be empirically veriﬁed using anydataset if: (1) The dataset includes enough samples to construct many training sets and a large enoughtest set so that we can reliably estimate ρ j , E D ∼Z n ( g D,j ) and OSGR. (2) The learning rate is smallenough. (3) In the early training stage of gradient descent.To empirically verify eq. (19), we show how to estimate its left and right hand sides, i.e. OSGR bydeﬁnition and OSGR as a function of GSNR. Suppose we have M training sets each with size n , anda test set of size n (cid:48) . We initialize a model and train it separately on the M training sets and test itwith the same test set. For the t -th training iteration, we denote the training loss and test loss of themodel trained on the m -th training dataset as L ( m ) t and L (cid:48) ( m ) t , respectively. Then the left hand side, i.e. OSGR by deﬁnition, of the t -th iteration can be estimated by R t ( Z , n ) ≈ (cid:80) Mm =1 L (cid:48) ( m ) t +1 − L (cid:48) ( m ) t (cid:80) Mm =1 L ( m ) t +1 − L ( m ) t (23)5ublished as a conference paper at ICLR 2020Figure 3: Left hand (LHS or OSGR by deﬁnition) and right side (RHS or OSGR as a function ofGSNR) of eq. (19). Points are drawn under different experiment settings. Left : LHS vs RHS atepoch 20, 100, 500, 2500. Each point is drawn by LHS and RHS computed at the given epoch underdifferent model structure (number of channels) or training data size; red dotted line is the line of bestﬁt computed by least squares; blue dotted line is the line of reference representing LHS = RHS; thevalue of c in each title represents the Pearson correlation coefﬁcient between LHS and RHS computedby points in ﬁgure. Right : The legend. Different symbols and colors stand for different number ofchannels and training data size. Different random noise levels are not distinguished.For the model trained on the m -th training set, we can compute the t -th step average gradient andsample-wise gradient variance of θ j on the corresponding training set, denoted as g m,j,t and ρ m,j,t ,respectively. Therefore the right hand side of eq. (19) can be estimated by E D ∼Z n ( g D,j,t ) ≈ M M (cid:88) m =1 g m,j,t , ρ j,t ≈ M M (cid:88) m =1 ρ m,j,t (24)We performed the above estimations on MNIST with a simple CNN structure consists of 2 Conv-Relu-MaxPooling blocks and 2 fully-connected layers. First, to estimate eq. (24) with M = 10 , we ran-domly sample 10 training sets with size n and a test set with size 10,000. To cover different conditions,we (1) choose n ∈ { , , , , , , } , respectively; (2) inject noise byrandomly changing the labels with probability p random ∈ { . , . , . , . , . } ; (3) change themodel structure by varying number of channels in the layers, ch ∈ { , , , , , , , } . SeeAppendix A for more details of the setup. We use the gradient descent training (not SGD), with asmall learning rate of . . The left and right hand sides of 19 at different epochs are shown inFigure 3, where each point represents one speciﬁc choice of the above settings.At the beginning of training, the data points are closely distributed along the dashed line correspondingto LHS=RHS. This shows that eq. (19) ﬁts quite well under a variety of different settings. As trainingproceeds, the points become more scattered as the non-overﬁtting limit approximation no longerholds, but correlation between the LHS and RHS remains high even when the training converges(at epoch 2,500). We also conducted the same experiment on CIFAR10 A.2 and a toy dataset A.3observed the same behavior. See Appendix for these experiments.The empirical evidence together with our previous derivation of eq. (19) clearly show the relationbetween GSNR and OSGR and its implication in the model’s generalization ability. RAINING DYNAMICS OF

DNN

S NATURALLY LEADS TO LARGE

GSNR

In this section, we analyze and explain one interesting phenomenon: the parameters’ GSNR of DNNsrises in the early stages of training, whereas the GSNR of shallow models such as logistic regressionor support vector machines declines during the entire training process. This difference gives rise toGSNR’s large practical values during training, which in turn is associated with good generalization.We analyze the dynamics behind this phenomenon both experimentally and theoretically.3.1 GSNR

BEHAVIOR OF

DNN

S TRAINING

For shallow models, the GSNR of parameters decreases in the whole training process becausegradients become small as learning converges. But for DNNs it is not the case. We trained DNNs6ublished as a conference paper at ICLR 2020on the CIFAR datasets and computed the GSNR averaged over all model parameters. Because E D ∼Z n ( g D,j ) = n ρ j + ˜ g j and we assume n is large, E D ∼Z n ( g D,j ) ≈ ˜ g j . In the case of only onelarge training datasets, we estimate GSNR of t -th iteration by r j,t ≈ g D,j,t /ρ D,j,t (25)As shown in Figure 4, the GSNR starts out low with randomly initialized parameters. As learningprogresses, the GSNR increases in the early training stage and stays at a high level in the wholelearning process. For each model parameter, we also computed the proportion of the samples withthe same gradient sign, denoted as p same _ sign . In Figure 4c, we plot the mean of time series of thisproportion for all the parameters. This value increases from about 50% (half positive half negetivedue to random initialization) to about 56% ﬁnally, which indicates that for most parameters, thegradient signs on different samples become more consistent. This is because meaningful featuresbegin to emerge in the learning process and the gradients of the weights on these features tend tohave the same sign among different samples.Previous research (Zhang et al., 2016) showed that DNNs achieved zero training loss by memorizingtraining samples even if the labels were randomized. We also plot the average GSNR for modeltrained using data with randomized labels in Figure 4 and ﬁnd that the GSNR stays at a low levelthroughout the training process. Although the training loss of both the original and randomized labelsgo to zero (not shown), the GSNR curves clearly distinguish between these two cases and reveal thelack of meaningful patterns in the latter one. We believe this is the reason why DNNs trained on realand random data lead to completely different generalization behaviors.Figure 4: (a) : GSNR curves generated by a simple network based on real and random data. Anobvious upward process in the early training stage was observed for real data only. (b) : Same plot forResNet18. (c) : Average of p same _ sign for the same model as in (a).3.2 T RAINING D YNAMICS BEHIND THE

GSNR

BEHAVIOR

In this section we show that the feature learning ability of DNNs is the key reason why the GSNRcurve behavior of DNNs is different from that of shallow models during the gradient descent training.To demonstrate this, a simple two-layer perceptron regression model is constructed. A syntheticdataset is generated as following. Each data point is constructed i.i.d. using y = x x + (cid:15) , where x and x are drawn from uniform distribution [ − , and (cid:15) is drawn from uniform distribution [ − . , . . The training set and test set sizes are 200 and 10,000, respectively. We use a verysimple two-layer MLP structure with 2 inputs, 20 hidden neurons and 1 output.We randomly initialized the model parameters and trained the model on the synthetic training dataset.As a control setup we also tried to freeze model weights in the ﬁrst layer to prevent it from learningfeatures. Note that a two layer MLP with the ﬁrst layer frozen is equivalent to a linear regressionmodel. That is, regression weights are learned on the second layer using ﬁxed features extracted bythe ﬁrst layer. We plot the average GSNR of the second layer parameters for both the frozen andnon-frozen cases. Figure 5 shows that in the non-frozen case, the average GSNR over parameters ofthe second layer shows a signiﬁcant upward process, whereas in the frozen case the average GSNRdecreases in the beginning and remains at a low level during the whole training process.In the non-frozen case, GSNR curve of individual parameters of the second layer are shown in Figure5. The GSNR for some parameters show a signiﬁcant upward process. To measure the quality ofthese features, we computed the Pearson correlation between them and the target output y , both at thebeginning of training and at the maximum point of their GSNR curves. We can see that the learningprocess learns “good” features (high correlation value, i.e. with stronger correlation with y ) from7ublished as a conference paper at ICLR 2020Figure 5: Average GSNR (a) and loss (b) curves for the frozen and non-frozen case. (c) : GSNRcurves of individual parameters for the non-frozen case.random initialized ones, as shown in Table 1. This shows that the GSNR increasing process is relatedto feature learning.3.3 A NALYSIS OF TRAINING DYNAMICS BEHIND

DNN S ’ GSNR BEHAVIOR

In this section, we will investigate the training dynamics behind the GSNR curve behavior. In the caseof fully connected network structure, we can analytically show that the numerator of GSNR, i.e. thesquared gradient mean of model parameters, tends to increase in the early training stage throughfeature learning.Consider a fully connected network, whose parameters are θ = { W (1) , b (1) , ..., W ( l max ) , b ( l max ) } ,where W (1) , b (1) are the weight matrix and bias of the ﬁrst layer, and so on. We denote theactivations of the l -th layer as a ( l ) = { a ( l ) s ( θ ( l − ) ) } , where s is the index for nodes/channels ofthis layer, and θ ( l − ) is the collection of model parameters in the layers before l , i.e. θ ( l − ) = { W (1) , b (1) , ..., W ( l − , b ( l − } . In the forward pass on data sample i , { a ls ( θ ( l − ) ) } is multiplied bythe weight matrix W ( l ) : o ( l ) i,c = (cid:88) s W ( l ) c,s a ( l ) i,s ( θ ( l − ) ) (26)where o ( l ) = { o ( l ) i,c } is the output of the matrix multiplication, for the i -th data sample, on the l -thlayer, c = { , , ..., C } is the index of nodes/channels in the ( l + 1) -th layer. We use g ( l ) D to denotethe average gradient of weights of the l -th layer W ( l ) , i.e. g ( l ) D = n (cid:80) ni =1 ∂L i ∂ W ( l ) , where L i is theloss of the i -th sample.Here we show that the feature learning ability of DNNs plays a crucial role in the GSNR increasingprocess. More precisely, we show that the learning of features a ( l ) ( θ ( l − ) ) , i.e. the learning ofparameters θ ( l − ) tends to increase the absolute value of g ( l ) D . Consider the one-step change ofgradient mean ∆ g ( l ) D = g ( l ) D,t +1 − g ( l ) D,t with the learning rate λ → . In one training step, θ is updatedby ∆ θ = θ t +1 − θ t = − λ g D ( θ ) . Using linear approximation with λ → , we have ∆ g ( l ) D,s,c ≈ (cid:88) j ∂ g ( l ) D,s,c ∂θ j ∆ θ j = (cid:88) θ j ∈ θ ( l − ) ∂ g ( l ) D,s,c ∂θ j ∆ θ j + (cid:88) θ j ∈ θ ( l +) ∂ g ( l ) D,s,c ∂θ j ∆ θ j (27)where θ ( l − ) and θ ( l +) denote model parameters before and after the l -the layer (including the l -th),respectively.We focus on the ﬁrst term of eq. (27), i.e. the one-step change of g ( l ) D caused by learning θ ( l − ) .Substituting g ( l ) D = n (cid:80) ni =1 ∂L i ∂ W ( l ) and ∆ θ j = ( − λ n (cid:80) ni =1 ∂L i ∂θ j ) into eq. (27), we have ∆ g ( l ) D,s,c = − λn (cid:88) θ j ∈ θ ( l − ) W ( l ) s,c ( n (cid:88) i =1 ∂L i ∂o ( l ) i,c ∂a ( l ) i,s ∂θ j ) + other terms (28)8ublished as a conference paper at ICLR 2020The detailed derivation of eq. (28) can be found in Appendix B. We can see the ﬁrst term (which is asummation over parameters in θ ( l − ) ) in eq. (28) has opposite sign with W ( l ) s,c . This term will make ∆ g ( l ) D,s,c negatively correlated with W ( l ) s,c . We plot the correlation between ∆ g ( l ) D,s,c with W ( l ) s,c for amodel trained on MNIST for 200 epochs in Figure 6a. In the early training stage, they are indeednegatively correlated. For top-10% weights with larger absolute values, the negative correlation iseven more signiﬁcant.Here we show that this negative correlation between ∆ g ( l ) D,s,c and W ( l ) s,c tends to increase theabsolute value of g ( l ) D through an interesting mechanism. Consider the weights W ( l ) s,c with { W ( l ) s,c > , g ( l ) D,s,c < } . Learning θ l − would decrease g ( l ) D,s,c and thus increase its absolutevalue because the ﬁrst term in eq. (28) is negative. On the other hand, learning W ( l ) s,c would increase W ( l ) s,c and its absolute value because ∆ W ( l ) s,c = − λ g ( l ) D,s,c is positive. This will form a positivefeedback process, in which the numerator of GSNR, ( g ( l ) D,s,c ) , would increase and so is the GSNR.Similar analysis can be done for the case with { W ( l ) s,c < , g ( l ) D,s,c > } .On the other hand, when { W ( l ) s,c g ( l ) D,s,c > } , we show that the weights tend to change into the earliercase, i.e. { W ( l ) s,c g ( l ) D,s,c < } during training. Consider the case of { W ( l ) s,c > , g ( l ) D,s,c > } , the ﬁrstterm in eq. (28) is negative, learning θ ( l − ) tends to decrease g ( l ) D,s,c or even change its sign. Anotherposibility is that learning W ( l ) s,c changes the sign of W ( l ) s,c because ∆ W ( l ) s,c = − λ g ( l ) D,s,c is negative.In both cases the weights change into the earlier case with { W ( l ) s,c g ( l ) D,s,c < } . Similar analysis canbe done for the case of { W ( l ) s,c < , g ( l ) D,s,c < } .Therefore { W ( l ) s,c g ( l ) D,s,c < } is a more stable state in the training process. For a simple model trainedon MNIST, We plot the proportion of weights satisfying { W ( l ) s,c g ( l ) D,s,c < } in Figure 6b and ﬁndthat there are indeed more weights with { W ( l ) s,c g ( l ) D,s,c < } than the opposite. Because weights withsmall absolute value easily change sign during training, we also plot this proportion for the top-10%weights with larger absolute values. We can see that for the weights with large absolute values, nearly80% of them have opposite signs with their gradient mean, conﬁrming our earlier analysis. For theseweights, the numerator of GSNR, ( g ( l ) D,s,c ) , tends to increase through the positive feedback processas discussed above.Figure 6: MNIST experiments. Left : Correlation between ∆ g ( l ) D,s,c and W ( l ) s,c . Right : Ratio of weights that have opposite signs withtheir gradient mean. Table 1: Pearson correlationbetween features and targetoutput y , where c t and c t max are correlations at the begin-ning of training and maximumof GSNR curve respectively.feature id c t c t max UMMARY

In this paper, we performed a series of analysis on the role of model parameters’ GSNR in deepneural networks’ generalization ability. We showed that large GSNR is a key to small generalizationgap, and gradient descent training naturally incurs and exploits large GSNR as the model discoversuseful features in learning. 9ublished as a conference paper at ICLR 2020 R EFERENCES

Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neuralnetworks. arXiv preprint arXiv:1710.03667 , 2017.Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds fordeep nets via a compression approach, 2018. arXiv:1802.05296.Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds andstructural results.

Journal of Machine Learning Research , 3:463–482, 2002.Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize fordeep nets. In

Proceedings of the 34th International Conference on Machine Learning-Volume 70 ,pp. 1019–1028. JMLR. org, 2017.Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds fordeep (stochastic) neural networks with many more parameters than training data. arXiv preprintarXiv:1703.11008 , 2017.Alex Graves. Agenerating sequences with recurrent neural networks, 2013. arXiv:1308.0850v5.Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability ofstochastic gradient descent. arXiv preprint arXiv:1509.01240 , 2015.Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generaliza-tion gap in large batch training of neural networks.

Advances in Neural Information ProcessingSystems , pp. 1731–1741, 2017.Daniel Jakubovitz, Raja Giryes, and Miguel RD Rodrigues. Generalization error in deep learning,2019.Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning. arXivpreprint arXiv:1710.05468 , 2017.Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak PeterTang. On large-batch training for deep learning: Generalization gap and sharp minima. arXivpreprint arXiv:1609.04836 , 2016.Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring general-ization in deep learning. In

Advances in Neural Information Processing Systems , pp. 5947–5956,2017.Roman Novak, Yasaman Bahri, Daniel A Abolaﬁa, Jeffrey Pennington, and Jascha Sohl-Dickstein.Sensitivity and generalization in neural networks: An empirical study. arXiv:1802.08760 , 2018.Tom Rainforth, Adam R Kosiorek, Tuan Anh Le, Chris J Maddison, Maximilian Igl, Frank Wood,and Yee Whye Teh. Tighter variational bounds are not necessarily better. arXiv preprintarXiv:1802.04537 , 2018.Stanislaw Jastrzebski Srini Narayanan Stanislav Fort, Paweł Krzysztof Nowak. Stiffness: A newperspective on generalization in neural networks, 2019. arXiv:1901.09491.Vladimir N Vapnik and A Ja Chervonenkis. The necessary and sufﬁcient conditions for consistencyof the method of empirical risk.

Pattern Recognition and Image Analysis , 1(3):284–305, 1991.Matthew D. Zeiler. Adadelta: An adaptive learning rate method, 2012. arXiv:1212.5701.Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understandingdeep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 , 2016.10ublished as a conference paper at ICLR 2020

A A

PPENDIX A A.1 M

ODEL S TRUCTURE IN S ECTION p from { , , , , , , , } .Table 2: Model structure On MNIST in Section 2.4. p is the number of channels and q = int (2 . ∗ p ) Layer input conv + relu + maxpooling 1 p conv + relu + maxpooling p q ﬂatten - -fc + relu 16 * q

10 * q fc + relu 10 * q XPERIMENT ON

CIFAR10Different from the experiment on MNIST, we use a deeper network on CIFAR10. We also includethe Batch Normalization (BN) layer, because we ﬁnd that it’s difﬁcult for the network to converge inthe absence of it. The network consists of 4 Conv-BN-Relu-Conv-BN-Relu-MaxPooling blocks and3 fully-connected layers. More details are shown in Table 3.Table 3: Model structure on CIFAR10. p is the number of channels. Layer input conv + bn + relu 3 p conv + bn + relu p p maxpooling - -conv + bn + relu p p conv + bn + relu p p maxpooling - -conv + bn + relu p p conv + bn + relu p p maxpooling - -conv + bn + relu p p conv + bn + relu p p maxpooling - -ﬂatten - -fc + relu 32 * q q fc + relu 8 * q q fc 8 * q n ∈ { , , , , } , p random ∈ { . , . , . } , ch ∈ { , , , , , , } .We use the gradient descent training (Not SGD), with a small learning rate of . . The left andright hand sides of 19 at different epochs are shown in Figure 7, where each point represents onespeciﬁc combination of the above settings. Note that at the evaluation step of every epoch, we use11ublished as a conference paper at ICLR 2020Figure 7: Left hand (LHS) and right side (RHS) of eq. (19). Points are drawn under differentexperiment settings. Left ﬁgure: LHS vs RHS relation at epoch 20, 100, 500, 1000.Figure 8: Similar with Fig. 3, but for a toy regression model discussed in in Appendix A.3.the same mean and variance inside the BN layers as the training dataset. That’s to ensure that thenetwork and loss function are consistent between training and test.At the beginning of training, compared to that of MNIST, the data points no longer perfectly resideson the diagonal dashed line. We suppose that’s beacuse of the presence of BN layer, whose internalparameters, i.e. running mean and running variance, are not regular learnable parameters in theoptimization process, but change their values in a different way. Their change affects the OSGR,yet we could not include them in the estimation of OSGR. However, the strong positive correlationbetween the left and right hand sides of eq. (19) can always be observed until the training begins toconverge.A.3 E XPERIMENT ON T OY D ATASET

In this section we show a simple two-layer regression model consists of a FC-Relu structure with only2 inputs, 1 hidden layer with N neurons and 1 output. A similar synthetic dataset with the trainingdata used in the experiment of Section 3.2 is generated as follows. Each data point is constructed i.i.d. using y = x x + (cid:15) , where x and x are drawn from uniform distribution of [ − , and (cid:15) isdrawn from uniform distribution of [ − η noise , η noise ] .To estimate eq. (24), we randomly generate 100 training sets with n samples each, i.e. M =100,and a test set with 20,000 samples. To cover different conditions, we (1) choose n ∈{ , , , , , , } ; (2) inject noise with η noise ∈ { . , , , , } ; (3) perturbmodel structures by choosing N ∈ { , , , , , , , } . We use gradient descent withlearning rate of 0.001.Figure 8 shows a similar behavior as Fig. 3. During the early training stages, the LHS and RHS ofeq. (19) are very close. Their highly correlated relation remains until training converges, whereas theRHS of eq. (19) decreases signiﬁcantly. 12ublished as a conference paper at ICLR 2020 B A

PPENDIX B Derivation of eq. (28) ∆ g ( l ) D,s,c = (cid:88) θ j ∈ θ ( l − ) ∂ g ( l ) D,s,c ∂θ j ∆ θ j + other terms (29) = (cid:88) θ j ∈ θ ( l − ) ∂ ( n (cid:80) ni =1 ∂L i ∂ W ( l ) s,c ) ∂θ j ( − λ n n (cid:88) i =1 ∂L i ∂θ j ) + other terms (30) = (cid:88) θ j ∈ θ ( l − ) ∂ ( n (cid:80) ni =1 ∂L i ∂o ( l ) i,c ∂o ( l ) i,c ∂ W ( l ) s,c ) ∂θ j ( − λn n (cid:88) i =1 (cid:88) s (cid:48) ,c (cid:48) ∂L i ∂o ( l ) i,c (cid:48) ∂o ( l ) i,c (cid:48) ∂a ( l ) i,s (cid:48) ∂a ( l ) i,s (cid:48) ∂θ j ) + other terms (31) = − λn (cid:88) θ j ∈ θ ( l − ) ∂ ( (cid:80) ni =1 ∂L i ∂o ( l ) i,c a ( l ) i,s ) ∂θ j ( n (cid:88) i =1 (cid:88) s (cid:48) ,c (cid:48) ∂L i ∂o ( l ) i,c (cid:48) W ( l ) s (cid:48) ,c (cid:48) ∂a ( l ) i,s (cid:48) ∂θ j ) + other terms (32) = − λn (cid:88) θ j ∈ θ ( l − ) n (cid:88) i =1 ( ∂L i ∂o ( l ) i,c ∂a ( l ) i,s ∂θ j + ∂ L i ∂o ( l ) i,c ∂θ j a ( l ) i,s )( (cid:88) s (cid:48) ,c (cid:48) W ( l ) s (cid:48) ,c (cid:48) n (cid:88) i =1 ∂L i ∂o ( l ) i,c (cid:48) ∂a ( l ) i,s (cid:48) ∂θ j )+ other terms (33)Above we used ∂o ( l ) i,c (cid:48) ∂a ( l ) i,s (cid:48) = W ( l ) s (cid:48) ,c (cid:48) and ∂o ( l ) i,c ∂ W ( l ) s,c = a ( l ) i,s that can both be derived from eq. (26). Considerthe ﬁrst term of eq. (33). When s (cid:48) = s, c (cid:48) = c , we have ∆ g ( l ) s,c = − λn (cid:88) θ j ∈ θ ( l − ) W ( l ) s,c ( n (cid:88) i =1 ∂L i ∂o ( l ) i,c ∂a ( l ) i,s ∂θ j ) + other terms (34)Note that the term related to ∂ L i ∂o ( l ) i,c ∂θ j a ( l ) i,s and the terms when s (cid:48) (cid:54) = s or c (cid:48) (cid:54) = c in eq. (33) are mergedinto other terms of eq. (34). 13ublished as a conference paper at ICLR 2020 C A

PPENDIX C Notations Z A data distribution satisﬁes

X × Y s or ( x, y ) A single data sample D Training set consists of n samples drawn from Z D (cid:48) Test set consists of n (cid:48) samples drawn from Z θ Model parameters, whose components are denoted as θ j g s ( θ ) or g i ( θ ) Parameters’ gradient w.r.t. a single data sample s or ( x i , y i )˜ g ( θ ) Mean values of parameters’ gradient over a total data distribution, i.e. , E s ∼Z ( g s ( θ )) g D ( θ ) Average gradient over the training dataset, i.e. , n (cid:80) ni =1 g i ( θ ) g D (cid:48) ( θ ) Average gradient over the test dataset, i.e. , n (cid:48) (cid:80) n (cid:48) i =1 g (cid:48) i ( θ ) . Notethat, in eq. (5), we assume n (cid:48) = n g D,j

Same as g D ( θ j ) ρ ( θ ) Variance of parameters’ gradient of a single sample, i.e. , Var s ∼Z ( g s ( θ )) ρ j Same as ρ ( θ j ) σ ( θ ) Variance of the average gradient over a training dataset of size n , i.e. , Var D ∼Z n [ g D ( θ )] σ j Same as σ ( θ j ) r j or r ( θ j ) Gradient signal to noise ratio (GSNR) of model parameter θ j L [ D ] Empirical training loss, i.e. , n (cid:80) ni =1 L ( y i , f ( x i , θ )) L [ D (cid:48) ] Empirical test loss, i.e. , n (cid:48) (cid:80) n (cid:48) i =1 L ( y (cid:48) i , f ( x (cid:48) i , θ )))∆ L [ D ] One-step training loss decrease ∆ L j [ D ] One-step training loss decrease caused by updating one parame-ter θ j R ( Z , n ) One-step generalization ratio (OSGR) for the training andtest sets of size n sampled from data distribution Z , i.e. , E D,D (cid:48)∼Z n (∆ L [ D (cid:48) ]) E D ∼Z n (∆ L [ D ]) λ Learning rate (cid:53)

One-step generalization gap increment, i.e. , ∆ L [ D ] - ∆ L [ D (cid:48) ] (cid:15) Random variables with zero mean and variance σ ( θ ) W ( l ) and b ( l ) Model parameters (weight matrix and bias) of the l -th layer θ ( l − ) Collection of model parameters over all the layers before the l -thlayer g ( l ) D Average gradient of W ( l ) over the training dataset θ ( l +) Collection of model parameters over all the layers after the l -thlayer, including the l -th layer a ( l ) = { a ( l ) s ( θ ( l − ) ) } Activations of the l -th layer, where s = { , , ..., S } is the indexof nodes/channels in the l -th layer. o ( l ) = { o ( l ) c } Outputs of matrix multiplication of the l -th layer, where c = { , , ..., C } is index of nodes/channels in the ( l + 1) -th layer. a ( l ) i,s and o ( l ) i,c a ( l ) s and o ( l ) c evaluated on data sample ii