HHow Many Factors Influence Minima in SGD?
Victor Luo and Yazhen WangDepartment of Statistics, University of Wisconsin-MadisonMadison, WI 53706, USA.Email: [email protected], [email protected] 25, 2020
Abstract
Stochastic gradient descent (SGD) is often applied to train Deep Neural Networks (DNNs), and re-search efforts have been devoted to investigate the convergent dynamics of SGD and minima found bySGD. The influencing factors identified in the literature include learning rate, batch size, Hessian, andgradient covariance, and stochastic differential equations are used to model SGD and establish the rela-tionships among these factors for characterizing minima found by SGD. It has been found that the ratioof batch size to learning rate is a main factor in highlighting the underlying SGD dynamics; however, theinfluence of other important factors such as the Hessian and gradient covariance is not entirely agreedupon. This paper describes the factors and relationships in the recent literature and presents numericalfindings on the relationships. In particular, it confirms the four-factor and general relationship resultsobtained in Wang (2019), while the three-factor and associated relationship results found in Jastrz¸ebskiet al. (2018) may not hold beyond the considered special case.
With the rise of big data analytics, multi-layer deep neural networks (DNNs) have surfaced as one of the mostpowerful machine learning methods. Training a deep network requires optimizing an empirical risk or averageloss function, typically done using stochastic gradient descent (SGD). There is a large volume of literature onDNNs. This paper focuses on two recent papers, Wang (2019) and Jastrz¸ebski et al. (2018), that show thatthe SGD dynamics are captured by the ratio of the learning rate (LR) to batch size (BS) along with otherfactors. We will refer to this ratio as LR/BS. Jastrz¸ebski et al. (2018) considered a special setting to identifythree factors and established a relationship among the three factors to character minima in SGD. Wang(2019) studied a general setting to find four factors and derived a general relationship to describe minimain SGD. Moreover, Jastrz¸ebski et al. (2018) can be treated as a special case of Wang (2019). We detail thetwo approaches and their settings in the following sections, and refer to the specific equations of each paperinvolving relationships with the LR/BS in the titles. Their differences are investigated numerically.The rest of the paper is organized as follows. Sections 2 and 3 provide brief reviews of Jastrz¸ebski etal. (2018) and Wang (2019), respectively. Section 4 describes neutral network model setup, data sets, andsoftware packages. Section 5 presents numerical results.
This section provides an overview of the main results in Jastrz¸ebski et al. (2018) describing the SGDbehavior. Consider a model parameterized by θ where θ i for i ∈ { , . . . , q } are the components, with q denoting the number of parameters. For n training examples x j , j ∈ { , . . . , n } , the average loss function isdefined as L ( θ ) = n (cid:80) nj =1 l ( θ , x j ). The corresponding gradient is g ( θ ) = ∂ L ∂ θ . The gradient is based on thesum over all loss values for all training examples.Stochastic gradient descent with learning rate δ is considered, where the update rule is given by1 a r X i v : . [ c s . L G ] S e p k +1 = θ k − δ g ( m ) ( θ k ) , (1)where k indexes the discrete update steps, and g ( m ) are the stochastic gradients that arise when consideringa minibatch B of size m < n of random indices drawn uniformly from { , . . . , n } . The stochastic gradients g ( m ) form an unbiased estimate of the gradient based in the corresponding subset of training examples g ( m ) ( θ ) = m (cid:80) j ∈ B ∂∂ θ l ( θ , x j ).If we consider the loss gradient at a randomly chosen data point g j ( θ ) = ∂∂ θ l ( θ , x j ) as a random variableinduced by the random sampling of the data items, g j ( θ ) is an unbiased estimator of the gradient E [ g ( θ )].This estimator g j ( θ ) has finite covariance C ( θ ) for typical loss functions. With a large enough data set,each item in a batch is a conditional IID sample. When the batch size is sufficiently large, g ( m ) ( θ ) is themean of components of the form g j ( θ ). Thus, g ( m ) ( θ ) approximately follows a normal distribution withmean close to g ( θ ) and covariance Σ ( θ ) = m C ( θ ). Therefore, (1) can be rewritten as θ k +1 = θ k − δ g ( θ k ) + δ ( g ( m ) ( θ k ) − g ( θ k )) , (2)where ( g ( m ) ( θ k ) − g ( θ k )) is an additive zero mean Gaussian random noise with variance Σ ( θ ) = ( / m ) C ( θ ).Then, (2) can be rewritten as θ k +1 = θ k − δ g ( θ k ) + δ √ m (cid:15) , (3)with (cid:15) being a zero mean Gaussian random variable with covariance C ( θ ).Next, we may model (8) by a stochastic differential equation (SDE) of the form d θ = − g ( θ ) dt + (cid:114) δm R ( θ ) d W ( t ) , (4)where R ( θ ) R ( θ ) T = C ( θ ). It is noted that R ( θ ) = U ( θ ) D ( θ ) / , and the eigendecomposition of C ( θ ) isgiven as U ( θ ) Λ ( θ ) U ( θ ) T , where Λ ( θ ) is the diagonal matrix of eigenvalues and U ( θ ) is the orthonormalmatrix of eigenvectors of C ( θ ).Two assumptions are made: Assumption 1.
As we expect the training to have arrived in a local minima, the loss surface can be ap-proximated by a quadratic bowl, with minimum at zero loss (reflecting the ability of networks to fully fit thetraining data). Given this the training can be approximated by an Ornstein-Unhlenbeck process.
Assumption 2.
Assume C = H , that is, the covariance of the gradients and the Hessian of the loss areequal. Under these assumptions, the Hessian is positive definite, and matches the covariance C . Thus, itseigendecomposition is H = C = VΛV T , where Λ is the diagonal matrix of positive eigenvalues, and V isan orthonormal matrix. Let z be defined as z = V T ( θ − θ ∗ ), where θ ∗ are the parameters at the minimum.Now, starting from (4), and approximating the average loss L ( θ ) as ( θ − θ ∗ ) T H ( θ − θ ∗ ), we obtain anOrnstein-Unhlenbeck (OU) process for z given as d z = − Λz dt + (cid:114) δm Λ / d W ( t ) . (5)2he stationary distribution of an OU process of this form is Gaussian with mean zero and covariancecov( z ) = E ( zz T ) = δ m I . Furthermore, Jastrz¸ebski et al. (2018, Equation 9) obtains the expected loss thatcan be written as E ( L ) = 12 q (cid:88) i =1 λ i E ( z i ) = δ m Tr( Λ ) = δ m Tr( H ) , (6)with the expectation being over the stationary distribution of the OU process, and the second equalityfollowing from the expression for the OU covariance. We will refer to (6) as Equation[4](9) for the rest ofthis paper. We can see from [4](9) that the learning rate to batch size ratio determines the trade-off betweenwidth and expected loss associated with SGD dynamics within a minimum centered at a point of zero loss,with E ( L )Tr( H ) ∝ δm . This section reviews the relevant SGD results in Wang (2019). We consider the minimization problemmin θ ∈ Θ f ( θ ), where the objective function f ( θ ) is defined on a parameter space Θ ⊂ R p and assumed to haveL-Lipschitz continuous gradients. The plain gradient descent algorithm is defined as x k = x k − − δ ∇ f ( x k − ),where ∇ denotes the gradient operator and δ is the learning rate. We can model { x k , k = 0 , , . . . } by asmooth curve X ( t ) with ansatz x k ≈ X ( kδ ). Define a step function x δ ( t ) = x k for kδ ≤ t < ( k + 1) δ , and as δ → x δ ( t ) approaches X ( t ) satisfying ˙ X ( t ) + ∇ f ( X ( t )) = 0 , (7)where ˙ X ( t ) denotes the derivative of X ( t ) and initial value X (0) = x . We see that X ( t ) is a gradient flowassociated with the objective function f ( · ).Now, let θ = ( θ , . . . , θ p ) (cid:48) be parameters that we are interested in, and U a relevant random element on aprobability space with a given distribution Q . Consider an objective function (cid:96) ( θ ; u ) and its correspondingexpectation E [ (cid:96) ( θ ; U )] = f ( θ ). We can treat (cid:96) ( θ ; u ) as a loss function and f ( θ ) = E [ (cid:96) ( θ ; U )] its correspondingrisk. Because g ( θ ) is usually unavailable, we consider the stochastic optimization problem min θ ∈ Θ L n ( θ ; U n ),where L n ( θ ; U n ) = n (cid:80) ni =1 (cid:96) ( θ ; U i ), U n = ( U , . . . , U n ) (cid:48) is a sample, and we assume that U , . . . , U n are iidand follow the distribution Q .We make some smoothing assumptions on the loss function (cid:96) ( θ ; u ), and E [ (cid:96) ( θ ; U )] = f ( θ ), E [ ∇ (cid:96) ( θ ; U )] = ∇ f ( θ ), E [ H (cid:96) ( θ ; U )] = H f ( θ ), where ∇ is the gradient operator (the first order partial derivatives), and H is theHessian operator (the second order partial derivatives). We further assume that √ n [ ∇ L n ( θ ; U n ) − ∇ f ( θ )] = √ n (cid:80) ni =1 [ ∇ (cid:96) ( θ ; U i ) − ∇ f ( θ )] weakly converges to a Gaussian process with mean zero and autocovariance ς ( θ, υ ) defined below.Define the cross auto-covariance ς ( θ, υ ) = ( ς ij ( θ, υ )) ≤ i,j ≤ p , θ, υ ∈ Θ, where Cov[ ∂∂θ i (cid:96) ( θ ; U ) , ∂∂υ j (cid:96) ( υ ; U )]= ς ij ( θ, υ ) are assumed to be continuously differentiable, and L-Lipschitz. Let σ ij ( θ ) =Cov[ ∂∂θ i (cid:96) ( θ ; U ) , ∂∂θ j (cid:96) ( θ ; U )] = ς ij ( θ, θ ), and σ ( θ ) = Var[ ∇ (cid:96) ( θ ; U )] = ( σ ij ( θ )) ≤ i,j ≤ p = ς ( θ, θ ) is positive definite.The stochastic gradient scheme can then be viewed as x mk = x mk − − δ ∇ ˆ L m ( x mk − ; U ∗ mk ) , (8)where U ∗ mk = ( U ∗ k , . . . , U ∗ mk ) (cid:48) , k = 1 , , . . . , are independent mini-batches.As an analog of gradient flow given by ODE (7) for the gradient descent algorithm, a continuous-timeprocess X mδ ( t ) is derived to approximate the stochastic gradient descent algorithm by the following stochasticdifferential equation, 3 X mδ ( t ) = −∇ f ( X mδ ( t )) dt − ( δ/m ) / σ ( X mδ ( t )) d B ( t ) . (9)Let V mδ ( t ) = ( m/δ ) / [ X mδ ( t ) − X ( t )], and treat V mδ as random elements in C ([0 , T ]), where C ([0 , T ])is the space of all continuous functions on [0 , T ] with the uniform metric max {| b ( t ) − b ( t ) | : t ∈ [0 , T ] } between functions b ( t ) and b ( t ). Similarly, we use the SGD iterate x mk from (8) to define an empiricalprocess x mδ ( t ). Set v mδ ( t ) = ( m/δ ) / [ x mδ ( t ) − X ( t )], and treat v mδ as random elements in D ([0 , T ]), where D ([0 , T ]) is the Skorokhod space of all c´adl´ag functions on [0 , T ], equipped with the Skorokhod metric. Wang(2019) established the weak convergence limit of V mδ ( t ) and v mδ ( t ) as follows. Result 3.1. (Gradient flow central limit theorem in Wang (2019).) As δ → and m → ∞ , V mδ ( t ) and v mδ ( t ) , t ∈ [0 , T ] , weakly converge to V ( t ) which is a time-dependent Ornstein-Uhlenbeck process satisfying dV ( t ) = − [ H f ( X ( t ))] V ( t ) dt − σ ( X ( t )) d B ( t ) , V (0) = 0 , (10) where H denotes the Hessian operator. Suppose that stochastic gradient processes converge to the critical point ˇ θ . From a limiting distributionpoint of view, Result 3.1 indicates that the SGD iterate x mk from (8) and the continuous processes X mδ ( t )generated from the SDE (9) are asymptotically the same as the deterministic solution X ( t ) of the ordinarydifferential equation (7) plus ( δ/m ) / V ( t ), where V ( t ) is the solution of the SDE (10). The limiting process V ( t ) is a time-dependent Ornstein-Uhlenbeck process given by (10). Then, we have the following theoremfor the behaviors of f ( X mδ ( t )) and f ( x mδ ( t )) around the critical point ˇ θ . Result 3.2. (Minima in SGD in Wang (2019).) Suppose the gradient descent process X ( t ) given by theordinary differential equation (7) converges to a critical point ˇ θ of f ( · ) . If ˇ θ = X ( ∞ ) is a local minimizerwith positive definite H f (ˇ θ ) , then as t → ∞ , V ( t ) has a limiting stationary distribution with mean zero andcovariance matrix Γ( ∞ ) satisfying the following algebraic Ricatti equation, Γ( ∞ ) H f ( X ( ∞ )) + H f ( X ( ∞ ))Γ( ∞ ) = σ ( X ( ∞ )); (11) moreover, we have E [ f ( X mδ ( t ))] = f ( X ( t )) + δ m tr [ σ ( X ( ∞ ))] + o ( δ/m ) , (12) E [ |∇ f ( X mδ ( t )) | ] = |∇ f ( X ( t )) | + δ m tr [ σ ( X ( ∞ )) H f ( X ( ∞ ))] + o ( δ/m ) . (13) If ˇ θ is a saddle point, V ( t ) diverges and thus does not have any limiting distribution. Since the above equations (12) and (13) correspond to Equations (4.45) and (4.49) in Wang (2019), respec-tively, we will refer to them as Equations [8](4.45) and [8](4.46) for the rest of the paper. Note that underAssumption 2 in Section 2, σ ( X ( ∞ )) = H f ( X ( ∞ )); thus, (6) can be recovered from (12). However, withoutthe assumption, there may have a significant difference between (6) and (12)—namely, Equation [8](4.45)can significantly differ from Equation [4](9). The numerical analysis in the rest of the paper was conducted in R coupled with Python using packages kerasand tensorflow. Keras is a deep learning API written in Python, with runs on top of tensorflow. Tensorflowis a symbolic math library that is used for machine learning applications, especially neural networks. Wespecifically utilize the GPU version of the packages, with a NVIDIA GeForce GTX 1050 Ti being the graphicscard. 4igure 1: ResNet code utilized in R .The model considered was ResNet 56, where 56 represents the depth of the model. ResNet stacksconvolution layers, but introduces the key idea of identity shortcut connections, allowing the model to skipone or more layers. Using a residual block allows the stacked layers to fit an easier residual mapping ratherthan directly fitting the desired underlaying mapping. The code is given in Figure 1.The data set considered was CIFAR-10. CIFAR-10 is a dataset consisting of 60,000 32x32 color imagesin 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck), with 6,000 images perclass. The data set is divided in to 50,000 training images and 10,000 test images. The model is fit on thetraining images and validated using the test images. The training is run on 125 epochs, and the batch sizeand learning rate will be varied. In addition, the learning rate is reduced by a factor of 10 at checkpoints 60epochs and 100 epochs to help prevent the loss from plateauing. SGD and batch normalization are used aswell. We will explore the Kullback-Leibler divergence loss, which is given as (cid:80) y true ∗ log( y true y pred ) in tensorflow,and the gradient of the standard logistic loss (difference in y true and y pred ), where y true is the true value and y pred is the predicted from the model. The log loss function was explored in Wang (2019). In this section, we will first justify Equations [8](4.45) and [8](4.46) through numerical experiment, thenhighlight the differences between [8](4.45) and [4](9)—namely, the differences between tr [ σ ( X ( ∞ ))] andTr( H ), respectively. 5igure 2: Evolution of Training and Validation Accuracy for varying BS/LR using ResNet-56 on CIFAR-10with KLD Loss. In Figure 2, we use the Kullback-Leibler Divergence (KLD) as our loss function, scale both the learning rateand batch size, and compare them against rescaling the batch size exactly with the learning rate. The bluelines represent experiments where the BS/LR is kept the same, i.e., rescaling the batch size and learningrate by the same amount. We notice that the evolution of their curves in both the training and validation6ccuracy very closely match. The green line represents a BS/LR ratio that is relatively close to those of theblue lines, and we can see that the evolution of its curve is somewhat close to that of the blue lines. Finally,the red lines represent BS/LR ratios that are relatively far from those of the blue and green lines; we cansee that evolution of the learning curves of those red lines is noticeably different from that of the blue andgreen lines. These results numerically confirm the validity of Equation [8](4.45).Figure 3: Evolution of Training and Validation Accuracy for varying BS/LR using ResNet-56 on CIFAR-10with Gradient Log Loss. 7xperiment BS/LR BS LR tr( H ) tr [ σ ( X ( ∞ ))] Magnitude Difference1 1,250 100 0.08 22,398,330 9,528,207 2.42 500 100 0.2 44,428,697 5,501,901 8.14 200 100 0.5 45,588,319 4,575,870 9.94 2,500 200 0.08 31,897,024 10,304,980 3.15 1,000 200 0.2 33,932,327 4,084,640 8.36 400 200 0.5 41,058,846 3,486,877 11.87 6,250 500 0.08 63,343,355 9,530,501 6.68 2,500 500 0.2 35,252,033 4,261,116 8.39 1,000 500 0.5 54,212,278 5,380,719 10.1Table 1: Values of tr [ σ ( X ( ∞ ))] and Tr( H ) for varying batch sizes and learning rates under KLD loss,averaged over 5 runs for each experiment. Magnitude Difference represents Tr( H ) tr [ σ ( X ( ∞ ))] . In Figure 3, we use the gradient of the log loss as our loss function, scale both the learning rate and batchsize, and compare them against rescaling the batch size exactly with the learning rate. The blue linesrepresent experiments where the BS/LR is kept the same, i.e. rescaling the batch size and learning rate bythe same amount. We notice that the evolution of their curves in both the training and validation accuracyvery closely match. The green line represents a BS/LR ratio that is relatively close to those of the bluelines, and we can see that the evolution of its curve is somewhat close to that of the blue lines. Finally, thered lines represent BS/LR ratios that are relatively far from those of the blue and green lines; we can seethat evolution of the learning curves of those red lines is noticeably different from that of the blue and greenlines. These results numerically confirm the validity of Equation [8](4.46).
We next analyze the values of tr [ σ ( X ( ∞ ))] and Tr( H ) from Equations [8](4.45) and [4](9), respectively, forvarying batch sizes and learning rates under the KLD loss. We take 5 runs for each of the experiments andaverage the resulting values of tr [ σ ( X ( ∞ ))] and Tr( H ). The results are presented in Table 1.Generally, it seems that if batch size is held constant, a higher learning rate causes Tr( H ) to increase,while tr [ σ ( X ( ∞ ))] decreases. If learning rate is held constant, as batch size increases, tr [ σ ( X ( ∞ ))] seemsto hold constant, while Tr( H ) increases. In addition, as the BS/LR ratio decreases, it appears that themagnitude difference in Tr( H ) and tr [ σ ( X ( ∞ ))] generally increases. It appears that changes in learning rateaffect the magnitude difference much stronger than changes in batch size; holding batch size constant whileincreasing the learning rate leads to larger magnitude differences, while holding learning rate constant andincreasing the batch size doesn’t affect the magnitude difference as much. These observations are illustratedin Figure 4. In all cases, we notice that tr [ σ ( X ( ∞ ))] is magnitudes smaller than Tr( H ), suggesting thatEquation [8](4.45) is a more robust formulation of the the relationship between expected loss and BS/LRratio compared to Equation [4](9).We note that similar experiments were conducted under log loss, and Equations [8](4.45) and [4](9) seemto generally agree in that case, due to the assumptions made in Jastrz¸ebski et al. (2018). Acknowledgments
We would like to acknowledge support for this project from the National Science Foundation (NSF grantsDMS-1707605 and DMS-1913149). 8igure 4: Evolution of Magnitude Difference in Tr( H ) and tr [ σ ( X ( ∞ ))] for varying Batch Sizes and Learn-ing Rates. References [1] A. A. Brown and M. C. Bartholomew-Biggs,
Some Effective Methods for Unconstrained OptimizationBased on the Solution of Systems of Ordinary Differential Equations , Journal of Optimization Theory9nd Applications, 62(2):211-224, 1989.[2] D. J. Foster, A. Sekhari, O. Shamir, N. Srebro, K. Sridharan, B. Woodworth,
The Complexity of Makingthe Gradient Small in Stochastic Convex Optimization , Proceedings of Machine Learning Research 99,1-27 (2019).[3] C. W. Gardiner,
Stochastic Methods: A Handbook for the Natural and Social Sciences , Springer, 4thedition. 2009.[4] S. Jastrz¸ebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey.
Three FactorsInfluencing Minima in SGD , September 14, 2018. arXiv: 1711.04623v3 [cs.LG].[5] K. Kawaguchi,
Deep Learning Without Poor Local Minima , In Advances In Neural Information ProcessingSystems, pages 586-594, 2016.[6] Y. Nesterov,
Gradient Methods for Minimizing Composite Functions , Mathematical Programming,140(1):125-161, 2013.[7] S. Ruder,
An Overview of Gradient Descent Optimization Algorithms , arXiv:1609.04747v1, 2016 [cs.LG].[8] Y. Wang,