How neural networks find generalizable solutions: Self-tuned annealing in deep learning
HHow neural networks find generalizable solutions: Self-tunedannealing in deep learning
Yu Feng
1, 2 and Yuhai Tu IBM T. J. Watson Research Center, Yorktown Heights, NY10598 Department of Physics, Duke University, Durham, NC27710
Despite the tremendous success of Stochastic Gradient Descent (SGD) algorithmin deep learning, little is known about how SGD finds generalizable solutions in thehigh-dimensional weight space. By analyzing the learning dynamics and loss functionlandscape, we discover a robust inverse relation between the weight variance andthe landscape flatness (inverse of curvature) for all SGD-based learning algorithms.To explain the inverse variance-flatness relation, we develop a random landscapetheory, which shows that the SGD noise strength (effective temperature) dependsinversely on the landscape flatness. Our study indicates that SGD attains a self-tuned landscape-dependent annealing strategy to find generalizable solutions at theflat minima of the landscape. Finally, we demonstrate how these new theoreticalinsights lead to more efficient algorithms, e.g., for avoiding catastrophic forgetting. a r X i v : . [ phy s i c s . d a t a - a n ] J a n I. INTRODUCTION
One key ingredient for the powerful deep neural network (DNN) based machine learningparadigm –Deep Learning [1]– is a relatively simple iterative method called stochastic gra-dient descent (SGD) [2, 3]. However, despite the tremendous successes of Deep Learning,the reason why SGD is so effective in learning in a high dimensional nonconvex loss function(energy) landscape remains poorly understood. The random element seems key for SGD, yetmakes it harder to understand. Fortunately, many physical systems include such randomelement, e.g., Brownian motion, and powerful tools have been developed for understand-ing collective behaviors in stochastic systems with many degrees of freedom. Here, we useconcepts and methods from statistical physics to investigate the SGD dynamics, the lossfunction landscape, and more importantly their relationship.We start by introducing the SGD based learning process as a stochastic dynamical system.A learning system such as neural network (NN) especially DNN has a large number ( N ) ofweight parameters w i ( i = 1 , , ..., N ). For supervised learning, there is a set of M trainingsamples each with an input ~X k and a correct output ~Z k for k = 1 , , ..., M . For each input ~X k , the learning system predicts an output ~Y k = G ( ~X k , ~w ), where the output function G depends on the architecture of the NN as well as its weights ~w . The goal of learning is tofind the weight parameters to minimize the difference between the predicted and correctoutput characterized by an overall loss function (or energy function): L ( ~w ) = M − M X k =1 d ( ~Y k , ~Z k ) , (1)where d ( ~Y k , ~Z k ) is a measure of distance between ~Y k and ~Z k . In our study, a cross-entropyloss for d (see Supplementary Material (SM) for its expression) is used .One learning strategy is to update the weights by following the gradient of L directly.However, this batch learning scheme is computationally inhibitive for large datasets and italso has the obvious shortfall of being trapped by local minima. SGD was first introducedto circumvent the large dataset problem by updating the weights according to a subset(minibatch) of samples randomly chosen at each iteration [2]. Specifically, the change ofweight w i ( i = 1 , , ..., N ) for iteration t in SGD is given by:∆ w i ( t ) = − α ∂L µ ( t ) ( ~w ) ∂w i , (2)where α is the learning rate and µ ( t ) represents the random minibatch used for iteration t .The mini loss function (MLF) for minibatch µ of size B is defined as: L µ ( ~w ) = B − B X l =1 d ( ~Y µ l , ~Z µ l ) , (3)where µ l ( l = 1 , , .., B ) labels the B randomly chosen samples.Here, we introduce the key concept of a MLF ensemble { L µ ( ~w ) } , i.e., an ensemble ofenergy landscapes each from a random minibatch. The overall loss function L ( ~w ) is just theensemble average of MLF: L ≡ h L µ i µ . The SGD noise comes from the variation between aMLF and its ensemble average: δL µ ≡ L µ − L . By taking the continuum time limit in Eq. 2,we obtain the following stochastic partial differential equation for SGD: ∂ ~w∂t = − α ∂L∂ ~w + ~η ( ~w ) , (4)where we have taken the learning step as the unit of time. This equation is analogous to theLangevin equation in statistical physics. The first term − α ∂L∂ ~w is the deterministic gradientdescent governed by the overall loss function L analogous to the energy function in physics.The second term is the SGD noise term ~η ≡ − α ∇ ~w δL µ ( ~w ) with zero mean h ~η i = 0 andequal time correlation C ij ( ~w ) ≡ h η i η j i = α × h ∂δL µ ∂w i ∂δL µ ∂w j i µ , which depends explicitly on ~w .As first pointed out by Chaudhauri and Soatto [4], unlike equilibrium physical systemswhere the noise has a constant strength given by the thermal temperature, the SGD dynam-ics is highly nonequilibrium as the SGD noise is anisotropic and varies in the weight space.In the rest of the paper, we study the SGD-based machine learning process by adoptingthe nonequilibrium stochastic dynamics framework. Our working hypothesis is that SGDmay serve as an efficient annealing strategy for varying the noise (or effective temperature)“intelligently” according to the loss function landscape in order to find the shallow (flat)minima where “good” (generalizable) solutions are believed to exist [5–10]. II. LEARNING VIA A LOW-DIMENSIONAL DRIFT-DIFFUSION DYNAMICSIN SGD
In general, SGD based DNN learning dynamics can be roughly divided into an initialfast learning phase where L decreases rapidly, followed by an “exploration” phase where theweights keep evolving but the loss L only changes very slowly (see Fig. S1 in SM). Theweight vectors sampled in the exploration phase are treated as solutions to the problemgiven their low losses.We used principal component analysis (PCA) to study the stochastic motion of the weightvector during the SGD learning process, in particular during the exploration phase when thesystem reaches a quasi-steady-state (see Methods section for details of PCA). Within a largetime window t ∈ [ t , t + T ] where t is a time in the exploration phase and T (= 10 epochs used here) is a large time window, the weight dynamics can be decomposed into its variationsin different principal components: ~w ( t ) = h ~w i T + N X i =1 θ i ( t ) ~p i , (5)where h ~w i T = T − R t + Tt ~w ( t ) dt is the average weight vector in the time window T , ~p i is the i -th principal component base vector with ~p i · ~p j = δ ij . The projection of the weight vectoralong the PCA direction ~p i is given by θ i ( t ), which is a linear combination of the individualweights, ~θ is the weight vector in the PCA coordinate.The results reported here are for a simple NN with 2 hidden layers each with 50 neuronsfor classification tasks using the MNIST database (see Methods for details and other NNarchitectures used). The PCA was done for the N = 2500 weights between the two hiddenlayers (results for other NN architectures and databases are included in the SM). In Fig. 1A,we show the PCA spectrum, i.e., the variance σ i ≡ T − R t + Tt θ i ( t ) dt versus its rank i indescending order ( σ i +1 < σ i ). We found that the variance in the first PCA direction ( ~p )is much larger than variances in other directions because the motion along ~p has a netdrift velocity (see discussion below and Fig. 1C). For other PCA directions, after a smallnumber of leading PCA directions 2 ≤ i ≤ σ i decays rapidly with itsrank: σ i ∼ i − γ for 20 < i <
200 with a large exponent γ ∼ . i ( > σ , more than 90% of the total variance occurs inthe first 35 PCA modes much smaller than the total number of weights N = 2 , each epoch has MB iterations which covers all training samples once -5 0 510 -4 -505 10 -4 trajectoryStartEnd -0.02 -0.01 0 0.01 0.02-3-2-10123 10 -3 trajectoryStartEnd -10 -5 C DA B
FIG. 1: The PCA results and the drift-diffusion motion in SGD. (A) The rank-ordered variance σ i in different principal component (PC) direction i . For i ≥ σ i decreases with i as a power law i − γ with γ ∼ −
3. (B) The normalized accumulative variance of the top ( n −
1) PCA componentsexcluding i = 1. It reaches ∼
90% at n = 35 much smaller than the total number of weights N = 2500 between the two hidden layers. (C) The SGD weight trajectory projected onto the( θ , θ ) plane. The persistent drift motion in θ and the diffusive random motion in θ are clearlyshown. (D) The diffusive motion in ( θ i , θ j ) plane with j > i ( = 1) randomly chosen ( i = 49 and j = 50 shown here). Unless otherwise stated, hyperparameters used are: B = 50, α = 0 . along the first PCA direction ~p , there is a net drift velocity dθ /dt with a persistence timemuch longer than 1 epoch as clearly shown in Fig. 1C where SGD dynamics projected ontothe ( θ , θ ) space is shown. For all other PCA directions, the dynamics are random walkswith a short velocity correlation time (shorter than 1 epoch) as clearly demonstrated inFig. 2C where the SGD dynamics projected onto a randomly chosen pair of PCA directions( θ , θ ) is shown.The persistent drift in the first PCA direction ~p can be understood by moving a solution ~w found by SGD along ~p by θ to a new weight vector ~w = ~w + θ ~p . We find that such atranslation results roughly in an overall amplification of the difference between the outputsfor the right class and the wrong classes, which leads to a change in the cross-entropy lossfunction L ( ~w ) ≈ L ( ~w ) × exp( βθ ) with β a constant parameter. This dependence of L on θ leads to the persistent motion along the ~p direction with a low speed proportional to L which slowly decreases with time itself (see SM for details). In the next section, we focus onthe majority diffusive PCA modes ( i ≥
2) and their relation to the loss function landscape.
III. THE LOSS FUNCTION LANDSCAPE AND THE INVERSEVARIANCE-FLATNESS RELATION
In the exploration phase, the loss function is small and all the weight vectors along theSGD trajectory can be considered as valid solutions. However, the solutions found by aSGD trajectory only represent a small subset of valid solutions. To gain insights on the fullsolution space, we study the loss function landscape around a specific solution ~w reachedby SGD. Specifically, we compute the loss function profile L i along the i -th PCA direction ~p i determined from PCA of the SGD dynamics: L i ( δθ ) ≡ L ( ~w + δθ~p i ) . (6)In Fig. 2A, we show the loss function landscape profiles L i ( δθ ) for several diffusive PCAdirections i = 10 , , , L i ( δθ ))versus δθ , which shows that ln( L i ( δθ )) can be fitted well by a quadratic function of δθ near δθ = 0 for all i ≥ i = 1):ln( L i ( δθ )) ≈ ln( L ) + 4 δθ F i , (7)where L ≡ L ( ~w ) is the loss at ~w and F i , inversely proportional to the curvature, definesa flatness parameter of the loss function landscape in the i -th PCA direction. In general, F i can be defined as the width of the region within which L i ≤ e × L . A larger value of F i means a flatter landscape in the i -th PCA direction. -10 -5 -10 -5 0 5 1002468101214 -2 -1 0 1 2-10-9-8-7-6 A BC D
FIG. 2: The loss function landscape and the inverse variance-flatness relation. (A) The loss functionprofile L i along the i -th PCA direction. (B) The ln( L i ) profile. It can be fitted by a quadraticfunction (the red dotted line). The definition of the flatness F i is also shown. (C) The flatness F i for different PCA direction i , and (D) the inverse relation between the variance σ i and the flatness F i for different choices of minibatch size B and learning rate α . Quantitatively, we found that the flatness F i increases with i as shown in Fig. 2C. Giventhat the SGD variance σ i decreases with i as shown in Fig. 2A, this immediately suggestsan inverse relationship between the loss function landscape flatness and the SGD variance.Indeed, as shown in Fig. 2D, the inverse variance flatness relation holds true for differentchoices of minibatch size B and learning rate α . Quantitatively, the variance-flatness followsapproximately a power law: σ i ∼ F − ψi , (8)where the exponent ψ ∼ B and α .The inverse variance-flatness relation is counter-intuitive. In equilibrium systems, thefluctuation of a variable around its equilibrium value is proportional to the change of thevariable in response to an external perturbation, which is known as the fluctuation-response(or fluctuation-dissipation) relation [12]. Since the response is proportional to the flat-ness of the energy function or equivalently the loss function used in machine learning, thefluctuation-response relation in equilibrium systems means that the variance of a variableshould be larger for a flatter landscape, which is the opposite to the inverse relation shownin Fig. 2D for SGD.What is the reason for the inverse variance-flatness relation in SGD? Unlike equilibriumsystems where the noise strength (temperature) is a constant, the SGD noise comes from thedifference between gradient of a random MLF and that of the overall (mean) loss functionand thus it varies in the weight space and in time. In the next section, we explain theinverse variance-flatness relation based on the dependence of the SGD noise on the statisticalproperties of the MLF ensemble. IV. THE RANDOM LANDSCAPE THEORY AND ORIGIN OF THE INVERSEVARIANCE-FLATNESS RELATION
The most distinctive feature of SGD is that at any given iteration (time) the learningdynamics is driven by a random minibatch out of an ensemble of minibatches each with itsown random mini loss function (MLF). To understand the SGD dynamics, we develop arandom landscape theory to describe the statistical properties of the MLF ensemble near asolution ~w (we set ~w = 0 for convenience).As shown in Fig. 3A, we found that L µ has roughly the same general shape as L near itsown minimum. Therefore, we approximate L µ by an inverse Gaussian function with randomparameters: L µ ( ~θ ) ≈ L µmin exp[ N X i,j =1 M µij θ i − θ µi )( θ j − θ µj )]= L µ exp[ X i M µii θ i ( θ i − θ µi ) + X i FIG. 3: Statistical properties of the MLF ensemble. (A) Profiles of the overall loss function ln( L i )(red line) and a set of randomly chosen MLFs ln( L µi ) (blue dashed lines) in a given PCA direction i . (B) The variance ˜ σ i of the minimum positions and the average diagonal element M (0) ii of theHessian matrices of the MLF ensemble versus the flatness F i of the overall loss function. Thecombination ( M (0) ii ) ˜ σ i versus F i is also shown. i = 30 used here. positioned at ~θ µ . The average relaxation time, τ i ≡ / ( αL M ii ), is long given that L issmall. For t (cid:28) τ i , variation of θ j is much smaller than that of the minimum position θ µj ,i.e., | θ j | (cid:28) σ θ,j . Therefore, we can neglect θ j ( (cid:28) θ µj ) in Eq. 12, and we have: dθ i dt ≡ v i = αL µ X j M µij θ µj , (13)which explains the drift-diffusion motion observed in our numerical simulations. For thefirst PCA direction ( i = 1), since h θ µ i µ = θ (0)1 = 0, there is a net drift velocity v (0)1 ≡ h v i µ ≈ αL M (0)11 θ (0)1 . For all the other PCA directions ( i ≥ h v i i µ = 0 and a diffusion constant D i given by: D i ≡ h v i i ∆ t ≈ α h ( L µ ) i µ [( M (0) ii ) σ θ,i + X j = i σ ij σ θ,j ] = α h ( L µ ) i µ ( M (0) ii ) ˜ σ θ,i , (14)where we have used the expression for ˜ σ θ,i from Eq. 11.The above result, Eq. 14, provides a clear explanation for the inverse variance-flatnessrelation for the diffusive modes ( i ≥ M (0) ii , which leads to a smaller diffusion constant D i and thus a smaller variance since σ i ∝ D i ∝ ( M (0) ii ) ˜ σ θ,i . Quantitatively, both M (0) ii and ˜ σ θ,i can be determined from the MLF1ensemble statistics. As shown in Fig. 3B, ˜ σ θ,i scales almost linearly with F i = (8 /M ii ) / and M (0) ii scales inversely with F i , approximately as F − i . As a result, we have σ i ∝ ( M (0) ii ) ˜ σ θ,i ∝ F − i (see Fig. 3B) which is consistent with the inverse power law dependence shown inFig. 2D. However, the theoretically obtained exponent 2 is smaller than the exponent ( ψ ≈ V. PREVENTING CATASTROPHIC FORGETTING BY USINGLANDSCAPE-DEPENDENT CONSTRAINTS To demonstrate the utility of the theoretical insights gained so far, we tackle a long-lastingchallenge in machine learning, i.e., how to prevent catastrophic forgetting (CF) [13, 14].After a deep neural network (DNN) learns to perform a particular task, it is trained foranother task. Though the DNN can readjust its weights to perform well for the new task, itmay forget the previous task and thus fail catastrophically for the previous task. To preventforgetting, a recent study by Kirkpatrick et al [15] proposed the elastic weight constraint(EWC) algorithm to train a new task by enforcing constraints on individual weights basedon their effects on the performance of the previous task. Here, armed with the new insightson the loss landscape and SGD dynamics, we propose to a landscape-dependent constraints(LDC) algorithm to train for the new task with constraints applied to the collective PCAmodes from the previous task and with constraint strength determined by the loss landscapeflatness of the previous task.Explicitly, when we find a solution ~w in the weight space for the first task (task-1) bySGD, we can also determine the loss function landscape around ~w in terms of the flatnessparameter F i along the i − th PCA direction ~p i for task-1, which can be obtained directly(cheaply) from the weight variance by using the variance-flatness relation Eq. 8. Whenlearning the new task, we can use a modified loss function for the second task (task-2) byintroducing an additional cost term that penalizes the network for going out of the attractionbasin of the task-1 solutions:˜ L ( ~w ) = L ( ~w ) + λ N c X i =1 (( ~w − ~w ) · ~p i ) F i , (15)where L is the original loss function for task-2 and N c ( ≤ N ) is the number of constrained2PCA modes from task-1. In general, constrains can be included for all previous tasks forsequential learning of more than two tasks. The overall strength of the constraints from task-1 is parameterized by λ . The strength of the constraint for a specific PCA mode dependsinversely on the flatness of the loss function in that specific PCA direction.Based on the large attraction basin for a given task as evidenced by the large flatnessparameters shown in Fig. 2, we expect that solutions for task-2 exist within the basin ofsolutions for task-1, so the performance for task-1 should not be degraded significantlyafter learning task-2. We tested this idea in the MNIST database with task-1 and task-2corresponding to classifying two disjoint subsets of digits, e.g., (0,1) for task-1 and (2,3)for task-2 (see Methods for details). As shown in Fig. 4A in the absence of the constraints( λ = 0), starting with a task-1 solution ~w , the weights evolve quickly to a solution fortask-2 ~w with a small task-2 test error (cid:15) (red line). However, the performance of task-1 deteriorates quickly with a fast increasing task-1 test error (cid:15) (blue line). The reasoncan be understood in Fig. 4B, where q i = || ( ~w − ~w ) · ~p i || , the projections of the weightdisplacement vector onto different PCA directions of task-1, are shown. Without constraints,the displacement becomes much larger than a threshold ξ i set by the landscape flatness, q i (cid:29) ξ i ( ≡ . F i ), for many high-ranking PCA modes (smaller i ), which leads to the largetask-1 test error after learning task-2, i.e., catastrophic forgetting.The situation improves significantly when task-2 is learnt with the modified loss function˜ L given in Eq. 15. As shown in Fig. 4C, in the presence of the constraints ( λ = 10) forthe top N c = 200 modes, even though the learning process for task-2 is slightly slower, thesystem is able to learn a solution for task-2 with a comparable error (cid:15) as before ( λ = 0). Thesignificant advantage here is that the performance for task-1 (blue line) remains roughly thesame as before, i.e., the system has avoided catastrophic forgetting. The reason is explainedin Fig. 4D, which shows that q i is bounded by the threshold ( ξ i ), i.e., q i . ξ i , for the topmodes ( i ≤ N c ) due to the constraints.There is a tradeoff between the two testing errors ( (cid:15) , (cid:15) ) when varying λ . As shown inFig. 4E, the performance of LDC is better than that of EWC. This is not surprising as LDCuses the full landscape information whereas EWC only uses the diagonal elements of theFisher Information matrix (effectively the Hessian matrix). More interestingly, as shown inFig. 4F, the overall performance ( (cid:15) + (cid:15) ) of LDC reaches its optimal level when a relativelysmall number ( N ∗ c ≈ LDCEWC -3 -2 -1 LDCEWC A BC DE F FIG. 4: The landscape-dependent constraints for avoiding catastrophic forgetting. (A) The testerrors for task-1 ( (cid:15) ) and task-2 ( (cid:15) ) versus training time for task-2 in the absence of the constraints( λ = 0). (B) The weight displacements q i in different PCA directions ~p i from task-1 in the absenceof the constraints ( λ = 0). The threshold ξ ≡ . F i is shown by the red dotted line. (C)&(D)are the same as (A)&(B) but in the presence of the constraints with λ = 10 . (E) The tradeoffbetween the saturated test errors ( (cid:15) and (cid:15) ) when varying λ for LDC (blue circles) and EWC (redsquares) algorithms. (F) The overall performance ( (cid:15) + (cid:15) ) versus the number of constrains N c for LDC (blue circles) and EWC (red squares) algorithms. The two tasks are for classifying twoseparate digit pairs [(0 , 1) for task-1 and (2 , 3) for task-2] in MNIST. N = 2 , 500 individual weights are constrained. The results from the LDC algorithmsuggest that memory of the previous task is encoded in the top N ∗ c PCA modes and N ∗ c canbe used to estimate the capacity of the network for sequential learning. VI. SGD AS A SELF-TUNED (LANDSCAPE-DEPENDENT) ANNEALINGSTRATEGY FOR LEARNING In the final section of our paper, we go back to evaluate our initial working hypothesison the learning strategy deployed in SGD. In an equilibrium system with state variables ~θ and a free energy function L ( ~θ ), the statistics of ~θ follows the Boltzmann distribution P ( ~θ ) = exp [ − L ( ~θ ) /T ] where T is the constant temperature that characterizes the strengthof the thermal fluctuations (we set the Boltzmann constant k B = 1 here). By expanding theloss function to the second order: L = L min (1 + P i ~θ · ~p i ) F i ) around a minimum ~w = 0, it iseasy to show that the variance of θ i would be proportional to the squared flatness F i andtemperature T : σ i ∝ T × F i , (16)which is a direct consequence of the fluctuation-response (aka fluctuation-dissipation) rela-tion in equilibrium statistical physics.Remarkably, for the SGD-based learning dynamics, we found an inverse relation betweenfluctuations of the variables and the flatness of the loss function landscape, Eq. 8, which is theopposite to the fluctuation-response relation in equilibrium systems given in Eq. 16. We havetested it with different variants of the SGD algorithms such as ADAM and momentum basedalgorithms; different databases (MNIST and CIFAR-10); and different DNN architectures(see SM and Fig. S3 for details). In all cases we studied, the inverse variance-flatness relationholds suggesting that it is an universal property of the SGD based learning algorithms.Unlike thermal noise in equilibrium systems, which represents a passive random drivingforce with a constant strength (temperature), the SGD “noise” ~η = α ∂δL µ ∂~θ represents anactive learning/searching process that varies in “space” ( ~θ ). The intensity of this learning(search) activity – learning intensity along the i -th PCA direction ~p i can be characterized5 -10 -5 -10 -5 -5 -0.2 -0.1 0 0.1 0.210 -10 -8 -6 -1 -0.5 0 0.5 110 -10 -5 -14 -13 -12 -11 -10 -14 -12 -10 A BC D FIG. 5: Profiles and dynamics of the anisotropic active temperature. (A) The active temperatureprofile T i ( δθ, t ) in the i ’th PCA direction at t = 200. (B) The minimum active temperature T i (0)in different PCA direction i . The inverse dependence of T i on the flatness F i is shown in theinset. (C) The active temperature profiles T i ( δθ, t ) at different times for i = 10. (D) The activetemperature decreases with time in sync with the loss function (red line) dynamics. The shadedregion highlights the transition between the fast learning phase and the exploration phase. Thecorrelation between T i and L is shown in the inset. by an active local temperature T i ( δθ, t ): T i ( δθ, t ) ≡ α h|| ∂δL µ ( ~w + δθ~p i ) ∂δθ || i µ , (17)where δθ is the displacement from ~w along the ~p i direction.As shown in Fig. 5A, the active temperature T i ( δθ, t ) has a similar spatial profile as thatof the loss function with the active temperature higher away from the minimum. In weight6space where the overall loss function is high, the active temperature is also high, which drivesthe system away from regions in the weight space with high losses. The learning intensityalso depends on the PCA direction as shown in Fig. 5B. For a flatter direction with a largervalue of F i , T i is lower (see the inset in Fig. 5B) as the basin of solutions is wide and thusno strong active learning is needed. However, for a steeper direction with a smaller valueof F i , the solutions exist only in smaller regions and thus more intensive learning (or higheractive temperature) is required. Therefore, the MLF ensemble can sense both the local lossand non-local flatness of the landscape in different directions and use these information todrive active learning.The active temperature also varies with time. As learning progresses, the active temper-ature profile decreases with time, as shown in Fig. 5C. In Fig. 5D, dynamics of the activetemperature and the overall loss function along a SGD trajectory are shown together. Itis clear that the active temperature and the overall loss function are highly correlated asshown directly in the inset of Fig. 5D, which means that the SGD system cools down asit learns. This reminds us of the well known simulated annealing algorithm for optimiza-tion [16], where temperature is decreased from a high value to zero with some prescribedcooling schedule. However, the SGD algorithm seems to deploy a more intelligent landscape-dependent annealing strategy where the active temperature (learning intensity), driven bythe MLF ensemble, is self-tuned according to the local and non-local properties of the losslandscape that are sensed by the MLF ensemble. This landscape-dependent annealing strat-egy drives the system towards the flat minima of the loss function landscape and stays atthe flat minima by lowering the active temperature once there.To summarize, a careful study of the SGD dynamics and the loss function landscapein this paper reveals a robust inverse relation between fluctuations in SGD and flatness ofthe loss landscape, which is critical for deciphering the learning strategy in deep learningand for designing more efficient algorithms. The statistical physics based approach pro-vides a general theroretical framework for studying machine learning. We are currentlyusing this approach to address other foundamental questions in machine learning such asgeneralization [17, 18], relation between task complexity and network architecture, transferlearning [19], and continuous learning [20–22].7 VII. METHODS Neural network architecture and simulations. Two types of DNNs are studied.(1) Two fully-connected Neural Networks were used for classifying digits in the MNISTdatabase, one with two hidden layers (784 × × × 10, Main Text) and the other withfour hidden layers (784 × × × × × 10, SM). The response of the hidden layer neuronsis ReLu, activation of the output neurons is Softmax, and there is no bias neuron. We alsostudied Convolution Neural Network (CNN). (2) Two LeNets were used in our experiments.One is trained on MNIST dataset (see Fig. S1 in SM) which has two convolution layer withsize 1 × × × 16 and 16 × × × 32, and one fully-connected layer with size 1568 × 10. (Herewe represent the convolution layer using input neural number × kernel size × kernel size × output neural number).The stride of convolution is 1 and there is a zero padding to keep thedata dimension unchanged. After each convolution layer, there is a 2 × × × × × × × 16, and three fully-connected layer with size400 × × 84, 84 × 10. The stride of convolution is 1 and the size of max pooling is2 × 2. We do not use zero padding in this network. All numerical experiments are done onneural network simulation framework torch. Principal component analysis in exploration phase. For a given time in explorationphase, we extract the weight matrix ( 50 × 50) between two hidden layers and reshape thematrix to 1 × T = 10 epochs. The PCA is applied using sklearn package provided by Python 3.7. Multi-task learning. We divided the MNIST into 5 groups. Each group only contain 2numbers. Here we call each group as task-1, task-2, etc. We use the fully-connected NeuralNetworks with two hidden layers (784 × × × VIII. ACKNOWLEDGMENTS We thank Mattia Rigotti, Irina Rish, Matthew Riemer, Robert Ajemain, Yunfei Teng,and Alberto Sassi for discussions. We also thank Jerry Tersoff, Tom Theis, and YoussefMrouef for comments on the manuscript. The work by Yu Feng was done when he was anIBM intern. [1] LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature , 436 EP – (2015). URL https://doi.org/10.1038/nature14539 .[2] Robbins, H. & Monro, S. A stochastic approximation method. The Annals of MathematicalStatistics , 400?407 (1951). URL http://dx.doi.org/10.1214/aoms/1177729586 .[3] Bottou, L. Large-scale machine learning with stochastic gradient descent. In Lechevallier, Y. &Saporta, G. (eds.) Proceedings of COMPSTAT’2010 , 177–186 (Physica-Verlag HD, Heidelberg,2010).[4] Chaudhari, P. & Soatto, S. Stochastic gradient descent performs variational inference, con-verges to limit cycles for deep networks. (2018). URL http://dx.doi.org/10.1109/ita.2018.8503224 .[5] Hinton, G. E. & van Camp, D. Keeping the neural networks simple by minimizing the de-scription length of the weights. In Proceedings of the Sixth Annual Conference on Com-putational Learning Theory , COLT ’93, 5–13 (ACM, New York, NY, USA, 1993). URL http://doi.acm.org/10.1145/168304.168306 .[6] Hochreiter, S. & Schmidhuber, J. Flat minima. Neural Computation , 1–42 (1997).[7] Baldassi, C. et al. Unreasonable effectiveness of learning neural networks: From accessi-ble states and robust ensembles to basic algorithmic schemes. Proceedings of the NationalAcademy of Sciences , E7655–E7662 (2016). URL et al. Entropy-sgd: Biasing gradient descent into wide valleys (2016).1611.01838.[9] Zhang, Y., Saxe, A. M., Advani, M. S. & Lee, A. A. Energyentropy competition and the effec-tiveness of stochastic gradient descent in machine learning. Molecular Physics , 32143223 (2018). URL http://dx.doi.org/10.1080/00268976.2018.1483535 .[10] Mei, S., Montanari, A. & Nguyen, P.-M. A mean field view of the landscape of two-layer neuralnetworks. Proceedings of the National Academy of Sciences of the United States of America , E7665–E7671 (2018). URL .[11] Gur-Ari, G., Roberts, D. A. & Dyer, E. Gradient Descent Happens in a Tiny Subspace. arXive-prints arXiv:1812.04754 (2018). 1812.04754.[12] Forster, D. Hydrodynamic fluctuations, broken symmetry, and correlation functions (CRCPress, 2018).[13] McCloskey, M. & Cohen, N. J. Catastrophic interference in connectionist networks: Thesequential learning problem. Psychology of learning and motivation , 109–165 (1989).[14] Robins, A. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science ,123–146 (1995).[15] Kirkpatrick, J. et al. Overcoming catastrophic forgetting in neural networks. Proceedings ofthe National Academy of Sciences Science , 671–680 (1983). URL https://science.sciencemag.org/content/220/4598/671 .https://science.sciencemag.org/content/220/4598/671.full.pdf.[17] Neyshabur, B., Bhojanapalli, S., McAllester, D. & Srebro, N. Exploring generalization in deeplearning. In NIPS (2017).[18] Advani, M. S. & Saxe, A. M. High-dimensional dynamics of generalization error in neuralnetworks (2017). 1710.03667.[19] Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. How transferable are features in deep neuralnetworks? (2014). 1411.1792.[20] Ring, M. B. Continual learning in reinforcement environments . Ph.D. thesis, University ofTexas at Austin Austin, Texas 78712 (1994).[21] Lopez-Paz, D. & Ranzato, M. Gradient episodic memory for continuum learning. NIPS (2017).[22] Riemer, M. et al. Learning to learn without forgetting by maximizing transfer and minimizinginterference (2018). 1810.11910. upplementary Material for “How neural networks findgeneralizable solutions: Self-tuned annealing in deep learning” I. THE TWO PHASES OF LEARNING: THE INITIAL FAST LEARNINGPHASE AND THE SLOW EXPLORATION PHASE The dynamics of L versus iteration time are shown in Fig. S1, which shows that thereare two phases in learning in DNN. There is an initial fast learning phase where the overallloss function decreases quickly and sometimes abruptly followed by an exploration phasewhere the overall loss L decreases much more slowly and gradually. Due to its slow timescale, the exploration phase can be considered as in quasi-steady-state. The weights reachedin the exploration phase can be considered as solutions of the problem given their smallloss function values. These two phases exist independent of the hyperparameters and thenetwork architecture as shwon in Fig. S1, where the transition region between the two phasesare highlighted. II. PROPERTIES OF THE FIRST PRINCIPLE COMPONENT MODE In our study, we used the cross entropy loss function for each sample: d ( ~Y k , ~Z k ) = − ln (cid:2) exp( ~Y k · ~Z k ) P n c n =1 exp( Y k,n ) (cid:3) = − ln (cid:2) exp( Y k,n ( k ) ) P n c n =1 exp( Y k,n ) (cid:3) , (S1)where n c is the total number of classes and also the dimension of the network output vector ~Y k and the correct output vector ~Z k . The correct class for sample k is n ( k ), and ~Z k is justa unit vector in the correct ( n ( k )) direction. Let m ( k ) be the incorrect output componentwith the largest output value, we can then define ∆ Y k ≡ Y k,n ( k ) − Y k,m ( k ) . The cross entropyloss for sample k can then be written as: l k ≡ d ( ~Y k , ~Z k ) = ln[1 + a k exp( − ∆ Y k )] ≈ a k exp( − ∆ Y k ) , (S2)where a k = P n = n ( k ) exp( Y k,n − Y k,m ( k ) ) is an order O(1) number given that Y k,m ( k ) is thelargest among all the incorrect outputs ( n = n ( k )).As we described in the main text, the persistent drift in the first PC direction ~p canbe understood by translating a solution ~w found by SGD along ~p by θ to a new weight a r X i v : . [ phy s i c s . d a t a - a n ] J a n vector ~w = ~w + θ × ~p . We studied how such a change of weights affects ∆ Y k for a givensample k . We computed ∆ Y k as a function of θ for different samples, and found that thechange in ∆ Y k depends approximately linearly on θ :∆ Y k ( θ ) ≈ ∆ Y k (0) + β k θ , (S3)where β k is a sample dependent constant. This linear dependence is shown clearly in Fig. S2Afor different samples covering different correct classes.By using Eq. S3 in the expression for l k , i.e., Eq. S2, we have: l k ( θ ) ∝ l k (0) × exp( − β k θ ) , (S4)which depends exponentially on θ for all samples k . As a result, the overall loss functionalong the first PC direction: L ≡ h l k ( θ ) i k ≈ L exp( − βθ + β θ ) , (S5)where β is the average of β k : β = h β k i k > 0, and β depends on the variation of β k amongdifferent samples. The expression for L ( θ ) agrees with the one obtained from the MLFensemble average derived in the main text with the correspondences: β = M (0)11 θ (0)1 and β = M / L ( θ )) expressed in Eq.S5, i.e., ln( L ( θ )) ≈ ln( L (0)) − βθ + β θ agrees with our numerical results shown in Fig. S2B, which shows that ln( L ) has afinite slope at θ . It is this finite slope that drives the drift motion of θ observed in Fig. 1Cin the main text. III. ROBUSTNESS AND GENERALITY OF THE INVERSEVARIANCE-FLATNESS RELATION As shown in the main text (Fig. 2D), the inverse relation between the SGD variance andthe flatness of the loss function landscape is the same for different choices of the hyper-parameters, batch size B and learning rate α . Since all the results reported in the maintext are from a simple network with 2 hidden layers applied to the MNIST dataset, wehave tested the robustness of this inverse variance-flatness relation in different networks, fordifferent algorithms and datasets. We first tested the variance-flatness relation in a DNNwith a larger number of hidden layers, e.g., 4 hidden layer network as shown in Fig. S3A.We found the same inverse variance-flatness relation for weights in all different layers asshown in Fig. S3B. We have also tested the inverse variance-flatness relationship in otherDNN architectures such as LeNet (illustrated in Fig. S3C), different learning algorithmssuch as ADAM, and different dataset such as CIFAR. As shown in Fig. S3D, the inversevariance-flatness relation holds true in all these cases we studied. -6 -4 -2 -6 -4 -2 A B FIG.S1: Two different phases in learning. (A) shows the phase transition in a multi-layer fully-connected neural network. Here we use a network with four hidden layers and each hidden layerhas 50 units. The experiment is done on the MNIST data sets. We can see that there is an obviousfast drop of cross-entropy loss around 50 epochs after that the system enters the exploration phasewith slowly changing loss. This phase transition holds true for different combination of learningrate and batch size. (B) shows the phase transition in a LeNet, which has two convolution layerswith size 1 × × × 16 and 16 × × × 32, and one fully-connected neural network with size1586 × 10 (see Fig. S3C). We can see that there is also a drop of cross-entropy loss around 30epochs -10 -5 0 5 10-10-505 A B FIG.S2: The effect of changing θ along the first PC direction ~p . (A) The dependence of ∆ Y k , thedifference between the correct output and the maximum incorrect output, on the displacement θ along the first PC for different sample k . ∆ Y k increases linearly with θ . Each symbol correspondsto an average over 5 samples of the same digit. (B) The landscape of ln L i ( θ i ) along the i -th PCdirection. Only ln( L ) has a finite gradient at θ = 0, which indicates a drift motion in ~p . In allother PC directions ( i ≥ L i ) has a zero gradient at θ i = 0, which indicates a diffusive motionin these directions. 784 50 1050 50 50Layer 1 Layer 2 Layer 3 -15 -10 Conv1,Pooling Conv2,Pooling Flatten FC1 FC2 !" 16 400 120 84 10FC3 -8 -6 -4 -2 conv1conv2 A BC D FIG.S3: The variance-flatness relation for different network, data set and optimization method.(A) A four hidden layer neural network is used on the MNIST dataset. (B) The inverse relationbetween variance and flatness holds between any two adjacent hidden layers. (C) Illustration ofLeNet which has two convolution layer with sizes of 3 × × × × × × 16, and three fullyconnected layers with sizes of 400 × × ××