[PDF] How neural networks find generalizable solutions: Self-tuned annealing in deep learning

Abstract

Despite the tremendous success of Stochastic Gradient Descent (SGD) algorithm in deep learning, little is known about how SGD finds generalizable solutions in the high-dimensional weight space. By analyzing the learning dynamics and loss function landscape, we discover a robust inverse relation between the weight variance and the landscape flatness (inverse of curvature) for all SGD-based learning algorithms. To explain the inverse variance-flatness relation, we develop a random landscape theory, which shows that the SGD noise strength (effective temperature) depends inversely on the landscape flatness. Our study indicates that SGD attains a self-tuned landscape-dependent annealing strategy to find generalizable solutions at the flat minima of the landscape. Finally, we demonstrate how these new theoretical insights lead to more efficient algorithms, e.g., for avoiding catastrophic forgetting.

Full PDF

HHow neural networks ﬁnd generalizable solutions: Self-tunedannealing in deep learning

Yu Feng

1, 2 and Yuhai Tu IBM T. J. Watson Research Center, Yorktown Heights, NY10598 Department of Physics, Duke University, Durham, NC27710

Despite the tremendous success of Stochastic Gradient Descent (SGD) algorithmin deep learning, little is known about how SGD ﬁnds generalizable solutions in thehigh-dimensional weight space. By analyzing the learning dynamics and loss functionlandscape, we discover a robust inverse relation between the weight variance andthe landscape ﬂatness (inverse of curvature) for all SGD-based learning algorithms.To explain the inverse variance-ﬂatness relation, we develop a random landscapetheory, which shows that the SGD noise strength (eﬀective temperature) dependsinversely on the landscape ﬂatness. Our study indicates that SGD attains a self-tuned landscape-dependent annealing strategy to ﬁnd generalizable solutions at theﬂat minima of the landscape. Finally, we demonstrate how these new theoreticalinsights lead to more eﬃcient algorithms, e.g., for avoiding catastrophic forgetting. a r X i v : . [ phy s i c s . d a t a - a n ] J a n I. INTRODUCTION

One key ingredient for the powerful deep neural network (DNN) based machine learningparadigm –Deep Learning [1]– is a relatively simple iterative method called stochastic gra-dient descent (SGD) [2, 3]. However, despite the tremendous successes of Deep Learning,the reason why SGD is so eﬀective in learning in a high dimensional nonconvex loss function(energy) landscape remains poorly understood. The random element seems key for SGD, yetmakes it harder to understand. Fortunately, many physical systems include such randomelement, e.g., Brownian motion, and powerful tools have been developed for understand-ing collective behaviors in stochastic systems with many degrees of freedom. Here, we useconcepts and methods from statistical physics to investigate the SGD dynamics, the lossfunction landscape, and more importantly their relationship.We start by introducing the SGD based learning process as a stochastic dynamical system.A learning system such as neural network (NN) especially DNN has a large number ( N ) ofweight parameters w i ( i = 1 , , ..., N ). For supervised learning, there is a set of M trainingsamples each with an input ~X k and a correct output ~Z k for k = 1 , , ..., M . For each input ~X k , the learning system predicts an output ~Y k = G ( ~X k , ~w ), where the output function G depends on the architecture of the NN as well as its weights ~w . The goal of learning is toﬁnd the weight parameters to minimize the diﬀerence between the predicted and correctoutput characterized by an overall loss function (or energy function): L ( ~w ) = M − M X k =1 d ( ~Y k , ~Z k ) , (1)where d ( ~Y k , ~Z k ) is a measure of distance between ~Y k and ~Z k . In our study, a cross-entropyloss for d (see Supplementary Material (SM) for its expression) is used .One learning strategy is to update the weights by following the gradient of L directly.However, this batch learning scheme is computationally inhibitive for large datasets and italso has the obvious shortfall of being trapped by local minima. SGD was ﬁrst introducedto circumvent the large dataset problem by updating the weights according to a subset(minibatch) of samples randomly chosen at each iteration [2]. Speciﬁcally, the change ofweight w i ( i = 1 , , ..., N ) for iteration t in SGD is given by:∆ w i ( t ) = − α ∂L µ ( t ) ( ~w ) ∂w i , (2)where α is the learning rate and µ ( t ) represents the random minibatch used for iteration t .The mini loss function (MLF) for minibatch µ of size B is deﬁned as: L µ ( ~w ) = B − B X l =1 d ( ~Y µ l , ~Z µ l ) , (3)where µ l ( l = 1 , , .., B ) labels the B randomly chosen samples.Here, we introduce the key concept of a MLF ensemble { L µ ( ~w ) } , i.e., an ensemble ofenergy landscapes each from a random minibatch. The overall loss function L ( ~w ) is just theensemble average of MLF: L ≡ h L µ i µ . The SGD noise comes from the variation between aMLF and its ensemble average: δL µ ≡ L µ − L . By taking the continuum time limit in Eq. 2,we obtain the following stochastic partial diﬀerential equation for SGD: ∂ ~w∂t = − α ∂L∂ ~w + ~η ( ~w ) , (4)where we have taken the learning step as the unit of time. This equation is analogous to theLangevin equation in statistical physics. The ﬁrst term − α ∂L∂ ~w is the deterministic gradientdescent governed by the overall loss function L analogous to the energy function in physics.The second term is the SGD noise term ~η ≡ − α ∇ ~w δL µ ( ~w ) with zero mean h ~η i = 0 andequal time correlation C ij ( ~w ) ≡ h η i η j i = α × h ∂δL µ ∂w i ∂δL µ ∂w j i µ , which depends explicitly on ~w .As ﬁrst pointed out by Chaudhauri and Soatto [4], unlike equilibrium physical systemswhere the noise has a constant strength given by the thermal temperature, the SGD dynam-ics is highly nonequilibrium as the SGD noise is anisotropic and varies in the weight space.In the rest of the paper, we study the SGD-based machine learning process by adoptingthe nonequilibrium stochastic dynamics framework. Our working hypothesis is that SGDmay serve as an eﬃcient annealing strategy for varying the noise (or eﬀective temperature)“intelligently” according to the loss function landscape in order to ﬁnd the shallow (ﬂat)minima where “good” (generalizable) solutions are believed to exist [5–10]. II. LEARNING VIA A LOW-DIMENSIONAL DRIFT-DIFFUSION DYNAMICSIN SGD

In general, SGD based DNN learning dynamics can be roughly divided into an initialfast learning phase where L decreases rapidly, followed by an “exploration” phase where theweights keep evolving but the loss L only changes very slowly (see Fig. S1 in SM). Theweight vectors sampled in the exploration phase are treated as solutions to the problemgiven their low losses.We used principal component analysis (PCA) to study the stochastic motion of the weightvector during the SGD learning process, in particular during the exploration phase when thesystem reaches a quasi-steady-state (see Methods section for details of PCA). Within a largetime window t ∈ [ t , t + T ] where t is a time in the exploration phase and T (= 10 epochs used here) is a large time window, the weight dynamics can be decomposed into its variationsin diﬀerent principal components: ~w ( t ) = h ~w i T + N X i =1 θ i ( t ) ~p i , (5)where h ~w i T = T − R t + Tt ~w ( t ) dt is the average weight vector in the time window T , ~p i is the i -th principal component base vector with ~p i · ~p j = δ ij . The projection of the weight vectoralong the PCA direction ~p i is given by θ i ( t ), which is a linear combination of the individualweights, ~θ is the weight vector in the PCA coordinate.The results reported here are for a simple NN with 2 hidden layers each with 50 neuronsfor classiﬁcation tasks using the MNIST database (see Methods for details and other NNarchitectures used). The PCA was done for the N = 2500 weights between the two hiddenlayers (results for other NN architectures and databases are included in the SM). In Fig. 1A,we show the PCA spectrum, i.e., the variance σ i ≡ T − R t + Tt θ i ( t ) dt versus its rank i indescending order ( σ i +1 < σ i ). We found that the variance in the ﬁrst PCA direction ( ~p )is much larger than variances in other directions because the motion along ~p has a netdrift velocity (see discussion below and Fig. 1C). For other PCA directions, after a smallnumber of leading PCA directions 2 ≤ i ≤ σ i decays rapidly with itsrank: σ i ∼ i − γ for 20 < i <

200 with a large exponent γ ∼ . i ( > σ , more than 90% of the total variance occurs inthe ﬁrst 35 PCA modes much smaller than the total number of weights N = 2 , each epoch has MB iterations which covers all training samples once -5 0 510 -4 -505 10 -4 trajectoryStartEnd -0.02 -0.01 0 0.01 0.02-3-2-10123 10 -3 trajectoryStartEnd -10 -5 C DA B

FIG. 1: The PCA results and the drift-diﬀusion motion in SGD. (A) The rank-ordered variance σ i in diﬀerent principal component (PC) direction i . For i ≥ σ i decreases with i as a power law i − γ with γ ∼ −

3. (B) The normalized accumulative variance of the top ( n −

1) PCA componentsexcluding i = 1. It reaches ∼

90% at n = 35 much smaller than the total number of weights N = 2500 between the two hidden layers. (C) The SGD weight trajectory projected onto the( θ , θ ) plane. The persistent drift motion in θ and the diﬀusive random motion in θ are clearlyshown. (D) The diﬀusive motion in ( θ i , θ j ) plane with j > i ( = 1) randomly chosen ( i = 49 and j = 50 shown here). Unless otherwise stated, hyperparameters used are: B = 50, α = 0 . along the ﬁrst PCA direction ~p , there is a net drift velocity dθ /dt with a persistence timemuch longer than 1 epoch as clearly shown in Fig. 1C where SGD dynamics projected ontothe ( θ , θ ) space is shown. For all other PCA directions, the dynamics are random walkswith a short velocity correlation time (shorter than 1 epoch) as clearly demonstrated inFig. 2C where the SGD dynamics projected onto a randomly chosen pair of PCA directions( θ , θ ) is shown.The persistent drift in the ﬁrst PCA direction ~p can be understood by moving a solution ~w found by SGD along ~p by θ to a new weight vector ~w = ~w + θ ~p . We ﬁnd that such atranslation results roughly in an overall ampliﬁcation of the diﬀerence between the outputsfor the right class and the wrong classes, which leads to a change in the cross-entropy lossfunction L ( ~w ) ≈ L ( ~w ) × exp( βθ ) with β a constant parameter. This dependence of L on θ leads to the persistent motion along the ~p direction with a low speed proportional to L which slowly decreases with time itself (see SM for details). In the next section, we focus onthe majority diﬀusive PCA modes ( i ≥

2) and their relation to the loss function landscape.

III. THE LOSS FUNCTION LANDSCAPE AND THE INVERSEVARIANCE-FLATNESS RELATION

In the exploration phase, the loss function is small and all the weight vectors along theSGD trajectory can be considered as valid solutions. However, the solutions found by aSGD trajectory only represent a small subset of valid solutions. To gain insights on the fullsolution space, we study the loss function landscape around a speciﬁc solution ~w reachedby SGD. Speciﬁcally, we compute the loss function proﬁle L i along the i -th PCA direction ~p i determined from PCA of the SGD dynamics: L i ( δθ ) ≡ L ( ~w + δθ~p i ) . (6)In Fig. 2A, we show the loss function landscape proﬁles L i ( δθ ) for several diﬀusive PCAdirections i = 10 , , , L i ( δθ ))versus δθ , which shows that ln( L i ( δθ )) can be ﬁtted well by a quadratic function of δθ near δθ = 0 for all i ≥ i = 1):ln( L i ( δθ )) ≈ ln( L ) + 4 δθ F i , (7)where L ≡ L ( ~w ) is the loss at ~w and F i , inversely proportional to the curvature, deﬁnesa ﬂatness parameter of the loss function landscape in the i -th PCA direction. In general, F i can be deﬁned as the width of the region within which L i ≤ e × L . A larger value of F i means a ﬂatter landscape in the i -th PCA direction. -10 -5 -10 -5 0 5 1002468101214 -2 -1 0 1 2-10-9-8-7-6 A BC D

FIG. 2: The loss function landscape and the inverse variance-ﬂatness relation. (A) The loss functionproﬁle L i along the i -th PCA direction. (B) The ln( L i ) proﬁle. It can be ﬁtted by a quadraticfunction (the red dotted line). The deﬁnition of the ﬂatness F i is also shown. (C) The ﬂatness F i for diﬀerent PCA direction i , and (D) the inverse relation between the variance σ i and the ﬂatness F i for diﬀerent choices of minibatch size B and learning rate α . Quantitatively, we found that the ﬂatness F i increases with i as shown in Fig. 2C. Giventhat the SGD variance σ i decreases with i as shown in Fig. 2A, this immediately suggestsan inverse relationship between the loss function landscape ﬂatness and the SGD variance.Indeed, as shown in Fig. 2D, the inverse variance ﬂatness relation holds true for diﬀerentchoices of minibatch size B and learning rate α . Quantitatively, the variance-ﬂatness followsapproximately a power law: σ i ∼ F − ψi , (8)where the exponent ψ ∼ B and α .The inverse variance-ﬂatness relation is counter-intuitive. In equilibrium systems, theﬂuctuation of a variable around its equilibrium value is proportional to the change of thevariable in response to an external perturbation, which is known as the ﬂuctuation-response(or ﬂuctuation-dissipation) relation [12]. Since the response is proportional to the ﬂat-ness of the energy function or equivalently the loss function used in machine learning, theﬂuctuation-response relation in equilibrium systems means that the variance of a variableshould be larger for a ﬂatter landscape, which is the opposite to the inverse relation shownin Fig. 2D for SGD.What is the reason for the inverse variance-ﬂatness relation in SGD? Unlike equilibriumsystems where the noise strength (temperature) is a constant, the SGD noise comes from thediﬀerence between gradient of a random MLF and that of the overall (mean) loss functionand thus it varies in the weight space and in time. In the next section, we explain theinverse variance-ﬂatness relation based on the dependence of the SGD noise on the statisticalproperties of the MLF ensemble. IV. THE RANDOM LANDSCAPE THEORY AND ORIGIN OF THE INVERSEVARIANCE-FLATNESS RELATION

The most distinctive feature of SGD is that at any given iteration (time) the learningdynamics is driven by a random minibatch out of an ensemble of minibatches each with itsown random mini loss function (MLF). To understand the SGD dynamics, we develop arandom landscape theory to describe the statistical properties of the MLF ensemble near asolution ~w (we set ~w = 0 for convenience).As shown in Fig. 3A, we found that L µ has roughly the same general shape as L near itsown minimum. Therefore, we approximate L µ by an inverse Gaussian function with randomparameters: L µ ( ~θ ) ≈ L µmin exp[ N X i,j =1 M µij θ i − θ µi )( θ j − θ µj )]= L µ exp[ X i M µii θ i ( θ i − θ µi ) + X i M µij ∼ N (0 , σ ij ) with zero mean for i = j , θ µi ∼ N (0 , σ θ,i ) with zero mean forthe diﬀusive modes ( i ≥ θ µ ∼ N ( θ (0)1 , σ θ, ) with a ﬁnite mean θ (0)1 ( = 0) for the driftmode ( i = 1), we obtain the overall loss function analytically (up to the second order termsin θ i ) by averaging over the distributions of M µ and ~θ µ : L ≡ h L µ ( ~θ ) i µ ≈ L (cid:10) exp { X i M µii θ i ( θ i − θ µi ) + X i

FIG. 3: Statistical properties of the MLF ensemble. (A) Proﬁles of the overall loss function ln( L i )(red line) and a set of randomly chosen MLFs ln( L µi ) (blue dashed lines) in a given PCA direction i . (B) The variance ˜ σ i of the minimum positions and the average diagonal element M (0) ii of theHessian matrices of the MLF ensemble versus the ﬂatness F i of the overall loss function. Thecombination ( M (0) ii ) ˜ σ i versus F i is also shown. i = 30 used here. positioned at ~θ µ . The average relaxation time, τ i ≡ / ( αL M ii ), is long given that L issmall. For t (cid:28) τ i , variation of θ j is much smaller than that of the minimum position θ µj ,i.e., | θ j | (cid:28) σ θ,j . Therefore, we can neglect θ j ( (cid:28) θ µj ) in Eq. 12, and we have: dθ i dt ≡ v i = αL µ X j M µij θ µj , (13)which explains the drift-diﬀusion motion observed in our numerical simulations. For theﬁrst PCA direction ( i = 1), since h θ µ i µ = θ (0)1 = 0, there is a net drift velocity v (0)1 ≡ h v i µ ≈ αL M (0)11 θ (0)1 . For all the other PCA directions ( i ≥ h v i i µ = 0 and a diﬀusion constant D i given by: D i ≡ h v i i ∆ t ≈ α h ( L µ ) i µ [( M (0) ii ) σ θ,i + X j = i σ ij σ θ,j ] = α h ( L µ ) i µ ( M (0) ii ) ˜ σ θ,i , (14)where we have used the expression for ˜ σ θ,i from Eq. 11.The above result, Eq. 14, provides a clear explanation for the inverse variance-ﬂatnessrelation for the diﬀusive modes ( i ≥ M (0) ii , which leads to a smaller diﬀusion constant D i and thus a smaller variance since σ i ∝ D i ∝ ( M (0) ii ) ˜ σ θ,i . Quantitatively, both M (0) ii and ˜ σ θ,i can be determined from the MLF1ensemble statistics. As shown in Fig. 3B, ˜ σ θ,i scales almost linearly with F i = (8 /M ii ) / and M (0) ii scales inversely with F i , approximately as F − i . As a result, we have σ i ∝ ( M (0) ii ) ˜ σ θ,i ∝ F − i (see Fig. 3B) which is consistent with the inverse power law dependence shown inFig. 2D. However, the theoretically obtained exponent 2 is smaller than the exponent ( ψ ≈ V. PREVENTING CATASTROPHIC FORGETTING BY USINGLANDSCAPE-DEPENDENT CONSTRAINTS

To demonstrate the utility of the theoretical insights gained so far, we tackle a long-lastingchallenge in machine learning, i.e., how to prevent catastrophic forgetting (CF) [13, 14].After a deep neural network (DNN) learns to perform a particular task, it is trained foranother task. Though the DNN can readjust its weights to perform well for the new task, itmay forget the previous task and thus fail catastrophically for the previous task. To preventforgetting, a recent study by Kirkpatrick et al [15] proposed the elastic weight constraint(EWC) algorithm to train a new task by enforcing constraints on individual weights basedon their eﬀects on the performance of the previous task. Here, armed with the new insightson the loss landscape and SGD dynamics, we propose to a landscape-dependent constraints(LDC) algorithm to train for the new task with constraints applied to the collective PCAmodes from the previous task and with constraint strength determined by the loss landscapeﬂatness of the previous task.Explicitly, when we ﬁnd a solution ~w in the weight space for the ﬁrst task (task-1) bySGD, we can also determine the loss function landscape around ~w in terms of the ﬂatnessparameter F i along the i − th PCA direction ~p i for task-1, which can be obtained directly(cheaply) from the weight variance by using the variance-ﬂatness relation Eq. 8. Whenlearning the new task, we can use a modiﬁed loss function for the second task (task-2) byintroducing an additional cost term that penalizes the network for going out of the attractionbasin of the task-1 solutions:˜ L ( ~w ) = L ( ~w ) + λ N c X i =1 (( ~w − ~w ) · ~p i ) F i , (15)where L is the original loss function for task-2 and N c ( ≤ N ) is the number of constrained2PCA modes from task-1. In general, constrains can be included for all previous tasks forsequential learning of more than two tasks. The overall strength of the constraints from task-1 is parameterized by λ . The strength of the constraint for a speciﬁc PCA mode dependsinversely on the ﬂatness of the loss function in that speciﬁc PCA direction.Based on the large attraction basin for a given task as evidenced by the large ﬂatnessparameters shown in Fig. 2, we expect that solutions for task-2 exist within the basin ofsolutions for task-1, so the performance for task-1 should not be degraded signiﬁcantlyafter learning task-2. We tested this idea in the MNIST database with task-1 and task-2corresponding to classifying two disjoint subsets of digits, e.g., (0,1) for task-1 and (2,3)for task-2 (see Methods for details). As shown in Fig. 4A in the absence of the constraints( λ = 0), starting with a task-1 solution ~w , the weights evolve quickly to a solution fortask-2 ~w with a small task-2 test error (cid:15) (red line). However, the performance of task-1 deteriorates quickly with a fast increasing task-1 test error (cid:15) (blue line). The reasoncan be understood in Fig. 4B, where q i = || ( ~w − ~w ) · ~p i || , the projections of the weightdisplacement vector onto diﬀerent PCA directions of task-1, are shown. Without constraints,the displacement becomes much larger than a threshold ξ i set by the landscape ﬂatness, q i (cid:29) ξ i ( ≡ . F i ), for many high-ranking PCA modes (smaller i ), which leads to the largetask-1 test error after learning task-2, i.e., catastrophic forgetting.The situation improves signiﬁcantly when task-2 is learnt with the modiﬁed loss function˜ L given in Eq. 15. As shown in Fig. 4C, in the presence of the constraints ( λ = 10) forthe top N c = 200 modes, even though the learning process for task-2 is slightly slower, thesystem is able to learn a solution for task-2 with a comparable error (cid:15) as before ( λ = 0). Thesigniﬁcant advantage here is that the performance for task-1 (blue line) remains roughly thesame as before, i.e., the system has avoided catastrophic forgetting. The reason is explainedin Fig. 4D, which shows that q i is bounded by the threshold ( ξ i ), i.e., q i . ξ i , for the topmodes ( i ≤ N c ) due to the constraints.There is a tradeoﬀ between the two testing errors ( (cid:15) , (cid:15) ) when varying λ . As shown inFig. 4E, the performance of LDC is better than that of EWC. This is not surprising as LDCuses the full landscape information whereas EWC only uses the diagonal elements of theFisher Information matrix (eﬀectively the Hessian matrix). More interestingly, as shown inFig. 4F, the overall performance ( (cid:15) + (cid:15) ) of LDC reaches its optimal level when a relativelysmall number ( N ∗ c ≈ LDCEWC -3 -2 -1 LDCEWC

A BC DE F

FIG. 4: The landscape-dependent constraints for avoiding catastrophic forgetting. (A) The testerrors for task-1 ( (cid:15) ) and task-2 ( (cid:15) ) versus training time for task-2 in the absence of the constraints( λ = 0). (B) The weight displacements q i in diﬀerent PCA directions ~p i from task-1 in the absenceof the constraints ( λ = 0). The threshold ξ ≡ . F i is shown by the red dotted line. (C)&(D)are the same as (A)&(B) but in the presence of the constraints with λ = 10 . (E) The tradeoﬀbetween the saturated test errors ( (cid:15) and (cid:15) ) when varying λ for LDC (blue circles) and EWC (redsquares) algorithms. (F) The overall performance ( (cid:15) + (cid:15) ) versus the number of constrains N c for LDC (blue circles) and EWC (red squares) algorithms. The two tasks are for classifying twoseparate digit pairs [(0 ,

1) for task-1 and (2 ,

3) for task-2] in MNIST. N = 2 ,

500 individual weights are constrained. The results from the LDC algorithmsuggest that memory of the previous task is encoded in the top N ∗ c PCA modes and N ∗ c canbe used to estimate the capacity of the network for sequential learning. VI. SGD AS A SELF-TUNED (LANDSCAPE-DEPENDENT) ANNEALINGSTRATEGY FOR LEARNING

In the ﬁnal section of our paper, we go back to evaluate our initial working hypothesison the learning strategy deployed in SGD. In an equilibrium system with state variables ~θ and a free energy function L ( ~θ ), the statistics of ~θ follows the Boltzmann distribution P ( ~θ ) = exp [ − L ( ~θ ) /T ] where T is the constant temperature that characterizes the strengthof the thermal ﬂuctuations (we set the Boltzmann constant k B = 1 here). By expanding theloss function to the second order: L = L min (1 + P i ~θ · ~p i ) F i ) around a minimum ~w = 0, it iseasy to show that the variance of θ i would be proportional to the squared ﬂatness F i andtemperature T : σ i ∝ T × F i , (16)which is a direct consequence of the ﬂuctuation-response (aka ﬂuctuation-dissipation) rela-tion in equilibrium statistical physics.Remarkably, for the SGD-based learning dynamics, we found an inverse relation betweenﬂuctuations of the variables and the ﬂatness of the loss function landscape, Eq. 8, which is theopposite to the ﬂuctuation-response relation in equilibrium systems given in Eq. 16. We havetested it with diﬀerent variants of the SGD algorithms such as ADAM and momentum basedalgorithms; diﬀerent databases (MNIST and CIFAR-10); and diﬀerent DNN architectures(see SM and Fig. S3 for details). In all cases we studied, the inverse variance-ﬂatness relationholds suggesting that it is an universal property of the SGD based learning algorithms.Unlike thermal noise in equilibrium systems, which represents a passive random drivingforce with a constant strength (temperature), the SGD “noise” ~η = α ∂δL µ ∂~θ represents anactive learning/searching process that varies in “space” ( ~θ ). The intensity of this learning(search) activity – learning intensity along the i -th PCA direction ~p i can be characterized5 -10 -5 -10 -5 -5 -0.2 -0.1 0 0.1 0.210 -10 -8 -6 -1 -0.5 0 0.5 110 -10 -5 -14 -13 -12 -11 -10 -14 -12 -10 A BC D

FIG. 5: Proﬁles and dynamics of the anisotropic active temperature. (A) The active temperatureproﬁle T i ( δθ, t ) in the i ’th PCA direction at t = 200. (B) The minimum active temperature T i (0)in diﬀerent PCA direction i . The inverse dependence of T i on the ﬂatness F i is shown in theinset. (C) The active temperature proﬁles T i ( δθ, t ) at diﬀerent times for i = 10. (D) The activetemperature decreases with time in sync with the loss function (red line) dynamics. The shadedregion highlights the transition between the fast learning phase and the exploration phase. Thecorrelation between T i and L is shown in the inset. by an active local temperature T i ( δθ, t ): T i ( δθ, t ) ≡ α h|| ∂δL µ ( ~w + δθ~p i ) ∂δθ || i µ , (17)where δθ is the displacement from ~w along the ~p i direction.As shown in Fig. 5A, the active temperature T i ( δθ, t ) has a similar spatial proﬁle as thatof the loss function with the active temperature higher away from the minimum. In weight6space where the overall loss function is high, the active temperature is also high, which drivesthe system away from regions in the weight space with high losses. The learning intensityalso depends on the PCA direction as shown in Fig. 5B. For a ﬂatter direction with a largervalue of F i , T i is lower (see the inset in Fig. 5B) as the basin of solutions is wide and thusno strong active learning is needed. However, for a steeper direction with a smaller valueof F i , the solutions exist only in smaller regions and thus more intensive learning (or higheractive temperature) is required. Therefore, the MLF ensemble can sense both the local lossand non-local ﬂatness of the landscape in diﬀerent directions and use these information todrive active learning.The active temperature also varies with time. As learning progresses, the active temper-ature proﬁle decreases with time, as shown in Fig. 5C. In Fig. 5D, dynamics of the activetemperature and the overall loss function along a SGD trajectory are shown together. Itis clear that the active temperature and the overall loss function are highly correlated asshown directly in the inset of Fig. 5D, which means that the SGD system cools down asit learns. This reminds us of the well known simulated annealing algorithm for optimiza-tion [16], where temperature is decreased from a high value to zero with some prescribedcooling schedule. However, the SGD algorithm seems to deploy a more intelligent landscape-dependent annealing strategy where the active temperature (learning intensity), driven bythe MLF ensemble, is self-tuned according to the local and non-local properties of the losslandscape that are sensed by the MLF ensemble. This landscape-dependent annealing strat-egy drives the system towards the ﬂat minima of the loss function landscape and stays atthe ﬂat minima by lowering the active temperature once there.To summarize, a careful study of the SGD dynamics and the loss function landscapein this paper reveals a robust inverse relation between ﬂuctuations in SGD and ﬂatness ofthe loss landscape, which is critical for deciphering the learning strategy in deep learningand for designing more eﬃcient algorithms. The statistical physics based approach pro-vides a general theroretical framework for studying machine learning. We are currentlyusing this approach to address other foundamental questions in machine learning such asgeneralization [17, 18], relation between task complexity and network architecture, transferlearning [19], and continuous learning [20–22].7 VII. METHODS

Neural network architecture and simulations.

Two types of DNNs are studied.(1) Two fully-connected Neural Networks were used for classifying digits in the MNISTdatabase, one with two hidden layers (784 × × ×

10, Main Text) and the other withfour hidden layers (784 × × × × ×

10, SM). The response of the hidden layer neuronsis ReLu, activation of the output neurons is Softmax, and there is no bias neuron. We alsostudied Convolution Neural Network (CNN). (2) Two LeNets were used in our experiments.One is trained on MNIST dataset (see Fig. S1 in SM) which has two convolution layer withsize 1 × × ×

16 and 16 × × ×

32, and one fully-connected layer with size 1568 ×

10. (Herewe represent the convolution layer using input neural number × kernel size × kernel size × output neural number).The stride of convolution is 1 and there is a zero padding to keep thedata dimension unchanged. After each convolution layer, there is a 2 × × × × × × ×

16, and three fully-connected layer with size400 × ×

84, 84 ×

10. The stride of convolution is 1 and the size of max pooling is2 ×

2. We do not use zero padding in this network. All numerical experiments are done onneural network simulation framework torch.

Principal component analysis in exploration phase.

For a given time in explorationphase, we extract the weight matrix ( 50 ×

50) between two hidden layers and reshape thematrix to 1 × T = 10 epochs. The PCA is applied using sklearn package provided by Python 3.7. Multi-task learning.

We divided the MNIST into 5 groups. Each group only contain 2numbers. Here we call each group as task-1, task-2, etc. We use the fully-connected NeuralNetworks with two hidden layers (784 × × × VIII. ACKNOWLEDGMENTS

We thank Mattia Rigotti, Irina Rish, Matthew Riemer, Robert Ajemain, Yunfei Teng,and Alberto Sassi for discussions. We also thank Jerry Tersoﬀ, Tom Theis, and YoussefMrouef for comments on the manuscript. The work by Yu Feng was done when he was anIBM intern. [1] LeCun, Y., Bengio, Y. & Hinton, G. Deep learning.

Nature , 436 EP – (2015). URL https://doi.org/10.1038/nature14539 .[2] Robbins, H. & Monro, S. A stochastic approximation method.

The Annals of MathematicalStatistics , 400?407 (1951). URL http://dx.doi.org/10.1214/aoms/1177729586 .[3] Bottou, L. Large-scale machine learning with stochastic gradient descent. In Lechevallier, Y. &Saporta, G. (eds.) Proceedings of COMPSTAT’2010 , 177–186 (Physica-Verlag HD, Heidelberg,2010).[4] Chaudhari, P. & Soatto, S. Stochastic gradient descent performs variational inference, con-verges to limit cycles for deep networks. (2018). URL http://dx.doi.org/10.1109/ita.2018.8503224 .[5] Hinton, G. E. & van Camp, D. Keeping the neural networks simple by minimizing the de-scription length of the weights. In

Proceedings of the Sixth Annual Conference on Com-putational Learning Theory , COLT ’93, 5–13 (ACM, New York, NY, USA, 1993). URL http://doi.acm.org/10.1145/168304.168306 .[6] Hochreiter, S. & Schmidhuber, J. Flat minima.

Neural Computation , 1–42 (1997).[7] Baldassi, C. et al. Unreasonable eﬀectiveness of learning neural networks: From accessi-ble states and robust ensembles to basic algorithmic schemes.

Proceedings of the NationalAcademy of Sciences , E7655–E7662 (2016). URL et al.

Entropy-sgd: Biasing gradient descent into wide valleys (2016).1611.01838.[9] Zhang, Y., Saxe, A. M., Advani, M. S. & Lee, A. A. Energyentropy competition and the eﬀec-tiveness of stochastic gradient descent in machine learning.

Molecular Physics , 32143223 (2018). URL http://dx.doi.org/10.1080/00268976.2018.1483535 .[10] Mei, S., Montanari, A. & Nguyen, P.-M. A mean ﬁeld view of the landscape of two-layer neuralnetworks. Proceedings of the National Academy of Sciences of the United States of America , E7665–E7671 (2018). URL .[11] Gur-Ari, G., Roberts, D. A. & Dyer, E. Gradient Descent Happens in a Tiny Subspace. arXive-prints arXiv:1812.04754 (2018). 1812.04754.[12] Forster, D.

Hydrodynamic ﬂuctuations, broken symmetry, and correlation functions (CRCPress, 2018).[13] McCloskey, M. & Cohen, N. J. Catastrophic interference in connectionist networks: Thesequential learning problem.

Psychology of learning and motivation , 109–165 (1989).[14] Robins, A. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science ,123–146 (1995).[15] Kirkpatrick, J. et al. Overcoming catastrophic forgetting in neural networks.

Proceedings ofthe National Academy of Sciences

Science , 671–680 (1983). URL https://science.sciencemag.org/content/220/4598/671 .https://science.sciencemag.org/content/220/4598/671.full.pdf.[17] Neyshabur, B., Bhojanapalli, S., McAllester, D. & Srebro, N. Exploring generalization in deeplearning. In

NIPS (2017).[18] Advani, M. S. & Saxe, A. M. High-dimensional dynamics of generalization error in neuralnetworks (2017). 1710.03667.[19] Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. How transferable are features in deep neuralnetworks? (2014). 1411.1792.[20] Ring, M. B.

Continual learning in reinforcement environments . Ph.D. thesis, University ofTexas at Austin Austin, Texas 78712 (1994).[21] Lopez-Paz, D. & Ranzato, M. Gradient episodic memory for continuum learning.

NIPS (2017).[22] Riemer, M. et al.

Learning to learn without forgetting by maximizing transfer and minimizinginterference (2018). 1810.11910. upplementary Material for “How neural networks ﬁndgeneralizable solutions: Self-tuned annealing in deep learning”

I. THE TWO PHASES OF LEARNING: THE INITIAL FAST LEARNINGPHASE AND THE SLOW EXPLORATION PHASE

The dynamics of L versus iteration time are shown in Fig. S1, which shows that thereare two phases in learning in DNN. There is an initial fast learning phase where the overallloss function decreases quickly and sometimes abruptly followed by an exploration phasewhere the overall loss L decreases much more slowly and gradually. Due to its slow timescale, the exploration phase can be considered as in quasi-steady-state. The weights reachedin the exploration phase can be considered as solutions of the problem given their smallloss function values. These two phases exist independent of the hyperparameters and thenetwork architecture as shwon in Fig. S1, where the transition region between the two phasesare highlighted. II. PROPERTIES OF THE FIRST PRINCIPLE COMPONENT MODE

In our study, we used the cross entropy loss function for each sample: d ( ~Y k , ~Z k ) = − ln (cid:2) exp( ~Y k · ~Z k ) P n c n =1 exp( Y k,n ) (cid:3) = − ln (cid:2) exp( Y k,n ( k ) ) P n c n =1 exp( Y k,n ) (cid:3) , (S1)where n c is the total number of classes and also the dimension of the network output vector ~Y k and the correct output vector ~Z k . The correct class for sample k is n ( k ), and ~Z k is justa unit vector in the correct ( n ( k )) direction. Let m ( k ) be the incorrect output componentwith the largest output value, we can then deﬁne ∆ Y k ≡ Y k,n ( k ) − Y k,m ( k ) . The cross entropyloss for sample k can then be written as: l k ≡ d ( ~Y k , ~Z k ) = ln[1 + a k exp( − ∆ Y k )] ≈ a k exp( − ∆ Y k ) , (S2)where a k = P n = n ( k ) exp( Y k,n − Y k,m ( k ) ) is an order O(1) number given that Y k,m ( k ) is thelargest among all the incorrect outputs ( n = n ( k )).As we described in the main text, the persistent drift in the ﬁrst PC direction ~p canbe understood by translating a solution ~w found by SGD along ~p by θ to a new weight a r X i v : . [ phy s i c s . d a t a - a n ] J a n vector ~w = ~w + θ × ~p . We studied how such a change of weights aﬀects ∆ Y k for a givensample k . We computed ∆ Y k as a function of θ for diﬀerent samples, and found that thechange in ∆ Y k depends approximately linearly on θ :∆ Y k ( θ ) ≈ ∆ Y k (0) + β k θ , (S3)where β k is a sample dependent constant. This linear dependence is shown clearly in Fig. S2Afor diﬀerent samples covering diﬀerent correct classes.By using Eq. S3 in the expression for l k , i.e., Eq. S2, we have: l k ( θ ) ∝ l k (0) × exp( − β k θ ) , (S4)which depends exponentially on θ for all samples k . As a result, the overall loss functionalong the ﬁrst PC direction: L ≡ h l k ( θ ) i k ≈ L exp( − βθ + β θ ) , (S5)where β is the average of β k : β = h β k i k >

0, and β depends on the variation of β k amongdiﬀerent samples. The expression for L ( θ ) agrees with the one obtained from the MLFensemble average derived in the main text with the correspondences: β = M (0)11 θ (0)1 and β = M / L ( θ )) expressed in Eq.S5, i.e., ln( L ( θ )) ≈ ln( L (0)) − βθ + β θ agrees with our numerical results shown in Fig. S2B, which shows that ln( L ) has aﬁnite slope at θ . It is this ﬁnite slope that drives the drift motion of θ observed in Fig. 1Cin the main text. III. ROBUSTNESS AND GENERALITY OF THE INVERSEVARIANCE-FLATNESS RELATION

As shown in the main text (Fig. 2D), the inverse relation between the SGD variance andthe ﬂatness of the loss function landscape is the same for diﬀerent choices of the hyper-parameters, batch size B and learning rate α . Since all the results reported in the maintext are from a simple network with 2 hidden layers applied to the MNIST dataset, wehave tested the robustness of this inverse variance-ﬂatness relation in diﬀerent networks, fordiﬀerent algorithms and datasets. We ﬁrst tested the variance-ﬂatness relation in a DNNwith a larger number of hidden layers, e.g., 4 hidden layer network as shown in Fig. S3A.We found the same inverse variance-ﬂatness relation for weights in all diﬀerent layers asshown in Fig. S3B. We have also tested the inverse variance-ﬂatness relationship in otherDNN architectures such as LeNet (illustrated in Fig. S3C), diﬀerent learning algorithmssuch as ADAM, and diﬀerent dataset such as CIFAR. As shown in Fig. S3D, the inversevariance-ﬂatness relation holds true in all these cases we studied. -6 -4 -2 -6 -4 -2 A B

FIG.S1: Two diﬀerent phases in learning. (A) shows the phase transition in a multi-layer fully-connected neural network. Here we use a network with four hidden layers and each hidden layerhas 50 units. The experiment is done on the MNIST data sets. We can see that there is an obviousfast drop of cross-entropy loss around 50 epochs after that the system enters the exploration phasewith slowly changing loss. This phase transition holds true for diﬀerent combination of learningrate and batch size. (B) shows the phase transition in a LeNet, which has two convolution layerswith size 1 × × ×

16 and 16 × × ×

32, and one fully-connected neural network with size1586 ×

10 (see Fig. S3C). We can see that there is also a drop of cross-entropy loss around 30epochs -10 -5 0 5 10-10-505 A B

FIG.S2: The eﬀect of changing θ along the ﬁrst PC direction ~p . (A) The dependence of ∆ Y k , thediﬀerence between the correct output and the maximum incorrect output, on the displacement θ along the ﬁrst PC for diﬀerent sample k . ∆ Y k increases linearly with θ . Each symbol correspondsto an average over 5 samples of the same digit. (B) The landscape of ln L i ( θ i ) along the i -th PCdirection. Only ln( L ) has a ﬁnite gradient at θ = 0, which indicates a drift motion in ~p . In allother PC directions ( i ≥ L i ) has a zero gradient at θ i = 0, which indicates a diﬀusive motionin these directions.

784 50 1050 50 50Layer 1 Layer 2 Layer 3 -15 -10 Conv1,Pooling Conv2,Pooling Flatten FC1 FC2 !"

16 400 120 84 10FC3 -8 -6 -4 -2 conv1conv2 A BC D

FIG.S3: The variance-ﬂatness relation for diﬀerent network, data set and optimization method.(A) A four hidden layer neural network is used on the MNIST dataset. (B) The inverse relationbetween variance and ﬂatness holds between any two adjacent hidden layers. (C) Illustration ofLeNet which has two convolution layer with sizes of 3 × × × × × ×

16, and three fullyconnected layers with sizes of 400 × × ××