Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise
aa r X i v : . [ c s . L G ] F e b Eliminating Sharp Minima from SGD with TruncatedHeavy-tailed Noise
Xingyu WangDepartment of Industrial Engineering and Management SciencesNorthwestern University, Evanston, IL, 60208 [email protected]
Sewoong OhAllen School of Computer Science & EngineeringUniversity of Washington, Seattle, WA, 98195 [email protected]
Chang-Han RheeDepartment of Industrial Engineering and Management SciencesNorthwestern University, Evanston, IL, 60208 [email protected]
February 9, 2021
Abstract
The empirical success of deep learning is often attributed to SGD’s mysterious ability to avoidsharp local minima in the loss landscape, which is well known to lead to poor generalization.Recently, empirical evidence of heavy-tailed gradient noise was reported in many deep learningtasks; under the presence of such heavy-tailed gradient noise, it can be shown that SGD can escape sharp local minima, providing a partial solution to the mystery. In this work, we analyzea popular variant of SGD where gradients are truncated above a fixed threshold. We show that itachieves a stronger notion of avoiding sharp minima; it can effectively eliminate sharp local minimaentirely from its training trajectory. We characterize the dynamics of truncated SGD driven byheavy-tailed noises. First, we show that the truncation threshold and width of the attractionfield dictate the order of the first exit time from the associated local minimum. Moreover, whenthe objective function satisfies appropriate structural conditions, we prove that as the learningrate decreases the dynamics of the heavy-tailed truncated SGD closely resemble that of a specialcontinuous-time Markov chain which never visits any sharp minima. We verify our theoreticalresults with numerical experiments and discuss the implications on the generalizability of SGD indeep learning.
Stochastic gradient descent (SGD) and its variants have seen unprecedented empirical successes intraining deep neural networks. The training of deep neural networks is typically posed as a non-convex optimization problem without explicit regularization, but the solutions obtained by SGD oftenperform surprisingly well with test data. Such an unexpected generalization performance of SGDin deep neural networks are often attributed to SGD’s ability to avoid sharp local minima in the1 eavy-tailed noises, no gradient clipping (a)
Heavy-tailed noises, gradient clipping (b)
Light-tailed noises, no gradient clipping (c)
Light-tailed noises, gradient clipping (d) −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5m m m m (e) Figure 1: Histograms for the time SGD spent at different location. With the help of clipped heavy-tailed noises, SGD hardly ever visits the two sharp minima m and m . The function f is plotted atthe bottom, and dashed lines are added as references for the locations of local minima.loss landscape, which is well known to lead to poor generalization; there have been significant effortsto explain such phenomena theoretically, but how SGD avoids sharp local minima still remains asone of the central mystery of deep learning. Recently, it was suggested that the previous theoreticalefforts failed to provide a satisfactory explanation due to the assumption that stochastic gradient noiseare light-tailed: [1] and [2] report the empirical evidence of heavy-tails in stochastic gradient noisein common deep learning architectures and show that SGD can escape sharp local minima underthe presence of the heavy-tailed gradient noise. More specifically, they view heavy-tailed SGDs asdiscrete approximations of L´evy driven Langevin equations and argue that the amount of time SGDtrajectory spends in each local minimum is proportional to the width of the minimum according tothe metastability theory [3, 4, 5] for such heavy-tailed processes. Figure 1 (a) illustrates this with thehistogram of a sample trajectory of SGD.In this paper, we consider a popular variant of SGD, where the stochastic gradient is truncatedabove a fixed threshold. Such truncation scheme is often called gradient clipping and employed asdefault in various contexts [6, 7, 8, 9]. We prove that under this scheme, the long-run behavior ofSGD is fundamentally different from that of the pure form of SGD: in particular, under a suitablestructural condition on the geometry of the loss landscape, gradient clipping can completely eliminatesharp minima from the trajectory of the SGD .Figure 1 clearly illustrates these points with the histograms of the sample trajectories of SGDs.Note first that SGDs with light-tailed gradient noise—Figure 1 (c) and (d)—never manages to escapea (sharp) minimum regardless of gradient clipping. In contrast, SGDs with heavy-tailed gradient noise2asily escapes from local minima. Moreover, there is a clear difference between SDGs with gradientclipping and without gradient clipping. In Figure 1 (a), SGD without gradient clipping stays ateach of all four local minima ( { m , m , m , m } ) for a significant amount of time, although it spendsmore time around the wide ones ( { m , m } ) than the sharp ones ( { m , m } . On the other hand, inFigure 1 (b), SGD with gradient clipping not only escapes from local minima but also avoids sharpminima ( { m , m } ) almost completely. This means that if SGD with gradient clipping terminatesat an arbitrary time point, it is almost guaranteed that it won’t be at a sharp minimum, effectivelyeliminating sharp minima from its trajectory.The next section characterizes such long-run dynamics of truncated SGD driven by heavy-tailednoises in detail. We provide a complete characterization of the dynamics of SGD iterates with gradient clipping whenthe SGD algorithm is applied to a non-convex function f on R . We introduce the assumptions aboutthe function. Let f ∈ C ( R ) be a function satisfying the following assumption. Assumption 1.
There exist a positive integer n min , a real number L ∈ (0 , ∞ ) , and an orderedsequence of real numbers m , s , m , s , · · · , s n min − , m n min such that • − L < m < s < m < s < · · · < s n min − < m n min < L • f ′ ( x ) = 0 iff x ∈ { m , s , · · · , s n min − , m n min } ; • For any x ∈ { m , m , · · · , m n min } , f ′′ ( x ) > ; • For any x ∈ { s , s , · · · , s n min − } , f ′′ ( x ) < . As illustrated in Figure 2, the assumption above requires that f has finitely many local minima(to be specific, the count is n min ) and they all lie in the interval [ − L, L ]. Moreover, the points s , · · · , s n min − naturally partition the entire real line into different regions Ω i = ( s i − , s i ) (here weadopt the convention that s = −∞ , s n min = + ∞ ). We call each region Ω i the attraction field ofthe local minimum m i , as the gradient flow in any Ω i would always point to m i .Throughout the optimization procedure, given any location x ∈ R we assume that we have onlyaccess to the noisy estimator of the gradient f ′ ( x ) − Z n instead of the true gradient. Specifically, in thiswork we are interested in the case where the iid sequence of noises ( Z n ) n ≥ are heavy-tailed. Typically,the heavy-tailed phenomenon in distributions is captured by the concept of regular variation, and wework with the following assumption. Let H + ( x ) = P ( Z > x ) , H − ( x ) = P ( Z < − x ) ,H ( x ) = H + ( x ) + H − ( x ) = P ( | Z | > x ) . Assumption 2. E Z = 0 . Furthermore, there exsits some α ∈ (1 , ∞ ) such that function H ( x ) is regularly varying (at + ∞ ) with index − α . Besides, regarding the positive and negative tail fordistribution of the noises, we have lim x →∞ H + ( x ) H ( x ) = p + , lim x →∞ H − ( x ) H ( x ) = p − = 1 − p + where p + and p − are constants in interval (0 , . s s −L Lm m m m Ω Ω Ω Ω Figure 2: Illustration of a function satisfying Assumption 1For a measurable function φ : R + R + , we say that φ is regularly varying at + ∞ with index β (denoted as φ ∈ RV β ) if, for any t >
0, lim x →∞ φ ( tx ) /φ ( x ) = t β . For details on definition andproperties of regularly varying functions, see chapter 2 of [10]. Roughly speaking, Assumption 2 tellsus that the shape of the tail for the distribution of noises Z n resembles the polynomial function x − α ,which is much heavier than the exp( − x ) tail of a Gaussian distribution. Therefore, large Z n is morelikely for heavy-tailed noises than in the light-tailed case.Our work concerns a popular variant of SGD where the gradient clipping technique is employed.Specifically, when updating the SGD iterates with a learning rate η , rather than using the originalnoisy gradient estimator η ( f ′ ( X n ) − Z n ), we will truncate the gradient when it is above some threshold b > ϕ (cid:0) η ( f ′ ( X n ) − Z n ) , b (cid:1) instead. Here the truncation operator ϕ is defined as ϕ ( w, c ) = ∆ ( w ∧ c ) ∨ ( − c ) ∀ w ∈ R , c > . (1)Note that u ∧ v = min { u, v } , u ∨ v = max { u, v } . For technical reasons, we need the following assumptionabout the clipping threshold b >
0. Fortunately, Assumption 3 is a very mild one, as it is obviouslysatisfied by (Lebesgue) almost every b > Assumption 3.
For any i = 1 , , · · · , n min , min {| s i − m i | , | s i − − m i |} /b is NOT an integer. Aside from the gradient clipping technique, we also reflect the SGD iterates at ± L (recall that L is the constant in Assumption 1). Formally speaking, when learning rate (namely, step size) is η > X ηn ) n ≥ using the following recursive updates: X clip ,ηn = ∆ X ηn − ϕ (cid:0) η ( f ′ ( X ηn ) − Z n ) , b ) , (2)(SGD update with gradient clipping) X ηn +1 = ∆ ϕ ( X clip ,ηn , L ) . (Reflection at ± L ) (3)When there is no ambiguity about the learning rate η , we simply use X n to denote the SGD iteratesproduced by the recursive updates defined above. We stress that, despite the introduction of the4eflection operator, the generality of the algorithm studied here is not compromised, nor would thebehavior of ( X ηn ) n ≥ differ noticeably from the vanilla SGD iterates. First of all, it is a common trickin practical statistical learning or optimization tasks that we truncate the weights and ensure thatthey will not explode and drift to infinity. Besides, preventing the iterates from drifting to infinity(explosion of weights) is also a must in theoretical analyses regarding SGD or Langevin-type stochasticdifferential equations, and thanks to the reflection operator we do not need to introduce sophisticatedassumptions on tail behavior of f that are commonly seen in existing works (see, for instance, thedissipativity conditions in [11]). We first introduce a few concepts that are crucial to our main results. For any attraction field Ω i ,define (note that ⌈ x ⌉ = min { n ∈ Z : n ≥ x } , ⌊ x ⌋ = max { n ∈ Z : n ≤ x } ) r i = ∆ min {| m i − s i − | , | s i − m i |} , (4) l ∗ i = ∆ ⌈ r i /b ⌉ . (5)Note that l ∗ i are in fact influenced by the the value of gradient clipping threshold b even though thisdependency is not highlighted by the notation. Here r i can be interpreted as the radius or the effective width of the attraction field, and l ∗ i is the minimum number of jumps required to escape Ω i whenstarting from m i : Indeed, the gradient clipping threshold b dictates that any SGD update step cantravel no more than b , and to exit Ω i when starting from m i (which requires the length of travel tobe at least r i ) we can see that at least ⌈ r i /b ⌉ steps are required. The moral of results in this sectionis as follows. First, (for a fixed b ) the minimum number of jumps l ∗ i is also an indicator of the widthof the attraction field Ω i , hence the minimum effort required to exit Ω i . More importantly, as will beshown in Theorem 1, l ∗ i dictates the order of first exit time as well as where the iterates X ηn exit towhen leaving Ω i .Specifically, we are now interested in the behavior of SGD regarding the following stopping timefor first exit from Ω i σ i ( η ) = ∆ min { n ≥ X ηn / ∈ Ω i } . Besides, for any attraction field Ω i , define a scaling function λ i ( η ) = ∆ H (1 /η ) (cid:16) H (1 /η ) η (cid:17) l ∗ i − . (6)To stress the initial condition, we either use P x to denote the probability law induced by X η = x , orsimply write X ηn ( x ). Theorem 1.
Under Assumptions 1-3, given x ∈ [ − L, L ] and i ∈ { , , · · · , n min } such that x ∈ Ω i ,we have1) Under P x , q i λ i ( η ) σ i ( η ) converges in distribution to an Exponential random variable with rate 1as η ↓ ;2) For any i, j = 1 , , · · · , n min such that j = i , lim η ↓ P x ( X ησ i ( η ) ∈ Ω j ) = q i,j /q i . where the constant q i > ∀ i and q i,j ≥ ∀ j = i . For the complete proof and explicit formula for constants q i , q i,j , we refer the readers to thesupplementary materials. We add a few remarks to provide intuitive interpretations.5 s −L Lm m m Ω Ω Ω m m m l = 2l = 1 l = 1l = 2b = 0.5 m m m l = 2l = 1 l = 1l = 3 × b = 0.4 Figure 3: Typical transition graphs G under different gradient clipping thresholds b . (Left) Thefunction f illustrated here has 3 attraction fields. For the second one Ω = ( s , s ), we have s − m =0 . , m − s = 0 .
6. (Middle) The typical transition graph induced by b = 0 .
5. The entire graph G is irreducible since all nodes communicate with each other. (Right) The typical transition graphinduced by b = 0 .
4. When b = 0 .
4, since 0 . < b, . > b , starting from m and with only 2 jumps,the iterates could only exit Ω from the left. Therefore, on the graph G there are two communicatingclasses: G = { m , m } , G = { m } ; G is absorbing while G is transient. • As is made clear by the proof, Theorem 1 is simply an immediate result of the following generalprinciple about the dynamics of X ηn . Fix some ǫ > B ǫ ( m i ) = [ m i − ǫ, m i + ǫ ] is asmall neighborhood of local minimum m i . We say that X ηn makes an attempt to escape if itleaves B ǫ ( m i ), and the attempt fails if X ηn returns to B ǫ ( m i ) without exiting Ω i . As learningrate η goes to 0, one can show that (1) it is very rare that an attempt would succeed, and (2)when an attempt does succeed (so X ηn finally escapes Ω i ), it is almost always because X ηn makes exactly l ∗ i large jumps during this attempt that help X ηn to finally overcome the distance r i and escape. By large jumps, we mean SGD steps where the noise Z n is so large that, even whenmodulated by the small learning rate η , it still causes noticeable perturbation (say η | Z n | > ǫ ). • Given this l ∗ i − jump principle, one would expect that X ησ i ( η ) , the location right at the time ofexit, will hardly ever be l ∗ i b far away from m i : the length of each update is clipped by b , andthere will most likely be only l ∗ large SGD steps during this successful attempt. In fact, fromthe proof one can see that q i,j > y ∈ Ω j | y − m i | < l ∗ i b . • Part (1) of Theorem 1 tells us the first exit time is roughly of order 1 /λ i ( η ) ≈ (1 /η ) α +( α − l ∗ i − .Summarizing the three bullet points here, we see that the minimum number of jumps l ∗ i dictates how heavy-tailed SGD escapes an attraction field, where it exits to, and when the exit occurs. In this section, we show that, under proper structure of f and gradient clipping scheme, the sharpminima can be effectively eliminated from the trajectory of heavy-tailed SGD. Again, we first introducenew concepts that will be crucial to results in this section. Similar to the the minimum number ofjumps for escape l ∗ i defined in 5, for any j = i we can define the following quantities as the minimumnumber of jumps to reach Ω j from m i : l i,j = ( ⌈ ( s j − − m i ) /b ⌉ if j > i, ⌈ ( m i − s j ) /b ⌉ if j < i. (7)From the remarks after Theorem 1, we see that starting from Ω i , X ηn will most likely move to some-where reachable within l ∗ i jumps when exiting Ω i . Therefore, the typical transitions the SGD iterates6ould make are those from Ω i to Ω j if l i,j = l ∗ i . Now we can define the following directed graph thatonly includes these typical transitions. Definition 1 (Typical Transition Graph) . Given a function f satisfying Assumption 1 and gradientclipping threshold b > satisfying Assumption 3, a directed graph G = ( V, E ) is the corresponding typical transition graph if • V = { m , · · · , m n min } ; • A path ( m i → m j ) is in E if and only if l i,j = l ∗ i . Naturally, the typical transition graph G can be decomposed into several communicating classes G , · · · , G K that are mutually exclusive. Recall that a communicating class G on G is a equivalentclass with the property that • For i = j , we have m i ∈ G, m j ∈ G if and only if there exists a path ( m i , m k , · · · , m k n , m j ) aswell as a path ( m j , m k ′ , · · · , m k ′ n ′ , m i ) on G ; in other words, by travelling through edges on G , m i and m j can commute with each other.We say a communicating class G is absorbing if there does not exist any edge ( m i → m j ) ∈ E suchthat m i ∈ G yet m j / ∈ G . Otherwise, we say G is transient . In the case that all m i communicatewith each other on graph G , we say G is irreducible . See Figure 3(Middle) for the illustration ofan irreducible case. When G is irreducible, the set of largest attraction fields of f can be definedas M large = { m i : i = 1 , , · · · , n min , l ∗ i = l large } with l large = max j l ∗ j : Recall that, given clippingthreshold b , l ∗ i also indicates the width of Ω i , hence the name largest attraction fields for all Ω j in M large (and with l ∗ i = l large ). Also, we define time scaling λ large ( η ) = H (1 /η ) (cid:0) H (1 /η ) /η (cid:1) l large − that,as indicated in Theorem 1, is exactly the time scale for first exit time of the largest attraction fieldsof f . The following result gives a taste of the unique benefits of heavy-tailed SGD where all the smallattraction fields are eliminated from its trajectory when gradient clipping is employed. Theorem 2.
Let Assumptions 1-3 hold and assume the graph G is irreducible . Given any t > , β > l large ( α −
1) + 1 , and x ∈ [ − L, L ] , the following random variables (indexed by η ) ⌊ t/η β ⌋ Z ⌊ t/η β ⌋ n X η ⌊ u ⌋ ( x ) ∈ [ j : m j / ∈ M large Ω j o du (8) converge in probability to as η ↓ . The proof can be found in the supplementary materials. Here we provide the intuitive interpre-tations of the result. Assume that we decide to terminate the SGD training at some reasonably longtime, for instance after ⌊ t/η β ⌋ iterations, then the random variable defined in (8) is exactly the pro-portion of time the iterates X ηn spent at some attraction fields that are not the largest ones of f . Inother words, in this irreducible case only the largest attraction fields could survive on the trajectoryof SGD under truncated heavy-tailed noises, and all the sharp minima are eliminated from SGD aslearning rate approaches 0.Interestingly enough, Theorem 2 is merely a manifestation of the global dynamics of heavy-tailedSGD, and a lot more can be said even in the non-irreducible cases. The moral of results in this sectioncan be summarized as follows: (a) Gradient clipping scheme naturally partitions the entire landscapeinto different regions; (b) On each region, the dynamics of X ηn closely resemble that of a continuous-time Markov chain that only visits local minima when learning rate is small; (3) In particular, anysharp minima within a region is almost always avoided by SGD.First, astute readers may have noticed already that different gradient clipping threshold b mayinduce different structures on G . For instance, let us consider the function depicted in Figure 3(Left):for its attraction field Ω = ( s , s ), starting from the local minimum m we need to travel 0 . . l ∗ i in(5) as well as the minimum jump for ( i → j ) transition l i,j in (7). When b = 0 . l , = l , = l ∗ = 2,so the entire graph is irreducible. When we use b = 0 .
4, however, we will have l , = 2 = l ∗ < l , = 3.As illustrated in Figure 3(Right), the graph induced by b = 0 . G = { m , m } is absorbing and G = { m } is transient.From now on, we zoom in on a specific communicating class G among all the G , · · · , G K . For thiscommunicating class G , define l ∗ G = ∆ max { l ∗ i : i = 1 , , · · · , n min ; m i ∈ G } . For each local minimum m i ∈ G , based on its minimum jump number, we call its attraction field Ω i a large attraction fieldif l ∗ i = l ∗ G , and a small attraction field if l ∗ i < l ∗ G . We have thus classified all m i in G into twogroups: the ones in large attraction fields m large1 , · · · , m large i G and the ones in small attraction fields m small1 , · · · , m small i ′ G . Also, define a scaling function λ G on this communicating class G as λ G ( η ) = ∆ H (1 /η ) (cid:16) H (1 /η ) η (cid:17) l ∗ G − . (9)Lastly, we consider a version of X ηn that is killed when X ηn leaves G . Define the stopping time τ G ( η ) = ∆ min { n ≥ X ηn / ∈ [ i : m i ∈ G Ω i } as the first time the iterates leave all attraction fields in G , and we use a cemetery state † to constructthe following process X † ,ηn as a version of X ηn with killing at τ G : X † ,ηn = ( X ηn if n < τ G ( η ) , † if n ≥ τ G ( η ) . Theorem 3.
Under Assumptions 1-3, if G is absorbing , then there exist a continuous-time Markovchain Y on { m large , · · · , m large i G } with some generator matrix Q as well as random mapping π G satis-fying • If m ∈ { m large , · · · , m large i G } , then π G ( m ) ≡ m ; • If m ∈ { m small , · · · , m small i ′ G } , then the distribution of π G ( m ) only takes value in { m large , · · · , m large i G } ;such that for any x ∈ Ω i , | x | ≤ L (where i ∈ { , , · · · , n min } ) with m i ∈ G , we have X η ⌊ t/λ G ( η ) ⌋ ( x ) → Y t ( π G ( m i )) as η ↓ in the sense of finite-dimensional distributions. Theorem 4.
Under Assumptions 1-3, if G is transient , then there exist some continuous-timeMarkov chain Y with killing that has state space { m large , · · · , m large i G , † } (we say the Markov chain Y is killed when it enters the absorbing cemetery state † ) and some generator matrix Q , as well as arandom mapping π G satisfying • If m ∈ { m large , · · · , m large i G } , then π G ( m ) ≡ m ; • If m ∈ { m small , · · · , m small i ′ G } , then the distribution of π G ( m ) only takes value in { m large , · · · , m large i G , † } ;such that for any x ∈ Ω i , | x | ≤ L (where i ∈ { , , · · · , n min } ) with m i ∈ G , we have X † ,η ⌊ t/λ G ( η ) ⌋ ( x ) → Y t ( π G ( m i )) as η ↓ in the sense of finite-dimensional distributions. Q as well as the distribution of random mapping π G ( · ) are also detailed inthe proof. Here we add some remarks. First, intuitively speaking, the two results above tell us thatgradient clipping scheme naturally partitions the entire landscape of f into several regions G , · · · , G K ;whether a region G is absorbing or not, when visiting G the dynamics of the heavy-tailed SGD wouldconverge to a continuous-time Markov chain avoiding any local minima that is not in the largest attraction fields in G . Second, under small learning rate η >
0, if X ηn ( x ) is initialized at x ∈ Ω i whereΩ i is not a largest one on G , then iterates will quickly escape the small attraction field and arrive atsome Ω j ∈ { m large1 , · · · , m large i G } that is a largest one on G; such a transition is so quick that undertime scaling λ G ( η ), it is almost instantaneous as if X ηn ( x ) is actually initialized randomly at some flatminima, and we compress the law of this random initialization in the random mapping π G .Before we conclude this section, we state a stronger version of the results that is an immediatecorollary from Theorem 3. If G is irreducible, then Theorem 3 is applicable to the entire optimizationlandscape. Recall that λ large ( · ) is the time scale defined before Theorem 2, which corresponds to thefirst exit time of the largest attraction fields of f . Theorem 5.
Under Assumptions 1-3, if G is irreducible , then there exist a continuous-time Markovchain Y on M large as well as a random mapping π such that for any i = 1 , , · · · , n min and any x ∈ Ω i , | x | ≤ L , X η ⌊ t/λ large ( η ) ⌋ ( x ) → Y t ( π ( m i )) as η ↓ in the sense of finite-dimensional distributions. Heavy-tailedness and wide minima folklore in SGD . Aside from theoretical interests in char-acterizing heavy-tailed SGD and the effects of gradient clipping therein, our results also have crucialpractical implications on statistical learning tasks, especially in light of the following empirical obser-vations.First, heavy-tailed gradient noises are ubiquitous in modern deep learning context including imageclassification tasks [1], generative models [12], and deep reinforcement learning tasks [13]. Theseempirical evidences strongly challenge the traditional light-tailed paradigm in theoretical analyses onSGD and necessitate a much deeper understanding of heavy-tailed SGD.Besides, it remains a mystery in the field of deep learning how the surprisingly good generalizabilityof the trained model is achieved using first-order optimization methods. The well-know conjecture isthat a sharp local minimum show poor generalizability in the test setting, while the flat and wider onestend to generalize better [14]. Aligned with this conjecture is the correlation between the test accuracyof deep neural nets and the flatness or width of the local minimum located at the end of the trainingobserved in various experiments [15][16]. While there exist multiple choices for characterizing a wideminimizer (eigenvalue of Hessian, width of attraction field, etc.), the general belief is that a sharplocal minimum in a small attraction field may lead to poor generalization performance. Therefore,our theoretical results are of great importance as they describe how, under truncated heavy-tailednoises, SGD spends much longer time in wider attraction fields and the sharp minima are practicallyeliminated from its global dynamics.
Systematic control on exit time from attraction fields . In light of the wide minima folklore,one may want to find techniques to modify the sojourn time of SGD at each attraction field. Theorem1 suggests that the order of the first exit time (w.r.t. learning rate η ) is directly controlled by thegradient clipping threshold b . Recall that H (1 /η ) ≈ O ( η α ), aso for an attraction field with minimumjump number l ∗ , Theorem 1 tells us the exit time from this attraction field is roughly of order(1 /η ) α +( α − l ∗ − . Given the width of the attraction field, its minimum jump number l ∗ is dictatedby gradient clipping threshold b . Therefore, gradient clipping provides us with a very systematicmethod to control the exit time at each attraction field. For instance, given clipping threshold b , the9xit time from an attraction field with width less than b is of order (1 /η ) α , while the exit time fromone larger than b is at least (1 /η ) α − , which dominates the exit time from smaller ones. Ideal structure of G and f . In order to apply the strongest result Theorem 5, irreduciblity of G is required and the shape of function f may also be a deciding factor here aside from the choice of b .For instance, we say that G is symmetric if for any attraction field Ω i such that i = 2 , , · · · , n min − i is not the leftmost or rightmost one at the boundary), we have q i,i − > , q i,i +1 >
0. One cansee that G is symmetric if and only if, for any i = 2 , , · · · , n min − | s i − m i | ∨ | m i − s i − | < l ∗ i b , andsymmetry is a sufficient condition for irreducibility of G . The graph illustrated in Figure 3(Middle) issymmetric, while the one in Figure 3(Right) is not. As the name suggests, in the R case the symmetryof G is more likely to hold if the shape of attraction fields in f is also nearly symmetric around itslocal minimum. If not, then as illustrated in Figure 3 the symmetry (as well as irreducibility) of G iseasily violated especially when a small gradient clipping threshold b is used.Generally speaking, our results imply that, even with the help of truncated heavy-tailed noises,one should expect the structure of function f to satisfy certain regularity conditions to ensure thatSGD iterates avoid unfavorable minima. This is in the same vein as the observations of [16]: the deepneural nets that are more trainable under SGD tend to have a much more regular structure in termsof the number and shape of local minima. Heavy-tailed SGD without gradient clipping . It is worth mentioning that our results alsocharacterize the dynamics of heavy-tailed when there is no gradient clipping. For instance, since thereflection operation at ± L restricts the iterates on the compact set [ − L, L ], if we use a truncationthreshold b that is large than 2 L , then any SGD update that moves larger than b will definitelybe reflected at ± L . Therefore, the dynamics are identical to that of the following iterates withoutgradient clipping: X η, unclipped n = ϕ (cid:16) X η, unclipped n − − ηf ′ ( X η, unclipped n − ) + ηZ n , L (cid:17) . (10)The next result follows immediately from Theorem 1 and 3. Corollary 6.
There exist constants q i > ∀ i , q i,j > ∀ j = i such that the following claims hold forany i and any x ∈ Ω i , | x | ≤ L .1) Under P x , q i H (1 /η ) σ i ( η ) converges in distribution to an Exponential random variable with rate1 as η ↓ ;2) For any j = 1 , , · · · , n min such that j = i , lim η ↓ P x ( X ησ i ( η ) ∈ Ω j ) = q i,j /q i .
3) Let Y be a continuous-time Markov chain on { m , · · · , m n min } with generator matrix Q parametrizedby Q i,i = − q i , Q i,j = q i,j . Then X η, unclipped ⌊ t/H (1 /η ) ⌋ ( x ) → Y t ( m i ) as η ↓ in the sense of finite-dimensional distributions. At first glance, the form of Corollary 6 is similar to the results in [3] and [1]. However, the objectstudied in [3] is different: they study the following Langevin-type stochastic differential equation(SDE) driven by a regularly varying L´evy process L t with magnitude σ > dX σt = − f ′ ( X σt − ) + σdL t . In particular, they study the limiting behavior of X σ as σ ↓
0. On the other hand, we directly analyzethe stochastic gradient descent. 10 .1 0.01 0.001 0.0001Learning Rate10 F i r s t E x i t T i m e gradient clippingno gradient clippingPolynomial with power 1.4Polynomial with power 1.2 Figure 4: First Exit Time from Ω . Each dot represents the average of 20 samples of first exit time.Each dahsed line shows a polynomial function c/η β where β is predicted by Theorem 1 and thecoefficient c is chosen to fit the dots. We use numerical experiments to corroborate our theoretical results and demonstrate that (a) asindicated by Theorem 1, the minimum jump number defined in (5) accurately characterizes the firstexit from heavy-tailed SGD when gradient clipping is applied; (b) with the help of gradient clipping,sharp minima can be effectively eliminated from heavy-tailed SGD; (c) the properties studied in thispaper are exclusive to heavy-tailed noises; under light-tailed noises SGD are easily trapped at sharpminima for extremely long time.The function f ∈ C ( R ) we consider in this section is exactly the same one depicted in Figure 2,with L = 1 . m = − . , s = − . , m = − . , s = 0 . , m = 0 . , s = 0 . , m = 1 .
32. Notethat m and m are sharp minima in small attraction fields, while m and m are much flatter andlive in much wider attraction fields. The explicit expression for f is in supplementary materials.First, we study the first exit time of X ηn from Ω = ( − . , .
2) when initialized at − .
7. The heavy-tailed noises we use in the experiment are Z n = 0 . U n W n where W n are from Pareto TypeII distribution (aka Lomax distribution) with shape parameter α = 1 .
2, and the signs U n are iid RVssuch that U n = 1 with probability 1 / U n = − /
2. We test different learningrates η from the set { . , . , . , . , . , . , . } . As for gradient clipping scheme, wetest a gradient clipping case where the SGD iterates are produced by updates (2)(3) with b = 0 . no gradient clipping case where we remove the gradient clipping scheme so SGD updates aregenerated by (10). For each case, we run the simulation 20 times and plot the average of the 20 exittimes in Figure 4. According to Theorem 1 and Corollary 6, when there is no gradient clipping, thefirst exit time is roughly of order 1 /H (1 /η ) ≈ (1 /η ) α = (1 /η ) . for small η ; when gradient clippingis applied with b = 0 .
5, the minimum jump number is l ∗ = 2 for attraction field Ω , and the firstexit time is roughly of order 1 / (cid:16) H (1 /η ) (cid:0) H (1 /η ) /η (cid:1)(cid:17) ≈ (1 /η ) α − = (1 /η ) . . As demonstrated inFigure 4, our results accurately predict how first exit time varies with learning rate η and the gradientclipping scheme. 11 X n (a) −101 X n (b) −101 X n (c) Iteration number n −101 X n (d) Figure 5: Typical trajectories of SGD in different cases: (a) Heavy-tailed noises, no gradient clipping;(b) Heavy-tailed noises, gradient clipping at b = 0 .
5; (c) Light-tailed noises, no gradient clipping; (d)Light-tailed noises, gradient clipping at b = 0 .
5. The function f is plotted at the right of each figure,and dashed lines are added as references for locations of the local minima. For readability of thefigures, we plot X n for every 250 iterations.In the next experiment we study the global dynamics of heavy-tailed SGD. We initialize the SGDiterates at 0 . = (0 . , . η = 0 . b = 0 .
5) and the no gradient clipping case. Moreover, aside from thePareto heavy-tailed noises, we also test light-tailed noises where we use N (0 ,
1) as the distribution fornoises Z n . For each case, we obtain 20 sample paths of X n with each run terminated after 3 , , × , ,
000 = 6 × SGD iterates. First, under heavy-tailed noises, for most part of the SGD trajectory it is somewherearound a local minimum. When there is no gradient clipping, X n will still visit the sharp minima m , m . Under gradient clipping, sharp minima are almost eliminated from the trajectory as the timespent at m , m is negligible compared to the time X n spent at m , m which are in much widerattraction fields. This is further demonstrated by the corresponding sample paths. Without gradientclipping, we see that X n jumps frequently between all the local minima (Figure 5(a)), while undergradient clipping, even though X n may still jump to the sharp minima occasionally, relatively speaking12he time spent there is so short that the entire trajectory is almost always at the minima in widerattraction fields (Figure 5(b)). This is exactly the type of dynamics predicted by Theorem 2-5 anddemonstrates the elimination of sharp minima with truncated heavy-tailed noises when learning rateis small.Lastly, we stress that the said properties are exclusive to heavy-tailed SGD. As shown in Figure1 and Figure 5(c)(d), under learning rate η = 0 .
001 the light-tailed SGD is trapped at the very firstminimum close to the initialization point for the entire 3 , ,
000 iterations. In contrast, heavy-tailedSGD can always easily escape this sharp minimum m and start exploring the entire landscape of f .The results imply that the efficient escape from sharp minima and preference of wide/flat minima isa unique benefit under heavy-tailed noises. In this work, we characterized the dynamics of SGD under clipped heavy-tailed gradients in the R case and verified our theoretical results in numerical experiments. For future works, the following openquestions are worth pursuing. First, it is natural to consider the extension of the results to the general R d case. Second, while we used conditions such as irreducibility or symmetry of graph G to ensurethe elimination of all sharp minima, it is very likely that a more relaxed and general set of geometricconditions would suffice for eliminating most sharp minima; in particular, it is worth exploring howthe structure of state-of-the-art deep neural nets lend themselves to such desirable geometries andachieve better performance after training. References [1] Umut S¸im¸sekli, Mert G¨urb¨uzbalaban, Thanh Huy Nguyen, Ga¨el Richard, and Levent Sagun. On the heavy-tailedtheory of stochastic gradient descent for deep neural networks. arXiv preprint arXiv:1912.00018 , 2019.[2] Umut S¸im¸sekli, Levent Sagun, and Mert Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deepneural networks. In
International Conference on Machine Learning , pages 5827–5837. PMLR, 2019.[3] Ilya Pavlyukevich. Cooling down l´evy flights.
Journal of Physics A: Mathematical and Theoretical , 40(41):12299,2007.[4] Peter Imkeller, Ilya Pavlyukevich, and Michael Stauch. First exit times of non-linear dynamical systems in R d perturbed by multifractal L´evy noise. Journal of Statistical Physics , 141(1):94–119, 2010.[5] Peter Imkeller, Ilya Pavlyukevich, and Torsten Wetzel. The hierarchy of exit times of L´evy-driven Langevinequations.
The European Physical Journal Special Topics , 191(1):211–222, 2010.[6] Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Alek-sander Madry. Implementation matters in deep rl: A case study on ppo and trpo. In
International Conference onLearning Representations , 2020.[7] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing LSTM language models.In
International Conference on Learning Representations , 2018.[8] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 , 2013.[9] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In
International conference on machine learning , pages 1310–1318. PMLR, 2013.[10] Sidney I Resnick.
Heavy-tail phenomena: probabilistic and statistical modeling . Springer Science & Business Media,2007.[11] Than Huy Nguyen, Umut Simsekli, and Ga¨el Richard. Non-asymptotic analysis of fractional langevin monte carlofor non-convex optimization. In
International Conference on Machine Learning , pages 4810–4819. PMLR, 2019.[12] Vishwak Srinivasan, Adarsh Prasad, Sivaraman Balakrishnan, and Pradeep Kumar Ravikumar. Efficient estimatorsfor heavy-tailed machine learning, 2021.[13] Saurabh Garg, Joshua Zhanson, Emilio Parisotto, Adarsh Prasad, J Zico Kolter, Zachary Chase Lipton, SivaramanBalakrishnan, Ruslan Salakhutdinov, and Pradeep Kumar Ravikumar. On proximal policy optimization’s heavy-tailed gradients, 2021.[14] Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima.
Neural computation , 9(1):1–42, 1997.
15] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. Onlarge-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 ,2016.[16] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neuralnets. In
Proceedings of the 32nd International Conference on Neural Information Processing Systems , pages6391–6401, 2018., pages6391–6401, 2018.