[PDF] Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise

Abstract

The empirical success of deep learning is often attributed to SGD's mysterious ability to avoid sharp local minima in the loss landscape, as sharp minima are known to lead to poor generalization. Recently, empirical evidence of heavy-tailed gradient noise was reported in many deep learning tasks, and it was shown in \c{S}im\c{s}ekli (2019a,b) that SGD can escape sharp local minima under the presence of such heavy-tailed gradient noise, providing a partial solution to the mystery. In this work, we analyze a popular variant of SGD where gradients are truncated above a fixed threshold. We show that it achieves a stronger notion of avoiding sharp minima: it can effectively eliminate sharp local minima entirely from its training trajectory. We characterize the dynamics of truncated SGD driven by heavy-tailed noises. First, we show that the truncation threshold and width of the attraction field dictate the order of the first exit time from the associated local minimum. Moreover, when the objective function satisfies appropriate structural conditions, we prove that as the learning rate decreases, the dynamics of heavy-tailed truncated SGD closely resemble those of a continuous-time Markov chain that never visits any sharp minima. Real data experiments on deep learning confirm our theoretical prediction that heavy-tailed SGD with gradient clipping finds a "flatter" local minima and achieves better generalization.

Full PDF

aa r X i v : . [ c s . L G ] F e b Eliminating Sharp Minima from SGD with TruncatedHeavy-tailed Noise

Xingyu WangDepartment of Industrial Engineering and Management SciencesNorthwestern University, Evanston, IL, 60208 [email protected]

Sewoong OhAllen School of Computer Science & EngineeringUniversity of Washington, Seattle, WA, 98195 [email protected]

Chang-Han RheeDepartment of Industrial Engineering and Management SciencesNorthwestern University, Evanston, IL, 60208 [email protected]

February 9, 2021

Abstract

The empirical success of deep learning is often attributed to SGD’s mysterious ability to avoidsharp local minima in the loss landscape, which is well known to lead to poor generalization.Recently, empirical evidence of heavy-tailed gradient noise was reported in many deep learningtasks; under the presence of such heavy-tailed gradient noise, it can be shown that SGD can escape sharp local minima, providing a partial solution to the mystery. In this work, we analyzea popular variant of SGD where gradients are truncated above a ﬁxed threshold. We show that itachieves a stronger notion of avoiding sharp minima; it can eﬀectively eliminate sharp local minimaentirely from its training trajectory. We characterize the dynamics of truncated SGD driven byheavy-tailed noises. First, we show that the truncation threshold and width of the attractionﬁeld dictate the order of the ﬁrst exit time from the associated local minimum. Moreover, whenthe objective function satisﬁes appropriate structural conditions, we prove that as the learningrate decreases the dynamics of the heavy-tailed truncated SGD closely resemble that of a specialcontinuous-time Markov chain which never visits any sharp minima. We verify our theoreticalresults with numerical experiments and discuss the implications on the generalizability of SGD indeep learning.

Stochastic gradient descent (SGD) and its variants have seen unprecedented empirical successes intraining deep neural networks. The training of deep neural networks is typically posed as a non-convex optimization problem without explicit regularization, but the solutions obtained by SGD oftenperform surprisingly well with test data. Such an unexpected generalization performance of SGDin deep neural networks are often attributed to SGD’s ability to avoid sharp local minima in the1 eavy-tailed noises, no gradient clipping (a)

Heavy-tailed noises, gradient clipping (b)

Light-tailed noises, no gradient clipping (c)

Light-tailed noises, gradient clipping (d) −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5m m m m (e) Figure 1: Histograms for the time SGD spent at diﬀerent location. With the help of clipped heavy-tailed noises, SGD hardly ever visits the two sharp minima m and m . The function f is plotted atthe bottom, and dashed lines are added as references for the locations of local minima.loss landscape, which is well known to lead to poor generalization; there have been signiﬁcant eﬀortsto explain such phenomena theoretically, but how SGD avoids sharp local minima still remains asone of the central mystery of deep learning. Recently, it was suggested that the previous theoreticaleﬀorts failed to provide a satisfactory explanation due to the assumption that stochastic gradient noiseare light-tailed: [1] and [2] report the empirical evidence of heavy-tails in stochastic gradient noisein common deep learning architectures and show that SGD can escape sharp local minima underthe presence of the heavy-tailed gradient noise. More speciﬁcally, they view heavy-tailed SGDs asdiscrete approximations of L´evy driven Langevin equations and argue that the amount of time SGDtrajectory spends in each local minimum is proportional to the width of the minimum according tothe metastability theory [3, 4, 5] for such heavy-tailed processes. Figure 1 (a) illustrates this with thehistogram of a sample trajectory of SGD.In this paper, we consider a popular variant of SGD, where the stochastic gradient is truncatedabove a ﬁxed threshold. Such truncation scheme is often called gradient clipping and employed asdefault in various contexts [6, 7, 8, 9]. We prove that under this scheme, the long-run behavior ofSGD is fundamentally diﬀerent from that of the pure form of SGD: in particular, under a suitablestructural condition on the geometry of the loss landscape, gradient clipping can completely eliminatesharp minima from the trajectory of the SGD .Figure 1 clearly illustrates these points with the histograms of the sample trajectories of SGDs.Note ﬁrst that SGDs with light-tailed gradient noise—Figure 1 (c) and (d)—never manages to escapea (sharp) minimum regardless of gradient clipping. In contrast, SGDs with heavy-tailed gradient noise2asily escapes from local minima. Moreover, there is a clear diﬀerence between SDGs with gradientclipping and without gradient clipping. In Figure 1 (a), SGD without gradient clipping stays ateach of all four local minima ( { m , m , m , m } ) for a signiﬁcant amount of time, although it spendsmore time around the wide ones ( { m , m } ) than the sharp ones ( { m , m } . On the other hand, inFigure 1 (b), SGD with gradient clipping not only escapes from local minima but also avoids sharpminima ( { m , m } ) almost completely. This means that if SGD with gradient clipping terminatesat an arbitrary time point, it is almost guaranteed that it won’t be at a sharp minimum, eﬀectivelyeliminating sharp minima from its trajectory.The next section characterizes such long-run dynamics of truncated SGD driven by heavy-tailednoises in detail. We provide a complete characterization of the dynamics of SGD iterates with gradient clipping whenthe SGD algorithm is applied to a non-convex function f on R . We introduce the assumptions aboutthe function. Let f ∈ C ( R ) be a function satisfying the following assumption. Assumption 1.

There exist a positive integer n min , a real number L ∈ (0 , ∞ ) , and an orderedsequence of real numbers m , s , m , s , · · · , s n min − , m n min such that • − L < m < s < m < s < · · · < s n min − < m n min < L • f ′ ( x ) = 0 iﬀ x ∈ { m , s , · · · , s n min − , m n min } ; • For any x ∈ { m , m , · · · , m n min } , f ′′ ( x ) > ; • For any x ∈ { s , s , · · · , s n min − } , f ′′ ( x ) < . As illustrated in Figure 2, the assumption above requires that f has ﬁnitely many local minima(to be speciﬁc, the count is n min ) and they all lie in the interval [ − L, L ]. Moreover, the points s , · · · , s n min − naturally partition the entire real line into diﬀerent regions Ω i = ( s i − , s i ) (here weadopt the convention that s = −∞ , s n min = + ∞ ). We call each region Ω i the attraction ﬁeld ofthe local minimum m i , as the gradient ﬂow in any Ω i would always point to m i .Throughout the optimization procedure, given any location x ∈ R we assume that we have onlyaccess to the noisy estimator of the gradient f ′ ( x ) − Z n instead of the true gradient. Speciﬁcally, in thiswork we are interested in the case where the iid sequence of noises ( Z n ) n ≥ are heavy-tailed. Typically,the heavy-tailed phenomenon in distributions is captured by the concept of regular variation, and wework with the following assumption. Let H + ( x ) = P ( Z > x ) , H − ( x ) = P ( Z < − x ) ,H ( x ) = H + ( x ) + H − ( x ) = P ( | Z | > x ) . Assumption 2. E Z = 0 . Furthermore, there exsits some α ∈ (1 , ∞ ) such that function H ( x ) is regularly varying (at + ∞ ) with index − α . Besides, regarding the positive and negative tail fordistribution of the noises, we have lim x →∞ H + ( x ) H ( x ) = p + , lim x →∞ H − ( x ) H ( x ) = p − = 1 − p + where p + and p − are constants in interval (0 , . s s −L Lm m m m Ω Ω Ω Ω Figure 2: Illustration of a function satisfying Assumption 1For a measurable function φ : R + R + , we say that φ is regularly varying at + ∞ with index β (denoted as φ ∈ RV β ) if, for any t >

0, lim x →∞ φ ( tx ) /φ ( x ) = t β . For details on deﬁnition andproperties of regularly varying functions, see chapter 2 of [10]. Roughly speaking, Assumption 2 tellsus that the shape of the tail for the distribution of noises Z n resembles the polynomial function x − α ,which is much heavier than the exp( − x ) tail of a Gaussian distribution. Therefore, large Z n is morelikely for heavy-tailed noises than in the light-tailed case.Our work concerns a popular variant of SGD where the gradient clipping technique is employed.Speciﬁcally, when updating the SGD iterates with a learning rate η , rather than using the originalnoisy gradient estimator η ( f ′ ( X n ) − Z n ), we will truncate the gradient when it is above some threshold b > ϕ (cid:0) η ( f ′ ( X n ) − Z n ) , b (cid:1) instead. Here the truncation operator ϕ is deﬁned as ϕ ( w, c ) = ∆ ( w ∧ c ) ∨ ( − c ) ∀ w ∈ R , c > . (1)Note that u ∧ v = min { u, v } , u ∨ v = max { u, v } . For technical reasons, we need the following assumptionabout the clipping threshold b >

0. Fortunately, Assumption 3 is a very mild one, as it is obviouslysatisﬁed by (Lebesgue) almost every b > Assumption 3.

For any i = 1 , , · · · , n min , min {| s i − m i | , | s i − − m i |} /b is NOT an integer. Aside from the gradient clipping technique, we also reﬂect the SGD iterates at ± L (recall that L is the constant in Assumption 1). Formally speaking, when learning rate (namely, step size) is η > X ηn ) n ≥ using the following recursive updates: X clip ,ηn = ∆ X ηn − ϕ (cid:0) η ( f ′ ( X ηn ) − Z n ) , b ) , (2)(SGD update with gradient clipping) X ηn +1 = ∆ ϕ ( X clip ,ηn , L ) . (Reﬂection at ± L ) (3)When there is no ambiguity about the learning rate η , we simply use X n to denote the SGD iteratesproduced by the recursive updates deﬁned above. We stress that, despite the introduction of the4eﬂection operator, the generality of the algorithm studied here is not compromised, nor would thebehavior of ( X ηn ) n ≥ diﬀer noticeably from the vanilla SGD iterates. First of all, it is a common trickin practical statistical learning or optimization tasks that we truncate the weights and ensure thatthey will not explode and drift to inﬁnity. Besides, preventing the iterates from drifting to inﬁnity(explosion of weights) is also a must in theoretical analyses regarding SGD or Langevin-type stochasticdiﬀerential equations, and thanks to the reﬂection operator we do not need to introduce sophisticatedassumptions on tail behavior of f that are commonly seen in existing works (see, for instance, thedissipativity conditions in [11]). We ﬁrst introduce a few concepts that are crucial to our main results. For any attraction ﬁeld Ω i ,deﬁne (note that ⌈ x ⌉ = min { n ∈ Z : n ≥ x } , ⌊ x ⌋ = max { n ∈ Z : n ≤ x } ) r i = ∆ min {| m i − s i − | , | s i − m i |} , (4) l ∗ i = ∆ ⌈ r i /b ⌉ . (5)Note that l ∗ i are in fact inﬂuenced by the the value of gradient clipping threshold b even though thisdependency is not highlighted by the notation. Here r i can be interpreted as the radius or the eﬀective width of the attraction ﬁeld, and l ∗ i is the minimum number of jumps required to escape Ω i whenstarting from m i : Indeed, the gradient clipping threshold b dictates that any SGD update step cantravel no more than b , and to exit Ω i when starting from m i (which requires the length of travel tobe at least r i ) we can see that at least ⌈ r i /b ⌉ steps are required. The moral of results in this sectionis as follows. First, (for a ﬁxed b ) the minimum number of jumps l ∗ i is also an indicator of the widthof the attraction ﬁeld Ω i , hence the minimum eﬀort required to exit Ω i . More importantly, as will beshown in Theorem 1, l ∗ i dictates the order of ﬁrst exit time as well as where the iterates X ηn exit towhen leaving Ω i .Speciﬁcally, we are now interested in the behavior of SGD regarding the following stopping timefor ﬁrst exit from Ω i σ i ( η ) = ∆ min { n ≥ X ηn / ∈ Ω i } . Besides, for any attraction ﬁeld Ω i , deﬁne a scaling function λ i ( η ) = ∆ H (1 /η ) (cid:16) H (1 /η ) η (cid:17) l ∗ i − . (6)To stress the initial condition, we either use P x to denote the probability law induced by X η = x , orsimply write X ηn ( x ). Theorem 1.

Under Assumptions 1-3, given x ∈ [ − L, L ] and i ∈ { , , · · · , n min } such that x ∈ Ω i ,we have1) Under P x , q i λ i ( η ) σ i ( η ) converges in distribution to an Exponential random variable with rate 1as η ↓ ;2) For any i, j = 1 , , · · · , n min such that j = i , lim η ↓ P x ( X ησ i ( η ) ∈ Ω j ) = q i,j /q i . where the constant q i > ∀ i and q i,j ≥ ∀ j = i . For the complete proof and explicit formula for constants q i , q i,j , we refer the readers to thesupplementary materials. We add a few remarks to provide intuitive interpretations.5 s −L Lm m m Ω Ω Ω m m m l = 2l = 1 l = 1l = 2b = 0.5 m m m l = 2l = 1 l = 1l = 3 × b = 0.4 Figure 3: Typical transition graphs G under diﬀerent gradient clipping thresholds b . (Left) Thefunction f illustrated here has 3 attraction ﬁelds. For the second one Ω = ( s , s ), we have s − m =0 . , m − s = 0 .

6. (Middle) The typical transition graph induced by b = 0 .

5. The entire graph G is irreducible since all nodes communicate with each other. (Right) The typical transition graphinduced by b = 0 .

4. When b = 0 .

4, since 0 . < b, . > b , starting from m and with only 2 jumps,the iterates could only exit Ω from the left. Therefore, on the graph G there are two communicatingclasses: G = { m , m } , G = { m } ; G is absorbing while G is transient. • As is made clear by the proof, Theorem 1 is simply an immediate result of the following generalprinciple about the dynamics of X ηn . Fix some ǫ > B ǫ ( m i ) = [ m i − ǫ, m i + ǫ ] is asmall neighborhood of local minimum m i . We say that X ηn makes an attempt to escape if itleaves B ǫ ( m i ), and the attempt fails if X ηn returns to B ǫ ( m i ) without exiting Ω i . As learningrate η goes to 0, one can show that (1) it is very rare that an attempt would succeed, and (2)when an attempt does succeed (so X ηn ﬁnally escapes Ω i ), it is almost always because X ηn makes exactly l ∗ i large jumps during this attempt that help X ηn to ﬁnally overcome the distance r i and escape. By large jumps, we mean SGD steps where the noise Z n is so large that, even whenmodulated by the small learning rate η , it still causes noticeable perturbation (say η | Z n | > ǫ ). • Given this l ∗ i − jump principle, one would expect that X ησ i ( η ) , the location right at the time ofexit, will hardly ever be l ∗ i b far away from m i : the length of each update is clipped by b , andthere will most likely be only l ∗ large SGD steps during this successful attempt. In fact, fromthe proof one can see that q i,j > y ∈ Ω j | y − m i | < l ∗ i b . • Part (1) of Theorem 1 tells us the ﬁrst exit time is roughly of order 1 /λ i ( η ) ≈ (1 /η ) α +( α − l ∗ i − .Summarizing the three bullet points here, we see that the minimum number of jumps l ∗ i dictates how heavy-tailed SGD escapes an attraction ﬁeld, where it exits to, and when the exit occurs. In this section, we show that, under proper structure of f and gradient clipping scheme, the sharpminima can be eﬀectively eliminated from the trajectory of heavy-tailed SGD. Again, we ﬁrst introducenew concepts that will be crucial to results in this section. Similar to the the minimum number ofjumps for escape l ∗ i deﬁned in 5, for any j = i we can deﬁne the following quantities as the minimumnumber of jumps to reach Ω j from m i : l i,j = ( ⌈ ( s j − − m i ) /b ⌉ if j > i, ⌈ ( m i − s j ) /b ⌉ if j < i. (7)From the remarks after Theorem 1, we see that starting from Ω i , X ηn will most likely move to some-where reachable within l ∗ i jumps when exiting Ω i . Therefore, the typical transitions the SGD iterates6ould make are those from Ω i to Ω j if l i,j = l ∗ i . Now we can deﬁne the following directed graph thatonly includes these typical transitions. Deﬁnition 1 (Typical Transition Graph) . Given a function f satisfying Assumption 1 and gradientclipping threshold b > satisfying Assumption 3, a directed graph G = ( V, E ) is the corresponding typical transition graph if • V = { m , · · · , m n min } ; • A path ( m i → m j ) is in E if and only if l i,j = l ∗ i . Naturally, the typical transition graph G can be decomposed into several communicating classes G , · · · , G K that are mutually exclusive. Recall that a communicating class G on G is a equivalentclass with the property that • For i = j , we have m i ∈ G, m j ∈ G if and only if there exists a path ( m i , m k , · · · , m k n , m j ) aswell as a path ( m j , m k ′ , · · · , m k ′ n ′ , m i ) on G ; in other words, by travelling through edges on G , m i and m j can commute with each other.We say a communicating class G is absorbing if there does not exist any edge ( m i → m j ) ∈ E suchthat m i ∈ G yet m j / ∈ G . Otherwise, we say G is transient . In the case that all m i communicatewith each other on graph G , we say G is irreducible . See Figure 3(Middle) for the illustration ofan irreducible case. When G is irreducible, the set of largest attraction ﬁelds of f can be deﬁnedas M large = { m i : i = 1 , , · · · , n min , l ∗ i = l large } with l large = max j l ∗ j : Recall that, given clippingthreshold b , l ∗ i also indicates the width of Ω i , hence the name largest attraction ﬁelds for all Ω j in M large (and with l ∗ i = l large ). Also, we deﬁne time scaling λ large ( η ) = H (1 /η ) (cid:0) H (1 /η ) /η (cid:1) l large − that,as indicated in Theorem 1, is exactly the time scale for ﬁrst exit time of the largest attraction ﬁeldsof f . The following result gives a taste of the unique beneﬁts of heavy-tailed SGD where all the smallattraction ﬁelds are eliminated from its trajectory when gradient clipping is employed. Theorem 2.

Let Assumptions 1-3 hold and assume the graph G is irreducible . Given any t > , β > l large ( α −

1) + 1 , and x ∈ [ − L, L ] , the following random variables (indexed by η ) ⌊ t/η β ⌋ Z ⌊ t/η β ⌋ n X η ⌊ u ⌋ ( x ) ∈ [ j : m j / ∈ M large Ω j o du (8) converge in probability to as η ↓ . The proof can be found in the supplementary materials. Here we provide the intuitive interpre-tations of the result. Assume that we decide to terminate the SGD training at some reasonably longtime, for instance after ⌊ t/η β ⌋ iterations, then the random variable deﬁned in (8) is exactly the pro-portion of time the iterates X ηn spent at some attraction ﬁelds that are not the largest ones of f . Inother words, in this irreducible case only the largest attraction ﬁelds could survive on the trajectoryof SGD under truncated heavy-tailed noises, and all the sharp minima are eliminated from SGD aslearning rate approaches 0.Interestingly enough, Theorem 2 is merely a manifestation of the global dynamics of heavy-tailedSGD, and a lot more can be said even in the non-irreducible cases. The moral of results in this sectioncan be summarized as follows: (a) Gradient clipping scheme naturally partitions the entire landscapeinto diﬀerent regions; (b) On each region, the dynamics of X ηn closely resemble that of a continuous-time Markov chain that only visits local minima when learning rate is small; (3) In particular, anysharp minima within a region is almost always avoided by SGD.First, astute readers may have noticed already that diﬀerent gradient clipping threshold b mayinduce diﬀerent structures on G . For instance, let us consider the function depicted in Figure 3(Left):for its attraction ﬁeld Ω = ( s , s ), starting from the local minimum m we need to travel 0 . . l ∗ i in(5) as well as the minimum jump for ( i → j ) transition l i,j in (7). When b = 0 . l , = l , = l ∗ = 2,so the entire graph is irreducible. When we use b = 0 .

4, however, we will have l , = 2 = l ∗ < l , = 3.As illustrated in Figure 3(Right), the graph induced by b = 0 . G = { m , m } is absorbing and G = { m } is transient.From now on, we zoom in on a speciﬁc communicating class G among all the G , · · · , G K . For thiscommunicating class G , deﬁne l ∗ G = ∆ max { l ∗ i : i = 1 , , · · · , n min ; m i ∈ G } . For each local minimum m i ∈ G , based on its minimum jump number, we call its attraction ﬁeld Ω i a large attraction ﬁeldif l ∗ i = l ∗ G , and a small attraction ﬁeld if l ∗ i < l ∗ G . We have thus classiﬁed all m i in G into twogroups: the ones in large attraction ﬁelds m large1 , · · · , m large i G and the ones in small attraction ﬁelds m small1 , · · · , m small i ′ G . Also, deﬁne a scaling function λ G on this communicating class G as λ G ( η ) = ∆ H (1 /η ) (cid:16) H (1 /η ) η (cid:17) l ∗ G − . (9)Lastly, we consider a version of X ηn that is killed when X ηn leaves G . Deﬁne the stopping time τ G ( η ) = ∆ min { n ≥ X ηn / ∈ [ i : m i ∈ G Ω i } as the ﬁrst time the iterates leave all attraction ﬁelds in G , and we use a cemetery state † to constructthe following process X † ,ηn as a version of X ηn with killing at τ G : X † ,ηn = ( X ηn if n < τ G ( η ) , † if n ≥ τ G ( η ) . Theorem 3.

Under Assumptions 1-3, if G is absorbing , then there exist a continuous-time Markovchain Y on { m large , · · · , m large i G } with some generator matrix Q as well as random mapping π G satis-fying • If m ∈ { m large , · · · , m large i G } , then π G ( m ) ≡ m ; • If m ∈ { m small , · · · , m small i ′ G } , then the distribution of π G ( m ) only takes value in { m large , · · · , m large i G } ;such that for any x ∈ Ω i , | x | ≤ L (where i ∈ { , , · · · , n min } ) with m i ∈ G , we have X η ⌊ t/λ G ( η ) ⌋ ( x ) → Y t ( π G ( m i )) as η ↓ in the sense of ﬁnite-dimensional distributions. Theorem 4.

Under Assumptions 1-3, if G is transient , then there exist some continuous-timeMarkov chain Y with killing that has state space { m large , · · · , m large i G , † } (we say the Markov chain Y is killed when it enters the absorbing cemetery state † ) and some generator matrix Q , as well as arandom mapping π G satisfying • If m ∈ { m large , · · · , m large i G } , then π G ( m ) ≡ m ; • If m ∈ { m small , · · · , m small i ′ G } , then the distribution of π G ( m ) only takes value in { m large , · · · , m large i G , † } ;such that for any x ∈ Ω i , | x | ≤ L (where i ∈ { , , · · · , n min } ) with m i ∈ G , we have X † ,η ⌊ t/λ G ( η ) ⌋ ( x ) → Y t ( π G ( m i )) as η ↓ in the sense of ﬁnite-dimensional distributions. Q as well as the distribution of random mapping π G ( · ) are also detailed inthe proof. Here we add some remarks. First, intuitively speaking, the two results above tell us thatgradient clipping scheme naturally partitions the entire landscape of f into several regions G , · · · , G K ;whether a region G is absorbing or not, when visiting G the dynamics of the heavy-tailed SGD wouldconverge to a continuous-time Markov chain avoiding any local minima that is not in the largest attraction ﬁelds in G . Second, under small learning rate η >

0, if X ηn ( x ) is initialized at x ∈ Ω i whereΩ i is not a largest one on G , then iterates will quickly escape the small attraction ﬁeld and arrive atsome Ω j ∈ { m large1 , · · · , m large i G } that is a largest one on G; such a transition is so quick that undertime scaling λ G ( η ), it is almost instantaneous as if X ηn ( x ) is actually initialized randomly at some ﬂatminima, and we compress the law of this random initialization in the random mapping π G .Before we conclude this section, we state a stronger version of the results that is an immediatecorollary from Theorem 3. If G is irreducible, then Theorem 3 is applicable to the entire optimizationlandscape. Recall that λ large ( · ) is the time scale deﬁned before Theorem 2, which corresponds to theﬁrst exit time of the largest attraction ﬁelds of f . Theorem 5.

Under Assumptions 1-3, if G is irreducible , then there exist a continuous-time Markovchain Y on M large as well as a random mapping π such that for any i = 1 , , · · · , n min and any x ∈ Ω i , | x | ≤ L , X η ⌊ t/λ large ( η ) ⌋ ( x ) → Y t ( π ( m i )) as η ↓ in the sense of ﬁnite-dimensional distributions. Heavy-tailedness and wide minima folklore in SGD . Aside from theoretical interests in char-acterizing heavy-tailed SGD and the eﬀects of gradient clipping therein, our results also have crucialpractical implications on statistical learning tasks, especially in light of the following empirical obser-vations.First, heavy-tailed gradient noises are ubiquitous in modern deep learning context including imageclassiﬁcation tasks [1], generative models [12], and deep reinforcement learning tasks [13]. Theseempirical evidences strongly challenge the traditional light-tailed paradigm in theoretical analyses onSGD and necessitate a much deeper understanding of heavy-tailed SGD.Besides, it remains a mystery in the ﬁeld of deep learning how the surprisingly good generalizabilityof the trained model is achieved using ﬁrst-order optimization methods. The well-know conjecture isthat a sharp local minimum show poor generalizability in the test setting, while the ﬂat and wider onestend to generalize better [14]. Aligned with this conjecture is the correlation between the test accuracyof deep neural nets and the ﬂatness or width of the local minimum located at the end of the trainingobserved in various experiments [15][16]. While there exist multiple choices for characterizing a wideminimizer (eigenvalue of Hessian, width of attraction ﬁeld, etc.), the general belief is that a sharplocal minimum in a small attraction ﬁeld may lead to poor generalization performance. Therefore,our theoretical results are of great importance as they describe how, under truncated heavy-tailednoises, SGD spends much longer time in wider attraction ﬁelds and the sharp minima are practicallyeliminated from its global dynamics.

Systematic control on exit time from attraction ﬁelds . In light of the wide minima folklore,one may want to ﬁnd techniques to modify the sojourn time of SGD at each attraction ﬁeld. Theorem1 suggests that the order of the ﬁrst exit time (w.r.t. learning rate η ) is directly controlled by thegradient clipping threshold b . Recall that H (1 /η ) ≈ O ( η α ), aso for an attraction ﬁeld with minimumjump number l ∗ , Theorem 1 tells us the exit time from this attraction ﬁeld is roughly of order(1 /η ) α +( α − l ∗ − . Given the width of the attraction ﬁeld, its minimum jump number l ∗ is dictatedby gradient clipping threshold b . Therefore, gradient clipping provides us with a very systematicmethod to control the exit time at each attraction ﬁeld. For instance, given clipping threshold b , the9xit time from an attraction ﬁeld with width less than b is of order (1 /η ) α , while the exit time fromone larger than b is at least (1 /η ) α − , which dominates the exit time from smaller ones. Ideal structure of G and f . In order to apply the strongest result Theorem 5, irreduciblity of G is required and the shape of function f may also be a deciding factor here aside from the choice of b .For instance, we say that G is symmetric if for any attraction ﬁeld Ω i such that i = 2 , , · · · , n min − i is not the leftmost or rightmost one at the boundary), we have q i,i − > , q i,i +1 >

0. One cansee that G is symmetric if and only if, for any i = 2 , , · · · , n min − | s i − m i | ∨ | m i − s i − | < l ∗ i b , andsymmetry is a suﬃcient condition for irreducibility of G . The graph illustrated in Figure 3(Middle) issymmetric, while the one in Figure 3(Right) is not. As the name suggests, in the R case the symmetryof G is more likely to hold if the shape of attraction ﬁelds in f is also nearly symmetric around itslocal minimum. If not, then as illustrated in Figure 3 the symmetry (as well as irreducibility) of G iseasily violated especially when a small gradient clipping threshold b is used.Generally speaking, our results imply that, even with the help of truncated heavy-tailed noises,one should expect the structure of function f to satisfy certain regularity conditions to ensure thatSGD iterates avoid unfavorable minima. This is in the same vein as the observations of [16]: the deepneural nets that are more trainable under SGD tend to have a much more regular structure in termsof the number and shape of local minima. Heavy-tailed SGD without gradient clipping . It is worth mentioning that our results alsocharacterize the dynamics of heavy-tailed when there is no gradient clipping. For instance, since thereﬂection operation at ± L restricts the iterates on the compact set [ − L, L ], if we use a truncationthreshold b that is large than 2 L , then any SGD update that moves larger than b will deﬁnitelybe reﬂected at ± L . Therefore, the dynamics are identical to that of the following iterates withoutgradient clipping: X η, unclipped n = ϕ (cid:16) X η, unclipped n − − ηf ′ ( X η, unclipped n − ) + ηZ n , L (cid:17) . (10)The next result follows immediately from Theorem 1 and 3. Corollary 6.

There exist constants q i > ∀ i , q i,j > ∀ j = i such that the following claims hold forany i and any x ∈ Ω i , | x | ≤ L .1) Under P x , q i H (1 /η ) σ i ( η ) converges in distribution to an Exponential random variable with rate1 as η ↓ ;2) For any j = 1 , , · · · , n min such that j = i , lim η ↓ P x ( X ησ i ( η ) ∈ Ω j ) = q i,j /q i .

3) Let Y be a continuous-time Markov chain on { m , · · · , m n min } with generator matrix Q parametrizedby Q i,i = − q i , Q i,j = q i,j . Then X η, unclipped ⌊ t/H (1 /η ) ⌋ ( x ) → Y t ( m i ) as η ↓ in the sense of ﬁnite-dimensional distributions. At ﬁrst glance, the form of Corollary 6 is similar to the results in [3] and [1]. However, the objectstudied in [3] is diﬀerent: they study the following Langevin-type stochastic diﬀerential equation(SDE) driven by a regularly varying L´evy process L t with magnitude σ > dX σt = − f ′ ( X σt − ) + σdL t . In particular, they study the limiting behavior of X σ as σ ↓

0. On the other hand, we directly analyzethe stochastic gradient descent. 10 .1 0.01 0.001 0.0001Learning Rate10 F i r s t E x i t T i m e gradient clippingno gradient clippingPolynomial with power 1.4Polynomial with power 1.2 Figure 4: First Exit Time from Ω . Each dot represents the average of 20 samples of ﬁrst exit time.Each dahsed line shows a polynomial function c/η β where β is predicted by Theorem 1 and thecoeﬃcient c is chosen to ﬁt the dots. We use numerical experiments to corroborate our theoretical results and demonstrate that (a) asindicated by Theorem 1, the minimum jump number deﬁned in (5) accurately characterizes the ﬁrstexit from heavy-tailed SGD when gradient clipping is applied; (b) with the help of gradient clipping,sharp minima can be eﬀectively eliminated from heavy-tailed SGD; (c) the properties studied in thispaper are exclusive to heavy-tailed noises; under light-tailed noises SGD are easily trapped at sharpminima for extremely long time.The function f ∈ C ( R ) we consider in this section is exactly the same one depicted in Figure 2,with L = 1 . m = − . , s = − . , m = − . , s = 0 . , m = 0 . , s = 0 . , m = 1 .

32. Notethat m and m are sharp minima in small attraction ﬁelds, while m and m are much ﬂatter andlive in much wider attraction ﬁelds. The explicit expression for f is in supplementary materials.First, we study the ﬁrst exit time of X ηn from Ω = ( − . , .

2) when initialized at − .

7. The heavy-tailed noises we use in the experiment are Z n = 0 . U n W n where W n are from Pareto TypeII distribution (aka Lomax distribution) with shape parameter α = 1 .

2, and the signs U n are iid RVssuch that U n = 1 with probability 1 / U n = − /

2. We test diﬀerent learningrates η from the set { . , . , . , . , . , . , . } . As for gradient clipping scheme, wetest a gradient clipping case where the SGD iterates are produced by updates (2)(3) with b = 0 . no gradient clipping case where we remove the gradient clipping scheme so SGD updates aregenerated by (10). For each case, we run the simulation 20 times and plot the average of the 20 exittimes in Figure 4. According to Theorem 1 and Corollary 6, when there is no gradient clipping, theﬁrst exit time is roughly of order 1 /H (1 /η ) ≈ (1 /η ) α = (1 /η ) . for small η ; when gradient clippingis applied with b = 0 .

5, the minimum jump number is l ∗ = 2 for attraction ﬁeld Ω , and the ﬁrstexit time is roughly of order 1 / (cid:16) H (1 /η ) (cid:0) H (1 /η ) /η (cid:1)(cid:17) ≈ (1 /η ) α − = (1 /η ) . . As demonstrated inFigure 4, our results accurately predict how ﬁrst exit time varies with learning rate η and the gradientclipping scheme. 11 X n (a) −101 X n (b) −101 X n (c) Iteration number n −101 X n (d) Figure 5: Typical trajectories of SGD in diﬀerent cases: (a) Heavy-tailed noises, no gradient clipping;(b) Heavy-tailed noises, gradient clipping at b = 0 .

5; (c) Light-tailed noises, no gradient clipping; (d)Light-tailed noises, gradient clipping at b = 0 .

5. The function f is plotted at the right of each ﬁgure,and dashed lines are added as references for locations of the local minima. For readability of theﬁgures, we plot X n for every 250 iterations.In the next experiment we study the global dynamics of heavy-tailed SGD. We initialize the SGDiterates at 0 . = (0 . , . η = 0 . b = 0 .

5) and the no gradient clipping case. Moreover, aside from thePareto heavy-tailed noises, we also test light-tailed noises where we use N (0 ,

1) as the distribution fornoises Z n . For each case, we obtain 20 sample paths of X n with each run terminated after 3 , , × , ,

000 = 6 × SGD iterates. First, under heavy-tailed noises, for most part of the SGD trajectory it is somewherearound a local minimum. When there is no gradient clipping, X n will still visit the sharp minima m , m . Under gradient clipping, sharp minima are almost eliminated from the trajectory as the timespent at m , m is negligible compared to the time X n spent at m , m which are in much widerattraction ﬁelds. This is further demonstrated by the corresponding sample paths. Without gradientclipping, we see that X n jumps frequently between all the local minima (Figure 5(a)), while undergradient clipping, even though X n may still jump to the sharp minima occasionally, relatively speaking12he time spent there is so short that the entire trajectory is almost always at the minima in widerattraction ﬁelds (Figure 5(b)). This is exactly the type of dynamics predicted by Theorem 2-5 anddemonstrates the elimination of sharp minima with truncated heavy-tailed noises when learning rateis small.Lastly, we stress that the said properties are exclusive to heavy-tailed SGD. As shown in Figure1 and Figure 5(c)(d), under learning rate η = 0 .

001 the light-tailed SGD is trapped at the very ﬁrstminimum close to the initialization point for the entire 3 , ,

000 iterations. In contrast, heavy-tailedSGD can always easily escape this sharp minimum m and start exploring the entire landscape of f .The results imply that the eﬃcient escape from sharp minima and preference of wide/ﬂat minima isa unique beneﬁt under heavy-tailed noises. In this work, we characterized the dynamics of SGD under clipped heavy-tailed gradients in the R case and veriﬁed our theoretical results in numerical experiments. For future works, the following openquestions are worth pursuing. First, it is natural to consider the extension of the results to the general R d case. Second, while we used conditions such as irreducibility or symmetry of graph G to ensurethe elimination of all sharp minima, it is very likely that a more relaxed and general set of geometricconditions would suﬃce for eliminating most sharp minima; in particular, it is worth exploring howthe structure of state-of-the-art deep neural nets lend themselves to such desirable geometries andachieve better performance after training. References [1] Umut S¸im¸sekli, Mert G¨urb¨uzbalaban, Thanh Huy Nguyen, Ga¨el Richard, and Levent Sagun. On the heavy-tailedtheory of stochastic gradient descent for deep neural networks. arXiv preprint arXiv:1912.00018 , 2019.[2] Umut S¸im¸sekli, Levent Sagun, and Mert Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deepneural networks. In

International Conference on Machine Learning , pages 5827–5837. PMLR, 2019.[3] Ilya Pavlyukevich. Cooling down l´evy ﬂights.

Journal of Physics A: Mathematical and Theoretical , 40(41):12299,2007.[4] Peter Imkeller, Ilya Pavlyukevich, and Michael Stauch. First exit times of non-linear dynamical systems in R d perturbed by multifractal L´evy noise. Journal of Statistical Physics , 141(1):94–119, 2010.[5] Peter Imkeller, Ilya Pavlyukevich, and Torsten Wetzel. The hierarchy of exit times of L´evy-driven Langevinequations.

The European Physical Journal Special Topics , 191(1):211–222, 2010.[6] Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Alek-sander Madry. Implementation matters in deep rl: A case study on ppo and trpo. In

International Conference onLearning Representations , 2020.[7] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing LSTM language models.In

International Conference on Learning Representations , 2018.[8] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 , 2013.[9] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the diﬃculty of training recurrent neural networks. In

International conference on machine learning , pages 1310–1318. PMLR, 2013.[10] Sidney I Resnick.

Heavy-tail phenomena: probabilistic and statistical modeling . Springer Science & Business Media,2007.[11] Than Huy Nguyen, Umut Simsekli, and Ga¨el Richard. Non-asymptotic analysis of fractional langevin monte carlofor non-convex optimization. In

International Conference on Machine Learning , pages 4810–4819. PMLR, 2019.[12] Vishwak Srinivasan, Adarsh Prasad, Sivaraman Balakrishnan, and Pradeep Kumar Ravikumar. Eﬃcient estimatorsfor heavy-tailed machine learning, 2021.[13] Saurabh Garg, Joshua Zhanson, Emilio Parisotto, Adarsh Prasad, J Zico Kolter, Zachary Chase Lipton, SivaramanBalakrishnan, Ruslan Salakhutdinov, and Pradeep Kumar Ravikumar. On proximal policy optimization’s heavy-tailed gradients, 2021.[14] Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima.

Neural computation , 9(1):1–42, 1997.

15] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. Onlarge-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 ,2016.[16] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neuralnets. In

Proceedings of the 32nd International Conference on Neural Information Processing Systems , pages6391–6401, 2018., pages6391–6401, 2018.