[PDF] Convergence Rates of Stochastic Gradient Descent under Infinite Noise Variance

Abstract

Full PDF

CConvergence Rates of Stochastic Gradient Descentunder Inﬁnite Noise Variance

Hongjian Wang ∗ , Mert G¨urb¨uzbalaban † , Lingjiong Zhu ‡ , Umut S¸im¸sekli § , Murat A. Erdogdu (cid:107) February 23, 2021

Abstract

Recent studies have provided both empirical and theoretical evidence illustrating that heavytails can emerge in stochastic gradient descent (SGD) in various scenarios. Such heavy tailspotentially result in iterates with diverging variance, which hinders the use of conventionalconvergence analysis techniques that rely on the existence of the second-order moments. In thispaper, we provide convergence guarantees for SGD under a state-dependent and heavy-tailednoise with a potentially inﬁnite variance, for a class of strongly convex objectives. In the casewhere the p -th moment of the noise exists for some p ∈ [1 , p -positive (semi-)deﬁniteness’, that leads to an interesting interpolationbetween positive semi-deﬁnite matrices ( p = 2) and diagonally dominant matrices with non-negative diagonal entries ( p = 1). Under this condition, we then provide a convergence rate forthe distance to the global optimum in L p . Furthermore, we provide a generalized central limittheorem, which shows that the properly scaled Polyak-Ruppert averaging converges weaklyto a multivariate α -stable random vector. Our results indicate that even under heavy-tailednoise with inﬁnite variance, SGD can converge to the global optimum without necessitatingany modiﬁcation neither to the loss function or to the algorithm itself, as typically required inrobust statistics. We demonstrate the implications of our results to applications such as linearregression and generalized linear models subject to heavy-tailed data. We consider the unconstrained minimization problemminimize x ∈ R d f ( x ) , (1.1)using the stochastic gradient descent (SGD) algorithm. Initialized at x ∈ R d , the SGD algorithmis given by the iterations, x t +1 = x t − γ t +1 (cid:0) ∇ f ( x t ) + ξ t +1 ( x t ) (cid:1) , t = 0 , , , ... (1.2)where { γ t } t ∈ N + denotes the step-size sequence, and { ξ t } t ∈ N + is a martingale diﬀerence sequenceadapted to a ﬁltration {F t } t ∈ N , characterizing the noise in the gradient (the sequence { x t } t ∈ N is ∗ Department of Computer Science and Technology at Tsinghua University [email protected] † Department of Management Science and Information Systems at Rutgers Business School [email protected] ‡ Department of Mathematics at Florida State University [email protected] § INRIA - D´ept. d’Informatique de l’´Ecole Normale Sup´erieure - PSL Research University [email protected] (cid:107)

Department of Computer Science and Department of Statistical Sciences at University of Toronto, and VectorInstitute [email protected] a r X i v : . [ m a t h . O C ] F e b lso adapted to the same ﬁltration, if we assume x is F -measurable). Our focus is on the casewhere the noise is state dependent, and its variance is inﬁnite, i.e., E (cid:2) (cid:107) ξ t (cid:107) (cid:3) = ∞ .Many problems in modern statistical learning can be written in the form (1.1), where f ( x )typically corresponds to the population risk, that is, f ( x ) := E z ∼ ν [ (cid:96) ( x , z )] for a given loss function (cid:96) and an unknown data distribution ν . In practice, one observes independent and identicallydistributed (i.i.d.) samples z i ∼ ν for i ∈ [ n ], and estimates the population gradient ∇ f ( x ) witha noisy gradient at each iteration, which is based on an empirical average over a subset of thesamples { z i } i ∈ [ n ] . Due to its simplicity, superior generalization performance, and well-understoodtheoretical guarantees, SGD has been the method of choice for minimization problems arising instatistical machine learning.Starting from the pioneering works of Robbins and Monro [1951], Chung [1954], Sacks [1958],Fabian [1968], Ruppert [1988], Shapiro [1989], Polyak and Juditsky [1992], theoretical propertiesof the SGD algorithm and its variants have been receiving a growing attention under diﬀerentscenarios. Recent works, for example Tripuraneni et al. [2018], Su and Zhu [2018], Duchi andRuan [2016], Toulis and Airoldi [2017], Fang et al. [2018], Anastasiou et al. [2019], Yu et al. [2020]establish convergence rates for SGD in various settings, and build on the analysis of Polyak andJuditsky [1992] to prove a central limit theorem (CLT) for the Polyak-Ruppert averaging, whichleads to novel methodologies to compute conﬁdence intervals using SGD. However, a recurringassumption in this line of work is the ﬁnite noise variance, which may be violated frequently inmodern frameworks.Heavy-tailed behavior in statistical methodology may naturally arise from the underlying model,or through the iterative optimization algorithm used during model training. In robust statistics,one often encounters heavy-tailed noise behavior in data, which in conjunction with standard lossfunctions leads to inﬁnite noise variance in SGD. Very recently, heavy-tailed behavior is shown toemerge from the multiplicative noise in SGD, when the step-size is large and/or the batch-size issmall [Hodgkinson and Mahoney, 2020, G¨urb¨uzbalaban et al., 2020]. On the other hand, thereis strong empirical evidence in modern machine learning that the gradient noise often exhibitsa heavy-tailed behavior, which indicates an inﬁnite variance. For example, this is observed infully connected and convolutional neural networks [S¸im¸sekli et al., 2019, G¨urb¨uzbalaban and Hu,2020] as well as recurrent neural networks [Zhang et al., 2019]. Thus, understanding the behaviorof SGD under inﬁnite noise variance becomes extremely important for at least two reasons. A computational complexity reason: modern machine learning and robust statistics frameworks leadto heavy-tailed behavior in SGD; thus, understanding the performance of this algorithm in terms ofprecise convergence rates as well as the required conditions on the step-size sequence as a functionof the ‘heaviness’ of the tail become crucial in this setup. A statistical reason: many inferencemethods that rely on Polyak-Ruppert averaging utilize a CLT that holds under ﬁnite noise variance(see e.g. online bootstrap and variance estimation approaches [Fang et al., 2018, Su and Zhu, 2018,Chen et al., 2020]). Using the same methodology in the aforementioned modern framework (underheavy-tailed noise) will ultimately result in incorrect conﬁdence intervals, jeopardizing the statisticalprocedure. Thus, establishing the limit distribution in this setting is of great importance.In this work, we study the behavior of the SGD algorithm with diminishing step-sizes for a classof strongly convex problems when the noise variance is inﬁnite. We establish the convergence ratesof the SGD iterates towards the global minimum, and identify a suﬃcient condition on the Hessian of f , which interpolates positive semi-deﬁniteness and diagonal dominance with non-negative diagonalentries. We further study the Polyak-Ruppert averaging of the SGD iterates, and show that the limitdistribution is a multivariate α -stable distribution. We illustrate our theory on linear regressionand generalized linear models, demonstrating how to verify the conditions of our theorems. Perhapssurprisingly, our results show that even under heavy-tailed noise with inﬁnite variance, SGD with2iminishing step-sizes can converge to the global optimum without requiring any modiﬁcationneither to the loss function or to the algorithm itself, as opposed to the conventional techniquesused in robust statistics [Huber, 2004]. Finally, we argue that our work has potential implicationsin constructing conﬁdence intervals in the inﬁnite noise variance setting. Notational Conventions. By N , N + and R we denote the set of non-negative integers, positiveintegers, and real numbers respectively. For m ∈ N + , we deﬁne [ m ] = { , . . . , m } . We use italicletters (e.g. x, ξ ) to denote scalars and scalar-valued functions, bold face italic letters (e.g. x , ξ )to denote vectors and vector-valued functions, and bold face upper case letters (e.g. A ) to denotematrices. We use | x | and (cid:107) x (cid:107) p to denote the 2-norm and p -norm of a vector x ; (cid:107) A (cid:107) and (cid:107) A (cid:107) p theoperator 2-norm and operator p -norm of a matrix A . The transpose of a matrix A and a vector x (viewed as a matrix with 1 column) are denoted by A T and x T . If { A i } i ∈ N is a sequence ofmatrices and k > (cid:96) , the empty product (cid:81) (cid:96)i = k A i is understood to be the identity matrix I . Theasymptotic notations are deﬁned in the usual way: for two sequences of real numbers { a t } t ∈ N , { b t } t ∈ N , we write a t = O ( b t ) if lim sup t →∞ | a t | / | b t | < ∞ , a t = o ( b t ) if lim sup t →∞ | a t | / | b t | = 0, a t = Θ( b t ) if both a t = O ( b t ) and b t = O ( a t ) hold, and a t (cid:16) b t if lim t →∞ | a t | / | b t | exists and is in(0 , ∞ ). If a t = O ( b t t ε ) for any ε >

0, we say a t = ˜ O ( b t ). Suﬃciently large or suﬃciently smallpositive constants whose values do not matter are written as C, C , C , . . . , sometimes withoutprior introduction. If X , X , . . . is a sequence of random vectors taking value in R n and µ is aprobability measure on R n , we write X t D −−−→ t →∞ µ if { X t } t ∈ N + converges in distribution (also called‘converges weakly’) to µ . Stochastic Approximation.

In the SGD recursion (1.2), we can replace ∇ f with arbitrarycontinuous function R : R n → R n , and consider the same iterations that stochastically approximatethe zero x ∗ of R , x t +1 = x t − γ t +1 (cid:0) R ( x t ) + ξ t +1 ( x t ) (cid:1) . (2.1)This is called the stochastic approximation process [Robbins and Monro, 1951], which is a predeces-sor of stochastic gradient descent and describes a larger family of iterative algorithms (Kushner andYin [2003], Chapters 2 and 3). Theoretical investigation into recursion (2.1) has been active eversince its invention, especially under ﬁnite noise variance assumption: Robbins and Monro [1951]prove the recursion (2.1) can lead to the L convergence lim t →∞ E [ | x t − x ∗ | ] = 0; Chung [1954]further calculates an exact convergence rate (see (3.6)); Blum [1954] presents an elegant proof thatthe convergence of x t to x ∗ can hold almost surely. The asymptotic distribution of (2.1) is also thediscovery of Chung [1954], the Theorem 6 of which states that γ − / t ( x t − x ∗ ) converges weakly toa normal distribution; Polyak and Juditsky [1992] and Ruppert [1988] independently introduce theconcept of ‘averaging the iterates’, x t = x + . . . + x t − t , showing the striking result that √ t ( x t − x ∗ ) converges weakly to a ﬁxed normal distribution re-gardless of the choice of the step-size { γ t } t ∈ N + . Recently, optimization algorithms that can handleheavy-tailed ξ have been proposed [Davis et al., 2019, Nazin et al., 2019, Gorbunov et al., 2020];however, they still rely on a uniformly bounded variance assumption, hence do not cover our setting.Compared with the copious collection of theoretical studies on stochastic approximation withﬁnite variance mentioned above, papers on inﬁnite variance stochastic approximation are extremely3carce, and we shall summarize the only four papers known to us to the best of our knowledge.Krasulina [1969] is the ﬁrst to consider such problems, proving almost sure and L p convergencefor the one-dimensional stochastic approximation process without variance. The weak convergenceof the iterates (without averaging) t /α ( x t − x ∗ ) is also considered by Krasulina [1969], but onlyfor the fastest-decaying step-size γ t = 1 /t . Goodsell and Hanson [1976] discuss how x t → x ∗ inprobability can imply x t → x ∗ almost surely, when no ﬁnite variance is assumed, and Li [1994]provides a necessary and suﬃcient condition for almost sure convergence of x t → x ∗ , stating thatfaster-decaying step-size γ t = o ( t − /p ) is required when moments of lower orders E [ | ξ t | p ] are notin place. Anantharam and Borkar [2012] show that although step-size that decays slower than t − /p cannot yield almost sure convergence, L p convergence can still hold under what they call the‘stability assumption’, but their analysis technique provides no convergence rate. Recently, S¸im¸sekliet al. [2019] and Zhang et al. [2019] considered SGD with heavy-tailed ξ having uniformly bounded p -th order moments. Besides not being able to handle state-dependent noise due to this uniformmoment condition, S¸im¸sekli et al. [2019] imposed further conditions on R = ∇ f such as globalH¨older continuity for a non-convex f , whereas Zhang et al. [2019] modiﬁed SGD with ‘gradientclipping’, in order to be able to compensate the eﬀects of the heavy-tailed noise.Finally, we shall mention that a class of stochastic recursions similar to (2.1) have been con-sidered in the dynamical systems theory [Mirek, 2011, Buraczewski et al., 2012, 2016], for whichgeneralized central limit theorems with α -stable limits have been proven. However, such techniquestypically require R to be (asymptotically) linear and the step-sizes to be constant as they heavilyrely on the theory of time-homogeneous Markov processes. Hence, their approach does not readilygeneralize to the setting of our interest, i.e., non-linear R and diminishing step-sizes, where thelatter is crucial for ensuring convergence towards the global optimum. Stable Distributions.

In probability theory, a random variable X is stable if its distributionis non-degenerate and satisﬁes the following property: Let X and X be independent copies of X . Then, for any constants a, b >

0, the random variable aX + bX has the same distributionas cX + d for some constants c > d (see e.g. [Samorodnitsky and Taqqu, 1994]). Thestable distribution is also referred to as the α -stable distribution, ﬁrst proposed by L´evy [1937],where α ∈ (0 ,

2] denoting the stability parameter. The case α = 2 corresponds to the normaldistribution, and the variance under this distribution is undeﬁned for any α <

2. The multivariate α -stable distribution dates back to Feldheim [1937], which is a multivariate generalization of theunivariate α -stable distribution, which is also uniquely characterized by its characteristic function.In particular, an R d -valued random vector X has a multivariate α -stable distribution, denoted as X ∼ S ( α, Λ , δ ) if the joint characteristic function of X is given by E (cid:104) exp (cid:16) i u T X (cid:17)(cid:105) = exp (cid:110) − (cid:90) s ∈ S ( | u T s | α + iν ( u T s , α ))Λ(d s ) + i u T δ (cid:111) , for any u ∈ R d , and 0 < α (cid:54)

2. Here, α is the tail-index, Λ is a ﬁnite measure on S known asthe spectral measure, δ ∈ R d is a shift vector, and ν ( y, α ) := − sgn( y ) tan( πα/ | y | α for α (cid:54) = 1and ν ( y, α ) := (2 /π ) y log | y | for α = 1 for any y ∈ R , and S denotes the unit sphere in R d ; i.e. S = { s ∈ R d : (cid:107) s (cid:107) = 1 } . Stable distributions also appear as the limit in the Generalized CentralLimit Theorem (GCLT) [Gnedenko and Kolmogorov, 1954], which states that for a sequence of i.i.d.random variables whose distribution has a power-law tail with index 0 < α <

2, the normalizedsum converges to an α -stable distribution as the number of summands grows. Domains of Normal Attraction of Stable Distributions.

Let X , X , . . . , X n be an i.i.d.sequence of random vectors in R d with a common distribution function F ( x ). If there exists some4onstant a > b n ∈ R d such that X + · · · + X n an /α − b n D −−−→ n →∞ µ, (2.2)then F ( x ) is said to belong to the domain of normal attraction of the law µ , and α is the char-acteristic exponent of the law µ [Gnedenko and Kolmogorov, 1954, page 181]. If µ is an α -stabledistribution, then we say F ( x ) is said to belong to the domain of normal attraction of an α -stabledistribution. For example, the Pareto distribution belongs to the domain of normal attraction of an α -stable law. In Appendix C, we provide more details as well as a suﬃcient and necessary conditionfor being in the domain of normal attraction of an α -stable law. In this section, we identify suﬃcient conditions for the convergence of SGD under heavy tailedgradient noise, and derive the explicit rate estimates. In the standard setting when the noisevariance is ﬁnite, some notion of positive deﬁnite Hessian assumption is frequently utilized toachieve convergence (see for example Polyak and Juditsky [1992], Tripuraneni et al. [2018], Su andZhu [2018], Duchi and Ruan [2016], Toulis and Airoldi [2017], Fang et al. [2018], Anastasiou et al.[2019]). When the noise variance is inﬁnite, but it has ﬁnite p -th moment for p ∈ [1 , p → p = 1). p -Positive Deﬁniteness First, we introduce a signed power of vectors which will be used when deﬁning a family of matrices.Figure 1:

Geometry of p -PSD matrices. D + cone refers to the cone of diagonallydominant matrices with non-negative di-agonal entries. Their inclusion relation-ship is given in Propositions 13 and 14. For v = ( v , . . . , v n ) T ∈ R n and q (cid:62)

0, we let v (cid:104) q (cid:105) = (cid:0) sgn (cid:0) v (cid:1)(cid:12)(cid:12) v (cid:12)(cid:12) q , . . . , sgn( v n ) | v n | q (cid:1) T . (3.1)Denoting the n -dimensional (cid:96) p unit sphere with S p = { v ∈ R n : (cid:107) v (cid:107) p = 1 } , and the set of n × n symmetric matriceswith S , we now deﬁne the following subset of S . Deﬁnition 1 ( p -positive deﬁniteness) . Let p (cid:62) and Q be asymmetric matrix. We say that Q is p -positive deﬁnite if forall v ∈ S p , v T Q v (cid:104) p − (cid:105) > . Similarly, we call Q p -positivesemi-deﬁnite if for all v ∈ S p , v T Q v (cid:104) p − (cid:105) (cid:62) . It is not hard to see that the set of p -positive semi-deﬁnitematrices ( p -PSD) deﬁnes a closed pointed cone, which wedenote by S p + , with interior as the set of p -positive deﬁnitematrices ( p -PD), denoted by S p ++ . We are mainly interestedin the case 1 (cid:54) p <

2. Note that S coincides with thestandard PSD cone, and we show in Section A.2 that S isexactly the cone of diagonally dominant matrices with non-negative diagonal entries, denoted by D + . For any p ∈ [1 , D + = S ⊆ S p + ⊆ S . (cid:107) · (cid:107) p induces the sametopology on the set of n -dimensional matrices, which is just the usual topology on R n × n . Further,the set of symmetric matrices S , as the set of zeros of the continuous function X (cid:55)→ X − X T , is aclosed set. Hence for a set M ⊆ S , denoting its topological closure with M , we also have M ⊆ S .We are interested in the case where M is bounded. Deﬁnition 2 (uniform p -PD) . Let p (cid:62) and M ⊂ S be a non-empty set of symmetric matrices.We say that M is uniformly p -PD if for all Q ∈ M , we have Q ∈ S p ++ . Notice that M is uniformly 2-PD if and only if the eigenvalues of the symmetric matrices inthe set M are all lower bounded by a positive real number. Notice also that a ﬁnite subset ofsymmetric matrices is uniformly p -PD if and only if each element of the set is p -PD. p -PSD cone emerges naturally when analyzing SGD algorithm in the heavy-tailed setting, in-terpolating between the standard PSD cone to the cone of diagonally dominant matrices withnon-negative diagonal entries. To the best of our knowledge, we are the ﬁrst to study such familiesof matrices and their application in stochastic optimization. For further details about these cones,we refer interested reader to Appendix A.2.We make the following uniform smoothness and the curvature assumptions on the Hessian ofthe objective function. Assumption 1.

The set of matrices { ∇ f ( x ) : x ∈ R n } is bounded and uniformly p -PD. L p We ﬁx a probability space (Ω , F , P ) with ﬁltration {F t } t ∈ N , and make the following assumption onthe gradient noise sequence. Assumption 2.

Let x be F -measurable. The gradient noise sequence { ξ t } t ∈ N + is given as ξ t +1 ( x t ) = m t +1 ( x t ) + ζ t +1 , (3.2) where { ζ t } t ∈ N + is an i.i.d. sequence with E [ ζ t ] = 0 , and E [ | ζ t | p ] < ∞ for some p , and { m t } t ∈ N + isa martingale diﬀerence sequence, and both sequences are adapted to the ﬁltration {F t } t ∈ N .Further, the state dependent component of the noise satisﬁes, for some K > , E (cid:104) | m t +1 ( x t ) | | F t (cid:105) (cid:54) K (cid:0) | x t | (cid:1) . (3.3)We note that the above assumptions also imply that both the gradient noise sequence { ξ t } t ∈ N + as well as the SGD iterates { x t } t ∈ N are adapted to the same ﬁltration {F t } t ∈ N . We call m t the state-dependent component of the gradient noise, which naturally has a state dependent conditionalsecond moment. The variance of this component of the noise can be arbitrarily large depending onthe state; yet, for a given state x t , it is guaranteed to be ﬁnite. The heavy-tailed noise behavior isdue to ζ t , which may have an inﬁnite variance for p < L p , for the SGD algorithm to the unique minimizer x ∗ of the objective function f with uniformly p -PD Hessian, when the noise sequence { ξ t } t ∈ N + haspotentially an inﬁnite variance. 6 heorem 3. Suppose Assumptions 1 and 2 hold for some < p (cid:54) . For step-size satisfying γ t (cid:16) t − ρ with ρ ∈ (0 , , the error of the SGD iterates { x t } t ∈ N from the minimizer x ∗ satisﬁes E [ | x t − x ∗ | p ] = O (cid:16) t − ρ ( p − (cid:17) . (3.4) Consequently, we have sup t ∈ N + E [ | ξ t | p ] < ∞ . The proof of Theorem 3 is provided in Appendix B. We observe that the convergence rate ofSGD depends on the highest ﬁnite moment p of the noise sequence, and faster rates are achievedfor larger values of p . The fastest rate implied by our result is near O (cid:0) t − p +1 (cid:1) , which is achievedfor ρ ≈

1; yet, SGD converges even for very slowly decaying step-size sequences with ρ closer to 0.If the noise has further integrability properties with a ﬁnite p -th moment for all p ∈ [ q, α ) forsome q < α and if uniform p -PD assumption holds, then faster rates are achievable. In particular,the following result is a consequence of Theorem 3, and its proof is provided in Appendix B. Corollary 4.

For constants q, α satisfying < q < α (cid:54) , suppose that Assumptions 1 and 2 holdfor every p ∈ [ q, α ) . For step-size satisfying γ t (cid:16) t − ρ with ρ ∈ (0 , , the error of the SGD iterates { x t } t ∈ N from the minimizer x ∗ satisﬁes E [ | x t − x ∗ | q ] = ˜ O (cid:16) t − ρq α − α (cid:17) . (3.5) Remark.

The additional integrability assumption yields faster rates for any feasible step-sizesequence since p ( α − /α (cid:62) p − p ∈ (1 , states that E [ | x t − x ∗ | r ] = Θ (cid:16) t − ρr/ (cid:17) , (3.6)where r (cid:62) r -th moment exists for the stochastic approximationprocess, and this is achieved for strongly convex objective functions in one dimension (whose secondderivative { f (cid:48)(cid:48) ( x ) : x ∈ R } satisﬁes the uniformly 2-PD property) with a step-size choice γ t (cid:16) t − ρ for some ρ ∈ (1 / , r = 2, and extends it further to the case 1 (cid:54) r < In this section, we establish the limit distribution of the Polyak-Ruppert averaging under inﬁnitenoise variance, extending the asymptotic normality result given by Polyak and Juditsky [1992] to α -stable distributions. Let us ﬁx an α ∈ (1 ,

2] and assume the following throughout this subsection.

Assumption 3.

Let x be F -measurable. The gradient noise sequence { ξ t } t ∈ N + is given as ξ t +1 ( x t ) = m t +1 ( x t ) + ζ t +1 , (4.1) where { ζ t } t ∈ N + is an i.i.d. sequence with E [ ζ t ] = 0 , and it is in the domain of normal attractionof an n -dimensional symmetric α -stable distribution µ , i.e., ζ + . . . + ζ t t /α D −−−→ t →∞ µ. This result, like many other similar studies in the 1950s, concerns only the one-dimensional case. But theygeneralize easily to higher dimensions. he state dependent component { m t } t ∈ N + is a martingale diﬀerence sequence with a second-momentsatisfying (3.3) , and both sequences are adapted to the ﬁltration {F t } t ∈ N . The above assumption also implies that E [ | ζ t | p ] < ∞ for every p ∈ [1 , α ), i.e., the momentcondition on the i.i.d. heavy-tailed component of the noise in Assumption 2 holds for every p ∈ [1 , α ).Denoting the Polyak-Ruppert averaging by x t := t ( x + ... + x t − ), we are interested in theasymptotic behavior of t − /α ( x t − x ∗ ) = ( x + . . . + x t − ) − t x ∗ t /α , for α ∈ (1 , α = 2, it is known that this limit converges to a multivariatenormal distribution (which is a 2-stable distribution), a result proven in the seminal work by Polyakand Juditsky [1992]. Similarly, we begin with a result that considers a quadratic objective where thefunction ∇ f ( x ) is linear in x , and then building on this result, we establish the limit distributionof Polyak-Ruppert averaging also in the more general non-linear case. Theorem 5 (linear case) . Suppose the function ∇ f ( x ) is aﬃne, i.e. ∇ f ( x ) = A x − b for a realmatrix A ∈ R n × n and a real vector b ∈ R n and there exist scalars p, ρ satisfying max (cid:18) α + αρ αρ , αρ (cid:19) (cid:54) p (cid:54) α, such that A is p -PD and ρ ∈ (0 , . If the noise sequence satisﬁes Assumption 3, then for thestep-size satisfying γ t (cid:16) t − ρ , the normalized average t − /α ( x t − x ∗ ) converges weakly to an n -dimensional α -stable distribution. We observe from the above theorem that α -stable limit is achieved for Polyak-Ruppert averagingfor any step-size sequence with index ρ ∈ (0 , α -stable limit of the averaged iterates does notdepend on the index ρ . Non-asymptotic rates are required to see the eﬀect of step-size moreclearly.The next result generalizes Theorem 5 to the setting where ∇ f ( x ) is non-linear. Theorem 6 (non-linear case) . Let < /ρ < q < α and suppose Assumption 1 holds for every p ∈ [ q, α ) . Assume further that the gradient ∇ f ( x ) can be approximated using the Hessian matrix ∇ f ( x ∗ ) around the minimizer x ∗ as (cid:12)(cid:12) ∇ f ( x ) − ∇ f ( x ∗ )( x − x ∗ ) (cid:12)(cid:12) (cid:54) K | x − x ∗ | q . (4.2) If the noise sequence satisﬁes Assumption 3, for the step-size satisfying γ t (cid:16) t − ρ , the normalizedaverage t − /α ( x t − x ∗ ) converges weakly to an n -dimensional α -stable distribution. The additional assumption (4.2) is standard (see e.g. Polyak and Juditsky [1992, Assump-tion 3.2]), which simply imposes a linearity condition on the gradient of f with an order- q polyno-mial error term. We notice that the size of the interval of feasible indices ρ ∈ (1 /α,

1) is smallerthis time compared to the light tailed case, where Polyak and Juditsky [1992, Theorem 2] allows ρ ∈ (1 / , α -stable limit rather than a standard CLT. This result has potential implica-tions in statistical inference in the presence of heavy-tailed data. Inference procedures that take8nto account the computational part of the training procedure (instead of drawing conclusions forthe minimizer of the empirical risk) rely typically on variations of Polyak-Ruppert averaging andthe CLT they admit [Fang et al., 2018, Su and Zhu, 2018, Chen et al., 2020]. The above theoremsimply states this CLT does not hold under heavy-tailed gradient noise. Therefore, many of theseprocedures require further adaptation, if the gradient has undeﬁned variance. Finally, it is well-known that Polyak-Ruppert averaging achieves the Cram´er-Rao lower bound [Polyak and Juditsky,1992, Gadat and Panloup, 2017], which is a lower bound on the variance of an unbiased estimator.However, it is not clear what this type of optimality means when the variance is not deﬁned. Theseare important directions that require thorough investigations, and they will be studied elsewhere. In this section, we demonstrate how the stochastic approximation framework discussed in our papercovers several interesting examples, most notably linear regression and generalized linear models(GLMs), such that the heavy-tailed behavior naturally arise and the assumptions we proposed forTheorems 3, 5, and 6 are all met.

Let us ﬁrst consider the following linear model, y = z T β + (cid:15), where β ∈ R n is the true coeﬃcients, y ∈ R is the response, the random vector z ∈ R n denotesthe covariates with a positive-deﬁnite second moment 0 ≺ E [ zz T ] < ∞ , and (cid:15) is a noise with zeroconditional mean E [ (cid:15) | z ] = 0. In the classical setting, the noise (cid:15) is assumed to be Gaussian, whosevariance is well deﬁned. In this case, the population version of the maximum likelihood estimation(MLE) problem corresponds to minimizing f ( x ) = E [( y − z T x ) ] / y, z ) pair), or equivalently solving the following normal equations ∇ f ( x ) := E (cid:2) zz T (cid:3) x − E [ z y ] = 0 . It can be easily veriﬁed that the true coeﬃcients β is the unique zero of the above equation, i.e.we have x ∗ = β .Now, suppose we are given access to a stream of i.i.d. drawn instances of the pair ( y, z ), denotedby { y t , z t } t ∈ N + . In large-scale settings, one generally runs the following stochastic approximationprocess, which is simply online SGD on the population MLE objective f ( x ): x t = x t − − γ t (cid:16) z t z T t x t − − z t y t (cid:17) . (5.1)Manifestly, (5.1) is a special case of (2.1), where the gradient noise admitting the decomposition ξ t = ζ t + m t , for an i.i.d. component ζ t and a state-dependent component m t (see (4.1)), (cid:40) ζ t = E [ z y ] − z t y t , m t = (cid:0) z t z T t − E (cid:2) zz T (cid:3)(cid:1) x t − . In the presence of heavy-tailed noise, i.e., (cid:15) has possibly inﬁnite variance, the population MLEobjective f ( x ) may not be ﬁnite and one should typically resort to methods from M-estimation andchoose an appropriate loss function within robust statistics framework [Huber, 2004, Van der Vaart,9000]. However, the SGD iterations (5.1) may still be employed to estimate the true coeﬃcients β (potentially due to model misspeciﬁcation), as we demonstrate below.First, notice that the noise sequence can be decomposed in two parts, and the i.i.d. component { ζ t } t ∈ N exhibits the heavy-tailed behavior. Assume that this component has the highest deﬁnedmoment order 1 (cid:54) p <

2, i.e., E [ | ζ t | p ] < ∞ . Further, the state dependent component m t deﬁnesa martingale diﬀerence sequence, and the condition (3.3) is met since the covariates z have ﬁnitesecond moment, i.e., E (cid:2) | m t | | x t − (cid:3) (cid:54) C | x t − | . Hence, Assumption 2 is satisﬁed. Next, assuming that the second moment of the covariates ∇ f ( x ) = E [ zz T ] is p -PD, one can guarantee that Assumption 1 is satisﬁed. Therefore, our con-vergence results can be invoked. We emphasize that this assumption is always satisﬁed if E [ zz T ]is diagonally dominant, but the condition is milder for p > In this section, we consider the problem of estimating the coeﬃcients in generalized linear models(GLMs) in the presence of heavy-tailed noise. GLMs play a crucial role in numerous statisticsproblems, and provide a miscellaneous framework for many regression and classiﬁcation tasks, withmany applications [McCullagh and Nelder, 1989, Nelder and Wedderburn, 1972].For a response y ∈ R and random covariates z ∈ R n , the population version of an (cid:96) -regularizedMLE problem in the canonical GLM framework readsminimize x f ( x ) := E (cid:104) ψ (cid:16) x T z (cid:17) − y x T z (cid:105) + λ | x | for λ > . (5.2)Here, ψ : R → R is referred to as the cumulant generating function (CGF) and assumed to beconvex. Notable examples include ψ ( x ) = x / ψ ( x ) = log(1 + e x )yielding logistic regression, and ψ ( x ) = e x yielding Poisson regression. Gradient of the aboveobjective (5.2) is given by ∇ f ( x ) = E (cid:104) z ψ (cid:48) (cid:0) z T x (cid:1)(cid:105) − E [ z y ] + λ x . (5.3)We deﬁne the unique solution of the population GLM problem as the unique zero of (5.3), whichwe denote by x ∗ . Note that we do not assume a model on data, allowing for model misspeciﬁcationsimilar to Erdogdu et al. [2016, 2019]. As in the previous section, we assume that the covariateshave ﬁnite fourth moment and the response is contaminated with heavy-tailed noise with inﬁnitevariance. In this setting, the objective function is always deﬁned, even if the response has inﬁnitevariance.We are given access to a stream of i.i.d. drawn instances of the pair ( y, z ), denoted by { y t , z t } t ∈ N + , and we solve the above non-linear problem using the following stochastic process, x t = x t − − γ t (cid:16) z t ψ (cid:48) (cid:0) z T t x t − (cid:1) − z t y t + λ x t − (cid:17) , with gradient noise admitting the decomposition ξ t = ζ t + m t where (cid:40) ζ t = E [ z y ] − z t y t , m t = z t ψ (cid:48) (cid:0) z T t x t − (cid:1) − E (cid:2) z t ψ (cid:48) (cid:0) z T t x t − (cid:1)(cid:3) . In what follows, we verify our assumptions for a CGF satisfying | ψ (cid:48) ( x ) | (cid:54) C (1 + | x | ) and ψ (cid:48)(cid:48) ( x ) (cid:62) x ∈ R . These assumptions can be easily veriﬁed for any convex CGF that grows10t most linearly (e.g. ψ ( x ) = log(1+ e x )). ζ t are i.i.d. and contain the entire heavy-tailed part of thegradient noise. Assume that this component has the highest deﬁned moment order 1 (cid:54) p <

2, i.e., E [ | ζ t | p ] < ∞ . Further observe that the state dependent component deﬁnes a martingale diﬀerencesequence and satisﬁes the condition (3.3) since the covariates z have ﬁnite fourth moment, and | ψ (cid:48) | grows at most linearly. Therefore, Assumption 2 is satisﬁed.We note that the Hessian of the objective f is given as ∇ f ( x ) = E (cid:104) zz T ψ (cid:48)(cid:48) (cid:0) z T x (cid:1)(cid:105) + λ I . Since ψ (cid:48)(cid:48) ( x ) (cid:62) ∇ f ( x ) is clearly PD for all λ >

0. For suﬃciently large λ , this matrix canalso be made diagonally dominant, which implies that it is p -PD for any p (cid:62)

1, further implyingAssumption 1. Therefore, for an appropriate step-size sequence, our convergence results on theSGD can be applied to this framework.

In this paper, we considered SGD subject to state-dependent and heavy-tailed noise with a po-tentially inﬁnite variance when the objective belongs to a class of strongly convex functions. Weprovided a convergence rate for the distance to the optimizer in L p under appropriate assumptions.Furthermore, we provided a generalized central limit theorem that shows that the averaged iteratesconverge to a multivariate α -stable distribution. We also discussed the implications of our resultsto applications such as linear regression and generalized linear models subject to heavy-tailed in-put data. Finally, while we leave it for a future study, we emphasize the importance of adaptingexisting statistical inference techniques that rely on the averaged SGD iterates in the presence ofheavy-tailed gradient noise which arises naturally in modern statistical learning applications. Acknowledgements

MAE is partially funded by CIFAR AI Chairs program, and CIFAR AI Catalyst grant. MG’sresearch is supported in part by the grants NSF DMS-1723085 and NSF CCF-1814888. LZ isgrateful to the support from a Simons Foundation Collaboration Grant.11 eferences

V. Anantharam and V. S. Borkar. Stochastic approximation with long range dependent and heavy tailednoise.

Queueing Systems , 71(1-2):221–242, 2012.A. Anastasiou, K. Balasubramanian, and M. A. Erdogdu. Normal approximation for stochastic gradientdescent via non-asymptotic rates of martingale CLT. In

Conference on Learning Theory , pages 115–137,2019.J. R. Blum. Approximation methods which converge with probability one.

The Annals of MathematicalStatistics , 25(2):382–386, 1954.D. Buraczewski, E. Damek, and M. Mirek. Asymptotics of stationary solutions of multivariate stochasticrecursions with heavy tailed inputs and related limit theorems.

Stochastic Processes and their Applications ,122(1):42–67, 2012.D. Buraczewski, E. Damek, and T. Mikosch.

Stochastic Models with Power-Law Tails . Springer, 2016.X. Chen, J. D. Lee, X. T. Tong, and Y. Zhang. Statistical inference for model parameters in stochasticgradient descent.

The Annals of Statistics , 48(1):251–273, 2020.Y. Cherapanamjeri, N. Tripuraneni, P. L. Bartlett, and M. I. Jordan. Optimal mean estimation without avariance. arXiv preprint arXiv:2011.12433 , 2020.K. L. Chung. On a stochastic approximation method.

The Annals of Mathematical Statistics , 25(3):463–483,1954.D. Davis, D. Drusvyatskiy, L. Xiao, and J. Zhang. From low probability to high conﬁdence in stochasticconvex optimization. arXiv preprint arXiv:1907.13307 , 2019.J. Duchi and F. Ruan. Asymptotic optimality in stochastic optimization. arXiv:1612.05612 , 2016.M. A. Erdogdu, M. Bayati, and L. H. Dicker. Scaled least squares estimator for glms in large-scale problems.In

Proceedings of the 30th International Conference on Neural Information Processing Systems , pages3332–3340, 2016.M. A. Erdogdu, M. Bayati, and L. H. Dicker. Scalable approximations for generalized linear problems.

TheJournal of Machine Learning Research , 20(1):231–275, 2019.V. Fabian. Stochastic approximation of minima with improved asymptotic speed.

The Annals of Mathemat-ical Statistics , 38(1):191–200, 1967.V. Fabian. On asymptotic normality in stochastic approximation.

The Annals of Mathematical Statistics ,39(4):1327–1332, 1968.Y. Fang, J. Xu, and L. Yang. Online bootstrap conﬁdence intervals for the stochastic gradient descentestimator.

The Journal of Machine Learning Research , 19(1):3053–3073, 2018.N. Farsad, W. Guo, C. B. Chae, and A. Eckford. Stable distributions as noise models for molecular commu-nication. , 2015.E. Feldheim. ´Etude de la stabilit´e des lois de probabilit´e . PhD thesis, Facult´e des Sciences de Paris, Paris,1937.W. Feller.

An Introduction to Probability Theory and Its Applications . Wiley, New York, 2nd edition, 1971.A. Fiche, J. C. Cexus, A. Martin, and A. Khenchaf. Features modeling with an α -stable distribution:Application to pattern recognition based on continuous belief functions. Information Fusion , 14(4):504–520, 2013. . Gadat and F. Panloup. Optimal non-asymptotic bound of the Ruppert-Polyak averaging without strongconvexity. arXiv preprint arXiv:1709.03342 , 2017.J. L. Geluk and L. de Hann. Stable probability distributions and their domains of attraction: A directapproach. Probability and Mathematical Statistics , 20:169–188, 2000.B. V. Gnedenko and A. Kolmogorov.

Limit Distributions for Sums of Independent Random Variables .Addison-Wesley, Cambridge, MA, 1954. Translated by Kai Lai Chung.C. Goodsell and D. Hanson. Almost sure convergence for the Robbins-Monro process.

The Annals ofProbability , 4(6):890–901, 1976.E. Gorbunov, M. Danilova, and A. Gasnikov. Stochastic optimization with heavy-tailed noise via acceleratedgradient clipping. In

Advances in Neural Information Processing Systems , volume 33, 2020.M. G¨urb¨uzbalaban and Y. Hu. Fractional moment-preserving initialization schemes for training fully-connected neural networks. arXiv preprint arXiv:2005.11878 , 2020.M. G¨urb¨uzbalaban, U. S¸im¸sekli, and L. Zhu. The heavy-tail phenomenon in SGD. arXiv preprintarXiv:2006.04740 , 2020.L. Hodgkinson and M. W. Mahoney. Multiplicative noise and heavy tails in stochastic optimization. arXivpreprint arXiv:2006.06293 , 2020.P. J. Huber.

Robust Statistics , volume 523. John Wiley & Sons, 2004.T. P. Krasulina. On stochastic approximation processes with inﬁnite variance.

Theory of Probability & ItsApplications , 14(3):522–526, 1969.H. Kushner and G. G. Yin.

Stochastic Approximation and Recursive Algorithms and Applications , volume 35.Springer Science & Business Media, 2003.P. L´evy. Th´eorie de l’addition des variables al´eatoires.

Gauthiers-Villars, Paris , 1937.G. Li. Almost sure convergence of stochastic approximation procedures.

Statistica Sinica , 4(1):361–372,1994.P. McCullagh and J. A. Nelder.

Generalized Linear Models . Chapman and Hall, 2nd edition, 1989.M. Mirek. Heavy tail phenomenon and convergence to stable laws for iterated Lipschitz maps.

ProbabilityTheory and Related Fields , 151(3):705–734, 2011.A. V. Nazin, A. S. Nemirovsky, A. B. Tsybakov, and A. B. Juditsky. Algorithms of robust stochasticoptimization based on mirror descent method.

Automation and Remote Control , 80(9):1607–1627, 2019.J. A. Nelder and R. W. Wedderburn. Generalized linear models.

Journal of the Royal Statistical Society:Series A (General) , 135(3):370–384, 1972.J. Neveu.

Discrete-Parameter Martingales , volume 10. North-Holland Amsterdam, 1975.B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging.

SIAM Journal onControl and Optimization , 30(4):838–855, 1992.H. Robbins and S. Monro. A stochastic approximation method.

The Annals of Mathematical Statistics , 22(3):400–407, 1951.D. Ruppert. Eﬃcient estimations from a slowly convergent Robbins-Monro process. Technical report, CornellUniversity Operations Research and Industrial Engineering, 1988. . Sacks. Asymptotic distribution of stochastic approximation procedures. The Annals of MathematicalStatistics , 29(2):373–405, 1958.G. Samorodnitsky and M. S. Taqqu.

Stable Non-Gaussian Random Processes: Stochastic Models with InﬁniteVariance . Chapman & Hall, New York, 1994.K. Sarafrazi and M. Yazdi. Skewed alpha-stable distribution for natural texture modeling and segmentationin contourlet domain.

Eurasip Journal on Image and Video Processing , 2019(1):1–12, 2019.A. Shapiro. Asymptotic properties of statistical estimators in stochastic programming.

The Annals ofStatistics , 17(2):841–858, 1989.U. S¸im¸sekli, M. G¨urb¨uzbalaban, T. H. Nguyen, G. Richard, and L. Sagun. On the heavy-tailed theory ofstochastic gradient descent for deep neural networks. arXiv preprint arXiv:1912.00018 , 2019.U. S¸im¸sekli, L. Sagun, and M. G¨urb¨uzbalaban. A tail-index analysis of stochastic gradient noise in deepneural networks. In

International Conference on Machine Learning , pages 5827–5837, 2019.W. Su and Y. Zhu. Statistical inference for online learning and stochastic approximation via hierarchicalincremental gradient descent. arXiv preprint arXiv:1802.04876 , 2018.P. Toulis and E. M. Airoldi. Asymptotic and ﬁnite-sample properties of estimators based on stochasticgradients.

The Annals of Statistics , 45(4):1694–1727, 2017.N. Tripuraneni, N. Flammarion, F. Bach, and M. I. Jordan. Averaging stochastic gradient descent onRiemannian manifolds. In

Proceedings of the 31st Conference On Learning Theory , 2018.A. W. Van der Vaart.

Asymptotic Statistics , volume 3. Cambridge University Press, 2000.L. Yu, K. Balasubramanian, S. Volgushev, and M. A. Erdogdu. An analysis of constant step size sgd in thenon-convex regime: Asymptotic normality and bias. arXiv preprint arXiv:2006.07904 , 2020.J. Zhang, S. P. Karimireddy, A. Veit, S. Kim, S. J. Reddi, S. Kumar, and S. Sra. Why are adaptive methodsgood for attention models? arXiv preprint arXiv:1912.03194 , 2019. Lemmas and Discussions

A.1 Key Lemmas

In this subsection, we present some key lemmas used in the proof of our main theorems, which arehelpful when considering stochastic problems with inﬁnite variance.The concept of uncorrelatedness has long been used by probabilists as a trick when comput-ing and estimating variance. For example, consider a sequence of uncorrelated random vectors { X t } t ∈ N + (e.g. square-integrable martingale diﬀerence). Then E (cid:2) | X + . . . + X t | (cid:3) = E (cid:2) | X | (cid:3) + . . . + E (cid:2) | X t | (cid:3) . (A.1)Indeed, this type of expansion is used in Polyak and Juditsky [1992] to show L convergence in thenormality analysis of stochastic approximation problems.However, correlatedness is only deﬁned when random elements have ﬁnite variance. The follow-ing lemma provides an inﬁnite-variance version of expansion (A.1), stating that the p -th moment( p <

2) of a martingale without square-integrability assumption can also be bounded simpliciter by the sum of the p -th moments of its diﬀerences, at the cost of a multiplicative constant that maydepend only on p and the dimension n . It is a generalization of the recent study Cherapanamjeriet al. [2020, Lemma 4.2]. Lemma 7.

Suppose p ∈ [0 , and let { S t } t ∈ N be an n -dimensional martingale adapted to theﬁltration {F t } t ∈ N , with E [ | S t | p ] < ∞ for every t and S = 0 . Let X i = S i − S i − . Then E (cid:104) | S t | p (cid:105) (cid:54) − p n − p t (cid:88) i =1 E (cid:104) | X i | p (cid:105) . Next, we present a Taylor-expansion-type inequality for the function (cid:107) x (cid:107) pp . Recall that we havedeﬁned the signed power of a vector in (3.1). Lemma 8.

Let p ∈ [1 , . For any x , y ∈ R n , (cid:107) x + y (cid:107) pp (cid:54) (cid:107) x (cid:107) pp + 4 (cid:107) y (cid:107) pp + p y T x (cid:104) p − (cid:105) . This inequality traces back to Krasulina [1969], where the one-dimensional version | x + y | p (cid:54) | x | p + C | y | p + pyx p − sgn( x ) is used to derive an L p rate of convergence for the one-dimensionalstochastic approximation process with step-size 1 /t . In our current study, this lemma is used notonly to derive L p rate of convergence for general inﬁnite-variance process in R n with variable step-size scheme (Theorem 3), but also in the proof of the equivalent deﬁnitions of p -PD (Theorem 10).Finally, we quote Fabian [1967, Lemma 4.2], which we shall use to calculate the exact conver-gence rate (see also Chung [1954]). Lemma 9 (Fabian [1967], Lemma 4.2) . Let { b t } t ∈ N , A, B, α, β be real numbers such that < α < , A > and suppose the recursion b t +1 = b t (1 − At − α ) + Bt − α − β holds. Then, b t = Θ( t − β ) . The paper Krasulina [1969] contains a minor error in ignoring the signum function sgn( x ) in this inequality. Ourproof of Theorem 3 can be thought of its correction as well as extension. .2 Discussions on p -Positive Deﬁniteness and Uniform p -Positive Deﬁniteness Let us now focus on p -PD and uniform p -PD assumptions which we deﬁned back in Deﬁnition 1and Deﬁnition 2. Our next theorem provides several equivalent characterizations of p -PD. Theorem 10 (Equivalent deﬁnitions of p -PD) . Let Q be a symmetric matrix. The following areequivalent when p ∈ [1 , .i) There exist δ, L > , such that (cid:107) I − t Q (cid:107) pp (cid:54) − Lt for all t ∈ [0 , δ ) .ii) There exists λ > such that for all v ∈ R n , v T Q v (cid:104) p − (cid:105) (cid:62) λ (cid:107) v (cid:107) pp .iii) For all v ∈ S p , v T Q v (cid:104) p − (cid:105) > .iv) For all v ∈ S p , there exists t > such that (cid:107) v − t Q v (cid:107) p < . Next, we provide several equivalent characterizations of uniform p -PD. Theorem 11 (Equivalent deﬁnitions of uniform p -PD) . Let M be a bounded set of symmetricmatrices. The following are equivalent when p ∈ [1 , .i) There exist δ, L > , such that (cid:107) I − t Q (cid:107) pp (cid:54) − Lt for all t ∈ [0 , δ ) and Q ∈ M .ii) There exists λ > such that for all v ∈ R n and Q ∈ M , v T Q v (cid:104) p − (cid:105) (cid:62) λ (cid:107) v (cid:107) pp .iii) For all v ∈ S p and Q ∈ M , v T Q v (cid:104) p − (cid:105) > .iv) For all v ∈ S p and Q ∈ M , there exists t > such that (cid:107) v − t Q v (cid:107) p < . We notice that some mild assumptions can indeed imply p -PD. For example, we will show thatdiagonal dominance implies p -PD. Recall that a symmetric matrix Q = ( q ij ) n × n is called diagonallydominant (with non-negative diagonal) if for every i ∈ [ n ], q ii − (cid:88) j ∈ [ n ] \{ i } | q ij | > . Further, we say that a non-empty set M of symmetric matrices is uniformly diagonally dominant (with non-negative diagonal) if inf ( q ij ) n × n ∈M min i ∈ [ n ]  q ii − (cid:88) j ∈ [ n ] \{ i } | q ij |  > . We have the following observations which we shall prove in Appendix B. First, we observe thatthe uniform p -PD assumption is weaker than the notion of uniform diagonally dominance (withnon-negative diagonal). Proposition 12.

A uniformly diagonally dominant (with non-negative diagonal) set of symmetricmatrices is uniformly p -PD for every p ∈ [1 , . Next, we notice that the result in Proposition 12 is tight for p = 1. Proposition 13.

Uniform -PD is equivalent to uniform diagonal dominance (with non-negativediagonal). Finally, we observe that the notion of uniform 2-PD is weaker than uniform p -PD for any p ∈ [1 , Proposition 14.

Let p ∈ [1 , . Uniform p -PD implies uniform 2-PD. Omitted Proofs

In this appendix, we ﬁrst prove the lemmas, theorems, and propositions in Section A, then provethe theorems in Sections 3 and 4. Throughout this appendix, we denote by δ t the error of theapproximation x t − x ∗ , and by δ t the averaged error ( δ + . . . + δ t − ) /t . The gradient ∇ f ( x )and the Hessian ∇ f ( x ) will be written as R ( x ) and ∇ R ( x ) respectively, not only for notationalsimplicity, but also to stress the fact that our results can be applied to any instance of stochasticapproximation (2.1) including SGD. Proof of Lemma n = 1 case. Suppose { S t } is a one-dimensional martingaleand X i = S i − S i − . Notice that the function g ( x ) = | x | p satisﬁes the inequality (see e.g.Cherapanamjeri et al. [2020, Lemma A.3]): | g (cid:48) ( x ) − g (cid:48) ( y ) | (cid:54) − p g (cid:48) ( | x − y | ) , where the weak derivative g (cid:48) ( x ) = sgn( x ) is used in the inequality above in the case of p = 0, wheresgn( x ) :=  x > , − x < , x = 0 . Furthermore, by E [ X i g (cid:48) ( S i − ) | F i − ] = g (cid:48) ( S i − ) E [ X i | F i − ] = 0, we have E [ g ( S t )] = t (cid:88) i =1 E (cid:34)(cid:90) S i S i − g (cid:48) ( x )d x (cid:35) = t (cid:88) i =1 E (cid:34) X i g (cid:48) ( S i − ) + (cid:90) S i S i − (cid:2) g (cid:48) ( x ) − g (cid:48) ( S i − ) (cid:3) d x (cid:35) = t (cid:88) i =1 E (cid:34)(cid:90) S i S i − (cid:2) g (cid:48) ( x ) − g (cid:48) ( S i − ) (cid:3) d x (cid:35) = t (cid:88) i =1 E (cid:20)(cid:90) X i (cid:2) g (cid:48) ( S i − + τ ) − g (cid:48) ( S i − ) (cid:3) d τ (cid:21) = t (cid:88) i =1 E (cid:34)(cid:90) | X i | (cid:12)(cid:12) g (cid:48) ( S i − + sgn( X i ) τ ) − g (cid:48) ( S i − ) (cid:12)(cid:12) d τ (cid:35) (cid:54) − p t (cid:88) i =1 E (cid:34)(cid:90) | X i | g (cid:48) ( τ )d τ (cid:35) = 2 − p t (cid:88) i =1 E [ g ( | X i | )] . (B.1)Next, for the higher dimension n >

1, we denote by S ji (resp. X ji ) the j -th entry of the vector17 i (resp. X i ). We can apply the inequality (B.1) obtained above to S jt by taking a (1 + p )-norm, E (cid:104) (cid:107) S t (cid:107) p p (cid:105) = n (cid:88) j =1 E (cid:20)(cid:12)(cid:12)(cid:12) S jt (cid:12)(cid:12)(cid:12) p (cid:21) (cid:54) n (cid:88) j =1 − p t (cid:88) i =1 E (cid:20)(cid:12)(cid:12)(cid:12) X ji (cid:12)(cid:12)(cid:12) p (cid:21) = 2 − p t (cid:88) i =1 n (cid:88) j =1 E (cid:20)(cid:12)(cid:12)(cid:12) X ji (cid:12)(cid:12)(cid:12) p (cid:21) = 2 − p t (cid:88) i =1 E (cid:104) (cid:107) X i (cid:107) p p (cid:105) . Finally, the inequalities | x | (cid:54) (cid:107) x (cid:107) p (cid:54) n p − | x | give our desired result: E (cid:2) | S t | p (cid:3) (cid:54) − p n − p t (cid:88) i =1 E (cid:2) | X i | p (cid:3) . The proof is complete.

Proof of Lemma | a | p (cid:54) ap + 4 | a | p for any p ∈ [1 ,

2] and a ∈ R ,we have that for any p ∈ [1 ,

2] and x, y ∈ R , | x + y | p (cid:54) | x | p + py | x | p − sgn( x ) + 4 | y | p . (B.2)Next, for any x = ( x , . . . , x n ) T , y = ( y , . . . , y n ) T ∈ R n , by taking the p -norm and applying theinequality (B.2), we obtain (cid:107) x + y (cid:107) pp = n (cid:88) i =1 (cid:12)(cid:12) x i + y i (cid:12)(cid:12) p (cid:54) n (cid:88) i =1 (cid:16)(cid:12)(cid:12) x i (cid:12)(cid:12) p + py i (cid:12)(cid:12) x i (cid:12)(cid:12) p − sgn( x i ) + 4 (cid:12)(cid:12) y i (cid:12)(cid:12) p (cid:17) = (cid:107) x (cid:107) pp + 4 (cid:107) y (cid:107) pp + p n (cid:88) i =1 y i (cid:12)(cid:12) x i (cid:12)(cid:12) p − sgn( x i )= (cid:107) x (cid:107) pp + 4 (cid:107) y (cid:107) pp + p y T x (cid:104) p − (cid:105) , which completes the proof.Since Theorem 10 is just a special case of Theorem 11, we will only prove the latter. Before weproceed, let us ﬁrst state a useful technical lemma. Lemma 15.

Let u , v ∈ R n and consider the function ϕ ( t ) = (cid:107) u + t v (cid:107) pp = (cid:80) ni =1 | u i + v i t | p . Thefunction ϕ is convex and has the following derivative (when < p (cid:54) ) or subderivative (when p = 1 ): ϕ (cid:48) ( t ) = n (cid:88) i =1 p (cid:12)(cid:12) u i + v i t (cid:12)(cid:12) p − sgn (cid:0) u i + v i t (cid:1) v i = p v T ( u + t v ) (cid:104) p − (cid:105) . Proof of Theorem

11 We shall show that i) = ⇒ iv) = ⇒ iii) = ⇒ ii) = ⇒ i).i) = ⇒ iv) Take a sequence { Q , Q , . . . } ⊆ M such that lim m →∞ Q m = Q . iv) follows from (cid:107) I − ( δ/ Q m (cid:107) pp (cid:54) − Lδ/ ⇒ iii) For all v ∈ S p and Q ∈ M , consider the function ϕ ( t ) = (cid:107) v − t Q v (cid:107) pp . According toLemma 15, ϕ ( t ) is convex. Furthermore, ϕ ( t ) < ϕ (0). Hence it follows that ϕ (cid:48) (0) <

0; that is, v T Q v (cid:104) p − (cid:105) > ⇒ ii) Since the function ( v , Q ) (cid:55)→ v T Q v (cid:104) p − (cid:105) is continuous, it maps the compact set S p × M to a compact set. Hence there exists some λ > v ∈ S p and Q ∈ M , v T Q v (cid:104) p − (cid:105) (cid:62) λ . Now, for every u ∈ R n \ { } , by setting v = u / (cid:107) u (cid:107) p ,we get u T Q u (cid:104) p − (cid:105) (cid:62) λ (cid:107) u (cid:107) pp .ii) = ⇒ i) For arbitrary v ∈ R n and Q ∈ M , by Lemma 8 we have (cid:107) ( I − t Q ) v (cid:107) pp = (cid:107) v − t Q v (cid:107) pp (cid:54) (cid:107) v (cid:107) pp + 4 t p (cid:107) Q v (cid:107) pp − pt ( v T Q v (cid:104) p − (cid:105) ) (cid:54) (cid:107) v (cid:107) pp + 4 t p (cid:107) Q (cid:107) pp (cid:107) v (cid:107) pp − ptλ (cid:107) v (cid:107) pp . This impliesi).The proof is complete. Proof of Proposition

12 Let Q ∈ M and v ∈ R n . v T Q v (cid:104) p − (cid:105) = n (cid:88) i =1 q ii | v i | p + (cid:88) i

13 Suppose M is uniform 1-PD. By the item i) of Theorem 11, thereexists δ, L > (cid:107) I − t Q (cid:107) (cid:54) − Lt for all t ∈ [0 , δ ) and Q ∈ M . Let Q = ( q ij ) n × n andnotice that (cid:107) I − t Q (cid:107) = max i ∈ [ n ]  | − tq ii | + (cid:88) j ∈ [ n ] \{ i } t | q ij |  . It follows that min i ∈ [ n ]  q ii − (cid:88) j ∈ [ n ] \{ i } | q ij |  (cid:62) L > . To see this, notice that for any p (cid:62) x, y (cid:62) x p + y p − x p − y − y p − x = ( x p − − y p − )( x − y ) (cid:62) M is uniformly diagonally dominant (with non-negative diagonal). The proof is com-plete. Proof of Proposition

14 Suppose M is uniformly p -PD but not uniformly 2-PD. Then, thereexists a sequence { Q , Q , . . . } ⊆ M such that the smallest eigenvalues λ m of Q m satisfylim m →∞ λ m (cid:54) . (B.3)For each m ∈ N + , there exists an v m ∈ R n \ { } such that Q m v m = λ m v m . Hence v T m Q m v (cid:104) p − (cid:105) m = λ m v T m v (cid:104) p − (cid:105) m = λ m (cid:107) v m (cid:107) pp . But by the item ii) of Theorem 11, there exists λ > λ m (cid:62) λ . This contradicts (B.3).The proof is complete. Proof of Theorem T t ( x ) = (cid:0) T t ( x ) , . . . , T nt ( x ) (cid:1) T = x − x ∗ − γ t +1 R ( x ) . An n -dimensional (and corrected) version of the ﬁrst inequality in the proof of Krasulina [1969,Theorem 2] can be obtained by applying Lemma 8 to our stochastic approximation scheme, (cid:107) x t +1 − x ∗ (cid:107) pp = (cid:13)(cid:13) T t ( x t ) − γ t +1 ξ t +1 (cid:13)(cid:13) pp (cid:54) (cid:107) T t ( x t ) (cid:107) pp + 4 γ pt +1 (cid:13)(cid:13) ξ t +1 (cid:13)(cid:13) pp + pγ t +1 n (cid:88) i =1 ξ it +1 (cid:12)(cid:12) T it ( x t ) (cid:12)(cid:12) p − sgn T it ( x t ) . (B.4)Since E (cid:2) ξ it +1 | T it ( x t ) | p − sgn T it ( x t ) | x t (cid:3) = | T it ( x t ) | p − sgn T it ( x t ) E [ ξ it +1 | x t ] = 0, by taking expec-tations in (B.4), we get E (cid:104) (cid:107) δ t +1 (cid:107) pp (cid:105) = E (cid:104) (cid:107) x t +1 − x ∗ (cid:107) pp (cid:105) (cid:54) E (cid:104) (cid:107) T t ( x t ) (cid:107) pp (cid:105) + 4 γ pt +1 E (cid:104)(cid:13)(cid:13) ξ t +1 (cid:13)(cid:13) pp (cid:105) = E (cid:104) (cid:107) ( x t − x ∗ ) − γ t +1 R ( x t ) (cid:107) pp (cid:105) + 4 γ pt +1 E (cid:104)(cid:13)(cid:13) ξ t +1 (cid:13)(cid:13) pp (cid:105) . By the mean value theorem, there exists x (cid:91)t ∈ { x ∗ + τ ( x t − x ∗ ) : 0 (cid:54) τ (cid:54) } , such that R ( x t ) = ∇ R ( x (cid:91)t )( x t − x ∗ ), and then E (cid:104) (cid:107) ( x t − x ∗ ) − γ t +1 R ( x t ) (cid:107) pp (cid:105) + 4 γ pt +1 E (cid:104)(cid:13)(cid:13) ξ t +1 (cid:13)(cid:13) pp (cid:105) = E (cid:20)(cid:13)(cid:13)(cid:13) ( I − γ t +1 ∇ R ( x (cid:91)t ))( x t − x ∗ ) (cid:13)(cid:13)(cid:13) pp (cid:21) + 4 γ pt +1 E (cid:104)(cid:13)(cid:13) ξ t +1 (cid:13)(cid:13) pp (cid:105) (cid:54) (cid:13)(cid:13)(cid:13) I − γ t +1 ∇ R ( x (cid:91)t ) (cid:13)(cid:13)(cid:13) pp · E (cid:2) (cid:107) x t − x ∗ (cid:107) pp (cid:3) + 4 γ pt +1 E (cid:104)(cid:13)(cid:13) ξ t +1 (cid:13)(cid:13) pp (cid:105) (cid:54) (cid:13)(cid:13)(cid:13) I − γ t +1 ∇ R ( x (cid:91)t ) (cid:13)(cid:13)(cid:13) pp · E (cid:2) (cid:107) δ t (cid:107) pp (cid:3) + C γ pt +1 (cid:0) E (cid:2) (cid:107) δ t (cid:107) pp (cid:3)(cid:1) , where the last inequality follows from E [ | m t +1 | p | F t ] (cid:54) E (cid:104) | m t +1 | | F t (cid:105) p/ (cid:54) (cid:2) K (cid:0) | x t | (cid:1)(cid:3) p/ (B.5) (cid:54) K p/ (1 + | x t | p ) (cid:54) K p/ (cid:0) p − ( | δ t | p + | x ∗ | p ) (cid:1) , x + y ) r (cid:54) x r + y r for any x, y (cid:62)

0, 0 (cid:54) r (cid:54) E [ | ζ | p ] < ∞ .Note that (cid:13)(cid:13) I − γ t +1 ∇ R ( x (cid:91)t ) (cid:13)(cid:13) pp can be estimated by the uniform p -PD assumption (see item i)of Theorem 11) since γ t →

0. For t suﬃciently large, (cid:13)(cid:13)(cid:13) I − γ t +1 ∇ R ( x (cid:91)t ) (cid:13)(cid:13)(cid:13) pp (cid:54) − Lγ t +1 . And there is a positive constant C such that 1 − Lγ t +1 + C γ pt +1 (cid:54) − C γ t +1 for t suﬃcientlylarge. Hence, we arrive at the following iterative bound E (cid:104) (cid:107) δ t +1 (cid:107) pp (cid:105) (cid:54) (1 − γ t +1 C ) · E (cid:104) (cid:107) δ t (cid:107) pp (cid:105) + C γ pt +1 (B.6)for t suﬃciently large.Next, let us substitute γ t +1 with t − ρ where 0 < ρ <

1. Consider the iteration µ t +1 = (1 − t − ρ C ) · µ t + C t − ρp , (B.7)so that by (B.6), E (cid:104) (cid:107) δ t (cid:107) pp (cid:105) = O ( µ t ). By virtue of Lemma 9, we get µ t = Θ (cid:16) t − ρ ( p − (cid:17) . (B.8)Therefore, by (B.6), (B.7), and (B.8), we obtain the following rate of convergence: E [ (cid:107) δ t (cid:107) pp ] = O (cid:16) t − ρ ( p − (cid:17) . Next, since p -norms on R n are all equivalent, we can drop the subscript (cid:107) · (cid:107) p and obtain E [ | δ t | p ] = O (cid:16) t − ρ ( p − (cid:17) . Finally, by (B.5), we see that sup t ∈ N + E [ | ξ t | p ] (cid:54) sup t ∈ N + E [2 p − ( | m t | p + | ζ t | p )] < ∞ . The proof iscomplete. Proof of Corollary E [ | δ t | p ] = O (cid:0) t − ρ ( p − (cid:1) holds for every p ∈ [ q, α ). We can thus apply Jensen’s inequality to strengthen it. By Jensen’sinequality and (3.4), we get E [ | δ t | q ] (cid:54) E [ | δ t | p ] q/p = O (cid:16) t − ρ ( p − qp (cid:17) . By letting p (cid:37) α , we conclude that have for every ε > E [ | δ t | q ] = o (cid:16) t − ρq α − α + ε (cid:17) . The proof is complete.Next, we state a series of technical lemmas as well as their proofs, which will be used in theproofs of Theorems 5 and 6.

Lemma 16. If γ t (cid:16) t − ρ with < ρ < κ (cid:54) , then for all λ > , lim t →∞ t − κ t − (cid:88) j =1 exp  − λ t − (cid:88) i = j γ i  = 0 . roof. Notice that there exists some constant

B > t − (cid:88) i = j γ i (cid:62) Bλ (cid:0) t − ρ − j − ρ (cid:1) . It follows that t − κ t − (cid:88) j =1 exp  − λ t − (cid:88) i = j γ i  (cid:54) t − κ t − (cid:88) j =0 exp (cid:0) − Bt − ρ + Bj − ρ (cid:1) = (cid:80) t − j =0 exp( Bj − ρ ) t κ exp( Bt − ρ ) . By Stolz-Ces`aro theorem, we have (cid:80) t − j =0 exp( Bj − ρ ) t κ exp( Bt − ρ ) (cid:16) exp( Bt − ρ )( t + 1) κ exp( B ( t + 1) − ρ ) − t κ exp( Bt − ρ )= 1( t + 1) κ exp[ B (( t + 1) − ρ − t − ρ )] − t κ = 1( t + 1) κ exp[ B (1 − ρ )( t + 1) − ρ + o ( t − ρ )] − t κ = 1( t + 1) κ [1 + B (1 − ρ )( t + 1) − ρ + o ( t − ρ )] − t κ = 1 B (1 − ρ )( t + 1) κ − ρ + o (( t + 1) κ − ρ ) → , as t → ∞ . The proof is complete. Lemma 17.

Suppose γ t (cid:16) t − ρ and < ρ < κ (cid:54) ; let A be a positive deﬁnite symmetric matrix.Consider the matrix recursion in [Polyak and Juditsky, 1992, Lemma 1], X jj = I , X t +1 j = X tj − γ t AX tj , ( j ∈ N + ) and deﬁne X tj = γ j t − (cid:88) i = j X ij , Φ tj = A − − X tj . Then the following limit holds, lim t →∞ t κ t − (cid:88) j =1 (cid:107) Φ tj (cid:107) = 0 . Remark.

Lemma 17 recovers [Polyak and Juditsky, 1992, Lemma 1] as the special case κ = 1. Proof of Lemma

17 Modeling after Polyak and Juditsky [1992]’s proof of their Lemma 1, wedeﬁne S tj = (cid:80) t − i = j ( γ i − γ j ) X ij , and we have Φ tj = S tj + A − X tj .

22e will split the proofs into two parts. In the ﬁrst part, we will prove t − κ (cid:80) t − j =1 (cid:107) S tj (cid:107) → t − κ (cid:80) t − j =1 (cid:107) X tj (cid:107) → Part I.

We ﬁrst prove that t − κ (cid:80) t − j =1 (cid:107) S tj (cid:107) → , there exist some λ > K < ∞ such that (cid:107) X tj (cid:107) (cid:54) K exp  − λ t − (cid:88) i = j γ i  = Ke − λm tj , (B.9)where m (cid:96)k stands for (cid:80) (cid:96) − i = k γ i . Now we have (cid:13)(cid:13) S tj (cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t (cid:88) i =1 ( γ i − γ j ) X ij (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t (cid:88) i =1  i − (cid:88) k = j ( γ k +1 − γ k )  X ij (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:54) C t (cid:88) i = j i − (cid:88) k = j k − ρ − exp (cid:0) − λm ij (cid:1) (cid:54) C j − t (cid:88) i = j i − (cid:88) k = j k − ρ exp (cid:0) − λm ij (cid:1) (cid:54) C j − t (cid:88) i = j m ij exp (cid:0) − λm ij (cid:1) = C j − t (cid:88) i = j m ij e − λm ij ( m ij − m i − j ) γ i , (B.10)where C , C are some positive constants.Since the function f w ( x ) = x ρ exp( − wx − ρ ) is bounded on x ∈ [1 , ∞ ) for every w >

0, we have j − ρ γ i exp (cid:0) − λm ij (cid:1) (cid:54) C i ρ j − ρ exp( − C ( i − ρ − j − ρ )) = C f C ( i ) /f C ( j ) (cid:54) C , for some positive constants C , C and C . Hence, continuing upon (B.10), (cid:13)(cid:13) S tj (cid:13)(cid:13) (cid:54) C C j ρ − t (cid:88) i = j m ij e − λm ij ( m ij − m i − j ) . Since the summation (cid:80) ti = j m ij e − λm ij ( m ij − m i − j ) approximates (cid:82) m tj me − λm d m , it is bounded.Hence, for some positive constant C , (cid:107) S tj (cid:107) (cid:54) C j ρ − , We can directly use this inequality since our assumption on step-size γ t (cid:16) t − ρ , 0 < ρ < t →∞ t − κ t − (cid:88) j =1 (cid:107) S tj (cid:107) = 0 . Part II.

It remains to prove that t − κ (cid:80) t − j =1 (cid:107) X tj (cid:107) → t − κ (cid:80) t − j =1 (cid:107) X tj (cid:107) →

0. Hence the proof of this lemmais complete.

Lemma 18.

Given the assumption of Theorem or Theorem , ξ + . . . ξ t t /α D −−−→ t →∞ µ. Proof.

We recall the decomposition ξ t = ζ t + m t , where { ζ t } are i.i.d. and ζ is in the domainof normal attraction of an n -dimensional centered α -stable distribution so that ζ + . . . + ζ t t /α D −−−→ t →∞ µ. Hence, it suﬃces to show that t − /α ( m + . . . + m t ) → L r , for some r (cid:62) C > E (cid:104) | m t +1 ( x t ) | | F t (cid:105) (cid:54) K (cid:0) | x t | (cid:1) (cid:54) K (1 + 2 | x ∗ | + 2 | δ t | ) (cid:54) C (1 + | δ t | ) . Hence, by using the “Remark” on p.151 of Neveu [1975] (cf. inequalities (20) of Anantharamand Borkar [2012]), we get E (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) m + . . . + m t t /α (cid:12)(cid:12)(cid:12)(cid:12) r (cid:21) (cid:54) C t r/α E (cid:32) t (cid:88) i =1 E (cid:2) | m i | | F i − (cid:3)(cid:33) r/  (cid:54) C t r/α E (cid:32) t (cid:88) i =1 (cid:0) | δ i − | (cid:1)(cid:33) r/  (cid:54) C t r/α E (cid:34) t r/ + t (cid:88) i =1 | δ i − | r (cid:35) , (B.11)where, for the last inequality, we use the fact that ( x + y ) s (cid:54) x s + y s for any x, y (cid:62)

0, 0 (cid:54) s (cid:54) r = p > ( α + αρ ) / (1 + αρ ) in the inequalities (B.11)above. Then, by Theorem 3, E [ | δ t | r ] = O ( t − ρ ( r − ) = o ( t r/α − ).If the assumption of Theorem 6 holds, take r = q > /ρ > α/ (1 + ρ ( α − E [ | δ t | r ] = ˜ O ( t − ρr ( α − /α ) = o ( t r/α − ).In both cases, t − /α ( m + . . . + m t ) → L r . The proof is complete.Finally, we are ready to prove Theorems 5 and 6. Proof of Theorem tt /α δ t = 1 t /α F t δ (cid:124) (cid:123)(cid:122) (cid:125) I (1) t − t /α t − (cid:88) j =1 A − ξ j (cid:124) (cid:123)(cid:122) (cid:125) I (2) t − t /α t − (cid:88) j =1 W tj ξ j (cid:124) (cid:123)(cid:122) (cid:125) I (3) t , (B.12)24here F t and W tj are deterministic matrices with uniformly bounded operator 2-norms deﬁned as F t = t − (cid:88) i =0 i (cid:89) k =1 ( I − γ k A ) , (B.13) W tj = γ j t − (cid:88) i = j i (cid:89) k = j +1 ( I − γ k A ) − A − . (B.14)We have I (1) t → F t . Next, take some κ such thatmax( ρ, /α ) < κ (cid:54) p/α. (B.15)We shall prove that I (3) t → L ακ (notice that 1 < ακ (cid:54) p < α ; cf. Polyak and Juditsky [1992,Proof of Theorem 1] where convergence in L is proven). By Theorem 3, sup j E [ | ξ j | p ] < ∞ . Hencewe can compute, by virtue of Lemma 7, that E (cid:104)(cid:12)(cid:12)(cid:12) I (3) t (cid:12)(cid:12)(cid:12) ακ (cid:105) = E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t /α t − (cid:88) j =1 W tj ξ j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ακ  (cid:54) C t κ t − (cid:88) j =1 E (cid:2)(cid:12)(cid:12) W tj ξ j (cid:12)(cid:12) ακ (cid:3) (cid:54)  C t κ t − (cid:88) j =1 (cid:13)(cid:13) W tj (cid:13)(cid:13) ακ  sup j E (cid:2)(cid:12)(cid:12) ξ j (cid:12)(cid:12) ακ (cid:3) (cid:54)  C t κ t − (cid:88) j =1 (cid:13)(cid:13) W tj (cid:13)(cid:13) sup j E (cid:2)(cid:12)(cid:12) ξ j (cid:12)(cid:12) ακ (cid:3) (cid:54) C t κ t − (cid:88) j =1 (cid:13)(cid:13) W tj (cid:13)(cid:13) . Notice that the matrices W tj deﬁned above correspond to − Φ tj in Lemma 17. This infers that E (cid:104) | I (3) t | ακ (cid:105) (cid:54) K t κ (cid:80) t − j =1 (cid:107) W tj (cid:107) → t → ∞ .Finally, Lemma 18 states that I (2) t converges weakly to an α -stable distribution. Hence weconclude the proof. Proof of Theorem A the Hessian matrix ∇ R ( x ∗ ) = ∇ f ( x ∗ ). Consider acorresponding linear SA process with the same noise, x t +1 = x t − γ t +1 (cid:0) A ( x t − x ∗ ) + ξ t +1 ( x t ) (cid:1) , (B.16)with x = x . We further deﬁne δ t = x t − x ∗ and the averaging process δ t = ( δ + . . . + δ t − ) /t . Part I.

We ﬁrst prove that t − /α (cid:16) δ t − δ t (cid:17) → tt /α δ t = 1 t /α F t δ − t /α t − (cid:88) j =1 (cid:0) A − + W tj (cid:1) ξ j , (B.17)where the matrices F t and W tj are deﬁned back in (B.13) and (B.14). For the non-linear process(2.1), it can be viewed as if it is a linear process with the j -th noise term being ξ j + R ( x j − ) − A δ j − .Hence by (B.12), we have tt /α δ t = 1 t /α F t δ − t /α t − (cid:88) j =1 (cid:0) A − + W tj (cid:1)(cid:0) ξ j + R ( x j − ) − A δ j − (cid:1) . (B.18)25ombining (B.17) and (B.18) yields the diﬀerence (cf. Part 4 of Polyak and Juditsky [1992, Proofof Theorem 2]) tt /α (cid:16) δ t − δ t (cid:17) = 1 t /α t − (cid:88) j =1 (cid:0) A − + W tj (cid:1) ( R ( x j − ) − A δ j − ) . (B.19)We also recall the assumption that | R ( x j ) − A δ j | (cid:54) K | δ j | q . Hence, it suﬃces to show the followingterm vanishes almost surely as t → ∞ : J t = 1 t /α t − (cid:88) j =1 | δ j | q . To show this, ﬁrst by our calculation of the rate of convergence in Corollary 4, E  t − (cid:88) j =1 j /α | δ j | q  = t − (cid:88) j =1 ˜ O (cid:16) j − ρq α − α − α (cid:17) = O (1) . The last equality holds since − ρq α − α − α < −

1. Hence, we have P  t − (cid:88) j =1 j /α | δ j | q < ∞  = 1 . (B.20)By Kronecker’s lemma, (B.20) implies that P [lim t →∞ J t = 0] = 1. This further implies that the lefthand side of (B.19), t − /α (cid:16) δ t − δ t (cid:17) , converges to 0 almost surely. Part II.

It remains to show that t − /α δ t converges weakly to an α -stable distribution.Deﬁne x t = ( x + . . . + x t − ) /t . Since t − /α (cid:0) x t − x t (cid:1) = t − /α (cid:16) δ t − δ t (cid:17) → a fortiori that x t − x t → x t − x t → ξ t +1 ( x t ) = ζ t +1 + m t +1 ( x t ), the state-dependent com-ponent m t +1 ( x t ) satisﬁes not only (3.3), i.e., E (cid:104) | m t +1 ( x t ) | | F t (cid:105) (cid:54) K (cid:0) | x t | (cid:1) , but also E (cid:104) | m t +1 ( x t ) | | F t (cid:105) (cid:54) C (cid:0) | x t | (cid:1) . Hence, combining the discussion above and Lemma 18, we know that the linear recursion (B.16)deﬁnes a process that satisﬁes Theorem 5. (The only diﬀerence is that κ , instead of (B.15), canbe taken from the range ( ρ,

1) under the assumption of the current theorem, since by Theorem 3,sup t ∈ N + E [ | ξ t | p ] < ∞ for every 1 (cid:54) p < α .) It then follows from Theorem 5 that t − /α δ t convergesweakly to an α -stable distribution.The proof is complete. 26 Additional Technical Background

C.1 Properties of α -Stable Distributions An α -stable distributed random variable X is denoted by X ∼ S α ( σ, θ, µ ), where α ∈ (0 ,

2] isthe tail-index , θ ∈ [ − ,

1] is the skewness parameter, σ (cid:62) scale parameter, and µ ∈ R is called the location parameter. An α -stable random variable X is uniquely characterized by itscharacteristic function: E [exp( iuX )] = e − σ α | u | α (1 − iθ sgn( u ) tan( πα ))+ iµu , if α (cid:54) = 1 and E [exp( iuX )] = e − σ | u | (1+ iθ π sgn( u ) log | u | )+ iµu , if α = 1, for any u ∈ R . The mean of X coincides with µ if α >

1, andotherwise the mean of X is undeﬁned. The skewness parameter θ is a measure of asymmetry. Wesay that X follows a symmetric α -stable distribution denoted as S α S ( σ ) = S α ( σ, ,

0) if θ = 0 (and µ = 0). The tail-index parameter α ∈ (0 ,

2] determines the tail thickness of the distribution, and σ > X around its mode. When α < α -stable distributions have heavytails so that their moments are ﬁnite only up to the order α . More precisely, let X ∼ S α ( σ, θ, µ )with 0 < α <

2. Then E [ | X | p ] < ∞ for any 0 < p < α and E [ | X | p ] = ∞ for any p (cid:62) α , which impliesinﬁnite variance (see e.g. [Samorodnitsky and Taqqu, 1994, Property 1.2.16]). When 0 < α < X are described by the formulas:lim x →∞ x α P ( X > x ) = 1 + θ C α σ α , lim x →∞ x α P ( X < − x ) = 1 − θ C α σ α , where C α := (1 − α ) / (Γ(2 − α ) cos( πα/ α (cid:54) = 1 and C α := 2 /π if α = 1, (see e.g. [Samorodnitskyand Taqqu, 1994, Property 1.2.15]). The family of α -stable distributions include normal, L´evyand Cauchy distributions as special cases, and can be used to model many complex stochasticphenomena [Sarafrazi and Yazdi, 2019, Fiche et al., 2013, Farsad et al., 2015]. C.2 Domains of Attraction of Stable Distributions

Let X i be an i.i.d. sequence with a common distribution whose distribution function is denoted as F , and let S n := X + X + · · · + X n . Suppose that for some normalizing constants a n > b n ,the sequence S n /a n − b n has a non-degenerate limit distribution with distribution function G , i.e.lim n →∞ P ( S n /a n − b n (cid:54) x ) = G ( x ) , (C.1)for all continuity points x of G , then such limit distributions G are stable distributions and theset of distribution functions F such that S n /a n − b n converges to a particular stable distribution iscalled its domain of attraction .Next, let us provide a suﬃcient and necessary condition for being in the domain of attractionof a stable distribution. The class of distribution functions F for which S n /a n − b n converges to S α S ( σ ) is called the α -stable domain of attraction, and we denote it as F ∈ D α . Before we proceed,let us recall that a positive measurable function f is regularly varying if there exists a constant γ ∈ R such that lim t →∞ f ( tx ) f ( t ) = x γ , for every x >

0. In this case, we denote f ∈ RV γ , and we saya function f is slowly varying if f ∈ RV .Deﬁne the characteristic functions φ ( u ) := (cid:82) ∞−∞ e iux dF ( x ) and ψ ( u ) := (cid:82) ∞−∞ e iux dG ( x ), and alsodeﬁne λ ( u ) := φ (1 /u ) and g ( u ) := ψ (1 /u ) for u ∈ [ −∞ , ∞ ] \{ } . We also denote U ( x ) := Re λ ( x )and V ( x ) := Im λ ( x ). By L´evy’s continuity theorem for characteristic functions (see e.g. Feller[1971, Chapter XV.3]), the convergence in (C.1) is equivalent to lim n →∞ exp( − ib n /u ) λ n ( a n u ) = g ( u ), u ∈ [ −∞ , ∞ ] \{ } uniformly on neighborhoods of ±∞ . Based on this, one can show that (seee.g. ) if (C.1) holds, then | g ( u ) | = exp( − c | u | − α ) for some α ∈ (0 ,

2] and c > − log | λ | ∈ RV − α , i.e. − log | λ | is regularly varying with index − α . Next, we state a suﬃcient andnecessary condition for being in the α -stable domain of attraction.27 heorem 19 (Geluk and de Hann [2000], Theorem 1) . Suppose < α < . Every α -stable randomvariable X has a characteristic function of the form: E [exp( iuX )] = exp (cid:18) − (cid:26) | u | α + iu (2 p − { (1 − α ) tan( απ/ } | u | α − − α − (cid:27)(cid:19) , for some (cid:54) p (cid:54) with (1 − α ) tan( π/ deﬁned to be /π at α = 1 . The following statements areequivalent:(i) F ∈ D α .(ii) − F ( x ) + F ( − x ) ∈ RV − α and there exists a constant p ∈ [0 , such that lim x →∞ − F ( x )1 − F ( x ) + F ( − x ) = p. (iii) − U ( x ) ∈ RV − α and there exists a constant p ∈ [0 , such that lim x →∞ xuV ( xu ) − xV ( x ) x (1 − U ( x )) = (2 p − − α ) tan (cid:16) απ (cid:17) | u | − α − − α , u ∈ R \{ } . Furthermore, [Geluk and de Hann, 2000, Theorem 1] showed that if any of (i), (ii), (iii) holds,then lim x →∞ − U ( x )1 − F ( x )+ F ( − x ) = Γ(1 − α ) cos( απ/

2) and lim x →∞ V ( x ) − x − (cid:82) x (1 − F ( y ) − F ( − y ))d y − F ( x )+ F ( − x ) = (2 p − (cid:16) Γ(1 − α ) sin( απ/ − − α (cid:17) .Let us illustrate [Geluk and de Hann, 2000, Theorem 1] with an example of Pareto distribution,which is a power-law distribution widely applied in various ﬁelds. A random variable X is said tofollow a Pareto distribution (of type I) if there exists some c > P ( X > x ) = ( x/c ) − α for any x (cid:62) c and P ( X > x ) = 1 for any x < c . In this case, F ( x ) = 1 − ( x/c ) − α for any x (cid:62) c and F ( x ) = 0 for any x < c . It follows that 1 − F ( x ) + F ( − x ) ∈ RV − α and lim x →∞ − F ( x )1 − F ( x )+ F ( − x ) = 1.Therefore, F ∈ D α and the Pareto distribution is in the α -stable domain of attraction.When the tail-index α ∈ (0 , E (cid:2) e iuX (cid:3) )of an α -stable random variable X is of the form (see [Gnedenko and Kolmogorov, 1954, equation(12), page 168]): iγu + c (cid:90) −∞ (cid:20) e iux − − iux x (cid:21) d x | x | α + c (cid:90) ∞ (cid:20) e iux − − iux x (cid:21) d xx α , (C.2)where c , c (cid:62) γ ∈ R . Since the characteristic function uniquely characterizes a probabilitydistribution, the triplet ( c , c , α ) uniquely determines an α -stable law up to a constant shift γ ∈ R when 0 < α <

2. [Gnedenko and Kolmogorov, 1954, Theorem 2, page 175] provides anothersuﬃcient and necessary condition for being in the domain of attraction of an α -stable distribution,which complements [Geluk and de Hann, 2000, Theorem 1]. Suppose 0 < α <

2. Then, thedistribution function F ( x ) belongs to the domain of attraction of an α -stable distribution if andonly if the following conditions hold: (i) lim x →∞ F ( − x )1 − F ( x ) = c c . (ii) For every constant κ > x →∞ − F ( x )+ F ( − x )1 − F ( κx )+ F ( − κx ) = κ α . In the case of a Pareto distribution (of type I), for some c >

0, wehave F ( x ) = 1 − ( x/c ) − α for any x (cid:62) c and F ( x ) = 0 for any x < c . Then we can check thatlim x →∞ F ( − x )1 − F ( x ) = 0 and for every constant κ >

0, lim x →∞ − F ( x )+ F ( − x )1 − F ( κx )+ F ( − κx ) = lim x →∞ ( x/c ) − α ( κx/c ) − α = κ α .Thus, the Pareto distribution belongs to the domain of attraction of an α -stable distribution.Finally, let us provide a suﬃcient and necessary condition for being in the domain of normalattraction of a stable distribution. 28 heorem 20 (Gnedenko and Kolmogorov [1954], Theorem 5, page 181) . Suppose < α < . Thedistribution function F ( x ) belongs to the domain of attraction of an α -stable distribution charac-terized by (C.2) if and only if F ( x ) = ( c a α + α ( x )) 1 | x | α , for x < , (C.3) F ( x ) = 1 − ( c a α + α ( x )) 1 x α , for x > , (C.4) where a > is a positive constant and α ( x ) , α ( x ) are functions satisfying lim x →−∞ α ( x ) =lim x →∞ α ( x ) = 0 . Indeed, the constant a in (2.2) , (C.3) and (C.4) is the same. In the case of a Pareto distribution (of type I), for some c >

0, we have F ( x ) = 1 − ( x/c ) − α forany x (cid:62) c and F ( x ) = 0 for any x < c . Then we can check that (C.3) and (C.4) hold with c = 0, α ( x ) ≡ c = 1, α ( x ) ≡ a = c . Thus, the Pareto distribution belongs to the domain ofnormal attraction of an αα