[PDF] SGD in the Large: Average-case Analysis, Asymptotics, and Stepsize Criticality

Abstract

We propose a new framework, inspired by random matrix theory, for analyzing the dynamics of stochastic gradient descent (SGD) when both number of samples and dimensions are large. This framework applies to any fixed stepsize and the finite sum setting. Using this new framework, we show that the dynamics of SGD on a least squares problem with random data become deterministic in the large sample and dimensional limit. Furthermore, the limiting dynamics are governed by a Volterra integral equation. This model predicts that SGD undergoes a phase transition at an explicitly given critical stepsize that ultimately affects its convergence rate, which we also verify experimentally. Finally, when input data is isotropic, we provide explicit expressions for the dynamics and average-case convergence rates (i.e., the complexity of an algorithm averaged over all possible inputs). These rates show significant improvement over the worst-case complexities.

Full PDF

SSGD in the Large:Average-case Analysis, Asymptotics, and Stepsize Criticality

Courtney Paquette *† Kiwon Lee † Fabian Pedregosa * Elliot Paquette † February 9, 2021

Abstract

We propose a new framework, inspired by random matrix theory, for analyzing the dynamics of stochastic gra-dient descent (SGD) when both number of samples and dimensions are large. This framework applies to any ﬁxedstepsize and the ﬁnite sum setting. Using this new framework, we show that the dynamics of SGD on a least squaresproblem with random data become deterministic in the large sample and dimensional limit. Furthermore, the limitingdynamics are governed by a Volterra integral equation. This model predicts that SGD undergoes a phase transition atan explicitly given critical stepsize that ultimately affects its convergence rate, which we also verify experimentally.Finally, when input data is isotropic, we provide explicit expressions for the dynamics and average-case convergencerates ( i.e., the complexity of an algorithm averaged over all possible inputs). These rates show signiﬁcant improve-ment over the worst-case complexities.

Stochastic gradient descent (SGD) Robbins and Monro [1951] is one of the most popular and important stochasticoptimization methods for use in large-scale problems. There are well-established worst-case convergence rates, butSGD lacks a detailed theory that encompasses both its successes and its empirically observed peculiarities. For exam-ple, the solutions to which SGD converges have qualitative differences that seemingly depend on how SGD is tuned[Jastrzebski et al., 2017, Keskar et al., 2016]. Furthermore, the dependence of the runtime of SGD on its stepsize iscomplicated, and stepsize selection is an active area of research [Bollapragada et al., 2018, Friedlander and Schmidt,2012, Mahsereci and Hennig, 2017, Schaul et al., 2013, Vaswani et al., 2019]. Beyond the conﬁnes of SGD, the be-havior of other stochastic optimization algorithms is even more poorly understood [Sutskever et al., 2013]. Because ofthese challenges, making good quantitative predictions for the dynamics of stochastic algorithms remains a difﬁcult,broad and deep problem .A proliﬁc technique for analyzing optimizations methods, both stochastic and deterministic, is the stochastic dif-ferential equations (SDE) paradigm [Chaudhari and Soatto, 2018, Hu et al., 2017, Jastrzebski et al., 2017, Kushnerand Yin, 2003, Li et al., 2017, Ljung, 1977, Mandt et al., 2016, Su et al., 2016]. These SDEs relate to the dynamicsof the optimization method by taking the limit when the stepsize goes to zero, so that the trajectory of the objectivefunction over the lifetime of the algorithm converges to the solution of an SDE. Naturally, in practice, the stepsize istaken as large as possible, which limits the predictive power of the SDE method.A related popular paradigm for analyzing the behavior of SGD is the noisy gradient model. Often used in conjunc-tion with the SDE approach, in this model, one supposes that the stochastic gradient estimators in SGD are the truegradient plus some independent noise (typically assumed to be Gaussian with some covariance structure) [Jastrzebskiet al., 2017, Li et al., 2017, Mandt et al., 2016, Simsekli et al., 2019] or more generally the gradient estimators areindependent with a common distribution, see for e.g.

Huang et al. [2020]. The latter is equivalent to the “streamingsetting” Jain et al. [2018] or the “one-pass” assumption on the data [Gurbuzbalaban et al., 2020]. * Google Research, Brain Team † Department of Mathematics and Statistics, McGill University, Montreal, QC, Canada, H3A 0B9; CP is a CIFAR AI chair; https://cypaquette.github.io/ . Research by EP was supported by a Discovery Grant from the Natural Science and Engineering ResearchCouncil (NSERC) of Canada.; https://elliotpaquette.github.io/ . a r X i v : . [ m a t h . O C ] F e b

20 40epochs10 SMESDESGD StreamingVolterra (our model)

Figure 1: The proposed Volterra equation model ac-curately tracks SGD on a random least-squares prob-lem for any choice of stepsize. Other models intro-duce biases that substantially impact model ﬁdelity.Here one generates a new sample at each iteration and doesnot reuse any past data. In practice, SGD is typically imple-mented on a ﬁnite dataset with multiple passes over the data.Such modeling assumptions on the stochastic gradient estima-tors can not capture the full dynamics of SGD (see Figure 1).We offer a new alternative, inspired by the phenomenol-ogy of random matrix theory. We prove that SGD with a ﬁxed stepsize γ has deterministic dynamics, when run on the leastsquares problem with high–dimensional random data, and, weanalyze these dynamics to provide stepsize selection and con-vergence properties (see Figure 1 for a comparison). We nei-ther impose assumptions on the gradient estimators nor takethe stepsize to 0 and we work in the non-streaming or ﬁnitesum setting (a.k.a. incremental gradient). These deterministicdynamics are governed by a Volterra integral equation, that is,the function values converge to the solution ψ of ψ ( t ) = z ( t ) + rγ (cid:90) t h ( t − s ) ψ ( s ) d s, where h ( t ) = (cid:90) ∞ x e − γtx d µ ( x ) . (1)Here, r is the ratio of the number of parameters to sample size, and µ is the distribution of the eigenvalues of theHessian’s objective. The function z is an explicit forcing function, which has dependence on all parts of the problem,including the initialization x and the target b . See Theorem 1.1 for the precise statement. The value of the stepsize γ can be as large as the convergence threshold which we explicitly provide. This Volterra equation (1) has rich behavior;the asymptotic suboptimality of ψ has a discontinuity in γ at a critical stepsize (see Theorem 1.2). Notation.

We denote vectors in lowercase boldface ( x ) and matrices in uppercase boldface ( H ). A sequence ofrandom variables { y d } ∞ d =0 converges in probability to y , indicated by y d Pr −−−→ d →∞ y , if for any ε > , lim d →∞ Pr( | y d − y | >ε ) = 0 . Probability measures are denoted by µ and their densities by d µ . We say a sequence of random measures µ d converges to µ weakly in probability if for any bounded continuous function f , we have (cid:82) f d µ d → (cid:82) f d µ inprobability. For two random variables x and y we write x law = y to mean they have the same distribution. We consider the least–squares problem when the number of samples ( n ) and features ( d ) are large: arg min x ∈ R d (cid:110) f ( x ) def = 12 n n (cid:88) i =1 ( a i x − b i ) (cid:111) , with b def = A (cid:101) x + √ n η , (2)where A ∈ R n × d is a random data matrix whose i -th row is denoted by a i ∈ R d × , (cid:101) x ∈ R d is the signal vector, and η ∈ R n is a source of noise. The target b = A (cid:101) x + √ n η comes from a generative model corrupted by noise.We apply SGD (incremental gradient) to the ﬁnite sum, quadratic problem above. On the k -th iteration it selects auniformly random subset B k ⊂ { , , · · · , n } , of batch-size β and makes the updates x k +1 = x k − γn (cid:88) i ∈ B k ∇ f i ( x k ) = x k − γn A T P k ( Ax k − b ) , where P k def = (cid:88) i ∈ B k e i e Ti . (3)Here P k is a random orthogonal projection matrix with e i the i -th standard basis vector, β ∈ N is a batch-sizeparameter, which we will allow to depend on n , γ > is a stepsize parameter, and the function f i is the i -th elementof the sum in (2). Typical stepsizes for SGD (see e.g [Bottou et al., 2018, Thm 4.6]) include the second moment ofthe stochastic gradients, which under our problem setting grows like n . This explains the dependency on n in (3). Weremark that β can equal in which case (3) reduces to the simple SGD setting.2o perform our analysis we make the following explicit assumptions on the signal (cid:101) x , the noise η , and the datamatrix A . Assumption 1.1 (Initialization, signal, and noise) . The initial vector x ∈ R d , the signal (cid:101) x ∈ R d , and noise vector η ∈ R n satisfy the following conditions:1. The difference x − (cid:101) x is any deterministic vector such that (cid:107) x − (cid:101) x (cid:107) = R.

2. The entries of the noise vector η are i.i.d. random variables that verify for some constant (cid:101) R > E [ η ] = , E [ (cid:107) η (cid:107) ] = (cid:101) R, and E [ (cid:107) η (cid:107) p ∞ ] = O ( n (cid:15) − p/ ) for any (cid:15), p > . (4)Any subexponential law for the entries of η (say, uniform or Gaussian with variance (cid:101) R/n ) will satisfy (4). The scalingsof the vectors x − (cid:101) x and η arise as a result of preserving a constant signal-to-noise ratio in the generative model.Such generative models with this scaling have been used in numerous works [Gerbelot et al., 2020, Hastie et al., 2019,Mei and Montanari, 2019].Next we state an assumption on the eigenvalue and eigenvector distribution of the data matrix A . We then reviewpractical scenarios in which this is veriﬁed. Assumption 1.2 (Data matrix) . Let A be a random n × d matrix such that the number of features, d , tends toinﬁnity proportionally to the size of the data set, n , so that dn → r ∈ (0 , ∞ ) . Let H def = n A T A with eigenvalues λ ≤ . . . ≤ λ d and let δ λ i denote the Dirac delta with mass at λ i . We make the following assumptions on theeigenvalue distribution of this matrix:1. The eigenvalue distribution of H converges to a deterministic limit µ with compact support. Formally, the empiricalspectral measure (ESM) satisﬁes µ H = 1 d d (cid:88) i =1 δ λ i → µ weakly in probability . (5)

2. The largest eigenvalue λ + H of H converges in probability to the largest element λ + in the support of µ , i.e. λ + H Pr −−−→ d →∞ λ + . (6)

3. (Orthogonal invariance) Let U ∈ R d × d and O ∈ R n × n be orthogonal matrices. The matrix A is orthogonallyinvariant in the sense that AU law = A and OA law = A (7) magnitude d e n s i t y r=1.00r=1.25r=1.50r=1.75r=2.00 Figure 2: The ESM of matrices n A T A with i.i.d.entries converges as n, d → ∞ to the Marchenko-Pastur distribution, shown here for different valuesof r = d/n .Assumption 1.2 characterizes the distribution of eigenval-ues for the random matrix H which approximately equals thedistribution µ . The ESM and its convergence to the limitingspectral distribution µ are well studied in random matrix the-ory, and for many random matrix ensembles the limiting spec-tral distribution is known. In machine learning literature, ithas been shown that the spectrum of the Hessians of neuralnetworks share many characteristics with the limiting spectraldistributions found in classical random matrix theory [Behroozet al., 2019, Dauphin et al., 2014, Martin and Mahoney, 2018,Papyan, 2018, Sagun et al., 2016].The last assumption, orthogonal invariance, is a ratherstrong condition as it implies that the singular vectors of A are uniformly distributed on the sphere. The classic exampleof a matrix which satisﬁes this property are matrices whoseentries are generated from standard Gaussians. There is however, a large body of literature [Cipolloni et al., 2020,Knowles and Yin, 2017] showing that other classes of large dimensional random matrices behave like orthogonally3nvariant ensembles; weakening the orthogonal invariance assumption is an interesting future direction of researchwhich is beyond the scope of this paper. Moreover our numerical simulations suggest that (7) is unnecessary as ourVolterra equation holds for ensembles without this orthogonal invariance property (see one-hidden layer networks inSection 4). For a more thorough review of random matrix theory see [Bai and Silverstein, 2010, Tao, 2012]. Examples of data distributions.

In this section we review examples of data-generating distributions that verifyAssumption 1.2.

Example 1: Isotropic features with Gaussian entries.

The ﬁrst model we consider has entries of A which arei.i.d. standard Gaussian random variables, that is, A ij ∼ N (0 , for all i, j . This ensemble has a rich history inrandom matrix theory. When the number of features d tends to inﬁnity proportionally to the size of the data set n , dn → r ∈ (0 , ∞ ) , the seminal work of Marˇcenko and Pastur [1967] showed that the spectrum of H = n A T A asymptotically approaches a deterministic measure µ MP , verifying Assumption 1.2. This measure, µ MP , is given bythe Marchenko-Pastur law: d µ MP ( λ ) def = δ ( λ ) max { − r , } + (cid:112) ( λ − λ − )( λ + − λ )2 πλr [ λ − ,λ + ] , where λ − def = (1 − √ r ) and λ + def = (1 + √ r ) . (8) Example 2: Planted spectrum

One may wonder if there are limits to what singular value distributions can appearfor orthogonally invariant random matrices, but as it turns out, any singular value distribution is attainable. Supposethat A = U Σ V T , (9)where V ∈ R d × d and U ∈ R n × n are random matrices, uniformly chosen from the orthogonal group and Σ ∈ R n × d is any deterministic matrix such that the squared singular values of Σ have an empirical distribution that converges toa desired limit µ . Then A is orthogonally invariant. As in the previous case, we assume that the dimensions of thematrix A grow at a comparable rate given by dn → r ∈ (0 , ∞ ) . Constructions like this appear in neural networksinitialized with random orthogonal weight matrices, and they produce exotic singular value distributions [Saxe et al.,2013, Figure 7]. Example 3: Linear neural networks.

This model encompasses linear neural networks with a squared loss, wherethe m layers have random weights ( W i with i = 1 , . . . , m ) and the ﬁnal layer’s weights are given by the regression co-efﬁcient x . The entries of these random weight matrices are sampled i.i.d. from a standard Gaussian. The optimizationproblem in (2) becomes min x (cid:26) f ( x ) = 12 n (cid:107) Ax − b (cid:107) (cid:27) , where A = W W W · · · W m . (10)It is known that products of Gaussian matrices satisfy (7) with a limiting spectral measure in the large n limit and ﬁxeddepth given by the Fuss-Catalan law Alexeev et al. [2010], Liu et al. [2011]. A new paradigm for analyzing the dynamics of SGD.

We propose a framework for the analysis of SGD thatexploits the fact that when increasing the problem size ( i.e. n and d large), statistics that are driven by the fullpopulation converge to deterministic processes; the spirit of which is behind law of large numbers and concentrationof measure. A practical outcome of this framework is a new expression for the function values of SGD as a Volterraequation: Theorem 1.1 (Concentration of SGD) . Suppose the stepsize satisﬁes γ < r (cid:0) (cid:82) ∞ x d µ ( x ) (cid:1) − and the batchsizesatisﬁes β ( n ) ≤ n / − δ for some δ > . Under Assumptions 1.1 and 1.2, sup ≤ t ≤ T (cid:12)(cid:12) f (cid:0) x (cid:98) nβ t (cid:99) (cid:1) − ψ ( t ) (cid:12)(cid:12) Pr −−−−→ n →∞ , (11)4 here the function ψ is the solution to the Volterra equation ψ ( t ) = R h ( t ) + (cid:101) R (cid:0) rh ( t ) + (1 − r ) (cid:1) + (cid:90) t γ rh ( t − s ) ψ ( s ) d s, and h k ( t ) = (cid:90) ∞ x k e − γtx d µ ( x ) for all k ≥ . (12) Convergence rates r=0.12r=0.14r=0.17r=0.20r=0.25r=0.33r=0.50r=0.70

Stepsize

Figure 3:

Phase transition of the convergence rate (y-axis) as a function of the stepsize (x-axis, γ ) forthe isotropic features model. Smaller stepsizes (dot-ted) yield convergence rates which depend linearlyon γ with a slope that is always frozen on λ − . Theconvergence rate abruptly changes behavior once ithits the critical stepsize (solid gray, γ ∗ ), becoming anon-linear function of γ . The critical stepsize appearsto be a good predictor for the optimal stepsize. In ad-dition, the more over-parameterized the data matrix( r → ) is, the smaller the window of convergentstepsizes and as H becomes ill-conditioned ( r → ,the linear rate degenerates and the high temperaturephase disappears.The expression highlights how the algorithm, stepsize, sig-nal and noise levels interact with each other to produce differ-ent dynamics. For instance, our framework allows one to seethe effect of the entire spectrum of the data matrix on the dy-namics. Also we note that the batch-size β does not appear inthe limiting Volterra equation. Numerical simulations in Sec-tion 4 conﬁrm that ψ accurately predicts the behavior of SGD. Phase transition of SGD dynamics and critical stepsize.

We prove a surprising dichotomy in the dynamics of SGD fora general measure: SGD undergoes a phase transition at a crit-ical stepsize which we denote by γ ∗ γ ∗ def = 1 r (cid:82) ∞ x x − λ − d µ ( x ) . (13)Starting at small stepsizes, we see that the linear rate of con-vergence for SGD freezes on the smallest eigenvalue of H , that is f ( x (cid:98) nβ t (cid:99) ) decreases like e − γλ − t . However when γ passes the transition point γ ∗ , the dynamics of SGD have amore complicated dependency on the stepsize (in particular itis no longer log-linear in γ ). This is strongly reminiscent of a freezing transition , often seen in the free energies of randomenergy models (see Derrida [1981]), with γ playing the role oftemperature. This is summarized in our second main contribu-tion – the asymptotic rates for SGD under a general spectralmeasure µ (see Appendix E.1). Theorem 1.2 (Critical stepsize, asymptotic rates) . Suppose r (cid:54) = 1 (i.e. strongly convex regime). For γ ∗ < γ < r (cid:0) (cid:82) ∞ x d µ ( x ) (cid:1) − , the value of λ ∗ ( γ ) is given as the uniquesolution to rγ (cid:90) ∞ e γλ ∗ t h ( t ) d t = 1 . (14) The function ψ ( t ) satisﬁes that for some explicit constant c ( R, (cid:101) R, µ ) > ,ψ ( t ) − (cid:101) R · rµ ( { } ) + (1 − r )1 − γr (cid:0) (cid:82) ∞ x d µ ( x ) (cid:1) ∼ cγ e − γtλ ∗ ( γ ) . (15) If in addition γ ∗ > , and µ ([ λ − , λ − + t ]) ∼ c µ t α as t → then there is a constant c ( R, (cid:101) R, γ, µ ) > so that for < γ < γ ∗ , ψ ( t ) − (cid:101) R · rµ ( { } ) + (1 − r )1 − γr (cid:0) (cid:82) ∞ x d µ ( x ) (cid:1) ∼ ct α e − γtλ − . (16)We also give rates for the case of r = 1 in Thm E.3. See App. E.1 for further discussion and the derivation.5 trongly convex, γ < γ ∗ Strongly convex, γ = γ ∗ Worst exp ( − γtλ − + γ ( λ + ) t ) exp ( − γtλ − + γ ( λ + ) t ) Average exp ( − γλ − t ) · t / exp ( − γλ − t ) · t / Strongly convex, γ > γ ∗ Non-strongly convexw/ noise, r = 1 , (cid:101) R > Worst exp ( − γtλ − + γ ( λ + ) t ) ( R + (cid:101) R · d ) · t Average exp [ − γt (cid:0) − rγ (cid:1)(cid:0) r + (cid:113) (1 + r ) − γ (cid:1)(cid:3) R · t / + (cid:101) R · t / Table 1:

Asymptotic convergence guarantees for f ( x (cid:98) nβ t (cid:99) ) − (cid:101) R (cid:0) − rγ (cid:1) − max { , − r } on the isotropic featuresmodel. Stepsizes smaller than γ ∗ have linear rates based only on λ − multiplied by a polynomial term ( t α in Theo-rem 1.2). Larger stepsizes have linear rates with factor λ ∗ made explicit here. Average-case complexity are strictlybetter than the worst-case complexity, in some cases by a factor γ vs γ . Note also how the rates highlight the freezingtransition in the strongly convex regime. For worst-case rates, see [Bottou et al., 2018, Theorem 4.6] [Ghadimi andLan, 2013, Theorem 2.1]; λ + can be replaced by the max- (cid:96) -row-norm in the worst-case bounds below. Average-case complexity for SGD.

Our last contribution is one of the ﬁrst average-case complexity results for anystochastic optimization algorithm. The value ψ ( t ) is the average function value at iteration t after ﬁrst taking themodel size to inﬁnity. Consequently, this yields a notion of average complexity for SGD to a neighborhood. When thedata matrix A satisﬁes the isotropic features model A , we give an explicit formula for the expected function values ψ ( t ) , the critical stepsize γ ∗ , and the corresponding λ ∗ (Appendix E.2, Theorem E.5 and Section 3, Theorem 3.1).Table 1 summarizes our average rates.The average-case complexity in the strongly convex case has signiﬁcantly better linear rates than the worst-caseguarantees and, in particular, there is no dependence on λ + . We additionally capture a second-order behavior, the polynomial correction term (green in Table 1). This polynomial term has little effect on the complexity comparedto the linear rate. However as the matrix H becomes ill-conditioned ( r → , the polynomial correction starts todominate the average-case complexity. The sublinear rates in Table 1 for r = 1 show this effect and it accounts forthe improved average-case rates in the convex setting. This improvement in the average rate indeed highlights that thesupport of the spectrum does not fully determine the rate. Many eigenvalues contribute meaningfully to the averagerate. Hence, our results are not and cannot be purely explained by the support of the spectrum. As noted in Paquetteet al. [2020], the worst-case rates when r = 1 have dimension-dependent constants due to the distance to the optimum (cid:107) x − x (cid:63) (cid:107) ≈ d, which appears in the bounds. Related work.

Average-case versus worst-case complexity.

Traditional worst-case analysis of optimization algo-rithms provide complexity bounds no matter how unlikely [Nemirovski, 1995, Nesterov, 2004]. There are a plethoraof results on the worst-case analysis of SGD [Bertsekas and Tsitsiklis, 2000, Bottou et al., 2018, Ghadimi and Lan,2013, Gower et al., 2019, Robbins and Monro, 1951] and in particular, speciﬁc results for SGD applied to the leastsquares problem (see e.g.

Bertsekas [1997], Jain et al. [2018]). Worst-case analysis gives convergence guarantees, butthe bounds are not always representative of typical runtime.Average-case analysis, in contrast, gives sharper runtime estimates when some or all of its inputs are random. Thistype of analysis has a long history in computer-science and numerical analysis and it is often used to justify the superiorperformances of algorithms such as QuickSort [Hoare, 1962] and the simplex method, see for e.g. , [Borgwardt, 1986,Smale, 1983, Spielman and Teng, 2004]. Despite its rich history, average-case is rarely used in optimization due to theill-deﬁned notion of a typical objective function. Recently Lacotte and Pilanci [2020], Pedregosa and Scieur [2020]derived a framework for average-case analysis of gradient-based methods on the least-squares problem with vanishingnoise and it was later extended by Paquette et al. [2020]. Similar results for the conjugate gradient method were derivedin Deift and Trogdon [2020], Paquette and Trogdon [2020]. Our work is in the same line of research–providing theﬁrst average-case complexity for SGD.For stochastic algorithms, Sagun et al. [2017] showed empirical evidence that SGD on neural networks exhibits6oncentration of the function values. Other works Gurbuzbalaban et al. [2020], Huang et al. [2020], Mei et al. [2018,2019], Sirignano and Spiliopoulos [2020] have used random matrix theory to analyze stochastic algorithms, but onlyin online or one-pass settings ( n (cid:29) d ). We emphasize that our work applies to the ﬁnite sum setting; as we allow formultiple passes over the data. Continuous time processes.

A popular approach [An et al., 2018, Jastrzebski et al., 2017, Li et al., 2017, Mandtet al., 2016, Nguyen et al., 2019, Zhu et al., 2019] is to model the dynamics of SGD by imposing some structure onthe noise and, by sending stepsize to , relate the iterates of SGD to the stochastic differential equation (SDE): d X t = −∇ f ( X t ) d t + ( γ Σ ( X t )) / d B t . (17)Here one typically assumes the stochastic gradient noise ∇ f i ( x ) − ∇ f ( x ) is normally distributed (but not necessarily[Simsekli et al., 2019]) with some speciﬁc covariance structure Σ ( X ) . A common choice, called the stochasticmodiﬁed equation (SME) [Li et al., 2017, Mandt et al., 2016], matches the covariance matrix Σ ( X ) of the Gaussiannoise with the actual covariance of the stochastic gradients at x ( i.e. Σ ( X ) = n (cid:80) ni =1 ( ∇ f ( x ) − ∇ f i ( x ))( ∇ f ( x ) −∇ f i ( x )) T ). This covariance makes SME have correct mean behavior so the expected function values of the SMEmodel are good approximations for the expected function values of SGD. Li et al. [2017] show that by taking thestepsize γ small, the behavior of SGD and SME align. They and Mandt et al. [2016] also give a modiﬁed SME whichgives even higher order accuracy of SGD as stepsize goes to .These SDEs have been used to study numerous properties of SGD including the dynamics of regularized lossfunctions [Kunin et al., 2020] and generalization [Jastrzebski et al., 2017, Pﬂug, 1986, Simsekli et al., 2019, Zhu et al.,2019]. Despite their wide use, it has been observed that there is no small stepsize limit SGD that converges to anSDE [Yaida, 2019]. Our approach, instead, looks at the large- n limit and shows, in fact, that SGD concentrates whilemaintaining ﬁxed stepsize. Moreover, our Volterra equation is relatively easy to analyze. We note that the SME has thesame mean behavior as SGD so when n → ∞ , the mean behavior of SME and our Volterra equation match. Howeverthe SME does not capture this concentration effect and greatly overestimates the ﬂuctuations of the sub-optimality.

In this section, we develop the framework for the dynamics of SGD and sketch the argument of our main result(Theorem 1.1). Full proofs can be found in Appendix B.

Step 1: Change of basis.

A key feature of the SGD least squares iteration (3) is that the projection of x k ontoa singular vector v j of A with singular value σ j decreases in expectation exponentially in the number of iterationsat a rate proportionally to the squared singular value σ j [Steinerberger, 2020, Strohmer and Vershynin, 2009]. Thisobservation suggests the following change of basis. Consider the singular value decomposition of √ n A = U Σ V T ,where U and V are orthogonal matrices, i.e. V V T = V T V = I and Σ is the n × d singular value matrix withdiagonal entries diag ( σ j ) , j = 1 , . . . , d . We deﬁne the spectral weight vector (cid:98) ν k def = V T ( x k − (cid:101) x ) , which thereforeevolves like (cid:98) ν k +1 = (cid:98) ν k − γ Σ T U T P k ( U Σ (cid:98) ν k − η ) . (18)For this point on, we consider the evolution of (cid:98) ν . We note our above observation on the singular vectors only holdson average for individual coordinates of x k , and it does not alone explain the emergence of the Volterra equationdynamics. It also guarantees nothing about the concentration of the suboptimality. Step 2: Embedding into continuous time.

We next consider an embedding of the (cid:98) ν into continuous time. Thisis done to simplify the analysis, and it does not change the underlying behavior of SGD. We let N t be a standardunivariate Poisson process with rate nβ , so that for any t > , E ( N t ) = ntβ . We embed the spectral weights (cid:98) ν intocontinuous time, by taking ν t = (cid:98) ν τ Nt . We note that we have scaled time (by choosing the rate of the Poisson process)so that in a single unit of time t, the algorithm has done one complete pass (in expectation) over the data set.We then show that f ( x N t ) is well approximated by ψ ( t ) . As the mean of N t is large for any ﬁxed t > , the Poisson process concentrates around ntβ , and it follows as an immediate corollary that f ( x (cid:98) ntβ (cid:99) ) is also wellapproximated by ψ ( t ) . 7 tep 3: Doob–Meyer decomposition & the approximate Volterra equation. Under this continuous-time scaling,we can write the function values at x t in terms of ν t as ψ ε ( t ) def = f ( x N t ) = 12 (cid:107) Σ ν t − U T η (cid:107) = 12 d (cid:88) j =1 σ j ν t,j − n ∧ d (cid:88) j =1 σ j ν t,j ( U T η ) j + 12 (cid:107) η (cid:107) , (19)where ν t,j is the j -th coordinate of the vector ν t . Hence the dynamics of ψ ε ( t ) are governed by the behaviors of ν t,j and ν t,j processes. Using (18) and Doob decomposition for quasi-martingales [Protter, 2005, Thm 18, Chpt. 3], wehave an expression for the ν t,j and ν t,j , that is, if we let F t be the σ -algebra of the information available to the processat time t ≥ , we get ν t,j = ν ,j + (cid:90) t B s,j d s + (cid:102) M t,j and ν t,j = ν ,j + (cid:90) t A s,j d s + M t,j , where B t,j def = ∂ t E [ ν t,j | F t ] = − γσ j ν t,j + γσ j ( U T η ) j , (20) A t,j def = ∂ t E [ ν t,j | F t ] = 2 ν t,j B t,j + β − n − (cid:0) B t,j (cid:1) + γ σ j (cid:0) − β − n − (cid:1) n (cid:88) i =1 (cid:0) e Tj U T e i (cid:1) (cid:0) e Ti ( U Σ ν t − η ) (cid:1) , and ( M t,j , (cid:102) M t,j : t ≥ are F t –adapted martingales. The last identities for B t,j and A t,j are derived in Lemma B.1in Appendix B).We will now see how the terms A t,j and B t,j can be simpliﬁed in the large- n limit. In this regime, sums of spectralquantities converge to integrals against the limiting spectral measure µ as a direct consequence of Assumptions 1.1 and1.2. Since we are working in the regime where β = o ( n ) , the terms with β − n − vanish in the large- n limit, disappearingentirely when β = 1 , and explaining why β = o ( n ) does not affect the limiting dynamics of SGD. Our key lemma ,which explains the Volterra dynamics of the mean of f ( x N T ) (Lemma B.5, App. B.6.2), is that (cid:0) e Tj U T e i (cid:1) selfaverages to n , and this is the point where we leverage orthogonal invariance of A most heavily.These simpliﬁcations can be summarized as A t,j ≈ (cid:98) A t,j def = − γ σ j ν t,j + γ σ j ν t,j ( U T η ) j + γ σ j ψ ε ( t ) n . (21)The expression (cid:98) A t,j explains the limiting Volterra dynamics for ψ (cid:15) ( t ) , and why the mean “gradient ﬂow” term does notcorrectly describe the dynamics of SGD. Due to the gradient ﬂow term, the squared spectral weights ν t,j tend to decaylinearly with rate γσ j . On the other hand, coordinates can not decay too quickly, as there is a mass redistributionterm, which explains the rate at which mass from other spectral weights is added to ν t,j and which is due to SGDupdates being noisy analogues for gradient ﬂow. Finally, there is a noise term which in principle depends on ν t,j which would greatly complicate the limiting dynamics. However, when averaged in j the independence of the noise η leads to a concentration effect, due to which only the mean behavior of ν t,j survives. As this mean is just gradientﬂow, this leads to a simple deterministic forcing term in the Volterra equation.Plugging (21) into (19) and (20), we can produce a perturbed Volterra equation for ψ ε ( t ) . For any t > we have ψ ε ( t ) = Rh ( t )2 + (cid:101) R ( rh ( t ) + (1 − r ))2 + ε ( n )1 ( t ) + (cid:90) t (cid:0) γ rh ( s ) + ε ( n )2 ( s ) (cid:1) ψ ε ( t − s ) d s, (22)for error terms ε ( n ) i (see Appendix B.6 for a precise deﬁnition of the errors). The h k ( t ) are deﬁned in Theorem 1.1 asthe Laplace transforms of the measure µ , and arise naturally due to the presence of the gradient ﬂow generator. Step 4: Control of the errors and stability of the Volterra equation.

The expression (22) is a Volterra equa-tion of convolution type — a well-studied equation, see e.g. , Gripenberg et al. [1990], with established stability andexistence/uniqueness theorems. In particular, we can summarily conclude that (see Proposition B.1 in Appendix B.5) max i =1 , sup ≤ t ≤ T | ε ( n ) i ( t ) | Pr −−−−→ n →∞ ⇒ sup ≤ t ≤ T | ψ ε ( t ) − ψ ( t ) | Pr −−−−→ n →∞ . Thus, Theorem 1.1, the dynamics for SGD immediately follows provided control of the errors in (22).8eyond controlling the error of A t,j − (cid:98) A t,j , we must separately control the ﬂuctuations of the martingale terms in(20), which represent the randomness of SGD. A central challenge here is to show in a suitable sense that the entriesof √ n ν t remain bounded on compact sets of time (see the discussion in Appendix B.6 for a detailed overview), whichin turn can be seen as a consequence of the updates of SGD being very nearly orthogonal to any ﬁxed row of U . Hereagain we use the orthogonal invariance of A , but in a weaker way, in that we only need that the maximum of theentries of U are in control. Such results are well-developed for other random matrix ensembles. We solve the Volterra equation and derive exact expressions for the average-case analysis, the critical stepsize γ ∗ , andthe rate λ ∗ (Thm 1.2) under the isotropic features model. In this case, the empirical spectral measure converges to theMarchenko-Pastur measure µ MP (8). Volterra equations of convolution type can be solved using Laplace transforms,which conveniently, for Marchenko-Pastur, are explicit due to a connection with the Stieltjes transform. This leads usto our next main result. Theorem 3.1 (Dynamics of SGD in noiseless setting) . Suppose (cid:101) R = 0 and the stepsize γ < r . Deﬁne the constants (cid:37) and ω and critical stepsize (cid:37) = 1 + r (cid:16) − rγ (cid:17) , ω = 14 (cid:16) − rγ (cid:17) (cid:18) γ − (1 + r ) (cid:19) , and γ ∗ = 2 √ r ( r − √ r + 1) . (23) The iterates of SGD satisfy if γ ≤ γ ∗ , f ( x (cid:98) nβ t (cid:99) (cid:1) Pr −−−−→ n →∞ R · γ (cid:16) − rγ (cid:17) (cid:90) ∞ xe − γxt ( x − (cid:37) ) + ω d µ MP ( x ) and if γ > γ ∗ , for some explicit constant c ( γ, r ) , the iterates of SGD follow f ( x (cid:98) nβ t (cid:99) (cid:1) Pr −−−−→ n →∞ R · γ (cid:18) − rγ (cid:19) (cid:90) ∞ xe − γxt ( x − (cid:37) ) + ω d µ MP ( x ) + R · c ( γ, r ) · e − γ ( (cid:37) + √ | ω | ) t . We only record the dynamics for SGD in the noiseless regime and refer the reader to the Appendix E.2, Theo-rem E.5 for the noisy setting. We ﬁrst observe the freezing transition as predicted by renewal theory – a jammingterm appears for γ > γ ∗ that slows convergence. We note that when the ratio of features to samples r does not equal , the least squares problem in (2) is (almost surely) strongly convex as d µ MP has a gap between the ﬁrst non-zeroeigenvalue and zero (see Figure 2). As r approaches , the smallest non-zero eigenvalue become arbitrarily close to .This phenomenon suggests different convergence rates in the regimes r = 1 and r (cid:54) = 1 . Moreover, we see the explicitvalue of λ ∗ , (cid:37) + (cid:112) | ω | which vanishes when r = 1 . We present our average-case rates in Table 1. We compare models of SGD’s dynamics on two data distributions for moderately-sized problems ( n = 1000 ): theisotropic features model (see Section 1.1) and one-hidden layer network with random weights. In the latter model, theentries of A are the result of a matrix multiplication composed with an activation function g : R → R : A ij def = g (cid:0) [ W Y ] ij √ m (cid:1) , where W ∈ R n × m , Y ∈ R m × d are random matrices. (24)For the simulations, we took this activation function g to be a shifted ReLU function; the shift makes E [ A ] = 0 (seeAppendix A for details). This model encompasses two-layer neural networks with a squared loss, where the ﬁrst layerhas random weights and the second layer’s weights are given by the regression coefﬁcients x . Note that while theisotropic features model satisﬁes our assumptions, the one-hidden layer model does not. For all these approaches, wecompute the objective suboptimality as a function of the number of passes over the dataset (epochs) for the models:(1). SDE ( i.e. , Σ ( X ) = 0 . I in (17)), (2). SME ( i.e. , Σ ( X ) matches covariance of the stochastic gradients), (3).streaming (regenerate a i at each step), and (4). our Volterra equation. See Appendix F for full details on the setupas well as experiments with other values of r . The outcome is displayed in Figure 4 and discussed in the caption.9

Comparison of different SGD models : isotropic features (top) and one-hidden layer network (bottom); r = 1 . . Across all stepsizes the Volterra overlaps the objective suboptimality of the empirical SGD runs (orange).The SME (teal) ﬁts SGD for small stepsizes, whereas streaming (blue) and SDE (pink) have noticeable divergencesfrom SGD for all stepsizes. Stochastic methods were averaged across 10 runs, with ﬁlled area representing the standarddeviation. The parameter γ max is the largest stepsize which still yields convergence of SGD, γ max = r ( d tr ( H )) − from Theorem 1.2.The ﬁt of the Volterra equation to SGD is extremely accurate across different stepsizes and data distributions (somenot covered by our assumptions) and even for medium-sized problems ( n = 1000 ). We also note that while SME isoften a good approximation, obtaining convergence rates from it is an open problem. On the other hand, the proposedVolterra equation can be analyzed through its link with renewal theory. Conclusion and future work

We have shown that the SGD method on least squares objectives admits a tight analysisin the large n and d limit. We described the dynamics of this algorithm through a Volterra integral equation andcharacterize its average-case convergence rate as well as its stepsize regimes. Although our results only hold in thelarge n -limit, the Volterra equation is remarkably accurate for relatively small dimensions (see e.g. Figure 4).While our theoretical results focus on problems with isotropic data matrix A , Figure 4 shows that the Volterraequation also predicts remarkably well the dynamics on data generated from a one-hidden layer network model. Thissuggests that the Volterra prediction might hold in even greater generality, a conjecture that is left for future work.Another direction of future work consists in extending to include other algorithms and problems. We believe theframework presented here should apply to methods like SGD momentum, RMSprop or ADAM and problems such asPCA. Acknowledgements

The authors would like to thank our colleagues Nicolas Le Roux, Bart van Merri¨enboer, Zaid Harchaoui, ManuelaGirotti, Gauthier Gidel, and Dmitriy Drusvyatskiy for their feedback on this manuscript.10 eferences

N. Alexeev, F. G¨otze, and A. Tikhomirov. Asymptotic distribution of singular values of powers of random matrices.

Lith. Math. J. , 50(2):121–132, 2010.J. An, J. Lu, and L. Ying. Stochastic modiﬁed equations for the asynchronous stochastic gradient descent. arXivpreprint arXiv:1805.08244 , 2018.S. Asmussen.

Applied probability and queues , volume 51 of

Applications of Mathematics (New York) . Springer-Verlag,New York, second edition, 2003. Stochastic Modelling and Applied Probability.S. Asmussen, S. Foss, and D. Korshunov. Asymptotics for sums of random variables with local subexponentialbehaviour.

J. Theoret. Probab. , 16(2):489–518, 2003.Z. Bai and J. Silverstein.

Spectral analysis of large dimensional random matrices , volume 20. Springer, 2010.G. Behrooz, S. Krishnan, and Y. Xiao. An Investigation into Neural Net Optimization via Hessian Eigenvalue Density.

Proceedings of the 36th International Conference on Machine Learning (ICML) , 2019.L. Benigni and S. P´ech´e. Eigenvalue distribution of nonlinear models of random matrices. arXiv preprintarXiv:1904.03090 , 2019.D. Bertsekas. A new class of incremental gradient methods for least squares problems.

SIAM J. Optim. , 7(4):913–926,1997.D. Bertsekas and J. Tsitsiklis. Gradient convergence in gradient methods with errors.

SIAM J. Optim. , 10(3):627–642,2000.R. Bollapragada, R. Byrd, and J. Nocedal. Adaptive sampling strategies for stochastic optimization.

SIAM J. Optim. ,28(4):3312–3343, 2018.K. Borgwardt.

A Probabilistic Analysis of the Simplex Method . Springer-Verlag, Berlin, Heidelberg, 1986.L. Bottou, F.E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning.

SIAM Review , 60(2):223–311, 2018.P. Chaudhari and S. Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles fordeep networks. In

International Conference on Learning Representations (ICLR) , 2018.G. Cipolloni, L. Erd˝os, and D. Schr¨oder. Eigenstate Thermalization Hypothesis for Wigner Matrices. arXiv e-prints ,art. arXiv:2012.13215, 2020.Y.N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the saddle pointproblem in high-dimensional non-convex optimization. In

Advances in Neural Information Processing Systems(NeurIPS) , 2014.P.A. Deift and T. Trogdon. The conjugate gradient algorithm on well-conditioned Wishart matrices is almost deterim-inistic.

Quarterly of Applied Mathematics , 2020.B Derrida. Random-energy model: An exactly solvable model of disordered systems.

Phys. Rev. B , 24:2613–2626,Sep 1981.M. Friedlander and M. Schmidt. Hybrid deterministic-stochastic methods for data ﬁtting.

SIAM J. Sci. Comput. , 34(3):A1380–A1405, 2012.Y. Fyodorov and J. Bouchaud. Freezing and extreme-value statistics in a random energy model with logarithmicallycorrelated potential.

J. Phys. A , 41(37):372001, 12, 2008.C. Gerbelot, A. Abbara, and F. Krzakala. Asymptotic Errors for High-Dimensional Convex Penalized Linear Regres-sion beyond Gaussian Matrices. In Jacob Abernethy and Shivani Agarwal, editors,

Proceedings of Thirty ThirdConference on Learning Theory (COLT) , volume 125 of

Proceedings of Machine Learning Research , pages 1682–1713. PMLR, 2020. 11. Ghadimi and G. Lan. Stochastic ﬁrst- and zeroth-order methods for nonconvex stochastic programming.

SIAM J.Optim. , 23(4):2341–2368, 2013.R. Gower, N. Loizou, X. Qian, A. Sailanbayev, E. Shulgin, and P. Richt´arik. SGD: General analysis and improvedrates. In

International Conference on Machine Learning (ICML) . PMLR, 2019.G. Gripenberg, S.O. Londen, and O. Staffans.

Volterra Integral and Functional Equations . Encyclopedia of Mathe-matics and its Applications. Cambridge University Press, 1990.M. Gurbuzbalaban, U. Simsekli, and L. Zhu. The Heavy-Tail Phenomenon in SGD. arXiv preprint arXiv:2006.04740 ,2020.T. Hastie, A. Montanari, S. Rosset, and R.J. Tibshirani. Surprises in high-dimensional ridgeless least squares interpo-lation. arXiv preprint arXiv:1903.08560 , 2019.C.A.R. Hoare. Quicksort.

The Computer Journal , 5(1):10–16, 01 1962.W. Hu, C. Junchi Li, L. Li, and J. Liu. On the diffusion approximation of nonconvex stochastic gradient descent. arXivpreprint arXiv:1705.07562 , 2017.D. Huang, J. Niles-Weed, J. Tropp, and R. Ward. Matrix Concentration for Products. arXiv preprintsarXiv:2003.05437 , 2020.P. Jain, S. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford. Accelerating Stochastic Gradient Descent for LeastSquares Regression. In

Proceedings of the 31st Conference On Learning Theory (COLT) , volume 75, pages 545–604, 2018.S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Three Factors InﬂuencingMinima in SGD. arXiv preprint arXiv:1711.04623 , 2017.N. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. Tang. On Large-Batch Training for Deep Learning:Generalization Gap and Sharp Minima. arXiv preprint arXiv:1609.04836 , 2016.J. F. C. Kingman.

Poisson Processes . Clarendon Press · Oxford, 1993.B. Klar. Bounds on Tail Probabilities of Discrete Distributions.

Probability in the Engineering and InformationalSciences , 14:161–171, 2000.A. Knowles and J. Yin. Anisotropic local laws for random matrices.

Probab. Theory Related Fields , 169(1-2):257–352,2017.D. Kunin, J. Sagastuy-Brena, S. Ganguli, D.L. K. Yamins, and H. Tanaka. Neural Mechanics: Symmetry and BrokenConservation Laws in Deep Learning Dynamics. arXiv preprint arXiv:2012.04728 , 2020.H. Kushner and G.G. Yin.

Stochastic approximation and recursive algorithms and applications , volume 35. SpringerScience & Business Media, 2003.J. Lacotte and M. Pilanci. Optimal Randomized First-Order Methods for Least-Squares Problems. arXiv preprintarXiv:2002.09488 , 2020.D. L´epingle. Sur le comportement asymptotique des martingales locales. In

S´eminaire de Probabilit´es, XII (Univ.Strasbourg, Strasbourg, 1976/1977) , volume 649 of

Lecture Notes in Math. , pages 148–161. Springer, Berlin, Hei-delberg, 1978.Q. Li, C. Tai, and W. E. Stochastic Modiﬁed Equations and Adaptive Stochastic Gradient Algorithms. In

Proceedingsof the 34th International Conference on Machine Learning (ICLR) , volume 70, pages 2101–2110, 2017.D. Liu, C. Song, and Z. Wang. On explicit probability densities associated with Fuss-Catalan numbers.

Proc. Amer.Math. Soc. , 139(10):3735–3738, 2011.L. Ljung. Analysis of recursive stochastic algorithms.

IEEE Trans. Automatic Control , AC-22(4):551–575, 1977.12. Mahsereci and P. Hennig. Probabilistic Line Searches for Stochastic Optimization.

Journal of Machine LearningResearch , 18(119):1–59, 2017.S. Mandt, M. Hoffman, and D. Blei. A variational analysis of stochastic gradient algorithms. In

International confer-ence on machine learning (ICML) , 2016.V.A. Marˇcenko and L.A. Pastur. Distribution of eigenvalues for some sets of random matrices.

Mathematics of theUSSR-Sbornik , 1967.C.H. Martin and M.W. Mahoney. Implicit self-regularization in deep neural networks: Evidence from random matrixtheory and implications for learning. arXiv preprint arXiv:1810.01075 , 2018.E. Meckes.

The random matrix theory of the classical compact groups , volume 218 of

Cambridge Tracts in Mathe-matics . Cambridge University Press, Cambridge, 2019.S. Mei and A. Montanari. The generalization error of random features regression: Precise asymptotics and doubledescent curve. arXiv preprint arXiv:1908.05355 , 2019.S. Mei, A. Montanari, and P. Nguyen. A mean ﬁeld view of the landscape of two-layer neural networks.

Proc. Natl.Acad. Sci. USA , 115(33):E7665–E7671, 2018.S. Mei, T. Misiakiewicz, and A. Montanari. Mean-ﬁeld theory of two-layers neural networks: dimension-free boundsand kernel limit. In Alina Beygelzimer and Daniel Hsu, editors,

Proceedings of the Thirty-Second Conference onLearning Theory (COLT) , volume 99, pages 2388–2464, 2019.A. Nemirovski. Information-based complexity of convex programming.

Lecture Notes , 1995.Y. Nesterov.

Introductory lectures on convex optimization . Springer, 2004.T. Nguyen, U. Simsekli, M. Gurbuzbalaban, and G. Richard. First exit time analysis of stochastic gradient descentunder heavy-tailed gradient noise. In

Advances in Neural Information Processing Systems (NeurIPS) , 2019.V. Papyan. The full spectrum of deepnet hessians at scale: Dynamics with SGD Training and Sample Size. arXivpreprint arXiv:1811.07062 , 2018.C. Paquette, B. van Merri¨enboer, and F. Pedregosa. Halting Time is Predictable for Large Models: A UniversalityProperty and Average-case Analysis. arXiv preprint arXiv:2006.04299 , 2020.E. Paquette and T. Trogdon. Universality for the conjugate gradient and MINRES algorithms on sample covariancematrices. arXiv preprint arXiv:2007.00640 , 2020.F. Pedregosa and D. Scieur. Average-case Acceleration Through Spectral Density Estimation. In

Proceedings of the37th International Conference on Machine Learning (ICML) , 2020.J. Pennington and P. Worah. Nonlinear random matrix theory for deep learning. In

Advances in Neural InformationProcessing Systems (NeurIPS) , 2017.G. C. Pﬂug. Stochastic minimization with constant step-size: asymptotic laws.

SIAM J. Control Optim. , 24(4):655–666, 1986.P.E. Protter.

Stochastic integration and differential equations , volume 21 of

Stochastic Modelling and Applied Proba-bility . Springer-Verlag, Berlin, 2005.A. Rahimi and B. Recht. Random features for large-scale kernel machines. In

Advances in Neural InformationProcessing Systems (NeurIPS) , pages 1177–1184, 2008.S. Resnick.

Adventures in stochastic processes . Birkh¨auser Boston, Inc., Boston, MA, 1992.H. Robbins and S. Monro. A Stochastic Approximation Method.

Ann. Math. Statist. , 1951.L. Sagun, L. Bottou, and Y. LeCun. Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond. arXivpreprint arXiv:1611.07476 , 2016. 13. Sagun, T. Trogdon, and Y. LeCun. Universal halting times in optimization and machine learning.

Quarterly ofApplied Mathematics , 76:1, 09 2017.A. Saxe, J. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neuralnetworks. arXiv preprint arXiv:1312.6120 , 2013.T. Schaul, S. Zhang, and Y. LeCun. No more pesky learning rates. In

Proceedings of the 30th International Conferenceon Machine Learning (ICML) , volume 28 of

Proceedings of Machine Learning Research , pages 343–351, 2013.G. Shorack and J. Wellner.

Empirical processes with applications to statistics . Wiley Series in Probability andMathematical Statistics: Probability and Mathematical Statistics. John Wiley & Sons, Inc., New York, 1986.U. Simsekli, L. Sagun, and M. Gurbuzbalaban. A Tail-Index Analysis of Stochastic Gradient Noise in Deep NeuralNetworks. In

Proceedings of the 36th International Conference on Machine Learning (ICML) , volume 97, pages5827–5837, 2019.J. Sirignano and K. Spiliopoulos. Mean ﬁeld analysis of neural networks: a law of large numbers.

SIAM J. Appl.Math. , 80(2):725–752, 2020.S. Smale. On the average number of steps of the simplex method of linear programming.

Mathematical Programming ,27(3):241–262, 1983.D. Spielman and S. Teng. Smoothed Analysis of Algorithms: Why the Simplex Algorithm Usually Takes PolynomialTime.

J. ACM , 51(3):385–463, 2004.S. Steinerberger. Randomized Kaczmarz converges along small singular vectors. arXiv preprint arXiv:2006.16978 ,2020.T. Strohmer and R. Vershynin. A randomized Kaczmarz algorithm with exponential convergence.

J. Fourier Anal.Appl. , 15(2):262–278, 2009.W. Su, S. Boyd, and E.J. Cand`es. A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method:Theory and Insights.

Journal of Machine Learning Research , 17(153):1–43, 2016.I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning.In

Proceedings of the 30th International Conference on Machine Learning (ICML) , volume 28, pages 1139–1147,2013.T. Tao.

Topics in random matrix theory , volume 132. American Mathematical Soc., 2012.S. Vaswani, A. Mishkin, I. Laradji, M. Schmidt, G. Gidel, and S. Lacoste-Julien. Painless Stochastic Gradient: Inter-polation, Line-Search, and Convergence Rates. In

Advances in Neural Information Processing Systems (NeurIPS) ,volume 32, pages 3732–3745, 2019.R. Vershynin.

High-dimensional probability: An introduction with applications in data science . Cambridge UniversityPress, 2018.S. Yaida. Fluctuation-dissipation relations for stochastic gradient descent. In

International Conference on LearningRepresentations(ICLR) , 2019.M. Yor. On optional stochastic integrals and a remarkable series of exponential formulas.

Strasbourg probabilityseminar , 10:481–500, 1976.Z. Zhu, J. Wu, B. Yu, L. Wu, and J. Ma. The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior ofEscaping from Sharp Minima and Regularization Effects. In

Proceedings of the 36th International Conference onMachine Learning (ICML) , volume 97, pages 7654–7663, 2019.14

GD in the Large:

Average-case Analysis, Asymptotics, and Stepsize Criticality

Supplementary material

The appendix is organized into six sections as follows:1. Appendix A expands upon the data examples in Section 1.1.2. Appendix B derives the Volterra equation and proves the main concentration for the dynamics of SGD (Theo-rem 1.1).3. We show in Appendix C that the error terms associated with concentration of measure on the high-dimensionalorthogonal group disappear in the large- n limit. This includes the key lemma , Proposition B.5.4. Appendix D shows the error terms which vanish due to martingale concentration results.5. Appendix E derives the average-case complexity results from Section 3 and provides a proof of the Malthusianexponent (Theorem 1.2).6. Appendix F contains details on the simulations.Unless otherwise stated, all the results hold under Assumptions 1.1 and 1.2. We include all statements from theprevious sections for clarity. Notation.

All stochastic quantities deﬁned hereafter live on a probability space denoted by (Pr , Ω , F ) with proba-bility measure Pr and the σ -algebra F containing subsets of Ω . A random variable (vector) is a measurable map from Ω to R ( R d ) respectively. Let X : (Ω , F ) → ( R , B ) be a random variable mapping into the borel σ -algebra B and theset B ∈ B . We use the standard shorthand for the event { X ∈ B } = { ω : X ( ω ) ∈ B } . We denote the minimum of a and b by min { a, b } = a ∧ b . An event E that occurs with high probability (or shortened to w.h.p.) is one whoseprobability depends on n , related to matrix dimension in our paper, and the probability of its complementary eventgoes to 0 as n → ∞ . Whereas event E is said to occur with overwhelming probability (or w.o.p.) if the probability ofits complementary event goes to 0 faster than any polynomial order of n as n → ∞ , i.e. for any k > , lim n →∞ Pr( E c ) n k = 0 . Throughout the paper, β = β ( n ) , which denotes the size of B k , a uniformly random subset of { , · · · , n } at the k -thiteration of SGD, is assumed to satisfy β ≤ n / − δ for some δ > . A Data distributions

A.1 Elaboration on isotropy

We recall that in Assumption 1.2, we have assumed that the data matrix A is orthogonally invariant . This is a strongfrom of isotropy, under which the matrix looks the same in any orthogonal basis. On a technical level, we work in thissetting as it leads to a singular value decomposition with especially simple statistics.To state this property, we recall that the set of n × n orthogonal matrices form a group O ( n ) under multiplication,and that this group naturally admits a Lie group structure. In particular, there is a probability measure on this group, the Haar measure , which is invariant by left and right multiplication by ﬁxed orthogonal matrices. To refer to a randommatrix whose law is Haar measure, we will simply say a Haar-distributed orthogonal matrix. While it may appearunwieldy, there are many exceptionally nice tools that exist for working with this measure. We will elaborate on manyof them in Section C. We also refer to Meckes [2019] for a rich exposition on the intrinsic properties of this group.The main feature that we will need is the following: 15 emma A.1.

Suppose that A is an n × d orthogonally invariant random matrix in that AO d law = A and O n A law = A for any orthogonal matrices O n ∈ O ( n ) and O d ∈ O ( d ) . Then there is a singular value decomposition A = U Σ V with Σ an n × d random matrix having Σ ≥ Σ ≥ Σ ≥ · · · ≥ Σ mm where m = min { n, d } , so that ( U , Σ , V ) are independent, and U and V are Haar orthogonally distributed.Proof. The key observation is that if we introduce a new, independent Haar distributed random matrix U ∈ O ( n ) ,then U T A has the same law as A and moreover ( U , U T A ) are independent. To see that U T A has the same law as A we just observe that conditionally on U T , U T A law = A by assumption. As the conditional law does not depend on U , it follows that U T A law = A and U is independent of U T A . Extending this, if we introduce two new independent Haardistributed random matrices U ∈ O ( n ) and V in O ( d ) , it follows that ( U , U T AV T , V ) is a triple on independentrandom matrices. Let U T AV T = (cid:101) U Σ (cid:101) V be the singular value decomposition with Σ having the properties stated in the lemma. Then A = U T ( U AV ) V T = ( U (cid:101) U ) Σ ( (cid:101) V V ) , with U , V and ( (cid:101) U Σ (cid:101) V ) independent. By invariance of Haar measure, the triple ( U (cid:101) U , V (cid:101) V , Σ ) remain independentand uniformly distributed on O ( n ) and O ( d ) respectively. As a consequence, we will frequently condition on the singular values Σ of A , and most estimates we need areestimates that hold conditionally on Σ . A.2 Isotropic features and Random features

In this section, we expand upon Assumptions 1.1 and 1.2 in the main text of the paper. We discuss in detail twoexamples: isotropic features and one-hidden layer networks.

A.2.1 Isotropic features

In their seminal work, Marˇcenko and Pastur [1967] show that the spectrum of H = n A T A under the isotropicfeatures model converged to a deterministic measure. Subsequent work then characterized the convergence of thelargest eigenvalue of H . We summarize these results below. Lemma A.2 (Isotropic features) . ( Bai and Silverstein [2010, Theorem 5.8] ) Suppose the matrix A ∈ R n × d is gen-erated using the isotropic features model. Then the empirical spectral measure (EMS) µ H converges weakly almostsurely to the Marchenko-Pastur measure µ MP and the largest eigenvalue of H , λ + H , converges in probability to λ + where λ + = (1 + √ r ) is the top edge of the support of the Marchenko-Pastur measure. The results stated so far did not require that the entries of A are normally distributed, and hold equally well forany i.i.d. matrices with mean , entry variance and bounded fourth-moment. Under the additional assumption thatthe entries of A are normally distributed, it is easily checked by a covariance computation that for ﬁxed orthogonalmatrices U and V the entries U A and AV remain independent, mean and variance . We summarize this claimbelow. Lemma A.3.

For an n × d matrix A of i.i.d. standard normal random variables, U A and AV are again n × d matrices of independent standard normals for ﬁxed orthogonal matrices U and V . We emphasize that while this is not true for matrices A with entries that are independent of mean and variance ,there are many senses in which this is approximately true (see Knowles and Yin [2017]).16 .2.2 One-hidden layer networksOne-hidden layer network with random weights. In this model, the entries of A are the result of a matrix multi-plication composed with a (potentially non-linear) activation function g : R → R : A ij def = g (cid:0) [ W Y ] ij √ m (cid:1) , where W ∈ R n × m , Y ∈ R m × d are random matrices. (25)The entries of W and Y are i.i.d. with zero mean, isotropic variances E [ W ij ] = σ w and E [ Y ij ] = σ y , and light tails(see App. A.2 for details). As in the previous case to study the large dimensional limit, we assume that the differentdimensions grow at comparable rates given by mn → r ∈ (0 , ∞ ) and md → r ∈ (0 , ∞ ) . This model encompassestwo-layer neural networks with a squared loss, where the ﬁrst layer has random weights and the second layer’s weightsare given by the regression coefﬁcients x . Particularly, the optimization problem in (2) becomes min x (cid:26) f ( x ) = 12 n (cid:107) g (cid:0) √ m W Y (cid:1) x − b (cid:107) (cid:27) . (26)The model was introduced by [Rahimi and Recht, 2008] as a randomized approach for scaling kernel methods tolarge datasets, and has seen a surge in interest in recent years as a way to study the generalization properties of neuralnetworks [Hastie et al., 2019, Mei and Montanari, 2019, Pennington and Worah, 2017].The difference between this and the isotropic features model is the activation function, g . We assume g to be entirewith a growth condition and have zero Gaussian-mean (App. A.2). These assumptions hold for common activationfunctions such as sigmoid g ( z ) = (1 + e − z ) − and softplus g ( z ) = log(1 + e z ) , a smoothed variant of ReLU.Benigni and P´ech´e [2019] recently showed that the empirical spectral measure and largest eigenvalue of H con-verge to a deterministic measure and largest element in the support, respectively. This implies that this model veriﬁesAssumption 1.2. However, contrary to the isotropic features model, the limiting measure does not have an explicitexpression, except for some speciﬁc instances of g in which it is known to coincide with the Marchenko-Pastur distri-bution.For the one-hidden layer network, following [Benigni and P´ech´e, 2019], we assume that the activation function g is an entire function with a growth condition satisfying the following zero Gaussian mean property:(Gaussian mean) (cid:90) g ( σ w σ y z ) e − z / √ π d z = 0 . (27)The additional growth condition on the function g is precisely given as there exists positive constants C g , c g , A > such that for any A ≥ A and any n ∈ N sup z ∈ [ − A,A ] | g ( n ) ( z ) | ≤ C g A c g n . (28)This growth condition is veriﬁed for polynomials which can approximate to arbitrary precision common activationfunctions such as the sigmoid g ( z ) = (1 + e − z ) − and the softplus g ( z ) = log(1 + e z ) , a smoothed approximation tothe ReLU. The Gaussian mean assumption (27) can always be satisﬁed by incorporating a translation into the activationfunction.In addition to the i.i.d., mean zero, and isotropic entries, we also require an assumption on the tails of W and Y ,that is, there exists constants θ w , θ y > and α > such that for any t > | W | > t ) ≤ exp( − θ w t α ) and Pr( | Y | > t ) ≤ exp( − θ y t α ) . (29)Although stronger than bounded fourth moments, this assumption holds for any sub-Gaussian random variables ( e.g. ,Gaussian, Bernoulli, etc). Under these hypotheses, Assumption 1.2 is veriﬁed. Lemma A.4 (One-hidden layer network) . ( Benigni and P´ech´e [2019, Theorems 2.2 and 5.1] ) Suppose the matrix A ∈ R n × d is generated using the random features model. Then there exists a deterministic compactly supportedmeasure µ such that µ H −→ d →∞ µ weakly almost surely. Moreover λ + H Pr −−−→ d →∞ λ + where λ + is the top edge of thesupport of µ . Derivation of the dynamics of SGD

In this section, we derive the Volterra equation from (12), that is, ψ ( t ) = R h ( t ) + (cid:101) R (cid:0) rh ( t ) + (1 − r ) (cid:1) + (cid:90) t γ rh ( t − s ) ψ ( s ) ds, and h k ( t ) = (cid:90) ∞ x k e − γtx dµ ( x ) . (30)and prove Theorem 1.1: sup ≤ t ≤ T (cid:12)(cid:12) f (cid:0) x (cid:98) nβ t (cid:99) (cid:1) − ψ ( t ) (cid:12)(cid:12) Pr −−−−→ n →∞ , (31)provided that error terms go to zero. We begin by setting up the tools to derive an approximate Volterra equation. B.1 Change of basis

Recall the iterates of SGD satisfy x k +1 = x k − γn (cid:88) i ∈ B k ∇ f i ( x k ) = x k − γn A T P k ( Ax k − b ) , where P k def = (cid:88) i ∈ B k e i e Ti . (32)Here P k is a random orthogonal projection matrix with e i the i -th standard basis vector, β ∈ N is a batch-sizeparameter, which we will allow to depend on n , γ > is a stepsize parameter, and the function f i is the i -th elementof the sum in (2).We recall that b has the representation b = A (cid:101) x + √ n η , and both (cid:101) x and η have norms bounded independently of n . Hence we can represent the updates of SGD (3) equation in matrix form as x k +1 = x k − γn A T P k ( A ( x k − (cid:101) x ) + √ n η ) . We will consider a singular value decomposition guaranteed by Lemma A.1 of √ n A = U Σ V T , where U and V are Haar distributed orthogonal matrices, i.e. V V T = V T V = I and Σ is the n × d singular value matrix withdiagonal entries diag ( σ i ) , i = 1 , . . . , n . Our analysis will use a different choice of variables. So we deﬁne the vector (cid:98) ν k def = V T ( x k − (cid:101) x ) , which therefore evolves like (cid:98) ν k +1 = (cid:98) ν k − γ Σ T U T P k ( U Σ (cid:98) ν k − η ) . (33) B.2 Embedding into continuous time

We next consider an embedding of the process (cid:98) ν k into continuous time. This is done to simplify the analysis, andit does not change the underlying behavior of SGD. Let N = N ∪ { } . We deﬁne an inﬁnite random sequence { τ k : k ∈ N } ⊂ [0 , ∞ ) with τ < τ < τ < · · · , which will record the time at which the k -th update of SGDoccurs. The distribution of these { τ k : k ∈ N } will follow a standard rate- ( nβ ) Poisson process. This means that thefamily of interarrival times { τ k − τ k − : k ∈ N } are i.i.d. Exp ( nβ ) random variables, i.e. , those with mean βn , andwe note that this randomization is independent of both the SGD, A and b . The function N t will count the number ofarrivals of this Poisson process before time t , that is N t = sup { k ∈ N : τ k ≤ t } . Then for any t > , N t has the distribution of Poisson ( nβ t ) .We embed the process (cid:98) ν into continuous time, by taking ν t = (cid:98) ν τ Nt . We note that we have scaled time (bychoosing the rate of the Poisson process) so that in a single unit of time t, the algorithm has done one complete pass(in expectation) over the data set, i.e. SGD has completed one epoch.18 .3 Doob-Meyer decomposition We compute the Doob decomposition for quasi-martingales [Protter, 2005, Thm. 18, Chpt. 3] of ν t and of ν t,j , where ν t,j is the j -th coordinate of ν t where j ranges from j = 1 , . . . , d . Here we let F t be the σ -algebra of informationavailable to the process at time t ≥ . So we take, for any j ∈ [ d ] , B t def = ∂ t E [ ν t | F t ] = lim (cid:15) ↓ (cid:15) − E [ ν t + (cid:15) − ν t | F t ] A t,j def = ∂ t E [ ν t,j | F t ] = lim (cid:15) ↓ (cid:15) − E [ ν t + (cid:15),j − ν t,j | F t ] . In terms of this random variable, we have a decomposition ν t = ν + (cid:90) t B s d s + (cid:102) M t ,ν t,j = ν ,j + (cid:90) t A s,j d s + M t,j , (34)where ( M t,j , (cid:103) M t : t ≥ are F t –adapted martingales.For the computation of A t,j we observe that as (cid:15) → , A t,j is dominated by the contribution of a single Poissonpoint arrival; as in time (cid:15) , the probability of having multiple Poisson point arrivals is O ( β − n (cid:15) ) , whereas theprobability of having a single arrival is − e − β − n(cid:15) ∼ β − n(cid:15) as (cid:15) → . For notational simplicity, we let theprojection matrix P ∈ R d × d be an i.i.d. copy of P , which is independent of all the randomness so far and we let B be the corresponding random subset of { , , . . . , n } that deﬁnes P . It follows B t = nβ E (cid:20) ν t − γ Σ T U T P ( U Σ ν t − η ) − ν t (cid:12)(cid:12)(cid:12)(cid:12) F t (cid:21) , A t,j = nβ E (cid:20)(cid:18) ν t,j − γ e Tj Σ T U T P ( U Σ ν t − η ) (cid:19) − ν t,j (cid:12)(cid:12)(cid:12)(cid:12) F t (cid:21) . The mean term of ν t simpliﬁes signiﬁcantly, and by no accident: by construction, the SGD update rule has aconditional expectation which is proportional to the gradient of the objective function. Observe that since E [ P ] = βn n (cid:88) i =1 e i e Ti = βn I , the previous equation simpliﬁes to: B t = − γ Σ T ( Σ ν t − U T η ) and B t,j = − γσ j ν t,j + γσ j ( U T η ) j . (35)We now turn to the evaluation of A t,j . If we let P = (cid:80) i ∈ B e i e Ti , A t,j = − ν t,j γ e Tj Σ T U T ( U Σ ν t − η ) + γ nβ E (cid:18) e Tj Σ T U T P ( U Σ ν t − η ) (cid:12)(cid:12)(cid:12)(cid:12) F t (cid:19) = 2 ν t,j B t,j + γ nσ j β E (cid:18)(cid:88) i ∈ B (cid:0) e Tj U T e i (cid:1)(cid:0) e Ti ( U Σ ν t − η ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) F t (cid:19) . (36)To compute this conditional expectation, we record the following lemma. Lemma B.1.

Suppose that u and v are ﬁxed vectors in R n . Then E (cid:18)(cid:88) i ∈ B u i v i (cid:19) = β ( β − n ( n −

1) ( u T v ) + (cid:18) βn − β ( β − n ( n − (cid:19) n (cid:88) i =1 ( u i v i ) . roof. This reduces to the two probabilities:

Pr( i ∈ B ) = βn , and Pr( i, (cid:96) ∈ B ) = β ( β − n ( n − , where i (cid:54) = (cid:96) are any ﬁxed numbers in { , , . . . , n } . The proof now follows by expanding both sides.Using Lemma B.1, we can therefore simplify (36) by writing A t,j = 2 ν t,j B t,j + β − n − (cid:0) B t,j (cid:1) + γ σ j (cid:18) − β − n − (cid:19) n (cid:88) i =1 (cid:0) e Tj U T e i (cid:1) (cid:0) e Ti ( U Σ ν t − η ) (cid:1) . (37) B.4 Constructing the approximate Volterra equation

In this section, we derive an approximate Volterra equation. First, we can write the function values at the iterates x t in terms of ν t , ψ ε ( t ) def = f ( x N t ) = 12 n (cid:107) A ( x N t − (cid:101) x ) − √ n η (cid:107) = 12 (cid:107) Σ ν t − U T η (cid:107) . = 12 d (cid:88) j =1 σ j ν t,j − n ∧ d (cid:88) j =1 σ j ν t,j ( U T η ) j + 12 (cid:107) η (cid:107) . (38)Hence the dynamics of ψ ε ( t ) are governed by the behaviors of ν t,j and ν t,j processes. We now return to A t,j . The keylemma to simplifying (37) is that the (cid:0) e Tj U T e i (cid:1) expression in (37) self-averages to n . Furthermore, we are workingin the regime when β = o ( n ) , and hence the terms with βn will vanish in the large- n limit. Thus we deﬁne (cid:98) A t,j def = 2 ν t,j B t,j + γ σ j n (cid:88) i =1 n (cid:18) e Ti ( U Σ ν t − η ) (cid:19) = − γ σ j ν t,j + γ σ j ν t,j ( U T η ) j + γ σ j ψ ε ( t ) n . (39)We will show that (cid:98) A t,j is a good approximation for A t,j in a suitably strong sense so that we can derive a deterministicVolterra equation description for ψ ε ( t ) in the large- n limit. For the moment, let’s group these error terms together.Deﬁne the c`adl`ag process E t,j def = ν t,j − ν ,j − (cid:90) t (cid:98) A s,j d s. (40)In the following lemma, we get an expression for ν t,j . Lemma B.2.

For any t ≥ and for any ≤ j ≤ d,ν t,j = e − tγσ j ν ,j + (cid:90) t e − t − s ) γσ j (cid:18)(cid:0) γ σ j ν s,j ( U T η ) j + γ σ j ψ ε ( s ) n (cid:1) d s + d E s,j (cid:19) and ν t,j = e − γσ j t ν ,j + (cid:90) t e − γσ j ( t − s ) (cid:0) γσ j ( U T η ) j d s + d (cid:102) M s,j (cid:1) . Proof.

We show the ﬁrst equation. The second follows by a similar argument. Using the deﬁnition of E t,j , thefollowing holds ν t,j = ν ,j + (cid:90) t (cid:18) − γ σ j ν s,j + γ σ j ν s,j ( U T η ) j + γ σ j ψ ε ( s ) n (cid:19) d s + E t,j . Using c`adl`ag differentiation, we get that d( e tγσ j ν t,j ) = e tγσ j γ σ j ν t,j ( U T η ) j + e tγσ j γ σ j ψ ε ( t ) n + e tγσ j d E t,j . Hence integrating both sides, one obtains ν t,j = e − tγσ j (cid:18) ν ,j + (cid:90) t e sγσ j (cid:18) γ σ j ν s,j ( U T η ) j + γ σ j ψ ε ( s ) n (cid:19) d s + (cid:90) t e sγσ j d E s,j (cid:19) , which completes the proof. 20e then apply Lemma B.2 to (38) and we derive the approximate Volterra equation ψ ε ( t ) = 12 d (cid:88) j =1 σ j (cid:18) e − tγσ j ν ,j + (cid:90) t e − t − s ) γσ j γ σ j ψ ε ( s ) n d s (cid:19) + 12 d (cid:88) j =1 (cid:90) t e − t − s ) γσ j γ σ j ν s,j ( U T η ) j d s + 12 (cid:107) η (cid:107) − n ∧ d (cid:88) j =1 σ j ν t,j ( U T η ) j + 12 d (cid:88) j =1 σ j (cid:90) t e − t − s ) γσ j d E s,j . (41)In this expression, we have gathered the terms on each line that have different limit behaviors. On the ﬁrst line, wehave the terms, that due to the convergence of the empirical measure of singular values (Assumption 1.2), will havecontinuum limits. The second line are those terms that survive in the limit due to the effect of noise η . The third lineare error terms that vanish in the limit. We will make explicit the convergence in the ﬁrst two lines in the followinglemmata. B.5 Stability of the Volterra equation

We begin by deﬁning some Laplace transforms of the limiting spectral measures, for k ∈ N ,h k ( t ) def = (cid:90) ∞ x k e − γtx d µ ( x ) . (42)We begin by showing that the terms on the ﬁrst line of (41) converge to some ﬁnite limit. Under our Assumption 1.2, Lemma B.3.

Locally uniformly on compact sets of time, d (cid:88) j =1 σ j e − tγσ j ν ,j Pr −−−−→ n →∞ Rh ( t )2 and (43) d (cid:88) j =1 γ σ j n e − tγσ j Pr −−−−→ n →∞ γ rh ( t ) . (44) Proof.

We begin by showing that each term in (43) and (44) converge pointwise in probability. Note that the con-vergence of (44) is trivial because of Assumption 1.2. Hence, it only remains to show pointwise convergence of(43).Under Assumption 1.1 and using uniform distribution of V we have that ν = V ( x − (cid:101) x ) is a uniformly distributedvector on the sphere of norm √ R . Observe ﬁrst that conditioned on Σ , the conditional expectation of the LHS of (43)is given by E (cid:20) d (cid:88) j =1 σ j e − tγσ j ν ,j (cid:12)(cid:12)(cid:12)(cid:12) Σ (cid:21) = R d d (cid:88) j =1 σ j e − tγσ j . The vector ν follows the Dirichlet distribution, which is negatively associated. In particular, E ( ν ,j ν ,j ) ≤ E ( ν ,j ) E ( ν ,j ) . Further E ( ν ,j ) ≤ R d − , as the moments are strictly bounded by the normal moments. Hence, the variance is21ounded by Var (cid:18) d (cid:88) j =1 σ j e − tγσ j ν ,j (cid:12)(cid:12)(cid:12)(cid:12) Σ (cid:19) = E (cid:20)(cid:0) d (cid:88) j =1 σ j e − tγσ j ν ,j − R d d (cid:88) j =1 σ j e − tγσ j (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) Σ (cid:21) = E (cid:20)(cid:0) d d (cid:88) j =1 σ j e − tγσ j ( dν ,j − R ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) Σ (cid:21) ≤ d E (cid:20) d (cid:88) j =1 ( σ j e − tγσ j ) ( dν ,j − R ) (cid:12)(cid:12)(cid:12)(cid:12) Σ (cid:21) = 14 d (cid:20) d d (cid:88) j =1 σ j e − · tγσ j (cid:21) ( d E [ ν ,j ] − R ) = O (cid:18) d (cid:19) . Therefore, for (cid:15) > , conditional Chebyshev inequality gives Pr (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) d (cid:88) j =1 σ j e − tγσ j ν ,j − R d d (cid:88) j =1 σ j e − tγσ j (cid:12)(cid:12)(cid:12)(cid:12) > (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) Σ (cid:19) ≤ (cid:15) Var (cid:18) d (cid:88) j =1 σ j e − tγσ j ν ,j (cid:12)(cid:12)(cid:12)(cid:12) Σ (cid:19) −−−−→ n →∞ . Applying the law of total probability to this and combining it with the weak convergence of ESM in probability(Assumption 1.2) gives Pr (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) d (cid:88) j =1 σ j e − tγσ j ν ,j − Rh ( t )2 (cid:12)(cid:12)(cid:12)(cid:12) > (cid:15) (cid:19) ≤ Pr (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) d (cid:88) j =1 σ j e − tγσ j ν ,j − R d d (cid:88) j =1 σ j e − tγσ j (cid:12)(cid:12)(cid:12)(cid:12) > (cid:15) (cid:19) + Pr (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) R d d (cid:88) j =1 σ j e − tγσ j − Rh ( t )2 (cid:12)(cid:12)(cid:12)(cid:12) > (cid:15) (cid:19) −−−−→ n →∞ . Now we show the uniform convergence of (43) on time interval [0 , T ] for a ﬁxed time T > (the same argumentapplies to (44)). Considering mesh points on [0 , T ] with spacing, let us say λ > , we can say that the pointwiseconvergence holds on those mesh points and so does the supremum convergence on them. For an arbitrary time t ∈ [0 , T ] , there exists a mesh point t such that | t − t | ≤ λ . Then, since e − tγσ j is a Lipschitz function on [0 , T ] with some Lipschitz constant C > , we have sup ≤ t ≤ T (cid:12)(cid:12)(cid:12)(cid:12) d (cid:88) j =1 σ j ( e − tγσ j − e − t γσ j ) ν ,j (cid:12)(cid:12)(cid:12)(cid:12) ≤ Cλ d (cid:88) j =1 σ j ν ,j . Note that (cid:80) dj =1 σ j ν

20 Pr −−−−→ n →∞ R (cid:82) ∞ x dµ ( x ) < ∞ using a similar idea by conditioning on Σ and applying Assump-tion 1.2. Then observe, applying triangle inequality and taking supremum on t ∈ [0 , T ] gives sup ≤ t ≤ T (cid:12)(cid:12)(cid:12)(cid:12) d (cid:88) j =1 σ j e − tγσ j ν ,j − Rh ( t )2 (cid:12)(cid:12)(cid:12)(cid:12) ≤ sup t ∈ [0 ,T ] (cid:12)(cid:12)(cid:12)(cid:12) d (cid:88) j =1 σ j e − t γσ j ν ,j − Rh ( t )2 (cid:12)(cid:12)(cid:12)(cid:12) + sup ≤ t ≤ T (cid:12)(cid:12)(cid:12)(cid:12) d (cid:88) j =1 σ j ( e − tγσ j − e − t γσ j ) ν ,j (cid:12)(cid:12)(cid:12)(cid:12) + sup t,t ∈ [0 ,T ] R (cid:12)(cid:12) h ( t ) − h ( t ) (cid:12)(cid:12) . Given that µ has a ﬁnite support, we have sup t,t ∈ [0 ,T ] | h ( t ) − h ( t ) | ≤ C (cid:48) λ for some C (cid:48) > . Now the claimfollows as λ can be chosen as small as possible. 22e can now recast (41) as an approximate Volterra type integral equation, where ψ ε ( t ) = Rh ( t )2 + (cid:101) R · rh ( t ) + (1 − r )2 + ε ( n )1 ( t ) + (cid:90) t (cid:18) γ rh ( t − s ) + ε ( n )2 ( t − s ) (cid:19) ψ ε ( s ) d s, (45)and where ε ( n ) i are deﬁned implicitly by comparison with (41). In particular, ε ( n )2 ( t ) is given by the difference ε ( n )2 ( t ) = d (cid:88) j =1 γ σ j n e − tγσ j − γ rh ( t ) , which is therefore guaranteed to converge to by Lemma B.3. The other error ε ( n )1 is substantially more complicated;we discuss it fully in (54), Section B.6.However, all we need to show is that this error tends to , as Volterra equations are stable: Proposition B.1 (Stability of the Volterra equation) . Fix a constant

T > and suppose ψ ε solves (45) with boundederror terms, sup ≤ t ≤ T | ε ( n )1 ( t ) | Pr −−−−→ n →∞ and sup ≤ t ≤ T | ε ( n )2 ( t ) | Pr −−−−→ n →∞ . Suppose that ψ solves (45) with ε ( n )1 = ε ( n )2 = 0 . Then for any ﬁxed length of time T , the perturbed solution of theVolterra equation ψ ε converges uniformly, in probability, to the unperturbed solution of the Volterra equation, ψ ( t ) : sup ≤ t ≤ T | ψ ε ( t ) − ψ ( t ) | Pr −−−−→ n →∞ . Proof.

Throughout this proof, we set R + = { x ∈ R : x ≥ } and we use the notation L loc ( R + ) to be the locallyintegrable functions on R + . We will supress the n dependency in the error terms ε ( n )1 and ε ( n )2 . We begin by deﬁningsome notation for solving Volterra equations, namely the kernel and forcing function respectively by k ε ( t ) def = − (cid:0) γ rh ( t ) + ε ( t ) (cid:1) and f ε ( t ) def = Rh ( t )2 + (cid:101) R · rh ( t ) + 1 − r ε ( t ) . (46)Here we use the convention that k and f correspond to where ε ( t ) = ε ( t ) = 0 . Under this notation, the Volterraequation in (45) becomes ψ ε ( t ) + (cid:90) t k ε ( t − s ) ψ ε ( s ) d s = f ε ( t ) . (47)Now we check that k ε ( t ) ∈ L loc ( R + ) with high probability. To see this we only need that h ( t ) ∈ L loc ( R + ) as thesupremum condition on ε ( t ) guarantees that the error term in k ε is bounded with high probability and therefore ε ( t ) is in L loc ( R + ) with high probability. Since h ( t ) ≥ , we can apply Tonelli’s theorem (cid:90) ∞ h ( t ) d t = (cid:90) ∞ (cid:90) ∞ x e − γtx d t d µ ( x ) = (cid:90) ∞ x γ d µ ( x ) < ∞ . Here we used that µ ( x ) is compactly supported to conclude the last integral. Hence it follows that h ( t ) ∈ L ( R + ) which shows that k ε ( t ) ∈ L loc ( R + ) . To prove the conclusion of the proposition, we will use a stability theorem togetherwith the existence and uniqueness for Volterra equations of convolution type kernels. The solutions of convolutionkernel Volterra equations rely on a function deﬁned through the kernel k called the resolvent of the kernel k . We deﬁnethis resolvent as the function r ε : R + → R such that r ε ( t ) = ∞ (cid:88) j =1 ( − j − k ∗ jε ( t ) , (48)where the function k ∗ jε ( t ) , j ≥ is the ( j − -fold convolution of the kernel k ε with itself. We want to show that aperturbed kernel k ε results in a perturbation of the resolvent. Since k ε ∈ L loc ( R + ) , the stability theorem for kernels,23heorem 3.1 in Gripenberg et al. [1990], says that the resolvent r ε ∈ L loc ( R + ) is unique and depends continuously on k ε in the L loc ( R + ) topology. As sup ≤ t ≤ T | ε ( t ) | Pr −−−−→ n →∞ , we have that (cid:90) T | k ε ( t ) − k ( t ) | d t = (cid:90) T | ε ( t ) | d t Pr −−−−→ n →∞ , so by continuity in L loc , we get that (cid:90) T | r ε ( t ) − r ( t ) | d t Pr −−−−→ n →∞ . Using this resolvent, the unique solution to the Volterra equation in (47) [Gripenberg et al., 1990, Theorem 3.5] is giveby ψ ε ( t ) = f ε ( t ) − ( r ε ∗ f ε )( t ) . (49)A simple computation yields that | ψ ε ( t ) − ψ ( t ) | ≤ | f ε ( t ) − f ( t ) | + (cid:12)(cid:12) (cid:90) t ( r − r ε )( t − s ) f ( s ) d s (cid:12)(cid:12) + (cid:12)(cid:12) (cid:90) t r ε ( t − s )[ f ( s ) − f ε ( s )] d s (cid:12)(cid:12) ≤ | f ε ( t ) − f ( t ) | + sup ≤ s ≤ t | f ( t ) | (cid:90) t | ( r − r ε ) | ( t − s ) d s + sup ≤ s ≤ t | f ( s ) − f ε ( s ) | (cid:90) t | r ε ( t − s ) | d s. Since h k ( t ) is bounded, we clearly have that sup ≤ t ≤ T | f ( t ) | is bounded. Working on the event that r ε ( t ) ∈ L loc ( R + ) ( i.e. (cid:82) T | r ε ( t ) | dt is bounded), (cid:82) T | r ε ( t ) − r ( t ) | dt is small, and sup ≤ t ≤ T | ε ( t ) | is small, it follows that sup ≤ t ≤ T | ψ ε ( t ) − ψ ( t ) | ≤ sup ≤ t ≤ T | ε ( t ) | + sup ≤ t ≤ T | f ( t ) | (cid:90) T | r − r ε | ( t ) d t + sup ≤ t ≤ T | ε ( t ) | (cid:90) T | r ε ( t ) | d t. Since every term on the RHS is small and the complement of the event on which we proved the inequality above hassmall probability, the result immediately follows.This yields one of the main theorems of this paper which we restate for clarity (Theorem 1.1):

Theorem B.1 (Concentration of SGD) . Suppose β ∈ N is a batch-size parameter such that < β ≤ n / − δ forsome δ > and the stepsize is γ < r (cid:0) (cid:82) ∞ x dµ ( x ) (cid:1) − . Let the constant T > . Under Assumptions 1.1 and 1.2, thefunction values at the iterates of SGD converge to sup ≤ t ≤ T (cid:12)(cid:12) f (cid:0) x (cid:98) nβ t (cid:99) (cid:1) − ψ ( t ) (cid:12)(cid:12) Pr −−−−→ n →∞ , (50) where the function ψ is the solution to the Volterra equation ψ ( t ) = R h ( t ) + (cid:101) R (cid:0) rh ( t ) + (1 − r ) (cid:1) + (cid:90) t γ rh ( t − s ) ψ ( s ) d s, and h k ( t ) = (cid:90) ∞ x k e − γtx d µ ( x ) . (51)24 roof. By deﬁnition of N t and ψ ε ( t ) , we have f (cid:0) x (cid:98) nβ t (cid:99) (cid:1) = f (cid:0) x N τ (cid:98) nβ/t (cid:99) (cid:1) = ψ ε ( τ (cid:98) nβ t (cid:99) ) . Also, Proposition B.1 gives sup ≤ t ≤ T | ψ ε ( t ) − ψ ( t ) | Pr −−−−→ n →∞ . Therefore, triangle inequality gives sup ≤ t ≤ T (cid:12)(cid:12) f (cid:0) x (cid:98) nβ t (cid:99) (cid:1) − ψ ( t ) (cid:12)(cid:12) ≤ sup ≤ t ≤ T (cid:12)(cid:12) ψ ε ( τ (cid:98) nβ t (cid:99) ) − ψ ( τ (cid:98) nβ t (cid:99) ) (cid:12)(cid:12) + sup ≤ t ≤ T (cid:12)(cid:12) ψ ( τ (cid:98) nβ t (cid:99) ) − ψ ( t ) (cid:12)(cid:12) , and by the continuity of ψ ( t ) , it would sufﬁce to show sup ≤ t ≤ T (cid:12)(cid:12) τ (cid:98) nβ t (cid:99) − t (cid:12)(cid:12) Pr −−−−→ n →∞ . (52)First, note that sup ≤ s ≤ T (cid:12)(cid:12) N s βn − s (cid:12)(cid:12) Pr −−−−→ n →∞ (53)holds. For a ﬁxed time s ∈ [0 , T ] , this comes from the strong law of large numbers, see [Kingman, 1993, (4.18)]. Andthe result for the supremum on [0 , T ] follows using monotonicity of N t on [0 , T ] and the meshing arguments as usedin proving Lemma B.3.Now for t > , let s > be such that (cid:98) nβ t (cid:99) = N s , or t = N s βn + βn r for ≤ r < . Therefore, observe (cid:12)(cid:12) τ (cid:98) nβ t (cid:99) − t (cid:12)(cid:12) = (cid:12)(cid:12) τ N s − N s βn − βn r (cid:12)(cid:12) ≤ (cid:12)(cid:12) τ N s − s (cid:12)(cid:12) + (cid:12)(cid:12) N s βn − s (cid:12)(cid:12) + βn r. The last term converges to 0 as n → ∞ , and so does the second term in probability, by (53). So, it is left to showthe convergence in probability of the ﬁrst term. By the deﬁnition of N s , we have s − ∆ ≤ τ N s ≤ s, where ∆ denotes the largest spacing between adjacent jumps in [0,T]. Note that N s ≤ N T ≤ T nβ with overwhelmingprobability [Klar, 2000, Prop. 1]. Recalling again that τ k − τ k − , k ∈ N , follow Exp ( nβ ) independently on k ∈ N , wehave for u > , Pr (cid:0) (∆ > u ) ∩ ( N T ≤ T nβ ) (cid:1) = 1 − Pr (cid:0) (∆ ≤ u ) ∩ ( N T ≤ T nβ ) (cid:1) ≤ − (1 − e − nβ u ) N T ≤ N T e − nβ u ≤ T nβ e − nβ u . This implies Pr (cid:0) ∆ > u (cid:1) ≤ Pr (cid:0) (∆ > u ) ∩ ( N T ≤ T nβ ) (cid:1) + Pr( N T > T nβ ) → as n → ∞ and we obtain the claim. B.6 Bounding the errors ε ( n )1 In this section, we give a high–level overview of the errors and how they converge to . We will have the followingerror pieces: ε ( n )1 ( t ) def = ε ( n )IC ( t ) + ε ( n )KL ( t ) + ε ( n )M ( t ) + ε ( n )beta ( t ) + ε ( n )eta ( t ) . (54)We deﬁne these terms momentarily and we will verify that ε ( n )1 is indeed equal to these pieces in Lemma B.6. Weremark that before controlling the errors, we will need to make an a priori estimate that (effectively) shows the functionvalues remain bounded. Thus, we deﬁne the stopping time, for any ﬁxed θ > , by ϑ def = inf (cid:8) t ≥ (cid:107) U Σ ν t − η (cid:107) > n θ (cid:9) . (55)We then show: 25 emma B.4. For any θ > , and for any T > , ϑ > T with high probability. This is achieved by a simple martingale-type estimate, which is similar to the standard convergence arguments forSGD. The proof is given in Section B.7. We will need it in what follows.

We will also condition on Σ going forward. B.6.1 Errors from the convergence of the initial conditions

The error ε ( n )IC ( t ) arises due to convergence errors in the signal and initialization. It was already essentially discussedin Lemma B.3. We deﬁne it by ε ( n )IC ( t ) def = 12 d (cid:88) j =1 σ j e − tγσ j ν ,j − Rh ( t )2 . (56)It accounts for the convergence of the initialization in the large d limit and relies on the convergence of the empiricalspectral distribution. Due to Lemma B.3, we have already shown it converges to . B.6.2 Errors which vanish due to the key lemma

The vanishing of the error ε ( n )KL ( t ) is the key lemma . To explain why we call it this: let us specialize to the case of η = 0 and β = 1 . If we were content to evaluate the expected function values, when averaging over the randomnessinherent in the SGD algorithm, then this would be the only error that we would need to control. Thus in some sense, itcan be viewed as the minimal estimate that needs to be shown to prove the Volterra equation holds. This error is givenby ε ( n )KL ( t ) def = 12 d (cid:88) j =1 σ j (cid:90) t e − t − s ) γσ j (cid:18) n (cid:88) i =1 (cid:18)(cid:0) e Tj U T e i (cid:1) − n (cid:19)(cid:0) e Ti ( U Σ ν s − η ) (cid:1) (cid:19) d s. (57)After interchanging the order of summation, it sufﬁces to show: Lemma B.5 (Key lemma) . For any

T > and for any (cid:15) > , with overwhelming probability max ≤ i ≤ n max ≤ t ≤ T (cid:12)(cid:12)(cid:12)(cid:12) d (cid:88) j =1 σ j e − tγσ j (cid:18)(cid:0) e Tj U T e i (cid:1) − n (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ n (cid:15) − / . This we show in Section C.2. Note that by combining this with (57) and Lemma B.4, we conclude that for any (cid:15) > and any T, with high probability max ≤ t ≤ T | ε ( n )KL ( t ) | ≤ n (cid:15) − / (cid:90) T

12 d s → . B.6.3 Martingale errors

The martingale errors are due to the randomness in the algorithm itself. They in part are small because the singular vec-tor matrix U is delocalized, in that its offdiagonal entries in any ﬁxed orthogonal basis are n (cid:15) − / with overwhelmingprobability. The martingale errors are given by ε ( n )M ( t ) def = 12 d (cid:88) j =1 σ j (cid:90) t e − t − s ) γσ j d M s,j . (58)Estimating this error requires a substantial build-up. The most important technical input, which we will use inmultiple places, is that the function values do not concentrate too heavily in any coordinate direction. As an input, wewill use Lemma B.4, and so we work with the stopped process deﬁned for any t ≥ by ν ϑt def = ν t ∧ ϑ . In some sense,this is the most challenging and important technical statement that we prove: Proposition B.2.

For any

T > , any (cid:15) > , there is a sufﬁciently small θ > so that sup ≤ t ≤ T sup ≤ i ≤ n (cid:0) e Ti ( U Σ ν ϑt − η ) (cid:1) ≤ βn (cid:15) − with overwhelming probability.

26e expect that the upper bound on β in Theorem B.1 is a limitation of our method, and that similar statements shouldhold for larger β . This proposition is proven in Section D.2.With Proposition B.2 in hand, we can then bound the martingale errors. Proposition B.3.

For any

T > , with overwhelming probability, sup ≤ t ≤ T | ε ( n )M ( t ∧ ϑ ) | Pr −−−−→ n →∞ . This is proven in Section D.2. Having done Proposition B.2, the proof of Proposition B.3 reduces to standard martin-gale techniques.

B.6.4 Errors due to minibatching

The Volterra equation that we prove (12) importantly does not depend on the minibatching size. Naturally, the dy-namics do depend on β, and so there are error terms which must be controlled and which are in part small due to theminibatching parameter β satisfying β/n → . These errors are given by ε ( n )beta ( t ) def = 12 d (cid:88) j =1 σ j (cid:90) t e − t − s ) γσ j (cid:18) β − n − B s,j − β − n − n (cid:88) i =1 (cid:0) e Tj U T e i (cid:1) (cid:0) e Ti ( U Σ ν s − η ) (cid:1) (cid:19) d s. (59)Note in particular that when β = 1 this vanishes identically.Much of this error term is controlled using delocalization of U and Lemma B.4. However, there is one error β ( B s,j ) which requires extra work. This we would like to tend to in a sufﬁciently strong sense. On consideration of(35), we see that this in turn requires that ν s itself be delocalized in the sense that | ν s,j | ≤ n (cid:15) − / with overwhelmingprobability. Proposition B.4.

For any (cid:15) > and T > , with overwhelming probability, sup ≤ t ≤ T max ≤ j ≤ d | ν ϑt,j | ≤ βn (cid:15) − / . The dependence on β is only through Proposition B.2, on which this relies. The proof is found in Section D.2. NowProposition B.4 with eigenvector delocalization gives the following proposition. Proposition B.5.

For any (cid:15) > and any T > , with overwhelming probability, sup ≤ t ≤ T | ε ( n )beta ( t ) | ≤ n (cid:15) − / . This is proven in Section C.3.

B.6.5 Errors due to the model noise

Finally, there are errors that arise due to the noise η . The model noise η in fact induces a change in the dynamics ofthe algorithm. This change is reﬂected in an additional forcing term that appears in the Volterra equation. This forcingterm is controlled (in some sense) by the mean behavior of ν t,j . The model noise error is deﬁned by ε ( n )eta ( t ) def = d (cid:88) j =1 (cid:90) t e − t − s ) γσ j γσ j ν s,j ( U T η ) j d s + 12 (cid:107) η (cid:107) − n ∧ d (cid:88) j =1 σ j ν t,j ( U T η ) j − (cid:101) R · rh ( t ) + (1 − r )2 . (60)The fundamental identity that needs to be shown here is that averages of ν s,j ( U T η ) j converge. Within ε ( n )eta ( t ) there are many such averages, and so we formulate a general claim to this effect. Proposition B.6.

Let { c j } n be a deterministic sequence with | c j | ≤ for all j and deﬁne ϑ as in Lemma B.4. Thenfor any t > and any (cid:15) > , (cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) j =1 c j σ j ν ϑt,j ( U T η ) j − (cid:107) η (cid:107) n n (cid:88) j =1 c j (1 − e − ( t ∧ ϑ ) γσ j ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) η (cid:107) (cid:112) βn (cid:15) − / , with overwhelming probability. ε ( n )eta ( t ) tends to . Proposition B.7.

For any

T > , max ≤ t ≤ T | ε ( n )eta ( t ) | Pr −−−−→ n →∞ . This is proven in Section C.4.

B.6.6 Veriﬁcation of (54)The combination of Propositions B.5, B.3, B.5, and B.7 together show all remaining errors are small in (54). Beforeproceeding, we connect (54) to the previous sections to demonstrate this is truly the sum of errors that must becontrolled.

Lemma B.6.

Equation (54) holds.Proof.

We recall the approximate Volterra equation in (41). The error ε ( n )1 ( t ) is deﬁned implicitly in (45). By using(56) and (60), we conclude that ε ( n )1 ( t ) − ε ( n )IC ( t ) − ε ( n )eta ( t ) = 12 d (cid:88) j =1 σ j (cid:90) t e − t − s ) γσ j d E s,j . (61)Recall from (40) and (34) E t,j = (cid:90) t A s,j d s + M t,j − (cid:90) t (cid:98) A s,j d s. Using (37) and (39), we can express d E t,j = d M t,j + (cid:18) β − n − (cid:0) B t,j (cid:1) − β − n − n (cid:88) i =1 (cid:0) e Tj U T e i (cid:1) (cid:0) e Ti ( U Σ ν t − η ) (cid:1) (cid:19) d t + (cid:18) n (cid:88) i =1 (cid:18)(cid:0) e Tj U T e i (cid:1) − n (cid:19)(cid:0) e Ti ( U Σ ν t − η ) (cid:1) (cid:19) d t. Each of these three lines, on substituting into (61), produces ε ( n )M ( t ) , ε ( n )beta ( t ) , and ε ( n )KL ( t ) , respectively. B.6.7 Proof organization

We organize the remainder of the proof as follows. We begin by proving Lemma B.4 in Section B.7 using standardmartingale techniques. Arguments along this line are well–known in the context of analysis of SGD, and this argumentis similar (and in fact easier) than convergence arguments for SGD.In Section C, we introduce standard machinery for concentration of Lipschitz functions on the orthogonal group.In this section, we then make the error estimates that follow from this type of estimate. In particular we prove that thekey lemma, Lemma B.5 holds. We also show Proposition B.5 and Proposition B.7 hold. Note these latter propositionsdepend on some estimates that require other martingale techniques.In Section D, we give the estimates that depend heavily on martingale concentration techniques. In Section D.1,we outline the general martingale concentration techniques that we need. These extend general martingale techniquesin ways that are appropriate to our setting. In Section D.2, we prove the remaining propositions, beginning with themain technical proposition Proposition B.2. We then give the bounds that prove Propositions B.3, B.4, and B.6.28 .7 An a priori bound for the objective function values

Here we combine some of the estimates already developed to give a simple starting bound for the function values ψ ε ( t ) in the proof of Lemma B.4. We will need this starting bound in many of our future estimates. We will do thisby constructing an appropriate supermartingale, which we then use to control the evolution of ψ ε . Proof of Lemma B.4.

For convenience, we will set ψ ε = ψ . We recall from (38) that ψ ( t ) = 12 d (cid:88) j =1 σ j ν t,j − n ∧ d (cid:88) j =1 σ j ν t,j ( U T η ) j + 12 (cid:107) η (cid:107) . Hence using (36) and (35), lim (cid:15) ↓ (cid:15) − E [ ψ ( t + (cid:15) ) − ψ ( t ) | F t ] = 12 d (cid:88) j =1 σ j A t,j − n ∧ d (cid:88) j =1 σ j B t,j ( U T η ) j = 12 d (cid:88) j =1 σ j (cid:18) − ν t,j γ e Tj Σ T U T ( U Σ ν t − η ) + γ nβ E (cid:18) e Tj Σ T U T P ( U Σ ν t − η ) (cid:12)(cid:12)(cid:12)(cid:12) F t (cid:19) (cid:19) + γ n (cid:88) j =1 ( η T U Σ e j ) e Tj Σ T ( Σ ν t − U T η )= − γ ( Σ ν t − U T η ) T ΣΣ T ( Σ ν t − U T η ) + γ n β n (cid:88) j =1 E (cid:18) e Tj ΣΣ T U T P ( U Σ ν t − η ) (cid:12)(cid:12)(cid:12)(cid:12) F t (cid:19) . Using Lemma B.1, we can give the expression lim (cid:15) ↓ (cid:15) − E [ ψ ( t + (cid:15) ) − ψ ( t ) | F t ] = − γ ( Σ ν t − U T η ) T ΣΣ T ( Σ ν t − U T η )+ γ β − n − d (cid:88) j =1 (cid:18) e Tj ΣΣ T U T ( U Σ ν t − η ) (cid:19) + γ (cid:18) − β − n − (cid:19) d (cid:88) j =1 n (cid:88) i =1 (cid:18) e Tj ΣΣ T U T e i (cid:19) (cid:18) e Ti ( U Σ ν t − η ) (cid:19) . All three terms have the interpretation as a quadratic form x T (cid:98) Ax for some matrix (cid:98) A and the vector x = U Σ ν t − η .Speciﬁcally, we have (cid:98) A = − γ U ΣΣ T U T + γ ( β − n − U ΣΣ T ΣΣ T U T + γ (cid:18) − β − n − (cid:19) n (cid:88) i =1 e i e Ti (cid:107) ΣΣ T U T e i (cid:107) . As Σ is bounded, we can let ρ (cid:63) be the largest eigenvalue of (cid:98) A , which is symmetric, and which can be bounded solelyin terms of the norm of Σ . Then we conclude that lim (cid:15) ↓ (cid:15) − E [ ψ ( t + (cid:15) ) − ψ ( t ) | F t ] ≤ ρ (cid:63) ψ ( t ) . It follows immediately that X t def = e − ρ (cid:63) t ψ ( t ) is a positive supermartingale. Hence by optional stopping, for any T > ≤ t ≤ T X t ≥ λ | F ] ≤ ψ (0) λ . Hence,

Pr[ sup ≤ t ≤ T ψ ( t ) ≥ λ | F ] ≤ ψ (0) e ρ (cid:63) T λ . Taking λ = n θ / completes the proof. 29 Estimates based on concentration of measure on the high–dimensionalorthogonal group

C.1 Generalities

We recall a few properties of Haar measure on the orthogonal group. We endow the orthogonal group O ( n ) with themetric given by the Frobenius norm, so that d ( O , U ) = (cid:107) O − U (cid:107) F . Say that a function F : O ( n ) → R is Lipschitzwith constant L if | F ( O ) − F ( U ) | ≤ L (cid:107) O − U (cid:107) F . Recall that the orthogonal group can be partitioned as two disconnected copies of the special orthogonal group

SO( n ) , which we endow with the same metric. These are given as the preimages of {± } under the determinant map. Haarmeasure on the special orthogonal group enjoys a strong concentration of measure property. Theorem C.1.

Suppose that F : SO( n ) → R is Lipschitz with constant L . Then for all t ≥ , Pr[ | F ( U ) − E [ F ( U )] | > t ] ≤ e − cnt /L , where c > is an absolute constant. See [Vershynin, 2018, Theorem 5.2.7] or [Meckes, 2019, Theorem 5.17] for precise constants. We can derive concen-tration for even functions of the orthogonal group automatically:

Corollary C.1.

Suppose that F : O ( n ) → R is Lipschitz with constant L and suppose that E [ F ( U ) det U ] = E [ F ( U )] = 0 . Then for all t ≥ , Pr[ | F ( U ) − E [ F ( U )] | > t ] ≤ e − cnt /L , where c > is an absolute constant.Proof. Under the assumption, the mean of F conditioning on either det U = 1 or det U = − is . Hence, byconditioning, we achieve the desired concentration around , which is the mean E [ F ( U )] . As a useful illustration, an entry of U , which is a Haar–distributed random matrix on O ( n ) is concentrated. Corollary C.2.

For any n > , there is an absolute constant c > so that for all t ≥ and all i, j in { , , ..., n } Pr[ | U ij | > t ] ≤ e − cnt . The same statement holds for any generalized entry x T U y for ﬁxed unit vectors x , y , i.e. Pr[ | x T U y | > t ] ≤ e − cnt . Proof.

The entry map U (cid:55)→ U ij is –Lipschitz. Moreover, it has mean , restricted to either component, as E [ U ij det( U )] = 0 = E [ U ij ] . Note that by distributional invariance, negating row i of U leaves the distribution of U invariant. Doing so shows thesecond equality. Negating any row except for i (which exists as n > ) shows the ﬁrst equality.For the generalized entry, by the linearity of the expectation, E [ x T U y det( U )] = (cid:88) i,j x i y j E [ U ij det( U )] = 0 = (cid:88) i,j x i y j E [ U ij ] = E [ x T U y ] . Using that U (cid:55)→ x T U y is –Lipschitz, the proof follows.30 .2 Applications to the Volterra equation errors More to the point, we need concentration of random combinations of functions weighted by entries of U . Lemma C.1.

Let

T > be ﬁxed, and suppose that { g i } are functions from [0 , T ] → R which are bounded by L > and have Lipschitz constant . Then, for any (cid:15) > , and any ﬁxed unit vectors a and b in R n , with overwhelmingprobability max ≤ t ≤ T (cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) j =1 g j ( t ) (cid:0) ( a T U ) j ( b T U ) j − E [( a T U ) j ( b T U ) j ] (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) ≤ n (cid:15) − / . Proof.

We ﬁrst prove the claim for a ﬁxed t ∈ [0 , T ] and generalize the result for any t ∈ [0 , T ] later using a meshpoints argument. In proving this, we can take advantage of Corollary C.1. For t ∈ [0 , T ] , let F t : O ( n ) → R be F t ( U ) def = n (cid:88) j =1 g j ( t ) (cid:0) ( a T U ) j ( b T U ) j − E [( a T U ) j ( b T U ) j ] (cid:1) . (62)We can show that F t is a Lipschitz function on O ( n ) . Indeed, for U , V ∈ O ( n ) , E [( a T U ) j ( b T U ) j ] = E [( a T V ) j ( b T V ) j ] = 1 n (cid:104) a , b (cid:105) and | F t ( U ) − F t ( V ) | = (cid:12)(cid:12) n (cid:88) j =1 g j ( t ) (cid:0) ( a T U ) j − ( a T V ) j (cid:1) ( b T U ) j + n (cid:88) j =1 g j ( t )( a T V ) j (cid:0) ( b T U ) j − ( b T V ) j (cid:1)(cid:12)(cid:12) ≤ (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) j =1 ( a T ( U − V )) (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) j =1 g j ( t )( b T U ) j + (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) j =1 ( b T ( U − V )) (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) j =1 g j ( t )( a T V ) j ≤ L (cid:107) a T ( U − V ) (cid:107) (cid:107) b T U (cid:107) + L (cid:107) b T ( U − V ) (cid:107) (cid:107) a T V (cid:107) ≤ L (cid:107) U − V (cid:107) F . Therefore, we conclude that F t is a Lipschitz function of Lipschitz constant L . For j ∈ { , · · · , n } , let f j def = ( a T U ) j ( b T U ) j − E [( a T U ) j ( b T U ) j ] . Then conditioning on det U = 1 and det U = − , negating any column of U leaves the distribution of f j invariant,which gives E [ f j ( U ) det U ] = E [ f j ( U )] = 0 , and thus, using linearity, E [ F t ( U ) det U ] = E [ F t ( U )] = 0 . Now Corollary C.1 gives, for s ≥ , Pr[ | F t ( U ) − E [ F t ( U )] | > s ] ≤ e − cns , where c > is an absolute constant. Or, replacing s = n (cid:15) − / gives the claim for a ﬁxed time t ∈ [0 , T ] .Now we generalize the result to any time in [0 , T ] . Assume that the claim is attained for mesh points on [0 , T ] witharbitrarily small spacing, say λ . Then for any t ∈ [0 , T ] , there exists a mesh point t ∈ [0 , T ] such that | t − t | ≤ λ .The assumption that { g j } are Lipschitz functions with Lipschitz constant 1 implies that | g j ( t ) − g j ( t ) | ≤ | t − t | ≤ λ. Then we see (cid:12)(cid:12) n (cid:88) j =1 ( g j ( t ) − g j ( t )) (cid:0) ( a T U ) j ( b T U ) j − E [( a T U ) j ( b T U ) j ] (cid:1)(cid:12)(cid:12) ≤ n (cid:88) j =1 | g j ( t ) − g j ( t ) | (cid:12)(cid:12)(cid:0) ( a T U ) j ( b T U ) j − E [( a T U ) j ( b T U ) j ] (cid:1)(cid:12)(cid:12) ≤ λn. λ can be arbitrarily small. Thus we have (cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) j =1 g j ( t ) (cid:0) ( a T U ) j ( b T U ) j − E [( a T U ) j ( b T U ) j ] (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) j =1 g j ( t ) (cid:0) ( a T U ) j ( b T U ) j − E [( a T U ) j ( b T U ) j ] (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) j =1 ( g j ( t ) − g j ( t )) (cid:0) ( a T U ) j ( b T U ) j − E [( a T U ) j ( b T U ) j ] (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) ≤ n (cid:15) − / + 2 nλ < n (cid:15) − / with overwhelming probability and with small enough (cid:15) in the last part. All in all, we have with overwhelmingprobability max ≤ t ≤ T (cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) j =1 g j ( t ) (cid:0) ( a T U ) j ( b T U ) j − E [( a T U ) j ( b T U ) j ] (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) ≤ n (cid:15) − / . As (cid:15) > is arbitrary, Lemma B.5 follows immediately. Proof of Lemma B.5.

We just need to apply Lemma C.1 with g j ( t ) = σ j e − γσ j t and a = b = e i . Note that we are conditioning on Σ so that the expectation in the statement of Lemma C.1 is only taken over U . Bythe boundedness of σ j , by dividing by a sufﬁciently large constant depending on Σ , the Lemma applies. C.3 Control of the beta errors

Next, we prove Proposition B.5 provided that Proposition B.4 holds.

Proof of Proposition B.5.

For t ∈ [0 , T ] , by equation (35), we have ( B s,j ) = ( − γσ j ν ϑt,j + γσ j ( U T η ) j ) ≤ γ σ j β + γ σ j ) n (cid:15) − , with overwhelming probability. Here Proposition B.4 was used to bound ν ϑt,j , and Assumption 1.1 and Corollary C.2imply | ( U T η ) j | ≤ n (cid:15) − / w.o.p. by observing (cid:0) ( U T η ) j : 1 ≤ j ≤ n (cid:1) law = (cid:107) η (cid:107) (cid:0) U ,j : 1 ≤ j ≤ n (cid:1) . We recall the deﬁnition of ϑ from Lemma B.4 as ϑ = inf (cid:8) t ≥ (cid:107) U Σ ν t − η (cid:107) > n θ (cid:9) , where θ < (cid:15)/ . By applying Corollary C.2 and the deﬁnition of ϑ , we get n (cid:88) i =1 (cid:0) e Tj U T e i (cid:1) (cid:0) e Ti ( U Σ ν ϑt − η ) (cid:1) ≤ n (cid:15) − (cid:107) U Σ ν ϑt − η (cid:107) ≤ n (cid:15) − , | ε ( n )beta ( t ) | = 12 (cid:12)(cid:12)(cid:12)(cid:12) d (cid:88) j =1 σ j (cid:90) t e − t − s ) γσ j (cid:18) β − n − (cid:0) B s,j (cid:1) − β − n − n (cid:88) i =1 (cid:0) e Tj U T e i (cid:1) (cid:0) e Ti ( U Σ ν s − η ) (cid:1) (cid:19) d s (cid:12)(cid:12)(cid:12)(cid:12) ≤ d (cid:88) j =1 σ j (cid:18)(cid:90) t e − t − s ) γσ j d s (cid:19) (cid:18) βn · (cid:0) γ σ j + γ σ j (cid:1) n (cid:15) − + βn · n (cid:15) − (cid:19) = 12 d (cid:88) j =1 γ · βn · n (cid:15) − (2 γ σ j + 2 γ σ j + 1) ≤ n (cid:15) − / , with small enough (cid:15) in the last line, given our assumption on β ≤ n / − (cid:15) . C.4 Control of the eta errors

We now give the proof of Proposition B.7 provided that Proposition B.6 holds.

Proof of Proposition B.7.

Note that the proof of Proposition B.6 is based on conditioning on Σ . Therefore, PropositionB.6 by substituting c j = (cid:82) t e − t − s ) γσ j γσ j d s and c j = 1 , respectively, implies with overwhelming probability (cid:12)(cid:12)(cid:12)(cid:12) d (cid:88) j =1 (cid:90) t e − t − s ) γσ j γσ j ν ϑs,j ( U T η ) j − (cid:107) η (cid:107) n d (cid:88) j =1 (cid:90) t e − t − s ) γσ j γσ j (1 − e − ( s ∧ ϑ ) γσ j ) d s (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) η (cid:107) (cid:112) βn (cid:15) − / and (cid:12)(cid:12)(cid:12)(cid:12) n ∧ d (cid:88) j =1 σ j ν ϑt,j ( U T η ) j − (cid:107) η (cid:107) n n ∧ d (cid:88) j =1 (1 − e − ( t ∧ ϑ ) γσ j ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) η (cid:107) (cid:112) βn (cid:15) − / . Therefore, it sufﬁces to show (cid:107) η (cid:107) n d (cid:88) j =1 (cid:90) t e − t − s ) γσ j γσ j (1 − e − sγσ j ) d s + 12 (cid:107) η (cid:107) − (cid:107) η (cid:107) n n ∧ d (cid:88) j =1 (1 − e − tγσ j ) − (cid:101) R · rh ( t ) + (1 − r )2 (63)converges to 0 in probability as n → ∞ .Observe (cid:107) η (cid:107) n d (cid:88) j =1 (cid:90) t e − t − s ) γσ j γσ j (1 − e − sγσ j ) d s = (cid:107) η (cid:107) n d (cid:88) j =1 γσ j e − tγσ j (cid:90) t e sγσ j (1 − e − sγσ j ) d s = (cid:107) η (cid:107) n d (cid:88) j =1 e − tγσ j (cid:20)

12 ( e tγσ j − − ( e tγσ j − (cid:21) = (cid:107) η (cid:107) n (cid:20) d d (cid:88) j =1 e − γσ j t − d (cid:88) j =1 e − tγσ j (cid:21) . Note that (cid:107) η (cid:107) −−−−→ n →∞ (cid:101) R , d (cid:107) η (cid:107) n Pr −−−−→ n →∞ r ˜ R by Assumptions 1.1 and (cid:107) η (cid:107) n (cid:80) dj =1 e − γσ j t Pr −−−−→ n →∞ ˜ Rr h ( t ) byAssumptions 1.2. Moreover, (cid:80) n ∧ dj =1 (1 − e − tγσ j ) = (cid:80) dj =1 (1 − e − tγσ j ) always holds because σ j = 0 for j > n ∧ d and this cancels out (cid:107) η (cid:107) n (cid:80) dj =1 e − tγσ j in (63). 33 Estimates based on martingale concentration

D.1 General techniques

We recall that the martingales M t,j and (cid:102) M t,j are deﬁned in (34). Both of these martingales need to be controlled, butonly after summing them in a speciﬁc way. First, we do not need these martingales directly, but only certain integralsagainst these martingales. These are deﬁned for all t ≥ and all j ∈ { , , . . . , d } , (cid:101) X t,j = (cid:90) t e sγσ j d (cid:102) M s,j X t,j = (cid:90) t e sγσ j d M s,j , (64)which are again c`adl`ag, ﬁnite variation martingales. We will need to show concentration for sums of these martingalessuch as (cid:80) dj =1 c j (cid:101) X t,j and (cid:80) dj =1 c j X t,j for bounded coefﬁcients { c j } . We formulate some general concentration lemmas for c`adl`ag, ﬁnite variation martingales Y t with jumps givenexactly by { τ k : k ≥ } . For such a process, the jumps entirely determine its ﬂuctuations. We will deﬁne for anyc`adl`ag process Y, ∆ Y t def = Y t − Y t − , which is for all t except { τ k : k ≥ } . For concreteness and for reference, we record that the jumps of (cid:101) X and X aregiven by ∆ (cid:101) X τ k ,j = e γσ j τ k (cid:0) e Tj γ Σ T U T P k − ( U Σ ν τ k − − η ) (cid:1) , ∆ X τ k ,j = e γσ j τ k (cid:0) ν τ k ,j + ν τ k − ,j (cid:1)(cid:0) e Tj γ Σ T U T P k − ( U Σ ν τ k − − η ) (cid:1) , (65)To control the ﬂuctuations of these martingales, we need to control their quadratic variations. The quadraticvariation [ Y t ] is the sum of squares of all jumps of the process, and hence [ Y t ] def = N t (cid:88) k =1 (∆ Y τ k ) . Likewise the predictable quadratic variation (cid:104) Y t (cid:105) is (cid:104) Y t (cid:105) def = N t (cid:88) k =1 E (cid:2) (∆ Y τ k ) |F τ k − (cid:3) . Moreover, for some of the martingales we consider here, it is possible to ﬁnd good events on which the quadraticvariation or the predictable quadratic variations are in control. Then it is a relatively standard fact that the ﬂuctuationsof these processes are in control:

Lemma D.1.

Suppose that ( Y t : t ≥ is a c`adl`ag ﬁnite variation martingale. Suppose there is an event G which ismeasurable with respect to F that holds with overwhelming probability, and so that for some T > i ) [ Y T ] G ≤ βnT N T ; or ( ii ) (cid:104) Y T (cid:105) G ≤ βnT N T and max ≤ t ≤ T | Y t − Y t − | G ≤ . Then for any (cid:15) > with overwhelming probability sup ≤ t ≤ T | Y t | ≤ n (cid:15) . Proof.

We begin with the proof of (i) . Using the Burkholder–Davis–Gundy inequalities (see [Protter, 2005, TheoremIV.49]), for any p > there is a constant C p so that E (cid:18) sup ≤ t ≤ T | Y t | G (cid:19) p ≤ C p E (cid:2) [ Y T ] p G (cid:3) ≤ C p (cid:0) βnT (cid:1) p E [ N pT ] . C > so that E ( N pT ) ≤ Cp !( E N T ) p , and so we conclude that E (cid:18) sup ≤ t ≤ T | Y t | G (cid:19) p ≤ CC p . Using Markov’s inequality, we conclude that

Pr( { sup ≤ t ≤ T | Y t | ≥ n (cid:15) } ) ≤ Pr( G c ) + CC p n − (cid:15)p . Hence letting p tend slowly to inﬁnity with n, this concludes the proof of (i) .We turn to the proof of (ii) . We need a tail bound for martingales (see [Shorack and Wellner, 1986, Appendix B.6Inequality 1]), which states that Pr( { sup ≤ t ≤ T | Y t | > s } ∩ {(cid:104) Y T (cid:105) ≤ r } ∩ G ) ≤ (cid:18) − s s + 2 r (cid:19) . Taking s = r = n (cid:15) , this vanishes faster than any power of n. The probability that N T > n (cid:15) ( E [ N T ]) additionally decaysfaster than any power of n, so that we conclude that on G , sup ≤ t ≤ T | Y t | ≤ n (cid:15) with overwhelming probability.We will need an extension of this standard type of concentration, which allows for exceptional jumps. Suppose wecan decompose the jumps { τ k } of ( Y t : t ≥ into two types { τ k, , τ k, } . In our application, we shall pick the jumpsof the second type to be those for which a ﬁxed coordinate ∈ B k and the ﬁrst type to be all that remains. Thus byproperties of the Poisson process, the two processes { τ k, , τ k, } are independent Poisson processes. Lemma D.2.

Suppose that ( Y t : t ≥ is a c`adl`ag ﬁnite variation martingale with jumps given by { τ k } . Supposethese jumps are divided into two groups { τ k, , τ k, } by a rule depending only on ( k, P k ) . Let N t, and N t, be thecounting functions of the number of jumps from either type. Suppose that the jumps of Y t of type (the typical ones)satisfy E [ (cid:0) ∆ Y τ k, (cid:1) | F τ k, − ] ≤ βn and | ∆ Y τ k, | ≤ . For the jumps of the second type, suppose that for some

T > there is a constant C > so that E [ N T, ] ≤ C and aconstant δ ∈ (0 , so that ( i ) . | ∆ Y τ k, | ≤ δ | Y τ k, − | + 1 or ( ii ) . | Y τ k, | ≤ δ | Y τ k, − | + 1 . Then for any (cid:15) > with overwhelming probability sup ≤ t ≤ T | Y t | ≤ n (cid:15) . Proof.

Let n t, and n t, be the L´evy measures for the jumps of Y of types and ; i.e. the measures so that for anybounded continuous function f and (cid:96) ∈ { , } , N t,(cid:96) (cid:88) k =1 f (cid:0) ∆ Y τ k,(cid:96) (cid:1) − (cid:90) f ( x ) n t,(cid:96) (d x ) is a martingale. We decompose the martingale ( Y t : t ≥ into pieces. Deﬁne Y t,(cid:96) = N t,(cid:96) (cid:88) k =1 ∆ Y τ k,(cid:96) − (cid:90) xn t,(cid:96) (d x ) . Then Y t = Y t, + Y t, for all t ≥ . 35e use two different versions of the exponential martingale. The ﬁrst, which we believe originates with Yor [1976](c.f. [L´epingle, 1978, Lemme 2]) is (cid:98) Z t, def = exp (cid:18) λY t, − (cid:90) (cid:0) e λx − − λx (cid:1) n t, (d x ) (cid:19) , which is a martingale. The second is the Dol´eans exponential, which is the more commonly cited ([Protter, 2005, II.Theorem 37], Yor [1976]), and which shows (cid:98) Z t, def = exp (cid:0) λY t, (cid:1) N t, (cid:89) k =1 f (cid:0) λ (∆ Y τ k, ) (cid:1) where f ( x ) = (1 + x ) e − x . As both processes are ﬁnite variation and have no common jumps, their product remains a martingale. Thus (cid:98) Z t def = exp (cid:18) λY t, + λY t, − (cid:90) (cid:0) e λx − − λx (cid:1) n t, (d x ) (cid:19) N t, (cid:89) k =1 f (cid:0) λ ∆ Y τ k, (cid:1) is a martingale. Note that the two martingales combine to form Y t . By the assumption, the jumps of Y t, are less thanor equal to . Hence the measure n t, (d x ) is supported on [ − , . For | u | ≤ ,e u − − u ≤ u e − . So we deﬁne a supermartingale ( Z t : t ≥ for any λ ≤ by (cid:98) Z t ≥ Z t def = exp (cid:18) λY t − (cid:90) λ x e − n t, (d x ) (cid:19) N t, (cid:89) k =1 f (cid:0) λ ∆ Y τ k, (cid:1) . The integral (cid:82) x n t, (d x ) is the predictable quadratic variation (cid:104) Y t, (cid:105) = N t, (cid:88) k =1 E (cid:2) (∆ Y τ k, ) | F τ k, − (cid:3) ≤ βn N t, . Now we ﬁx a parameter r > − δ and let ϑ = inf { t ≥ | Y t | ≥ r } . By optional stopping for any bounded stopping time ρ ≥ , E (cid:2) Z ϑ ∧ ρ ϑ ≤ ρ (cid:3) ≤ E [ Z ϑ ∧ ρ ] ≤ E [ Z ] = 1 . (66)So, for λ ∈ (0 , , Z ϑ ∧ ρ ϑ ≤ ρ ≥ exp (cid:0) λr − λ e − βn N ρ, (cid:1) N ρ, (cid:89) k =1 f (cid:0) λ ∆ Y τ k, ) (cid:1) { Y ϑ ∧ ρ ≥ r } − exp (cid:0) − λr − λ e − βn N ρ, (cid:1) N ρ, (cid:89) k =1 | f (cid:0) λ ∆ Y τ k, (cid:1) | { Y ϑ ∧ ρ ≤− r } . We produce a similar bound on taking − λ ∈ (0 , although with the roles reversed.The product may in principle be negative or . So we consider taking ρ = T ∧ τ , , for some ﬁxed T > . Then if ϑ < τ , , we have an empty product. Otherwise, we have ϑ = τ , , in which case the product contains a single term.If assumption (ii) is in force, then either the jump decreases the absolute value of Y τ , as it is opposite sign from Y τ , − and does not cross or the second condition is in force. In that case, since | Y τ , − | ≤ r, and since r ≤ | Y τ , | ≤ δr + 1 ,

36e would have r ≤ − δ . However, we have chosen r large enough that this is not the case. So, we conclude that whenassumption (ii) is in force, we could not have had ϑ = τ , . We conclude in the case of assumption (ii) that Z ϑ ∧ ρ { ϑ ≤ ρ } ≥ exp (cid:0) λr − λ e − βn N ρ, (cid:1) { Y ϑ ∧ ρ ≥ r } − exp (cid:0) − λr − λ e − βn N ρ, (cid:1) { Y ϑ ∧ ρ ≤− r } . (67)If assumption (i) is in force, then if Y ϑ ≥ r, the jump of Y at τ , is necessarily positive, as this is the ﬁrst time themartingale jumped above some level. As assumption (i) is in force, then Y τ , − > as well, and so the jump of type must satisfy ∆ Y τ , ≤ δY τ , − + 1 ≤ δr + 1 . We conclude that when Y ϑ ≥ r and assumption (i) holds, f (cid:0) λ ∆ Y τ , (cid:1) ≥ e − λ ∆ Y τ , ≥ e − λ ( δr +1) . If on the other hand Y ϑ ≤ − r, then the jump must have been negative, and we conclude similarly that | f (cid:0) λ ∆ Y τ , (cid:1) | ≤ (1 − λ ∆ Y τ , ) e − λ ∆ Y τ , ≤ (1 + λ ( δr + 1)) e λ ( δr +1) . Hence Z ϑ ∧ ρ { ϑ ≤ ρ } ≥ exp (cid:0) − λ (1 − δ ) r − λ e − βn N ρ, (cid:1) { Y ϑ ∧ ρ ≥ r } − (1 + λ ( δr + 1)) exp (cid:0) − λ (1 − δ ) r − λ e − βn N ρ, (cid:1) { Y ϑ ∧ ρ ≤− r } . (68)In either case of (67) or (68), using (66) and the boundedness of r (cid:55)→ λ ( δr + 1) e − λ (1 − δ ) r , there is a constant C δ > so that E (cid:18) exp (cid:0) λ (1 − δ ) r − λ e − βn N ρ, (cid:1) { Y ϑ ∧ ρ ≥ r } (cid:19) ≤ C δ . With overwhelming probability βn N ρ, ≤ βn N T ≤ T , and hence on the event E that βn N ρ, ≤ T, E (cid:18) exp (cid:0) λ (1 − δ ) r − λ e − T (cid:1) { Y ϑ ∧ ρ E ≥ r } (cid:19) ≤ C δ . (69)Thus taking λ = 1 and r = (1 − δ ) − (log n ) , we conclude that e (log n ) Pr( Y ϑ ∧ ρ ≥ n (cid:15) ∩ E ) = O (1) , and hence Y ϑ ∧ ρ ≤ n (cid:15) with overwhelming probability.By applying the same argument to − Y t which is again a martingale satisfying the same assumptions, we canconclude with overwhelming probability that sup (cid:8) | Y t | : 0 ≤ t ≤ ( T ∧ τ , ) (cid:9) ≤ − δ ) − (log n ) . Now we suppose that with overwhelming probability, we have shown for some (cid:96) ∈ N sup (cid:8) | Y t | : 0 ≤ t ≤ ( T ∧ τ (cid:96), ) (cid:9) ≤ (cid:96) (1 − δ ) − (cid:96) (log n ) . We now apply the same bounds to Z t /Z t ∧ τ (cid:96), def = exp (cid:18) λ ( Y t − Y t ∧ τ (cid:96), ) − (cid:90) λ x e − n t, (d x ) (cid:19) N t, (cid:89) k = (cid:96) +1 f (cid:0) λ ∆ Y τ k, (cid:1) . In particular taking the conditional expectation, with the same ϑ and with ρ = T ∧ τ (cid:96) +1 , E [ Z ϑ ∧ ρ /Z ϑ ∧ ρ ∧ τ (cid:96), { ϑ ≤ ρ } | F τ k, ] ≤ . E (cid:18) exp (cid:18) λY ϑ ∧ ρ − (cid:90) λ x e − n t, (d x ) (cid:19) N ϑ ∧ ρ, (cid:89) k = (cid:96) +1 f (cid:0) λ ∆ Y τ k, (cid:1) | F τ (cid:96), (cid:19) ≤ exp (cid:18) λY ϑ ∧ ρ ∧ τ (cid:96), (cid:19) . Hence following the same line of argument that leads to (69), E (cid:18) exp (cid:0) λ (1 − δ ) r − λ e − T (cid:1)(cid:1) { Y ϑ ∧ ρ ≥ r } E | F τ (cid:96), (cid:19) ≤ C δ exp (cid:18) λY ϑ ∧ ρ ∧ τ (cid:96), (cid:19) . Taking λ = 1 and r = 2 (cid:96) +1 (1 − δ ) − (cid:96) − and restricting to the event in the inductive hypothesis, E (cid:18) exp (cid:0) (cid:96) +1 (1 − δ ) − (cid:96) (log n ) (cid:1) { Y ϑ ∧ ρ ≥ r } E | F τ (cid:96), (cid:19) ≤ C δ exp (cid:18) (cid:96) (1 − δ ) − (cid:96) (log n ) (cid:19) . In particular, with overwhelming probability, sup (cid:8) | Y t | : 0 ≤ t ≤ ( T ∧ τ (cid:96) +1 , ) (cid:9) ≤ (cid:96) (1 − δ ) − (cid:96) (log n ) . The number of type-2 jumps before T is N T, , which is Poisson with mean C . Hence with overwhelming proba-bility, for any (cid:15) > , N T, ≤ (cid:0) log − δ (cid:1) − (cid:15) log n. Hence, we conclude that with overwhelming probability, sup (cid:8) | Y t | : 0 ≤ t ≤ T (cid:9) ≤ (cid:0) − δ (cid:1) N T, (log n ) ≤ n (cid:15) (log n ) . As (cid:15) > may be picked as small as desired, the proof follows. D.2 Applications to the control of errors in the Volterra equation

D.2.1 Delocalization of the function values: the proof of Proposition B.2

Proof of Proposition B.2.

It sufﬁces to prove that for a ﬁxed i and for any T > and any (cid:15) > , sup ≤ t ≤ T (cid:0) e Ti ( U Σ ν ϑt − η ) (cid:1) ≤ βn (cid:15) − with overwhelming probability.Using Lemma B.2, we have the representation ν t,j = e − γσ j t ν ,j + (cid:90) t e − γσ j ( t − s ) γσ j ( U T η ) j d s + e − γσ j t (cid:101) X t,j . Note that ν ϑt,j has the same representation by replacing t → t ∧ ϑ . Observe that each of the ﬁrst two terms has thecontribution of O ( n (cid:15) − / ) to | e Ti ( U Σ ν ϑt − η ) | with overwhelming probability. Indeed, we have e Ti ( U Σ ν ϑt − η ) − d (cid:88) j =1 U ij σ j e − γtσ j (cid:101) X t,j = − η i + d (cid:88) j =1 U ij σ j e − γσ j t ν ,j + d (cid:88) j =1 U ij σ j (cid:90) t e − γσ j ( t − s ) γσ j ( U T η ) j d s = O ( n (cid:15) − / ) . (70)Here the order of ﬁrst term comes from Assumption 1.1. Corollary C.1 gives the order of the second term, by deﬁning F ( U ) := (cid:80) dj =1 U ij σ j e − γσ j t ν ,j with conditioning on ν . The order of the last term is obtained from Lemma C.1with setting a = e i , b = η and conditioning on Σ . Indeed, note that E [ U ij ( U T η ) j ] = n η i so that w.o.p., E (cid:20) d (cid:88) j =1 γU ij σ j ( U T η ) j (cid:90) t e − ( t − s ) γσ j d s (cid:21) = (cid:18) n d (cid:88) j =1 γσ j (cid:90) t e − ( t − s ) γσ j (cid:19) η i = O ( n (cid:15) − / ) . Y t def = d (cid:88) j =1 U ij σ j e − γqσ j (cid:101) X t,j ≤ t ≤ q, (71)for some ﬁxed q ≥ . Note that as we did in Lemma B.3, showing the bound for a ﬁxed q ∈ [0 , T ] should be sufﬁcient,considering mesh points on [0 , T ] with spacing, let us say, λ = λ ( n ) > , which depends on n . Since the process ν t isconstant between jumps, the only cases which cannot be covered by mesh points are having multiple jumps betweentwo adjacent mesh points. However, as the possibility of such events is given by O ( β − n λ ) , which can be smallerthan any power of polynomial of n , we conclude that every jump can be covered by mesh points with overwhelmingprobability. Each jump for the process Y ( · ) is given by ∆ Y τ k +1 def = Y τ k +1 − Y τ ( k +1) − = − d (cid:88) j =1 U ij σ j e − ( q − τ k +1 ) γσ j e Tj γ Σ T U T P k ( U Σ ν ϑτ ( k +1) − − η ) . (72)Note that there are two different types of jumps, i.e.1. B k does not include the index i .2. B k includes the index i .Replacing P k by (cid:80) l ∈ B k e l e Tl , the jump ∆ Y τ k +1 can be translated as ∆ Y τ k +1 = − (cid:88) l ∈ B k (cid:2) d (cid:88) j =1 γU ij U lj σ j e − ( q − τ k +1 ) γσ j (cid:3) e Tl ( U Σ ν ϑτ ( k +1) − − η ) . (73)Let Φ i,l ( t ) def = d (cid:88) j =1 γU ij U lj σ j e − ( q − t ) γσ j , (74)and let G = G ( θ ) for θ > be the event deﬁned as G def = (cid:26) sup ≤ l ≤ n,l (cid:54) = i max ≤ t ≤ T | Φ i,l ( t ) | ≤ n θ − / , max ≤ t ≤ T | Φ i,i ( t ) | < (cid:27) . (75)Note that this holds with overwhelming probability, by Lemma C.1 and condition on the stepsize γ , see Theorem 1.2.Furthermore, in order to apply the bootstrap argument, let us deﬁne, for ℵ ∈ [ − (cid:15)/ , / , (cid:126) def = inf { t ≤ ϑ : max ≤ l ≤ n | e Tl ( U Σ ν t − η ) | > n −ℵ } . (76)Now we are ready to apply Lemma D.2 to prove the claim. Case 1.

When B k does not include the index i : we need to control E [(∆ Y (cid:126) τ k +1 , ) |F τ ( k +1) − ] and | ∆ Y (cid:126) τ k +1 , | .Observe, | ∆ Y (cid:126) τ k +1 , | = | (cid:88) l ∈ B k Φ i,l e Tl ( U Σ ν (cid:126) τ k +1 , − − η ) | ≤ βn θ − / −ℵ . (77)On the other hand, E [(∆ Y (cid:126) τ k +1 , ) |F τ ( k +1) − ] = β ( β − n − n − (cid:2) n (cid:88) l =1 Φ i,l e Tl ( U Σ ν (cid:126) τ k +1 , − − η ) (cid:3) + (cid:18) βn − − β ( β − n − n − (cid:19) n (cid:88) l =1 Φ i,l ( e Tl ( U Σ ν (cid:126) τ k +1 , − − η )) ≤ β n n θ + βn n θ − ≤ βn ( βn θ − + n θ − ) . (78)39ere Cauchy-Schwarz inequality as well as the deﬁnition of ϑ were used for the inequality. Case 2.

When B k includes the index i : In this case, we want to have the following: for some T > , there is a constant C > so that E N T, ≤ C and a constant δ ∈ (0 , so that ( i ) . | ∆ Y (cid:126) τ k +1 , | ≤ δ | Y (cid:126) τ k +1 , − | + 1 or ( ii ) . | Y (cid:126) τ k +1 , | ≤ δ | Y (cid:126) τ k +1 , − | + 1 . First recall that N t has the distribution of Poisson ( nβ t ) . Since B k contains the index i with probability (cid:0) n − β − (cid:1) / (cid:0) nβ (cid:1) = βn ,we have E [ N T, ] = βn nTβ = T < ∞ . Now observe, with t def = τ k +1 , ∧ (cid:126) , ∆ Y (cid:126) τ k +1 , = − (cid:88) l ∈ B k (cid:2) d (cid:88) j =1 γU ij U lj σ j e − ( q − t ) γσ j (cid:3) e Tl ( U Σ ν (cid:126) τ k +1 , − − η )= − d (cid:88) j =1 γU ij σ j e − ( q − t ) γσ j e Ti ( U Σ ν (cid:126) τ k +1 , − − η ) − (cid:88) l ∈ B k ,l (cid:54) = i (cid:2) d (cid:88) j =1 γU ij U lj σ j e − ( q − t ) γσ j (cid:3) e Tl ( U Σ ν (cid:126) τ k +1 , − − η ) . (79)We will see that the ﬁrst term will lead to the one including Y (cid:126) τ k +1 , − with errors. From Lemma B.2, we have ν t,j = e − γσ j t ν ,j + (cid:90) t e − γσ j ( t − s ) γσ j ( U T η ) j d s + e − γσ j t (cid:101) X t,j , and this gives with overwhelming probability Y (cid:126) τ k +1 , − = d (cid:88) j =1 U ij σ j e − t γσ j (cid:101) X (cid:126) ( τ k +1 , − ) ,j = d (cid:88) j =1 U ij σ j (cid:20) ν (cid:126) ( τ k +1 , − ) ,j − e − t γσ j ν ,j − (cid:90) t e − ( t − s ) γσ j γσ j ( U T η ) j d s (cid:21) = e Ti U Σ ν (cid:126) τ k +1 , − − d (cid:88) j =1 U ij σ j e − t γσ j ν ,j − d (cid:88) j =1 γU ij σ j ( U T η ) j (cid:90) t e − ( t − s ) γσ j d s = e Ti U Σ ν (cid:126) τ k +1 , − + O ( n θ − / ) . In the last line to get the order, we used Corollary C.1 for the second term and Lemma C.1 for the last term withsetting a = e i , b = η and conditioning on Σ . See the arguments after (70) for detail. Thus, from (79) we have withoverwhelming probability ∆ Y (cid:126) τ k +1 , = − d (cid:88) j =1 γU ij σ j e − ( q − t ) γσ j ( Y (cid:126) τ k +1 , − + O ( n θ − / )) − (cid:88) l ∈ B k ,l (cid:54) = i (cid:2) d (cid:88) j =1 γU ij U lj σ j e − ( q − t ) γσ j (cid:3) e Tl ( U Σ ν (cid:126) τ k − η )= − d (cid:88) j =1 γU ij σ j e − ( q − t ) γσ j Y (cid:126) τ k +1 , − + O ( n θ − / ) + O ( βn θ − / −ℵ ) , (80)40here Lemma C.1 was used again in the last line; when l (cid:54) = i , E (cid:2) d (cid:88) j =1 γU ij U lj σ j e − ( q − t ) γσ j (cid:3) = 0 . Note that condition ( ii ) is satisﬁed from (80) after some appropriate scaling of Y (cid:126) t , since Y (cid:126) τ k +1 , = Y (cid:126) τ k +1 , − + ∆ Y (cid:126) τ k +1 , = (cid:18) − d (cid:88) j =1 γU ij σ j e − ( q − t ) γσ j (cid:19) Y (cid:126) τ k +1 , − + O ( n θ − / ) + O ( βn θ − / −ℵ ) , (81)and (cid:12)(cid:12)(cid:12)(cid:12) − (cid:80) dj =1 γU ij σ j e − ( q − t ) γσ j (cid:12)(cid:12)(cid:12)(cid:12) < on G .Now, in view of (77), (78) and (81), scaling Y (cid:126) t by max {√ βn θ − / , βn θ − / −ℵ , n θ − / } makes every conditionfor cases 1 and 2 valid, so Lemma D.2 gives sup ≤ t ≤ T | Y (cid:126) t | ≤ n θ − / max { (cid:112) βn θ , βn −ℵ , } . (82)We summarize the following conclusion: if we let, for any (cid:15) > , ψ ( T ) i def = max ≤ t ≤ T | e Ti ( U Σ ν t − η ) | , ψ ( T ) i ≤ n −ℵ w.o.p. = ⇒ ψ ( T ) i ≤ n θ − / max { (cid:112) βn θ , βn −ℵ , } w.o.p.Thus under the assumption that β ≤ n / − δ ≤ n / − δ , and picking θ < δ/ we conclude ψ ( T ) i ≤ n −ℵ w.o.p. = ⇒ ψ ( T ) i ≤ max { (cid:112) βn θ − / , n −ℵ− δ/ , n θ − / } w.o.p.Hence by iterating this inequality ﬁnitely many times, max {√ βn θ − / , n −ℵ− δ/ , n θ − / } becomes √ βn θ − / and the conclusion follows with the choice of θ < min { δ/ , (cid:15)/ } . The only thing left to check is to bound ψ ( T ) i withthe initial condition ℵ = − (cid:15)/ , i.e. , max ≤ t ≤ T | e Ti ( U Σ ν ϑt − η ) | ≤ n (cid:15)/ with overwhelming probability. But this was already given by Lemma B.4. D.2.2 Delocalization of the spectral weights: the proof of Proposition B.4

Proof of Proposition B.4.

It is sufﬁcient to prove the same claim for a ﬁxed j . To take advantage of the main technicalassumption Proposition B.2, we will introduce a stopping time (cid:126) , deﬁned as (for some α ∈ (0 , ) ) (cid:126) def = inf (cid:26) t ≤ ϑ : max ≤ i ≤ n (cid:0) e Ti ( U Σ ν t − η ) (cid:1) > βn − α (cid:27) . (83)As with overwhelming probability this does not occur, it sufﬁces to show a bound for the stopped processes ν (cid:126) t,j def = ν t ∧ (cid:126) ,j . Using Lemma B.2, we have the representation ν t,j = e − γσ j t ν ,j + (cid:90) t e − γσ j ( t − s ) γσ j ( U T η ) j d s + e − γσ j t (cid:101) X t,j . By replacing t → t ∧ (cid:126) , we have the same representation for ν (cid:126) t,j . We let G be the event that | ( U T η ) j | ≤ n (cid:15)/ − / and max i (cid:12)(cid:12) e Tj γ Σ T U T e i (cid:12)(cid:12) ≤ n (cid:15)/ − / . (84)41y Corollary C.2 this holds with overwhelming probability. The ﬁrst two terms is n (cid:15) − / with overwhelming proba-bility, using Assumption 1.1. Hence it sufﬁces to show that for any (cid:15) > , sup ≤ t ≤ T | (cid:101) X (cid:126) t,j | ≤ βn (cid:15) − / with overwhelming probability.The quadratic variation is, from (65) [ (cid:101) X (cid:126) t,j ] = N t ∧ (cid:126) (cid:88) k =1 e τ k +1 γσ j (cid:0) e Tj γ Σ T U T P k − ( U Σ ν (cid:126) τ k − η ) (cid:1) . We observe that for τ k ≤ (cid:126) on G , (cid:0) e Tj Σ T U T P k − ( U Σ ν τ k − η ) (cid:1) ≤ β max i (cid:12)(cid:12) e Tj γ Σ T U T e i (cid:12)(cid:12) n − α ≤ C ( T, Σ ) β n (cid:15) − − α . Using part (i) of Lemma D.1, we have that max ≤ t ≤ T | ν (cid:126) t,j | ≤ βn (cid:15) − α with overwhelming probability. D.2.3 Concentration of the function values: the proof of Proposition B.3

Proof of Proposition B.3.

Recall that for ﬁxed q ∈ [0 , T ] ε ( n )M ( q ) def = 12 d (cid:88) j =1 σ j (cid:90) q e − q − s ) γσ j d M s,j . Hence we can write this as ε ( n )M ( q ) = 12 d (cid:88) j =1 σ j e − qγσ j X q,j . We consider the martingale Y t def = 12 d (cid:88) j =1 σ j e − qγσ j X t,j , and we show concentration for Y t , ≤ t ≤ q . As in the proof of Proposition B.2, it would sufﬁce to bound Y q forﬁxed q ∈ [0 , T ] using the mesh arguments because the probability of having multiple jumps between two adjacentmesh points converges to zero faster than any polynomial order. Also, Proposition B.2 allows us to adopt a stoppingtime (cid:126) , deﬁned as (for some α ∈ (0 , ) ) (cid:126) def = inf (cid:26) t ≤ ϑ : max ≤ i ≤ n (cid:0) e Ti ( U Σ ν t − η ) (cid:1) > βn − α (cid:27) . (85)As with overwhelming probability this does not occur, it sufﬁces to show a bound for the stopped processes ν (cid:126) t,j def = ν t ∧ (cid:126) ,j . The jumps of this martingale are given by (see (65)) ∆ Y (cid:126) τ k = 12 d (cid:88) j =1 σ j e − γ ( q − τ k ∧ (cid:126) ) σ j (cid:0) ν (cid:126) τ k ,j + ν (cid:126) τ k − ,j (cid:1)(cid:0) e Tj γ Σ T U T P k − ( U Σ ν (cid:126) τ k − − η ) (cid:1) . [ Y (cid:126) q ] = N q (cid:88) k =1 (∆ Y (cid:126) τ k ) = N q (cid:88) k =1 (cid:20) d (cid:88) j =1 σ j e − γ ( q − τ k ∧ (cid:126) ) σ j (cid:0) ν (cid:126) τ k ,j + ν (cid:126) τ k − ,j (cid:1)(cid:0) e Tj γ Σ T U T P k − ( U Σ ν (cid:126) τ k − − η ) (cid:1)(cid:21) = 14 N q (cid:88) k =1 (cid:20) d (cid:88) j =1 σ j e − γ ( q − τ k ∧ (cid:126) ) σ j (cid:0) ν (cid:126) τ k − ,j + e Tj γ Σ T U T P k − ( U Σ ν (cid:126) τ k − − η ) (cid:1) · (cid:0) e Tj γ Σ T U T P k − ( U Σ ν (cid:126) τ k − − η ) (cid:1)(cid:21) ≤ N q (cid:88) k =1 (cid:20) (cid:88) i ∈ B k − (cid:18) d (cid:88) j =1 σ j e − γ ( q − τ k ∧ (cid:126) ) σ j ν (cid:126) τ k − ,j ( e Tj γ Σ T U T e i ) (cid:19)(cid:0) e Ti ( U Σ ν (cid:126) τ k − − η ) (cid:1)(cid:21) + 12 N q (cid:88) k =1 (cid:20) d (cid:88) j =1 σ j e − γ ( q − τ k ∧ (cid:126) ) σ j (cid:0) e Tj γ Σ T U T P k − ( U Σ ν (cid:126) τ k − − η ) (cid:1) (cid:21) . Note that the second term is bounded as N q (cid:88) k =1 (cid:20) d (cid:88) j =1 σ j e − γ ( q − τ k ∧ (cid:126) ) σ j (cid:0) (cid:88) i ∈ B k − (cid:0) e Tj γ Σ T U T e i (cid:1)(cid:0) e Ti ( U Σ ν (cid:126) τ k − − η ) (cid:1) (cid:21) ≤ N q (cid:2) n · ( βn (cid:15) − / (cid:112) βn − α ) ] = N q β n β n (cid:15) − α +1 . Note that Corollary C.2 was used to bound e Tj γ Σ T U T e i . As for the ﬁrst term, deﬁne W q,v,s,i def = d (cid:88) j =1 ν (cid:126) s,j e − γ ( q − v ) σ j e Tj Σ T U T e i , (86)for ≤ v ≤ q, ≤ s ≤ v . Note that it sufﬁces to bound max ≤ s ≤ v W q,v,s,i for a ﬁxed v , because we can apply theunion bound to ≤ v ≤ q using the same meshing arguments as the ones used after (71).Observe, W q,v,s,i = d (cid:88) j =1 e − γ ( q − v ) σ j σ j U ij (cid:0) e − γσ j s ν ,j + (cid:90) s e − γσ j ( s − u ) γσ j ( U T η ) j d u + e − γσ j s (cid:101) X (cid:126) s,j (cid:1) = d (cid:88) j =1 e − γ (2 q − v + s ) σ j σ j U ij ν ,j + d (cid:88) j =1 (cid:90) s e − γ (2 q − v + s − u ) σ j γσ j U ij ( U T η ) j d u + d (cid:88) j =1 e − γ q − v ) σ j e − γsσ σ j U ij (cid:101) X (cid:126) s,j . Note that the ﬁrst and second terms are of O ( n (cid:15) − / ) with overwhelming probability by the arguments after (70).Also, the last term is bounded by √ βn (cid:15) − / w.o.p. by the same arguments used in showing the bound for Y t deﬁnedin (71). It is crucial that the additional coefﬁcients e − γ q − v ) σ j are less than 1 so that the same arguments from (72)to (82) work.Then the quadratic variation is bounded by [ Y (cid:126) q ] ≤ CN q (cid:2) β (cid:112) βn (cid:15) − / (cid:112) βn − α (cid:3) + N q β n β n (cid:15) − α +1 ≤ CN q βnT β n (cid:15) − α +1 , C > and small enough (cid:15) > in the last part, which will be an enough bound to apply Lemma D.1. Hencewe conclude, with the same meshing arguments used in the proof of Proposition B.2 and assumption on β ≤ n / − δ , sup ≤ t ≤ T | Y (cid:126) t | ≤ β / n (cid:15) − α +1 / ≤ n (cid:15) − δ/ − α ) , and the claim follows by choosing α < / sufﬁciently close to / . D.2.4 Concentration of cross-variation of the model noise with the spectral weights: the proof of PropositionB.6

Proof of Proposition B.6.

We recall from the assumptions of the Proposition that we let { c j } n be a deterministicsequence with | c j | ≤ for all j . We should show that for any t > and for some (cid:15) > (cid:12)(cid:12)(cid:12)(cid:12) (cid:107) η (cid:107) n (cid:88) j =1 c j σ j ν ϑt,j ( U T η ) j − n n (cid:88) j =1 c j (cid:0) − e − γσ j ( t ∧ ϑ ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:112) βn (cid:15) − / with overwhelming probability.We begin again by using Lemma B.2, due to which we have the representation ν t,j = e − γσ j t ν ,j + (cid:90) t e − γσ j ( t − s ) γσ j ( U T η ) j d s + e − γσ j t (cid:101) X t,j . We replace this expression into the sum we wish to control, and observe that n (cid:88) j =1 c j σ j ν ϑt,j ( U T η ) j = n (cid:88) j =1 c j σ j ( U T η ) j (cid:18) e − γσ j t ν ,j + (cid:90) t e − γσ j ( t − s ) γσ j ( U T η ) j d s + e − γσ j t (cid:101) X t,j (cid:19) . Under Assumption 1.1 and Corollary C.1, the ﬁrst sum vanishes with overwhelming probability. By independence of η from U , we have that (cid:0) ( U T η ) j : 1 ≤ j ≤ n (cid:1) law = (cid:107) η (cid:107) (cid:0) U ,j : 1 ≤ j ≤ n (cid:1) . Hence, by Lemma C.1, for any (cid:15) > , with overwhelming probability, (cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) j =1 c j ( U T η ) j (cid:107) η (cid:107) (cid:90) t e − γσ j ( t − s ) γσ j ds − n (cid:88) j =1 c j n (cid:90) t e − γσ j ( t − s ) γσ j ds (cid:12)(cid:12)(cid:12)(cid:12) ≤ n (cid:15) − / . Using that (cid:82) t e − γσ j ( t − s ) γσ j ds = (1 − e − γσ j t ) , we have reduced the problem to showing that for any (cid:15) > withoverwhelming probability (cid:12)(cid:12) Y t (cid:12)(cid:12) ≤ (cid:112) βn (cid:15) − / where Y s def = 1 (cid:107) η (cid:107) n (cid:88) j =1 c j σ j ( U T η ) j e − γσ j t (cid:101) X s,j for all s ≤ t. To leverage Proposition B.2, we again use the stopping time (cid:126) (83). We will again apply Lemma D.1. The jumps of Y (cid:126) t are given by, for any τ k ≤ s, ∆ Y (cid:126) τ k = − (cid:107) η (cid:107) n (cid:88) j =1 c j σ j ( U T η ) j e − γσ j t (cid:0) e Tj γ Σ T U T P k − ( U Σ ν (cid:126) τ k − − η ) (cid:1) . If we let D be the diagonal matrix with entries − γc j e − γσ j t , then we have the representation ∆ Y (cid:126) τ k = 1 (cid:107) η (cid:107) η T U D ΣΣ T U T P k − ( U Σ ν (cid:126) τ k − − η ) . Using that (cid:107) η (cid:107) η T U D ΣΣ T U T has a norm bounded only in terms of t, Σ and γ, it follows that | ∆ Y (cid:126) τ k | ≤ C ( t, Σ , γ ) βn − α . (87)44e turn to bounding the predictable quadratic variation. Using Lemma B.1, E ((∆ Y (cid:126) τ k ) | F τ k − ) = β ( β − n ( n − (cid:18) (cid:107) η (cid:107) η T U D ΣΣ T U T ( U Σ ν (cid:126) τ k − − η ) (cid:19) + (cid:18) βn − β ( β − n ( n − (cid:19) n (cid:88) i =1 (cid:18) (cid:107) η (cid:107) η T U D ΣΣ T U T e i (cid:19) (cid:18) e Ti ( U Σ ν (cid:126) τ k − − η ) (cid:19) . The ﬁrst line we bound using that U Σ ν (cid:126) τ k − − η is norm at most n (cid:15) and that U D ΣΣ T U T has a norm bounded onlyby some C ( t, Σ , γ ) . The second line we bound using that ( e Ti ( U Σ ν (cid:126) τ k − − η )) ≤ βn − α . Together, these boundsgive that E ((∆ Y (cid:126) τ k ) | F τ k − ) ≤ C ( t, Σ , γ ) (cid:18) β n − (cid:15) + β n − − α (cid:19) . Hence the conclusion follows using Lemma D.1.

E Analyzing the Volterra equation

E.1 General analysis of the Volterra equation

In this section, we analyze the solution of the Volterra equation (12) and give some basic properties of its solution forgeneral limiting spectral measures µ which is a compactly supported measure on [0 , ∞ ) . We recall for conveniencethat the Volterra equation is given by ψ ( t ) = R h ( t ) + (cid:101) R (cid:0) rh ( t ) + (1 − r ) (cid:1) + γ r (cid:90) t h ( t − s ) ψ ( s ) d s, and h k ( t ) = (cid:90) ∞ x k e − γtx d µ ( x ) , (88)where γ > is a stepsize parameter. When convenient, we will simply write z ( t ) for the forcing function z ( t ) def = R h ( t ) + (cid:101) R (cid:0) rh ( t ) + (1 − r ) (cid:1) . (89)The parameter r ∈ (0 , ∞ ) is ﬁxed, but we may consider limits of the Volterra equation under various limits. Theparameters R and (cid:101) R are both non-negative, but to avoid trivialities, we should assume at least one is positive. As theVolterra equation is linear, we may without loss of generality assume R + (cid:101) R = 1 .The equation (88) appears frequently in the probability literature as the renewal equation [Resnick, 1992, (3.5.1)],[Asmussen, 2003, (2.1)]; it appears naturally in renewal theory and in the Lotka population model, amongst others,which are neatly described in the references just mentioned. It allows, for example, ψ to be given the amusinginterpretation as the expected size of a population which evolves in times (c.f. [Resnick, 1992, Example 3.5.2] or[Asmussen, 2003, Example 2.2]). Much of the behavior of the equation is determined by the properties of the function γ rh ( t ) . We will let λ − be the leftmost endpoint of the support of µ restricted to (0 , ∞ ) , and we record the followingelementary computation. For any α ∈ R , (cid:90) ∞ e γαt γ rh ( t ) d t = (cid:40) γr (cid:82) ∞ x x − α d µ ( x ) , if α ≤ λ − ∞ otherwise. (90)We begin by observing some elementary properties of the equation: Theorem E.1.

There is a unique, positive solution to (12) which exists for all time. The solution is bounded if andonly if γ < r (cid:0)(cid:82) ∞ x d µ ( x ) (cid:1) − in which case ψ ( ∞ ) def = lim t →∞ ψ ( t ) = (cid:101) R · rµ ( { } ) + (1 − r )1 − γr (cid:0)(cid:82) ∞ x d µ ( x ) (cid:1) . roof. In standard renewal notation (c.f. [Resnick, 1992, (3.5.1)], [Asmussen, 2003, (2.1)]), we would write (88) ψ = z + ψ ∗ F where F is the function F ( t ) = (cid:90) t γ rh ( s ) d s. The existence and uniqueness is now standard (compare with Proposition B.1), see [Resnick, 1992, Theorem 3.5.1] or[Asmussen, 2003, Theorem 2.4]. By (90), F ( ∞ ) = γr (cid:90) ∞ x d µ ( x ) . Hence when this is bigger than , the solution ψ ( t ) tends to inﬁnity exponentially fast [Asmussen, 2003, Theorem 7.1]or [Resnick, 1992, Proposition 3.11.1]. In the case that γr (cid:82) ∞ x d µ ( x ) = 1 , by Blackwell’s Renewal theorem, ψ ( t ) is asymptotic to a positive multiple of t (see [Resnick, 1992, Theorem 3.10.1] or [Asmussen, 2003, Theorem 4.4])and hence still diverges. Finally, in the case that γr (cid:82) ∞ x d µ ( x ) < by [Resnick, 1992, Section 3.11] or [Asmussen,2003, Proposition 7.4], lim t →∞ ψ ( t ) = lim t →∞ z ( t )1 − γr (cid:82) ∞ x d µ ( x ) , (91)which is the claimed result. Two phases

Hence, we will assume going forward that γ < γ def = (cid:0) r (cid:82) ∞ x d µ ( x ) (cid:1) − . We shall see that it is possible to say moreabout the rate of convergence in general. Deﬁne the Malthusian exponent λ ∗ as the solution of (cid:90) ∞ e γλ ∗ t γ rh ( t ) d t = 1 , (92)when it exists. Note by virtue of (90), if this exponent exists, it can just as well be deﬁned as the solution of r (cid:90) ∞ x x − λ ∗ d µ ( x ) = 1 γ , (93)and we necessarily have that λ ∗ ≤ λ − . Deﬁne γ ∗ = 1 r (cid:82) ∞ x x − λ − d µ ( x ) (94)which exists and is positive exactly when (cid:82) ∞ x x − λ − d µ ( x ) < ∞ . Note that γ ∗ is strictly less than γ if and only if λ − > . Moreover, we can completely give the asymptotic behavior of the Volterra equation on either side of thecritical point. Although, to do this for γ < γ ∗ , we will need some further assumptions on µ .Recall that a function f : (0 , ∞ ) → R is slowly varying if f ( tx ) /f ( x ) → as t → ∞ for any x > . Afunction f : (0 , ∞ ) → R is regularly varying if f ( t ) = g ( t ) t α for a slowly varying function g. We will say that µ is left-edge-regular if there exists a regularly varying function L and α > so that t (cid:55)→ µ (( λ − , λ − + t ]) ∼ t α L ( t ) , as t → ∞ , (95)which for example is satisﬁed by Marchenko-Pastur (8). We show the following: Theorem E.2.

For γ ∈ ( γ ∗ , γ ) , the Malthusian exponent λ ∗ exists and is the unique solution of (93) . The function ψ ( t ) satisﬁes that for some explicit constant c ( R, (cid:101) R, µ ) , ψ ( t ) − ψ ( ∞ ) ∼ c ( R, (cid:101) R, µ ) γ e − γ ( λ ∗ ) t . If in addition γ ∗ > and µ is left-edge-regular, then for γ ∈ (0 , γ ∗ ) ψ ( t ) − ψ ( ∞ ) ∼ e − γ ( λ − ) t g ( t ) where g ( t ) is some explicit regularly varying function. γ from up to γ ∗ the process undergoes a transition in behavior when γ = γ ∗ . For small γ, the exponential rate of change is frozen on the smallest eigenvalue of the Hessian λ − . Howeveras γ passes the transition point γ ∗ , the logarithm of the rate becomes a smooth function. This is strongly reminiscentof a freezing transition , which is often seen in the free energies of random energy models. See for example Fyodorovand Bouchaud [2008].The proof is essentially an automatic consequence of established theory for Volterra equations. As an input to thecase of γ < γ ∗ , we need the following asymptotics of the functions h k , which are the main way in which left-edge-regularity: Lemma E.1.

Suppose that µ has left-edge-regularity, meaning that there is an α ≥ and slowly varying function L so that µ (( λ − , λ − + t ]) ∼ t α L ( t ) as t → . If λ − > then for any k ≥ h k ( t ) − ( λ − ) k e − γλ − t µ ( { λ − } ) ∼ e − γλ − t t − α L ( t )Γ( α + 1)( λ − ) k . If λ − = 0 then for any k ≥ , h k ( t ) − k =0 µ ( { } ) ∼ t − k − α L ( t )Γ( k + α + 1) . This is a standard exercise, and we do not show its proof.

Proof of Theorem E.2.

The case of γ ∈ ( γ ∗ , γ ) . We follow the notation of [Asmussen, 2003, Theorem 7.1] (seealso [Resnick, 1992, Proposition 3.11.1]). Before beginning, we observe that the Malthusian exponent does existfor this region, as the function α (cid:55)→ (cid:82) ∞ x x − α d µ ( x ) , is an increasing continuous function on ( −∞ , λ − ) . Hence bythe deﬁnition of γ ∗ , the image of this function applied to [0 , γλ − ) is all of [ γ − , γ − ∗ ) . From [Asmussen, 2003,Proposition 7.6] lim t →∞ e γλ ∗ t ( ψ ( t ) − ψ ( ∞ )) = (cid:82) ∞ e γλ ∗ t ( z ( t ) − z ( ∞ )) d t − z ( ∞ ) β (cid:82) ∞ te γλ ∗ t γ rh ( t ) d t . We evaluate these two integrals for convenience. Using the deﬁnition of z in (89) (cid:90) ∞ e γλ ∗ t ( z ( t ) − z ( ∞ )) d t − z ( ∞ ) β = R γ (cid:90) ∞ x d µ ( x ) x − λ ∗ + (cid:101) Rr γ (cid:90) ∞ d µ ( x ) x − λ ∗ − (cid:101) R γλ ∗ (1 − r ) . For the denominator, (cid:90) ∞ te γλ ∗ t γ rh ( t ) d t = γ r (cid:90) ∞ (cid:90) ∞ x te γ ( λ ∗ − x ) t d t d µ ( x )= r (cid:90) ∞ x ( x − λ ∗ ) d µ ( x ) . The case of γ ∈ (0 , γ ∗ ) . By assumption we have that γ ∗ > , and hence (cid:90) ∞ x x − λ − d µ ( x ) < ∞ . Set F ( t ) = (cid:82) t γ rh ( s ) d s . Using that ψ ( ∞ ) = z ( ∞ )1 − F ( ∞ ) (see (91)), ψ ( ∞ ) = z ( ∞ ) (cid:18) − F ( t )1 − F ( ∞ ) (cid:19) + F ( t ) ψ ( ∞ ) , and hence the constant function ψ solves the Volterra equation (88) with forcing function z ( ∞ ) (cid:18) − F ( t )1 − F ( ∞ ) (cid:19) . It followsthat ψ ( t ) − ψ ( ∞ ) = z ( t ) − z ( ∞ ) (cid:18) − F ( t )1 − F ( ∞ ) (cid:19) + γ r (cid:90) t h ( t − s )( ψ ( s ) − ψ ( ∞ )) d s. (96)47eﬁne (cid:98) Z ( t ) = e γλ − t (cid:0) ψ ( t ) − ψ ( ∞ ) (cid:1) , (cid:98) z ( t ) = e γλ − t (cid:0) z ( t ) − z ( ∞ ) − F ( t )1 − F ( ∞ ) (cid:1) and d (cid:98) F ( t )d t = γ re γλ − t h ( t ) . Then using (96), (cid:98) Z ( t ) = (cid:98) z ( t ) + (cid:90) t d (cid:98) F ( s )d s (cid:98) Z ( t − s ) d s. By the assumption that γ < γ ∗ , we have that θ def = (cid:90) ∞ γ re γλ − t h ( t ) d t < . In preparation to apply [Asmussen et al., 2003, Theorem 5], we need to evaluate the ratio of the limits of the densities lim t →∞ (cid:98) z ( t ) (cid:98) F (cid:48) ( t ) = lim t →∞ e γλ − t (cid:0) z ( t ) − z ( ∞ ) − F ( t )1 − F ( ∞ ) (cid:1) γ re γλ − t h ( t ) = lim t →∞ (cid:18) z ( t ) − z ( ∞ ) γ rh ( t ) + z ( ∞ ) γ r F ( t ) − F ( ∞ ) h ( t )(1 − F ( ∞ )) (cid:19) . (97)To simplify this, we observe that F ( ∞ ) − F ( t ) = γr (cid:90) ∞ xe − γtx d µ ( x ) = γr h ( t ) . We also observe that z ( t ) − z ( ∞ ) = R h ( t ) + (cid:101) Rr h + ( t ) where h + ( t ) = lim (cid:15) ↓ (cid:90) ∞ (cid:15) e − γtx d µ ( x ) . We conclude that lim t →∞ (cid:98) z ( t ) (cid:98) F (cid:48) ( t ) = lim t →∞ (cid:18) Rh ( t ) + (cid:101) Rrh + ( t )2 γ rh ( t ) − ψ ( ∞ )2 γ h ( t ) h ( t ) (cid:19) . (98)Hence we have from (98) lim t →∞ (cid:98) z ( t ) (cid:98) F (cid:48) ( t ) = ( R − γrψ ( ∞ )) λ − + (cid:101) Rr γ r ( λ − ) def = c ∗ . (99)By the assumption on α, it can be checked that (cid:82) ∞ (cid:98) z ( t ) dt < ∞ . Moreover it follows that (cid:98) z ( t ) is subexponentialas α > (see [Asmussen et al., 2003, Section 3]), and hence by [Asmussen et al., 2003, Theorem 5 (ii)], (cid:98) Z ( t ) ∼ (cid:18) (cid:82) ∞ (cid:98) z ( t ) d t (1 − θ ) + c ∗ − θ (cid:19) (cid:98) F (cid:48) ( t ) ∼ (cid:18) (cid:82) ∞ (cid:98) z ( t ) d t (1 − θ ) + c ∗ − θ (cid:19) γ re γλ − t h ( t ) . ∼ (cid:18) (cid:82) ∞ (cid:98) z ( t ) d t (1 − θ ) + c ∗ − θ (cid:19) γ r (cid:18) ( λ − ) µ ( { λ − } ) + t − α L ( t )Γ( α + 1)( λ − ) (cid:19) . The second line follows from Lemma E.1.We ﬁnish by observing that when λ − = 0 , another behavior takes hold. Theorem E.3.

Suppose that λ − = 0 , and that the measure µ has left-edge-regularity, meaning that there is an α > and slowly varying function L so that µ (( λ − , λ − + t ]) ∼ t α L ( t ) as t → . Then ψ ( t ) − ψ ( ∞ ) ∼ − γr (cid:82) ∞ x d µ ( x ) (cid:40) (cid:101) Rr t − α L ( t )Γ(1 + α ) if (cid:101) R > , R t − − α L ( t )Γ(2 + α ) if (cid:101) R = 0 . roof. We again apply [Asmussen et al., 2003, Theorem 5], and so we use the same change of variables as in TheoremE.2. We once more must compute (98), which by Lemma E.1 is now equal to ∞ . Since γ < γ ,θ def = (cid:90) ∞ γ rh ( t ) d t = γr (cid:90) ∞ x d µ ( x ) < . Hence by [Asmussen et al., 2003, Theorem 5 (iii)], (cid:98) Z ( t ) ∼ − θ (cid:98) z ( t ) ∼ − θ (cid:40) (cid:101) Rr (cid:82) ∞ e − γtx d µ ( x ) if (cid:101) R > , R (cid:82) ∞ xe − γtx d µ ( x ) if (cid:101) R = 0 . Then by Lemma E.1, the proof is complete.

E.2 Explicit solution of the Volterra equation for Isotropic Features

In this section, we solve the Volterra equation for ψ ( t ) in (12) when d µ satisﬁes the Marchenko-Pastur law in (8).Throughout this section we use the following change of variables (cid:98) ψ ( t ) def = 2 ψ (cid:0) t γ (cid:1) . Under this change of variables, the Volterra equation in (12) becomes (cid:98) ψ ( t ) = R · h (cid:0) t γ (cid:1) + (cid:101) R (cid:0) rh (cid:0) t γ (cid:1) + (1 − r ) (cid:1) + rγ (cid:90) t h (cid:0) γ ( t − s ) (cid:1) (cid:98) ψ ( s ) d s = R · (cid:98) h ( t ) + (cid:101) R (cid:0) r (cid:98) h ( t ) + (1 − r ) (cid:1) + (cid:90) t k ( t − s ) (cid:98) ψ ( s ) d s, (100)where we set (cid:98) h k a scaled version of h k and the kernel k as (cid:98) h k ( t ) def = h k (cid:0) t γ (cid:1) and k ( t ) def = rγ h (cid:0) t γ (cid:1) . (101)Volterra equations of convolution type can. be solved trivially by using Laplace transforms which conveniently inthe case of Marchenko-Pastur, we do have. Explicit formulas for the Laplace transforms of h k ( t ) via the Stieltjestransform of µ MP exist.We now solve for (cid:98) ψ using Laplace transforms. We let Ψ( p ) and K ( p ) be the Laplace transforms of (cid:98) ψ ( t ) and k ( t ) respectively. We can relate (cid:98) h ( t ) in (101) to the function k and hence it’s Laplace transform by the following ∂ t (cid:98) h ( t ) = − R (cid:90) ∞ x e − tx d µ MP ( x ) = − Rrγ k ( t ) and L{ (cid:98) h ( t ) } = R (cid:16) − rγ K ( p ) (cid:17) p , (102)where we used that the ﬁrst moment of µ MP is [Bai and Silverstein, 2010]. We now deﬁne the function T ( t ) to bethe Laplace transform of Marchenko-Pastur and the Laplace transform of T ( i.e. , the Laplace transform of the Laplacetransform of µ µ MP ), otherwise known as the Stieltjes transform, as the following T ( t ) def = (cid:90) ∞ e − xt d µ MP and L{ T ( t ) } ( p ) = − p + r − − (cid:112) ( − p − r − − r rp . (103)It is clear that the Laplace transform for (cid:98) h is given by using L{ T ( t ) } . From the Volterra equation (100) and thefunction (cid:98) h ( t ) (101), we get the following expression for Ψ( p ) :Ψ( p ) = R (cid:16) − rγ K ( p ) (cid:17) p + K ( p )Ψ( p ) + (cid:101) R (cid:18) r L{ T ( t ) } ( p ) + (1 − r ) p (cid:19) Ψ( p ) = R ( − rγ K ( p ) ) p + (cid:101) R (cid:16) r L{ T ( t ) } ( p ) + (1 − r ) p (cid:17) − K ( p ) . (104)49e now turn to giving an explicit expression for Ψ( p ) . We begin with the following lemma relating the integral, (cid:82) xx + p d µ MP , to the Stieltjes transform of the semi-circle law. We let m be the Stieltjes transform for semi-circle law m ( z ) def = 12 π (cid:90) − (cid:112) − y y − z d y, (105)and we record a few well-known properties of this Stieltjes transform: Lemma E.2.

The Stieltjes transform m can be expressed as m ( z ) = − z + √ z − , for z ∈ C with (cid:61) z > , and it maps to the upper half plane (in fact to the upper half-disk). This can be extended to z ∈ R \ [ − , by continuity, and to the lower half plane by conjugation symmetry. The function m is the solution ofthe functional equation m ( z ) + 1 m ( z ) = − z for all (cid:61) z > . (106) Hence, we deﬁne the conjugate of m as (cid:98) m ( z ) def = 1 m ( z ) = − z − √ z − . (107) Moreover the Stieltjes transform of m is related to the Marchenko-Pastur by the identity for p ∈ C \ [ λ − , λ + ] (cid:90) ∞ xx + p d µ MP ( x ) = m ( q ) √ r , (108) where we set q def = − p +1+ r √ r .Proof. The results regarding the Stieltjes transform m are well-known and we refer the reader to [Bai and Silverstein,2010]. It remains to prove (108) relating m to the Marchenko-Pastur. First we observe that (cid:90) ∞ xx + p d µ MP ( x ) = 12 πr (cid:90) λ + λ − (cid:112) ( x − λ − )( λ + − x ) x + p d x. We recenter and rescale by sending x = λ − + λ + + λ + − λ − y , so that the follow holds (cid:112) ( x − λ − )( λ + − x ) = √ r · (cid:112) − y . Using this change of variables and noting that λ + − λ − = √ r , d x = √ r d y , and λ + + λ − = 1 + r , we deduce that (cid:90) ∞ xx + p d µ MP ( x ) = 12 π (cid:90) − (cid:112) − y p + (cid:16) λ − + λ + + λ + − λ − y (cid:17) d y = 12 π √ r (cid:90) − (cid:112) − y y − (cid:16) − p − (1+ r ) √ r (cid:17) d y. The result follows after noting the deﬁnition of m ( z ) and q .By exploiting the relationship between Marchenko-Pastur and the Stieltjes transform m deﬁned in (105), we canevaluate some expressions against Marchenko-Pastur. To do so, it will be important to deﬁne the following quantities (cid:37) = 1 + r (cid:16) − rγ (cid:17) and ω = 14 (cid:16) − rγ (cid:17) (cid:18) γ − (1 + r ) (cid:19) . (109)50 emma E.3 (Marchenko-Pastur Integrals) . Suppose the constants (cid:37) and ω are as in (109) and ﬁx the stepsize <γ < r . Deﬁne a critical stepsize γ ∗ as γ ∗ def = 2 √ r ( r − √ r + 1) . (110) It then follows that (cid:90) ∞ xx + p ∗ d µ MP ( x ) = (cid:0) − rγ (cid:1) − γ ( (cid:37) + i √ ω )2 , if p ∗ = − (cid:37) − i √ ω and γ < r (cid:0) − rγ (cid:1) − γ ( (cid:37) − i √ ω )2 , if p ∗ = − (cid:37) + i √ ω and γ ≤ γ ∗ (cid:0) − rγ (cid:1) (cid:37) + i √ ω ) rγ ( (cid:37) + ω ) , if p ∗ = − (cid:37) + i √ ω and γ ∗ < γ < r . (111) Proof.

First suppose that ω < , then the follow holds (cid:37) + (cid:112) | ω | = 12 (cid:16) − rγ (cid:17) (cid:16) r + (cid:113) (1 + r ) − γ (cid:17) . We wish to show exactly when this quantity is equal to (1 − √ r ) as this will give us the critical γ ∗ . Let x = (cid:113) (1 + r ) − γ and observe that r − x ≥ . Hence we have that (cid:37) + (cid:112) | ω | − (1 − √ r ) )(1 + r − x ) = (cid:16) γ − r (cid:17) − − √ r ) (1 + r − x )= (cid:0) (1 + r ) − r − x (cid:1) − − √ r ) (1 + r − x )= − ( x − (1 − √ r ) ) . Thus (cid:37) + (cid:112) | ω | < (1 − √ r ) except at a single value of γ at which (1 + r ) − γ = (1 − √ r ) ⇐⇒ γ = γ ∗ def = 2 √ r ( r − √ r + 1) . (112)Now we let p ∗ be either of − (cid:37) ± i √ ω, where we take the branch of the square root continuous in the closed upperhalf plane (where ω ∈ C ). We use the identity for p in C \ [ λ − , λ + ] , (cid:90) ∞ xx + p d µ MP ( x ) = m ( q ) √ r , where we recall − q = p +1+ r √ r . We will apply this at p ∗ , and we set − q ∗ = p ∗ +1+ r √ r .We use the identity γ p + (cid:37) ) + ω ] = (cid:0) − rγ rγp √ r m ( q ) (cid:1)(cid:0) − rγ rγp √ r (cid:98) m ( q ) (cid:1) . In particular, evaluating this identity at p = p ∗ the left-hand-side is , and we conclude that either (cid:0) − rγ rγp ∗ √ r m ( q ∗ ) (cid:1) = 0 or (cid:0) − rγ rγp ∗ √ r m ( q ∗ ) (cid:1) = 0 (113)When ω > , then as − rγ > , the correct choice is dictated by having either p ∗ m ( q ∗ ) < or p ∗ m ( q ∗ ) < . If (cid:61) p ∗ = −(cid:61) q ∗ < , we have (cid:61) m ( q ∗ ) > , and so it must be the second of these two identities in (113). Likewise, if (cid:61) p ∗ = −(cid:61) q ∗ > , then (cid:61) m ( q ∗ ) < , and it is again the second of these identities.For ω < , we argue by continuity. For p ∗ = − (cid:37) − i √ ω, the mapping γ (cid:55)→ p ∗ is a continuous function for γ ≤ r and we have that p ∗ = − (cid:37) + (cid:112) | ω | . For ω ≥ (i.e. γ ≥ (1 + r ) ), the second identity in (113) holds. If for somevalue of γ, there were a transition in which identity in (113) holds, then by continuity there would need to be a valueof γ so that both hold, which occurs if and only if (cid:0) − rγ rγp ∗ √ r m ( q ∗ ) (cid:1) = 0 = (cid:0) − rγ rγp ∗ √ r m ( q ∗ ) (cid:1) ⇐⇒ m ( q ∗ ) = 1 or p ∗ = 0 . m ( q ∗ ) is positive for ω ≤ , we must therefore have m ( q ∗ ) = 1 which occurs if and only if − p ∗ = (1 − √ r ) .From (112) this does not occur for p ∗ = − (cid:37) − i √ ω, and in conclusion for this branch of p ∗ we are always in thesecond case of (113).On the other hand for p ∗ = − (cid:37) + i √ ω, when γ < γ ∗ we must still be in the second case of (113). By continuity,it sufﬁces to check a single value of γ > γ ∗ to determine in which case (113) we are in. The most convenient value of γ = r , but at this point, both are as p ∗ = 0. If we parameterize the ﬁrst equation in terms of t = rγ , then we canwrite the second case of (113) as − t + tp ∗ ( t ) √ rm ( q ∗ ( t )) = 0 . Differentiating in t, at t = 1 , and observing p ∗ (1) = 0 , we arrive at − √ rm ( q ∗ (1)) dp ∗ ( t ) dt (cid:12)(cid:12)(cid:12)(cid:12) t =1 = − − max { r, }√ rm ( q ∗ (1)) = − − max { r, } min { r, } (cid:54) = 0 , except when r = 1 . Note that when r = 1 , no ω < is possible.We summarize the outcome of this argument as follows (cid:0) − rγ rγp ∗ √ r m ( q ∗ ) (cid:1) = 0 , for p ∗ = − (cid:37) − i √ ω and γ < r , (cid:0) − rγ rγp ∗ √ r m ( q ∗ ) (cid:1) = 0 , for p ∗ = − (cid:37) + i √ ω and γ ≤ γ ∗ , (cid:0) − rγ rγp ∗ √ r m ( q ∗ ) (cid:1) = 0 , for p ∗ = − (cid:37) + i √ ω and γ ∗ < γ < r . (114)Suppose we are in the ﬁrst two cases of (114), in particular, we have − rγ + rγp ∗ √ r m ( q ∗ ) = 0 , then the followingholds m ( q ∗ ) = − (cid:18) − rγ (cid:19) − rγp ∗ √ r . This then implies that (cid:90) ∞ xx + p ∗ d µ MP ( x ) = − (cid:0) − rγ (cid:1) − γp ∗ . (115)On the other hand, when we are in the second case of (114), namely p ∗ = − (cid:37) + i √ ω and γ ∗ < γ < r , then m ( q ∗ ) = − √ rrγp ∗ (cid:0) − rγ (cid:1) and consequently, (cid:90) ∞ xx + p ∗ d µ MP ( x ) = − (cid:18) − rγ (cid:19) rγp ∗ . (116)The result follows. E.2.1 Noiseless setting.

With these items in place we can start deriving an expression (104) in the noiseless setting, namely when (cid:101) R = 0 . Inorder to do so, we need to compute the Laplace transform of k ( t ) : K ( p ) = (cid:90) ∞ e − pt (cid:18)(cid:90) ∞ rγ e − xt x d µ MP ( x ) (cid:19) d t = rγ (cid:90) ∞ (cid:18)(cid:90) ∞ e − ( x + p ) t d t (cid:19) x d µ MP ( x )= rγ (cid:90) ∞ x x + p d µ MP ( x ) . (117)Using this deﬁnition for the Marchenko-Pastur measure (8), we deduce from (117) that K ( p ) = γ π (cid:90) λ + λ − x + p − px + p (cid:112) ( x − λ − )( λ + − x ) d x = γ π (cid:90) λ + λ − (cid:112) ( x − λ − )( λ + − x ) dx − r · γ · p (cid:90) ∞ xx + p d µ MP ( x ) .

52y applying Lemma E.2, we can connect the Stieltjes transform m to K . Recalling q = − p − (1+ r ) √ r , we deduce K ( p ) = rγ − √ r · γ · p · m ( q ) (118)Consequently a simple string computations give the following identity R (cid:16) − rγ K ( p ) (cid:17) (1 − K ( p )) p = R · p √ r m ( q ) p (1 − rγ + rγp √ r m ( q ))= R · √ r m ( q )1 − rγ + rγp √ r m ( q ) · − rγ + rγp √ r ˆ m ( q )1 − rγ + rγp √ r ˆ m ( q )= R · √ r m ( q ) (cid:16) − rγ + rγp √ r ˆ m ( q ) (cid:17)(cid:0) − rγ (cid:1) + rγp √ r (cid:0) − rγ (cid:1) (cid:16) p +1+ r √ r (cid:17) + r (cid:0) rγp (cid:1) = R · √ r m ( q ) (cid:16) − rγ + rγp √ r ˆ m ( q ) (cid:17)(cid:0) − rγ (cid:1) + prγ (1+ r )2 r (cid:0) − rγ (cid:1) + rγp r = R · √ r m ( q ) (cid:0) − rγ (cid:1) + pγ (cid:0) − rγ (cid:1) + pγ (1+ r )2 (cid:0) − rγ (cid:1) + p γ . (119) Lemma E.4.

Fix the stepsize < γ < r and set the following constants γ ∗ = 2 √ r ( r − √ r + 1) , (cid:37) = 1 + r (cid:16) − rγ (cid:17) , and ω = 14 (cid:16) − rγ (cid:17) (cid:18) γ − (1 + r ) (cid:19) . If γ ≤ γ ∗ , then the following holds L −  (cid:16) − rγ K ( p ) (cid:17) (1 − K ( p )) p  = 2 γ (cid:18) − rγ (cid:19) (cid:90) ∞ xe − xt ( x − (cid:37) ) + ω d µ MP ( x ) , and in the case that γ ∗ < γ < r , one gets that L −  (cid:16) − rγ K ( p ) (cid:17) (1 − K ( p )) p  = 2 γ (cid:18) − rγ (cid:19) (cid:90) ∞ xe − xt ( x − (cid:37) ) + ω d µ MP ( x )+ 2 i √ ω ω · (cid:20) (cid:37) − i √ ω − (cid:18) γ (cid:19) (cid:16) − rγ (cid:17) (cid:37) + i √ ωr ( (cid:37) + ω ) (cid:21) e ( − (cid:37) + i √ ω ) t . Proof.

We ﬁrst suppose that γ ≤ γ ∗ and deﬁne the function y ( t ) def = 2 γ (cid:18) − rγ (cid:19) (cid:90) ∞ xe − xt ( x − (cid:37) ) + ω d µ MP ( x ) . Then the Laplace transform of this function is given by the equation Y ( p ) def = L{ y ( t ) } ( p ) = 2 γ (cid:18) − rγ (cid:19) (cid:90) ∞ x ( x + p )(( x − (cid:37) ) + ω ) d µ MP ( x ) . The roots of ( x − (cid:37) ) + ω are precisely (cid:37) ± i √ ω so by partial fractions, we have that x ( x + p )(( x − (cid:37) ) + ω ) = 1( p + (cid:37) ) + ω · xx + p − i √ ω ω ( p + (cid:37) + i √ ω ) · xx − (cid:37) − i √ ω + 2 i √ ω ω ( p + (cid:37) − i √ ω ) · xx − (cid:37) + i √ ω (120)53rovided that √ ω (cid:54) = 0 . Using Lemmas E.2 and E.3, we have that γ (cid:18) − rγ (cid:19) (cid:90) ∞ xx − (cid:37) − i √ ω d µ MP ( x ) = (cid:37) + i √ ω, (121) γ (cid:18) − rγ (cid:19) (cid:90) ∞ xx − (cid:37) + i √ ω d µ MP ( x ) = (cid:37) − i √ ω, and (cid:90) ∞ xx + p d µ MP ( x ) = m ( q ) √ r , where the function m ( q ) and the point q = − p +1+ r √ r are deﬁned in Lemma E.2. A simple calculation using (121)shows that γ (cid:18) − rγ (cid:19) (cid:90) ∞ (cid:18) i √ ω ω ( p + (cid:37) − i √ ω ) · xx − (cid:37) + i √ ω − i √ ω ω ( p + (cid:37) + i √ ω ) · xx − (cid:37) − i √ ω (cid:19) d µ MP ( x )= 2 i √ ω ω ( p + (cid:37) − i √ ω ) · ( (cid:37) − i √ ω ) − i √ ω ω ( p + (cid:37) + i √ ω ) · ( (cid:37) + i √ ω )= p ( p + (cid:37) ) + ω . (122)Combining the partial fractions decomposition of Y ( s ) in (120) and (121), we deduce that Y ( s ) = γ (cid:0) − rγ (cid:1) m ( q ) √ r + p ( p + (cid:37) ) + ω . Since both Y ( s ) and the RHS make sense when ω = 0 by continuity this result holds when ω = 0 . After noting that γ (cid:2) ( p + (cid:37) ) + ω (cid:3) = (cid:18) − rγ (cid:19) + pγ (1 + r )2 (cid:18) − rγ (cid:19) + γp , the result follows for γ ≤ γ ∗ from comparing with (119).Next, we consider the setting where γ > γ ∗ . In this case, we have that ω < . Let A and A be indeterminatesand deﬁne w ( t ) def = 2 γ (cid:18) − rγ (cid:19) (cid:90) ∞ xe − xt ( x − (cid:37) ) + ω d µ MP ( x ) + A i √ ω ω e ( − (cid:37) + i √ ω ) t + A i √ ω ω e ( − (cid:37) − i √ ω ) t = y ( t ) + A i √ ω ω e ( − (cid:37) + i √ ω ) t + A i √ ω ω e ( − (cid:37) − i √ ω ) t . Particularly the Laplace transform of w ( t ) is W ( p ) def = L{ w ( t ) } = Y ( p ) + 2 i √ ω ω A p + (cid:37) − i √ ω + 2 i √ ω ω A p + (cid:37) + i √ ω . (123)As in the previous case, we have the partial fraction decomposition in (120) for Y ( p ) = L{ y ( t ) } . However in thiscase, using Lemmas E.2 and E.3, we get different formulas for (121): γ (cid:18) − rγ (cid:19) (cid:90) ∞ xx − (cid:37) + i √ ω d µ MP ( x ) = (cid:18) γ (cid:19) (cid:16) − rγ (cid:17) (cid:37) + i √ ωr ( (cid:37) + ω ) , γ (cid:18) − rγ (cid:19) (cid:90) ∞ xx − (cid:37) − i √ ω d µ MP ( x ) = (cid:37) + i √ ω, and (cid:90) ∞ xx + p d µ MP ( x ) = m ( q ) √ r . (124)With these equations, we have that γ (cid:18) − rγ (cid:19) (cid:90) ∞ (cid:18) i √ ω ω ( p + (cid:37) − i √ ω ) · xx − (cid:37) + i √ ω − i √ ω ω ( p + (cid:37) + i √ ω ) · xx − (cid:37) − i √ ω (cid:19) d µ MP ( x )= 2 i √ ω ω ( p + (cid:37) − i √ ω ) (cid:18) γ (cid:19) (cid:16) − rγ (cid:17) (cid:37) + i √ ωr ( (cid:37) + ω ) − i √ ω ω ( p + (cid:37) + i √ ω ) · ( (cid:37) + i √ ω ) . (125)54e observe that if (cid:0) γ (cid:1) (cid:0) − rγ (cid:1) (cid:37) + i √ ωr ( (cid:37) + ω ) = (cid:37) − i √ ω , then by (122) we would be done. Therefore we can use A tomake this term equal to (cid:37) − i √ ω . Hence, we should choose A such that A + (cid:18) γ (cid:19) (cid:16) − rγ (cid:17) (cid:37) + i √ ωr ( (cid:37) + ω ) = (cid:37) − i √ ω ⇒ A = (cid:37) − i √ ω − (cid:18) γ (cid:19) (cid:16) − rγ (cid:17) (cid:37) + i √ ωr ( (cid:37) + ω ) and A = 0 . If we deﬁne Z ( p ) def = 2 γ (cid:18) − rγ (cid:19) (cid:90) ∞ (cid:18) i √ ω ω ( p + (cid:37) − i √ ω ) · xx − (cid:37) + i √ ω − i √ ω ω ( p + (cid:37) + i √ ω ) · xx − (cid:37) − i √ ω (cid:19) d µ MP ( x ) , then by the construction of A and A , we deduce that Z ( p ) + 2 i √ ω ω A p + (cid:37) − i √ ω + 2 i √ ω ω A p + (cid:37) + i √ ω = p ( p + (cid:37) ) + ω . from (122). Putting together this with (124) and the deﬁnition of W ( p ) in (123), we get that W ( p ) = γ (cid:0) − rγ (cid:1) m ( q ) √ r + p ( p + (cid:37) ) + ω . The result then follows.

Theorem E.4 (Dynamics of SGD, noiseless setting) . Suppose (cid:101) R = 0 and the batchsize satisﬁes β ( n ) ≤ n / − δ forsome δ > , and the stepsize is < γ < r . Deﬁne the critical stepsize γ ∗ ∈ R and constants (cid:37) > and ω ∈ C , γ ∗ = 2 √ r ( r − √ r + 1) , (cid:37) = 1 + r (cid:16) − rγ (cid:17) , and ω = 14 (cid:16) − rγ (cid:17) (cid:18) γ − (1 + r ) (cid:19) . The iterates of SGD satsify if γ ≤ γ ∗ f ( x (cid:98) nβ t (cid:99) ) Pr −−−−→ n →∞ R · γ (cid:18) − rγ (cid:19) (cid:90) ∞ xe − γxt ( x − (cid:37) ) + ω d µ MP ( x ) and if γ > γ ∗ , the iterates of SGD satisfy f ( x (cid:98) nβ t (cid:99) ) Pr −−−−→ n →∞ R · γ (cid:18) − rγ (cid:19) (cid:90) ∞ xe − γxt ( x − (cid:37) ) + ω d µ MP ( x )+ R · (cid:112) | ω | · (cid:20) (cid:37) + (cid:112) | ω | − (cid:18) γ (cid:19) (cid:16) − rγ (cid:17) (cid:37) − (cid:112) | ω | r ( (cid:37) − | ω | ) (cid:21) e − γ ( (cid:37) + √ | ω | ) t . Here the convergence is locally uniformly.Proof.

The result follows immediately from Lemma E.4 after noting that when γ ∗ < γ we have ω < and that f ( x (cid:98) nβ t (cid:99) ) Pr −−−−→ n →∞ ψ ( t ) = (cid:98) ψ (2 γt ) . E.2.2 Noisy term

We now turn to solving the noisy term in (104) ( i.e. the term with (cid:101) R ), namely (cid:101) R · r L{ T ( t ) } ( p ) + − rp − K ( p ) . T in terms of the point q = − p − − r √ r . Recall the function m ( z ) in (105) as inLemma E.2 by m ( z ) = − z + √ z − so that the Laplace transform of T becomes r L{ T ( t ) } ( p ) = √ rp · q − (cid:112) q −

42 + rp = rp − √ rp m ( q ) . We note again from Lemma E.2 that m ( q ) ˆ m ( q ) = 1 and ˆ m ( q ) = − q − m ( q ) . Using the deﬁnition of K ( p ) from(118), we get the following equality for the noisy term (cid:101) R · r L{ T ( t ) } ( p ) + − rp − K ( p ) = (cid:101) R · −√ r · m ( q ) + 1 p (1 − K ( p )) · − rγ + rγp √ r ˆ m ( q )1 − rγ + rγp √ r ˆ m ( q )= (cid:101) R · −√ r (cid:0) − rγ (cid:1) m ( q ) − rγp + 1 − rγ + rγp √ r ˆ m ( q ) p (cid:2)(cid:0) − rγ (cid:1) + rpγ (1+ r )2 r (cid:0) − rγ (cid:1) + rγp r (cid:3) = (cid:101) R · − (cid:2) γ √ r (cid:0) − rγ (cid:1) + √ rp (cid:3) m ( q ) + p + p + γ (cid:0) − rγ (cid:1) p (( p + (cid:37) ) + ω ) , (126)where (cid:37) and ω are deﬁned in (109). With this, we able to conclude derive an expression for the noisy term in (cid:98) ψ ( t ) of(100). Lemma E.5.

Fix the stepsize < γ < r and set the following constants γ ∗ = 2 √ r ( r − √ r + 1) , (cid:37) = 1 + r (cid:16) − rγ (cid:17) , and ω = 14 (cid:16) − rγ (cid:17) (cid:18) γ − (1 + r ) (cid:19) . If γ ≤ γ ∗ , then the following holds L − (cid:40) r L{ T ( t ) } + − rp − K ( p ) (cid:41) = γ (1 − r ) (cid:0) − rγ (cid:1) (cid:37) + ω + (cid:90) ∞ − rx + γ r (cid:0) − rγ (cid:1) ( x − (cid:37) ) + ω e − xt d µ MP ( x ) and in the case that γ ∗ < γ < r , one gets that L − (cid:40) r L{ T ( t ) } + − rp − K ( p ) (cid:41) = γ (1 − r ) (cid:0) − rγ (cid:1) (cid:37) + ω + (cid:90) ∞ − rx + γ r (cid:0) − rγ (cid:1) ( x − (cid:37) ) + ω e − xt d µ MP ( x )+ (2 i √ ω ) r (cid:2) γ (cid:0) − rγ (cid:1) − ( (cid:37) − i √ ω ) (cid:3) ω ( (cid:37) − i √ ω ) · (cid:20) (cid:37) − i √ ω γ (cid:0) − rγ (cid:1) − γ ( (cid:37) + i √ ω ) (cid:0) − rγ (cid:1) r ( (cid:37) + ω ) (cid:21) e ( − (cid:37) + i √ ω ) t . Proof.

We ﬁrst consider the setting where γ ≤ γ ∗ and we deﬁne the functions y ( t ) def = (cid:90) ∞ − rx + rγ (cid:0) − rγ (cid:1) ( x − (cid:37) ) + ω e − xt d µ MP ( x ) and j ( t ) def = γ (1 − r ) (cid:0) − rγ (cid:1) (cid:37) + ω . Then the Laplace transform of this function is given by the equation Y ( p ) def = L{ y ( t ) } = (cid:90) ∞ − rx + rγ (cid:0) − rγ (cid:1) x ( x + p )[( x − (cid:37) ) + ω ] x d µ MP ( x ) . (127)The roots of ( x − (cid:37) ) + ω are precisely (cid:37) ± i √ ω so by partial fractions, we have that − rx + rγ (cid:0) − rγ (cid:1) x ( x + p )[( x − (cid:37) ) + ω ] · x = rγ (cid:0) − rγ (cid:1) p ( (cid:37) + ω ) − rp + rγ (cid:0) − rγ (cid:1)(cid:2) ( p + (cid:37) ) + ω (cid:3) p · xx + p + rγ (cid:0) − rγ (cid:1) − r ( (cid:37) + i √ ω )( p + (cid:37) + i √ ω )( (cid:37) + i √ ω )(2 i √ ω ) · xx − (cid:37) − i √ ω + rγ (cid:0) − rγ (cid:1) − r ( (cid:37) − i √ ω )( p + (cid:37) − i √ ω )( (cid:37) − i √ ω )( − i √ ω ) · xx − (cid:37) + i √ ω . (128)56e will consider each term individual. By using Lemma E.2, we deduce that (cid:90) ∞ rp + rγ (cid:0) − rγ (cid:1)(cid:2) ( p + (cid:37) ) + ω (cid:3) p · xx + p d µ MP ( x ) = m ( q ) √ r · rp + rγ (cid:0) − rγ (cid:1) p (( p + (cid:37) ) + ω ) . (129)This matches the term in front of the m ( q ) in (126). For the last two terms, we apply Lemma E.3 and some simplecomputations to conclude that L ( p ) def = (cid:90) ∞ rγ (cid:0) − rγ (cid:1) − r ( (cid:37) + i √ ω )( p + (cid:37) + i √ ω )( (cid:37) + i √ ω )(2 i √ ω ) · xx − (cid:37) − i √ ω d µ MP ( x )+ (cid:90) ∞ rγ (cid:0) − rγ (cid:1) − r ( (cid:37) − i √ ω )( p + (cid:37) − i √ ω )( (cid:37) − i √ ω )( − i √ ω ) · xx − (cid:37) + i √ ω d µ MP ( x )= rγ (cid:0) − rγ (cid:1) − r ( (cid:37) + i √ ω )( p + (cid:37) + i √ ω )(2 i √ ω ) γ (cid:0) − rγ (cid:1) + rγ (cid:0) − rγ (cid:1) − r ( (cid:37) − i √ ω )( p + (cid:37) − i √ ω )( − i √ ω ) γ (cid:0) − rγ (cid:1) = − rγ ( γ + p − r )2(( p + (cid:37) ) + ω ) (cid:0) − rγ (cid:1) . (130)We next observe that L{ } = p . This observation applied to the function j ( t ) together with the terms not associatedwith m ( q ) in (128) gives the following result − r ) (cid:0) − rγ (cid:1) pγ ( (cid:37) + ω ) + L ( p ) + 2 r (cid:0) − rγ (cid:1) pγ ( (cid:37) + ω ) = p + p + γ (cid:0) − rγ (cid:1) p (( p + (cid:37) ) + ω ) . This combined with (129) and (126) shows that result holds when γ ≤ γ ∗ .Next, we consider the setting where γ > γ ∗ . In this case, we always have that ω < . Let A be an indeterminateand deﬁne w ( t ) def = y ( t ) + j ( t ) + rγ (cid:0) − rγ (cid:1) − r ( (cid:37) − i √ ω )( (cid:37) − i √ ω )( − i √ ω ) A e ( − (cid:37) + i √ ω ) t . The Laplace transform of w ( t ) is W ( p ) def = L{ w ( t ) } = Y ( p ) + 2(1 − r ) (cid:0) − rγ (cid:1) γ ( (cid:37) + ω ) · p + rγ (cid:0) − rγ (cid:1) − r ( (cid:37) − i √ ω )( (cid:37) − i √ ω )( − i √ ω ) A · p + (cid:37) − i √ ω . As in the previous case, we have that (127), (128), and (129) all still hold. The only difference occurs in the secondterm in the function L ( p ) in (130). In this case using Lemma E.3, we get that (cid:98) L ( p ) def = (cid:90) ∞ rγ (cid:0) − rγ (cid:1) − r ( (cid:37) + i √ ω )( p + (cid:37) + i √ ω )( (cid:37) + i √ ω )(2 i √ ω ) · xx − (cid:37) − i √ ω d µ MP ( x )+ (cid:90) ∞ rγ (cid:0) − rγ (cid:1) − r ( (cid:37) − i √ ω )( p + (cid:37) − i √ ω )( (cid:37) − i √ ω )( − i √ ω ) · xx − (cid:37) + i √ ω d µ MP ( x )= rγ (cid:0) − rγ (cid:1) − r ( (cid:37) + i √ ω )( p + (cid:37) + i √ ω )(2 i √ ω ) γ (cid:0) − rγ (cid:1) + rγ (cid:0) − rγ (cid:1) − r ( (cid:37) − i √ ω )( p + (cid:37) − i √ ω )( (cid:37) − i √ ω )( − i √ ω ) 2( (cid:37) + i √ ω ) (cid:0) − rγ ) rγ ( (cid:37) + ω ) . (131)Using the term A , we are able to make (cid:98) L ( p ) exactly equal to the RHS of L ( p ) . In particular, we choose the constant A as A = (cid:37) − i √ ω γ (cid:0) − rγ (cid:1) − (cid:37) + i √ ω ) (cid:0) − rγ (cid:1) rγ ( (cid:37) + ω ) . (132)57y this choice of A , we guarantee that the sum of (cid:98) L ( p ) and the A term equals the RHS of L ( p ) : (cid:98) L ( p ) + rγ (cid:0) − rγ (cid:1) − r ( (cid:37) − i √ ω )( (cid:37) − i √ ω )( − i √ ω ) A · p + (cid:37) − i √ ω = − rγ ( γ + p − r )2(( p + (cid:37) ) + ω ) (cid:0) − rγ (cid:1) . The result then immediately follows from the previous case when γ > γ ∗ .From this lemma, we can now derive the main result which shows that the function values concentrate. Theorem E.5 (Dynamics of SGD, noisy setting) . Suppose the batchsize satisﬁes β ( n ) ≤ n / − δ for some δ > andthe stepsize is < γ < r . Deﬁne the critical stepsize γ ∗ and constants (cid:37) > and ω ∈ C by γ ∗ = 2 √ r ( r − √ r + 1) , (cid:37) = 1 + r (cid:16) − rγ (cid:17) , and ω = 14 (cid:16) − rγ (cid:17) (cid:18) γ − (1 + r ) (cid:19) . The iterates of SGD satisfy if γ ≤ γ ∗ f ( x (cid:98) nβ t (cid:99) ) Pr −−−→ d →∞ R · γ (cid:18) − rγ (cid:19) (cid:90) ∞ xe − γxt ( x − (cid:37) ) + ω d µ MP ( x )+ (cid:101) R · · (cid:20) γ (1 − r ) (cid:0) − rγ (cid:1) (cid:37) + ω + (cid:90) ∞ − rx + rγ (cid:0) − rγ (cid:1) ( x − (cid:37) ) + ω e − γxt d µ MP ( x ) (cid:21) . and if γ > γ ∗ , the iterates of SGD satisfy f ( x (cid:98) nβ t (cid:99) ) Pr −−−→ d →∞ R · γ (cid:18) − rγ (cid:19) (cid:90) ∞ xe − γxt ( x − (cid:37) ) + ω d µ MP ( x )+ R · (cid:112) | ω | · (cid:20) (cid:37) + (cid:112) | ω | − γ (cid:16) − rγ (cid:17) (cid:37) − (cid:112) | ω | r ( (cid:37) − | ω | ) (cid:21) e − γ ( (cid:37) + √ | ω | ) t + (cid:101) R · (cid:20) γ (1 − r ) (cid:0) − rγ (cid:1) (cid:37) + ω + (cid:90) ∞ − rx + rγ (cid:0) − rγ (cid:1) ( x − (cid:37) ) + ω e − γxt d µ MP ( x )+ ( − (cid:112) | ω | ) r (cid:2) γ (cid:0) − rγ (cid:1) − ( (cid:37) + (cid:112) | ω | ) (cid:3) ω ( (cid:37) + (cid:112) | ω | ) · (cid:18) (cid:37) + (cid:112) | ω | γ (cid:0) − rγ (cid:1) − γ ( (cid:37) − (cid:112) | ω | )(1 − rγ ) r ( (cid:37) + ω ) (cid:19) e − γ ( (cid:37) + √ | ω | ) t (cid:21) . Here the convergence is locally uniformly.Proof.

The result follows immediately from Lemma E.5 after noting that when γ ∗ < γ we have ω < and that f ( x (cid:98) nβ t (cid:99) ) Pr −−−→ d →∞ ψ ( t ) = (cid:98) ψ (2 γt ) . E.2.3 Computing average-complexity

With our explicit expressions for ψ , we can now derive the complexity results. Theorem E.6 (Asymptotic convergence rates isotropic features) . Suppose Assumptions 1.1 and 1.2 hold with dµ ( x ) = dµ MP ( x ) . Fix the stepsize < γ < r and let the batchsize satisﬁes β ( n ) ≤ n / − δ for some δ > . Deﬁne theconstants (cid:37) > , ω ∈ C , and critical stepsize γ ∗ ∈ R as in (E.5) . If the ratio r = 1 , the iterates of SGD satisfy lim n →∞ f ( x (cid:98) nβ t (cid:99) ) Pr ∼ γ / · (cid:16) − rγ (cid:17) · √ λ + √ π · (cid:37) + ω (cid:20) R · γ · t / + (cid:101) R · r · t / (cid:21) . (133) If the ratio r (cid:54) = 1 and γ < γ ∗ , the iterates of SGD satisfy lim n →∞ f ( x (cid:98) nβ t (cid:99) ) Pr ∼ (cid:101) R · γ (cid:0) − rγ (cid:1) ( (cid:37) + ω ) max { , − r } + 18 √ πr · ( λ + − λ − ) / γ / (cid:2) ( λ − − (cid:37) ) + ω (cid:3) (cid:20) R · γ (cid:18) − rγ (cid:19) + (cid:101) R (cid:18) rγλ − (cid:18) − rγ (cid:19) − r (cid:19) (cid:21) · e − γλ − t · t / . f the ratio r (cid:54) = 1 and γ = γ ∗ , the iterates of SGD satisfy lim n →∞ f ( x (cid:98) nβ t (cid:99) ) Pr ∼ (cid:101) R · γ (cid:0) − rγ (cid:1) ( (cid:37) + ω ) max { , − r } + 12 √ πr · γ / r ( λ + − λ − ) / (cid:20) R · γ (cid:18) − rγ (cid:19) + (cid:101) R · (cid:18) rγλ − (cid:18) − rγ (cid:19) − r (cid:19)(cid:21) · e − γλ − t · t / . If the ratio r (cid:54) = 1 and γ > γ ∗ , the iterates of SGD satisfy lim n →∞ f ( x (cid:98) nβ t (cid:99) ) Pr ∼ (cid:101) R · γ (cid:0) − rγ (cid:1) ( (cid:37) + ω ) max { , − r } + R · (cid:112) | ω | · (cid:20) (cid:37) + (cid:112) | ω | − γ (cid:16) − rγ (cid:17) (cid:37) − (cid:112) | ω | r ( (cid:37) − | ω | ) (cid:21) e − γ ( (cid:37) + √ | ω | ) t + (cid:101) R · ( − (cid:112) | ω | ) r (cid:2) γ (cid:0) − rγ (cid:1) − ( (cid:37) + (cid:112) | ω | ) (cid:3) ω ( (cid:37) + (cid:112) | ω | ) · (cid:18) (cid:37) + (cid:112) | ω | γ (cid:0) − rγ (cid:1) − γ ( (cid:37) − (cid:112) | ω | )(1 − rγ ) r ( (cid:37) + ω ) (cid:19) e − γ ( (cid:37) + √ | ω | ) t . Here t = 1 corresponds to computing n stochastic gradients, the convergence is locally uniformly, and the notation Pr ∼ means that you take the limit in probability and then compute the asymptotic.Proof. Throughout the proof we use γt (cid:55)→ t and we deﬁne (cid:37) and ω as in Theorem E.5. Suppose we have that r = 1 .Because − γ > , we know that γ < γ ∗ . By construction of γ ∗ in (112), we have that (1 + r ) − γ − < (1 + r ) − γ ∗ = (1 − √ r ) = 0 . In particular, this means that ω > and the roots of ( x − (cid:37) ) + ω are precisely x = (cid:37) ± √− ω . As ω > , these rootsare complex with positive imaginary part and thus, ( x − (cid:37) ) + ω (cid:54) = 0 for any x ∈ [0 , λ + ] . So there exists a constant C > such that ( x − (cid:37) ) + ω > C for all x ∈ [0 , λ + ] and consequently, (cid:90) ∞ x ( x − (cid:37) ) + ω d µ MP ( x ) is bounded. (134)We begin by computing (cid:82) ∞ xe − xt ( x − (cid:37) ) + ω d µ MP . If x ≥ log ( t ) /t , then it is clear that e − tx decays faster than anypolynomial in t . Combining this with (134), we get that (cid:90) ∞ log ( t ) /t xe − xt ( x − (cid:37) ) + ω d µ MP ( x ) decays faster than any polynomial in t .Now we suppose that ≤ x < log ( t ) /t . Using a simple change of variables, we deduce that (cid:90) log ( t ) /t xe − tx ( x − (cid:37) ) + ω d µ MP ( x ) = 12 π · t / (cid:90) min { log ( t ) ,tλ + } √ u · e − u ( t u − (cid:37) ) + ω (cid:113) λ + − ut d u. We have that (cid:112) λ + − ut ≤ √ λ + and < ω ≤ ( x − (cid:37) ) + ω so dominated convergence theorem holds lim t →∞ (cid:90) min { log ( t ) ,tλ + } √ u · e − u ( t u − (cid:37) ) + ω (cid:113) λ + − ut d u = (cid:90) ∞ √ u · e − u (cid:37) + ω √ λ + d u = √ λ + π (cid:37) + ω ) . Consequently, we have that (cid:90) ∞ xe − tx ( x − (cid:37) ) + ω d µ MP ( x ) ∼ √ π · √ λ + (cid:37) + ω · t / . (135)Now we turn to (cid:82) ∞ e − xt ( x − (cid:37) ) + ω d µ MP ( x ) . As before, we know that (cid:90) ∞ log ( t ) /t e − xt ( x − (cid:37) ) + ω d µ MP ( x ) decays faster than any polynomial in t .59gain using a simple change of variables, we deduce that (cid:90) log ( t ) /t e − tx ( x − (cid:37) ) + ω d µ MP ( x ) = 12 π · t / (cid:90) min { log ( t ) ,tλ + } e − u (cid:112) λ + − ut √ u (cid:0) ( ut − (cid:37) ) + ω (cid:1) d u. By using dominated convergence theorem, we know that lim t →∞ (cid:90) min { log ( t ) ,λ + t } e − u (cid:112) λ + − ut √ u (cid:0) ( ut − (cid:37) ) + ω (cid:1) d u = (cid:90) ∞ e − u √ λ + √ u ( (cid:37) + ω ) d u = √ πλ + (cid:37) + ω . Consequently, we have that (cid:90) ∞ e − xt ( x − (cid:37) ) + ω d µ MP ( x ) ∼ √ π · √ λ + (cid:37) + ω · t / . (136)After noting that t = 2 γt and Theorem E.5, the result immediately follows for the case when r = 1 in (133).Next we consider the setting where r (cid:54) = 1 . Suppose (cid:96) ∈ { , } . By a simple change of variables x = λ − + ( λ + − λ − ) u , we have πr (cid:90) λ + λ − e − xt x − (cid:96) (( x − (cid:37) ) + ω ) (cid:112) ( x − λ − )( λ + − x ) d x (137) = 12 πr (cid:32)(cid:90) log ( t ) /t + (cid:90) ( t ) /t (cid:33) ( λ + − λ − ) e − tλ − e − ( λ + − λ − ) ut (cid:112) u (1 − u )( λ − + ( λ + − λ − ) u ) − (cid:96) (cid:2) ( λ − + ( λ + − λ − ) u − (cid:37) ) + ω (cid:3) d u. Let’s ﬁrst consider where u ≥ log ( t ) /t . As r (cid:54) = 1 , we have that λ − > and therefore, ( λ − + ( λ + − λ − ) u ) − (cid:96) ≥ ( λ − ) − (cid:96) . If ω > , then < ω < ( λ − + ( λ + − λ − ) u − (cid:37) ) + ω so that (cid:90) ( t ) /t λ − + ( λ + − λ − ) u ) − (cid:96) (cid:2) ( λ − + ( λ + − λ − ) u − (cid:37) ) + ω (cid:3) (cid:112) u (1 − u ) d u ≤ C, (138)or equivalently this integral is bounded by some C > . Now we suppose that ω ≤ . The roots of ( λ − +( λ + − λ − ) u − (cid:37) ) + ω are precisely given by u = (cid:37) − λ − ±√− ωλ + − λ − . By (112) if γ (cid:54) = γ ∗ , we have that (cid:37) + √− ω < (1 − √ r ) = λ − .Hence there exists a constant (cid:98) C > such that (cid:98) C < ( λ − + ( λ + − λ − ) u − (cid:37) ) + ω for all u ∈ [0 , . It immediatelyfollows that (138) holds. Now suppose that γ = γ ∗ . Then (cid:37) + (cid:112) | ω | = (1 − √ r ) = λ − so the polynomial ( λ − + ( λ + − λ − ) u − (cid:37) ) + ω is when u = 0 . We observe that u = 0 is not a double root since there does not existan r with γ = γ ∗ , ω = 0 , and γ ∗ < r (here γ = γ ∗ and ω = 0 imply that r = 1 , but then γ ∗ = 2 which violates γ ∗ < r ). Consequently, we can write ( λ − + ( λ + − λ − ) u − (cid:37) ) + ω = ( λ + − λ − ) u ( u + r ) , where r > as the second root is negative. Since the Marchenko-Pastur measure has √ u -behavior near , it imme-diately follows that (cid:90) ( t ) /t λ − + ( λ + − λ − ) u ) − (cid:96) (cid:2) ( λ − + ( λ + − λ − ) u − (cid:37) ) + ω (cid:3) (cid:112) u (1 − u ) d u ≤ r ( λ − ) (cid:96) − ( λ + − λ − ) (cid:90) √ − u √ u d u < C. Hence in all cases we have that ( λ + − λ − ) πr (cid:90) ( t ) /t e − ( λ + − λ − ) ut (cid:0) λ − + ( λ + − λ − ) u ) − (cid:96) (cid:2) ( λ − + ( λ + − λ − ) u − (cid:37) ) + ω (cid:3) (cid:112) u (1 − u ) d u ≤ e − ( λ + − λ − ) log ( t ) C C > is some constant. Since e − ( λ + − λ − ) log ( t ) decays faster than polynomial, this part of the integral doesnot have the interesting asymptotic. Now returning to (137) the interesting asymptotic occurs near u = 0 . First weconsider the setting where γ (cid:54) = γ ∗ . As we saw for u ∈ [log ( t ) /t, , we also know that there exists a constant (cid:98) C > such that (cid:98) C < ( λ − + ( λ + − λ − ) u ) − (cid:96) (cid:2) ( λ − + ( λ + − λ − ) u − (cid:37) ) + ω (cid:3) for all u ∈ [0 , .Using a change of variables u = v/t , we deduce that (cid:90) log ( t ) /t e − ( λ + − λ − ) ut ( λ − + ( λ + − λ − ) u ) − (cid:96) (cid:2) ( λ − + ( λ + − λ − ) u − (cid:37) ) + ω (cid:3) (cid:112) u (1 − u ) d u ≤ (cid:98) C · t / (cid:90) log ( t )0 e − ( λ + − λ − ) v (cid:113) v (1 − vt ) d v ≤ (cid:98) C · t / (cid:90) ∞ e − ( λ + − λ − ) v √ v d v Since the last integral is bounded, we can apply dominated convergence theorem. Using the change of variables, u = vt , we have that (cid:90) log ( t ) /t e − ( λ + − λ − ) ut ( λ − + ( λ + − λ − ) u ) − (cid:96) (cid:2) ( λ − + ( λ + − λ − ) u − (cid:37) ) + ω (cid:3) (cid:112) u (1 − u ) d u = 1 t / · (cid:90) log ( t )0 e − ( λ + − λ − ) v ( λ − + ( λ + − λ − ) vt ) − (cid:96) (cid:2) ( λ − + ( λ + − λ − ) vt − (cid:37) ) + ω (cid:3) (cid:113) v (1 − vt ) d v ∼ t / (cid:90) ∞ e − ( λ + − λ − ) v ( λ − ) − (cid:96) (cid:2) ( λ − − (cid:37) ) + ω (cid:3) √ v d v ∼ t / · √ π λ + − λ − ) / ( λ − ) − (cid:96) (( λ − − (cid:37) ) + ω ) . (139)Consequently, we get for any (cid:96) ∈ { , } πr (cid:90) λ + λ − e − xt x − (cid:96) (cid:0) ( x − (cid:37) ) + ω (cid:1) (cid:112) ( x − λ − )( λ + − x ) d x ∼ √ πr · ( λ + − λ − ) / ( λ − ) − (cid:96) (cid:2) ( λ − − (cid:37) ) + ω (cid:3) · e − λ − t · t / . Now consider the setting where γ = γ ∗ . As we saw for u ∈ [log ( t ) /t, , we know that the polynomial has a root at u = 0 (not a double root). In particular, we get that ( λ − + ( λ + − λ − ) u ) − (cid:96) (cid:2) ( λ − + ( λ + − λ − ) u − (cid:37) ) + ω (cid:3) = ( λ + − λ − ) ( λ − + ( λ + − λ − ) u ) − (cid:96) u ( u + r ) , where − r is the second root of the quadratic ( λ − + ( λ + − λ − ) u − (cid:37) ) + ω . We know this root is negative (i.e. r > ). Using a change of variables u = v/t and a simple lower bound on ( λ − + ( λ + − λ − ) u ) − (cid:96) , we get that (cid:90) log ( t ) /t e − ( λ + − λ − ) ut ( λ − + ( λ + − λ − ) u ) − (cid:96) (cid:2) ( λ − + ( λ + − λ − ) u − (cid:37) ) + ω (cid:3) (cid:112) u (1 − u ) d u ≤ λ − ) − (cid:96) ( λ + − λ − ) r · t / (cid:90) log ( t )0 e − ( λ + − λ − ) v v (cid:113) v (1 − vt ) d v ≤ λ − ) − (cid:96) ( λ + − λ − ) r · t / (cid:90) ∞ e − ( λ + − λ − ) v √ v d v u = vt , we have that (cid:90) log ( t ) /t e − ( λ + − λ − ) ut ( λ − + ( λ + − λ − ) u ) − (cid:96) (cid:2) ( λ + − λ − ) u ( u + r ) (cid:3) (cid:112) u (1 − u ) d u = 1 t / · (cid:90) log ( t )0 e − ( λ + − λ − ) v ( λ − + ( λ + − λ − ) vt ) − (cid:96) (cid:2) ( λ + − λ − ) v ( vt + r ) (cid:3) (cid:113) v (1 − vt ) d v ∼ t / (cid:90) ∞ e − ( λ + − λ − ) v ( λ − ) − (cid:96) (cid:2) ( λ + − λ − ) r (cid:3) · √ v d v ∼ t / · √ π ( λ + − λ − ) / ( λ − ) − (cid:96) r . (140)Consequently, we have when γ = γ ∗ that πr (cid:90) λ + λ − e − xt x − (cid:96) (cid:0) ( x − (cid:37) ) + ω (cid:1) (cid:112) ( x − λ − )( λ + − x ) d x ∼ √ πr · r ( λ + − λ − ) / ( λ − ) − (cid:96) · e − λ − t · t / . We note that the Dirac delta from the Marchenko-Pastur terms combined with the constant term yields that γ (1 − r ) (cid:0) − rγ (cid:1) (cid:37) + ω + rγ (1 − r ) (cid:0) − rγ (cid:1) (cid:37) + ω max (cid:8) , − r (cid:9) = γ (cid:0) − rγ (cid:1) (cid:37) + ω max { , − r } . The result immediately follows.

F Numerical simulation details and extra experiments

Problem setup.

The vectors x , (cid:101) x and are sampled i.i.d. from the Gaussian N ( , d I ) . The objective function inwhich we run SGD is in all cases the least squares objective function f ( x ) = n (cid:107) Ax − b (cid:107) , where b is generated asin (2). The deﬁnition of A is different depending on the data-generating process considered:• For the isotropic features model, the rows of A are generated i.i.d. from a standard Gaussian distribution.• In the one-hidden layer model these are generated following (25), with both W and and Y are i.i.d. from astandard Gaussian and n/m = X , m/d = Y , and g is a shifted hinge loss: g ( z ) = max( x, − √ π . (141)The substraction of − √ π ensures that the zero Gaussian mean assumption is veriﬁed (27).Throughout the experiments r is ﬁxed to . and ˜ R = 0 . We experimented with different values, and alwaysobtained very similar qualitative results. Algorithms.

We simulate the

SME and SDE models (see (17) for description) using Euler-Murayama discretizationwith stepsize − . The SDE model discretizes the same as equation as the SME model (Eq. (17)) with a scaledidentity covariance ( Σ = σ I ). In this model σ is a free parameter which for the experiments we set to 0.1, as thisvalue was giving the closest ﬁt to SGD across the log-space grid of parameters − i , i = 0 , , ... .For the streaming model , we use SGD updates and regenerate a i (following the same model as SGD) at everystep. Extra experiments.

We provide extra experiments following the same setting as in Figure 4 but with differentchoices of the ratio r = dn parameter. 62

20 4010 isotropic features, = max /8.0 isotropic features, = max /4.0 isotropic features, = max /2.0 isotropic features, = max /1.2 one-hidden layer, = max /8.0SME SDE Empirical runs (SGD) Streaming Volterra (our model) one-hidden layer, = max /4.0 one-hidden layer, = max /2.0 one-hidden layer, = max /1.2 isotropic features, = max /8.0 isotropic features, = max /4.0 isotropic features, = max /2.0 isotropic features, = max /1.2 one-hidden layer, = max /8.0SME SDE Empirical runs (SGD) Streaming Volterra (our model) one-hidden layer, = max /4.0 one-hidden layer, = max /2.0 one-hidden layer, = max /1.2 Figure 5:

Comparison of different SGD models with r = 0 . and r = 1 .6