[PDF] From Averaging to Acceleration, There is Only a Step-size

Abstract

We show that accelerated gradient descent, averaged gradient descent and the heavy-ball method for non-strongly-convex problems may be reformulated as constant parameter second-order difference equation algorithms, where stability of the system is equivalent to convergence at rate O(1/n 2), where n is the number of iterations. We provide a detailed analysis of the eigenvalues of the corresponding linear dynamical system , showing various oscillatory and non-oscillatory behaviors, together with a sharp stability result with explicit constants. We also consider the situation where noisy gradients are available, where we extend our general convergence result, which suggests an alternative algorithm (i.e., with different step sizes) that exhibits the good aspects of both averaging and acceleration.

Full PDF

FFrom Averaging to Acceleration, There is Only a Step-size

Nicolas Flammarion and Francis BachINRIA - Sierra project-teamD´epartement d’Informatique de l’Ecole Normale Sup´erieureParis, France [email protected] , [email protected] April 8, 2015

Abstract

We show that accelerated gradient descent, averaged gradient descent and the heavy-ball method for non-strongly-convex problems may be reformulated as constant pa-rameter second-order diﬀerence equation algorithms, where stability of the system isequivalent to convergence at rate O (1 /n ), where n is the number of iterations. Weprovide a detailed analysis of the eigenvalues of the corresponding linear dynamical sys-tem, showing various oscillatory and non-oscillatory behaviors, together with a sharpstability result with explicit constants. We also consider the situation where noisy gra-dients are available, where we extend our general convergence result, which suggests analternative algorithm (i.e., with diﬀerent step sizes) that exhibits the good aspects ofboth averaging and acceleration. Many problems in machine learning are naturally cast as convex optimization problemsover a Euclidean space; for supervised learning this includes least-squares regression, logisticregression, and the support vector machine. Faced with large amounts of data, practitionersoften favor ﬁrst-order techniques based on gradient descent, leading to algorithms withmany cheap iterations. For smooth problems, two extensions of gradient descent have hadimportant theoretical and practical impacts: acceleration and averaging.Acceleration techniques date back to Nesterov (1983) and have their roots in momentumtechniques and conjugate gradient (Polyak, 1987). For convex problems, with an appropri-ately weighted momentum term which requires to store two iterates, Nesterov (1983) showedthat the traditional convergence rate of O (1 /n ) for the function values after n iterations ofgradient descent goes down to O (1 /n ) for accelerated gradient descent, such a rate beingoptimal among ﬁrst-order techniques that can access only sequences of gradients (Nesterov,2004). Like conjugate gradient methods for solving linear systems, these methods are how-ever more sensitive to noise in the gradients; that is, to preserve their improved convergencerates, signiﬁcantly less noise may be tolerated (d’Aspremont, 2008; Schmidt et al., 2011;Devolder et al., 2014). 1 a r X i v : . [ s t a t . M L ] A p r veraging techniques which consist in replacing the iterates by the average of all iterateshave also been thoroughly considered, either because they sometimes lead to simpler proofs,or because they lead to improved behavior. In the noiseless case where gradients are exactlyavailable, they do not improve the convergence rate in the convex case; worse, for strongly-convex problems, they are not linearly convergent while regular gradient descent is. Theirmain advantage comes with random unbiased gradients, where it has been shown that theylead to better convergence rates than the unaveraged counterparts, in particular becausethey allow larger step-sizes (Polyak and Juditsky, 1992; Bach and Moulines, 2011). Forexample, for least-squares regression with stochastic gradients, they lead to convergencerates of O (1 /n ), even in the non-strongly convex case (Bach and Moulines, 2013).In this paper, we show that for quadratic problems, both averaging and acceleration are twoinstances of the same second-order ﬁnite diﬀerence equation, with diﬀerent step-sizes. Theymay thus be analyzed jointly, together with a non-strongly convex version of the heavy-ball method (Polyak, 1987, Section 3.2). In presence of random zero-mean noise on thegradients, this joint analysis allows to design a novel intermediate algorithm that exhibitsthe good aspects of both acceleration (quick forgetting of initial conditions) and averaging(robustness to noise).In this paper, we make the following contributions:– We show in Section 2 that accelerated gradient descent, averaged gradient descentand the heavy-ball method for non-strongly-convex problems may be reformulated asconstant parameter second-order diﬀerence equation algorithms, where stability of thesystem is equivalent to convergence at rate O (1 /n ).– In Section 3, we provide a detailed analysis of the eigenvalues of the corresponding lineardynamical system, showing various oscillatory and non-oscillatory behaviors, togetherwith a sharp stability result with explicit constants.– In Section 4, we consider the situation where noisy gradients are available, where weextend our general convergence result, which suggests an alternative algorithm (i.e., withdiﬀerent step sizes) that exhibits the good aspects of both averaging and acceleration.– In Section 5, we illustrate our results with simulations on synthetic examples. Throughout this paper, we consider minimizing a convex quadratic function f : R d → R deﬁned as: f ( θ ) = 12 (cid:104) θ, Hθ (cid:105) − (cid:104) q, θ (cid:105) , (1)with H ∈ R d × d a symmetric positive semi-deﬁnite matrix and q ∈ R d . Without loss ofgenerality, H is assumed invertible (by projecting onto the orthogonal of its null space),though its eigenvalues could be arbitrarily small. The solution is known to be θ ∗ = H − q ,2ut the inverse of the Hessian is often too expensive to compute when d is large. The excesscost function may be simply expressed as f ( θ n ) − f ( θ ∗ ) = (cid:104) θ n − θ ∗ , H ( θ n − θ ∗ ) (cid:105) . In this paper we study second-order iterative algorithms of the form: θ n +1 = A n θ n + B n θ n − + c n , (2)started with θ = θ in R d , with A n ∈ R d × d , B n ∈ R d × d and c n ∈ R d for all n ∈ N ∗ . Weimpose the natural restriction that the optimum θ ∗ is a stationary point of this recursion,that is, for all n ∈ N ∗ : θ ∗ = A n θ ∗ + B n θ ∗ + c n . ( θ ∗ -stationarity)By letting φ n = θ n − θ ∗ we then have φ n +1 = A n φ n + B n φ n − , started from φ = φ = θ − θ ∗ .Thus, we restrict our problem to the study of the convergence of an iterative system to 0.In connection with accelerated methods, we are interested in algorithms for which f ( θ n ) − f ( θ ∗ ) = (cid:104) φ n , Hφ n (cid:105) converges to 0 at a speed of O (cid:0) /n (cid:1) . Within this context we imposethat A n and B n have the form : A n = nn + 1 A and B n = n − n + 1 B ∀ n ∈ N with A, B ∈ R d × d . (n-scalability)By letting η n = nφ n = n ( θ n − θ ∗ ), we can now study the simple iterative system with constant terms η n +1 = Aη n + Bη n − , started at η = 0 and η = θ − θ ∗ . Showing thatthe function values remain bounded, we directly have the convergence of f ( θ n ) to f ( θ ∗ ) atthe speed O (cid:0) /n (cid:1) . Thus the n-scalability property allows to switch from a convergenceproblem to a stability problem.For feasibility concerns the method can only access H through matrix-vector products.Therefore A and B should be polynomials in H and c a polynomial in H times q , if possibleof low degree. The following theorem clariﬁes the general form of iterative systems whichshare these three properties (see proof in Appendix B). Theorem 1.

Let ( P n , Q n , R n ) ∈ ( R [ X ]) for all n ∈ N , be a sequence of polynomials. If theiterative algorithm deﬁned by Eq. (2) with A n = P n ( H ) , B n = Q n ( H ) and c n = R ( H ) q sat-isﬁes the θ ∗ -stationarity and n-scalability properties, there are polynomials ( ¯ A, ¯ B ) ∈ ( R [ X ]) such that: A n = 2 nn + 1 (cid:18) I − (cid:18) ¯ A ( H ) + ¯ B ( H )2 (cid:19) H (cid:19) ,B n = − n − n + 1 (cid:18) I − ¯ B ( H ) H (cid:19) and c n = (cid:18) n ¯ A ( H ) + ¯ B ( H ) n + 1 (cid:19) q. Note that our result prevents A n and B n from being zero, thus requiring the algorithm tostrictly be of second order. This illustrates the fact that ﬁrst-order algorithms as gradientdescent do not have the convergence rate in O (1 /n ).3e now restrict our class of algorithms to lowest possible order polynomials, that is, ¯ A = αI and ¯ B = βI with ( α, β ) ∈ R , which correspond to the fewest matrix-vector products periteration, leading to the constant-coeﬃcient recursion for η n = nφ n = n ( θ n − θ ∗ ): η n +1 = ( I − αH ) η n + ( I − βH ) ( η n − η n − ) . (3) Expression with gradients of f . The recursion in Eq. (3) may be written with gradientsof f in multiple ways. In order to preserve the parallel with accelerated techniques, werewrite it as: θ n +1 = 2 nn + 1 θ n − n − n + 1 θ n − − nα + βn + 1 f (cid:48) (cid:18) n ( α + β ) nα + β θ n − ( n − βnα + β θ n − (cid:19) . (4)It may be interpreted as a modiﬁed gradient recursion with two potentially diﬀerent aﬃne(i.e., with coeﬃcients that sum to one) combinations of the two past iterates. This refor-mulation will also be crucial when using noisy gradients. The allowed values for ( α, β ) ∈ R will be determined in the following sections. Averaged gradient descent.

We consider averaged gradient descent (referred to fromnow on as “Av-GD”) (Polyak and Juditsky, 1992) with step-size γ ∈ R deﬁned by: ψ n +1 = ψ n − γf (cid:48) ( ψ n ) , θ n +1 = 1 n + 1 n +1 (cid:88) i =1 ψ i . When computing the average online as θ n +1 = θ n + n +1 ( ψ n +1 − θ n ) and seeing the averageas the main iterate, the algorithm becomes (see proof in Appendix B.2): θ n +1 = 2 nn + 1 θ n − n − n + 1 θ n − − γn + 1 f (cid:48) (cid:0) nθ n − ( n − θ n − (cid:1) . This corresponds to Eq. (4) with α = 0 and β = γ . Accelerated gradient descent.

We consider the accelerated gradient descent (referredto from now on as “Acc-GD”) (Nesterov, 1983) with step-sizes ( γ, δ n ) ∈ R : θ n +1 = ω n − γf (cid:48) ( ω n ) , ω n = θ n + δ n ( θ n − θ n − ) . For smooth optimization the accelerated literature (Nesterov, 2004; Beck and Teboulle,2009) uses the step-size δ n = 1 − n +1 and their results are not valid for bigger step-size δ n . However δ n = 1 − n +1 is compatible with the framework of Lan (2012) and is moreconvenient for our set-up. This corresponds to Eq. (4) with α = γ and β = γ . Note thataccelerated techniques are more generally applicable, e.g., to composite optimization withsmooth functions (Nesterov, 2013; Beck and Teboulle, 2009).4 eavy ball. We consider the heavy-ball algorithm (referred to from now on as “HB”)(Polyak, 1964) with step-sizes ( γ, δ n ) ∈ R : θ n +1 = θ n − γf (cid:48) ( θ n ) + δ n ( θ n − θ n − ) , when δ n = 1 − n +1 . We note that typically δ n is constant for strongly-convex problems.This corresponds to Eq. (4) with α = γ and β = 0. We study the convergence of the iterates deﬁned by: η n +1 = ( I − αH ) η n +( I − βH ) ( η n − η n − ).This is a second-order iterative system with constant coeﬃcients that it is standard to castin a linear framework (see, e.g., Ortega and Rheinboldt, 2000). We may rewrite it as:Θ n = F Θ n − , with Θ n = (cid:18) η n η n − (cid:19) and F = (cid:18) I − ( α + β ) H βH − II (cid:19) ∈ R d × d . Thus Θ n = F n Θ . Following O’Donoghue and Candes (2013), if we consider an eigenvaluedecomposition of H , i.e., H = P Diag( h ) P (cid:62) with P an orthogonal matrix and ( h i ) theeigenvalues of H , sorted in decreasing order: h d = L ≥ h d − ≥ · · · ≥ h ≥ h = µ >

0, thenEq. (3) may be rewritten as: P (cid:62) η n +1 = ( I − α Diag ( h )) P (cid:62) η n + ( I − β Diag ( h )) (cid:0) P (cid:62) η n − P (cid:62) η n − (cid:1) . (5)Thus there is no interaction between the diﬀerent eigenspaces and we may consider, for theanalysis only, d diﬀerent recursions with η in = p (cid:62) i η n , i ∈ { , ..., d } , where p i ∈ R d is the i -thcolumn of P : η in +1 = (1 − αh i ) η in + (1 − βh i ) (cid:0) η in − η in − (cid:1) . (6) In this section, we consider a ﬁxed i ∈ { , . . . , d } and study the stability in the correspondingeigenspace. This linear dynamical system may be analyzed by studying the eigenvalues ofthe 2 × F i = (cid:18) − ( α + β ) h i βh i −

11 0 (cid:19) . These eigenvalues are the roots of itscharacteristic polynomial which is:det( XI − F i ) = det ( X ( X − α + β ) h i ) + 1 − βh i ) = X − X (cid:16) − (cid:16) α + β (cid:17) h i (cid:17) +1 − βh i . To compute the roots of the second-order polynomial, we compute its reduced discriminant:∆ i = (cid:16) − (cid:16) α + β (cid:17) h i (cid:17) − βh i = h i (cid:16)(cid:16) α + β (cid:17) h i − α (cid:17) . Depending on the sign of the discriminant ∆ i , there will be two real distinct eigenvalues(∆ i > i <

0) or a single real eigenvalue (∆ i = 0).5 1 2 3 4012 AvGDAccGDHB Real Complex βh i αh i Figure 1: Area of stability of the algorithm, with the three traditional algorithms repre-sented. In the interior of the triangle, the convergence is linear.We will now study the sign of ∆ i . In each diﬀerent case, we will determine under whatconditions on α and β the modulus of the eigenvalues is less than one, which means that theiterates ( η in ) n remain bounded and the iterates ( θ n ) n converge to θ ∗ . We may then computefunction values as f ( θ n ) − f ( θ ∗ ) = n (cid:80) di =1 ( η in ) h i = (cid:80) di =1 ( φ in ) h i .The various regimes are summarized in Figure 1: there is a triangle of values of ( αh i , βh i )for which the algorithm remains stable (i.e., the iterates ( η n ) n do not diverge), with eithercomplex or real eigenvalues. In the following lemmas (see proof in Appendix C), we providea detailed analysis that leads to Figure 1. Lemma 1 (Real eigenvalues) . The discriminant ∆ i is strictly positive and the algorithm isstable if and only if α ≥ , α + 2 β ≤ /h i , α + β > (cid:112) α/h i . We then have two real roots r ± i = r i ± √ ∆ i , with r i = 1 − ( α + β ) h i . Moreover, we have: ( φ in ) h i = ( φ i ) h i n (cid:2) ( r i + √ ∆ i ) n − ( r i − √ ∆ i ) n (cid:3) ∆ i . (7)Therefore, for real eigenvalues, (( φ in ) h i ) n will converge to 0 at a speed of O (1 /n ) howeverthe constant ∆ i may be arbitrarily small (and thus the scaling factor arbitrarily large).Furthermore we have linear convergence if the inequalities in the lemmas are strict. Lemma 2 (Complex eigenvalues) . The discriminant ∆ i is stricly negative and the algorithmis stable if and only if α ≥ , β ≥ , α + β < (cid:112) α/h i . We then have two complex conjugate eigenvalues: r ± i = r i ± √− √− ∆ i . Moreover, wehave: ( φ in ) h i = ( φ i ) n sin ( ω i n ) (cid:0) α − ( α + β ) h i (cid:1) ρ n . (8) with ρ i = √ − βh i , and ω i deﬁned through sin( ω i ) = √− ∆ i /ρ i and cos( ω i ) = r i /ρ i . φ in ) h i ) n oscillates to 0 at a speed of O (1 /n ) even if h i isarbitrarily small. Coalescing eigenvalues.

When the discriminant goes to zero in the explicit formulas ofthe real and complex cases, both the denominator and numerator of (( φ in ) h i ) n will go tozero. In the limit case, when the discriminant is equal to zero, we will have a double realeigenvalue. This happens for β = 2 (cid:112) α/h i − α . Then the eigenvalue is r i = 1 −√ αh i , and thealgorithm is stable for 0 < α < /h i , we then have ( φ in ) h i = h i ( φ i ) (1 − √ αh i ) n − . Thiscan be obtained by letting ∆ i goes to 0 in the real and complex cases (see also Appendix C.3). Summary.

To conclude the iterate ( η in ) n = ( n ( θ in − θ i ∗ )) n will be stable for α ∈ [0 , /h i ]and β ∈ [0 , /h i − α/ α and β this iterate will have a diﬀerentbehavior. In the complex case, the roots are complex conjugate with magnitude √ − βh i .Thus, when β >

0, ( η in ) n will converge to 0, oscillating, at rate √ − βh i . In the realcase, the two roots are real and distinct. However the product of the two roots is equalto √ − βh i , thus one will have a higher magnitude and ( η in ) n will converges to 0 at ratehigher than in the complex case (as long as α and β belong to the interior of the stabilityregion).Finally, for a given quadratic function f , all the d iterates ( η in ) n should be bounded, thereforewe must have α ∈ [0 , /L ] and β ∈ [0 , /L − α/ h i , someeigenvalues may be complex or real. For particular choices of α and β , displayed in Figure 1, the eigenvalues are either all realor all complex, as shown in the table below.Av-GD Acc-GD Heavy ball α γ γβ γ γ i ( γh i ) − γh i (1 − γh i ) − γh i (1 − γh i ) r ± i

1, 1 − γh i √ − γh i e ± iω i e ± iω i cos( ω i ) √ − γh i − γ h i ρ i √ − γh i r + i = 1 for all eigensubspaces. Similarly, the heavy ball method is not adaptive to strongconvexity because ρ i = 1. However, accelerated gradient descent, although designed fornon-strongly-convex problems, is adaptive because ρ i = √ − γh i depends on h i while α and β do not. These last two algorithms have an oscillatory behavior which can be observedin practice and has been already studied (Su et al., 2014).7ote that all the classical methods choose step-sizes α and β either having all the eigenvaluesreal either complex; whereas we will see in Section 4, that it is signiﬁcant to combine bothbehaviors in presence of noise. Even if the exact formulas in Lemmas 1 and 2 are computable, they are not easily inter-pretable. In particular when the two roots become close, the denominator will go to zero,which prevents from bounding them easily. When we further restrict the domain of ( α, β ),we can always bound the iterate by the general bound (see proof in Appendix D):

Theorem 2.

For α ≤ /h i and ≤ β ≤ /h i − α , we have ( η in ) ≤ min (cid:26) η i ) αh i , η i ) n ( α + β ) h i , η i ) ( α + β ) h i (cid:27) . (9)These bounds are shown by dividing the set of ( α, β ) in three regions where we obtainspeciﬁc bounds. They do not depend on the regime of the eigenvalues (complex or real);this enables us to get the following general bound on the function values, our main resultfor the deterministic case. Corollary 1.

For α ≤ /L and ≤ β ≤ /L − α : f ( θ n ) − f ( θ ∗ ) ≤ min (cid:26) (cid:107) θ − θ ∗ (cid:107) αn , (cid:107) θ − θ ∗ (cid:107) ( α + β ) n (cid:27) . (10)We can make the following observations:– The ﬁrst bound (cid:107) θ − θ ∗ (cid:107) αn corresponds to the traditional acceleration result, and is onlyrelevant for α > (cid:107) θ n − θ ∗ (cid:107) typically does not go to zero (althoughit remains bounded in our situation).– When α = 0 (averaged gradient descent), then the second bound (cid:107) θ − θ ∗ (cid:107) ( α + β ) n provides aconvergence rate of O (1 /n ) if no assumption is made regarding the starting point θ ,while the last bound of Theorem 2 would lead to a bound (cid:107) H − / ( θ − θ ∗ ) (cid:107) ( α + β ) n , that is arate of O (1 /n ), only for some starting points.– As shown in Appendix E by exhibiting explicit sequences of quadratic functions, theinverse dependence in αn and ( α + β ) n in Eq. (10) is not improvable.8 1 2 3 4012 AvGDAccGDOur Algorithms βh i αh i Figure 2: Trade-oﬀ between averaged and accelerated methods for noisy gradients.

In many practical situations, the gradient of f is not available for the recursion in Eq. (4),but only a noisy version. In this paper, we only consider additive uncorrelated noise withﬁnite variance. We now assume that the true gradient is not available and we rather have access to a noisyoracle for the gradient of f . In Eq. (4), we assume that the oracle outputs a noisy gradient f (cid:48) (cid:0) n ( α + β ) nα + β θ n − ( n − βnα + β θ n − (cid:1) − ε n +1 . The noise ( ε n ) is assumed to be uncorrelated zero-meanwith bounded covariance, i.e., E [ ε n ⊗ ε m ] = 0 for all n (cid:54) = m and E [ ε n ⊗ ε n ] (cid:52) C , where A (cid:52) B means that B − A is positive semi-deﬁnite.For quadratic functions, for the reduced variable η n = nφ n = n ( θ n − θ ∗ ), we get: η n +1 = ( I − αH ) η n + ( I − βH )( η n − η n − ) + [ nα + β ] ε n +1 . (11)Note that algorithms with α (cid:54) = 0 will have an important level of noise because of the term nαε n +1 . We denote by ξ n +1 = (cid:18) [ nα + β ] ε n +1 (cid:19) and we now have the recursion:Θ n +1 = F Θ n + ξ n +1 , (12)which is a standard noisy linear dynamical system (see, e.g., Arnold, 1998) with uncorrelatednoise process ( ξ n ). We may thus express Θ n directly as Θ n = F n Θ + (cid:80) nk =1 F n − k ξ k , and itsexpected second-order moment as, E (cid:0) Θ n Θ n (cid:1) (cid:62) = F n Θ Θ (cid:62) ( F n ) (cid:62) + (cid:80) nk =1 F n − k E (cid:0) ξ k ξ (cid:62) k (cid:1) ( F n − k ) (cid:62) . In order to obtain the expected excess cost function, we simply need to compute tr (cid:18) H (cid:19) E (cid:0) Θ n Θ n (cid:1) (cid:62) ,which thus decomposes as a term that only depends on initial conditions (which is exactlythe one computed and studied in Section 3.3), and a new term that depends on the noise.9 .2 Convergence result For a quadratic function f with arbitrarily small eigenvalues and uncorrelated noise withﬁnite covariance, we obtain the following convergence result (see proof in Appendix F);since we will allow the parameters α and β to depend on the time we stop the algorithm,we introduce the horizon N : Theorem 3 (Convergence rates with noisy gradients) . With E [ ε n ⊗ ε n ] = C for all n ∈ N ,for α ≤ L and ≤ β ≤ L − α . Then for any N ∈ N , we have: E f ( θ N ) − f ( θ ∗ ) ≤ min (cid:26) (cid:107) θ − θ ∗ (cid:107) αN + ( αN + β ) αN tr( C ) , (cid:107) θ − θ ∗ (cid:107) ( α + β ) N + 4( αN + β ) α + β tr( C ) (cid:27) . (13)We can make the following observations:– Although we only provide an upper-bound, the proof technique relies on direct momentcomputations in each eigensubspace with few inequalities, and we conjecture that thescalings with respect to n are tight.– For α = 0 and β = 1 /L (which corresponds to averaged gradient descent), the secondbound leads to L (cid:107) θ − θ ∗ (cid:107) N + C ) L , which is bounded but not converging to zero. Werecover a result from Bach and Moulines (2011, Theorem 1).– For α = β = 1 /L (which corresponds to Nesterov’s acceleration), the ﬁrst bound leadsto L (cid:107) θ − θ ∗ (cid:107) N + ( N +1) tr( C ) L , and our bound suggests that the algorithm diverges, whichwe have observed in our experiments in Appendix A.– For α = 0 and β = 1 /L √ N , the second bound leads to L (cid:107) θ − θ ∗ (cid:107) √ N + C ) L √ N , and werecover the traditional rate of 1 / √ N for stochastic gradient in the non-strongly-convexcase.– When the values of the bias and the variance are known we can choose α and β suchthat the trade-oﬀ between the bias and the variance is optimal in our bound, as thefollowing corrollary shows. Note that in the bound below, taking a non zero β enablesthe bias term to be adaptive to hidden strong-convexity. Corollary 2.

For α = min (cid:110) (cid:107) θ − θ ∗ (cid:107) √ tr CN / , /L (cid:111) and β ∈ [0 , min { N α, /L } ] , we have: E f ( θ N ) − f ( θ ∗ ) ≤ L (cid:107) θ − θ ∗ (cid:107) N + 4 √ tr C (cid:107) θ − θ ∗ (cid:107)√ N .

When only the noise total variance tr( C ) is considered, as shown in Section 4.4, Corollary2 recover existing (more general) results. Our framework however leads to improved resultfor structured noise processes frequent in machine learning, in particular in least-squaresregression which we now consider but this goes beyond (see, e.g. Bach and Moulines, 2013).10ssume we observe independent and identically distributed pairs ( x n , y n ) ∈ R d × R andwe want to minimize the expected loss f ( θ ) = E [( y n − (cid:104) θ, x n (cid:105) ) ]. We denote by H = E ( x n ⊗ x n ) the covariance matrix which is assumed invertible. The global minimum of f is attained at θ ∗ ∈ R d deﬁned as before and we denote by r n = y n − (cid:104) θ ∗ , x n (cid:105) the statisticalnoise, which we assume bounded by σ . We have E [ r n x n ] = 0. In an online setting, weobserve the gradient ( x n ⊗ x n )( θ − θ ∗ ) − r n x n , whose expectation is the gradient f (cid:48) ( θ ). Thiscorresponds to a noise in the gradient of ε n = ( H − x n ⊗ x n )( θ − θ ∗ ) + r n x n . Given θ , ifthe data ( x n , y n ) are almost surely bounded, the covariance matrix of this noise is boundedby a constant times H . This suggests to characterize the noise convergence by tr( CH − ),which is bounded even though H has arbitrarily small eigenvalues.However, our result will not apply to stochastic gradient descent (SGD) for least-squares,because of the term ( H − x n ⊗ x n )( θ − θ ∗ ) which depends on θ , but to a “semi-stochastic”recursion where the noisy gradient is H ( θ − θ ∗ ) − r n x n , with a noise process ε n = r n x n ,which is such that E [ ε n ⊗ ε n ] (cid:52) σ H , and has been used by Bach and Moulines (2011) andDieuleveut and Bach (2014) to prove results on regular stochastic gradient descent. Weconjecture that our algorithm (and results) also applies in the regular SGD case, and weprovide encouraging experiments in Section 5.For this particular structured noise we can take advantage of a large β : Theorem 4 (Convergence rates with structured noisy gradients) . Let α ≤ L and ≤ β ≤ L − α . For any N ∈ N , E f ( θ N ) − f ( θ ∗ ) is upper-bounded by: min (cid:26) (cid:107) θ − θ ∗ (cid:107) N α + ( αN + β ) αβN tr( CH − ) , L (cid:107) θ − θ ∗ (cid:107) ( α + β ) N + 8( αN + β ) tr( CH − )( α + β ) N (cid:27) . (14)We can make the following observations:– For α = 0 and β = 1 /L (which corresponds to averaged gradient descent), the secondbound leads to L (cid:107) θ − θ ∗ (cid:107) N + CH − ) N . We recover a result from Bach and Moulines(2013, Theorem 1). Note that when C (cid:52) σ H , tr( CH − ) (cid:54) σ d .– For α = β = 1 /L (which corresponds to Nesterov’s acceleration), the ﬁrst bound leadsto L (cid:107) θ − θ ∗ (cid:107) N + tr( CH − ), which is bounded but not converging to zero (as opposed tothe the unstructured noise where the algorithm may diverge).– For α = 1 / ( LN a ) with 0 ≤ a ≤ β = 1 /L , the ﬁrst bound leads to L (cid:107) θ − θ ∗ (cid:107) N − a + tr( CH − ) N a . We thus obtain an explicit bias-variance trade-oﬀ by changing the value of a .– When the values of the bias and the variance are known we can choose α and β withan optimized trade-oﬀ, as the following corrollary shows: Corollary 3.

For α = min (cid:26) (cid:107) θ − θ ∗ (cid:107) √ L tr( CH − ) N , /L (cid:27) and β = min { N α, /L } we have: E f ( θ N ) − f ( θ ∗ ) ≤ max (cid:26) CH − ) N , (cid:112) tr( CH − ) L (cid:107) θ − θ ∗ (cid:107) N , (cid:107) θ − θ ∗ (cid:107) LN (cid:27) . (15)11 .4 Related work Acceleration and noisy gradients.

Several authors (Lan, 2012; Hu et al., 2009; Xiao,2010) have shown that using a step-size proportional to 1 /N / accelerated methods withnoisy gradients lead to the same convergence rate of O (cid:0) L (cid:107) θ − θ ∗ (cid:107) N + (cid:107) θ − θ ∗ (cid:107) √ tr( C ) √ N (cid:1) thanin Corollary 2, for smooth functions. Thus, for unstructured noise, our analysis providesinsights in the behavior of second-order algorithms, without improving bounds. We getsigniﬁcant improvements for structured noises. Least-squares regression.

When the noise is structured as in least-square regression andmore generally in linear supervised learning, Bach and Moulines (2011) have shown thatusing averaged stochastic gradient descent with constant step-size leads to the convergencerate of O (cid:0) L (cid:107) θ − θ (cid:107) N + σ dN (cid:1) . It has been highlighted by D´efossez and Bach (2014) that the biasterm L (cid:107) θ − θ ∗ (cid:107) N may often be the dominant one in practice. Our result in Corollary 3 leadsto an improved bias term in O (1 /N ) with the price of a potentially slightly worse constantin the variance term. However, with optimal constants in Corollary 3, the new algorithmis always an improvement over averaged stochastic gradient descent in all situations. Ifconstants are unknown, we may use α = 1 / ( LN a ) with 0 ≤ a ≤ β = 1 /L and wechoose a depending on the emphasis we want to put on bias or variance. Minimax convergence rates.

For noisy quadratic problems, the convergence rate nicelydecomposes into two terms, a bias term which corresponds to the noiseless problem andthe variance term which corresponds to a problem started at θ ∗ . For each of these twoterms, lower bounds are known. For the bias term, if N ≤ d , then the lower bound is,up to constants, L (cid:107) θ − θ ∗ (cid:107) /N (Nesterov, 2004, Theorem 2.1.7). For the variance term,for the general noisy gradient situation, we show in Appendix H that for N ≤ d , it is(tr C ) / ( L √ N ), while for least-squares regression, it is σ d/N (Tsybakov, 2003). Thus, forthe two situations, we attain the two lower bounds simultaneously for situations whererespectively L (cid:107) θ − θ ∗ (cid:107) ≤ (tr C ) /L and L (cid:107) θ − θ ∗ (cid:107) ≤ dσ . It remains an open problemto achieve the two minimax terms in all situations. Other algorithms as special cases.

We also note as shown in Appendix G that inthe special case of quadratic functions, the algorithms of Lan (2012); Hu et al. (2009);Xiao (2010) could be uniﬁed into our framework (although they have signiﬁcantly diﬀerentformulations and justiﬁcations in the smooth case).

In this section, we illustrate our theoretical results on synthetic examples. We considera matrix H that has random eigenvectors and eigenvalues 1 /k m , for k = 1 , . . . , d and m ∈ N . We take a random optimum θ ∗ and a random starting point θ such that r = (cid:107) θ − θ ∗ (cid:107) = 1 (unless otherwise speciﬁed). In Appendix A, we illustrate the noiseless results12 (n) l og [f ( θ ) − f ( θ * ) ] Structured noisy gradients, d=20

OAOA−atAGDAccGDAC−SASAGEAccRDA 0 1 2 3 4 5−4−20 log (n) l og [f ( θ ) − f ( θ * ) ] Structured noisy gradients, d=20

OAOA−atAGDAccGDAC−SASAGEAccRDA

Figure 3: Quadratic optimization with regression noise. Left σ = 1, r = 1. Right σ = 0 . r = 10.of Section 3, in particular the oscillatory behaviors and the inﬂuence of all eigenvalues,as well as unstructured noisy gradients. In this section, we focus on noisy gradients withstructured noise (as described in Section 4.3), where our new algorithms show signiﬁcantimprovements.We compare our algorithm to other stochastic accelerated algorithms, that is, AC-SA (Lan,2012), SAGE (Hu et al., 2009) and Acc-RDA (Xiao, 2010) which are presented in Ap-pendix G. For all these algorithms (and ours) we take the optimal step-sizes deﬁned inthese papers. We show results averaged over 10 replications. Homoscedastic noise.

We ﬁrst consider an i.i.d. zero mean noise whose covariance ma-trix is proportional to H . We also consider a variant of our algorithm with an any-timestep-size function of n rather than N (for which we currently have no proof of convergence).In Figure 3, we take into account two diﬀerent set-ups. In the left plot, the variance domi-nates the bias (with r = (cid:107) θ − θ ∗ (cid:107) = σ ). We see that (a) Acc-GD does not converge to theoptimum but does not diverge either, (b) Av-GD and our algorithms achieve the optimalrate of convergence of O ( σ d/n ), whereas (c) other accelerated algorithms only converge atrate O (1 / √ n ). In the right plot, the bias dominates the variance ( r = 10 and σ = 0 . Application to least-squares regression.

We now see how these algorithms behave forleast-squares regressions and the regular (non-homoscedastic) stochastic gradients describedin Section 4.3. We consider normally distributed inputs. The covariance matrix H is thesame as before. The outputs are generated from a linear function with homoscedatic noisewith a signal-to-noise ratio of σ . We consider d = 20. We show results averaged over 10replications. In Figure 4, we consider again a situation where the bias dominates (left)and vice versa (right). We see that our algorithm has the same good behavior than in thehomoscedastic noise case and we conjecture that our bounds also hold in this situation.13 (n) l og [f ( θ ) − f ( θ * ) ] Least−Square Regression, d=20

OAOA−atAGDAC−SASAGEAccRDA 0 1 2 3 4 5−4−20 log (n) l og [f ( θ ) − f ( θ * ) ] Least−Square Regression, d=20

OAOA−atAGDAC−SASAGEAccRDA

Figure 4: Least-Square Regression. Left σ = 1, r = 1. Right σ = 0 . r = 10. We have provided a joint analysis of averaging and acceleration for non-strongly-convexquadratic functions in a single framework, both with noiseless and noisy gradients. Thisallows to deﬁne a class of algorithms that can beneﬁt simultaneously of the known improve-ments of averaging and accelerations: faster forgetting of initial conditions (for acceleration),and better robustness to noise when the noise covariance is proportional to the Hessian (foraveraging).Our current analysis of our class of algorithms in Eq. (4), that considers two diﬀerent aﬃnecombinations of previous iterates (instead of one for traditional acceleration), is limitedto quadratic functions; an extension of its analysis to all smooth or self-concordant-likefunctions would widen its applicability. Similarly, an extension to least-squares regressionwith natural heteroscedastic stochastic gradient, as suggested by our simulations, would bean interesting development.

Acknowledgements

This work was partially supported by the MSR-Inria Joint Centre and a grant by theEuropean Research Council (SIERRA project 239993). The authors would like to thankAymeric Dieuleveut for helpful discussions.

References

A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoreticlower bounds on the oracle complexity of stochastic convex optimization.

InformationTheory, IEEE Transactions on , 58(5):3235–3249, 2012.L. Arnold.

Random dynamical systems . Springer Monographs in Mathematics. Springer-Verlag, 1998. 14. Bach and E. Moulines. Non-Asymptotic Analysis of Stochastic Approximation Algo-rithms for Machine Learning. In

Advances in Neural Information Processing Systems ,2011.F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with con-vergence rate O (1 /n ). In Advances in Neural Information Processing Systems , December2013.A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverseproblems.

SIAM J. Imaging Sci. , 2(1):183–202, 2009.A. d’Aspremont. Smooth optimization with approximate gradient.

SIAM J. Optim. , 19(3):1171–1183, 2008.A. D´efossez and F. Bach. Constant step size least-mean-square: Bias-variance trade-oﬀsand optimal sampling distributions. Technical Report 1412.0156, arXiv, 2014.O. Devolder, F. Glineur, and Y. Nesterov. First-order methods of smooth convex optimiza-tion with inexact oracle.

Math. Program. , 146(1-2, Ser. A):37–75, 2014.A. Dieuleveut and F. Bach. Non-parametric Stochastic Approximation with Large Stepsizes. Technical Report 1408.0361, arXiv, August 2014.C. Hu, W. Pan, and J. T. Kwok. Accelerated gradient methods for stochastic optimizationand online learning. In

Advances in Neural Information Processing Systems , 2009.G. Lan. An optimal method for stochastic composite optimization.

Math. Program. , 133(1-2, Ser. A):365–397, 2012.Y. Nesterov. A method of solving a convex programming problem with convergence rate O (1 /k ). Soviet Mathematics Doklady , 27(2):372–376, 1983.Y. Nesterov.

Introductory Lectures on Convex Optimization , volume 87 of

Applied Opti-mization . Kluwer Academic Publishers, Boston, MA, 2004. A basic course.Y. Nesterov. Gradient methods for minimizing composite functions.

Math. Program. , 140(1, Ser. B):125–161, 2013.B. O’Donoghue and E. Candes. Adaptive restart for accelerated gradient schemes.

Foun-dations of Computational Mathematics , pages 1–18, 2013.J. M. Ortega and W. C. Rheinboldt.

Iterative solution of nonlinear equations in severalvariables , volume 30 of

Classics in Applied Mathematics . Society for Industrial andApplied Mathematics (SIAM), Philadelphia, PA, 2000.B. T. Polyak. Some methods of speeding up the convergence of iteration methods. { USSR } Computational Mathematics and Mathematical Physics , 4(5):1–17, 1964.B. T. Polyak.

Introduction to Optimization . Translations Series in Mathematics and Engi-neering. Optimization Software, Inc., Publications Division, New York, 1987.15. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging.

SIAM J. Control Optim. , 30(4):838–855, 1992.M. Schmidt, N. Le Roux, and F. Bach. Convergence Rates of Inexact Proximal-GradientMethods for Convex Optimization. In

Advances in Neural Information Processing Sys-tems , December 2011.W. Su, S. Boyd, and E. Candes. A Diﬀerential Equation for Modeling Nesterov’s AcceleratedGradient Method: Theory and Insights. In

Advances in Neural Information ProcessingSystems , 2014.A. B. Tsybakov. Optimal rates of aggregation. In

Proceedings of the Annual Conference onComputational Learning Theory , 2003.L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization.

J. Mach. Learn. Res. , 11:2543–2596, 2010.

A Additional experimental results

In this appendix, we provide additional experimental results to illustrate our theoreticalresults.

A.1 Deterministic convergence

Comparaison for d = 1 . In Figure 5, we minimize a one-dimensional quadratic function f ( θ ) = θ for a ﬁxed step-size α = 1 /

10 and diﬀerent step-sizes β . In the left plot, wecompare Acc-GD, HB and Av-GD. We see that HB and Acc-GD both oscillate and thatAcc-GD leverages strong convexity to converge faster. In the right plot, we compare thebehavior of the algorithm for diﬀerent values of β . We see that the optimal rate is achievedfor β = β ∗ deﬁned to be the one for which there is a double coalescent eigenvalue, wherethe convergence is linear at speed O (1 − √ αL ) n . When β > β ∗ , we are in the real case andwhen β < β ∗ the algorithm oscillates to the solution. Comparison between the diﬀerent eigenspaces.

Figure 6 shows interactions betweendiﬀerent eigenspaces. In the left plot, we optimize a quadratic function of dimension d = 2.The ﬁrst eigenvalue is L = 1 and the second is µ = 2 − . For Av-GD the convergence isof order O (1 /n ) since the problem is “not” strongly convex (i.e., not appearing as stronglyconvex since nµ remains small). The convergence is at the beginning the same for HBand Acc-GD, with oscillation at speed O (1 /n ), since the small eigenvalue prevents Acc-GD from having a linear convergence. Then for large n , the convergence becomes linearfor Acc-GD, since µn becomes large. In the right plot, we optimize a quadratic functionin dimension d = 5 with eigenvalues from 1 to 0 .

1. We show the function values of theprojections of the iterates η n on the diﬀerent eigenspaces. We see that high eigenvalues ﬁrstdominate, but converge quickly to zero, whereas small ones keep oscillating, and convergemore slowly. 16

50 100−30−20−100 n l og [f ( θ ) − f ( θ * ) ] Minimization of f( θ )= θ /2 AccGDHBAvGD l og [f ( θ ) − f ( θ * ) ] Minimization of f( θ )= θ /2 β =1.5 β * β =0.5 β * β = β * Figure 5: Deterministic case for d = 1 and α = 1 /

10. Left: classical algorithms, right:diﬀerent oscillatory behaviors. (n) l og [f ( θ ) − f ( θ * ) ] d=2, hi=i −8 AccGDHBAvGD f ( η i ) − f ( η * ) d=5, α =0.5, β =0.5 h1=0.1h2=0.3h3=0.6h4=0.8h5=1.0 Figure 6: Left: Deterministic quadratic optimization for d = 2. Right: Function value ofthe projection of the iterate on the diﬀerent eigenspaces ( d = 5). (n) l og [f ( θ ) − f ( θ * ) ] d=20, hi=i −2 AccGDHBAvGD (n) l og [f ( θ ) − f ( θ * ) ] d=20, hi=i −8 AccGDHBAvGD

Figure 7: Deterministic case for d = 20 and γ = 1 /

10. Left: m = 2. Right: m = 8. Comparison for d = 20 . In Figure 7, we optimize two 20-dimensional quadratic functionswith diﬀerent eigenvalues with Av-GD, HB and Acc-GD for a ﬁxed step-size γ = 1 /

10. Inthe left plot, the eigenvalues are 1 /k and in the right one, they are 1 /k , for k = 1 , . . . , d .We see that in both cases, Av-GD converges at a rate of O (1 /n ) and HB at a rate of17 (1 /n ). For Acc-GD the convergence is linear when µ is large (left plot) and becomessublinear at a rate of O (1 /n ) when µ becomes small (right plot). A.2 Noisy convergence with unstructured additive noise

We optimize the same quadratic function, but now with noisy gradients. We compare ouralgorithm to other stochastic accelerated algorithms, that is, AC-SA (Lan, 2012), SAGE(Hu et al., 2009) and Acc-RDA (Xiao, 2010), which are presented in Appendix G. For allthese algorithms (and ours) we take the optimal step-sizes deﬁned in these papers. We plotthe results averaged over 10 replications.We consider in Figure 8 an i.i.d. zero mean noise of variance C = I . We see that all theaccelerated algorithms achieve the same precision whereas Av-GD with constant step-sizedoes not converge and Acc-Gd diverges. However SAGE and AC-SA are anytime algorithmsand are faster at the beginning since their step-sizes are decreasing and not a constant (withrespect to n ) function of the horizon N . (n) l og [f ( θ ) − f ( θ * ) ] Noisy gradients, d=20, hi=i OAAvGDAccGDAC−SASAGEAccRDA

Figure 8: Quadratic optimization with additive noise.

B Proofs of Section 2

B.1 Proof of Theorem 1

Let ( P n , Q n , R n ) ∈ ( R [ X ]) for all n ∈ N be a sequence of polynomials. We consider theiterates deﬁned for all n ∈ N ∗ by θ n +1 = P n ( H ) θ n + Q n ( H ) θ n − + R ( H ) q, started from θ = θ ∈ R d . The θ ∗ -stationarity property gives for n ∈ N ∗ : θ ∗ = P n ( H ) θ ∗ + Q n ( H ) θ ∗ + R n ( H ) q. θ ∗ = H − q we get for all q ∈ R d H − q = P n ( H ) H − q + Q n ( H ) H − q + R n ( H ) q. For all ˜ q ∈ R d we apply this relation to vectors q = H ˜ q :˜ q = P n ( H )˜ q + Q n ( H )˜ q + R n ( H ) H ˜ q ∀ ˜ q ∈ R d , and we get I = P n ( H ) + Q n ( H ) + R n ( H ) H ∀ n ∈ N ∗ . Therefore there are polynomials ( ¯ P n , ¯ Q n ) ∈ ( R [ X ]) and q n ∈ R for all n ∈ N ∗ such that wehave for all n ∈ N : P n ( X ) = (1 − q n ) I + X ¯ P n ( X ) Q n ( X ) = q n I + X ¯ Q n ( X ) R n ( X ) = − ( ¯ P n ( X ) + ¯ Q n ( X )) . (16)The n-scalability property means that there are polynomials ( P, Q ) ∈ ( R [ X ]) independentof n such that: P n ( X ) = nn + 1 P ( X ) ,Q n ( X ) = n − n + 1 Q ( X ) . And in connection with Eq. (16) we can rewrite P and Q as: P ( X ) = ¯ p + X ¯ P ( X ) ,Q ( X ) = ¯ q + X ¯ Q ( X ) , with (¯ p, ¯ q ) ∈ R and ( ¯ P , ¯ Q ) ∈ ( R [ X ]) . Thus for all n ∈ N : q n = n − n + 1 ¯ q (17)¯ Q n ( X ) = n − n Q ( X ) nn + 1 ¯ p = (1 − q n ) (18)¯ P n ( X ) = nn + 1 P ( X ) . Eq. (17) and Eq. (18) give: nn + 1 ¯ p = (cid:18) − n − n + 1 ¯ q (cid:19) . Thus for n = 1, we have ¯ p = 2. Then − n − n +1 ¯ q = nn +1 − n − n +1 and ¯ q = −

1. Therefore P n ( H ) = 2 nn + 1 I + nn + 1 ¯ P ( H ) HQ n ( H ) = − n − n I + ¯ Q ( H ) HR n ( H ) = − (cid:18) n ¯ P ( H ) + ( n −

1) ¯ Q ( H ) n + 1 (cid:19) .

19e let ¯ A = − ( ¯ P + ¯ Q ) and ¯ B = ¯ Q so that we have: P n ( H ) = 2 nn + 1 (cid:18) I − (cid:18) ¯ A ( H ) + ¯ B ( H )2 (cid:19) H (cid:19) Q n ( H ) = − n − n (cid:0) I − ¯ B ( H ) H (cid:1) R n ( H ) = (cid:18) n ¯ A ( H ) + ¯ B ( H ) n + 1 (cid:19) , and with φ n = θ n − θ ∗ for all n ∈ N , the algorithm can be written under the form: φ n +1 = (cid:20) I − (cid:18) nn + 1 ¯ A ( H ) + 1 n + 1 ¯ B ( H ) (cid:19) H (cid:21) φ n + (cid:18) − n + 1 (cid:19) (cid:2) I − ¯ B ( H ) H (cid:3) ( φ n − φ n − ) . B.2 Av-GD as two steps-algorithm

We show now that when the averaged iterate of Av-GD is seen as the main iterate we havethat Av-GD with step-size γ ∈ R is equivalent to: θ n +1 = 2 nn + 1 θ n − n − n + 1 θ n − − γn + 1 f (cid:48) (cid:0) nθ n − ( n − θ n − (cid:1) . We remind ψ n +1 = ψ n − γf (cid:48) ( ψ n ) ,θ n +1 = θ n + 1 n + 1 ( ψ n +1 − θ n ) . Thus, we have: θ n +1 = θ n + 1 n + 1 ( ψ n +1 − θ n )= θ n + 1 n + 1 ( ψ n − γf (cid:48) ( ψ n ) − θ n )= θ n + 1 n + 1 ( θ n + ( n − θ n − θ n − ) − γf (cid:48) ( θ n + ( n − θ n − θ n − )) − θ n )= 2 nn + 1 θ n − n − n + 1 θ n − − γn + 1 f (cid:48) ( nθ n − ( n − θ n − ) . C Proof of Section 3

C.1 Proof of Lemma 1

The discriminant ∆ i is strictly positive when (cid:0) α + β (cid:1) h i − α >

0. This is always true for α strictly negative. For α positive and for h i (cid:54) = 0, this is true for | α + β | > (cid:112) α/h i . Thus thediscriminant ∆ i is strictly positive for α < α ≥ (cid:110) β < − α − (cid:112) α/h i or β > − α + 2 (cid:112) α/h i (cid:111) .

20 1 2 3 4012 βh i αh i r + i ≤ r − i ≥ − i > r + i = 1 r − i = − i = 0Figure 9: Stability in the real case, with all contraints plotted.Then we determine when the modulus of the eigenvalues is less than one (which correspondsto − ≤ r − i ≤ r + i ≤ r + i ≤ ⇔ (cid:118)(cid:117)(cid:117)(cid:116) h i (cid:32)(cid:18) α + β (cid:19) h i − α (cid:33) ≤ (cid:18) β + α (cid:19) h i ⇔ h i (cid:32)(cid:18) β + α (cid:19) h i − α (cid:33) ≤ (cid:20)(cid:18) β + α (cid:19) h i (cid:21) and α + β ≥ ⇔ h i α ≥ α + β ≥ ⇔ α ≥ α + β ≥ . Moreover, we have : r − i ≥ − ⇔ (cid:118)(cid:117)(cid:117)(cid:116) h i (cid:32)(cid:18) β + α (cid:19) h i − α (cid:33) ≤ − (cid:18) β + α (cid:19) h i ⇔ h i (cid:32)(cid:18) β + α (cid:19) h i − α (cid:33) ≤ (cid:20) − (cid:18) β + α (cid:19) h i (cid:21) and 2 − (cid:18) β + α (cid:19) h i ≥ ⇔ h i (cid:32)(cid:18) β + α (cid:19) h i − α (cid:33) ≤ − (cid:18) β + α (cid:19) h i + (cid:20)(cid:18) β + α (cid:19) h i (cid:21) and (cid:18) β + α (cid:19) ≤ /h i ⇔ − h i α ≤ − (cid:18) β + α (cid:19) h i and β ≤ /h i − α ⇔ β ≤ /h i − α/ β ≤ /h i − α. Figure 9 (where we plot all the constraints we have so far) enables to conclude that thediscriminant ∆ i is strictly positive and the algorithm is stable when the following three21onditions are satisﬁed: α ≥ α + 2 β ≤ /h i α + β ≥ (cid:112) α/h . For any of those α et β we will have: η in = c ( r − i ) n + c ( r + i ) n . Since η i = 0, c + c = 0 and for n = 1, c = η i / ( r − i − r + i ); we thus have: η in = η i r + i ) n − ( r − i ) n √ ∆ i . Thus, we get the ﬁnal expression:( φ in ) h i = ( φ i ) n (cid:8)(cid:2) r i + √ ∆ i (cid:3) n − (cid:2) r i − √ ∆ i (cid:3) n (cid:9) ∆ i /h i . C.2 Proof of Lemma 2 βh i αh i ∆ < | r ± i | ≤ | r ± i | = 1∆ i = 0Figure 10: Stability in the complex case, with all constraints plotted.The discriminant ∆ i is strictly negative if and only if (cid:0) α + β (cid:1) h i − α <

0. This implies | α + β | < (cid:112) α/h i . The modulus of the eigenvalues is | r ± i | = 1 − βh i . Thus the discriminant∆ i is strictly negative and the algorithm is stable for α, β ≥ α + β < (cid:112) α/h i , as shown in Figure 10. 22or any of those α et β we have: η in = [ c cos( ω i n ) + c sin( ω i n )] ρ ni , with ρ i = √ − βh i , sin( ω i ) = √− ∆ i /ρ i and cos( ω i ) = r i /ρ i . Since η i = 0, c = 0 and wehave for n = 1, c = η i / (sin( ω i ) ρ i ). Therefore η in = η i sin( ω i n ) √− ∆ i (1 − βh i ) n/ , and ( φ in ) h i = ( φ i ) n sin ( ω i n )sin ( ω i ) /h i (1 − βh i ) n − . C.3 Coalescing eigenvalues

When β = 2 (cid:112) α/h i − α , the discriminant ∆ i is equal to zero and we have a double realeigenvalue: r i = 1 − (cid:112) αh i . Thus the algorithm is stable for α < h i . For any of those α et β we have: η in = ( c + nc ) r n . This gives with η i = 0, c = 0 and c = η i /r . Therefore η in = nη i (1 − (cid:112) αh i ) n − , and: ( φ in ) h i = h i ( φ i ) (1 − (cid:112) αh i ) n − . In the presence of coalescing eigenvalues the convergence is linear if 0 < α < /h i and h i >

0, however one might worry about the behavior of (( φ in ) h i ) n when h i becomes small.Using the bound x exp( − x ) ≤ x ≤

1, we have for α < /h i : h i (1 − (cid:112) αh i ) n = h i exp(2 n log( | − (cid:112) αh i | )) ≤ hi exp( − n min { (cid:112) αh i , − (cid:112) αh i } ) ≤ h i min {√ αh i , − √ αh i } ≤ max (cid:26) α , h i (2 − √ αh i ) (cid:27) . Therefore we always have the following bound for α < /h i :( φ in ) h i ≤ ( φ i ) n max (cid:26) α , h i (2 − √ αh i ) (cid:27) . Thus for αh i ≤ φ in ) h i ≤ ( φ i ) n α . Proof of Theorem 2

D.1 Sketch of the proof βh i αh i Lemma 3Figure 11: Validity of Lemma 3 0 1 2 3 4012 βh i αh i Lemma 4Figure 12: Validity of Lemma 40 1 2 3 4012 βh i αh i Lemma 5Figure 13: Validity of Lemma 5 0 1 2 3 4012 βh i αh i Lemma 5Lemma 4Lemma 3Theorem 2Figure 14: Area of Theorem 2We divide the domain of validity of Theorem 2 in three subdomains as explained in Figure14. On the domain described in Figure 11 we have a ﬁrst bound on the iterate η in : Lemma 3.

For ≤ α ≤ /h i and − √ − αh i < βh i < √ − αh i , we have: ( η in ) ≤ ( η i ) αh i . And on the domain described Figure 12 we also have:

Lemma 4.

For ≤ α ≤ /h i and β ≤ α we have: ( η in ) ≤ η i ) αh i . These two lemmas enable us to prove the ﬁrst bound of Theorem 2 since the domain ofthis theorem is included in the intersection of the two domains of these lemmas as shownin Figure 14.Then we have the following bound on domain described in Figure 13:24 emma 5.

For ≤ α ≤ /h i and ≤ β ≤ /h i − α , we have: | η in | ≤ min (cid:40) √ n (cid:112) ( α + β ) h i , α + β ) h i (cid:41) . Since the domain of deﬁnition of Theorem 2 is included in the domain of deﬁnition of Lemma5 (as shown in Figure 14), this lemma proves the last two bounds of the theorem.

D.2 Outline of the proofs of the Lemmas – We ﬁnd a Lyapunov function G from R to R such that the sequence ( G ( η in , η in − ))decrease along the iterates.– We also prove that G ( η in , η in − ) dominates c (cid:107) η in (cid:107) when we want to have a bound on (cid:107) η in (cid:107) of the form c G ( η i , η i ) = c G ( θ i − θ i ∗ , i and take h i = 1 without loss of generality. D.3 Proof of Lemma 3

We ﬁrst consider a quadratic Lyapunov function (cid:18) η n η n − (cid:19) (cid:62) G (cid:18) η n η n − (cid:19) with G = (cid:18) α − α − − α (cid:19) .We note that G is symmetric positive semi-deﬁnite for α ≤

1. We recall F i = (cid:18) − ( α + β ) β −

11 0 (cid:19) .For the result to be true we need for 0 ≤ α ≤ − √ − α < β < √ − α twoproperties: F (cid:62) i G F i (cid:52) G , (19)and α (cid:18) (cid:19) (cid:52) G . (20) Proof of Eq. (20).

We have: G − α (cid:18) (cid:19) = (1 − α ) (cid:18) (cid:19) (cid:60) α ≤ . Proof of Eq. (19).

Since β (cid:55)→ F i ( β ) (cid:62) G F i ( β ) − G is convex in β ( G is symmetricpositive semi-deﬁnite), we only have to show Eq. (19) for the boundaries of the interval in β . For x ∈ R ∗ + : (cid:18) x − x x (cid:19) (cid:62) (cid:18) − x − x x (cid:19) (cid:18) x − x x (cid:19) − (cid:18) − x − x x (cid:19) = − (1 − x ) (cid:18) (cid:19) (cid:52) . This especially shows Eq. (19) for the boundaries of the interval with x = ±√ − α .25 ound. Thus, because η = 0, we have αη n +1 ≤ Θ (cid:62) n G Θ n ≤ Θ (cid:62) n − G Θ n − ≤ Θ (cid:62) G Θ ≤ η . This shows that for 0 ≤ α ≤ /h i and 1 − √ − αh i < βh i < √ − αh i :( η in ) ≤ ( η i ) αh i . D.4 Proof of Lemma 4

We consider now a second Lyapunov function G ( η n , η n − ) = ( η n − rη n − ) − ∆( η n − ) .We have: G ( η n , η n − ) = ( η n − rη n − ) − ∆ η n − = ( rη n − − (1 − β ) η n − ) − ∆ η n − = ( r − ∆) η n − + (1 − β ) η n − − − β ) rη n − η n − = ((1 − β ) η n − + (1 − β )( r − ∆) η n − − − β ) rη n − η n − = (1 − β )[( η n − − rη n − ) − ∆( η n − ) ] . = (1 − β ) G ( η n − , η n − ) . Where we have used twice r − ∆ = (1 − β ) and η n = 2 rη n − − (1 − β ) η n − . Moreover G ( η n , η n − ) can be rewritten as: G ( η n , η n − ) = (1 − α + β η n − η n − ) + α − β η n − ) + α + β η n ) . Thus for α + β ≤ β ≤ α we have: α η n ) ≤ G ( η n , η n − ) = (1 − β ) n − G ( η , η ) = (1 − β ) n − ( η ) . Therefore for α + β ≤ /h i and β ≤ α , we have:( η in ) ≤ η i ) αh i . D.5 Proof of Lemma 5

We may write η n as η n = rη n − + ( r + ) n + ( r − ) n . Moreover, we have: | ( r + ) n + ( r − ) n | ≤ , therefore for α + β ≤ | η n | ≤ r | η n − | + 2 ≤ − r n − r ≤ − (1 − ( α + β )) n ( α + β ) . | η n | ≤ α + β ) h . Moreover for all u ∈ [0 ,

1] and n ≥ − (1 − u ) n ≤ √ nu , since 1 − (1 − u ) n ≤ − (1 − u ) n = u (cid:80) (1 − u ) k ≤ nu . Thus | η n | ≤ √ n (cid:113) ( α + β ) . Therefore for 0 ≤ α ≤ /h i and α + β ≤ /h i we have: | η in | ≤ min (cid:40) √ n (cid:112) ( α + β ) h i , α + β ) h i (cid:41) . E Lower bound

We have the following lower-bound for the bound shown in Corollary 1, which shows thatdepending on which of the two terms dominates, we may always ﬁnd a sequence of functionsthat makes it tight.

Proposition 1.

Let L ≥ . For all sequences ≤ α n ≤ /L and ≤ β n ≤ /L − α n , suchthat α n + β n = o ( nα n ) there exists a sequence of one-dimensional quadratic functions ( f n ) n with second-derivative less than L such that: lim α n n ( f n ( θ n ) − f n ( θ ∗ )) = (cid:107) θ − θ ∗ (cid:107) . For all sequences ≤ α n ≤ /L and ≤ β n ≤ /L − α n , such that nα n = o ( α n + β n ) , thereexists a sequence of one-dimensional quadratic functions ( g n ) n with second-derivative lessthan L such that: lim n ( α n + β n )( g n ( θ n ) − g n ( θ ∗ )) = (1 − exp( − (cid:107) θ − θ ∗ (cid:107) . Proof of the ﬁrst lower-bound.

For the ﬁrst lower bound we consider 0 ≤ α n ≤ /L and 0 ≤ β n ≤ /L − α , such that α n + β n = o ( nα n ). We deﬁne f n = π / (4 α n n ) and weconsider the sequence of quadratic functions f n ( θ ) = f n θ . We consider the iterate ( η n ) n deﬁned by our algorithm. We will show thatlim α n f n ( η n ) = η . We have, from Lemma 2, f n ( η n ) = η n f n η sin ( ω n n ) ρ nn α n (1 − π ( α n + β n ) (4 α n n ) ) . ρ nn = (cid:18) − β n π α n n (cid:19) n = exp (cid:18) n log (cid:18) − β n π α n n (cid:19)(cid:19) = 1 + o (1) , since β n α n n = o (1). Also, 1 − π ( α n + β n ) (4 α n n ) = 1 + o (1), since α n + β n = o ( nα n ). Moreoversin( ω n ) = √− ∆ n ρ n = √ f n (cid:113) α n − ( α n + β n ) f n √ − β n f n = π/ (2 n ) + o (1 /n ) , thus ω n = π/ (2 n ) + o (1 /n ) and sin( nω n ) = 1 + o (1). Proof of the second lower-bound.

We consider now the situation where the secondbound is active. Thus we take sequences ( α n ) and ( β n ), such that nα n = o ( α n + β n ). Wedeﬁne g n = n ( α n + β n ) + α n ( α n + β n ) and consider the sequence of quadratic functions g n ( θ ) = g n θ . We will show for the iterate ( η n ) deﬁned by our algorithm that:lim n ( α n + β n )( g n ( θ n ) − g n ( θ ∗ )) = (1 − exp( − (cid:107) θ − θ ∗ (cid:107) . We will use Lemma 1. We ﬁrst have∆ n = (cid:18) α n + β n (cid:19) g n − α n g n = g n (cid:18) α n + β n (cid:19) n . Thus ( n ∆ n ) /g n = (cid:16) α n + β n (cid:17) and (cid:112) ∆ n = (cid:115)(cid:18) n (cid:19) + 2 α n n ( α n + β n )= 1 n (cid:114) α n nα n + β n = 1 n + α n α n + β n + o (cid:18) α n α n + β n (cid:19) . Moreover r n = 1 − α n + β n g n = 1 − n − α n α n + β n . Thus r + = 1 − α n α n + β n + o (cid:18) α n α n + β n (cid:19) , and r n + = exp( n log( r + )) = exp (cid:18) − nα n α n + β n (cid:19) + o (cid:18) nα n α n + β n (cid:19) = 1 + o (1) . Furthermore r − = 1 − n − α n α n + β n + o (cid:18) α n α n + β n (cid:19) , r n − = exp( n log( r + )) = exp (cid:18) − − α n nα n + β n (cid:19) + o (cid:18) nα n α n + β n (cid:19) = exp( −

2) + o (1) . Thus ( r n + − r n − ) = (1 − exp( − + o (1) . Finally, we have:( α n + β n ) n [ g n ( θ n ) − g n ( θ ∗ )] = α n + β n n (cid:107) θ − θ ∗ (cid:107) [ r n + − r n − ] n /g n = (cid:107) θ − θ ∗ (cid:107) r n + − r n − ] = (cid:107) θ − θ ∗ (cid:107) − exp( − + o (1) . F Proofs of Section 4

F.1 Proofs of Theorem 3 and Theorem 4

We decompose again vectors in an eigenvector basis of H with η in = p (cid:62) i η n and ε in = p (cid:62) i ε n : η in +1 = (1 − αh i ) η in + (1 − βh i )( η in − η in − ) + ( nα + β ) ε in +1 . We denote by ξ in +1 = (cid:18) [ nα + β ] ε in +1 (cid:19) and we have the reduced equation:Θ in +1 = F i Θ in + ξ in +1 . Unfortunately F i is not Hermitian and this formulation will not be convenient for calculus.Without loss of generality, we assume r − i (cid:54) = r + i even if it means having r − i − r + i goes to 0 inthe ﬁnal bound. Let Q i = (cid:18) r − i r + i (cid:19) be the transfer matrix of F i , i.e., F i = Q i D i Q − i with D i = (cid:18) r − i r + i (cid:19) and Q − i = r − i − r + i (cid:18) − r + i − r − i (cid:19) . We can reparametrize the problem in thefollowing way: Q − i Θ in +1 = Q − i F i Θ in + Q − i ξ in +1 = Q − i F i Q i Q − i Θ in + Q − i ξ in +1 = D i ( Q − i Θ in ) + Q − i ξ in +1 . With ˜Θ in = Q − i Θ in and ˜ ξ in = Q − i ξ in we now have:˜Θ in +1 = D i ˜Θ in + ˜ ξ in +1 , (21)with now D i Hermitian (even diagonal). 29hus it is easier to tackle using standard techniques for stochastic approximation (see, e.g.,Polyak and Juditsky, 1992; Bach and Moulines, 2011):˜Θ in = D ni ˜Θ i + n (cid:88) k =1 D n − ki ˜ ξ ik . Let M i = (cid:32) h / i h / i (cid:33) , we then get using standard martingale square moment inequalities,since for n (cid:54) = m , ε in and ε im are uncorrelated (i.e., E [ ε in ε im ] = 0): E (cid:107) M i ˜Θ in (cid:107) = (cid:107) M i D ni ˜Θ i (cid:107) + E n (cid:88) k =1 (cid:107) M i D n − ki ˜ ξ ik (cid:107) . This is a bias-variance decomposition; the left term only depends on the initial conditionand the right term only depends on the noise process.We have with M i = (cid:32) h / i h / i (cid:33) , M i Q − i = (cid:32) h / i (cid:33) , and M i ˜Θ in = (cid:18) √ h i η in (cid:19) . Thus,we have access to the function values through: (cid:107) M i ˜Θ in (cid:107) = h i ( η in ) . Moreover we have Θ i = (cid:18) φ i / ( r − i − r + i ) − φ i / ( r − i − r + i ) (cid:19) . Thus (cid:107) M i D ni ˜Θ i (cid:107) = ( φ i ) h i (cid:0) ( r + i ) n − ( r − i ) n (cid:1) ( r + i − r − i ) . This is the bias term we have studied in Section 3.3 which we bound with Theorem 2. Thevariance term is controlled by the next proposition.

Proposition 2.

With E [( ε in ) ] = c i for all n ∈ N , for α ≤ /h i and ≤ β ≤ /h i − α , wehave n E n (cid:88) k =1 (cid:107) M i D n − ki ˜ ξ ik (cid:107) ≤ min (cid:26) αn + β ) αβ (4 − ( α + 2 β ) h i ) n c i h i , nα + β ) n ( α + β ) c i h i , αn + β ) nα c i , nα + β ) α + β c i (cid:27) . The last two bounds prove Theorem 3.We note that if we restrict β to β ≤ / (2 h i ) − α/

2, then 4 − ( α + 2 β ) h i ≥ αn + β ) αβn c i h i . This allows to conclude to proveTheorem 4. 30 .2 Proof of Corollary 3 We let ν = (cid:107) θ − θ ∗ (cid:107) √ L tr( CH − ) and consider three diﬀerent regimes depending on ν and L .If ν < /L , we have ν/N < /L and thus α = ν/N and β = ν . Therefore (cid:107) θ − θ ∗ (cid:107) N α + ( αN + β ) αβN tr( CH − ) = (cid:107) θ − θ ∗ (cid:107) νN + 4 tr( CH − ) N ≤ (cid:112) L tr( CH − ) (cid:107) θ − θ ∗ (cid:107) N + 4 tr( CH − ) N ≤ CH − ) N , where we have used √ L (cid:107) θ − θ ∗ (cid:107) < (cid:112) tr( CH − ) since ν < /L .If ν > /L and ν < N/L , we have α = ν/N and β = 1 /L . Therefore (cid:107) θ − θ ∗ (cid:107) N α + ( αN + β ) αβN tr( CH − ) ≤ (cid:107) θ − θ ∗ (cid:107) νN + 4 tr( CH − ) LνN ≤ (cid:112) L tr( CH − ) (cid:107) θ − θ ∗ (cid:107) N + 4 tr( CH − ) N ≤ (cid:112) L tr( CH − ) (cid:107) θ − θ ∗ (cid:107) N , where we have used √ L (cid:107) θ − θ ∗ (cid:107) > (cid:112) tr( CH − ) since ν > /L .If ν > N/L , we have α = 1 /L and β = 1 /L . Therefore (cid:107) θ − θ ∗ (cid:107) N α + ( α ( N −

1) + β ) αβN tr( CH − ) = L (cid:107) θ − θ ∗ (cid:107) N + tr( CH − ) ≤ L (cid:107) θ − θ ∗ (cid:107) N + L (cid:107) θ − θ ∗ (cid:107) N ≤ L (cid:107) θ − θ ∗ (cid:107) N , where we have used that the real bound in Proposition 2 is in fact in ( N − α + β , (seeLemma 6) and that tr( CH − ) < L (cid:107) θ − θ ∗ (cid:107) N since ν > N/L . F.3 Proof of Proposition 2

F.3.1 Proof outline

To prove Proposition 2 we will use Lemmas 6, 7 and 8, that are stated and proved inSection F.3.2.We want to bound E [ (cid:80) nk =1 (cid:107) M i D n − ki ˜ ξ ik (cid:107) ] and according to Lemma 6, we have an explicitexpansion using the roots of the characteristic polynomial: E (cid:107) M i D ki ˜ ξ ik (cid:107) = h i (( k − α + β ) E [( ε i ) ] [( r − i ) n − k − ( r + i ) n − k ] ( r − i − r + i ) . k − α + β by ( n − α + β , we get E n (cid:88) k =1 (cid:107) M i D n − ki ˜ ξ ik (cid:107) ≤ h i (( n − α + β ) E [ ε i ] n (cid:88) k =1 [( r − i ) n − k − ( r + i ) n − k ] ( r − i − r + i ) . (22)Then, we have from Lemma 7 the inequality: n (cid:88) k =1 [( r − i ) k − ( r + i ) k ] [( r − i ) − ( r + i )] ≤ − βh i αβh i (1 − ( α + β ) h i ) . Therefore E n (cid:88) k =1 (cid:107) M / i D n − ki ˜ ξ ik (cid:107) ≤ E [ ε i ] h i (( n − α + β ) αβ − βh i (1 − ( α + β ) h i ) . This allows to prove the ﬁrst part of the bound. The other parts are much simpler and aredone in Lemma 8. Thus, adding these bounds gives for α ≤ /h i and 0 ≤ β ≤ /h i − α :1 n E n (cid:88) k =1 (cid:107) M i D n − ki ˜ ξ ik (cid:107) ≤ min (cid:26) α ( n −

1) + β ) αβn (4 − ( α + 2 β ) h i ) ch i , n − α + β ) n ( α + β ) ch i , α ( n −

1) + β ) nα c i , n − α + β ) α + β c i (cid:27) . F.3.2 Some technical Lemmas

We ﬁrst compute an explicit expansion of the noise term as a function of the eigenvalues ofthe dynamical system.

Lemma 6.

For all α ≤ /h i and ≤ β ≤ /h i − α we have E (cid:107) M i D ki ˜ ξ ik (cid:107) = h i (( k − α + β ) E [( ε i ) ] [( r − i ) n − k − ( r + i ) n − k ] ( r − i − r + i ) . Proof.

We ﬁrst turn the Euclidean norm into a trace, using that tr[ AB ] = tr[ BA ] for twomatrices A and B and that tr[ x ] = x for a real x . E (cid:107) M i D n − ki ˜ ξ ik (cid:107) = Tr D n − ki M i (cid:62) M i D n − ki E [ ˜ ξ ik ( ˜ ξ ik ) (cid:62) ] , (23)This enables us to separate the noise term from the rest of the formula. Then we computethe latter from the deﬁnition of ˜ ξ ik in Eq. (21) : E [ ˜ ξ ik ( ˜ ξ ik ) (cid:62) ] = (( k − α + β ) ( r − i − r + i ) E [( ε i ) ] (cid:18) − − (cid:19) . And the ﬁrst part of Eq. (23) is equal to: D n − ki M i (cid:62) M i D n − ki = h i (cid:32) ( r − i ) n − k ) ( r − i ) ( n − k ) − ( r + i ) ( n − k ) ( r − i ) ( n − k ) − ( r + i ) ( n − k ) ( r + i ) n − k ) (cid:33) , D i = (cid:18) r − i r + i (cid:19) and M i = (cid:32) h / i h / i (cid:33) . Therefore: E (cid:107) M i D n − ki ˜ ξ ik (cid:107) = h i (( k − α + β ) ( r − i − r + i ) E [ ε i ][( r − i ) n − k − ( r + i ) n − k ] . In the following leamma, we bound a certain sum of powers of the roots.

Lemma 7.

For all α ≤ /h i and ≤ β ≤ /h i − α we have n (cid:88) k =1 (cid:2) ( r − i ) k − ( r + i ) k ] [( r − i ) − ( r + i ) (cid:3) ≤ − βh i αβh i (1 − ( α + β ) h i ) . We ﬁrst note that when the two roots become close, the denominator and the numeratorwill go to zero, which prevents from bounding the numerator easily. We also note that thisbound is very tight since the diﬀerence between the two terms goes to zero when n goes toinﬁnity. Proof.

We ﬁrst expand the square of the diﬀerence of the powers of the roots and computetheir sums. n (cid:88) k =1 (cid:2) ( r − i ) k − ( r + i ) k (cid:3) = n (cid:88) k =1 (cid:2) r + i k + r − i k − r + i r − i ) k (cid:3) = 1 − r + i n − r + i + 1 − r − i n − r − i − − ( r + i r − i ) n − ( r + i r − i )= 11 − r + i + 11 − r − i − − ( r + i r − i ) − (cid:20) r + i n − r + i + r − i n − r − i − r + i r − i ) n − ( r + i r − i ) (cid:21) = 11 − r + i + 11 − r − i − − ( r + i r − i ) − I n , with I n = (cid:20) r + i n − r + i + r − i n − r − i − ( r + i r − i ) n − ( r + i r − i ) (cid:21) .This sum is therefore equal to the sum of one term we will compute explicitly and one otherterm which will go to zero. We have for the ﬁrst term:11 − r + i + 11 − r − i − − ( r + i r − i ) = (1 − r − i )(1 − ( r + i r − i )) − (1 − r − i )(1 − r + i )(1 − r + i )(1 − r − i )(1 − ( r + i r − i ))+ (1 − r + i )(1 − ( r + i r − i )) − (1 − r − i )(1 − r + i )(1 − r + i )(1 − r − i )(1 − ( r + i r − i )) , − r − i )(1 − ( r + i r − i )) − (1 − r − i )(1 − r + i ) = (1 − r − i )[(1 − ( r + i r − i )) − (1 − r + i )]= r + i (1 − r − i )( r + i − r − i ) , and (1 − r + i )(1 − ( r + i r − i )) − (1 − r − i )(1 − r + i ) = − r − i (1 − r − i )( r + i − r − i ) , and r + i (1 − r − i )( r + i − r − i ) − r − i (1 − r − i )( r + i − r − i ) = ( r + i − r − i )[ r + i (1 − r − i ) − r − i (1 − r − i )]= ( r + i − r − i )[ r + i − r − i + r + i r − i ( r + i − r − i )]= ( r + i − r − i ) [1 + r + i r − i ] . Therefore the ﬁrst term is equal to:11 − r + i + 11 − r − i − − ( r + i r − i ) = ( r + i − r − i ) [1 + r + i r − i ](1 − r + i )(1 − r − i )(1 − ( r + i r − i )) , and the sum can be expanded as: n (cid:88) k =1 [( r − i ) k − ( r + i ) k ] [ r − i − r + i ] = [1 + r + i r − i ](1 − r + i )(1 − r − i )(1 − ( r + i r − i )) − J n , with J n = I n [( r − i ) − ( r + i )] .Then we simplify the ﬁrst term of this sum using the explicit values of the roots. We recall r ± i = r i ± √ ∆ i = 1 − α + β h i ± (cid:114)(cid:16) α + β (cid:17) h i − αh i , therefore r + i r − i = r i − ∆ i = (cid:18) − (cid:18) α + β (cid:19) h i (cid:19) − (cid:20)(cid:18) α + β (cid:19) h i (cid:21) + αh i = 1 − βh i , and(1 − r + i )(1 − r − i ) = [(1 − r − i )(1 − r + i )][(1 + r + i )(1 + r + i )]= [(1 − r i + (cid:112) ∆ i )((1 − r i − (cid:112) ∆ i )][(1 + r i + (cid:112) ∆ i )((1 + r i − (cid:112) ∆ i )]= [(1 − r i ) − ∆ i ][(1 + r i ) − ∆ i ][(1 − r i ) − ∆ i ]= 4 αh i (cid:18) − (cid:18) α + 12 β (cid:19) h i (cid:19) . Thus n (cid:88) k =1 [( r − i ) k − ( r + i ) k ] [( r − i ) − ( r + i )] = 2 − βh i αβh i (1 − ( α + β ) h i ) − J n . J n will be asymptotically small, we want a non-asymptotic bound, thus we willshow that J n is always positive.In the real case [( r − i ) − ( r + i )] ≥ a + b ≥ ab , for all ( a, b ) ∈ R , we have r + i n − r + i + r − i n − r − i ≥ r + i r − i ) n (cid:113) (1 − r + i )(1 − r − i ) , and using r + i + r − i ≥ r − i r − i we have (cid:113) (1 − r + i )(1 − r − i ) ≤ − ( r + i r − i ) , since(1 − r + i )(1 − r − i ) − [1 − ( r + i r − i )] = 1 − r + i − r − i + ( r + i r − i ) − r + i r − i − ( r + i r − i ) = 2 r + i r − i − r + i − r − i ≤ . Thus r + i n − r + i + r − i n − r − i − r + i r − i ) n − ( r + i r − i ) ≥ . and J n ≥ r − i ) − ( r + i )] ≤

0, and using z + ¯ z ≤ z ¯ z for all z ∈ C , we have r + i n − r + i + r − i n − r − i ≤ r + i r − i ) n (cid:113) (1 − r + i )(1 − r − i ) , and using r + i + r − i ≤ r − i r − i we have (cid:113) (1 − r + i )(1 − r − i ) ≥ − ( r + i r − i ) . Thus r + i n − r + i + r − i n − r − i − r + i r − i ) n − ( r + i r − i ) ≤ . and J n ≥ J n ≥ , and n (cid:88) k =1 [( r − i ) k − ( r + i ) k ] [( r − i ) − ( r + i )] ≤ − βh i αβh i (1 − ( α + β ) h i ) . η in = [( r − i ) n − k − ( r + i ) n − k ] ( r − i − r + i ) . This gives us the following lemma which enables to prove thesecond part of Proposition 2. Lemma 8.

For all α ≤ /h i and ≤ β ≤ /h i − α we have E n (cid:88) k =1 (cid:107) M / i D n − ki ˜ ξ ik (cid:107) ≤ E [( ε i ) ] n (( n − α + β ) min (cid:26) α , nα + β , h i ( α + β ) (cid:27) . Proof.

From Lemma 6, we get E n (cid:88) k =1 (cid:107) M / i D n − ki ˜ ξ ik (cid:107) = h i E [( ε i ) ] n (cid:88) k =1 (( k − α + β ) [( r − i ) n − k − ( r + i ) n − k ] ( r − i − r + i ) ≤ h i E [( ε i ) ](( n − α + β ) n min (cid:26) αh i , n ( α + β ) h i , α + β ) h i (cid:27) ≤ E [( ε i ) ] n (( n − α + β ) min (cid:26) α , nα + β , h i ( α + β ) (cid:27) . G Comparison with additional other algorithms

G.1 Summary

When the objective function f is quadratic and for correct choices of step-sizes, the AC-SAalgorithm of Lan (2012), the SAGE algorithm of Hu et al. (2009) and the Accelerated RDAalgorithm of Xiao (2010) are all equivalent to: θ n +1 = [ I − δ n +1 H n +1 ] θ n + n − n + 1 [ I − δ n +1 H n +1 ]( θ n − θ n − ) + δ n +1 ε n +1 , where we use H n θ + ε n as an unbiased estimate of the gradient and δ n as step-size whichvalues will be speciﬁed later.Lan (2012) and Hu et al. (2009) only consider bounded cases by projecting their iterateson a bounded space. Xiao (2010) deals with the unbounded case and prove the followingconvergence result: Theorem 5. (Xiao, 2010, Theorem 6). With E [ ε n ⊗ ε n ] = C , for step-size δ n ≤ n − n γ with γ ≤ /L , we have E f ( θ n ) − f ( θ ∗ ) ≤ (cid:107) θ − θ ∗ (cid:107) n γ + nγσ tr C . This result is signiﬁcantly more general than ours since it is valid for composite optimizationand general noise on the gradients.We now present the diﬀerent algorithms and show they all share the same form.36 .2 AC-SA

Lemma 9.

AC-SA algorithm with step size γ n and β n and gradient estimate H n +1 θ n + ε n +1 is equivalent to: θ n +1 = ( I − γ n β n H n +1 ) θ n + β n − − β n ( I − γ n β n H n +1 )( θ n − θ n − ) + γ n β n ε n +1 . Proof.

We recall the general

AC-SA algorithm : • Let the initial points x ag = x , and the step-sizes { β n } n ≤ and { γ n } n ≤ be given.Set n = 1 • Step 1 . Set x mdn = β − n x n + (1 − β − n ) x agn , • Step 2 . Call the Oracle for computing G ( x mdn , ξ n ) where E [ G ( x mdn , ξ n )] = f (cid:48) ( x mdn ).Set x n +1 = x n − γ n G ( x mdn , ξ n ) ,x agn +1 = β − n x n +1 + (1 − β − n ) x agn , • Step 3 . Set n → n + 1 and go to step 1.When f is quadratic we will have G ( x mdn , ξ n ) = H n +1 x mdn − ε n +1 , thus x n +1 = x n − γ n H n +1 x mdn + γ n ε n +1 , and: x agn +1 = β − n x n +1 + (1 − β − n ) x agn = β − n ( x n − γ n H n +1 x mdn + γ n ε n +1 ) + (1 − β − n ) x agn = β − n ( β n x mdn + (1 − β n ) x agn − γ n H n +1 x mdn + γ n ε n +1 ) + (1 − β − n ) x agn = x mdn − γ n β n H n +1 x mdn + γ n β n ε n +1 , and x mdn = β − n x n + (1 − β − n ) x agn = β − n β n − x agn + β − n (1 − β n − ) x agn − + (1 − β − n ) x agn = x agn + β n − − β n [ x agn − x agn − ] . These give the result for θ n = x agn . G.3 SAGE

Lemma 10.

The algorithm SAGE with step-sizes L n and α n is equivalent to: θ n +1 = ( I − /L n +1 H n +1 ) θ n + (1 − α n ) α n +1 α n [ I − /L n +1 H n +1 ]( θ n − θ n − ) + 1 /L n +1 ε n +1 . roof. We recall the general

SAGE algorithm : • Let the initial points x = z = 0, and the step-sizes { β n } n ≤ and { L n } n ≤ be given.Set n = 1 • Step 1 . Set x n = (1 − α n ) y n − + α n z n − , • Step 2 . Call the Oracle for computing G ( x n , ξ n ) where E [ G ( x n , ξ n )] = f (cid:48) ( x n ). Set y n = x n − /L n G ( x n , ξ n ) ,z n = z n − − α − n ( x n − y n ) • Step 3 . Set n → n + 1 and go to step 1.We have y n = ( I − /L n H n ) x n + γ n ε n , and z n = z n − − α − n ( x n − y n )= z n − − α − n [(1 − α n ) y n − + α n z n − − y n ]= α − n y n − α − n (1 − α n ) y n − . Thus x n = (1 − α n ) y n − + α n z n − = (1 − α n ) y n − + α n [ α − n − y n − − α − n − (1 − α n − ) y n − ]= y n − + (1 − α n − ) α n α n − [ y n − − y n − ] . These give the result for θ n = y n . G.4 Accelerated RDA method

Lemma 11.

The algorithm AccRDA with step-sizes β and α n is equivalent to: θ n +1 = ( I − γ n +1 H n +1 ) θ n + (1 − α n ) α n +1 α n [ I − γ n +1 H n +1 ]( θ n − θ n − ) + γ n +1 ε n +1 , with γ n = α n θ n L + β .Proof. We recall the general

Accelerated RDA method : • Let the initial points w = v , A = 0, ˜ g = 0 and the step-sizes { α n } n ≤ and { β n } n ≤ be given.Set n = 1 38 Step 1 . Set A n = A n − + α n and θ n = α n A n . • Step 2 . Compute the query point u n = (1 − θ n ) w n − + θ n v n − • Step 3 . Call the Oracle for computing g n = G ( u n , ξ n ) where E [ G ( u n , ξ n )] = f (cid:48) ( u n ),and update the weighted average ˜ g n ˜ g n = (1 − θ n )˜ g n − + θ n g n . • Step 4 . Set v n = v − A n L + β n ˜ g n . • Step 5 . Set w n = (1 − θ n ) w n − + θ n v n . • Step 6 . Set n → n + 1 and go to step 1.First we have v n = v − A n L + β n ˜ g n = v − A n L + β n [(1 − θ n )˜ g n − + θ n g n ]= v − A n L + β n [(1 − θ n )˜ g n − + θ n ( H n +1 u n + ε n +1 )]= v + (1 − θ n ) A n ( L + β n − )( L + β n ) A n − v n − − A n L + β n θ n ( H n +1 u n + ε n +1 )]= v + (1 − θ n ) A n ( L + β n − )( L + β n ) A n − v n − − α n L + β n ( H n +1 u n + ε n +1 )] . With β n = β we have v n = v n − − α n L + β ( H n +1 u n + ε n +1 )] and w n = ( I − α n θ n L + β H n +1 ) u n + α n θ n L + β ε n +1 . Since v n − = θ − n − w n − − θ − n − (1 − θ n − ) w n − , then u n = (1 − θ n ) w n − + θ n ( θ − n − w n − − θ − n − (1 − θ n − ) w n − ) , and u n = w n − + α n A n − α n − A n [ w n − − w n − ] . H Lower bound for stochastic optimization for least-squares

In this section, we show a lower bound for optimization of quadratic functions with noisyaccess to gradients. We follow very closely the framework of Agarwal et al. (2012) and use39heir notations. The only diﬀerence with their Theorem 1 in the diﬀerent choice of twofunctions f + i and f − i , which we choose to be: f ± i ( x ) = c i ( x i ± r , with a non-increasing sequence ( c i ) to be chosen later. The function g α that is optimizedis thus: g α ( x ) = 1 d d (cid:88) i =1 (cid:8) ( 12 + α i δ ) f + i ( x ) + ( 12 − α i δ ) f − i ( x ) (cid:9) . This function is quadratic and its Hessian has eigenvalues equal to 2 c i /d . Thus, its largesteigenvalue is 2 c /d , which we choose equal to L .Noisy gradients are obtained by sampling d independent Bernoulli random variables b i , i = 1 , . . . , d , with parameters ( + α i δ ) and using the gradient of the random function d (cid:80) di =1 (cid:8) b i f + i ( x ) + (1 − b i ) f − i ( x ) (cid:9) . The variance of the random gradient is equal to V = d (cid:88) i =1 d var (cid:16) b i (cid:2) c i ( x i + r/ − c i ( x i − r/ (cid:3)(cid:17) = 1 d d (cid:88) i =1 c i r (1 / − δ ) . The function g α is minimized for x = − αδr , and the discrepancy measure between twofunctions g α and g β is greater than1 d d (cid:88) i =1 (cid:26) inf x (cid:8) f + i ( x )+ f − i ( x ) (cid:9) − inf x f + i ( x ) − inf x f − i ( x ) (cid:27) α i (cid:54) = β i (cid:62) d d (cid:88) i =1 c i r δ α i (cid:54) = β i (cid:62) d c d r δ α, β ) . Since the vectors α, β ∈ {− , } d are so that their Hamming distance ∆( α, β ) (cid:62) d/ α (cid:54) = β , we have a discrepancy measure greater than c d r δ . Thus, for a an approxi-mate optimality of ε = c d r δ , we have, following the proof of Theorem 1 (equation (29))from Agarwal et al. (2012), for N iterations of any method that accesses a random gradient,we have: 1 / (cid:62) − N dδ + log 2 d log(2 / √ e ) . Thus, for d large, we get, up to constants, δ (cid:62) /N and thus ε (cid:62) r c d N .For c = 2 Ld and c i = L √ d for the remaining ones, we get (up to constants): ε (cid:62) VL √ dN . This leads to the desired result for N (cid:54) dd