A Continuized View on Nesterov Acceleration
Raphaël Berthier, Francis Bach, Nicolas Flammarion, Pierre Gaillard, Adrien Taylor
AA Continuized View on Nesterov Acceleration
Raphaël Berthier , Francis Bach , Nicolas Flammarion , Pierre Gaillard and Adrien Taylor Inria - Département d’informatique de l’ENSPSL Research University, Paris, France School of Computer and Communication SciencesEcole Polytechnique Fédérale de Lausanne Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France
Abstract.
We introduce the “continuized” Nesterov acceleration, a close variant of Nesterovacceleration whose variables are indexed by a continuous time parameter. The two variablescontinuously mix following a linear ordinary differential equation and take gradient steps atrandom times. This continuized variant benefits from the best of the continuous and the discreteframeworks: as a continuous process, one can use differential calculus to analyze convergence andobtain analytical expressions for the parameters; but a discretization of the continuized processcan be computed exactly with convergence rates similar to those of Nesterov original acceleration.We show that the discretization has the same structure as Nesterov acceleration, but with randomparameters. Introduction
In the last decades, the emergence of numerous applications in statistics, machine learning andsignal processing has led to a renewed interest in first-order optimization methods (Bottou et al.,2018). They enjoy a low computational complexity necessary to the analysis of large datasets. Theperformance of first-order methods was largely improved thanks to acceleration techniques (see thereview by d’Aspremont et al., 2021, and the many references therein), starting with the seminalwork of Nesterov (1983).Let f : R d → R be a convex and differentiable function, minimized at x ∗ ∈ R d . We assumethroughout the paper that f is L -smooth, i.e., ∀ x, y ∈ R d , f ( y ) (cid:54) f ( x ) + (cid:104)∇ f ( x ) , y − x (cid:105) + L (cid:107) y − x (cid:107) . In addition, we sometimes assume that f is µ -strongly convex for some µ > , i.e., ∀ x, y ∈ R d , f ( y ) (cid:62) f ( x ) + (cid:104)∇ f ( x ) , y − x (cid:105) + µ (cid:107) y − x (cid:107) . For the problem of minimizing f , gradient descent is well-known to achieve a rate f ( x k ) − f ( x ∗ ) = O ( k − ) in the smooth case, and a rate f ( x k ) − f ( x ∗ ) = O ((1 − µ/L ) k ) in the smooth and stronglyconvex case. In both cases, Nesterov introduced an alternative method with essentially the samerunning-time complexity, that achieves faster rates; it converges at the rate O ( k − ) in the smoothconvex case and at the rate O ((1 − (cid:112) µ/L ) k ) in the smooth and strongly convex case (Nesterov,2003). These rates are then optimal among all methods that access gradients and linearly combinethem (Nesterov, 2003; Nemirovskij and Yudin, 1983).Nesterov acceleration introduces several sequences of iterates—two or three, depending on theformulations—and relies on a clever blend of gradient steps and mixing steps between the iterates.Many works contributed to interpret and motivate the precise structure of the iteration that leadto the success of the method, see for instance (Bubeck et al., 2015; Flammarion and Bach, 2015;Arjevani et al., 2016; Kim and Fessler, 2016; Allen-Zhu and Orecchia, 2017). A large number ofthese works found useful to study continuous time equivalents of Nesterov acceleration, obtainedby taking the limit when stepsizes vanish, or from a variational framework. The continuous timeindex t of the limit allowed to use differential calculus to study the convergence of these equivalents.For examples of studies that use continuous time, see (Su et al., 2014; Krichene et al., 2015; Wilsonet al., 2016; Wibisono et al., 2016; Betancourt et al., 2018; Diakonikolas and Orecchia, 2019; Shiet al., 2018, 2019; Attouch et al., 2018, 2019; Zhang et al., 2018; Siegel, 2019; Muehlebach andJordan, 2019; Sanz-Serna and Zygalakis, 2020).In this paper, we propose another way to obtain a continuous time equivalent of Nesterovacceleration, that we call the continuized version of Nesterov acceleration, that does not have a r X i v : . [ c s . D C ] F e b vanishing stepsizes. It is built by considering two variables x t , z t ∈ R d , t ∈ R (cid:62) , that continuouslymix following a linear ordinary differential equation (ODE), and that take gradient steps at randomtimes T , T , T , . . . . Thus, in this modeling, mixing and gradient steps alternate randomly.Thanks to the continuous index t and some stochastic calculus, one can differentiate averagedquantities (expectations) with respect to t . In particular, this leads to simple analytical expressionsfor the optimal parameters as functions of t , while the optimal parameters of Nesterov accelerationsare defined by recurrence relations that are complicated to solve.The discretization ˜ x k = x T k , ˜ z k = z T k , k ∈ N , of the continuized process can be computed directlyand exactly: the result is a recursion of the same form as Nesterov iteration, but with randomizedparameters, that performs similarly to Nesterov original deterministic version both in theory andin simulations.There are particular situations where Nesterov acceleration can not be implemented and thecontinuized acceleration can. First, a major advantage of the continuized acceleration over Nesterovacceleration is that the parameters of the algorithm depend only on time t ∈ R (cid:62) , and not onthe number of past gradient steps k . This is useful in distributed implementations, where thetotal number of gradient steps taken in the network may not be known to a particular node.Second, the continuized modeling can be relevant when gradient steps arrive at random times, as inasynchronous parallel computing for instance. Gossip algorithms represent another example whereboth features are present: the total number of past communications in the network at a given timeis unknown to all nodes, and communication between nodes occur at random times. This motivatedEven et al. (2020) to consider a similar continuized procedure, for communication steps instead ofgradient steps, in order to accelerate gossip algorithms; their work is the source of inspiration forthe present paper.Beyond these particular situations, the continuized acceleration should be seen as a closeapproximation to Nesterov acceleration, that features both an insightful and convenient expressionas a continuous time process and a direct implementation as a discrete iteration. We thus hope tocontribute to the understanding of Nesterov acceleration. We believe that the continuized frameworkcan be adapted to various settings and extensions of Nesterov acceleration; as an illustration of thisstatement, we study how the continuized acceleration behaves in the presence of additive noise onthe gradients. Notations.
The index k always denotes a non-negative integer, while the indices t, s always denotenon-negative reals. Structure of the paper.
In Section 2, we recall gradient descent and Nesterov acceleration, itschoice of parameters, and its convergence rate as a function of the number of iterations k . InSection 3, we introduce our continuized variant of Nesterov acceleration, its choice of parametersand its convergence rate as functions of t . In Section 4, we show that the discretization of thecontinuized acceleration leads to an iteration of the same structure as Nesterov acceleration, withrandom parameters. We give the expressions for the parameters and the convergence rate in termsof the number of iterations k . Finally, in Section 5, we study the robustness of the continuizedacceleration to additive noise. Reminders on gradient descent and Nesterov acceleration
For the sake of comparison, let us first recall classical results of convex optimization. Considerthe iterates of gradient descent with stepsize γ , x k +1 = x k − γ ∇ f ( x k ) . We have the following convergence of the function values f ( x k ) , depending on whether the function f is (1) convex, or (2) strongly convex. Theorem 1 (Convergence of gradient descent) . Choose the stepsize γ = 1 /L .(1) Then f ( x k ) − f ( x ∗ ) (cid:54) L (cid:107) x − x ∗ (cid:107) k + 4 . (2) Assume further that f is µ -strongly convex, µ > . Then f ( x k ) − f ( x ∗ ) (cid:54) L (cid:16) − µL (cid:17) k (cid:107) x − x ∗ (cid:107) . These results (or similar bounds) can be found at many places in the literature; for instance thefirst bound is in (Nesterov, 2003, Corollary 2.1.2) and the second bound is a simple consequence of(Nesterov, 2003, Theorem 2.1.15). See also the recent book of Nesterov (2018).To accelerate these rates of convergence, Nesterov introduced iterations of three sequences,parametrized by τ k , τ (cid:48) k , γ k , γ (cid:48) k , k (cid:62) , of the form y k = x k + τ k ( z k − x k ) , (1) x k +1 = y k − γ k ∇ f ( y k ) , (2) z k +1 = z k + τ (cid:48) k ( y k − z k ) − γ (cid:48) k ∇ f ( y k ) . (3)Depending on whether the function f is known to be (1) simply convex, or (2) strongly convex witha known strong convexity parameter, Nesterov gave choices of parameters leading to acceleratedconvergence rates. Theorem 2 (Convergence of accelerated gradient descent) . (1) Choose the parameters τ k =1 − A k A k +1 , τ (cid:48) k = 0 , γ k = L , γ (cid:48) k = A k +1 − A k L , k (cid:62) , where the sequence A k , k (cid:62) , is defined bythe recurrence relation A = 0 , A k +1 = A k + 12 (1 + (cid:112) A k + 1) . Then f ( x k ) − f ( x ∗ ) (cid:54) L (cid:107) x − x ∗ (cid:107) k . (2) Assume further that f is µ -strongly convex, µ > . Choose the constant parameters τ k ≡ √ µ/L √ µ/L , τ (cid:48) k ≡ (cid:112) µL , γ k ≡ L , γ (cid:48) k ≡ √ µL , k (cid:62) . Then f ( x k ) − f ( x ∗ ) (cid:54) (cid:16) f ( x ) − f ( x ∗ ) + µ (cid:107) z − x ∗ (cid:107) (cid:17) (cid:18) − (cid:114) µL (cid:19) k . This result, in this exact form, is proven by d’Aspremont et al. (2021, Sections 4.4.1 and 4.5.3).From a high-level perspective, Nesterov acceleration iterates over several variables, alternatingbetween gradient steps (always with respect to the gradient at y k ) and mixing steps, where therunning value of a variable is replaced by a linear combination of the other variables. However,the precise way gradient and mixing steps are coupled is rather mysterious, and the success of theproof of Theorem 2 relies heavily on the detailed structure of the iterations. In the next section, wetry to gain perspective on this structure by developing a continuized version of the acceleration. Continuized version of Nesterov acceleration
In this section and the following ones, we use several mathematical notions related to randomprocesses. It should be possible to understand the paper with only a heuristic understanding ofthese notions. The rigorous definitions are provided in Appendix A.We argue that the accelerated iteration becomes more natural if we consider two variables x t , z t indexed by a continuous time t (cid:62) , that are continuously mixing and that take gradient steps atrandom times. More precisely, let T , T , T , . . . (cid:62) be random times such that T , T − T , T − T , . . . are independent identically distributed (i.i.d.), of law exponential with rate (any constantrate would do, but we choose to make the comparison with discrete time k straightforward).By convention, we choose that our stochastic processes t (cid:55)→ x t , t (cid:55)→ z t are càdlàg almost surely,i.e., right continuous with well-defined left-limits x t − , z t − (see Definition 5 in Appendix A). Ourdynamics are parametrized by functions γ t , γ (cid:48) t , τ t , τ (cid:48) t , t (cid:62) . At the random times T , T , . . . , oursequences take gradient steps x T k = x T k − − γ T k ∇ f ( x T k − ) , (4) z T k = z T k − − γ (cid:48) T k ∇ f ( x T k − ) . (5) Because of the memoryless property of the exponential distribution, in a infinitesimal time interval [ t, t + d t ] , the variables take gradients steps with probability d t , independently of the past.Between these random times, the variables mix through a linear ordinary differential equation(ODE) d x t = η t ( z t − x t )d t , (6) d z t = η (cid:48) t ( x t − z t )d t . (7)Following the notation of stochastic calculus, we can write the process more compactly in terms ofthe Poisson point measure d N ( t ) = (cid:80) k (cid:62) δ T k (d t ) , which has intensity the Lebesgue measure d t , d x t = η t ( z t − x t )d t − γ t ∇ f ( x t )d N ( t ) , (8) d z t = η (cid:48) t ( x t − z t )d t − γ (cid:48) t ∇ f ( x t )d N ( t ) . (9)Before giving convergence guarantees for such processes, let us digress quickly on why we canexpect an iteration of this form to be mathematically appealing.First, from a Markov chain indexed by a discrete time index k , one can associate the so-called continuized Markov chain, indexed by a continuous time t , that makes transition with the sameMarkov kernel, but at random times, with independent exponential time intervals (Aldous andFill, 2002). Following this terminology, we refer to our acceleration (8)-(9) as the continuizedacceleration. The continuized Markov chain is appreciated for its continuous time parameter t ,while keeping many properties of the original Markov chain; similarly the continuized accelerationis arguably simpler to analyze, while performing similarly to Nesterov acceleration.Second, it is also interesting to compare with coordinate gradient descent methods, that areeasier to analyze when coordinates are selected randomly rather than in an ordered way (Wright,2015). Similarly, the continuized acceleration is simpler to analyze because the gradient steps (4)-(5)and the mixing steps (6)-(7) alternate randomly, due to the randomness of T , T , . . . In analogy with Theorem 2, we give choices of parameters that lead to accelerated convergencerates, in the convex case (1) and in the strongly convex case (2). Convergence is analyzed as afunction of t . As d N ( t ) is a Poisson point process with rate , t is the expected number of gradientsteps done by the algorithm. Thus t is analoguous to k in Theorem 2. Theorem 3 (Convergence of continuized Nesterov acceleration) . (1) Choose the parameters η t = t , η (cid:48) t = 0 , γ t = L , γ (cid:48) t = t L . Then E f ( x t ) − f ( x ∗ ) (cid:54) L (cid:107) z − x ∗ (cid:107) t . (2) Assume further that f is µ -strongly convex, µ > . Choose the constant parameters η t = η (cid:48) t ≡ (cid:112) µL , γ t ≡ L , γ (cid:48) t ≡ √ µL . Then E f ( x t ) − f ( x ∗ ) (cid:54) (cid:16) f ( x ) − f ( x ∗ ) + µ (cid:107) z − x ∗ (cid:107) (cid:17) exp (cid:18) − (cid:114) µL t (cid:19) . Sketch.
A complete and rigorous proof is given in Appendix B.1. Here, we only provide the heuristicof the main lines of the proof.The proof is similar to the one of Nesterov acceleration: we prove that for some choices ofparameters η t , η (cid:48) t , γ t , γ (cid:48) t , t (cid:62) , and for some functions A t , B t , t (cid:62) , φ t = A t ( f ( x t ) − f ( x ∗ )) + B t (cid:107) z t − x ∗ (cid:107) is a supermartingale. In particular, this implies that E φ t is a Lyapunov function, i.e., a non-increasingfunction of t .To prove that φ t is a supermartingale, it is sufficient to prove that for all infinitesimal timeintervals [ t, t + d t ] , E t φ t +d t (cid:54) φ t , where E t denotes the conditional expectation knowing all the pastof the Poisson process up to time t . Thus we would like to compute the first order variation of E t φ t +d t . This implies computing the first order variation of E t f ( x t +d t ) .From (8), we see that f ( x t ) evolves for two reasons between t and t + d t : • x t follows the linear ODE (6), which results in the infinitesimal variation f ( x t ) → f ( x t ) + η t (cid:104)∇ f ( x t ) , z t − x t (cid:105) d t , and • with probability d t , x t takes a gradient step, which results in a macroscopic variation f ( x t ) → f ( x t − γ t ∇ f ( x t )) .Combining both variations, we obtain that E t f ( x t +d t ) ≈ f ( x t ) + η t (cid:104)∇ f ( x t ) , z t − x t (cid:105) d t + d t ( f ( x t − γ t ∇ f ( x t )) − f ( x t )) , where the d t in the second term corresponds to the probability that a gradient step happens; notethat the latter event is independent of the past up to time t .A similar computation can be done for E t (cid:107) z t − x ∗ (cid:107) . Putting things together, we obtain E t φ t +d t − φ t ≈ d t (cid:18) d A t d t ( f ( x t ) − f ( x ∗ )) + A t η t (cid:104)∇ f ( x t ) , z t − x t (cid:105)− A t ( f ( x t − γ t ∇ f ( x t )) − f ( x t )) + d B t d t (cid:107) z t − x ∗ (cid:107) + 2 B t η (cid:48) t (cid:104) z t − x ∗ , x t − z t (cid:105) + B t (cid:0) (cid:107) z t − γ (cid:48) t ∇ f ( x t ) − x ∗ (cid:107) − (cid:107) z t − x ∗ (cid:107) (cid:1)(cid:19) . Using convexity and strong convexity inequalities, and a few computations, we obtain the followingupper bound: E t φ t +d t − φ t (cid:46) d t (cid:18) (cid:18) d A t d t − A t η t (cid:19) (cid:104)∇ f ( x t ) , x t − x ∗ (cid:105) + (cid:18) d B t d t − B t η (cid:48) t (cid:19) (cid:107) z t − x ∗ (cid:107) + ( A t η t − B t γ (cid:48) t ) (cid:104)∇ f ( x t ) , z t − x ∗ (cid:105) + (cid:18) B t η (cid:48) t − d A t d t µ (cid:19) (cid:107) x t − x ∗ (cid:107) + (cid:18) B t γ (cid:48) t − A t γ t (cid:18) − Lγ t (cid:19)(cid:19) (cid:107)∇ f ( x t ) (cid:107) (cid:19) . We want this infinitesimal variation to be non-positive. Here, we choose the parameters so that γ t = 1 /L , and all prefactors in the above expression are zero. This gives some constraints on thechoices of parameters. We show that only one degree of freedom is left: the choice of the function A t , that must satisfy the ODE d d t (cid:16)(cid:112) A t (cid:17) = µ L (cid:112) A t , but whose initialization remains free. Once the initialization of the function A t is chosen, thisdetermines the full function A t and, through the constraints, all parameters of the algorithm. As φ t is a supermartingale (by design), a bound on the performance of the algorithm is given by E f ( x t ) − f ( x ∗ ) (cid:54) E φ t A t (cid:54) φ A t . The results presented in Theorem 3 correspond to one special choice of initialization for thefunction A t .In this sketch of proof, our derivation of the infinitesimal variation is intuitive and elementary;however it can be made more rigorous and concise—albeit more technical—using classical resultsfrom stochastic calculus, namely Proposition 2. This is our approach in Appendix B.1. (cid:3) Many authors have proposed continuous-time equivalents in order to understand better Nesterovacceleration using differential calculus, see the numerous references in the introduction. For instance,in the seminal work of Su et al. (2014), the equivalence is obtained from Nesterov accelerationby taking the joint asymptotic where the stepsizes vanish and the number of iterates is rescaled.The resulting limit is an ODE that must be discretized to be implemented; choosing the rightdiscretization is not straightforward as it introduces stability and approximation errors that mustbe controlled, see (Zhang et al., 2018; Shi et al., 2019; Sanz-Serna and Zygalakis, 2020).On the contrary, our continuous time equivalent (8)-(9) does not correspond to a limit where thestepsizes vanish. However, in Appendix D, we check that the continuized acceleration has the sameODE scaling limit as Nesterov acceleration. This sanity check emphasizes that the continuizedacceleration is fundamentally different from previous continuous-time equivalents. k f ( x k ) f ( x * ) Gradient descentNesterov accelerationContinuized Nesterov acceleration k f ( x k ) f ( x * ) Figure 1.
Comparison between gradient descent, Nesterov acceleration, andthe continuized version of Nesterov acceleration, on a convex function (left) anda strongly convex function (right). For the continuized acceleration, which israndomized, the results shown corresponds to a single run. (Results were stableacross runs.) Discrete implementation of the continuized implementation withrandom parameters
In this section, we show that the continuized acceleration can be implemented exactly as adiscrete algorithm. Denote ˜ x k = x T k , ˜ y k = x T k +1 − , ˜ z k = z T k . The three sequences ˜ x k , ˜ y k , ˜ z k , k (cid:62) , satisfy a recurrence relation of the same structure as Nesterovacceleration, but with random weights. Theorem 4 (Discrete version of continuized acceleration) . For any stochastic process of the form (8) - (9) , we have ˜ y k = ˜ x k + τ k (˜ z k − ˜ x k ) , (10) ˜ x k +1 = ˜ y k − ˜ γ k ∇ f (˜ y k ) , (11) ˜ z k +1 = ˜ z k + τ (cid:48) k (˜ y k − ˜ z k ) − ˜ γ (cid:48) k ∇ f (˜ y k ) , (12) for some random parameters τ k , τ (cid:48) k , ˜ γ k , ˜ γ (cid:48) k (that are functions of T k , T k +1 , η t , η (cid:48) t , γ t , γ (cid:48) t ).(1) For the parameters of Theorem 3.1, τ k = 1 − (cid:16) T k T k +1 (cid:17) , τ (cid:48) k = 0 , ˜ γ k = L , and ˜ γ (cid:48) k = T k L .(2) For the parameters of Theorem 3.2, τ k = (cid:0) − exp (cid:0) − (cid:112) µL ( T k +1 − T k ) (cid:1)(cid:1) , τ (cid:48) k = tanh (cid:0)(cid:112) µL ( T k +1 − T k ) (cid:1) , ˜ γ k = L , and ˜ γ (cid:48) k = √ µL . This theorem is proved in Appendix C.In Figure 1, we compare this continuized Nesterov acceleration (10)-(12) with the classicalNesterov acceleration (1)-(3) and gradient descent. In the strongly convex case (right), we run thealgorithms with the parameters of Theorem 2.2 and 4.2 on the function f ( x , x , x ) = µ x − + 3 µ x − + L x − , with µ = 10 − and L = 1 . In the convex case, we run the algorithms with the parameters ofTheorem 2.1 and 4.1 on the function f ( x , . . . , x ) = 12 (cid:88) i =1 i (cid:18) x i − i (cid:19) , which has negligible strong convexity parameter. All iterations were initialized from x = z = 0 . In order to have a straightforward theoretical comparison with Nesterov acceleration, we describethe performance f (˜ x k ) − f ( x ∗ ) = f ( x T k ) − f ( x ∗ ) of the continuized acceleration in terms of thenumber k of gradient operations. Theorem 5 (Convergence of the discretized version) . The discrete implementation (10) - (12) , withrandom weights, of the continuized acceleration, satisfies:(1) For the parameters of Theorem 4.1, E (cid:2) T k ( f (˜ x k ) − f ( x ∗ )) (cid:3) (cid:54) L (cid:107) z − x ∗ (cid:107) . (2) Assume further that f is µ -strongly convex, µ > . For the parameters of Theorem 4.2, E (cid:20) exp (cid:18)(cid:114) µL T k (cid:19) ( f (˜ x k ) − f ( x ∗ )) (cid:21) (cid:54) f ( x ) − f ( x ∗ ) + µ (cid:107) z − x ∗ (cid:107) . This theorem is proved in Appendix B.1. The law of T k is well known: it is the sum of k i.i.d. random variables of law exponential with rate ; this is called an Erlang or Gammadistribution with shape parameter k and rate . One can use well-known properties of this law,such as its concentration around its expectation E T k = k , to derive corollaries of Theorem 5. Robustness of the continuized Nesterov acceleration to additivenoise
We now investigate how the continuized version of Nesterov acceleration performs under stochasticnoise. We should emphasize that a similar study has been done on Nesterov acceleration directly(Lan, 2012; Hu et al., 2009; Xiao, 2010; Devolder, 2011; Cohen et al., 2018; Aybat et al., 2020).However, in the continuized framework, the randomness of the stochastic gradient and its time mixin a particularly convenient way.We assume that we do not have direct access to the gradient ∇ f ( x ) but to a random estimate ∇ f ( x, ξ ) , where ξ ∈ Ξ is random of law P . We assume that our estimate is unbiased, i.e., ∀ x ∈ R d , E ξ ∇ f ( x, ξ ) = ∇ f ( x ) , (13)and has a uniformly bounded variance, i.e., there exists σ (cid:62) such that ∀ x ∈ R d , E ξ (cid:107)∇ f ( x, ξ ) − ∇ f ( x ) (cid:107) (cid:54) σ . (14)These assumptions typically hold in the additive noise model, where ∇ f ( x, ξ ) = ∇ f ( x ) + ξ , where ξ ∈ R d is satisfies E ξ = 0 , E (cid:107) ξ (cid:107) (cid:54) σ . By an abuse of terminology, we say that our stochasticgradients have “additive noise” when (13) and (14) hold.We keep the same algorithms, replacing gradients by stochastic gradients. Let ξ , ξ , . . . bei.i.d. random variables of law P . We take stochastic gradient steps at the random times T , T , . . . , x T k = x T k − − γ T k ∇ f ( x T k − , ξ k ) ,z T k = z T k − − γ (cid:48) T k ∇ f ( x T k − , ξ k ) . Between these random times, the variables mix through the same ODE d x t = η t ( z t − x t )d t , d z t = η (cid:48) t ( x t − z t )d t . This can be written more compactly in terms of the Poisson point measure d N ( t, ξ ) = (cid:80) k (cid:62) δ ( T k ,ξ k ) (d t, d ξ ) on R (cid:62) × Ξ , which has intensity d t ⊗ P , d x t = η t ( z t − x t )d t − γ t (cid:90) Ξ ∇ f ( x t , ξ )d N ( t, ξ ) , (15) d z t = η (cid:48) t ( x t − z t )d t − γ (cid:48) t (cid:90) Ξ ∇ f ( x t , ξ )d N ( t, ξ ) . (16) Theorem 6 (Continuized acceleration with noise) . Assume that the stochastic gradients areunbiased (13) and have a variance uniformly bounded by σ (14) . Then the continuized acceleration (15) - (16) satisfies the following. k f ( x k ) f ( x * ) Gradient descentNesterov accelerationContinuized Nesterov acceleration k f ( x k ) f ( x * ) Figure 2.
Effect of additive noise on gradient descent, Nesterov acceleration, andthe continuized version of Nesterov acceleration, on a convex function (left) and astrongly convex function (right). The results shown corresponds to a single run.(Results were stable across runs.) (1) For the parameters of Theorem 3.1, E f ( x t ) − f ( x ∗ ) (cid:54) L (cid:107) z − x ∗ (cid:107) t + σ t L . (2) Assume further that f is µ -strongly convex, µ > . For the parameters of Theorem 3.2, E f ( x t ) − f ( x ∗ ) (cid:54) (cid:16) f ( x ) − f ( x ∗ ) + µ (cid:107) z − x ∗ (cid:107) (cid:17) exp (cid:18) − (cid:114) µL t (cid:19) + σ √ µL . This theorem is proved in Appendix B.2.In the above bounds, L is a parameter of the algorithm, that can be taken greater than the bestknown smoothness constant of the function f . Increasing L reduces the stepsizes of the algorithmand performs some variance reduction. If the bound σ on the variance is known, one can choose L optimizing the above bounds in order to obtain algorithms that adapt to additive noise.In Figure 2, we run the same simulations as in Figure 1, with two differences: (1) we add isotropicGaussian noise on the gradients, with covariance − Id , and (2) we initialized algorithms at theoptimum, i.e., x = z = x ∗ . Initializing at the optimum enables to isolate the effect of the additivenoise only. These simulations confirm Theorem 6: the noise term is (sub-)linearly increasing in theconvex case and constant in the strongly convex case.Note that similarly to Theorem 5, one could obtain convergence bounds for the discrete imple-mentation under the presence of additive noise. Conclusion
In this work, we introduced a continuized version of Nesterov’s accelerated gradients. In anutshell, the method has two sequences of iterates from which gradient steps are taken at randomtimes. In between gradient steps, the two sequences mix following a simple ordinary differentialequation, whose parameters are picked for ensuring good convergence properties of the method.As compared to other continuous time models of Nesterov acceleration, a key feature of thisapproach is that the method can be implemented without any approximation step, as the differentialequation governing the mixing procedure has a simple analytical solution. When discretized, thecontinuized method corresponds to an accelerated gradient method with random parameters.Continuization strategies were introduced in the context of Markov chains (Aldous and Fill,2002). Here, they allow using acceleration mechanisms in asynchronous distributed optimization,where agents are usually not aware of total the number of iterations taken so far, as showcasedin the context of asynchronous gossip algorithms by Even et al. (2020). Possible future researchdirections include extending to constrained and non-Euclidean settings.
Acknowledgements
This work was funded in part by the French government under management of Agence Nationalede la Recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001(PRAIRIE 3IA Institute). We also acknowledge support from the European Research Council(grant SEQUOIA 724063) and from the DGA.
References
Aldous, D. and Fill, J. A. (2002). Reversible markov chains and random walks on graphs. Unfinishedmonograph, recompiled 2014, available at .Allen-Zhu, Z. and Orecchia, L. (2017). Linear Coupling: An Ultimate Unification of Gradient andMirror Descent. In
Proceedings of the 8th Innovations in Theoretical Computer Science , ITCS ’17.Arjevani, Y., Shalev-Shwartz, S., and Shamir, O. (2016). On lower and upper bounds in smoothand strongly convex optimization.
Journal of Machine Learning Research , 17(126):1–51.Attouch, H., Chbani, Z., Peypouquet, J., and Redont, P. (2018). Fast convergence of inertialdynamics and algorithms with asymptotic vanishing viscosity.
Mathematical Programming ,168(1):123–175.Attouch, H., Chbani, Z., and Riahi, H. (2019). Rate of convergence of the Nesterov acceleratedgradient method in the subcritical case α (cid:54) . ESAIM: Control, Optimisation and Calculus ofVariations , 25:2.Aybat, N. S., Fallah, A., Gurbuzbalaban, M., and Ozdaglar, A. (2020). Robust accelerated gradientmethods for smooth strongly convex functions.
SIAM Journal on Optimization , 30(1):717–751.Betancourt, M., Jordan, M., and Wilson, A. (2018). On symplectic optimization. arXiv preprintarXiv:1802.03653 .Bottou, L., Curtis, F. E., and Nocedal, J. (2018). Optimization methods for large-scale machinelearning.
SIAM Review , 60(2):223–311.Bubeck, S., Lee, Y. T., and Singh, M. (2015). A geometric alternative to Nesterov’s acceleratedgradient descent. arXiv preprint arXiv:1506.08187 .Cohen, M., Diakonikolas, J., and Orecchia, L. (2018). On acceleration with noise-corruptedgradients. In
Proceedings of the 35th International Conference on Machine Learning , volume 80of
Proceedings of Machine Learning Research , pages 1019–1028. PMLR.d’Aspremont, A., Scieur, D., and Taylor, A. (2021). Acceleration methods.Devolder, O. (2011). Stochastic first order methods in smooth convex optimization. Technicalreport, CORE.Diakonikolas, J. and Orecchia, L. (2019). The approximate duality gap technique: A unified theoryof first-order methods.
SIAM Journal on Optimization , 29(1):660–689.Even, M., Hendrikx, H., and Massoulié, L. (2020). Asynchrony and acceleration in gossip algorithms. arXiv preprint arXiv:2011.02379 .Flammarion, N. and Bach, F. (2015). From averaging to acceleration, there is only a step-size. In
Conference on Learning Theory , pages 658–695. PMLR.Hu, C., Pan, W., and Kwok, J. (2009). Accelerated gradient methods for stochastic optimizationand online learning. In
Advances in Neural Information Processing Systems , volume 22, pages781–789.Ikeda, N. and Watanabe, S. (2014).
Stochastic differential equations and diffusion processes . Elsevier.Jacod, J. and Shiryaev, A. (2013).
Limit theorems for stochastic processes , volume 288. SpringerScience & Business Media.Kim, D. and Fessler, J. A. (2016). Optimized first-order methods for smooth convex minimization.
Mathematical programming , 159(1):81–107.Krichene, W., Bayen, A., and Bartlett, P. (2015). Accelerated mirror descent in continuous anddiscrete time.
Advances in Neural Information Processing Systems , 28:2845–2853.Lan, G. (2012). An optimal method for stochastic composite optimization.
Math. Program. , 133(1-2,Ser. A):365–397.Le Gall, J.-F. (2016).
Brownian Motion, Martingales, and Stochastic Calculus , volume 274. Springer.Muehlebach, M. and Jordan, M. (2019). A dynamical systems perspective on Nesterov acceleration.In
International Conference on Machine Learning , pages 4656–4662. PMLR. Nemirovskij, A. S. and Yudin, D. B. (1983).
Problem Complexity and Method Efficiency inOptimization . Wiley-Interscience.Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate O (1 /k ) . Dokl. Akad. Nauk SSSR , 27(2):372–376.Nesterov, Y. (2003).
Introductory Lectures on Convex Optimization: A Basic Course , volume 87.Springer Science & Business Media.Nesterov, Y. (2018).
Lectures on Convex Optimization , volume 137. Springer.Sanz-Serna, J. M. and Zygalakis, K. (2020). The connections between Lyapunov functions for someoptimization algorithms and differential equations. arXiv preprint arXiv:2009.00673 .Shi, B., Du, S., Jordan, M., and Su, W. (2018). Understanding the acceleration phenomenon viahigh-resolution differential equations. arXiv preprint arXiv:1810.08907 .Shi, B., Du, S., Su, W., and Jordan, M. (2019). Acceleration via symplectic discretization ofhigh-resolution differential equations. In
Advances in Neural Information Processing Systems ,volume 32, pages 5744–5752.Siegel, J. W. (2019). Accelerated first-order methods: Differential equations and Lyapunov functions. arXiv preprint arXiv:1903.05671 .Su, W., Boyd, S., and Candes, E. (2014). A differential equation for modeling Nesterov’s acceleratedgradient method: theory and insights.
Advances in neural information processing systems ,27:2510–2518.Wibisono, A., Wilson, A. C., and Jordan, M. I. (2016). A variational perspective on acceleratedmethods in optimization.
Proceedings of the National Academy of Sciences , 113(47):E7351–E7358.Wilson, A., Recht, B., and Jordan, M. I. (2016). A Lyapunov analysis of momentum methods inoptimization. arXiv preprint arXiv:1611.02635 .Wright, S. (2015). Coordinate descent algorithms.
Math. Program. , 151(1, Ser. B):3–34.Xiao, L. (2010). Dual averaging methods for regularized stochastic learning and online optimization.
J. Mach. Learn. Res. , 11:2543–2596.Zhang, J., Mokhtari, A., Sra, S., and Jadbabaie, A. (2018). Direct Runge-Kutta discretizationachieves acceleration. In
Advances in Neural Information Processing Systems , volume 31, pages3900–3909. Appendix A. Stochastic calculus toolbox
In this appendix, we give a short introduction to the mathematical tools that we use in this paper.For more details, the reader can consult the more rigorous monographs of Jacod and Shiryaev(2013); Ikeda and Watanabe (2014); Le Gall (2016).
A.1. Poisson point measures.
We fix P a probability law on some space Ξ . Definition 1. A (homogenous) Poisson point measure on R (cid:62) × Ξ , with intensity ν (d t, d ξ ) =d t ⊗ d P ( ξ ) , is a random measure N on R (cid:62) × Ξ such that • For any disjoint measurable subsets A and B of R (cid:62) × Ξ , N ( A ) and N ( B ) are independent. • For any measurable subset A of R (cid:62) × Ξ , N ( A ) is a Poisson random variable with parameter ν ( A ) . (If ν ( A ) = ∞ , N ( A ) is equal to ∞ almost surely.) Proposition 1.
Let N be a Poisson point measure on R (cid:62) × Ξ with intensity d t ⊗ d P ( ξ ) .There exists a decomposition d N ( t, ξ ) = (cid:80) k (cid:62) δ ( T k ,ξ k ) (d t, d ξ ) on R (cid:62) × Ξ where < T < T Let N be a Poisson point measure on R (cid:62) × Ξ with intensity d t ⊗ d P ( ξ ) . The filtration F t , t (cid:62) , generated by N is defined by the formula F t = σ ( N ([0 , s ] × A ) , s (cid:54) t, A ⊂ Ξ measurable ) . A.2. Martingales and supermartingales. Let (Ω , F , P ) be a probability space and F t , t (cid:62) ,a filtration on this probability space. Definition 3. A random process x t ∈ R d , t (cid:62) , is adapted if for all t (cid:62) , x t is F t -measurable.An adapted process x t ∈ R , t (cid:62) is a martingale (resp. supermartingale ) if for all (cid:54) s (cid:54) t , E [ x t | F s ] = x s (resp. E [ x t | F s ] (cid:54) x s ). Definition 4. A random variable T ∈ [0 , ∞ ] is a stopping time if for all t (cid:62) , { T (cid:54) t } ∈ F t . Definition 5. A function x t , t (cid:62) , is said to be càdlàg if it is right continuous and for every t > , the limit x t − := lim s → t,s We fix P a probabilitylaw on some space Ξ , N a Poisson point measure on R (cid:62) × Ξ with intensity d t ⊗ d P ( ξ ) , and denote F t , t (cid:62) , the filtration generated by N . Definition 6. Let b : R d → R d and G : R d × Ξ → R d be two functions. An random process x t ∈ R d , t (cid:62) , is said to be a solution of the equation d x t = b ( x t )d t + (cid:90) Ξ G ( x t , ξ )d N ( t, ξ ) if it is adapted, càdlàg, and for all t (cid:62) , x t = x + (cid:90) t b ( x s )d s + (cid:90) [0 ,t ] × Ξ G ( x s − , ξ )d N ( s, ξ ) . If we consider the decomposition d N ( t, ξ ) = (cid:80) k (cid:62) δ ( T k ,ξ k ) (d t, d ξ ) given by Proposition 1, then (cid:90) [0 ,t ] × Ξ G ( x s − , ξ )d N ( s, ξ ) = (cid:88) k (cid:62) { T k (cid:54) t } G ( x T k − , ξ k ) . Proposition 2. Let x t ∈ R d be a solution of d x t = b ( x t )d t + (cid:90) Ξ G ( x t , ξ )d N ( t, ξ ) and ϕ : R d → R be a smooth function. Then ϕ ( x t ) = ϕ ( x ) + (cid:90) t (cid:104)∇ ϕ ( x s ) , b ( x s ) (cid:105) d s + (cid:90) [0 ,t ] × Ξ ( ϕ ( x s − + G ( x s − , ξ )) − ϕ ( x s − )) d N ( s, ξ ) . Moreover, we have the decomposition (cid:90) [0 ,t ] × Ξ ( ϕ ( x s − + G ( x s − , ξ )) − ϕ ( x s − )) d N ( s, ξ )= (cid:90) t (cid:90) Ξ ( ϕ ( x s + G ( x s , ξ )) − ϕ ( x s )) d t d P ( ξ ) + M t , where M t = (cid:82) [0 ,t ] × Ξ ( ϕ ( x s − + G ( x s − , ξ )) − ϕ ( x s − )) (d N ( s, ξ ) − d t d P ( ξ )) is a martingale. This proposition is an elementary calculus of variations formula: to compute the value of theobservable ϕ ( x t ) , one must sum the effects of the continuous part and of the Poisson jumps.Moreover, the integral with respect to the Poisson measure N becomes a martingale if the sameintegral with respect to its intensity measure d t ⊗ d P ( ξ ) is removed. Appendix B. Analysis of the continuized Nesterov acceleration To encompass the proofs in the convex and in the strongly convex cases in a unified way, weassume f is µ -strongly convex, µ (cid:62) . If µ > , this corresponds to assuming the µ -strong convexityin the usual sense; if µ = 0 , it means that we only assume the function to be convex. In otherwords, the proofs in the convex case can be obtained by taking µ = 0 below.In this section, F t , t (cid:62) , is the filtration associated to the Poisson point measure N . B.1. Noiseless case: proofs of Theorems 3 and 5. In this section, we analyze the convergenceof the continuized iteration (8)-(9), that we recall for the reader’s convenience: d x t = η t ( z t − x t )d t − γ t ∇ f ( x t )d N ( t ) , d z t = η (cid:48) t ( x t − z t )d t − γ (cid:48) t ∇ f ( x t )d N ( t ) . The choices of parameters η t , η (cid:48) t , γ t , γ (cid:48) t , t (cid:62) , and the corresponding convergence bounds follownaturally from the analysis. We seek sufficient conditions under which the function φ t = A t ( f ( x t ) − f ∗ ) + B t (cid:107) z t − x ∗ (cid:107) is a supermartingale.The process ¯ x t = ( t, x t , z t ) satisfies the equation d¯ x t = b (¯ x t )d t + G (¯ x t )d N ( t ) , b (¯ x t ) = η t ( z t − x t ) η (cid:48) t ( x t − z t ) , G (¯ x t ) = − γ t ∇ f ( x t ) − γ (cid:48) t ∇ f ( x t ) . We thus apply Proposition 2 to φ t = ϕ (¯ x t ) = ϕ ( t, x t , z t ) where ϕ ( t, x, z ) = A t ( f ( x ) − f ( x ∗ )) + B t (cid:107) z − x ∗ (cid:107) , we obtain: φ t = φ + (cid:90) t (cid:104)∇ ϕ (¯ x s ) , b (¯ x s ) (cid:105) d s + (cid:90) t ( ϕ (¯ x s + G (¯ x s )) − ϕ (¯ x s )) d s + M t , where M t is a martingale. Thus, to show that ϕ t is a supermartingale, it is sufficient to show thatthe map t (cid:55)→ (cid:82) t (cid:104)∇ ϕ (¯ x s ) , b (¯ x s ) (cid:105) d s + (cid:82) t ( ϕ (¯ x s + G (¯ x s )) − ϕ (¯ x s ))) d s is non-increasing almost surely,i.e., I t := (cid:104)∇ ϕ (¯ x t ) , b (¯ x t ) (cid:105) + ϕ (¯ x t + G (¯ x t )) − ϕ (¯ x t ) (cid:54) . We now compute (cid:104)∇ ϕ (¯ x t ) , b (¯ x t ) (cid:105) = ∂ t ϕ (¯ x t ) + (cid:104) ∂ x ϕ (¯ x t ) , η t ( z t − x t ) (cid:105) + (cid:104) ∂ z ϕ (¯ x t ) , η (cid:48) t ( x t − z t ) (cid:105) = d A t d t ( f ( x t ) − f ( x ∗ )) + d B t d t (cid:107) z t − x ∗ (cid:107) + A t η t (cid:104)∇ f ( x t ) , z t − x t (cid:105) + 2 B t η (cid:48) t (cid:104) z t − x ∗ , x t − z t (cid:105) . Here, we use that as f is µ -strongly convex, f ( x t ) − f ( x ∗ ) (cid:54) (cid:104)∇ f ( x t ) , x t − x ∗ (cid:105) − µ (cid:107) x t − x ∗ (cid:107) , and the simple bound (cid:104) z t − x ∗ , x t − z t (cid:105) = (cid:104) z t − x ∗ , x t − x ∗ (cid:105) − (cid:107) z t − x ∗ (cid:107) (cid:54) (cid:107) z t − x ∗ (cid:107)(cid:107) x t − x ∗ (cid:107) − (cid:107) z t − x ∗ (cid:107) (cid:54) (cid:0) (cid:107) z t − x ∗ (cid:107) + (cid:107) x t − x ∗ (cid:107) (cid:1) − (cid:107) z t − x ∗ (cid:107) = 12 (cid:0) (cid:107) x t − x ∗ (cid:107) − (cid:107) z t − x ∗ (cid:107) (cid:1) . This gives (cid:104)∇ ϕ (¯ x t ) , b (¯ x t ) (cid:105) (cid:54) (cid:18) d A t d t − A t η t (cid:19) (cid:104)∇ f ( x t ) , x t − x ∗ (cid:105) + (cid:18) B t η (cid:48) t − d A t d t µ (cid:19) (cid:107) x t − x ∗ (cid:107) (17) + (cid:18) d B t d t − B t η (cid:48) t (cid:19) (cid:107) z t − x ∗ (cid:107) + A t η t (cid:104)∇ f ( x t ) , z t − x ∗ (cid:105) . (18)Further, ϕ (¯ x t + G (¯ x t )) − ϕ (¯ x t ) = A t ( f ( x t − γ t ∇ f ( x t )) − f ( x t ))+ B t (cid:0) (cid:107) ( z t − x ∗ ) − γ (cid:48) t ∇ f ( x t ) (cid:107) − (cid:107) z t − x ∗ (cid:107) (cid:1) . As f is L -smooth, f ( x t − γ t ∇ f ( x t )) − f ( x t ) (cid:54) (cid:104)∇ f ( x t ) , − γ t ∇ f ( x t ) (cid:105) + L (cid:107) γ t ∇ f ( x t ) (cid:107) = − γ t (cid:18) − Lγ t (cid:19) (cid:107)∇ f ( x t ) (cid:107) . This gives ϕ (¯ x t + G (¯ x t )) − ϕ (¯ x t ) (cid:54) (cid:18) B t γ (cid:48) t − A t γ t (cid:18) − Lγ t (cid:19)(cid:19) (cid:107)∇ f ( x t ) (cid:107) − B t γ (cid:48) t (cid:104)∇ f ( x t ) , z t − x ∗ (cid:105) . (19)Finally, combining (17)-(18) with (19), we obtain I t (cid:54) (cid:18) d A t d t − A t η t (cid:19) (cid:104)∇ f ( x t ) , x t − x ∗ (cid:105) + (cid:18) d B t d t − B t η (cid:48) t (cid:19) (cid:107) z t − x ∗ (cid:107) (20) + ( A t η t − B t γ (cid:48) t ) (cid:104)∇ f ( x t ) , z t − x ∗ (cid:105) + (cid:18) B t η (cid:48) t − d A t d t µ (cid:19) (cid:107) x t − x ∗ (cid:107) (21) + (cid:18) B t γ (cid:48) t − A t γ t (cid:18) − Lγ t (cid:19)(cid:19) (cid:107)∇ f ( x t ) (cid:107) . (22)Remember that I t (cid:54) is a sufficient condition for φ t to be a supermartingale. Here, we choose theparameters η t , η (cid:48) t , γ t , γ (cid:48) t , t (cid:62) , so that all prefactors are . We start by taking γ t ≡ L (other choices γ t < L could be possible but would give similar results) and we want to satisfy d A t d t = A t η t , d B t d t = B t η (cid:48) t A t η t = 2 B t γ (cid:48) t , B t η (cid:48) t = d A t d t µ , B t γ (cid:48) t = A t L . To satisfy the last equation, we choose γ (cid:48) t = (cid:114) A t LB t . (23)To satisfy the third equation, we choose η t = 2 B t γ (cid:48) t A t = (cid:114) B t LA t . (24)To satisfy the fourth equation, we choose η (cid:48) t = d A t d t µ B t = A t η t µ B t = µ (cid:114) A t LB t . (25)Having now all parameters η t , η (cid:48) t , γ t , γ (cid:48) t constrained, we now have that φ t is Lyapunov if d A t d t = A t η t = (cid:114) A t B t L , d B t d t = B t η (cid:48) t = µ (cid:114) A t B t L . This only leaves the choice of the initialization ( A , B ) as free: both the algorithm and theLyapunov depend on it. (Actually, only the relative value A /B matters.) Instead of solvingthe above system of two coupled non-linear ODEs, it is convenient to turn them into a singlesecond-order linear ODE: dd t (cid:16)(cid:112) A t (cid:17) = 12 √ A t d A t d t = (cid:114) B t L , dd t (cid:16)(cid:112) B t (cid:17) = 12 √ B t d B t d t = µ (cid:114) A t L . (26)This can also be restated as d d t (cid:16)(cid:112) A t (cid:17) = µ L (cid:112) A t , (cid:112) B t = √ L dd t (cid:16)(cid:112) A t (cid:17) . (27) B.1.1. Proof of the first part (convex case). We now assume µ = 0 , and we choose the solution suchthat A = 0 and B = 1 . From (26), we have dd t (cid:0) √ B t (cid:1) = 0 , thus B t ≡ , and dd t (cid:0) √ A t (cid:1) = √ L ,thus √ A t = t √ L . The parameters of the algorithm are given by (23)-(25): η t = t , η (cid:48) t = 0 , γ (cid:48) t = t √ L (and we had chosen γ t = L ).From the fact that φ t is a supermartingale, we obtain that the associated algorithm satisfies E f ( x t ) − f ( x ∗ ) (cid:54) E φ t A t (cid:54) φ A t = 2 L (cid:107) z − x ∗ (cid:107) t . This proves the first part of Theorem 3.Further, one can apply martingale stopping Theorem 7 to the supermartingale φ t with thestopping time T k to obtain E [ A T k ( f (˜ x k ) − f ( x ∗ ))] = E [ A T k ( f ( x T k ) − f ( x ∗ ))] (cid:54) E φ T k (cid:54) φ = (cid:107) z − x ∗ (cid:107) . This proves the first part of Theorem 5. B.1.2. Proof of the second part (strongly convex case). We now assume µ > . We consider thesolution of (27) that is exponential: (cid:112) A t = (cid:112) A exp (cid:18) (cid:114) µL t (cid:19) , (cid:112) B t = (cid:112) A (cid:114) µ (cid:18) (cid:114) µL t (cid:19) . The parameters of the algorithm are given by (23)-(25): η t = η (cid:48) t = (cid:112) µL , γ (cid:48) t = √ µL (and we hadchosen γ t = L ).From the fact that φ t is a supermartingale, we obtain that the associated algorithm satisfies E f ( x t ) − f ( x ∗ ) (cid:54) E φ t A t (cid:54) φ A t = A ( f ( x ) − f ( x ∗ )) + A µ (cid:107) z − x ∗ (cid:107) A t = (cid:16) f ( x ) − f ( x ∗ ) + µ (cid:107) z − x ∗ (cid:107) (cid:17) exp (cid:18) − (cid:114) µL t (cid:19) . This proves the second part of Theorem 3. Similarly to above, one can also apply the martingalestopping theorem to prove the second part of Theorem 5. Remark 1. In the above derivation, in both the convex and strongly convex cases, we choose aparticular solution of (27) , while several solutions are possible. In the convex case, we make thechoice A = 0 to have a succinct bound that does not depend on f ( x ) − f ( x ∗ ) . More importantly,in the strongly convex case, we choose the solution that satisfies the relation (cid:112) µ √ A t = √ B t , whichimplies that η t , η (cid:48) t , γ (cid:48) t , are constant functions of t , and η t = η (cid:48) t . These conditions help solving inclosed form the continuous part of the process d x t = η t ( z t − x t )d t , d z t = η (cid:48) t ( x t − z t )d t , which is crucial if we want to have a discrete implementation of our method (for more details, seeTheorem 4 and its proof ). However, in the strongly convex case, considering other solutions wouldbe interesting, for instance to have an algorithm converging to the convex one as µ → . B.2. With additive noise: proof of Theorem 6. The proof of this theorem is along the samelines as the proof of Theorem 3 above. Here, we only give the major differences.We analyze the convergence of the continuized stochastic iteration (15)-(16), that we recall forthe reader’s convenience: d x t = η t ( z t − x t )d t − γ t (cid:90) Ξ ∇ f ( x t , ξ )d N ( t, ξ ) , d z t = η (cid:48) t ( x t − z t )d t − γ (cid:48) t (cid:90) Ξ ∇ f ( x t , ξ )d N ( t, ξ ) . In this setting, we loose the property that φ t = A t ( f ( x t ) − f ∗ ) + B t (cid:107) z t − x ∗ (cid:107) is a supermartingale. However, we bound the increase of φ t .The process ¯ x t = ( t, x t , z t ) satisfies the equation d¯ x t = b (¯ x t )d t + (cid:90) Ξ G (¯ x t , ξ )d N ( t, ξ ) , b (¯ x t ) = η t ( z t − x t ) η (cid:48) t ( x t − z t ) , G (¯ x t , ξ ) = − γ t ∇ f ( x t , ξ ) − γ (cid:48) t ∇ f ( x t , ξ ) . We apply Proposition 2 to φ t = ϕ (¯ x t ) = ϕ ( t, x t , z t ) and obtain φ t = φ + (cid:90) t I s d s + M t , (28)where M t is a martingale and I t = (cid:104)∇ ϕ (¯ x t ) , b (¯ x t ) (cid:105) + E ξ ϕ (¯ x t + G (¯ x t , ξ )) − ϕ (¯ x t ) . The computation of the first term remains the same: the inequality (17)-(18) holds. The computationof the second term becomes E ξ ϕ (¯ x t + G (¯ x t , ξ )) − ϕ (¯ x t ) = A t ( E ξ f ( x t − γ t ∇ f ( x t , ξ )) − f ( x t ))+ B t (cid:0) E ξ (cid:107) ( z t − x ∗ ) − γ (cid:48) t ∇ f ( x t , ξ ) (cid:107) − (cid:107) z t − x ∗ (cid:107) (cid:1) . As f is L -smooth, f ( x t − γ t ∇ f ( x t , ξ )) − f ( x t ) (cid:54) (cid:104)∇ f ( x t ) , − γ t ∇ f ( x t , ξ ) (cid:105) + L (cid:107) γ t ∇ f ( x t , ξ ) (cid:107) , E ξ f ( x t − γ t ∇ f ( x t , ξ )) − f ( x t ) (cid:54) (cid:104)∇ f ( x t ) , − γ t E ξ ∇ f ( x t , ξ ) (cid:105) + L E ξ (cid:107) γ t ∇ f ( x t , ξ ) (cid:107) . Bu assumptions (13) and (14), the stochastic gradient ∇ f ( x, ξ ) is unbiased and has a variancebounded by σ , which implies E ξ (cid:107)∇ f ( x t , ξ ) (cid:107) (cid:54) (cid:107)∇ f ( x t ) (cid:107) + σ . Thus E ξ f ( x t − γ t ∇ f ( x t , ξ )) − f ( x t ) (cid:54) − γ t (cid:18) − Lγ t (cid:19) (cid:107)∇ f ( x t ) (cid:107) + σ Lγ t . Similarly, E ξ (cid:107) ( z t − x ∗ ) − γ (cid:48) t ∇ f ( x t , ξ ) (cid:107) − (cid:107) z t − x ∗ (cid:107) = − γ (cid:48) t (cid:104) E ξ ∇ f ( x t , ξ ) , z t − x ∗ (cid:105) + γ (cid:48) t E ξ (cid:107)∇ f ( x t , ξ ) (cid:107) (cid:54) − γ (cid:48) t (cid:104)∇ f ( x t ) , z t − x ∗ (cid:105) + γ (cid:48) t (cid:107)∇ f ( x t ) (cid:107) + σ γ (cid:48) t . This gives ϕ (¯ x t + G (¯ x t )) − ϕ (¯ x t ) (cid:54) (cid:18) B t γ (cid:48) t − A t γ t (cid:18) − Lγ t (cid:19)(cid:19) (cid:107)∇ f ( x t ) (cid:107) − B t γ (cid:48) t (cid:104)∇ f ( x t ) , z t − x ∗ (cid:105) + σ (cid:18) A t Lγ t B t γ (cid:48) t (cid:19) . Combining the bounds, we obtain I t (cid:54) (cid:18) d A t d t − A t η t (cid:19) (cid:104)∇ f ( x t ) , x t − x ∗ (cid:105) + (cid:18) d B t d t − B t η (cid:48) t (cid:19) (cid:107) z t − x ∗ (cid:107) + ( A t η t − B t γ (cid:48) t ) (cid:104)∇ f ( x t ) , z t − x ∗ (cid:105) + (cid:18) B t η (cid:48) t − d A t d t µ (cid:19) (cid:107) x t − x ∗ (cid:107) + (cid:18) B t γ (cid:48) t − A t γ t (cid:18) − Lγ t (cid:19)(cid:19) (cid:107)∇ f ( x t ) (cid:107) + σ (cid:18) A t Lγ t B t γ (cid:48) t (cid:19) , which is an additive perturbation of the bound (20)-(22) in the noiseless case, with a perturbationproportional to σ . The choices of parameters of Theorem 3 cancel all first five prefactors, andsatisfy γ t = L , A t Lγ t = B t γ (cid:48) t . We thus obtain I t (cid:54) σ A t L . This bound controls the increase of φ t . Using the decomposition (28), we obtain E f ( x t ) − f ( x ∗ ) (cid:54) E φ t A t (cid:54) φ A t + (cid:82) t E I s d sA t (cid:54) A ( f ( x ) − f ( x ∗ )) + B (cid:107) z − x ∗ (cid:107) A t + σ L (cid:82) t A s d sA t . B.2.1. Proof of the first part (convex case). In this case, A t = t L and B = 1 . Thus (cid:82) t A s d s = L t . Thus E f ( x t ) − f ( x ∗ ) (cid:54) L (cid:107) z − x ∗ (cid:107) t + σ t L . B.2.2. Proof of the second part (strongly convex case). In this case, A t = A exp (cid:0)(cid:112) µL t (cid:1) and B = A µ . Thus (cid:82) t A s d s (cid:54) A (cid:112) µL − exp (cid:0)(cid:112) µL t (cid:1) = (cid:113) Lµ A t . Thus E f ( x t ) − f ( x ∗ ) (cid:54) (cid:16) f ( x ) − f ( x ∗ ) + µ (cid:107) z − x ∗ (cid:107) (cid:17) exp (cid:18) − (cid:114) µL t (cid:19) + σ √ µL . Appendix C. Proof of Theorem 4 By integrating the ODE d x t = η t ( z t − x t )d t , d z t = η (cid:48) t ( x t − z t )d t , between T k and T k +1 − , we obtain that there exists τ k , τ (cid:48)(cid:48) k , such that ˜ y k = x T k +1 − = x T k + τ k ( z T k − x T k ) = ˜ x k + τ k (˜ z k − ˜ x k ) , (29) z T k +1 − = z T k + τ (cid:48)(cid:48) k ( x T k − z T k ) = ˜ z k + τ (cid:48)(cid:48) k (˜ x k − ˜ z k ) . From the first equation, we have ˜ x k = − τ k (˜ y k − τ k ˜ z k ) , which gives by substitution in the secondequation, z T k +1 − = ˜ z k + τ (cid:48)(cid:48) k (cid:18) − τ k (˜ y k − τ k ˜ z k ) − ˜ z k (cid:19) = ˜ z k + τ (cid:48) k (˜ y k − ˜ z k ) , where τ (cid:48) k = τ (cid:48)(cid:48) k − τ k .Further, from (4)-(5), we obtain the equations ˜ x k +1 = x T k +1 = x T k +1 − − γ T k +1 ∇ f ( x T k +1 − ) = ˜ y k − γ T k +1 ∇ f (˜ y k ) , (30) ˜ z k +1 = z T k +1 = z T k +1 − − γ (cid:48) T k +1 ∇ f ( x T k +1 − ) = ˜ z k + τ (cid:48) k (˜ y k − ˜ z k ) − γ (cid:48) T k +1 ∇ f (˜ y k ) . (31)The stated equation (10)-(12) are the combination of (29), (30) and (31). (1) The parameters of Theorem 3.1 are η t = t , η (cid:48) t = 0 , γ t = L and γ (cid:48) t = t L . In this case, theODE d x t = η t ( z t − x t )d t = 2 t ( z t − x t )d t , d z t = η (cid:48) t ( x t − z t )d t = 0 , can be integrated in closed form: for t (cid:62) t , x t = z t + (cid:18) t t (cid:19) ( x t − z t ) = x t + (cid:32) − (cid:18) t t (cid:19) (cid:33) ( z t − x t ) ,z t = z t . In particular, taking t = T k , t = T k +1 − , we obtain τ k = 1 − (cid:16) T k T k +1 (cid:17) , τ (cid:48)(cid:48) k = 0 and thus τ (cid:48) k = τ (cid:48)(cid:48) k − τ k = 0 . Finally, ˜ γ k = γ T k = L and ˜ γ (cid:48) k = γ (cid:48) T k = T k L .(2) The parameters of Theorem 3.2 are η t = η (cid:48) t ≡ (cid:112) µL , γ t ≡ L and γ (cid:48) t ≡ √ µL . In this case, theODE d x t = η t ( z t − x t )d t = (cid:114) µL ( z t − x t )d t , d z t = η (cid:48) t ( x t − z t )d t = (cid:114) µL ( x t − z t )d t , can also be integrated in closed form: for t (cid:62) t , x t = x t + z t x t − z t (cid:18) − (cid:114) µL ( t − t ) (cid:19) = x t + 12 (cid:18) − exp (cid:18) − (cid:114) µL ( t − t ) (cid:19)(cid:19) ( z t − x t ) ,z t = x t + z t z t − x t (cid:18) − (cid:114) µL ( t − t ) (cid:19) = z t + 12 (cid:18) − exp (cid:18) − (cid:114) µL ( t − t ) (cid:19)(cid:19) ( x t − z t ) . In particular, taking t = T k , t = T k +1 − , we obtain τ k = τ (cid:48)(cid:48) k = (cid:0) − exp (cid:0) − (cid:112) µL ( T k +1 − T k ) (cid:1)(cid:1) and thus τ (cid:48) k = τ (cid:48)(cid:48) k − τ k = tanh (cid:0)(cid:112) µL ( T k +1 − T k ) (cid:1) . Finally, ˜ γ k = γ T k = L and ˜ γ (cid:48) k = γ (cid:48) T k = √ µL . Appendix D. Heuristic ODE scaling limit of the continuizedacceleration D.1. Convex case. With the choices of parameters of Theorem 3.1, the continuized accelerationis d x t = 2 t ( z t − x t )d t − L ∇ f ( x t )d N ( t ) , d z t = − t L ∇ f ( x t )d N ( t ) . The ODE scaling limit is obtained by taking the limit L → ∞ (so that the stepsize /L vanishes)and rescaling the time s = t/ √ L . Some law of large number argument heuristically gives us that,as L → ∞ , d N ( t ) = d N ( √ Ls ) ≈ √ L d s . Thus in the limit, we obtain d x s = 2 √ Ls ( z s − x s ) √ L d s − L ∇ f ( x s ) √ L d s , d z s = − √ Ls L ∇ f ( x s ) √ L d s . The second term of the first equation becomes negligible in the limit. Thus the equations simplifyto d x s d s = 2 s ( z s − x s ) , d z s d s = − s ∇ f ( x s ) . Thus − s ∇ f ( x s ) = d z s d s = dd s (cid:18) x s + s x s d s (cid:19) = d x s d s + 12 d x s d s + s x s d s , and thus d x s d s + 3 s d x s d s + ∇ f ( x s ) = 0 . This is the same limiting ODE as the one found by Su et al. (2014) for Nesterov acceleration. D.2. Strongly-convex case. With the choices of parameters of Theorem 3.2, the continuizedacceleration is d x t = (cid:114) µL ( z t − x t )d t − L ∇ f ( x t )d N ( t ) , d z t = (cid:114) µL ( x t − z t )d t − √ µL ∇ f ( x t )d N ( t ) . Again, we take joint scaling L → ∞ , s = t/ √ L , with the approximation d N ( t ) ≈ √ L d s . We obtain d x s = (cid:114) µL ( z s − x s ) √ L d s − L ∇ f ( x s ) √ L d s , d z s = (cid:114) µL ( x s − z s ) √ L d s − √ µL ∇ f ( x s ) √ L d s . As before, the second term of the first equation becomes negligible in the limit. Thus the equationssimplify to d x s d s = √ µ ( z s − x s ) , (32) d z s d s = √ µ ( x s − z s ) − √ µ ∇ f ( x s ) . (33)From (32), we have z s = x s + √ µ d x s d s , and by substitution in (33), we obtain d x s d s + 2 √ µ d x s d s + ∇ f ( x s ) = 0 ..