Acceleration Methods
aa r X i v : . [ m a t h . O C ] J a n Acceleration Methods
Alexandre d’Aspremont
CNRS & Ecole Normale Supérieure, [email protected]
Damien Scieur
Samsung SAIT AI Lab & MILA, [email protected]
Adrien Taylor
INRIA & Ecole Normale Supérieure, [email protected] ontents Appendices to Chapter 4 84 .C Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgements 140References 141 cceleration Methods
Alexandre d’Aspremont , Damien Scieur and Adrien Taylor CNRS & Ecole Normale Supérieure, Paris; [email protected] Samsung SAIT AI Lab & MILA, Montreal; [email protected] INRIA & Ecole Normale Supérieure, Paris; [email protected]
ABSTRACTThis monograph covers some recent advances on a range of acceleration tech-niques frequently used in convex optimization. We first use quadratic op-timization problems to introduce two key families of methods, momentumand nested optimization schemes, which coincide in the quadratic case toform the Chebyshev method whose complexity is analyzed using Chebyshevpolynomials.We discuss momentum methods in detail, starting with the seminal workof Nesterov (1983) and structure convergence proofs using a few master tem-plates, such as that of optimized gradient methods which have the key benefitof showing how momentum methods maximize convergence rates.We further cover proximal acceleration techniques, at the heart of the
Cata-lyst and
Accelerated Hybrid Proximal Extragradient frameworks, using similaralgorithmic patterns.Common acceleration techniques directly rely on the knowledge of some reg-ularity parameters of the problem at hand, and we conclude by discussing restart schemes, a set of simple techniques to reach nearly optimal conver-gence rates while adapting to unobserved regularity parameters.
Introduction
Optimization methods are a core component of the modern numerical toolkit. In manycases, iterative algorithms for solving convex optimization problems have reached a levelof efficiency and reliability comparable to that of advanced linear algebra routines. Thisis largely true on medium scale problems where interior point methods reign supreme,less so on large-scale problems where the complexity of first-order methods is not as wellunderstood, and efficiency still remains a concern.The situation has markedly improved in recent years, driven in particular by theemergence of a number of applications in statistics, machine learning and signal process-ing. Building on Nesterov’s path-breaking algorithm from the 80’s, several acceleratedmethods and numerical schemes have been developed that both improve the efficiencyof optimization algorithms and refine their complexity bounds. Our objective in thismonograph is to cover these recent developments using a few master templates.The methods described in this manuscript can be arranged in roughly two categories.The first, stemming from the work of Nesterov (1983), produces variants of the gradientmethod with accelerated worst-case convergence rates which are provably optimal underclassical regularity assumptions. The second uses outer iteration (aka nested) schemesto speed up convergence. In this second setting, accelerated schemes run both an innerand an outer loop, with inner iterations being solved by classical optimization methods.
Direct acceleration techniques.
Ever since the original algorithm by Nesterov (1983),the acceleration phenomenon has remained somewhat of a mystery. While acceleratedgradient methods can be seen as iteratively building a model for the function, using itto guide gradient computations, the argument is purely algebraic and simply effectivelyexploits regularity assumptions. This approach of collecting inequalities induced by reg-ularity assumptions and cleverly chaining them to prove convergence was also used in e.g. (Beck and Teboulle, 2009) to produce an optimal proximal gradient method. Theretoo however, the proof yielded very little evidence on why the method must actually befaster.Fortunately, we are now better equipped to push the proof mechanisms much further.Recent advances in programmatic design of optimization algorithms allow us to performthe design and analysis of algorithms following a more principle approach. In particular,the performance estimation approach , pioneered by Drori and Teboulle (2014), can beused to design optimal methods from scratch, picking algorithmic parameters to optimizeworst-case performance guarantees (Drori and Teboulle, 2014; Kim and Fessler, 2016).Primal dual optimality conditions on the design problem then provide a blueprint forthe accelerated algorithm structure and for its convergence proof.Using this framework, acceleration is not a mystery anymore, it is the main objectivein the algorithm’s design. We recover the usual “soup of regularity inequalities” templateof classical convergence proofs, but the optimality conditions of the design problemexplicitly produce a method that optimizes the convergence rate. In this monograph,we cover accelerated first-order methods using this systematic template and describe anumber of convergence proofs for classical variants of the accelerated gradient methodof (Nesterov, 1983), such as those of Nesterov (1983), Nesterov (2003), and Beck andTeboulle (2009). Nested acceleration schemes.
The second category of acceleration techniques we coverin this monograph is formed by outer iteration schemes, which use classical optimizationalgorithms as a black-box in their inner loops, and where acceleration is produced by anargument on the outer loop. We describe three types of acceleration results in this vein.The first scheme is based on nonlinear acceleration techniques. Using argumentsdating back to (Aitken, 1927; Wynn, 1956; Anderson and Nash, 1987), these techniquesuse a weighted average of iterates to extrapolate a better candidate solution than the lastiterate. We begin by describing the Chebyshev method for solving quadratic problems,which interestingly qualifies both as a gradient method and as an outer iteration scheme,and takes its name from the fact that Chebyshev polynomial coefficients are used toapproximately minimize the gradient at the extrapolated solution. This argument can beextended to non quadratic optimization problems provided the extrapolation procedureis regularized.The second scheme, due to (Güler, 1992; Monteiro and Svaiter, 2013; Lin et al. ,2015) solves a conceptual accelerated proximal point algorithm, and uses classical iter-ative methods to approximate the proximal point in an inner loop. In particular, thisframework produces accelerated gradient methods (in the same sense as Nesterov’s accel-eration) when the approximate proximal points are computed using linearly converginggradient-based optimization methods, taking advantage of the fact that the inner prob-lems are always strongly convex.Finally, we describe restart schemes. These take advantage of regularity propertiescalled Hölderian error bounds, which extend strong convexity properties near the opti-
Introduction mum and hold almost generically, to improve convergence rates of most first-order meth-ods. The parameters of Hölderian error bounds are usually unknown, but the restartschemes are robust which means that they are adaptive to these constants and thattheir empirical performance is excellent on problems with reasonable precision targets.
Content & organization
We present a few convergence acceleration techniques that areparticularly relevant in the context of (first-order) convex optimization. This summarypresents our own points of view on the topic, and focuses on techniques that received alot of attention since the early 2000’s, although some of the underlying ideas date backto long before. We do not pretend to be exhaustive, and are aware that very valuablereferences might not appear below.Chapters can be read nearly independently. However, we believe the insights of somechapters largely benefit the others. In particular, Chebyshev acceleration (Chapter 2) &Nonlinear acceleration (Chapter 3) are clear complementary readings. Similarly, Cheby-shev acceleration (Chapter 2) & Nesterov acceleration (Chapter 4), Nesterov accelera-tion (Chapter 4) & proximal acceleration (Chapter 5), as well as Nesterov acceleration(Chapter 4) & restarts (Chapter 6) certainly belong together.
Prerequisites and complementary readings.
This monograph is not meant to be ageneral-purpose manuscript on convex optimization, for which we refer the reader to thenow classical references (Boyd and Vandenberghe, 2004; Bonnans et al. , 2006; Nocedaland Wright, 2006). Other directly related references are provided in the text.We assume the reader to have a working knowledge of base linear algebra and convexanalysis (such as on subdifferentials, and subgradients), as we do not spend much time onthe corresponding technical details. Classical references on the latter include (Rockafellar,1970; Bauschke and Combettes, 2011; Hiriart-Urruty and Lemaréchal, 2013).
Notes on this version.
Last compiled on January 26, 2021.This version is still preliminary and under revision. Any feedback is welcome. In particu-lar, the current version certainly does not contain all appropriate references. Please feelfree to suggest any that you believe is relevant.
Chebyshev Acceleration
While “Chebyshev polynomials are everywhere dense in numerical analysis”, we wouldlike to argue here that Chebyshev polynomials also provide one of the most direct andintuitive explanation for acceleration arguments in first-order methods. In quadraticoptimization, one can form a linear combination of fixed step gradient descent iteratesthat minimizes the gradient norm optimally and uniformly over all quadratic problemssatisfying given regularity conditions. Given a fixed number of iterations, these linearcombination coefficients emerge from a Chebyshev minimization problem, whose solutioncan also be computed iteratively, thus yielding an algorithm, called the
Chebyshev method (Nemirovskiy and Polyak, 1984) which is detailed in what follows and asymptoticallymatches the heavy-ball method.
In what follows, we show basic acceleration results on quadratic minimization problems.In this case, the optimum points can be written as the solutions of a linear system andwe see that the basic gradient method can be seen as a simple iterative solver for thislinear system. We then detail acceleration methods on these sequences using a classicalargument involving Chebyshev polynomials.Analyzing this simple scenario is useful in two ways. First, recursive formulations ofthe Chebyshev argument yield a basic algorithmic template to design accelerated meth-ods, providing some intuition on their structure, such as the presence of a momentumterm. Second, the arguments are robust to perturbations of the quadratic function f ,hence apply asymptotically to more generic convex functions, thus enabling accelerationin a much wider range of applications. This is covered in the next chapter. Chebyshev Acceleration
For now, consider the following unconstrained quadratic minimization problemminimize f ( x ) , x T H x − b T x (2.1)in the variable x ∈ R d , and H ∈ S d is the Hessian of f . We further assume that f is bothsmooth and strongly convex, i.e. that there are µ, L > µ I (cid:22) H (cid:22) L I (thiscan be readily extended to the case where µ is the smallest nonzero eigenvalue, or where µ = 0). Suppose we solve this problem using the fixed step gradient method detailed asAlgorithm 1. Algorithm 1
Gradient Method
Input:
A differentiable convex function f , an initial point x , a step size γ > for i = 0 , . . . , k − do x i +1 := x i − γ ∇ f ( x i ) end forOutput: An approximate solution x k .On problem (2.1), the iterations become x k +1 = ( I − γ H ) x k + γb, and calling x ⋆ = H − b the optimum of problem (2.1), this is again x k +1 − x ⋆ = ( I − γ H )( x k − x ⋆ ) . (2.2)This means that gradient descent iterates x k − x ⋆ are computed from x − x ⋆ using thematrix polynomial P Gradk ( H ) = ( I − γ H ) k . (2.3)Suppose we set the step size γ such that k I − γ H k < k x k − x ⋆ k ≤ k I − γ H k k k x − x ⋆ k , for k ≥ . (2.4)Because the matrix H is symmetric hence diagonalizable in an orthogonal basis, given γ >
0, we get k I − γ H k ≤ max µI (cid:22) H (cid:22) LI k I − γ H k ≤ max µ ≤ h ≤ L | − γh |≤ max µ ≤ h ≤ L { γh − , − γh }≤ max { γL − , − γµ } . .2. Optimal Methods and Minimax Polynomials PSfrag replacements | − γλ | | − γλ | | − γλ | Lµ | P g r a d k ( λ ) | λ ∈ [ µ, L ] Figure 2.1:
We plot | P Gradk ( λ ) | (for the optimal γ in (2.6)) for k ∈ { , . . . , } , µ = 1, L = 10. Thepolynomial is normalized such that P (0) = 1. The rate is equal to the largest value of | P Gradk ( λ ) | on theinterval, achieved at the boundaries where λ = µ or L . To get the best possible worst-case rate, we now minimize this quantity in γ >
0, solvingmin γ> max { γL − , − γµ } = L − µL + µ , (2.5)with the optimal step size obtained when both terms in the max are equal, at γ = 2 L + µ . (2.6)Calling κ = L/µ ≥ condition number of function f , the bound in (2.4) finallybecomes k x k − x ⋆ k ≤ (cid:18) κ − κ + 1 (cid:19) k k x − x ⋆ k , for k ≥ . (2.7)which characterizes the convergence of the gradient method in the smooth, stronglyconvex case. In equation (2.4) above, we saw that the convergence rate of the gradient method onquadratic functions can be controlled by the spectral norm of a matrix polynomial andFigure 2.1 plots the polynomial P Gradk for various degrees k . We can push this reasoninga bit further to produce methods with accelerated rates of convergence. The bounds derived above for gradient descent can be extended to a broader class offirst-order methods for quadratic optimization. We consider first-order algorithms where
Chebyshev Acceleration each iterate belongs to the span of previous gradients, i.e. x k +1 ∈ x + span {∇ f ( x ) , ∇ f ( x ) , . . . , ∇ f ( x k ) } , and show that the iterates are computed using a matrix polynomial as in (2.3) above. Proposition 2.1.
Assume the iterates x k are generated by a first-order algorithm withfixed step size such that x k +1 ∈ x + span {∇ f ( x ) , ∇ f ( x ) , . . . , ∇ f ( x k ) } , (2.8)where ∇ f is the gradient of the quadratic objective function in (2.1). Then, the error atiterate x k can be written x k − x ⋆ = P k ( H )( x − x ⋆ ) (2.9)where P k is a polynomial of degree at most k and P k (0) = 1. Proof.
Indeed, since ∇ f ( x ) is the gradient of a quadratic function, it reads ∇ f ( x ) = H x − b = H ( x − x ⋆ )for any x ⋆ satisfying H x ⋆ = b , where H is Hermitian. We have x − x ⋆ = 1 · ( x − x ⋆ )= P ( H )( x − x ⋆ ) x − x ⋆ = ( x − x ⋆ ) + α (1)0 H ( x − x ⋆ )= p ( H )( x − x ⋆ )Assuming recursively that (2.9) holds for all indices i ≤ k , our assumption on iteratesin (2.8) means x k +1 − x ⋆ = x − x ⋆ + k +1 X i =0 α ( k +1) i ∇ f ( x i )= x − x ⋆ + k +1 X i =0 α ( k +1) i H P i ( H )( x − x ⋆ )= I − H k +1 X i =0 α ( k +1) i P i ( H ) ! ( x − x ⋆ )writing P k +1 ( x ) = 1 − x P k +1 i =0 α ( k +1) i P i ( x ), we have x k +1 − x ⋆ = P k +1 ( H )( x − x ⋆ )with P k +1 (0) = 1 and deg( P k +1 ) ≤ k + 1.Note that the proof structure does not change if the step size is allowed to vary with x k , H , . . . , so that α ( k ) i ≡ α ( k ) i ( x k , H , . . . ) in the proof of Proposition 2.1, e.g. when usingline search. This means that the above result readily applies in this case. .3. The Chebyshev Method This last proposition helps us design optimal algorithms, given a class M of problemmatrices. Indeed, because there is a one-to-one link between polynomials and first-ordermethods, we can use tools from approximation theory to find optimal polynomials anddesign optimal methods. Given a matrix class M , this means minimizing the worst-caserate of convergence over H ∈ M , solving P ∗ k = argmin deg( P ) ≤ k,P (0)=1 max H ∈M k p ( H ) k in the variable P ∗ k ∈ C k [ x ]. This polynomial P ∗ k will be an optimal polynomial for M ,and will yield a (worst-case) optimal algorithm for the class M . In the simple case where M is the set of positive definite matrices with bounded spectrum, namely M = { H ∈ S n : 0 ≺ µI (cid:22) H (cid:22) LI } , finding the optimal polynomial over this class simplifies to P ∗ k = argmin deg( P ) ≤ k,P (0)=1 max λ ∈ [ µ, L ] | P ( λ ) | (2.10)in the variable P ∗ k ∈ C k [ x ]. Polynomials solving (2.10) are derived from Chebyshevpolynomials of the first kind in approximation theory and can be formed explicitly toproduce an optimal algorithm called the
Chebyshev method . The next section describesand analyses the rate of convergence of this method.
In the previous section, we have mentioned that the optimal polynomial for the class M = { H ∈ S n : 0 ≺ µI (cid:22) H (cid:22) LI } , is related to Chebyshev polynomials of the first kind . We now explicitly introduce thesepolynomials, then analyse the rate of convergence of the associated algorithm, namelythe Chebyshev method. A more complete treatment of these polynomials is available ine.g. (Mason and Handscomb, 2002).
The Chebyshev polynomials of the first kind are defined recursively as follows T ( x ) = 1 , T ( x ) = x, (2.11) T k ( x ) = 2 x T k − ( x ) − T k − ( x ) , for k ≥ . These polynomials T k have minimal ℓ ∞ norm in the interval [ − ,
1] among polynomialssatisfying deg( T k ) = k and T k (1) = 1, i.e. T k ( x ) = argmin P ∈P k , P (1)=1 max x ∈ [ − , | P ( x ) | . (2.12) Chebyshev Acceleration
There exists a compact explicit solution for Chebyshev polynomial involving trigonomet-ric functions, T k ( x ) = cos( k acos( x )) x ∈ [ − , , cosh( k acosh( x )) x > , ( − k cosh( k acosh( − x )) x < . (2.13)We seek minimal polynomials on the interval [ µ, L ], not [ − , x → t [ µ,L ] ( x ) = 2 x − ( L + µ ) L − µ , we obtain shifted Chebyshev polynomials as C [ µ,L ] k ( x ) = T k (cid:0) t ( x ) (cid:1) T k (cid:0) t (0) (cid:1) . (2.14)where we have enforced the normalization constraint C [ µ,L ] (0) = 1. It is also possibleto characterize ℓ ∞ optimality using an equi-oscillation argument, see (Süli and Mayers,2003) for example. Using this mapping, the recursion becomes (after simplification) C [ µ,L ]0 ( x ) = 1 ,C [ µ,L ]1 ( x ) = 1 − L + µ x, (2.15) C [ µ,L ] k ( x ) = 2 δ k L − µ ( L + µ − x ) C [ µ,L ] k − ( x )+ (cid:18) − δ k ( L + µ ) L − µ (cid:19) C [ µ,L ] k − ( x ) , for k ≥ δ = − δ k = 12 L + µL − µ − δ k − , for k ≥ δ k ensures C [ µ,L ] k (0) = 1. We plot C [0 ,σ ]1 ( x ), C [0 ,σ ]3 ( x ) and C [0 ,σ ]5 ( x ) inFigure 2.2 as illustration. The shifted
Chebyshev polynomials above solve (2.10) hence allow us to bound the con-vergence rate of first-order methods using (2.9). We now present the resulting algorithm,called
Chebyshev semi-iterative method in (Golub and Varga, 1961).Suppose we define iterates using C [ µ,L ] k ( x ) as follows, x k − x ⋆ = C [ µ,L ] k ( H )( x − x ⋆ ) . the recursion in (2.15) yields x k − x ⋆ = 2 δ k L − µ (( L + µ ) I − H ) ( x k − − x ⋆ )+ (cid:18) − δ k ( L + µ ) L − µ (cid:19) ( x k − − x ⋆ ) . .3. The Chebyshev Method PSfrag replacements | C [ µ,L ]1 || C [ µ,L ]3 || C [ µ,L ]5 | Lµ | C [ µ , L ] k ( λ ) | λ ∈ [ µ, L ] Figure 2.2:
We plot the absolute value of C [ µ,L ]1 ( x ), C [ µ,L ]3 ( x ) and C [ µ,L ]5 ( x ) for λ ∈ [ µ, L ] where µ = 1and L = 10. The polynomial is normalized such that P (0) = 1. The maximum value of the image of[ µ, L ] by C k decreases very fast as k grows, implying a better rate of convergence. Since the gradient of the function reads ∇ f ( x k ) = H ( x k − x ⋆ ) , we can simplify away x ⋆ to get the following recursion x k = 2 δ k L − µ (( L + µ ) x k − − ∇ f ( x k − )) + (cid:18) − δ k ( L + µ ) L − µ (cid:19) x k − . which describes iterates of the Chebyshev method. We summarize it as Algorithm 2.The Chebyshev method is optimal for minimizing quadratics. Its iteration structure issurprisingly simple and intuitive: it involves a gradient descent step with variable stepsize4 δ k / ( L − µ ), combined with a variable momentum term. Algorithm 2
Chebyshev method
Input: x , strong convexity constant µ , smoothness constant L . Set δ = − , x = x − L + µ ∇ f ( x ). for k = 1 , . . . , k max do Set δ k = ( L + µ ) L − µ − δ k − , x k := x k − − δ k L − µ ∇ f ( x k − ) + (cid:16) − δ k ( L + µ ) L − µ (cid:17) ( x k − − x k − ) . end for Asymptotically in k , the Chebyshev method converges to Polyak’s Heavy-ball method ,written x k = x k − − √ L + √ µ ) ∇ f ( x k − ) + ( √ L − √ µ ) ( √ L + √ µ ) ( x k − − x k − ) . Chebyshev Acceleration
To see this, it suffices to compute the limit of δ k , written δ ∞ , solving δ ∞ = 12 ( L + µ ) L − µ − δ ∞ to get δ ∞ = √ L − √ µ √ L + √ µ . We obtain Polyak’s heavy-ball method by replacing δ k by δ ∞ in Algorithm 2. The rate of convergence of the Chebyshev method is bounded as in (2.5), with k x k − x ⋆ k ≤ k C [ µ,L ] k ( H )( x − x ⋆ ) k ≤ k x − x ⋆ k max x ∈ [ µ,L ] | C [ µ,L ] k ( x ) | . (2.16)The maximum value is given by evaluating the polynomial at one of the extremities ofthe interval (Mason and Handscomb, 2002, Chap.2), i.e,max x ∈ [ µ,L ] | C [ µ,L ] k ( x ) | = C [ µ,L ] k ( L ) . Using successively (2.14) and (2.13), | C [ µ,L ] k ( L ) | = 1 |T k (cid:0) t [ µ,L ] (0) (cid:1) | = 1cosh (cid:16) k acosh (cid:16) L + µL − µ (cid:17)(cid:17) We get the convergence rate of the Chebyshev method, as stated in the following theorem.
Theorem 2.1.
Suppose we are solving the quadratic minimization problem (2.1) with H ∈ S n such that µI (cid:22) H (cid:22) LI . Among all first-order methods satisfying (2.8), theChebyshev algorithm is optimal, and its rate of convergence is given by k x k − x ⋆ k ≤ ξ k + ξ − k k x − x ⋆ k where ξ = q Lµ + 1 q Lµ − . (2.17) Proof.
The optimality of Chebyshev’s method follows directly from (2.10), whosesolution is given by the shifted Chebyshev polynomial (2.14). It remains to bound (2.16),i.e. k x k − x ⋆ k ≤ k x − x ⋆ k max x ∈ [ µ,L ] | C [ µ,L ] k ( x ) | = k x − x ⋆ k (cid:16) k acosh (cid:16) L + µL − µ (cid:17)(cid:17) . .3. The Chebyshev Method First, we evaluate the acos term, for whichacosh (cid:18) L + µL − µ (cid:19) = ln L + µL − µ + s(cid:18) L + µL − µ (cid:19) − , = ln ( ξ ) , where ξ = q Lµ + 1 q Lµ − . After plugging this result in the cosh,1cosh ( k ln ( ξ )) = 2 e k ln( ξ ) + e − k ln( ξ ) = 2 ξ k + ξ − k , which gives the desired result.This rate of convergence may be hard to compare with gradient descent due to a morecomplex expression. However, neglecting the denominator term ξ − k , we can simplify itas follows k x k − x ⋆ k ≤ √ κ − √ κ + 1 ! k k x − x ⋆ k . This matches the rate of Polyak’s Heavy-Ball method, and is better than that of gradientdescent in (2.5), which reads k x k − x ⋆ k ≤ (cid:18) κ − κ + 1 (cid:19) k k x − x ⋆ k . We summarize this in the following corollary, comparing the number of iterationrequired to reach a target accuracy ǫ . Corollary 2.2.
For fixed µ and L , to ensure k x k − x ⋆ k ≤ ǫ, we need• k ≥ Lµ log (cid:16) k x − x ⋆ k ǫ (cid:17) iterations of Gradient descent• k ≥ q Lµ log (cid:16) k x − x ⋆ k ǫ (cid:17) iterations of the Polyak-Chebyshev method Proof.
For gradient descent, a sufficient condition on the number of iteration toreach an ǫ accuracy reads k x k − x ⋆ k ≤ (cid:18) κ − κ + 1 (cid:19) k k x − x ⋆ k ≤ ǫ. Taking logs on both sides, assuming k x − x ⋆ k > ǫ , we get k ≥ log (cid:16) k x − x ⋆ k ǫ (cid:17) log (cid:16) κ − κ +1 (cid:17) Chebyshev Acceleration
However, log (cid:16) − x x (cid:17) > x . This means the condition above can be strengthen by consid-ering k ≥ κ log (cid:18) k x − x ⋆ k ǫ (cid:19) This gives the desired result for gradient descent. With the same approach, we also getthe result for the Chebyshev algorithm.This corollary shows in particular that the Chebyshev method can be √ κ faster than gradient descent. This means a factor 100 speedup in problems with a (reasonable)condition number of 10 , which is very significant. Furthermore, when the dimensionof the ambient space is large, and without further assumptions on the spectrum of H ,these worst-case guarantees on Chebyshev’s method are essentially unimprovable, seefor example (Nemirovsky and Yudin, 1983a; Nemirovsky, 1992; Nesterov, 2003).However, to achieve this optimal rate of convergence, the Chebyshev method requiresknowledge of the constants µ and L . While L can be computed easily since L = k H k ,the constant µ is usually more difficult to estimate. We will now introduce another non-linear acceleration technique, known as Anderson acceleration, which achieves the sameconvergence rate as Chebyshev’s method, but runs without assuming knowledge of µ . Nonlinear Acceleration
In what follows, we see that the main argument used in the Chebyshev method can beadapted beyond quadratic problems. This follows a pattern that is known in numericalanalysis as nonlinear acceleration and seeks to accelerate convergence of sequences byextrapolation using nonlinear averages. These methods are known under various names,starting with Aitken’s ∆ (Aitken, 1927), Wynn’s epsilon algorithm (Wynn, 1956) orAnderson Acceleration (Anderson, 1965) (a survey of these techniques can be found in(Sidi et al. , 1986)). These results can be transposed to the optimization setting and aregularization argument produces convergence bounds. We focus again on a generic unconstrained convex minimization problem writtenminimize f ( x ) (3.1)in the variable x ∈ R d , where f is twice continuously differentiable in a neighborhoodof its minimizers x ⋆ . In what follows, some of the ideas behind Chebyshev acceleration,developed on quadratic models in Chapter 2, are adapted to arbitrary convex minimiza-tion problems. Quite naturally, this stems from a local quadratic approximation of theobjective function f written f ( x ) = f ⋆ + ( x − x ⋆ ) T H ( x − x ⋆ ) / o ( k x − x ⋆ k ) (3.2)where H = ∇ f ( x ⋆ ) ∈ S d is the Hessian matrix of f at x ⋆ , satisfying µ I (cid:22) H (cid:22) L I .Problem (3.2) can be written in the format of problem (2.1) in Chapter 2, with f ( x ) = x T H x/ b T x + c, where b = − H x ⋆ , c = f ⋆ + ( x ⋆ ) T H x ⋆ / , but the form in (3.2) simplifies notations below. Nonlinear Acceleration
We could apply Chebyshev’s method directly to this approximate quadratic modelin (3.2), but we can get better weights by adaptively forming the coefficients of thepolynomial P k ( H ) in Proposition 2.1 to improve convergence speed. The Chebyshevacceleration arguments of Chapter 2 are then used in this case as a wrapper (or outerloop) around existing first-order methods, and form better estimates of the optimumusing linear combinations of the sequence of gradients produced by an underlying first-order algorithm. This technique is called nonlinear acceleration and is known under othervarious names such as Anderson or minimum polynomial acceleration (see §3.6 for morereferences).More concretely, assume we run a first-order method of the form x k +1 ∈ x + span {∇ f ( x ) , ∇ f ( x ) , . . . , ∇ f ( x k ) } , (3.3)producing a sequence of gradients ∇ f ( x ) , ∇ f ( x ) , . . . , ∇ f ( x k ) . Nonlinear acceleration approximates a minimizer x ⋆ using a linear combination of theiterates x i by minimizing the norm of the gradient, since ∇ f ( x ⋆ ) = 0, solving for c ⋆ = argmin T c =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∇ f k X i =0 c i x i !(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . (3.4)While we linearly combine the iterates x i ’s, the coefficient c i depend on ∇ f and the x i ,hence the term nonlinear acceleration. Solving (3.4) is of course just as hard as solving (3.1) in general. However, when f is aquadratic function f ( x ) = ( x − x ⋆ ) T H ( x − x ⋆ ) / f ⋆ , (3.5)the gradient in (3.4) is a linear function of x and reads ∇ f k X i =0 c i x i ! = H k X i =0 c i x i − x ⋆ ! . When the coefficients c i sum to one, the gradient of the linear combination becomes alinear combination of gradients, satisfying ∇ f k X i =0 c i x i ! = H k X i =0 c i x i − x ⋆ ! = k X i =0 c i H ( x i − x ⋆ ) = k X i =0 c i ∇ f ( x i ) . (3.6)This means that, in the quadratic case, subproblem (3.4) becomes a simple quadraticprogram with one equality constraint, involving known quantities (the gradients at eachiterate of the underlying first-order method), written c ⋆ = argmin c T =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k X i =0 c i ∇ f ( x i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , (3.7) .2. Quadratic Case in the variable c ∈ R k +1 . In matrix form, the problem reads c ⋆ = argmin c T =1 k G c k , (3.8)in the variable c ∈ R k +1 , where G = [ ∇ f ( x ) , . . . , ∇ f ( x k )] is the matrix formed by con-catenating all gradients produced by the underlying first-order method. This quadraticsubproblem requires solving a small linear system of ( k + 1) equations with an explicitsolution computed as c ⋆ = z P ki =0 z i , where z = ( G T G ) − . The previous section shows that it is possible to efficiently minimize the gradient bycombining previous iterates x i . However, even if we use the information contained in ∇ f ( x k ) to form the coefficients c i , the extrapolation y k +1 only belongs to the spanof gradients up to ∇ f ( x k − ), thus implicitly discarding the information contained in ∇ f ( x k ).We can incorporate ∇ f ( x k ) by mixing the points x i and their gradient as follows, y k +1 = k X i =0 c i x i − h k +1 k X i =0 c i ∇ f ( x i ) , where h k +1 is the mixing parameter. This corresponds to a simple gradient step at theextrapolated point P ki =0 c i x i . In fact, using equation (3.6), we obtain y k +1 = k X i =0 c i x i − h k +1 ∇ f k X i =0 c i x i ! . which gives a simple strategy slightly improving the rate of convergence of nonlinearacceleration using ∇ f ( x k ). In practice, the mixing parameter h is not necessary outsideof certain contexts (see §“online acceleration” in Section 3.5.4). All together, we obtain the nonlinear acceleration method described as Algorithm 3.
Algorithm 3
Nonlinear Acceleration
Input:
Sequence of pairs { ( x i , ∇ f ( x i )) } i =0 ...k ; step size h Form the matrix G = [ ∇ f ( x ) , . . . , ∇ f ( x k )], and compute G T G . Solve the linear system z = ( G T G ) − , and compute c = zz T . Output:
Return the extrapolation P ki =0 c i ( x i − h ∇ f ( x i )).The complexity of this algorithm is O ( nk + k ) where k is the number of iterationsand n the ambient dimension. The first term comes from the matrix-matrix multiplica-tion in step 1, the second term comes from inverting a ( k + 1) × ( k + 1) matrix in step 2. Nonlinear Acceleration
The length k of the sequence is typically much smaller than n , thus the complexity isroughly O ( nk ). If an extrapolation step is computed each time a new gradient is in-serted, the per iteration complexity can be reduced up to O ( nk ), and computing thematrix G T G and the coefficients c ⋆ using low-rank updates. The accuracy of nonlinear acceleration can be quantified when the sequence of points x i is generated by a first-order method, under mild assumptions. The following ensuresthat the matrix G T G is invertible. Assumption 1.
At iteration k of 3.3, the dimension of the space x + span {∇ f ( x ) , . . . , f ( x k ) } is k + 1.Under this assumption, the next theorem shows that Algorithm 1 is at least asefficient as Chebyshev acceleration. We first show that the solution of (3.7) combinedwith the latest gradient, allows obtaining guarantees on the norm of the gradient at theextrapolated point. Theorem 3.1.
Consider the quadratic function f described in (3.5). Let x i be gener-ated by a first-order method (3.3), satisfying Assumption 1. Let y k +1 be the output ofAlgorithm 3 on the sequence { ( x i , ∇ f ( x i )) } i =0 ...k , then k∇ f ( y k +1 ) k ≤ k I − h H k min c T =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k X i =0 c i ∇ f ( x i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , (3.9)where k I − h H k ≤ max {| − hµ | , | − hL |} . Proof.
We have the following inequalities k∇ f ( y k +1 ) k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∇ f k X i =0 c i ( x i − h ∇ f ( x i )) !(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) H k X i =0 c i ( I − h H )( x i − x ⋆ ) !(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ k I − h H k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k X i =0 c i H ( x i − x ⋆ ) !(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = k I − h H k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k X i =0 c i ∇ f ( x i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . and µ I (cid:22) H (cid:22) L I yields the desired result.This result shows that, once the last step h is fixed, the rate of convergence is es-sentially controlled by the optimal value of a minimization problem. This value can be .2. Quadratic Case bounded using a Chebyshev argument, and the next proposition thus connects the rateof convergence of Nonlinear Acceleration with the maximum of Chebyshev polynomials.Unlike Chebyshev acceleration, whose coefficients are predetermined and minimize theworst-case rate of convergence, the nonlinear acceleration technique adaptively looks forthe best coefficients, and is thus at least as fast as Chebyshev. Proposition 3.1.
Consider the quadratic function f described in (3.5). Let x i be gener-ated by a first-order method (3.3), satisfying Assumption 1. Let y k +1 be the output ofAlgorithm 3 on the sequence { ( x i , ∇ f ( x i )) } i =0 ...k . In such case, Nonlinear Accelerationfinds the best polynomial , satisfyingmin c T =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k X i =0 c i ∇ f ( x i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = min P ∈P k , P (0)=1 k P ( H ) ∇ f ( x ) k and P k is the set of polynomials of degree at most k . This impliesmin c T =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k X i =0 c i ∇ f ( x i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ξ k + ξ − k k∇ f ( x ) k , ξ = q Lµ + 1 q Lµ − . Proof.
We start with the definition of first-order method, yielding iterates x i suchthat x i ∈ x + span {∇ f ( x ) , . . . , ∇ f ( x i − ) } . From Section 2.3, Proposition 2.1, if x i +1 follows the previous recursion (for a quadraticfunction), then x i reads x i = x ⋆ + P i ( H )( x − x ⋆ ) , P i ∈ P i , P i (0) = 1 . The same idea holds for its gradient, ∇ f ( x i ) = H ( x i − x ⋆ ) = H P i ( H )( x − x ⋆ ) = P i ( H )( H ( x − x ⋆ ))= P i ( H )( ∇ f ( x )) . (3.10)We inject this expression in the objective of Algorithm 3.7,min c T =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =0 c i ∇ f ( x i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = min c T =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =0 c i P i ( H ) ! ∇ f ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . (3.11)By Assumption 1, the dimension of the span at iteration k is k + 1. This means all gradi-ents are linearly independent. Therefore, because of Equation (3.10), the polynomials P i are also linearly independent, and hence { P i } i =0 ...k is a basis for the space P k . Finally,because P i (0) = 1 and c T = 1, we can rephrase the objective (3.11) asmin c T =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k X i =0 c i P i ( H ) ! ∇ f ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = min P ∈P k : P (0)=1 k P ( H ) ∇ f ( x ) k . Nonlinear Acceleration
A convergence rate is thus given by k∇ f ( y k +1 ) k ≤ k I − h H k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k X i =0 c i ∇ f ( x i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = k I − h H k min P ∈P : P (0)=1 k P ( H ) ∇ f ( x ) k . Because the shifted Chebyshev polynomial from Theorem 2.1 is a feasible solution ofthis last minimization problem, we can use it to bound the last term.In short, nonlinear acceleration adaptively looks for the best combination of previousiterates given the information stored in previous gradients, as opposed to Chebyshevacceleration that uses the same worst-case optimal polynomial in all cases. Moreover,Algorithm 3 does not require knowledge of smoothness and strong convexity parameters.Note however that the iteration complexity of nonlinear acceleration grows with thenumber of iterates, so it is typically only used on a limited number of steps.
Nonlinear acceleration suffers some serious drawbacks when used outside the restrictedsetting of quadratic functions. In fact, vanilla nonlinear acceleration is highly numericallyunstable. The origin of this problem comes from the conditioning of the matrix G T G ,used to compute the coefficients c i ’s.To illustrate this statement, suppose we are given a noisy sequence of gradients inAlgorithm 3, NonlinAcc ( { ( x , ∇ f ( x ) + e ); . . . ; ( x k , ∇ f ( x k ) + e k ) } , h ) , where k e i k ≤ ǫ . (Scieur et al. , 2016, Proposition 3.1) shows that the (relative) distancebetween ˜ c (the coefficients computed using the noisy sequence above) and its noise-freeversion c is bounded by k c − ˜ c kk c k ≤ O (cid:16) k E kk (( G + E ) T ( G + E )) − k (cid:17) . (3.12)where E , [ e , . . . , e k ]is the noise matrix. In this bound, the perturbation impacts the solution proportionallyto its norm and to the conditioning of G + E .Unfortunately, even for small perturbations, the condition number of G + E and thenorm of the vector c are usually huge . In fact, (Scieur et al. , 2016) show that G has a Krylov matrix structure, which is notoriously poorly conditioned (Tyrtyshnikov, 1994).This means even very small perturbation E in (3.12) can have a significant impact onperformance. .3. The Non-Quadratic Case: Regularized Nonlinear Acceleration (a) (b) Figure 3.1:
Illustration of the sensitivity of nonlinear acceleration, when applying nonlinear accelerationon different methods to minimize a random quadratic function. Figure 3.1a the condition number of thematrix G T G , that grows exponentially with its size (the plateau at the end it caused by numericalerrors). Figure 3.1b, the norm of the vector of coefficients c , which grows quickly over time. We briefly illustrate the link between G and Krylov matrices. Consider for simplicitythe gradient descent algorithm with step size γ on a quadratic function. Equation (2.3)from Section 2.3 shows that the iterates follow the recursion x k +1 − x ⋆ = (1 − γ H ) k ( x − x ⋆ ) ⇔ ∇ f ( x k +1 ) = (1 − γ H ) k ∇ f ( x ) . Replacing gradients in the matrix G by these expressions gives G = h ∇ f ( x ) , ( I − γ H ) ∇ f ( x ) , ( I − γ H ) ∇ f ( x ) , . . . i , which shows that G is in fact a Krylov matrix.Figure 3.1 shows the norm of c and the condition number of G T G when nonlinearacceleration is used to accelerate gradient descent, fast gradient descent (Nesterov, 2013),and itself (in the online setting, algorithm (3.23)). We see that after only 3 iterations,the system is considered as singular (i.e., the smallest eigenvalue is smaller than machineprecision).To stabilize the method, it is common to regularize the linear system. The resultingalgorithm is often referred to as regularized nonlinear acceleration (RNA) (Scieur et al. ,2016). The following subsection is devoted to the theoretical properties of this method. Regularized nonlinear acceleration consists in using Algorithm (3) with a regularizationterm. In short, the base operation used by RNA solves forargmin c T =1 k G c k k G k + λ k c − c ref k (3.13)in the variable c ∈ R k , where c ref is the reference vector of coefficients, and that regular-ization forces c to be close to c ref . Ideally, c ref should sum to one. For instance, setting Nonlinear Acceleration this vector /k makes c closer to averaging. Another possibility is c ref = [ k − , k G k is for scaling purpose only,which makes λ unit-free. Algorithm 4
Regularized Nonlinear Acceleration
Input:
Sequence of pairs { ( x i , ∇ f ( x i )) } i =0 ...k , step size h , regularization term λ , refer-ence vector c ref . Form G = [ ∇ f ( x ) , . . . , ∇ f ( x k )], compute G = G T G k G T G k . Solve the linear system w = λ ( G + λ I ) − c ref . Solve the linear system z = ( G + λ I ) − . Compute the coefficients c = w + z (1 − w T ) z T Output:
Return the extrapolation P ki =0 c i ( x i − h ∇ f ( x i )).The resulting method is provided in Algorithm 4, and is slightly more complicatedthan its unregularized counterpart. In the special case where c ref = , the proceduresimplifies to Algorithm 5. Algorithm 5
Regularized Nonlinear Acceleration (Simplified)
Input:
See Algorithm 4. Ensure that c ref = /k . Form G = [ ∇ f ( x ) , . . . , ∇ f ( x k )], compute G = G T G k G T G k . Solve the linear system z = ( G + λ I ) − /k . Compute the coefficients c = zz T . Output:
Return the extrapolation P ki =0 c i ( x i − h ∇ f ( x i )).Because of the regularization term, the output c of Algorithm 4 is less sensitive tonoise. The next section introduces the concept of perturbed quadratic gradient, andillustrates that regularization is crucial in some applications. This section introduces the model of perturbed quadratic gradients . We first focus on deterministic non-quadratic function f ( x ), with gradient ∇ f ( x ).As before, the iterates x i come from a first-order algorithm, i.e. x k +1 ∈ x + span {∇ f ( x ) , ∇ f ( x ) , . . . , ∇ f ( x k ) } . However, ∇ f ( x ) is no longer the gradient of a quadratic function. Instead, the gradientcontains some error term e ( x ), and reads as in (3.2) ∇ f ( x ) = H ( x − x ⋆ ) + e ( x ) , (3.14)for some matrix H such that µ I (cid:22) H (cid:22) L I . We detail two scenarios under which thisdecomposition holds locally. .3. The Non-Quadratic Case: Regularized Nonlinear Acceleration Example 1: Twice-differentiable functions.
Consider the problem min x f ( x ) , in the variable x ∈ R d , for f a twice-differentiable function. For any x around its optimalpoint x ⋆ , the function satisfies f ( x ) = f ⋆ + ∇ f ( x ⋆ ) | {z } =0 ( x − x ⋆ ) + ( x − x ⋆ ) T ∇ f ( x ⋆ )( x − x ⋆ ) / O ( k x − x ⋆ k ) . This means f ( x ) can be approximated by the quadratic function (3.14), with H = ∇ f ( x ⋆ ). Using the same technique, its gradient reads ∇ f ( x ) = ∇ f ( x ⋆ ) | {z } =0 + ∇ f ( x ⋆ )( x − x ⋆ ) + e ( x ) = H ( x − x ⋆ ) + e ( x ) , where e ( x ) is the first-order Taylor remainder of the gradient. Thus, minimizing a non-quadratic function is equivalent to minimizing a perturbed quadratic one, whose erroron the gradient is of second order and satisfies e ( x k ) = ∇ f ( x k ) − H ( x k − x ⋆ ) (cid:16) = O ( k x k − x ⋆ k ) (cid:17) . where H = ∇ f ( x ⋆ ). Example 2: Stochastic iterations.
In the case where the gradient is corrupted bystochastic noise, the iterate x k +1 reads x k +1 ∈ x + span { ˜ ∇ f ( x ) , ˜ ∇ f ( x ) , . . . , ˜ ∇ f ( x k ) } . where ˜ ∇ f ( x i ) is a stochastic estimate of the true gradient ∇ f ( x i ) that satisfies E [ ˜ ∇ f ( x i )] = ∇ f ( x i ) , E [ k ˜ ∇ f ( x i ) − ∇ f ( x i ) k ] ≤ σ . Here the gradient ∇ f ( x ) is a perturbed version of the gradient of a quadratic function,with ˜ ∇ f ( x i ) = H ( x i − x ⋆ ) + e i , where e i is now the sum of a stochastic noise and a Taylor remainder. For stochasticnoise, Theorem (3.2) still holds but in expectation. Using a perturbation argument, it is possible to derive a convergence guarantee forregularized nonlinear acceleration. We state here a simplified version of (Scieur et al. ,2018, Theorem 3.2), describing how regularization balances acceleration and stability inthe algorithm. For simplicity, we state the next theorem specifically for Algorithm 5, butone can extend the results for any c ref . Nonlinear Acceleration
Theorem 3.2.
Let x extr be the output of Algorithm 5 used on the sequence { ( x , ∇ f ( x )) , . . . , ( x k , ∇ f ( x k )) } , where ∇ f ( x ) follows (3.14) with k e ( x i ) k ≤ ǫ . Then, k H ( x extr − x ⋆ ) k ≤ k I − h H k C k H ( x − x ⋆ ) k | {z } Acceleration + O r λ ǫ !| {z } Stability where C is a constant, corresponding to the maximum value in the interval [ µ, L ] of the regularized Chebyshev polynomial , C = argmin P ∈P k , P (0)=1 max x ∈ [ µ,L ] P ( x ) + λ kGkk P k , (3.15)in which k P k is the norm of the coefficients of the polynomial.This theorem shows that regularization, as expected, stabilizes the algorithm whileslowing down the convergence rate. The regularized Chebyshev polynomial interpolatesthe classical shifted Chebyshev polynomial C µ,Lk from (2.14), and the polynomial whosecoefficients are defined by /k , i.e., the polynomial that averages the iterates x k . Byconstruction, its maximum value is always higher than that of the Chebyshev polynomial,but the norm of its coefficients is smaller. Unfortunately, there is for now no knownexplicit expression of the regularized Chebyshev polynomial. However, its value can becomputed numerically, using sum-of-squares and its value can be upper bounded usingproxy Chebyshev polynomials (Barré et al. , 2020b).This interpolation between Chebyshev coefficients and simple averaging of iteratesis also natural in the context of noisy iterations. When the noise is negligible, a smallregularization parameter combines the iterates x i using classical Chebyshev weights. Incases where the noise is more significant, a larger regularization makes the coefficients c closer to the average /k which reduces the noise, but increases the value of C λ,µ,Lk andreduces the acceleration effect. We quickly discuss the behavior of the RNA algorithm when the initialisation x isgetting closer to the solution x ⋆ . In particular, the next proposition shows that if theperturbation magnitude ǫ decreases faster than k H ( x − x ⋆ ) k , the parameter λ can beadjusted to ensure an asymptotically optimal rate of convergence, i.e. a rate comparableto that of the Chebyshev method on quadratics (Nemirovsky and Yudin, 1983b).The intuition is fairly simple: as x gets closer to x ⋆ , the Taylor expansion of thefunction f gets closer to a quadratic. In this setting, the rate of convergence of RNAconverges to the one of Theorem 3.1. .3. The Non-Quadratic Case: Regularized Nonlinear Acceleration Proposition 3.2.
Let x extr be the output of Algorithm 5 used on the sequence { ( x , ∇ f ( x )) , . . . , ( x k , ∇ f ( x k )) } , where ∇ f ( x ) follows (3.14) with k e ( x i ) k ≤ ǫ . Assume the error magnitude ǫ satisfies ǫ ≤ O ( k x − x ⋆ k α ) , α > . Then, if λ = O ( k x − x ⋆ k s ), where 0 < s < α − x → x ⋆ k H ( x extr − x ⋆ ) kk H ( x − x ⋆ ) k ≤ k I − h H k ξ k + ξ − k k x − x ⋆ k where ξ = q Lµ + 1 q Lµ − , which corresponds to the rate of the Chebyshev acceleration from Theorem 2.1. Proof.
We start with the result from Theorem 3.2, and we divide both sides by k H ( x − x ⋆ ) k , k H ( x extr − x ⋆ ) kk H ( x − x ⋆ ) k ≤ k I − h H k C λ,µ,Lk + O r λ ǫ k H ( x − x ⋆ ) k !! . Let λ = k H ( x − x ⋆ ) k s and ǫ = k H ( x − x ⋆ ) k α . Therefore, k H ( x extr − x ⋆ ) kk H ( x − x ⋆ ) k ≤ k I − h H k C λ,µ,Lk + O s k H ( x − x ⋆ ) k α − + k H ( x − x ⋆ ) k α − k H ( x − x ⋆ ) k s . When x → x ⋆ , we have k H ( x − x ⋆ ) k α − → α > k H ( x − x ⋆ ) k α − − s → s < α − . Finally, in (3.15), the regularization parameter λ kGk = O ( k H ( x − x ⋆ ) k ) →
0. There-fore, the regularized Chebyshev in (3.15) converges to the regular (shifted) Chebyshevpolynomial.In short, the previous theorem shows that asymptotically optimal rate of convergenceis achieved as soon as λ decreases faster than k x − x ⋆ k , but slower than ǫ . Thisis possible, for instance, when accelerating twice differentiable functions with gradientdescent. Indeed, in this setting, the error decreases as O ( k x − x ⋆ k ), and therefore α = 2 > ǫ ≥ O ( D ). This happens,for instance, when accelerating fixed step stochastic gradient descent. In this case ǫ = O (1) and asymptotic acceleration is not possible. However, this is not a surprise, as thefixed step version of SGD itself does not converge to the optimum. Nonlinear Acceleration
So far, we discussed the acceleration of optimization algorithms solving unconstrained problems. Unfortunately, extending these results to the constrained case is not straight-forward.For instance, when minimizing a function f over a restricted domain C ∈ R d , thereis no known mechanism to ensure that the output of Algorithm 3 (or its regularizedvariant) remains inside C . Moreover, projecting the extrapolated point onto the set C does not guarantee an improvement in accuracy.Even worse, our definition of first-order methods (3.3) does not handle algorithms pos-sibly involving projections: after a projection step, the resulting iterate might be outside the span of previous gradients. Therefore, the whole intuition of nonlinear accelerationdeveloped in previous sections vanishes, as well as its convergence guarantees.Recently, Mai and Johansson (2020) adapted the Anderson Acceleration method tohandle a large class of constrained and non-smooth problems, using Clarke’s generalizedJacobian (Clarke, 1990) and semi-smoothness (Mifflin, 1977; Qi and Sun, 1993). Beforeproviding the technical details, the next section introduces proximal gradient descent, aswell as some notations. Consider the composite optimization problemmin x ∈ R d f ( x ) + φ ( x ) , (3.16)where f ( x ) is a smooth strongly convex function and φ ( x ) is convex, potentially non-smooth, with tractable proximal operator prox hφ ( x ) , argmin y (cid:26) hφ ( y ) + 12 k x − y k (cid:27) . (3.17)The composite optimization problem (3.16) can be easily solved using proximal gradientdescent , whose recursion reads x k +1 = prox hφ ( x k − h ∇ f ( x k )) . (3.18)We skip most of the details on proximal algorithms and refer the reader to e.g. (Parikhand Boyd, 2014) for an excellent survey. The proximal step corresponds to an implicitupdates . That is, using the first-order optimality condition on (3.17), ∂ (cid:18) φ ( y ) + 12 h k x − y k (cid:19) ∈ ⇔ y ∈ x − h∂φ ( y ) , where ∂ stands for the sub-gradient. Injecting this property in the proximal gradientstep (3.18) gives the recursion x k +1 = x k − h ( ∇ f ( x k ) + ∂φ ( x k +1 )) . (3.19) .4. Constrained/Proximal Nonlinear Acceleration The term implicit comes from the fact that the gradient of the “proximal” function isevaluated at x k +1 .Many constrained optimization problems have a composite structure as in (3.16). Forexample, the constrained optimization problemmin x ∈C f ( x ) , (3.20)where C ∈ R d is a convex set. We can form an equivalent problem of the form (3.16), byletting φ ( x ) be the indicator function of the set, i.e., φ ( x ) = + ∞ if x
6∈ C , C . Note that it does not mean that the proximal operation is easy to compute,as it might largely depend on the set.Another typical example is the ℓ regularization, where φ ( x ) = k x k . In this case,the proximal operator is called soft thresholding , withprox h k·k ( x ) = x ∈ [ − h, h ] ,x − h if x > h,x + h if x < h. This operator is commonly used in the LASSO problem, for exploiting the sparsity-inducing properties of ℓ regularization. In the previous sections, we considered first-order methods (3.3), where x k +1 belongs tothe span of previous gradients ∇ f ( x i ). Unfortunately, proximal gradient descent (3.18)does not satisfy this requirement due to the proximal operator. Indeed, after expanding(3.19), we obtain x k +1 ∈ x + span {∇ f ( x ) + ∂φ ( x ) , . . . , ∇ f ( x k ) + ∂φ ( x k +1 ) } , which does not satisfies Definition (3.3). We can however consider another sequence ofpoints z k , satisfying z k +1 ∈ x + span {∇ f ( x ) + ∂φ ( x ) , . . . , ∇ f ( x k ) + ∂φ ( x k ) } , (3.21)such that x k +1 = prox hφ ( z k +1 ) . In the case of proximal gradient descent, this can be achieved by considering the sequence z k +1 = x k − h ∇ f ( x k ) ,x k +1 = prox hφ ( z k +1 ) . Nonlinear Acceleration
The modified sequence is simply an intermediate iterate between the proximal operatorand the gradient step, whose iterates z k follow z k +1 = prox hφ ( z k ) − h ∇ f (prox hφ ( z k )) . This simple trick allows the derivation of convergence bounds for proximal nonlinearacceleration, with little changes to the algorithm. In particular, Mai and Johansson(2020) show that using Algorithm (6) in the presence of a proximal operator does notchange the convergence analysis, assuming that the function φ is twice epi-differentiable ,and that φ is twice-differentiable at the solution x ⋆ . We refer the reader to Rockafellarand Wets (1998, Section 13) for a comprehensive treatment of epi-differentiability. Algorithm 6
Online Regularized Nonlinear Acceleration with Proximal operator
Input:
Step size h , regularization term λ , reference vector c ref . for k = 0 . . . maxIter do {// Perform one proximal step} Compute g k = x k − h ∇ f ( z k ), and g k = r k − z k {// Compute the coefficients of Nonlinear Acceleration} Form G = [ r , . . . , r k ], compute G = G T G k G T G k . Solve the linear system w = λ ( G + λ I ) − c ref . Solve the linear system z = ( G + λ I ) − . Compute the coefficients c = w + z (1 − w T ) z T .{// Compute the extrapolation} z k +1 = P ki =0 c i g i . x k +1 = prox hφ ( z k +1 ). end forOutput: Return x k +1 . In what follows, we discuss a few strategies to improve the performance of nonlinearacceleration.
The nonlinear acceleration algorithm computes an extrapolated point P ki =0 c i x i and anextrapolated descent direction P ki =0 c i ∇ f ( x i ). Therefore, we can check whether (1) theextrapolated point is better than other iterates x i and, (2) if the extrapolated direction .5. Speeding-up Nonlinear Acceleration is a descent direction. More formally, this means checking f (cid:16)P ki =0 c i x i (cid:17) < min i ∈{ ...k } f ( x i ) , (cid:16)P ki =0 c i ∇ f ( x i ) (cid:17) T ∇ f (cid:16)P ki =0 c i x i (cid:17) > , and then to keep the extrapolated point if these conditions are met. Otherwise, we discardit and perform a step of the base algorithm without extrapolation. When optimizingquadratics without numerical errors, this descent condition should never discard anyacceleration step. However that may not be the case beyond quadratics, or in the presenceof large enough numerical errors. Nonlinear acceleration requires picking a step size h , whose best theoretical value is µ + L .In practice, it is possible to perform a line search on the parameter h by solvingmin h f k X i =0 c i x i ! − h k X i =0 c i ∇ f ( x i ) !! approximately. Despite the asymptotically optimal rate of convergence of nonlinear acceleration, itscomputational cost per iteration grows with the length of {∇ f ( x ) , . . . , ∇ f ( x k ) } . Asit turns out that the most relevant information is mostly carried by the last iteratesin practice, therefore, limited memory variants are often harmless. More precisely, thismeans running the extrapolation using only the last m iterates { ( x k − m +1 , ∇ f ( x k − m +1 )) , . . . , ( x k , ∇ f ( x k )) } . For now, we mostly described nonlinear acceleration in Algorithm (3) as a refinement(post processing) step, but the scheme can either run in parallel, without interfering withthe original optimization process (i.e. offline ), or accelerate its own iterates (i.e. online nonlinear acceleration, or Anderson Acceleration) as in Algorithm 6. We discuss this inmore details below.
Offline Acceleration
When run in offline mode, the algorithm keeps track of the iterates x k and y k independently, as follows, x k +1 = FirstOrderMethod ( x , {∇ f ( x ) , . . . , ∇ f ( x k ) } , . . . ) y k +1 = NonlinAcc ( { ( x , ∇ f ( x )); . . . ; ( x k , ∇ f ( x k )) } , h ) . (3.22) Nonlinear Acceleration
In this setting, nonlinear acceleration is used as a post-processing step on the iterates after the first-order method outputs its final iterate x k , thus the x k are independent ofthe y k . Since the strategy does not interfere with the original algorithm, this ensures thatthe convergence rate of the original scheme hold. The parallelization of this techniqueis straightforward, but its numerical performance is typically worse than that of onlinenonlinear acceleration. Online Acceleration
Since nonlinear acceleration in Algorithm 3 creates a new point y k +1 that belongs to the span of previous gradients, we can in fact consider it as afirst-order method. That means we can use nonlinear acceleration recursively, with x k +1 = NonlinAcc ( { ( x , ∇ f ( x )) , . . . , ( x k , ∇ f ( x k )) } , h ) , (3.23)under the condition h = 0. Otherwise, there is no contribution of ∇ f ( x k ) in x k +1 , whichbreaks Assumption 1. Online nonlinear acceleration usually exhibits better performancecompared to the offline version. Nonlinear acceleration techniques were extensively studied during the past decades, andexcellent reviews can be found in (Smith et al. , 1987; Jbilou and Sadok, 1991; Jbilouand Sadok, 1995; Jbilou and Sadok, 2000; Brezinski, 2001; Brezinski and Zaglia, 2013).Let us provide a few words summarizing the different approaches.There are a lot of independent works leading to methods similar to those describedhere. The most classical, and probably closest one, is
Anderson Acceleration (Anderson,1965), and corresponds exactly to the online mode of nonlinear acceleration (withoutregularization). Despite being an old algorithm, there is a recent regain of interest in itsconvergence analysis (Walker and Ni, 2011; Toth and Kelley, 2015) thanks to its goodempirical performance, and strong connection with quasi-Newton methods (Fang andSaad, 2009).Other versions of nonlinear acceleration use different arguments but behave similarly.For instance, minimal polynomial extrapolation (MPE) which uses the properties of theminimal polynomial of a matrix (Cabay and Jackson, 1976), reduced rank extrapolation(RRE) or the Mesina method (Mešina, 1977; Eddy, 1979) are also variants of Ander-son acceleration. The properties and equivalences of these approaches were extensivelystudies during the past decades (Sidi, 1988; Sidi, 1991; Sidi and Shapira, 1998; Sidi,2008; Sidi, 2017b; Sidi, 2017a). Unfortunately, these method hardly extend to nonlinearfunction f , especially due to conditioning problems (Sidi, 1986; Sidi and Bridger, 1988;Scieur et al. , 2016). Recent works still prove the convergence of such method, providedwe can ensure a good conditioning of the linear system (Sidi, 2019).There are also other classes of nonlinear acceleration algorithms, based on existingalgorithms for accelerating the convergence of scalar sequences (Brezinski, 1975). For .6. Related Works instance the topological epsilon vector (TEA) algorithm extends the idea of the scalar ε -algorithm of (Wynn, 1956) to vectors. Nesterov Acceleration
This chapter presents a systematic interpretation of gradient method’s acceleration stem-ming from Nesterov’s original work in (Nesterov, 1983). The early part of this chapteris devoted to the gradient method, and the “optimized gradient method”, due to Droriand Teboulle (2014) and Kim and Fessler (2016), since the motivations and ideas un-derlying this method are natural, and very similar to those behind the introductionof Chebyshev polynomials for optimizing quadratic functions (see Chapter 2). Further-more, the optimized gradient method has a relatively simple format and proof, and canbe used as an inspiration for developing many variant schemes with a wider range ofapplications—including Nesterov’s early accelerated gradient methods (Nesterov, 1983;Nesterov, 2013), and FISTA (Beck and Teboulle, 2009). Some parts of this chapter aremore technical, however we believe all ideas can be reasonably well understood evenwhen skipping the algrebaic proofs.We start by the base theory and interpretation of acceleration in a simple setting:smooth unconstrained convex minimization in an Euclidean space. A second part is thendevoted to methods that take advantage of strong convexity, using the same ideas andalgorithm structures. For later references, we provide a few different (equivalent) tem-plates for the algorithms, as, when going into more advanced settings, they naturallydo not generalize the same way. We then recap and discuss a few of the practical ex-tensions for handling constrained problems, nonsmooth regularization terms, unknownproblem parameters/line searches, and non-Euclidean geometries. Notebooks for simplerreproduction of the proofs of this chapter are provided in Section 4.8.
In the first part of this chapter, we consider smooth unconstrained convex minimiza-tion problems. This type of problems is a direct extension of unconstrained quadratic .1. Introduction minimization problems where the quadratic function has eigenvalues bounded above bysome constant. More precisely, consider the simple unconstrained differentiable convexminimization problem f ⋆ = min x ∈ R d f ( x ) , (4.1)where f is L -smooth and convex (see definition below), and we assume throughoutthat there exists a minimizer x ⋆ . The goal of the methods presented below is to find acandidate solution x satisfying f ( x ) − f ⋆ ≤ ǫ for some ǫ >
0. Depending on the targetapplication, other quality measures, such as guarantees on k∇ f ( x ) k or k x − x ⋆ k , mightbe preferred, see discussions in §“Changing the performance measure”, Section 4.8.We start with gradient descent, and then show that its iteration complexity can besignificantly improved, using an acceleration technique due to Nesterov (1983).After presenting the theory for the smooth convex case, we will see how it extendsto the smooth strongly convex case, a direct extension of the unconstrained quadraticminimization problem where the quadratic function has eigenvalues bounded above andbelow, by some constants L and µ . Definition 4.1.
Let 0 ≤ µ < L ≤ + ∞ . A function f : R d → R is L -smooth and µ -strongly convex (written f ∈ F µ,L ) if and only if• ( L -smoothness) for all x, y ∈ R d , it holds that f ( x ) ≤ f ( y ) + h∇ f ( y ); x − y i + L k x − y k , (4.2)• ( µ -strong convexity) for all x, y ∈ R d , it holds that f ( x ) ≥ f ( y ) + h∇ f ( y ); x − y i + µ k x − y k . (4.3)We naturally use the notation F ,L for the set of smooth convex functions. By extension,we use F , ∞ to denote the set of proper, closed, and convex functions (i.e., convexfunctions whose epigraphs are non-empty closed convex sets). Finally, we denote by ∂f ( x ) the subdifferential of f at x ∈ R d , by g f ( x ) ∈ ∂f ( x ) a particular subgradient at x ,and by q = µL the inverse condition number of functions in the class F µ,L .Figure 4.1 provides an illustration of the global quadratic upper approximation (withcurvature L ) on f ( . ) due to smoothness, and of the global quadratic lower approximation (with curvature µ ) on f ( . ) due to strong convexity.A number of inequalities can be written to characterize functions in F µ,L , see forexample (Nesterov, 2003, Theorem 2.1.5). When analyzing methods for minimizing func-tions in that class, it is crucial to have the right inequalities at our disposal, as worst-caseanalyses essentially boil down to appropriately combining inequalities. We provide themost important ones along with their interpretations and proofs in Appendix 4.A. In thischapter, we only use three of them. First, we use the quadratic upper and lower boundsarising in our definition of smooth strongly convex functions, that is, (4.2) and (4.3). Nesterov Acceleration xf • yf ( y ) xf • yf ( y ) Figure 4.1:
Let f ( . ) (blue) be a differentiable function. (Right) Smoothness: f ( . ) (blue) is L -smooth ifand only if it is upper bounded by f ( y )+ h∇ f ( y ); . − y i + L k . − y k (dashed, brown) for all y . (Left) Strongconvexity: f ( . ) (blue) is µ -strongly convex if and only if it is lower bounded by f ( y ) + h∇ bbbf ( y ); . − y i + µ k . − y k (dashed, brown) for all y . For some analyses however, we need an additional inequality, provided by the follow-ing theorem. This inequality, although apparently a bit rough, is often referred to asan interpolation (or extension) inequality (details in Appendix 4.A). It can be shownthat worst-case analyses of all first-order methods for minimizing smooth strongly con-vex functions can be performed using only this one, instantiated at the iterates of themethod at hand, as well as at an optimizer. Furthermore, its proof is relatively simple,as it only consists in requiring all quadratic lower bound from (4.3) to be below allquadratic upper bounds from (4.2).
Theorem 4.1.
A differentiable function f ∈ F µ,L if and only if for all x, y ∈ R d it holdsthat f ( x ) ≥ f ( y ) + h∇ f ( y ); x − y i + 12 L k∇ f ( y ) − ∇ f ( x ) k + µL L − µ ) k x − y − L ( ∇ f ( x ) − ∇ f ( y )) k . (4.4)However, as shown in Section 4.3.3, this inequality has some flaws when it comes toextending methods in a more general settings than (4.1) (for example in the presence ofconstraints). Therefore, we still use (4.2) and (4.3) whenever possible.Before moving to the next section, let us mention that both smoothness and strongconvexity are strong assumptions. More generic assumptions are discussed in Chapter 6to get improved rates under weaker assumptions. In this section, we analyze gradient descent using the concept of potential functions . Theresulting proofs are technically simple, although they do not seem to provide any directintuition on the method at hand. We use the same ideas for analyzing a few improvementsover gradient descent, before providing some interpretations underlying this mechanism. .2. Gradient Method and Potential Functions The simplest and probably most natural method for minimizing differentiable functionsis gradient descent. It is often attributed to Cauchy (1847) and consists in iterating x k +1 = x k − γ k ∇ f ( x k ) , where γ k is some step size. There are many different techniques for picking γ k , and thesimplest one is to set γ k = 1 /L , assuming L is known—otherwise, line-search techniquesare typically used, see Section 4.7. Our objective now is to assess the number of iterationsrequired by gradient descent to obtain an approximate minimizer x k of f , satisfying f ( x k ) − f ⋆ ≤ ǫ . Potential (or Lyapunov) functions are classical tools for proving convergence rates in thefirst-order literature, and a nice recent review of this topic is given by Bansal and Gupta(2019). For gradient descent, the idea consists in recursively using a simple inequality(proof below)( k + 1)( f ( x k +1 ) − f ⋆ ) + L k x k +1 − x ⋆ k ≤ k ( f ( x k ) − f ⋆ ) + L k x k − x ⋆ k , that is valid for all f ∈ F ,L and x k ∈ R d , when x k +1 = x k − L ∇ f ( x k ). In this context,we refer to φ k := k ( f ( x k ) − f ⋆ ) + L k x k − x ⋆ k as a potential, and use φ k +1 ≤ φ k as the building block for the worst-case analysis. Oncesuch a potential inequality φ k +1 ≤ φ k is established, a convergence rate can easily bededuced through a recursive argument which yields N ( f ( x N ) − f ⋆ ) ≤ φ N ≤ φ N − ≤ . . . ≤ φ = L k x − x ⋆ k , (4.5)hence f ( x N ) − f ⋆ ≤ L N k x − x ⋆ k . We also conclude that the worst-case accuracy of gradi-ent descent is O ( N − ), or equivalently that its iteration complexity is O ( ǫ − ). Therefore,the main inequality to be proved for this worst-case analysis to work, is the potential inequality φ k +1 ≤ φ k . In other words, the analysis of N iterations of gradient descent isreduced to the analysis of a single iteration, using an appropriate potential. This kindof approach was already used for example by Nesterov (1983), and many different vari-ants of this potential function can be used to prove convergence of gradient descent andrelated methods, in similar ways. Theorem 4.2.
Let f be a L -smooth convex function, x ⋆ ∈ argmin x f ( x ), and k ∈ N . Forany A k ≥ x k ∈ R d it holds that A k +1 ( f ( x k +1 ) − f ⋆ ) + L k x k +1 − x ⋆ k ≤ A k ( f ( x k ) − f ⋆ ) + L k x k − x ⋆ k , Nesterov Acceleration with x k +1 = x k − L ∇ f ( x k ) and A k +1 = 1 + A k . Proof.
The proof consists in performing a weighted sum of the following inequalities:• convexity of f between x k and x ⋆ , with weight λ = 10 ≥ f ( x k ) − f ⋆ + h∇ f ( x k ); x ⋆ − x k i , • smoothness of f between x k and x k +1 with weight λ = 1 + A k ≥ f ( x k +1 ) − f ( x k ) − h∇ f ( x k ); x k +1 − x k i − L k x k − x k +1 k . This last inequality is often referred to as the descent lemma , as substituting x k +1 allows obtaining f ( x k +1 ) ≤ f ( x k ) − L k∇ f ( x k ) k .This weighted sum forms a valid inequality:0 ≥ λ [ f ( x k ) − f ⋆ + h∇ f ( x k ); x ⋆ − x k i ]+ λ (cid:20) f ( x k +1 ) − f ( x k ) − h∇ f ( x k ); x k +1 − x k i − L k x k − x k +1 k (cid:21) . Using x k +1 = x k − L ∇ f ( x k ), this inequality can be rewritten (completing the squares,or simply extending both expressions and verifying that they match on a term by termbasis) as follows ( A k + 1)( f ( x k +1 ) − f ⋆ ) + L k x k +1 − x ⋆ k ≤ A k ( f ( x k ) − f ⋆ ) + L k x k − x ⋆ k − A k L k∇ f ( x k ) k ≤ A k ( f ( x k ) − f ⋆ ) + L k x k − x ⋆ k , where the last inequality follows from neglecting the last term on the right hand side(which is nonpositive).A convergence rate for gradient descent can directly be obtained as a consequence ofTheorem 4.2, following the reasoning of (4.5), and the rate corresponds to f ( x N ) − f ⋆ = O ( A − N ) = O ( N − ). We detail this in the next corollary. Corollary 4.3.
Let f be a L -smooth convex function, and x ⋆ ∈ argmin x f ( x ). For any N ∈ N , the iterates of gradient descent with step size γ = γ = . . . = γ N = L satisfy f ( x N ) − f ⋆ ≤ L k x − x ⋆ k N .
Proof.
Following the reasoning of (4.5), we recursively use Theorem 4.2 startingwith A = 0. .3. Optimized Gradient Method Before moving to other methods, let us show that this worst-case rate O ( N − ) rate ofgradient descent can actually be attained on very simple problems, motivating the searchfor alternate methods with better guarantees. This worst-case rate is attained by e.g. allfunctions that are nearly linear over large regions. One such common function is theHuber loss f ( x ) = ( a τ | x | − b τ if | x | ≥ τ, L x otherwise,(with a τ = Lτ and b τ = − L τ to ensure continuity and differentiability of this function).Indeed, on this function, as long as the iterates of gradient descent satisfy | x k | ≥ τ ,they behave as if the function was linear, and the gradient on this part is constant. Itis therefore relatively easy to explicitly compute all iterates. In particular, picking τ = | x | N +1 , we get that f ( x N ) − f ⋆ = L k x − x ⋆ k N +1) , reaching the O ( N − ) worst-case bound (thisexample is due to Drori and Teboulle (2014, Theorem 3.2)). We picked f as a functionof N here for making computations simple, but this is not necessary for obtaining one-dimensional examples on which f ( x N ) − f ⋆ = O ( N − ). Therefore, it appears that theworst-case bound from Corollary 4.3 for gradient descent can only be improved in termsof the constants, but the rate itself is the best possible one for this simple method; see forexample (Drori and Teboulle, 2014; Drori, 2014) for the corresponding tight expressions.In the next section, we show that a similar reasoning based on potential functions,produces methods with improved iteration complexity O ( N − ), compared to the O ( N − )of simple gradient descent. Given that the complexity bound for gradient descent cannot be improved, it is reason-able to look for alternate, hopefully better, methods. In fact, in what follows, we designan accelerated method by optimizing its worst-case performance. A reasonably broadfamily of candidates first-order methods is described by y = y − h , ∇ f ( y ) ,y = y − h , ∇ f ( y ) − h , ∇ f ( y ) ,y = y − h , ∇ f ( y ) − h , ∇ f ( y ) − h , ∇ f ( y ) , ... y N = y N − − N − X i =0 h N,i ∇ f ( y i ) . (4.6)Of course, methods in this form are impractical, as they require keeping track of allprevious gradients. Neglecting this potential problem for now, one possibility for choosing Nesterov Acceleration the step sizes { h i,j } is to solve a minimax problemmin { h i,j } max f ∈F ,L (cid:26) f ( y N ) − f ⋆ k y − x ⋆ k : y N obtained from (4.6) and x (cid:27) . (4.7)In other words, we are looking for the best possible worst-case ratio among methods ofthe form (4.6). Of course, different target notions of accuracy could be considered insteadof ( f ( y N ) − f ⋆ ) (cid:14) k y − x ⋆ k , but we stick with this notion as this is the one we alreadyused for gradient descent in Corollary 4.3 and Nesterov’s method in Corollary 4.9.It turns out that (4.7) has a clean solution, obtained by Kim and Fessler (2016), basedon clever reformulations and relaxations of (4.7) developed by Drori and Teboulle (2014)(some details are provided in Section 4.8). Furthermore, this method has “factorized”forms which do not require keeping track of previous gradients. The optimized gradientmethod (OGM) is parameterized by a sequence { θ k,N } k that is constructed recursivelystarting from θ − ,N = 0 (or equivalently θ ,N = 1), using θ k +1 ,N = p θ k,N +12 if k ≤ N − p θ k,N +12 if k = N − . (4.8)Let us also mention that optimized gradient methods can be stated in various equivalentformats, which we provide in Algorithm 7 and Algorithm 8 (a rigorous equivalencestatement is provided in Appendix 4.B.1). While the shape of Algorithm 8 is morecommon to accelerated methods, the equivalent formulation provided in Algorithm 7allows for slightly more direct proofs. Algorithm 7
Optimized gradient method (OGM), form I
Input: A L -smooth convex function f , initial point x , budget N . Initialize z = y = x and θ − ,N = 0. for k = 0 , . . . , N − do θ k,N = p θ k − ,N +12 y k = (cid:16) − θ k,N (cid:17) x k + θ k,N z k x k +1 = y k − L ∇ f ( y k ) z k +1 = x − L P ki =0 θ i,N ∇ f ( y k ) end forOutput: Approx. solution y N = (cid:16) − θ N,N (cid:17) x N + θ N,N z N with θ N,N = p θ N − ,N +12 Direct approaches to (4.7) are rather technical—see details in (Drori and Teboulle,2014; Kim and Fessler, 2016). However, showing that the OGM is indeed optimal on theclass of smooth convex function can be done indirectly, by providing an upper boundon its worst-case complexity guarantees, and by showing that no first-order method canhave a better worst-case guarantee on this class of problems. We detail a fully explicit .3. Optimized Gradient Method Algorithm 8
Optimized gradient method (OGM), form II
Input: A L -smooth convex function f , initial point x , budget N . Initialize z = y = x and θ ,N = 1. for k = 0 , . . . , N − do θ k +1 ,N = p θ k,N +12 x k +1 = y k − L ∇ f ( y k ) y k +1 = x k +1 + θ k,N − θ k +1 ,N ( x k +1 − x k ) + θ k,N θ k +1 ,N ( x k +1 − y k ) end forOutput: Approx. solution y N worst-case guarantee for OGM in the next section, which consists in showing that φ k = 2 θ k − ,N (cid:18) f ( y k − ) − f ⋆ − L k∇ f ( y k − ) k (cid:19) + L k z k − x ⋆ k (4.9)is a potential function for the optimized gradient method (Theorem 4.4 below), when k < N . For k = N , we need a minor adjustment (Lemma 4.5 below), to obtain a boundon f ( y N ) − f ⋆ and not in terms of f ( y N ) − f ⋆ − L k∇ f ( y N ) k which appears in thepotential.As for gradient descent, the proof therefore relies on potential functions. Following therecursive argument from (4.5), the convergence guarantee is driven by the convergencespeed of θ − k,N towards 0. Let us note that when k < N − θ k +1 ,N = 1 + q θ k,N + 12 ≥ θ k,N θ k,N + 12 , (4.10)therefore θ k,N ≥ k + 1. We also get θ N,N > N +22 and hence θ − N,N = O ( N − ). Beforeproviding the proof, let us mention that it heavily relies on inequality (4.4), with µ = 0.This inequality turns out to be key for formulating (4.7) in a tractable way. The main point now is to prove that (4.9) is indeed a potential for the optimized gradientmethod. We emphasize again that our main motivation for doing so, is to show thatOGM provides a good template algorithm for acceleration (i.e. a method involving twoor three sequences), and that the corresponding potential functions can also be used asa template for the analysis of more advanced methods.Note that although the potential structure might give the impression of falling fromthe sky, it was actually found using computer-assisted proof design techniques (see Sec-tion 4.8 for further references).
Theorem 4.4.
Let f be a L -smooth convex function, x ⋆ ∈ argmin x f ( x ), and N ∈ N . Nesterov Acceleration
For any k ∈ N with 0 ≤ k ≤ N − y k − , z k ∈ R d it holds that2 θ k,N (cid:18) f ( y k ) − f ⋆ − L k∇ f ( y k ) k (cid:19) + L k z k +1 − x ⋆ k ≤ θ k − ,N (cid:18) f ( y k − ) − f ⋆ − L k∇ f ( y k − ) k (cid:19) + L k z k − x ⋆ k , when y k and z k +1 are obtained from Algorithm 7. Proof.
Let us recall that the algorithm can be written as y k = − θ k,N ! (cid:18) y k − − L ∇ f ( y k − ) (cid:19) + 1 θ k,N z k z k +1 = z k − θ k,N L ∇ f ( y k ) . The proof then consists in performing a weighted sum of the following inequalities.• Smoothness and convexity of f between y k − and y k with weight λ = 2 θ k − ,N ≥ f ( y k ) − f ( y k − ) + h∇ f ( y k ); y k − − y k i + 12 L k∇ f ( y k ) − ∇ f ( y k − ) k , • smoothness and convexity of f between x ⋆ and y k with weight λ = 2 θ k,N ≥ f ( y k ) − f ⋆ + h∇ f ( y k ); x ⋆ − y k i + 12 L k∇ f ( y k ) k . The weights being nonnegative, the weighted sum produces a valid inequality of the form0 ≥ λ (cid:20) f ( y k ) − f ( y k − ) + h∇ f ( y k ); y k − − y k i + 12 L k∇ f ( y k ) − ∇ f ( y k − ) k (cid:21) λ (cid:20) f ( y k ) − f ⋆ + h∇ f ( y k ); x ⋆ − y k i + 12 L k∇ f ( y k ) k (cid:21) , (4.11)which can be reformulated as (again either by completing the squares, or simply byextending both expressions and verifying that they match on a term by term basis)0 ≥ λ (cid:20) f ( y k ) − f ( y k − ) + h∇ f ( y k ); y k − − y k i + 12 L k∇ f ( y k ) − ∇ f ( y k − ) k (cid:21) λ (cid:20) f ( y k ) − f ⋆ + h∇ f ( y k ); x ⋆ − y k i + 12 L k∇ f ( y k ) k (cid:21) =2 θ k,N (cid:18) f ( y k ) − f ⋆ − L k∇ f ( y k ) k (cid:19) + L k z k +1 − x ⋆ k − θ k − ,N (cid:18) f ( y k − ) − f ⋆ − L k∇ f ( y k − ) k (cid:19) − L k z k − x ⋆ k + 2 θ k,N (cid:16) θ k − ,N + θ k,N − θ k,N (cid:17) h∇ f ( y k ); y k − − L ∇ f ( y k − ) − z k i + 2 (cid:16) θ k − ,N + θ k,N − θ k,N (cid:17) (cid:18) f ( y k ) − f ⋆ + 12 L k∇ f ( y k ) k (cid:19) , .3. Optimized Gradient Method and the desired conclusion follows from picking θ k,N ≥ θ k − ,N satisfying θ k − ,N + θ k,N − θ k,N = 0 , hence the choice (4.8), reaching2 θ k,N (cid:18) f ( y k ) − f ⋆ − L k∇ f ( y k ) k (cid:19) + L k z k +1 − x ⋆ k ≤ θ k − ,N (cid:18) f ( y k − ) − f ⋆ − L k∇ f ( y k − ) k (cid:19) + L k z k − x ⋆ k . A last technical fix is required now. In order show that the optimized gradient methodis an optimal solution to (4.7), we need an upper bound on function values, and not onfunction values minus a squared gradient norm. This discrepancy is handled by thefollowing technical lemma.
Lemma 4.5.
Let f be a L -smooth convex function, x ⋆ ∈ argmin x f ( x ), and N ∈ N . Forany y N − , z N ∈ R d it holds that θ N,N ( f ( y N ) − f ⋆ ) + L k z N − θ N,N L ∇ f ( y N ) − x ⋆ k ≤ θ N − ,N (cid:18) f ( y N − ) − f ⋆ − L k∇ f ( y N − ) k (cid:19) + L k z N − x ⋆ k where y N is obtained from Algorithm 7. Proof.
The proof consists in performing a weighted sum of the following inequalities.• Smoothness and convexity of f between y N − and y N with weight λ = 2 θ N − ,N ≥ f ( y N ) − f ( y N − ) + h∇ f ( y N ); y N − − y N i + 12 L k∇ f ( y N ) − ∇ f ( y N − ) k , • smoothness and convexity of f between x ⋆ and y N with weight λ = θ N,N ≥ f ( y N ) − f ⋆ + h∇ f ( y N ); x ⋆ − y N i + 12 L k∇ f ( y N ) k . The weights being nonnegative, the weighted sum produces a valid inequality of the form0 ≥ λ (cid:20) f ( y N ) − f ( y N − ) + h∇ f ( y N ); y N − − y N i + 12 L k∇ f ( y N ) − ∇ f ( y N − ) k (cid:21) + λ (cid:20) f ( y N ) − f ⋆ + h∇ f ( y N ); x ⋆ − y N i + 12 L k∇ f ( y N ) k (cid:21) , which can be reformulated as0 ≥ θ N,N ( f ( y N ) − f ⋆ ) + L k z N − θ N,N L ∇ f ( y N ) − x ⋆ k − θ N − ,N (cid:18) f ( y N − ) − f ⋆ − L k∇ f ( y N − ) k (cid:19) − L k z N − x ⋆ k + 1 θ N,N (cid:16) θ N − ,N − θ N,N + θ N,N (cid:17) h∇ f ( y N ); y N − − L ∇ f ( y N − ) − z N i + (cid:16) θ N − ,N − θ N,N + θ N,N (cid:17) (cid:18) f ( y N ) − f ⋆ + 12 L k∇ f ( y N ) k (cid:19) . Nesterov Acceleration
The conclusion follows from picking θ N,N ≥ θ N − ,N such that2 θ N − ,N − θ N,N + θ N,N = 0 , reaching the desired θ N,N ( f ( y N ) − f ⋆ ) + L k z N − θ N,N L ∇ f ( y N ) − x ⋆ k ≤ θ N − ,N (cid:18) f ( y N − ) − f ⋆ − L k∇ f ( y N − ) k (cid:19) + L k z N − x ⋆ k . By combining Theorem 4.4 and technical Lemma 4.5, we get the final worst-caseconvergence bound performances of OGM on function values, detailed in the corollarybelow.
Corollary 4.6.
Let f be a L -smooth convex function, and x ⋆ ∈ argmin x f ( x ). For any N ∈ N and x ∈ R d , the output of the optimized gradient method (OGM, Algorithm 7or Algorithm 8) satisfies f ( y N ) − f ⋆ ≤ L k x − x ⋆ k θ N,N ≤ L k x − x ⋆ k ( N + 2) . Proof.
Defining, for k ∈ { , . . . , N } φ k = 2 θ k − ,N (cid:18) f ( y k − ) − f ⋆ − L k∇ f ( y k − ) k (cid:19) + L k z k − x ⋆ k , and φ N +1 = θ N,N ( f ( y N ) − f ⋆ ) + L k z N − θ N,N L ∇ f ( y N ) − x ⋆ k , we reach the desired statement θ N,N ( f ( y N ) − f ⋆ ) ≤ φ N +1 ≤ φ N ≤ . . . ≤ φ = L k x − x ⋆ k . using Theorem 4.4 and technical Lemma 4.5. We obtain the last bound using θ N,N ≥ ( N + 2) /
2, see (4.10).
Remark 4.1.
The potential function for the optimized gradient method can be foundin (Taylor and Bach, 2019, Theorem 11), and was obtained using semidefinite program-ming.
A nice, commonly used guide for designing optimal methods consists in constructingproblems that are difficult for all methods within a certain class. This strategy results in lower complexity bounds , and is often deployed via the concept of minimax risk (of a class .3. Optimized Gradient Method of problems and a class of methods)—see e.g., Guzmán and Nemirovsky (2015)—, whichcorresponds to the worst-case performance of the best method within the prescribedclass. In this section, we briefly discuss such results in the context of smooth convexminimization, on the particular class of black-box first-order methods. The term black-box is used to emphasize that the method has no prior knowledge on f (beyond the classof functions to which f belongs; so methods are allowed to use L ), and can only obtaininformation on f by a evaluating its gradient/function value through an oracle .Of particular interest to us, Drori (2017) established that the worst-case performanceachieved by the optimized gradient method cannot be improved, in general, by any black-box first-order method, on the class of smooth convex functions. Theorem 4.7. (Drori, 2017, Theorem 3) Let
L > d, N ∈ N and d ≥ N + 1. Forany black-box first-order method that performs at most N calls to the first-order oracle( f ( . ) , ∇ f ( . )), there exists a function f ∈ F ,L ( R d ) such that f ( x N ) − f ( x ⋆ ) ≥ L k x − x ⋆ k θ N,N , where x ⋆ ∈ argmin x f ( x ), and x N is the output of the method under consideration.While Drori’s approach to obtain this lower bound is rather technical, there are sim-pler approaches allowing to show that the rate O ( N − ) (that is, neglecting the tightconstants) cannot be beaten in general, in smooth convex minimization. For one suchexample, we refer to (Nesterov, 2003, Theorem 2.1.6). In a very related line of work, (Ne-mirovsky, 1991) established similar exact bounds in the context of solving linear systemsof equations, and for minimizing convex quadratic functions. For convex quadratic prob-lems whose Hessian have bounded eigenvalues between 0 and L , these lower bounds areattained by Chebyshev (see Chapter 2) and by conjugate gradient methods (Nemirovsky,1991; Nemirovsky, 1992). Perhaps amazingly, it turns out that the conjugate gradientmethod also shares the same optimal worst-case guarantee as that of OGM in smoothconvex minimization, as well, and that the proof follows from the same potential as thatof OGM (see Appendix 4.C). Before going further, let us quickly summarize what we have learned from optimized gra-dient methods. First of all, the optimized gradient method can be seen as a counterpartof the Chebyshev method for minimizing quadratics, applied to smooth convex minimiza-tion. It is an optimal method in the sense that it has the smallest possible worst-caseratio f ( y N ) − f ⋆ k y − x ⋆ k over the class f ∈ F ,L , among all black-box first-order methods, givena fixed computational budget of N gradient evaluations. Furthermore, although thismethod has a few drawbacks (we mention a few below), it can be seen as a template fordesigning other accelerated methods using the same algorithmic and proof structures.We extensively use variants of this template below. In other words, most variants of Nesterov Acceleration accelerated gradient methods usually rely on the same two (or three) sequence structure,and on similar potential functions. Those variants usually rely on slight variations in thechoices of the parameters used throughout the iterative process, typically involving lessaggressive step size strategies (i.e., smaller values for { h i,j } in (4.6)).Second, OGM is not a very practical method: it is fined-tuned for unconstrainedsmooth convex minimization, and does not readily extend to other situations, involvingfor example constraints, or unknown smoothness parameters, to name a few.On the other hand, we will see in what follows that it is relatively easy to designothers methods, following the same template and achieving the same O ( N − ) rate, whilefixing the issues of OGM listed above. Those methods use slightly less aggressive step sizestrategies, at the cost of being slightly suboptimal for (4.7), i.e. with slightly worse worst-case guarantees. In this vein, we start by discussing the original accelerated gradientmethod due to Nesterov (1983). Motivated by the format of the optimized gradient method, we detail a potential-basedproof for Nesterov’s method. We then quickly review the concept of estimate sequences ,and show that they provide an interpretation of potential functions as increasingly goodmodels of the function to be minimized. We then extend these results to strongly convexminimization.
In this section, we follow the same algorithmic template as that provided by the optimizedgradient method. In this spirit, we start by discussing the first accelerated method inits simplest form (Algorithm 9), as well as its potential function, due to Nesterov (1983)(although our presentation is slightly different).Our goal is to derive the simplest algebraic proof for this scheme. We follow thealgorithmic template of the optimized gradient method (which is further motivated inSection 4.6.1) and once a potential is chosen, the proofs are quite straightforward assimple combinations of inequalities and basic algebra. Our choice of potential functionis not immediately obvious but allows for simple extensions afterwards. Other choicesare possible, for example by incorporating f ( y k ) as in OGM, or additional terms such as k∇ f ( x k ) k . We pick a potential function similar to that used for gradient descent andthat of the optimized gradient method, written φ k = A k ( f ( x k ) − f ⋆ ) + L k z k − x ⋆ k , where one iteration of the algorithm has the following form, reminiscent of the OGM y k = x k + τ k ( z k − x k ) x k +1 = y k − α k ∇ f ( y k ) z k +1 = z k − γ k ∇ f ( y k ) . (4.12) .4. Nesterov’s Acceleration Our goal is to pick algorithmic parameters { ( τ k , α k , γ k ) } k for greedily making A k +1 aslarge as possible as a function of A k , as the convergence rate of the method is controlledby the inverse of the growth rate of A k , i.e. f ( x N ) − f ⋆ = O ( A − N ).In practice, we can pick A k +1 = A k + (1+ √ A k + 1), by choosing τ k = 1 − A k /A k +1 , α k = L , and γ k = ( A k +1 − A k ) /L (see Algorithm 9), and the proof is then quite compact. Algorithm 9
Nesterov’s first method, form I
Input: A L -smooth convex function f , initial point x . Initialize z = x and A = 0. for k = 0 , . . . do a k = (1 + √ A k + 1) A k +1 = A k + a k y k = x k + (1 − A k A k +1 )( z k − x k ) x k +1 = y k − L ∇ f ( y k ) z k +1 = z k − A k +1 − A k L ∇ f ( y k ) end forOutput: An approximate solution x k +1 .Before moving to the proof of the potential inequality, let us show that A − k = O ( k − ).Indeed we have, A k = A k − + 1 + p A k − + 12 ≥ A k − + 12 + q A k − ≥ (cid:18)q A k − + 12 (cid:19) ≥ k , (4.13)where the last inequality follows from a recursive application of the previous one alongwith A = 0. Theorem 4.8.
Let f be a L -smooth convex function, x ⋆ ∈ argmin x f ( x ), and k ∈ N . Forany x k , z k ∈ R d , and A k ≥ A k +1 ( f ( x k +1 ) − f ⋆ ) + L k z k +1 − x ⋆ k ≤ A k ( f ( x k ) − f ⋆ ) + L k z k − x ⋆ k , with A k +1 = A k + √ A k +12 Proof.
The proof consists in a weighted sum of the following inequalities.• Convexity of f between x ⋆ and y k with weight λ = A k +1 − A k f ⋆ ≥ f ( y k ) + h∇ f ( y k ); x ⋆ − y k i , • convexity of f between x k and y k with weight λ = A k f ( x k ) ≥ f ( y k ) + h∇ f ( y k ); x k − y k i , Nesterov Acceleration • smoothness of f between y k and x k +1 (a.k.a., descent lemma ) with weight λ = A k +1 f ( y k ) + h∇ f ( y k ); x k +1 − y k i + L k x k +1 − y k k ≥ f ( x k +1 ) . We therefore arrive to the following valid inequality0 ≥ λ [ f ( y k ) − f ⋆ + h∇ f ( y k ); x ⋆ − y k i ]+ λ [ f ( y k ) − f ( x k ) + h∇ f ( y k ); x k − y k i ]+ λ [ f ( x k +1 ) − f ( y k ) − h∇ f ( y k ); x k +1 − y k i − L k x k +1 − y k k ] . For the sake of simplicity, we do not substitute A k +1 by its expression until the laststage of the reformulation. Substituting y k , x k +1 , and z k +1 by their expressions in (4.12)along with τ k = 1 − A k /A k +1 , α k = L , and γ k = A k +1 − A k L , basic algebra shows that theprevious inequality can be reorganized as0 ≥ A k +1 ( f ( x k +1 ) − f ⋆ ) + L k z k +1 − x ⋆ k − A k ( f ( x k ) − f ⋆ ) − L k z k − x ⋆ k + A k +1 − ( A k − A k +1 ) L k∇ f ( y k ) k . The desired claim follows from picking A k +1 ≥ A k such that A k +1 − ( A k − A k +1 ) = 0,reaching A k +1 ( f ( x k +1 ) − f ⋆ ) + L k z k +1 − x ⋆ k ≤ A k ( f ( x k ) − f ⋆ ) + L k z k − x ⋆ k . The final worst-case guarantee is obtained using the same chaining argument asin (4.5), combined with an upper bound on A N . Corollary 4.9.
Let f be a L -smooth convex function, and x ⋆ ∈ argmin x f ( x ). For any N ∈ N , the iterates of Algorithm 9 satisfy f ( x N ) − f ⋆ ≤ L k x − x ⋆ k N . Proof.
Following the argument of (4.5), we recursively use Theorem 4.8 with A = 0 A N ( f ( x N ) − f ⋆ ) ≤ φ N ≤ . . . ≤ φ = L k x − x ⋆ k , which yields f ( x N ) − f ⋆ ≤ L k x − x ⋆ k A N ≤ L k x − x ⋆ k N , where we used A N ≥ N / O ( N − ) rate matches that of lowerbounds (see e.g., Theorem 4.7) up to absolute constants. .4. Nesterov’s Acceleration Finally, note that Nesterov’s method is often written in a slightly different format,similar to that of Algorithm 8. This alternate formulation omits the third sequence z k ,corresponds to Algorithm 10. It is preferred in many references on the topic due to itssimplicity. A third equivalent variant is provided by Algorithm 11; this variant turnsout to be useful when generalizing the method beyond Euclidean space. The equivalencestatements between Algorithm 9, Algorithm 10, and Algorithm 11 are relatively sim-ple, and are provided in Appendix 4.B.2. Many references typically favor one of thoseformulations, and we want to point out that they are often equivalent. Although theexpressions of those different formats in terms of the same external sequence { A k } k doesnot always correspond to their simplest forms (particularly in the strongly convex case,which follows), we stick with it in order not to introduce too many variations aroundthe same theme. Algorithm 10
Nesterov’s first method, form II
Input: A L -smooth convex function f , initial point x . Initialize y = x and A = 0. for k = 0 , . . . do a k = (1 + √ A k + 1) A k +1 = A k + a k x k +1 = y k − L ∇ f ( y k ) y k +1 = x k +1 + a k − a k +1 ( x k +1 − x k ) end forOutput: An approximate solution x k +1 . Algorithm 11
Nesterov’s method, form III
Input: A L -smooth convex function f , initial point x . Initialize z = x and A = 0. for k = 0 , . . . do a k = (1 + √ A k + 1) A k +1 = A k + a k y k = x k + (1 − A k A k +1 )( z k − x k ) z k +1 = z k − A k +1 − A k L ∇ f ( y k ) x k +1 = A k A k +1 x k + (1 − A k A k +1 ) z k +1 end forOutput: An approximate solution x k +1 . We now relate the potential function approach to estimate sequences , which maintain amodel of the function along iterations. This technique was originally developed in (Nes- Nesterov Acceleration terov, 2003, Section 2.2), and was then used in numerous works to obtain acceleratedfirst-order methods in various settings (see discussions in Section 4.8). We present aslightly modified version, related to those of (Baes, 2009; Wilson et al. , 2016), whichsimplifies comparisons with previous material.
Estimate Sequences
As we see below, the base idea underlying estimate sequences is closely related to thatof potential function, but has explicit interpretations in terms of models of the objectivefunction, f . More precisely a sequence of pairs { ( A k , ϕ k ( x )) } k , with A k ≥ ϕ k : R d → R , is called an estimate sequence of a function f if(i) for all k ≥ x ∈ R d we have ϕ k ( x ) − f ( x ) ≤ A − k ( ϕ ( x ) − f ( x )) , (4.14)(ii) A k → ∞ as k → ∞ .If in addition, an estimate sequence satisfies(iii) for all k ≥
0, there exists some x k such that f ( x k ) ≤ ϕ k ( x ⋆ ), then we can guaranteethat f ( x k ) − f ⋆ = O ( A − k ).The purpose of estimate sequences is to start from an initial model ϕ ( x ) satisfying ϕ ( x ) ≥ f ⋆ for all x ∈ R d , and to design a sequence of convex models ϕ k that areincreasingly good approximations of f , in the sense of (4.14). We now further commenton conditions (i) and (iii), assuming for simplicity that { A k } is monotonically increasing.• Regarding (i), for all x ∈ R d , we have to design ϕ k to be either (a) a lower bound onthe function (i.e., ϕ k ( x ) − f ( x ) ≤ x ), or (b) an increasingly good upperapproximation of f ( x ) when 0 ≤ ϕ k ( x ) − f ( x ) ≤ A − k ( ϕ ( x ) − f ( x )) for that x .That is, we require that the error | f ( x ) − ϕ k ( x ) | incurred when approximating f ( x )by ϕ k ( x ) gets smaller for all x for which ϕ k ( x ) is an upper bound on f ( x ).To develop such models and the corresponding methods, three sequences of pointsare commonly used: (a) minimizers of our models ϕ k will correspond to iterates z k of the corresponding method, (b) a sequence y k of points, whose first-orderinformation is used to update the model of the function, and (c) the iterates x k ,corresponding to the best possible f ( x k ) we can form (which often do not corre-spond to the minimum of the model, ϕ k , which is not necessarily an upper boundon the function).• Regarding (iii), this condition ensures that the models ϕ k remain upper boundson the optimal value f ⋆ . That is, f ⋆ ≤ ϕ k ( x ⋆ ) (since f ⋆ ≤ f ( x k )), and hence that ϕ k ( x ⋆ ) − f ⋆ ≥
0. From previous bullet point, this ensure that the modeling error of f ⋆ goes to 0 asymptotically as k increases. More formally, conditions (ii) and (iii)allow constructing proofs similar to potential functions, and to obtain convergencerates. That is, under (iii), we get that f ( x k ) − f ⋆ ≤ ϕ k ( x ⋆ ) − f ( x ⋆ ) ≤ A − k ( ϕ ( x ⋆ ) − f ( x ⋆ )) , (4.15) .4. Nesterov’s Acceleration and therefore f ( x k ) − f ⋆ ≤ O ( A − k ), and the convergence rate is dictated by thatof A − k , which goes to 0 by (ii).Now, the game consists in picking appropriate sequences { ( A k , ϕ k ) } corresponding tosimple algorithms. We thus translate our potential functions results in terms of estimatesequences. Potential Functions as Estimate Sequences
One can observe that potential functions and estimate sequences are closely related. First,in both cases, convergence speed is dictated by that of a scalar sequence A − k . In fact,there is one subtle but important difference between the two approaches: whereas ϕ k ( x )should be an increasingly good approximation of f for all x in the context of estimatesequences, potential functions require a model to be an increasingly good approximationof f ⋆ only, which is less restrictive. Hence, estimate sequences are more general butmay not handle situations where the analysis actually requires having a weaker modelholding only on f ⋆ , and not of f ( x ) for all x . Let us make this discussion more concrete onthree examples, namely gradient descent, Nesterov’s method, and the optimized gradientmethod.• Gradient descent: the potential inequality from Theorem 4.2 actually holds for all x , and not only x ⋆ , as the proof does not use the optimality of x ⋆ , that is( A k + 1)( f ( x k +1 ) − f ( x )) + L k x k +1 − x k ≤ A k ( f ( x k ) − f ( x )) + L k x k − x k , for all x ∈ R d . Therefore the pair { ( A k , ϕ k ( x )) } k with ϕ k ( x ) = f ( x k ) + L A k k x k − x k and A k = A + k (with A >
0) is an estimate sequence for gradient descent.• Nesterov’s first method: the potential inequality from Theorem 4.8 also holds forall x ∈ R d , not only x ⋆ , as the proof does not use optimality of x ⋆ , that is A k +1 ( f ( x k +1 ) − f ( x )) + L k z k +1 − x k ≤ A k ( f ( x k ) − f ( x )) + L k z k − x k , for all x ∈ R d . Hence the pair { ( A k , ϕ k ( x )) } k with ϕ k ( x ) = f ( x k ) + L A k k z k − x k and A k = A k − + √ A k − +12 (with A >
0) is an estimate sequence for Nesterov’smethod. Nesterov Acceleration • Optimized gradient method: the potential inequality from Theorem 4.4 actuallyuses the fact that x ⋆ is an optimal point. Indeed, the proof relies on f ( x ) ≥ f ( y k ) + h∇ f ( y k ); x − y k i + 12 L k∇ f ( y k ) k which is a valid upper bound on f ( x ) only when ∇ f ( x ) = 0 (see Equation (4.4)).This does not mean there is no estimate sequence -type model of the function asthe algorithm proceeds, but the potential does not directly correspond to one.Alternatively, one can interpret ϕ k ( x ) = f ( y k ) − L k∇ f ( y k ) k + L θ k,N k z k +1 − x k as an increasingly good model of f ⋆ .A similar conclusion holds for the conjugate gradient method (CG), from Ap-pendix 4.C. We are not aware of any estimate sequence allowing to prove CGreaches the lower bound from Theorem 4.7.Those discussions can be carried on to the strongly convex setting, which we now address. Before designing faster methods exploiting strong convexity, let us briefly describe thebenefits and limitations of this additional assumption. Roughly speaking, strong convex-ity guarantees gradients will get larger further away from the optimal solution. One wayof looking at it is as follows: a function f is L -smooth and µ -strongly convex if and onlyif f ( x ) = ˜ f ( x ) + µ k x − x ⋆ k , where x ⋆ is an optimal point for both f and ˜ f , and where ˜ f is convex and ( L − µ )-smooth.Therefore, one iteration of gradient descent can be described as follows: x k +1 − x ⋆ = x k − x ⋆ − γ ∇ f ( x k )= x k − x ⋆ − γ ( ∇ ˜ f ( x k ) + µ ( x k − x ⋆ ))= (1 − γµ )( x k − x ⋆ ) − γ ∇ ˜ f ( x k ) , and we see that for small enough step sizes γ , there is an additional contraction effectdue to the factor (1 − γµ ), as compared to the effect of gradient descent has on smoothconvex functions such as ˜ f . In what follows, we adapt our proofs to develop acceleratedmethods in the strongly convex case. On the other hand, smooth strongly convex func-tions are sandwiched between two quadratic functions, so these assumptions are muchmore restrictive than smoothness alone. .5. Acceleration under Strong Convexity As in the smooth convex case, the smooth strongly convex case can be studied throughpotential functions. There are many ways to prove convergence rates for this setting,but let us only consider one that allows recovering the µ = 0 case as its limit, in orderto have well-defined results even in degenerate cases. The proof below is essentially thesame as that for the smooth convex case in Theorem 4.2 and the same inequalities areused, with strong convexity instead of convexity. The potential is only slightly modifiedallowing A k to have a geometric growth rate, as follows φ k = A k ( f ( x k ) − f ⋆ ) + L + µA k k x k − x ⋆ k . For notation convenience we use q = µL to denote the inverse condition ratio below. Thisquantity plays a key role in the geometric convergence of first-order methods in thepresence of strong convexity. Theorem 4.10.
Let f be a L -smooth (possibly µ -strongly) convex function, x ⋆ ∈ argmin x f ( x ),and k ∈ N . For any A k ≥ x k it holds that A k +1 ( f ( x k +1 ) − f ⋆ ) + L + µA k +1 k x k +1 − x ⋆ k ≤ A k ( f ( x k ) − f ⋆ ) + L + µA k k x k − x ⋆ k with x k +1 = x k − L ∇ f ( x k ), A k +1 = (1 + A k ) / (1 − q ), and q = µL . Proof.
The proof consists in performing a weighted sum of the following inequalities:• µ -strong convexity of f between x k and x ⋆ , with weight λ = A k +1 − A k ≥ f ( x k ) − f ⋆ + h∇ f ( x k ); x ⋆ − x k i + µ k x ⋆ − x k k , • smoothness of f between x k and x k +1 with weight λ = A k +1 ≥ f ( x k +1 ) − f ( x k ) − h∇ f ( x k ); x k +1 − x k i − L k x k − x k +1 k . This weighted sum yields a valid inequality0 ≥ λ [ f ( x k ) − f ⋆ + h∇ f ( x k ); x ⋆ − x k i + µ k x ⋆ − x k k ]+ λ [ f ( x k +1 ) − f ( x k ) − h∇ f ( x k ); x k +1 − x k i − L k x k − x k +1 k ] . Using x k +1 = x k − L ∇ f ( x k ), this inequality can be rewritten exactly as A k +1 ( f ( x k +1 ) − f ⋆ ) + L + µA k +1 k x k +1 − x ⋆ k ≤ A k ( f ( x k ) − f ⋆ ) + L + µA k k x k − x ⋆ k − A k L k∇ f ( x k ) k + 1 + A k − (1 − q ) A k +1 L h∇ f ( x k ); ∇ f ( x k ) + 2 L ( x ⋆ − x k ) i . Nesterov Acceleration
The desired inequality follows from A k +1 = (1 + A k ) / (1 − q ) and the sign of A k , makingone term nonpositive and the other one equal to zero, reaching A k +1 ( f ( x k +1 ) − f ⋆ ) + L + µA k +1 k x k +1 − x ⋆ k ≤ A k ( f ( x k ) − f ⋆ ) + L k x k − x ⋆ k . From this theorem, we observe that adding strong convexity to the problem allows A k to follow a geometric rate given by (1 − q ) − (where we denote again by q = µL theinverse condition number). The corresponding iteration complexity of gradient descentfor smooth strongly convex minimization of O ( Lµ log ǫ ) to find an approximate solution f ( x k ) − f ⋆ ≤ ǫ . This rate is essentially tight, as can be verified on quadratic functions(see e.g. Chapter 2), and follows from the following corollary. Corollary 4.11.
Let f be a L -smooth (possibly µ -strongly) convex function, and x ⋆ ∈ argmin x f ( x ). For any N ∈ N , the iterates of gradient descent with step size γ = γ = . . . = γ N = L satisfy f ( x N ) − f ⋆ ≤ µ k x − x ⋆ k − q ) − N − , with q = µL the inverse condition number. Proof.
Following the reasoning of (4.5), we recursively use Theorem 4.2 startingwith A = 0, that is, A N ( f ( x N ) − f ⋆ ) ≤ φ N ≤ . . . ≤ φ = L k x − x ⋆ k , and notice that the recurrence equation A k +1 = ( A k + 1) / (1 − q ) has the solution A k = ((1 − q ) − k − /q . The final bound is obtained using again f ( x N ) − f ⋆ ≤ k x − x ⋆ k A N . Remark 4.2 (Lower bounds) . As in the smooth convex case, one can derive lower com-plexity bounds for smooth strongly convex optimization. Using the same lower boundsarising from smooth strongly convex quadratic minimization (for which Chebyshev’smethods have optimal iteration complexities), one can conclude that no black-box first-order method can behave better than f ( x k ) − f ⋆ = O ( ρ k ) with ρ = (1 −√ q ) (1+ √ q ) . We referthe reader to Nesterov (2003) and Nemirovsky (1992) for more details. For smoothstrongly convex problems beyond quadratics, this lower bound can be improved to f ( x k ) − f ⋆ = O ((1 − √ q ) k ) as provided in (Drori and Taylor, 2021, Corollary 4).In this context we will see that Nesterov’s acceleration satisfies f ( x k ) − f ⋆ = O ((1 − √ q ) k ) , i.e. has a O ( q Lµ log ǫ ) iteration complexity, reaching the lower complexity bound up toa constant factor. As in to the optimized gradient method provided in Section 4.3, anoptimal method for the smooth strongly convex case is detailed in Section 4.6.1, and canbe shown to match the corresponding lower-complexity bound. .5. Acceleration under Strong Convexity To adapt proofs of convergence of accelerated methods in the strongly convex case, weneed to make a small adjustment in the shape of the previous accelerated method y k = x k + τ k ( z k − x k ) x k +1 = y k − α k ∇ f ( y k ) z k +1 = (1 − µL δ k ) z k + µL δ k y k − γ k ∇ f ( y k ) . (4.16)As discussed below, this scheme can be seen as an optimized gradient method for smoothstrongly convex minimization (whose details are provided in Section 4.6.1), similar tothe smooth convex case. Following this scheme, Nesterov’s method for strongly convexproblems is presented in Algorithm 12. As in the smooth convex case, we detail severalof its convenient reformulations in Algorithm 20 and Algorithm 21. The correspondingequivalences are established in Appendix 4.B.3. Algorithm 12
Nesterov’s method, form I
Input: A L -smooth (possibly µ -strongly) convex function f , initial point x . Initialize z = x and A = 0, set q = µ/L (the inverse condition ratio). for k = 0 , . . . do A k +1 = A k +1+ √ A k +4 qA k +12(1 − q ) Set τ k = ( A k +1 − A k )(1+ qA k ) A k +1 +2 qA k A k +1 − qA k and δ k = A k +1 − A k qA k +1 y k = x k + τ k ( z k − x k ) x k +1 = y k − L ∇ f ( y k ) z k +1 = (1 − qδ k ) z k + qδ k y k − δ k L ∇ f ( y k ) end forOutput: An approximate solution x k +1 .Regarding the potential, we make the same adjustment as for gradient descent, ar-riving to the following theorem. Theorem 4.12.
Let f be a L -smooth (possibly µ -strongly) convex function, x ⋆ ∈ argmin x f ( x ),and k ∈ N . For all x k , z k ∈ R d , and A k ≥ A k +1 ( f ( x k +1 ) − f ⋆ ) + L + µA k +1 k z k +1 − x ⋆ k ≤ A k ( f ( x k ) − f ⋆ ) + L + µA k k z k − x ⋆ k , with A k +1 = A k +1+ √ A k +4 qA k +12(1 − q ) and q = µL . Proof.
The proof consists in a weighted sum of the following inequalities:• strong convexity between x ⋆ and y k with weight λ = A k +1 − A k f ⋆ ≥ f ( y k ) + h∇ f ( y k ); x ⋆ − y k i + µ k x ⋆ − y k k , Nesterov Acceleration • convexity between x k and y k with weight λ = A k f ( x k ) ≥ f ( y k ) + h∇ f ( y k ); x k − y k i , • smoothness between y k and x k +1 ( descent lemma ) with weight λ = A k +1 f ( y k ) + h∇ f ( y k ); x k +1 − y k i + L k x k +1 − y k k ≥ f ( x k +1 ) . We therefore arrive at the following valid inequality0 ≥ λ [ f ( y k ) − f ⋆ + h∇ f ( y k ); x ⋆ − y k i + µ k x ⋆ − y k k ]+ λ [ f ( y k ) − f ( x k ) + h∇ f ( y k ); x k − y k i ]+ λ [ f ( x k +1 ) − f ( y k ) − h∇ f ( y k ); x k +1 − y k i − L k x k +1 − y k k ] . For the sake of simplicity, we do not substitute A k +1 by its expression until the last stageof the reformulation. Substituting x k +1 , z k +1 by their expressions in (4.16) along with τ k = ( A k +1 − A k )(1+ qA k ) A k +1 +2 qA k A k +1 − qA k , α k = L , δ k = A k +1 − A k qA k +1 and γ k = δ k L , basic algebra shows thatthe previous inequality can be reorganized as A k +1 ( f ( x k +1 ) − f ⋆ ) + L + µA k +1 k z k +1 − x ⋆ k ≤ A k ( f ( x k ) − f ⋆ ) + L + µA k k z k − x ⋆ k + ( A k − A k +1 ) − A k +1 − qA k +1 qA k +1 L k∇ f ( y k ) k − A k ( A k +1 − A k )(1 + qA k )(1 + qA k +1 )( A k +1 + 2 qA k A k +1 − qA k ) µ k x k − z k k . The desired statement follows from picking A k +1 ≥ A k ≥
0, and such that( A k − A k +1 ) − A k +1 − qA k +1 = 0 , yielding A k +1 ( f ( x k +1 ) − f ⋆ ) + L + µA k +1 k z k +1 − x ⋆ k ≤ A k ( f ( x k ) − f ⋆ ) + L + µA k k z k − x ⋆ k . The final worst-case guarantee is obtained using the same reasoning as before, to-gether with a simple bound on A k +1 A k +1 = 2 A k + 1 + q A k + 4 qA k + 12 (1 − q ) ≥ A k + q(cid:0) A k √ q (cid:1) − q ) = A k − √ q (4.17)which means f ( x k ) − f ⋆ = O ((1 − √ q ) k ) when µ >
0, or alternatively O (cid:16)q Lµ log ǫ (cid:17) iteration complexity for obtaining an approximate solution f ( x N ) − f ⋆ ≤ ǫ . The followingcorollary summarizes what we obtained for Nesterov’s method. .5. Acceleration under Strong Convexity Corollary 4.13.
Let f be a L -smooth (possibly µ -strongly) convex function, and x ⋆ ∈ argmin x f ( x ). For all N ∈ N , N ≥
1, the iterates of Algorithm 12 satisfy f ( x N ) − f ⋆ ≤ min (cid:26) N , (1 − √ q ) N (cid:27) L k x − x ⋆ k , with q = µL . Proof.
Following the argument of (4.5), we recursively use Theorem 4.12 with A = 0,together with the bounds on A N for the smooth convex case (4.13) (notice that A k +1 is an increasing function of µ , and hence the bound for the smooth case remains validin the smooth strongly convex one) and for the smooth strongly convex one (4.17) with A = − q = −√ q )(1+ √ q ) ≥ (1 − √ q ) − , reaching A N ≥ (1 − √ q ) − N . Remark 4.3.
Before moving to the next section, let us mention that another directconsequence of the potential inequality above (Theorem 4.12) is that z k may also serveas an approximate solution to x ⋆ when µ >
0. Indeed, using the inequality L + µA N k z N − x ⋆ k ≤ φ N ≤ . . . ≤ φ = L k x − x ⋆ k , it follows that k z N − x ⋆ k ≤
11 + qA N k x − x ⋆ k ≤ − √ q ) N − √ q ) N + q k x − x ⋆ k , and hence that k z N − x ⋆ k = O ((1 − √ q ) N ), as well, and therefore also that f ( z N ) − f ⋆ ≤ L k z N − x ⋆ k = O ((1 − √ q ) N ) . In addition, y N being a convex combination of x N and z N , the same conclusion holdsfor k y N − x ⋆ k and f ( y N ) − f ⋆ . Similar observations also apply to other variants ofaccelerated methods when µ > Some important simplifications are often made to the method in the strongly convexcase where µ >
0. Several approaches produce the same method, known as the “constantmomentum” version of Nesterov’s accelerated gradient. We derive it by observing thatthe limit case, as k → ∞ of Algorithm 12 can be characterized explicitly. In particular,when k → ∞ , it is clear that A k → ∞ as well. We can then take the limits of allparameters as A k → ∞ , to obtain the “limit algorithm”. This is similar in spirit with theresult showing that Polyak’s heavy-ball method is the asymptotic version of Chebyshev’smethod, discussed in Section 2.3.2. First, the convergence rate is obtained aslim A k →∞ A k +1 A k = (1 − √ q ) − . Nesterov Acceleration
By taking the limits of all algorithmic parameters, that is,lim A k →∞ τ k = √ q √ q , lim A k →∞ δ k = 1 √ q , we obtain Algorithm 13 and its equivalent, probably most well-known, second form,provided by Algorithm 14. Algorithm 13
Nesterov’s method, form I, constant momentum
Input: A L -smooth µ -strongly convex function f , initial point x . Initialize z = x and A >
0, set q = µ/L (the inverse condition ratio). for k = 0 , . . . do A k +1 = A k −√ q {Only for the proof/relation to previous methods.} y k = x k + √ q √ q ( z k − x k ) x k +1 = y k − L ∇ f ( y k ) z k +1 = (cid:0) − √ q (cid:1) z k + √ q (cid:16) y k − µ ∇ f ( y k ) (cid:17) end forOutput: Approximate solutions ( y k , x k +1 , z k +1 ). Algorithm 14
Nesterov’s method, form II, constant momentum
Input: A L -smooth µ -strongly convex function f , initial point x . Initialize y = x and A >
0, set q = µ/L (the inverse condition ratio). for k = 0 , . . . do A k +1 = A k −√ q {Only for the proof/relation to previous methods.} x k +1 = y k − L ∇ f ( y k ) y k +1 = x k +1 + −√ q √ q ( x k +1 − x k ) end forOutput: Approximate solutions ( y k , x k +1 ).From a worst-case analysis perspective, those simplifications correspond to using aLyapunov function obtained by dividing that of Theorem 4.12 by A k and then takingthe limit, written ρ − (cid:18) f ( x k +1 ) − f ⋆ + µ k z k +1 − x ⋆ k (cid:19) ≤ f ( x k ) − f ⋆ + µ k z k − x ⋆ k , with ρ = (1 − √ q ). Theorem 4.14.
Let f be a L -smooth µ -strongly convex function, x ⋆ = argmin x f ( x ),and k ∈ N . For any x k , z k ∈ R d , the iterates of Algorithm 13 (or equivalently those ofAlgorithm 14) satisfy ρ − (cid:18) f ( x k +1 ) − f ⋆ + µ k z k +1 − x ⋆ k (cid:19) ≤ f ( x k ) − f ⋆ + µ k z k − x ⋆ k . .5. Acceleration under Strong Convexity Proof.
Let A k > A k +1 = A k / (1 − √ q ). The proof is essentially the same asthat of as Theorem 4.12, by dividing all weights A k leading to a slight variation in thereformulation of the weighted sum, and the following valid inequality0 ≥ λ [ f ( y k ) − f ⋆ + h∇ f ( y k ); x ⋆ − y k i + µ k x ⋆ − y k k ]+ λ [ f ( y k ) − f ( x k ) + h∇ f ( y k ); x k − y k i ]+ λ [ f ( x k +1 ) − f ( y k ) − h∇ f ( y k ); x k +1 − y k i − L k x k +1 − y k k ] , with weights λ = A k +1 − A k A k = √ q − √ q , λ = A k A k = 1 , and λ = A k +1 A k = (1 − √ q ) − . Substituting y k = x k + √ q √ q ( z k − x k ) x k +1 = y k − L ∇ f ( y k ) z k +1 = (1 − √ q ) z k + √ q (cid:18) y k − µ ∇ f ( y k ) (cid:19) , we arrive to the following valid inequality ρ − (cid:18) f ( x k +1 ) − f ⋆ + µ k z k +1 − x ⋆ k (cid:19) ≤ f ( x k ) − f ⋆ + µ k z k − x ⋆ k − √ q (1 + √ q ) µ k x k − z k k , where we reach the desired statement from the last term being nonpositive ρ − (cid:18) f ( x k +1 ) − f ⋆ + µ k z k +1 − x ⋆ k (cid:19) ≤ f ( x k ) − f ⋆ + µ k z k − x ⋆ k . Corollary 4.15.
Let f be a L -smooth µ -strongly convex function, and x ⋆ ∈ argmin x f ( x ).For all N ∈ N , N ≥
1, the iterates of Algorithm 13 (or equivalently of Algorithm 14)satisfy f ( x N ) − f ⋆ ≤ (1 − √ q ) N (cid:18) f ( x ) − f ⋆ + µ k x − x ⋆ k (cid:19) , with q = µL . Proof.
It directly follows from Theorem 4.14 with φ k = ρ − k (cid:18) f ( x k ) − f ⋆ + µ k z k − x ⋆ k (cid:19) ,ρ = 1 − √ q , z = x , and ρ − N ( f ( x N ) − f ⋆ ) ≤ φ N . Nesterov Acceleration
Remark 4.4.
In view of Section 4.4.2, one can also find estimate sequence interpretationsfor Algorithm 12 and 13 from their respective potential functions.
Remark 4.5.
A few works on accelerated methods focus on understanding this particularinstance of Nesterov’s method. Our analysis here is largely inspired by that of Nesterov(2003), but such potentials can be obtained in different ways, see for example (Wilson et al. , 2016; Hu and Lessard, 2017; Siegel, 2019; Bansal and Gupta, 2019).
In this section, we first push the potential function reasoning to its limit. We presentthe information-theoretic exact method (Taylor and Drori, 2021), which generalizes theoptimized gradient descent in the strongly convex case. Similar to Nesterov’s methodwith constant momentum, the information-theoretic exact method has a limit case thatis known as the triple momentum method (Van Scoy et al. , 2017). We then discuss amore geometric variant, known as geometric descent (Bubeck et al. , 2015), or quadraticaveraging (Drusvyatskiy et al. , 2018).
It turns out that there also exists optimal gradient methods for smooth strongly convexminimization, similar to the optimized gradient method for smooth convex minimization.Such methods can be obtained by solving a minimax problem similar to (4.7) withdifferent objectives.The following scheme is optimal for the criterion k z N − x ⋆ k k x − x ⋆ k , reaching the exact lowercomplexity bound for this criterion, as discussed below. In addition, this method reducesto OGM (see Section 4.3) when µ = 0 using the correspondence A k +1 = 4 θ k,N (for k < N ). Therefore, this method is doubly optimal , i.e. optimal for two criteria, in thesense that it also achieves the lower complexity bound for f ( y N ) − f ⋆ k x − x ⋆ k when µ = 0, usingthe last iteration adjustment from Lemma 4.5.The following analysis pushes the potential function reasoning to its limit. It is remi-niscent of Nesterov’s method in Algorithm 12, but also of the optimized gradient methodand its proof (see Theorem 4.4). That is, the known potential function for ITEM relieson inequality (4.4), not only for its proof but also simply for showing that it is non-negative, which follows from instantiating (4.4) at y = x ⋆ . The following analyses canbe found almost verbatim in (Taylor and Drori, 2021). The main proof of this sectionis particularly algebraic, but can be very reasonably skipped as it follows from similarideas as previous developments. Theorem 4.16.
Let f be a L -smooth (possibly µ -strongly) convex function, x ⋆ ∈ argmin x f ( x ),and k ∈ N . For all y k − , z k ∈ R d , and A k ≥ φ k +1 ≤ φ k , .6. Recent Variants of Accelerated Methods Algorithm 15
Information-theoretic exact method (ITEM)
Input: A L -smooth (possibly µ -strongly) convex function f , initial point x . Initialize z = x , A = 0, and set q = µ/L (the inverse condition ratio). for k = 0 , . . . do A k +1 = (1+ q ) A k +2 (cid:0) √ (1+ A k )(1+ qA k ) (cid:1) (1 − q ) Set τ k = 1 − A k (1 − q ) A k +1 , and δ k =
12 (1 − q ) A k +1 − (1+ q ) A k q + qA k y k = x k + τ k ( z k − x k ) x k +1 = y k − L ∇ f ( y k ) z k +1 = (1 − qδ k ) z k + qδ k y k − δ k L ∇ f ( y k ) end forOutput: Approximate solutions ( y k , x k +1 , z k +1 ).with φ k = A k (cid:20) f ( y k − ) − f ⋆ − L k∇ f ( y k − ) k − µ − q ) k y k − − L ∇ f ( y k − ) − x ⋆ k (cid:21) + L + µA k − q k z k − x ⋆ k and A k +1 = (1+ q ) A k +2 (cid:0) √ (1+ A k )(1+ qA k ) (cid:1) (1 − q ) . Proof.
Perform a weighted sum of two inequalities due to Theorem 4.4.• Smoothness and strong convexity between y k − and y k with weight λ = A k f ( y k − ) ≥ f ( y k ) + h∇ f ( y k ); y k − − y k i + 12 L k∇ f ( y k ) − ∇ f ( y k − ) k + µ − q ) k y k − y k − − L ( ∇ f ( y k ) − ∇ f ( y k − )) k , • smoothness and strong convexity of f between x ⋆ and y k with weight λ = A k +1 − A k f ⋆ ≥ f ( y k ) + h∇ f ( y k ); x ⋆ − y k i + 12 L k∇ f ( y k ) k + µ − q ) k y k − x ⋆ − L ∇ f ( y k ) k . Summing up and reorganizing those two inequalities (without substituting A k +1 by its Nesterov Acceleration expression, for simplicity), we arrive to the following valid inequality0 ≥ λ (cid:20) f ( y k ) + h∇ f ( y k ); y k − − y k i + 12 L k∇ f ( y k ) − ∇ f ( y k − ) k + µ − q ) k y k − y k − − L ( ∇ f ( y k ) − ∇ f ( y k − )) k (cid:21) + λ (cid:20) f ( y k ) − f ⋆ + h∇ f ( y k ); x ⋆ − y k i + 12 L k∇ f ( y k ) k + µ − q ) k y k − x ⋆ − L ∇ f ( y k ) k (cid:21) . Substituting the expressions of y k and z k +1 with y k = (1 − τ k ) (cid:16) y k − − L ∇ f ( y k − ) (cid:17) + τ k z k z k +1 = (1 − qδ k ) z k + qδ k y k − δ k L ∇ f ( y k )(noting that this substitution is valid even for k = 0 as A = 0 in this case, and hence τ = 1 and y = z ), the previous inequality can be reformulated exactly as φ k +1 ≤ φ k − LK − q P ( A k +1 ) k z k − x ⋆ k + K L (1 − q ) P ( A k +1 ) k (1 − q ) A k +1 ∇ f ( y k ) − µA k ( x k − x ⋆ ) + K µ ( z k − x ⋆ ) k , with three parameters (well defined given that 0 ≤ µ < L and A k , A k +1 ≥ K = q (1 + q ) + (1 − q ) qA k +1 K = (1 + q ) + (1 − q ) qA k +1 (1 − q ) (1 + q + qA k ) A k +1 K = (1 + q ) (1 + q ) A k − (1 − q )(2 + qA k ) A k +1 (1 + q ) + (1 − q ) qA k +1 , as well as P ( A k +1 ) = ( A k − (1 − q ) A k +1 ) − A k +1 (1 + qA k ) . For obtaining the desired inequality, we pick A k +1 such that A k +1 ≥ A k and P ( A k +1 ) = 0,reaching the claim φ k +1 ≤ φ k as well as the choice for A k +1 .The final bound for this method is obtained after the usual growth analysis of thesequence A k , as follows. When µ = 0, we have A k +1 = 2 + A k + 2 p A k ≥ A k + 2 p A k ≥ (1 + p A k ) , reaching p A k +1 ≥ √ A k and hence √ A k ≥ k and A k ≥ k . When µ >
0, we can usean alternate bound A k +1 = (1 + q ) A k + 2 (cid:16) p (1 + A k )(1 + qA k ) (cid:17) (1 − q ) ≥ (1 + q ) A k + 2 q qA k (1 − q ) = A k (1 − √ q ) , .6. Recent Variants of Accelerated Methods therefore reaching similar bounds as before. In this case, we only emphasize convergenceresults for k z N − x ⋆ k , as it corresponds to the lower complexity bound for smooth stronglyconvex minimization (provided below). Corollary 4.17.
Let f ∈ F µ,L and denote q = µ/L . For any x = z ∈ R d and N ∈ N with N ≥
1, the iterates of 15 satisfy k z N − x ⋆ k ≤
11 + qA N k z − x ⋆ k ≤ (1 − √ q ) N (1 − √ q ) N + q k z − x ⋆ k . Proof.
From Theorem 4.16, we get φ N ≤ φ N − ≤ . . . ≤ φ = L k z − x ⋆ k . From (4.4), we have that φ N ≥ ( L + A N µ ) k z N − x ⋆ k . It remains to use the boundson A N . That is, using A = − q ) = √ q ) (1 −√ q ) ≥ (1 − √ q ) − , we have that A N ≥ (1 − √ q ) − N which concludes the proof.Before concluding, let us mention that the algorithm is non-improvable for minimiz-ing large-scale smooth strongly convex functions in the following sense. Theorem 4.18. (Drori and Taylor, 2021, Corollary 4) Let 0 ≤ µ < L < ∞ , d, N ∈ N and d ≥ N + 1. For any black-box first-order method that performs at most N calls tothe first-order oracle ( f ( . ) , ∇ f ( . )), there exists a function f ∈ F µ,L ( R d ) such that k x N − x ⋆ k ≥
11 + qA N k x − x ⋆ k , where x ⋆ ∈ argmin x f ( x ), x N is the output of the method under consideration, and A N is defined as in Algorithm 15.In the next section, we see that this method also enjoys a clean asymptotic method,known as the triple momentum method (TMM). Remark 4.6.
Just as for the optimized gradient method from Section 4.3, this methodmight serve as a template for designing other accelerated schemes. However, it has thesame caveats as the optimized gradient method, also similar to those of the triple mo-mentum method, presented in the next section. As emphasized in Section 4.3.3, it isunclear how to generalize it to broader classes of problems, e.g., involving constraints.
The triple momentum method, due to Van Scoy et al. (2017), pushes the Lyapunovreasoning to its limit. It is reminiscent of Nesterov’s method in Algorithm 13, and cor-responds to the asymptotic behavior of the information-theoretic exact method. Indeed,considering Algorithm 15, one can explicitly computelim A k →∞ A k +1 A k = (1 − √ q ) − , Nesterov Acceleration as well as lim A k →∞ τ k = 1 − − √ q √ q , lim A k →∞ δ k = 1 √ q , Algorithm 16
Triple momentum method (TMM)
Input: A L -smooth µ -strongly convex function f , initial point x . Initialize y − = z = x , and set q = µ/L (the inverse condition ratio). for k = 0 , . . . do y k = −√ q √ q (cid:16) y k − − L ∇ f ( y k − ) (cid:17) + (cid:16) − −√ q √ q (cid:17) z k z k +1 = √ q (cid:16) y k − µ ∇ f ( y k ) (cid:17) + (cid:0) − √ q (cid:1) z k end forOutput: An approximate solution y k (or z k +1 ).As for the information-theoretic exact method, the known potential function for thetriple momentum method relies on inequality (4.4), not only for its proof but also simplyfor showing that it is nonnegative, which follows from instantiating (4.4) at y = x ⋆ . Theorem 4.19.
Let f be a L -smooth µ -strongly convex function, x ⋆ = argmin x f ( x ),and k ∈ N . For any x k , z k ∈ R d , the iterates of Algorithm 16 satisfy f ( y k ) − f ⋆ − L k∇ f ( y k ) k − µ − q ) k y k − x ⋆ − L ∇ f ( y k ) k + µ − q k z k +1 − x ⋆ k ≤ ρ (cid:18) f ( y k − ) − f ⋆ − L k∇ f ( y k − ) k − µ − q ) k y k − − x ⋆ − L ∇ f ( y k − ) k + µ − q k z k − x ⋆ k (cid:19) , with ρ = 1 − √ q . Proof.
For simplicity, let us consider Algorithm 16 in the following form, parameter-ized by ρ y k = ρ − ρ (cid:16) y k − − L ∇ f ( y k − ) (cid:17) + (cid:18) − ρ − ρ (cid:19) z k z k +1 = (1 − ρ ) (cid:16) y k − µ ∇ f ( y k ) (cid:17) + ρz k . We combine the following two inequalities• smoothness and strong convexity between y k − and y k with weight λ = ρ f ( y k − ) ≥ f ( y k ) + h∇ f ( y k ); y k − − y k i + 12 L k∇ f ( y k ) − ∇ f ( y k − ) k + µ − q ) k y k − y k − − L ( ∇ f ( y k ) − ∇ f ( y k − )) k . • smoothness and strong convexity between x ⋆ and y k with weight λ = 1 − ρ f ⋆ ≥ f ( y k ) + h∇ f ( y k ); x ⋆ − y k i + 12 L k∇ f ( y k ) k + µ − q ) k y k − x ⋆ − L ∇ f ( y k ) k , .6. Recent Variants of Accelerated Methods After some algebra, the weighted sum can be reformulated exactly as (it is simpler notto use the expression of ρ to verify this) f ( y k ) − f ⋆ − L k∇ f ( y k ) k − µ − q ) k y k − x ⋆ − L ∇ f ( y k ) k + µ − q k z k +1 − x ⋆ k ≤ ρ (cid:18) f ( y k − ) − f ⋆ − L k∇ f ( y k − ) k − µ − q ) k y k − − x ⋆ − L ∇ f ( y k − ) k + µ − q k z k − x ⋆ k (cid:19) − q − ( ρ − ( ρ −
2) (1 − q ) h∇ f ( y k ); ρ ( y k − − x ⋆ ) − ρ − z k − x ⋆ ) − ρL ∇ f ( y k − ) i− q − ( ρ − (1 − q ) µ k∇ f ( y k ) k . Using the expression ρ = 1 − √ q , the last two terms cancel and we arrive to the desired f ( y k ) − f ⋆ − L k∇ f ( y k ) k − µ − q ) k y k − x ⋆ − L ∇ f ( y k ) k + µ − q k z k +1 − x ⋆ k ≤ ρ (cid:18) f ( y k − ) − f ⋆ − L k∇ f ( y k − ) k − µ − q ) k y k − − x ⋆ − L ∇ f ( y k − ) k + µ − q k z k − x ⋆ k (cid:19) . Remark 4.7.
This method was proposed in (Van Scoy et al. , 2017). It was further studiedin Cyrus et al. (2018) from a robust control perspective, and by Zhou et al. (2020) asan accelerated method for a different objective. The triple momentum method can beobtained as a time-independent optimized gradient method (Lessard and Seiler, 2020;Gramlich et al. , 2020). Of course, all drawbacks that were valid for OGM and ITEM alsoapply to the triple momentum method, so the same questions related to generalizationsof this scheme remain open. Furthermore, this method is defined only for µ >
0, likeNesterov’s method with constant momentum.
Accelerated methods tailored for the strongly convex setting, such as Algorithms 13and 16, make use of two kinds of step sizes for updating the different sequences. First,they perform small gradient steps with the step size 1 /L . Such steps correspond tominimizing quadratic upper bounds (4.2). Second, they use so-called large steps /µ ,which correspond to minimizing quadratic lower bounds (4.3).This idea can be exploited for developing more geometric accelerated methods. Inparticular, Drusvyatskiy et al. (2018) proposes a method based on quadratic averaging ,similar in shape with Algorithm 13, but where the last sequence z k is explicitly computed Nesterov Acceleration as the minimum of a quadratic lower bound on the function (the coefficients arising inthe computation of z k are then picked dynamically to maximize this lower bound). Oneof those lower bounds is that of the previous iteration, whose minimum is achieved at z k ,and the other one is f ( y k )+ h∇ f ( y k ) , x − y k i + µ k x − y k k , whose minimum is y k − µ ∇ f ( y k ).Because of the specific shapes of those lower bounds, their convex combinations has aminimum which is the convex combination of their respective minimum, with the sameweights, motivating the update rule z k = λ ( y k − µ ∇ f ( y k ))+(1 − λ ) z k − , with dynamicallychosen λ (for maximizing the minimum value of the new under-approximation).Alternatively, the sequence z k can be interpreted in terms of a localization method,tracking x ⋆ using intersections of balls. In this case, the sequence z k corresponds to cen-ters of the balls containing x ⋆ . This alternative is referred to as geometric descent (Bubeck et al. , 2015), and the λ are picked to minimize the radius of the ball, centered at z k = λ ( y k − µ ∇ f ( y k )) + (1 − λ ) z k − , while making sure they contain x ⋆ . Geometricdescent and quadratic averaging produces the same sequences of iterates, as providedin (Drusvyatskiy et al. , 2018, Theorem 4.5).Geometric descent is detailed in (Bubeck et al. , 2015) and (Bubeck, 2015, Section3.6.3). It was extended to handle constraints (Chen et al. , 2017), and studied using thesame Lyapunov function as that used in Theorem 4.14 (Karimi and Vavasis, 2017). Our goal now is to detail a few of the many extensions of Nesterov’s accelerated methods.We see below that additional elements can be incorporated to the accelerated methodsabove while keeping exactly the same proof structures. That is, we perform weightedsums involving the exact same inequalities for the smooth (possibly strongly) convexfunctions, and only use a few additional inequalities for taking the new elements intoaccount.We also intent to provide intuitions, along with bibliographical pointers for goingfurther. The following scenarios are particularly important in practice.
Constraints
In the presence of constraints, or nonsmooth functions, a common ap-proach is to use (proximal) splitting approaches. This idea is not recent, see e.g., (Dou-glas and Rachford, 1956; Glowinski and Marroco, 1975; Lions and Mercier, 1979), butis still very relevant in signal processing, computer vision and statistical learning (Com-bettes and Pesquet, 2011; Parikh and Boyd, 2014; Chambolle and Pock, 2016; Fessler,2020).
Adaptation
Problem constants, such as smoothness or strong convexity parameters,are generally unknown. Furthermore, their local values are typically much more advan-tageous than their, typically conservative, global ones. Typically, smoothness constantsare estimated on the fly using backtracking line-search strategies, see e.g. (Goldstein,1962; Armijo, 1966; Nesterov, 1983). Strong convexity, or Hölderian error bounds on the .7. Practical Extensions other hand are more difficult to estimate, and restart schemes are often used to adaptto these additional regularity properties (see Chapter 6), see e.g. (Becker et al. , 2011;O’Donoghue and Candes, 2015; Roulet and d’Aspremont, 2017). Non Euclidean settings
Although we only briefly mention this topic, taking into ac-count the geometry of the problem at hand is in general key to obtain good empiricalperformances. In particular, optimization problems are often more naturally formulatedin a non-Euclidean space, with non Euclidean norms producing a better implicit modelfor the function. A popular method for this setting is commonly known as mirror de-scent (Nemirovsky and Yudin, 1983a)—see also (Ben-Tal and Nemirovski, 2001; Juditskyand Nemirovsky, 2011a)—, which we do not intent to detail at length here. Good surveysare provided by Beck and Teboulle (2003) and Bubeck (2015). However, we do detail anaccelerated method in this setting in Section 4.7.3.
In this section, we consider the problem of minimizing a sum of two convex functions F ⋆ = min x ∈ R d { F ( x ) ≡ f ( x ) + h ( x ) } , (4.18)where f is L -smooth and (possibly µ -strongly) convex, and where h is convex, closedand proper (CCP), which we denote by f ∈ F µ,L and h ∈ F , ∞ . Those are technicalconditions to ensure that the proximal operator, defined hereafter, is well defined every-where on R d . We refer to (Rockafellar, 1970; Bauschke and Combettes, 2011; Ryu andBoyd, 2016) for further details. In addition, we assume a proximal operator of h to bereadily available, so prox γh ( x ) = argmin y ∈ R d { γh ( y ) + k x − y k } , (4.19)can be evaluated efficiently (Chapter 5 deals with some cases where this operator isapproximated, see also discussions in Section 4.8). The proximal operator can be seenas an implicit (sub)gradient step on h , as dictated by the optimality conditions of theproximal operation x + = prox γh ( x ) ⇔ x + = x − γg h ( x + ) with g h ( x + ) ∈ ∂h ( x + ) . In particular, when h ( x ) is the indicator function of a closed convex set Q , then theproximal operation corresponds to the orthogonal projection onto Q . There are a fewcommonly used functions for which the proximal operation has an analytical solution,such as the h ( x ) = k x k , see, for instance, (Combettes and Pesquet, 2011, Table 2). Inthe proofs below, we incorporate h using inequalities characterizing convexity, that is h ( x ) ≥ h ( y ) + h g h ( y ); x − y i , with g h ( y ) ∈ ∂h ( y ) some subgradient of h at y . Nesterov Acceleration
In this setting, classical methods for solving (4.18) consist in using a forward-backwardsplitting strategy (in other words, forward, or gradient, steps on f , and backward, prox-imal, or implicit, steps on h ), introduced by Passty (1979). This topic is addressed inmany references, and we refer to (Combettes and Pesquet, 2011; Parikh and Boyd, 2014;Ryu and Boyd, 2016) and the references therein for further details. In the context of ac-celerated methods, forward-backward splittings were introduced by Nesterov (2003) andNesterov (2013) through the concept of gradient mapping ; see also Tseng (2008) andBeck and Teboulle (2009). Problem (4.18) is also sometimes referred to as the compositeconvex optimization setting (Nesterov, 2013). Depending on the assumptions made on f and h , there are of course alternate ways of solving this problem—for example, whenthe proximal operator is available for both, one could use the Douglas-Rachford split-ting (Douglas and Rachford, 1956; Lions and Mercier, 1979), but this goes beyond thepurpose of this chapter and we refer to (Combettes and Pesquet, 2011; Ryu and Yin,2020) and the references therein for further discussions on those topics. In previous sections, we assumed f to be L -smooth and possibly µ -strongly convex. Inall previous algorithms, we explicitly used the values of both L and µ to design themethods. However, this is not a desirable feature. First, it means that we need to beable to estimate valid values for L and µ . Second, it means that we are not adaptive topotentially better (local) values of those parameters. That is, we will not benefit fromproblems being simpler than specified, i.e. where the smallest valid L is much smallerthan our estimate; and/or the largest valid µ is much larger than our estimation. Wealso want to benefit from typically better local properties of the function at hand, alongthe path taken by the method, rather than global ones. The difference between localand global regularity properties is often significant, meaning that adaptive methods willoften converge much faster in practice.We discuss below how adaptation is implemented for the smoothness constant, us-ing line search techniques. However, it is still an open issue whether strong convexityparameters can be efficiently estimated while keeping reasonable worst-case guarantees,without resorting to restart schemes (i.e. outer iterations) (see Chapter 6).To handle unknown parameters, the key is to look at the inequalities used in theproofs of the desired method. It turns out that smoothness is usually only used in in-equalities between pairs of iterates, which means that these inequalities can be testedduring iterations. For our guarantees to hold, we therefore do not need the function tobe L -smooth everywhere, but rather only need some inequality to hold for the valueof L we are using (smoothness of the function ensures that such a L exists). Strongconvexity however is typically only used in inequalities involving the optimal point (see,for example, proof of Theorem 4.12), which we do not know a priori. Therefore, thoseinequalities cannot be tested as the algorithm proceeds, which complicates the estima-tion of strong convexity while running the algorithm. Adaptation to strong convexity is .7. Practical Extensions therefore typically done via the use of restarts .Those approaches are common, not new (Goldstein, 1962; Armijo, 1966), and werealready used by Nesterov (1983). They were adapted to the forward-backward settingin Nesterov (2013) and Beck and Teboulle (2009), and further exploited to improveperformances in various settings, see e.g., (Scheinberg et al. , 2014). The topic is furtherdiscussed in the next section, as well as in the notes and references provided in Section 4.8. An Accelerated Forward-backward Methods with Backtracking
As discussed above, the smoothness constant L is used very sparsely in the proofs, bothfor gradient descent (see Theorem 4.2 and Theorem 4.10) and accelerated variants (The-orem 4.8, Theorem 4.12, and Theorem 4.14). Essentially, it is used only in three places:(i) to compute A k +1 (actually only when µ > x k +1 = y k − L ∇ f ( y k ),and (iii) in the inequality f ( x k +1 ) ≤ f ( y k ) + h∇ f ( y k ); x k +1 − y k i + L k x k +1 − y k k (4.20)(a.k.a., descent lemma , as substituting the gradient step x k +1 = y k − L ∇ f ( y k ) allowswriting it as f ( x k +1 ) ≤ f ( y k ) − L k∇ f ( y k ) k ). In addition, outside of L , (4.20) onlycontains information which we observe , hence we can simply check whether this inequal-ity holds for a given estimate of L . When it does not, we simply increase the currentapproximation of L and recompute, using this new estimate of L , (i) A k +1 (it is neces-sary only if µ >
0) and the corresponding y k , (ii) x k +1 using the new step size. We thencheck whether (4.20) is satisfied, again. If (4.20) is satisfied, then we can proceed (thepotential inequality is verified), and otherwise we increase our approximation of L untilthe descent condition (4.20) is satisfied. Finally, for guaranteeing that we only performa finite number of “wasted” gradient steps for estimating L , we need an appropriate rulefor increasing it. It is common to simply multiply the current approximation by someconstant α >
1, allowing to guarantee to waste at most ⌈ log α LL ⌉ gradient steps in theprocess, where L is the true smoothness constant and L is our starting estimate. As wesee below, both backtracking and nonsmooth terms require proofs very similar to thosepresented above, and potential-based analyses are still well suited here.We present two particular extensions of Nesterov’s first method that handle nons-mooth term, and a have backtracking procedure on the smoothness parameter. The firstone, FISTA, is particularly popular, while the second one fixes one potential issue arisingin the original FISTA. FISTA
The fast iterative shrinkage-thresholding algorithm (FISTA), due to Beck andTeboulle (2009), is a natural extension of Nesterov (1983) in its first form (see Algo-rithm 12), handling an additional nonsmooth term. In this section, we present a stronglyconvex variant of FISTA, provided by Algorithm 17. The proof contains the same ingre-dients as in the original work, and can easily be compared to previous material. Nesterov Acceleration
Algorithm 17
Strongly convex FISTA, form I
Input: A L -smooth (possibly µ -strongly) convex function f , a convex function h withproximal operator available, an initial point x , and an initial estimate L > µ . Initialize z = x , A = 0, and some α > for k = 0 , . . . do L k +1 = L k {Alternate strategies are provided in Remark 4.8.} loop q k +1 = µ/L k +1 A k +1 = A k +1+ √ A k +4 q k +1 A k +12(1 − q k +1 ) set τ k = ( A k +1 − A k )(1+ q k +1 A k ) A k +1 +2 q k +1 A k A k +1 − q k +1 A k and δ k = A k +1 − A k q k +1 A k +1 y k = x k + τ k ( z k − x k ) x k +1 = prox h/L k +1 (cid:16) y k − L k +1 ∇ f ( y k ) (cid:17) z k +1 = (1 − q k +1 δ k ) z k + q k +1 δ k y k + δ k ( x k +1 − y k ) if (4.20) holds then break {Iterates accepted; k will be incremented.} else L k +1 = αL k +1 {Iterates not accepted; recompute new L k +1 .} end if end loop end forOutput: An approximate solution x k +1 .The proof follows exactly the same lines as those of Theorem 4.12 (Nesterov’s methodfor strongly convex functions), but accounts for the nonsmooth function h (observe thatthe potential is stated in terms of F and not f ). Two additional inequalities, involvingconvexity of h between two different pairs of points, allow taking this nonsmooth terminto account. In this case, f is assumed to be smooth and convex over R d (i.e., it has fulldomain, dom f = R d ), and we are therefore allowed to evaluate gradients of f outsideof the domain of h . Theorem 4.20.
Let f ∈ F µ,L (with full domain; dom f = R d ), h be a closed, properand convex function, x ⋆ ∈ argmin x { f ( x ) + h ( x ) } , and k ∈ N . For any x k , z k ∈ R d and A k ≥
0, the iterates of Algorithm 17 satisfying (4.20) also satisfy A k +1 ( F ( x k +1 ) − F ⋆ ) + L k +1 + µA k +1 k z k +1 − x ⋆ k ≤ A k ( F ( x k ) − F ⋆ ) + L k +1 + µA k k z k − x ⋆ k , with A k +1 = A k +1+ √ A k +4 A k q k +1 +12(1 − q k +1 ) and q = µ/L k +1 . Proof.
The proof consists in a weighted sum of the following inequalities: .7. Practical Extensions • strong convexity of f between x ⋆ and y k with weight λ = A k +1 − A k f ( x ⋆ ) ≥ f ( y k ) + h∇ f ( y k ); x ⋆ − y k i + µ k x ⋆ − y k k , • strong convexity of f between x k and y k with weight λ = A k f ( x k ) ≥ f ( y k ) + h∇ f ( y k ); x k − y k i , • smoothness of f between y k and x k +1 ( descent lemma ) with weight λ = A k +1 f ( y k ) + h∇ f ( y k ); x k +1 − y k i + L k +1 k x k +1 − y k k ≥ f ( x k +1 ) , • convexity of h between x ⋆ and x k +1 with weight λ = A k +1 − A k h ( x ⋆ ) ≥ h ( x k +1 ) + h g h ( x k +1 ); x ⋆ − x k +1 i , with g h ( x k +1 ) ∈ ∂h ( x k +1 ) and x k +1 = y k − L k +1 ( ∇ f ( y k ) + g h ( x k +1 ))• convexity of h between x k and x k +1 with weight λ = A k h ( x k ) ≥ h ( x k +1 ) + h g h ( x k +1 ); x k − x k +1 i . We get the following inequality0 ≥ λ [ f ( y k ) − f ( x ⋆ ) + h∇ f ( y k ); x ⋆ − y k i + µ k x ⋆ − y k k ]+ λ [ f ( y k ) − f ( x k ) + h∇ f ( y k ); x k − y k i ]+ λ [ f ( x k +1 ) − ( f ( y k ) + h∇ f ( y k ); x k +1 − y k i + L k +1 k x k +1 − y k k )]+ λ [ h ( x k +1 ) − h ( x ⋆ ) + h g h ( x k +1 ); x ⋆ − x k +1 i ]+ λ [ h ( x k +1 ) − h ( x k ) + h g h ( x k +1 ); x k − x k +1 i ] . Substituting the y k , x k +1 , and z k +1 using y k = x k + τ k ( z k − x k ) x k +1 = y k − L k +1 ( ∇ f ( y k ) + g h ( x k +1 )) z k +1 = (1 − q k +1 δ k ) z k + q k +1 δ k y k + δ k ( x k +1 − y k )the previous weighted sum can be reformulated exactly as A k +1 ( f ( x k +1 ) + h ( x k +1 ) − f ( x ⋆ ) − h ( x ⋆ )) + L k +1 + A k +1 µ k z k +1 − x ⋆ k ≤ A k ( f ( x k ) + h ( x k ) − f ( x ⋆ ) − h ( x ⋆ )) + L k +1 + A k µ k z k − x ⋆ k + ( A k − A k +1 ) − A k +1 − q k +1 A k +1 q k +1 A k +1 L k +1 k∇ f ( y k ) + g h ( x k +1 ) k − A k ( A k +1 − A k )(1 + q k +1 A k )(1 + q k +1 A k +1 ) (cid:0) A k +1 + 2 q k +1 A k A k +1 − q k +1 A k (cid:1) µ k x k − z k k . Nesterov Acceleration
Using 0 ≤ q k +1 ≤ A k +1 such that A k +1 ≥ A k and( A k − A k +1 ) − A k +1 − q k +1 A k +1 = 0 , yields the desired result A k +1 ( f ( x k +1 ) + h ( x k +1 ) − f ( x ⋆ ) − h ( x ⋆ )) + L k +1 + µA k +1 k z k +1 − x ⋆ k ≤ A k ( f ( x k ) + h ( x k ) − f ( x ⋆ ) − h ( x ⋆ )) + L k +1 + µA k k z k − x ⋆ k . Finally, we obtain a complexity guarantee by adapting the potential argument (4.5),and by noting that A k +1 is a decreasing function of L k +1 (whose maximal value is αL ,assuming L < L ; otherwise its maximal value is L ). The growth rate of A k in thesmooth convex setting remains unchanged, see (4.13), however its geometric grow rateis actually slightly degraded to A k +1 ≥ (cid:18) − r µαL (cid:19) − A k , which remains better than the (1 − µαL ) rate of gradient descent with backtracking(assuming in both case again that L < L as otherwise the rates are respectively (1 − q µL )and (1 − µL )). Corollary 4.21.
Let f ∈ F µ,L ( R d ), h be a closed, proper and convex function, and x ⋆ ∈ argmin x { F ( x ) ≡ f ( x ) + h ( x ) } . For any N ∈ N , N ≥
1, and x ∈ R d , the output ofAlgorithm 17 satisfy F ( x N ) − F ⋆ ≤ min ( N , (cid:18) − r µℓ (cid:19) N ) ℓ k x − x ⋆ k , with ℓ = max { αL, L } . Proof.
We assume that
L > L as otherwise f ∈ F µ,L and the proof also directlyfollows the case without backtracking. Define φ k = A k ( F ( x k ) − F ⋆ ) + L k + µA k k z k − x ⋆ k . Since L k +1 /L k ≥
1, we have φ k +1 ≤ A k ( F ( x k ) − F ⋆ ) + L k +1 + µA k k z k − x ⋆ k ≤ L k +1 L k φ k The chained potential argument (4.5) can now be adapted to get A N ( F ( x N ) − F ⋆ ) ≤ φ N ≤ L N L N − φ N − ≤ L N L N − φ N − ≤ . . . ≤ L N L φ , .7. Practical Extensions where we used Theorem 4.20 and the fact the output of the algorithm satisfies (4.20).Using A = 0, we reach F ( x N ) − F ⋆ ≤ L N k x − x ⋆ k A N . Using our previous bounds on A N (noting that A k +1 is an decreasing function of L k +1 ),in e.g., Corollary 4.13, along with the fact that the estimated smoothness cannot belarger than the growth rate α times the true constant L N < αL in the worst-case,except if L was already larger than the true L , in which case L N = L , and hence L N ≤ ℓ = max { αL, L } , yielding the desired result. Remark 4.8.
There are two common variations around the backtracking strategy pre-sented in this section. One can for example reset L k +1 ← L (at line 3 of Algorithm 17)at each iteration, potentially using a total of N ⌈ log α LL ⌉ additional gradient evaluationsover all iterations. Another possibility is to pick some additional constant 0 < β < L k +1 ← βL k (at line 3 of Algorithm 17). In the case β = 1 /α , this strategypotentially costs 1 additional gradient evaluation per iteration due to the backtrackingstrategy, potentially using a total of N + ⌈ log α LL ⌉ additional gradient evaluations overall iterations.Incorporating such non-monotonic estimations of L can be done at low additionaltechnical cost. The corresponding methods and their analyses are essentially the same asthose of this chapter; they are provided in Appendix 4.D.1 and 4.D.2 (see Algorithm 23and Algorithm 24). Remark 4.9.
Variations around strongly convex extensions of FISTA, involving back-tracking line-searches, can be found in, e.g. (Chambolle and Pock, 2016; Calatroni andChambolle, 2019; Florea and Vorobyov, 2018; Florea and Vorobyov, 2020), together withpractical improvements. The method presented in this section was designed for easy com-parisons with previous material.
Another Accelerated Gradient Method for Composite Convex Optimization
FISTA potentially evaluates gradients outside of the domain of h , and therefore implicitlyassumes that f is defined even outside this region. In many situations, this is not an issue,for example when f is quadratic. However, it can be problematic in some cases. In thissection, we assume instead that f is continuously differentiable and satisfies smoothnesscondition (4.21) only for all x, y ∈ dom h . Definition 4.2.
Let 0 ≤ µ < L ≤ + ∞ and C ⊆ R d . A closed, proper and convex function f : R d → R ∪ { + ∞} is L -smooth and µ -strongly convex on C (written f ∈ F µ,L ( C )) ifand only if• ( L -smoothness) there exists an open set C ′ such that C ⊆ C ′ and f is continuouslydifferentiable on C ′ , and for all x, y ∈ C , it holds that f ( x ) ≤ f ( y ) + h∇ f ( y ); x − y i + L k x − y k , (4.21) Nesterov Acceleration • ( µ -strong convexity) for all x, y ∈ C , it holds that f ( x ) ≥ f ( y ) + h∇ f ( y ); x − y i + µ k x − y k . (4.22)By extension, F µ, ∞ ( C ) denotes the set of proper, closed, and µ -strongly convex func-tions whose domain contains C , and F , ∞ denotes the set of proper, closed and convexfunctions.There exists different ways of handling this situation. The method of this sectionrelies on using the proximal operator on the sequence z k , and on formulating Nesterov’smethod in its form III (see Algorithm 11), where, assuming the initial point is feasible( F ( x ) < ∞ ) then both x k ’s and y k ’s are obtained from convex combinations of feasiblepoints, and hence are feasible.A wide variety of accelerated methods exists, and most variants settle FISTA’s prob-lem by using two proximal operations per iteration (on both sequences x k and z k ). Thevariant of this section performs only one projection per iteration, while also fixing theinfeasibility issue of y k in FISTA. Variations on this theme can found in a number ofreferences, see for example (Auslender and Teboulle, 2006, “Improved interior gradientalgorithm”), (Tseng, 2008, Algorithm 1), or more recently (Gasnikov and Nesterov, 2018,“Method of similar triangles”). Theorem 4.22.
Let h ∈ F , ∞ , f ∈ F µ,L ( dom h ), x ⋆ ∈ argmin x { F ( x ) ≡ f ( x ) + h ( x ) } ,and k ∈ N . For any x k , z k ∈ R d and A k ≥
0, the iterates of Algorithm 18 satisfying (4.20)also satisfy A k +1 ( F ( x k +1 ) − F ⋆ ) + L k +1 + µA k +1 k z k +1 − x ⋆ k ≤ A k ( F ( x k ) − F ⋆ ) + L k +1 + µA k k z k − x ⋆ k , with A k +1 = A k +1+ √ A k +4 A k q k +1 +12(1 − q k +1 ) and q = µ/L k +1 . Proof.
First, { z k } are in dom h by construction—it is the output of the proxi-mal/projection step. Furthermore, we have 0 ≤ A k A k +1 ≤ A k +1 ≥ A k ≥ z = x ∈ dom h , all subsequent { y k } and { x k } arealso in dom h (as they are obtained from convex combinations of feasible points).The rest of the proof consists in a weighted sum of the following inequalities (whichare valid due to feasibility of the iterates).• Strong convexity of f between x ⋆ and y k with weight λ = A k +1 − A k f ( x ⋆ ) ≥ f ( y k ) + h∇ f ( y k ); x ⋆ − y k i + µ k x ⋆ − y k k , • convexity of f between x k and y k with weight λ = A k f ( x k ) ≥ f ( y k ) + h∇ f ( y k ); x k − y k i , .7. Practical Extensions Algorithm 18
A proximal accelerated gradient method
Input: h ∈ F , ∞ with proximal operator available, f ∈ F µ,L ( dom h ), an initial point x ∈ dom h , and an initial estimate L > µ . Initialize z = x , A = 0, and some α > for k = 0 , . . . do L k +1 = L k {Alternate strategies are provided in Remark 4.8.} loop q k +1 = µ/L k +1 A k +1 = A k +1+ √ A k +4 q k +1 A k +12(1 − q k +1 ) set τ k = ( A k +1 − A k )(1+ q k +1 A k ) A k +1 +2 q k +1 A k A k +1 − q k +1 A k and δ k = A k +1 − A k q k +1 A k +1 y k = x k + τ k ( z k − x k ) z k +1 = prox δ k h/L k +1 (cid:16) (1 − q k +1 δ k ) z k + q k +1 δ k y k − δ k L k +1 ∇ f ( y k ) (cid:17) x k +1 = A k A k +1 x k + (1 − A k A k +1 ) z k +1 if (4.20) holds then break {Iterates accepted; k will be incremented.} else L k +1 = αL k +1 {Iterates not accepted; recompute new L k +1 } end if end loop end forOutput: An approximate solution x k +1 .• smoothness of f between y k and x k +1 ( descent lemma ) with weight λ = A k +1 f ( y k ) + h∇ f ( y k ); x k +1 − y k i + L k +1 k x k +1 − y k k ≥ f ( x k +1 ) , • convexity of h between x ⋆ and z k +1 with weight λ = A k +1 − A k h ( x ⋆ ) ≥ h ( z k +1 ) + h g h ( z k +1 ); x ⋆ − z k +1 i , with g h ( z k +1 ) ∈ ∂h ( z k +1 ) and z k +1 = (1 − qδ k ) z k + qδ k y k − δ k L k +1 ( ∇ f ( y k ) + g h ( z k +1 ))• convexity of h between x k and x k +1 with weight λ = A k h ( x k ) ≥ h ( x k +1 ) + h g h ( x k +1 ); x k − x k +1 i , with g h ( x k +1 ) ∈ ∂h ( x k +1 )• convexity of h between z k +1 and x k +1 with weight λ = A k +1 − A k h ( z k +1 ) ≥ h ( x k +1 ) + h g h ( x k +1 ); z k +1 − x k +1 i . Nesterov Acceleration
We get the following inequality0 ≥ λ [ f ( y k ) − f ( x ⋆ ) + h∇ f ( y k ); x ⋆ − y k i + µ k x ⋆ − y k k ]+ λ [ f ( y k ) − f ( x k ) + h∇ f ( y k ); x k − y k i ]+ λ [ f ( x k +1 ) − ( f ( y k ) + h∇ f ( y k ); x k +1 − y k i + L k +1 k x k +1 − y k k )]+ λ [ h ( z k +1 ) − h ( x ⋆ ) + h g h ( z k +1 ); x ⋆ − z k +1 i ]+ λ [ h ( x k +1 ) − h ( x k ) + h g h ( x k +1 ); x k − x k +1 i ]+ λ [ h ( x k +1 ) − h ( z k +1 ) + h g h ( x k +1 ); z k +1 − x k +1 i ] . Substituting the y k , z k +1 , and x k +1 using y k = x k + τ k ( z k − x k ) z k +1 = (1 − q k +1 δ k ) z k + q k +1 δ k y k − δ k L k +1 ( ∇ f ( y k ) + g h ( z k +1 )) x k +1 = A k A k +1 x k + (cid:18) − A k A k +1 (cid:19) z k +1 , we reformulate the previous inequality as A k +1 ( f ( x k +1 ) + h ( x k +1 ) − f ( x ⋆ ) − h ( x ⋆ )) + L k +1 + A k +1 µ k z k +1 − x ⋆ k ≤ A k ( f ( x k ) + h ( x k ) − f ( x ⋆ ) − h ( x ⋆ )) + L k +1 + A k µ k z k − x ⋆ k + ( A k − A k +1 ) (cid:16) ( A k − A k +1 ) − A k +1 − q k +1 A k +1 (cid:17) A k +1 (1 + q k +1 A k +1 ) L k +1 k∇ f ( y k ) + g h ( z k +1 ) k − A k ( A k +1 − A k )(1 + q k +1 A k )(1 + q k +1 A k +1 ) (cid:0) A k +1 + 2 q k +1 A k A k +1 − q k +1 A k (cid:1) µ k x k − z k k . Picking A k +1 ≥ A k such that( A k − A k +1 ) − A k +1 − q k +1 A k +1 = 0 , yields the desired result A k +1 ( f ( x k +1 ) + h ( x k +1 ) − f ( x ⋆ ) − h ( x ⋆ )) + L k +1 + µA k +1 k z k +1 − x ⋆ k ≤ A k ( f ( x k ) + h ( x k ) − f ( x ⋆ ) − h ( x ⋆ )) + L k +1 + µA k k z k − x ⋆ k . We have the following corollary.
Corollary 4.23.
Let h ∈ F , ∞ , f ∈ F µ,L ( dom h ) and x ⋆ ∈ argmin x { F ( x ) ≡ f ( x )+ h ( x ) } .For any N ∈ N , N ≥
1, and x ∈ R d , the output of Algorithm 18 satisfy F ( x N ) − F ⋆ ≤ min ( N , (cid:18) − r µℓ (cid:19) − N ) ℓ k x − x ⋆ k , with ℓ = max { αL, L } . .7. Practical Extensions Proof.
The proof follows the same arguments as those of Corollary 4.21, using thepotential from Theorem 4.22, using the fact the output of the algorithm satisfies (4.20).
Remark 4.10.
In this chapter, we introduced backtracking techniques by “observing”what inequalities were used in the proof. In particular, because smoothness was usedonly through the descent lemma , in which the only unknown value is L , one can simplycheck this inequality at runtime. Another possible exploitation of the observation (“whatinequalities do we need in the proof”) is to identify minimal assumptions on the class offunctions under which it is possible to prove accelerated rates; this topic is explored bye.g., Necoara et al. (2019). More generally the same question holds simply for the abilityof proving convergence rates; this is further discussed in Chapter 6. In this section, we put ourselves in a slightly different scenario, often referred to as themirror descent setting. Let us consider the convex minimization problem F ⋆ = min x ∈ R d { F ( x ) ≡ f ( x ) + h ( x ) } , (4.23)with h, f : R d → R ∪{ + ∞} closed proper and convex. Furthermore, we assume f to be dif-ferentiable and to have Lipschitz gradients with respect to some (possibly non-Euclidean)norm k . k . That is, denoting by k s k ∗ = sup x {h s ; x i : k x k ≤ } the corresponding dualnorm, we require k∇ f ( x ) − ∇ f ( y ) k ∗ ≤ L k x − y k , for all x, y ∈ dom h . In this setting, inequality (4.21) also holds (see Appendix 4.A.2),and we perhaps abusively also denote f ∈ F ,L ( dom h ).To solve (4.23), we define a few additional ingredients. First, we pick a 1-stronglyconvex, closed and proper function w : R d → R ∪{ + ∞} such that dom h ⊆ dom w (recallthat by assumption on h , dom h = ∅ , and therefore dom w = ∅ ). Those assumptionsensure the proximal operations below are well defined, and w is commonly referred toas the distance generating function .Finally, pick g w ( y ) ∈ ∂w ( y ) and define a Bregman divergence generated by w as D w ( x ; y ) = w ( x ) − w ( y ) − h g w ( y ); x − y i , (4.24)which we use below as a notion of distance for generalizing the previous proximal oper-ator (4.19). Note that the Bregman divergence D w ( . ; . ) generated by any subgradient of w at z k is considered valid here.Now, the base ingredient we use for solving (4.23) is the Bregman proximal gradientstep, with step size a k L z k +1 = argmin y (cid:26) a k L ( h ( y ) + h∇ f ( y k ); y − y k i ) + D w ( y ; z k ) (cid:27) , (4.25) Nesterov Acceleration which corresponds to the usual Euclidean proximal gradient step when w ( x ) = k x k .Under previous assumptions (4.25) is well defined and we can explicitly write g w ( z k +1 ) ∋ g w ( z k ) − a k L ( ∇ f ( y k ) + g h ( z k +1 )) , with some g w ( z k +1 ) ∈ ∂w ( z k +1 ), g w ( z k ) ∈ ∂w ( z k ), and g h ( z k +1 ) ∈ ∂h ( z k +1 ).Under this construction, one can then rely on (4.25) for solving (4.23) using Algo-rithm 19. Note that when w is differentiable (which is usually the case, but for whichrequiring w to be closed, proper and convex requires some additional discussions), weoften refer to ∇ w as a mirror map , which is a bijective mapping due to strong convexityand differentiability of w . In this case, iterations can be described as ∇ w ( z k +1 ) = ∇ w ( z k ) − a k L ( ∇ f ( y k ) + g h ( z k +1 )) . Algorithm 19
Proximal accelerated Bregman gradient method
Input: h ∈ F , ∞ , f ∈ F ,L ( dom h ), w ∈ F , ∞ with dom h ⊆ dom w , and x ∈ dom h (such that ∂w ( z ) = ∅ ). Initialize z = x and A = 0. for k = 0 , . . . do a k = √ A k +12 A k +1 = A k + a k y k = A k A k +1 x k + (cid:16) − A k A k +1 (cid:17) z k z k +1 = argmin y (cid:8) a k L ( h ( y ) + h∇ f ( y k ); y − y k i ) + D w ( y ; z k ) (cid:9) x k +1 = A k A k +1 x k + (cid:16) − A k A k +1 (cid:17) z k +1 end forOutput: Approximate solution x k +1 Theorem 4.24 below provides a convergence guarantee for Algorithm 19 using apotential argument.
Theorem 4.24.
Let h ∈ F , ∞ , f ∈ F ,L ( dom h ), w ∈ F , ∞ with dom h ⊆ dom w , let x ⋆ ∈ argmin y { F ( x ) ≡ f ( x ) + h ( x ) } , and k ∈ N . For any x k , z k ∈ dom h such that ∂w ( z k ) = ∅ and A k ≥
0, iterates of Algorithm 19 satisfy A k +1 ( F ( x k +1 ) − F ⋆ ) + LD w ( x ⋆ ; z k +1 ) ≤ A k ( F ( x k ) − F ⋆ ) + LD w ( x ⋆ ; z k ) , with A k +1 = A k + √ A k +12 and D w ( . ; . ) a Bregman divergence (4.24) with respect to w . Furthermore, ∂w ( z k +1 ) = ∅ . Proof.
First, { z k } are all feasible, i.e., z k ∈ dom h , by construction. Indeed, z isfeasible by assumption, and the following iterates z k ( k >
0) are obtained after proximalsteps, hence ∂h ( z k ) = ∅ and therefore z k ∈ dom h .Secondly, it is direct to verify that 0 ≤ A k A k +1 ≤ A k +1 ≥ A k ≥
0. Asa consequence, all subsequent { y k } and { x k } are obtained as convex combinations. It .7. Practical Extensions follows from z = x ∈ dom h , that the sequences { y k } and { x k } are also in dom h , whichis convex. The rest of proof consists in a weighted sum of the following inequalities:• convexity of f between x ⋆ and y k with weight λ = A k +1 − A k f ( x ⋆ ) ≥ f ( y k ) + h∇ f ( y k ); x ⋆ − y k i , • convexity of f between x k and y k with weight λ = A k f ( x k ) ≥ f ( y k ) + h∇ f ( y k ); x k − y k i , • smoothness of f between y k and x k +1 ( descent lemma ) with weight λ = A k +1 f ( y k ) + h∇ f ( y k ); x k +1 − y k i + L k x k +1 − y k k ≥ f ( x k +1 ) , • convexity of h between x ⋆ and z k +1 with weight λ = A k +1 − A k h ( x ⋆ ) ≥ h ( z k +1 ) + h g h ( z k +1 ); x ⋆ − z k +1 i , with g h ( z k +1 ) ∈ ∂h ( z k +1 )• convexity of h between x k and x k +1 with weight λ = A k h ( x k ) ≥ h ( x k +1 ) + h g h ( x k +1 ); x k − x k +1 i , with g h ( x k +1 ) ∈ ∂h ( x k +1 ),• convexity of h between z k +1 and x k +1 with weight λ = A k +1 − A k h ( z k +1 ) ≥ h ( x k +1 ) + h g h ( x k +1 ); z k +1 − x k +1 i , • 1-strong convexity of w between z k +1 and z k with weight λ = Lw ( z k +1 ) ≥ w ( z k ) + h g w ( z k ); z k +1 − z k i + 12 k z k − z k +1 k , with g w ( z k ) ∈ ∂w ( z k ).We then get the following inequality0 ≥ λ [ f ( y k ) − f ⋆ + h∇ f ( y k ); x ⋆ − y k i ]+ λ [ f ( y k ) − f ( x k ) + h∇ f ( y k ); x k − y k i ]+ λ [ f ( x k +1 ) − ( f ( y k ) + h∇ f ( y k ); x k +1 − y k i + L k x k +1 − y k k )]+ λ [ h ( z k +1 ) − h ( x ⋆ ) + h g h ( z k +1 ); x ⋆ − z k +1 i ]+ λ [ h ( x k +1 ) − h ( x k ) + h g h ( x k +1 ); x k − x k +1 i ]+ λ [ h ( x k +1 ) − h ( z k +1 ) + h g h ( x k +1 ); z k +1 − x k +1 i ]+ λ [ w ( z k ) − w ( z k +1 ) + h g w ( z k ); z k +1 − z k i + 12 k z k − z k +1 k ] . Nesterov Acceleration
Now, by substituting (using g w ( z k +1 ) ∈ ∂w ( z k +1 )) y k = A k A k +1 x k + (cid:18) − A k A k +1 (cid:19) z k g w ( z k +1 ) = g w ( z k ) − A k +1 − A k L ( ∇ f ( y k ) + g h ( z k +1 )) x k +1 = A k A k +1 x k + (cid:18) − A k A k +1 (cid:19) z k +1 we obtain exactly k x k +1 − y k k = ( A k +1 − A k ) A k +1 k z k − z k +1 k , and the weighted sum can bereformulated exactly as A k +1 ( F ( x k +1 ) − F ⋆ ) + LD w ( x ⋆ , z k +1 ) ≤ A k ( F ( x k ) − F ⋆ ) + LD w ( x ⋆ , z k ) + ( A k − A k +1 ) − A k +1 A k +1 L k z k − z k +1 k , and we obtain the desired inequality from picking A k +1 satisfying A k +1 ≥ A k and( A k − A k +1 ) − A k +1 = 0 . Let us conclude this section by providing the final corollary describing the worst-caseperformance of the method.
Corollary 4.25.
Let h ∈ F , ∞ , f ∈ F ,L ( dom h ), w ∈ F , ∞ with dom h ⊆ dom w , andlet x ⋆ ∈ argmin y { F ( x ) ≡ f ( x ) + h ( x ) } . For any x ∈ dom h such that ∂w ( x ) = ∅ , andany N ∈ N the iterates of Algorithm 19 satisfy F ( x N ) − F ⋆ ≤ LD w ( x ⋆ ; z ) A N ≤ LD w ( x ⋆ ; z ) N . Proof.
The claim directly follows from previous arguments using potential φ k = A k ( F ( x k ) − F ⋆ ) + LD w ( x ⋆ ; z k ), along with A N ≥ N from (4.13). Remark 4.11.
The method presented in this section is essentially (Gasnikov and Nes-terov, 2018, “Method of similar triangles”); see also (Auslender and Teboulle, 2006, “Im-proved interior gradient algorithm”). It enjoys a number of variants, see e.g., discussionsin (Tseng, 2008), possibly involving two projections per iteration, as in (Nesterov, 2005,Section 3), (Lan et al. , 2011, Section 3). The method can also naturally be embeddedwith a backtracking procedure, exactly as in previous sections.
Remark 4.12.
Beyond the Euclidean setting where w ( x ) = k . k , one of the most classi-cal examples of the impact of the non-Euclidean setup optimizes a simple function overthe simplex. In this case, writing x ( i ) the i th component of some x ∈ R d , we considerthe situation where h is the indicator function of the simplex h ( x ) = ( P di =1 x ( i ) = 1 , x ( i ) ≥ i = 1 , . . . , d )+ ∞ otherwise, .7. Practical Extensions and w is the entropy (which is closed, proper, and convex, and 1-strongly convex overthe simplex, for k . k = k . k ; this is known as Pinsker’s inequality), that is, define some w i : R → R w i ( x ) = x log x if x > x = 0,+ ∞ otherwise,and set w ( x ) = P di =1 w i ( x ( i ) ). In this case, the expression for the Bregman proximalgradient step in Algorithm 19 can be computed exactly, assuming y ( i ) k = 0 (for all i = 1 , ..., d ) z ( i ) k +1 = y ( i ) k exp h − a k L [ ∇ f ( y k )] ( i ) iP di =1 y ( i ) k exp (cid:2) − a k L [ ∇ f ( y k )] ( i ) (cid:3) . Hence we also have that y ( i ) k = 0 as soon as y ( i )0 = 0; a common technique is to instantiate x ( i )0 = d .In this setup, a non-Euclidean geometry often shows a significant practical advantagewhen optimizing large-scale functions, by improving the dependence from d to log( d )in the final complexity bound. Indeed, we have D w ( x ⋆ , x ) ≤ ln d here compared to D k . k ( x ⋆ , x ) ≤ in the Euclidean case, so the dependence in d is seemingly betterin the Euclidean case. However, the choice of norms has a very significant impact here.When the gradient has a small Lipschitz constant measured with respect to k . k = k . k ,that is k∇ f ( x ) − ∇ f ( y ) k ∞ ≤ L k x − y k , the Lipschitz constant might be up to d times smaller than that computed using theEuclidean norm k . k = k . k (using norm equivalences), i.e. k∇ f ( x ) − ∇ f ( y ) k ≤ L k x − y k , with L = O ( L /d ). The final complexity bound using an Euclidean geometry then reads F ( x N ) − F ⋆ ≤ LdN , compared to F ( x N ) − F ⋆ ≤ L ln dN for the same L >
0, if we were using the geometry induced by the entropy. The impactof the choice of norm is discussed extensively in e.g. (Juditsky et al. , 2009, Example 2.1)and (d’Aspremont et al. , 2013).Another, related, example is that of optimizing over the spectrahedron, see for ex-ample the nice introduction by Bubeck (2015, Chapter 4). This setup is also largely mo-tivated by (Nesterov, 2005), and we refer to (Allen-Zhu and Orecchia, 2014, AppendixA) and the references therein for further discussions on this topic. Nesterov Acceleration
Potential func-tions were used in the original paper by Nesterov (1983) to develop accelerated meth-ods. Nesterov (2003) developed estimate sequences as an alternate, more constructiveapproach to get optimal first-order methods. Since then, both approaches were used inmany references on this topic in various settings and Tseng (2008) provides a nice unifiedview on accelerated methods. Estimate sequences were extensively studied by e.g. Nes-terov (2013), Baes (2009), Devolder (2011), and Kulunchakov and Mairal (2019).Nesterov’s methods were also interpreted and studied in a continuous-time settings,through ordinary differential equations, in the smooth convex case (Su et al. , 2016; Krich-ene et al. , 2015; Attouch and Peypouquet, 2016), and in the strongly convex one (Siegel,2019) where constant momentum schemes can be deployed. In this case, Nesterov’smethod can be seen as an appropriate integration scheme of the gradient flow (Scieur et al. , 2017). A continuous-time counterpart to estimate sequences, referred to as approx-imate duality gaps was proposed by Diakonikolas and Orecchia (2019b).
Beyond Euclidean Geometries.
Mirror descent dates back to the work of Nemirovskyand Yudin (1983a). It was further developed and used in many subsequent works (Ben-Tal and Nemirovski, 2001; Nesterov, 2005; Nesterov, 2009; Xiao, 2010; Juditsky andNesterov, 2014). Nice pedagogical surveys and can be found in (Beck and Teboulle, 2003;Juditsky and Nemirovsky, 2011a; Juditsky and Nemirovsky, 2011b; Bubeck, 2015).Beyond the setting described in this chapter, mirror descent was also studied in the relative smoothness setting, introduced by Bauschke et al. (2016)—see also (Teboulle,2018)—, and extended to a notion of relative strong convexity by Lu et al. (2018). How-ever, acceleration remains an open issue there, and it is generally unclear which additionalassumptions allow accelerated rates. It is however clear that additional assumptions arerequired, as emphasized by the lower bound provided by Dragomir et al. (2019) in ageneric setting.
Lower Complexity Bounds.
Lower complexity bounds were studied in a variety ofsettings to establish limits on worst-case performance from black-box methods. Theclassical reference on this topic is the book of Nemirovsky and Yudin (1983b).Of particular interest to us, Nemirovsky (1991) and Nemirovsky (1992) establish opti-mality of Chebyshev and conjugate gradient methods for convex quadratic minimization.Lower bounds for black-box first-order methods in the context of smooth convex andsmooth strongly convex optimization can be found in (Nesterov, 2003). The final lowerbound for black-box smooth convex minimization was obtained by Drori (2017) andshows optimality of the Optimized Gradient Method, as well as that of conjugate gradi-ents, as discussed earlier in this chapter. Lower bounds for ℓ p norms in the mirror descentsetup are constructed in Guzmán and Nemirovsky (2015), whereas a lower bound formirror descent in the relative smoothness setup is provided by Dragomir et al. (2019). .8. Notes & References Changing the Performance Measure.
Obtaining (practical) accelerated method forother types of convergence criterion, such as gradient norms, is still not a fully settledissue. Those criterion are important in other contexts, including dual methods, and todraw links between methods intrinsically designed for solving convex problems and thoseused in nonconvex settings, for which finding stationary points is the goal. There are afew tricks allowing to pass from a guarantee in one context to another one. For example, aregularization trick is proposed in (Nesterov, 2012b), yielding approximate solutions withsmall gradient norm. Beyond that, in the context of smooth convex minimization, recentprogresses were made by Kim and Fessler (2020) who designed an optimized method forminimizing the gradient norm after a given number of iterations. Corresponding lowerbounds, based on quadratic minimization, for a variety of performance measures can befound in (Nemirovsky, 1992).
Backtracking Line Searches.
The idea of using backtracking line searches is classical,and attributed to e.g. Goldstein (1962) and Armijo (1966)—see e.g., discussions (No-cedal and Wright, 2006; Bonnans et al. , 2006). It was already incorporated in the originalwork of Nesterov (1983) estimating the smoothness constant within an accelerated gra-dient method. Since then, many works on the topic heavily rely on this technique, oftenadapted to obtain better practical performance, see for example (Scheinberg et al. , 2014;Chambolle and Pock, 2016; Florea and Vorobyov, 2018; Calatroni and Chambolle, 2019).A more recent adaptive step size strategy can be found in (Malitsky and Mishchenko,2020), though without acceleration.
Inexactness, Stochasticity, and Randomness.
The ability to use approximate first-order information, be it stochastic or deterministic, is key for tackling certain problemsfor which computing exact gradient is expansive. Deterministic (or adversarial) errormodels are studied in e.g. (d’Aspremont, 2008; Schmidt et al. , 2011; Devolder et al. ,2014; Devolder, 2013; Devolder et al. , 2013) through different noise models. Such ap-proaches can also be deployed when the projection/proximal operation is computedapproximately (Güler, 1992; Schmidt et al. , 2011; Villa et al. , 2013) (see also Chapter 5and the references therein).Similarly, stochastic approximations and incremental gradient methods are key inmany statistical learning problems, where samples are accessed one at a time and forwhich it is not desirable to optimize beyond data accuracy (Bottou and Bousquet, 2007).For this reason, the old idea of stochastic approximations (Robbins and Monro, 1951) isstill widely used and an active area of research. Their “optimal” variants were developedmuch later (Hu et al. , 2009; Xiao, 2010; Devolder, 2011; Lan, 2012; Dvurechensky andGasnikov, 2016) with the rise of machine learning applications—in particular, we notethat “stochastic” estimate sequences were developed in (Devolder, 2011; Kulunchakovand Mairal, 2019). The case of the stochastic noise arising from sampling an objectivefunction that is a finite sum of smooth components attracted a lot of attention in the2010’s, starting with (Schmidt et al. , 2017; Johnson and Zhang, 2013; Shalev-Shwartz Nesterov Acceleration and Zhang, 2013; Defazio et al. , 2014a; Defazio et al. , 2014b; Mairal, 2015) and thenextended to feature acceleration techniques (Shalev-Shwartz and Zhang, 2014; Allen-Zhu, 2017; Zhou et al. , 2018; Zhou et al. , 2019). Acceleration techniques also apply inthe context of randomized block coordinate descent as well, see for example Nesterov(2012a), Lee and Sidford (2013), Fercoq and Richtárik (2015), and Nesterov and Stich(2017).
Higher-order Methods.
Acceleration mechanisms, via estimate sequences, were alsoproposed in the context of higher-order methods, as well; see for example (Nesterov,2008; Baes, 2009; Wilson et al. , 2016), and optimal methods presented by (Gasnikov etal. , 2019) and by (Monteiro and Svaiter, 2013) (which we also discuss in the next chapter).It was not clear before the work of Nesterov (2019) that intermediate subproblems arisingin the context of higher-order methods were tractable. The fact that tractability is notan issue has attracted a lot of attention towards these methods.
Optimized Methods
Optimized gradient methods were premised by Drori and Teboulle(2014), and discovered by Kim and Fessler (2016). Since then, optimized methods havebeen studied in various settings, incorporating constraints/proximal terms (Kim andFessler, 2018b; Taylor et al. , 2017a), optimizing gradient norms (Kim and Fessler, 2018c;Kim and Fessler, 2020) (as an alternative to (Nesterov, 2012b)), adapting to unknownproblem parameters using exact line searches (Drori and Taylor, 2020), or restarts (Kimand Fessler, 2018a), and in the strongly convex case (Van Scoy et al. , 2017; Cyrus et al. ,2018; Taylor and Drori, 2021). Such methods also appeared in the context of fixed-pointiterations (Lieder, 2020), and proximal methods (Kim, 2019; Barré et al. , 2020a).
On Obtaining Proofs from this Chapter.
The worst-case performance of first-ordermethods can often be computed numerically, as shown in (Drori and Teboulle, 2014;Drori, 2014; Drori and Teboulle, 2016) through the introduction of performance estima-tion problems.The performance estimation approach was shown to provide tight certificates, fromwhich one could recover both worst-case certificates and matching worst-case probleminstances in (Taylor et al. , 2017c; Taylor et al. , 2017a). A consequence is that worst-case guarantees for first-order methods such as those detailed in this chapter can always be obtained as weighted sum of appropriate inequalities characterizing the problem athand, see for instance (De Klerk et al. , 2017; Dragomir et al. , 2019). A similar approachframed in control theoretic terms, and originally tailored to obtain geometric convergencerates, was developed by Lessard et al. (2016) and can also be used to form potentialfunctions (Hu and Lessard, 2017; Fazlyab et al. , 2018), as well as optimized methodssuch as the triple momentum method (Van Scoy et al. , 2017; Cyrus et al. , 2018; Lessardand Seiler, 2020).Proofs of this section were obtained using the performance estimation approach tai-lored for potential functions (Taylor and Bach, 2019) together with the performance .8. Notes & References estimation toolbox (Taylor et al. , 2017b). These techniques can be used for either val-idating or rediscovering the proofs of this chapter numerically, through semidefiniteprogramming.For reproducibility purposes, we provide the corresponding codes, as well as note-books for symbolically verifying the algebraic reformulations of this section athttps://github.com/AdrienTaylor/AccelerationMonograph. ppendices to Chapter 4 In this section, we prove basic inequalities involving smooth strongly convex functions.Most inequalities are not used in our developments. Nevertheless, we believe they areuseful for gaining intuitions about those classes of function, as well as for comparisonswith previous works.
In this section, we consider an Euclidean setting, where k x k = h x ; x i , and h . ; . i : R d × R d → R is a dot product.The following theorem summarizes known inequalities characterizing the class ofsmooth convex functions. Note that those characterizations of f ∈ F ,L are all equivalentassuming f ∈ F , ∞ , as convexity is not implied by some points below. In particular, (i),(ii), (v), (vi), or (vii) alone do not encode convexity of f , whereas (iii) or (iv) encodeboth smoothness and convexity. Theorem 4.26.
Let f : R d → R be a differentiable convex function. The followingstatements are equivalent for inclusion in F ,L .(i) ∇ f satisfies a Lipschitz condition; for all x, y ∈ R d k∇ f ( x ) − ∇ f ( y ) k ≤ L k x − y k . (ii) f is upper bounded by quadratic functions; for all x, y ∈ R d f ( x ) ≤ f ( y ) + h∇ f ( y ); x − y i + L k x − y k . (iii) f satisfies, for all x, y ∈ R d f ( x ) ≥ f ( y ) + h∇ f ( y ); x − y i + 12 L k∇ f ( x ) − ∇ f ( y ) k . .A. Useful Inequalities (iv) ∇ f is cocoercive; for all x, y ∈ R d h∇ f ( x ) − ∇ f ( y ); x − y i ≥ L k∇ f ( x ) − ∇ f ( y ) k . (v) ∇ f satisfies, for all x, y ∈ R d h∇ f ( x ) − ∇ f ( y ); x − y i ≤ L k x − y k . (vi) L k x k − f ( x ) is convex.(vii) f satisfies, for all λ ∈ [0 , f ( λx + (1 − λ ) y ) ≥ λf ( x ) + (1 − λ ) f ( y ) − λ (1 − λ ) L k x − y k . Proof.
Let us start by (i) ⇒ (ii). We use a first-order expansion f ( y ) = f ( x ) + Z h∇ f ( x + τ ( y − x )); y − x i dτ. The quadratic upper bound then follows from algebraic manipulations, and from upperbounding the integral term f ( y ) = f ( x ) + h∇ f ( x ); y − x i + Z h∇ f ( x + τ ( y − x )) − ∇ f ( x ); y − x i dτ ≤ f ( x ) + h∇ f ( x ); y − x i + Z k∇ f ( x + τ ( y − x )) − ∇ f ( x ) kk y − x k dτ ≤ f ( x ) + h∇ f ( x ); y − x i + L k x − y k Z τ dτ = f ( x ) + h∇ f ( x ); y − x i + L k x − y k . We proceed with (ii) ⇒ (iii). The idea is to require the quadratic upper bound to beeverywhere above the linear lower bound arising from convexity of f , that is, for all x, y, z ∈ R d f ( y ) + h∇ f ( y ); z − y i ≤ f ( z ) ≤ f ( x ) + h∇ f ( x ); z − x i + L k x − z k . In other words, we must have for all z ∈ R d f ( y ) + h∇ f ( y ); z − y i ≤ f ( x ) + h∇ f ( x ); z − x i + L k x − z k ⇔ f ( y ) − f ( x ) + h∇ f ( y ); z − y i − h∇ f ( x ); z − x i − L k x − z k ≤ ⇔ f ( y ) − f ( x ) + max z ∈ R d h∇ f ( y ); z − y i − h∇ f ( x ); z − x i − L k x − z k ≤ ⇔ f ( y ) − f ( x ) + h∇ f ( y ); x − y i + 12 L k∇ f ( x ) − ∇ f ( y ) k ≤ , Nesterov Acceleration where the last line follows from the explicit maximization on z , that is, we pick z = x − L ( ∇ f ( x ) − ∇ f ( y )), reaching the desired result after base algebraic manipulations.We continue with (iii) ⇒ (iv), which simply follows from adding f ( x ) ≥ f ( y ) + h∇ f ( y ); x − y i + 12 L k∇ f ( x ) − ∇ f ( y ) k f ( y ) ≥ f ( x ) + h∇ f ( x ); y − x i + 12 L k∇ f ( x ) − ∇ f ( y ) k . For obtaining (iv) ⇒ (i), one can use Cauchy-Schwartz1 L k∇ f ( x ) − ∇ f ( y ) k ≤ h∇ f ( x ) − ∇ f ( y ); x − y i ≤ k∇ f ( x ) − ∇ f ( y ) kk x − y k , which allows concluding k∇ f ( x ) − ∇ f ( y ) k ≤ L k x − y k , reaching the final statement.For obtaining (ii) ⇒ (v), we simply add f ( x ) ≤ f ( y ) + h∇ f ( y ); x − y i + L k x − y k f ( y ) ≤ f ( x ) + h∇ f ( x ); y − x i + L k x − y k , and reorganize the resulting inequality.For obtaining (v) ⇒ (ii), we use a first-order expansion, again f ( y ) = f ( x ) + Z h∇ f ( x + τ ( y − x )); y − x i dτ. The quadratic upper bound then follows from algebraic manipulations, and from upperbounding the integral term (we use the intermediate variable z τ = x + τ ( y − x ) forconvenience) f ( y ) = f ( x ) + h∇ f ( x ); y − x i + Z h∇ f ( x + τ ( y − x )) − ∇ f ( x ); y − x i dτ = f ( x ) + h∇ f ( x ); y − x i + Z τ h∇ f ( z τ ) − ∇ f ( x ); z τ − x i dτ ≤ f ( x ) + h∇ f ( x ); y − x i + Z Lτ k z τ − x k dτ = f ( x ) + h∇ f ( x ); y − x i + L k x − y k Z τ dτ = f ( x ) + h∇ f ( x ); y − x i + L k x − y k . For the equivalence (vi) ⇔ (ii), simply define h ( x ) = L k x k − f ( x ) (and hence ∇ h ( x ) = Lx − ∇ f ( x )) and observe that for all x, y ∈ R d h ( x ) ≥ h ( y ) + h∇ h ( y ); x − y i ⇔ f ( x ) ≤ f ( y ) + h∇ f ( y ); x − y i + L k x − y k , which follows from base algebraic manipulations. .A. Useful Inequalities Finally, the equivalence (vi) ⇔ (vii) follows the same h ( x ) = L k x k − f ( x ) (and hence ∇ h ( x ) = Lx − ∇ f ( x )) and the observation that for all x, y ∈ R d and λ ∈ [0 ,
1] we have h ( λx + (1 − λ ) y ) ≤ λh ( x ) + (1 − λ ) h ( y ) ⇔ f ( λx + (1 − λ ) y ) ≥ λf ( x ) + (1 − λ ) f ( y ) − λ (1 − λ ) L k x − y k , which follows from base algebraic manipulations.For obtaining the corresponding inequalities in the strongly convex case, one can relyon Fenchel conjugation between smoothness and strong convexity, see for example, (Rock-afellar and Wets, 1998, Proposition 12.6). The following inequalities are stated withoutproofs, and can be obtained either as direct consequences of the definitions, or fromFenchel conjugation along with statements of Theorem 4.26. Theorem 4.27.
Let f : R d → R be a closed, proper, and convex function. The followingstatements are equivalent for inclusion in F µ,L .(i) ∇ f satisfies a Lipschitz and an inverse Lipschitz condition, for all x, y ∈ R d µ k x − y k ≤ k∇ f ( x ) − ∇ f ( y ) k ≤ L k x − y k . (ii) f is lower and upper bounded by quadratic functions; for all x, y ∈ R d f ( y ) + h∇ f ( y ); x − y i + µ k x − y k ≤ f ( x ) ≤ f ( y ) + h∇ f ( y ); x − y i + L k x − y k . (iii) f satisfies, for all x, y ∈ R d f ( y )+ h∇ f ( y ); x − y i + 12 L k∇ f ( x ) − ∇ f ( y ) k ≤ f ( x ) ≤ f ( y ) + h∇ f ( y ); x − y i + 12 µ k∇ f ( x ) − ∇ f ( y ) k . (iv) ∇ f satisfies for all x, y ∈ R d L k∇ f ( x ) − ∇ f ( y ) k ≤ h∇ f ( x ) − ∇ f ( y ); x − y i ≤ µ k∇ f ( x ) − ∇ f ( y ) k . (v) ∇ f satisfies for all x, y ∈ R d µ k x − y k ≤ h∇ f ( x ) − ∇ f ( y ); x − y i ≤ L k x − y k . (vi) for all λ ∈ [0 , λf ( x )+(1 − λ ) f ( y ) − λ (1 − λ ) L k x − y k ≤ f ( λx + (1 − λ ) y ) ≤ λf ( x )+(1 − λ ) f ( y ) − λ (1 − λ ) µ k x − y k Nesterov Acceleration (vii) f ( x ) − µ k x k and L k x k − f ( x ) are convex and ( L − µ )-smooth.Finally, let us mention that the existence of an inequality allowing to encode bothsmoothness and strong convexity together. This inequality is also known as an interpola-tion inequality (Taylor et al. , 2017c) and turns out to be particularly useful for provingworst-case guarantees. Theorem 4.28.
Let f : R d → R be a differentiable function. f is L -smooth µ -stronglyconvex if and only if f ( x ) ≥ f ( y )+ h∇ f ( y ); x − y i + 12 L k∇ f ( x ) − ∇ f ( y ) k + µ − µ/L ) k x − y − L ( ∇ f ( x ) − ∇ f ( y )) k . (4.26) Proof. ( f ∈ F µ,L ⇒ (4.26)) The idea is to require the quadratic upper boundfrom smoothness to be everywhere above the quadratic lower bound arising from strongconvexity, that is, for all x, y, z ∈ R d f ( y ) + h∇ f ( y ); z − y i + µ k z − y k ≤ f ( z ) ≤ f ( x ) + h∇ f ( x ); z − x i + L k x − z k . In other words, we must have for all z ∈ R d f ( y ) + h∇ f ( y ); z − y i + µ k z − y k ≤ f ( x ) + h∇ f ( x ); z − x i + L k x − z k ⇔ f ( y ) − f ( x ) + h∇ f ( y ); z − y i + µ k z − y k − h∇ f ( x ); z − x i − L k x − z k ≤ ⇔ f ( y ) − f ( x ) + max z ∈ R d h∇ f ( y ); z − y i + µ k z − y k − h∇ f ( x ); z − x i − L k x − z k ≤ z , that is, picking z = Lx − µyL − µ − L − µ ( ∇ f ( x ) − ∇ f ( y )) allowsreaching the desired inequality after base algebraic manipulations.((4.26) ⇒ f ∈ F µ,L ) f ∈ F ,L is direct by observing that (4.26) is stronger thanTheorem 4.26(iii); f ∈ F µ,L is then direct by reformulating (4.26) as f ( x ) ≥ f ( y )+ h∇ f ( y ); x − y i + µ k x − y k + 12 L (1 − µ/L ) k∇ f ( x ) − ∇ f ( y ) − µ ( x − y ) k , which is stronger than f ( x ) ≥ f ( y ) + h∇ f ( y ); x − y i + µ k x − y k . Remark 4.13.
It is crucial to recall that some inequalities above are only valid when dom f = R d . We refer to (Drori, 2018) for illustrating that some inequalities are notvalid when restricted on some dom f = R d . .A. Useful Inequalities In this section, we show that requiring a Lipschitz condition on ∇ f , on a convex set C ⊆ R d , implies a quadratic upper bound on f . That is, requiring that for all x, y ∈ C k∇ f ( x ) − ∇ f ( y ) k ∗ ≤ L k x − y k , where k . k is some norm and k . k ∗ is the corresponding dual norm, implies a quadraticupper bound ∀ x, y ∈ Cf ( x ) ≤ f ( y ) + h∇ f ( y ); x − y i + L k x − y k . Theorem 4.29.
Let f : R d → R ∪ { + ∞} be continuously differentiable on some openconvex set C ⊆ R d , and satisfying a Lipschitz condition k∇ f ( x ) − ∇ f ( y ) k ∗ ≤ L k x − y k , for all x, y ∈ C . Then, it holds that f ( x ) ≤ f ( y ) + h∇ f ( y ); x − y i + L k x − y k , for all x, y ∈ C . Proof.
The desired result is obtained from a first-order expansion f ( y ) = f ( x ) + Z h∇ f ( x + τ ( y − x )); y − x i dτ. The quadratic upper bound then follows from algebraic manipulations, and from upperbounding the integral term f ( y ) = f ( x ) + h∇ f ( x ); y − x i + Z h∇ f ( x + τ ( y − x )) − ∇ f ( x ); y − x i dτ ≤ f ( x ) + h∇ f ( x ); y − x i + Z k∇ f ( x + τ ( y − x )) − ∇ f ( x ) k ∗ k y − x k dτ ≤ f ( x ) + h∇ f ( x ); y − x i + L k x − y k Z τ dτ = f ( x ) + h∇ f ( x ); y − x i + L k x − y k . Nesterov Acceleration
In this short section, we show that Algorithm 7 and Algorithm 8 generate the samesequence { y k } . A direct consequence of this statement is that the sequences { x k } alsomatch, as they are generated, in both cases, from simple gradient steps on { y k } .For doing so, we show that Algorithm 8 is a reformulation of Algorithm 7. Proposition 4.1.
The sequence { y k } k generated by Algorithm 7 is equal to that ofAlgorithm 8. Proof.
Let us first observe that the sequences are initiated the same way in bothformulations of OGM. Furthermore, consider one iteration of OGM in form I, we have: y k = − θ k,N ! x k + 1 θ k,N z k . Clearly, we therefore have z k = θ k,N y k + (1 − θ k,N ) x k . At the next iteration, we have y k +1 = − θ k +1 ,N ! x k +1 + 1 θ k +1 ,N (cid:18) z k − θ k,N L ∇ f ( y k ) (cid:19) = − θ k +1 ,N ! x k +1 + 1 θ k +1 ,N (cid:18) θ k,N y k + (1 − θ k,N ) x k − θ k,N L ∇ f ( y k ) (cid:19) where we substituted z k by its equivalent expression from previous iteration. Now, bynoting that − L ∇ f ( y k ) = x k +1 − y k , we reach y k +1 = θ k +1 ,N − θ k +1 ,N x k +1 + 1 θ k +1 ,N ((1 − θ k,N ) x k + 2 θ k,N x k +1 − θ k,N y k )= x k +1 + θ k,N − θ k +1 ,N ( x k +1 − x k ) + θ k,N θ k +1 ,N ( x k +1 − y k )where we reorganized the terms for reaching the same format as in Algorithm 8. The two sequences { x k } k and { y k } k generated by Algorithm 9 areequal to those of Algorithm 10. Proof.
In order to prove the result, we use the identities A k +1 = a k , as well as A k = P k − i =0 a i , and a k +1 = a k + a k +1 .Given that the sequences { x k } k are obtained from gradient steps on y k in bothformulations, it is sufficient to prove that the sequences { y k } match. The equivalenceis clear for k = 0, as both methods generate y = x − L ∇ f ( x ). For k ≥
0, fromAlgorithm 9, one can write iteration k as y k = A k A k +1 x k + (cid:18) − A k A k +1 (cid:19) z k , .B. Relations between Acceleration Methods and hence z k = A k +1 A k +1 − A k y k + (cid:18) − A k +1 A k +1 − A k (cid:19) x k = a k y k + (1 − a k ) x k . Substituting this expression in that of iteration k + 1, we reach y k +1 = A k +1 A k +2 x k +1 + A k +2 − A k +1 A k +2 (cid:18) z k − A k +1 − A k L ∇ f ( y k ) (cid:19) = a k a k +1 x k +1 + 1 a k +1 (cid:18) a k y k + (1 − a k ) x k − a k L ∇ f ( y k ) (cid:19) = a k a k +1 x k +1 + 1 a k +1 ( a k x k +1 + (1 − a k ) x k )= x k +1 + a k − a k +1 ( x k +1 − x k ) , where we substituted the expression of z k , and used previous identities for reaching thedesired statement.The same relationship holds with Algorithm 11, as provided by the next proposition. Proposition 4.3.
The three sequences { z k } k , { x k } k and { y k } k generated by Algorithm 9are equal to those of Algorithm 11. Proof.
Clearly, we have x = z = y in both methods. Let us assume that thesequences match up to iteration k , that is, up to y k − , x k and z k . Clearly, both y k and z k +1 are computed in the exact same way in both methods. It remains to compare updaterules for x k +1 ; in Algorithm 11, we have x k +1 = A k A k +1 x k + (cid:18) − A k A k +1 (cid:19) z k +1 = y k − (cid:18) − A k A k +1 (cid:19) A k +1 − A k L ∇ f ( y k )where we used the update rule for z k +1 . Further simplifications, along with the identity( A k +1 − A k ) = A k +1 allows arriving to x k +1 = y k − ( A k +1 − A k ) LA k +1 ∇ f ( y k )= y k − L ∇ f ( y k ) , which is clearly the same update rule as that of Algorithm 9, and hence all sequencesmatch and the desired statement is proved. Nesterov Acceleration
In this short section, we provide alternate, equivalent, formulations for Algorithm 12.
Algorithm 20
Nesterov’s method, form II
Input: A L -smooth µ -strongly convex function f , initial point x . Initialize z = x , set q = µ/L , A = 0, and A = (1 − q ) − . for k = 0 , . . . do A k +2 = A k +1 +1+ p A k +1 +4 qA k +1 +12(1 − q ) x k +1 = y k − L ∇ f ( y k ) y k +1 = x k +1 + β k ( x k +1 − x k ) with β k = ( A k +2 − A k +1 )( A k +1 (1 − q ) − A k − A k +2 (2 qA k +1 +1) − qA k +1 end forOutput: An approximate solution x N . Proposition 4.4.
The two sequences { x k } k and { y k } k generated by Algorithm 12 areequal to those of Algorithm 20. Proof.
Without loss of generality, we can consider that a third sequence z k is presentin Algorithm 20 (although it is not computed).Clearly, we have x = z = y in both methods. Let us assume that the sequencesmatch up to iteration k , that is, up to y k , x k and z k . Clearly, x k +1 is computed in theexact same way in both methods, as a gradient step from y k , and it remains to compareupdate rules for y k +1 . In Algorithm 12, we have y k +1 = x k + ( τ k − τ k +1 ( τ k − − qδ k )) ( z k − x k ) − ( δ k − τ k +1 + 1 L ∇ f ( y k ) , whereas in Algorithm 12, we have y k +1 = x k + ( β k + 1) τ k ( z k − x k ) − β k L ∇ f ( y k ) . By remarking that β k = τ k +1 ( δ k − ∇ f ( y k ) match in bothexpressions, and it remains to check that( β k + 1) τ k − ( τ k − τ k +1 ( τ k − − qδ k ))is identically 0 for reaching the desired statement. Substituting β k = τ k +1 ( δ k − τ k +1 ( δ k ( τ k (1 − q ) + q ) − , and we have to verify ( δ k ( τ k (1 − q ) + q ) −
1) to be zero. Substituting and reworking thisexpression, using those of τ k , and δ k , we arrive to τ k (cid:16) ( A k +1 − A k ) − A k +1 − qA k +1 (cid:17) ( A k +1 − A k )(1 + qA k +1 ) = 0 , .B. Relations between Acceleration Methods as we recognize ( A k +1 − A k ) − A k +1 − qA k +1 = 0 (which is the expression we used forpicking A k +1 ). Algorithm 21
Nesterov’s method, form III
Input: A L -smooth µ -strongly convex function f , initial point x . Initialize z = x and A = 0, set q = µ/L . for k = 0 , . . . do A k +1 = A k +1+ √ A k +4 qA k +12(1 − q ) set τ k = ( A k +1 − A k )(1+ qA k ) A k +1 +2 qA k A k +1 − qA k and δ k = A k +1 − A k qA k +1 y k = x k + τ k ( z k − x k ) z k +1 = (1 − qδ k ) z k + qδ k y k − δ k L ∇ f ( y k ) x k +1 = A k A k +1 x k + (1 − A k A k +1 ) z k +1 end forOutput: An approximate solution x N . Proposition 4.5.
The three sequences { z k } k , { x k } k and { y k } k generated by Algorithm 12are equal to those of Algorithm 21. Proof.
Clearly, we have x = z = y in both methods. Let us assume that thesequences match up to iteration k , that is, up to y k − , x k and z k . Clearly, y k and z k +1 are computed in the exact same way in both methods, so we only have to verify thatthe update rules for x k +1 match. In other words, we have to verify A k A k +1 x k + (1 − A k A k +1 ) z k +1 = y k − L ∇ f ( y k ) , which, using the update rules for z k +1 and y k amounts to verify − ( A k +1 − A k ) − A k +1 − qA k +1 LA k +1 (1 + qA k +1 ) ∇ f ( y k ) = 0 , which is true, as we recognize ( A k +1 − A k ) − A k +1 − qA k +1 = 0 (which is the expressionused for choosing A k +1 ). Nesterov Acceleration
Historically, Nesterov’s accelerated gradient method (Nesterov, 1983) was preceded by afew other methods with optimal worst-case convergence rates O ( N − ) for smooth convexminimization. However, those alternate schemes required the possibility of optimizingexactly in one dimension or few (Nemirovsky, 1982)—the first works in this line requiredthe ability to solve exactly intermediary two or three dimensional optimization problems;unfortunately they are not to be found. Those methods were obtained through links withconjugate gradients (Algorithm 22), as by-products of the worst-case analysis. It turnsout that the connection between OGM and conjugate gradients is absolutely perfect: theexact same proof (achieving the lower bound) is valid for both. The conjugate gradient Algorithm 22
Conjugate gradient method
Input: A L -smooth convex function f , initial point y , budget N . for k = 0 , . . . , N − do y k +1 = argmin x { f ( x ) : x ∈ y + span {∇ f ( y ) , ∇ f ( y ) , . . . , ∇ f ( y k ) }} end forOutput: An approximate solution y N .method (CG) for solving quadratic optimization problems is known to have an efficientform not requiring to perform those span searches (which are in general too expensiveto be of any practical interest), see for example (Nocedal and Wright, 2006). Beyondquadratics, it is in general not possible to reformulate the method in an efficient way.However, it is possible to find other methods for which the exact same worst-case analysisapplies, and it turns out that OGM is one of them. Similarly, by slightly weakening theanalysis of CG, one can find other methods, such as Nesterov’s accelerated gradients.More precisely, let us recall the previous definition for the sequence { θ k,N } k definedin (4.8); θ k +1 ,N = p θ k,N +12 if k ≤ N − p θ k,N +12 if k = N − . As a result of the worst-case analysis presented below, all methods satisfying h∇ f ( y i ); y i − " − θ i,N ! (cid:16) y i − − L ∇ f ( y i − ) (cid:17) + 1 θ i,N y − L i − X j =0 θ j,N ∇ f ( y j ) i ≤ . (4.27)achieve the optimal worst-case complexity of smooth convex minimization, providedby Theorem 4.7. On the one hand, CG ensures this inequality to hold thanks to span .C. Conjugate Gradient Method searches (i.e., orthogonality of successive search directions), that is h∇ f ( y i ); y i − y i − + 1 θ i,N ( y i − − y ) i = 0 h∇ f ( y i ); ∇ f ( y ) i = 0... h∇ f ( y i ); ∇ f ( y i − ) i = 0 . On the other hand OGM enforces this inequality by using y i = − θ i,N ! (cid:16) y i − − L ∇ f ( y i − ) (cid:17) + 1 θ i,N y − L i − X j =0 θ j,N ∇ f ( y j ) . Optimized and Conjugate Gradient Methods: Worst-case Analyses
The worst-case analysis below relies on the exact same potentials as for the optimizedgradient method, see Theorem 4.4 and Lemma 4.5.
Theorem 4.30.
Let f be a L -smooth convex function, N ∈ N , and some x ⋆ ∈ argmin x f ( x ).The iterates of the conjugate gradient method (CG, Algorithm 22), and of all methodswhose iterates are compliant with (4.27), satisfy f ( y N ) − f ( x ⋆ ) ≤ L k y − x ⋆ k θ N,N , for all y ∈ R d . Proof.
The result is obtained from the exact same potential as that of OGM, ob-tained from more inequalities. That is, first perform a weighted sum of the followinginequalities.• Smoothness and convexity of f between y k − and y k with weight λ = 2 θ k − ,N ≥ f ( y k ) − f ( y k − ) + h∇ f ( y k ); y k − − y k i + 12 L k∇ f ( y k ) − ∇ f ( y k − ) k , • smoothness and convexity of f between x ⋆ and y k with weight λ = 2 θ k,N ≥ f ( y k ) − f ( x ⋆ ) + h∇ f ( y k ); x ⋆ − y k i + 12 L k∇ f ( y k ) k . • search procedure for obtaining y k , with weight λ = 2 θ k,N ≥ h∇ f ( y k ); y k − " − θ k,N ! (cid:16) y k − − L ∇ f ( y k − ) (cid:17) + 1 θ k,N z k i , where we used z k := y − L P k − j =0 θ j,N ∇ f ( y j ). Nesterov Acceleration
The weighted sum is a valid inequality0 ≥ λ [ f ( y k ) − f ( y k − ) + h∇ f ( y k ); y k − − y k i + 12 L k∇ f ( y k ) − ∇ f ( y k − ) k ]+ λ [ f ( y k ) − f ( x ⋆ ) + h∇ f ( y k ); x ⋆ − y k i + 12 L k∇ f ( y k ) k ]+ λ [ h∇ f ( y k ); y k − " − θ k,N ! (cid:16) y k − − L ∇ f ( y k − ) (cid:17) + 1 θ k,N z k i ] . Substituting z k +1 , the previous inequality can be reformulated exactly as0 ≥ θ k,N (cid:18) f ( y k ) − f ⋆ − L k∇ f ( y k ) k (cid:19) + L k z k +1 − x ⋆ k − θ k − ,N (cid:18) f ( y k − ) − f ⋆ − L k∇ f ( y k − ) k (cid:19) − L k z k − x ⋆ k + 2 (cid:16) θ k − ,N − θ k,N + θ k,N (cid:17) (cid:18) f ( y k ) − f ⋆ + 12 L k∇ f ( y k ) k (cid:19) + 2 (cid:16) θ k − ,N − θ k,N + θ k,N (cid:17) h∇ f ( y k ); y k − − L ∇ f ( y k − ) − y k i . We reach the desired inequality by picking θ k,N ≥ θ k − ,N satisfying θ k − ,N − θ k,N + θ k,N = 0 , reaching the same potential as in Theorem 4.4.For obtaining the technical lemma allowing to bound the final f ( y N ) − f ⋆ , we followthe same steps with the following inequalities.• Smoothness and convexity of f between y k − and y k with weight λ = 2 θ N − ,N ≥ f ( y N ) − f ( y N − ) + h∇ f ( y N ); y N − − y N i + 12 L k∇ f ( y N ) − ∇ f ( y N − ) k , • smoothness and convexity of f between x ⋆ and y k with weight λ = θ N,N ≥ f ( y N ) − f ( x ⋆ ) + h∇ f ( y N ); x ⋆ − y N i + 12 L k∇ f ( y N ) k . • search procedure for obtaining y N , with weight λ = θ N,N ≥ h∇ f ( y N ); y N − " − θ N,N ! (cid:16) y N − − L ∇ f ( y N − ) (cid:17) + 1 θ N,N z N i . The weighted sum can then be reformulated exactly as0 ≥ θ N,N ( f ( y N ) − f ⋆ ) + L k z N − θ N,N L ∇ f ( y N ) − x ⋆ k − θ N − ,N (cid:18) f ( y N − ) − f ⋆ − L k∇ f ( y N − ) k (cid:19) − L k z N − x ⋆ k + (cid:16) θ N − ,N − θ N,N + θ N,N (cid:17) (cid:18) f ( y N ) − f ⋆ + 12 L k∇ f ( y N ) k (cid:19) + (cid:16) θ N − ,N − θ N,N + θ N,N (cid:17) h∇ f ( y N ); y N − − L ∇ f ( y N − ) − y N i .C. Conjugate Gradient Method reaching the desired inequality as in Lemma 4.5 by picking θ N,N ≥ θ N − ,N satisfying2 θ N − ,N − θ N,N + θ N,N . Hence, the potential argument from Corollary 4.6 applies as such and we reach thedesired conclusion. In other words, one can define, for all k ∈ { , . . . , N } , φ k = 2 θ k − ,N (cid:18) f ( y k − ) − f ⋆ − L k∇ f ( y k − ) k (cid:19) + L k z k − x ⋆ k , and φ N +1 = θ N,N ( f ( y N ) − f ⋆ ) + L k z N − θ N,N L ∇ f ( y N ) − x ⋆ k , and reach the desired statement by chaining the inequalities θ N,N ( f ( y N ) − f ⋆ ) ≤ φ N +1 ≤ φ N ≤ . . . ≤ φ = L k y − x ⋆ k . Remark 4.14.
It is possible to further exploit the conjugate gradient method for design-ing practical accelerated methods in different settings, such as that of Nesterov (1983).Such points of view were exploited in, among others, (Narkiss and Zibulevsky, 2005;Karimi and Vavasis, 2016; Karimi and Vavasis, 2017; Diakonikolas and Orecchia, 2019a).The link between CG and OGM presented in this section is due to Drori and Taylor(2020), though under a different presentation. Nesterov Acceleration
In this section, we show how to incorporate backtracking strategies possibly not satisfying L k +1 ≥ L k , which is important in practice. The developments are essentially the same,and one possible trick is to incorporate all the knowledge about L k in A k . That is, weuse a rescaled shape for the potential function φ k = B k ( f ( x k ) − f ⋆ ) + 1 + µB k k z k − x ⋆ k , where, without backtracking strategy, B k = A k L . This somehow cosmetic change allows φ k to depend on L k solely via B k , and applies to both backtracking methods presentedin this chapter.The idea used for obtaining both methods below is that, one can perform the samecomputations as in Algorithm 12, replacing A k by L k +1 B k and A k +1 by L k +1 A k +1 atiteration k . So, as in previous versions, only the current approximate Lipschitz constant L k +1 is used at iteration k ; previous approximations were only used for computing B k . Algorithm 23
Strongly convex FISTA (general initialization of L k +1 ) Input: A L -smooth (possibly µ -strongly) convex function f , a convex function h withproximal operator available, an initial point x , and an initial estimate L > µ . Initialize z = x , B = 0, and some α > for k = 0 , . . . do Pick L k +1 ∈ ( µ, ∞ ){reasonable choices include L k +1 ∈ [ L , L k ].} loop set q k +1 = µ/L k +1 , B k +1 = L k +1 B k +1+ √ L k +1 B k +4 µL k +1 B k +12( L k +1 − µ ) set τ k = ( B k +1 − B k )(1+ µB k )( B k +1 +2 µB k B k +1 − µB k ) and δ k = L k +1 B k +1 − B k µB k +1 y k = x k + τ k ( z k − x k ) x k +1 = prox h/L k +1 (cid:16) y k − L k +1 ∇ f ( y k ) (cid:17) z k +1 = (1 − q k +1 δ k ) z k + q k +1 δ k y k + δ k ( x k +1 − y k ) if (4.20) holds then break {Iterates accepted; k will be incremented.} else L k +1 = αL k +1 {Iterates not accepted; recompute new L k +1 .} end if end loop end forOutput: An approximate solution x k +1 .The proof follows exactly the same lines as those of FISTA (Algorithm 4.20). In this .D. Proximal Accelerated Gradient Without Monotone Backtracking case, f is assumed to be smooth and convex over R d (i.e., it has full domain, dom f = R d ),and we are therefore allowed to evaluate gradients of f outside of the domain of h . Theorem 4.31.
Let f ∈ F µ,L (with full domain; dom f = R d ), h be a closed, properand convex function, x ⋆ ∈ argmin x { f ( x ) + h ( x ) } , and k ∈ N . For any x k , z k ∈ R d and B k ≥
0, the iterates of Algorithm 23 satisfying (4.20) also satisfy B k +1 ( F ( x k +1 ) − F ⋆ ) + 1 + µB k +1 k z k +1 − x ⋆ k ≤ B k ( F ( x k ) − F ⋆ ) + 1 + µB k k z k − x ⋆ k , with B k +1 = L k +1 B k +1+ √ L k +1 B k +4 µL k +1 B k +12( L k +1 − µ ) . Proof.
The proof consists in a weighted sum of the following inequalities:• strong convexity of f between x ⋆ and y k with weight λ = B k +1 − B k f ⋆ ≥ f ( y k ) + h∇ f ( y k ); x ⋆ − y k i + µ k x ⋆ − y k k , • strong convexity of f between x k and y k with weight λ = B k f ( x k ) ≥ f ( y k ) + h∇ f ( y k ); x k − y k i , • smoothness of f between y k and x k +1 ( descent lemma ) with weight λ = B k +1 f ( y k ) + h∇ f ( y k ); x k +1 − y k i + L k +1 k x k +1 − y k k ≥ f ( x k +1 ) , • convexity of h between x ⋆ and x k +1 with weight λ = B k +1 − B k h ( x ⋆ ) ≥ h ( x k +1 ) + h g h ( x k +1 ); x ⋆ − x k +1 i , with g h ( x k +1 ) ∈ ∂h ( x k +1 ) and x k +1 = y k − L k +1 ( ∇ f ( y k ) + g h ( x k +1 ))• convexity of h between x k and x k +1 with weight λ = B k h ( x k ) ≥ h ( x k +1 ) + h g h ( x k +1 ); x k − x k +1 i . We get the following inequality0 ≥ λ [ f ( y k ) − f ⋆ + h∇ f ( y k ); x ⋆ − y k i + µ k x ⋆ − y k k ]+ λ [ f ( y k ) − f ( x k ) + h∇ f ( y k ); x k − y k i ]+ λ [ f ( x k +1 ) − ( f ( y k ) + h∇ f ( y k ); x k +1 − y k i + L k +1 k x k +1 − y k k )]+ λ [ h ( x k +1 ) − h ( x ⋆ ) + h g h ( x k +1 ); x ⋆ − x k +1 i ]+ λ [ h ( x k +1 ) − h ( x k ) + h g h ( x k +1 ); x k − x k +1 i ] . Nesterov Acceleration
Substituting the y k , x k +1 , and z k +1 using y k = x k + τ k ( z k − x k ) x k +1 = y k − L k +1 ( ∇ f ( y k ) + g h ( x k +1 )) z k +1 = (1 − q k +1 δ k ) z k + q k +1 δ k y k + δ k ( x k +1 − y k )yields, after some basic but tedious algebra B k +1 ( f ( x k +1 ) + h ( x k +1 ) − f ( x ⋆ ) − h ( x ⋆ )) + 1 + B k +1 µ k z k +1 − x ⋆ k ≤ B k ( f ( x k ) + h ( x k ) − f ( x ⋆ ) − h ( x ⋆ )) + 1 + B k µ k z k − x ⋆ k + L k +1 ( B k − B k +1 ) − B k +1 − µB k +1 µB k +1 L k +1 k∇ f ( y k ) + g h ( x k +1 ) k − B k ( B k +1 − B k )(1 + µB k )(1 + µB k +1 ) (cid:0) B k +1 + 2 µB k B k +1 − µB k (cid:1) µ k x k − z k k . Picking B k +1 such that B k +1 ≥ B k and L k +1 ( B k − B k +1 ) − B k +1 − µB k +1 = 0 , yields the desired result B k +1 ( f ( x k +1 ) + h ( x k +1 ) − f ( x ⋆ ) − h ( x ⋆ )) + 1 + B k +1 µ k z k +1 − x ⋆ k ≤ B k ( f ( x k ) + h ( x k ) − f ( x ⋆ ) − h ( x ⋆ )) + 1 + B k µ k z k − x ⋆ k . Finally, we obtain a complexity guarantee by adapting the potential argument (4.5),and by noting that B k +1 is a decreasing function of L k +1 (whose maximal value is αL ,assuming L < L ; otherwise its maximal value is L ). The growth rate of B k in thesmooth convex setting remains unchanged, see (4.13), as we have B k +1 ≥ (cid:16) + p B k L k +1 (cid:17) L k +1 , and hence p B k +1 ≥ √ L k +1 + √ B k , and therefore B k ≥ (cid:16) k √ ℓ (cid:17) with ℓ = max { L , αL } and L k +1 ≤ ℓ . As for the geometric rate, we obtain, similarly B k +1 ≥ B k (cid:18) q µL k +1 (cid:19) − µL k +1 = B k − q µL k +1 , and therefore B k +1 ≥ (1 − q µℓ ) − B k . .D. Proximal Accelerated Gradient Without Monotone Backtracking Corollary 4.32.
Let f ∈ F µ,L ( R d ), h be a closed, proper and convex function, and x ⋆ ∈ argmin x { F ( x ) ≡ f ( x ) + h ( x ) } . For any N ∈ N , N ≥
1, and x ∈ R d , the output ofAlgorithm 23 satisfy F ( x N ) − F ⋆ ≤ min ( N , (cid:18) − r µℓ (cid:19) N ) ℓ k x − x ⋆ k , with ℓ = max { αL, L } . Proof.
We assume that
L > L as otherwise f ∈ F µ,L and the proof directly followsfrom the case without backtracking. The chained potential argument (4.5) can be usedas before. Using B = 0, we reach F ( x N ) − F ⋆ ≤ k x − x ⋆ k B N . Using our previous bounds on B N yields the desired result, using B = 1 L k +1 − µ ≥ ℓ − − µℓ = 2 ℓ − (cid:16) − q µℓ (cid:17) (cid:16) q µℓ (cid:17) ≥ ℓ − − q µℓ , and hence B N ≥ ℓ − (cid:16) − q µℓ (cid:17) − N , as well as B k ≥ (cid:16) k √ ℓ (cid:17) . Just as for FISTA, we can perform the same cosmetic change to Algorithm 18 for incor-porating a non-monotonic estimations of the Lipschitz constant. The proof is thereforeessentially that of Algorithm 18.
Theorem 4.33.
Let h ∈ F , ∞ , f ∈ F µ,L ( dom h ), x ⋆ ∈ argmin x { F ( x ) ≡ f ( x ) + h ( x ) } ,and k ∈ N . For any x k , z k ∈ R d and B k ≥
0, the iterates of Algorithm 24 satisfying (4.20)also satisfy B k +1 ( F ( x k +1 ) − F ⋆ ) + 1 + µB k +1 k z k +1 − x ⋆ k ≤ B k ( F ( x k ) − F ⋆ ) + 1 + µB k k z k − x ⋆ k , with B k +1 = L k +1 B k +1+ √ L k +1 B k +4 µL k +1 B k +12( L k +1 − µ ) . Proof.
First, { z k } are in dom h by construction—it is the output of the proxi-mal/projection step. Furthermore, we have 0 ≤ B k B k +1 ≤ B k +1 ≥ B k ≥ z = x ∈ dom h , all subsequent { y k } and { x k } arealso in dom h (as they are obtained from convex combinations of feasible points).The rest of the proof consists in a weighted sum of the following inequalities (whichare valid due to feasibility of the iterates): Nesterov Acceleration
Algorithm 24
A proximal accelerated gradient (general initialization of L k +1 ) Input: h ∈ F , ∞ with proximal operator available, f ∈ F µ,L ( dom h ), an initial point x ∈ dom h , and an initial estimate L > µ . Initialize z = x , A = 0, and some α > for k = 0 , . . . do Pick L k +1 ∈ ( µ, ∞ ){reasonable choices include L k +1 ∈ [ L , L k ].} loop set q k +1 = µ/L k +1 , B k +1 = L k +1 B k +1+ √ L k +1 B k +4 µL k +1 B k +12( L k +1 − µ ) set τ k = L k +1 ( B k +1 − B k )(1+ µB k ) L k +1 ( B k +1 +2 µB k B k +1 − µB k ) and δ k = L k +1 B k +1 − B k µB k +1 y k = x k + τ k ( z k − x k ) z k +1 = prox δ k h/L k +1 (cid:16) (1 − q k +1 δ k ) z k + q k +1 δ k y k − δ k L k +1 ∇ f ( y k ) (cid:17) x k +1 = A k A k +1 x k + (1 − A k A k +1 ) z k +1 if (4.20) holds then break {Iterates accepted; k will be incremented.} else L k +1 = αL k +1 {Iterates not accepted; recompute new L k +1 .} end if end loop end forOutput: An approximate solution x k +1 .• strong convexity of f between x ⋆ and y k with weight λ = B k +1 − B k f ( x ⋆ ) ≥ f ( y k ) + h∇ f ( y k ); x ⋆ − y k i + µ k x ⋆ − y k k , • convexity of f between x k and y k with weight λ = B k f ( x k ) ≥ f ( y k ) + h∇ f ( y k ); x k − y k i , • smoothness of f between y k and x k +1 ( descent lemma ) with weight λ = B k +1 f ( y k ) + h∇ f ( y k ); x k +1 − y k i + L k +1 k x k +1 − y k k ≥ f ( x k +1 ) , • convexity of h between x ⋆ and z k +1 with weight λ = B k +1 − B k h ( x ⋆ ) ≥ h ( z k +1 ) + h g h ( z k +1 ); x ⋆ − z k +1 i , with g h ( z k +1 ) ∈ ∂h ( z k +1 ) and z k +1 = (1 − qδ k ) z k + qδ k y k − δ k L k +1 ( ∇ f ( y k ) + g h ( z k +1 ))• convexity of h between x k and x k +1 with weight λ = B k h ( x k ) ≥ h ( x k +1 ) + h g h ( x k +1 ); x k − x k +1 i , with g h ( x k +1 ) ∈ ∂h ( x k +1 ) .D. Proximal Accelerated Gradient Without Monotone Backtracking • convexity of h between z k +1 and x k +1 with weight λ = B k +1 − B k h ( z k +1 ) ≥ h ( x k +1 ) + h g h ( x k +1 ); z k +1 − x k +1 i . We get the following inequality0 ≥ λ [ f ( y k ) − f ⋆ + h∇ f ( y k ); x ⋆ − y k i + µ k x ⋆ − y k k ]+ λ [ f ( y k ) − f ( x k ) + h∇ f ( y k ); x k − y k i ]+ λ [ f ( x k +1 ) − ( f ( y k ) + h∇ f ( y k ); x k +1 − y k i + L k +1 k x k +1 − y k k )]+ λ [ h ( z k +1 ) − h ( x ⋆ ) + h g h ( z k +1 ); x ⋆ − z k +1 i ]+ λ [ h ( x k +1 ) − h ( x k ) + h g h ( x k +1 ); x k − x k +1 i ]+ λ [ h ( x k +1 ) − h ( z k +1 ) + h g h ( x k +1 ); z k +1 − x k +1 i ] . Substituting the y k , z k +1 , and x k +1 using y k = x k + τ k ( z k − x k ) z k +1 = (1 − q k +1 δ k ) z k + q k +1 δ k y k − δ k L k +1 ( ∇ f ( y k ) + g h ( z k +1 )) x k +1 = B k B k +1 x k + (cid:18) − B k B k +1 (cid:19) z k +1 , some algebra allows obtaining the following reformulation B k +1 ( f ( x k +1 ) + h ( x k +1 ) − f ( x ⋆ ) − h ( x ⋆ )) + 1 + µB k +1 k z k +1 − x ⋆ k ≤ B k ( f ( x k ) + h ( x k ) − f ( x ⋆ ) − h ( x ⋆ )) + 1 + µB k k z k − x ⋆ k + ( B k − B k +1 ) (cid:16) L k +1 ( B k − B k +1 ) − B k +1 − µB k +1 (cid:17) B k +1 (1 + µB k +1 ) k∇ f ( y k ) + g h ( z k +1 ) k − B k ( B k +1 − B k )(1 + µB k )(1 + µB k +1 ) (cid:0) B k +1 + 2 µB k B k +1 − µB k (cid:1) µ k x k − z k k , and the desired inequality follows from picking B k +1 ≥ B k such that L k +1 ( B k − B k +1 ) − B k +1 − µB k +1 = 0 , yielding B k +1 ( f ( x k +1 ) + h ( x k +1 ) − f ( x ⋆ ) − h ( x ⋆ )) + 1 + µB k +1 k z k +1 − x ⋆ k ≤ B k ( f ( x k ) + h ( x k ) − f ( x ⋆ ) − h ( x ⋆ )) + 1 + B k µ k z k − x ⋆ k . The last corollary follows exactly from the same arguments as those used in Corol-lary 4.32, and provides the final bound for Algorithm 24. Nesterov Acceleration
Corollary 4.34.
Let h ∈ F , ∞ , f ∈ F µ,L ( dom h ) and x ⋆ ∈ argmin x { F ( x ) ≡ f ( x )+ h ( x ) } .For any N ∈ N , N ≥
1, and x ∈ R d , the output of Algorithm 24 satisfy F ( x N ) − F ⋆ ≤ min ( N , (cid:18) − r µℓ (cid:19) − N ) ℓ k x − x ⋆ k , with ℓ = max { αL, L } . Proof.
The proof follows the same arguments as those of Corollary 4.32, using thepotential from Theorem 4.33, using the fact the output of the algorithm satisfies (4.20).
Proximal Acceleration and Catalyst
In this chapter we present simple methods based on approximate proximal operationsthat produce accelerated gradient-based methods. This idea is exploited for examplein the Catalyst (Lin et al. , 2015; Lin et al. , 2017) and Accelerated Hybrid ProximalExtragradient frameworks (Monteiro and Svaiter, 2013). In essence, the idea is to de-velop (conceptual) accelerated proximal point algorithms, and to use classical iterativemethods to approximate the proximal point. In particular, these frameworks produceaccelerated gradient methods (in the same sense as Nesterov’s acceleration) when theapproximate proximal points are computed using linearly converging gradient-based op-timization methods.
We review acceleration from the perspective of proximal point algorithms (PPA). Thekey concept here, called proximal operation , dates back to the sixties, with the works ofMoreau (1962; 1965). Its introduction in optimization is attributed to Martinet (1970;1972) and was primarily motivated by its link with augmented Lagrangian techniques.In contrast with previous sections, where information about the functions to be min-imized was obtained through their gradients, the following pages deal with the casewhere information is gathered through a proximal operator , or an approximation of thatoperator.The proximal point algorithm, and its use for developing optimization schemes arenicely surveyed in (Parikh and Boyd, 2014). We aim to go in a slightly different directionhere and describe its use in an outer loop to obtain improved convergence guarantees,in the spirit of the Accelerated Hybrid Proximal Extragradient Method (Monteiro andSvaiter, 2013), and of Catalyst acceleration (Lin et al. , 2015; Lin et al. , 2017).
Proximal Acceleration and Catalyst
In this chapter, we focus on the problem of solving f ⋆ = min x ∈ R d f ( x ) , (5.1)where f is closed, proper, and convex (it has a non-empty, closed, and convex epigraph),which we denote by f ∈ F , ∞ , in line with Definition 4.1 from Chapter 4. We denoteby ∂f ( x ) the subdifferential of f at x ∈ R d , and by g f ( x ) ∈ ∂f ( x ) some element of thesubdifferential at x , irrespective of f being continuously differentiable or not. We aim tofind an ǫ -approximate solution x such that f ( x ) − f ⋆ ≤ ǫ .It is possible to develop optimized proximal methods in the spirit of optimized gra-dient methods presented in Chapter 4. That is, given a computational budget—in theproximal setting, this consists of a number of iterations, and a sequence of step sizes—,one can choose the algorithmic parameters to optimize worst-case performances. Theproximal equivalent to the optimized gradient method is Güler’s second method (Güler,1992, Section 6) (see discussions in Section 5.6). We do not spend time on this methodhere, and directly aim for methods designed from simple potential functions, in the samespirit as for Nesterov’s accelerated gradient methods from Chapter 4. Whereas the base method for minimizing a function using its gradient is gradient descent, x k +1 = x k − λg f ( x k ) , the base method for minimizing a function using its proximal oracle is the proximal pointalgorithm, x k +1 = prox λf ( x k ) , (5.2)where the proximal operator is given byprox λf ( x ) ≡ argmin y { Φ( y ; x ) ≡ λf ( y ) + 12 k y − x k } . The proximal point algorithm has a number of intuitive interpretations, with two ofthem particularly convenient for our purposes.• Optimality conditions of the proximal subproblem reveal that a proximal stepcorresponds to an implicit (sub)gradient method x k +1 = x k − λg f ( x k +1 ) . where g f ( x k +1 ) ∈ ∂f ( x k +1 ).• Using the proximal point algorithm is equivalent to apply gradient descent to theMoreau envelope of f , where the Moreau envelope, denoted F λ , is provided by F λ ( x ) = min y { f ( y ) + 12 λ k y − x k } . .2. Proximal Point Algorithm and Acceleration The Moreau envelope has the same set of optimal solutions as that of f , whileenjoying nice additional regularity properties (it is 1 /λ -smooth and convex). See,for example, (Lemaréchal and Sagastizábal, 1997).In general, proximal operations are expensive, sometimes nearly as expensive as mini-mizing the function itself. However, there are many cases, especially in the context ofcomposite optimization problems, where one can isolate parts of the objective for whichproximal operators actually have analytical solutions, see e.g. (Combettes and Pesquet,2011, Table 2) for a list of such examples.In the following sections, we start by analyzing such proximal point methods, thenshow at the end of this chapter how proximal methods can be used in outer loops ,where proximal subproblems are solved approximately using a classical iterative method(in inner loops). In particular, we describe how this combination produces acceleratednumerical schemes. Given the links between proximal operations and gradient methods, it is probably notsurprising that proximal point methods for convex optimization can be analyzed usingpotential functions similar to those used for gradient methods.However, there is a huge different between gradient and proximal steps, as the latercan be made arbitrarily “powerful”, by taking large step sizes. In other words, a singleproximal operation can produce an arbitrarily good approximate solution, by picking anarbitrarily large step size. This contrasts with gradient descent, for which large step sizesmake the method diverge. This fact is made clearer later by Corollary 5.2. However, thisnice property of proximal operators comes at the cost: we may not be able to computeefficiently the proximal step.As emphasized by the next theorem, proximal point methods for solving (5.1) canbe analyzed using similar potentials as those of gradient-based methods. We use φ k = A k ( f ( x k ) − f ( x ⋆ )) + 12 k x k − x ⋆ k , and show that φ k +1 ≤ φ k . As before, this type of reasoning can be used recursively A N ( f ( x N ) − f ( x ⋆ )) ≤ φ N ≤ φ N − ≤ . . . ≤ φ = A ( f ( x ) − f ( x ⋆ )) + 12 k x − x ⋆ k , reaching bounds of type f ( x N ) − f ⋆ ≤ A N k x − x ⋆ k = O ( A − N ), assuming A = 0.Therefore, the convergence rates are dictated by the growth rate of the scalar sequence { A k } , so the proofs are designed to increase A k as fast as possible. Theorem 5.1.
Let f ∈ F , ∞ . For any k ∈ N , A k , λ k ≥ x k , it holds that A k +1 ( f ( x k +1 ) − f ( x ⋆ )) + 12 k x k +1 − x ⋆ k ≤ A k ( f ( x k ) − f ( x ⋆ )) + 12 k x k − x ⋆ k , Proximal Acceleration and Catalyst with x k +1 = prox λ k f ( x k ) and A k +1 = A k + λ k . Proof.
We perform a weighted sum of the following valid inequalities originatingfrom our assumptions.• convexity between x k +1 and x ⋆ with weight λ k f ( x ⋆ ) ≥ f ( x k +1 ) + h g f ( x k +1 ); x ⋆ − x k +1 i , with some g f ( x k +1 ) ∈ ∂f ( x k +1 ),• convexity between x k +1 and x k with weight A k f ( x k ) ≥ f ( x k +1 ) + h g f ( x k +1 ); x k − x k +1 i , with the same g f ( x k +1 ) ∈ ∂f ( x k +1 ) as before.By performing a weighted sum of those two inequalities, with respective weights, weobtain the following valid inequality:0 ≥ λ k [ f ( x k +1 ) − f ( x ⋆ ) + h g f ( x k +1 ); x ⋆ − x k +1 i ]+ A k [ f ( x k +1 ) − f ( x k ) + h g f ( x k +1 ); x k − x k +1 i ] . By matching the expressions term by term and by substituting x k +1 = x k − λ k g f ( x k +1 ),one can easily check that the previous inequality can be rewritten exactly as( A k + λ k )( f ( x k +1 ) − f ( x ⋆ )) + 12 k x k +1 − x ⋆ k ≤ A k ( f ( x k ) − f ( x ⋆ )) + 12 k x k − x ⋆ k − λ k A k + λ k k g f ( x k +1 ) k . By omitting the last term of the right hand side (which is nonpositive), we reach thedesired statement.The first proof of the following worst-case guarantee is due to Güler (1991), anddirectly follows from the previous potential.
Corollary 5.2.
Let f ∈ F , ∞ , { λ i } i ≥ be a sequence of nonnegative step sizes, and { x i } i ≥ be the sequence of iterates from the corresponding proximal point algorithm (5.2). Forall k ∈ N , k ≥ f ( x k ) − f ⋆ ≤ k x − x ⋆ k P k − i =0 λ i . Proof.
Directly follows from the potential and the choice A = 0, resulting in A k = P k − i =0 λ i , along with f ( x k ) − f ( x ⋆ ) ≤ A k k x − x ⋆ k , and the claim follows. .3. Güler and Monteiro-Svaiter Acceleration Note again that we can make this bound arbitrarily good by simply increasing thevalue of λ k ’s. There is no contradiction here because the proximal oracle is massivelystronger than the usual gradient step, as previously discussed. However, solving evena single proximal step is usually (nearly) as hard as solving the original optimizationproblem, so the proximal method, as detailed here, is a purely conceptual algorithm.Note that the choice of a constant step size λ k = λ results in a f ( x N ) − f ⋆ = O ( N − )convergence, reminiscent of gradient descent. It turns out that, as for gradient-basedoptimization of smooth convex functions, it is possible to improve this result to O ( N − )by using information from previous iterations. This idea was proposed by Güler (1992).In the case of a constant step size λ k = λ , one possible way to obtain this improvementis to apply Nesterov’s method (or any other accelerated variant) to the Moreau envelopeof f . For varying step sizes, the corresponding bound has the form f ( x k ) − f ⋆ ≤ k x − x ⋆ k (cid:16)P k − i =0 √ λ i (cid:17) . In addition, Güler’s acceleration is actually robust to computation errors (as describedin the next sections), while allowing varying step size strategies.Those two key properties allow using Güler’s acceleration to design improved numer-ical optimization schemes:1. approximate proximal steps, for example by approximately solving the proximalsubproblems via iterative methods, can be used.2. Step sizes λ i ’s can be increased from iteration to iteration, allowing to reach arbi-trarily fast convergence rates (assuming of course that the proximal subproblemscan be solved efficiently).It is important to note that classical lower bounds for gradient-type methods do notapply here, as we use the much stronger, and more expensive, proximal oracle. It istherefore not a surprise that such techniques (i.e., increasing step sizes λ k from iterationto iteration) might beat the O ( N − ) bound obtained through Nesterov’s acceleration.Such increasing step size rules can, for example, be used when solving the proximalsubproblem via Newton’s method, as proposed by Monteiro and Svaiter (2013). In this section, we describe an accelerated version of the proximal point algorithm, pos-sibly involving inexact proximal evaluations. The method detailed below is a simplifiedversion of that of Monteiro and Svaiter (2013), sufficient for our purposes, for which weprovide a simple convergence proof. The method essentially boils down to that of Güler(1992) when exact proximal evaluations are used.Before proceeding, let us mention that there exists quite a few natural notions ofinexactness for proximal operations. In this chapter, we focus on approximately satisfying Proximal Acceleration and Catalyst first-order optimality conditions of the proximal problem x k +1 = argmin x { Φ( x ; y k ) ≡ f ( x ) + 12 λ k k x − y k k } . In other words, optimality conditions of the proximal subproblem is0 = λ k g f ( x k +1 ) + x k +1 − y k , for some g f ( x k +1 ) ∈ ∂f ( x k +1 ). In the following lines, we tolerate, instead, an error e k e k = λ k g f ( x k +1 ) + x k +1 − y k , and require k e k k to be small enough, for guaranteeing convergence—as even starting atan optimal point does not imply staying at it, without proper assumptions on e k . Onepossibility is to require k e k k to be small with respect to the distance between the startingpoint y k and the approximate solution to the proximal subproblem x k +1 . Formally, we usethe following definition for an approximate solution with relative inaccuracy 0 ≤ δ ≤ x k +1 ≈ δ prox λ k f ( y k ) ⇐⇒ k e k k ≤ δ k x k +1 − y k k with e k := x k +1 − y k + λ k g f ( x k +1 )for some g f ( x k +1 ) ∈ ∂f ( x k +1 ) (5.3)Intuitively, this notion tolerates relatively large errors when the solution of the proximalsubproblem is far from y k (meaning that y k is also far away from a minimum of f ), whileimposing relatively small errors when getting closer to a solution. On the other side, if y k is an optimal point for f , then so is x k +1 , as shown by the following proposition. Proposition 5.1.
Let y k ∈ argmin x f ( x ). For any δ ∈ [0 ,
1] and any x k +1 ≈ δ prox λ k f ( y k ),it holds that x k +1 ∈ argmin x f ( x ). Proof.
We only consider the case δ = 1, as without loss of generality x k +1 ≈ prox λ k f ( y k ) ⇒ x k +1 ≈ δ prox λ k f ( y k ) for any δ ∈ [0 , x k +1 , we have k x k +1 − y k + λ k g f ( x k +1 ) k ≤ k x k +1 − y k k ⇔ λ k h g f ( x k +1 ); x k +1 − y k i ≤ − λ k k g f ( x k +1 ) k , (5.4)for some g f ( x k +1 ) ∈ ∂f ( x k +1 ) (the second inequality follows from base algebraic manip-ulations of the first one). In addition, optimality of y k implies h g f ( x k +1 ); x k +1 − y k i = h g f ( x k +1 ) − g f ( y k ); x k +1 − y k i ≥ , with g f ( y k ) = 0 ∈ ∂f ( y k ), and where the second inequality follows from convexity of f (see e.g., Section 4.A). Therefore, condition (5.4) can be satisfied only when g f ( x k +1 ) = 0,meaning that x k +1 is a minimizer of f .Assuming now it is possible to find an approximate solution to the proximal operator,one can use Algorithm 25, originating from (Monteiro and Svaiter, 2013), to minimizethe convex function f . The parameter A k , a k in the algorithm were optimized for δ = 1for simplicity and can be slightly improved by exploiting the cases 0 ≤ δ < .3. Güler and Monteiro-Svaiter Acceleration Algorithm 25
An inexact accelerated proximal point method (Monteiro and Svaiter,2013)
Input:
A convex function f , an initial point x . Initialize z = x and A = 0. for k = 0 , . . . do Pick a k = λ k + √ λ k +4 A k λ k A k +1 = A k + a k . y k = A k A k + a k x k + a k A k + a k z k x k +1 ≈ δ prox λ k f ( y k ) (see Eq. (5.3), for some δ ∈ [0 , z k +1 = z k − a k g f ( x k +1 ) end forOutput: An approximate solution x k +1 . Theorem 5.3.
Let f ∈ F , ∞ . For any k ∈ N , A k , λ k ≥ x k , z k ∈ R d it holdsthat A k +1 ( f ( x k +1 ) − f ( x ⋆ )) + 12 k z k +1 − x ⋆ k ≤ A k ( f ( x k ) − f ( x ⋆ )) + 12 k z k − x ⋆ k , where x k +1 and z k +1 are generated by one iteration of Algorithm 25, and A k +1 = A k + a k . Proof.
We perform a weighted sum of the following valid inequalities, originatingfrom our assumptions.• Convexity between x k +1 and x ⋆ with weight A k +1 − A k f ( x ⋆ ) ≥ f ( x k +1 ) + h g f ( x k +1 ); x ⋆ − x k +1 i , for some g f ( x k +1 ) ∈ ∂f ( x k +1 ), which we also use below,• convexity between x k +1 and x k with weight A k f ( x k ) ≥ f ( x k +1 ) + h g f ( x k +1 ); x k − x k +1 i , • error magnitude with weight ( A k +1 ) / (2 λ k ) k e k k ≤ k x k +1 − y k k . By performing the weighted sum of those three inequalities, we obtain the following validinequality 0 ≥ ( A k +1 − A k ) [ f ( x k +1 ) − f ( x ⋆ ) + h g f ( x k +1 ); x ⋆ − x k +1 i ]+ A k [ f ( x k +1 ) − f ( x k ) + h g f ( x k +1 ); x k − x k +1 i ]+ A k +1 λ k h k e k k − k x k +1 − y k k i Proximal Acceleration and Catalyst
After substituting A k +1 = A k + a k , x k +1 = A k / ( A k + a k ) x k + a k / ( A k + a k ) z k − λ k g f ( x k +1 )+ e k , and z k +1 = z k − a k g f ( x k +1 ), one can easily check that the previous inequality canbe rewritten as( A k + a k )( f ( x k +1 ) − f ( x ⋆ )) + 12 k z k +1 − x ⋆ k ≤ A k ( f ( x k ) − f ( x ⋆ )) + 12 k z k − x ⋆ k − λ k ( a k + A k ) − a k k g f ( x k +1 ) k , either by comparing the expressions on a term by term basis, or by using an appropriate“complete the squares” strategy. We obtain the desired statement by enforcing λ k ( a k + A k ) − a k ≥
0, which allows us to neglect the last term of the right hand side (which is thennonpositive). Finally, since we already assumed a k ≥
0, requiring λ k ( a k + A k ) − a k ≥ ≤ a k ≤ λ k + q λ k + 4 A k λ k Corollary 5.4.
Let f ∈ F , ∞ , { λ i } i ≥ be a sequence of nonnegative step sizes, and { x i } i ≥ be the corresponding sequence of iterates from Algorithm 25. For all k ∈ N , k ≥ f ( x k ) − f ⋆ ≤ k x − x ⋆ k (cid:16)P k − i =0 √ λ i (cid:17) . Proof.
Using the potential from Theorem 5.3 with A = 0, we directly obtain f ( x k ) − f ⋆ ≤ k x − x ⋆ k A k . The desired result then follows from A k +1 = A k + a k = A k + λ k + q λ k + 4 A k λ k ≥ A k + λ k p A k λ k hence A k ≥ (cid:16)p A k − + p λ k − (cid:17) ≥ (cid:16)P k − i =0 √ λ i (cid:17) . In this section, we provide refined convergence results when the function to minimizeis µ -strongly convex (all results from previous sections are recovered by picking µ = 0).The algebra is slightly more technical, but the message and techniques are the same.While the proofs in the previous section can simply be seen as particular cases of theproofs presented below, we detail both versions separately to make the argument moreaccessible. .4. Exploiting Strong Convexity Proximal point algorithm under strong convexity
Let us begin by refining the results on the proximal point algorithm. The same modifi-cation to the potential function is used for incorporating acceleration in the sequel.
Theorem 5.5.
Let f be a closed proper and µ -strongly convex function. For any k ∈ N , A k , λ k ≥
0, any x k , and A k +1 = A k (1 + λ k µ ) + λ k , it holds that A k +1 ( f ( x k +1 ) − f ( x ⋆ )) + 1 + µA k +1 k x k +1 − x ⋆ k ≤ A k ( f ( x k ) − f ( x ⋆ )) + 1 + µA k k x k − x ⋆ k , with x k +1 = prox λ k f ( x k ). Proof.
We perform a weighted sum of the following valid inequalities, originatingfrom our assumptions.• Strong convexity between x k +1 and x ⋆ with weight A k +1 − A k f ( x ⋆ ) ≥ f ( x k +1 ) + h g f ( x k +1 ); x ⋆ − x k +1 i + µ k x ⋆ − x k +1 k , with some g f ( x k +1 ) ∈ ∂f ( x k +1 ) which we further use below,• convexity between x k +1 and x k with weight A k f ( x k ) ≥ f ( x k +1 ) + h g f ( x k +1 ); x k − x k +1 i . By performing the weighted sum of those two inequalities, we obtain the following validinequality0 ≥ ( A k +1 − A k ) (cid:20) f ( x k +1 ) − f ( x ⋆ ) + h g f ( x k +1 ); x ⋆ − x k +1 i + µ k x ⋆ − x k +1 k (cid:21) + A k [ f ( x k +1 ) − f ( x k ) + h g f ( x k +1 ); x k − x k +1 i ] . By matching the expressions term by term, after substituting x k +1 = x k − λ k g f ( x k +1 )and A k +1 = A k (1+ λ k µ )+ λ k , one can check that the previous inequality can be rewrittenas A k +1 ( f ( x k +1 ) − f ( x ⋆ )) + 1 + µA k +1 k x k +1 − x ⋆ k ≤ A k ( f ( x k ) − f ( x ⋆ )) + 1 + µA k k x k − x ⋆ k − λ k A k (2 + λ k µ ) + λ k k g f ( x k +1 ) k . Neglecting the last term of the right hand side (which is nonpositive), we reach thedesired statement. Proximal Acceleration and Catalyst
In order to get to the convergence speed guaranteed by the previous potential, weagain have to characterize the growth rate of A k , observing that A k +1 ≥ A k (1 + λ k µ ) = A k − λ k µ λ k µ , we conclude in the corollary below that accuracy ǫ is therefore achieved in O (cid:16) λµλµ log ǫ (cid:17) iterations of the proximal point algorithm when the step size λ k = λ is kept constant,which contrasts with the O (cid:16) λǫ (cid:17) in the non-strongly convex case. Corollary 5.6.
Let f be a closed proper and µ -strongly convex function, with µ ≥ { λ i } i ≥ be a sequence of nonnegative step sizes, and { x i } i ≥ be the corresponding se-quence of iterates from the proximal point algorithm. For all k ∈ N , k ≥ f ( x k ) − f ⋆ ≤ µ k x − x ⋆ k k − i =0 (1 + λ i µ ) − . Proof.
One can note that the recurrence for A k provided in Theorem 5.5 has a simplesolution A k = ([Π k − i =0 (1 + λ i µ )] − /µ . Together with f ( x k ) − f ⋆ ≤ k x − x ⋆ k A k , as provided by Theorem 5.5 with A = 0, we reach the desired statement.As a particular case, one can note that we recover the case µ = 0 from the previouscorollary, as A k → P k − i =0 λ i when µ goes to zero. Proximal acceleration and inexactness under strong convexity
To accelerate convergence while exploiting strong convexity, we upgrade Algorithm 25,and end up with Algorithm 26, whose analysis follows along the same lines as before. Inthis case, the algorithm was optimized for δ = √ λ k µ for simplicity; it can be slightlyimproved by exploiting the cases 0 ≤ δ < √ λ k µ . This method can be found in (Barré et al. , 2020a). Theorem 5.7.
Let f be a closed proper and µ -strongly convex function. For any k ∈ N , A k , λ k ≥
0, the iterates of Algorithm 26 satisfy A k +1 ( f ( x k +1 ) − f ( x ⋆ )) + 1 + µA k +1 k z k +1 − x ⋆ k ≤ A k ( f ( x k ) − f ( x ⋆ )) + 1 + µA k k z k − x ⋆ k . Proof.
We perform a weighted sum of the following valid inequalities, originatingfrom our assumptions. .4. Exploiting Strong Convexity
Algorithm 26
An inexact accelerated proximal point method
Input:
A ( µ -strongly) convex function f , an initial point x . Initialize z = x and A = 0. for k = 0 , . . . do Pick A k +1 = A k + λ k +2 A k λ k µ + √ A k λ k µ ( λ k µ +1)+4 A k λ k ( λ k µ +1)+ λ k y k = x k + ( A k +1 − A k )( A k µ +1) A k +1 +2 µA k A k +1 − µA k ( z k − x k ) x k +1 ≈ δ prox λ k f ( y k ) (see Eq. (5.3), for some δ ∈ [0 , √ λ k µ ]) z k +1 = z k + µ A k +1 − A k µA k +1 ( x k +1 − z k ) − A k +1 − A k µA k +1 g f ( x k +1 ) end forOutput: An approximate solution x k +1 .• Strong convexity between x k +1 and x ⋆ with weight A k +1 − A k f ( x ⋆ ) ≥ f ( x k +1 ) + h g f ( x k +1 ); x ⋆ − x k +1 i + µ k x ⋆ − x k +1 k , with some g f ( x k +1 ) ∈ ∂f ( x k +1 ), and where this particular subgradient is usedrepetitively below,• strong convexity between x k +1 and x k with weight A k f ( x k ) ≥ f ( x k +1 ) + h g f ( x k +1 ); x k − x k +1 i + µ k x k − x k +1 k , • error magnitude with weight A k +1 +2 µA k A k +1 − µA k λ k (1+ µA k +1 ) k e k k ≤ (1 + λ k µ ) k x k +1 − y k k . By performing a weighted sum of those three inequalities, with their respective weights,we obtain the following valid inequality0 ≥ ( A k +1 − A k ) (cid:20) f ( x k +1 ) − f ( x ⋆ ) + h g f ( x k +1 ); x ⋆ − x k +1 i + µ k x ⋆ − x k +1 k (cid:21) + A k (cid:20) f ( x k +1 ) − f ( x k ) + h g f ( x k +1 ); x k − x k +1 i + µ k x k − x k +1 k (cid:21) + A k +1 + 2 µA k A k +1 − µA k λ k (1 + µA k +1 ) [ k e k k − (1 + λ k µ ) k x k +1 − y k k ] . By matching the expressions term by term and by substituting the expressions of y k , x k +1 = y k − λ k g f ( x k +1 ) + e k , and z k +1 , one can check that the previous inequality can Proximal Acceleration and Catalyst be rewritten as (we advise not substituting A k +1 at this stage) A k +1 ( f ( x k +1 ) − f ( x ⋆ )) + 1 + µA k +1 k z k +1 − x ⋆ k ≤ A k ( f ( x k ) − f ( x ⋆ )) + 1 + µA k k z k − x ⋆ k − ( A k +1 + 2 µA k A k +1 − µA k ) λ k − ( A k +1 − A k ) µA k +1 k g f ( x k +1 ) k − A k ( A k +1 − A k )(1 + µA k ) A k +1 + 2 µA k A k +1 − µA k µ k x k − z k k The conclusion follows from A k +1 ≥ A k , allowing to discard the last term (which is thennonpositive). Positivity of the first residual term can be enforced by choosing A k +1 suchthat ( A k +1 + 2 µA k A k +1 − µA k ) λ k − ( A k +1 − A k ) ≥ , and the desired result is, in particular, achieved by picking the largest root of secondorder polynomial in A k +1 , so that A k +1 ≥ A k .In contrast with the previous proximal point algorithm, this accelerated version re-quires O (cid:18)q λµλµ log ǫ (cid:19) inexact proximal iterations to reach f ( x k ) − f ( x ⋆ ) ≤ ǫ , when using a constant step size λ k = λ . This follows from an inequality characterizing the growth rate of the sequence A k , written A k +1 ≥ A k (1 + λ k µ ) + A k q λ k µ (1 + λ k µ ) = A k − q λ k µ λ k µ . (5.5) Corollary 5.8.
Let f ∈ F µ, ∞ with µ ≥ { λ i } i ≥ be a sequence of nonnegative step sizes,and { x i } i ≥ be the corresponding sequence of iterates from Algorithm 26. For all k ∈ N , k ≥ f ( x k ) − f ⋆ ≤ Π k − i =1 (cid:18) − q λ i µ λ i µ (cid:19) k x − x ⋆ k λ . Proof.
The proof follows from the same arguments as before, that is, f ( x k ) − f ⋆ ≤ k x − x ⋆ k A k , using A k ≥ λ Π k − i =1 (cid:16) − q λ i µ λ i µ (cid:17) , where we used A = 0, resulting in A = λ and proceeding with (5.5). .5. Application: Catalyst Acceleration Before going to the next section, let us note that combining Corollary 5.4 withCorollary 5.8 shows f ( x k ) − f ⋆ ≤ min Π k − i =1 (cid:16) − q λ i µ λ i µ (cid:17) λ , (cid:16)P k − i =0 √ λ i (cid:17) k x − x ⋆ k . There exists many notions of inexactness for solving the proximal subproblems, givingrise to different types of guarantees, together with slightly different methods.In particular, we required the approximate solution to have a small gradient. Othernotions include satisfying an inaccuracy criterion in terms of function values (Güler, 1992;Schmidt et al. , 2011; Villa et al. , 2013). Different choices might be favored, dependingon the target application or on the target algorithm for solving inner problems. A fairlygeneral framework was developed by Monteiro and Svaiter (2013) (where the error iscontrolled via a primal-dual gap on the proximal subproblem).In what follows, we illustrate how to use proximal methods as meta-algorithms toimprove convergence of simple gradient-based first-order methods. This idea can beextended by embedding basically any algorithm that can solve the proximal subproblem.
A popular application of inexact accelerated proximal gradient is “Catalyst” accelera-tion (Lin et al. , 2015). For readability purposes, we do not present the general Catalystframework, but rather a simple instance. Stochastic versions of this acceleration pro-cedure were also developed, which we briefly summarize in Section 5.5.4. The idea isagain to use a base first-order method to approximate the proximal subproblem up tothe required accuracy. For now, let us assume we want to minimize a L -smooth convexfunction f , i.e., minimize x f ( x ) . The corresponding proximal subproblem has the formprox λf ( y ) = argmin x (cid:18) f ( x ) + 12 λ k x − y k (cid:19) , (5.6)and is therefore the minimization of a ( L + 1 /λ )-smooth and 1 /λ -strongly convex. Tosolve such a problem, one can use a first-order method to approximate its solution. Preliminaries
In what follows, we consider using a method M to solve the proximal subproblem (5.6).We assume this method is guaranteed to converge linearly on any smooth strongly convexproblem with minimizer w ⋆ , satisfying k w k − w ⋆ k ≤ C M (1 − τ M ) k k w − w ⋆ k (5.7) Proximal Acceleration and Catalyst for some constant C M ≥ < τ M ≤
1. Note that we considerlinear convergence in terms of k w k − w ⋆ k for convenience, other notions can be usedinstead, such as convergence in function values.We distinguish the points x k , y k , and z k which are the iterates of the inexact acceler-ated proximal point algorithms (Algorithm (25)), and the sequence of iterates w , . . . , w k ,which are are the iterates of M , used to approximate the step 5 of Algorithm (25).One can then apply Algorithm 25 to minimize f , while (approximately) solving theproximal subproblems with M . We first define three iteration counters,1. N outer , the number of iterations of the inexact accelerated proximal point method(Algorithm 25, or 26), which serves as “outer loop” for the overall accelerationscheme.2. N inner ( k ), the number of iterations of method M for approximately solving theproximal subproblem at iteration k of the outer loop inexact proximal point method.3. N total , the total number of iterations of method M : N total = N useless + N outer − X k =0 N inner ( k ) , where N useless is the number of iterations of M that did not allow finishing oneadditional outer iteration, so N useless < N inner ( N outer ) (i.e., the number of uselessiterations of M is smaller than the number of such iterations that would have ledto an addition outer iteration). Overall complexity
As we detail in the sequel, assuming M indeed satisfies (5.7), the overall complexity ofthe combination of methods is then guaranteed to be f ( x N outer ) − f ⋆ = O ( N − ) , where x N outer is the iterate produced after N outer iterations of the inexact acceleratedproximal point method; or equivalently the iterate produced after a total number ofiterations N total of the method M . More precisely, it is guaranteed to satisfy f ( x N outer ) − f ( x ⋆ ) ≤ k x − x ⋆ k λN ≤ k x − x ⋆ k λ ⌊ B − M ,λ N total ⌋ , where we used N inner ( k ) ≤ B M ,λ for all k ≥
0, and hence ⌊ N total B M ,λ ⌋ ≤ N outer , where theconstant B M ,λ depends solely on the choice of λ and on properties of M . It representsthe computational burden of approximately solving one proximal subproblem with M ,and satisfies B M ,λ ≤ log( C M ( λL + 2)) τ M + 1 . .5. Application: Catalyst Acceleration Let us provide a few simple examples based on gradient methods for smooth stronglyconvex minimization. For all those methods, the embedding within the inexact proximalframework yields N total = O (cid:18) B M ,λ q L k x − x ⋆ k ǫ (cid:19) (5.8)iteration complexity in terms of total number of calls to M to find a point satisfying f ( x N outer ) − f ( x ⋆ ) ≤ ǫ . We can make this bound a bit more explicit depending on thechoice of M .• Suppose we solve the proximal subproblem with M a regular gradient method usingstep size 1 / ( L + 1 /λ ). The method is known to converge linearly with C M = 1 and τ M = λL (inverse condition ratio for the proximal subproblem) and produces theaccelerated rate in (5.8). Note that directly applying the gradient method to theproblem of minimizing f yields a much worst iteration complexity O (cid:16) L k x − x ⋆ k ǫ (cid:17) .• Let M be a gradient method featured with an exact line search. It is guaranteed toconverge linearly with C M = λL + 1 (condition ratio of the proximal subproblem)and τ M = λL . The iteration complexity of applying this steepest descent schemedirectly to f is similar to that of vanilla gradient descent. One can also choose λ in order not to have a too large B M ,λ , for example λ = 1 /L .• Let M be an accelerated gradient method specifically tailored for smooth stronglyconvex optimization, for example Nesterov’s method with constant momentum, seeAlgorithm 14. It is guaranteed to converge linearly with C M = λL + 1 and τ M = q λL . Although there is no working guarantee for this method on the originalminimization problem, if f is not strongly convex, it can still be used for minimizing f through the inexact proximal point framework, as proximal subproblems arestrongly convex.As a conclusion, inexact accelerated proximal schemes produce accelerated rates forvanilla optimization methods that are linearly converging in the smooth strongly convexcase. This idea of embedding a simple first-order method within an inexact acceleratedscheme can be applied to a large panel of settings, including obtaining acceleration instrongly convex (see below), and stochastic settings. However, one should note thatpractical tuning of the corresponding numerical schemes (and particularly of the stepsize parameters) critically affects overall performance, as discussed in e.g. (Lin et al. ,2017) and makes effective implementation somewhat tricky. The analysis of non-convexsettings is beyond the scope of this chapter, but examples of such results can be foundin e.g. (Paquette et al. , 2018). Recall that function value accuracies, e.g. in Corollary 5.4 are expressed in terms ofouter loop iterations. Therefore, in order to complete the analysis, we need to answer Proximal Acceleration and Catalyst the following question: given total budget N total of inner iterations of method M , howmany iterations of Algorithm 25, N outer , will we perform in the ideal strategy (in otherwords, what is B M ,λ )? To answer this question, we start by analyzing the computationalcost of solving a single proximal subproblem through M . Computational cost of inner problems
LetΦ k ( x ) = f ( x ) + 12 λ k x − y k k be the objective of the proximal subproblem we aim to solve at iteration k (line 5 ofAlgorithm 25) centered at y k . By construction, Φ k ( x ) is ( L + 1 /λ )-smooth and 1 /λ -strongly convex. Also denote by w = y k our (warm-started) initial iterate, and by w , w , . . . , w N inner ( k ) the iterates of M for solving min x Φ k ( x ).We need to compute an upper bound on the number of iterations N inner ( k ) requiredto satisfy the error criterion (5.3): k e N inner ( k ) k = λ k Φ ′ k ( w N inner ( k ) ) k ≤ k w N inner ( k ) − w k , (5.9)where we denote by N inner ( k ) = inf { i : k Φ ′ k ( w i ) k ≤ /λ k w i − w k} the index of the firstiteration such that (5.9) is satisfied, which is precisely the quantity we want to upperbound. We start by the following observations.• By ( L + 1 /λ )-smoothness of Φ k k Φ ′ k ( w i ) k ≤ ( L + 1 /λ ) k w i − w ⋆ (Φ k ) k , (5.10)where w ⋆ (Φ k ) is the minimizer of Φ k .• A triangle inequality on k w − w ⋆ (Φ k ) k implies k w − w ⋆ (Φ k ) k − k w i − w ⋆ (Φ k ) k ≤ k w − w i k . (5.11)Hence (5.9) is satisfied if the right hand side of (5.10) is smaller than the left hand sideof (5.11) divided by λ , and hence, for any i for which we can prove( L + 1 /λ ) k w i − w ⋆ (Φ k ) k ≤ /λ ( k w − w ⋆ (Φ k ) k − k w ⋆ (Φ k ) − w i k ) . we get N inner ( k ) ≤ i . Rephrasing this inequality leads to k w i − w ⋆ (Φ k ) k ≤ λL + 2 k w − w ⋆ (Φ k ) k . Therefore, by assumption on M , (5.9) is guaranteed to hold as soon as C M (1 − τ M ) i ≤ λL + 2 , hence for any i satisfying i ≥ (cid:24) log ( C M ( λL + 2))log (1 / (1 − τ M )) (cid:25) . .5. Application: Catalyst Acceleration We conclude that (5.9) is satisfied before this number of iterations is achieved, hence N inner ( k ) ≤ (cid:24) log ( C M ( λL + 2))log (1 / (1 − τ M )) (cid:25) ≤ log ( C M ( λL + 2))log (1 / (1 − τ M )) + 1 . Given that the right hand side does not depend on k , we use the notation B M ,λ := log ( C M ( λL + 2))log (1 / (1 − τ M )) + 1as our upper bound on the iteration cost of solving the proximal subproblem via M . Global complexity bound
We showed that the number of iterations in the inner loopwas bounded above by a constant that depends on the specific choice of the regularizationparameter, and on the method M , in other words: N inner ( k ) ≤ B M ,λ . Writing N total thetotal number of calls to the gradient of f , by N outer the number of iterations performedby Algorithm 25, and by N useless the number of iteration of M that did result in anadditional outer iteration, we conclude that N total = N useless + N outer − X k =0 N inner ( k ) ≤ N outer B M ,λ , hence N outer ≥ ⌊ B − M ,λ N total ⌋ , as N useless < B M ,λ (the number of useless iteration issmaller than the number of iterations that would have led to an additional outer itera-tion). The conclusion follows from Corollary 5.4: f ( x N outer ) − f ( x ⋆ ) ≤ k x − x ⋆ k λN ≤ k x − x ⋆ k λ ⌊ B − M ,λ N total ⌋ . In other words given a target accuracy ǫ , the iteration complexity written in terms oftotal number of approximate proximal minimizations in Algorithm 25 is O ( q k x − x ⋆ k λǫ ),the total iteration complexity for solving the problem using M in inner loops is simplythe same bound multiplied by the cost of solving a single proximal subproblem, namely O ( B M ,λ q k x − x ⋆ k λǫ ). The previous analysis holds for the convex (but not necessarily strongly convex) case,the iteration complexity of solving inner problem remains valid in the strongly convexcase, and the expression for B M ,λ can only slightly be improved—by taking into accountthe better strong convexity parameter µ + 1 /λ and the possibly larger acceptable errormagnitude of a factor √ λµ in Algorithm 26. Therefore, the total number of iterationsof Algorithm 26 embedded with M remains bounded in a similar fashion and the overallerror decreases as (cid:16) − q λµ λµ (cid:17) ⌊ N outer /B M ,λ ⌋ , and the iteration complexity is therefore oforder O B M ,λ s λµλµ log 1 ǫ ! . (5.12) Proximal Acceleration and Catalyst
It is then natural to choose the value of λ by optimizing the overall iteration complexityof Algorithm 26 combined with M . One way to proceed is by optimizing s λµλµ , τ M , essentially neglecting the factor log( C M ( λL +2)) in the complexity estimate (5.12). Hereare a few examples:• Gradient method with suboptimal tuning (e.g. when using backtracking or line-search techniques): τ M = µλ +1 Lλ +1 . Optimizing the ratio leads to the choice λ = L − µ and the ratio is equal to 2 q Lµ −
1. Assuming C M = 1 (which is the case for thestandard step size 1 /L ), the overall iteration complexity is then O (cid:16)q Lµ log ǫ (cid:17) , where we neglected the factor log(2 − µ/L − µ/L ) ≈ log 2 when L/µ is large enough.• Gradient method with optimal tuning, τ M is µλ +1)( Lλ + µλ +2) . The resulting choice is λ = L − µ and the ratio is √ q Lµ −
1, arriving to the same O (cid:16)q Lµ log ǫ (cid:17) . Similar results hold for stochastic methods as well, assuming convergence of M in expec-tation instead of (5.7), for example in the form E k w k − w ⋆ k ≤ C M (1 − τ M ) k k w − w ⋆ k .Overall, the idea remains the same:1. Use the inexact accelerated proximal point (Algorithm 25 or 26) as if M wasdeterministic.2. Use the stochastic method M to obtain points satisfying the accuracy requirement.Dealing with the computational burden of solving the inner problem is a bit moretechnical but the overall analysis remains similar. One can bound the expected numberof iterations for solving the inner problem E [ N inner ( k )] by some constant B (stoch) M ,λ of theform (details below) B (stoch) M ,λ := log ( C M ( λL + 2))log (1 / (1 − τ M )) + 2 , which is simply B (stoch) M ,λ = B M ,λ + 1. A simple argument to obtain this bound usesMarkov’s inequality as follows P ( N inner ( k ) > i ) ≤ P (cid:16) k w i − w ⋆ (Φ k ) k > λL +2 k w − w ⋆ (Φ k ) k (cid:17) ≤ E [ k w i − w ⋆ (Φ k ) k ] λL +2 k w − w ⋆ (Φ k ) k (Markov) ≤ C (1 − τ ) i k w − w ⋆ (Φ k ) k λL +2 k w − w ⋆ (Φ k ) k = C (1 − τ ) i λL +2 . .6. Notes & References We then use a refined version of this bound: P ( N inner ( k ) > i ) ≤ min (cid:8) , ( λL + 2) C (1 − τ ) i (cid:9) and, in order to bound E [ N inner ( k )], we proceed as E [ N inner ( k )] = ∞ X t =1 P ( N inner ( k ) ≥ t ) ≤ Z N dt + C ( λL + 2) Z ∞ N (1 − τ ) t dt, where N is such that 1 = C ( λL + 2)(1 − τ ) N ; direct computation yields E [ N inner ( k )] ≤ B (stoch) M ,λ := N + 1.The overall expected iteration complexity is that of the inexact accelerated proximalpoint method multiplied by the expected computational burden of solving the proximalsubproblems B (stoch) M ,λ . That is, the expected iteration complexity becomes O B (stoch) M ,λ s k x − x ⋆ k λǫ in the smooth convex setting, and O B (stoch) M ,λ s λµλµ log 1 ǫ ! in the smooth strongly convex setting. The main argument of this section, namely theuse of Markov’s inequality, was adapted from Lin et al. (2017, Appendix B.4) (mergedwith those of the deterministic case above). In the optimization literature, the proximal operation is an essential algorithmic prim-itive at the heart of many practical optimization methods. Proximal point algorithmsare also largely motivated by the fact that they offer a nice framework for obtaining“meta” (or high-level) algorithms. Among others, they naturally appear in augmentedLagrangian and splitting-based numerical schemes. We refer the reader to the very nicesurveys in (Parikh and Boyd, 2014; Ryu and Boyd, 2016) for more details.
Proximal point algorithms, accelerated and inexact variants.
Proximal point algo-rithms have a long history, dating back to the works of Moreau (1962; 1965), andbrought to the optimization community by Martinet (1970; 1972). Early interest inproximal methods was motivated by their connections to augmented Lagrangian tech-niques (Rockafellar, 1973; Rockafellar, 1976; Iusem, 1999), see also the nice tutorialby Eckstein and Silva (2013)). Among the many other successes and uses of proximal op-erations, one can cite the many splitting techniques (Lions and Mercier, 1979; Eckstein,1989), for which nice surveys already exist (Boyd et al. , 2011; Combettes and Pesquet,2011; Eckstein and Yao, 2012). In this context, inexact proximal operations were already Proximal Acceleration and Catalyst introduced by Rockafellar (1976) and combined with acceleration much later by Güler(1992).
Hybrid proximal extragradient (HPE) framework.
Whereas “Catalyst” accelerationis based on the idea of solving the proximal subproblem via a first-order method, the(related) hybrid proximal extragradient framework was also used together with a Newtonscheme in (Monteiro and Svaiter, 2013). Furthermore, the accelerated hybrid proximalextragradient framework allows using an increasing sequence of step sizes, leading tofaster rates than those obtained via vanilla first-order methods (i.e. using an increasingsequence of { λ i } i , ( P Ni =1 √ λ i ) might grow much faster than N ).The HPE framework was initiated before being embedded with acceleration tech-niques by Solodov and Svaiter (1999; 1999; 2000; 2001). Catalyst.
The variant presented in this chapter was chosen for simplicity of expositionand is largely inspired by recent works on the topic in (Lin et al. , 2017; Ivanova et al. ,2019) along with (Monteiro and Svaiter, 2013). Efficient implementations of Catalystcan be found in the Cyanure package by Mairal (2019). In particular, most efficientpractical implementations of Catalyst appear to rely on absolute inaccuracy criterionfor the inexact proximal operation, instead of relative (or multiplicative) ones, as thatused in this chapter. However, we chose the relative error model as we believe it allowsa slightly simpler exposition, while relying on essentially the same techniques.Catalyst was originally proposed by Lin et al. (2015) as a generic tool for reachingaccelerated methods. Among others, it allowed accelerating stochastic methods such asSVRG (Johnson and Zhang, 2013), SAGA (Defazio et al. , 2014a), MISO (Mairal, 2015),and Finito (Defazio et al. , 2014b) before direct acceleration techniques became knownfor them (Allen-Zhu, 2017; Zhou et al. , 2018; Zhou et al. , 2019).
Higher-order proximal subproblems.
Higher-order proximal subproblems of the formmin x (cid:26) f ( x ) + 1 λ ( p + 1) k x − x k k p +1 (cid:27) , (5.13)were used by (Nesterov, 2020a; Nesterov, 2020b) as new primitive for designing opti-mization schemes. Those subproblems can also be solved approximately (via p th ordertensor methods (Nesterov, 2019)) while keeping good convergence guarantees. Optimized proximal point methods.
It is possible to develop optimized proximal meth-ods in the spirit of optimized gradient methods. That is, given a computational budget—in the proximal setting, this consists of a number of iterations, and a sequence of stepsizes { λ i } ≤ i ≤ N — one can choose algorithmic parameters to optimize worst-case perfor-mance of a method of type x k +1 = x − k X i =1 β i g f ( x i ) − λ k +1 g f ( x k +1 ) .6. Notes & References with respect to β i ’s. The proximal equivalent to the optimized gradient method is Güler’ssecond method (Güler, 1992, Section 6), which was obtained as an optimized proximalpoint method in (Barré et al. , 2020a). Alternatively, Güler’s second method (Güler,1992, Section 6) can be obtained by applying the optimized gradient method (withoutits last iteration trick) to the Moreau envelope of the nonsmooth convex function f . Moreprecisely, denoting by ˜ f the Moreau enveloppe of f , one can apply the optimized gradientmethod without the last iteration trick to ˜ f , as f ( x k ) − f ⋆ = ˜ f ( x k ) − ˜ f ⋆ − L k g ˜ f ( x k ) k ,which precisely corresponds to the first term of the potential of the optimized gradientmethod (see Equation 4.9). In the more general setting of monotone inclusions, one canobtain alternate optimized proximal point methods for different criterion as in (Kim,2019; Lieder, 2020). Proofs of this Chapter.
Proofs of the potential inequalities of this chapter were ob-tained through the performance estimation methodology, introduced by Drori and Teboulle(2014), and specialized for studying inexact proximal operations by Barré et al. (2020a).More details can be found in Section 4.8, §“On Obtaining Proofs of this Chapter”. Inparticular, for reproducibility purposes, we provide codes for symbolically verifying thealgebraic reformulations of this section athttps://github.com/AdrienTaylor/AccelerationMonographtogether with those of Chapter 4.
Restart Schemes
First-order methods typically exhibit a sublinear convergence, whose rate varies withgradient smoothness. Polynomial upper complexity bounds are typically convex functionsof the number of iterations, so these methods converge faster at the beginning whileconvergence tails off as iterations progress. This suggests that periodically restarting first-order methods, i.e. simply running more “early” iterations, could accelerate convergence.We illustrate this in Figure 6.1. f − f ⋆ f − f ⋆ Iterations Iterations
Figure 6.1:
Left:
Sublinear convergence plot without restart.
Right:
Sublinear convergence plot withrestart.
Beyond this graphical argument, all accelerated methods have memory and look backat least one step to compute the next iterate. They iteratively form a model for the func-tion around the optimum and restarting allows this model to be periodically refreshed, .1. Introduction discarding outdated information as the algorithm converges towards the optimum.While the benefits of restart are immediately apparent in this plot, restart schemesraise several important questions. How many iterations should we run between restarts?What is the best complexity bound we can hope for using this scheme? What regularityproperties of the problem drive the performance of restart schemes? Fortunately here, allthese questions have an explicit answer stemming from a simple and intuitive argument.We will see that restart schemes are also adaptive to unknown regularity constants andoften reach near optimal convergence rates without observing these parameters.We begin by illustrating this on the problem of minimizing a strongly convex functionusing the fixed step gradient method.
We illustrate the main argument of this chapter when minimizing strongly convex func-tions using fixed step gradient descent. Suppose we seek to solve the following minimiza-tion problem minimize f ( x ) (6.1)in the variable x ∈ R n . Suppose that the gradient of f is Lipschitz continuous withconstant L with respect to the Euclidean norm, i.e. k∇ f ( y ) − ∇ f ( x ) k ≤ L k y − x k , for all x, y ∈ R n (6.2)We can use the fixed step gradient method to solve problem (6.1) as in Algorithm 27below. Algorithm 27
Gradient Method
Input:
A smooth convex function f , an initial point x . for i = 0 , . . . , k − do x i +1 := x i − L ∇ f ( x i ) end forOutput: An approximate solution x k .The smoothness assumption in (6.2) ensures following complexity bound f ( x k ) − f ⋆ ≤ L k x − x ⋆ k k + 4 (6.3)after k iterations (see Chapter 4 for a complete discussion).Assume now that f is also strongly convex with parameter µ with respect to theEuclidean norm. Strong convexity means that f satisfies µ k x − x ⋆ k ≤ f ( x ) − f ⋆ (6.4)where x ⋆ is an optimal solution to problem (6.1) and f ⋆ the corresponding optimalobjective value. Let us write A ( x , k ) the output of k iterations of Algorithm 27 started Restart Schemes
Algorithm 28
Restart Scheme
Input:
A smooth convex function f , an initial point x and an inner optimizationalgorithm A ( x, k ). for i = 0 , . . . , N − do Obtain x i +1 by running k i iterations of the gradient method, starting at x i , i.e. x i +1 := A ( x i , k i ) end forOutput: x N at x , and suppose that we periodically restart the gradient method according to thefollowing scheme.Combining the strong convexity bound in (6.4) with the complexity bound in (6.3)yields f ( x i +1 ) − f ⋆ ≤ L k x i − x ⋆ k k + 4 ≤ Lµ ( k + 4) ( f ( x i ) − f ⋆ ) (6.5)after an iteration of the restart scheme in Algorithm 28 in which we run k (inner)iterations of the gradient method in Algorithm 27. This means that if we set k i = k = 8 Lµ then f ( x N ) − f ⋆ ≤ (cid:18) (cid:19) N ( f ( x ) − f ⋆ )after N iterations of the restart scheme in Algorithm 28. Therefore, running a total of T = N k gradient steps, we can rewrite the complexity bound in terms of this totalnumber of gradient oracle calls (or inner iterations) as f ( x T ) − f ⋆ ≤ (cid:18) µ L (cid:19) T ( f ( x ) − f ⋆ ) (6.6)which proves linear convergence in the strongly convex case.Of course, the basic gradient method with fixed step size in Algorithm 27 has nomemory, so restarting it has no impact on iterations or numerical performance. Invok-ing the restart scheme in Algorithm 28 simply allows us to produce a better complexitybound in the strongly convex case. Without information on the strong convexity parame-ter, the classical bound yields sublinear convergence, while the restart method convergeslinearly.Crucially here, the argument in (6.5) can be significantly generalized to improve theconvergence rate of several types of first-order methods. In fact, as we will see below, alocal bound on the growth rate of the function akin to strong convexity holds almostgenerically, albeit with a different exponent than in (6.4). .2. Hölderian Error Bounds We now recall several results related to subanalytic functions and Hölderian error boundsof the form µr d ( x, X ⋆ ) r ≤ f ( x ) − f ⋆ , for all x ∈ K, (HEB)for some µ, r >
0. We refer the reader to e.g. (Bolte et al. , 2007) for a more completediscussion. These results produce bounds akin to local versions of strong convexity, withvarious exponents, and are known to hold under very generic conditions. They thus betterexplain the good empirical performance of restart schemes. In general of course, thesevalues are neither observed nor known a priori, but as detailed below, restart schemes areessentially adaptive to these parameters and reach optimal convergence rates withoutany prior information on µ, r . Let f be a smooth convex function on R n . Smoothness ensures f ( x ) ≤ f ⋆ + L k x − y k , for any x ∈ R n and y ∈ X ⋆ . Setting y to be the projection of x on X ⋆ , this yields thefollowing upper bound on suboptimality f ( x ) − f ⋆ ≤ L d ( x, X ⋆ ) . (6.7)Now, assume that f satisfies the Hölderian error bound (HEB) on a set K with param-eters ( r, µ ). Combining (6.7) and (HEB) leads to2 µrL ≤ d ( x, X ⋆ ) − r , for every x ∈ K . This means that 2 ≤ r by taking x close enough to X ⋆ . We will alsoallow the gradient smoothness exponent 2 to vary in later results, where we assume thegradient to be Hölder smooth, but we first detail the smooth case for simplicity. In whatfollows, we will use the following notations κ , L/µ r and τ , − r (6.8)defining generalized condition numbers for the function f . Note that if r = 2, then κ matches the classical condition number of the function. Subanalytic functions form a very broad class of functions for which we can show Hölde-rian error bounds bounds as in (HEB), akin to strong convexity. We recall some keydefinitions and refer the reader to e.g. (Bolte et al. , 2007) for a more complete discus-sion. Restart Schemes
Definition 6.1 (Subanalyticity) . (i) A subset A ⊂ R n is called semianalytic if each pointof R n admits a neighborhood V for which A ∩ V assumes the following form p [ i =1 q \ i =1 { x ∈ V : f ij ( x ) = 0 , g ij ( x ) > } , where f ij , g ij : V → R are real analytic for 1 ≤ i ≤ p , 1 ≤ j ≤ q .(ii) A subset A ⊂ R n is called subanalytic if each point of R n admits a neighborhood V such that A ∩ V = { x ∈ R n : ( x, y ) ∈ B } where B is a bounded semianalytic subset of R n × R m .(iii) A function f : R n → R ∪ { + ∞} is called subanalytic if its graph is a subanalyticsubset of R n × R .The class of subanalytic functions is of course very large, but the definition abovesuffers from one key shortcoming as the image and preimage of a subanalytic functionare not in general subanalytic. To remedy this stability issue, we can define a notion ofglobal subanalyticity. We first define the function β n with β n ( x ) , (cid:18) x x , . . . , x n x n (cid:19) and we have the following definition. Definition 6.2 (Global subanalyticity) . (i) A subset A of R n is called globally subanalytic if its image under β n is a subanalytic subset of R n .(ii) A function f : R n → R ∪ { + ∞} is called globally subanalytic if its graph is a globallysubanalytic subset of R n × R .We now recall the Łojasiewicz factorization lemma, which will give us local growthbounds on the graph of the function around its minimum. Theorem 6.1 (Łojasiewicz factorization lemma) . Let K ⊂ R n be a compact set and g, h : K → R two continuous globally subanalytic functions. If h − (0) ⊂ g − (0) , then µr | g ( x ) | r ≤ | h ( x ) | , for all x ∈ K, (6.9)for some µ, r > K ⊂ R n , we can set h ( x ) = f ( x ) − f ⋆ and g ( x ) = d ( x, X ⋆ ), the Euclidean distance from x to the set X ⋆ , where X ⋆ is the set ofoptimal solutions. In this case, we have h − (0) ⊂ g − (0), we can show that g is globallysubanalytic if X ⋆ is and if f is continuous and globally subanalytic, Theorem 6.1 showsthe following Hölderian error bound, µr d ( x, X ⋆ ) r ≤ f ( x ) − f ⋆ , for all x ∈ K, (6.10) .3. Optimal Restart Schemes for some µ, r >
0. Here, Theorem 6.1 produces a bound on the growth rate of thefunction around the optimum, generalizing the strong convexity bound in (6.4). Weillustrate this in Figure 6.2. Overall, since continuity and subanalyticity are very weakconditions, Theorem 6.1 shows that the Hölderian error bound in (HEB) holds almostgenerically. -0.5 0 0.500.10.20.30.40.5 -1 0 100.10.20.30.40.5 -1 0 100.10.20.30.40.5
Figure 6.2:
Left and center:
The functions | x | and x satisfy a growth condition around zero. Right:
The function exp ( − /x ) does not. We now discuss how to exploit the Hölderian error bounds detailed above using restartschemes. Suppose again that we seek to solve the following unconstrained minimizationproblem, written minimize f ( x ) (6.11)in the variable x ∈ R n , where the gradient of f is Lipschitz continuous with constant L with respect to the Euclidean norm. The optimal method in (4.9) detailed as Algorithm 9produces a point x k satisfying f ( x k ) − f ⋆ ≤ Lk k x − x ⋆ k (6.12)after k iterations.Assume the function f satisfies the Hölderian error bound (HEB), we can use achaining argument similar to that in (6.5) to show improved convergence rates. Whilea constant number of inner iterations (between restarts) was optimal in the stronglyconvex case, the optimal restart scheme for r > Theorem 6.2 (Restart Complexity) . Let f be a smooth convex function satisfying (6.2)with parameter L and (HEB) with parameters ( r, µ ) on a set K . Assume that we aregiven x ∈ R n such that { x | f ( x ) ≤ f ( x ) } ⊂ K . Run the restart scheme in Algorithm 28 Restart Schemes from x with iteration schedule k i = C ⋆κ,τ e τi , for i = 1 , . . . , R , where C ⋆κ,τ , e − τ ( cκ ) ( f ( x ) − f ⋆ ) − τ , (6.13)with κ and τ defined in (6.8) and c = 4 e /e here. The precision reached at the last pointˆ x is bounded by, f (ˆ x ) − f ⋆ ≤ f ( x ) − f ⋆ (cid:16) τ e − ( f ( x ) − f ⋆ ) τ ( cκ ) − N + 1 (cid:17) τ = O (cid:16) N − τ (cid:17) , (6.14)when τ >
0, where N = P Ri =1 k i is the total number of inner iterations.In the strongly convex case, i.e. when τ = 0, the bound above actually becomes f (ˆ x ) − f ⋆ ≤ exp (cid:16) − e − ( cκ ) − N (cid:17) ( f ( x ) − f ⋆ ) = O (cid:16) exp( − κ − N ) (cid:17) and we recover the classical linear convergence bound for Algorithm 12 in the stronglyconvex case. When 0 < τ < faster conver-gence rate than accelerated gradient methods on non-strongly convex functions (i.e. when r > r is to 2, the tighter our bounds are, which yields a better modelfor the function and faster convergence. This matches the lower bounds for optimizingsmooth and sharp functions (Nemirovskii and Nesterov, 1985) up to constant factors.Also, setting k i = C ⋆κ,τ e τi yields continuous bounds on precision, i.e. when τ →
0, bound(6.14) converges to the linear bound, which also shows that for τ near zero, constantrestart schemes are almost optimal. The previous restart schedules depend on the sharpness parameters ( r, µ ) in (HEB). Ingeneral of course, these values are neither observed nor known a priori. Making the restartscheme above adaptive is thus crucial to practical performance. Fortunately, we will seethat a simple logarithmic grid search on these parameters is enough to guarantee nearlyoptimal performance. In other words, as shown in (Roulet and d’Aspremont, 2017), thecomplexity bound in (6.14) is somewhat robust to a misspecification of the inner iterationschedule k i . We can run several restart schemes in Algorithm 28, each with a given number of in-ner iterations N to perform a log-scale grid search on the values of τ and κ in (6.8).We see below that running (log N ) restart schemes suffices to achieve nearly optimalperformance. We define these schemes as follows. ( S p, : Restart Algorithm 9 with k i = C p , S p,q : Restart Algorithm 9 with k i = C p e τ q i , (6.15) .5. Extensions where C p = 2 p and τ q = 2 − q . We stop these schemes when the total number of inneralgorithm iterations has exceed N , i.e. at the smallest R such that P Ri =1 k i ≥ N . Thesize of the grid search in C p is naturally bounded as we cannot restart the algorithmafter more than N total inner iterations, so p ∈ [1 , . . . , ⌊ log N ⌋ ]. Also, when τ is smallerthan 1 /N , a constant schedule performs as well as the optimal geometrically increasingschedule, which crucially means we can also choose q ∈ [0 , . . . , ⌈ log N ⌉ ] and limits thecost of grid search to log N . We have the following complexity bounds. Theorem 6.3 (Adaptive Restart Complexity) . Let f be a smooth convex function satis-fying (6.2) with parameter L and (HEB) with parameters ( r, µ ) on a set K . Assumethat we are given x ∈ R n such that { x | f ( x ) ≤ f ( x ) } ⊂ K and write N a givennumber of iterations. Run schemes S p,q defined in (6.15) for p ∈ [1 , . . . , ⌊ log N ⌋ ] and q ∈ [0 , . . . , ⌈ log N ⌉ ], stopping each time after N total inner algorithm iterations i.e. for R such that P Ri =1 k i ≥ N . Assume N is large enough, so N ≥ C ⋆κ,τ , and if N > τ > C ⋆κ,τ > τ = 0, there exists p ∈ [1 , . . . , ⌊ log N ⌋ ] such that scheme S p, achieves a precisiongiven by f (ˆ x ) − f ⋆ ≤ exp (cid:16) − e − ( cκ ) − N (cid:17) ( f ( x ) − f ⋆ ) . (ii) If τ >
0, there exist p ∈ [1 , . . . , ⌊ log N ⌋ ] and q ∈ [1 , . . . , ⌈ log N ⌉ ] such that scheme S p,q achieves a precision given by f (ˆ x ) − f ⋆ ≤ f ( x ) − f ⋆ (cid:16) τ e − ( cκ ) − ( f ( x ) − f ⋆ ) τ ( N − / (cid:17) τ . Overall, running the logarithmic grid search has a complexity (log N ) times higherthan running N iterations using the optimal scheme where we know the parametersin (HEB), while the convergence rate is roughly slowed down by a factor four. We now discuss several extensions of the results above.
Hölder Smooth Gradient
The results above can be somewhat directly extended to more general notions of regu-larity. In particular, if we assume that there exist s ∈ [1 ,
2] and
L > J ⊂ R n ,i.e. k∇ f ( x ) − ∇ f ( y ) k ≤ L k x − y k s − , for all x, y ∈ J . (6.16)so that the gradient is Hölder smooth. Without further assumptions on f , the optimalrate of convergence for this class of functions is bounded as O (1 /N ρ ), where N is thetotal number of iterations and ρ = 3 s/ − , (6.17) Restart Schemes which gives ρ = 2 for smooth functions and ρ = 1 / ǫ and a starting point x as inputs. It outputs a point x , U ( x , ǫ, t ),such that f ( x ) − f ⋆ ≤ ǫ cL s d ( x , X ⋆ ) ǫ s t ρs ǫ , (6.18)after t iterations, where c is a constant ( c = 2 s − s ). We can extend the definition of κ and τ in (6.8) to this setting, with κ , L s µ r and τ , − sr (6.19)here. We see that τ acts as an analytical condition number measuring how tightnessof upper and lower bound models. The key difference with the smooth case describedabove is that here we need to schedule both the target accuracy ǫ i used by the algorithm and the number of iterations k i made at the i th run of the algorithm. Our scheme isdescribed in Algorithm 29. Algorithm 29
Universal scheduled restarts for convex minimization
Input: x ∈ R n , ǫ ≥ f ( x ) − f ⋆ , γ ≥
0, a sequence k i and an inner algorithm U ( x, ǫ, k ). for i = 1 , . . . , R do ǫ i := e − γ ǫ i − , x i := U ( x i − , ǫ i , k i ) end forOutput: An iterate x R .We will choose a sequence k i that ensures f ( x i ) − f ⋆ ≤ ǫ i , for the geometrically decreasing sequence ǫ i . Grid search on the restart scheme still worksin this case but requires the knowledge of both s and τ . Theorem 6.4.
Let f be a convex function satisfying (6.16) with parameters ( s, L ) ona set J and (HEB) with parameters ( r, µ ) on a set K . Given x ∈ R n assume that { x | f ( x ) ≤ f ( x ) } ⊂ J ∩ K . Run the restart scheme in Algorithm 29 from x for a given ǫ ≥ f ( x ) − f ⋆ with γ = ρ, k i = C ⋆κ,τ,ρ e τi , where C ⋆κ,τ,ρ , e − τ ( cκ ) s ρ ǫ − τρ where ρ is defined in (6.17), κ and τ are defined in (6.19) and c = 8 e /e here. Theprecision reached at the last point x R is given by, f ( x R ) − f ⋆ ≤ exp (cid:16) − ρe − ( cκ ) − s ρ N (cid:17) ǫ = O (cid:16) exp( − κ − s ρ N ) (cid:17) , .6. Calculus Rules when τ = 0, while f ( x R ) − f ⋆ ≤ ǫ (cid:18) τ e − ( cκ ) − s ρ ǫ τρ N + 1 (cid:19) − ρτ = O (cid:16) κ s τ N − ρτ (cid:17) , when τ >
0, where N = P Ri =1 k i is total number of iterations. Relative Smoothness
We can also extend the inequality defining condition (HEB), replacing the distance tothe optimal set by a more general Bregman divergence. Suppose h ( x ) is a one stronglyconvex function with respect to the Euclidean norm, the Bregman divergence D h ( x, y )is defined as D h ( x, y ) , h ( x ) − h ( y ) − h∇ f ( y ); ( x − y ) i (6.20)and we will say that a function f is L -smooth with respect to h on R n if f ( y ) ≤ f ( x ) + h∇ f ( x ); ( y − x ) i + LD h ( y, x ) , for all x, y ∈ R n . (6.21)We can then extend the Hölderian error bound to the Bregman setting as follows. Inan optimization context on a compact set K ⊂ R n , we can set h ( x ) = f ( x ) − f ⋆ and g ( x ) = D ( x, X ⋆ ) = inf y ∈ X ⋆ D h ( x, y ) where X ⋆ is the set of optimal solutions. In thiscase, we have h − (0) ⊂ g − (0), we can show that g is globally subanalytic if X ⋆ is andif f is continuous and globally subanalytic, Theorem 6.1 again shows µr D ( x, X ⋆ ) r ≤ f ( x ) − f ⋆ , for all x ∈ K, (HEB-B)for some µ, r >
0. This allows to use the restart scheme complexity results above toaccelerate proximal gradient methods.
In general, the exponent r and the factor µ in the bounds (HEB) and (6.26) are notobserved and hard to estimate. Despite this, the robustness result in Theorem 6.3 meansthat searching for the best restart scheme only introduces a log factor in the overallalgorithm complexity. There are however a number of scenarios where we can producemuch more precise estimates on r and µ , hence get refined a priori complexity boundsand reduce the cost of the grid search in (6.15).In particular, (Li and Pong, 2018) shows “calculus rules” for the exponent for anumber of elementary operations using a related type of error bound known as theKurdyka-Łojasiewicz inequality. The results focus on the Kurdyka-Łojasiewicz exponent α , defined as follows. Definition 6.3.
A proper closed convex function has Kurdyka-Łojasiewicz (KL) exponent α iff for any point ¯ x ∈ dom ∂f there is a neighborhood V of ¯ x , an ν > and a constant c > such that D ( ∂f ( x ) , ≥ c ( f ( x ) − f (¯ x )) α (6.22) Restart Schemes whenever x ∈ V and f (¯ x ) ≤ f ( x ) ≤ f (¯ x ) + ν .In particular, (Bolte et al. , 2007, Th. 3.3) shows that (HEB) implies (6.22) withexponent α = 1 − /r . Very briefly, the following calculus rules apply to the exponent α .• If f ( x ) = min i f i ( x ) and each f i has KL exponent α i then f has KL exponent α = max i α i (Li and Pong, 2018, Cor. 3.1).• Let f ( x ) = g ◦ F ( x ) , where g is a proper closed function and F is a continuouslydifferentiable mapping. Suppose in addition that g is a KL function with exponent α and the Jacobian JF(x) is a surjective mapping at some ¯ x ∈ dom ∂f . Then f hasthe KL property at ¯ x with exponent α (Li and Pong, 2018, Th. 3.2).• If f ( x ) = P i f i ( x i ) and each f i is continuous and has KL exponent α i then f hasKL exponent α = max i α i (Li and Pong, 2018, Cor. 3.3).• Let f be a proper closed convex function with a KL exponent α ∈ [0 , / . Supposefurther that f is continuous on dom ∂f . Fix λ > and consider F λ ( X ) = inf y (cid:26) f ( y ) + 12 λ k x − y k (cid:27) then F λ has KL exponent α = max n , α − α o (Li and Pong, 2018, Th. 3.4).Note that another related notion of error bound where the primal gap is replaced bythe norm of the prox step was studied in e.g. (Pang, 1987; Luo and Tseng, 1992; Tseng,2010; Zhou and So, 2017). In some applications such as compressed sensing, under some classical assumptions onthe problem data, the exponent r is equal to one and the constant µ can be directlycomputed from quantities controlling recovery performance. In these problems, a singleparameter thus controls both signal recovery and computational performance.Consider for instance a sparse recovery problem using the ℓ norm. Given a matrix A ∈ R n × p and observations b = Ax ⋆ on a signal x ⋆ ∈ R p , recovery is performed bysolving the ℓ minimization programminimize k x k subject to Ax = b ( ℓ recovery)in the variable x ∈ R p . A number of conditions on A have been derived to guaranteethat ( ℓ recovery) recovers the true signal whenever it is sparse enough. Among these,the null space property (see Cohen et al. , 2009 and references therein) is defined asfollows. .8. Restarting Other First-Order Methods Definition 6.4. (Null Space Property)
The matrix A satisfies the Null Space Property(NSP) on support S ⊂ { , p } with constant α ≥ if for any z ∈ Null ( A ) \ { } , α k z S k < k z S c k . (NSP)The matrix A satisfies the Null Space Property at order s with constant α ≥ if itsatisfies it on every support S of cardinality at most s .The null space property is a necessary and sufficient condition for the convex pro-gram ( ℓ recovery) to recover all signals up to some sparsity threshold. We have, thefollowing proposition directly linking the null space property and the Hölderian errorbound (HEB). Proposition 6.1.
Given a coding matrix A ∈ R n × p satisfying (NSP) at order s withconstant α ≥ , if the original signal x ⋆ is s -sparse, then for any x ∈ R p satisfying Ax = b , x = x ⋆ , we have k x k − k x ⋆ k > α − α + 1 k x − x ⋆ k . (6.23)This implies signal recovery, i.e. optimality of x ⋆ for ( ℓ recovery) and the Hölderianerror bound (HEB) with µ = α − α +1 . The restart argument can be readily extended to other optimization methods providedtheir complexity bound directly depends on some measure of distance to optimality. Thisis the case for instance for the Frank-Wolfe method, as detailed in (Kerdreux et al. , 2018).Suppose that we seek to solve the following constrained optimization problemminimize f ( x ) subject to x ∈ C (6.24)Distance to optimality is now measured in terms of the strong Wolfe gap, defined asfollows. Definition 6.5 (Strong Wolfe Gap) . Let f be a smooth convex function, C a polytope,and let x ∈ C be arbitrary. Then the strong Wolfe gap w ( x ) over C is defined as w ( x ) , min S ∈S x max y ∈ S,z ∈C h f ( x ); ( y − z ) i (6.25)where x ∈ Co ( S ) and S x = { S ⊂ Ext ( C ) , finite, x proper combination of elements of S } , is the set of proper supports of x .The inequality playing the role of the Hölderian error bound in (HEB) for the strongWolfe gap is then written as follows. Restart Schemes
Definition 6.6 (Strong Wolfe primal bound) . Let K be a compact neighborhood of X ⋆ in C , where X ⋆ is the set of solutions of the constrained optimization problem (6.24). Afunction f satisfies a r -strong Wolfe primal bound on K , if and only if there exists r ≥ and µ > such that for all x ∈ Kf ( x ) − f ⋆ ≤ µw ( x ) r , (6.26)and f ⋆ its optimal value.Notice that this inequality is an upper bound on the primal gap f ( x ) − f ⋆ , while theHölderian error bound in (HEB) provides a lower bound. This is because the strong Wolfegap can be understood as a gradient norm, so that (6.26) is a Łojasiewicz inequality asin (Bolte et al. , 2007), instead of a direct consequence of the Łojasiewicz factorizationlemma as in (HEB) above.Regularity of f is measured using away curvature as in (Lacoste-Julien and Jaggi,2015), with C Af , sup x,s,v ∈C η ∈ [0 , y = x + η ( s − v ) η (cid:0) f ( y ) − f ( x ) − η h∇ f ( x ) , s − v i (cid:1) , (6.27)and allows us to bound the performance the Fractional Away-Step Frank-Wolfe Algo-rithm in (Kerdreux et al. , 2018), as follows. Theorem 6.5.
Let f be a smooth convex function with away curvature C Af . Assume thestrong Wolfe primal bound in (6.26) holds for some ≤ r ≤ . Let γ > and assume x ∈ C is such that e − γ w ( x , S ) / ≤ C Af . With γ k = γ , the output of the FractionalAway-Step Frank-Wolfe Algorithm satisfies f ( x T ) − f ⋆ ≤ w (cid:16) T C rγ (cid:17) − r when ≤ r < f ( x T ) − f ⋆ ≤ w exp − γe γ ˜ T C Af µ ! when r = 2 , (6.28)after T steps, with w = w ( x , S ) , ˜ T , T − ( |S | − |S T | ) , and C rγ , e γ (2 − r ) − e γ C Af µw ( x , S ) r − . (6.29)This result is similar to that of Theorem 6.4, and shows that restart yields linear com-plexity bounds when the exponent in the strong Wolfe primal bound in (6.26) matchesthat in the curvature (i.e. r = 2 ), and improved linear rates when this exponent r satisfies ≤ r < . Crucially here, the method is fully adaptive to the error bound parametersso no prior knowledge of these parameters is required to get the accelerated rates inTheorem 6.5 and no log scale grid search is required. .9. Notes & References The optimal complexity bounds and exponential restart schemes detailed here can betraced back to (Nemirovskii and Nesterov, 1985). Restart schemes were extensivelybenchmarked in the numerical toolbox TFOCS by (Becker et al. , 2011), with a par-ticular focus on compressed sensing applications. The robustness result showing thatlog scale grid search produces near optimal complexity bounds is due to (Roulet andd’Aspremont, 2017).Hölderian error bounds can be traced to the work of Lojasiewicz (1963) for analyticfunctions. This was extended to much broader classes of functions by (Kurdyka, 1998;Bolte et al. , 2007). Several examples of problems in signal processing where this conditionholds can be found in e.g.(Zhou et al. , 2015; Zhou and So, 2017). Calculus rules for theexponent are discussed in details in e.g. (Li and Pong, 2018).Restart also helps in a stochastic setting with (Davis et al. , 2019) showing recentlythat stochastic algorithms with geometric step decay converge linearly on functions sat-isfying Hölderian error bounds, thus validating a classical empirical acceleration trick,which restarts every few epochs after adjusting step size (aka learning rate in machinelearning terminology). cknowledgements
The authors would like to thank Francis Bach, Mathieu Barré, Raphaël Berthier, AymericDieuleveut, Radu-Alexandru Dragomir, Yoel Drori, Baptiste Goujaud, Hadrien Hen-drikx, Reza Babanezhad, Simon Lacoste-Julien, Fabian Pedregosa and Vincent Rouletfor fruitful discussions and pointers, which largely simplified the writing process of thismanuscript.In particular, the authors would like to warmly thank Raphaël Berthier and MathieuBarré for comments on early versions of this manuscript, for spotting a few typos, andfor discussions and developments related to Chapter 4 and Chapter 5.AA is at the département d’informatique de l’ENS, École normale supérieure, UMRCNRS 8548, PSL Research University, 75005 Paris, France, and INRIA. AA wouldlike to acknowledge support from the ML and Optimisation joint research initiativewith the fonds AXA pour la recherche and Kamet Ventures, a Google focused award,as well as funding by the French government under management of Agence Nationalede la Recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute). AT acknowledges support from the European Re-search Council (ERC grant SEQUOIA 724063). AT is at INRIA and the départementd’informatique de l’ENS, École normale supérieure, CNRS, PSL Research University,75005 Paris, France. eferences
Aitken, A. C. 1927. “On Bernoulli’s Numerical Solution of Algebraic Equations”.
Pro-ceedings of the Royal Society of Edinburgh . 46: 289–305.Allen-Zhu, Z. 2017. “Katyusha: The first direct acceleration of stochastic gradient meth-ods”.
The Journal of Machine Learning Research (JMLR) . 18(1): 8194–8244.Allen-Zhu, Z. and L. Orecchia. 2014. “Linear coupling: An ultimate unification of gradi-ent and mirror descent”. arXiv preprint arXiv:1407.1537 .Anderson, D. G. 1965. “Iterative procedures for nonlinear integral equations”.
Journalof the ACM (JACM) . 12(4): 547–560.Anderson, E. J. and P. Nash. 1987.
Linear programming in infinite-dimensional spaces .Edward J. Anderson, Peter Nash. Chichester: Wiley.Armijo, L. 1966. “Minimization of functions having Lipschitz continuous first partialderivatives”.
Pacific Journal of mathematics . 16(1): 1–3.Attouch, H. and J. Peypouquet. 2016. “The rate of convergence of Nesterov’s acceleratedforward-backward method is actually faster than /k ”. SIAM Journal on Optimiza-tion . 26(3): 1824–1834.Auslender, A. and M. Teboulle. 2006. “Interior gradient and proximal methods for convexand conic optimization”.
SIAM Journal on Optimization . 16(3): 697–725.Baes, M. 2009.
Estimate sequence methods: extensions and approximations . url Theoryof Computing . 15(1): 1–32.Barré, M., A. Taylor, and F. Bach. 2020a. “Principled Analyses and Design of First-OrderMethods with Inexact Proximal Operators”. preprint arXiv:2006.06041 .Barré, M., A. Taylor, and A. d’Aspremont. 2020b. “Convergence of constrained andersonacceleration”. arXiv preprint arXiv:2010.15482 .Bauschke, H. H., J. Bolte, and M. Teboulle. 2016. “A descent Lemma beyond Lipschitzgradient continuity: first-order methods revisited and applications”.
Mathematics ofOperations Research . 42(2): 330–348.
References
Bauschke, H. H. and P. L. Combettes. 2011.
Convex analysis and monotone operatortheory in Hilbert spaces . Vol. 408. Springer.Beck, A. and M. Teboulle. 2009. “A fast iterative shrinkage-thresholding algorithm forlinear inverse problems”.
SIAM Journal on Imaging Sciences . 2(1): 183–202.Beck, A. and M. Teboulle. 2003. “Mirror descent and nonlinear projected subgradientmethods for convex optimization”.
Operations Research Letters . 31(3): 167–175.Becker, S. R., E. J. Candès, and M. C. Grant. 2011. “Templates for convex cone problemswith applications to sparse signal recovery”.
Mathematical Programming Computa-tion . 3(3): 165–218.Ben-Tal, A. and A. Nemirovski. 2001.
Lectures on modern convex optimization : analysis,algorithms, and engineering applications . MPS-SIAM series on optimization . SIAM.Bolte, J., A. Daniilidis, and A. Lewis. 2007. “The Lojasiewicz inequality for nonsmoothsubanalytic functions with applications to subgradient dynamical systems”.
SIAMJournal on Optimization . 17(4): 1205–1223.Bonnans, J.-F., J. C. Gilbert, C. Lemaréchal, and C. A. Sagastizábal. 2006.
Numericaloptimization: theoretical and practical aspects . Springer Science & Business Media.Bottou, L. and O. Bousquet. 2007. “The tradeoffs of large scale learning”.
Advances inNeural Information Processing Systems (NIPS) . 20: 161–168.Boyd, S., N. Parikh, E. Chu, B. Peleato, and J. Eckstein. 2011. “Distributed optimizationand statistical learning via the alternating direction method of multipliers”.
Founda-tions and Trends in Machine learning . 3(1): 1–122.Boyd, S. and L. Vandenberghe. 2004.
Convex optimization . Cambridge university press.Brezinski, C. 1975. “Généralisations de la transformation de Shanks, de la table de Padéet de l’ ε -algorithme”. Calcolo . 12(4): 317–360.Brezinski, C. 2001. “Convergence acceleration during the 20th century”.
Numerical Anal-ysis: Historical Developments in the 20th Century : 113.Brezinski, C. and M. R. Zaglia. 2013.
Extrapolation methods: theory and practice . Else-vier.Bubeck, S. 2015. “Convex Optimization: Algorithms and Complexity”.
Foundations andTrends in Machine Learning . 8(3-4): 231–357.Bubeck, S., Y. T. Lee, and M. Singh. 2015. “A geometric alternative to Nesterov’saccelerated gradient descent”. preprint arXiv:1506.08187 .Cabay, S. and L. Jackson. 1976. “A polynomial extrapolation method for finding limitsand antilimits of vector sequences”.
SIAM Journal on Numerical Analysis . 13(5):734–752.Calatroni, L. and A. Chambolle. 2019. “Backtracking strategies for accelerated descentmethods with smooth composite objectives”.
SIAM Journal on Optimization . 29(3):1772–1798.Cauchy, A. 1847. “Méthode générale pour la résolution des systemes d’équations simul-tanées”.
Comptes Rendus de l’Académie des Sciences de Paris . 25(1847): 536–538.Chambolle, A. and T. Pock. 2016. “An introduction to continuous optimization for imag-ing”.
Acta Numerica . 25: 161–319. eferences
Chen, S., S. Ma, and W. Liu. 2017. “Geometric descent method for convex compositeminimization”. In:
Advances in Neural Information Processing Systems (NIPS) . 636–644.Clarke, F. H. 1990.
Optimization and nonsmooth analysis . Vol. 5. SIAM.Cohen, A., W. Dahmen, and R. DeVore. 2009. “Compressed sensing and best k-termapproximation”.
Journal of the AMS . 22(1): 211–231.Combettes, P. L. and J.-C. Pesquet. 2011. “Proximal splitting methods in signal pro-cessing”. In:
Fixed-point algorithms for inverse problems in science and engineering .Springer. 185–212.Cyrus, S., B. Hu, B. Van Scoy, and L. Lessard. 2018. “A robust accelerated optimiza-tion algorithm for strongly convex functions”. In:
Proceedings of the 2018 AmericanControl Conference (ACC) . IEEE. 1376–1381.d’Aspremont, A. 2008. “Smooth Optimization with Approximate Gradient”.
SIAM Jour-nal on Optimization . 19(3): 1171–1183.d’Aspremont, A., C. Guzman, and M. Jaggi. 2013. “Optimal Affine Invariant SmoothMinimization Algorithm”. arXiv preprint arXiv:1301.0465 .Davis, D., D. Drusvyatskiy, and V. Charisopoulos. 2019. “Stochastic algorithms with geo-metric step decay converge linearly on sharp functions”. arXiv preprint arXiv:1907.09547 .De Klerk, E., F. Glineur, and A. B. Taylor. 2017. “On the worst-case complexity ofthe gradient method with exact line search for smooth strongly convex functions”.
Optimization Letters . 11(7): 1185–1199.Defazio, A., F. Bach, and S. Lacoste-Julien. 2014a. “SAGA: A fast incremental gradientmethod with support for non-strongly convex composite objectives”. In:
Advances inNeural Information Processing Systems (NIPS) . 1646–1654.Defazio, A. J., T. S. Caetano, and J. Domke. 2014b. “Finito: A faster, permutableincremental gradient method for big data problems”. In:
Proceedings of the 31stInternational Conference on Machine Learning (ICML) . 1125–1133.Devolder, O. 2011. “Stochastic first order methods in smooth convex optimization”.
Tech.rep.
CORE discussion paper.Devolder, O. 2013. “Exactness, inexactness and stochasticity in first-order methods forlarge-scale convex optimization”.
PhD thesis .Devolder, O., F. Glineur, and Y. Nesterov. 2013. “Intermediate gradient methods forsmooth convex problems with inexact oracle”.Devolder, O., F. Glineur, and Y. Nesterov. 2014. “First-order methods of smooth convexoptimization with inexact oracle”.
Mathematical Programming . 146(1-2): 37–75.Diakonikolas, J. and L. Orecchia. 2019a. “Conjugate gradients and accelerated methodsunified: The approximate duality gap view”. preprint arXiv:1907.00289 .Diakonikolas, J. and L. Orecchia. 2019b. “The approximate duality gap technique: Aunified theory of first-order methods”.
SIAM Journal on Optimization . 29(1): 660–689. References
Douglas, J. and H. H. Rachford. 1956. “On the numerical solution of heat conductionproblems in two and three space variables”.
Transactions of the American mathemat-ical Society . 82(2): 421–439.Dragomir, R.-A., A. Taylor, A. d’Aspremont, and J. Bolte. 2019. “Optimal complexityand certification of Bregman first-order methods”. preprint arXiv:1911.08510 .Drori, Y. 2014. “Contributions to the Complexity Analysis of Optimization Algorithms”.
PhD thesis . Tel-Aviv University.Drori, Y. 2017. “The exact information-based complexity of smooth convex minimiza-tion”.
Journal of Complexity . 39: 1–16.Drori, Y. 2018. “On the properties of convex functions over open sets”. preprint arXiv:1812.02419 .Drori, Y. and A. Taylor. 2021. “On the oracle complexity of smooth strongly convexminimization”. preprint .Drori, Y. and A. B. Taylor. 2020. “Efficient first-order methods for convex minimization:a constructive approach”.
Mathematical Programming . 184(1): 183–220.Drori, Y. and M. Teboulle. 2014. “Performance of first-order methods for smooth convexminimization: a novel approach”.
Mathematical Programming . 145(1-2): 451–482.Drori, Y. and M. Teboulle. 2016. “An optimal variant of Kelley’s cutting-plane method”.
Mathematical Programming . 160(1-2): 321–351.Drusvyatskiy, D., M. Fazel, and S. Roy. 2018. “An optimal first order method based onoptimal quadratic averaging”.
SIAM Journal on Optimization . 28(1): 251–271.Dvurechensky, P. and A. Gasnikov. 2016. “Stochastic intermediate gradient method forconvex problems with stochastic inexact oracle”.
Journal of Optimization Theoryand Applications . 171(1): 121–145.Eckstein, J. 1989. “Splitting methods for monotone operators with applications to par-allel optimization”.
PhD thesis . Massachusetts Institute of Technology.Eckstein, J. and P. J. Silva. 2013. “A practical relative error criterion for augmentedLagrangians”.
Mathematical Programming . 141(1-2): 319–348.Eckstein, J. and W. Yao. 2012. “Augmented Lagrangian and alternating direction meth-ods for convex optimization: A tutorial and some illustrative computational results”.
RUTCOR Research Reports . 32(3).Eddy, R. 1979. “Extrapolating to the limit of a vector sequence”. In:
Information linkagebetween applied mathematics and industry . Elsevier. 387–396.Fang, H.-r. and Y. Saad. 2009. “Two classes of multisecant methods for nonlinear accel-eration”.
Numerical Linear Algebra with Applications . 16(3): 197–221.Fazlyab, M., A. Ribeiro, M. Morari, and V. M. Preciado. 2018. “Analysis of optimizationalgorithms via integral quadratic constraints: Nonstrongly convex problems”.
SIAMJournal on Optimization . 28(3): 2654–2689.Fercoq, O. and P. Richtárik. 2015. “Accelerated, parallel, and proximal coordinate de-scent”.
SIAM Journal on Optimization . 25(4): 1997–2023.Fessler, J. A. 2020. “Optimization methods for magnetic resonance image reconstruction:Key models and optimization algorithms”.
IEEE Signal Processing Magazine . 37(1):33–40. Complete version: http://arxiv.org/abs/1903.03510. eferences
Florea, M. I. and S. A. Vorobyov. 2018. “An accelerated composite gradient method forlarge-scale composite objective problems”.
IEEE Transactions on Signal Processing .67(2): 444–459.Florea, M. I. and S. A. Vorobyov. 2020. “A generalized accelerated composite gradientmethod: Uniting Nesterov’s fast gradient method and FISTA”.
IEEE Transactionson Signal Processing .Gasnikov, A., P. Dvurechensky, E. Gorbunov, E. Vorontsova, D. Selikhanovych, and C.Uribe. 2019. “Optimal Tensor Methods in Smooth Convex and Uniformly ConvexOptimization”. In:
Proceedings of the 32nd Conference on Learning Theory (COLT) .Vol. 99. 1–18.Gasnikov, A. V. and Y. E. Nesterov. 2018. “Universal method for stochastic compos-ite optimization problems”.
Computational Mathematics and Mathematical Physics .58(1): 48–64.Glowinski, R. and A. Marroco. 1975. “Sur l’approximation, par éléments finis d’ordre un,et la résolution, par pénalisation-dualité d’une classe de problèmes de Dirichlet nonlinéaires”.
ESAIM: Mathematical Modelling and Numerical Analysis-ModélisationMathématique et Analyse Numérique . 9(R2): 41–76.Goldstein, A. 1962. “Cauchy’s method of minimization”.
Numerische Mathematik . 4(1):146–150.Golub, G. H. and R. S. Varga. 1961. “Chebyshev semi-iterative methods, successiveoverrelaxation iterative methods, and second order Richardson iterative methods”.
Numerische Mathematik . 3(1): 157–168.Gramlich, D., C. Ebenbauer, and C. W. Scherer. 2020. “Convex Synthesis of AcceleratedGradient Algorithms for Optimization and Saddle Point Problems using Lyapunovfunctions”. preprint arXiv:2006.09946 .Güler, O. 1991. “On the convergence of the proximal point algorithm for convex mini-mization”.
SIAM Journal on Control and Optimization . 29(2): 403–419.Güler, O. 1992. “New proximal point algorithms for convex minimization”.
SIAM Jour-nal on Optimization . 2(4): 649–664.Guzmán, C. and A. Nemirovsky. 2015. “On lower complexity bounds for large-scalesmooth convex optimization”.
Journal of Complexity . 31(1): 1–14.Hiriart-Urruty, J.-B. and C. Lemaréchal. 2013.
Convex analysis and minimization algo-rithms I: Fundamentals . Vol. 305. Springer science & business media.Hu, B. and L. Lessard. 2017. “Dissipativity theory for Nesterov’s accelerated method”.In:
Proceedings of the 34th International Conference on Machine Learning (ICML) .1549–1557.Hu, C., W. Pan, and J. Kwok. 2009. “Accelerated gradient methods for stochastic opti-mization and online learning”.
Advances in Neural Information Processing Systems(NIPS) . 22: 781–789.Iusem, A. N. 1999. “Augmented Lagrangian methods and proximal point methods forconvex optimization”.
Investigación Operativa . 8(11-49): 7. References
Ivanova, A., D. Grishchenko, A. Gasnikov, and E. Shulgin. 2019. “Adaptive Catalyst forsmooth convex optimization”. preprint arXiv:1911.11271 .Jbilou, K. and H. Sadok. 2000. “Vector extrapolation methods. Applications and nu-merical comparison”.
Journal of Computational and Applied Mathematics . 122(1-2):149–165.Jbilou, K. and H. Sadok. 1991. “Some results about vector extrapolation methods andrelated fixed-point iterations”.
Journal of Computational and Applied Mathematics .36(3): 385–398.Jbilou, K. and H. Sadok. 1995. “Analysis of some vector extrapolation methods forsolving systems of linear equations”.
Numerische Mathematik . 70(1): 73–89.Johnson, R. and T. Zhang. 2013. “Accelerating stochastic gradient descent using pre-dictive variance reduction”. In:
Advances in Neural Information Processing Systems(NIPS) . 315–323.Juditsky, A., G. Lan, A. Nemirovski, and A. Shapiro. 2009. “Stochastic ApproximationApproach to Stochastic Programming”.
SIAM Journal on Optimization . 19(4): 1574–1609.Juditsky, A. and A. Nemirovsky. 2011a. “First order methods for nonsmooth convex large-scale optimization, i: general purpose methods”.
Optimization for Machine Learning .30(9): 121–148.Juditsky, A. and A. Nemirovsky. 2011b. “First order methods for nonsmooth convexlarge-scale optimization, ii: utilizing problems structure”.
Optimization for MachineLearning . 30(9): 149–183.Juditsky, A. and Y. Nesterov. 2014. “Deterministic and stochastic primal-dual subgra-dient algorithms for uniformly convex minimization”.
Stochastic Systems . 4(1): 44–80.Karimi, S. and S. Vavasis. 2017. “A single potential governing convergence of conjugategradient, accelerated gradient and geometric descent”. preprint arXiv:1712.09498 .Karimi, S. and S. A. Vavasis. 2016. “A unified convergence bound for conjugate gradientand accelerated gradient”. preprint arXiv:1605.00320 .Kerdreux, T., A. d’Aspremont, and S. Pokutta. 2018. “Restarting Frank-Wolfe”. arXivpreprint arXiv:1810.02429 .Kim, D. 2019. “Accelerated proximal point method for maximally monotone operators”. preprint arXiv:1905.05149 .Kim, D. and J. A. Fessler. 2018a. “Adaptive restart of the optimized gradient methodfor convex optimization”.
Journal of Optimization Theory and Applications . 178(1):240–263.Kim, D. and J. A. Fessler. 2016. “Optimized first-order methods for smooth convexminimization”.
Mathematical Programming . 159(1-2): 81–107.Kim, D. and J. A. Fessler. 2018b. “Another look at the fast iterative shrinkage/thresholdingalgorithm (FISTA)”.
SIAM Journal on Optimization . 28(1): 223–250.Kim, D. and J. A. Fessler. 2018c. “Generalizing the optimized gradient method forsmooth convex minimization”.
SIAM Journal on Optimization . 28(2): 1920–1950. eferences
Kim, D. and J. A. Fessler. 2020. “Optimizing the efficiency of first-order methods fordecreasing the gradient of smooth convex functions”.
Journal of Optimization Theoryand Applications .Krichene, W., A. Bayen, and P. L. Bartlett. 2015. “Accelerated mirror descent in continu-ous and discrete time”.
Advances in Neural Information Processing Systems (NIPS) .28: 2845–2853.Kulunchakov, A. and J. Mairal. 2019. “Estimate sequences for stochastic compositeoptimization: Variance reduction, acceleration, and robustness to noise”. preprintarXiv:1901.08788 .Kurdyka, K. 1998. “On gradients of functions definable in o-minimal structures”. In:
Annales de l’institut Fourier . Vol. 48. No. 3. 769–783.Lacoste-Julien, S. and M. Jaggi. 2015. “On the global linear convergence of Frank-Wolfeoptimization variants”.
Advances in Neural Information Processing Systems (NIPS) .28: 496–504.Lan, G. 2012. “An optimal method for stochastic composite optimization”.
MathematicalProgramming . 133(1-2): 365–397.Lan, G., Z. Lu, and R. D. Monteiro. 2011. “Primal-dual first-order methods with O (1 /ǫ ) iteration-complexity for cone programming”. Mathematical Programming . 126(1): 1–29.Lee, Y. T. and A. Sidford. 2013. “Efficient accelerated coordinate descent methods andfaster algorithms for solving linear systems”. In: . IEEE. 147–156.Lemaréchal, C. and C. Sagastizábal. 1997. “Practical Aspects of the Moreau–Yosida Reg-ularization: Theoretical Preliminaries”.
SIAM Journal on Optimization . 7(2): 367–385.Lessard, L., B. Recht, and A. Packard. 2016. “Analysis and design of optimization al-gorithms via integral quadratic constraints”.
SIAM Journal on Optimization . 26(1):57–95.Lessard, L. and P. Seiler. 2020. “Direct synthesis of iterative algorithms with boundson achievable worst-case convergence rate”. In:
Proceedings of the 2020 AmericanControl Conference (ACC) . IEEE. 119–125.Li, G. and T. K. Pong. 2018. “Calculus of the exponent of Kurdyka–Łojasiewicz inequal-ity and its applications to linear convergence of first-order methods”.
Foundations ofcomputational mathematics . 18(5): 1199–1232.Lieder, F. 2020. “On the convergence rate of the Halpern-iteration”.
Optimization Letters :1–14.Lin, H., J. Mairal, and Z. Harchaoui. 2015. “A universal catalyst for first-order optimiza-tion”. In:
Advances in Neural Information Processing Systems (NIPS) . 3384–3392.Lin, H., J. Mairal, and Z. Harchaoui. 2017. “Catalyst acceleration for first-order convexoptimization: from theory to practice”.
The Journal of Machine Learning Research(JMLR) . 18(1): 7854–7907. References
Lions, P.-L. and B. Mercier. 1979. “Splitting algorithms for the sum of two nonlinearoperators”.
SIAM Journal on Numerical Analysis . 16(6): 964–979.Lojasiewicz, S. 1963. “Une propriété topologique des sous-ensembles analytiques réels”.
Les équations aux dérivées partielles : 87–89.Lu, H., R. M. Freund, and Y. Nesterov. 2018. “Relatively smooth convex optimization byfirst-order methods, and applications”.
SIAM Journal on Optimization . 28(1): 333–354.Luo, Z.-Q. and P. Tseng. 1992. “On the linear convergence of descent methods for convexessentially smooth minimization”.
SIAM Journal on Control and Optimization . 30(2):408–425.Mai, V. and M. Johansson. 2020. “Anderson acceleration of proximal gradient methods”.In:
Proceedings of the 37th International Conference on Machine Learning (ICML) .PMLR. 6620–6629.Mairal, J. 2015. “Incremental majorization-minimization optimization with applicationto large-scale machine learning”.
SIAM Journal on Optimization . 25(2): 829–855.Mairal, J. 2019. “Cyanure: An Open-Source Toolbox for Empirical Risk Minimizationfor Python, C++, and soon more”. preprint arXiv:1912.08165 .Malitsky, Y. and K. Mishchenko. 2020. “Adaptive Gradient Descent without Descent”.In:
Proceedings of the 37th International Conference on Machine Learning (ICML) .PMLR. 6702–6712.Martinet, B. 1970. “Régularisation d’inéquations variationnelles par approximations suc-cessives”.
Revue Française d’Informatique et de Recherche Opérationnelle . 4: 154–158.Martinet, B. 1972. “Détermination approchée d’un point fixe d’une application pseudo-contractante. Cas de l’application prox.”
Comptes rendus hebdomadaires des séancesde l’Académie des sciences de Paris . 274: 163–165.Mason, J. C. and D. C. Handscomb. 2002.
Chebyshev polynomials . CRC press.Mešina, M. 1977. “Convergence acceleration for the iterative solution of the equationsX= AX+ f”.
Computer Methods in Applied Mechanics and Engineering . 10(2): 165–173.Mifflin, R. 1977. “Semismooth and semiconvex functions in constrained optimization”.
SIAM Journal on Control and Optimization . 15(6): 959–972.Monteiro, R. D. and B. F. Svaiter. 2013. “An accelerated hybrid proximal extragradientmethod for convex optimization and its implications to second-order methods”.
SIAMJournal on Optimization . 23(2): 1092–1125.Moreau, J.-J. 1962. “Fonctions convexes duales et points proximaux dans un espacehilbertien”.
Comptes rendus hebdomadaires des séances de l’Académie des sciencesde Paris . 255: 2897–2899.Moreau, J.-J. 1965. “Proximité et dualité dans un espace hilbertien”.
Bulletin de laSociété mathématique de France . 93: 273–299.Narkiss, G. and M. Zibulevsky. 2005.
Sequential subspace optimization method for large-scale unconstrained problems . Technion-IIT, Department of Electrical Engineering. eferences
Necoara, I., Y. Nesterov, and F. Glineur. 2019. “Linear convergence of first order methodsfor non-strongly convex optimization”.
Mathematical Programming . 175(1-2): 69–107.Nemirovskii, A. and Y. E. Nesterov. 1985. “Optimal methods of smooth convex mini-mization”.
USSR Computational Mathematics and Mathematical Physics . 25(2): 21–30.Nemirovskiy, A. S. and B. T. Polyak. 1984. “Iterative methods for solving linear ill-posedproblems under precise information.”
ENG. CYBER. (4): 50–56.Nemirovsky, A. and D. Yudin. 1983a.
Problem complexity and method efficiency in opti-mization.
Nemirovsky, A. 1982. “Orth-method for smooth convex optimization”.
Izvestia AN SSSR,Transl.: Eng. Cybern. Soviet J. Comput. Syst. Sci . 2: 937–947.Nemirovsky, A. S. 1991. “On optimality of Krylov’s information when solving linearoperator equations”.
Journal of Complexity . 7(2): 121–130.Nemirovsky, A. S. 1992. “Information-based complexity of linear operator equations”.
Journal of Complexity . 8(2): 153–175.Nemirovsky, A. S. and D. B. Yudin. 1983b. “Problem Complexity and Method Efficiencyin Optimization.”
Willey-Interscience, New York .Nesterov, Y. 1983. “A method of solving a convex programming problem with conver-gence rate O (1 /k ) ”. Soviet Mathematics Doklady . 27(2): 372–376.Nesterov, Y. 2003.
Introductory Lectures on Convex Optimization . Springer.Nesterov, Y. 2005. “Smooth minimization of non-smooth functions”.
Mathematical Pro-gramming . 103(1): 127–152.Nesterov, Y. 2009. “Primal-dual subgradient methods for convex problems”.
Mathemat-ical programming Series B . 120(1): 221–259.Nesterov, Y. 2008. “Accelerating the cubic regularization of Newton’s method on convexproblems”.
Mathematical Programming . 112(1): 159–181.Nesterov, Y. 2012a. “Efficiency of coordinate descent methods on huge-scale optimizationproblems”.
SIAM Journal on Optimization . 22(2): 341–362.Nesterov, Y. 2013. “Gradient methods for minimizing composite functions”.
Mathemat-ical Programming . 140(1): 125–161.Nesterov, Y. 2015. “Universal gradient methods for convex optimization problems”.
Mathematical Programming . 152(1-2): 381–404.Nesterov, Y. 2012b. “How to make the gradients small”.
Optima. Mathematical Opti-mization Society Newsletter . (88): 10–11.Nesterov, Y. 2019. “Implementable tensor methods in unconstrained convex optimiza-tion”.
Mathematical Programming : 1–27.Nesterov, Y. 2020a. “Inexact accelerated high-order proximal-point methods”.
Tech. rep.
CORE discussion paper.Nesterov, Y. 2020b. “Inexact high-order proximal-point methods with auxiliary searchprocedure”.
Tech. rep.
CORE discussion paper. References
Nesterov, Y. and S. U. Stich. 2017. “Efficiency of the accelerated coordinate descentmethod on structured optimization problems”.
SIAM Journal on Optimization . 27(1):110–123.Nocedal, J. and S. Wright. 2006.
Numerical optimization . Springer Science & BusinessMedia.O’Donoghue, B. and E. Candes. 2015. “Adaptive restart for accelerated gradient schemes”.
Foundations of computational mathematics . 15(3): 715–732.Pang, J.-S. 1987. “A posteriori error bounds for the linearly-constrained variational in-equality problem”.
Mathematics of Operations Research . 12(3): 474–484.Paquette, C., H. Lin, D. Drusvyatskiy, J. Mairal, and Z. Harchaoui. 2018. “Catalyst forGradient-based Nonconvex Optimization”. In:
Proceedings of the 21st InternationalConference on Artificial Intelligence and Statistics (AISTATS) . Vol. 84.Parikh, N. and S. Boyd. 2014. “Proximal algorithms”.
Foundations and Trends in Opti-mization . 1(3): 127–239.Passty, G. B. 1979. “Ergodic convergence to a zero of the sum of monotone operators inHilbert space”.
Journal of Mathematical Analysis and Applications . 72(2): 383–390.Qi, L. and J. Sun. 1993. “A nonsmooth version of Newton’s method”.
Mathematicalprogramming . 58(1-3): 353–367.Robbins, H. and S. Monro. 1951. “A stochastic approximation method”.
The annals ofmathematical statistics : 400–407.Rockafellar, R. T. 1973. “A dual approach to solving nonlinear programming problemsby unconstrained optimization”.
Mathematical Programming . 5(1): 354–373.Rockafellar, R. T. 1976. “Augmented Lagrangians and applications of the proximal pointalgorithm in convex programming”.
Mathematics of operations research . 1(2): 97–116.Rockafellar, R. T. 1970.
Convex Analysis . Princeton.: Princeton University Press.Rockafellar, R. T. and R.-B. Wets. 1998.
Variational Analysis . Berlin: Springer-Verlag.Roulet, V. and A. d’Aspremont. 2017. “Sharpness, restart and acceleration”. In:
Ad-vances in Neural Information Processing Systems (NIPS) . 1119–1129.Ryu, E. K. and S. Boyd. 2016. “Primer on monotone operator methods”.
Appl. Comput.Math . 15(1): 3–43.Ryu, E. K. and W. Yin. 2020.
Large-Scale Convex Optimization via Monotone Operators .Scheinberg, K., D. Goldfarb, and X. Bai. 2014. “Fast first-order methods for compositeconvex optimization with backtracking”.
Foundations of Computational Mathematics .14(3): 389–417.Schmidt, M., N. Le Roux, and F. Bach. 2011. “Convergence rates of inexact proximal-gradient methods for convex optimization”. In:
Advances in Neural Information Pro-cessing Systems (NIPS) . 1458–1466.Schmidt, M., N. Le Roux, and F. Bach. 2017. “Minimizing finite sums with the stochasticaverage gradient”.
Mathematical Programming . 162(1-2): 83–112.Scieur, D., A. d’Aspremont, and F. Bach. 2016. “Regularized nonlinear acceleration”. In:
Advances in Neural Information Processing Systems (NIPS) . 712–720. eferences
Scieur, D., E. Oyallon, A. d’Aspremont, and F. Bach. 2018. “Online Regularized Non-linear Acceleration”. preprint arXiv:1805.09639 .Scieur, D., V. Roulet, F. Bach, and A. d’Aspremont. 2017. “Integration methods andoptimization algorithms”. In:
Advances in Neural Information Processing Systems(NIPS) . 1109–1118.Shalev-Shwartz, S. and T. Zhang. 2013. “Stochastic dual coordinate ascent methods forregularized loss minimization”.
The Journal of Machine Learning Research (JMLR) .14(Feb): 567–599.Shalev-Shwartz, S. and T. Zhang. 2014. “Accelerated proximal stochastic dual coordinateascent for regularized loss minimization”. In:
Proceedings of the 31st InternationalConference on Machine Learning (ICML) . 64–72.Sidi, A. 1986. “Convergence and stability properties of minimal polynomial and reducedrank extrapolation algorithms”.
SIAM Journal on Numerical Analysis . 23(1): 197–209.Sidi, A. 1988. “Extrapolation vs. projection methods for linear systems of equations”.
Journal of Computational and Applied Mathematics . 22(1): 71–88.Sidi, A. 1991. “Efficient implementation of minimal polynomial and reduced rank ex-trapolation methods”.
Journal of Computational and Applied Mathematics . 36(3):305–337.Sidi, A. 2008. “Vector extrapolation methods with applications to solution of large sys-tems of equations and to PageRank computations”.
Computers & Mathematics withApplications . 56(1): 1–24.Sidi, A. 2017a. “Minimal polynomial and reduced rank extrapolation methods are re-lated”.
Advances in Computational Mathematics . 43(1): 151–170.Sidi, A. 2017b.
Vector extrapolation methods with applications . SIAM.Sidi, A. 2019. “A convergence study for reduced rank extrapolation on nonlinear sys-tems”.
Numerical Algorithms : 1–26.Sidi, A. and J. Bridger. 1988. “Convergence and stability analyses for some vector ex-trapolation methods in the presence of defective iteration matrices”.
Journal of Com-putational and Applied Mathematics . 22(1): 35–61.Sidi, A., W. F. Ford, and D. A. Smith. 1986. “Acceleration of convergence of vectorsequences”.
SIAM Journal on Numerical Analysis . 23(1): 178–196.Sidi, A. and Y. Shapira. 1998. “Upper bounds for convergence rates of accelerationmethods with initial iterations”.
Numerical Algorithms . 18(2): 113–132.Siegel, J. W. 2019. “Accelerated first-order methods: Differential equations and Lyapunovfunctions”. preprint arXiv:1903.05671 .Smith, D. A., W. F. Ford, and A. Sidi. 1987. “Extrapolation methods for vector se-quences”.
SIAM review . 29(2): 199–233.Solodov, M. V. and B. F. Svaiter. 1999a. “A hybrid approximate extragradient–proximalpoint algorithm using the enlargement of a maximal monotone operator”.
Set-ValuedAnalysis . 7(4): 323–345. References
Solodov, M. V. and B. F. Svaiter. 1999b. “A hybrid projection-proximal point algorithm”.
Journal of convex analysis . 6(1): 59–70.Solodov, M. V. and B. F. Svaiter. 2000. “Error bounds for proximal point subprob-lems and associated inexact proximal point algorithms”.
Mathematical Programming .88(2): 371–389.Solodov, M. V. and B. F. Svaiter. 2001. “A unified framework for some inexact proximalpoint algorithms”.
Numerical functional analysis and optimization . 22(7-8): 1013–1035.Su, W., S. Boyd, and E. J. Candes. 2016. “A differential equation for modeling Nesterov’saccelerated gradient method: theory and insights”.
The Journal of Machine LearningResearch (JMLR) . 17(1): 5312–5354.Süli, E. and D. F. Mayers. 2003.
An introduction to numerical analysis . Cambridgeuniversity press.Taylor, A. and F. Bach. 2019. “Stochastic first-order methods: non-asymptotic andcomputer-aided analyses via potential functions”. In: vol. 99.
Proceedings of the 32ndConference on Learning Theory (COLT) . 2934–2992.Taylor, A. and Y. Drori. 2021. “An optimal gradient method for smooth (possiblystrongly) convex minimization”. preprint .Taylor, A. B., J. M. Hendrickx, and F. Glineur. 2017a. “Exact worst-case performanceof first-order methods for composite convex optimization”.
SIAM Journal on Opti-mization . 27(3): 1283–1313.Taylor, A. B., J. M. Hendrickx, and F. Glineur. 2017b. “Performance estimation toolbox(PESTO): automated worst-case analysis of first-order optimization methods”. In:
Proceedings of the 56th Conference on Decision and Control (CDC) . IEEE. 1278–1283.Taylor, A. B., J. M. Hendrickx, and F. Glineur. 2017c. “Smooth strongly convex in-terpolation and exact worst-case performance of first-order methods”.
MathematicalProgramming . 161(1-2): 307–345.Teboulle, M. 2018. “A simplified view of first order methods for optimization”.
Mathe-matical Programming . 170(1): 67–96.Toth, A. and C. Kelley. 2015. “Convergence analysis for Anderson acceleration”.
SIAMJournal on Numerical Analysis . 53(2): 805–819.Tseng, P. 2008. “On accelerated proximal gradient methods for convex-concave optimiza-tion”. url : h
Mathematical Programming . 125(2): 263–295.Tyrtyshnikov, E. E. 1994. “How bad are Hankel matrices?”
Numerische Mathematik .67(2): 261–269.Van Scoy, B., R. A. Freeman, and K. M. Lynch. 2017. “The fastest known globally con-vergent first-order method for minimizing strongly convex functions”.
IEEE ControlSystems Letters . 2(1): 49–54. eferences
Villa, S., S. Salzo, L. Baldassarre, and A. Verri. 2013. “Accelerated and inexact forward-backward algorithms”.
SIAM Journal on Optimization . 23(3): 1607–1633.Walker, H. F. and P. Ni. 2011. “Anderson acceleration for fixed-point iterations”.
SIAMJournal on Numerical Analysis . 49(4): 1715–1735.Wilson, A. C., B. Recht, and M. I. Jordan. 2016. “A Lyapunov analysis of momentummethods in optimization”. preprint arXiv:1611.02635 .Wynn, P. 1956. “On a device for computing the e m ( S n ) transformation”. MathematicalTables and Other Aids to Computation . 10(54): 91–96.Xiao, L. 2010. “Dual Averaging Methods for Regularized Stochastic Learning and OnlineOptimization”.
The Journal of Machine Learning Research (JMLR) . 11: 2543–2596.Zhou, K., Q. Ding, F. Shang, J. Cheng, D. Li, and Z.-Q. Luo. 2019. “Direct acceleration ofSAGA using sampled negative momentum”. In:
Proceedings of the 22nd InternationalConference on Artificial Intelligence and Statistics (AISTATS) . 1602–1610.Zhou, K., F. Shang, and J. Cheng. 2018. “A Simple Stochastic Variance Reduced Al-gorithm with Fast Convergence Rates”. In:
Proceedings of the 35th InternationalConference on Machine Learning (ICML) . 5980–5989.Zhou, K., A. M.-C. So, and J. Cheng. 2020. “Boosting First-order Methods by ShiftingObjective: New Schemes with Faster Worst Case Rates”. In:
Advances in NeuralInformation Processing Systems (NeuRIPS) .Zhou, Z. and A. M.-C. So. 2017. “A unified approach to error bounds for structuredconvex optimization problems”.
Mathematical Programming : 1–40.Zhou, Z., Q. Zhang, and A. M.-C. So. 2015. “l1, p-Norm Regularization: Error Boundsand Convergence Rate Analysis of First-Order Methods”. In: