An optimal gradient method for smooth strongly convex minimization
aa r X i v : . [ m a t h . O C ] J a n An optimal gradient method for smooth (possibly strongly) convexminimization
Adrien Taylor · Yoel Drori
Date of current version: January 26, 2021
Abstract
We present an optimal gradient method for smooth (possibly strongly) convex optimization.The method is optimal in the sense that its worst-case bound exactly matches the lower bound on theoracle complexity for the class of problems, meaning that no black-box first-order method can have a betterworst-case guarantee without further assumptions on the class of problems at hand.The method is in some sense a generalization of the Optimized Gradient Method of Kim and Fessler[2016], and asymptotically corresponds to the Triple Momentum Method [Van Scoy et al., 2017], in thepresence of strong convexity. Furthermore, the method is numerically stable to arbitrarily large conditionnumbers and admits a conceptually very simple proof, which involves a Lyapunov argument and a sum oftwo inequalities.Finally, we provide a numerical recipe for obtaining the algorithmic parameters of the method, usingsemidefinite programming, and illustrate that it can be used for developing other methods as well.
Consider the unconstrained minimization problemmin x ∈ R d f ( x ) , (1)where f is a smooth convex function, possibly strongly convex. For solving such problems, one can rely onblack-box first-order methods, which iteratively acquire information about f by evaluating its gradient ata sequence of iterates. In this context, the question of designing first-order methods with good worst-caseguarantees occupy an important place.In this work, we provide an optimal black-box first-order method, the Information-Theoretic ExactMethod (ITEM), for when f is smooth and (and possibly strongly) convex. This method exactly achievesthe lower bound on the oracle complexity of smooth strongly convex minimization for the distance to anoptimal solution, and therefore no black-box first-order method can be guaranteed to perform strictly betterin the worst-case for this criterion, neither in rate, nor in the constants involved. A. Taylor acknowledges support from the European Research Council (grant SEQUOIA 724063).This work was funded inpart by the french government under management of Agence Nationale de la recherche as part of the “Investissementsd’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute).Adrien TaylorINRIA, D´epartement d’informatique de l’ENS, ´Ecole normale sup´erieure, CNRS, PSL Research University, Paris, FranceEmail: [email protected] DroriGoogle Research Israel. Email: [email protected] A. Taylor, Y. Drori
In a nutshell, assuming f is a µ -strongly convex function with L -Lipschitz gradient, the method of interestin this text has the following structure y k = (1 − β k ) z k + β k (cid:18) y k − − L ∇ f ( y k − ) (cid:19) z k +1 = (1 − qδ k ) z k + qδ k y k − δ k L ∇ f ( y k ) , where q = µ/L is the (inverse) condition ratio. When µ >
0, those methods can be described in a more symmetric , and perhaps more natural, way y k = (1 − β k ) z k + β k (cid:18) y k − − L ∇ f ( y k − ) (cid:19) z k +1 = (1 − qδ k ) z k + qδ k (cid:18) y k − µ ∇ f ( y k ) (cid:19) . (2)Both sequences { β k } and { δ k } are parametrized by a sequence A k , incorporating the dependency on theiteration number, as follows β k = A k (1 − q ) A k +1 , and δ k = 12 (1 − q ) A k +1 − (1 + q ) A k q + qA k , with A = 0 and A k +1 = (1 + q ) A k + 2 (cid:16) p (1 + A k )(1 + qA k ) (cid:17) (1 − q ) . This sequence allows to concisely describe the worst-case performance of (3) as k z N − x ⋆ k ≤
11 + qA N k x − x ⋆ k , which is the exact lower bound for smooth strongly convex minimization, as exposed in Section 2.2 andobtained in [Drori and Taylor, 2021]. Therefore, no black-box first-order method can further improve thisguarantee. In addition, as A N > (1 −√ q ) − N , this bound provides a guarantee that z k strictly improves over z with a convergence rate better than (1 − √ q ) . Furthermore, when µ >
0, and as k → ∞ , the method’sparameters β k and δ k tends to those of the Triple Momentum Method by Van Scoy et al. [2017] (detailsbelow). On the other hand, when µ = 0, the parameters correspond to those of the Optimized GradientMethod of Kim and Fessler [2016], which achieves the lower complexity bound for ( f ( y N ) − f ⋆ ) / k x − x ⋆ k .Optimality of this method, together with that of OGM, allows further justifying the two/three-sequencestructure of common accelerated methods, as follows: the { y k } sequence achieves optimal rates (and constant)for optimizing function values when µ = 0, whereas the { z k } achieves the optimal rate (and constant) foroptimizing the distance to a solution in all settings, although the bound is only weakly informative when µ = 0.1.1 Related works Lower bounds and accelerated methods.
The method presented in this work is closely related to the cel-ebrated fast gradient methods (FGMs) by Nesterov [1983, 2004]. Lyapunov and potential function-basedanalyses of FGMs were presented in many works, including in the original [Nesterov, 1983]. The analy-ses are usually tailored for the smooth convex minimization setting [Nesterov, 1983, Beck and Teboulle,2009], for the smooth strongly convex one [Wilson et al., 2016, Bansal and Gupta, 2019], and sometimesdeal with both simultaneously [Nesterov, 2004, Gasnikov and Nesterov, 2018]. In the large-scale quadraticsmooth strongly convex minimization setting, optimal worst-case accuracies are achieved by Chebyshev andconjugate gradient methods [Nemirovskii, 1992, Nemirovski, 1999]. ptimal gradient methods 3
Performance estimation problems.
The idea of computing worst-case accuracy of a given method throughsemidefinite programming dates back to Drori and Teboulle [2014]. It was refined using the concept ofconvex interpolation in [Taylor et al., 2017b], which allows guaranteeing that worst-case accuracies providedby the semidefinite programs actually correspond to matching examples (i.e., the certificates are tight).The approach was taken further in different directions, for analyzing and designing numerical methods indifferent contexts. A very related line of works, initiated by [Lessard et al., 2016], presents such analysesfrom a control theoretic perspective, and corresponds to look for Lyapunov functions. Those works hencerather target asymptotic properties of time-invariant numerical methods, which allows using smaller sizedSDPs.
Optimized gradient methods.
The method presented in this work was first obtained as a solution to a con-vex optimization problem, through an approach closely related to that taken by Drori and Teboulle [2014]and Kim and Fessler [2016] for obtaining the Optimized Gradient Method. The Optimized Gradient Methodfor smooth convex minimization ( µ = 0), obtained by Kim and Fessler [2016], was obtained by explicitlychoosing the step sizes of a method for minimizing an upper bound on the worst-case inaccuracy criterion.The resulting method was later proved to achieve the lower bound in [Drori, 2017]. When µ = 0, optimalmethods for the criterion ( f ( x N ) − f ⋆ ) / k x − x ⋆ k include the Optimized Gradient Method [Kim and Fessler,2016, Drori, 2017], and the conjugate gradient method [Drori and Taylor, 2020]. It is also worth mentioningthat optimized methods can be developed for other criteria as well. In particular, optimized methods for thecriterion k∇ f ( x N ) k / ( f ( x ) − f ⋆ ) are studied by Kim and Fessler [2020], in the smooth convex setting.The design of the Triple Momentum Method [Van Scoy et al., 2017] can be performed from a Lya-punov function perspective, using an idea similar to that of the Optimized Gradient Method, but for time-independent methods (i.e., whose coefficients do not depend on the iteration counter) as in [Lessard and Seiler,2020], relying on using the integral quadratic framework by Lessard et al. [2016]. The problem of devis-ing optimized methods for smooth strongly convex minimization is also addressed in [Zhou et al., 2020,Gramlich et al., 2020], which also recovers the triple momentum method.So far, it remained unclear how to conciliate both optimal methods, as the OGM is clearly not optimalanymore when µ > µ = 0.1.2 OrganizationThe Information-Theoretic Exact Method is provided in Section 2, along with its worst-case analysis. InSection 3, we describe a constructive approach that leads to the method and illustrate that it can be usedfor developing other methods as well. Finally, we draw some conclusions in Section 4.1.3 Preliminaries and notationsWe denote by x ⋆ some optimal solution to (1) (which is unique if µ > f ⋆ its optimal value. Definition 1
Let f : R d → R ∪ { + ∞} be a proper, closed, and convex function, and consider two constant0 ≤ µ < L . We say that f is L -smooth and µ -strongly convex, denoted f ∈ F µ,L ( R d ), if – ( L -smooth) for all x, y ∈ R d , it holds that f ( x ) ≤ f ( y ) + h∇ f ( y ); x − y i + L k x − y k , – ( µ -strongly convex) for all x, y ∈ R d , it holds that f ( x ) ≥ f ( y ) + h∇ f ( y ); x − y i + µ k x − y k .We simply denote f ∈ F µ,L when the dimension is either clear from the context or unspecified. In addition,we use q := µ/L the (inverse) condition number of the class (hence 0 ≤ q < L = µ for readability purposes.Smooth strongly convex functions are associated with many inequalities, see e.g., [Nesterov, 2004, Theorem2.1.5]. For the developments below, we need one specific inequality characterizing smooth and strongly convexfunctions. A. Taylor, Y. Drori
Theorem 1
Let f ∈ F µ,L ( R d ) . For all x, y ∈ R d , it holds that f ( y ) ≥ f ( x )+ h∇ f ( x ); y − x i + 12 L k∇ f ( x ) − ∇ f ( y ) k + µ − µ/L ) k x − y − L ( ∇ f ( x ) − ∇ f ( y )) k . This inequality turns out to be key in proving worst-case guarantees for first-order methods applied onsmooth strongly convex problems, due to the following result [Taylor et al., 2017b, Theorem 4].
Theorem 2 ( F µ,L -interpolation) Let I be an index set and S = { ( x i , g i , f i ) } i ∈ I ⊆ R d × R d × R be a setof triplets. There exists f ∈ F µ,L ( R d ) satisfying f ( x i ) = f i and g i ∈ ∂f ( x i ) for all i ∈ I if and only if f i ≥ f j + h g j ; x i − x j i + 12 L k g i − g j k + µ − µ/L ) k x i − x j − L ( g i − g j ) k holds for all i, j ∈ I . In the method below, we use an additional sequence { x k } for notational convenience. Information-Theoretic Exact Method (ITEM)Input: f ∈ F µ,L with 0 ≤ µ < L < ∞ , initial guess x ∈ R d Initialization: z = x , A = 0, q = µ/L For k = 0 , , . . . , N − A k +1 = (1 + q ) A k + 2 (cid:16) p (1 + A k )(1 + qA k ) (cid:17) (1 − q ) Set β k = A k (1 − q ) A k +1 , and δ k = 12 (1 − q ) A k +1 − (1 + q ) A k q + qA k y k = (1 − β k ) z k + β k x k x k +1 = y k − L ∇ f ( y k ) z k +1 = (1 − qδ k ) z k + qδ k y k − δ k L ∇ f ( y k ) . (3) Output: approximate solutions ( y N , x N , z N )Before providing the theory for this method, we inspect its limit cases. First, when µ = 0, ITEM canbe compared to the Optimized Gradient Method of Kim and Fessler [2016]. In their notations, we denote by θ k = A k +1 , a sequence that can alternatively be defined recursively as θ = 1 and θ k +1 = √ θ k +12 . In thissetting, the parameters correspond to β k = A k A k +1 , and δ k = A k +1 − A k , and we recover, using Kim and Fessler[2016]’s notations (using the identity θ k = θ k − + θ k ) y k = θ k − θ k x k + θ k z k x k +1 = y k − L ∇ f ( y k ) z k +1 = z k − L θ k ∇ f ( y k ) . Note though that Kim and Fessler [2016] use a “last iteration adjustment” by setting θ N = √ θ N − +12 .This adjustment is not needed for the purpose of obtaining the optimal bound on k z N − x ⋆ k . ptimal gradient methods 5 Second, when µ > k → ∞ , one can explicitly compute the limits of the algorithmic parameterslim k →∞ A k A k +1 = lim A k →∞ (1 − q ) A k (1 + q ) A k + 2 + 2 p q ) A k + qA k = (1 − q ) (1 + √ q ) = (1 − √ q ) lim k →∞ β k = lim k →∞ A k (1 − q ) A k +1 = 1 − √ q √ q lim k →∞ δ k = lim k →∞
12 (1 − q ) A k +1 − (1 + q ) A k q + qA k = 12 (1 − q ) − (1 + q )(1 − √ q ) q (1 − √ q ) = r q , reaching y k = 1 − √ q √ q (cid:0) y k − − L ∇ f ( y k − ) (cid:1) + (cid:18) − − √ q √ q (cid:19) z k z k +1 = √ q (cid:0) y k − µ ∇ f ( y k ) (cid:1) + (1 − √ q ) z k , which is the Triple Momentum Method [Van Scoy et al., 2017] and its convergence rate (1 − √ q ) .The main results concerning Algorithm (3) are stated as follows. Firstly, we provide a bound on k z N − x ⋆ k , and secondly, we present a bound involving function values, which is more relevant as µ →
0. A prooffor this theorem is provided in the next section.
Theorem 3
Let f ∈ F µ,L and denote q = µ/L . For any x = z ∈ R d and N ∈ N with N ≥ , the iteratesof (3) satisfy k z N − x ⋆ k ≤
11 + qA N k z − x ⋆ k ≤ (1 −√ q ) N (1 −√ q ) N + q k z − x ⋆ k ,ψ N ≤ L (1 − q ) A N +1 k z − x ⋆ k ≤ min (cid:26) (1 − √ q ) N +1) , N + 1) (cid:27) L (1 − q ) k z − x ⋆ k , with ψ N = f ( y N ) − f ⋆ − L k∇ f ( y N ) k − µ − µ/L ) k y N − L ∇ f ( y N ) − x ⋆ k ≥ . This ψ N is related to a potential (or Lyapunov) function that turns out to be key in the analysis of themethod, as provided in the next section.2.1 Worst-case analysisFor performing the analysis, we use a potential function argument (see e.g., the nice review by Bansal and Gupta[2019]) similar to those used for standard accelerated methods [Nesterov, 1983, Beck and Teboulle, 2009].We show that for all y k − , z k ∈ R d and A k ≥
0, the function φ k =(1 − q ) A k ψ k − + ( L + µA k ) k z k − x ⋆ k =(1 − q ) A k h f ( y k − ) − f ⋆ − L k∇ f ( y k − ) k − µ − µ/L ) k y k − − L ∇ f ( y k − ) − x ⋆ k i + ( L + µA k ) k z k − x ⋆ k (4)satisfies φ k +1 ≤ φ k when z k +1 , y k and A k +1 are generated from one iteration of (3). Lemma 1
Let f ∈ F µ,L . For any y k − , z k ∈ R d and A k ≥ , two consecutive iterations of (3) satisfy φ k +1 ≤ φ k with A k +1 being defined as in (3) .Proof We perform a weighted sum of two inequalities due to Theorem 1 (note that y − does not intervenein the computations, as A = 0): – smoothness and strong convexity of f between x ⋆ and y k with weight λ = (1 − q )( A k +1 − A k ) f ⋆ ≥ f ( y k ) + h∇ f ( y k ); x ⋆ − y k i + L k∇ f ( y k ) k + µ − q ) k y k − x ⋆ − L ∇ f ( y k ) k , A. Taylor, Y. Drori – smoothness and strong convexity between y k − and y k with weight λ = (1 − q ) A k f ( y k − ) ≥ f ( y k ) + h∇ f ( y k ); y k − − y k i + L k∇ f ( y k ) − ∇ f ( y k − ) k + µ − q ) k y k − y k − − L ( ∇ f ( y k ) − ∇ f ( y k − )) k . Summing up and reorganizing those two inequalities (without substituting A k +1 by its expression, for now),we arrive to the following valid inequality0 ≥ λ (cid:20) f ( y k ) − f ⋆ + h∇ f ( y k ); x ⋆ − y k i + L k∇ f ( y k ) k + µ − q ) k y k − x ⋆ − L ∇ f ( y k ) k (cid:21) + λ (cid:20) f ( y k ) + h∇ f ( y k ); y k − − y k i + L k∇ f ( y k ) − ∇ f ( y k − ) k + µ − q ) k y k − y k − − L ( ∇ f ( y k ) − ∇ f ( y k − )) k (cid:21) . Substituting y k = (1 − β k ) z k + β k (cid:18) y k − − L ∇ f ( y k − ) (cid:19) z k +1 = (1 − qδ k ) z k + qδ k y k − δ k L ∇ f ( y k ) , (note that this substitution is also valid when k = 0 as β k = 0 in this case, and hence y = z ) the weightedsum can be reformulated exactly as (this can be verified by expanding both expressions and matching themon a term by term basis ) φ k +1 ≤ φ k − LK P ( A k +1 ) k z k − x ⋆ k + L K P ( A k +1 ) k (1 − q ) A k +1 ∇ f ( y k ) − µA k (cid:0) y k − − x ⋆ − L ∇ f ( y k − ) (cid:1) + K µ ( z k − x ⋆ ) k with three constants (well defined given that 0 ≤ µ < L and A k , A k +1 ≥ K = q (1 + q ) + (1 − q ) qA k +1 K = (1 + q ) + (1 − q ) qA k +1 (1 − q ) (1 + q + qA k ) A k +1 K = (1 + q ) (1 + q ) A k − (1 − q )(2 + qA k ) A k +1 (1 + q ) + (1 − q ) qA k +1 , as well as P ( A k +1 ) = ( A k − (1 − q ) A k +1 ) − A k +1 (1 + qA k ) . For obtaining the desired potential inequality, we simply pick A k +1 such that A k +1 ≥ A k and P ( A k +1 ) = 0,reaching the claim φ k +1 ≤ φ k as well as the choice for A k +1 . ⊓⊔ We are now armed for proving our main result, presented in Theorem 3.
Proof (Theorem 3)
From Lemma 1, we get φ N ≤ φ N − ≤ . . . ≤ φ = L k z − x ⋆ k . From Theorem 1 (evaluated at x ← x ⋆ , and y ← y k − ), we have that ( L + µA N ) k z N − x ⋆ k ≤ φ N , reaching k z N − x ⋆ k ≤ φ ( L + µA N ) = 11 + qA N k z − x ⋆ k . The puzzled reader can verify this using base symbolic computations. We provide a notebook for verifying the equivalenceof the expressions in Section 4.ptimal gradient methods 7
Similarly, we have that (1 − q ) A N +1 ψ N ≤ φ N +1 ≤ φ N and hence ψ N ≤ φ (1 − q ) A N +1 = 1(1 − q ) A N +1 k z − x ⋆ k . For reaching the claims, it is therefore to characterize the growth rate of { A k } . Because solving the recurrencefor { A k } appears to be out of reach, we consider the classical two scenarios for bounding its growth rate.First, when µ = 0, A k +1 = 2 + A k + 2 p A k ≥ A k + 2 p A k ≥ (1 + p A k ) , reaching p A k +1 ≥ √ A k and hence √ A k ≥ k and A k ≥ k . Second, when µ >
0, one also has A k +1 = (1 + q ) A k + 2 (cid:16) p (1 + A k )(1 + qA k ) (cid:17) (1 − q ) ≥ (1 + q ) A k + 2 p qA k (1 − q ) = A k (1 − √ q ) , reaching A N ≥ (1 − √ q ) − N using A = − q ) = √ q ) (1 −√ q ) ≥ (1 − √ q ) − , and concluding the proof. ⊓⊔ Remark 1
The analysis can be simplified when directly dealing with the two limit cases mentioned above.First, in the case µ = 0, we reach A k +1 ( f ( y k ) − f ⋆ − L k∇ f ( y k ) k ) + L k z k +1 − x ⋆ k ≤ A k ( f ( y k − ) − f ⋆ − L k∇ f ( y k − ) k ) + L k z k − x ⋆ k with A k +1 = 2+ A k +2 √ A k + 1 = (1+ √ A k + 1) , recovering [Taylor and Bach, 2019, Theorem 11], which isthe potential function for the Optimized Gradient Method. When µ >
0, dividing both sides of the inequality φ k +1 ≤ φ k by A k and taking the limits as A k → ∞ allows obtaining ρ − (cid:20) f ( y k ) − f ⋆ − L k∇ f ( y k ) k − µ − q ) k y k − L ∇ f ( y k ) − x ⋆ k + µ − q k z k +1 − x ⋆ k (cid:21) ≤ (cid:20) f ( y k − ) − f ⋆ − L k∇ f ( y k − ) k − µ − q ) k y k − − L ∇ f ( y k − ) − x ⋆ k + µ − q k z k − x ⋆ k (cid:21) with ρ = 1 −√ q , which is precisely the known Lyapunov function for the Triple Momentum Method [Cyrus et al.,2018, Inequality (10)].2.2 Matching examples and lower boundIn this section, we show the correspondence with the lower bound. Then, we also provide two simple one-dimensional examples on which the method achieves its worst-case. First, the lower bound from [Drori and Taylor,2021, Corollary 4] states that for any black-box first-order, there exists f ∈ F µ,L such that k x N − x ⋆ k ≥ λ N √ q k x − x ⋆ k , where x ⋆ = argmin x f ( x ), x N is the output of the black-box first-order method under consideration, andwhere the sequence { λ i } is defined recursively as λ = √ q and λ k +1 = 1 − p q − (1 − q ) λ k λ k λ k . Let us show that it matches the upper bound provided by Theorem 3. One can verify the correspondence λ k = √ q qA k , A. Taylor, Y. Drori by verifying that it is compatible with A = 0 and through a recurrence argument. That is, assuming λ k = √ q qA k , it is relatively simple to establish that λ k +1 = 1 − r q − (1 − q ) (cid:16) √ q qA k (cid:17) (cid:16) √ q qA k (cid:17) √ q qA k = √ q qA k +1 . Before moving on, let us mention that (3) also attains its worst-case bound on simple one dimensionalfunctions, including the two base quadratic functions f L ( x ) = L | x | , f µ = µ | x | . In other words, the guarantee k z N − x ⋆ k ≤ k x − x ⋆ k qA N holds with equality on both f L ( . ) and f µ ( . ). Equiva-lently, it holds with equality on the separable two-dimensional function f ( x ) = 12 x ⊤ (cid:18) L µ (cid:19) x for any x ∈ R d . Lemma 2
Let < µ < L < ∞ , and f L , f µ ∈ F µ,L ( R ) with f L ( x ) = L x and f µ ( x ) = µ x . The iterates ofITEM (3) satisfy z k = z qA k when applied to either f L or f µ .Proof We proceed by recurrence. It is clear that z = z qA (recall A = 0), which establishes the baserecurrence case.(i) Let us start with f L . It is clear from explicit computations that for all y k ∈ R , x k +1 = y k − L ∇ f L ( y k ) = x ⋆ = 0. Therefore, we have y k = L ∇ f L ( y k ) along with y k = (1 − β k ) z k (this also trivially holds for k = 0, as in this case β k = 0 and y = z = x ), and therefore z k +1 = z k + qδ k ( y k − z k ) − δ k y k = ((1 − q ) β k δ k − δ k + 1) z k . Substituting the expressions of β k , δ k , A k +1 , and z k = z qA k in this equality (squared) leads to z k +1 = (cid:16) qA k − q p (1 + A k )(1 + qA k ) (cid:17) (1 + qA k )(1 + q + qA k ) z = z qA k +1 , where the last equality can be verified by base algebra.(ii) We proceed with f µ . In this case, for all y k ∈ R we have y k − µ ∇ f µ ( y k ) = x ⋆ = 0. Therefore, z k +1 = (1 − qδ k ) z k . Substituting the expression of δ k and the recurrence hypothesis z k = z qA k we arrive to the same expressionas before z k +1 = (cid:16) qA k − q p (1 + A k )(1 + qA k ) (cid:17) (1 + qA k )(1 + q + qA k ) z = z qA k +1 , reaching the desired claim. ⊓⊔ We established that the ITEM achieves a lower complexity bound for smooth strongly convex minimiza-tion. The next section provides a constructive way of discovering the method. ptimal gradient methods 9
The intent of this section is to provide a constructive procedure for obtaining the Information-TheoreticExact Method. We would like to emphasize that although ITEM was derived using the technique describedbelow, its proof is independent of the following, as provided in earlier sections.As a starting point, consider the class of black-box first-order methods gathering information about theobjective function f only by evaluating an oracle O f ( x ) = ( f ( x ) , ∇ f ( x )). We describe such a black-boxmethod M as a set of rules { M , M , . . . , M N } for forming its iterates, which we denote by w k for avoidingconfusions with any sequences of the ITEM, as w = M ( x , O f ( w )) w = M ( x , O f ( w ) , O f ( w ))... w N = M N ( x , O f ( w ) , O f ( w ) , . . . , O f ( w N − )) , and we denote by M N the set of black-box first-order methods that performs N gradient evaluations.Furthermore, we call the efficiency estimate of a method M the following quantity W µ,L ( M ) = max f ∈F µ,L (cid:26) k w N − w ⋆ k k w − w ⋆ k : for any sequence w , . . . , w N generated by M on f , initiated at some w , and w ⋆ ∈ argmin w f ( w ) (cid:27) , (5)which correspond to the worst-case performance of M on the class F µ,L for the criterion k w N − w ⋆ k k w − w ⋆ k . A directconsequence of Theorem 3 and the lower complexity bound discussed in Section 2.2 is that ITEM belongto the class of black-box first-order methods with optimal performances with respect to W µ,L ( M ). ITEM istherefore a solution to min M ∈M N W µ,L ( M ) . (6)Although this minimax problem appears to be hard to solve directly, we illustrate below that it can beapproached using semidefinite programming.In a nutshell, we consider two simplified upper bounds to this minimax problem. First, we considera subclass of black-box first-order methods, referred to as fixed step first-order methods . Those are first-order methods that are described by a set of normalized coefficients { α i,j } , and whose formal description isprovided below. Second, given a fixed step first-order method M , the idea is to develop a tractable upperbound on the efficiency estimate of M , written UB µ,L ( M ) and such that UB µ,L ( M ) ≥ W µ,L ( M ).After that, we show that minimization over M is also tractable for this upper bound. That is, we can solvemin { α i,j } UB µ,L ( M ) and obtain the Information-Theoretic Exact Method as a solution. As a comparison,let us mention that the Optimized Gradient Method [Drori and Teboulle, 2014, Kim and Fessler, 2016] wasobtained through similar steps for the objective ( f ( w N ) − f ⋆ ) / k w − w ⋆ k when µ = 0.More precisely, we proceed as follows: – In Section 3.1, we describe the class of fixed step first-order methods. This class of methods is somewhatnatural and contains classical numerical methods such as gradient, heavy-ball, and accelerated gradientmethods, but excludes adaptive methods. – In Section 3.2 and 3.3, we detail a tractable upper bound UB µ,L ( M ) which can be computed numericallythrough semidefinite programming. – In Section 3.4, we show how to render min { h i,j } UB µ,L ( M ) tractable, yielding the Information-TheoreticExact Method as a solution.We complement those developments with numerical examples of the design procedure, as well as appli-cations to alternate design criterion that include ( f ( w N ) − f ⋆ ) / k w N − w ⋆ k , in Appendix D. M N allows for more convenient formulations of optimization problemsover the class of methods, such as the minimax problem (6).We start with the following “natural” description of the class of methods of interest, and then introduce analternate parametrization which is more convenient for the step size optimization procedure of the followingsections. Definition 2
A black box first-order method is called a fixed step first-order method if there is a set { h i,j } ⊂ R which allows describing the iterates of the method as w = w − h , L ∇ f ( w ) w = w − h , L ∇ f ( w ) − h , L ∇ f ( w )... w N = w N − − N − X i =0 h N,i L ∇ f ( w i ) . (7)In what follows, we restrict ourselves to these fixed step first-order methods. However, we parameterizethem in a slightly different, but equivalent, fashion. Informally, the alternate parametrization allows formu-lating the maximization problem arising in the efficiency estimate W µ,L ( M ) (see (5)) in a more convenientway. In short, we show below that computing W µ,L ( M ) can be formulated as a semidefinite program. Thisconvex formulation of W µ,L ( M ) contains a linear matrix inequality that is naturally quadratic in terms of { h i,j } , but linear in terms of { α i,j } , the new parameters introduced below.More precisely, we express the method using gradient steps on a function ˜ f , defined by˜ f ( x ) := f ( x ) − µ k x − w ⋆ k , where w ⋆ is a minimizer of both f and ˜ f . It follows from f ∈ F µ,L that ˜ f ∈ F ,L − µ (see e.g. [Nesterov, 2004]).Then, one can express (7) in terms of evaluations of the gradient of ˜ f , instead of that of f . Concretely, wereformulate (7) in terms of coefficients { α i,j } as follows w − w ⋆ = ( w − w ⋆ )(1 − µL α , ) − α , L ∇ ˜ f ( w ) w − w ⋆ = ( w − w ⋆ )(1 − µL ( α , + α , )) − α , L ∇ ˜ f ( w ) − α , L ∇ ˜ f ( w )... w N − w ⋆ = ( w − w ⋆ ) − µL N − X i =0 α N,i ! − N − X i =0 α N,i L ∇ ˜ f ( w i ) . (8)One can show that there is a bijection between representation (7) and (8). Therefore, the problem of designingan optimal method in the form (7) is equivalent to that of devising an optimal method in the form (8). Thisis formalized by the following lemma. Lemma 3
Let N ∈ N , f ∈ F µ,L , w ⋆ ∈ argmin w f ( w ) , and a sequence { w k } k =0 ,...,N ⊂ R d . The followingstatements are equivalent. – There exists a set { h i,j } i =1 ,...,N ; j =0 ,...,i − such that the sequence { w k } k =0 ,...,N satisfies (7) . – There exists a set { α i,j } i =1 ,...,N ; j =0 ,...,i − such that the sequence { w k } k =0 ,...,N satisfies (8) .Proof The proof follows from a short recurrence argument (provided in Appendix A) for showing that the tworepresentation are isomorphic, and that they are linked through the following triangular system of equations α k +1 ,i = (cid:26) h k +1 ,k if i = kh k +1 ,i + α k,i − µL P kj = i +1 h k +1 ,j α j,i if 0 ≤ i < k. (9)Therefore, although we use (8) in the following sections, any method formulated in terms of { α i,j } can beconverted to the more natural { h i,j } notation, and reciprocally. ⊓⊔ ptimal gradient methods 11 ˜ f,d ∈ N { w i } i ∈ I ⊂ R d k w N − w ⋆ k k w − w ⋆ k s.t. w k generated by (8) applied on ˜ f , and initiated at some w ˜ f ∈ F ,L − µ ( R d ) w ⋆ ∈ argmin w ˜ f ( w ) , where we used an index set I = { ⋆, , . . . , N } . Note the maximization over d , which aims at obtainingdimension-independent guarantees.As a first step towards an “efficient” upper bound, we reformulate (5) using an extension (or interpolation)argument. That is, the previous maximization problem can be restated using an existence argument forreplacing the function by a finite set of samples. In other words, we optimize over the oracle’s responseswhile keeping the responses consistent with assumptions on f max { ( w i ,g i ,f i ) } i ∈ I ⊂ R d × R d × R d ∈ N k w N − w ⋆ k k w − w ⋆ k s.t. w k generated by (8) for k = 1 , . . . , N ∃ ˜ f ∈ F ,L − µ ( R d ) : g i ∈ ∂ ˜ f ( w i ) , f i = ˜ f ( w i ) ∀ i ∈ I,g ⋆ = 0 . Using an homogeneity argument, one can reformulate this problem without the fractional objective. Moreprecisely, for any feasible point S = { ( x i , g i , f i ) } i ∈ I and any α >
0, the point S ′ = { ( αx i , αg i , α f i ) } i ∈ I isalso feasible while reaching the same objective value. We can therefore arbitrarily fix the scale of the problemto k w − w ⋆ k = 1, reaching the following problem with the same optimal valuemax { ( w i ,g i ,f i ) } i ∈ I ⊂ R d × R d × R d ∈ N k w N − w ⋆ k s.t. k w − w ⋆ k = 1 ,w k generated by (8) for k = 1 , . . . , N ∃ ˜ f ∈ F ,L − µ ( R d ) : g i ∈ ∂ ˜ f ( w i ) , f i = ˜ f ( w i ) ∀ i ∈ I,g ⋆ = 0 . It follows from Theorem 2 that the previous problem can be reformulated exactly asmax { ( w i ,g i ,f i ) } i ∈ I ⊂ R d × R d × R d ∈ N k w N − w ⋆ k s.t. k w − w ⋆ k = 1 ,w k generated by (8) for k = 1 , . . . , Nf i ≥ f j + h g j ; x i − x j i + L − µ ) k g i − g j k ∀ i, j ∈ I,g ⋆ = 0 . (10)Whereas equivalence between the two previous problems might be regarded as technical, the fact (10)produces upper bounds on (5) is quite direct. Indeed, any ˜ f ∈ F ,L − µ satisfies the above inequalities, andhence any feasible point to (5) can be converted to a feasible point to (10) by sampling ˜ f . In what follows, we use the following relaxation of (10), by incorporating only a specific subset of theprevious quadratic inequalities, therefore forming an upper bound on the original problem. The selectedinequalities were removed because they introduce nonlinearities in the steps taken in the next sections. Notethat these inequalities are not active at the optimal point, because of the tightness of the result shown below.Hence, the following relaxation also produces upper bounds on (5)max { ( w i ,g i ,f i ) } i ∈ I ⊂ R d × R d × R d ∈ N k w N − w ⋆ k s.t. k w − w ⋆ k = 1 , g ⋆ = 0 (R) w k generated by (8) for k = 1 , . . . , Nf i ≥ f i +1 + h g i +1 ; w i − w i +1 i + L − µ ) k g i − g i +1 k for i = 0 , . . . , N − f ⋆ ≥ f i + h g i , w ⋆ − w i i + L − µ ) k g i k for i = 0 , . . . , N − f N − ≥ f ⋆ + L − µ ) k g N − k . As provided in the next section this problem is semidefinite-representable, and we can use standard packagesfor solving it. The following lemma summarizes what we have so far, that is, W µ,L ( M ) ≤ val(R), where val(R)denotes the optimal value of (R). Lemma 4
Let N ∈ N , ≤ µ < L < ∞ , and M ∈ M N be a fixed step first-order method (8) performing N gradient evaluations and described by a set of coefficients { α i,j } i,j . For any d ∈ N , f ∈ F µ,L ( R d ) , w ⋆ ∈ argmin w f ( w ) , initial guess w ∈ R d , and w N = M ( w , f ) , it holds that k w N − w ⋆ k ≤ val(R) k w − w ⋆ k , where val(R) denotes the optimal value of (R) . G, F ) (after substituting w k ’s by their expressions) defined by G = k w − w ⋆ k h g ; w − w ⋆ i h g ; w − w ⋆ i . . . h g N − ; w − w ⋆ ih g ; w − w ⋆ i k g k h g ; g i . . . h g N − ; g ih g ; w − w ⋆ i h g ; g i k g k . . . h g N − ; g i ... ... ... . . . ... h g N − ; w − w ⋆ i h g N − ; g i h g N − ; g i . . . k g N − k F = f − f ⋆ f − f ⋆ ... f N − − f ⋆ . (11)Formally, let us introduce the following notations for picking elements in G and F and conveniently formu-lating the SDPs w = e ∈ R N +1 , g i = e i +2 ∈ R N +1 , f i = e i +1 ∈ R N , with i = 0 , . . . , N − e i being the unit vector whose i th component is equal to 1. In addition, we alsodenote by w k = w − µL k − X i =0 α k,i ! − k − X i =0 α k,i L g i , ptimal gradient methods 13 for i = 0 , . . . , N (note that w k is therefore parameterized by { α k,i } ). Those notations allow to express theobjective and constraints of (R) directly in terms of G and F using the following identities f i − f ⋆ = F f i i = 0 , , . . . , N − k g i k = g ⊤ i G g i i = 0 , , . . . , N − k w i − w ⋆ k = w ⊤ i G w i i = 0 , , . . . , N − h g i ; w j − w ⋆ i = g ⊤ i G w j i, j = 0 , , . . . , N − . Using those notations, any feasible point to (R) can be transformed to a feasible point to the follow-ing (SDP-R), using the Gram matrix representation. Hence, the optimal value to the following problemis an upper bound on that of (R)max G ∈ S N +1 F ∈ R N d ∈ N w ⊤ N G w N s.t. G (cid:23) w ⊤ G w = 1 , ≥ ( f i +1 − f i ) ⊤ F + g ⊤ i +1 G ( w i − w i +1 ) + L − µ ) ( g i − g i +1 ) ⊤ G ( g i − g i +1 ) for i = 0 , . . . , N − ≥ f ⊤ i F − g ⊤ i G w i + L − µ ) g ⊤ i G g i for i = 0 , . . . , N − ≥ − f ⊤ N − F + L − µ ) g ⊤ N − G g N − rank( G ) ≤ d. After getting rid of the variable d and the rank constraint (which is void due to maximization over d ), thisproblem is a linear SDP, parametrized by L > µ ≥
0, and { α i,j } .For transforming the minimax problem to a bilinear minimization problem, the next key step in ourprocedure is to express the Lagrangian dual of (SDP-R), substituting the inner maximization problem by aminimization, hence replacing the minimax by a minimization problem. Note that we do not assume strongduality, as weak duality suffices for obtaining an upper bound on the original problem. That is, we performthe following primal-dual associations k w − w ⋆ k = 1 : τf i ≥ f i +1 + h g i +1 ; w i − w i +1 i + L − µ ) k g i − g i +1 k for i = 0 , . . . , N − λ i,i +1 f ⋆ ≥ f i + h g i , w ⋆ − w i i + L − µ ) k g i k for i = 0 , . . . , N − λ ⋆,i f N − ≥ f ⋆ + L − µ ) k g N − k : λ N − ,⋆ and arrive to the following dual formulation of (SDP-R), whose optimal value is denoted by UB µ,L ( M )UB µ,L ( { α i,j } ) := min τ,λ i,j ≥ τ, s.t. S ( τ, { λ i,j } ) (cid:23) , (dual-SDP-R) N − X i =0 λ i,i +1 ( f i +1 − f i ) + N − X i =0 λ ⋆,i f i − λ N − ,⋆ f N − = 0 , with S ( τ, { λ i,j } ) = τ w w ⊤ − w N w ⊤ N + λ N − ,⋆ L − µ ) g N − g ⊤ N − + N − X i =0 λ ⋆,i (cid:16) − g i w ⊤ i − w i g ⊤ i + L − µ g i g ⊤ i (cid:17) + N − X i =0 λ i,i +1 (cid:16) g i +1 ( w i − w i +1 ) ⊤ + ( w i − w i +1 ) g ⊤ i +1 + L − µ ( g i − g i +1 )( g i − g i +1 ) ⊤ (cid:17) . Note that in the case
N >
1, the equality constraint in (dual-SDP-R) can be written as λ ⋆, − λ , = 0 λ i − ,i + λ ⋆,i − λ i,i +1 = 0 for i = 1 , . . . , N − λ N − ,N − + λ ⋆,N − − λ N − ,⋆ = 0 . (12)Alternatively, when N = 1, it reduces to λ ⋆, − λ ,⋆ = 0. The following lemma recaps the current situation,which is informally summarized by W µ,L ( M ) ≤ UB µ,L ( M ). Lemma 5
Let N ∈ N , ≤ µ < L < ∞ , and M ∈ M N be a black-box first-order method (8) performing N gradient evaluations and described by a set of coefficients { α i,j } i,j . For any d ∈ N , w ∈ R d , f ∈ F µ,L ( R d ) , w ⋆ ∈ argmin w f ( w ) , and w N = M ( w ) , it holds that k w N − w ⋆ k ≤ UB µ,L ( { α i,j } ) k w − w ⋆ k . Proof
The result follows Lemma 4. More precisely, any feasible point to (R) can be translated to a feasiblepoint to (SDP-R) using the Gram matrix representation (11), hence val(R) ≤ val(SDP-R). Furthermore,weak duality implies val(SDP-R) ≤ val(dual-SDP-R) = UB µ,L ( { α i,j } ). Thereforeval(R) ≤ val(SDP-R) ≤ UB µ,L ( { α i,j } ) , and it follows that k w N − w ⋆ k ≤ val(R) k w − w ⋆ k ≤ UB µ,L ( { α i,j } ) k w − w ⋆ k , where the first inequality is due to Lemma 4. ⊓⊔ The last remaining difficulty is that S ( . ) appearing in (dual-SDP-R) is bilinear in terms of the algorithmicparameters { α i,j } (the vectors w i depend linearly on those parameters) and the dual variables { λ i,j } .Therefore, it might be unclear how to solvemin { α i,j } min τ,λ i,j ≥ τ, s.t. S ( τ, { λ i,j } ) (cid:23) , N − X i =0 λ i,i +1 ( f i +1 − f i ) + N − X i =0 λ ⋆,i f i − λ N − ,⋆ f N − = 0 , as problems involving such bilinear matrix inequalities are NP-hard in general [Toker and Ozbay, 1995]. Inthe next section, we introduce a linearization trick which allows tackling this specific problem.3.4 An approximate minimax and its semidefinite representatonIn this section, we show how to solve min { α i,j } UB µ,L ( { α i,j } ) , (13)which is a minimization problem jointly on α i,j ’s and λ i,j ’s. As it is, the problem features a bilinear matrixinequality . However, expanding and manipulating the expression of S (those manipulations are provided inAppendix B) reveals that the following change of variables solves this issue α ′ i,j = λ i,i +1 α i,j if 0 ≤ i < N − λ N − ,⋆ α N − ,j if i = N − α N,j if i = N. (14) ptimal gradient methods 15 As provided in Lemma 6, this change of variables is invertible for this problem. In other words, for any
N > ≤ µ < L , one can solvemin τ,λ i,j ≥ { α ′ i,j } τ s.t. (cid:18) S ′′ ( τ, { λ i,j } , { α ′ i,j } ) w N w ⊤ N (cid:19) (cid:23) , N − X i =0 λ i,i +1 ( f i +1 − f i ) + N − X i =0 λ ⋆,i f i − λ N − ,⋆ f N − = 0 , (Minimax-R)which is a standard linear semidefinite program, with the following definitions S ′′ ( τ, { λ i,j } , { α ′ i,j } )= L − µ ) λ N − ,⋆ g N − g ⊤ N − + N − X i =0 λ ⋆,i g i g ⊤ i + N − X i =0 λ i,i +1 ( g i − g i +1 )( g i − g i +1 ) ⊤ ! − λ ⋆, ( g w ⊤ + w g ⊤ ) − N − X i =1 λ i,i +1 − µL i − X j =0 α ′ i,j ( g i w ⊤ + w g ⊤ i )+ N − X i =1 i − X j =0 α ′ i,j L ( g i g ⊤ j + g j g ⊤ i ) − λ N − ,⋆ − µL N − X j =0 α ′ N − ,j ( g N − w ⊤ + w g ⊤ N − )+ N − X j =0 α ′ N − ,j L ( g N − g ⊤ j + g j g ⊤ N − ) + N − X i =0 λ i,i +1 − µL i − X j =0 α ′ i,j ( g i +1 w ⊤ + w g ⊤ i +1 ) − N − X i =0 i − X j =0 α ′ i,j L ( g i +1 g ⊤ j + g j g ⊤ i +1 ) + τ w w ⊤ , and w N = w (cid:16) − µL P N − i =0 α N,i (cid:17) − P N − i =0 α N,i L g i . From the solution to (Minimax-R), one can recover afixed step first-order method whose worst-case performance satisfies k x N − x ⋆ k ≤ val(Minimax-R) k x − x ⋆ k , as formalized by the next lemma. Lemma 6
Let N ∈ N with N > , and ≤ µ < L < ∞ . Furthermore, let ( τ, { α ′ i,j } , { λ i,j } ) be a solutionto (Minimax-R) . The following statements hold.(i) If λ i,i +1 = 0 , then α ′ i,j = 0 .(ii) If λ N − ,⋆ = 0 , then α ′ N − ,j = 0 .(iii) Let { α i,j } be defined as α i,j = if α ′ i,j = 0 ,α ′ i,j /λ i,i +1 if ≤ i < N − ,α ′ N − ,j /λ N − ,⋆ if i = N − ,α ′ N,j if i = N. (15) The output of the corresponding method of the form (8) satisfies k w N − w ⋆ k ≤ τ k w − w ⋆ k for any d ∈ N , w ∈ R d , and f ∈ F µ,L ( R d ) . Proof (i) assume λ k,k +1 = 0 for some k >
0; it follows from λ i,j ≥ λ ⋆,i = 0 and λ i − ,i = 0for all 1 ≤ i ≤ k . From the expression of S ′′ , it means that there are no diagonal entries correspondingto the entries e e ⊤ , ..., e k +2 e ⊤ k +2 (corresponding to g g ⊤ , ..., g k g ⊤ k ). Therefore, the constraint S ′′ (cid:23) w g ⊤ , . . . , w g ⊤ k and g i g ⊤ j for j = 0 , . . . , k and i = 0 , . . . , N − α ′ i, , . . . , α ′ i,i − = 0.(ii) Using a similar argument: it follows from λ N − ,⋆ = 0 that λ N − ,N − = λ ⋆,N − = 0. There istherefore no diagonal element corresponding to the entry g N − g ⊤ N − in S ′′ , and the corresponding off-diagonal elements should be zero as well due to the constraint S ′′ (cid:23)
0. Hence α N − , , . . . α N − ,N − = 0.(iii) Using (i) and (ii), and for { α i,j } , the couple ( τ, { λ i,j } ) is feasible for (dual-SDP-R) by construction,following the reformulation steps of S in Appendix B. It follows that UB µ,L ( { α i,j } ) ≤ τ and Lemma 5 allowsreaching the desired claim. ⊓⊔ It is relatively straightforward to establish that ITEM is a solution to (14), given that (i) ITEM achieves thelower complexity bound (see Section 2.2), that (ii) ITEM is a fixed step first-order method, and that (iii) allthe inequalities involved in the proof of Theorem 1 and Theorem 3 are used in the relaxation procedure (R).We conclude this section by the corresponding formal statement.
Theorem 4
Let N ∈ N , and ≤ µ < L < ∞ . Algorithm (3) is a solution to (Minimax-R) .Proof We exhibit a solution to (Minimax-R) and show that it corresponds to ITEM in Appendix C. ⊓⊔ Numerical examples of the design procedure are provided in Appendix D, including optimized methodsfor different objectives, like function values, for which we provide the slightly adapted design strategy inAppendix E. Codes for reproducing the results are provided in Section 4.
In this work, we provided the
Information-Theoretic Exact Method (ITEM), a first-order method whoseworst-case guarantees exactly match lower bounds both for minimizing smooth convex and smooth stronglyconvex functions. Furthermore, we showed how to develop such methods constructively, through performanceestimation problems and semidefinite programming.We believe that obtaining accelerated first-order methods as solutions to minimax problems certainlybrings perspectives and a systematic approach to accelerated methods in first-order convex optimization.In addition, we think that the conceptual simplicity of the shapes and proofs of such optimized methodsrender them attractive as textbook examples for illustrating the acceleration phenomenon. In particular, itappeared as very surprising to us that both sequences { y k } and { z k } now have relatively clear interpretations: z k ’s are optimal for optimizing the distance to an optimal solution, whereas y k ’s are essentially optimal foroptimizing function values (see [Kim and Fessler, 2016]).Those methods might as well serve as an inspiration for further developments on this topic, for designingaccelerated methods in other settings, and for alternate performance criterion, as showcased numerically inAppendix D.On the negative side it remains unclear how to extend methods such as the Optimized Gradient Method,the Triple Momentum Method, and the Information-Theoretic Exact Method to more general situations,possibly involving constraints, for instance. Codes
Codes for helping reproducing the slightly algebraic passage in Section 2 are to be found in https://github.com/AdrienTaylor/OptimalGradientMethod together with implementations in the Performance Estimation Toolbox [Taylor et al., 2017a] for numericallyvalidating the potential from Lemma 1, the final bound from Theorem 3, and the constructive procedure ofLemma 6. ptimal gradient methods 17
References
Nikhil Bansal and Anupam Gupta. Potential-function proofs for gradient methods.
Theory of Computing ,15(1):1–32, 2019.Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.
SIAM journal on imaging sciences , 2(1):183–202, 2009.Saman Cyrus, Bin Hu, Bryan Van Scoy, and Laurent Lessard. A robust accelerated optimization algorithmfor strongly convex functions. In , pages 1376–1381.IEEE, 2018.Etienne De Klerk, Fran¸cois Glineur, and Adrien B. Taylor. On the worst-case complexity of the gradientmethod with exact line search for smooth strongly convex functions.
Optimization Letters , 11(7):1185–1199, 2017.Yoel Drori. The exact information-based complexity of smooth convex minimization.
Journal of Complexity ,39:1–16, 2017.Yoel Drori and Adrien B. Taylor. Efficient first-order methods for convex minimization: a constructiveapproach.
Mathematical Programming , 184(1):183–220, 2020.Yoel Drori and Adrien B. Taylor. On the oracle complexity of smooth strongly convex minimization. preprint(to appear) , 2021.Yoel Drori and Marc Teboulle. Performance of first-order methods for smooth convex minimization: a novelapproach.
Math. Program. Ser. A , 145:451–482, 2014.Alexander V. Gasnikov and Yurii E. Nesterov. Universal method for stochastic composite optimizationproblems.
Computational Mathematics and Mathematical Physics , 58(1):48–64, 2018.Dennis Gramlich, Christian Ebenbauer, and Carsten W Scherer. Convex synthesis of accelerated gradient al-gorithms for optimization and saddle point problems using lyapunov functions. preprint arXiv:2006.09946 ,2020.Donghwan Kim and Jeffrey A. Fessler. Optimized first-order methods for smooth convex minimization.
Math. Program. Ser. A , 159(1):81–107, 2016.Donghwan Kim and Jeffrey A. Fessler. Optimizing the efficiency of first-order methods for decreasing thegradient of smooth convex functions.
Journal of optimization theory and applications , 2020.Laurent Lessard and Peter Seiler. Direct synthesis of iterative algorithms with bounds on achievable worst-case convergence rate. In , pages 119–125. IEEE, 2020.Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of optimization algorithms viaintegral quadratic constraints.
SIAM Journal on Optimization , 26(1):57–95, 2016.J. L¨ofberg. YALMIP : A toolbox for modeling and optimization in MATLAB. In
Proceedings of the CACSDConference , 2004.APS Mosek. The MOSEK optimization software. , 54, 2010.Arkadi Nemirovski. Optimization II: Numerical methods for nonlinear continuous optimization.
Lecturenotes, , 1999.Arkadi S. Nemirovskii. Information-based complexity of linear operator equations.
Journal of Complexity ,8(2):153–175, 1992.Yurii Nesterov. A method of solving a convex programming problem with convergence rate O(1 /k ). SovietMath. Dokl. , 27(2):372–376, 1983.Yurii Nesterov.
Introductory lectures on convex optimization: a basic course . Applied optimization. KluwerAcademic Publishers, 2004. ISBN 9781402075537.Adrien Taylor.
Convex Interpolation and Performance Estimation of First-order Methods for Convex Opti-mization . PhD thesis, Universit´e catholique de Louvain, 2017.Adrien Taylor and Francis Bach. Stochastic first-order methods: non-asymptotic and computer-aided analy-ses via potential functions. In
Proceedings of the 2019 Conference on Learning Theory (COLT) , volume 99,pages 2934–2992, 2019.Adrien B. Taylor, Julien M Hendrickx, and Fran¸cois Glineur. Performance estimation toolbox (PESTO):automated worst-case analysis of first-order optimization methods. In , pages 1278–1283. IEEE, 2017a.Adrien B. Taylor, Julien M Hendrickx, and Fran¸cois Glineur. Smooth strongly convex interpolation and exactworst-case performance of first-order methods.
Mathematical Programming , 161(1-2):307–345, 2017b.
Onur Toker and Hitay Ozbay. On the NP-hardness of solving bilinear matrix inequalities and simultaneousstabilization with static output feedback. In
Proceedings of 1995 American Control Conference , volume 4,pages 2525–2526. IEEE, 1995.Charles F. Van Loan and Gene H. Golub.
Matrix computations . Johns Hopkins University Press Baltimore,1983.Bryan Van Scoy, Randy A. Freeman, and Kevin M. Lynch. The fastest known globally convergent first-ordermethod for minimizing strongly convex functions.
IEEE Control Systems Letters , 2(1):49–54, 2017.Ashia C. Wilson, Benjamin Recht, and Michael I. Jordan. A Lyapunov analysis of momentum methods inoptimization. preprint arXiv:1611.02635 , 2016.Kaiwen Zhou, Anthony Man-Cho So, and James Cheng. Boosting first-order methods by shifting objective:New schemes with faster worst case rates. 2020.
A Alternate parametrization for first-order methods
In this section, we show that any method (7) can be reparametrized as (8) (and vice-versa), using the identity (9). Firstnote that the equivalence is clear for k = 1, as w − x ⋆ = w − x ⋆ − h , L ( ∇ ˜ f ( w ) + µ ( w − x ⋆ ))= ( w − x ⋆ )(1 − µL h , ) − h , L ∇ ˜ f ( w ) , and hence the equivalence holds for k = 1. Now, assuming the equivalence holds at iteration k , let us check that it holds atiteration k + 1, that is assume w i − x ⋆ = ( w − x ⋆ ) (cid:16) − µL P i − j =0 α i,j (cid:17) − L P i − j =0 α i,j ∇ ˜ f ( w j ) for 0 ≤ i ≤ k and compute w k +1 − x ⋆ = w k − x ⋆ − L k X i =0 h k +1 ,i ( ∇ ˜ f ( w i ) + µ ( w i − x ⋆ ))=( w − x ⋆ ) − µL k − X i =0 α k,i − µL k X i =0 h k +1 ,i + µ L k X i =0 k − X j =0 h k +1 ,i α i,j − L k − X i =0 α k,i ∇ ˜ f ( w i ) − L k X i =0 h k +1 ,i ∇ ˜ f ( w i ) + µL k X i =0 k − X j =0 h k +1 ,i α i,j ∇ ˜ f ( w j )by reverting the ordering of the double sums, renaming the indices, and reordering, we get w k +1 − x ⋆ =( w − x ⋆ ) − µL k − X i =0 α k,i − µL k X i =0 h k +1 ,i + µ L k − X j =0 k X i = j +1 h k +1 ,i α i,j − L k − X i =0 α k,i ∇ ˜ f ( w i ) − L k X i =0 h k +1 ,i ∇ ˜ f ( w i ) + µL k − X j =0 k X i = j +1 h k +1 ,i α i,j ∇ ˜ f ( w j )=( w − x ⋆ ) − µL k − X i =0 α k,i − µL k X i =0 h k +1 ,i + µ L k − X i =0 k X j = i +1 h k +1 ,j α j,i − L k − X i =0 α k,i ∇ ˜ f ( w i ) − L k X i =0 h k +1 ,i ∇ ˜ f ( w i ) + µL k − X i =0 k X j = i +1 h k +1 ,j α j,i ∇ ˜ f ( w i )=( w − x ⋆ ) − µL h k +1 ,k − µL k − X i =0 α k,i + h k +1 ,i − µL k X j = i +1 h k +1 ,j α j,i − L h k +1 ,k ∇ ˜ f ( w k ) − L k − X i =0 α k,i + h k +1 ,i − µL k X j = i +1 h k +1 ,j α j,i ∇ ˜ f ( w i ) . From this last reformulation, the choice (9), that is α k +1 ,i = (cid:26) h k +1 ,k if i = kh k +1 ,i + α k,i − µL P kj = i +1 h k +1 ,j α j,i if 0 ≤ i < k, allows enforcing the coefficients of all independent terms ( w − x ⋆ ), ∇ ˜ f ( w ) , . . . , ∇ ˜ f ( w k ) to be equal in both (7) and (8),reaching the desired statement. In addition, note that this change of variable is reversible.ptimal gradient methods 19 B Algebraic manipulations of the SDP
In this section, we reformulate (dual-SDP-R) for enabling us optimizing both on α i,j ’s and λ i,j ’s simultaneously. For doingthat, let us start by conveniently noting that S ( τ, { λ i,j } ) (cid:23) ⇔ (cid:18) S ′ ( τ, { λ i,j } ) w N w ⊤ N (cid:19) (cid:23) , with S ′ ( τ, { λ i,j } ) = S ( τ, { λ i,j } ) + w N w ⊤ N , using a standard Schur complement (see, e.g., [Van Loan and Golub, 1983]). Themotivation underlying this reformulation is that this lifted linear matrix inequality depends linearly on { α N,i } i ’s. Indeed,the coefficients of the last iteration only appear through the term w N , which is not present in S ′ (details below).We only consider the case N > N = 1, the (6) can be solved without the following simplifications.Let us develop the expression of S ′ ( τ, { λ i,j } ) as follows S ′ ( τ, { λ i,j } ) = τ w w ⊤ + L − µ ) λ N − ,⋆ g N − g ⊤ N − + N − X i =0 λ ⋆,i g i g ⊤ i + N − X i =0 λ i,i +1 ( g i − g i +1 )( g i − g i +1 ) ⊤ ! − λ ⋆, ( g w ⊤ + w g ⊤ ) − N − X i =1 ( λ i − ,i + λ ⋆,i )( g i w ⊤ i + w i g ⊤ i )+ N − X i =0 λ i,i +1 ( g i +1 w ⊤ i + w i g ⊤ i +1 )= τ w w ⊤ + L − µ ) λ N − ,⋆ g N − g ⊤ N − + N − X i =0 λ ⋆,i g i g ⊤ i + N − X i =0 λ i,i +1 ( g i − g i +1 )( g i − g i +1 ) ⊤ ! − λ ⋆, ( g w ⊤ + w g ⊤ ) − N − X i =1 λ i,i +1 ( g i w ⊤ i + w i g ⊤ i ) − λ N − ,⋆ ( g N − w ⊤ N − + w N − g ⊤ N − ) + N − X i =0 λ i,i +1 ( g i +1 w ⊤ i + w i g ⊤ i +1 )where we used λ i − ,i + λ ⋆,i = λ i,i +1 (for i = 1 , . . . , N − λ N − ,N − + λ ⋆,N − = λ N − ,⋆ for obtaining the secondequality. Substituting the expressions for w i ’s, we arrive to S ′ ( τ, { λ i,j } ) = τ w w ⊤ + L − µ ) λ N − ,⋆ g N − g ⊤ N − + N − X i =0 λ ⋆,i g i g ⊤ i + N − X i =0 λ i,i +1 ( g i − g i +1 )( g i − g i +1 ) ⊤ ! − λ ⋆, ( g w ⊤ + w g ⊤ ) − N − X i =1 λ i,i +1 − µL i − X j =0 α i,j ( g i w ⊤ + w g ⊤ i )+ N − X i =1 λ i,i +1 i − X j =0 α i,j L ( g i g ⊤ j + g j g ⊤ i ) − λ N − ,⋆ − µL N − X j =0 α N − ,j ( g N − w ⊤ + w g ⊤ N − )+ λ N − ,⋆ N − X j =0 α N − ,j L ( g N − g ⊤ j + g j g ⊤ N − ) + N − X i =0 λ i,i +1 − µL i − X j =0 α i,j ( g i +1 w ⊤ + w g ⊤ i +1 ) − N − X i =0 λ i,i +1 i − X j =0 α i,j L ( g i +1 g ⊤ j + g j g ⊤ i +1 )where we simply expressed each w k in two terms, one with the contribution of w , and the other with the contributionsof g i ’s. Although not pretty, one can observe that S ′ is still bilinear in { α i,j } and { λ i,j } . This expression can be largelysimplified, but this form suffices for the purposes in this work. C The Information-Theoretic Exact Method is a solution to (Minimax-R)
Proof of Theorem 4
For readability purpose, we establish the claim without explicitly computing the values { α i,j } forITEM. For avoiding this step, let us note that ITEM is clearly a fixed step first-order method following Definition 2.Therefore, following Lemma 3, the method can also be written in the alternate parametrization (8), using the association w i ← y i ( i = 0 , , . . . , N −
1) and w N ← z N .0 A. Taylor, Y. DroriNow, we show that the choice λ ⋆,i = 1 − qL A i +1 − A i qA N λ i,i +1 = 1 − qL A i +1 qA N λ N − ,⋆ = 1 − qL A N qA N τ = 11 + qA N (16)is feasible for the choice of { α i,j } that corresponds to ITEM. Note that optimality of the solution follows from the value of τ , which matches the lower complexity bound discussed in Section 2.2.For establishing dual feasibility, we relate (Minimax-R) to the Lagrangian of (R). That is, denoting K = S ′′ ( τ, { λ i,j } , { α ′ i,j } ) − w N w ⊤ N , we have for all ( F, G ) as in (11), by construction, F N − X i =0 λ i,i +1 ( f i +1 − f i ) + N − X i =0 λ ⋆,i f i − λ N − ,⋆ f N − ! + Tr( KG )= τ k w − w ⋆ k − k w N − w ⋆ k + N − X i =0 λ i,i +1 h ˜ f ( w i +1 ) − ˜ f ( w i ) + h∇ ˜ f ( w i +1 ); w i − w i +1 i + L − µ ) k∇ ˜ f ( w i ) − ∇ ˜ f ( w i +1 ) k i + N − X i =0 λ ⋆,i h ˜ f ( w i ) − ˜ f ⋆ + h∇ ˜ f ( w i ) , w ⋆ − w i i + L − µ ) k∇ ˜ f ( w i ) k i + λ N − ,⋆ h ˜ f ⋆ − ˜ f ( w N − ) + L − µ ) k∇ ˜ f ( w N − ) k i . Using the association w i ← y i ( i = 0 , , . . . , N −
1) and w N ← z N , as well as f ( y i ) = ˜ f ( y i ) + µ k y i − x ⋆ k , it follows fromLemma 1 (and in particular the weighted sum in its proof) that for i = 1 , . . . , N − λ ⋆,i h ˜ f ( w i ) − ˜ f ⋆ + h∇ ˜ f ( w i ) , w ⋆ − w i i + L − µ ) k∇ ˜ f ( w i ) k i + λ i − ,i h ˜ f ( w i ) − ˜ f ( w i − ) + h∇ ˜ f ( w i ); w i − − w i i + L − µ ) k∇ ˜ f ( w i ) − ∇ ˜ f ( w i − ) k i = 1 L + µA N ( φ i +1 − φ i ) , as well as λ ⋆, h ˜ f ( w ) − ˜ f ⋆ + h∇ ˜ f ( w ) , w ⋆ − w i + L − µ ) k∇ ˜ f ( w ) k i = 1 L + µA N ( φ − φ ) ,λ N − ,⋆ h ˜ f ⋆ − ˜ f ( w N − ) + L − µ ) k∇ ˜ f ( w N − ) k i = − (1 − q ) A N L + µA N ψ N − . In addition, noting that φ = L k w − w ⋆ k allows reaching the following reformulation F N − X i =0 λ i,i +1 ( f i +1 − f i ) + N − X i =0 λ ⋆,i f i − λ N − ,⋆ f N − ! + Tr( KG )= 1 L + µA N φ − (1 − q ) A N ψ N − + N − X i =0 ( φ i +1 − φ i ) ! − k z N − w ⋆ k = 0 , where the last equality follows from φ N = (1 − q ) A N ψ N − + ( L + µA N ) k z N − w ⋆ k . Therefore, ITEM is a solutionto (Minimax-R). In more direct terms of the SDP (Minimax-R), one can verify that λ ⋆, − λ , = 1 − qL (cid:18) A qA N − A qA N (cid:19) = 0 λ i − ,i + λ ⋆,i − λ i,i +1 = 1 − qL (cid:18) A i qA N + A i +1 − A i qA N − A i +1 qA N (cid:19) = 0 for i = 1 , . . . , N − λ N − ,N − + λ ⋆,N − − λ N − ,⋆ = 1 − qL (cid:18) A N − qA N + A N − A N − qA N − A N qA N (cid:19) = 0 . the previous computations therefore imply that K = 0 for ITEM, and hence S ′′ ( τ, { λ i,j } , { α ′ i,j } ) = w N w ⊤ N (cid:23) ⊓⊔ ptimal gradient methods 21 D A few numerical examples
In this section, we provide a few numerical examples that were obtained by solving the first-order method design prob-lem (Minimax-R), formulated as a linear semidefinite program, and solved using standard packages [L¨ofberg, 2004, Mosek,2010]. The following list provides numerical examples obtained for N = 1 , ..., L = 1 and µ = .
1, in the nota-tions from (7), together with the corresponding worst-case guarantees. We use notation from Definition 2 for presentationconvenience (the solution in terms of { α i,j } ’s are converted to { h i,j } ’s using (9)). – For a single iteration, the step size optimization procedure produces a method with guarantee k w − x ⋆ k k w − x ⋆ k ≤ . h i,j ] = (cid:2) . (cid:3) , which corresponds to 2 / ( L + µ ). – For N = 2, the optimized method has a guarantee k w − x ⋆ k k w − x ⋆ k ≤ . h i,j ] = (cid:20) . . . (cid:21) . – For N = 3, we obtain k w − x ⋆ k k w − x ⋆ k ≤ . h i,j ] = . . . . . . . – For N = 4, we obtain k w − x ⋆ k k w − x ⋆ k ≤ . h i,j ] = . . . . . . . . . . . – Finally, for N = 5, we reach k w − x ⋆ k k w − x ⋆ k ≤ . h i,j ] = . . . . . . . . . . . . . . . . It is straightforward to verify that numerical worst-case guarantees match k z N − x ⋆ k ≤ k z − x ⋆ k / (1 + qA N ) fromTheorem 3. Furthermore, one can observe an apparent strange step size pattern for going from one iteration to the nextone: each time, the last line is replaced by another one, and an additional line is added. This behavior can be explained asfollows: one can observe that ITEM is obtained by setting y k ← w k ( k = 0 , . . . , N −
1) and z N ← w N in the notation (7),as w k ’s are the points where the gradients are evaluated for k = 0 , . . . , N −
1, whereas w N is obtained for optimizing itsworst-case guarantee (but no gradient is evaluated at w N ). D.1 Optimized methods for ( f ( w N ) − f ⋆ ) / k w − x ⋆ k A slight modification of the relaxations used for obtaining (Minimax-R) allows optimizing step sizes for different criteria.Those modifications are provided in Appendix E and essentially correspond to selecting a different subset of quadraticinequalities in (10). In this section, we optimize step sizes for the criterion ( f ( w N ) − f ⋆ ) / k w − x ⋆ k . The examples belowwere obtained for L = 1 and µ = . – For a single iteration, by solving the corresponding optimization problem, we obtain a method with guarantee f ( w ) − f ⋆ k w − x ⋆ k ≤ . h i,j ] = (cid:2) . (cid:3) . This bound, and the corresponding step size, match the optimal step size h , = q +1 − √ q − q +1 q , see [Taylor, 2017,Theorem 4.14]. – For N = 2 iterations, we obtain f ( w ) − f ⋆ k w − x ⋆ k ≤ . h i,j ] = (cid:20) . . . (cid:21) . – For N = 3, we obtain f ( w ) − f ⋆ k w − x ⋆ k ≤ . h i,j ] = . . . . . . . – For N = 4, we obtain f ( w ) − f ⋆ k w − x ⋆ k ≤ . h i,j ] = . . . . . . . . . . . – Finally, for N = 5, we obtain f ( w ) − f ⋆ k w − x ⋆ k ≤ . h i,j ] = . . . . . . . . . . . . . . . . Note that when µ = 0, we recover the step size policy of the OGM by Kim and Fessler [2016]. When setting µ >
0, weobserve that the resulting optimized method is apparently less practical as the step sizes critically depend on the horizon N . In particular, one can observe that h , varies with the horizon N . D.2 Optimized methods for ( f ( w N ) − f ⋆ ) / ( f ( w ) − f ⋆ ) As in the previous section, the technique can be adapted for the criterion ( f ( w N ) − f ⋆ ) / ( f ( x ) − f ⋆ ), see Appendix E fordetails. The following step sizes are obtained by setting L = 1 and µ = . – For a single iteration we obtain a guarantee f ( w ) − f ⋆ f ( w ) − f ⋆ ≤ . h i,j ] = (cid:2) . (cid:3) , which matches the known optimal step size 2 / ( L + µ ) for this setup [De Klerk et al., 2017, Theorem 4.2]. – For N = 2, we obtain f ( w ) − f ⋆ f ( w ) − f ⋆ ≤ . h i,j ] = (cid:20) . . . (cid:21) . – For N = 3, we obtain f ( w ) − f ⋆ f ( w ) − f ⋆ ≤ . h i,j ] = . . . . . . . – For N = 4, we obtain f ( w ) − f ⋆ f ( w ) − f ⋆ ≤ . h i,j ] = . . . . . . . . . . . – Finally, for N = 5, we reach f ( w ) − f ⋆ f ( w ) − f ⋆ ≤ . h i,j ] = . . . . . . . . . . . . . . . . Note that the resulting method is again apparently less practical than ITEM, as step sizes also critically depend on thehorizon N ; for example, observe again that the value of h , depends on N . However, one can observe that the correspondingstep sizes are incredibly symmetric, and that the worst-case guarantees seem to behave slightly better than in the distanceproblem k w N − x ⋆ k / k w − x ⋆ k , although they asymptotic behavior has to be the same, due to the lower complexitybounds.ptimal gradient methods 23 E SDP formulation for optimizing function values
In this section, we show how to adapt the methodology for alternate design criterion. We do it for both the criterion ofSection D.1 and that of Section D.2 simultaneously. The developments only slightly differ from those required for optimizing k w N − w ⋆ k / k w − w ⋆ k ; and hence we decided not to present a unified version in the core of the text, for readabilitypurposes. In particular, the set of selected inequalities is slightly different, altering the linearization procedure.In this section, we deal with the criterion f ( w N ) − f ⋆ c w k w − w ⋆ k + c f ( f ( w ) − f ⋆ ) = ˜ f ( w N ) − f ⋆ + µ k w N − w ⋆ k c w k w − w ⋆ k + c f ( ˜ f ( w ) − f ⋆ + µ k w − w ⋆ k ) . As the steps are essentially the same, we proceed without providing much detail. We start with the discrete version,using the set I = { ⋆, , . . . , N } max { ( w i ,g i ,f i ) } i ∈ I d ∈ N f N − f ⋆ + µ k w N − w ⋆ k s.t. c w k w − w ⋆ k + c f ( f − f ⋆ + µ k w − w ⋆ k ) = 1 , g ⋆ = 0 w k generated by (8) for k = 1 , . . . , Nf i ≥ f j + h g j ; w i − w j i + L − µ ) k g i − g j k for all i, j ∈ I. The upper bound we use is now very slightly differentUB µ,L ( { α i,j } ) = max { ( w i ,g i ,f i ) } i ∈ I d ∈ N f N − f ⋆ + µ k w N − w ⋆ k s.t. c w k w − w ⋆ k + c f ( f − f ⋆ + µ k w − w ⋆ k ) = 1 , g ⋆ = 0 w k generated by (8) for k = 1 , . . . , Nf i ≥ f i +1 + h g i +1 ; w i − w i +1 i + L − µ ) k g i − g i +1 k for i = 0 , . . . , N − f ⋆ ≥ f i + h g i , x ⋆ − w i +1 i + L − µ ) k g i +1 k for i = 0 , . . . , N The corresponding SDP can be written using a similar couple (
G, F ) G = k w − w ⋆ k h g ; w − w ⋆ i h g ; w − w ⋆ i . . . h g N ; w − w ⋆ ih g ; w − w ⋆ i k g k h g ; g i . . . h g N ; g ih g ; w − w ⋆ i h g ; g i k g k . . . h g N ; g i ... ... ... .. . ... h g N ; w − w ⋆ i h g N ; g i h g N ; g i . . . k g N k F = f − f ⋆ f − f ⋆ ... f N − f ⋆ , and the similar notations w = e ∈ R N +2 , g i = e i +2 ∈ R N +2 , f i = e i +1 ∈ R N +1 , with i = 0 , . . . , N and e i being the unit vector whose i th component is equal to 1. In addition, we can also denote by w k = w − µL k − X i =0 α k,i ! − k − X i =0 α k,i L g i . A dual formulation of UB µ,L is given by (we directly included the Schur complement)UB µ,L ( { α i,j } ) = min τ,λ i,j ≥ τ, s.t. (cid:18) S ′ ( τ, { λ i,j } , { α ′ i,j } ) √ µ w N √ µ w ⊤ N (cid:19) (cid:23) ,τ c f f + N − X i =0 λ i,i +1 ( f i +1 − f i ) + N X i =0 λ ⋆,i f i = f N S ′ ( τ, { λ i,j } , { α ′ i,j } ) = τ ( c w + c f µ ) w w ⊤ + N X i =0 λ ⋆,i (cid:16) − g i w ⊤ i − w i g ⊤ i + L − µ g i g ⊤ i (cid:17) + N − X i =0 λ i,i +1 (cid:16) g i +1 ( w i − w i +1 ) ⊤ + ( w i − w i +1 ) g ⊤ i +1 + L − µ ( g i − g i +1 )( g i − g i +1 ) ⊤ (cid:17) . Note that that the equality constraint corresponds to c f + λ ⋆, − λ , = 0 λ i − ,i + λ ⋆,i − λ i,i +1 = 0 for i = 1 , . . . , N − λ N − ,N + λ ⋆,N = 1 . We perform a some additional work on S ′ , as before S ′ ( τ, { λ i,j } , { α ′ i,j } ) = τ ( c w + c f µ ) w w ⊤ + 12( L − µ ) N X i =0 λ ⋆,i g i g ⊤ i + N − X i =0 λ i,i +1 ( g i − g i +1 )( g i − g i +1 ) ⊤ ! − N X i =0 λ ⋆,i (cid:16) g i w ⊤ i + w i g ⊤ i (cid:17) + N − X i =0 λ i,i +1 12 (cid:16) g i +1 ( w i − w i +1 ) ⊤ + ( w i − w i +1 ) g ⊤ i +1 (cid:17) = τ ( c w + c f µ ) w w ⊤ + 12( L − µ ) N X i =0 λ ⋆,i g i g ⊤ i + N − X i =0 λ i,i +1 ( g i − g i +1 )( g i − g i +1 ) ⊤ ! − λ ⋆, ( g w ⊤ + w g ⊤ ) − N X i =1 ( λ ⋆,i + λ i − ,i ) ( g i w ⊤ i + w i g ⊤ i )+ N − X i =0 λ i,i +1 12 ( g i +1 w ⊤ i + w i g ⊤ i +1 )= τ ( c w + c f µ ) w w ⊤ + 12( L − µ ) N X i =0 λ ⋆,i g i g ⊤ i + N − X i =0 λ i,i +1 ( g i − g i +1 )( g i − g i +1 ) ⊤ ! − λ ⋆, ( g w ⊤ + w g ⊤ ) − N − X i =1 λ i,i +1 12 ( g i w ⊤ i + w i g ⊤ i ) − ( g N w ⊤ N + w N g ⊤ N )+ N − X i =0 λ i,i +1 12 ( g i +1 w ⊤ i + w i g ⊤ i +1 ) , where we used λ ⋆,i + λ i − ,i = λ i,i +1 (for i = 1 , . . . , N −
1) and λ ⋆,N + λ N − ,N = 1. Now, making the dependence on α i,j ’sexplicit again, we arrive to S ′ ( τ, { λ i,j } , { α ′ i,j } ) = τ ( c w + c f µ ) w w ⊤ + 12( L − µ ) N X i =0 λ ⋆,i g i g ⊤ i + N − X i =0 λ i,i +1 ( g i − g i +1 )( g i − g i +1 ) ⊤ ! − λ ⋆, ( g w ⊤ + w g ⊤ ) − N − X i =1 λ i,i +1 − µL i − X j =0 α i,j ( g i w ⊤ + w g ⊤ i )+ N − X i =1 λ i,i +1 i − X j =0 α i,j ( g i g ⊤ j + g j g ⊤ i ) − − µL N − X j =0 α N,j ( g N w ⊤ + w g ⊤ N )+ N − X j =0 α N,j ( g N g ⊤ j + g j g ⊤ N ) + N − X i =0 λ i,i +1 − µL i − X j =0 α i,j ( g i +1 w ⊤ + w g ⊤ i +1 ) − N − X i =0 λ i,i +1 i − X j =0 α i,j ( g i +1 g ⊤ j + g j g ⊤ i +1 )and it remains to remark that the change of variables α ′ i,j = (cid:26) λ i,i +1 α i,j if 0 ≤ i ≤ N − α N,j if i = N. (17)ptimal gradient methods 25linearizes the bilinear matrix inequality, again, and it remains to solve the SDP (13) using standard packages. Numericalresults for the pairs ( c w , c f ) = (1 ,
0) and ( c w , c f ) = (0 ,,