[PDF] Factor-\sqrt{2} Acceleration of Accelerated Gradient Methods

Abstract

The optimized gradient method (OGM) provides a factor-\sqrt{2} speedup upon Nesterov's celebrated accelerated gradient method in the convex (but non-strongly convex) setup. However, this improved acceleration mechanism has not been well understood; prior analyses of OGM relied on a computer-assisted proof methodology, so the proofs were opaque for humans despite being verifiable and correct. In this work, we present a new analysis of OGM based on a Lyapunov function and linear coupling. These analyses are developed and presented without the assistance of computers and are understandable by humans. Furthermore, we generalize OGM's acceleration mechanism and obtain a factor-\sqrt{2} speedup in other setups: acceleration with a simpler rational stepsize, the strongly convex setup, and the mirror descent setup.

Full PDF

aa r X i v : . [ m a t h . O C ] F e b Factor- √ Acceleration of Accelerated Gradient Methods

Chanwoo Park Jisun Park Ernest K. Ryu Abstract

The optimized gradient method (OGM) providesa factor- √ speedup upon Nesterov’s celebratedaccelerated gradient method in the convex (butnon-strongly convex) setup. However, this im-proved acceleration mechanism has not beenwell understood; prior analyses of OGM reliedon a computer-assisted proof methodology, sothe proofs were opaque for humans despite beingveriﬁable and correct. In this work, we presenta new analysis of OGM based on a Lyapunovfunction and linear coupling. These analyses aredeveloped and presented without the assistanceof computers and are understandable by humans.Furthermore, we generalize OGM’s accelerationmechanism and obtain a factor- √ speedup inother setups: acceleration with a simpler ratio-nal stepsize, the strongly convex setup, and themirror descent setup.

1. Introduction

Nesterov’s celebrated accelerated gradient method (AGM)solves the problem of ﬁnding the minimum of an L -smoothconvex function with an “optimal” accelerated O (1 /k ) complexity (Nesterov, 1983). Surprisingly, AGM turnedout to be not exactly optimal, but optimal only up to a con-stant. The optimized gradient method (OGM) has a factor- smaller (better) worst-case guarantee and thereby requiresfactor- √ fewer iterations to guarantee the same accuracy(Drori & Teboulle, 2014; Kim & Fessler, 2016).However, this remarkable discovery has not been wellunderstood. OGM was originally obtained through acomputer-assisted methodology based on the performanceestimation problem (PEP). The resulting convergence anal-yses involve arduous but elementary calculations that areveriﬁable but arguably not understandable by humans. Department of Statistics, Seoul National University, Seoul,Korea Department of Mathematical Sciences, Seoul NationalUniversity, Seoul, Korea. Correspondence to: Ernest K. Ryu.

Contribution.

In this work, we present human-understandable analyses of OGM. First, we showthat the improved acceleration mechanism of OGM can beunderstood and analyzed through an unconventional Lya-punov function. We then use this insight to propose a newmethod that obtains the factor- √ speedup in the stronglyconvex setup. Finally, we present a human-understandablederivation of OGM based on reﬁning the linear couplinganalysis of Allen-Zhu & Orecchia (2017), and generalizeOGM to the mirror descent setup.As minor contributions, we analyze the primary and sec-ondary sequences of OGM through a single uniﬁed analy-sis; to the best of our knowledge, prior works provide twoseparate convergence proofs for x - and y -sequences. More-over, we present a uniﬁed class of accelerated methods con-taining AGM and OGM through the linear coupling analy-sis. For

L > , a differentiable convex function f : R n → R is L -smooth with respect to a norm k · k if k∇ f ( x ) − ∇ f ( y ) k ∗ ≤ L k x − y k ∀ x, y ∈ R n , where k · k ∗ denotes the dual norm. A convex function f : R n → R is µ -strongly convex if f ( x ) − ( µ/ k x k isconvex (Nesterov, 2004; Ryu & Yin, 2021).Throughout this paper, we consider the problemminimize x ∈ R n f ( x ) and make the following assumptions on f : R n → R :(A1) f is convex, differentiable, and L -smooth with respectto k · k and(A2) f has a minimizer (not necessarily unique).We write x ⋆ for a minimizer of f and f ⋆ = f ( x ⋆ ) for theoptimal value. actor- √ Acceleration of Accelerated Gradient Methods

Nesterov’s AGM.

Nesterov’s AGM is y k +1 = x k − L ∇ f ( x k ) x k +1 = y k +1 + θ k − θ k +1 ( y k +1 − y k ) , where y = x , θ = 1 , and θ k +1 − θ k +1 = θ k for k =0 , , . . . (Nesterov, 1983). We can equivalently write AGMas y k +1 = x k − L ∇ f ( x k ) z k +1 = z k − θ k L ∇ f ( x k ) x k +1 = (cid:18) − θ k +1 (cid:19) y k +1 + 1 θ k +1 z k +1 with z = x (Nesterov, 2005).AGM can be generalized to use the relaxed parameter re-quirement θ k +1 − θ k +1 ≤ θ k on the positive sequence { θ k } ∞ k =0 . The choice θ k = ( k + 2) / is a common in-stance.In the setup where f is furthermore µ -strongly convex, Nes-terov’s AGM for the strongly convex setup (SC-AGM) is y k +1 = x k − L ∇ f ( x k ) x k +1 = y k +1 + √ κ − √ κ + 1 ( y k +1 − y k ) for k = 0 , , . . . , where κ = L/µ and y = x (Nesterov,2004). Optimized gradient method.

OGM is y k +1 = x k − L ∇ f ( x k ) x k +1 = y k +1 + θ k − θ k +1 ( y k +1 − y k ) + θ k θ k +1 ( y k +1 − x k ) for k = 0 , , . . . , where y = x and { θ k } ∞ k =1 is the sameas that of AGM (Drori & Teboulle, 2014; Kim & Fessler,2016). We refer to θ k − θ k +1 ( y k +1 − y k ) as the momentum term and θ k θ k +1 ( y k +1 − x k ) as the correction term . The addedcorrection term is the difference between AGM and OGM.We can equivalently write OGM as y k +1 = x k − L ∇ f ( x k ) z k +1 = z k − θ k L ∇ f ( x k ) x k +1 = (cid:18) − θ k +1 (cid:19) y k +1 + 1 θ k +1 z k +1 , where z = x (Kim & Fessler, 2016). The factor in z k +1 is the difference compared to AGM. The y k -sequence of OGM exhibits a rate faster than thatof AGM by a factor of √ . This rate was proved in(Kim & Fessler, 2017), and we also state it in Corollary 1.To clarify, the guarantee on the function value is smaller(better) by a factor of , and, combined with the O (1 /k ) iteration dependence, this represents a factor- √ reductionin the number of iterations necessary to reach a given accu-racy.Furthermore, OGM’s original presentation(Drori & Teboulle, 2014; Kim & Fessler, 2016) in-volves what we refer to as the last-step modiﬁcation on thesecondary sequence ˜ x k +1 = y k +1 + θ k − ϕ k +1 ( y k +1 − y k ) + θ k ϕ k +1 ( y k +1 − x k )= (cid:18) − ϕ k +1 (cid:19) y k +1 + 1 ϕ k +1 z k +1 , where ϕ k − ϕ k − θ k − = 0 . The ˜ x k -sequence of OGM ex-hibits a rate slightly better than OGM’s y k -sequence and isin fact exactly optimal (Drori, 2017). This rate was provedin the original presentation of OGM (Drori & Teboulle,2014; Kim & Fessler, 2016), and we also state it in Corol-lary 3.In this work, we present the ﬁrst variant of OGM for thestrongly convex setup. θ k -sequence asymptotic characterization. Throughoutthe exposition of this work, we will use the followingasymptotic characterization: if θ = 1 and θ k +1 − θ k +1 = θ k for k = 0 , , . . . , then θ k = k + ζ + 12 + log k o (1) (1)as k → ∞ , where ζ ≈ . . While we suspect this re-sult may be known, we could not ﬁnd it in any reference.Therefore, we formally state and prove (1) as Lemma 7 inthe appendix. Computer-assisted derivation and analysis of OGM.

OGM was originally obtained through a computer-assisted methodology based on the performance estima-tion problem (PEP); it was ﬁrst discovered numerically(Drori & Teboulle, 2014) and then its analytical form andconvergence analysis was found (Kim & Fessler, 2016).The PEP methodology’s key insight is to optimize overthe class of ﬁxed-step ﬁrst-order gradient methods, withthe objective being the convergence guarantee. Surpris-ingly, this problem is semideﬁnite programming- (SDP-)representable and has a tightness guarantee (Taylor et al.,2017b). OGM was re-discovered by using the PEP to ﬁnda greedy ﬁrst-order method simpliﬁed with a “subspace-search elimination procedure” (Drori & Taylor, 2020). actor- √ Acceleration of Accelerated Gradient Methods

However, these prior analyses of OGM, generated by com-puters, are veriﬁable but arguably not understandable byhumans. Moreover, as the analyses rely on ﬁnding analyt-ical solutions to the SDPs arising from the PEP, they areinaccessible to those unfamiliar with the methodology.

Lyapunov analysis of AGM.

Nesterov’s original 1983paper established the celebrated O (1 /k ) rate using aLyapunov analysis (Nesterov, 1983). Subsequent works(Ahn & Sra, 2020; Auslender & Teboulle, 2006; Baes,2009; Li et al., 2020; Nesterov, 2004; 2005; 2008; 2012;Wibisono et al., 2016) analyzed AGM and its variantsthrough the “estimate sequence” technique, which manyconsider to be less transparent than Lyapunov analy-ses. In recent years, there has been a renewed in-terest in studying accelerated methods via Lyapunovanalyses (Aujol & Dossal, 2017; Aujol et al., 2019a;b;Bansal & Gupta, 2019; Beck & Teboulle, 2009; Su et al.,2014; Taylor & Bach, 2019)In this work, we present the ﬁrst Lyapunov analysis ofOGM. Linear coupling analysis of AGM.

The interpretation ofAGM as a linear coupling between gradient descent andmirror descent was presented in (Allen-Zhu & Orecchia,2017). Speciﬁcally, AGM can be written as y k +1 = arg min y (cid:26) h∇ f ( x k ) , y − x k i + L k y − x k k (cid:27) z k +1 = arg min y { V z k ( y ) + h α k +1 ∇ f ( x k ) , y − x k i} x k +1 = (1 − τ k +1 ) y k +1 + τ k +1 z k +1 , where V z is a Bregman divergence. The y k -update canbe viewed as a gradient descent update and the z k -updatecan be viewed as a mirror descent update. Mirror descent(Nemirovsky & Yudin, 1983) was originally presented asa method that maps the current point to a dual space, per-forms a gradient update, and maps the point back to theprimal space. An alternate proximal form of mirror de-scent (which we use) was presented in (Beck & Teboulle,2003). An alternate “dual averaging” interpretation of mir-ror descent as a method that constructs a lower bound ofthe function was presented in (Nesterov, 2009). The key in-sight of linear coupling is to carefully interpolate betweenmirror descent and gradient descent to obtain AGM.Linear coupling has been used to obtain and analyze manyextensions of AGM (Allen-Zhu, 2017; Allen-Zhu & Hazan,2016; Allen-Zhu et al., 2016a;b), but whether the linearcoupling argument itself can be further reﬁned seems notto have been studied. In this work, we show that reﬁningthe linear coupling analysis naturally leads to OGM. Tight inequalities.

We informally say an inequality istight if it cannot be improved without further assump-tions and formally if it satisﬁes the “interpolation condi-tions” of Taylor et al. (2017b). The recent literature onperformance estimation problem focuses on using tightinequalities to obtain proofs that are provably cannot beimproved (De Klerk et al., 2020; Gu & Yang, 2020; Kim,2019; Lieder, 2020; Ryu et al., 2020; Taylor & Bach, 2019;Taylor et al., 2017a).The tight inequality we use is f ( y ) ≥ f ( x ) + h∇ f ( x ) , y − x i + 12 L k∇ f ( x ) − ∇ f ( y ) k ∗ for all L -smooth convex function f and x, y ∈ R n . The lin-ear coupling analysis of AGM uses strictly weaker inequal-ities three times. By reﬁning the analysis by replacing thenon-tight inequalities with tight ones, we obtain OGM.

2. Lyapunov analysis of OGM

In this section, we present a Lyapunov analysis of OGM.Our key insight is to use (cid:18) f ( x k ) − f ⋆ − L k∇ f ( x k ) k (cid:19) , which is nonnegative due to L -smoothness, instead of ( f ( x k ) − f ⋆ ) or ( f ( y k ) − f ⋆ ) in the construction of theLyapunov function. Throughout this section, k · k = k · k ∗ denotes the Euclidean norm.Based on this insight, we present: (i) a more human-understandable analysis of OGM (ii) a uniﬁed analysis ofboth the primary and secondary sequences of OGM thatadmits simpler θ k -choices. Nesterov’s AGM has the rate f ( y k ) − f ⋆ ≤ L k x − x ⋆ k θ k − = 2 L k x − x ⋆ k ( k + ζ ) − L k x − x ⋆ k log k ( k + ζ ) + o (cid:18) k (cid:19) for k = 0 , , . . . . This rate can be established through thefollowing Lyapunov analysis (Nesterov, 1983): for k =0 , , . . . , deﬁne U k = θ k − ( f ( y k ) − f ⋆ ) + L k z k − x ⋆ k with θ − = 0 and show U k ≤ · · · ≤ U . Conclude with θ k − ( f ( y k ) − f ⋆ ) ≤ U k ≤ U = L k x − x ⋆ k . actor- √ Acceleration of Accelerated Gradient Methods

We now analyze OGM’s convergence through an analogousLyapunov analysis.

Theorem 1.

Assume (A1) and (A2). Let the positive se-quence { θ k } ∞ k =0 satisfy θ = 1 and ≤ θ k +1 − θ k +1 ≤ θ k for k = 0 , , . . . . OGM’s y k -sequence exhibits the rate f ( y k ) − f ⋆ ≤ L k x − x ⋆ k θ k − for k = 1 , , . . . .Proof outline. Set θ − = 0 and x − = x . For k = − , , , . . . , deﬁne U k =2 θ k (cid:18) f ( x k ) − f ⋆ − L k∇ f ( x k ) k (cid:19) + L k z k +1 − x ⋆ k . We can show that { U k } ∞ k = − is nonincreasing. Using f ( y k ) ≤ f ( x k − ) − L k∇ f ( x k − ) k , which follows from L -smoothness, conclude the rate with θ k − ( f ( y k ) − f ⋆ ) ≤ θ k − (cid:18) f ( x k − ) − f ⋆ − L k∇ f ( x k − ) k (cid:19) ≤ U k − ≤ U − = L k z − x ⋆ k for k = 1 , , . . . .As with AGM, the optimal { θ k } ∞ k =0 is given by θ k +1 − θ k +1 = θ k , which was used in the original presentation ofOGM (Drori & Teboulle, 2014; Kim & Fessler, 2016). Corollary 1.

Under the setup of Theorem 1, the choice θ k +1 − θ k +1 = θ k leads to the rate f ( y k ) − f ⋆ ≤ L k x − x ⋆ k θ k − = L k x − x ⋆ k ( k + ζ ) − L k x − x ⋆ k log k ( k + ζ ) + o (cid:18) k (cid:19) for k = 1 , , . . . .Proof. This follows from Theorem 1 and (1).The relaxed parameter requirement ≤ θ k +1 − θ k +1 ≤ θ k of Theorem 1 is reminiscent of the requirement for AGM.We note that (Kim & Fessler, 2018c) had presented a gen-eralized analysis with requirement θ k +1 ≤ P k +1 i =1 θ i basedon the performance estimation problem methodology. The relaxed parameter requirement allows us to use the sim-pler rational coefﬁcients θ k = ( k + 2) / . This leads to y k +1 = x k − L ∇ f ( x k ) x k +1 = y k +1 + kk + 3 ( y k +1 − y k ) + k + 2 k + 3 ( y k +1 − x k ) , which we call Simple-OGM . Corollary 2.

Assume (A1) and (A2). Simple-OGM’s y k -sequence exhibits the rate f ( y k ) − f ⋆ ≤ L k x − x ⋆ k ( k + 1) for k = 1 , , . . . .Proof. This follows from Theorem 1.

We now analyze the convergence of OGM’s secondary se-quence with last-step modiﬁcation through a uniﬁed Lya-punov analysis.

Theorem 2.

Assume (A1) and (A2). Let the positive se-quence { θ k } ∞ k =0 satisfy θ = 1 , and ≤ θ k +1 − θ k +1 ≤ θ k for k = 0 , , . . . . Let the positive sequence { ϕ k } ∞ k =0 satisfy ≤ ϕ k − ϕ k ≤ θ k − for k = 0 , , . . . , where we deﬁne θ − = 0 . OGM’s ˜ x k -sequence, the secondary sequencewith last-step modiﬁcation, exhibits the rate f (˜ x k ) − f ⋆ ≤ L k x − x ⋆ k ϕ k for k = 0 , , . . . .Proof outline. Let { U k } ∞ k = − be as deﬁned in the proof ofthe Theorem 1. Deﬁne { ˜ U k } ∞ k =0 as ˜ U k = ϕ k ( f (˜ x k ) − f ⋆ ) + L (cid:13)(cid:13)(cid:13)(cid:13) z k − L ϕ k ∇ f (˜ x k ) − x ⋆ (cid:13)(cid:13)(cid:13)(cid:13) . We can show that ˜ U k ≤ U k − , conclude the rate with ϕ k ( f (˜ x k ) − f ⋆ ) ≤ ˜ U k ≤ U − = L k x − x ⋆ k for k = 0 , , . . . . Corollary 3.

Under the setup of Theorem 2, the choice θ k +1 − θ k +1 = θ k and ϕ k − ϕ k = 2 θ k − leads to therate f (˜ x k ) − f ⋆ ≤ L k x − x ⋆ k ϕ k = L k x − x ⋆ k ( k + ζ + 1 / √ − L k x − x ⋆ k log k ( k + ζ + 1 / √ + o (cid:18) k (cid:19) for k = 0 , , . . . . actor- √ Acceleration of Accelerated Gradient Methods

Proof.

This follows from (1), which implies ϕ k = k + ζ + √ √ + √ k + o (1) , and Theorem 2.Simple-OGM with the last-step modiﬁcation is y k +1 = x k − L ∇ f ( x k ) x k +1 = y k +1 + kk + 3 ( y k +1 − y k ) + k + 2 k + 3 ( y k +1 − x k )˜ x k +1 = y k +1 + k √ k + 2) + 1 ( y k +1 − y k )+ k + 2 √ k + 2) + 1 ( y k +1 − x k ) , where x = y . Corollary 4.

Assume (A1) and (A2). Simple-OGM’s ˜ x k -sequence, the secondary sequence with last-step modiﬁca-tion, exhibits the rate f (˜ x k ) − f ⋆ ≤ L k x − x ⋆ k ( k + 1 + 1 / √ for k = 0 , , . . . .Proof. Use Corollary 3 with θ k = k +22 and ϕ k = k +1+ √ √ . We clarify that the presented Lyapunov analysis is anovel contribution, while the results themselves are mostlyknown (Kim & Fessler, 2016; 2017; 2018c). The proofsin this section are complete except where we assert mono-tonicity of the Lyapunov functions and state “we can show”.The missing arguments are provided in the appendix.We emphasize two key points. First is the somewhat un-usual construction of the Lyapunov function. This key in-sight will be used in the following section to present a novelmethod for the strongly convex setup.The second point we emphasize is that we present a uniﬁedanalysis of the primary and last-step-modiﬁed secondarysequences using the Lyapunov functions U k and ˜ U k . Priorworks on the two sequences of AGM and OGM rely on twoseparate analyses (Kim & Fessler, 2016; 2017).

3. Strongly convex OGM

In this section, we present strongly convex OGM (SC-OGM), a novel method that provides a factor- √ improve-ment over Nesterov’s SC-AGM. The method and its analy-sis are obtained with following the key insight of Section 2:use the OGM-type correction term in the method and use (cid:18) f ( x k ) − f ⋆ − L k∇ f ( x k ) k (cid:19) in the construction of the Lyapunov function. Throughoutthis section, k · k = k · k ∗ denotes the Euclidean norm.Based on this insight, we present: (i) a novel method SC-OGM and (ii) a uniﬁed analysis of both the primary andsecondary sequences of SC-OGM. Further assume f is µ -strongly convex and write κ = L/µ .SC-AGM’s convergence rate f ( y k ) − f ⋆ ≤ (cid:18) √ κ − (cid:19) − k µ + L k x − x ⋆ k = O (cid:0) exp (cid:0) − k/ √ κ (cid:1)(cid:1) can be established through the following Lyapunov analy-sis (Bansal & Gupta, 2019). For k = 0 , , . . . , deﬁne U k = (cid:18) √ κ − (cid:19) k (cid:16) f ( y k ) − f ⋆ + µ k z k − x ⋆ k (cid:17) with z k = ( √ κ + 1) x k − √ κy k and show U k ≤ · · · ≤ U ≤ µ + L k x − x ⋆ k . We newly propose SC-OGM: y k +1 = x k − L ∇ f ( x k ) x k +1 = y k +1 + 12 γ + 1 ( y k +1 − y k ) + 12 γ + 1 ( y k +1 − x k ) for k = 0 , , . . . , where y = x and γ = √ κ +1+32 κ − . Theorem 3.

Assume (A1), (A2), and that f is µ -stronglyconvex. SC-OGM’s y k -sequence exhibits the rate f ( y k ) − f ⋆ ≤ (1 + γ ) − k +1 µ + 2 L k x − x ⋆ k = O (exp( −√ k/ √ κ )) for k = 1 , , . . . .Proof outline. For k = 0 , , . . . , deﬁne z k = 2 γ + 1 γ x k − γ + 1 γ y k and U k = (1 + γ ) k (cid:18) f ( x k ) − f ⋆ − L k∇ f ( x k ) k + µ k z k +1 − x ⋆ k (cid:19) . We can show that { U k } ∞ k =0 is nonincreasing and that U ≤ µ +2 L k x − x ⋆ k . Using f ( y k ) ≤ f ( x k − ) − actor- √ Acceleration of Accelerated Gradient Methods L k∇ f ( x k − ) k , which follows from L -smoothness, con-clude the rate with (1 + γ ) k − ( f ( y k ) − f ⋆ ) ≤ (1 + γ ) k − (cid:18) f ( x k − ) − f ⋆ − L k∇ f ( x k − ) k (cid:19) ≤ U k − ≤ U ≤ µ + 2 L k x − x ⋆ k for k = 1 , , . . . . We now analyze the convergence of SC-OGM’s secondarysequence with a uniﬁed Lyapunov analysis. We note thatSC-OGM does not require the last-step modiﬁcation, un-like the non-strongly convex counterpart.

Theorem 4.

Assume (A1), (A2), and that f is µ -stronglyconvex. SC-OGM’s x k -sequence, the secondary sequencewithout last-step modiﬁcation, exhibits the rate f ( x k ) − f ⋆ ≤ (1 + γ ) − k +2 γ (cid:18) µ + 2 L k x − x ⋆ k (cid:19) for k = 1 , , . . . .Proof outline. Let { z k } ∞ k =0 and { U k } ∞ k =0 be deﬁned as inthe proof of the Theorem 3. For k = 0 , , . . . , deﬁne ˜ U k = (1+ γ ) k − (cid:18) γ γ ( f ( x k ) − f ⋆ )+ µ (cid:13)(cid:13)(cid:13)(cid:13) z k − (cid:18) γ + 2 γ (cid:19) L ∇ f ( x k ) − x ⋆ (cid:13)(cid:13)(cid:13)(cid:13) (cid:19) We can show that ˜ U k ≤ U k − . Conclude the rate with (1 + γ ) k − γ γ ( f ( x k ) − f ⋆ ) ≤ ˜ U k ≤ U ≤ µ + 2 L k x − x ⋆ k for k = 1 , , . . . . The factor- √ improvement of SC-OGM over SC-AGM isconsistent with the factor- √ improvement of OGM overAGM. AGM and OGM share the same momentum termwhile OGM has the additional “correction term”. In con-trast, the momentum coefﬁcients differ in the strongly con-vex case: SC-AGM has √ κ − √ κ + 1 = 1 − √ κ + O (cid:18) κ (cid:19) while SC-OGM has γ + 1 = 1 − √ √ κ + O (cid:18) κ (cid:19) . Of course, SC-OGM also has the correction term, which isessential in the analysis.Again, we point out that we analyze both the primary andsecondary sequences in a single streamlined proof using theLyapunov functions U k and ˜ U k .We also point out that Kim & Fessler (2018a) presented avariant of OGM named OGM-q for smooth strongly convex quadratic functions. In contrast, SC-OGM applies to thebroader class of smooth strongly convex functions.

4. Linear coupling analysis

While the Lyapunov analyses of Sections 2 and 3 do pro-vide insight into the acceleration mechanism of OGM,they do not shed light onto the provenance of the method.Originally, OGM was generated through a computer-assisted proof methodology as the exactly optimal ﬁrst-order method, but this approach is arguably opaque to hu-mans.In this section, we present a human-understandable deriva-tion of OGM based on linear coupling. Speciﬁcally, weobtain OGM by reﬁning the linear coupling analysis ofAllen-Zhu & Orecchia (2017) through replacing the use ofnon-tight inequalities with tight inequalities.We speciﬁcally provide: (i) a natural (and non-computerassisted) derivation of OGM, (ii) a generalization of OGMto the mirror descent setup, and (iii) a uniﬁcation of AGMand OGM. We moreover provide (iv) a generalization ofSC-OGM to the mirror descent setup in the appendix, inSection H.

Assumption and notation.

In this section, assume(A3) k · k = p x T Qx is a quadratic norm, where Q is asymmetric positive deﬁnite matrix.Assumption (A1) is to be interpreted as L -smoothness withrespect to norm k · k . Write k · k ∗ = x T Q − x for the dualnorm of k·k . However, h· , ·i is the standard Euclidean innerproduct (unrelated to Q ). Let w : R n → R be a “distancegenerating function” that is differentiable and -stronglyconvex with respect to k · k , and let V x ( y ) = w ( y ) − h∇ w ( x ) , y − x i − w ( x ) ∀ x, y ∈ R n be the Bregman divergence generated by w . We brieﬂy outline the linear coupling analysis of AGMpresented in (Allen-Zhu & Orecchia, 2017) and point outwhere the analysis can be reﬁned. actor- √ Acceleration of Accelerated Gradient Methods

Consider the problem of minimizing f under assumptions(A1), (A2), and (A3). The linear coupling method is y k +1 = x k − L − Q − ∇ f ( x k ) (LC) z k +1 = arg min y ∈ R n { V z k ( y ) + h α k +1 ∇ f ( x k ) , y − x k i} x k +1 = (1 − τ k +1 ) y k +1 + τ k +1 z k +1 for k = 0 , , . . . , where x = z and { α k } ∞ k =1 and { τ k } ∞ k =1 are positive sequences to be determined.We obtain AGM by performing a non-tight analysis of (LC)and letting the analysis inform the choices of { α k } ∞ k =1 and { τ k } ∞ k =1 . The ﬁrst step of this analysis is α k +1 h∇ f ( x k ) , z k − x ⋆ i≤ α k +1 k∇ f ( x k ) k ∗ + V z k ( x ⋆ ) − V z k +1 ( x ⋆ ) ≤ α k +1 L ( f ( x k ) − f ( y k +1 )) + V z k ( x ⋆ ) − V z k +1 ( x ⋆ ) . The second inequality follows from f ( x k ) − f ( y k +1 ) ≥ L k∇ f ( x k ) k ∗ + 12 L k∇ f ( y k +1 ) k ∗ , but the underscored term L k∇ f ( y k +1 ) k ∗ is not used, i.e.,proof utilizes the weaker and non-tight inequality f ( x k ) − f ( y k +1 ) ≥ L k∇ f ( x k ) k ∗ . The second step of this analysis is to choose τ k = α k +1 L to eliminate f ( x k ) and to show α k +1 L (cid:0) f ( y k +1 ) − f ⋆ (cid:1) + V z k +1 ( x ⋆ ) ≤ (cid:0) α k +1 L − α k +1 (cid:1) ( f ( y k ) − f ⋆ ) + V z k ( x ⋆ ) . The inequality follows from f ( x k ) − f ⋆ ≤ h∇ f ( x k ) , x k +1 − x ⋆ i − L k∇ f ( x k ) k ∗ and h∇ f ( x k ) , y k − x k i≤ f ( y k ) − f ( x k ) − L k∇ f ( y k ) − ∇ f ( x k ) k ∗ , but the underscored terms are not used. Finally, conver-gence is established through a telescoping sum argument. We now derive OGM through performing a tight analysis of(LC) and letting the analysis inform the choices of { α k } ∞ k =1 and { τ k } ∞ k =0 .In the ﬁrst step of our linear coupling analysis, we followthe same arguments but do not take the step utilizing thenon-tight inequality. Lemma 1.

Assume (A1) and (A2). The iterates (LC) satisfy α k +1 h∇ f ( x k ) , z k − x ⋆ i≤ α k +1 k∇ f ( x k ) k ∗ + V z k ( x ⋆ ) − V z k +1 ( x ⋆ ) for k = 0 , , . . . .Proof. This is exactly the ﬁrst part of Lemma 4.2 of(Allen-Zhu & Orecchia, 2017).In the second step of our linear coupling analysis, wechoose τ k = α k +1 L to allow for a telescoping sum argu-ment and show the following lemma. Lemma 2.

Assume (A1), (A2) and (A3). Let < τ k = α k +1 L ≤ for k = 0 , , .. , α = L , and x − = x . Set h ( x ) = f ( x ) − f ⋆ − L k∇ f ( x ) k ∗ . The iterates (LC) satisfy α k +1 L h ( x k ) + V z k +1 ( x ⋆ ) ≤ α k +1 L − α k +1 h ( x k − ) + V z k ( x ⋆ ) for k = 0 , , . . . .Proof outline. We follow the steps of Lemma 4.3 of(Allen-Zhu & Orecchia, 2017), but use the tight inequali-ties f ( x k ) − f ⋆ ≤ h∇ f ( x k ) , x k − x ⋆ i − L k∇ f ( x k ) k ∗ and h∇ f ( x k ) , x k − − x k i≤ f ( x k − ) − f ( x k ) − L k∇ f ( x k − ) − ∇ f ( x k ) k ∗ . Theorem 5.

Assume (A1), (A2), and (A3). Let the positivesequence { α k } ∞ k =1 satisfy ≤ α k +1 L − α k +1 ≤ α k L for k = 1 , . . . and α = L . Let τ k = α k +1 L for k =1 , , . . . . The y k -sequence of (LC) exhibits the rate f ( y k ) − f ⋆ ≤ V x ( x ⋆ ) Lα k for k = 1 , , . . . .Proof. Sum the inequality of Lemma 2 from to ( k − . Then use V z k ( x ⋆ ) ≥ and f ( y k ) ≤ f ( x k − ) − L k∇ f ( x k − ) k ∗ to conclude the rate. actor- √ Acceleration of Accelerated Gradient Methods

The { θ k } ∞ k =0 of the original OGM formulation is relatedto { α k } ∞ k =1 through α k +1 = 2 θ k /L for k = 0 , , . . . .The seemingly different parameter choices τ k = α k +1 L forAGM and τ k = α k +1 L for OGM actually turn out to bethe same as { α k } ∞ k =1 for AGM and OGM differ by a fac-tor of . In the appendix, we further discuss the choice of { τ k } ∞ k =1 and how it naturally arises from the analysis. In the linear coupling context, the last-step modiﬁcationcan be expressed as ˜ x k = (1 − ˜ τ k ) y k + ˜ τ k z k (2)for k = 0 , , . . . , where { ˜ τ k } ∞ k =0 is a positive sequence tobe determined. Lemma 3.

Assume (A1), (A2) and (A3). Let < ˜ τ k = α k +1 L ≤ for k = 0 , , . . . , ˜ α = L , and x − = x .Then the ˜ x k -sequence of (2) , the secondary sequence withlast-step modiﬁcation of (LC) , satisﬁes ˜ α k +1 L ( f (˜ x k ) − f ⋆ ) + V z k +1 ( x ⋆ ) ≤ (cid:0) ˜ α k +1 L − ˜ α k +1 (cid:1) h ( x k − ) + V z k ( x ⋆ ) for k = 0 , , . . . . The proof follows steps similar to that of Lemma 2.

Theorem 6.

In the setup of Theorem 5, let ≤ ˜ α k +1 L − ˜ α k +1 ≤ α k L and ˜ α = L . Then the ˜ x k -sequence, thesecondary sequence with last-step modiﬁcation, of the lin-ear coupling method (LC) exhibits the rate f (˜ x k ) − f ⋆ ≤ V x ( x ⋆ ) L ˜ α k +1 for k = 0 , , . . . Proof.

Sum the inequality of Lemma 2 from to ( k − and the inequality of Lemma 3 with k − . Then use V z k ( x ⋆ ) ≥ to conclude the rate. If we choose w ( y ) = t k y k , so that V x ( y ) = t k x − y k , and < t ≤ , so that w is 1-strongly convex,and substitute α k +1 = 2 θ k /L , (LC) becomes y k +1 = x k − L ∇ f ( x k ) z k +1 = z k − tθ k L ∇ f ( x k ) x k +1 = (cid:18) − θ k +1 (cid:19) y k +1 + 1 θ k +1 z k +1 for k = 0 , , . . . . We also express this method with the mo-mentum and correction terms and without the z k -iterates inthe appendix, in Section G. This method uniﬁes AGM andOGM through the constant t ; AGM and OGM respectivelycorrespond to t = (1 / and t = 1 . Corollary 5.

Assume (A1), (A2) and (A3). Let < t ≤ .Then f ( y k ) − f ⋆ ≤ L k x − x ⋆ k tθ k − for k = 1 , , . . . Proof.

This follows from Theorem 5 with α k +1 = θ k L .The rates of Corollary 5 at t = and t = 1 exactly matchthe previously discussed rates of AGM and OGM. By identifying OGM as an instance of linear coupling, wegeneralized OGM to the setup with quadratic norms andmirror descent steps while maintaining the factor- √ im-provement. However, we do point out that the generaliza-tion is narrower than that of (Allen-Zhu & Orecchia, 2017),which allows non-quadratic norms and constrained y k -and z k -updates. The analysis on strongly convex case followsfrom a similar line of reasoning, and is presented in Ap-pendix, Section H.In addition to the human-understandable derivation ofOGM, this section provides two non-obvious observations,which we point out again. The ﬁrst is that AGM andOGM can be uniﬁed into a single parameterized family ofaccelerated gradient methods, all achieving the O (1 /k ) rate. Another is that the linear coupling analysis ofAllen-Zhu & Orecchia (2017) was suboptimal in the sameway that AGM is suboptimal and can be improved.

5. Conclusion

In this work, we presented human-understandable analysesof OGM. The ﬁrst key insight is to use a Lyapunov func-tion with f ( x k ) − f ⋆ − L k∇ f ( x k ) k , a somewhat un-usual term in Lyapunov analyses. The second key insight isto obtain OGM by reﬁning the linear coupling analysis ofAllen-Zhu & Orecchia (2017) through replacing non-tightinequalities with tight ones. With these insights, we ex-tended the factor- √ acceleration to other setups.In our view, the most signiﬁcant contribution of this work isthe improved understanding of OGM’s acceleration mech-anism. While Nesterov’s acceleration mechanism hasbeen utilized as a component in a wide range of setups, actor- √ Acceleration of Accelerated Gradient Methods

OGM’s acceleration mechanism has not yet seen any exter-nal use. Through the understanding provided by the anal-ysis of this work, we hope OGM’s acceleration becomesmore widely utilized to gain a (perhaps factor- √ ) speedupcompared to what can be achieved with AGM’s accelera-tion. For example, whether accelerated coordinate gradientmethods (Allen-Zhu et al., 2016b; Nesterov & Stich, 2017)or non-convex stochastic optimization (Ghadimi & Lan,2016) can be improved with OGM’s acceleration mech-anism would be an interesting question to address in fu-ture work. Improving the FISTA (Beck & Teboulle, 2009)and the more general mirror descent setup (Bauschke et al.,2017; Lu et al., 2018) are also interesting directions, al-though there are known limitations (Dragomir et al., 2021;Kim & Fessler, 2018b).Finally, studying how OGM’s acceleration interacts withother techniques used to analyze AGM, such as thecontinuous-time analysis (Su et al., 2014), high-resolutionODEs (Shi et al., 2019), and variational perspective(Wibisono et al., 2016) is also an interesting direction. actor- √ Acceleration of Accelerated Gradient Methods

References

Ahn, K. and Sra, S. From Nesterov’s estimate sequence toRiemannian acceleration.

COLT , 2020.Allen-Zhu, Z. Katyusha: The ﬁrst direct acceleration ofstochastic gradient methods.

STOC , 2017.Allen-Zhu, Z. and Hazan, E. Variance reduction for fasternon-convex optimization.

ICML , 2016.Allen-Zhu, Z. and Orecchia, L. Linear coupling: An ulti-mate uniﬁcation of gradient and mirror descent.

ITCS ,2017.Allen-Zhu, Z., Lee, Y. T., and Orecchia, L. Using optimiza-tion to obtain a width-independent, parallel, simpler, andfaster positive SDP solver.

SODA , 2016a.Allen-Zhu, Z., Qu, Z., Richtárik, P., and Yuan, Y. Evenfaster accelerated coordinate descent using non-uniformsampling.

ICML , 2016b.Aujol, J. and Dossal, C. Optimal rate of convergence of anODE associated to the fast gradient descent schemes for b > . HAL Archives Ouvertes , 2017.Aujol, J.-F., Dossal, C., Fort, G., and Moulines, É. Rates ofconvergence of perturbed FISTA-based algorithms.

HALArchives Ouvertes , 2019a.Aujol, J.-F., Dossal, C., and Rondepierre, A. Optimal con-vergence rates for Nesterov acceleration.

SIAM Journalon Optimization , 29(4):3131–3153, 2019b.Auslender, A. and Teboulle, M. Interior gradient and prox-imal methods for convex and conic optimization.

SIAMJournal on Optimization , 16(3):697–725, 2006.Baes, M. Estimate sequence methods: extensions and ap-proximations. Technical report, Institute for OperationsResearch, ETH, Zürich, Switzerland, 2009.Bansal, N. and Gupta, A. Potential-function proofs for gra-dient methods.

Theory of Computing , 15(4):1–32, 2019.Bauschke, H. H., Bolte, J., and Teboulle, M. A descentlemma beyond Lipschitz gradient continuity: First-ordermethods revisited and applications.

Mathematics of Op-erations Research , 42(2):330–348, 2017.Beck, A. and Teboulle, M. Mirror descent and nonlinearprojected subgradient methods for convex optimization.

Operations Research Letters , 31(3):167–175, 2003.Beck, A. and Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.

SIAM Journal on Imaging Sciences , 2(1):183–202, 2009. De Klerk, E., Glineur, F., and Taylor, A. B. Worst-case con-vergence analysis of inexact gradient and newton meth-ods through semideﬁnite programming performance es-timation.

SIAM Journal on Optimization , 30(3):2053–2082, 2020.Dragomir, R.-A., Taylor, A. B., d’Aspremont, A., andBolte, J. Optimal complexity and certiﬁcation of Breg-man ﬁrst-order methods.

Mathematical Programming ,2021.Drori, Y. The exact information-based complexity ofsmooth convex minimization.

Journal of Complexity , 39:1–16, 2017.Drori, Y. and Taylor, A. B. Efﬁcient ﬁrst-order methods forconvex minimization: a constructive approach.

Mathe-matical Programming , 184(1):183–220, 2020.Drori, Y. and Teboulle, M. Performance of ﬁrst-order meth-ods for smooth convex minimization: a novel approach.

Mathematical Programming , 145(1-2):451–482, 2014.Ghadimi, S. and Lan, G. Accelerated gradient methods fornonconvex nonlinear and stochastic programming.

Math-ematical Programming , 156(1-2):59–99, 2016.Gu, G. and Yang, J. Tight sublinear convergence rate ofthe proximal point algorithm for maximal monotone in-clusion problems.

SIAM Journal on Optimization , 30(3):1905–1921, 2020.Kim, D. Accelerated proximal point method for maximallymonotone operators. arXiv preprint arXiv:1905.05149 ,2019.Kim, D. and Fessler, J. A. Optimized ﬁrst-order meth-ods for smooth convex minimization.

Mathematical Pro-gramming , 159(1-2):81–107, 2016.Kim, D. and Fessler, J. A. On the convergence analysis ofthe optimized gradient method.

Journal of OptimizationTheory and Applications , 172(1):187–205, 2017.Kim, D. and Fessler, J. A. Adaptive restart of the optimizedgradient method for convex optimization.

Journal ofOptimization Theory and Applications , 178(1):240–263,2018a.Kim, D. and Fessler, J. A. Another look at the fast iter-ative shrinkage/thresholding algorithm (FISTA).

SIAMJournal on Optimization , 28(1):223–250, 2018b.Kim, D. and Fessler, J. A. Generalizing the optimized gra-dient method for smooth convex minimization.

SIAMJournal on Optimization , 28(2):1920–1950, 2018c. actor- √ Acceleration of Accelerated Gradient Methods

Li, B., Coutiño, M., and Giannakis, G. B. Revisitof estimate sequence for accelerated gradient methods.

ICASSP , 2020.Lieder, F. On the convergence rate of the halpern-iteration.

Optimization Letters , pp. 1–14, 2020.Lu, H., Freund, R. M., and Nesterov, Y. Relatively smoothconvex optimization by ﬁrst-order methods, and appli-cations.

SIAM Journal on Optimization , 28(1):333–354,2018.Nemirovsky, A. S. and Yudin, D. B.

Problem Complexityand Method Efﬁciency in Optimization. O (1 /k ) . Proceedings of the USSR Academy of Sciences , 269:543–547, 1983.Nesterov, Y.

Introductory Lectures on Convex Optimiza-tion: A Basic Course . 2004.Nesterov, Y. Smooth minimization of non-smooth func-tions.

Mathematical Programming , 103(1):127–152,2005.Nesterov, Y. Accelerating the cubic regularization of New-ton’s method on convex problems.

Mathematical Pro-gramming , 112(1):159–181, 2008.Nesterov, Y. Primal-dual subgradient methods for con-vex problems.

Mathematical Programming , 120(1):221–259, 2009.Nesterov, Y. Efﬁciency of coordinate descent methods onhuge-scale optimization problems.

SIAM Journal on Op-timization , 22(2):341–362, 2012.Nesterov, Y. and Stich, S. U. Efﬁciency of the acceler-ated coordinate descent method on structured optimiza-tion problems.

SIAM Journal on Optimization , 27(1):110–123, 2017.Rockafellar, R. T.

Convex Analysis . 1970.Ryu, E. K. and Yin, W.

Large-Scale Convex Optimizationvia Monotone Operators . Draft, 2021.Ryu, E. K., Taylor, A. B., Bergeling, C., and Giselsson,P. Operator splitting performance estimation: Tight con-traction factors and optimal parameter selection.

SIAMJournal on Optimization , 30(3):2251–2271, 2020.Shi, B., Du, S. S., Su, W., and Jordan, M. I. Accelerationvia symplectic discretization of high-resolution differen-tial equations.

NeurIPS , 2019. Su, W., Boyd, S., and Candes, E. A differential equation formodeling Nesterov’s accelerated gradient method: The-ory and insights.

NeurIPS , 2014.Taylor, A. B. and Bach, F. Stochastic ﬁrst-order methods:non-asymptotic and computer-aided analyses via poten-tial functions.

COLT , 2019.Taylor, A. B., Hendrickx, J. M., and Glineur, F. Exactworst-case performance of ﬁrst-order methods for com-posite convex optimization.

SIAM Journal on Optimiza-tion , 27(3):1283–1313, 2017a.Taylor, A. B., Hendrickx, J. M., and Glineur, F. Smoothstrongly convex interpolation and exact worst-case per-formance of ﬁrst-order methods.

Mathematical Pro-gramming , 161(1-2):307–345, 2017b.Wibisono, A., Wilson, A. C., and Jordan, M. I. A vari-ational perspective on accelerated methods in optimiza-tion.

Proceedings of the National Academy of Sciences ,113(47):E7351–E7358, 2016. actor- √ Acceleration of Accelerated Gradient Methods

A. Method reference

For reference, we restate all aforementioned methods. In all methods, we assume that f is L -smooth function, { θ k } ∞ k =0 and { ϕ k } ∞ k =0 are the sequences of positive scalars, and x = y = z . OGM.

One form of OGM is y k +1 = x k − L ∇ f ( x k ) x k +1 = y k +1 + θ k − θ k +1 ( y k +1 − y k ) + θ k θ k +1 ( y k +1 − x k ) and an equivalent form with z -iterates is y k +1 = x k − L ∇ f ( x k ) z k +1 = z k − θ k L ∇ f ( x k ) x k +1 = (cid:18) − θ k +1 (cid:19) y k +1 + 1 θ k +1 z k +1 for k = 0 , , . . . . The last-step modiﬁcation on the secondary sequence can be written as ˜ x k +1 = y k +1 + θ k − ϕ k +1 ( y k +1 − y k ) + θ k ϕ k +1 ( y k +1 − x k )= (cid:18) − ϕ k +1 (cid:19) y k +1 + 1 ϕ k +1 z k +1 where k = 0 , , . . . . OGM-simple.

OGM-simple is a simpler variant of

OGM with θ k = k +22 and ϕ k = k +1+ √ √ . One form of OGM-simpleis y k +1 = x k − L ∇ f ( x k ) x k +1 = y k +1 + kk + 3 ( y k +1 − y k ) + k + 2 k + 3 ( y k +1 − x k ) and an equivalent form with z -iterates is y k +1 = x k − L ∇ f ( x k ) z k +1 = z k − k + 2 L ∇ f ( x k ) x k +1 = (cid:18) − k + 3 (cid:19) y k +1 + 2 k + 3 z k +1 for k = 0 , , . . . . The last-step modiﬁcation on secondary sequence is written as ˜ x k +1 = y k +1 + k √ k + 2) + 1 ( y k +1 − y k ) + k + 2 √ k + 2) + 1 ( y k +1 − x k ) where k = 0 , , . . . . actor- √ Acceleration of Accelerated Gradient Methods

SC-OGM.

Here, we assume that f is a µ -strongly convex function, condition number of f is κ = L/µ , and γ = √ κ +1+32 κ − . SC-OGM is written as y k +1 = x k − L ∇ f ( x k ) x k +1 = y k +1 + 12 γ + 1 ( y k +1 − y k ) + 12 γ + 1 ( y k +1 − x k ) for k = 0 , , . . . . LC-OGM.

LC-OGM (Linear Coupling OGM) is deﬁned as y k +1 = x k − L − Q − ∇ f ( x k ) z k +1 = arg min y ∈ R n { V z k ( y ) + h α k +1 ∇ f ( x k ) , y − x k i} x k +1 = (1 − τ k +1 ) y k +1 + τ k +1 z k +1 for k = 0 , , . . . , where V z ( y ) is a Bregman divergence, { α k } ∞ k =1 and { τ k } ∞ k =1 are nonnegative sequences deﬁned as α = L , ≤ α k +1 L − α k +1 ≤ α k L , τ k = α k +1 L , and Q is a positive deﬁnite matrix deﬁning k x k = x T Qx .For last step modiﬁcation , we deﬁne positive sequences { ˜ α k } ∞ k =1 and { ˜ τ k } ∞ k =1 as α = L , ≤ ˜ α k +1 L − ˜ α k +1 ≤ α k L ,and ˜ τ k = α k +1 L , and also deﬁne ˜ x k = (1 − ˜ τ k ) y k + ˜ τ k z k for k = 1 , , . . . . Uniﬁcation of AGM and OGM.

Using LC-OGM, we can unify AGM and OGM as y k +1 = x k − L ∇ f ( x k ) z k +1 = z k − tθ k L ∇ f ( x k ) x k +1 = (cid:18) − θ k +1 (cid:19) y k +1 + 1 θ k +1 z k +1 . for k = 0 , , . . . . In Section G, we prove that this is equivalent to y k +1 = x k − L ∇ f ( x k ) x k +1 = y k +1 + θ k − θ k +1 ( y k +1 − y k ) + (2 t − θ k θ k +1 ( y k +1 − x k ) . LC-SC-OGM.

LC-SC-OGM (Linear Coupling Strongly Convex OGM) is y k +1 = x k − L Q − ∇ f ( x k ) z k +1 = 11 + γ (cid:18) z k + γx k − γµ Q − ∇ f ( x k ) (cid:19) x k +1 = τ z k +1 + (1 − τ ) y k +1 , for k = 0 , , . . . , where Q is a positive deﬁnite matrix. actor- √ Acceleration of Accelerated Gradient Methods

B. Omitted proofs of Section 2

Lemma 4.

In the setup of Theorem 1, deﬁne U k = 2 θ k (cid:18) f ( x k ) − f ⋆ − L k∇ f ( x k ) k (cid:19) + L k z k +1 − x ⋆ k for k = − , , , . . . . Then, U k is monotonically decreasing.Proof. For k = − , , , . . . , we have U k − U k +1 = 2 θ k (cid:18) f ( x k ) − f ⋆ − L k∇ f ( x k ) k (cid:19) − θ k +1 (cid:18) f ( x k +1 ) − f ⋆ − L k∇ f ( x k +1 ) k (cid:19) + L k z k +1 − x ⋆ k − L k z k +2 − x ⋆ k = 2 θ k (cid:18) f ( x k ) − f ⋆ − L k∇ f ( x k ) k (cid:19) − θ k +1 (cid:18) f ( x k +1 ) − f ⋆ − L k∇ f ( x k +1 ) k (cid:19) − h θ k +1 ∇ f ( x k +1 ) , x ⋆ − z k +1 i − L θ k +1 k∇ f ( x k +1 ) k = 2 θ k (cid:18) f ( x k ) − f ⋆ − L k∇ f ( x k ) k (cid:19) − θ k +1 (cid:18) f ( x k +1 ) − f ⋆ + 12 L k∇ f ( x k +1 ) k (cid:19) − h θ k +1 ∇ f ( x k +1 ) , x ⋆ − z k +1 i≥ θ k +1 − θ k +1 ) (cid:18) f ( x k ) − f ⋆ − L k∇ f ( x k ) k (cid:19) − θ k +1 (cid:18) f ( x k +1 ) − f ⋆ + 12 L k∇ f ( x k +1 ) k (cid:19) − h θ k +1 ∇ f ( x k +1 ) , x ⋆ − z k +1 i = 2( θ k +1 − θ k +1 ) (cid:18) f ( x k ) − f ⋆ − L k∇ f ( x k ) k − f ( x k +1 ) + f ⋆ − L k∇ f ( x k +1 ) k (cid:19) − θ k +1 (cid:18) f ( x k +1 ) − f ⋆ + 12 L k∇ f ( x k +1 ) k (cid:19) − h θ k +1 ∇ f ( x k +1 ) , x ⋆ − z k +1 i = 2( θ k +1 − θ k +1 ) (cid:18) f ( x k ) − f ( x k +1 ) − L k∇ f ( x k ) k − L k∇ f ( x k +1 ) k (cid:19) + 2 θ k +1 (cid:18) f ⋆ − f ( x k +1 ) − L k∇ f ( x k +1 ) k + h∇ f ( x k +1 ) , x k +1 − x ⋆ i (cid:19) + 2 θ k +1 h∇ f ( x k +1 ) , z k +1 − x k +1 i≥ θ k +1 − θ k +1 ) (cid:18) f ( x k ) − f ( x k +1 ) − L k∇ f ( x k ) k − L k∇ f ( x k +1 ) k (cid:19) + 2 θ k +1 h∇ f ( x k +1 ) , z k +1 − x k +1 i , where the inequalities follow from the cocoercivity of f .Consider two separate cases k = − and k = 0 , , . . . . In case of k = − , θ k +1 − θ k +1 = 1 − and z k +1 − x k +1 = z − x = 0 . The last formula becomes zero, so U − − U ≥ . For k = 0 , , . . . , θ k +1 − θ k +1 ) (cid:18) f ( x k ) − f ( x k +1 ) − L k∇ f ( x k ) k − L k∇ f ( x k +1 ) k (cid:19) + 2 θ k +1 h∇ f ( x k +1 ) , z k +1 − x k +1 i = 2( θ k +1 − θ k +1 ) (cid:18) f ( x k ) − f ( x k +1 ) − L k∇ f ( x k ) k − L k∇ f ( x k +1 ) k (cid:19) + 2 θ k +1 ( θ k +1 − h∇ f ( x k +1 ) , x k +1 − x k + 1 L ∇ f ( x k ) i = (2 θ k +1 − θ k +1 ) (cid:18) f ( x k ) − f ( x k +1 ) − L k∇ f ( x k ) − ∇ f ( x k +1 ) k + h∇ f ( x k +1 ) , x k +1 − x k i (cid:19) ≥ , where the inequalities follow from the cocoercivity of f . actor- √ Acceleration of Accelerated Gradient Methods

Lemma 5.

In the setup of Theorem 2, deﬁne ˜ U k = ϕ k ( f (˜ x k ) − f ⋆ ) + L (cid:13)(cid:13)(cid:13)(cid:13) z k − L ϕ k ∇ f (˜ x k ) − x ⋆ (cid:13)(cid:13)(cid:13)(cid:13) . Then, ˜ U k ≤ U k − for k = 0 , , . . . , where U k − is as deﬁned in Lemma 4.Proof. For k = 0 , , . . . , we have U k − − ˜ U k = 2 θ k − (cid:18) f ( x k − ) − f ⋆ − L k∇ f ( x k − ) k (cid:19) − ϕ k ( f (˜ x k ) − f ⋆ )+ L k z k − x ⋆ k − L (cid:13)(cid:13)(cid:13)(cid:13) z k − L ϕ k ∇ f (˜ x k ) − x ⋆ (cid:13)(cid:13)(cid:13)(cid:13) = 2 θ k − (cid:18) f ( x k − ) − f ⋆ − L k∇ f ( x k − ) k (cid:19) − ϕ k ( f (˜ x k ) − f ⋆ ) − h ϕ k ∇ f (˜ x k ) , x ⋆ − z k i − L ϕ k k∇ f (˜ x k ) k = 2 θ k − (cid:18) f ( x k − ) − f ⋆ − L k∇ f ( x k − ) k (cid:19) − ϕ k (cid:18) f (˜ x k ) − f ⋆ + 12 L k∇ f (˜ x k ) k (cid:19) − h ϕ k ∇ f (˜ x k ) , x ⋆ − z k i≥ ( ϕ k − ϕ k ) (cid:18) f ( x k − ) − f ⋆ − L k∇ f ( x k − ) k (cid:19) − ϕ k (cid:18) f (˜ x k ) − f ⋆ + 12 L k∇ f (˜ x k ) k (cid:19) − h ϕ k ∇ f (˜ x k ) , x ⋆ − z k i = ( ϕ k − ϕ k ) (cid:18) f ( x k − ) − f ⋆ − L k∇ f ( x k − ) k − f (˜ x k ) + f ⋆ − L k∇ f (˜ x k ) k (cid:19) + ϕ k (cid:18) f ⋆ − f (˜ x k ) − L k∇ f (˜ x k ) k + h∇ f (˜ x k ) , ˜ x k − x ⋆ i (cid:19) + h ϕ k ∇ f (˜ x k ) , z k − ˜ x k i≥ ( ϕ k − ϕ k ) (cid:18) f ( x k − ) − f (˜ x k ) − L k∇ f ( x k − ) k − L k∇ f (˜ x k ) k (cid:19) + h ϕ k ∇ f (˜ x k ) , z k − ˜ x k i = ( ϕ k − ϕ k ) (cid:18) f ( x k − ) − f (˜ x k ) − L k∇ f ( x k − ) k − L k∇ f (˜ x k ) k (cid:19) + ϕ k ( ϕ k − h∇ f (˜ x k ) , ˜ x k − x k − + 1 L ∇ f ( x k − ) i = ( ϕ k − ϕ k ) (cid:18) f ( x k − ) − f (˜ x k ) − L k∇ f ( x k − ) − ∇ f (˜ x k ) k + h∇ f (˜ x k ) , ˜ x k − x k − i (cid:19) ≥ , where the inequalities follow from the cocoercivity of f . C. Omitted proofs of Section 3

Lemma 6.

In the setup of Theorem 3, deﬁne z k = 2 γ + 1 γ x k − γ + 1 γ y k and let X k = x k − x ⋆ and Z k = z k − x ⋆ , for k = 0 , , . . . . Then ( x k − x k − ) + 1 L ∇ f ( x k − ) + γX k = 11 + γ ( γZ k + γ X k ) actor- √ Acceleration of Accelerated Gradient Methods for k = 1 , , . . . , and Z k +1 = 1 γ + 1 Z k + γγ + 1 X k − L γ + 2 γ ∇ f ( x k ) for k = 0 , , . . . .Proof. Plug y k = x k − − L ∇ f ( x k − ) in the deﬁnition of z k . Then we obtain the ﬁrst formula.For the second formula, from deﬁnition of z k and z k +1 z k +1 = 2 γ + 1 γ x k +1 − γ + 1 γ x k + 1 L γγ ∇ f ( x k ) z k = 2 γ + 1 γ x k − γ + 1 γ x k − + 1 L γγ ∇ f ( x k − ) and deﬁnition of x k , we have x k +1 = 2 γ + 22 γ + 1 y k +1 − γ + 1 y k − L γ + 1 ∇ f ( x k )= 2 γ + 22 γ + 1 x k − γ + 1 x k − − L γ + 32 γ + 1 ∇ f ( x k ) + 1 L γ + 1 ∇ f ( x k − ) . Therefore, z k +1 − γ + 1 z k = 2 γ + 1 γ x k +1 − γ + 1 γ x k + 1 L γγ ∇ f ( x k ) − γ + 1 (cid:18) γ + 1 γ x k − γ + 1 γ x k − + 1 L γγ ∇ f ( x k − ) (cid:19) = 2 γ + 1 γ x k +1 − γ + 4 γ + 2 γ ( γ + 1) x k + 1 γ x k − + 1 L γγ ∇ f ( x k ) − L γ ∇ f ( x k − )= 2 γ + 1 γ (cid:18) γ + 22 γ + 1 x k − γ + 1 x k − − L γ + 32 γ + 1 ∇ f ( x k ) + 1 L γ + 1 ∇ f ( x k − ) (cid:19) − γ + 4 γ + 2 γ ( γ + 1) x k + 1 γ x k − + 1 L γγ ∇ f ( x k ) − L γ ∇ f ( x k − )= γγ + 1 x k − L γ + 2 γ ∇ f ( x k ) so we obtained the second formula. Lemma 7.

In the setup of Theorem 3, deﬁne U k =(1 + γ ) k (cid:18) f ( x k ) − f ⋆ − L k∇ f ( x k ) k + µ k z k +1 − x ⋆ k (cid:19) for k = 0 , , . . . . Then U k is monotonically decreasing.Proof. It sufﬁces to show that for k = 0 , , . . . , (1 + γ ) − k ( U k − U k +1 ) ≥ which is equivalent to showing (cid:18) ( f ( x k ) − f ⋆ − L k∇ f ( x k ) k ) − (1 + γ )( f ( x k +1 ) − f ⋆ − L k∇ f ( x k +1 ) k ) (cid:19) + µ (cid:16) k z k +1 − x ⋆ k − (1 + γ ) k z k +2 − x ⋆ k (cid:17) ≥ . By L -smoothness of f , we have f ( x k +1 ) − f ( x k ) ≤ − L k∇ f ( x k +1 ) − ∇ f ( x k ) k + h∇ f ( x k +1 ) , x k +1 − x k i actor- √ Acceleration of Accelerated Gradient Methods and from strong convexity, f ( x k +1 ) − f ⋆ ≤ h∇ f ( x k +1 ) , x k +1 − x ⋆ i − µ k x k +1 − x ⋆ k . For k = 0 , , . . . , using above two inequalities and Lemma 6, ( f ( x k ) − f ⋆ − L k∇ f ( x k ) k ) − (1 + γ )( f ( x k +1 ) − f ⋆ − L k∇ f ( x k +1 ) k )= ( f ( x k ) − f ( x k +1 )) − γ ( f ( x k +1 ) − f ⋆ ) + 1 + γ L k∇ f ( x k +1 ) k − L k∇ f ( x k ) k ≥ (cid:18) L k∇ f ( x k +1 ) − ∇ f ( x k ) k + h∇ f ( x k +1 ) , x k − x k +1 i (cid:19) − γ (cid:16) h∇ f ( x k +1 ) , x k +1 − x ⋆ i − µ k x k +1 − x ⋆ k (cid:17) + 1 + γ L k∇ f ( x k +1 ) k − L k∇ f ( x k ) k = h∇ f ( x k +1 ) , − L ∇ f ( x k ) − x k +1 + x k − γ ( x k +1 − x ⋆ ) i + 2 + γ L k∇ f ( x k +1 ) k + µγ k x k +1 − x ⋆ k = h∇ f ( x k +1 ) , −

11 + γ ( γZ k +1 + γ X k +1 ) i + 2 + γ L k∇ f ( x k +1 ) k + µγ k x k +1 − x ⋆ k . In addition, µ (cid:16) (1 + γ ) k Z k +2 k − k Z k +1 k (cid:17) = µ (1 + γ ) (cid:13)(cid:13)(cid:13)(cid:13)

11 + γ Z k +1 + γ γ X k +1 − L γγ ∇ f ( x k +1 ) (cid:13)(cid:13)(cid:13)(cid:13) − k Z k +1 k ! = µ − γ γ k Z k +1 k + γ γ k X k +1 k + (1 + γ ) 1 L (2 + γ ) γ k∇ f ( x k +1 ) k + 2 γ γ h Z k +1 , X k +1 i − γLγ h∇ f ( x k +1 ) , Z k +1 i − γL h∇ f ( x k +1 ) , X k +1 i ) . Since µ γLγ = 11 + γ , we can telescope concerned ∇ f ( x k +1 ) ’s inner product in U k − U k +1 .For k = 0 , , . . . , we have (1 + γ ) − k ( U k − U k +1 ) ≥ γ L k∇ f ( x k +1 ) k + µγ k X k +1 k − µ (cid:18) − γ γ k Z k +1 k + γ γ k X k +1 k + (1 + γ ) 1 L (2 + γ ) γ k∇ f ( x k +1 ) k + 2 γ γ h Z k +1 , X k +1 i (cid:19) = − µ (cid:18) − γ γ k X k +1 k − γ γ k Z k +1 k + 2 γ γ h Z k +1 , X k +1 i (cid:19) = µ γ γ k Z k +1 − X k +1 k ≥ . Lemma 8.

In the setup of Theorem 4, deﬁne { ˜ U k } ∞ k =1 as ˜ U k =(1 + γ ) k − γ γ ( f ( x k ) − f ⋆ ) + µ (cid:13)(cid:13)(cid:13)(cid:13) z k − (cid:18) γ + 2 γ (cid:19) L ∇ f ( x k ) − x ⋆ (cid:13)(cid:13)(cid:13)(cid:13) ! then ˜ U k ≤ U k − for k = 1 , , . . . , where U k − is as deﬁned in Lemma 7. actor- √ Acceleration of Accelerated Gradient Methods

Proof.

Note that γ +1 γ (cid:0) ( x k − x k − ) + L ∇ f ( x k − ) (cid:1) = ( Z k − X k ) . Then we have (cid:18) f ( x k − ) − f ⋆ − L k∇ f ( x k − ) k (cid:19) − γ γ ( f ( x k ) − f ⋆ ) + Lγ γ )(2 + γ ) k z k − x ⋆ k − Lγ γ )(2 + γ ) (cid:13)(cid:13)(cid:13)(cid:13) z k − (cid:18) γ + 2 γ (cid:19) L ∇ f ( x k ) − x ⋆ (cid:13)(cid:13)(cid:13)(cid:13) = (cid:18) f ( x k − ) − f ⋆ − L k∇ f ( x k − ) k (cid:19) − γ γ ( f ( x k ) − f ⋆ ) + γ γ h Z k , ∇ f ( x k ) i − L γ γ k∇ f ( x k ) k = (cid:18) f ( x k − ) − f ⋆ − L k∇ f ( x k − ) k (cid:19) − γ γ ( f ( x k ) − f ⋆ )+ γ γ (cid:28) γ + 1 γ (cid:18) ( x k − x k − ) + 1 L ∇ f ( x k − ) (cid:19) + X k , ∇ f ( x k ) (cid:29) − L γ γ k∇ f ( x k ) k = (cid:18) f ( x k − ) − f ⋆ − L k∇ f ( x k − ) k (cid:19) − γ γ ( f ( x k ) − f ⋆ )+ h x k − x k − , ∇ f ( x k ) i + 1 L h∇ f ( x k − ) , ∇ f ( x k ) i + γ γ h X k , ∇ f ( x k ) i − L γ γ k∇ f ( x k ) k = (cid:18) f ( x k − ) − f ( x k ) − L k∇ f ( x k − ) − ∇ f ( x k ) k + h∇ f ( x k ) , x k − x k − i (cid:19) + 12 L γ γ k∇ f ( x k ) k + 11 + γ (cid:18) f ( x k ) − f ⋆ − L k∇ f ( x k ) k (cid:19) + γ γ (cid:18) f ⋆ − f ( x k ) − L k∇ f ( x k ) k + h X k , ∇ f ( x k ) i (cid:19) ≥ . Since Lγ γ )(2+ γ ) = µ , above inequality indicates that (cid:18) f ( x k − ) − f ⋆ − L k∇ f ( x k − ) k (cid:19) + µ k z k − x ⋆ k ≥ γ γ ( f ( x k ) − f ⋆ ) + µ (cid:13)(cid:13)(cid:13)(cid:13) z k − (cid:18) γ + 2 γ (cid:19) L ∇ f ( x k ) − x ⋆ (cid:13)(cid:13)(cid:13)(cid:13) . Lemma 9.

In the setup of Lemma 7, U ≤ (cid:0) L + µ (cid:1) k x − x ⋆ k .Proof. We have U = f ( x ) − f ⋆ − L k∇ f ( x ) k + µ k z − x ⋆ k = f ( x ) − f ⋆ − L k∇ f ( x ) k + µ (cid:13)(cid:13)(cid:13)(cid:13) x − L γ + 2 γ ∇ f ( x ) − x ⋆ (cid:13)(cid:13)(cid:13)(cid:13) = f ( x ) − f ⋆ + 12 L γ + 1 k∇ f ( x ) k − γ γ h∇ f ( x ) , x − x ⋆ i + µ k x − x ⋆ k ≤ f ( x ) − f ⋆ + 12 L k∇ f ( x ) k − γ γ h∇ f ( x ) , x − x ⋆ i + µ k x − x ⋆ k ≤ γ + 1 ( f ( x ) − f ⋆ ) + 12 L

11 + γ k∇ f ( x ) k + µ k x − x ⋆ k ≤

21 + γ ( f ( x ) − f ⋆ ) + µ k x − x ⋆ k ≤ (cid:16) L + µ (cid:17) k x − x ⋆ k . From (1 + γ ) − N = (cid:18) √ κ + 1 + 32 κ − (cid:19) − N = √ √ κ + o (cid:18) √ κ (cid:19)! − N , actor- √ Acceleration of Accelerated Gradient Methods we conclude that the convergence rate of SC-OGM is O (exp( −√ N/ √ κ )) . D. Co-coercivity inequality in general norm

Lemma 10.

Let f be a closed convex proper function. Then, ≤ f ( x ) + f ∗ ( u ) − h x, u i and inf x { f ( x ) + f ∗ ( u ) − h x, u i} = 0inf u { f ( x ) + f ∗ ( u ) − h x, u i} = 0 . Proof.

By the deﬁnition of the conjugate function, − f ∗ ( u ) = inf x { f ( x ) − h x, u i} and inf x { f ( x ) + f ∗ ( u ) − h x, u i} = 0 . Therefore, ≤ f ( x ) + f ∗ ( u ) − h x, u i ∀ x. The statement with u follows from the same argument and the fact that f ∗∗ = f . Lemma 11.

Consider a norm k · k and its dual norm k · k ∗ . Then, ≤ k x k + 12 k u k ∗ − h x, u i and inf x ∈ R n (cid:26) k x k + 12 k u k ∗ − h x, u i (cid:27) = 0inf u ∈ R n (cid:26) k x k + 12 k u k ∗ − h x, u i (cid:27) = 0 . Proof.

This follows from Lemma 10 with f ( x ) = k x k and (cid:16) k·k (cid:17) ∗ = k·k ∗ . Lemma 12.

Let

Grad ( x ) = arg min y ∈ R n (cid:26) L k y − x k + h∇ f ( x ) , y − x i (cid:27) . Then, h∇ f ( x ) , Grad ( x ) − x i + L k Grad ( x ) − x k = − L k∇ f ( x ) k ∗ . Proof.

Let z = L ( Grad ( x ) − x ) . By the deﬁnition of Grad ( x ) and Lemma 11, we have L k∇ f ( x ) k ∗ + L k Grad ( x ) − x k + h∇ f ( x ) , Grad ( x ) − x i = inf z ∈ R n L k∇ f ( x ) k ∗ + 12 L k z k + 1 L h∇ f ( x ) , z i = 0 . Lemma 13.

Let f : R n → R be a differentiable convex function such that k∇ f ( x ) − ∇ f ( y ) k ∗ ≤ L k x − y k for all x, y ∈ R n . Then f ( y ) ≤ f ( x ) + h∇ f ( x ) , y − x i + L k y − x k . actor- √ Acceleration of Accelerated Gradient Methods

Proof.

Since a differentiable convex function is continuously differentiable (Rockafellar, 1970, Theorem 25.5), f ( y ) − f ( x ) = Z h∇ f ( x + t ( y − x )) , y − x i dt = Z h∇ f ( x + t ( y − x )) − ∇ f ( x ) , y − x i dt + h∇ f ( x ) , y − x i≤ Z k∇ f ( x + t ( y − x )) − ∇ f ( x ) k ∗ k y − x k dt + h∇ f ( x ) , y − x i≤ Z tL k y − x k dt + h∇ f ( x ) , y − x i = L k y − x k + h∇ f ( x ) , y − x i . Lemma 14 (Co-coercivity inequality with general norm) . Let f : R n → R be a differentiable convex function such that k∇ f ( x ) − ∇ f ( y ) k ∗ ≤ L k x − y k for all x, y ∈ R n . Then f ( y ) ≥ f ( x ) + h∇ f ( x ) , y − x i + 12 L k∇ f ( x ) − ∇ f ( y ) k ∗ . Proof.

Set φ ( y ) = f ( y ) − h∇ f ( x ) , y − x i . Then x ∈ arg min φ . So by Lemma 12, φ ( x ) ≤ φ ( Grad ( y )) ≤ φ ( y ) + h∇ φ ( y ) , Grad ( y ) − y i + L k Grad ( y ) − y k = φ ( y ) − L k∇ φ ( y ) k ∗ . Substituting f back in φ yields the co-coercivity inequality. E. Telescoping sum argument

Suppose we established the inequality a i F i + b i G i ≤ c i F i − + d i G i − − E i for i = 1 , , . . . , where E i , F i , G i are nonnegative quantities and a i , b i , c i , and d i are nonnegative scalars. Assume c i ≤ a i − and d i ≤ b i − . By summing the inequalities for i = 1 , , . . . , k , we obtain a k F k ≤ − b k G k − k X i =2 ( a i − − c i ) F i − − k X i =2 ( b i − − d i ) G i − − k X i =2 E i + c F + d G ≤ c F + d G . However, note that the − b k G k − k X i =2 ( a i − − c i ) F i − − k X i =2 ( b i − − d i ) G i − − k X i =1 E i terms are wasted in the analysis. If one has the freedom to do so, it may be good to choose parameters so that a i − = c i , b i − = d i and E i = 0 for i = 1 , , . . . . Not having wasted terms may be an indication that the analysis is tight. actor- √ Acceleration of Accelerated Gradient Methods

F. Omitted proof of Section 4

In this section, we provide the proofs of the lemmas used in the linear coupling analysis of OGM and compare them withthe ones used in the analysis of AGM by Allen-Zhu & Orecchia (2017).

Proof of Lemma 1.

From the deﬁnition of mirror descent, h∇ V z k ( z k +1 ) + α k +1 ∇ f ( x k ) , u − z k +1 i ≥ . (3)Using the above inequality, α k +1 h∇ f ( x k ) , z k − u i = α k +1 h∇ f ( x k ) , z k − z k +1 i + α k +1 h∇ f ( x k ) , z k +1 − u i≤ α k +1 h∇ f ( x k ) , z k − z k +1 i + h−∇ V z k ( z k +1 ) , z k +1 − u i (4) = α k +1 h∇ f ( x k ) , z k − z k +1 i + V z k ( u ) − V z k +1 ( u ) − V z k ( z k +1 ) (5) ≤ (cid:18) α k +1 h∇ f ( x k ) , z k − z k +1 i − k z k − z k +1 k (cid:19) + (cid:0) V z k ( u ) − V z k +1 ( u ) (cid:1) (6) ≤ α k +1 k∇ f ( x k ) k ∗ + V z k ( u ) − V z k +1 ( u ) , (7)where (4) follows from (3), (5) follows from triangle equality of Bregman divergence, and (6) is due to 1-strong convexityof V · ( u ) , and (7) follows from Lemma 11. Proof of Lemma 2.

For k = 1 , , . . . , we have α k +1 ( f ( x k ) − f ⋆ )) ≤ α k +1 h∇ f ( x k ) , x k − x ⋆ i − α k +1 L k∇ f ( x k ) k ∗ (8) = α k +1 h∇ f ( x k ) , x k − z k i + α k +1 h∇ f ( x k ) , z k − x ⋆ i − α k +1 L k∇ f ( x k ) k ∗ = 1 − τ k τ k α k +1 h∇ f ( x k ) , y k − x k i + α k +1 h∇ f ( x k ) , z k − x ⋆ i − α k +1 L k∇ f ( x k ) k ∗ = 1 − τ k τ k α k +1 h∇ f ( x k ) , x k − − x k − L Q − ∇ f ( x k − ) i + α k +1 h∇ f ( x k ) , z k − x ⋆ i − α k +1 L k∇ f ( x k ) k ∗ (9) ≤ − τ k τ k α k +1 (cid:18) f ( x k − ) − f ( x k ) − L k∇ f ( x k − ) k ∗ − L k∇ f ( x k ) k ∗ (cid:19) (10) + α k +1 h∇ f ( x k ) , z k − x ⋆ i − α k +1 L k∇ f ( x k ) k ∗ ≤ − τ k τ k α k +1 (cid:18) f ( x k − ) − f ( x k ) − L k∇ f ( x k − ) k ∗ − L k∇ f ( x k ) k ∗ (cid:19) (11) + α k +1 k∇ f ( x k ) k ∗ + V z k ( x ⋆ ) − V z k +1 ( x ⋆ ) − α k +1 L k∇ f ( x k ) k ∗ . (8) and (10) follow from Lemma 14, (9) follows from the deﬁnition of linear coupling, and (11) follows from Lemma 1.The case of k = 0 follows from α = L and f ⋆ − f ( x ) − h∇ f ( x ) , x ⋆ − x i − L k∇ f ( x ) k ∗ ≥ with Lemma 1.The parameters { α k } ∞ k =1 and { τ k } ∞ k =1 are chosen to make the telescoping sum argument work and to make it work tightly,as described in Section E. Speciﬁcally, one starts with a general form M k (cid:16) f ( x k ) − f ⋆ − B k k∇ f ( x k ) k ∗ (cid:17) + V z k +1 ( x ⋆ ) ≤ N k − (cid:16) f ( x k − ) − f ⋆ − B k − k∇ f ( x k − ) k ∗ (cid:17) + V z k ( x ⋆ ) , where the scalar coefﬁcients M k , N k − , B k , and B k − are determined by (11). To make the telescoping sum argumentwork, the { B k } ∞ k =0 must be independent of k . The only possibility for this is B k = B k − = L , and this is achieved when − L (cid:18) α k +1 + 1 − τ k τ k α k +1 (cid:19) = − α k +1 L (cid:18) α k +1 + 1 − τ k τ k α k +1 (cid:19) actor- √ Acceleration of Accelerated Gradient Methods holds. Solving this equation leads to the choice τ k = Lα k +1 . The requirement α k +1 L − α k +1 ≤ α k L is needed for thetelescoping sum argument to work, and the choice α k +1 L − α k +1 = α k L makes the argument tight. Proof of Lemma 3.

Proof is identical to that of Lemma 2 with substituted τ k by ˜ τ k . Comparison of the linear coupling analyses of AGM and OGM.

The linear coupling analysis ofAllen-Zhu & Orecchia (2017), which derives AGM, relies on the following two key lemmas.

Lemma 15 ((Allen-Zhu & Orecchia, 2017, Lemma 4.2)) . In the linear coupling setup, α k +1 h∇ f ( x k ) , z k − x ⋆ i ≤ α k +1 k∇ f ( x k ) k ∗ + V z k ( x ⋆ ) − V z k +1 ( x ⋆ ) ≤ α k +1 L ( f ( x k ) − f ( y k +1 )) + V z k ( x ⋆ ) − V z k +1 ( x ⋆ ) for k = 0 , , . . . . Lemma 16 ((Allen-Zhu & Orecchia, 2017, Lemma 4.3)) . (Coupling Lemma) In the linear coupling setup, α k +1 L ( f ( y k +1 ) − f ⋆ ) + V z k +1 ( x ⋆ ) ≤ ( α k +1 L − α k +1 ) ( f ( y k ) − f ⋆ ) + V z k ( x ⋆ ) . for k = 0 , , . . . . As discussed in the main body, the proof of Allen-Zhu & Orecchia (2017, Lemma 4.2) uses of the non-tight inequality f ( x k ) − f ( y k +1 ) ≥ L k∇ f ( x k ) k ∗ , and the proof of Allen-Zhu & Orecchia (2017, Lemma 4.3) follows steps similar to that of Lemma 2, but uses the non-tightinequalities f ( x k ) − f ⋆ ≤ h∇ f ( x k ) , x k +1 − x ⋆ i and h∇ f ( x k ) , y k − x k i ≤ f ( y k ) − f ( x k ) . In both linear coupling analyses, for OGM and AGM, the telescoping sum argument is made tight by choosing { α k } ∞ k =1 and { τ k } ∞ k =1 appropriately. However, the analysis of Allen-Zhu & Orecchia (2017) uses non-tight inequalities before thetelescoping sum argument, while our analysis uses tight inequalities in all steps. G. Uniﬁcation of AGM and OGM

In section 4.4, we presented a uniﬁcation of AGM and OGM. The uniﬁed form is equivalent to y k +1 = x k − L ∇ f ( x k ) x k +1 = y k +1 + θ k − θ k +1 ( y k +1 − y k ) + (2 t − θ k θ k +1 ( y k +1 − x k ) . Proof.

To prove the equivalency, we show that the above sequence leads to x k +1 = (cid:18) − θ k +1 (cid:19) y k +1 + 1 θ k +1 z k +1 . actor- √ Acceleration of Accelerated Gradient Methods

That is, x k +1 = (cid:18) − θ k +1 (cid:19) y k +1 + θ k θ k +1 y k +1 − θ k − θ k +1 y k − (2 t − θ k θ k +1 L ∇ f ( x k )= (cid:18) − θ k +1 (cid:19) y k +1 + θ k θ k +1 (cid:18) x k − L ∇ f ( x k ) (cid:19) − θ k − θ k +1 y k − (2 t − θ k θ k +1 L ∇ f ( x k )= (cid:18) − θ k +1 (cid:19) y k +1 + θ k θ k +1 x k − θ k − θ k +1 y k − t θ k θ k +1 L ∇ f ( x k )= (cid:18) − θ k +1 (cid:19) y k +1 + θ k θ k +1 (cid:18) y k + θ k − − θ k ( y k − y k − ) − (2 t − θ k − θ k L ∇ f ( x k − ) (cid:19) − θ k − θ k +1 y k − t θ k θ k +1 L ∇ f ( x k )= (cid:18) − θ k +1 (cid:19) y k +1 + (cid:18) θ k θ k +1 + θ k − − θ k +1 − θ k − θ k +1 (cid:19) y k − θ k − − θ k +1 y k − − (2 t − θ k − θ k +1 L ∇ f ( x k − ) − t θ k θ k +1 L ∇ f ( x k )= (cid:18) − θ k +1 (cid:19) y k +1 + θ k − θ k +1 y k − θ k − − θ k +1 y k − − (2 t − θ k − θ k +1 L ∇ f ( x k − ) − t θ k θ k +1 L ∇ f ( x k )= (cid:18) − θ k +1 (cid:19) y k +1 + θ k − θ k +1 (cid:18) x k − − L ∇ f ( x k − ) (cid:19) − θ k − − θ k +1 y k − − (2 t − θ k − θ k +1 L ∇ f ( x k − ) − t θ k θ k +1 L ∇ f ( x k )= (cid:18) − θ k +1 (cid:19) y k +1 + θ k − θ k +1 x k − − θ k − − θ k +1 y k − − t θ k θ k +1 L ∇ f ( x k ) − t θ k − θ k +1 L ∇ f ( x k − ) ... = (cid:18) − θ k +1 (cid:19) y k +1 + θ θ k +1 x − θ − θ k +1 y − θ k +1 k X i =0 tθ i L ∇ f ( x i )= (cid:18) − θ k +1 (cid:19) y k +1 + 1 θ k +1 z k +1 . H. SC-OGM via linear coupling

In this section, we analyze SC-OGM through the linear coupling analysis. We consider the linear coupling form y k +1 = x k − L Q − ∇ f ( x k ) z k +1 = 11 + γ (cid:18) z k + γx k − γµ Q − ∇ f ( x k ) (cid:19) x k +1 = τ z k +1 + (1 − τ ) y k +1 , where τ is a coupling coefﬁcient to be determined. As an aside, we can view z k +1 as a mirror descent update of the form z k +1 = arg min z (cid:26) k z − z k k + γ k z − x k k + γµ h∇ f ( x k ) , z i (cid:27) , which is similar to what was considered in Allen-Zhu et al. (2016b). Lemma 17.

Assume (A1), (A2) and (A3). Then, γµ h∇ f ( x k ) , z k +1 − x ⋆ i − γ k x k − x ⋆ k ≤ − γ γ ) µ k∇ f ( x k ) k ∗ + 12 k z k − x ⋆ k − γ k z k +1 − x ⋆ k for k = 0 , , . . . .Proof. This proof follows steps similar to that of Allen-Zhu et al. (2016b, Lemma 5.4). actor- √ Acceleration of Accelerated Gradient Methods

From the deﬁnition of z k +1 , we say h ∂∂z (cid:26) k z − z k k + γ k z − x k k + γµ h∇ f ( x k ) , z i (cid:27) (cid:12)(cid:12)(cid:12)(cid:12) z k +1 , z k +1 − x ⋆ i = h Q ( z k +1 − z k ) , z k +1 − x ⋆ i + γµ h∇ f ( x k ) , z k +1 − x ⋆ i + γ h Q ( z k +1 − x k ) , z k +1 − x ⋆ i By three point equation, γµ h∇ f ( x k ) , z k +1 − x ⋆ i + γ (cid:18) k x k − z k +1 k − k x k − x ⋆ k (cid:19) = − k z k − z k +1 k + 12 k z k − x ⋆ k − γ k z k +1 − x ⋆ k . Plugging the deﬁnition of z k +1 , γ k x k − z k +1 k + 12 k z k − z k +1 k = γ (cid:13)(cid:13)(cid:13)(cid:13)

11 + γ ( x k − z k ) + γ (1 + γ ) µ Q − ∇ f ( x k ) (cid:13)(cid:13)(cid:13)(cid:13) + 12 (cid:13)(cid:13)(cid:13)(cid:13) − γ γ ( x k − z k ) + γ (1 + γ ) µ Q − ∇ f ( x k ) (cid:13)(cid:13)(cid:13)(cid:13) ≥ γ γ ) µ k∇ f ( x k ) k ∗ . Combining results above, we get γµ h∇ f ( x k ) , z k +1 − x ⋆ i − γ k x k − x ⋆ k ≤ − γ γ ) µ k∇ f ( x k ) k ∗ + 12 k z k − x ⋆ k − γ k z k +1 − x ⋆ k . Lemma 18 (Coupling lemma in SC-OGM) . Assume (A1), (A2) and (A3). Then (1 + γ ) (cid:18) f ( x k ) − L k∇ f ( x k ) k ∗ + µ k z k − x ⋆ k (cid:19) ≤ (cid:18) f ( x k − ) − L k∇ f ( x k − ) k ∗ + µ k z k − − x ⋆ k (cid:19) holds for k = 1 , , . . . Proof.

We have γ ( f ( x k ) − f ( x ⋆ )) ≤ γ h∇ f ( x k ) , x k − x ⋆ i − µγ k x k − x ⋆ k = γ h∇ f ( x k ) , x k − z k i + γ h∇ f ( x k ) , z k − x ⋆ i − µγ k x k − x ⋆ k = 1 − ττ γ h∇ f ( x k ) , y k − x k i + γ h∇ f ( x k ) , z k − x ⋆ i − µγ k x k − x ⋆ k = 1 − ττ γ h∇ f ( x k ) , x k − − x k − L Q − ∇ f ( x k − ) i + γ h∇ f ( x k ) , z k − x ⋆ i − µγ k x k − x ⋆ k ≤ (cid:18) − ττ γ − (cid:19) h∇ f ( x k ) , x k − − x k − L Q − ∇ f ( x k − ) i + (cid:18) f ( x k − ) − f ( x k ) − L k∇ f ( x k − ) k ∗ − L k∇ f ( x k ) k ∗ (cid:19) + γ h∇ f ( x k ) , z k − z k +1 i + γ h∇ f ( x k ) , z k +1 − x ⋆ i − µγ k x k − x ⋆ k ≤ (cid:18) − ττ γ − (cid:19) h∇ f ( x k ) , y k − x k i + (cid:18) f ( x k − ) − f ( x k ) − L k∇ f ( x k − ) k ∗ − L k∇ f ( x k ) k ∗ (cid:19) + γ h∇ f ( x k ) , z k − z k +1 i − γ γ ) µ k∇ f ( x k ) k ∗ + µ k z k − x ⋆ k − (1 + γ ) µ k z k +1 − x ⋆ k , actor- √ Acceleration of Accelerated Gradient Methods where the last inequality is an application of Lemma 17. Note that z k − z k +1 = z k −

11 + γ (cid:18) z k + γx k − γµ Q − ∇ f ( x k ) (cid:19) = γ γ ( z k − x k ) + γ (1 + γ ) µ Q − ∇ f ( x k )= γ γ − ττ ( x k − y k ) + γ (1 + γ ) µ Q − ∇ f ( x k ) . To eliminate the h∇ f ( x k ) , ·i term, we choose τ to satisfy − ττ γ − γ γ − ττ . (12)Plugging this in, the inequality above is γ ( f ( x k ) − f ( x ⋆ )) ≤ (cid:18) f ( x k − ) − f ( x k ) − L k∇ f ( x k − ) k ∗ − L k∇ f ( x k ) k ∗ (cid:19) + γ γ ) µ k∇ f ( x k ) k ∗ + µ k z k − x ⋆ k − (1 + γ ) µ k z k +1 − x ⋆ k . In order to make the telescoping form such as M k (cid:16) f ( x k ) − B k k∇ f ( x k ) k ∗ + C k k z k +1 − x ⋆ k (cid:17) ≤ N k − (cid:16) f ( x k − ) − B k − k∇ f ( x k − ) k ∗ + C k − k z k − x ⋆ k (cid:17) , we chose B k = L and C k = µ , which leads to the choice of γ satisfying γ L = γ γ ) µ . (13)We get the desired result by plugging (12) and (13) in the above inequality. I. Asymptotic characterization of θ k Theorem 7.

Let the positive sequence { θ k } ∞ k =0 satisfy θ = 1 and θ k +1 − θ k +1 − θ k = 0 for k = 0 , , . . . . Then, θ k = k + ζ + 12 + log k o (1) . Proof.

The proof consists of the following 3 steps:1. If c k < , then c k +1 < .2. c k → as k → ∞ .3. If θ k = k +22 + log k + e k , then e k is convergent. First step. If c k < , then c k +1 < .Let θ k = k +22 + c k log k . For our convenience, let c = 0 with c log 0 = 0 . Plugging this in θ k +1 − θ k +1 − θ k = 0 , wehave (cid:18) k + 22 + c k +1 log( k + 1) (cid:19) = (cid:18) k + 22 + c k log k (cid:19) + 14 , so ( c k +1 log( k + 1) − c k log k ) ( k + 2 + c k +1 log( k + 1) + c k log k ) = 14 . actor- √ Acceleration of Accelerated Gradient Methods

Assume c k +1 ≥ / . Then

14 = ( c k +1 log( k + 1) − c k log k ) ( k + 2 + c k +1 log( k + 1) + c k log k ) ≥

14 log (cid:18) k (cid:19) ( k + 2) > , which proves the ﬁrst claim. Second step. c k → as k → ∞ .Put d k = − c k , then < d k ≤ .

14 = (cid:18)

14 log (cid:18) k (cid:19) − d k +1 log( k + 1) + d k log k (cid:19) (cid:18) k + 2 + 14 log k ( k + 1) − d k +1 log( k + 1) − d k log k (cid:19) ≤ (cid:18)

14 log (cid:18) k (cid:19) − d k +1 log( k + 1) + d k log k (cid:19) (cid:18) k + 2 + 12 log( k + 1) (cid:19) Therefore d k +1 log( k + 1) − d k log k ≤

14 log (cid:18) k (cid:19) −

14 1 k + 2 + log( k + 1) . By talyor expansion, d k +1 log( k + 1) − d k log k ≤ (cid:18) k k + O (cid:18) k (cid:19)(cid:19) . So, By summing all the above inequality from 1 to k , d k +1 log( k + 1) ≤ C so d k +1 < C log( k +1) . In conclusion, as k → ∞ , d k → . Third step. If θ k = k +22 + log k + e k , then, e k converges.From the previous claim, we can say that for some sufﬁciently large k , | e k | < log k . (cid:18) k + 22 + 14 log( k + 1) + e k +1 (cid:19) = (cid:18) k + 22 + 14 log k + e k (cid:19) + 14 Then,

14 = (cid:18)

14 log (cid:18) k (cid:19) + e k +1 − e k (cid:19) (cid:18) k + 2 + 14 log k ( k + 1) + e k +1 + e k (cid:19) ≤ (cid:18)

14 log (cid:18) k (cid:19) + e k +1 − e k (cid:19) (cid:18) k + 2 + 56 log( k + 1) (cid:19) . So, e k +1 − e k ≥ (cid:0) k + 2 + log( k + 1) (cid:1) −

14 log (cid:18) k (cid:19) = − log k + k + O (cid:18) k (cid:19) . Summing this for k = 1 , . . . , k , we get that e k +1 > D for some constant D . Moreover,

14 = (cid:18)

14 log (cid:18) k (cid:19) + e k +1 − e k (cid:19) (cid:18) k + 2 + 14 log k ( k + 1) + e k +1 + e k (cid:19) ≥ (cid:18)

14 log (cid:18) k (cid:19) + e k +1 − e k (cid:19) ( k + 2) >

14 + ( k + 2)( e k +1 − e k ) , which indicates that e k +1 < e k . Since { e k } ∞ k =0=0