Accelerated Algorithms for Smooth Convex-Concave Minimax Problems with \mathcal{O}(1/k^2) Rate on Squared Gradient Norm
AAccelerated Algorithms for Smooth Convex-Concave Minimax Problemswith O (1 /k ) Rate on Squared Gradient Norm
TaeHo Yoon Ernest K. Ryu Abstract
In this work, we study the computational complex-ity of reducing the squared gradient magnitudefor smooth minimax optimization problems. First,we present algorithms with accelerated O (1 /k ) last-iterate rates, faster than the existing O (1 /k ) or slower rates for extragradient, Popov, and gra-dient descent with anchoring. The accelerationmechanism combines extragradient steps with an-choring and is distinct from Nesterov’s accelera-tion. We then establish optimality of the O (1 /k ) rate through a matching lower bound.
1. Introduction
Minimax optimization problems, or minimax games, of theform minimize x ∈ R n maximize y ∈ R m L ( x, y ) (1)have recently gained significant interest in the optimizationand machine learning communities due to their applicationin adversarial training (Goodfellow et al., 2015; Madry et al.,2018) and generative adversarial networks (GANs) (Good-fellow et al., 2014).Prior works on minimax optimization often consider com-pact domains X, Y for x, y and use the duality gap
Err gap ( x, y ) := sup ˜ y ∈ Y L ( x, ˜ y ) − inf ˜ x ∈ X L (˜ x, y ) to quantify suboptimality of algorithms’ iterates in solving(1). However, while it is a natural analog of minimizationerror for minimax problems, the duality gap can be difficultto measure directly in practice, and it is unclear how togeneralize the notion to non-convex-concave problems.In contrast, the squared gradient magnitude (cid:107)∇ L ( x, y ) (cid:107) ,when L is differentiable, is a more directly observable Department of Mathematical Sciences, Seoul National Uni-versity, Seoul, Korea. Correspondence to: Ernest K. Ryu < [email protected] > . value for quantifying suboptimality. Moreover, the notion ismeaningful for differentiable non-convex-concave minimaxgames. Interestingly, very few prior works have analyzedconvergence rates on the gradient norm for minimax prob-lems, and the optimal convergence rate or correspondingalgorithms were hitherto unknown. Contributions.
In this work, we introduce the extra an-chored gradient (EAG) algorithms for smooth convex-concave minimax problems and establish an accelerated (cid:107)∇ L ( z k ) (cid:107) ≤ O ( R /k ) rate, where R is the Lipschitzconstant of ∇ L . The rate improves upon the O ( R /k ) ratesof prior algorithms and is, to the best of our knowledge, thefirst accelerated rate in this setup. We then provide a match-ing Ω( R /k ) complexity lower bound for gradient-basedalgorithms and thereby establish optimality of EAG.Beyond establishing the optimal complexity, our resultsprovide the following observations. First, different subop-timality measures lead to materially different accelerationmechanisms, since reducing the duality gap is done opti-mally by the extragradient algorithm (Nemirovski, 2004;Nemirovsky & Yudin, 1983). Also, since our optimal ac-celerated convergence rate is on the non-ergodic last iterate,neither averaging nor keeping track of the best iterate isnecessary for optimally reducing the gradient magnitude inthe deterministic setup. We say a saddle function L : R n × R m → R is convex-concave if L ( x, y ) is convex in x ∈ R n for all fixed y ∈ R m and L ( x, y ) is concave in y ∈ R m for allfixed x ∈ R n . We say ( x (cid:63) , y (cid:63) ) is a saddle point of L if L ( x (cid:63) , y ) ≤ L ( x (cid:63) , y (cid:63) ) ≤ L ( x, y (cid:63) ) for all x ∈ R n and y ∈ R m . Solutions to the minimax problem (1) are definedto be saddle points of L . For notational conciseness, write z = ( x, y ) . When L is differentiable, define the saddlesubdifferential of L at z = ( x, y ) by ∂ L ( z ) = (cid:20) ∇ x L ( x, y ) −∇ y L ( x, y ) (cid:21) . (2)The saddle subdifferential is a monotone operator (Rock-afellar, 1970), i.e., (cid:104) ∂ L ( z ) − ∂ L ( z ) , z − z (cid:105) ≥ for a r X i v : . [ m a t h . O C ] F e b ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm all z , z ∈ R n × R m . We say L is R -smooth if ∂ L is R -Lipschitz continuous. Note that ∇ L (cid:54) = ∂ L due to thesign change in the y gradient, but (cid:107)∇ L (cid:107) = (cid:107) ∂ L (cid:107) , and weuse the two forms interchangeably. Because z (cid:63) = ( x (cid:63) , y (cid:63) ) is a saddle point if and only if ∂ L ( z (cid:63) ) , the squaredgradient magnitude is a natural measure of suboptimality ata given point for smooth convex-concave problems. The first main compo-nent of the our proposed algorithm is the extragradient (EG)algorithm of Korpelevich (1977). EG and its variants, in-cluding the algorithm of Popov (1980), have been studied inthe context of saddle point and variational inequality prob-lems and have appeared in the mathematical programmingliterature (Solodov & Svaiter, 1999; Tseng, 2000; Noor,2003; Censor et al., 2011; Lyashko et al., 2011; Malitsky& Semenov, 2014; Malitsky, 2015; 2020). More recentlyin the machine learning literature, similar ideas such as op-timism (Chiang et al., 2012; Rakhlin & Sridharan, 2013a),prediction (Yadav et al., 2018), and negative momentum(Gidel et al., 2019; Zhang et al., 2020) have been presentedand used in the context of multi-player games (Daskalakiset al., 2011; Rakhlin & Sridharan, 2013b; Syrgkanis et al.,2015; Antonakopoulos et al., 2021) and GANs (Gidel et al.,2018; Mertikopoulos et al., 2019; Liang & Stokes, 2019;Peng et al., 2020). O ( R/k ) rates on duality gap. For minimax problemswith an R -smooth L and bounded domains for x and y , Ne-mirovski (2004) and Nesterov (2007) respectively presentedthe mirror-prox and dual extrapolation algorithms gener-alizing EG and established ergodic O ( R/k ) convergencerates on Err gap . Monteiro & Svaiter (2010), Monteiro &Svaiter (2011), and Mokhtari et al. (2020b) extended the O ( R/k ) complexity analysis to the case of unbounded do-mains. Since there exists Ω( R/k ) complexity lower boundon Err gap for black-box gradient-based minimax optimiza-tion algorithms (Nemirovsky & Yudin, 1983), in terms ofduality gap, these algorithms are order-optimal.
Convergence rates on squared gradient norm.
Usingstandard summability arguments that appear in weak con-vergence proofs for EG (e.g. (Solodov & Svaiter, 1999,Lemma 2.3)), one can show min i =0 ,...,k (cid:107) ∂ L ( z i ) (cid:107) ≤ O ( R /k ) convergence rate of EG, provided that L is R -smooth. Ryuet al. (2019) showed that optimistic descent algorithms alsoattain O ( R /k ) convergence in terms of the best iterate andproposed simultaneous gradient descent with anchoring ,which pulls iterates toward the initial point z , and estab-lished O ( R /k − ε ) convergence rates in terms of squaredgradient norm of the last iterate. Anchoring is the secondmain component of EAG; we combine the EG steps with anchoring and obtain an accelerated last-iterate convergencerate of O ( R /k ) . Structured minimax problems.
For structured minimaxproblems of the form L ( x, y ) = f ( x ) + (cid:104) Ax, y (cid:105) − g ( y ) , where f, g are convex and A is a linear operator, primal-dual splitting algorithms (Chambolle & Pock, 2011; Con-dat, 2013; V˜u, 2013; Yan, 2018; Ryu & Yin, 2021) andNesterov’s smoothing technique (Nesterov, 2005a;b) havealso been extensively studied (Chen et al., 2014; He &Monteiro, 2016). Notably, when g is of “simple” form,Neterov’s smoothing framework achieves an acceleratedrate O (cid:16) (cid:107) A (cid:107) k + L f k (cid:17) on the duality gap. Additionally, Cham-bolle & Pock (2016) have shown that splitting algorithmscan achieve O (1 /k ) or linear convergence rates under ap-propriate strong convexity and smoothness assumptions on f and g , although they rely on proximal operations. Kolos-soski & Monteiro (2017); Hamedani & Aybat (2018); Zhao(2019); Alkousa et al. (2020) generalized these acceleratedalgorithms to the setting where the coupling term (cid:104) Ax, y (cid:105) isreplaced by non-bilinear convex-concave function Φ( x, y ) . Complexity lower bounds.
Ouyang & Xu (2021) pre-sented a Ω (cid:16) (cid:107) A (cid:107) k + L f k (cid:17) complexity lower bound on theduality gap for gradient-based algorithms solving bilinearminimax problems with proximable g , establishing optimal-ity of Nesterov’s smoothing. Zhang et al. (2019) presentedlower bounds for strongly-convex-strongly-concave prob-lems. These approaches are aligned with the information-based complexity analysis introduced in (Nemirovsky &Yudin, 1983) and thoroughly studied in (Nemirovsky, 1991;1992) for the special case of linear equations. Other problem setups.
Nesterov (2009) and Nedi´c &Ozdaglar (2009) proposed subgradient algorithms for non-smooth minimax problems. Stochastic minimax and vari-ational inequality problems were studied in (Nemirovskiet al., 2009; Juditsky et al., 2011; Lan, 2012; Ghadimi &Lan, 2012; 2013; Chen et al., 2014; 2017; Hsieh et al.,2019). Strongly monotone variational inequality problemsor strongly-convex-strongly-concave minimax problemswere studied in (Tseng, 1995; Nesterov & Scrimali, 2011;Gidel et al., 2018; Mokhtari et al., 2020a; Lin et al., 2020b;Wang & Li, 2020; Zhang et al., 2020; Azizian et al., 2020).Recently, minimax problems with objectives that are eitherstrongly convex or nonconvex in one variable were studiedin (Rafique et al., 2018; Thekumparampil et al., 2019; Jinet al., 2019; Nouiehed et al., 2019; Ostrovskii et al., 2020;Lin et al., 2020a;b; Lu et al., 2020; Wang & Li, 2020; Yanget al., 2020; Chen et al., 2021). ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm
2. Accelerated algorithms: Extra anchoredgradient
We now present two accelerated EAG algorithms that arequalitatively very similar but differ in the choice of step-sizes. The two algorithms present a tradeoff between thesimplicity of the step-size and the simplicity of the conver-gence proof; one algorithm has a varying step-size but asimpler convergence proof, while the other algorithm has asimpler constant step-size but has a more complicated proof.
The proposed extra anchored gradient (EAG) algorithmshave the following general form: z k +1 / = z k + β k ( z − z k ) − α k ∂ L ( z k ) z k +1 = z k + β k ( z − z k ) − α k ∂ L ( z k +1 / ) (3)for k ≥ , where z ∈ R n × R m is the starting point. Weuse ∂ L defined in (2) rather than describing the x - and y -updates separately to keep the notation concise. We call α k > step-sizes and β k ∈ [0 , anchoring coefficients .Note that when β k = 0 , EAG coincides with the uncon-strained extragradient algorithm.The simplest choice of { α k } k ≥ is the constant one. To-gether with the choice β k = k +2 (which we clarify later),we get the following simpler algorithm. EAG with constant step-size (EAG-C) z k +1 / = z k + 1 k + 2 ( z − z k ) − α ∂ L ( z k ) z k +1 = z k + 1 k + 2 ( z − z k ) − α ∂ L ( z k +1 / ) where α > is fixed. Theorem 1.
Assume L : R n × R m → R is an R -smoothconvex-concave function with a saddle point z (cid:63) . Assume α > satisfies − αR − α R − α R ≥ − αR + α R − α R ≥ . (4) Then EAG-C converges with rate (cid:107)∇ L ( z k ) (cid:107) ≤ αR + α R ) α (1 + αR ) (cid:107) z − z (cid:63) (cid:107) ( k + 1) for k ≥ . Corollary 1.
In the setup of Theorem 1, α ∈ (cid:0) , R (cid:3) satis-fies (4) , and the particular choice α = R yields (cid:107)∇ L ( z k ) (cid:107) ≤ R (cid:107) z − z (cid:63) (cid:107) ( k + 1) for k ≥ . While EAG-C is simple in its form, its convergence proof(presented in the appendix) is complicated. Furthermore, theconstant in Corollary 1 seems large and raises the ques-tion of whether it could be reduced. These issues, to someextent, are addressed by the following alternative version ofEAG.
EAG with varying step-size (EAG-V) z k +1 / = z k + 1 k + 2 ( z − z k ) − α k ∂ L ( z k ) z k +1 = z k + 1 k + 2 ( z − z k ) − α k ∂ L ( z k +1 / ) , where α ∈ (cid:0) , R (cid:1) and α k +1 = α k − α k R (cid:18) − ( k + 2) ( k + 1)( k + 3) α k R (cid:19) = α k (cid:18) − k + 1)( k + 3) α k R − α k R (cid:19) (5)for k ≥ .As the recurrence relation (5) may seem unfamiliar, weprovide the following lemma describing the behavior of theresulting sequence. Lemma 1. If α ∈ (cid:0) , R (cid:1) , then the sequence { α k } k ≥ of (5) monotonically decreases to a positive limit. In particular,when α = . R , we have lim k →∞ α k ≈ . R . We now state the convergence results for EAG-V.
Theorem 2.
Assume L : R n × R m → R is an R -smoothconvex-concave function with a saddle point z (cid:63) . Assume α ∈ (cid:0) , R (cid:1) , and define α ∞ = lim k →∞ α k . Then EAG-Vconverges with rate (cid:107)∇ L ( z k ) (cid:107) ≤ (cid:0) α α ∞ R (cid:1) α ∞ (cid:107) z − z (cid:63) (cid:107) ( k + 1)( k + 2) for k ≥ . Corollary 2.
EAG-V with α = . R satisfies (cid:107)∇ L ( z k ) (cid:107) ≤ R (cid:107) z − z (cid:63) (cid:107) ( k + 1)( k + 2) for k ≥ . We now outline the convergence analysis for EAG-V, whoseproof is simpler than that of EAG-C. The key ingredientof the proof is a Lyapunov analysis with a nonincreasingLyapunov function, the V k of the following lemma. Lemma 2.
Let { β k } k ≥ ⊆ (0 , and α ∈ (cid:0) , R (cid:1) begiven. Let b = 1 . Define the sequences { a k } k ≥ , { b k } k ≥ ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm and { α k } ≥ by the recurrence relations a k = α k β k b k (6) b k +1 = b k − β k (7) α k +1 = α k β k +1 (1 − α k R − β k ) β k (1 − β k )(1 − α k R ) (8) for k ≥ . Suppose that α k ∈ (0 , R ) holds for all k ≥ . Assume L is R -smooth and convex-concave. Then thesequence { V k } k ≥ defined as V k := a k (cid:107) ∂ L ( z k ) (cid:107) + b k (cid:104) ∂ L ( z k ) , z k − z (cid:105) (9) for EAG iterations in (3) is nonincreasing. In Lemma 2, the choice of β k = k +2 leads to b k = k + 1 , a k = α k ( k +2)( k +1)2 , and (5). Why the Lyapunov function ofLemma 2 leads to the convergence guarantee of Theorem 2may not be immediately obvious. The following proofprovides the analysis. Proof of Theorem 2.
Let β k = k +2 as specified by the defi-nition of EAG-V. By Lemma 2, the quantity V k defined by(9) is nonincreasing in k . Therefore, V k ≤ · · · ≤ V = α (cid:107) ∂ L ( z ) (cid:107) ≤ α R (cid:107) z − z (cid:63) (cid:107) . Next, we have V k = a k (cid:107) ∂ L ( z k ) (cid:107) + b k (cid:104) ∂ L ( z k ) , z k − z (cid:105) (a) ≥ a k (cid:107) ∂ L ( z k ) (cid:107) + b k (cid:104) ∂ L ( z k ) , z (cid:63) − z (cid:105) (b) ≥ a k (cid:107) ∂ L ( z k ) (cid:107) − a k (cid:107) ∂ L ( z k ) (cid:107) − b k a k (cid:107) z − z (cid:63) (cid:107) (c) = α k k + 1)( k + 2) (cid:107) ∂ L ( z k ) (cid:107) − k + 1 α k ( k + 2) (cid:107) z − z (cid:63) (cid:107) (d) ≥ α ∞ k + 1)( k + 2) (cid:107) ∂ L ( z k ) (cid:107) − α ∞ (cid:107) z − z (cid:63) (cid:107) , where (a) follows from the monotonicity inequality (cid:104) ∂ L ( z k ) , z k − z (cid:63) (cid:105) ≥ , (b) follows from Young’s inequal-ity, (c) follows from plugging in a k = α k ( k +1)( k +2)2 and b k = k + 1 , and (d) follows from Lemma 1 ( α k ↓ α ∞ ).Reorganize to get α ∞ k + 1)( k + 2) (cid:107) ∂ L ( z k ) (cid:107) ≤ V k + 1 α ∞ (cid:107) z − z (cid:63) (cid:107) ≤ (cid:18) α R + 1 α ∞ (cid:19) (cid:107) z − z (cid:63) (cid:107) , and divide both sides by α ∞ ( k + 1)( k + 2) . The algorithms and results of Sections 2.1 and 2.2 remainvalid when we replace ∂ L with an R -Lipschitz continuousmonotone operator; neither the definition of the EAG algo-rithms nor any part of the proofs of Theorems 1 and 2 utilizeproperties of saddle functions beyond the monotonicity oftheir subdifferentials.For EAG-C, the step-size conditions (4) in Theorem 1 canbe relaxed to accommodate larger values of α . However,we do not pursue such generalizations to keep the alreadycomplicated and arduous analysis of EAG-C manageable.Also, larger step-sizes are more naturally allowed in EAG-Vand Theorem 2. Finally, although (4) holds for values of α up to . R , we present a slightly smaller range (cid:0) , R (cid:3) inCorollary 1 for simplicity.For EAG-V, the choice β k = k +2 was obtained by roughly,but not fully, optimizing the bound on EAG-V originatingfrom Lemma 2. If one chooses β k = k + δ with δ > , then(6) and (7) become a k = α k ( k + δ )( k + δ − δ − , b k = k + δ − δ − . As the proof of Theorem 2 illustrates, linear growth of b k and quadratic growth of a k leads to O (1 /k ) convergenceof (cid:107) ∂ L ( z k ) (cid:107) . The value α = . R in Lemma 1 andCorollary 2 was obtained by numerically minimizing theconstant α ∞ (cid:0) α α ∞ R (cid:1) in Theorem 2 in the case of δ = 2 . The choice δ = 2 , however, is not optimal. Indeed,the constant of Corollary 2 can be reduced to . with ( δ (cid:63) , α (cid:63) ) ≈ (2 . , . /R ) , which was obtained bynumerically optimizing over δ and α . Finally, there is apossibility that a choice of β k not in the form of β k = k + δ leads to an improved constant.In the end, we choose to present EAG-C and EAG-V withthe simple choice β k = k +2 . As we establish in Section 3,the EAG algorithms are optimal up to a constant.
3. Optimality of EAG via a matchingcomplexity lower bound
Upon seeing an accelerated algorithm, it is natural to askwhether the algorithm is optimal. In this section, we presenta Ω( R /k ) complexity lower bound for the class of de-terministic gradient-based algorithms for smooth convex-concave minimax problems. This result establishes thatEAG is indeed optimal.For the class of smooth minimax optimization problems, adeterministic algorithm A produces iterates ( x k , y k ) = z k for k ≥ given a starting point ( x , y ) = z and a saddlefunction L , and we write z k = A ( z , . . . , z k − ; L ) for ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm k ≥ . Define A sim as the class of algorithms satisfying z k ∈ z + span { ∂ L ( z ) , . . . , ∂ L ( z k − ) } , (10)and A sep as the class of algorithms satisfying x k ∈ x + span (cid:8) ∇ x L ( x , y ) , . . . , ∇ x L ( x k − , y k − ) (cid:9) y k ∈ y + span (cid:8) ∇ y L ( x , y ) , . . . , ∇ y L ( x k − , y k − ) (cid:9) . (11)To clarify, algorithms in A sim access and utilize the x - and y -subgradients simultaneously . So A sim contains simultaneousgradient descent, extragradient, Popov, and EAG. On theother hand, algorithms in A sep can access and utilize the x - and y -subgradients separately . So A sim ⊂ A sep , andalternating gradient descent-ascent belongs to A sep but notto A sim .In this section, we present a complexity lower bound thatapplies to all algorithms in A sep , not just the algorithmsin A sim . Although EAG-C and EAG-V are in A sim , weconsider the broader class A sep to rule out the possibilitythat separately updating the x - and y -variables provides animprovement beyond a constant factor.We say L ( x, y ) is biaffine if it is an affine function of x for any fixed y and an affine function of y for any fixed x .Biaffine functions are, of course, convex-concave. We firstestablish a complexity lower bound on minimiax optimiza-tion problems with biaffine loss functions. Theorem 3.
Let k ≥ be fixed. For any n ≥ k + 2 , thereexists an R -smooth biaffine function L on R n × R n forwhich (cid:107)∇ L ( z k ) (cid:107) ≥ R (cid:107) z − z (cid:63) (cid:107) (2 (cid:98) k/ (cid:99) + 1) (12) holds for any algorithm in A sep , where (cid:98)·(cid:99) is the floor func-tion and z (cid:63) is the saddle point of L closest to z . Moreover,this lower bound is optimal in the sense that it cannot beimproved with biaffine functions. Since smooth biaffine functions are special cases of smoothconvex-concave functions, Theorem 3 implies the optimal-ity of EAG applied to smooth convex-concave mimimaxoptimization problems.
Corollary 3.
For R -smooth convex-concave minimax prob-lems, an algorithm in A sep cannot attain a worst-case con-vergence rate better than R (cid:107) z − z (cid:63) (cid:107) (2 (cid:98) k/ (cid:99) + 1) with respect to (cid:107)∇ L ( z k ) (cid:107) . Since EAG-C and EAG-V haverates O ( R (cid:107) z − z (cid:63) (cid:107) /k ) , they are optimal, up to a con-stant factor, in A sep . Consider biaffine functions of the form L ( x, y ) = (cid:104) Ax − b, y − c (cid:105) , where A ∈ R n × n and b, c ∈ R n . Then, ∇ x L ( x, y ) = A (cid:124) ( y − c ) , ∇ y L ( x, y ) = Ax − b , ∂ L is (cid:107) A (cid:107) -Lipschitz, andsolutions tominimize x ∈ X maximize y ∈ Y (cid:104) Ax − b, y − c (cid:105) are characterized by Ax − b = 0 and A (cid:124) ( y − c ) = 0 .Through translation, we may assume without loss of gener-ality that x = 0 , y = 0 . In this case, (11) becomes x k ∈ span { A (cid:124) c, A (cid:124) ( AA (cid:124) ) c, . . . , A (cid:124) ( AA (cid:124) ) (cid:98) k − (cid:99) c } + span { A (cid:124) b, A (cid:124) ( AA (cid:124) ) b, . . . , A (cid:124) ( AA (cid:124) ) (cid:98) k (cid:99)− b } y k ∈ span { b, ( AA (cid:124) ) b, . . . , ( AA (cid:124) ) (cid:98) k − (cid:99) b } + span { AA (cid:124) c, . . . , ( AA (cid:124) ) (cid:98) k (cid:99) c } (13)for k ≥ . (We detail these arguments in the appendix.)Furthermore let A = A (cid:124) and b = A (cid:124) c = Ac . Then thecharacterization of A sep further simplifies to x k , y k ∈ K k − ( A ; b ) := span { b, Ab, A b, . . . , A k − b } . Note that K k − ( A ; b ) is the order- ( k − Krylov subspace.Consider the following lemma. Its proof, deferred to theappendix, combines arguments from Nemirovsky (1991;1992).
Lemma 3.
Let
R > , k ≥ , and n ≥ k + 2 . Thenthere exists A = A (cid:124) ∈ R n × n such that (cid:107) A (cid:107) ≤ R and a b ∈ R ( A ) , satisfying (cid:107) Ax − b (cid:107) ≥ R (cid:107) x (cid:63) (cid:107) (2 (cid:98) k/ (cid:99) + 1) (14) for any x ∈ K k − ( A ; b ) , where x (cid:63) is the minimum normsolution to the equation Ax = b . Take A and b as in Lemma 3 and c = x (cid:63) . Then z (cid:63) =( x (cid:63) , x (cid:63) ) is the saddle point of L ( x, y ) = (cid:104) Ax − b, y − c (cid:105) with minimum norm. Finally, (cid:107)∇ L ( x k , y k ) (cid:107) = (cid:107) A (cid:124) ( y k − c ) (cid:107) + (cid:107) Ax k − b (cid:107) = (cid:107) Ay k − b (cid:107) + (cid:107) Ax k − b (cid:107) ≥ R (cid:107) x (cid:63) (cid:107) (2 (cid:98) k/ (cid:99) + 1) + R (cid:107) x (cid:63) (cid:107) (2 (cid:98) k/ (cid:99) + 1) = R (cid:107) z (cid:63) − z (cid:107) (2 (cid:98) k/ (cid:99) + 1) , for any x k , y k ∈ K k − ( A ; b ) . This completes the construc-tion of the biaffine L of Theorem 3. ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm
Let F be a function class, P F = {P f } f ∈F a class of opti-mization problems (with some common form), and E ( · ; P f ) a suboptimality measure for the problem P f . Define the worst-case complexity of an algorithm A for P F at the k -thiteration given the initial condition (cid:107) z − z (cid:63) (cid:107) ≤ D , as C ( A ; P F , D, k ) := sup z ∈ B ( z (cid:63) ; D ) f ∈F E (cid:0) z k ; P f (cid:1) , where z j = A ( z , . . . , z j − ; f ) for j = 1 , . . . , k and B ( z ; D ) denotes the closed ball of radius D centered at z . The optimal complexity lower bound with respect to analgorithm class A is C ( A ; P F , D, k ) : = inf A∈ A C ( A ; P F , D, k )= inf A∈ A sup z ∈ B ( z (cid:63) ; D ) f ∈F E (cid:0) z k ; P f (cid:1) . A complexity lower bound is a lower bound on the optimalcomplexity lower bound.Let L R ( R n × R m ) be the class of R -smooth convex-concavefunctions on R n × R m , P L the minimax problem (1), and E ( z ; P L ) = (cid:107) ∂ L ( z ) (cid:107) . With this notation, the results ofSection 2 can be expressed as C (cid:0) EAG ; P L R ( R n × R m ) , D, k (cid:1) = O (cid:18) R D k (cid:19) . Let L biaff R ( R n × R m ) be the class of R -smooth biaffine func-tions on R n × R m . Then the first statement of Theorem 3,the existence of L , can be expressed as C (cid:16) A sep ; P L biaff R ( R n × R n ) , D, k (cid:17) ≥ R D (2 (cid:98) k/ (cid:99) + 1) (15)for n ≥ k + 2 .As an aside, the argument of Corollary 3 can be expressedas: for any A ∈ A sep , we have C (cid:0) A ; P L R ( R n × R n ) , D, k (cid:1) ≥ C (cid:0) A sep ; P L R ( R n × R n ) , D, k (cid:1) ≥ C (cid:16) A sep ; P L biaff R ( R n × R n ) , D, k (cid:17) ≥ R D (2 (cid:98) k/ (cid:99) + 1) . The first inequality follows from
A ∈ A sep , the second from L biaff R ⊂ L R , and the third from Theorem 3. Optimality of lower bound of Theorem 3.
The secondstatement of Theorem 3, optimality of the lower bound, canbe expressed as C (cid:16) A sep ; P L biaff R ( R n × R n ) , D, k (cid:17) = R D (2 (cid:98) k/ (cid:99) + 1) (16) for n ≥ k + 2 . We establish this claim with the followingchain of inequalities, which we define and justify one at atime: R D (2 (cid:98) k/ (cid:99) + 1) ≤ C (cid:16) A sep ; P L biaff R ( R n × R n ) , D, k (cid:17) (17) ≤ C (cid:16) A sim ; P L biaff R ( R n × R n ) , D, k (cid:17) (18) ≤ C (cid:16) A lin ; P n, skew R,D , k (cid:17) (19) ≤ C (cid:0) A lin ; P nR,D , k (cid:1) (20) ≤ R D (2 (cid:98) k/ (cid:99) + 1) . (21)Once these inequalities are established, equality holdsthroughout and (16) is proved.Inequality (17) is what we established in Section 3.1. In-equality (18) follows from A sim ⊂ A sep and the fact that theinfimum over a larger class is smaller.Inequality (19) follows from the following observation. Ifwe express as L ( x, y ) = b (cid:124) x + x (cid:124) Ay − c (cid:124) y an L ∈L biaff R ( R n × R n ) , then ∂ L ( x, y ) = (cid:20) O A − A (cid:124) O (cid:21) (cid:20) xy (cid:21) + (cid:20) bc (cid:21) . Therefore, the minimax problem with L is equivalent tosolving the linear operator equation Bz = v with the skew-symmetric matrix B = (cid:20) O − AA (cid:124) O (cid:21) and v = (cid:20) bc (cid:21) ∈ R n .Through translation, we may assume without loss of gener-ality that z = 0 . In this case, condition (10) becomes z k ∈ K k − ( B ; v ) (22)for any z k computed by an algorithm in A sim . Therefore, for L ∈ L biaff R ( R n × R n ) , oracles queries on ∂ L from algorithmsin A sim are equivalent to matrix multiplication with B .Let P nR,D be the collection of linear equations with n × n matrices B satisfying (cid:107) B (cid:107) ≤ R and v = Bz (cid:63) for some z (cid:63) ∈ B (0; D ) . Let P n, skew R,D ⊂ P nR,D be the subclasswith skew-symmetric B . Each problem instance within L biaff R ( R n × R n ) can be identified with a problem instancein P n, skew R,D . In other words, we can effectively write L biaff R ( R n × R n ) ⊂ P n, skew R,D .Write A lin for the class of algorithms solving linear equa-tions Bz = v using matrix-vector products with B and B (cid:124) .Precisely, A ∈ A lin produces an auxiliary sequence v k = A ( v , . . . , v k − ; B )= Bv j or B (cid:124) v j for some j = 0 , . . . , k − for k ≥ with v = v , and the k -th approximate solutionsatisfies z k = A ( v , . . . , v k − ; B ) ∈ span { v , . . . , v k − } . (23) ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm Iteration count10 || L ( z k ) || EGPopovEAG-VEAG-CSimGD-A (a) Two-dimensional example L δ,(cid:15) of (24) Iteration count10 || L ( z k ) || EGPopovEAG-VEAG-CSimGD-A (b) Lagrangian of linearly constrained QP of (25)
Figure 1.
Plots of (cid:107) ∂ L ( z k ) (cid:107) versus iteration count. Dashed lines indicate corresponding theoretical upper bounds. The optimal complexity lower bound for a class of linearequation instances is C (cid:0) A lin ; P nR,D , k (cid:1) := inf A∈ A lin sup (cid:107) B (cid:107)≤ Rv = Bz (cid:63) , (cid:107) z (cid:63) (cid:107)≤ D (cid:13)(cid:13) Bz k − v (cid:13)(cid:13) . Define C (cid:16) A lin ; P n, skew R,D , k (cid:17) analogously.When B is skew-symmetric, (23) reduces to (22), so A sim and A lin are effectively the same class of algorithms. Since L biaff R ( R n × R n ) ⊂ P n, skew R,D effectively, and since the supre-mum over a larger class of problems is larger, inequality(19) holds.Inequality (20) follows from P n, skew R,D ⊂ P nR,D and the factthat the supremum over a larger class of problems is larger.Finally, (21) follows from the following lemma, using argu-ments based on Chebyshev-type matrix polynomials fromNemirovsky (1992). Its proof is deferred to the appendix. Lemma 4.
Let
R > and k ≥ . Then there exists A ∈ A lin such that for any m ≥ , B ∈ R m × m , and v = Bz (cid:63) satisfying (cid:107) B (cid:107) ≤ R and (cid:107) z (cid:63) (cid:107) ≤ D , the z k -iterate producedby A satisfies (cid:13)(cid:13) Bz k − v (cid:13)(cid:13) ≤ R D (cid:98) k/ (cid:99) + 1) . In (10) and (11), we assumed the subgradient queries aremade within the span of the gradients at the previous iterates.This span condition can be removed using the resistingoracle technique (Nemirovsky & Yudin, 1983) at the costof slightly enlarging the required problem dimension. Weinformally state the generalized result below and providedetails in the appendix.
Theorem 4 (Informal) . Let n ≥ k + 2 . For any gradient-based deterministic algorithm, there exists an R -smoothbiaffine function L on R n × R n such that (12) holds. Although we do not formally pursue this, the requirementthat the algorithm is not randomized can also be removedusing the techniques of Woodworth & Srebro (2016), whichexploit near-orthogonality of random vectors in high dimen-sions.
We established that one cannot improve the lower bound ofTheorem 3 using biaffine functions, arguably the simplestfamily of convex-concave functions. Furthermore, this opti-mality statement holds for both algorithm classes A sep and A sim as established through the chain of inequalities in Sec-tion 3.2. However, as demonstrated by Drori (2017), whointroduced a non-quadratic lower bound for smooth con-vex minimization that improves upon the classical quadraticlower bounds of Nemirovsky (1992) and Nesterov (2013),a non-biaffine construction may improve the constant. Inour setup, there is a factor-near- difference between theupper and lower bounds. (Note that each EAG iterationrequires evaluations of the saddle subdifferential oracle.)We suspect that both the algorithm and the lower bound canbe improved upon, but we leave this to future work.
4. Experiments
We now present experiments illustrating the acceleratedrate of EAG. We compare EAG-C and EAG-V against theprior algorithms with convergence guarantees: EG, Popov’salgorithm (or optimistic descent) and simultaneous gradientdescent with anchoring (SimGD-A). The precise forms ofthe algorithms are restated in the appendix.Figure 1(a) presents experiments on our first example, con-structed as follows. For (cid:15) > , define f (cid:15) ( u ) = (cid:40) (cid:15) | u | − (cid:15) if | u | ≥ (cid:15), u if | u | < (cid:15). ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm (a) Discrete trajectories with L δ,(cid:15) (b) Moreau–Yosida regularized flow with λ = 0 . and theanchored flow with L ( x, y ) = xy Figure 2.
Comparison of the discrete trajectories and their corresponding continuous-time flow. Trajectories from EAG-C and SimGD-Avirtually coincide and resemble the anchored flow. However, SimGD-A progresses slower due to its diminishing step-sizes.
Next, for < (cid:15) (cid:28) δ (cid:28) , define L δ,(cid:15) ( x, y ) = (1 − δ ) f (cid:15) ( x ) + δxy − (1 − δ ) f (cid:15) ( y ) , (24)where x, y ∈ R . Since f (cid:15) is a -smooth convex function, L δ,(cid:15) has smoothness parameter , which is almost tightdue to the quadratic behavior of L δ,(cid:15) within the region | x | , | y | ≤ (cid:15) . This construction was inspired by Drori &Teboulle (2014), who presented f (cid:15) as the worst-case in-stance for gradient descent. We choose the step-size α = 0 . as this value is comfortably within the theoretical range ofconvergent parameters for EG, EAG-C, and Popov. ForEAG-V, we set α = 0 . . We use N = 10 , δ = 10 − , and (cid:15) = 5 × − , and the initial point z has norm .Figure 1(b) presents experiments on our second example L ( x, y ) = 12 x (cid:124) Hx − h (cid:124) x − (cid:104) Ax − b, y (cid:105) , (25)where x, y ∈ R n , A ∈ R n × n , b ∈ R n , H ∈ R n × n ispositive semidefinite, and h ∈ R n . Note that this is theLagrangian of a linearly constrained quadratic minimizationproblem. We adopted this saddle function from Ouyang &Xu (2021), where the authors constructed H, h, A and b toprovide a lower bound on duality gap. The exact forms of H , h , A , and b are restated in the appendix. We use n = 200 , N = 10 , α = 0 . for EG and Popov, α = 0 . forEAG-C and α = 0 . for EAG-V. Finally, we use theinitial point z = 0 . ODE Interpretation
Figure 2(a) illustrates the algo-rithms applied to (24). For | x | , | y | (cid:29) (cid:15) , ∂ L δ,(cid:15) ( x, y ) = (cid:20) (1 − δ ) (cid:15) + δy (1 − δ ) (cid:15) − δx (cid:21) ≈ δ (cid:20) y − x (cid:21) , so the algorithms roughly behave as if the objective is thebilinear function δxy . When δ is sufficiently small, trajecto-ries of the algorithms closely resemble the correspondingcontinuous-time flows with L ( x, y ) = xy . Csetnek et al. (2019) demonstrated that Popov’s algorithmcan be viewed as discretization of the Moreau–Yosida reg-ularized flow ˙ z ( t ) = − ∂ L − ( Id + λ∂ L ) − λ ( z ( t )) for some λ > , and a similar analysis can be performed with EG.This connection explains why EG’s trajectory in Figure 2(a)and the regularized flow depicted in Figure 2(b) are similar.On the other hand, EAG and SimGD-A can be viewed as adiscretization of the anchored flow ODE ˙ z ( t ) = − ∂ L ( z ( t )) + 1 t ( z − z ( t )) . The anchored flow depicted in Figure 2(b) approaches thesolution much more quickly due to the anchoring term damp-ening the cycling behavior. The trajectories of EAG andSimGD-A iterates in Figure 2(a) are very similar to theanchored flow. However, SimGD-A requires diminishingstep-sizes − p ( k +1) p (both theoretically and experimentally)and therefore progresses much slower.
5. Conclusion
This work presents the extra anchored gradient (EAG) al-gorithms, which exhibit accelerated O (1 /k ) rates on thesquared gradient magnitude for smooth convex-concaveminimax problems. The acceleration combines the extragra-dient and anchoring mechanisms, which separately achieve O (1 /k ) or slower rates. We complement the O (1 /k ) ratewith a matching Ω(1 /k ) complexity lower bound, therebyestablishing optimality of EAG.At a superficial level, the acceleration mechanism of EAGseems to be distinct from that of Nesterov; anchoring damp-ens oscillations, but momentum provides the opposite effectof dampening. However, are the two accelerations phe-nomena entirely unrelated? Finding a common structure, aconnection, between the two acceleration phenomena wouldbe an interesting direction of future work. ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm
References
Alkousa, M., Gasnikov, A., Dvinskikh, D., Kovalev, D.,and Stonyakin, F. Accelerated methods for saddle-pointproblem.
Computational Mathematics and MathematicalPhysics , 60(11):1787–1809, 2020.Antonakopoulos, K., Belmega, E. V., and Mertikopoulos, P.Adaptive extra-gradient methods for min-max optimiza-tion and games.
ICLR , 2021.Azizian, W., Mitliagkas, I., Lacoste-Julien, S., and Gidel, G.A tight and unified analysis of gradient-based methodsfor a whole spectrum of differentiable games.
AISTATS ,2020.Censor, Y., Gibali, A., and Reich, S. The subgradient ex-tragradient method for solving variational inequalitiesin Hilbert space.
Journal of Optimization Theory andApplications , 148(2):318–335, 2011.Chambolle, A. and Pock, T. A first-order primal-dual algo-rithm for convex problems with applications to imaging.
Journal of Mathematical Imaging and Vision , 40(1):120–145, 2011.Chambolle, A. and Pock, T. On the ergodic convergencerates of a first-order primal–dual algorithm.
MathematicalProgramming , 159(1-2):253–287, 2016.Chen, Y., Lan, G., and Ouyang, Y. Optimal primal-dualmethods for a class of saddle point problems.
SIAMJournal on Optimization , 24(4):1779–1814, 2014.Chen, Y., Lan, G., and Ouyang, Y. Accelerated schemesfor a class of variational inequalities.
Mathematical Pro-gramming , 165(1):113–149, 2017.Chen, Z., Zhou, Y., Xu, T., and Liang, Y. Proximal gradientdescent-ascent: Variable convergence under KŁ geometry.
ICLR , 2021.Chiang, C.-K., Yang, T., Lee, C.-J., Mahdavi, M., Lu, C.-J.,Jin, R., and Zhu, S. Online optimization with gradualvariations.
COLT , 2012.Condat, L. A primal–dual splitting method for convex opti-mization involving Lipschitzian, proximable and linearcomposite terms.
Journal of Optimization Theory andApplications , 158(2):460–479, 2013.Csetnek, E. R., Malitsky, Y., and Tam, M. K. ShadowDouglas–Rachford splitting for monotone inclusions.
Ap-plied Mathematics & Optimization , 80(3):665–678, 2019.Daskalakis, C., Deckelbaum, A., and Kim, A. Near-optimalno-regret algorithms for zero-sum games.
SODA , 2011. Drori, Y. The exact information-based complexity of smoothconvex minimization.
Journal of Complexity , 39:1–16,2017.Drori, Y. and Teboulle, M. Performance of first-order meth-ods for smooth convex minimization: A novel approach.
Mathematical Programming , 145(1-2):451–482, 2014.Ghadimi, S. and Lan, G. Optimal stochastic approxima-tion algorithms for strongly convex stochastic compositeoptimization I: A generic algorithmic framework.
SIAMJournal on Optimization , 22(4):1469–1492, 2012.Ghadimi, S. and Lan, G. Optimal stochastic approxima-tion algorithms for strongly convex stochastic compositeoptimization, II: Shrinking procedures and optimal algo-rithms.
SIAM Journal on Optimization , 23(4):2061–2089,2013.Gidel, G., Berard, H., Vignoud, G., Vincent, P., and Lacoste-Julien, S. A variational inequality perspective on genera-tive adversarial networks.
ICLR , 2018.Gidel, G., Hemmat, R. A., Pezeshki, M., Le Priol, R., Huang,G., Lacoste-Julien, S., and Mitliagkas, I. Negative mo-mentum for improved game dynamics.
AISTATS , 2019.Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y.Generative adversarial nets.
NeurIPS , 2014.Goodfellow, I. J., Shlens, J., and Szegedy, C. Explainingand harnessing adversarial examples.
ICLR , 2015.Hamedani, E. Y. and Aybat, N. S. A primal-dual algorithmfor general convex-concave saddle point problems. arXivpreprint arXiv:1803.01401 , 2018.He, Y. and Monteiro, R. D. An accelerated HPE-type al-gorithm for a class of composite convex-concave saddle-point problems.
SIAM Journal on Optimization , 26(1):29–56, 2016.Hsieh, Y.-G., Iutzeler, F., Malick, J., and Mertikopoulos,P. On the convergence of single-call stochastic extra-gradient methods.
NeurIPS , 2019.Jin, C., Netrapalli, P., and Jordan, M. I. Minmax optimiza-tion: Stable limit points of gradient descent ascent arelocally optimal. arXiv preprint arXiv:1902.00618 , 2019.Juditsky, A., Nemirovski, A., and Tauvel, C. Solving varia-tional inequalities with stochastic mirror-prox algorithm.
Stochastic Systems , 1(1):17–58, 2011.Kolossoski, O. and Monteiro, R. D. An accelerated non-euclidean hybrid proximal extragradient-type algorithmfor convex–concave saddle-point problems.
OptimizationMethods and Software , 32(6):1244–1272, 2017. ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm
Korpelevich, G. Extragradient method for finding saddlepoints and other problems.
Matekon , 13(4):35–49, 1977.Lan, G. An optimal method for stochastic composite op-timization.
Mathematical Programming , 133(1-2):365–397, 2012.Liang, T. and Stokes, J. Interaction matters: A note on non-asymptotic local convergence of generative adversarialnetworks.
AISTATS , 2019.Lin, T., Jin, C., and Jordan, M. On gradient descent ascentfor nonconvex-concave minimax problems.
ICML , 2020a.Lin, T., Jin, C., Jordan, M., et al. Near-optimal algorithmsfor minimax optimization.
COLT , 2020b.Lu, S., Tsaknakis, I., Hong, M., and Chen, Y. Hybridblock successive approximation for one-sided non-convexmin-max problems: Algorithms and applications.
IEEETransactions on Signal Processing , 68:3676–3691, 2020.Lyashko, S., Semenov, V., and Voitova, T. Low-cost modi-fication of Korpelevich’s methods for monotone equilib-rium problems.
Cybernetics and Systems Analysis , 47(4):631, 2011.Madry, A., Makelov, A., Schmidt, L., Tsipras, D., andVladu, A. Towards deep learning models resistant toadversarial attacks.
ICLR , 2018.Malitsky, Y. Projected reflected gradient methods for mono-tone variational inequalities.
SIAM Journal on Optimiza-tion , 25(1):502–520, 2015.Malitsky, Y. Golden ratio algorithms for variational inequal-ities.
Mathematical Programming , 184:383–410, 2020.Malitsky, Y. V. and Semenov, V. An extragradient algorithmfor monotone variational inequalities.
Cybernetics andSystems Analysis , 50(2):271–277, 2014.Mason, J. C. and Handscomb, D. C.
Chebyshev Polynomials .2002.Mertikopoulos, P., Zenati, H., Lecouat, B., Foo, C.-S., Chan-drasekhar, V., and Piliouras, G. Optimistic mirror descentin saddle-point problems: Going the extra (gradient) mile.
ICLR , 2019.Mokhtari, A., Ozdaglar, A., and Pattathil, S. A unifiedanalysis of extra-gradient and optimistic gradient meth-ods for saddle point problems: Proximal point approach.
AISTATS , 2020a.Mokhtari, A., Ozdaglar, A. E., and Pattathil, S. Convergencerate of O (1 /k ) for optimistic gradient and extragradientmethods in smooth convex-concave saddle point prob-lems. SIAM Journal on Optimization , 30(4):3230–3251,2020b. Monteiro, R. D. and Svaiter, B. F. On the complexity of thehybrid proximal extragradient method for the iterates andthe ergodic mean.
SIAM Journal on Optimization , 20(6):2755–2787, 2010.Monteiro, R. D. and Svaiter, B. F. Complexity of variantsof Tseng’s modified FB splitting and Korpelevich’s meth-ods for hemivariational inequalities with applications tosaddle-point and convex optimization problems.
SIAMJournal on Optimization , 21(4):1688–1720, 2011.Nedi´c, A. and Ozdaglar, A. Subgradient methods for saddle-point problems.
Journal of Optimization Theory andApplications , 142(1):205–228, 2009.Nemirovski, A. Prox-method with rate of convergence O (1 /t ) for variational inequalities with Lipschitz contin-uous monotone operators and smooth convex-concavesaddle point problems. SIAM Journal on Optimization ,15(1):229–251, 2004.Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Ro-bust stochastic approximation approach to stochastic pro-gramming.
SIAM Journal on Optimization , 19(4):1574–1609, 2009.Nemirovsky, A. S. On optimality of Krylov’s informationwhen solving linear operator equations.
Journal of Com-plexity , 7(2):121–130, 1991.Nemirovsky, A. S. Information-based complexity of linearoperator equations.
Journal of Complexity , 8(2):153–175,1992.Nemirovsky, A. S. and Yudin, D. B.
Problem Complexityand Method Efficiency in Optimization . 1983.Nesterov, Y. Excessive gap technique in nonsmooth convexminimization.
SIAM Journal on Optimization , 16(1):235–249, 2005a.Nesterov, Y. Smooth minimization of non-smooth functions.
Mathematical Programming , 103(1):127–152, 2005b.Nesterov, Y. Dual extrapolation and its applications to solv-ing variational inequalities and related problems.
Mathe-matical Programming , 109(2-3):319–344, 2007.Nesterov, Y. Primal-dual subgradient methods for convexproblems.
Mathematical Programming , 120(1):221–259,2009.Nesterov, Y.
Introductory Lectures on Convex Optimization:A Basic Course . 2013.Nesterov, Y. and Scrimali, L. Solving strongly monotonevariational and quasi-variational inequalities.
Discrete &Continuous Dynamical Systems – A , 31(4):1383–1396,2011. ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm
Noor, M. A. New extragradient-type methods for generalvariational inequalities.
Journal of Mathematical Analysisand Applications , 277(2):379–394, 2003.Nouiehed, M., Sanjabi, M., Huang, T., Lee, J. D., and Raza-viyayn, M. Solving a class of non-convex min-max gamesusing iterative first order methods.
NeurIPS , 2019.Ostrovskii, D. M., Lowy, A., and Razaviyayn, M. Effi-cient search of first-order Nash equilibria in nonconvex-concave smooth min-max problems. arXiv preprintarXiv:2002.07919 , 2020.Ouyang, Y. and Xu, Y. Lower complexity bounds of first-order methods for convex-concave bilinear saddle-pointproblems.
Mathematical Programming , 185:1–35, 2021.Peng, W., Dai, Y.-H., Zhang, H., and Cheng, L. Train-ing GANs with centripetal acceleration.
OptimizationMethods and Software , 35(5):955–973, 2020.Popov, L. D. A modification of the Arrow–Hurwicz methodfor search of saddle points.
Mathematical Notes of theAcademy of Sciences of the USSR , 28(5):845–848, 1980.Rafique, H., Liu, M., Lin, Q., and Yang, T. Non-convex min-max optimization: Provable algorithms and applicationsin machine learning. arXiv preprint arXiv:1810.02060 ,2018.Rakhlin, A. and Sridharan, K. Online learning with pre-dictable sequences.
COLT , 2013a.Rakhlin, S. and Sridharan, K. Optimization, learning, andgames with predictable sequences.
NeurIPS , 2013b.Rockafellar, R. T. Monotone operators associated withsaddle-functions and minimax problems.
Nonlinear Func-tional Analysis , 18(part 1):397–407, 1970.Ryu, E. K. and Yin, W.
Large-Scale Convex Optimizationvia Monotone Operators . Draft, 2021.Ryu, E. K., Yuan, K., and Yin, W. ODE analysis ofstochastic gradient methods with optimism and anchor-ing for minimax problems and GANs. arXiv preprintarXiv:1905.10899 , 2019.Solodov, M. V. and Svaiter, B. F. A hybrid approximateextragradient–proximal point algorithm using the enlarge-ment of a maximal monotone operator.
Set-Valued Analy-sis , 7(4):323–345, 1999.Syrgkanis, V., Agarwal, A., Luo, H., and Schapire, R. E.Fast convergence of regularized learning in games.
NeurIPS , 2015.Taylor, A. and Bach, F. Stochastic first-order methods: Non-asymptotic and computer-aided analyses via potentialfunctions.
COLT , 2019. Taylor, A. B., Hendrickx, J. M., and Glineur, F. Smoothstrongly convex interpolation and exact worst-case per-formance of first-order methods.
Mathematical Program-ming , 161(1-2):307–345, 2017.Thekumparampil, K. K., Jain, P., Netrapalli, P., and Oh, S.Efficient algorithms for smooth minimax optimization.
NeurIPS , 2019.Tseng, P. On linear convergence of iterative methods for thevariational inequality problem.
Journal of Computationaland Applied Mathematics , 60(1-2):237–252, 1995.Tseng, P. A modified forward-backward splitting method formaximal monotone mappings.
SIAM Journal on Controland Optimization , 38(2):431–446, 2000.V˜u, B. C. A splitting algorithm for dual monotone inclusionsinvolving cocoercive operators.
Advances in Computa-tional Mathematics , 38(3):667–681, 2013.Wang, Y. and Li, J. Improved algorithms for convex-concaveminimax optimization.
NeurIPS , 2020.Woodworth, B. and Srebro, N. Tight complexity bounds foroptimizing composite objectives.
NeurIPS , 2016.Yadav, A., Shah, S., Xu, Z., Jacobs, D., and Goldstein,T. Stabilizing adversarial nets with prediction methods.
ICLR , 2018.Yan, M. A new primal–dual algorithm for minimizing thesum of three functions with a linear operator.
Journal ofScientific Computing , 76(3):1698–1717, 2018.Yang, J., Zhang, S., Kiyavash, N., and He, N. A catalystframework for minimax optimization.
NeurIPS , 2020.Zhang, G., Bao, X., Lessard, L., and Grosse, R. A uni-fied analysis of first-order methods for smooth gamesvia integral quadratic constraints. arXiv preprintarXiv:2009.11359 , 2020.Zhang, J., Hong, M., and Zhang, S. On lower iterationcomplexity bounds for the saddle point problems. arXivpreprint arXiv:1912.07481 , 2019.Zhao, R. Optimal stochastic algorithms for convex-concavesaddle-point problems. arXiv preprint arXiv:1903.01687 ,2019. ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm
A. Algorithm specifications
For the sake of clarity, we precisely specify all the algorithms discussed in this work.
Simultaneous gradient descent for smooth minimax optimization is defined as x k +1 = x k − α ∇ x L ( x k , y k ) y k +1 = y k + α ∇ y L ( x k , y k ) . The notation becomes more concise with the joint variable notation z k = ( x k , y k ) and the saddle subdifferential operator(2), where the sign change in y -gradient is already included: z k +1 = z k − α ∂ L ( z k ) . Alternating gradient descent-ascent is defined as x k +1 = x k − α ∇ x L ( x k , y k ) y k +1 = y k + α ∇ y L ( x k +1 , y k ) . Note that we update the x variable first and then use it to update the y -iterate.The extragradient (EG) algorithm is defined as z k +1 / = z k − α ∂ L ( z k ) ,z k +1 = z k − α ∂ L ( z k +1 / ) . Popov’s algorithm , or optimistic descent , is defined as z k +1 = z k − α ∂ L ( z k ) − α (cid:0) ∂ L ( z k ) − ∂ L ( z k − ) (cid:1) . Simultaneous gradient descent with anchoring (SimGD-A) (Ryu et al., 2019) is defined as z k +1 = z k − − p ( k + 1) p ∂ L ( z k ) + (1 − p ) γk + 1 ( z − z k ) , where p ∈ (1 / , and γ > . It has been proved in Ryu et al. (2019) that SimGD-A converges at O (1 /k − p ) rate. In thispaper, we always used γ = 1 and p = + 10 − . B. Omitted proofs of Section 2
The following identities follow directly from the definition of EAG iterates: z k − z k +1 = β k ( z k − z ) + α k ∂ L ( z k +1 / ) , (26) z k +1 / − z k +1 = α k (cid:16) ∂ L ( z k +1 / ) − ∂ L ( z k ) (cid:17) , (27) z − z k +1 = (1 − β k )( z − z k ) + α k ∂ L ( z k +1 / ) . (28) B.1. Proof of Lemma 2
Recall that ∂ L is a monotone operator, so that ≤ (cid:10) z k − z k +1 , ∂ L ( z k ) − ∂ L ( z k +1 ) (cid:11) . ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm
Therefore, V k − V k +1 ≥ V k − V k +1 − b k β k (cid:10) z k − z k +1 , ∂ L ( z k ) − ∂ L ( z k +1 ) (cid:11) = a k (cid:13)(cid:13) ∂ L ( z k ) (cid:13)(cid:13) + b k (cid:10) ∂ L ( z k ) , z k − z (cid:11) − a k +1 (cid:13)(cid:13) ∂ L ( z k +1 ) (cid:13)(cid:13) − b k +1 (cid:10) ∂ L ( z k +1 ) , z k +1 − z (cid:11) − b k β k (cid:10) z k − z k +1 , ∂ L ( z k ) − ∂ L ( z k +1 ) (cid:11) (a) = a k (cid:13)(cid:13) ∂ L ( z k ) (cid:13)(cid:13) + b k (cid:10) ∂ L ( z k ) , z k − z (cid:11) − a k +1 (cid:13)(cid:13) ∂ L ( z k +1 ) (cid:13)(cid:13) + b k +1 (cid:68) ∂ L ( z k +1 ) , (1 − β k )( z − z k ) + α k ∂ L ( z k +1 / ) (cid:69) − b k (cid:10) z k − z , ∂ L ( z k ) − ∂ L ( z k +1 ) (cid:11) − α k b k β k (cid:68) ∂ L ( z k +1 / ) , ∂ L ( z k ) − ∂ L ( z k +1 ) (cid:69) (b) = a k (cid:13)(cid:13) ∂ L ( z k ) (cid:13)(cid:13) − a k +1 (cid:13)(cid:13) ∂ L ( z k +1 ) (cid:13)(cid:13) + α k b k +1 (cid:68) ∂ L ( z k +1 ) , ∂ L ( z k +1 / ) (cid:69) − α k b k β k (cid:68) ∂ L ( z k +1 / ) , ∂ L ( z k ) − ∂ L ( z k +1 ) (cid:69) , (29)where (a) follows from (26) and (28), and (b) results from cancellation and collection of terms using (7). Next, we have ≤ R (cid:13)(cid:13) z k +1 / − z k +1 (cid:13)(cid:13) − (cid:13)(cid:13) ∂ L ( z k +1 / ) − ∂ L ( z k +1 ) (cid:13)(cid:13) = α k R (cid:13)(cid:13) ∂ L ( z k ) − ∂ L ( z k +1 / ) (cid:13)(cid:13) − (cid:13)(cid:13) ∂ L ( z k +1 / ) − ∂ L ( z k +1 ) (cid:13)(cid:13) (30)from R -Lipschitzness of ∂ L and (27). Now multiplying the factor a k α k R to (30) and subtracting from (29) gives V k − V k +1 ≥ a k (cid:13)(cid:13) ∂ L ( z k ) (cid:13)(cid:13) − a k +1 (cid:13)(cid:13) ∂ L ( z k +1 ) (cid:13)(cid:13) + α k b k +1 (cid:10) ∂ L ( z k +1 ) , ∂ L ( z k +1 / ) (cid:11) − α k b k β k (cid:68) ∂ L ( z k +1 / ) , ∂ L ( z k ) − ∂ L ( z k +1 ) (cid:69) − a k (cid:13)(cid:13)(cid:13) ∂ L ( z k ) − ∂ L ( z k +1 / ) (cid:13)(cid:13)(cid:13) + a k α k R (cid:13)(cid:13)(cid:13) ∂ L ( z k +1 / ) − ∂ L ( z k +1 ) (cid:13)(cid:13)(cid:13) = a k (1 − α k R ) α k R (cid:13)(cid:13)(cid:13) ∂ L ( z k +1 / ) (cid:13)(cid:13)(cid:13) + (cid:18) a k α k R − a k +1 (cid:19) (cid:13)(cid:13) ∂ L ( z k +1 ) (cid:13)(cid:13) + (cid:18) a k − α k b k β k (cid:19) (cid:68) ∂ L ( z k ) , ∂ L ( z k +1 / ) (cid:69) + (cid:18) α k b k +1 + α k b k β k − a k α k R (cid:19) (cid:68) ∂ L ( z k +1 / ) , ∂ L ( z k +1 ) (cid:69) . (31)Observe that the (cid:10) ∂ L ( z k ) , ∂ L ( z k +1 / ) (cid:11) term vanishes because of (6), and that α k b k +1 + α k b k β k = α k (cid:18) b k − β k + b k β k (cid:19) = α k b k β k (1 − β k ) = 2 a k − β k . Furthermore, by (8), we have a k +1 = α k +1 b k +1 β k +1 = α k β k +1 (1 − α k R − β k )(1 − α k R ) β k (1 − β k ) b k β k +1 (1 − β k ) = a k (1 − α k R − β k )(1 − α k R )(1 − β k ) . ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm
Plugging these identities into (31) and simplifying, we get V k − V k +1 ≥ a k (1 − α k R ) α k R (cid:13)(cid:13)(cid:13) ∂ L ( z k +1 / ) (cid:13)(cid:13)(cid:13) + a k (1 − α k R − β k ) α k R (1 − α k R )(1 − β k ) (cid:13)(cid:13) ∂ L ( z k +1 ) (cid:13)(cid:13) − a k (1 − α k R − β k ) α k R (1 − β k ) (cid:68) ∂ L ( z k +1 / ) , ∂ L ( z k +1 ) (cid:69) ≥ , where the last inequality is an application of Young’s inequality. B.2. Proof of Lemma 1
We may assume R = 1 without loss of generality because we can recover the general case by replacing α k with α k R .Rewrite (5) as α k − α k +1 = α k ( k + 1)( k + 3)(1 − α k ) . (32)Suppose that we have already established < α N < ρ for some N ≥ and ρ ∈ (0 , , where ρ satisfies γ := 12 (cid:18) N + 1 + 1 N + 2 (cid:19) ρ − ρ < . (33)Note that (33) holds true for all N ≥ if ρ < . Now we will show that given (33), α N > α N +1 > · · · > α N + k > (1 − γ ) α N for all k ≥ , so that α k ↓ α for some α ≥ (1 − γ ) α N . It suffices to prove that (1 − γ ) α N < α N + k < ρ for all k ≥ , because it is clearfrom (32) that { α k } k ≥ is decreasing.We use induction on k to prove that α N + k ∈ ((1 − γ ) α N , ρ ) . The case k = 0 is trivial. Now suppose that (1 − γ ) α N <α N + j < ρ holds true for all j = 0 , . . . , k . Then by (32), for each ≤ j ≤ k we have < α N + j − α N + j +1 = 1( N + j + 1)( N + j + 3) α N + j − α N + j < N + j + 1)( N + j + 3) ρ α N − ρ . Summing up the inequalities for j = 0 , . . . , k , we obtain < α N − α N + k +1 < k (cid:88) j =0 N + j + 1)( N + j + 3) ρ α N − ρ < ρ α N − ρ ∞ (cid:88) j =0 N + j + 1)( N + j + 3)= ρ α N − ρ (cid:18) N + 1 + 1 N + 2 (cid:19) = γα N , which gives (1 − γ ) α N < α N + k +1 < α N < ρ , completing the induction.In particular, when α = 0 . , direct calculation gives . > α N > . when N = 1000 . With ρ = 0 . and N = 1000 , we have γ = (cid:16) N +1 + N +2 (cid:17) ρ − ρ < . × − , which gives α ≥ (1 − γ ) α N ≈ . . ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm
B.3. Proof of Theorem 1
As in the proof of Theorem 2, assume without loss of generality that R = 1 . The strategy of the proof is basically thesame as in Theorem 2; we construct a nonincreasing Lyapunov function by combining the same set of inequalities, but withdifferent (more intricate) coefficients. For k ≥ , let V k = a k (cid:13)(cid:13) ∂ L ( z k ) (cid:13)(cid:13) + b k (cid:10) ∂ L ( z k ) , z k − z (cid:11) . As in Lemma 2, we will use b k = − β k = k + 1 , and a k ≥ will be specified later. Because we have the fixed step-size α ,the identities (26), (27), and (28) become z k +1 / − z k +1 = α (cid:16) ∂ L ( z k +1 / ) − ∂ L ( z k ) (cid:17) z k − z k +1 = 1 k + 2 ( z k − z ) + α ∂ L ( z k +1 / ) z k +1 − z = k + 1 k + 2 ( z k − z ) − α ∂ L ( z k +1 / ) . Now, subtracting the same inequalities from monotonicity and Lipschitzness from V k − V k +1 as in Lemma 2, each withcoefficients ( k + 1)( k + 2) and τ k ≥ (to be specified later), we obtain V k − V k +1 ≥ V k − V k +1 − ( k + 1)( k + 2) (cid:10) z k − z k +1 , ∂ L ( z k ) − ∂ L ( z k +1 ) (cid:11) − τ k (cid:18)(cid:13)(cid:13)(cid:13) z k +1 / − z k +1 (cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13) ∂ L ( z k +1 / ) − ∂ L ( z k +1 ) (cid:13)(cid:13)(cid:13) (cid:19) = ( a k − α τ k ) (cid:13)(cid:13) ∂ L ( z k ) (cid:13)(cid:13) + τ k (1 − α ) (cid:13)(cid:13)(cid:13) ∂ L ( z k +1 / ) (cid:13)(cid:13)(cid:13) + ( τ k − a k +1 ) (cid:13)(cid:13) ∂ L ( z k +1 ) (cid:13)(cid:13) + (cid:0) α τ k − α ( k + 1)( k + 2) (cid:1) (cid:68) ∂ L ( z k ) , ∂ L ( z k +1 / ) (cid:69) + (cid:0) α ( k + 2) − τ k (cid:1) (cid:68) ∂ L ( z k +1 / ) , ∂ L ( z k +1 ) (cid:69) = Tr ( M k S k M (cid:124) k ) , where we define M k := (cid:2) ∂ L ( z k ) ∂ L ( z k +1 / ) ∂ L ( z k +1 ) (cid:3) and S k := a k − α τ k α τ k − α ( k + 1)( k + 2) 0 α τ k − α ( k + 1)( k + 2) τ k (1 − α ) α ( k + 2) − τ k α ( k + 2) − τ k τ k − a k +1 . (34)If S k (cid:23) O , then Tr ( M k S k M (cid:124) k ) = Tr ( S k M (cid:124) k M k ) ≥ because the positive semidefinite cone is self-dual with respect tothe matrix inner product (cid:104) A, B (cid:105) = Tr( A (cid:124) B ) . Because b k = k + 1 grows linearly, provided that the sequence { a k } shouldgrows quadratically, we can derive O (1 /k ) convergence by using similar line of arguments as in the proof of Theorem 2.This reduction of the proof into a search of appropriate parameters (i.e., τ k ) that meet semidefiniteness constraints ( S k (cid:23) O in our case) while allowing for desired rate of growth in Lyapunov function coefficients ( a k in our case) was inspired byworks of Taylor et al. (2017) and Taylor & Bach (2019). In the following, we demonstrate that careful choices of a and τ k make a k asymptotically close to α ( k +1)( k +2)2 , so quadratic growth is guaranteed. We begin with the following lemma,which will be used throughout the proof. Lemma 5.
Let k ∈ N ≥ and α ∈ (cid:0) , (cid:3) be fixed, and define (cid:96) k := α ( k + 2)( k + 1 + kα )2(1 + α ) , u k := α ( k + 2)( k + 1 − kα )2(1 − α ) . ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm
Then, u k > α ( k + 1)( k + 2)2 > (cid:96) k (35) ≥ α ( k + 1)( k + 1 + α ( k + 2))2(1 + α ) (36) ≥ α ( k + 1) − α k ( k + 2)2(1 − α ) (37) ≥ max (cid:26) α ( k + 1)( k + 1 − α ( k + 2))2(1 − α ) , α ( k + 1)( k + 2)1 + α (cid:27) (38) ≥ α ( k + 1)( k + 2) + α ( k + 2) α ) . (39)We shall prove Lemma 5 after the proof of the main theorem and for now, focus on why we need such results. Observethat all the quantities within the lines (35) through (37) are asymptotically close to αk . We show that a k ∈ I k := [ (cid:96) k , u k ] for all k ≥ , which implies the quadratic growth. The quantities in Lemma 5 are used for choosing the right τ k and forshowing the positive semidefiniteness of S k .Subdivide the interval I k into two parts: I − k = (cid:20) (cid:96) k , α ( k + 1)( k + 2)2 (cid:21) , I + k = (cid:20) α ( k + 1)( k + 2)2 , u k (cid:21) . We divide cases: a k ∈ I − k and a k ∈ I + k . However, the latter case is in fact not needed unless we wish to extend the prooffor α beyond . R . If that is not the case, we recommend the readers to refer to Case 1 only. Nevertheless, we exhibitanalysis of both cases because Case 2 might provide useful data for enlarging or even completely determining the range ofconvergent step-sizes for EAG-C. Case 1.
Suppose that a k ∈ I − k . In this case, we choose τ k = ( k + 2) (2(1 − α ) a k − α ( k + 1)( k + 1 − α ( k + 2)))2 ( α ( k + 2)( k + 1 − kα ) − − α ) a k ) . (40)The denominator and numerator of (40) are both positive because u k > a k > α ( k +1)( k +1 − α ( k +2))2(1 − α ) (see (38)). Thus, τ k > .Next, define a k +1 as a k +1 = α ( k + 2) (cid:0) − α ) a k − α ( k + 1 − α ( k + 2)) (cid:1) − α ) ((1 − α ) a k + α ( k + 1)( k + 2))= α ( k + 2) − α (cid:18) − α ( k + 1 + α ( k + 2)) − α ) a k + α ( k + 1)( k + 2)) (cid:19) . (41)Then (34) can be rewritten as S k = s s s s s s s , where s = ( α ( k + 1)( k + 2) − a k ) (cid:0) − α ) a k + α ( k + 1)( k + 2) − α ( k + 2) (cid:1) α ( k + 2)( k + 1 − kα ) − − α ) a k ) , (42) s = − α (1 − α )( k + 2)( k + 1 + α ( k + 2))( α ( k + 1)( k + 2) − a k )2 ( α ( k + 2)( k + 1 − kα ) − − α ) a k ) , (43) ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm s = (1 − α )( k + 2) (2(1 − α ) a k − α ( k + 1)( k + 1 − α ( k + 2)))2 ( α ( k + 2)( k + 1 − kα ) − − α ) a k ) , (44) s = − ( k + 2) (cid:0) − α ) a k − α ( k + 1) + α k ( k + 2) (cid:1) α ( k + 2)( k + 1 − kα ) − − α ) a k ) , (45) s = ( k + 2) (cid:0) − α ) a k − α ( k + 1) + α k ( k + 2) (cid:1) (cid:0) − α ) a k + α ( k + 1)( k + 2) − α ( k + 2) (cid:1) − α ) ( α ( k + 2)( k + 1 − kα ) − − α ) a k ) ((1 − α ) a k + α ( k + 1)( k + 2)) . (46)The expressions seem ridiculously complicated, but there are a number of repeating terms. Let E = α ( k + 2)( k + 1 − kα ) − − α ) a k ,E = α ( k + 1)( k + 2) − a k . Because a k ≤ α ( k +1)( k +2)2 < u k (see (35)), we have E > , E ≥ . (Note that E = 0 only in the boundary case a k = sup I − k .) Next, put E = 2(1 − α ) a k − α ( k + 1)( k + 1 − α ( k + 2)) , which is a factor that appears within the definition of τ k (40); we have already seen that E > . Further, let E = 2(1 − α ) a k + α ( k + 1)( k + 2) − α ( k + 2) ,E = (1 − α ) a k + α ( k + 1)( k + 2) ,E = 2(1 − α ) a k − α ( k + 1) + α k ( k + 2) ,E = k + 1 + α ( k + 2) . It is obvious that E , E > , and E > follows directly from (37). To see that E > , observe that k + 1 − α ( k + 2) =( k + 2) (cid:16) k +1 k +2 − α (cid:17) ≥ ( k + 2) (cid:0) − α (cid:1) ≥ , provided that α ≤ . This implies E = 2(1 − α ) a k + α ( k + 2) ( k + 1 − ( k + 2) α ) > . Now we can rewrite (42) through (46) as s = E E E s = − α (1 − α )( k + 2) E E E s = (1 − α )( k + 2) E E s = − ( k + 2) E E s = ( k + 2) E E − α ) E E . This immediately shows that the diagonal entries s ii are nonnegative for i = 1 , , . By brute-force calculation, it is notdifficult to verify the identity (1 + α ) E E = α (1 − α ) E E + 2 E E . Using this, we see that v := (cid:104) α ( k +2) E E E − α ) E (cid:105) (cid:124) satisfies S k v = 0 , and this implies det S k = 0 . The cofactor-expansion of det S k along the first row gives S k = s (cid:12)(cid:12)(cid:12)(cid:12) s s s s (cid:12)(cid:12)(cid:12)(cid:12) − s (cid:12)(cid:12)(cid:12)(cid:12) s s s (cid:12)(cid:12)(cid:12)(cid:12) ⇐⇒ (cid:12)(cid:12)(cid:12)(cid:12) s s s s (cid:12)(cid:12)(cid:12)(cid:12) = s s s > ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm when s > , and via continuity argument we can argue that (cid:12)(cid:12)(cid:12)(cid:12) s s s s (cid:12)(cid:12)(cid:12)(cid:12) ≥ even in the boundary case s = 0 . Similarlyone can show that (cid:12)(cid:12)(cid:12)(cid:12) s s s s (cid:12)(cid:12)(cid:12)(cid:12) ≥ . Therefore, we have shown that all diagonal submatrices of S k (including the trivial case (cid:12)(cid:12)(cid:12)(cid:12) s s (cid:12)(cid:12)(cid:12)(cid:12) = s s ≥ ) have nonnegative determinants, that is, S k (cid:23) O .Finally, (41) shows that a k +1 is increasing with respect to a k . We see that a k +1 (cid:12)(cid:12)(cid:12) a k = α ( k +1)( k +2)2 = α ( k + 2)(( k + 1)( k + 3) − α ( k + 2) )2(1 − α )( k + 1) < α ( k + 2)( k + 3)2 (47)and a k +1 | a k = (cid:96) k − (cid:96) k +1 = α (cid:0) (1 − α − α − α ) k + 1 − α + α − α (cid:1) − α ) ((1 + α ) k + 1 + α + 2 α ) , and the last expression is nonnegative because of the assumption (4), which we restate here for the case R = 1 forconvenience: − α − α − α ≥ and − α + α − α ≥ . This proves that a k +1 ∈ I − k +1 ⊂ I k +1 , as desired. Case 2.
Suppose that a k ∈ I + k . The proof would be similar to Case 1, but choices of τ k and a k +1 are different. We let τ k = ( k + 2) (2(1 + α ) a k − α ( k + 1)( k + 1 + α ( k + 2)))4(1 + α ) a k − α ( k + 2)( k + 1 + kα ) . (48)Since a k > (cid:96) k > α ( k +1)( k +1+ α ( k +2))2(1+ α ) , the denominator and numerator of (48) are both positive and thus τ k > . Next, let a k +1 = α ( k + 2) (cid:0) α ) a k − α ( k + 1 + α ( k + 2)) (cid:1) α ) ((1 + α ) a k − α ( k + 1)( k + 2))= α ( k + 2) α (cid:18) − α ( k + 1 − α ( k + 2)) α ) a k − α ( k + 1)( k + 2)) (cid:19) . (49)Then we can check that s = (2 a k − α ( k + 1)( k + 2))(2(1 + α ) a k − α ( k + 1)( k + 2) − α ( k + 2) )4(1 + α ) a k − α ( k + 2)( k + 1 + kα ) ,s = ( k + 2) (cid:0) α ) a k − α ( k + 1)( k + 2) − α ( k + 2) (cid:1) (cid:0) − α ) a k − α ( k + 1) + α k ( k + 2) (cid:1) α ) (2(1 + α ) a k − α ( k + 2)( k + 1 + kα )) (2(1 + α ) a k − α ( k + 1)( k + 2)) , and so on. (Note that a k − α ( k + 1)( k + 2) ≥ because now we are assuming that a k ∈ I + k .) We omit further detailsof calculations, but with the above choices of τ k and a k +1 it can be shown that det S k = 0 and s , s ≥ , using (36)through (39). As in Case 1, this implies S k (cid:23) O .The identity (49) shows that a k +1 is increasing with respect to a k . Interestingly, although (41) and (49) have distinct forms,for the boundary value a k = α ( k +1)( k +2)2 , they evaluate to the same expression (47) and thus arguments from Case 1 readilyshow that a k +1 > (cid:96) k +1 . On the other hand, we have u k +1 − a k +1 | a k = u k = α (cid:0)(cid:0) α − α + α (cid:1) k + 1 + 8 α + α + 2 α (cid:1) − α ) ((1 − α ) k + 1 − α + 2 α ) and the last term is positive for any α ∈ (0 , , i.e., a k +1 < u k +1 . This completes Case 2. Proof of the theorem statement.
Given that a k ∈ I − k implies a k +1 ∈ I − k +1 (which has been proved in Case 1), the rest iseasy. If we take a = (cid:96) = α α , then because S k (cid:23) O for all k ≥ , we see that V k is nonincreasing: α α (cid:107) z − z (cid:63) (cid:107) ≥ α α (cid:13)(cid:13) ∂ L ( z ) (cid:13)(cid:13) = V ≥ · · · ≥ V k = a k (cid:13)(cid:13) ∂ L ( z k ) (cid:13)(cid:13) + ( k + 1) (cid:10) z k − z , ∂ L ( z k ) (cid:11) , ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm where the first inequality follows from Lipschitzness of ∂ L (recall that we are assuming that R = 1 ). Also by (35) and (36), a k ≥ (cid:96) k > α ( k + 1)( k + 1 + α ( k + 2))2(1 + α ) = α ( k + 1)2 (1 + α )( k + 1) + α α > α ( k + 1) . (50)Hence, we obtain α α (cid:107) z − z (cid:63) (cid:107) ≥ V k ≥ (cid:96) k (cid:13)(cid:13) ∂ L ( z k ) (cid:13)(cid:13) + ( k + 1) (cid:10) z k − z , ∂ L ( z k ) (cid:11) (a) ≥ α ( k + 1) (cid:13)(cid:13) ∂ L ( z k ) (cid:13)(cid:13) + ( k + 1) (cid:10) z (cid:63) − z , ∂ L ( z k ) (cid:11) (b) ≥ α ( k + 1) (cid:13)(cid:13) ∂ L ( z k ) (cid:13)(cid:13) − ( k + 1) (cid:18) α ( k + 1) (cid:107) z (cid:63) − z (cid:107) + α ( k + 1)4 (cid:13)(cid:13) ∂ L ( z k ) (cid:13)(cid:13) (cid:19) , where (a) follows from (50) and the monotonicity inequality (cid:104) z k − z (cid:63) , ∂ L ( z k ) (cid:105) ≥ , and (b) follows from Young’s inequality.Rearranging terms, we conclude that (cid:13)(cid:13) ∂ L ( z k ) (cid:13)(cid:13) ≤ α ( k + 1) (cid:18) α α + 1 α (cid:19) (cid:107) z − z (cid:63) (cid:107) = C (cid:107) z − z (cid:63) (cid:107) ( k + 1) , where C = α + α ) α (1+ α ) . Proof of Lemma 5.
Direct calculation gives u k − α ( k + 1)( k + 2)2 = α ( k + 2)2(1 − α ) > α ( k + 1)( k + 2)2 − (cid:96) k = α ( k + 2)2(1 + α ) > , showing (35). Next, (cid:96) k − α ( k + 1)( k + 1 + α ( k + 2))2(1 + α ) = α ( k + 1 − α ( k + 2))2(1 + α ) ≥ because k + 1 − α ( k + 2) = ( k + 2)( k +1 k +2 − α ) ≥ ( k + 2)( − α ) ≥ , which shows (36). Similarly, we observe that α ( k + 1)( k + 1 + α ( k + 2))2(1 + α ) − α ( k + 1) − α k ( k + 2)2(1 − α ) = α ( k + 1 − α ( k + 2))2(1 − α ) ≥ α ( k + 1) − α k ( k + 2)2(1 − α ) − α ( k + 1)( k + 1 − α ( k + 2))2(1 − α ) = α ( k + 1 + α ( k + 2))2(1 − α ) > α ( k + 1) − α k ( k + 2)2(1 − α ) − α ( k + 1)( k + 2)1 + α = α ( k + 1 − α ( k + 2)) − α ) ≥ α ( k + 1)( k + 2)1 + α − α ( k + 1)( k + 2) + α ( k + 2) α ) = α ( k + 2)( k + 1 − α ( k + 2))2(1 + α ) ≥ , and each line corresponds to an inequality within (37), (38) and (39). C. Omitted proofs of Section 3
In this section, we provide a self-contained discussion on the complexity lower bound results for linear operator equationsfrom Nemirovsky (1991; 1992).
C.1. Proof of Theorem 3
The proof of Theorem 3 was essentially completed in the main body of the paper, except the argument regarding translation,(13), and the proof of Lemma 3. ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm
We first provide the precise meaning of the translation invariance that we are to prove. Given a saddle function L and z ∈ R n × R n , let z (cid:63) L ( z ) be the saddle point of L nearest to z . For any z ∈ R n × R n , k ≥ and D > , define T (cid:0) z ; k, D (cid:1) := (cid:40) z k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L ( x, y ) = (cid:104) Ax − b, y − c (cid:105) , A ∈ R n × n , b, c ∈ R n , (cid:13)(cid:13) z (cid:63) L (cid:0) z (cid:1) − z (cid:13)(cid:13) ≤ D,z j = A ( z , . . . , z j − ; L ) , j = 1 , . . . , k, A ∈ A sep (cid:41) . We will show that T (cid:0) z ; k, D (cid:1) = z + T (0; k, D ) holds for any z ∈ R n × R n .Let z = ( x , y ) and L ( x, y ) = (cid:104) Ax − b, y − c (cid:105) be given, and assume that (cid:107) z (cid:63) L ( z ) − z (cid:107) ≤ D . Let b = b − Ax and c = c − y . Then ∇ x L ( x , y ) = A (cid:124) ( y − c ) = − A (cid:124) c , ∇ y L ( x , y ) = Ax − b = − b . Hence, (11) with k = 1 reads as x − x ∈ span { A (cid:124) c } ∆ = X ( A ; b , c ) ,y − y ∈ span { b } ∆ = Y ( A ; b , c ) . This further shows that ∇ x L ( x , y ) = A (cid:124) ( y − c ) = A (cid:124) ( y − y ) − A (cid:124) c ∈ span { A (cid:124) b , A (cid:124) c } , ∇ y L ( x , y ) = Ax − b = A ( x − x ) − b ∈ span { A ( A (cid:124) c ) , b } , and (11) with k = 2 becomes x − x ∈ span { A (cid:124) c , A (cid:124) b } ∆ = X ( A ; b , c ) ,y − y ∈ span { b , AA (cid:124) c } ∆ = Y ( A ; b , c ) . As one can see, we have x k − x ∈ X k ( A ; b , c ) and y k − y ∈ Y k ( A ; b , c ) , where we inductively define X k +1 ( A ; b , c ) = span { A (cid:124) c } + A (cid:124) Y k ( A ; b , c ) , Y k +1 ( A ; b , c ) = span { b } + A X k ( A ; b , c ) . Then it is not difficult to see that for k ≥ , X k ( A ; b , c ) = span (cid:110) A (cid:124) c , A (cid:124) ( AA (cid:124) ) c , . . . , A (cid:124) ( AA (cid:124) ) (cid:98) k − (cid:99) c (cid:111) + span (cid:110) A (cid:124) b , A (cid:124) ( AA (cid:124) ) b , . . . , A (cid:124) ( AA (cid:124) ) (cid:98) k (cid:99)− b (cid:111) , Y k ( A ; b , c ) = span (cid:110) b , ( AA (cid:124) ) b , . . . , ( AA (cid:124) ) (cid:98) k − (cid:99) b (cid:111) + span (cid:110) AA (cid:124) c , . . . , ( AA (cid:124) ) (cid:98) k (cid:99) c (cid:111) . Now consider L ( x, y ) := (cid:104) Ax − b , y − c (cid:105) = (cid:10) A ( x + x ) − b, y + y − c (cid:11) . Because z (cid:63) L is a saddle point of L if andonly if z (cid:63) L + z is a saddle point of L , we have z (cid:63) L (0) = z (cid:63) L ( z ) − z , and thus (cid:107) z (cid:63) L (0) (cid:107) ≤ D . Therefore, if we let S ( A ; D ) ∆ = (cid:26) (˜ b, ˜ c ) ∈ R n × R n (cid:12)(cid:12)(cid:12)(cid:12) (cid:13)(cid:13) z (cid:63) ˜ L (0) (cid:13)(cid:13) ≤ D, where ˜ L ( x, y ) = (cid:104) Ax − ˜ b, y − ˜ c (cid:105) (cid:27) , then T (cid:0) z ; k, D (cid:1) = (cid:91) A ∈ R n × n ( b ,c ) ∈S ( A ; D ) z + ( X k ( A ; b , c ) × Y k ( A ; b , c )) . This proves that the translation invariance holds with T (0; k, D ) = (cid:83) A ∈ R n × n ( b ,c ) ∈S ( A ; D ) ( X k ( A ; b , c ) × Y k ( A ; b , c )) and inparticular, shows (13). ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm
C.2. Complexity of solving linear operator equations and minimax polynomials
We first make some general observations. Suppose that we are given a symmetric matrix A ∈ R n × n , b ∈ R n , and an integer k ≥ . Then any x ∈ K k − ( A ; b ) = span { b, Ab, . . . , A k − b } can be expressed in the form x = q ( A ) b, where q ( t ) = q + q t + · · · + q k − t k − , for some q , . . . , q k − ∈ R . Then we can write b − Ax = b − Aq ( A ) b = ( I − Aq ( A )) b = p ( A ) b, (51)where p ( t ) = 1 − tq ( t ) is a polynomial of degree at most k satisfying p (0) = 1 . Note that conversely, given any polynomial ˜ p ( t ) with degree ≤ k and constant term , one can decompose it as ˜ p ( t ) = 1 − t ˜ q ( t ) and recover a polynomial ˜ q of degree ≤ k − corresponding to x .Now suppose further there exists x (cid:63) ∈ R n such that b = Ax (cid:63) and (cid:107) x (cid:63) (cid:107) ≤ D . The symmetric matrix A has an orthonormaleigenbasis v , . . . , v n , corresponding to eigenvalues λ , . . . , λ n , so we can write x (cid:63) = c v + · · · + c n v n for some c , . . . , c n ∈ R . Using (51), we obtain (cid:107) Ax − b (cid:107) = (cid:107) p ( A ) Ax (cid:63) (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) j =1 c j Ap ( A ) v j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) j =1 c j λ j p ( λ j ) v j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = n (cid:88) j =1 c j λ j p ( λ j ) ≤ D (cid:18) max j =1 ,...,n λ j p ( λ j ) (cid:19) . (52)We define the problem class by (cid:107) A (cid:107) ≤ R , which is equivalent to λ j ∈ [ − R, R ] for all j = 1 , . . . , n . Therefore, we considera method corresponding to a polynomial q ( t ) such that p ( t ) = 1 − tq ( t ) minimizes max λ ∈ [ − R,R ] λ p ( λ ) = (cid:18) max λ ∈ [ − R,R ] | λp ( λ ) | (cid:19) . More precisely, if p (cid:63)k ( t ) = 1 − tq (cid:63)k ( t ) minimizes the last quantity among all p ( t ) such that deg p ≤ k and p (0) = 1 , and ifwe put x k = q (cid:63)k ( A ) b , then (52) implies (cid:13)(cid:13) Ax k − b (cid:13)(cid:13) = n (cid:88) j =1 c j λ j ( p (cid:63)k ( λ j )) ≤ D M (cid:63) ( k, R ) ,M (cid:63) ( k, R ) ∆ = min deg p ≤ kp (0)=1 max λ ∈ [ − R,R ] | λp ( λ ) | , (53)for all A whose spectrum belongs to [ − R, R ] and b = Ax (cid:63) with (cid:107) x (cid:63) (cid:107) ≤ D . As p (cid:63)k solves (53), it is called a minimaxpolynomial .In order to establish Lemma 3, we present a two-fold analysis in the following. First, we compute the quantity (53) byexplicitly naming p (cid:63)k for each k ≥ . (This was given by Nemirovsky (1992), but without a proof.) Then, following theexposition from (Nemirovsky, 1991), we show that there exists an instance of ( A, b ) such that (cid:107) Aq ( A ) b − b (cid:107) ≥ D M (cid:63) ( k, R ) holds for any polynomial q of degree ≤ k − . C.3. Proof of Lemma 3
The solutions to (53) are characterized using the
Chebyshev polynomials of first kind , defined by T N (cos θ ) = cos( N θ ) , N ≥ , ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm or equivalently by T N ( t ) = cos( N arccos t ) . If N = 2 d for some nonnegative integer d , then T N is an even polynomialsatisfying T N (0) = cos( dπ ) = ( − d . On the other hand, if N = 2 d + 1 , then T N is an odd polynomial of the form T d +1 ( t ) = ( − d (2 d + 1) t + · · · , (54)which can be shown via induction using the recurrence relation T N +1 ( t ) = 2 tT N ( t ) − T N − ( t ) , which follows from thetrigonometric identity cos(( N + 1) θ ) + cos(( N − θ ) = 2 cos( N θ ) cos θ. Based on arguments from (Nemirovsky, 1992; Mason & Handscomb, 2002), we will show that given k ≥ and m := (cid:98) k (cid:99) , p (cid:63)k ( t ) := ( − m m + 1 (cid:18) Rt (cid:19) T m +1 (cid:18) tR (cid:19) solves (53).The Chebyshev polynomials satisfy the equioscillation property which makes them so special: the extrema of T N within [ − , occur at t j = cos ( N − j ) πN for j = 0 , . . . , N , and the signs of the extremal values alternate. Indeed, we have | T N ( t ) = cos( N arccos t ) | ≤ for all t ∈ [ − , , and for each j = 0 , . . . , N , T N ( t j ) = cos (cid:18) N ( N − j ) πN (cid:19) = cos( N − j ) π = ( − N − j . Also, we have T N ( t j ) = − T N ( t j − ) for each j = 1 , . . . , n .Given k ≥ , we denote by P k the collection of all polynomials p of degree ≤ k with p (0) = 1 . Recall that we are tominimize M ( p, R ) := max λ ∈ [ − R,R ] | λ p ( λ ) | (55)over p ∈ P k . If p ∈ P k minimizes (55), then so does p ev ( t ) := p ( t )+ p ( − t )2 , since for all λ ∈ [ − R, R ] | λp ev ( λ ) | = | λ | · (cid:12)(cid:12)(cid:12)(cid:12) p ( λ ) + p ( − λ )2 (cid:12)(cid:12)(cid:12)(cid:12) ≤ | λp ( λ ) | | ( − λ ) p ( − λ ) | ≤ M ( p, R )2 + M ( p, R )2 = M ( p, R ) (56)holds, which implies that M ( p ev , R ) ≤ M ( p, R ) .Observe that p (cid:63)k ∈ P k due to (54). Next, note that λp (cid:63)k ( λ ) = ( − m R m +1 T m +1 ( λR ) has extrema of alternating signs and samemagnitude within [ − R, R ] , which occur precisely at λ j := R cos (2 m +1 − j ) π m +1 , where j = 0 , . . . , m + 1 . Suppose that p (cid:63)k isnot a minimizer of M ( p, R ) over P k , so that there exists p ∈ P k such that | λ j p ( λ j ) | ≤ M ( p, R ) < M ( p (cid:63)k , R ) = | λ j p (cid:63)k ( λ j ) | ( j = 0 , . . . , m + 1) . (57)Due to (56), by replacing p with p ev if necessary, we may assume that p is even and has degree ≤ m . Since λ j (cid:54) = 0 for all j = 0 , . . . , m + 1 , the condition (57) reduces to | p ( λ j ) | < | p (cid:63)k ( λ j ) | .As p and p (cid:63)k are both polynomials of degree ≤ m and constant terms 1, we can write p (cid:63)k ( λ ) − p ( λ ) = λq ( λ ) for some polynomial q of degree ≤ m − . But then | p ( λ j ) | = | p (cid:63)k ( λ j ) − λ j q ( λ j ) | < | p (cid:63)k ( λ j ) | , which implies that p (cid:63)k ( λ j ) and λ j q ( λ j ) have same signs for j = 0 , . . . , m + 1 . Now, because p (cid:63)k ( λ j ) have alternating signs and λ < · · · < λ m < < λ m +1 < · · · < λ m +1 , we see that the signs of q ( λ j ) alternate over j = 0 , . . . , m and over j = m + 1 , . . . , m + 1 , respectively. Therefore, q musthave at least one zero in each open interval ( λ j , λ j +1 ) for j = 0 , . . . , m − , m + 1 , . . . , m . This implies that q ( t ) ≡ since deg q ≤ m − , while q has at least m zeros. Therefore, we arrive at p (cid:63)k = p , which is a contradiction. ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm
We have established that M (cid:63) ( k, R ) = M ( p (cid:63)k , R ) = | λ j p (cid:63)k ( λ j ) | = R m + 1 = R (cid:98) k/ (cid:99) + 1 ( j = 0 , . . . , m + 1) . (58)Furthermore, the above arguments show that the minimization of (55) over p ∈ P k is in fact the same as the minimization of max j =0 ,..., m +1 | λ j p ( λ j ) | = max λ ∈ Λ | λp ( λ ) | , Λ := { λ , λ , . . . , λ m +1 } . (59)Note that the trick of replacing p by p ev is still applicable to (59), but only because the set Λ is symmetric with respect tothe origin. Now we can write M (cid:63) ( k, R ) = (cid:18) min p ∈P k max λ ∈ [ − R,R ] | λp ( λ ) | (cid:19) = (cid:18) min p ∈P k max λ ∈ Λ | λp ( λ ) | (cid:19) = min p ∈P k max λ ∈ Λ λ p ( λ ) , (60)and the final problem from the line (60) is equivalent tominimize ν ∈ R , p ∈P k ν subject to λ j p ( λ j ) ≤ ν, j = 0 , . . . , m + 1 . (61)We can identify any p ( t ) = 1 + p t + · · · + p k t k ∈ P k as the vector ( p , . . . , p k ) ∈ R k . Under this identification, (61) is asecond order cone program (as the constraints are convex quadratic in p , . . . , p k ), and Slater’s constraint qualification isclearly satisfied. Hence M (cid:63) ( k, R ) equals the optimal value of the dual problemmaximize µ ∈ R m +2 minimize p ∈P k (cid:80) m +1 j =0 µ j λ j p ( λ j ) subject to (cid:80) m +1 j =0 µ j = 1 ,µ j ≥ . (62)Let µ (cid:63) = ( µ (cid:63) , . . . , µ (cid:63) m +1 ) be the dual optimal solution to (62). Provided that n ≥ k + 2 ≥ m + 2 , we can take standardbasis vectors (with -indexing) e , . . . , e m +1 ∈ R n . Define A by Ae j = λ j e j ( j = 0 , . . . , m + 1) , Av = 0 ( v ⊥ span { e , . . . , e m +1 } ) and let b = Ax (cid:63) , x (cid:63) = D m +1 (cid:88) j =0 (cid:0) µ (cid:63)j (cid:1) / e j so that (cid:107) x (cid:63) (cid:107) = D . For any given x = q ( A ) b with deg q ≤ k − , we use (52) to rewrite (cid:107) Ax − b (cid:107) as (cid:107) Ax − b (cid:107) = D n (cid:88) j =1 µ (cid:63)j λ j (1 − λ j q ( λ j )) = D n (cid:88) j =1 µ (cid:63)j λ j p ( λ j ) , where p ( t ) = 1 − tq ( t ) ∈ P k . But since ( p (cid:63)k , µ (cid:63) ) is the primal-dual solution pair to the problems (61) and (62), p (cid:63)k minimizes (cid:80) m +1 j =0 µ (cid:63)j λ j p ( λ j ) within P k . Therefore, (cid:107) Ax − b (cid:107) = D n (cid:88) j =1 µ (cid:63)j λ j p ( λ j ) ≥ D n (cid:88) j =1 µ (cid:63)j λ j p (cid:63)k ( λ j ) = D M (cid:63) ( k, R ) = R D (cid:98) k/ (cid:99) + 1) , which establishes (14). ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm
C.4. Proof of Lemma 4
Let k ≥ be a given (fixed) integer. Consider the polynomial p (cid:63)k we defined in the previous section. It is an even polynomialof degree (cid:98) k (cid:99) , and thus p (cid:63)k (cid:0) √ t (cid:1) is a polynomial in t of degree (cid:98) k (cid:99) , whose constant term is p (cid:63)k (0) = 1 . Therefore, we canwrite p (cid:63)k (cid:0) √ t (cid:1) = 1 − tq k ( t ) for some polynomial q k . We will show that z k = q k ( B (cid:124) B ) B (cid:124) v (63)satisfies (cid:107) Bz k − v (cid:107) ≤ R D (cid:98) k/ (cid:99) +1) for any (possibly non-symmetric) B ∈ R m × m and v = Bz (cid:63) satisfying (cid:107) B (cid:107) ≤ R and (cid:107) z (cid:63) (cid:107) ≤ D . The equation (63) defines an algorithm within the class A lin , as q k is of degree (cid:98) k (cid:99) − , so that z k is determinedby (cid:98) k (cid:99) − ≤ k − queries to the matrix multiplication oracle.We proceed via arguments similar to derivations in C.2. First, observe that (cid:13)(cid:13) Bz k − v (cid:13)(cid:13) = (cid:13)(cid:13) Bz k − Bz (cid:63) (cid:13)(cid:13) = ( z k − z (cid:63) ) (cid:124) B (cid:124) B ( z k − z (cid:63) ) = ( z k − z (cid:63) ) (cid:124) | B | ( z k − z (cid:63) ) = (cid:13)(cid:13) | B | z k − | B | z (cid:63) (cid:13)(cid:13) , (64)where | B | is the matrix square root of the positive semidefinite matrix B (cid:124) B . Rewriting (63) in terms of | B | , we obtain z k = q k ( B (cid:124) B ) B (cid:124) Bz (cid:63) = q k (cid:0) | B | (cid:1) | B | z (cid:63) . Plugging the last equation into (64) gives (cid:13)(cid:13) | B | z (cid:63) − | B | z k (cid:13)(cid:13) = (cid:13)(cid:13)(cid:0) I − | B | q k (cid:0) | B | (cid:1)(cid:1) | B | z (cid:63) (cid:13)(cid:13) = (cid:107) p (cid:63)k ( | B | ) | B | z (cid:63) (cid:107) Finally, because | B | is a symmetric matrix whose eigenvalues are within [0 , R ] , we can apply (52) with | B | , z (cid:63) in places of A, x (cid:63) , and use (58) to conclude that (cid:13)(cid:13) | B | z (cid:63) − | B | z k (cid:13)(cid:13) ≤ D (cid:18) max λ ∈ [0 ,R ] λ p (cid:63)k ( λ ) (cid:19) ≤ D (cid:18) max λ ∈ [ − R,R ] λ p (cid:63)k ( λ ) (cid:19) = R D (2 (cid:98) k/ (cid:99) + 1) . C.5. Proof of Theorem 4
We first describe the general class A of algorithms without the linear span assumption. An algorithm A within A is asequence of deterministic functions A , A , . . . , each of which having the form ( z i , z i ) = A i (cid:0) z , O ( z ; L ) , . . . , O ( z i − ; L ); L (cid:1) for i ≥ , where z = ( x , y ) ∈ R n × R m is an initial point and O : ( R n × R m ) × L R ( R n × R m ) → R n × R m is thegradient oracle defined as O (( x, y ); L ) = ( ∇ x L ( x, y ) , ∇ y L ( x, y )) . The sequence { z i } i ≥ are the inquiry points , and { z i } i ≥ are the approximate solutions produced by A . When k ≥ isthe predefined maximum number of iterations, then we assume z k = z k without loss of generality. Similar definitions fordeterministic algorithms have been considered in (Nemirovsky, 1991; Ouyang & Xu, 2021).To clarify, given L ∈ L R ( R n × R m ) , an algorithm A uses only the previous oracle information to choose the nextinquiry point and approximate solution. Therefore, if O ( z i ; L ) = O ( z i ; L ) for all i = 0 , . . . , k − , then the algorithmoutput ( z k , z k ) for the two functions will coincide, even if L (cid:54) = L . In that sense, A is deterministic , black-box , and gradient-based .Now we precisely restate Theorem 4. Theorem 4.
Let k ≥ and n ≥ k + 2 . Let A ∈ A be a deterministic black-box gradient-based algorithm for solvingconvex-concave minimax problems on R n × R n . Then for any initial point z ∈ R n × R n , there exists L ∈ L biaff R ( R n × R n ) with a saddle point z (cid:63) , for which z k , the k -th iterate produced by A , satisfies (cid:107)∇ L ( z k ) (cid:107) ≥ (cid:107) z − z (cid:63) (cid:107) (2 (cid:98) k/ (cid:99) + 1) . ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm
Proof.
Let z = ( x , y ) ∈ R n × R n be given. Take A and b as in Lemma 3. Denote by x min the minimum norm solutionto Ax = b . Recall the construction of A and b , where R ( A ) = span { e , . . . , e m +1 } ⊥ ker( A ) . Define L ( x , y ) = − b (cid:124) ( x − x ) + ( x − x ) (cid:124) A ( y − y ) − b (cid:124) ( y − y ) . Then ( ∇ x L ( x, y ) , ∇ y L ( x, y )) = (cid:0) A ( y − y ) − b, A ( x − x ) − b (cid:1) , and z + (cid:0) x min , x min (cid:1) is a saddle point of L .We follow the oracle-resisting proof strategy of Nemirovsky (1991), described as follows. For each i = 1 , . . . , k , weinductively define a rotated biaffine function L i ( x , y ) = − b (cid:124) ( x − x ) + ( x − x ) (cid:124) A i ( y − y ) − b (cid:124) ( y − y ) , where A i = U i AU (cid:124) i for an orthogonal matrix U i ∈ R n × n . We will show that U i can be chosen to satisfy U i b = b , O ( z j ; L i ) = O ( z j ; L i − ) (65)for j = 0 , . . . , i − , and x j − x , y j − y ∈ K j − ( A i ; b ) ⊕ U i N i = U i K j − ( A ; b ) ⊕ U i N i (66)for j = 0 , . . . , i , where N i is a subspace of ker( A ) such that dim( N i ) ≤ i . Note that (65) implies that the algorithmiterates ( z j , z j ) for j = 1 , . . . , i do not change when L i − is replaced by L i . Hence, this process sequentially adjusts theobjective function L upon observing an iterate z i to resist the algorithm from optimizing it efficiently. Indeed, if (66) holdswith i = j = k , then x k − x = U k q x ( A ) b + U k v kx y k − y = U k q y ( A ) b + U k v ky for some polynomials q x , q y of degree ≤ k − and v kx , v ky ∈ N i ⊆ ker( A ) . Thus ∇ x L k ( x k , y k ) = A k ( y k − y ) − b = U k AU (cid:124) k (cid:0) U k q y ( A ) b + U k v ky (cid:1) − b = U k ( Aq y ( A ) − I ) b and similarly ∇ y L k ( x k , y k ) = U k ( Aq x ( A ) − I ) b, showing that (cid:107)∇ L k ( z k ) (cid:107) = (cid:107) U k ( Aq y ( A ) − I ) b (cid:107) + (cid:107) U k ( Aq x ( A ) − I ) b (cid:107) ≥ (cid:107) x min (cid:107) (2 (cid:98) k/ (cid:99) + 1) . Then the theorem statement follows from the fact that z (cid:63) = z + ( U k x min , U k x min ) is a saddle point of L k .It remains to provide an inductive scheme for choosing U i . We set U = I (so that A = A ), N = { } , and define K − ( A ; b ) = { } for convenience. Let ≤ i ≤ k , and suppose that we already have an orthogonal matrix U i − and N i − ⊆ ker( A ) for which U i − b = b , dim( N i − ) ≤ i − , and (66) holds with i − (which is vacuously true when i = 1 ). Let ( z i , z i ) = A i (cid:0) z , O ( z ; L i − ) , . . . , O ( z i − ; L i − ) (cid:1) . We want U i (to be defined) to satisfy s ix , s iy ∈ U i ker( A ) while K i − ( A i − ; b ) = K i − ( A i ; b ) . The latter condition issatisfied if U i = Q i U i − for some orthogonal matrix Q i which preserves every element within J i − = K i − ( A i − ; b ) ⊕ U i − N i − , because then it follows that U i b = Q i U i − b = Q i b = b and K i − ( A i ; b ) = U i K i − ( A ; b ) = Q i U i − K i − ( A ; b ) = Q i K i − ( A i − ; b ) = K i − ( A i − ; b ) . Consider the decomposition x i − x = Π K i − ( A i − ; b ) ( x i − x ) + U i − r ix + s ix y i − y = Π K i − ( A i − ; b ) ( y i − y ) + U i − r iy + s iy ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm where Π denotes the orthogonal projection, r ix , r iy ∈ N i − and s ix , s iy ∈ J ⊥ i − . Since dim ker( A ) = n − m − ≥ n − k − and dim ( N i − ) ⊥ ≥ n − (2 i − ≥ n − k + 2 , we have dim (cid:16) ker( A ) ∩ ( N i − ) ⊥ (cid:17) ≥ n − k ≥ , so there exist ˜ s ix , ˜ s iy ∈ ker( A ) ∩ ( N i − ) ⊥ such that (cid:107) ˜ s ix (cid:107) = (cid:107) s ix (cid:107) , (cid:107) ˜ s iy (cid:107) = (cid:107) s iy (cid:107) , and (cid:104) ˜ s ix , ˜ s iy (cid:105) = (cid:104) s ix , s iy (cid:105) . Also, because ker( A ) ⊥ K i − ( A ; b ) , J i − = U i − ( K i − ( A ; b ) + N i − ) ⊥ U i − (cid:16) ker( A ) ∩ ( N i − ) ⊥ (cid:17) . This implies that there exists an orthogonal Q i ∈ R n × n satisfying Q i (cid:12)(cid:12) J i − = Id J i − Q i (cid:0) U i − ˜ s ix (cid:1) = s ix Q i (cid:0) U i − ˜ s iy (cid:1) = s iy . Now let v ix = r ix + ˜ s ix ∈ ker( A ) , v iy = r iy + ˜ s iy ∈ ker( A ) , and U i ∆ = Q i U i − N i ∆ = N i − + span { v ix , v iy } . Then clearly U i b = b , N i ⊆ ker( A ) , and dim N i ≤ i . Next, for each j = 0 , . . . , i − , we have x j − x , y j − y ∈ K j − ( A i − ; b ) ⊕ U i − N i − ⊆ K j − ( A i ; b ) ⊕ U i N i since Q i preserves J i − and N i − ⊆ N i . Moreover, because U i − r ix = Q i U i − r ix = U i r ix and s ix = Q i U i − ˜ s ix = U i ˜ s ix , x i − x = Π K i − ( A i − ; b ) ( x i − x ) + U i ( r ix + ˜ s ix ) ∈ K i − ( A i − ; b ) ⊕ U i N i = K i − ( A i ; b ) ⊕ U i N i and similarly y i − y ∈ K i − ( A i ; b ) ⊕ U i N i . This proves (66).Finally, for j = 0 , . . . , i − , ∇ x L i ( x j , y j ) = A i ( y j − y ) − b = Q i A i − Q (cid:124) i ( y j − y ) − b. But Q (cid:124) i ( y j − y ) = y j − y because y j − y ∈ K j − ( A i − ; b ) ⊕ U i − N i − ⊆ J i − , and A i − ( y j − y ) ∈ A i − K j − ( A i − ; b ) ⊕ A i − U i − N i − = K j ( A i − ; b ) ⊆ J i − , which shows that ∇ x L i ( x j , y j ) = Q i A i − Q (cid:124) i ( y j − y ) − b = A i − ( y j − y ) − b = ∇ x L i − ( x j , y j ) . Arguing analogouslyfor the y -variable gives ∇ y L i ( x j , y j ) = ∇ y L i − ( x j , y j ) , proving (65). This completes the induction step, and hence theproof. D. Experimental details
D.1. Exact forms of the construction from Ouyang & Xu (2021)
Following Ouyang & Xu (2021), we use A = 14 − . . . . . . − − ∈ R n × n , b = 14 ... ∈ R n , h = 14 ... and H = 2 A (cid:124) A . Ouyang & Xu (2021) shows that (cid:107) A (cid:107) ≤ , which implies (cid:107) H (cid:107) ≤ . Therefore (25) is a -smooth saddlefunction. ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm
D.2. Best-iterate gradient norm bound for EG
In Figure 1, we indicated theoretical upper bounds for EG. To clarify, there is no known last-iterate convergence resultfor EG with respect to (cid:107) ∂ L ( · ) (cid:107) . However, it is straightforward to derive O ( R /k ) best-iterate convergence via standardsummability arguments in weak convergence proofs for EG. Although there is no theoretical guarantee that (cid:107) ∂ L ( z k ) (cid:107) willmonotonically decrease with EG, in our experiments on both examples, they did monotonically decrease (see Figures 1(a),1(b)). Therefore, we safely used the best-iterate bounds to visualize the upper bound for EG in Figure 1. For the sake ofcompleteness, we derive the best-iterate bound below. Lemma 6.
Let L : R n × R m → R be an R -smooth convex-concave saddle function with a saddle point z (cid:63) . Let z ∈ R n × R m and α ∈ (cid:0) , R (cid:1) . Then w = z − α ∂ L ( z ) and z + = z − α ∂ L ( w ) satisfy (cid:107) z − z (cid:63) (cid:107) − (cid:107) z + − z (cid:63) (cid:107) ≥ (1 − α R ) (cid:107) z − w (cid:107) . Proof. (cid:107) z − z (cid:63) (cid:107) − (cid:107) z + − z (cid:63) (cid:107) = (cid:0) (cid:107) z − w (cid:107) + 2 (cid:104) z − w, w − z (cid:63) (cid:105) + (cid:107) w − z (cid:63) (cid:107) (cid:1) − (cid:0) (cid:107) z + − w (cid:107) + 2 (cid:104) z + − w, w − z (cid:63) (cid:105) + (cid:107) w − z (cid:63) (cid:107) (cid:1) = (cid:107) z − w (cid:107) − (cid:107) z + − w (cid:107) + 2 (cid:104) z − z + , w − z (cid:63) (cid:105)≥ (cid:107) z − w (cid:107) − (cid:107) z + − w (cid:107) . The last inequality is just monotonicity: (cid:104) z − z + , w − z (cid:63) (cid:105) = α (cid:104) ∂ L ( w ) , w − z (cid:63) (cid:105) ≥ . Now the conclusion follows from (cid:107) z + − w (cid:107) = (cid:107) ( z − α ∂ L ( w )) − ( z − α ∂ L ( z )) (cid:107) = α (cid:107) ∂ L ( z ) − ∂ L ( w ) (cid:107) ≤ α R (cid:107) z − w (cid:107) , where the last inequality follows from R -Lipschitzness of ∂ L .Now fix an integer k ≥ , and consider the EG iterations z i +1 / = z i − α ∂ L ( z i ) ,z i +1 = z i − α ∂ L ( z i +1 / ) for i = 0 , . . . , k . Applying Lemma 6 with z = z i , w = z i +1 / and z + = z i +1 , we have (cid:107) z i − z (cid:63) (cid:107) − (cid:107) z i +1 − z (cid:63) (cid:107) ≥ (1 − α R ) (cid:107) z i − z i +1 / (cid:107) = (1 − α R ) α (cid:107) ∂ L ( z i ) (cid:107) (67)for i = 0 , . . . , k . Summing up the inequalities (67) for all i = 0 , . . . , k , we obtain (cid:107) z − z (cid:63) (cid:107) − (cid:107) z k +1 − z (cid:63) (cid:107) ≥ (1 − α R ) α k (cid:88) i =0 (cid:107) ∂ L ( z i ) (cid:107) . The left hand side is at most (cid:107) z − z (cid:63) (cid:107) , while the right hand side is lower bounded by (1 − α R ) α ( k + 1) min i =0 ,...,k (cid:107) ∂ L ( z i ) (cid:107) . Therefore we conclude that min i =0 ,...,k (cid:107) ∂ L ( z i ) (cid:107) ≤ C (cid:107) z − z (cid:63) (cid:107) k + 1 where C = α (1 − α R ) . ccelerated O (1 /k ) Rate for Smooth Convex-Concave Minimax Problems on Squared Gradient Norm
D.3. ODE flows for L( x, y ) = xy Interestingly, the continuous-time flows with L ( x, y ) = xy have exact closed-form solutions.Note that ∂ L ( x, y ) = (cid:20) − (cid:21) (cid:20) xy (cid:21) . Therefore, ( ∂ L ) λ ( x, y ) = 1 λ (cid:32)(cid:20) (cid:21) − (cid:20) λ − λ (cid:21) − (cid:33) (cid:20) xy (cid:21) = (cid:20) λ λ λ − λ λ λ (cid:21) (cid:20) xy (cid:21) . The solution to the Moreau–Yosida regularized flow (cid:20) ˙ x ˙ y (cid:21) = (cid:20) − λ λ − λ λ − λ λ (cid:21) (cid:20) xy (cid:21) can be obtained with the matrix exponent. The results are x ( t ) = exp (cid:18) − λ λ t (cid:19) (cid:18) x cos t λ − y sin t λ (cid:19) ,y ( t ) = exp (cid:18) − λ λ t (cid:19) (cid:18) y cos t λ + x sin t λ (cid:19) . The anchored flow ODE for L ( x, y ) = xy is given by x (cid:48) ( t ) = − y ( t ) + 1 t (cid:0) x − x ( t ) (cid:1) ,y (cid:48) ( t ) = x ( t ) + 1 t (cid:0) y − y ( t ) (cid:1) . From the first equation, we have ( tx ( t )) (cid:48) = tx (cid:48) ( t ) + x ( t ) = − ty ( t ) + x , while similar manipulation of the second equationgives ( ty ( t )) (cid:48) = tx ( t ) + y . Therefore, ( tx ( t )) (cid:48)(cid:48) = − ( ty ( t )) (cid:48) = − tx ( t ) − y , ( ty ( t )) (cid:48)(cid:48) = ( tx ( t )) (cid:48) = − ty ( t ) + x , which gives tx ( t ) = c cos t − c sin t − y ,ty ( t ) = c sin t + c cos t + x . Using the initial conditions to determine the coefficients c , c , we obtain x ( t ) = y cos t + x sin t − y t ,y ( t ) = y sin t − x cos t + x0