[PDF] Improved Algorithms for Convex-Concave Minimax Optimization

Abstract

Full PDF

IImproved Algorithms for Convex-Concave Minimax Optimization

Yuanhao Wang and Jian Li

Institute for Interdisciplinary Information Sciences, Tsinghua University [email protected], [email protected]

June 12, 2020

Abstract

This paper studies minimax optimization problems min x max y f ( x , y ), where f ( x , y ) is m x -stronglyconvex with respect to x , m y -strongly concave with respect to y and ( L x , L xy , L y )-smooth. Zhanget al. [2019] provided the following lower bound of the gradient complexity for any ﬁrst-order method:Ω (cid:16)(cid:114) L x m x + L xy m x m y + L y m y ln(1 /(cid:15) ) (cid:17) . This paper proposes a new algorithm with gradient complexity upperbound ˜ O (cid:16)(cid:113) L x m x + L · L xy m x m y + L y m y ln (1 /(cid:15) ) (cid:17) , where L = max { L x , L xy , L y } . This improves over the bestknown upper bound ˜ O (cid:16)(cid:113) L m x m y ln (1 /(cid:15) ) (cid:17) by Lin et al. [2020]. Our bound achieves linear convergencerate and tighter dependency on condition numbers, especially when L xy (cid:28) L (i.e., when the interactionbetween x and y is weak). Via reduction, our new bound also implies improved bounds for stronglyconvex-concave and convex-concave minimax optimization problems. When f is quadratic, we can furtherimprove the upper bound, which matches the lower bound up to a small sub-polynomial factor. In this paper, we study the following minimax optimization problemmin x ∈ R n max y ∈ R m f ( x , y ) . (1)This problem can be thought as ﬁnding the equilibrium in a zero-sum two-player game, and has been studiedextensively in game theory [Von Neumann and Morgenstern, 2007, Basar and Olsder, 1999]. This formula-tion also arises in many machine learning applications, including adversarial training [Madry et al., 2018,Sinha et al., 2017], prediction and regression problems [Xu et al., 2005, Taskar et al., 2006], reinforcementlearning [Du et al., 2017, Dai et al., 2018, Nachum et al., 2019] and generative adversarial networks [Goodfel-low et al., 2014, Arjovsky et al., 2017]. Apart from machine learning, minimax optimization has also foundapplications in imaging [Chambolle and Pock, 2011, Haber and Modersitzki, 2004], control [Hast et al., 2013]and economics [Nagurney, 2013].We study the fundamental setting where f is smooth, strongly convex w.r.t. x and strongly concave w.r.t. y . In particular, we consider the function class F ( m x , m y , L x , L xy , L y ), where m x is the strong convexitymodulus, m y is the strong concavity modulus, L x and L y characterize the smoothness w.r.t. x and y respectively, and L xy characterizes the interaction between x and y (see Deﬁnition 2). The reason to considersuch a function class is twofold. First, the strongly convex-strongly concave setting is fundamental. Viareduction [Lin et al., 2020], an eﬃcient algorithm for this setting implies eﬃcient algorithms for other settings,including strongly convex-concave, convex-concave, and non-convex-concave settings. Second, Zhang et al.[2019] recently proved a gradient complexity lower bound Ω (cid:16)(cid:113) L x m x + L xy m x m y + L y m y · ln (cid:0) (cid:15) (cid:1)(cid:17) , which naturallydepends on the above parameters.In this setting, classic algorithms such as Gradient Descent-Ascent and ExtraGradient [Korpelevich, 1976]can achieve linear convergence [Tseng, 1995, Zhang et al., 2019]; however, their dependence on the condition1 a r X i v : . [ c s . L G ] J un igure 1: Comparison of previous upper bound [Lin et al., 2020], lower bound [Zhang et al., 2019] and theresults in this paper when L x = L y , m x < m y , ignoring logarithmic factors. The upper bounds and lowerbounds are shown as a function of L xy while other parameters are ﬁxed.number is far from optimal. Recently, Lin et al. [2020] showed an upper bound of ˜ O (cid:16)(cid:112) L /m x m y ln (1 /(cid:15) ) (cid:17) ,which has a much tighter dependence on the condition number. In particular, when L xy > max { L x , L y } , thedependence on the condition number matches the lower bound. However, when L xy (cid:28) max { L x , L y } , thisdependence would no longer be tight (see Fig 1 for illustration). In particular, we note that, when x and y arecompletely decoupled (i.e., L xy = 0), the optimal gradient complexity bound is Θ (cid:16)(cid:112) L x /m x + L y /m y · ln (1 /(cid:15) ) (cid:17) (the upper bound can be obtained by simply optimizing x and y separately). Moreover, Lin et al.’s resultdoes not enjoy a linear rate, which may be undesirable if a high precision solution is needed.In this work, we propose new algorithms in order to address these two issues. Our contribution can besummarized as follows.1. For general functions in F ( m x , m y , L x , L xy , L y ), we design an algorithm called Proximal Best Response(Algorithm 4), and prove a convergence rate of˜ O (cid:32)(cid:115) L x m x + L xy · Lm x m y + L y m y ln(1 /(cid:15) ) (cid:33) . It achieves linear convergence, and has a better dependence on condition numbers when L xy is small(see Theorem 3 and the red line in Fig. 1).2. We obtain tighter upper bounds for the strongly-convex concave problem and the general convex-concave problem, by reducing them to the strongly convex-strongly concave problem (See Corollary 1and 2).3. We also study the special case where f is a quadratic function. We propose an algorithm calledRecursive Hermition-Skew-Hermition Split (RHSS( k )), and show that it achieves an upper bound of O (cid:16)(cid:115) L x m x + L xy m x m y + L y m y (cid:18) L m x m y (cid:19) o (1) ln(1 /(cid:15) ) (cid:17) . Details can be found in Theorem 4 and Corollary 3. We note that the lower bound by Zhang et al.[2019] holds for quadratic functions as well. Hence, our upper bound matches the gradient complexitylower bound up to a sub-polynomial factor. 2

Preliminaries

In this work we are interested in strongly-convex strongly-concave smooth problems. We ﬁrst review somestandard deﬁnitions of strong convexity and smoothness. A function f : R n → R m is L -Lipschitz if ∀ x , x (cid:48) ∈ R n (cid:107) f ( x ) − f ( x (cid:48) ) (cid:107) ≤ L (cid:107) x − x (cid:48) (cid:107) . A function f : R n → R m is L -smooth if ∇ f is L -Lipschitz. A diﬀerentiablefunction φ : R n → R is said to be m -strongly convex if for any x , x (cid:48) ∈ R n , φ ( x (cid:48) ) ≥ φ ( x ) + ( x (cid:48) − x ) T ∇ φ ( x ) + m (cid:107) x (cid:48) − x (cid:107) . If m = 0, we recover the deﬁnition of convexity. If − φ is m -strongly convex, φ is said tobe m -strongly concave. For a function f ( x , y ), if ∀ y , f ( · , y ) is strongly convex, and ∀ x , f ( x , · ) is stronglyconcave, then f is said to be strongly convex-strongly concave. Deﬁnition 1.

A diﬀerentiable function f : R n × R m → R is said to be ( L x , L xy , L y )-smooth if1. For any y , ∇ x f ( · , y ) is L x -Lipschitz; 2. For any x , ∇ y f ( x , · ) is L y -Lipschitz;3. For any x , ∇ x f ( x , · ) is L xy -Lipschitz; 4. For any y , ∇ y f ( · , y ) is L xy -Lipschitz.In this work, we are interested in function that are strongly convex-strongly concave and smooth. Specif-ically, we study the following function class. Deﬁnition 2.

The function class F ( m x , m y , L x , L xy , L y ) contains diﬀerentiable functions from R n × R m to R such that: 1. ∀ y , f ( · , y ) is m x -strongly convex; 2. ∀ x , f ( x , · ) is m y -strongly concave; 3. f is ( L x , L xy , L y )-smooth.In the case where f ( x , y ) is twice continuously diﬀerentiable, denote the Hessian of f at ( x , y ) by H := (cid:20) H xx H xy H yx H yy (cid:21) . Then F ( m x , m y , L x , L xy , L y ) can be characterized with the Hessian; in particular werequire m x I (cid:52) H xx (cid:52) L x I , m y I (cid:52) − H yy (cid:52) L y I and (cid:107) H xy (cid:107) ≤ L xy .For notational simplicity, we assume that L x = L y when considering algorithms and upper bounds.This is without loss of generality, since one can deﬁne g ( x , y ) := f (( L y /L x ) / x , ( L x /L y ) / y ) in orderto make the two smoothness constants equal. It is not hard to show that this rescaling does not change L x /m x , L y /m y , L xy and m x m y , and that L = max { L x , L xy , L y } does not increase. Hence, we can makethe following assumption without loss of generality. Assumption 1. f ∈ F ( m x , m y , L x , L xy , L y ), and L x = L y .The optimal solution of the convex-concave minimax optimization problem min x max y f ( x , y ) is thesaddle point ( x ∗ , y ∗ ) deﬁned as follows. Deﬁnition 3. ( x ∗ , y ∗ ) is a saddle point of f : R n × R m → R if ∀ x ∈ R n , y ∈ R m f ( x , y ∗ ) ≥ f ( x ∗ , y ∗ ) ≥ f ( x ∗ , y ) . For strongly convex-strongly concave functions, it is well known that such a saddle point exists andis unique. Meanwhile,the saddle point is a stationary point, i.e. ∇ f ( x ∗ , y ∗ ) = 0, and is the minimizerof φ ( x ) := max y f ( x , y ). For the design of numerical algorithms, we are satisﬁed with an close enoughapproximate of the saddle point, called (cid:15) -saddle points. Deﬁnition 4. (ˆ x , ˆ y ) is an (cid:15) -saddle point of f if max y f (ˆ x , y ) − min x f ( x , ˆ y ) ≤ (cid:15). Alternatively, we can also characterize optimality with the distance to the saddle point. In particular,let z ∗ := [ x ∗ ; y ∗ ], ˆ z := [ˆ x ; ˆ y ], then one may require (cid:107) ˆ z − z ∗ (cid:107) ≤ (cid:15) . This implies that max y f (ˆ x , y ) − min x f ( x , ˆ y ) ≤ L min { m x , m y } (cid:15) . Thus, it is suﬃcient to ﬁnd a point close enough to the saddle point.In this work we focus on ﬁrst-order methods, that is, algorithms that only access f through gradientevaluations. The complexity of algorithms is measured through the gradient complexity: the number ofgradient evaluations required to ﬁnd an (cid:15) -saddle point (or get to (cid:107) ˆ z − z ∗ (cid:107) ≤ (cid:15) ). Note that this rescaling also does not change the lower bound. See Fact 4 in the appendix for proof. .1 Accelerated Gradient Descent Nesterov’s Accelerated Gradient Descent [Nesterov, 1983] is an optimal ﬁrst-order algorithm for smooth andconvex functions. Here we present a version of AGD for minimizing an l -smooth and m -strongly convexfunctions g ( · ). It is a crucial building block for the algorithms in this work. Algorithm

AGD( g , x , T ) [Nesterov, 2013] Require:

Initial point x , smoothness constant l , strongly-convex modulus m , number of iterations T ˜ x ← x , η ← /l , κ ← l/m , θ ← ( √ κ − / ( √ κ + 1) for t = 1 , · · · , T dox t ← ˜ x t − − η ∇ g (˜ x t − )˜ x t ← x t + θ ( x t − x t − ) end for The following convergence theorem holds for AGD. It implies that the complexity is O (cid:0) √ κ ln (cid:0) (cid:15) (cid:1)(cid:1) , whichgreatly improves over the O (cid:0) κ ln (cid:0) (cid:15) (cid:1)(cid:1) bound for gradient descent. Lemma 1. (Nesterov [2013, Theorem 2.2.3]) In the AGD algorithm, (cid:107) x T − x ∗ (cid:107) ≤ ( κ + 1) (cid:107) x − x ∗ (cid:107) · (cid:18) − √ κ (cid:19) T . There has been a long line of work on the convex-concave saddle point problem. Apart from GDA andExtraGradient [Korpelevich, 1976, Tseng, 1995, Nemirovski, 2004, Gidel et al., 2019], other algorithmswith theoretical guarantees include OGDA [Rakhlin and Sridharan, 2013, Daskalakis et al., 2018, Mokhtariet al., 2019, Azizian et al., 2019], Hamiltonian Gradient Descent [Abernethy et al., 2019] and ConsensusOptimization [Mescheder et al., 2017, Abernethy et al., 2019, Azizian et al., 2019]. For the convex-concavecase and strongly-convex-concave case, lower bounds have been provided by Ouyang and Xu [2019].Some authors have studied variance reduction algorithms for minimax optimization [Carmon et al., 2019,Palaniappan and Bach, 2016], which is beyond the scope of this work. One special case of convex-concavefunctions is the so-called bilinear case, where f ( x , y ) = h ( x )+ x T Ay − g ( y ). It has been studied by Chambolleand Pock [2011], Chen et al. [2014], Ouyang and Xu [2019] and Du and Hu [2019].Another special case is the quadratic case, where f ( x , y ) is a quadratic function, and solving the saddlepoint problem amounts to solving a structured linear system. The quadratic saddle point problem has beenstudied extensively in the numerical analysis community [Benzi et al., 2005, Bai et al., 2003, Bai, 2009].One of the most notable algorithms for quadratic saddle point problems is Hermitian-skew-Hermitian Split(HSS) [Bai et al., 2003]. However, most existing work do not provide a bound on the overall number ofmatrix-vector products.Beyond the convex-concave setting, some researchers have also studied the nonconvex-concave case re-cently [Lin et al., 2019, Thekumparampil et al., 2019, Raﬁque et al., 2018, Lin et al., 2020, Lu et al.,2019, Nouiehed et al., 2019], with the goal being ﬁnding a stationary point of the nonconvex function φ ( x ) := max y f ( x , y ). By reducing to the strongly convex-strongly concave setting, Lin et al. [2020] hasachieved state-of-the-art results for nonconvex-concave problems. L xy in GeneralCases Consider the extreme case where L xy = 0. In this case, there is no interaction between x and y , and f ( x , y )can be simply written as h ( x ) − h ( y ), where h and h are strongly convex functions. Thus, in this case,4 lgorithm 1 Alternating Best Response (ABR)

Require: g ( · , · ), Initial point z = [ x ; y ], precision (cid:15) , parameters m x , m y , L x , L y κ x := L x /m x , κ y := L y /m y , T ← (cid:24) log (cid:18) √ κ x + κ y (cid:15) (cid:19)(cid:25) for t = 0 , · · · , T dox t +1 ← AGD( f ( · , y t ) , x t , √ κ x ln(24 κ x )) y t +1 ← AGD( − f ( x t +1 , · ) , y t , √ κ y ln(24 κ y )) end for the following trivial algorithm solves the problem x ∗ ← arg min x f ( x , y ) , y ∗ ← arg max y f ( x ∗ , y ) . In other words, the equilibrium can be found by directly playing the best response to each other once.Now, let us consider the case where L xy is nonzero but small. In this case, would the best responsedynamics converge to the saddle point? Speciﬁcally, consider the following iterative procedure: (cid:40) x t +1 ← arg min x { f ( x , y t ) } y t +1 ← arg max y { f ( x t +1 , y ) } . (2)Let us deﬁne y ∗ ( x ) := arg max y f ( x , y ) and x ∗ ( y ) := arg min x f ( x , y ). Because y ∗ ( x ) is L xy /m y -Lipschitzand x ∗ ( y ) is L xy /m x -Lipschitz (see Fact 1 in Appendix A), (cid:107) x t +1 − x ∗ (cid:107) = (cid:107) x ∗ ( y t ) − x ∗ ( y ∗ ) (cid:107) ≤ L xy m x (cid:107) y t − y ∗ (cid:107) = L xy m x (cid:107) y ∗ ( x t ) − y ∗ ( x ∗ ) (cid:107) ≤ L xy m x m y (cid:107) x t − x ∗ (cid:107) . Thus, when L xy < m x m y , (2) is indeed a contraction. In fact, we can replace the exact solution of the inneroptimization problems with Nesterov’s Accelerated Gradient Descent (AGD) for constant number of steps,as described in Algorithm 1.The following theorem holds for the Alternating Best Response algorithm. The proof of the theorem canbe found in the Appendix B. Theorem 1. If g ∈ F ( m x , m y , L x , L xy , L y ) and L xy < √ m x m y , Alternating Best Response returns ( x T , y T ) such that (cid:107) x T − x ∗ (cid:107) + (cid:107) y T − y ∗ (cid:107) ≤ (cid:15) ( (cid:107) x − x ∗ (cid:107) + (cid:107) y − y ∗ (cid:107) ) , and the number of gradient evaluations is bounded by (with κ x = L x /m x , κ y = L y /m y ) O (cid:0)(cid:0)(cid:112) κ x + κ y (cid:1) · ln ( κ x κ y ) ln ( κ x κ y /(cid:15) ) (cid:1) . Note that when L xy is small, the lower bound of Zhang et al. [2019] can be written as Ω (cid:0) √ κ x + κ y ln(1 /(cid:15) ) (cid:1) .Thus Alternating Best Response matches this lower bound up to logarithmic factors. In the previous subsection, we showed that Alternating Best Response matches the lower bound when theinteraction term L xy is suﬃciently small. However, in order to apply the algorithm to functions with L xy > √ m x m y , we need another algorithmic component, namely the accelerated proximal point algorithm [G¨uler,1992, Lin et al., 2020].For a minimax optimization problem min x max y f ( x , y ), deﬁne φ ( x ) := max y f ( x , y ). Suppose thatwe run the accelerated proximal point algorithm on φ ( x ) with proximal parameter β : then the num-ber of iterations can be easily bounded, while in each iteration one needs to solve a proximal problem5in x (cid:8) φ ( x ) + β (cid:107) x − ˆ x t (cid:107) (cid:9) . The key observation is that, this is equivalent to solving a minimax optimiza-tion problem min x max y (cid:8) f ( x , y ) + β (cid:107) x − ˆ x t (cid:107) (cid:9) . Thus, via accelerated proximal point, we are able to re-duce solving min x max y f ( x , y ) to solving min x max y (cid:8) f ( x , y ) + β (cid:107) x − ˆ x t (cid:107) (cid:9) . Intuitively, the regularizer on x makes the new problem easier to solve.This is exactly the idea behind Algorithm 2 (the idea was also used in Lin et al. [2020]). In the algorithm, M is a positive constant characterizing the precision of solving the subproblem, where we require M ≥ poly( Lm x , Lm y , βm x ). If M → ∞ , the algorithm becomes an instance of accelerated proximal point on φ ( x ) =max y f ( x , y ). Algorithm 2

Accelerated Proximal Point Algorithm for Minimax Optimization

Require:

Initial point z = [ x ; y ], proximal parameter β , strongly-convex modulus m x ˆ x ← x , κ ← β/m x , θ ← √ κ − √ κ +1 , τ ← √ κ +4 κ for t = 1 , · · · , T do Suppose ( x ∗ t , y ∗ t ) = min x max y f ( x , y ) + β (cid:107) x − ˆ x t − (cid:107) . Find ( x t , y t ) such that (cid:107) x t − x ∗ t (cid:107) + (cid:107) y t − y ∗ t (cid:107) ≤ M [ (cid:107) x t − − x ∗ t (cid:107) + (cid:107) y t − − y ∗ t (cid:107) ]ˆ x t ← x t + θ ( x t − x t − ) + τ ( x t − ˆ x t − ) end for The following theorem can be shown for Algorithm 2. The proof can be found in Appendix C, and isbased on the proof of Theorem 4.1 in [Lin et al., 2020].

Theorem 2.

The number of iterations needed by Algorithm 2 to produce ( x T , y T ) such that (cid:107) x T − x ∗ (cid:107) + (cid:107) y T − y ∗ (cid:107) ≤ (cid:15) ( (cid:107) x − x ∗ (cid:107) + (cid:107) y − y ∗ (cid:107) ) is at most ( κ = β/m x ) ˆ T = 8 √ κ · ln (cid:32) κ Lm y (cid:115) L m x m y · (cid:15) (cid:33) . (3) With the two algorithmic components, namely Alternating Best Response and Accelerated Proximal Pointin place, we can now combine them and design an eﬃcient algorithms for general strongly convex-stronglyconcave functions. The high-level idea is to exploit the accelerated proximal point algorithm twice to reducea general problem into one solvable by Alternating Best Response.To start with, let us consider a strongly-convex-strongly-concave function f ( x , y ), and apply Algorithm 2for f with proximal parameter β = L xy . By Theorem 2, the algorithm can converge in ˜ O (cid:18)(cid:113) L xy m x (cid:19) iterations,while in each iteration we need to solve a regularized minimax problemmin x max y (cid:8) f ( x , y ) + β (cid:107) x − ˆ x t − (cid:107) (cid:9) . This is equivalent to min y max x (cid:8) − f ( x , y ) − β (cid:107) x − ˆ x t − (cid:107) (cid:9) , so we can apply Algorithm 2 once more to thisproblem with parameter β = L xy . This procedure would require ˜ O (cid:16)(cid:113) L xy m y (cid:17) iterations, and in each iteration,one need to solve a minimax problem of the formmin y max x (cid:8) − f ( x , y ) − β (cid:107) x − ˆ x t − (cid:107) + β (cid:107) y − ˆ y t (cid:48) − (cid:107) (cid:9) = − min x max y (cid:8) f ( x , y ) + β (cid:107) x − ˆ x t − (cid:107) − β (cid:107) y − ˆ y t (cid:48) − (cid:107) (cid:9) . Hence, we reduced the original problem to a problem that is 2 β -strongly convex with respect to x and2 β -strongly concave with respect to y . Now the interaction between x and y is (relatively) much weaker6nd one can easily see that L xy ≤ √ β · β . Consequently the ﬁnal problem can be solved in ˜ O (cid:16) L x L xy (cid:17) gradient evaluations using the Alternating Best Response algorithm. We ﬁrst consider the case where L xy > max { m x , m y } . The total gradient complexity would thus be˜ O (cid:32)(cid:114) L xy m x (cid:33) · ˜ O (cid:32)(cid:115) L xy m y (cid:33) · ˜ O (cid:32)(cid:115) LL xy (cid:33) = ˜ O (cid:32)(cid:115) L · L xy m x m y (cid:33) . In order to deal with the case where L xy < max { m x , m y } , we shall choose β = max { L xy , m x } for the ﬁrstlevel of proximal point, and β = max { L xy , m y } for the second level of proximal point. In this case, thetotal gradient complexity bound can be shown to be˜ O (cid:32)(cid:114) β m x (cid:33) · ˜ O (cid:32)(cid:115) β m y (cid:33) · ˜ O (cid:32)(cid:115) Lβ + Lβ (cid:33) = ˜ O (cid:32)(cid:115) L x m x + L · L xy m x m y + L y m y (cid:33) . A formal description of the algorithm is provided in Algorithm 4, and a formal statement of the complexityupper bound is provided in Theorem 3. The proof can be found in Appendix D

Theorem 3.

Assume that f ∈ F ( m x , m y , L x , L xy , L y ) . In Algorithm 4, the gradient complexity to produce ( x T , y T ) such that (cid:107) z T − z ∗ (cid:107) ≤ (cid:15) is O (cid:32)(cid:115) L x m x + L · L xy m x m y + L y m y · ln (cid:18) L m x m y (cid:19) ln (cid:18) L m x m y · (cid:107) z − z ∗ (cid:107) (cid:15) (cid:19)(cid:33) . Algorithm 3

APPA-ABR

Require: g ( · , · ), Initial point z = [ x ; y ], precision parameter M β ← max { m y , L xy } , M ← L . m x m . y ˆ y ← y , κ ← β /m y , θ ← √ κ − √ κ +1 , τ ← √ κ +4 κ , t ← repeat t ← t + 1 ( x t , y t ) ← ABR( g ( x , y ) − β (cid:107) y − ˆ y t − (cid:107) , [ x t − ; y t − ], 1 /M , 2 β , 2 β , 3 L , 3 L ) ˆ y t ← y t + θ ( y t − y t − ) + τ ( y t − ˆ y t − ) until (cid:107)∇ g ( x t , y t ) (cid:107) ≤ min { m x ,m y } LM (cid:107)∇ g ( x , y ) (cid:107) Algorithm 4

Proximal Best Response

Require:

Initial point z = [ x ; y ] β ← max { m x , L xy } , M ← L m . x m . y ˆ x ← x , κ ← β /m x , θ ← √ κ − √ κ +1 , τ ← √ κ +4 κ for t = 1 , · · · , T do ( x t , y t ) ← APPA-ABR( f ( x , y ) + β (cid:107) x − ˆ x t − (cid:107) , [ x t − , y t − ], M ) ˆ x t ← x t + θ ( x t − x t − ) + τ ( x t − ˆ x t − ) end for Theorem 3 improves over the results of Lin et al. in two ways. First, Lin et al.’s upper bound has a ln (1 /(cid:15) )factor, while our algorithm enjoys linear convergence. Second, our result has a better dependence on L xy .To see this, note that when L xy (cid:28) L , L x m x + L · L xy m x m y + L y m y (cid:28) L x m x + L m x m y + L y m y ≤ L m x m y . This is also illustratedby Fig. 1, where Proximal Best Response (the red line) signiﬁcantly outperforms Lin et al.’s result (the blue7ine) when L xy (cid:28) L . In particular, Proximal Best Response matches the lower bound when L xy > L x orwhen L xy < max { m x , m y } ; in between, it is able to gracefully interpolate the two cases.As shown by Lin et al. [2020], convex-concave problems and strongly convex-concave problems can bereduced to strongly convex-strongly concave problems. Hence, Theorem 3 naturally implies improved algo-rithms for convex-concave and strongly convex-concave problems. Corollary 1. If f ( x , y ) is ( L x , L xy , L y ) -smooth and m x -strongly convex w.r.t. x , via reduction to Theo-rem 3, the gradient complexity of ﬁnding an (cid:15) -saddle point is ˜ O (cid:16)(cid:113) m x · L y + L · L xy m x (cid:15) (cid:17) . Corollary 2. If f ( x , y ) is ( L x , L xy , L y ) -smooth and convex-concave, via reduction to Theorem 3, the gra-dient complexity to produce an (cid:15) -saddle point is ˜ O (cid:16)(cid:113) L x + L y (cid:15) + √ L · L xy (cid:15) (cid:17) . The precise statement as well as the proofs can be found in Appendix F. We remark that for the reduc-tion is for constrained minimax optimization, and Theorem 3 holds for constrained problems after simplemodiﬁcations to the algorithm. L xy in Quadratic Cases We can see that proximal best response has near optimal dependence on condition numbers when L xy > L x or when L xy < max { m x , m y } . However, when L xy falls in between, there is still a signiﬁcant gap betweenthe upper bound and the lower bound. In this section, we try to close this gap for quadratic functions, i.e.we assume that f ( x , y ) = 12 x T Ax + x T By − y T Cy + u T x + v T y . (4)The reason to consider quadratic functions is threefold. First, the lower bound instance by Zhang et al. [2019]is a quadratic function; thus, this lower bound applies to quadratic functions as well, so it would be interestingto match the lower bound for quadratic functions ﬁrst. Second, quadratic functions are considerably easierto analyze. Third, ﬁnding the saddle point of quadratic functions is an important problem on its own, andhas many applications (see Benzi et al. [2005] and references therein).For quadratic functions of the form (4), our assumption that f ∈ F ( m x , m y , L x , L xy , L y ) now becomesassumptions on the singular values of matrices: m x I (cid:52) A (cid:52) L x I , m y I (cid:52) C (cid:52) L y I , (cid:107) B (cid:107) ≤ L xy . In thiscase, the unique saddle point is given by the solution to a linear system (cid:20) x ∗ y ∗ (cid:21) = J − b = (cid:20) A B − B T C (cid:21) − (cid:20) − uv (cid:21) . To see this, note that J − b = (cid:20) A B − B T C (cid:21) − (cid:20) − uv (cid:21) = − (cid:20) A BB T − C (cid:21) − (cid:20) uv (cid:21) . Thus, [ x ∗ ; y ∗ ] is the uniqe stationary point of f ( x , y ). The question then becomes: how can we solve such alinear system J − b eﬃciently using ﬁrst-order methods?Throughout this section we assume that L x = L y and m x < m y , which are without loss of generality,and that m y < L xy , as otherwise proximal best response is already near-optimal. We now focus on how to solve the linear system Jz = b , where J := (cid:20) A B − B T C (cid:21) is positive deﬁnite butnot symmetric. We utilize the Hermitian-Skew-Hermitian Split (HSS) algorithm [Bai et al., 2003], which isdesigned to solve positive deﬁnite asymmetric systems. Deﬁne G := (cid:20) A C (cid:21) , S := (cid:20) B − B T (cid:21) , P := (cid:20) α I + β A I + β C (cid:21) , α and β are constants to be determined. Let z t := [ x t ; y t ]. Then HSS runs as (cid:40) ( η P + G ) z t +1 / = ( η P − S ) z t + b , ( η P + S ) z t +1 = ( η P − G ) z t +1 / + b . (5)Here η > z t +1 − z ∗ = ( η P + S ) − ( η P − G ) ( η P + G ) − ( η P − S ) ( z t − z ∗ ) . The key observation of HSS is that the equation above is a contraction.

Lemma 2 ([Bai et al., 2003]) . Deﬁne M ( η ) := ( η P + S ) − ( η P − G ) ( η P + G ) − ( η P − S ) . Then ρ ( M ( η )) ≤ (cid:107) M ( η ) (cid:107) ≤ max λ i ∈ sp ( P − G ) (cid:12)(cid:12)(cid:12)(cid:12) λ i − ηλ i + η (cid:12)(cid:12)(cid:12)(cid:12) < . Lemma 2 provides an upper bound on the iteration complexity of HSS, as in the original analysis ofHSS [Bai et al., 2003]. However, it does not consider the computational cost per iteration. In particular,the matrix η P + S is also asymmetric, and in fact corresponds to another quadratic minimax optimizationproblem. The original HSS paper did not consider how to solve this subproblem for general P . Our idea isto solve the subproblem recursively, as in the next subsection. In this subsection, we describe our algorithm Recursive Hermitian-skew-Hermitian Split, or RHSS( k ), whichuses HSS in k − k ) calls HSS with parameters α = m x /m y , β = L − k xy m − k − k y , η = L k xy m k − k y . In each iteration, it solves two linear systems. The ﬁrst one, whichis associated with η P + G , can be solved with Conjugate Gradient [Hestenes et al., 1952] as η P + G issymmetric positive deﬁnite. The second one is associated with η P + S = (cid:20) η ( α I + β A ) B − B T η ( I + β C ) (cid:21) , which is a quadratic minimax optimization problem. RHSS( k ) then makes a recursive call RHSS( k −

1) tosolve this subproblem. When k = 1, we simply run the Proximal Best Response algorithm (Algorithm 4).A detailed description of RHSS( k ) for k ≥ k ) is the following theorem, whose proof is deferred to Appendix G. Note thatfor an algorithm on quadratic functions, the number of matrix-vector products is the same as the gradientcomplexity. Theorem 4.

There exists constants C , C , such that the number of matrix-vector products needed to ﬁnd ( x T , y T ) such that (cid:107) z T − z ∗ (cid:107) ≤ (cid:15) is at most (cid:118)(cid:117)(cid:117)(cid:116) L xy m x m y + (cid:18) L x m x + L y m y (cid:19) (cid:32) (cid:18) L xy max { m x , m y } (cid:19) k (cid:33) · (cid:18) C ln (cid:18) C L m x m y (cid:19)(cid:19) k +3 ln (cid:18) (cid:107) z − z ∗ (cid:107) (cid:15) (cid:19) . (6)If k is chosen as a ﬁxed constant, the comparison of (6) and the lower bound [Zhang et al., 2019] isillustrated in Fig. 1. One can see that as k increases, the upper bound of RHSS( k ) gradually ﬁts the lowerbound. One may also try to choose the optimal k . In particular, we can show the following corollary. Corollary 3.

When k = Θ (cid:18)(cid:114) ln (cid:16) L m x m y (cid:17) / ln ln (cid:16) L m x m y (cid:17)(cid:19) , the number of matrix vector products thatRHSS( k ) need to ﬁnd z T such that (cid:107) z T − z ∗ (cid:107) ≤ (cid:15) is (cid:115) L xy m x m y + L x m x + L y m y ln (cid:18) (cid:107) z − z ∗ (cid:107) (cid:15) (cid:19) · (cid:18) L m x m y (cid:19) o (1) . In other words, for the quadratic saddle point problem, RHSS( k ) with the optimal choice of k matchesthe lower bound up to a sub-polynomial factor. Here ρ ( · ) stands for the spectral radius of a matrix, and sp ( · ) stands for its spectrum. lgorithm 5 RHSS( k ) (Recursive Hermitian-skew-Hermitian Split) Require:

Initial point [ x ; y ], precision (cid:15) , parameters m x , m y , L xy t ← M ← L m x m y , M ← L xy m y , α ← m x m y , β ← L − k xy m − k − k y , η ← L k xy m − k y , ˜ (cid:15) ← m x (cid:15)L xy + L x repeat (cid:20) r r (cid:21) ← (cid:20) η ( α I + β A ) − BB T η ( I + β C ) (cid:21) (cid:20) x t y t (cid:21) + (cid:20) − uv (cid:21) . Call conjugate gradient to compute (cid:20) x t +1 / y t +1 / (cid:21) ← CG (cid:18)(cid:20) η ( α I + β A ) + A η ( I + β C ) + C (cid:21) , (cid:20) r r (cid:21) , (cid:20) x t y t (cid:21) , M (cid:19) . Compute (cid:20) w w (cid:21) ← (cid:20) ηα I + ηβ A − A η ( I + β C ) − C (cid:21) (cid:20) x t +1 / y t +1 / (cid:21) + (cid:20) − uv (cid:21) Call RHSS( k −

1) with initial point [ x t ; y t ] and precision 1 /M to solve (cid:20) x t +1 y t +1 (cid:21) ← (cid:20) η ( α I + β A ) B − B T η ( I + β C ) (cid:21) − (cid:20) w w (cid:21) .t ← t + 1 until (cid:107) Jz t − b (cid:107) ≤ ˜ (cid:15) (cid:107) Jz − b (cid:107) Algorithm 6

The Conjugate Gradient Algorithm: CG( A , b , x , (cid:15) ) [Allaire and Kaber, 2008] r ← b − Ax , p ← r , k ← repeat α k ← r Tk r k p Tk Ap k x k +1 ← x k + α k p k r k +1 ← r k − α k Ap k β k ← r Tk +1 r k +1 r Tk r k p k +1 ← r k +1 + β k p k k ← k + 1 until (cid:107) r k (cid:107) ≤ (cid:15) (cid:107) b − Ax (cid:107) Return x In this work, we studied convex-concave minimax optimization problems. For general strongly convex-strongly concave problems, our Proximal Best Response algorithm achieves linear convergence and betterdependence on L xy , the interaction parameter. Via known reductions [Lin et al., 2020], this result impliesbetter upper bounds for strongly convex-concave and convex-concave problems. For quadratic functions, ouralgorithm RHSS( k ) is able to match the lower bound up to a sub-polynomial factor.In future research, one interesting direction is to extend RHSS( k ) to general strongly convex-stronglyconcave functions. Another important direction would be to shave the remaining sub-polynomial factor fromthe upper bound for quadratic functions. Acknowledgment

The authors thank Kefan Dong, Guodong Zhang and Chi Jin for helpful discussions. This paper is part ofYuanhao Wang’s undergraduate thesis project at Tsinghua University.10 eferences

Jacob Abernethy, Kevin A Lai, and Andre Wibisono. Last-iterate convergence rates for min-max optimiza-tion. arXiv preprint arXiv:1906.02027 , 2019.Gr´egoire Allaire and Sidi Mahmoud Kaber.

Numerical linear algebra , volume 55. Springer, 2008.Martin Arjovsky, Soumith Chintala, and L´eon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875 ,2017.Wa¨ıss Azizian, Ioannis Mitliagkas, Simon Lacoste-Julien, and Gauthier Gidel. A tight and uniﬁed analysisof extragradient for a whole spectrum of diﬀerentiable games. arXiv preprint arXiv:1906.05945 , 2019.Zhong-Zhi Bai. Optimal parameters in the hss-like methods for saddle-point problems.

Numerical LinearAlgebra with Applications , 16(6):447–479, 2009.Zhong-Zhi Bai, Gene H Golub, and Michael K Ng. Hermitian and skew-hermitian splitting methods fornon-hermitian positive deﬁnite linear systems.

SIAM Journal on Matrix Analysis and Applications , 24(3):603–626, 2003.Tamer Basar and Geert Jan Olsder.

Dynamic noncooperative game theory , volume 23. Siam, 1999.Michele Benzi, Gene H Golub, and J¨org Liesen. Numerical solution of saddle point problems.

Acta numerica ,14:1–137, 2005.Yair Carmon, Yujia Jin, Aaron Sidford, and Kevin Tian. Variance reduction for matrix games. In

Advancesin Neural Information Processing Systems , pages 11377–11388, 2019.Antonin Chambolle and Thomas Pock. A ﬁrst-order primal-dual algorithm for convex problems with appli-cations to imaging.

Journal of mathematical imaging and vision , 40(1):120–145, 2011.Yunmei Chen, Guanghui Lan, and Yuyuan Ouyang. Optimal primal-dual methods for a class of saddle pointproblems.

SIAM Journal on Optimization , 24(4):1779–1814, 2014.Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. Sbeed: Convergentreinforcement learning with nonlinear function approximation. In

International Conference on MachineLearning , pages 1133–1142, 2018.Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans with optimism.In

International Conference on Learning Representations (ICLR 2018) , 2018.Simon S Du and Wei Hu. Linear convergence of the primal-dual gradient method for convex-concave saddlepoint problems without strong convexity. In

The 22nd International Conference on Artiﬁcial Intelligenceand Statistics , pages 196–205, 2019.Simon S Du, Jianshu Chen, Lihong Li, Lin Xiao, and Dengyong Zhou. Stochastic variance reduction methodsfor policy evaluation. In

International Conference on Machine Learning , pages 1049–1058, 2017.Gauthier Gidel, Hugo Berard, Gatan Vignoud, Pascal Vincent, and Simon Lacoste-Julien. A variationalinequality perspective on generative adversarial networks. In

International Conference on Learning Rep-resentations , 2019. URL https://openreview.net/forum?id=r1laEnA5Ym .Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. Generative adversarial nets. In

Advances in neural information processingsystems , pages 2672–2680, 2014.Osman G¨uler. New proximal point algorithms for convex minimization.

SIAM Journal on Optimization , 2(4):649–664, 1992.Eldad Haber and Jan Modersitzki. Numerical methods for volume preserving image registration.

Inverseproblems , 20(5):1621, 2004. 11artin Hast, Karl Johan ˚Astr¨om, Bo Bernhardsson, and Stephen Boyd. Pid design by convex-concaveoptimization. In , pages 4460–4465. IEEE, 2013.Magnus R Hestenes, Eduard Stiefel, et al. Methods of conjugate gradients for solving linear systems.

Journalof research of the National Bureau of Standards , 49(6):409–436, 1952.GM Korpelevich. The extragradient method for ﬁnding saddle points and other problems.

Matecon , 12:747–756, 1976.Tianyi Lin, Chi Jin, and Michael I Jordan. On gradient descent ascent for nonconvex-concave minimaxproblems. arXiv preprint arXiv:1906.00331 , 2019.Tianyi Lin, Chi Jin, Michael Jordan, et al. Near-optimal algorithms for minimax optimization. arXiv preprintarXiv:2002.02417 , 2020.Songtao Lu, Ioannis Tsaknakis, Mingyi Hong, and Yongxin Chen. Hybrid block successive approximation forone-sided non-convex min-max problems: algorithms and applications. arXiv preprint arXiv:1902.08294 ,2019.Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deeplearning models resistant to adversarial attacks. In

International Conference on Learning Representations ,2018. URL https://openreview.net/forum?id=rJzIBfZAb .Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of gans. In

Advances in NeuralInformation Processing Systems , pages 1825–1835, 2017.Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil. A uniﬁed analysis of extra-gradient and optimisticgradient methods for saddle point problems: Proximal point approach. arXiv preprint arXiv:1901.08511 ,2019.Oﬁr Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dualdice: Behavior-agnostic estimation of discountedstationary distribution corrections. In

Advances in Neural Information Processing Systems , pages 2315–2325, 2019.Anna Nagurney.

Network economics: A variational inequality approach , volume 10. Springer Science &Business Media, 2013.Arkadi Nemirovski. Prox-method with rate of convergence o (1/t) for variational inequalities with lipschitzcontinuous monotone operators and smooth convex-concave saddle point problems.

SIAM Journal onOptimization , 15(1):229–251, 2004.Y. E. Nesterov. A method for solving the convex programming problem with convergence rate O (1 /k ). Proceedings of the USSR Academy of Sciences , 269:543–547, 1983.Yurii Nesterov.

Introductory lectures on convex optimization: A basic course , volume 87. Springer Science& Business Media, 2013.Maher Nouiehed, Maziar Sanjabi, Jason D Lee, and Meisam Razaviyayn. Solving a class of non-convexmin-max games using iterative ﬁrst order methods. arXiv preprint arXiv:1902.08297 , 2019.Yuyuan Ouyang and Yangyang Xu. Lower complexity bounds of ﬁrst-order methods for convex-concavebilinear saddle-point problems.

Mathematical Programming , pages 1–35, 2019.Balamurugan Palaniappan and Francis Bach. Stochastic variance reduction methods for saddle-point prob-lems. In

Advances in Neural Information Processing Systems , pages 1416–1424, 2016.Hassan Raﬁque, Mingrui Liu, Qihang Lin, and Tianbao Yang. Non-convex min-max optimization: Provablealgorithms and applications in machine learning. arXiv preprint arXiv:1810.02060 , 2018.Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In

Conference onLearning Theory , pages 993–1019, 2013. 12man Sinha, Hongseok Namkoong, and John Duchi. Certiﬁable distributional robustness with principledadversarial training. arXiv preprint arXiv:1710.10571 , 2, 2017.Ben Taskar, Simon Lacoste-Julien, and Michael I Jordan. Structured prediction via the extragradient method.In

Advances in neural information processing systems , pages 1345–1352, 2006.Kiran Koshy Thekumparampil, Prateek Jain, Praneeth Netrapalli, and Sewoong Oh. Eﬃcient algorithmsfor smooth minimax optimization. arXiv preprint arXiv:1907.01543 , 2019.Paul Tseng. On linear convergence of iterative methods for the variational inequality problem.

Journal ofComputational and Applied Mathematics , 60(1-2):237–252, 1995.John Von Neumann and Oskar Morgenstern.

Theory of games and economic behavior (commemorativeedition) . Princeton university press, 2007.Linli Xu, James Neufeld, Bryce Larson, and Dale Schuurmans. Maximum margin clustering. In

Advancesin neural information processing systems , pages 1537–1544, 2005.Junyu Zhang, Mingyi Hong, and Shuzhong Zhang. On lower iteration complexity bounds for the saddlepoint problems. arXiv preprint arXiv:1912.07481 , 2019.13

Some Useful Properties

In this section, we review some useful properties of functions in F ( m x , m y , L x , L xy , L y ). Some of the factsare known (see e.g., Lin et al. [2020] and Zhang et al. [2019]) and we provide the proofs for completeness. Fact 1.

Suppose f ∈ F ( m x , m y , L x , L xy , L y ). Let us deﬁne y ∗ ( x ) := arg max y f ( x , y ), x ∗ ( y ) := arg min x f ( x , y ), φ ( x ) := max y f ( x , y ) and ψ ( y ) := min x f ( x , y ). Then, we have that1. y ∗ is L xy /m y -Lipschitz, x ∗ is L xy /m x -Lipschitz;2. φ ( x ) is m x -strongly convex and L x + L xy /m y -smooth; ψ ( y ) is m y -strongly concave and L y + L xy /m x -smooth. Proof.

1. Consider arbitrary x and x (cid:48) . By deﬁnition, ∇ y f ( x , y ∗ ( x )) = ∇ y f ( x (cid:48) , y ∗ ( x (cid:48) )) = . By thedeﬁnition of ( L x , L xy , L y )-smoothness, (cid:107)∇ y f ( x (cid:48) , y ∗ ( x )) (cid:107) ≤ L xy (cid:107) x − x (cid:48) (cid:107) . Thus m y (cid:107) y ∗ ( x ) − y ∗ ( x (cid:48) ) (cid:107) ≤ (cid:107)∇ y f ( x (cid:48) , y ∗ ( x )) (cid:107) ≤ L xy (cid:107) x − x (cid:48) (cid:107) . This proves that y ∗ ( · ) is L xy /m y -Lipschitz. Similarly x ∗ ( · ) is L xy /m x -Lipschitz.2. By Danskin’s Theorem, ∇ φ ( x ) = ∇ x f ( x , y ∗ ( x )). Thus, ∀ x , x (cid:48) (cid:107)∇ φ ( x ) − ∇ φ ( x (cid:48) ) (cid:107) = (cid:107)∇ x f ( x , y ∗ ( x )) − ∇ x f ( x (cid:48) , y ∗ ( x (cid:48) )) (cid:107)≤ (cid:107)∇ x f ( x , y ∗ ( x )) − ∇ x f ( x , y ∗ ( x (cid:48) )) (cid:107) + (cid:107)∇ x f ( x , y ∗ ( x (cid:48) )) − ∇ x f ( x (cid:48) , y ∗ ( x (cid:48) )) (cid:107)≤ L xy · (cid:107) y ∗ ( x ) − y ∗ ( x (cid:48) ) (cid:107) + L x (cid:107) x − x (cid:48) (cid:107)≤ (cid:32) L x + L xy m y (cid:33) (cid:107) x − x (cid:48) (cid:107) . On the other hand, ∀ x , x (cid:48) , φ ( x (cid:48) ) − φ ( x ) − ( x (cid:48) − x ) T ∇ φ ( x ) = f ( x (cid:48) , y ∗ ( x (cid:48) )) − f ( x , y ∗ ( x )) − ( x (cid:48) − x ) T ∇ x f ( x , y ∗ ( x )) ≥ f ( x (cid:48) , y ∗ ( x )) − f ( x , y ∗ ( x )) − ( x (cid:48) − x ) T ∇ x f ( x , y ∗ ( x )) ≥ m x (cid:107) x (cid:48) − x (cid:107) . Thus φ ( x ) is m x -strongly convex and (cid:16) L x + L xy m y (cid:17) -smooth. By symmetric arguments, one can show that ψ ( y ) is m y -strongly concave and (cid:16) L y + L xy m x (cid:17) -smooth. Fact 2.

Let z := [ x ; y ] and z ∗ := [ x ∗ ; y ∗ ]. Then1 √ (cid:107) x − x ∗ (cid:107) + (cid:107) y − y ∗ (cid:107) ) ≤ (cid:107) z − z ∗ (cid:107) ≤ (cid:107) x − x ∗ (cid:107) + (cid:107) y − y ∗ (cid:107) . Proof.

This can be easily proven using the AM-GM inequality.

Fact 3.

Let z := [ x ; y ] ∈ R m + n , z ∗ := [ x ∗ ; y ∗ ]. Thenmin { m x , m y }(cid:107) z − z ∗ (cid:107) ≤ (cid:107)∇ f ( x , y ) (cid:107) ≤ L (cid:107) z − z ∗ (cid:107) . Proof.

By properties of strong convexity [Nesterov, 2013], ∀ x , y f ( x , y ∗ ( x )) − f ( x , y ) ≤ m y (cid:107)∇ y f ( x , y ) (cid:107) . Similarly, f ( x , y ) − f ( x ∗ ( y ) , y ) ≤ m x (cid:107)∇ x f ( x , y ) (cid:107) . (cid:107)∇ f ( x , y ) (cid:107) = (cid:107)∇ x f ( x , y ) (cid:107) + (cid:107)∇ y f ( x , y ) (cid:107) ≥ { m x , m y } ( φ ( x ) − ψ ( y )) . Here φ ( · ) = max y f ( · , y ), ψ ( · ) = min x f ( x , · ). By Proposition 1, φ is m x -strongly convex while ψ is m y -strongly concave. Hence φ ( x ) − ψ ( y ) ≥ min { m x , m y } (cid:0) (cid:107) x − x ∗ (cid:107) + (cid:107) y − y ∗ (cid:107) (cid:1) = min { m x , m y } (cid:107) z − z ∗ (cid:107) . It follows that (cid:107)∇ f ( x , y ) (cid:107) ≥ min { m x , m y }(cid:107) z − z ∗ (cid:107) . On the other hand, (cid:107)∇ x f ( x , y ) (cid:107) ≤ L xy (cid:107) y − y ∗ (cid:107) + L x (cid:107) x − x ∗ (cid:107) , (cid:107)∇ y f ( x , y ) (cid:107) ≤ L xy (cid:107) x − x ∗ (cid:107) + L y (cid:107) y − y ∗ (cid:107) . As a result (cid:107)∇ f ( x , y ) (cid:107) ≤ L ( (cid:107) x − x ∗ (cid:107) + (cid:107) y − y ∗ (cid:107) ) ≤ L (cid:107) z − z ∗ (cid:107) . Fact 4.

Let ˆ z = [ˆ x ; ˆ y ]. Then (cid:107) ˆ z − z ∗ (cid:107) ≤ (cid:15) impliesmax y f (ˆ x , y ) − min x f ( x , ˆ y ) ≤ L min { m x , m y } (cid:15) . Proof.

Deﬁne φ ( x ) = max y f ( x , y ) and ψ ( y ) = min x f ( x , y ). Thenmax y f (ˆ x , y ) − min x f ( x , ˆ y ) = φ (ˆ x ) − ψ (ˆ y ) . By Fact 1, φ is ( L x + L xy /m x )-smooth while ψ is ( L y + L xy /m x )-smooth. Since φ ( x ∗ ) = ψ ( y ∗ ), ∇ φ ( x ∗ ) = , ∇ ψ ( y ∗ ) = , φ (ˆ x ) − ψ (ˆ y ) ≤ (cid:18) L x + L xy m x (cid:19) (cid:107) ˆ x − x ∗ (cid:107) + 12 (cid:18) L y + L xy m y (cid:19) (cid:107) ˆ y − y ∗ (cid:107) ≤ (cid:18) L + L xy min { m x , m y } (cid:19) ( (cid:107) ˆ x − x ∗ (cid:107) + (cid:107) ˆ y − y ∗ (cid:107) ) ≤ L min { m x , m y } (cid:15) . B Proof of Theorem 1

Theorem 1. If g ∈ F ( m x , m y , L x , L xy , L y ) and L xy < √ m x m y , Alternating Best Response returns ( x T , y T ) such that (cid:107) x T − x ∗ (cid:107) + (cid:107) y T − y ∗ (cid:107) ≤ (cid:15) ( (cid:107) x − x ∗ (cid:107) + (cid:107) y − y ∗ (cid:107) ) , using ( κ x = L x /m x , κ y = L y /m y ) O (cid:16)(cid:0)(cid:112) κ x + κ y (cid:1) · ln ( κ x κ y ) ln (cid:16) κ x κ y (cid:15) (cid:17)(cid:17) . gradient evaluations.Proof. Deﬁne ˜ x t +1 := arg min x f ( x , y t ). Let us deﬁne y ∗ ( x ) := arg max y f ( x , y ), x ∗ ( y ) := arg min x f ( x , y )and φ ( x ) := max y f ( x , y ). Also deﬁne ˜ x t +1 := arg min x f ( x , y ∗ ( x t )) and ˆ x t +1 := arg min x f ( x , y t ).15he basic idea is the following. Because y ∗ ( · ) is L xy /m y -Lipschitz and x ∗ ( · ) is L xy /m x -Lipschitz (Fact 1), (cid:107) x ∗ ( y t ) − x ∗ (cid:107) = (cid:107) x ∗ ( y t ) − x ∗ ( y ∗ ) (cid:107) ≤ L xy m x (cid:107) y t − y ∗ (cid:107) , (cid:107) y ∗ ( x t +1 ) − y ∗ (cid:107) = (cid:107) y ∗ ( x t +1 ) − y ∗ ( x ∗ ) (cid:107) ≤ L xy m y (cid:107) x t +1 − x ∗ (cid:107) . By a standard analysis of accelerated gradient descent (Lemma 1), since ˆ x t +1 = x ∗ ( y t ) is the minimum of f ( · , y t ) and x t is the initial point, (cid:107) x t +1 − ˆ x t +1 (cid:107) ≤ ( κ x + 1) (cid:107) x t − ˆ x t +1 (cid:107) · (cid:18) − √ κ x (cid:19) √ κ x ln(24 κ x ) ≤ (cid:107) x t − ˆ x t +1 (cid:107) · ( κ x + 1) · exp {− κ x ) }≤ (cid:107) x t − ˆ x t +1 (cid:107) . That is, (cid:107) x t +1 − x ∗ ( y t ) (cid:107) ≤ (cid:107) x t − x ∗ ( y t ) (cid:107) ≤

116 ( (cid:107) x t − x ∗ (cid:107) + (cid:107) x ∗ ( y t ) − x ∗ (cid:107) ) . Thus (cid:107) x t +1 − x ∗ (cid:107) ≤ (cid:107) x t +1 − x ∗ ( y t ) (cid:107) + (cid:107) x ∗ ( y t ) − x ∗ (cid:107) ≤ · L xy m x (cid:107) y t − y ∗ (cid:107) + 116 (cid:107) x t − x ∗ (cid:107) . (7)Similarly, (cid:107) y t +1 − y ∗ ( x t +1 ) (cid:107) ≤ (cid:107) y t − y ∗ ( x t +1 ) (cid:107) ≤

116 ( (cid:107) y t − y ∗ (cid:107) + (cid:107) y ∗ ( x t +1 ) − y ∗ (cid:107) ) . Thus (cid:107) y t +1 − y ∗ (cid:107) ≤ (cid:107) y t +1 − y ∗ ( x t +1 ) (cid:107) + (cid:107) y ∗ ( x t +1 ) − y ∗ (cid:107)≤ · L xy m y (cid:107) x t +1 − x ∗ (cid:107) + 116 (cid:107) y t − y ∗ (cid:107)≤ (cid:32) · L xy m x m y + 116 (cid:33) (cid:107) y t − y ∗ (cid:107) + 17 L xy m y (cid:107) x t − x ∗ (cid:107)≤ . (cid:107) y t − y ∗ (cid:107) + 17 L xy m y (cid:107) x t − x ∗ (cid:107) . (8)Deﬁne C := 4 (cid:112) m y /m x . By adding (7) and C times (8), one gets (cid:107) x t +1 − x ∗ (cid:107) + C (cid:107) y t +1 − y ∗ (cid:107) ≤ (cid:18)

116 + 17 L xy √ m x m y (cid:19) (cid:107) x t − x ∗ (cid:107) + (cid:18) . C + 1716 · L xy m x (cid:19) (cid:107) y t − y ∗ (cid:107)≤ (cid:107) x t − x ∗ (cid:107) + (cid:18) .

35 + 17 L xy √ m x m y (cid:19) C (cid:107) y t − y ∗ (cid:107)≤

12 ( (cid:107) x t − x ∗ (cid:107) + C (cid:107) y t − y ∗ (cid:107) ) . It follows that (cid:107) x T − x ∗ (cid:107) + C (cid:107) y T − y ∗ (cid:107) ≤ − T ( (cid:107) x − x ∗ (cid:107) + C (cid:107) y − y ∗ (cid:107) ) . If C ≥

1, then (cid:107) x T − x ∗ (cid:107) + (cid:107) y T − y ∗ (cid:107) ≤ (cid:114) m y m x · − T · ( (cid:107) x − x ∗ (cid:107) + (cid:107) y − y ∗ (cid:107) ) .

16n the other hand, if

C <

1, then (cid:107) x T − x ∗ (cid:107) + (cid:107) y T − y ∗ (cid:107) ≤ − T C ( (cid:107) x − x ∗ (cid:107) + (cid:107) y − y ∗ (cid:107) ) = (cid:114) m x m y · − T − ( (cid:107) x − x ∗ (cid:107) + (cid:107) y − y ∗ (cid:107) ) . Since max { m x /m y , m y /m x } ≤ L x / min { m x , m y } , (cid:107) x T − x ∗ (cid:107) + (cid:107) y T − y ∗ (cid:107) ≤ (cid:115) L x min { m x , m y } · − T ( (cid:107) x − x ∗ (cid:107) + (cid:107) y − y ∗ (cid:107) ) . (9)The theorem follows from this inequality. C Proof of Theorem 2

Theorem 2.

Assume that M ≥ κ (cid:113) κ + Lm x + L xy m x m y (cid:16) Lm y (cid:17) . The number of iterations needed byAlgorithm 2 to produce ( x T , y T ) such that (cid:107) x T − x ∗ (cid:107) + (cid:107) y T − y ∗ (cid:107) ≤ (cid:15) ( (cid:107) x − x ∗ (cid:107) + (cid:107) y − y ∗ (cid:107) ) is at most ( κ = β/m x ) ˆ T = 8 √ κ · ln (cid:32) κ Lm y (cid:115) L m x m y · (cid:15) (cid:33) . (10)Before proving the theorem, we would ﬁrst state the inexact accelerated proximal point algorithm [Linet al., 2020], which is the basis of Algorithm 2. Algorithm 7

Inexact Accelerated Proximal Point Algorithm (Inexact APPA)

Require:

Initial point x , proximal parameter β , strongly convex module m ˆ x ← x , κ ← β/m , θ ← √ κ − √ κ +1 , τ ← √ κ +4 κ for t = 1 , · · · , T do Find x t such that g ( x t ) + β (cid:107) x t − ˆ x t − (cid:107) ≤ min x { g ( x ) + β (cid:107) x − ˆ x t − (cid:107) } + δ t ˆ x t ← x t + θ ( x t − x t − ) + τ ( x t − ˆ x t − ) end for The following lemma about the inexact APPA algorithm follows directly from the proof of Theorem4.1 [Lin et al., 2020]. We state it without proving it.

Lemma 3.

Suppose that { x t } t ≥ is generated by running the inexact APPA algorithm on g ( · ) . There existsa sequence { Λ t } t ≥ such that1. Λ t ≥ g ( x t ) Λ − g ( x ∗ ) ≤ g ( x ) − g ( x ∗ )) Λ t +1 − g ( x ∗ ) ≤ (cid:16) − √ κ (cid:17) (Λ t − g ( x ∗ )) + 11 κδ t +1 Here Λ t can be recursively deﬁned as follows (see also Lin et al. [2020]),Λ := g ( x ) + m (cid:107) x ∗ − x (cid:107) , Λ t +1 := 12 √ κ (cid:18) g ( x t +1 ) + 2 β (ˆ x t − x t +1 ) T ( x ∗ − x t +1 ) + m (cid:107) x ∗ − x t +1 (cid:107) κ / δ t +1 (cid:19) + (cid:18) − √ κ (cid:19) Λ t . However, we do not need to make use of the explicit deﬁnition of Λ t .Now we are ready to prove Theorem 2. 17 roof. Deﬁne φ ( x ) := max y f ( x , y ) and ˆ L := L + L xy /m y . Then φ ( x ) is m x -strongly convex and ˆ L -smooth.Observe that x ∗ t = arg min x (cid:2) φ ( x ) + β (cid:107) x − ˆ x t − (cid:107) (cid:3) , y ∗ t = arg max y [ f ( x ∗ t , y )] . Thus Algorithm 2 is an instance of the inexact APPA algorithm on φ ( x ) with proximal parameter β andstrongly convex module m x , and with δ t = φ ( x t ) + β (cid:107) x t − ˆ x t − (cid:107) − min x (cid:8) φ ( x ) + β (cid:107) x − ˆ x t − (cid:107) (cid:9) ≤ ˆ L + 2 β (cid:107) x t − x ∗ t (cid:107) . (11)Here we used the fact that, for a L -smooth function g ( · ) whose minimum is x ∗ , g ( x ) − g ( x ∗ ) ≤ L (cid:107) x − x ∗ (cid:107) .Deﬁne C := (cid:107) x − x ∗ (cid:107) + (cid:107) y − y ∗ (cid:107) and C := 44 κ √ κ ˆ L +2 β C . Let us state the following inductionhypothesis ∆ t := Λ t − φ ( x ∗ ) ≤ C (cid:18) − √ κ (cid:19) t , (12) (cid:15) t := (cid:107) x t − x ∗ t (cid:107) + (cid:107) y t − y ∗ t (cid:107) ≤ C (cid:18) − √ κ (cid:19) t . (13)It is easy to verify that with our choice of C and C , both (12) and (13) hold for t = 0.Now, assume that (12) and (13) hold for τ = 1 , , · · · , t . Deﬁne y ∗ ( · ) := arg max y f ( · , y ). By Fact 1, y ∗ ( · ) is ( L/m y )-Lipschitz. Thus (cid:107) y t − y ∗ t +1 (cid:107) ≤ (cid:107) y ∗ t − y ∗ t +1 (cid:107) + (cid:107) y t − y ∗ t (cid:107)≤ (cid:107) y ∗ ( x ∗ t ) − y ∗ ( x ∗ t +1 ) (cid:107) + (cid:15) t ≤ Lm y · (cid:0) (cid:107) x ∗ t − x t (cid:107) + (cid:107) x t − x ∗ t +1 (cid:107) (cid:1) + (cid:15) t ≤ (cid:18) Lm y + 1 (cid:19) (cid:15) t + Lm y (cid:107) x t − x ∗ t +1 (cid:107) . It follows that (cid:15) t +1 ≤ M (cid:2) (cid:107) x t − x ∗ t +1 (cid:107) + (cid:107) y t − y ∗ t +1 (cid:107) (cid:3) ≤ Lm y M · (cid:0) (cid:107) x t − x ∗ t +1 (cid:107) + (cid:15) t (cid:1) . (14)Note that by Lemma 3 and the induction hypothesis (12) φ ( x ∗ t +1 ) − φ ( x ∗ ) ≤ (cid:18) − √ κ (cid:19) ∆ t ≤ C (cid:18) − √ κ (cid:19) t . By the m x -strong convexity of φ ( · ) (Fact 1), (cid:107) x ∗ t +1 − x ∗ (cid:107) ≤ (cid:114) m x (cid:0) φ ( x ∗ t +1 ) − φ ( x ∗ ) (cid:1) ≤ (cid:114) C m x (cid:18) − √ κ (cid:19) t . Meanwhile (cid:107) x t − x ∗ (cid:107) ≤ (cid:114) m x ( φ ( x t ) − φ ( x ∗ )) ≤ (cid:114) C m x (cid:18) − √ κ (cid:19) t . Therefore (cid:107) x t − x ∗ t +1 (cid:107) ≤ (cid:107) x t − x ∗ (cid:107) + (cid:107) x ∗ t +1 − x ∗ (cid:107) ≤ (cid:114) C m x (cid:18) − √ κ (cid:19) t . (15)18y (14), (13) and the fact that M ≥ κ (cid:113) κ + ˆ Lm x (1 + L/m y ) (cid:15) t +1 ≤ Lm y M (cid:32) (cid:114) C m x + C (cid:33) (cid:18) − √ κ (cid:19) t ≤ Lm y M ·  (cid:115) κ . ( ˆ L + 2 β ) m x  C (cid:18) − √ κ (cid:19) t ( C = 44 κ . L +2 β C ) ≤ √ κ (cid:113) ˆ L +2 βm x κ (cid:113) κ + ˆ Lm x · C (cid:18) − √ κ (cid:19) t (2 √

44 + 1 < ≤ C (cid:18) − √ κ (cid:19) t ≤ C (cid:18) − √ κ (cid:19) t +12 . Therefore (13) holds for t + 1. Meanwhile, by (11) and Lemma 3,∆ t +1 ≤ (cid:18) − √ κ (cid:19) ∆ t + 11 κ · ˆ L + 2 β (cid:15) t +1 ≤ (cid:18) − √ κ (cid:19) C (cid:18) − √ κ (cid:19) t + 11 κ · ˆ L + 2 β · C (cid:18) − √ κ (cid:19) t = C (cid:18) − √ κ (cid:19) t +1 , where we used the fact that11 κ · ˆ L + 2 β · C = 14 √ κ · κ . ˆ L + 2 β C = C √ κ . Thus (12) also holds for t + 1. By induction on t , we can see that (12) and (13) both hold for all t ≥ (cid:107) x T − x ∗ (cid:107) ≤ (cid:114) m x [ φ ( x T ) − φ ( x ∗ )] ≤ (cid:115) m x · κ √ κ ˆ L + 2 β C (cid:18) − √ κ (cid:19) T ≤ C (cid:18) − √ κ (cid:19) T (cid:115) κ √ κ · (cid:18) L m x m y + κ (cid:19) . Meanwhile, (cid:107) y T − y ∗ (cid:107) ≤ (cid:107) y T − y ∗ ( x T ) (cid:107) + (cid:107) y ∗ − y ∗ ( x T ) (cid:107) ≤ (cid:15) T + L xy m y (cid:107) x T − x ∗ (cid:107) . Therefore (cid:107) x T − x ∗ (cid:107) + (cid:107) y T − y ∗ (cid:107) ≤ (cid:15) T + (cid:18) L xy m y + 1 (cid:19) (cid:107) x T − x ∗ (cid:107)≤ C (cid:18) − √ κ (cid:19) T + 2 Lm y · C (cid:18) − √ κ (cid:19) T · (cid:115) κ √ κ · (cid:18) L m x m y + κ (cid:19) ≤ C (cid:18) − √ κ (cid:19) T · (cid:34) κ Lm y (cid:115) L m x m y (cid:35) ≤ κ Lm y (cid:115) L m x m y · (cid:18) − √ κ (cid:19) T · ( (cid:107) x − x ∗ (cid:107) + (cid:107) y − y ∗ (cid:107) ) , which proves the theorem. 19 Proof of Theorem 3

Theorem 3.

We start the proof by verifying f ( x , y ) + β (cid:107) x − ˆ x (cid:107) − β (cid:107) y − ˆ y (cid:107) can indeed be solved by callingABR( · ,[ x ; y ],1 /M , 2 β , 2 β , 3 L , 3 L ). Observe that L xy ≤ β , β ≤ L . Since f ( x , y ) + β (cid:107) x − ˆ x (cid:107) − β (cid:107) y − ˆ y (cid:107) is 2 β -strongly convex w.r.t. x and 2 β -strongly concave w.r.t. y , we can see that √ β · β ≥ L xy .We can also verify that f ( x , y ) + β (cid:107) x − ˆ x (cid:107) − β (cid:107) y − ˆ y (cid:107) is 3 L -smooth, which follows from the fact that L + max { β , β } ≤ L .Therefore, we can apply Theorem 1 and conclude that at line 5 of Algorithm 3 (cid:107) x t − x ∗ t (cid:107) + (cid:107) y t − y ∗ t (cid:107) ≤ M ( (cid:107) x t − − x ∗ t (cid:107) + (cid:107) y t − − y ∗ t (cid:107) ) , where ( x ∗ t , y ∗ t ) := min x max y { g ( x , y ) − β (cid:107) y − y t − (cid:107) } , and such ( x t , y t ) is found in a gradient complexityof O (cid:32)(cid:115) Lβ + Lβ · ln (cid:18) L β β (cid:19) ln (cid:18) L β β · M (cid:19)(cid:33) = O (cid:32)(cid:115) Lβ + Lβ · ln (cid:18) L m x m y (cid:19)(cid:33) . Next, we verify that Algorithm 3 is an instance of Algorithm 2 on the function − g ( x , y ). Notice thatmin y max x (cid:8) − g ( x , y ) + β (cid:107) y − ˆ y (cid:107) (cid:9) = − min x max y (cid:8) g ( x , y ) − β (cid:107) y − ˆ y (cid:107) (cid:9) . That is, min x max y (cid:8) g ( x , y ) − (cid:107) y − ˆ y (cid:107) (cid:9) has the same saddle point as − g ( x , y ) + β (cid:107) y − ˆ y (cid:107) . Thus, we onlyneed to verify that M ≥ · β m (cid:48) y (cid:18) L (cid:48) m (cid:48) x (cid:19) (cid:115) β m (cid:48) y + L (cid:48) m (cid:48) y + L xy m x m y , (16)where ( m (cid:48) x , m (cid:48) y , L (cid:48) x , L xy , L (cid:48) y ) are parameters for f ( x , y ) + β (cid:107) x (cid:107) , and L (cid:48) = max { L xy , L (cid:48) x , L (cid:48) y } . Note that m (cid:48) x ≥ m x + 2 β , m (cid:48) y = m y , L (cid:48) x = L (cid:48) y ≤ L + 2 β , L xy ≤ β , β ≤ L . ThusRHS of (16) ≤ · β m y (cid:115) β m (cid:48) y + L + 2 β m y + L xy m y ( m x + 2 β ) · (cid:18) L + 2 β m x + 2 β (cid:19) ≤ · Lm y (cid:115) Lm y + 3 Lm y + L xy m y (cid:18) Lm x (cid:19) ≤ L . m x m . y = M . Therefore, Algorithm 3 is indeed an instance of Inexact APPA (Algorithm 7). Notice that by the stoppingcondition of Algorithm 3,( (cid:107) x t − x ∗ (cid:107) + (cid:107) y t − y ∗ (cid:107) ) ≤ √ { m x , m y } (cid:107)∇ g ( x t , y t ) (cid:107) (Fact 3 and 2) ≤ √ { m x , m y } · min { m x , m y } LM (cid:107)∇ g ( x , y ) (cid:107)≤ √ { m x , m y } · min { m x , m y } LM · L ( (cid:107) x − x ∗ (cid:107) + (cid:107) y − y ∗ (cid:107) ) ≤ M ( (cid:107) x − x ∗ (cid:107) + (cid:107) y − y ∗ (cid:107) ) . Here g ( x , y ) refers to the argument passed to Algorithm 3, which in our case has the form f ( x , y ) + β (cid:107) x − ˆ x t (cid:48) − (cid:107) . (cid:107) x t − x ∗ (cid:107) + (cid:107) y t − y ∗ (cid:107) ≤ M ( (cid:107) x − x ∗ (cid:107) + (cid:107) y − y ∗ (cid:107) ) (17)On the other hand, suppose that (cid:107) x t − x ∗ (cid:107) + (cid:107) y t − y ∗ (cid:107) ≤ M min { m x , m y } L · ( (cid:107) x − x ∗ (cid:107) + (cid:107) y − y ∗ (cid:107) ) , we can show that (cid:107)∇ g ( x t , y t ) (cid:107) ≤ L ( (cid:107) x t − x ∗ (cid:107) + (cid:107) y t − y ∗ (cid:107) ) ≤ min { m x , m y } M ( (cid:107) x − x ∗ (cid:107) + (cid:107) y − y ∗ (cid:107) ) ≤ M (cid:107)∇ g ( x , y ) (cid:107) . Thus in this case Algorithm 3 must return. By Theorem 2, we can see that Algorithm 3 always returns inat most O (cid:32)(cid:115) β m y · ln (cid:18) L m x m y · L min { m x , m y } M (cid:19)(cid:33) = O (cid:32)(cid:115) β m y · ln (cid:18) L m x m y (cid:19)(cid:33) (18)iterations.Finally, we verify that Algorithm 4 is an instance of Algorithm 2 on f ( x , y ) with parameter β . Notethat by (17), we only need to verify that M = 80 L m . x m . y ≥ · β m x (cid:115) β m x + Lm x + L xy m x m y (cid:18) Lm y (cid:19) . Observe that 20 · β m x (cid:115) β m x + Lm x + L xy m x m y (cid:18) Lm y (cid:19) ≤ · Lm x (cid:115) Lm x + Lm x + L m x m y · Lm y ≤ · Lm x · (cid:115) L m x m y · Lm y = M . Therefore Algorithm 4 is indeed an instance of Algorithm 2 on f ( x , y ). As a result, by Theorem 2, thenumber of iterations needed such that (cid:107) z T − z ∗ (cid:107) ≤ (cid:15) is O (cid:32)(cid:114) β m x · ln (cid:18) L m x m y · (cid:107) z − z ∗ (cid:107) (cid:15) (cid:19)(cid:33) . (19)We now compute the total gradient complexity. Recall that β = max { m x , L xy } , while β = max { m y , L xy } .By (19), (18) and (D), the total gradient complexity of Algorithm 4 to reach (cid:107) z T − z ∗ (cid:107) ≤ (cid:15) is O (cid:32)(cid:114) β m x · ln (cid:18) L m x m y · (cid:107) z − z ∗ (cid:107) (cid:15) (cid:19) · (cid:115) β m y · ln (cid:18) L m x m y (cid:19) · (cid:115) Lβ + Lβ · ln (cid:18) L m x m y (cid:19)(cid:33) = O (cid:32)(cid:115) L ( β + β ) m x m y · ln (cid:18) L m x m y (cid:19) ln (cid:18) L m x m y · (cid:107) z − z ∗ (cid:107) (cid:15) (cid:19)(cid:33) . If L xy ≥ max { m x , m y } , then β = β = L xy , so (cid:115) L ( β + β ) m x m y = (cid:115) L · L xy m x m y ≤ (cid:115) L x m x + L · L xy m x m y + L y m y . L xy < max { m x , m y } . Without loss of generality, assume that m x ≤ m y .Suppose that L xy < m y , then L = L x , β = m y , while β ≤ m y . Hence (cid:115) L ( β + β ) m x m y ≤ (cid:115) L x · m y m x m y = (cid:114) L x m x ≤ (cid:115) L x m x + L · L xy m x m y + L y m y . Thus, in either case, (cid:113) L ( β + β ) m x m y = O (cid:16)(cid:113) L x m x + L · L xy m x m y + L y m y (cid:17) . We conclude that the total gradient complexityof Algorithm 4 to ﬁnd a point z T = [ x T ; y T ] such that (cid:107) z T − z ∗ (cid:107) ≤ (cid:15) is O (cid:32)(cid:115) L x m x + L · L xy m x m y + L y m y · ln (cid:18) L m x m y (cid:19) ln (cid:18) L m x m y · (cid:107) z − z ∗ (cid:107) (cid:15) (cid:19)(cid:33) . E Application to Constrained Problems

In the constrained minimax optimization problem, x is constrained to a compact convex set X ⊆ R n while y is constrained to a compact convex set Y ⊆ R m . For constrained minimax optimization problems, saddlepoints are deﬁned as follows. Deﬁnition 9. ( x ∗ , y ∗ ) is a saddle point of f : X × Y → R if ∀ x ∈ X , y ∈ Y , f ( x , y ∗ ) ≥ f ( x ∗ , y ∗ ) ≥ f ( x ∗ , y ) . Deﬁnition 10. (ˆ x , ˆ y ) is an (cid:15) -saddle point of f : X × Y → R ifmax y ∈Y f (ˆ x , y ) − min x ∈X f ( x , ˆ y ) ≤ (cid:15). We will use P X [ · ] to denote the projection onto convex set X . Assuming eﬃcient projection oracles, ouralgorithms can all be easily adapted to the constrained case. In particular, for Algorithm 1, we only need toreplace AGD with the constrained version; that is, set x t ← P X [˜ x t − − η ∇ g (˜ x t − )].For Algorithm 3 and 4, the modiﬁed versions are presented below. The only signiﬁcant change is theaddition of a projected gradient descent-ascent step in line 5-6 of Algorithm 3 and line 5-6 and 9-10 ofAlgorithm 4. E.1 Algorithmic Modiﬁcations

Algorithm 8

AGD( g , x , T ) with Projections [Nesterov, 2013, (2.2.63)] Require:

Initial point x , smoothness constant l , strongly-convex modulus m , number of iterations T η ← /l , κ ← l/m , θ ← ( √ κ − / ( √ κ + 1) x ← P X [ x − η ∇ g ( x )], ˜ x ← x for t = 2 , · · · , T + 1 do x t ← P X [˜ x t − − η ∇ g (˜ x t − )] ˜ x t ← x t + θ ( x t − x t − ) end for For Algorithm 1, the only necessary modiﬁcation is to add projection steps to the Accelerated Gradi-ent Descent Procedure. The reason for the extra gradient step on line 2 is technical. From the originalanalysis [Nesterov, 2013, Theorem 2.2.3], it only follows that (cid:107) x T +1 − x ∗ (cid:107) ≤ (cid:20) (cid:107) x − x ∗ (cid:107) + 2 m ( f ( x ) − f ( x ∗ )) (cid:21) · (cid:18) − √ κ (cid:19) T . f ( x ) − f ( x ∗ ) ≤ L (cid:107) x − x ∗ (cid:107) does not hold. However, with the initial projectedgradient step, it can be shown that (cid:107) x − x ∗ (cid:107) ≤ (cid:107) x − x ∗ (cid:107) and that f ( x ) − f ( x ∗ ) ≤ L (cid:107) x − x ∗ (cid:107) (seeLemma 5). Thus (cid:107) x T +1 − x ∗ (cid:107) ≤ ( κ + 1) (cid:107) x − x ∗ (cid:107) (cid:18) − √ κ (cid:19) T . For Algorithm 3 and 4, the modiﬁed versions are presented below.

Algorithm 9

APPA-ABR (for Constrained Optimization)

Require: g ( · , · ), Initial point z = [ x ; y ], precision parameter M β ← max { m y , L xy } , M ← L m x m y ˆ y ← y κ ← β /m y , θ ← √ κ − √ κ +1 , τ ← √ κ +4 κ , T ← (cid:108) √ κ ln (cid:16) κ L M m x √ m x m y (cid:17)(cid:109) for t = 1 , · · · , T do ( x (cid:48) t , y (cid:48) t ) ← ABR( g ( x , y ) − β (cid:107) y − ˆ y t − (cid:107) , [ x t − ; y t − ], 1 /M , 2 β , 2 β , 3 L , 3 L ) x t ← P X (cid:2) x (cid:48) t − L ∇ x g ( x (cid:48) t , y (cid:48) t ) (cid:3) y t ← P Y (cid:2) y (cid:48) t + L ( ∇ y g ( x (cid:48) t , y (cid:48) t ) − β ( y (cid:48) t − ˆ y t − )) (cid:3) ˆ y t ← y t + θ ( y t − y t − ) + τ ( y t − ˆ y t − ) end forAlgorithm 10 Proximal Best Response (for Constrained Optimization)

Require:

Initial point z = [ x ; y ] β ← max { m x , L xy } , M ← L . m x m . y ˆ x ← x , κ ← β /m x , θ ← √ κ − √ κ +1 , τ ← √ κ +4 κ for t = 1 , · · · , T do ( x (cid:48) t , y (cid:48) t ) ← APPA-ABR( f ( x , y ) + β (cid:107) x − ˆ x t − (cid:107) , [ x t − , y t − ], M ) x t ← P X (cid:2) x (cid:48) t − L ( ∇ x f ( x (cid:48) t , y (cid:48) t ) + 2 β ( x (cid:48) t − ˆ x t − )) (cid:3) y t ← P Y (cid:2) y (cid:48) t + L ∇ y f ( x (cid:48) t , y (cid:48) t ) (cid:3) ˆ x t ← x t + θ ( x t − x t − ) + τ ( x t − ˆ x t − ) end for ˆ x ← P X (cid:2) x T − L ∇ x f ( x T , y T ) (cid:3) ˆ y ← P Y (cid:2) y T + L ∇ y f ( x T , y T ) (cid:3) The most signiﬁcant change is the addition of a projected gradient descent-ascent step in line 5-6 ofAlgorithm 3 and line 5-6 and 9-10 of Algorithm 4. The reason for this modiﬁcation is very similar tothat of the initial projected gradient descent step for AGD. For unconstrained problems, a small distanceto the saddle point implies a small duality gap (Fact 4); however this may not be true for constrainedproblems, since the saddle point may no longer be a stationary point. This is also true for minimization: if x ∗ = arg min x ∈X g ( x ) where g ( x ) is a L -smooth function g ( x ) − g ( x ∗ ) ≤ L (cid:107) x − x ∗ (cid:107) may not hold.Fortunately, there is a simple ﬁx to this problem. By applying projected gradient descent-ascent once, wecan assure that a small distance implies small duality gap. This is speciﬁed by the following lemma, whichis the key reason why our result can be adapted to the constrained problem. Lemma 4.

Suppose that f ∈ F ( m x , m y , L x , L xy , L y ) , ( x ∗ , y ∗ ) is a saddle point of f , z = ( x , y ) satisﬁes (cid:107) z − z ∗ (cid:107) ≤ (cid:15) . Let ˆ z = (ˆ x , ˆ y ) be the result of one projected GDA update, i.e. ˆ x ← P X (cid:20) x − L ∇ x f ( x , y ) (cid:21) , ˆ y ← P Y (cid:20) y + 12 L ∇ y f ( x , y ) (cid:21) . hen (cid:107) ˆ z − z ∗ (cid:107) ≤ (cid:15) , and max y ∈Y f (ˆ x , y ) − min x ∈X f ( x , ˆ y ) ≤ (cid:32) L xy min { m x , m y } (cid:33) L(cid:15) . The proof of Lemma 4 is deferred to Sec. E.3.Because we would use Lemma 4 to replace (11) in the analysis of Algorithm 3 and 4, we would need toaccordingly increase M to L . m x m . y and M to L m x m y . Apart from this, another minor change in Algorithm 3is that it would terminate after a ﬁxed number of iterations instead of based on a termination criterion. Thenumber of iterations is chosen such that (cid:107) x T − x ∗ (cid:107) + (cid:107) y T − y ∗ (cid:107) ≤ M [ (cid:107) x − x ∗ (cid:107) + (cid:107) y − y ∗ (cid:107) ] is guaranteed. E.2 Modiﬁcation of Analysis

We now claim that after modiﬁcations to the algorithms, Theorem 3 holds for constrained cases.

Theorem 3. (Modiﬁed) Assume that f ∈ F ( m x , m y , L x , L xy , L y ) . In Algorithm 4, the gradient complexityto ﬁnd an (cid:15) -saddle point O (cid:32)(cid:115) L x m x + L · L xy m x m y + L y m y · ln (cid:18) L m x m y (cid:19) ln (cid:18) L m x m y · L (cid:107) z − z ∗ (cid:107) (cid:15) (cid:19)(cid:33) . The proof of this theorem is, for the most part, the same as the unconstrained version. Hence, we onlyneed to point out parts of the original proof that need to be modiﬁed for the constrained case.To start with, Theorem 1 holds in the constrained case. The proof of Theorem 1 only relies on theanalysis of AGD and the Lipschitz properties in Fact 1, and both still hold for constrained problems. (SeeLemma B.2 Lin et al. [2020] for the proof of Fact 1 in constrained problems.)As for Theorem 2, the key modiﬁcation is about (11). As argued above, (11) uses the property g ( x ) − g ( x ∗ ) ≤ L (cid:107) x − x ∗ (cid:107) , which does not hold in constrained problems, since the optimum may not be a stationarypoint. Here, we would use Lemma 4 to derive a similar bound to replace (11). Note that originally (11) isonly used to derive δ t ≤ ˆ L +2 β (cid:15) t . Using Lemma 4, we can replace this with δ t ≤ max y ∈Y (cid:8) f ( x t , y ) + β (cid:107) x t − ˆ x t − (cid:107) (cid:9) − min x ∈X (cid:8) f ( x , y t ) + β (cid:107) x − ˆ x t − (cid:107) (cid:9) ≤ (cid:32) L xy m x m y (cid:33) L(cid:15) t . Accordingly, we can change C to 44 κ √ κ · L (cid:16) L xy m x m y (cid:17) C , and the assumption on M to M ≥ κ (cid:115) Lm x (cid:18) L xy m x m y (cid:19) (cid:18) Lm y (cid:19) . Then Theorem 2 would hold for the constrained case as well.Finally, as for Theorem 3, we need to re-verify that M and M satisfy the new assumptions of M inorder to apply Theorem 2. Observe that20 · β m y · (cid:115) L + 2 β ) m y · (cid:18) L xy β · m y (cid:19) · (cid:18) Lm x (cid:19) ≤ · Lm y · (cid:115) L m x m y · Lm x ≤ L m x m y = M , and that 20 · β m x · (cid:115) Lm x · L xy m x m y · Lm y ≤ √ L . m x m . y ≤ M .

24t follows that the number of iterations needed to ﬁnd (cid:107) z T − z ∗ (cid:107) ≤ (cid:15) is O (cid:32)(cid:115) L x m x + L · L xy m x m y + L y m y · ln (cid:18) L m x m y (cid:19) ln (cid:18) L m x m y · (cid:107) z − z ∗ (cid:107) (cid:15) (cid:19)(cid:33) . It follows from Lemma 4 that the duality gap of (ˆ x , ˆ y ) is at mostmax y ∈Y f (ˆ x , y ) − min x ∈X f ( x , ˆ y ) ≤ (cid:32) L xy min { m x , m y } (cid:33) L(cid:15) . Resetting (cid:15) to (cid:113) (cid:15) min { m x ,m y } L proves the theorem. E.3 Properties of Projected Gradient

Lemma 5. If g : X → R is L -smooth, x ∗ = arg min x ∈X g ( x ) , ˆ x = P X (cid:2) x − L ∇ g ( x ) (cid:3) , then (cid:107) ˆ x − x ∗ (cid:107) ≤(cid:107) x − x ∗ (cid:107) , and g (ˆ x ) − g ( x ∗ ) ≤ L (cid:107) x − x ∗ (cid:107) .Proof. By Corollary 2.2.1 [Nesterov, 2013], ( x − ˆ x ) T ( x − x ∗ ) ≥ (cid:107) ˆ x − x (cid:107) . Therefore (cid:107) ˆ x − x ∗ (cid:107) = (cid:107) ( x − x ∗ ) + (ˆ x − x ) (cid:107) = (cid:107) x − x ∗ (cid:107) + 2( x − x ∗ ) T (ˆ x − x ) + (cid:107) ˆ x − x (cid:107) ≤ (cid:107) x − x ∗ (cid:107) . Meanwhile, note that ˆ x = arg min x ∈X (cid:8) ∇ g ( x ) T x + L (cid:107) x − x (cid:107) (cid:9) . By the optimality condition and the L -strong convexity of ∇ g ( x ) T x + L (cid:107) x − x (cid:107) , we have ∇ g ( x ) T ˆ x + L (cid:107) ˆ x − x (cid:107) + L (cid:107) x − x ∗ (cid:107) ≤ ∇ g ( x ) T x ∗ + L (cid:107) x ∗ − x (cid:107) . Thus ∇ g ( x ) T (ˆ x − x ∗ ) ≤ L (cid:2) (cid:107) x ∗ − x (cid:107) − (cid:107) ˆ x − x (cid:107) − (cid:107) ˆ x − x ∗ (cid:107) (cid:3) . It follows that g (ˆ x ) − g ( x ∗ ) ≤ ∇ g (ˆ x ) T (ˆ x − x ∗ )= ∇ g ( x ) T (ˆ x − x ∗ ) + ( ∇ g (ˆ x ) − ∇ g ( x )) T (ˆ x − x ∗ ) ≤ L (cid:107) x ∗ − x (cid:107) − L (cid:107) ˆ x − x (cid:107) − L (cid:107) ˆ x − x ∗ (cid:107) + L (cid:107) ˆ x − x (cid:107) · (cid:107) ˆ x − x ∗ (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) ≤ ≤ L (cid:107) x ∗ − x (cid:107) . We then prove Lemma 4.

Proof of Lemma 4.

This can be seen as a special case of Proposition 2.2 [Nemirovski, 2004]. Deﬁne thegradient descent-ascent ﬁeld to be F ( z ) := (cid:20) ∇ x f ( x , y ) −∇ y f ( x , y ) (cid:21) . Note that the ˆ z can also be written asˆ z = arg min z ∈X ×Y (cid:8) L (cid:107) z − z (cid:107) + F ( z ) T z (cid:9) . z (cid:48) = ( x (cid:48) , y (cid:48) ) to be x (cid:48) ← P X (cid:20) x − L ∇ x f (ˆ x , ˆ y ) (cid:21) , y (cid:48) ← P Y (cid:20) y + 12 L ∇ y f (ˆ x , ˆ y ) (cid:21) . In other words, z (cid:48) = arg min z ∈X ×Y (cid:8) L (cid:107) z − z (cid:107) + F (ˆ z ) T z (cid:9) . By the optimality condition and 2 L -strongconvexity of L (cid:107) z − z (cid:107) + F (ˆ z ) T z , for any z ∈ X × Y , L (cid:107) z (cid:48) − z (cid:107) + F (ˆ z ) T z (cid:48) + L (cid:107) z (cid:48) − z (cid:107) ≤ L (cid:107) z − z (cid:107) + F (ˆ z ) T z . Similarly, by optimality of ˆ z , L (cid:107) ˆ z − z (cid:107) + F ( z ) T ˆ z + L (cid:107) z (cid:48) − ˆ z (cid:107) ≤ L (cid:107) z (cid:48) − z (cid:107) + F ( z ) T z (cid:48) . Thus F (ˆ z ) T (ˆ z − z ) = F (ˆ z ) T ( z (cid:48) − z ) + F (ˆ z ) T (ˆ z − z (cid:48) )= F (ˆ z ) T ( z (cid:48) − z ) + F ( z ) T (ˆ z − z (cid:48) ) + ( F (ˆ z ) − F ( z )) T (ˆ z − z (cid:48) ) ≤ L (cid:0) (cid:107) z − z (cid:107) − (cid:107) z (cid:48) − z (cid:107) − (cid:107) z (cid:48) − z (cid:107) (cid:1) + ( F (ˆ z ) − F ( z )) T (ˆ z − z (cid:48) )+ L (cid:0) (cid:107) z (cid:48) − z (cid:107) − (cid:107) ˆ z − z (cid:107) − (cid:107) z (cid:48) − ˆ z (cid:107) (cid:1) ≤ L (cid:0) (cid:107) z − z (cid:107) − (cid:107) z (cid:48) − z (cid:107) (cid:1) + 2 L (cid:107) ˆ z − z (cid:107) · (cid:107) ˆ z − z (cid:48) (cid:107) − L (cid:107) ˆ z − z (cid:48) (cid:107) − L (cid:107) ˆ z − z (cid:107) ≤ L (cid:0) (cid:107) z − z (cid:107) − (cid:107) z (cid:48) − z (cid:107) (cid:1) . Here we used the fact that for any z , z , (cid:107) F ( z ) − F ( z ) (cid:107) ≤ L (cid:107) z − z (cid:107) . Note that (by convexity andconcavity) F (ˆ z ) T (ˆ z − z ) = ∇ x f (ˆ x , ˆ y ) T (ˆ x − x ) − ∇ y f (ˆ x , ˆ y ) T (ˆ y − y ) ≥ [ f (ˆ x , ˆ y ) − f ( x , ˆ y )] + [ f (ˆ x , y ) − f (ˆ x , ˆ y )] ≥ f (ˆ x , y ) − f ( x , ˆ y ) . If we choose x and y to be x ∗ (ˆ y ) and y ∗ (ˆ x ), we can see thatmax y ∈Y f (ˆ x , y ) − min x ∈X f ( x , ˆ y ) ≤ L (cid:107) z − z (cid:107) ≤ L (cid:107) z − z ∗ (cid:107) + 2 L (cid:107) z ∗ − z (cid:107) ≤ L (cid:107) x ∗ (ˆ y ) − x ∗ (cid:107) + 2 L (cid:107) y ∗ (ˆ x ) − y ∗ (cid:107) + 2 L (cid:107) z ∗ − z (cid:107) ≤ L xy min { m x , m y } · L (cid:107) ˆ z − z ∗ (cid:107) + 2 L (cid:107) z ∗ − z (cid:107) . By Corollary 2.2.1 [Nesterov, 2013], ( x − ˆ x ) T ( x − x ∗ ) ≥ (cid:107) ˆ x − x (cid:107) . Therefore (cid:107) ˆ x − x ∗ (cid:107) = (cid:107) ( x − x ∗ ) + (ˆ x − x ) (cid:107) = (cid:107) x − x ∗ (cid:107) + 2( x − x ∗ ) T (ˆ x − x ) + (cid:107) ˆ x − x (cid:107) ≤ (cid:107) x − x ∗ (cid:107) . Similarly, (cid:107) ˆ y − y ∗ (cid:107) ≤ (cid:107) y − y ∗ (cid:107) . Thus (cid:107) ˆ z − z ∗ (cid:107) = (cid:107) ˆ x − x ∗ (cid:107) + (cid:107) ˆ y − y ∗ (cid:107) ≤ (cid:107) z − z ∗ (cid:107) ≤ (cid:15) . It follows that max y ∈Y f (ˆ x , y ) − min x ∈X f ( x , ˆ y ) ≤ L · (cid:32) L xy min { m x , m y } + 1 (cid:33) (cid:15) . Implications of Theorem 3

In this section, we discuss how Theorem 3 implies improved bounds for strongly convex-concave problemsand convex-concave problems via reductions established in Lin et al. [2020].Let us consider minimax optimization problem min x ∈X max y ∈Y f ( x , y ), where f ( x , y ) is m x -stronglyconvex with respect to x , concave with respect to y , and ( L x , L xy , L y )-smooth. Here, we assume that X and Y are bounded sets, with diameters D x = max x , x (cid:48) ∈X (cid:107) x − x (cid:48) (cid:107) and D y = max y , y (cid:48) ∈Y (cid:107) y − y (cid:48) (cid:107) .Following Lin et al. [2020], let us consider the function f (cid:15), y ( x , y ) := f ( x , y ) − (cid:15) (cid:107) y − y (cid:107) D y . Recall that (ˆ x , ˆ y ) is an (cid:15) -saddle point of f if max y ∈Y f (ˆ x , y ) − min x ∈X f ( x , ˆ y ) ≤ (cid:15) . We now showthat a ( (cid:15)/ f (cid:15), y would be an (cid:15) -saddle point of f . Let x ∗ ( · ) := arg min x ∈X f ( x , · ) and y ∗ ( · ) := arg max y ∈Y f ( · , y ). Obviously, for any x ∈ X , y ∈ Y , f ( x , y ) − (cid:15) ≤ f (cid:15), y ( x , y ) ≤ f ( x , y ) . Thus, if (ˆ x , ˆ y ) is a ( (cid:15)/ f (cid:15), y , then f (ˆ x , y ∗ (ˆ x )) ≤ f (cid:15), y (ˆ x , y ∗ (ˆ x )) + (cid:15) ≤ max y ∈Y f (cid:15), y (ˆ x , y ) + (cid:15) ,f ( x ∗ (ˆ y ) , ˆ y ) ≥ f (cid:15), y ( x ∗ (ˆ y ) , ˆ y ) ≥ min x ∈X f (cid:15), y ( x , ˆ y ) . It immediately follows thatmax y ∈Y f (ˆ x , y ) − min x ∈X f ( x , ˆ y ) ≤ (cid:15) y ∈Y f (cid:15), y (ˆ x , y ) − min x ∈X f (cid:15), y ( x , ˆ y ) ≤ (cid:15). Thus, to ﬁnd an (cid:15) -saddle point of f , we only need to ﬁnd an ( (cid:15)/ f (cid:15), y . We can now proveCorollary 1 by reducing to (the constrained version of) Theorem 3.Observe that f (cid:15), y belongs to F ( m x , (cid:15)D y , L x , L xy , L y + (cid:15)D y ). Thus, by Theorem 3, the gradient complexityof ﬁnding a ( (cid:15)/ f (cid:15), y is O (cid:32)(cid:115) L x m x + (cid:18) L · L xy m x + L y (cid:19) · D y (cid:15) · ln (cid:18) ( D x + D y ) L m x (cid:15) (cid:19)(cid:33) = ˜ O (cid:32)(cid:114) m x · L y + L · L xy m x (cid:15) (cid:33) , which proves Corollary 1. Corollary 1. If f ( x , y ) is ( L x , L xy , L y )-smooth and m x -strongly convex w.r.t. x , via reduction to Theo-rem 3, the gradient complexity of ﬁnding an (cid:15) -saddle point is ˜ O (cid:16)(cid:113) m x · L y + L · L xy m x (cid:15) (cid:17) .In comparison, Lin et al.’s result in this setting is ˜ O (cid:16)(cid:113) L m x (cid:15) (cid:17) . Meanwhile a lower bound for this problemhas been shown to be Ω (cid:16)(cid:113) L xy m x (cid:15) (cid:17) [Ouyang and Xu, 2019]. It can be seen that when L xy (cid:28) L , our bound isa signiﬁcant improvement over Lin et al.’s result, as m x · L y + L · L xy (cid:28) L .Similarly, if f : X × Y → R is convex with respect to x , concave with respect to y and ( L x , L xy , L y )-smooth, we can consider the function f (cid:15) ( x , y ) := f ( x , y ) + (cid:15) (cid:107) x − x (cid:107) D x − (cid:15) (cid:107) y − y (cid:107) D y . It can be shown that for any ˆ x ∈ X ,max y ∈Y (cid:26) f (ˆ x , y ) + (cid:15) (cid:107) ˆ x − x (cid:107) D x − (cid:15) (cid:107) y − y (cid:107) D y (cid:27) ≥ max y ∈Y f (ˆ x , y ) − (cid:15) . Here it is assumed that (cid:15) is suﬃciently small, i.e. (cid:15) ≤ max { L xy , m x } D y . y ∈ Y ,min x ∈X (cid:26) f ( x , ˆ y ) + (cid:15) (cid:107) x − x (cid:107) D x − (cid:15) (cid:107) ˆ y − y (cid:107) D y (cid:27) ≤ min x ∈Y f ( x , ˆ y ) + (cid:15) . Therefore, if (ˆ x , ˆ y ) is an ( (cid:15)/ f (cid:15) , it is an (cid:15) -saddle point of f , asmax y ∈Y f (ˆ x , y ) − min x ∈X f ( x , ˆ y ) ≤ (cid:15) y ∈Y f (cid:15) (ˆ x , y ) − min x ∈X f (cid:15) ( x , ˆ y ) ≤ (cid:15). Observe that f (cid:15) belongs to F ( (cid:15) D x , (cid:15) D y , L x + (cid:15) D x , L xy , L y + (cid:15) D y ). Thus, by Theorem 3, the gradientcomplexity of ﬁnding an ( (cid:15)/ f (cid:15) is O (cid:32)(cid:32)(cid:114) L x D x + L y D y (cid:15) + D x D y (cid:112) L · L xy (cid:15) (cid:33) · ln (cid:18) L ( D x + D y ) (cid:15) (cid:19)(cid:33) , which proves Corollary 2. Corollary 2. If f ( x , y ) is ( L x , L xy , L y )-smooth and convex-concave, via reduction to Theorem 3, thegradient complexity to produce an (cid:15) -saddle point is ˜ O (cid:16)(cid:113) L x + L y (cid:15) + √ L · L xy (cid:15) (cid:17) .In comparison, Lin et al.’s result for this setting is ˜ O (cid:0) L(cid:15) (cid:1) , and the classic result for ExtraGradient is O (cid:0) L(cid:15) (cid:1) [Nemirovski, 2004]. Meanwhile, a lower bound for this setting has shown to be Ω (cid:18)(cid:113) L x (cid:15) + L xy (cid:15) (cid:19) [Ouyangand Xu, 2019]. Again, our result can be a signiﬁcant improvement over Lin et al.’s result if L xy (cid:28) L , andis closer to the lower bound. G Proof of Theorem 4

We will start by proving several useful lemmas.

Lemma 2. ([Bai et al., 2003])

Deﬁne M ( η ) := ( η P + S ) − ( η P − G ) ( η P + G ) − ( η P − S ) . Then ρ ( M ( η )) ≤ (cid:107) M ( η ) (cid:107) ≤ max λ i ∈ sp ( P − G ) (cid:12)(cid:12)(cid:12)(cid:12) λ i − ηλ i + η (cid:12)(cid:12)(cid:12)(cid:12) < . Proof of Lemma 1.

We provide a proof for completeness. First, observe that M ( η ) = ( η P + S ) − ( η P − G )( η P + G ) − ( η P − S )= P − ( η I + P − SP − ) − ( η I − P − GP − )( η I + P − GP − ) − ( η I − P − SP − ) P . Let ˆ G := P − GP − , ˆ S := P − SP − . Then M ( η ) is similar to( η I + ˆ S ) − ( η I − ˆ G )( η I + ˆ G ) − ( η I − ˆ S ) , which is then similar to ( η I − ˆ G )( η I + ˆ G ) − ( η I − ˆ S )( η I + ˆ S ) − . The key observation is that ( η I − ˆ S )( η I + ˆ S ) − is orthogonal, since (cid:16) ( η I + ˆ S ) − (cid:17) T ( η I − ˆ S ) T ( η I − ˆ S )( η I + ˆ S ) − =( η I − ˆ S ) − ( η I + ˆ S )( η I − ˆ S )( η I + ˆ S ) − =( η I − ˆ S ) − ( η I − ˆ S )( η I + ˆ S )( η I + ˆ S ) − = I . ρ ( M ( η )) ≤ (cid:107) ( η I − ˆ G )( η I + ˆ G ) − ( η I − ˆ S )( η I + ˆ S ) − (cid:107) ≤ (cid:107) ( η I − ˆ G )( η I + ˆ G ) − (cid:107) · (cid:107) ( η I − ˆ S )( η I + ˆ S ) − (cid:107) = (cid:107) ( η I − ˆ G )( η I + ˆ G ) − (cid:107) = max λ i ∈ sp ( ˆ G ) (cid:12)(cid:12)(cid:12)(cid:12) λ i − ηλ i + η (cid:12)(cid:12)(cid:12)(cid:12) = max λ i ∈ sp ( P − G ) (cid:12)(cid:12)(cid:12)(cid:12) λ i − ηλ i + η (cid:12)(cid:12)(cid:12)(cid:12) . We now proceed to state some useful lemmas for the proof of Theorem 4.

Lemma 6.

The following statements about the eigenvalues and singular values of matrices hold:1. The singular values of J fall in [ m x , L xy + L x ] ;2. The condition number of η P + G is at most L x m x (cid:16) m y L xy (cid:17) k ;3. The condition number of η P + G is at most L x /m x .4. The eigenvalues of η ( α I + β A ) fall in [ ηα, ηβL x ] . The eigenvalues of η ( I + β C ) fall in [ η, ηβL x ] .Proof of Lemma 6.

1. Consider an arbitrary x ∈ R n + m with (cid:107) x (cid:107) = 1. Construct a set of orthonormalvectors { x , · · · , x n + m } with x = x . Then x T J T Jx = n + m (cid:88) i =1 x T J T x i x Ti Jx = n + m (cid:88) i =1 (cid:0) x T J T x i (cid:1) ≥ (cid:0) x T J T x (cid:1) . Since J = G + S = (cid:20) A C (cid:21) + (cid:20) B − B T (cid:21) , where S is skew-symmetric, x T J T x = x T Gx ≥ m x . Thus σ min ( J ) = (cid:113) λ min ( J T J ) ≥ m x . Meanwhile, λ max (cid:0) J T J (cid:1) ≤ (cid:107) G (cid:107) + (cid:107) S (cid:107) ≤ L xy + L x .

2. Note that η P + G = (cid:20) η ( α I + β A ) + A η ( I + β C ) + C (cid:21) . Thus (cid:107) η P + G (cid:107) ≤ max { η ( α + βL x ) + L x , η (1 + βL x ) + L x } = η (1 + βL x ) + L x . On the other hand λ min ( η P + G ) ≥ min { ηα + ηβm x + m x , η + ηβm y + m y } = ηα + ηβm x + m x . Thus the condition number of η P + G is at most η (1 + βL x ) + L x ηα + ηβm x + m x ≤ L x ηα + 1 + βL x α + βm x ≤ L x ηα + 2 βL x α = L x m x (cid:18) m y L xy (cid:19) k + 2 L x m x (cid:18) m y L xy (cid:19) k ≤ L x m x (cid:18) m y L xy (cid:19) k .

3. On the other hand, η (1 + βL x ) + L x ηα + ηβm x + m x = η + ηβL x + L x ηα + ηβm x + m x ≤ max (cid:26) α , L x m x (cid:27) = L x m x .

29. Finally let us consider matrices η ( α I + β A ) and η ( I + β C ). Obviously η ( α I + β A ) (cid:60) ηα I , η ( I + β C ) (cid:60) η I . Meanwhile (cid:107) η ( α I + β A ) (cid:107) ≤ η · ( α + βL x ) ≤ η (1 + βL x ) ( α < ≤ ηβL x . ( βL x > (cid:107) η ( I + β C ) (cid:107) ≤ ηβL x . Lemma 7.

With our choice of η , α and β , ρ ( M ( η )) ≤ (cid:107) M ( η ) (cid:107) ≤ − (cid:18) m y L xy (cid:19) k . Proof of Lemma 7.

By Lemma 1, ρ ( M ( η )) ≤ (cid:107) M ( η ) (cid:107) ≤ max λ i ∈ sp ( P − G ) (cid:12)(cid:12)(cid:12)(cid:12) λ i − ηλ i + η (cid:12)(cid:12)(cid:12)(cid:12) . Observe that P − G = (cid:20) ( α I + β A ) − A ( I + β C ) − C (cid:21) . The eigenvalues of ( α I + β A ) − A are contained in (cid:20) m x α + βm x , L x α + βL x (cid:21) ⊆ (cid:20) m y , β (cid:21) . ( βm x ≤ α )Similarly the eigenvalues of ( I + β C ) − C are contained in (cid:20) m y βm y , L x βL x (cid:21) ⊆ (cid:20) m y , β (cid:21) . ( βm y ≤ η = L /k xy m − /k y = (cid:112) m y /β . As a result,max λ i ∈ sp ( P − G ) (cid:12)(cid:12)(cid:12)(cid:12) λ i − ηλ i + η (cid:12)(cid:12)(cid:12)(cid:12) ≤ max  β − (cid:113) m y β β + (cid:113) m y β , (cid:113) m y β − m y (cid:113) m y β + m y  ≤ − (cid:112) βm y − (cid:18) m y L xy (cid:19) k . Lemma 8.

When RHSS( k ) terminates (cid:107) z t − z ∗ (cid:107) ≤ (cid:15) (cid:107) z − z ∗ (cid:107) .Proof of Lemma 8. σ min ( J ) (cid:107) z t − z ∗ (cid:107) ≤ (cid:107) Jz t − b (cid:107) ≤ ˜ (cid:15) (cid:107) Jz − b (cid:107) ≤ σ max ( J ) (cid:107) z − z ∗ (cid:107) . We know that σ min ( J ) ≥ m x and that σ max ( J ) ≤ L x + L xy . Thus (cid:107) z t − z ∗ (cid:107) ≤ ˜ (cid:15) · ( L x + L xy ) m x (cid:107) z − z ∗ (cid:107) = (cid:15) (cid:107) z − z ∗ (cid:107) . Lemma 9 (Proposition 9.5.1, [Allaire and Kaber, 2008]) . CG( A , b , x , (cid:15) ) returns (i.e. satisﬁes (cid:107) Ax T − b (cid:107) ≤ (cid:15) (cid:107) Ax − b (cid:107) ) in at most (cid:108) √ κ ln (cid:16) √ κ(cid:15) (cid:17)(cid:109) iterations. emma 10. In RHSS( k ), (cid:107) z t +1 − z ∗ (cid:107) ≤ (cid:32) − (cid:18) m y L xy (cid:19) k (cid:33) (cid:107) z t − z ∗ (cid:107) . (20) Proof of Lemma 10.

Let us deﬁne˜ z t +1 / = (cid:20) ˜ x t +1 / ˜ y t +1 / (cid:21) = (cid:20) η ( α I + β A ) + A η ( I + β C ) + C (cid:21) − (cid:20) r r (cid:21) . Since (cid:107) ( η P + G )( z t +1 / − ˜ z t +1 / (cid:107) ≤ M (cid:107) ( η P + G )( z t − ˜ z t +1 / (cid:107) , (cid:107) z t +1 / − ˜ z t +1 / (cid:107) ≤ (cid:107) ( η P + G )( z t +1 / − ˜ z t +1 / (cid:107) λ min ( η P + G ) ≤ (cid:107) ( η P + G )( z t − ˜ z t +1 / (cid:107) M λ min ( η P + G ) ≤ λ max ( η P + G ) M λ min ( η P + G ) (cid:107) z t − ˜ z t +1 / (cid:107)≤ L x M m x (cid:107) z t − ˜ z t +1 / (cid:107) = m x m y L (cid:107) z t − ˜ z t +1 / (cid:107) . (21)Because ˜ z t +1 / − z ∗ = ( η P + G ) − ( η P − S ) ( z t − z ∗ ) , (cid:107) ˜ z t +1 / − z ∗ (cid:107) ≤ (cid:107) ( η P + G ) − (cid:107) (cid:107) η P − S (cid:107) (cid:107) z t − z ∗ (cid:107)≤ ηα · ( L xy + ηα + ηβL x ) · (cid:107) z t − z ∗ (cid:107)≤ (cid:18) Lm x (cid:19) (cid:107) z t − z ∗ (cid:107) . It follows that (cid:107) z t − ˜ z t +1 / (cid:107) ≤ (cid:107) z t − z ∗ (cid:107) + (cid:107) ˜ z t +1 / − z ∗ (cid:107) ≤ (cid:18) Lm x (cid:19) (cid:107) z t − z ∗ (cid:107) . By plugging this into (21), one gets (cid:107) z t +1 / − ˜ z t +1 / (cid:107) ≤ m x m y L · (cid:18) Lm x (cid:19) (cid:107) z t − z ∗ (cid:107) ≤ m y L (cid:107) z t − z ∗ (cid:107) . (22)Now, let us deﬁne ˜ z t +1 := ( η P + S ) − (cid:2) ( η P − G )˜ z t +1 / + b (cid:3) , ˆ z t +1 := ( η P + S ) − (cid:2) ( η P − G ) z t +1 / + b (cid:3) . First let us try to bound (cid:107) ˜ z t +1 − ˆ z t +1 (cid:107) . Observe that ˆ z t +1 − z ∗ = ( η P + S ) − ( η P − G )( z t +1 / − z ∗ ), so (cid:107) ˜ z t +1 − ˆ z t +1 (cid:107) = (cid:107) (˜ z t +1 − z ∗ ) − (ˆ z t +1 − z ∗ ) (cid:107) = (cid:107) ( η P + G ) − ( η P − S ) (˜ z t +1 / − z t +1 / ) (cid:107)≤ (cid:107) ( η P + G ) − ( η P − S ) (cid:107) · (cid:107) ˜ z t +1 / − z t +1 / (cid:107)≤ L m y · m y L (cid:107) z t − z ∗ (cid:107) = m y L (cid:107) z t − z ∗ (cid:107) . (23)Next, by Lemma 8 on RHSS( k − (cid:107) z t +1 − ˆ z t +1 (cid:107) ≤ M (cid:107) z t − ˆ z t +1 (cid:107) ≤ M ( (cid:107) z t − z ∗ (cid:107) + (cid:107) ˆ z t +1 − z ∗ (cid:107) ) .

31y Lemma 7, (cid:107) ˜ z t +1 − z ∗ (cid:107) = (cid:107) M ( η )( z t − z ∗ ) (cid:107) ≤ (cid:32) − (cid:18) m y L xy (cid:19) k (cid:33) (cid:107) z t − z ∗ (cid:107) . (24)Thus (cid:107) z t +1 − ˆ z t +1 (cid:107) ≤ M (2 (cid:107) z t − z ∗ (cid:107) + (cid:107) ˜ z t +1 − ˆ z t +1 (cid:107) ) . (25)Combining (23) and (25), one gets (cid:107) z t +1 − ˜ z t +1 (cid:107) ≤ (cid:107) z t +1 − ˆ z t +1 (cid:107) + (cid:107) ˆ z t +1 − ˜ z t +1 (cid:107)≤ M (cid:107) z t − z ∗ (cid:107) + (cid:18) M (cid:19) (cid:107) ˆ z t +1 − ˜ z t +1 (cid:107)≤ m y L xy (cid:107) z t − z ∗ (cid:107) + m y L (cid:107) z t − z ∗ (cid:107) ≤ m y L xy (cid:107) z t − z ∗ (cid:107) . Combining this with (24), one gets (cid:107) z t +1 − z ∗ (cid:107) ≤ (cid:107) ˜ z t +1 − z ∗ (cid:107) + (cid:107) ˜ z t +1 − z t +1 (cid:107)≤ (cid:32) − (cid:18) m y L xy (cid:19) k (cid:33) (cid:107) z t − z ∗ (cid:107) + m y L xy (cid:107) z t − z ∗ (cid:107)≤ (cid:32) − (cid:18) m y L xy (cid:19) k (cid:33) (cid:107) z t − z ∗ (cid:107) . Finally, we are ready to prove Theorem 4.

Theorem 4.

By Lemma 10, when running RHSS( k ), (cid:107) z T − z ∗ (cid:107) ≤ (cid:32) − (cid:18) m y L xy (cid:19) k (cid:33) T (cid:107) z − z ∗ (cid:107) . Thus, when

T > (cid:16) L xy m y (cid:17) /k · ln (cid:16) (cid:107) z − z ∗ (cid:107) (cid:15) (cid:17) , one can ensure that (cid:107) z T − z ∗ (cid:107) ≤ (cid:15) . Now we can focus onthe number of matrix-vector products needed per iteration, which comes in two parts: the cost of callingconjugate gradient and the cost of calling RHSS( k − Conjugate Gradient cost

The matrix to be solved via conjugate gradient is η P + G . By Lemma 6, itscondition number is upper bounded by L x m x (cid:16) m y L xy (cid:17) /k . By Lemma 9, the number of matrix-vector productsneeded for calling CG is (cid:115) L x m x (cid:18) m y L xy (cid:19) /k ln  (cid:115) L x m x (cid:18) m y L xy (cid:19) /k M  ≤ c (cid:115) L x m x (cid:18) m y L xy (cid:19) k · ln (cid:18) c L m x m y (cid:19) , (27)for some constants c , c >

0. 32

HSS( k − ) cost By Lemma 6, the new saddle point problem involving η P + S has parameters m (cid:48) x = ηα , m (cid:48) y = η , L (cid:48) x = L (cid:48) y = 2 ηβL x , L (cid:48) xy = L xy . It is easy to see that m (cid:48) y = η ≥ m y , m (cid:48) x = ( m x /m y ) m (cid:48) y ≥ m x , andthat L (cid:48) x = L (cid:48) y ≤ L x . Thus L (cid:48) = max { L (cid:48) x , L (cid:48) y , L (cid:48) xy } ≤ L . Assuming that Theorem 4 holds for RHSS( k − (cid:118)(cid:117)(cid:117)(cid:116) L xy m (cid:48) x m (cid:48) y + (cid:18) L (cid:48) x m (cid:48) x + L (cid:48) y m (cid:48) y (cid:19) (cid:32) (cid:18) L xy max { m (cid:48) x , m (cid:48) y } (cid:19) / ( k − (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) ( a ) · (cid:18) C ln (cid:18) C L (cid:48) m (cid:48) y m (cid:48) x (cid:19)(cid:19) k +2 · ln (cid:18) L M m x (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) ( b ) . Here we used Lemma 8, that when (cid:107) z t − z ∗ (cid:107) ≤ (cid:16) m x L x + L xy (cid:17) (cid:107) z − z ∗ (cid:107) , RHSS( k −

1) returns. Assume that C >

8. Note that L xy m (cid:48) x m (cid:48) y = L xy αη = L xy m x m y · L /k xy m − /k y = L xy m x m y (cid:18) m y L xy (cid:19) k ,L xy m (cid:48) y = L xy η = L xy L /k xy m − /k y = (cid:18) L xy m y (cid:19) k − k . Therefore ( a ) ≤ (cid:118)(cid:117)(cid:117)(cid:116) L xy m x m y (cid:18) m y L xy (cid:19) k + 2 L x m x (cid:18) m y L xy (cid:19) k · (cid:32) (cid:18) L xy m y (cid:19) k (cid:33) ≤ (cid:115) L xy m x m y (cid:18) m y L xy (cid:19) k + L x m x (cid:18) m y L xy (cid:19) k , ln (cid:18) C L (cid:48) m (cid:48) y m (cid:48) x (cid:19) ≤ ln (cid:18) C L m x m y (cid:19) ≤ (cid:18) C L m x m y (cid:19) , ( b ) ≤ ln (cid:18) L m x m y (cid:19) ≤ (cid:18) L m x m y (cid:19) . Thus the cost of calling RHSS( k −

1) is at most4 C k +21 ln k +3 (cid:18) C L m x m y (cid:19) (cid:115) L xy m x m y (cid:18) m y L xy (cid:19) k + L x m x (cid:18) m y L xy (cid:19) k . (28)In the case where k = 2, RHSS( k −

1) is exactly Proximal Best Response (Algorithm 4). Hence, byTheorem 3, the number of matrix-vector products needed is at most O (cid:32)(cid:115) L xy · max { L xy , L (cid:48) } m (cid:48) x m (cid:48) y + L (cid:48) x m (cid:48) x + L (cid:48) y m (cid:48) y · ln (cid:18) L (cid:48) m (cid:48) x m (cid:48) y (cid:19) ln (cid:18) L (cid:48) M m (cid:48) x m (cid:48) y (cid:19)(cid:33) = O (cid:32)(cid:115) L xy m x + L x √ m y m x (cid:112) L xy ln (cid:18) L m x m y (cid:19)(cid:33) . By this, we mean there exists constants c , c > c (cid:115) L xy m x + L x √ m y m x (cid:112) L xy ln (cid:18) c L m x m y (cid:19) . Thus, (28) also holds for k = 2, provided that C ≥ c and C ≥ c .33 otal cost. By combining (27) and (28), we can see that the cost (i.e. number of matrix-vector products)of RHSS( k ) per iteration is (cid:0) C k +21 + c (cid:1) ln k +3 (cid:18) max { c , C } L m x m y (cid:19) (cid:115) L xy m x m y (cid:18) m y L xy (cid:19) k + L x m x (cid:18) m y L xy (cid:19) k . Let us choose C > max { c , } and C > max { c , } . Then, in order to ensure that (cid:107) z T − z ∗ (cid:107) ≤ (cid:15) , thenumber of matrix-vector products that RHSS( k ) needs is4 (cid:18) L xy m y (cid:19) /k ln (cid:18) (cid:107) z − z ∗ (cid:107) (cid:15) (cid:19) · (cid:0) C k +21 + c (cid:1) ln k +3 (cid:18) C L m x m y (cid:19) (cid:115) L xy m x m y (cid:18) m y L xy (cid:19) k + L x m x (cid:18) m y L xy (cid:19) k ≤ C k +21 ln (cid:18) (cid:107) z − z ∗ (cid:107) (cid:15) (cid:19) ln k +3 (cid:18) C L m x m y (cid:19) (cid:115) L xy m x m y (cid:18) m y L xy (cid:19) + L x m x (cid:18) L xy m y (cid:19) k ≤ (cid:118)(cid:117)(cid:117)(cid:116) L xy m x m y + (cid:18) L x m x + L y m y (cid:19) (cid:32) (cid:18) L xy m y (cid:19) /k (cid:33) · (cid:18) C ln (cid:18) C L m x m y (cid:19)(cid:19) k +3 ln (cid:18) (cid:107) z − z ∗ (cid:107) (cid:15) (cid:19) . We now discuss how to choose the optimal k . Observe that(6) ≤ (cid:115) L xy m x m y + L x m x + L y m y ln (cid:18) (cid:107) z − z ∗ (cid:107) (cid:15) (cid:19) · (cid:18) L m x m y (cid:19) k (cid:18) C ln (cid:18) C L m x m y (cid:19)(cid:19) k +3 (cid:124) (cid:123)(cid:122) (cid:125) ( a ) . Compared to the lower bound, there is only one additional factor ( a ), whose logarithm isln (cid:32)(cid:18) L m x m y (cid:19) k C k +31 ln k +3 (cid:18) C L m x m y (cid:19)(cid:33) = 12 k ln (cid:18) L m x m y (cid:19) + ( k + 3) ln (cid:18) C ln (cid:18) C L m x m y (cid:19)(cid:19) , which is minimized when k = (cid:115) ln (cid:16) L m x m y (cid:17) (cid:16) C ln (cid:16) L m x m y (cid:17)(cid:17) , and the minimum value is3 ln (cid:18) C ln (cid:18) C L m x m y (cid:19)(cid:19) + (cid:115)

12 ln (cid:18) L m x m y (cid:19) ln (cid:18) C ln (cid:18) C L m x m y (cid:19)(cid:19) = o (cid:18) ln (cid:18) L m x m y (cid:19)(cid:19) . I.e. ( a ) is sub-polynomial in L m x m y . This proves Corollary 3, which states that, when k = Θ (cid:32)(cid:115) ln (cid:18) L m x m y (cid:19) / ln ln (cid:18) L m x m y (cid:19)(cid:33) , the number of matrix vector products that RHSS( k ) needs to ﬁnd z T such that (cid:107) z T − z ∗ (cid:107) ≤ (cid:15) is (cid:115) L xy m x m y + L x m x + L y m y ln (cid:18) (cid:107) z − z ∗ (cid:107) (cid:15) (cid:19) · (cid:18) L m x m y (cid:19) o (1) ..