[PDF] Efficient Search of First-Order Nash Equilibria in Nonconvex-Concave Smooth Min-Max Problems

Abstract

We propose an efficient algorithm for finding first-order Nash equilibria in min-max problems of the form min x∈X max y∈Y F(x,y) , where the objective function is smooth in both variables and concave with respect to y ; the sets X and Y are convex and "projection-friendly," and Y is compact. Our goal is to find an ( ε x , ε y ) -first-order Nash equilibrium with respect to a stationarity criterion that is stronger than the commonly used proximal gradient norm. The proposed approach is fairly simple: we perform approximate proximal-point iterations on the primal function, with inexact oracle provided by Nesterov's algorithm run on the regularized function F( x t ,⋅) , x t being the current primal iterate. The resulting iteration complexity is O( ε x −2 ε y −1/2 ) up to a logarithmic factor. As a byproduct, the choice ε y =O( ε x 2 ) allows for the O( ε x −3 ) complexity of finding an ε x -stationary point for the standard Moreau envelope of the primal function. Moreover, when the objective is strongly concave with respect to y , the complexity estimate for our algorithm improves to O( ε x −2 κ y 1/2 ) up to a logarithmic factor, where κ y is the condition number appropriately adjusted for coupling. In both scenarios, the complexity estimates are the best known so far, and are only known for the (weaker) proximal gradient norm criterion. Meanwhile, our approach is "user-friendly:" (i) the algorithm is built upon running a variant of Nesterov's accelerated algorithm as subroutine and avoids extragradient steps; (ii) the convergence analysis recycles the well-known results on accelerated methods with inexact oracle. Finally, we extend the approach to non-Euclidean proximal geometries.

Full PDF

EEFFICIENT SEARCH OF FIRST-ORDER NASH EQUILIBRIAIN NONCONVEX-CONCAVE SMOOTH MIN-MAX PROBLEMS

DMITRII M. OSTROVSKII ∗ , ANDREW LOWY ∗ , AND

MEISAM RAZAVIYAYN ∗ Abstract.

We propose an eﬃcient algorithm for ﬁnding ﬁrst-order Nash equilibria in min-max problems of the form min x ∈ X max y ∈ Y F ( x, y ), where the objective function is smooth in bothvariables and concave with respect to y ; the sets X and Y are convex and “projection-friendly,” and Y is compact. Our goal is to ﬁnd an ( ε x , ε y )-ﬁrst-order Nash equilibrium with respect to a stationaritycriterion that is stronger than the commonly used proximal gradient norm. The proposed approachis fairly simple: we perform approximate proximal-point iterations on the primal function, withinexact oracle provided by Nesterov’s algorithm run on the regularized function F ( x t , · ), x t beingthe current primal iterate. The resulting iteration complexity is O ( ε x − ε y − / ) up to a logarithmicfactor. As a byproduct, the choice ε y = O ( ε x ) allows for the O ( ε x − ) complexity of ﬁnding an ε x -stationary point for the standard Moreau envelope of the primal function. Moreover, when theobjective is strongly concave with respect to y , the complexity estimate for our algorithm improves to O ( ε x − κ y / ) up to a logarithmic factor, where κ y is the condition number appropriately adjusted forcoupling. In both scenarios, the complexity estimates are the best known so far, and are only knownfor the (weaker) proximal gradient norm criterion. Meanwhile, our approach is “user-friendly”: (i)the algorithm is built upon running a variant of Nesterov’s accelerated algorithm as subroutine andavoids extragradient steps; (ii) the convergence analysis recycles the well-known results on acceleratedmethods with inexact oracle. Finally, we extend the approach to non-Euclidean proximal geometries. Key words. ﬁrst-order Nash equilibria, stationary points, nonconvex min-max problems

AMS subject classiﬁcations.

1. Introduction.

In recent years, min-max problems have received signiﬁcantattention across the optimization and machine learning communities due to theirapplications in training generative adversarial networks (GANs) [13], training machinelearning models that are robust to adversarial attacks [21], reinforcement learning [8],fair statistical inference [1], and distributed non-convex optimization [20], to name afew. These applications involve solving optimization problems in the general form(1.1) min x ∈ X max y ∈ Y F ( x, y ) , where F is a smooth objective function and X, Y is a pair of convex sets in thecorresponding Euclidean spaces X , Y . When the objective is convex in x and concavein y , (1.1) is well-studied. In this case, the corresponding variational inequality ismonotone, and there is a number of eﬃcient algorithms known for solving it, even atthe optimal rate (see, e.g., [22], [24], [31]). However, many of the applications discussedabove involve an objective F that is nonconvex in x and not necessarily concave in y ,which makes the problem much harder to solve. In fact, even Nash equilibria are notguaranteed to exist in (1.1) in this general nonconvex-nonconcave setting.In this work, we study (1.1) under the assumption that F ( x, y ) is concave in y but do not assume convexity x . To the best of our knowledge, [29] was the ﬁrst workproviding non-asymptotic convergence rates for nonconvex-concave problems withoutassuming special structure of the objective function. They use the notion of ε -ﬁrstorder Nash equilibrium (FNE) to measure the rate of convergence of their approach.This notion looks at the min-max problem as a two-player zero-sum game and usesthe ﬁrst-order optimality condition with respect to each variable as the optimality ∗ Viterbi School of Engineering, University of Southern California, Los Angeles, CA 90089, USA.Email: [email protected], [email protected], [email protected] a r X i v : . [ m a t h . O C ] D ec D. M. OSTROVSKII, A. LOWY, AND M. RAZAVIYAYN measure. Using this notion, they showed that their algorithm ﬁnds an ε -ﬁrst-orderNash equilibrium in O ( ε − . ) gradient evaluations. In this work, we use a similaroptimality notion to the one in [29]; however, in order to measure the optimalityw.r.t. the x, y variables separately, we generalize their notion to ( ε x , ε y )-ﬁrst orderstationarity. Therefore, by setting ε x = ε y = ε , we obtain the optimality measure usedin [29]. Using the deﬁned ﬁrst order stationarity measure, we propose an algorithmthat can ﬁnd ( ε x , ε y )-ﬁrst order stationary point in O ( ε x − ε y − / ) gradient evaluations.Another way to measure the convergence rate of an algorithm for solving (1.1), isto deﬁne the primal function ϕ ( x ) = max y ∈ Y F ( x, y ), and measure the ﬁrst-order opti-mality in terms of the nonconvex problem min x ∈ X ϕ ( x ). In this context, a commonlyused inaccuracy measure for a candidate solution (cid:98) x is the gradient norm of the standardMoreau envelope of the primal function (see Section 5 for more details). Using thisviewpoint, subtle analyses have been provided in [34, 17], and more recently, in theconcurrent work [19] whose preprint was announced a few days prior to ours. Morerecently, a similar approach has been used in [35]. The underlying idea in all theseworks, as well as in ours, is to obtain the next iterate ( x t +1 , y t +1 ) by approximatelysolving a strongly-convex-concave saddle-point problem(1.2) min x ∈ X max y ∈ Y [ F ( x, y ) + L xx (cid:107) x − x t (cid:107) ] , where L xx is the uniform over y ∈ Y bound on the Lipschitz constant of ∇ x F ( · , y ).Yet, there are notable diﬀerences between all these works, which we shall now discuss. The work [34] focuses on the problem of ﬁnding an ε x -stationary point of the Moreau envelope ϕ L xx ( x ) := min x (cid:48) ∈ X [ ϕ ( x (cid:48) ) + L xx (cid:107) x (cid:48) − x (cid:107) ].To achieve this goal, they solve (1.2) up to accuracy (cid:15) in objective value. The resultingscheme produces a point (cid:98) x satisfying (cid:107) ϕ L xx ( (cid:98) x ) (cid:107) (cid:54) ε x in O ( ε x ) oracle calls, providedthat one takes (cid:15) = O ( ε x ). However, they only handle the case X = X and do notprovide an algorithm to reach an ( ε x , ε y )-FNE for general ( ε x , ε y ); modifying theirscheme correspondingly might be challenging because of Y (cid:54) = Y . (Our discussionsin Section 5 shed some light on such intricacies). Moreover, their proposed algorithmis way less transparent than ours: it comprises an extragradient-type scheme as wellas Nesterov’s acceleration, whereas our approach is based solely on the analysis of FastGradient Method (FGM) a version of Nesterov’s accelerated algorithm, and uses thereadily available results for inexact-oracle FGM due to [12].The approach in the concurrent work [19] more closely resembles ours. In particu-lar, their scheme also avoids using an extragradient-type subroutine, and producesan ( ε x , ε y )-FNE in O ( ε x − ε y − / ) oracle calls; setting ε y = O ( ε x ) then allows to recoverthe O ( ε − x ) result for the Moreau envelope. However, they use the deﬁnition of ( ε x , ε y )-FNE based on the proximal gradient norm, or the weak criterion in our terminologyin Section 2 (cf. (2.4)), whereas our scheme produces an ( ε x , ε y )-FNE with respect tothe so-called strong criterion (cf. (2.3)). As we discuss in Remark 2.2, the latter task ismore challenging: an ( ε x , ε y )-FNE with respect to the strong criterion is also such withrespect to the weak criterion (hence the names); meanwhile, a guarantee on the weakcriterion does not imply any guarantee on the strong one in the presense of constraints.Moreover, the ε y = O ( ε x ) reduction for the Moreau envelope in [19] does not allowfor X (cid:54) = X (same as in [34]). In addition, this reduction relies on the result [18,Prop. 4.12]; as we discuss in Section 5, this result seems to be invalid unless Y = Y ,which is irrelevant in the context of (1.1). We rectify both these issues in Section 5through the delicate use of the strong criterion. It should be noted, however, that theauthors of [19] focus on the convex-concave scenario which we do not address here. FFICIENT SEARCH OF FIRST-ORDER NASH EQUILIBRIA O ( ε x − ε y − / ) complexity to ﬁnd an ( ε x , ε y )-ﬁrst-order stationary point; however, to the best of our understanding, their stationaritycriterion is not directly comparable to ours nor to the one in [19]. Setting ε y = O ( ε x ),the authors of [17] obtain O ( ε − x ) complexity result for the criterion similar to thegradient norm of the Moreau envelope, but slightly weaker (as follows by comparing [17,Eq. (2)] with the second claim of our Proposition 5.1). However, they assume directaccess to the gradient of the smooth function max y ∈ Y [ F ( x, y ) − λ (cid:107) y (cid:107) ], which isunrealistic (e.g., we solve a similar task via an FGM subroutine). This assumptionhas been removed in the recent work [35] (that appeared some time after our work).Moreover, while [35] only focuses on the primal accuracy measure (in the samesense as [17]), they address the general Bregman geometries (which we also do,see Appendix B). However, [35] do not address the general ( ε x , ε y )-stationarity notion.Finally, let us discuss the precursor work [29]. In our terminlogy, [29] showsan O ( ε x − ε y − / ) complexity estimate to ﬁnd an ( ε x , ε y )-FNE, using a notion ofstationarity weaker than the one in [19] (and thus a fortiori weaker than the one weuse in the present work). The extra ε − y complexity factor compared to our resultcomes from using a more naive algorithmic approach: instead of forming an iteratesequence by solving (1.2), they proceed by running projected gradient descent onthe smoothed primal function max y ∈ Y [ F ( x, y ) − λ y (cid:107) y (cid:107) ] . Taking λ = ε y /R y ensuresthat ∇ y [ F ( x, y ) − λ y (cid:107) y (cid:107) ] ≈ ∇ y F ( x, y ) up to O ( ε y ) error, which guarantees that ( ε x , ε y )-FNE for the new problem is still valid for the initial one. However, gradient descentsuﬀers from a poor smoothness of the smoothed primal function, whose gradient isonly Lipschitz with modulus L xx + O ( L xy /λ y ) = O ( ε − y ), which results in the ﬁnaliteration complexity estimate O ( ε x − ε y − / ).

2. Problem formulation and preview of the main result.

We study themin-max problem (1.1) in the setting where

X, Y are convex and “projection-friendly”sets with non-empty interior in the corresponding Euclidean spaces X , Y ; moreover, Y is contained in a Euclidean ball with radius R y < ∞ . The function F : X × Y → R isconcave in y for all x ∈ X , and has Lipschitz gradient, namely, the inequalities(2.1) (cid:107)∇ x F ( x (cid:48) , y ) − ∇ x F ( x, y ) (cid:107) (cid:54) L xx (cid:107) x (cid:48) − x (cid:107) , (cid:107)∇ y F ( x, y (cid:48) ) − ∇ y F ( x, y ) (cid:107) (cid:54) L yy (cid:107) y (cid:48) − y (cid:107) , (cid:107)∇ x F ( x, y (cid:48) ) − ∇ x F ( x, y ) (cid:107) (cid:54) L xy (cid:107) y (cid:48) − y (cid:107) , (cid:107)∇ y F ( x (cid:48) , y ) − ∇ y F ( x, y ) (cid:107) (cid:54) L xy (cid:107) x (cid:48) − x (cid:107) hold uniformly over x, x (cid:48) ∈ X and y, y (cid:48) ∈ Y with Lipschitz constants L xx , L yy , L xy .Here and in what follows, (cid:107) · (cid:107) and (cid:104)· , ·(cid:105) denote the standard Euclidean norm and innerproduct (regardless of the space), and [ ∇ x F ( x, y ) , ∇ y F ( x, y )] are the components ofthe full gradient ∇ F ( x, y ). Instead of seeking an exact solution to (1.1), we focus onthe more feasible task of ﬁnding an approximate ﬁrst-order Nash equilibrium . Definition

A point ( (cid:98) x, (cid:98) y ) ∈ X × Y is called ( ε x , ε y )-approximate ﬁrst-orderNash equilibrium (( ε x , ε y )-FNE) in the problem (1.1) if the following holds: (2.2) S X ( (cid:98) x, ∇ x F ( (cid:98) x, (cid:98) y ) , L xx ) (cid:54) ε x and S Y ( (cid:98) y, −∇ y F ( (cid:98) x, (cid:98) y ) , L yy ) (cid:54) ε y , where the inaccuracy measure S Z , with Z being a convex subset of a Euclidean space Z ,is deﬁned on triples z, ζ ∈ Z , L (cid:62) as follows: S Z ( z, ζ, L ) := 2 L max z (cid:48) ∈ Z (cid:2) − (cid:104) ζ, z (cid:48) − z (cid:105) − L (cid:107) z (cid:48) − z (cid:107) (cid:3) . (2.3) Remark f : Z → R is the norm of the proximal gradient , D. M. OSTROVSKII, A. LOWY, AND M. RAZAVIYAYN that is W Z ( (cid:98) z, ∇ f ( (cid:98) z ) , L ) := L (cid:13)(cid:13)(cid:98) z − Π Z (cid:2)(cid:98) z − L ∇ f ( (cid:98) z ) (cid:3)(cid:13)(cid:13) , where Π Z ( · ) is the operator ofEuclidean projection onto Z (see [27]), and we deﬁne the functional(2.4) W Z ( z, ζ, L ) := L (cid:13)(cid:13) z − Π Z (cid:2) z − L ζ (cid:3)(cid:13)(cid:13) for convenience. In the unconstrained case, both measures reduce to (cid:107)∇ f ( (cid:98) z ) (cid:107) . However,in the general constrained case S Z provides a stronger criterion, in the following sense:(i) For any z, ζ, L one has W Z ( z, ζ, L ) (cid:54) S Z ( z, ζ, L ); thus, any ( ε x , ε y )-FNEin the sense of Deﬁnition 2.1 is an ( ε x , ε y )-FNE in the weak sense, i.e., W X ( (cid:98) x, ∇ x F ( (cid:98) x, (cid:98) y ) , L xx ) (cid:54) ε x , W Y ( (cid:98) y, −∇ y F ( (cid:98) x, (cid:98) y ) , L yy ) (cid:54) ε y . See [3, Thm. 4.3].(ii) The converse is generally false (unless in the unconstrained case). In par-ticular, in Section 5 (cf. Remark 5.4) we exhibit a minimization problem inwhich W Z ( (cid:98) z, ∇ f ( (cid:98) z ) , L ) (cid:54) ε at (cid:98) z ∈ Z , but S Z ( (cid:98) z, ∇ f ( (cid:98) z ) , L ) is arbitrarily large.Our goal is to provide an eﬃcient algorithm for ﬁnding ( ε x , ε y )-FNE givenaccess to the full gradient oracle ∇ F ( x, y ). Following the established trend in theliterature, we assume the feasible sets X, Y to be “projection-friendly”, i.e., Euclideanprojection onto them can be done with a small computational eﬀort; thus, the naturalnotion of eﬃciency is simply the number of gradient computations. Besides the twoaccuracies ε x , ε y , the Lipschitz parameters L xx , L yy , L xy , and the “radius” R y of Y , weneed a parameter quantifying the hardness of the primal problem – that of minimizing(2.5) ϕ ( x ) := max y ∈ Y F ( x, y ) . Since X can be unbounded, the natural choice of such parameter is the primal gap ∆,∆ := ϕ ( x ) − min x ∈ X ϕ ( x ) , where x is the initial iterate. To give a concise and intuitive statement of our mainresult, it is helpful to deﬁne the “coupling-adjusted” counterpart L + yy of L yy , deﬁned as(2.6) L + yy := L yy + L xy L xx , as well as the unit-free quantities – the “complexity factors” T x and T y given by(2.7) T x := L xx ∆ ε x , T y := (cid:115) L + yy R y ε y . Upon consulting the literature (e.g., [7, 25]), we recognize T x as the iteration com-plexity of ﬁnding ε x -stationary point (with respect to gradient norm) in the class ofunconstrained minimization problems with L xx -smooth (possibly nonconvex) objectiveand initial gap ∆. On the other hand, we recognize T y as the tight complexity boundfor the problem of ﬁnding ε y -stationary point in the class of maximization problemswith concave and L + yy -smooth objective, given the initial point within R y distance ofan optimum, using ﬁrst-order information. (This bound is also tight in the constrainedsetup, with ε y bounding the proximal gradient norm.) We now state our main result. Theorem

There exists an algorithmthat, given ( ε x , ε y , L xx , L xy , L yy , R y , ∆) , outputs ( ε x , ε y ) -FNE of the problem (1.1) in (2.8) (cid:101) O ( T x T y ) computations of ∇ F ( x, y ) and projections, where (cid:101) O ( · ) hides logarithmic factors in T x , T y . Exact FNE might not exist when X is not compact; however ( ε x , ε y )-FNE exists for all ε x , ε y > ε x , ε y )-FNE in the problem (1.1) can be viewed as the product of the “pri-mal” complexity of ﬁnding an ε x -stationary point of F ( · , y ) with ﬁxed y , and the “dual”complexity of ﬁnding ε y -stationary point of ψ x ( · ) = min x (cid:48) ∈ X F ( x (cid:48) , · ) + L xx (cid:107) x (cid:48) − x (cid:107) with ﬁxed x ∈ X on R y . Note that ψ x ( · ) has L + yy -Lipschitz gradient by Danskin’stheorem (see [10] and [29, Lem. 24]), and is associated with the standard Moreauenvelope ϕ L xx ( x ) := min x (cid:48) ∈ X [ ϕ ( x (cid:48) ) + L xx (cid:107) x (cid:48) − x (cid:107) ] of the primal function ϕ ( x ) ([14]).Second, in Section 5 we prove that the primal component (cid:98) x of an ( ε x , ε y )-FNEwith ε y = ε x / ( L xx R y ) satisﬁes (cid:107) ϕ L xx ( (cid:98) x ) (cid:107) = O ( ε x ). In view of Theorem 2.3, this leadsto the complexity (cid:101) O ( ε − x ) of ﬁnding an ε x -stationary point for the standard Moreauenvelope (see Section 5 for the rigorous result (cf. (5.8)) and detailed discussion). Sucha result is known from the recent literature [34, 19] in the case X = X ; moreover, aswe discuss in Section 5, the result [18, Prop. 4.12] that is commonly used (in particular,in [19]) in order to reduce the Moreau envelope criterion to the criterion based on thenorm of the proximal gradient (i.e., our weak criterion in (2.4)) seems to be invalidwhen Y (cid:54) = Y . Our results close these gaps by working with the strong criterion (2.3).Third, our approach can be extended to composite objectives, e.g., by following [28,15]. To keep the presentation simple, we avoid such extension here (see, e.g., [3]). Onthe other hand, extension to non-Euclidean geometries faces some non-trivial challengesthat have not been properly addressed in the prior literature. In Appendix B wediscuss these challenges and introduce the necessary adjustments into our framework.

Notation.

Throughout the paper, and unless explicitly stated otherwise, (cid:107) · (cid:107) and (cid:104)· , ·(cid:105) denote the standard Euclidean norm and inner product regardless of the(Euclidean) space. We let [ T ] := { , , ..., T } for T ∈ N . log( · ) is the natural loga-rithm; g = O ( f ) means that for any z ∈ Dom( f ) = Dom( g ) one has f ( z ) (cid:54) Cg ( z )with C being a generic constant; g = (cid:101) O ( f ) means the same but with C replaced bya poly-logarithmic factor in g . We write ∂ y F ( x ( y ) , y ) for the partial gradient in y of F ( x ( y ) , y ) as a function of y ; in other words, ∂ y F ( x ( y ) , y ) = ∇ y F ( x, y ) with x = x ( y )substituted post-factum. We shall introduce additional notation when the need arises.

3. Building blocks and preliminaries.

Given a convex set Z in a Euclideanspace Z and a pair z, ζ ∈ Z , we deﬁne the prox-mapping (3.1) prox z,Z ( ζ ) := argmin z (cid:48) ∈ Z (cid:104) ζ, z (cid:48) (cid:105) + (cid:107) z (cid:48) − z (cid:107) . In what follows, we assume prox z,Z ( ζ ) to be computationally cheap. Note that in theunconstrained case with Z = Z , one has prox z,Z ( ζ ) = z − ζ , whereas in the (general)constrained case one has prox z,Z ( ζ ) = Π Z ( z − ζ ). Furthermore, in what follows weuse the notion of inexact ﬁrst-order oracle for a smooth convex function due to [12]. Definition δ -inexact oracle). Let f : Z → R be convex with L -Lipschitzgradient. Pair [ (cid:101) f ( · ) , (cid:101) ∇ f ( · )] is called inexact oracle for f with accuracy δ (cid:62) if for anypair of points z, z (cid:48) ∈ Z one has (3.2) 0 (cid:54) f ( z (cid:48) ) − (cid:101) f ( z ) − (cid:104) (cid:101) ∇ f ( z ) , z (cid:48) − z (cid:105) (cid:54) L (cid:107) z (cid:48) − z (cid:107) + δ. Note that, unlike [12], we do not include L into the deﬁnition of inexact oracle.Next we present Nesterov’s fast gradient method (FGM) for smooth convexoptimization with inexact oracle (see [12]) and a restart scheme for it. We use them in A notable exception is the work [35] that appeared shortly after the ﬁrst version of this manuscript.

D. M. OSTROVSKII, A. LOWY, AND M. RAZAVIYAYN two scenarios: (a) minimization of a strongly convex function on X with exact oracle;(b) maximization of a strongly concave function on Y with δ -inexact oracle. Algorithm 3.1

Fast Gradient Method function FGM ( z , Z, γ, T, (cid:101) ∇ f ( · )) G = 0 for t = 0 , , ..., T − do u t = prox z ,Z ( γG t ) τ t = t +2)( t +1)( t +4) v t +1 = τ t u t + (1 − τ t ) z t g t = t +22 (cid:101) ∇ f ( v t +1 ) w t +1 = prox u t ,Z ( γg t ) z t +1 = τ t w t +1 + (1 − τ t ) z t G t +1 = G t + g t end for return z T end function Algorithm 3.2 Restart Scheme for FGM function RestartFGM ( z , Z, γ, T, S, (cid:101) ∇ f ( · )) for s ∈ [ S ] do z s = FGM ( z s − , Z, γ, T, (cid:101) ∇ f ( · )) end for return z S end function3.1. Fast gradient method with inexactoracle. Assume we are given initial point z ∈ Z ,target number of iterations T , stepsize γ > δ -inexact (or, possibly, exact)oracle for function f : Z → R which satisﬁes therequirements in Deﬁnition 3.1.We will use a variant of fast gradient method with inexact oracle due to [12],given here as Algorithm 3.1, that performs T iterations and outputs approximateminimizer z T of f ; each of these iterations reduces to a single call of (cid:101) ∇ f ( · ), twoprox-mapping computations and a few entrywise vector operations. Note that theinexact oracle (cid:101) ∇ f ( · ) is passed as an input parameter (i.e., “function handle”); thismeans that such an oracle must be implemented as an external procedure.Assuming that the error of (cid:101) ∇ f ( · ) is small enough, the work [12] ensures that thestandard O ( T − ) convergence of FGM is preserved. Let us now rephrase their result. Theorem

Running Algorithm 3.1 runs with γ =1 /L and δ -inexact oracle of f that is L -smooth, convex, and minimized at z ∗ suchthat (cid:107) z − z ∗ (cid:107) (cid:54) R , ensures that f ( z T ) − f ( z ∗ ) (cid:54) LR /T + 2 δT. As a result, one has (3.3) f ( z T ) − f ( z ∗ ) (cid:54) LR T whenever δ (cid:54) δ T := LR T . When f is also λ -strongly convex, (3.3) allows to bound the distance to z ∗ :(3.4) (cid:107) z T − z ∗ (cid:107) (cid:54) κR /T , where κ = L/λ is the condition number. That is, we are guaranteed to get twice closerto the optimum after T = O ( √ κ ) iterations. Following [26], we exploit this fact toobtain linear convergence via the simple restart scheme given in Algorithm 3.2, andderive the following result. Corollary

Run Algorithm with γ = 1 /L , parameters T, S satisfying (3.5) T (cid:62) √ κ, S (cid:62) log (3 LR/ε ) for some ε > , and δ (cid:54) δ T , cf. (3.3) . Then the ﬁnal iterate z S satisﬁes (3.6) (cid:107) z S − z ∗ (cid:107) (cid:54) ε L , f ( z S ) − f ( z ∗ ) (cid:54) ε L , S Z ( z S , ∇ f ( z S ) , L ) (cid:54) ε . Proof.

By (3.4), T (cid:62) √ κ iterations in the ﬁrst epoch ensure that (cid:107) z − z ∗ (cid:107) (cid:54) R/ FGM is FFICIENT SEARCH OF FIRST-ORDER NASH EQUILIBRIA z . Repeating this process for s epochs, we have (cid:107) z s − z ∗ (cid:107) (cid:54) − s R . Inparticular, after S (cid:62) log (3 LR/ε ) epochs we arrive at the ﬁrst bound in (3.6). Now,arguing in a similar manner, but this time using (3.3), we have that f ( z s ) − f ( z ∗ ) (cid:54) L (2 − s +1 R ) T (cid:54) LR s T (cid:54) LR s +1 , where in the ﬁrst step we combined (3.3) with with the bound (cid:107) z s − − z ∗ (cid:107) (cid:54) − s +1 R ,and in the end we used κ (cid:62)

1. Plugging in 2 S +1 = 18 L R /ε , we verify the secondinequality in (3.6). Finally, for the last inequality in (3.6), we ﬁrst observe that, dueto the smoothness of f , it holds that f ( z ) − f ( z S ) (cid:54) (cid:10) ∇ f ( z S ) , z − z S (cid:11) + L (cid:107) z − z S (cid:107) . Thus, one has f ( z ∗ ) − f ( z S ) (cid:54) min z ∈ Z [ (cid:10) ∇ f ( z S ) , z − z S (cid:11) + L (cid:107) z − z S (cid:107) ], whence (3.7) S Z ( z S , ∇ f ( z S ) , L ) = 2 L max z ∈ Z (cid:104) −(cid:104)∇ f ( z S ) , z − z S (cid:105) − L (cid:107) z − z S (cid:107) (cid:105) = − L min z ∈ Z (cid:104) (cid:104)∇ f ( z S ) , z − z S (cid:105) + L (cid:107) z − z S (cid:107) (cid:105) (cid:54) − L [ f ( z ∗ ) − f ( z S )] (cid:54) ε / , where the ﬁnal inequality uses the second part of (3.6) proved earlier.When Algorithm 3.2 is used for minimization in x , the complexity factor T x isparametrized by the upper bound on the initial objective gap ∆ f [ (cid:62) f ( z ) − f ( z ∗ )]rather than R , and the exact oracle ∇ f ( · ) is available (function value is not used).Note that, by strong convexity, such a bound also implies a bound on the initialdistance to the optimum, namely R = 2 κ ∆ f /L , and we arrive at the following result. Corollary

Assume that f ( z ) − f ( z ∗ ) (cid:54) ∆ f . Run Algorithm with δ = 0 , (3.8) S (cid:62)

12 log (cid:0) κL ∆ f /ε (cid:1) , and other parameters set as in Corollary . Then the bounds in (3.6) remain valid. Next webrieﬂy review the proximal point method, which forms the backbone of our approach,in the context of searching for stationary points of nonconvex functions. Then weshow how the iterations of this method can be approximated by using Algorithm 3.2.Given a convex set X and φ : X → R with L -Lipschitz gradient, the proximalpoint operator of φ on X with stepsize < γ < /L is deﬁned by(3.9) x (cid:55)→ x + γφ,X ( x ) := argmin x (cid:48) ∈ X (cid:20) φ ( x (cid:48) ) + 12 γ (cid:107) x (cid:48) − x (cid:107) (cid:21) . Denoting x + = x + γφ,X ( x ) for brevity, the ﬁrst-order optimality condition in (3.9) writes(3.10) (cid:10) ∇ φ ( x + ) + γ ( x + − x ) , x (cid:48) − x + (cid:11) (cid:62) , ∀ x (cid:48) ∈ X. Note that this reduces to the “implicit gradient descent” update x + = x − γ ∇ φ ( x + )in the unconstrained case. For large stepsize, computing the proximal operator ata point might be as hard as minimizing φ . However, with suﬃcient regularization,namely when γ = c/L for 0 < c (cid:54) /

2, the task becomes easy, since the objectivein (3.9) is strongly convex and well-conditioned, with κ = (1 + c ) / (1 − c ) (cid:54) . On theother hand, with such stepsize the proximal point method , as given by(3.11) x t = x + γφ,X ( x t − ) , D. M. OSTROVSKII, A. LOWY, AND M. RAZAVIYAYN attains the optimal rate O (1 / √ T ) of minimizing the stationarity measure S X (cf. Def-inition 2.1). Indeed, from (3.9) with γ = c/L we get(3.12) φ ( x + ) + L c (cid:107) x + − x (cid:107) (cid:54) φ ( x ) . Iterating this T times according to (3.11) results in(3.13) min t ∈ [ T ] (cid:107) x t − x t − (cid:107) (cid:54) T (cid:88) t ∈ [ T ] (cid:107) x t − x t − (cid:107) (cid:54) c ∆ LT , where ∆ = φ ( x ) − min x ∈ X φ ( x ) is the initial gap. On the other hand,(3.14) S X ( x + , ∇ φ ( x + ) , L ) ≡ L max x (cid:48) ∈ X (cid:20) − (cid:10) ∇ φ ( x + ) , x (cid:48) − x + (cid:11) − L (cid:107) x (cid:48) − x + (cid:107) (cid:21) (cid:54) L max x (cid:48) ∈ X (cid:20) c (cid:10) x + − x, x (cid:48) − x + (cid:11) − (cid:107) x (cid:48) − x + (cid:107) (cid:21) (cid:54) L (cid:107) x + − x (cid:107) c , where we ﬁrst used the ﬁrst-order optimality condition (3.10) and then Young’s inequal-ity; note that the last inequality becomes tight when X = X . Combining (3.11), (3.13)and (3.14), we arrive at(3.15) min t ∈ [ T ] S X ( x t , ∇ φ ( x t ) , L ) (cid:54) (cid:114) L ∆ cT , i.e., the iteration complexity T ( ε ) = O (cid:0) L ∆ /ε (cid:1) of minimizing the measure S X , whichis optimal in the unconstrained case [7].Of course, the above argument would be useless if (3.13) or (3.14) were not tolerantto errors when computing x + γφ,X ( x ), i.e., when minimizing the regularized function(3.16) φ L,x ( · ) := φ ( · ) + L (cid:107) · − x (cid:107) . (Here we ﬁxed c = 1 / γ = 1 / (2 L ), cf. (3.9).) We shall now verifysuch error-tolerance for (3.13), (3.14) and (3.15) as a result. Indeed, let (cid:101) x + ∈ X satisfy(3.17) φ L,x ( (cid:101) x + ) (cid:54) φ L,x ( x + ) + ε L for given x , where x + = x + φ/ L,X ( x ) is the true minimizer, ε the desired accuracy, andthe constant 1 /

24 will be convenient in further calculations. Consider the counterpartof (3.11), i.e., the sequence (cid:101) x t = (cid:101) x + φ/ L,X ( (cid:101) x t − ) obeing (3.17) at each step. Using (3.12)and proceeding as when deriving (3.13), we obtain the following counterpart of (3.13):(3.18) min t ∈ [ T ] (cid:107) (cid:101) x t − (cid:101) x t − (cid:107) (cid:54) T (cid:88) t ∈ [ T ] (cid:107) (cid:101) x t − (cid:101) x t − (cid:107) (cid:54) ∆ LT + ε L , thus showing the desired error-tolerance for (3.13). Moreover, assume now that (cid:101) x + , inaddition to (3.17), admits the matching guarantee for the stationarity measure, that is(3.19) S X ( (cid:101) x + , ∇ φ L,x ( (cid:101) x + ) , L ) (cid:54) ε/ . FFICIENT SEARCH OF FIRST-ORDER NASH EQUILIBRIA S X ( (cid:101) x + , ∇ φ ( (cid:101) x + ) , L ) ≡ L max x (cid:48) ∈ X (cid:2) − (cid:10) ∇ φ ( (cid:101) x + ) , x (cid:48) − (cid:101) x + (cid:11) − L (cid:107) x (cid:48) − (cid:101) x + (cid:107) (cid:3) (cid:54) L max x (cid:48) ∈ X (cid:2) − (cid:10) ∇ φ ( (cid:101) x + ) + 2 L ( (cid:101) x + − x ) , x (cid:48) − (cid:101) x + (cid:11) − L (cid:107) x (cid:48) − (cid:101) x + (cid:107) (cid:3) + 2 L max x (cid:48) ∈ X (cid:2)(cid:10) L ( (cid:101) x + − x ) , x (cid:48) − (cid:101) x + (cid:11) − L (cid:107) x (cid:48) − (cid:101) x + (cid:107) (cid:3) =2 S X ( (cid:101) x + , ∇ φ L,x ( (cid:101) x + ) , L/

2) + 8 L (cid:107) (cid:101) x + − x (cid:107) (cid:54) S X ( (cid:101) x + , ∇ φ L,x ( (cid:101) x + ) , L ) + 8 L (cid:107) (cid:101) x + − x (cid:107) = ε / L (cid:107) (cid:101) x + − x (cid:107) ;here we ﬁrst used the explicit form of ∇ φ L,x , then estimated the additional term viaYoung’s inequality, and ﬁnally used that S X ( x, ξ, L ) is non-decreasing in L , as followsfrom the proximal Polyak-Lojasiewicz lemma ([16, Lem. 1]). Thus, we have justveriﬁed the required error-tolerance for (3.14). Finally, by recalling (3.18) we arrive at(3.21) min t ∈ [ T ] S X ( (cid:101) x t , ∇ φ ( (cid:101) x t ) , L ) (cid:54) (cid:114) L ∆ T + 5 ε (cid:54) (cid:114) L ∆ T + ε, which results in the same complexity T ( ε ) = O (cid:0) L ∆ /ε (cid:1) as for the exact updates (3.11).It remains to notice that the point (cid:101) x + satisfying (3.17) and (3.19) can be obtainedby running FGM with restarts (Algorithm 3.2) with a near-constant total number oforacle calls, since the function φ L,x minimized in (3.9) is 3 L -smooth and L -strongly-convex. Namely, combining Corollary 3.3 and Corollary 3.4, we obtain the following. Proposition

Givensome x ∈ X , let φ : X → R have L -Lipschitz gradient, and let x + = x + φ/ (2 L ) ,X ( x ) bethe minimizer of φ L,x , cf. (3.16) . Let (cid:101) x + be the output of Algorithm run with exactoracle ∇ φ L,x ( · ) , z = x, Z = X , and parameters (3.22) T = 11 , γ = 13 L , and S (cid:62)

12 log (cid:18) L ∆ L,x ε (cid:19) , where ∆ L,x := φ ( x ) − min x (cid:48) φ L,x ( x (cid:48) ) . Then (3.23) (cid:107) (cid:101) x + − x + (cid:107) (cid:54) ε L , S X ( (cid:101) x + , ∇ φ L,x ( (cid:101) x + ) , L ) (cid:54) ε , φ L,x ( (cid:101) x + ) − φ L,x ( x + ) (cid:54) ε L .

Proof.

Note that φ L,x ( · ) is 3 L -smooth, has condition number κ (cid:54)

3, and is ∆

L,x -suboptimal at x . Hence, Algorithm 3.2 run with T = 11 > √ κ and S (cid:62)

12 log (cid:18) L ∆ L,x ε (cid:19) (cid:62)

12 log (cid:18) κ (3 L )∆ L,x (3 ε/ (cid:19) , cf. (3.8), outputs a point for which (3.6) holds with the following replacements: z S (cid:55)→ (cid:101) x + , z ∗ (cid:55)→ x + , f ( · ) (cid:55)→ φ L,x ( · ) , L (cid:55)→ L, ε (cid:55)→ ε . Using that S X ( x, ξ, L ) (cid:54) S X ( x, ξ, L ), we verify all three inequalities in (3.23). Namely, we apply [16, Lemma 1] using the indicator of X as the proximal function g ( x ) there. D. M. OSTROVSKII, A. LOWY, AND M. RAZAVIYAYN

4. Algorithm and main result.

In order to better convey the ideas behind ourapproach, we shall present it in a similar manner as in Section 3.2. Namely, we shallﬁrst present the “conceptual” algorithm with exact proximal-point type updates, andthen show how to approximate these updates, which shall result in our ﬁnal algorithm.

First,following [25], we reduce the problem of ﬁnding ( ε x , ε y )-FNE in (1.1) to the problemof ﬁnding approximate FNE of the regularized function(4.1) F reg ( x, y ) := F ( x, y ) − ε y R y (cid:107) y − ¯ y (cid:107) . This function has a unique maximizer for any x ∈ X as it is ε y /R y -strongly concave.This strong concavity will help us obtain faster algorithms for ﬁnding ( ε x , ε y )-FNEwhen applying standard accelerated procedures.The crux of our approach is to run a version of primal-dual proximal-point method,choosing the next iterate ( x t , y t ) as an approximate optimal solution to the convex-concave saddle-point problem (with unique exact solution):(4.2) min x ∈ X max y ∈ Y (cid:2) F reg t ( x, y ) := F reg ( x, y ) + L xx (cid:107) x − x t − (cid:107) (cid:3) . To illustrate this idea, let us consider the idealized iterates ( (cid:98) x t , (cid:98) y t ) corresponding to theexact saddle point in (4.2), which exists and is unique by Sion’s minimax theorem [33].By deﬁnition, we have(4.3) F reg t ( (cid:98) x t , (cid:98) y t +1 ) (cid:54) F reg t ( (cid:98) x t , (cid:98) y t ) (cid:54) F reg t ( (cid:98) x t − , (cid:98) y t ) . Using the expression for F reg t ( x, y ) in (4.2), the right-hand inequality in (4.3) reads(4.4) F reg ( (cid:98) x t , (cid:98) y t ) + L xx (cid:107) (cid:98) x t − (cid:98) x t − (cid:107) (cid:54) F reg ( (cid:98) x t − , (cid:98) y t ) . Meanwhile, the ﬁrst inequality in (4.3) implies that F reg ( (cid:98) x t , (cid:98) y t ) (cid:62) F reg ( (cid:98) x t , (cid:98) y t +1 ). Thuswe arrive at(4.5) F reg ( (cid:98) x t , (cid:98) y t +1 ) + L xx (cid:107) (cid:98) x t − (cid:98) x t − (cid:107) (cid:54) F reg ( (cid:98) x t − , (cid:98) y t ) . The point here is that, unlike (4.4), relation (4.5) can now be iterated, as the indexgets shifted in the left-hand side for both variables. Iterating (4.5) results in a similarargument as in (3.12)-(3.15) and gives the T x complexity factor. More precisely,applying (4.5) for t ∈ [ T −

1] and (4.4) at t = T , we arrive at an analogue of (3.13):min t ∈ [ T ] (cid:107) (cid:98) x t − (cid:98) x t − (cid:107) (cid:54) T (cid:88) t ∈ [ T ] (cid:107) (cid:98) x t − (cid:98) x t − (cid:107) (cid:54) F reg ( (cid:98) x , (cid:98) y ) − F reg ( (cid:98) x T , (cid:98) y T ) L xx T .

Now observe that we can relate F reg ( (cid:98) x , (cid:98) y ) − F reg ( (cid:98) x T , (cid:98) y T ) to ∆ = ϕ ( (cid:98) x ) − min x ∈ X ϕ ( x ):(4.6) F reg ( (cid:98) x , (cid:98) y ) (cid:54) F ( (cid:98) x , (cid:98) y ) (cid:54) max y ∈ Y F ( (cid:98) x , y ) = ϕ ( (cid:98) x ) ,F reg ( (cid:98) x T , (cid:98) y T ) (cid:62) max y ∈ Y F ( (cid:98) x T , y ) − ε y R y max y (cid:48) ∈ Y (cid:107) y (cid:48) − ¯ y (cid:107) (cid:62) min x ∈ X ϕ ( x ) − ε y R y . Thus we can guarantee the existence of τ ∈ [ T ] for which (cid:107) (cid:98) x τ − (cid:98) x τ − (cid:107) (cid:54) ∆+2 ε y R y L xx T , mimicking (3.13) up to O ( ε y ) additive error. Now we can proceed as in (3.14), usingthe primal optimality condition in (4.2) and ∇ x F reg ( x, y ) ≡ ∇ x F ( x, y ). This results in S X ( (cid:98) x τ , ∇ x F ( (cid:98) x τ , (cid:98) y τ ) , L xx ) (cid:54) (cid:114) L xx (∆ + 2 ε y R y ) T ,

FFICIENT SEARCH OF FIRST-ORDER NASH EQUILIBRIA O ( T x ) iterations (2.7) to ensure S X ( (cid:98) x τ , ∇ x F ( (cid:98) x τ , (cid:98) y τ ) , L xx ) (cid:54) ε x . Mean-while, we remain near-stationary in y : indeed, for any iteration t ∈ [ T x ] we have that S Y ( (cid:98) y t , −∇ y F ( (cid:98) x t , (cid:98) y t ) , L yy ) = 2 L yy max y ∈ Y (cid:20) (cid:104)∇ y F ( (cid:98) x t , (cid:98) y t ) , y − (cid:98) y t (cid:105) − L yy (cid:107) y − (cid:98) y t (cid:107) (cid:21) (cid:54) L yy max y ∈ Y (cid:20)(cid:28) ε y R y ( (cid:98) y t − ¯ y ) , y − (cid:98) y t (cid:29) − L yy (cid:107) y − (cid:98) y t (cid:107) (cid:21) + 2 L yy max y ∈ Y (cid:20)(cid:28) ∇ y F ( (cid:98) x t , (cid:98) y t ) − ε y R y ( (cid:98) y t − ¯ y ) , y − (cid:98) y t (cid:29)(cid:21) (cid:54) L yy max y ∈ Y (cid:20)(cid:28) ε y R y ( (cid:98) y t − ¯ y ) , y − (cid:98) y t (cid:29) − L yy (cid:107) y − (cid:98) y t (cid:107) (cid:21) = ε y R y (cid:107) (cid:98) y t − ¯ y (cid:107) (cid:54) ε y , where in the second inequality we used the dual optimality condition for (4.2), andthen used Young’s inequality. Thus, ( (cid:98) x τ , (cid:98) y τ ) is an ( ε x , O ( ε y ))-FNE.So far we assumed the update (4.2) can be done exactly and analyzed the iterationcomplexity of the resulting idealized procedure. Next we show how to approximate (4.2)via Algorithm 3.2, leading to our ﬁnal algorithm and its eﬃciency estimate. As in the case of the usualproximal point method, the update stemming from the auxilliary min-max problemin (4.2) cannot be performed exactly. To address this problem, we extend the approachdescribed in Section 3.2 and approximately solve the (primal) minimization problemin (4.2) up to O ( ε x ) accuracy in the S X -measure via Algorithm 3.2 (cf. Proposition 3.5).The key challenge here is that the function to minimize in (4.2) stems from the nestedmaximization problem, hence neither it nor its gradient can be computed exactly.Instead, we provide inexact oracle for this function through the following steps. Algorithm 4.1

Solve Regularized Dual Problem function SolveRegDual ( y, x t − , ¯ y, γ x , λ y , T, S ) (cid:101) x t ( y ) = RestartFGM ( x t − , X, γ x , T, S, ∇ x F ( · , y ) − γ x ( · − x t − )) (cid:101) ∇ ψ t ( y ) = ∇ y F ( (cid:101) x t ( y ) , y ) − λ y ( y − ¯ y ) return (cid:101) x t ( y ) , (cid:101) ∇ ψ t ( y ) end function First, given the current primal iterate x t − , consider the minimization problemcorresponding to the dual function of (4.2) evaluated at some ﬁxed y ∈ Y : ψ t ( y ) := min x ∈ X (cid:2) F reg t ( x, y ) := F reg ( x, y ) + L xx (cid:107) x − x t − (cid:107) (cid:3) . Solving this minimization problem for ﬁxed y ∈ Y by running Algorithm 3.2 withexact oracle ∇ x F ( · , y ) + 2 L xx ( · − x t − ), we obtain approximation (cid:101) x t ( y ) of the exactminimizer (cid:98) x t ( y ). As F t ( · , y ) is well-conditioned, it only takes a logarithmic number oforacle calls to ensure a very small (inversely polynomial in the problem parameters)error of approximating (cid:98) x t ( y ). On the other hand, a version of Danskin’s theorem ([29,Lem. 24]) guarantees that the gradient of ψ t ( y ), given by(4.7) ∇ ψ t ( y ) ≡ ∂ y F reg ( (cid:98) x t ( y ) , y ) , is O ( L + yy )-Lipschitz. Hence, (cid:101) x t ( y ) provides a δ -inexact oracle for ψ t ( y ):(4.8) (cid:101) ψ t ( y ) := F reg ( (cid:101) x t ( y ) , y ) , (cid:101) ∇ ψ t ( y ) := ∂ y F reg ( (cid:101) x t ( y ) , y ) , cf. Deﬁnition 3.1, where the accuracy parameter δ can be arbitrarily chosen. Forconvenience, we outline the subroutine that returns (cid:101) x t ( y ) and the approximate dual2 D. M. OSTROVSKII, A. LOWY, AND M. RAZAVIYAYN gradient (cid:101) ∇ ψ t ( y ) in Algorithm 4.1. Now, observe that we can switch the order of minand max in (4.2), recasting it as y t = arg max y ∈ Y ψ t ( y ) , and x t = (cid:98) x t ( y t ) . Naturally,we replace those with the approximate updates given by(4.9) y t ≈ arg max y ∈ Y ψ t ( y ) , x t = (cid:101) x t ( y t ) , maximizing ψ t ( y ) by running Algorithm 3.2 with inexact gradient − (cid:101) ∇ ψ t ( y ) deﬁnedin (4.8), and without using (cid:101) ψ t ( y ). Since ψ t ( y ) is L + yy -smooth and ( ε y /R y )-stronglyconcave, in O ( T y ) calls of the inexact oracle − (cid:101) ∇ ψ t ( · ) Algorithm 3.2 ﬁnds O ( ε y )-approximate maximizer y t of ψ t , ensuring that S Y ( y t , −∇ ψ t ( y t ) , L + yy ) (cid:54) ε y , max y ∈ Y ψ t ( y ) − ψ t ( y t ) (cid:54) ε y L + yy . Combining the ﬁrst of these inequalities with (4.7) and recalling that (cid:101) x t ( y t ) ≈ (cid:98) x t ( y t )with very high accuracy, we ensure that ( x t , y t ) obtained via (4.9) is O ( ε y )-stationaryin y (in the sense of Deﬁnition 2.1). As this must be repeated for t ∈ [ T x ], we recover theﬁrst term in (2.8). On the other hand, the second inequality leads to the extra O ( ε y /L + yy )error in the saddle point relation (4.3), whereas, as we know from Proposition 3.5, thiserror must be O ( ε x /L xx ) in order to preserve the argument in Section 4.1. This is easyto ﬁx: it suﬃces to perform a logarithmic in T x number of additional restarts whenmaximizing ψ t ( y ) (cf. (3.22)). Thus, the argument in Section 4.1 remains valid, andwe ﬁnd ( ε x , O ( ε y ))-FNE in (1.1) in (cid:101) O ( T x T y ) gradient computations and projections.The resulting algorithm, our main practical contribution, is given in Algorithm 4.2. Algorithm 4.2

FNE Search in Nonconvex-Concave Smooth Min-Max Problem

Require: ∇ F ( · , · ), Y , x , ¯ y ∈ Y , T x , T y , S y , γ x , γ y , λ y , T o , S o for t ∈ [ T x ] do (cid:46) Using Algorithms 3.2 and 4.1 as subroutines y t = RestartFGM (¯ y, Y, γ y , T y , S y , − (cid:101) ∇ ψ t ( · )) with (cid:101) ∇ ψ t ( y ) returned by SolveRegDual ( y, x t − , ¯ y, γ x , λ y , T o , S o ) x t = (cid:101) x t ( y t ) returned by SolveRegDual ( y t , x t − , ¯ y, γ x , λ y , T o , S o ) end for return ( x τ , y τ ) with τ ∈ Argmin t ∈ [ T x ] (cid:107)∇ x F ( x t , y t ) (cid:107) We state our main result.

Theorem

Deﬁne λ y := ε y R y , Θ := L yy R y , Θ + := L + yy R y , and δ := min (cid:34) ε y R y , Θ2 T y , (cid:115) ∆(Θ + − Θ) T x T y (cid:35) . (4.10) Let us run Algorithm with (4.11) γ x = 12 L xx , γ y = 1 L + yy + λ y , (4.12) T x (cid:62) L xx (∆ + 2 ε y R y ) ε x , T y (cid:62) (cid:115) L + yy + λ y ) λ y , S y (cid:62) (cid:18) max (cid:20) T y , Θ + δ (cid:21)(cid:19) , (4.13) T o = 11 , S o (cid:62)

12 log (cid:18) ε y R y ) (cid:20) L xx ε x + 2Θ + δ + 112 δ (cid:21)(cid:19) . Its output is (2 ε x , ε y ) -FNE in the problem (1.1) , in the sense of Deﬁnition , in (cid:6) T o S o S y T x T y (cid:7) computations of ∇ F ( x, y ) and twice that many projections onto X and Y . FFICIENT SEARCH OF FIRST-ORDER NASH EQUILIBRIA (cid:98) x, (cid:98) y ) also satisﬁes L xx (cid:13)(cid:13)(cid:13)(cid:98) x − Π X (cid:16)(cid:98) x − L xx ∇ x F ( (cid:98) x, (cid:98) y ) (cid:17)(cid:13)(cid:13)(cid:13) (cid:54) ε x , L yy (cid:13)(cid:13)(cid:13)(cid:98) y − Π Y (cid:16)(cid:98) y + L yy ∇ y F ( (cid:98) x, (cid:98) y ) (cid:17)(cid:13)(cid:13)(cid:13) (cid:54) ε y , cf. Remark 2.2. On the other hand, the converse is not true: the above guarantee isnot suﬃcient to conclude that the point is ( ε x , ε y )-FNE in the sense of Deﬁnition 2.1. Remark F ( x, y ) is λ y -strongly concavein y with general λ y , leading to the complexity estimate (cid:101) O ( T x ( κ + y ) / ), where κ + y = L + yy /λ y is the condition number of the dual function in (4.2). This matches the bestknown rate (see, e.g., [19]). To this end, it suﬃces to run the algorithm with parametersset as in the premise of Theorem 4.1, but ﬁxing a prescribed value for λ y . Remark T x of iterations in the outer loop and under thelogarithms in S o and S y through δ , cf. (4.10). In practice, ∆ is usually unknown, butthis does not pose a problem. Indeed, in the case of logarithmic dependencies (in S o and S y ), we can use, instead of ∆, a very crude upper bound (e.g., we always have ∆ (cid:54) L xx R x whenever X is contained in the Euclidean ball with radius R x ). As for T x ,observe that, when actually running Algorithm 4.2, one does not have ﬁx in advancethe number of outer loop iterations. Instead, one can run an inﬁnite loop and check thestopping criterion S X ( x t , ∇ x F ( x t , y t ) , L xx ) (cid:54) ε x after each iteration, which amounts tocomputing a prox-mapping for X . This is a valid stopping criterion: as follows from theproof of Theorem 4.1, the complementary condition S Y ( y t , −∇ y F ( x t , y t ) , L yy ) (cid:54) ε y is maintained at each t . To this end, Theorem 4.1 guarantees the termination ofAlgorithm 4.2 after at most T x outer loop iterations. We use the notation introduced in Section 4.1–4.2and refer to the arguments presented there if needed. o . Given a primal iterate x t − , let us deﬁne the following auxiliary functions: F t ( x, y ) := F ( x, y ) + L xx (cid:107) x − x t − (cid:107) ,F reg ( x, y ) := F ( x, y ) − λ y (cid:107) y − ¯ y (cid:107) ,F reg t ( x, y ) := F t ( x, y ) − λ y (cid:107) y − ¯ y (cid:107) [= F reg ( x, y ) + L xx (cid:107) x − x t − (cid:107) ] . Consider ﬁrst the “idealized” update from the primal iterate x t − , as given by(4.14) y t = arg max y ∈ Y ψ t ( y ) , x t = (cid:98) x t ( y t ) . Here, ψ t ( y ) and (cid:98) x t ( y ) are deﬁned as(4.15) ψ t ( y ) := min x ∈ X F reg t ( x, y ) [= F reg t ( (cid:98) x t ( y ) , y )] , (cid:98) x t ( y ) := argmin x ∈ X F reg t ( x, y ) , with Clearly, ψ t ( y ) is λ y -strongly concave with λ y = ε y /R y . On the other hand, byDanskin’s theorem (see, e.g., [29, Lem. 24]), ψ t ( y ) is continuously diﬀerentiable with(4.16) ∇ ψ t ( y ) = ∂ y F reg ( (cid:98) x t ( y ) , y ) = ∂ y F ( (cid:98) x t ( y ) , y ) − λ y ( y − ¯ y ) , and ∇ ψ t ( y ) is ( L + yy + λ y )-Lipschitz with L + yy deﬁned in (2.6).4 D. M. OSTROVSKII, A. LOWY, AND M. RAZAVIYAYN o . We now focus on the properties of the point (cid:101) x t ( y ) returned when calling SolveRegDual ( y, x t − , ¯ y, γ x , λ y , T o , S o ) , cf. line 3 of Algorithm 4.2, as well as the corre-sponding pair [ (cid:101) ψ t ( y ) , (cid:101) ∇ ψ t ( y )], cf. (4.8). Note that the function value (cid:101) ψ t ( y ) is nevercomputed in Algorithm 4.2 and we only use it in the analysis. Inspecting the pseudocodeof SolveRegDual (Algorithm 4.1), we see that (cid:101) x t ( y ) corresponds to the approximateminimizer of F reg t ( x, y ) (thus also F t ( x, y )) in x , obtained by running restarted FGM(Algorithm 3.2) starting from x t − , with stepsize γ = 1 / (3 L xx ), T o = 11 inner loopiterations, and the number of restarts S o given in (4.13). Observe that minimiz-ing F t ( · , y ) corresponds to computing the proximal operator x + γF ( · ,y ) ,X ( x t − ) for thefunction F ( · , y ) which is L xx -smooth. Hence, due to our choice of input parameters,the premise of Proposition 3.5 is satisﬁed; applying it with our choice of S o yields S X ( (cid:101) x t ( y ) , ∇ x F t ( (cid:101) x t ( y ) , y ) , L xx ) (cid:54) ε x , (4.17) (cid:107) (cid:101) x t ( y ) − (cid:98) x t ( y ) (cid:107) (cid:54) min (cid:20) ε x L xx , δ L xy R y (cid:21) , (4.18) F reg t ( (cid:101) x t ( y ) , y ) − F reg t ( (cid:98) x t ( y ) , y ) (cid:54) min (cid:20) ε x L xx , δ (cid:21) . (4.19)Here, (4.17) and the ﬁrst respective terms in (4.18)–(4.19) are due to the ﬁrst of threeterms in brackets under logarithm in (4.13), cf. (3.22), combined with a very crudeuniform over y ∈ Y estimate(4.20) F reg t ( x t − , y ) − min x ∈ X F reg t ( x, y ) (cid:54)

3∆ + 2Θ + 6 ε y R y . (We defer the proof of (4.20) to appendix.) On the other hand, the second respectiveestimates in (4.18)–(4.19) correspond to the two remaining terms in brackets underlogarithm in (4.13), cf. (3.22), combined with the following easy-to-verify relations: L xx ε x = 112 δ ⇐⇒ ε x L xx = δ , (cid:40) + δ (cid:62) L xy R y L xx δ = L xx (cid:18) L xy R y L xx δ (cid:19) =: L xx ( ε (cid:48) x ) (cid:41) = ⇒ ε (cid:48) x L xx = δ L xy R y . Now, (4.17)–(4.19) have two consequences. First, by (4.19) we immediately have(4.21) F reg ( (cid:101) x t ( y ) , y ) + L xx (cid:107) (cid:101) x t ( y ) − x t − (cid:107) − ε x L xx (cid:54) F reg ( x t − , y ) , which mimics (4.4). The bound (4.21) will be our departure point when bounding S X later on. Second, the second respective terms in the right-hand side of (4.18)–(4.19)together ensure that the pair [ − (cid:101) ψ δt ( y ) , − (cid:101) ∇ ψ t ( y )] with(4.22) (cid:101) ψ δt ( y ) := F reg t ( (cid:101) x t ( y ) , y ) + δ/ , (cid:101) ∇ ψ t ( y ) = ∂ y F reg ( (cid:101) x t ( y ) , y )is a δ -inexact ﬁrst-order oracle for − ψ t ( y ) in the sense of Deﬁnition 3.1, namely,(4.23) 0 (cid:54) − ψ t ( y (cid:48) )+ (cid:101) ψ δt ( y )+ (cid:104) (cid:101) ∇ ψ t ( y ) , y (cid:48) − y (cid:105) (cid:54) ( L + yy + λ y ) (cid:107) y (cid:48) − y (cid:107) + δ, ∀ y, y (cid:48) ∈ Y, where we used that ψ t is ( L + yy + λ y )-smooth (the proof of (4.23) is deferred to appendix). FFICIENT SEARCH OF FIRST-ORDER NASH EQUILIBRIA o . Now consider the actual update performed in the for-loop of Algorithm 4.2:(4.24) y t ≈ arg max y ∈ Y ψ t ( y ) , x t = (cid:101) x t ( y t ) , where the precise meaning of “ ≈ ”, is y t = RestartFGM (¯ y, Y, γ y , T y , S y , − (cid:101) ∇ ψ t ( · )) , cf. line 2.In other words, y t is obtained by running Algorithm 3.2 with δ -inexact gradient (cid:101) ∇ ψ t ( · ),starting from ¯ y ∈ Y , with T y iterations in the inner calls of FGM and S y restarts, T y and S y being given in (4.12). Recall that ψ t ( · ) is ( L + yy + λ y )-smooth and λ y -stronglyconvex with λ y = ε y /R y , and(4.25) δ (4.10) (cid:54) Θ2 T y (cid:54) ( L + yy + λ y ) R y T y , i.e., the condition in (3.3) is satisﬁed. By our choice of T y and S y in (4.12), and dueto Corollary 3.3, we get(4.26) (cid:107) y t − y ∗ t (cid:107) (cid:54) ε y L + yy , ψ t ( y ∗ t ) − ψ t ( y t ) (cid:54) min (cid:34) ε y L + yy , ε x L xx T y (cid:35) , and S Y ( y t , −∇ ψ t ( y t ) , L + yy + λ y ) (cid:54) ε y / , where y ∗ t is the exact maximizer of ψ t (cf. (3.6)).Here we used the ﬁrst lower bound in (4.12) for S y to obtain all estimates except forthe second estimate of ψ t ( y ∗ t ) − ψ t ( y t ), and for this latter estimate we used the secondbound in (4.12) for S y and the last bound in (4.10) for δ , and did a series of estimates: S y (4.12) (cid:62) log (cid:18) (Θ + ) δ (cid:19) (4.10) (cid:62) log (cid:32) T x T y Θ + ∆ (cid:33) (cid:62) log (cid:32) L xx T y L + yy R y ε x (cid:33) (cid:62) (cid:18) L + yy R y ε (cid:48) y (cid:19) , ( ε (cid:48) y ) := ε x L + yy L xx T y . Now, by the proximal PL-lemma ([16, Lem. 1]), S Y ( y, g, L ) is non-decreasing in L , so(4.27) S Y ( y t , −∇ ψ t ( y t ) , L yy ) (cid:54) ε y / . Due to (4.16) and the Lipschitzness of ∇ y F ( · , y ) and prox y t ,Y ( · ), x t = (cid:101) x t ( y t ) satisﬁes S Y ( y t , −∇ y F ( x t , y t ) , L yy ) ( a ) (cid:54) S Y ( y t , −∇ y F ( x t , y t ) , L yy ) = 4 L yy max y ∈ Y (cid:2)(cid:10) ∇ y F ( x t , y t ) , y − y t (cid:11) − L yy (cid:107) y − y t (cid:107) (cid:3) (cid:54) L yy max y ∈ Y (cid:2)(cid:10) ∇ y F ( x t , y t ) − ∂ y F ( (cid:98) x t ( y t ) , y t ) + ε y R y ( y t − ¯ y ) , y − y t (cid:11) − L yy (cid:107) y − y t (cid:107) (cid:3) + 4 L yy max y ∈ Y (cid:2)(cid:10) ∂ y F ( (cid:98) x t ( y t ) , y t ) − ε y R y ( y t − ¯ y ) , y − y t (cid:11) − L yy (cid:107) y − y t (cid:107) (cid:3) = 4 L yy max y ∈ Y (cid:2)(cid:10) ∇ y F ( x t , y t ) − ∂ y F ( (cid:98) x t ( y t ) , y t ) + ε y R y ( y t − ¯ y ) , y − y t (cid:11) − L yy (cid:107) y − y t (cid:107) (cid:3) + 2 S Y ( y t , −∇ ψ t ( y t ) , L yy ) ( b ) (cid:54) L yy max y ∈ Y (cid:2)(cid:10) ∇ y F ( x t , y t ) − ∂ y F ( (cid:98) x t ( y t ) , y t ) + ε y R y ( y t − ¯ y ) , y − y t (cid:11) − L yy (cid:107) y − y t (cid:107) (cid:3) + ε y c ) (cid:54) (cid:13)(cid:13) ∇ y F ( x t , y t ) − ∂ y F ( (cid:98) x t ( y t ) , y t ) + ε y R y ( y t − ¯ y ) (cid:13)(cid:13) + ε y d ) (cid:54) (cid:107)∇ y F ( x t , y t ) − ∂ y F ( (cid:98) x t ( y t ) , y t ) (cid:107) + ε y R y (cid:107) y t − ¯ y (cid:107) + ε y e ) = 4 L xy (cid:107) (cid:101) x t ( y t ) − (cid:98) x t ( y t ) (cid:107) + 16 ε y + ε y ( f ) (cid:54) δ R y + 16 ε y + ε y ( g ) (cid:54) ε y . D. M. OSTROVSKII, A. LOWY, AND M. RAZAVIYAYN

Here in ( a ) we used that S Y ( y, g, L ) is non-decreasing in L ([16, Lem. 1]); in ( b ) weused (4.27); in ( c ) we used Young’s inequality; in ( d ) we used the Cauchy-Schwarzinequality; in ( e ) we used the Lipschitzness of F ; in ( f ) we used (4.18); in ( g ) we usedour choice of δ in (4.10). Thus, ( x t , y t ) is kept 5 ε y -stationary in y at any iteration t . o . We now revisit (4.21). Applying it to y = y t , we get(4.28) F reg ( x t , y t ) + L xx (cid:107) x t − x t − (cid:107) − ε x L xx (cid:54) F reg ( x t − , y t ) , which mimics (4.4). Our goal, however, is to mimic (4.5), for which we must lower-bound, up to a small error, F reg ( x t , y t ) via F reg ( x t , y t +1 ), or, equivalently, F reg t ( x t , y t )via F reg t ( x t , y t +1 ). First,(4.29) F reg t ( x t , y t +1 ) (cid:54) max y ∈ Y F reg t ( x t , y ) = ϕ t ( x t ) , where ϕ t ( x ) := max y ∈ Y F reg t ( x, y ) is the primal function in the saddle-point prob-lem (4.2). On the other hand, denoting x ∗ t = (cid:98) x t ( y ∗ t ), so that ( x ∗ t , y ∗ t ) is the uniquesaddle point in (4.2), we have F reg t ( x t , y t ) ≡ F reg t ( (cid:101) x t ( y t ) , y t ) (cid:62) F reg t ( (cid:98) x t ( y t ) , y t ) = ψ t ( y t ) (4.26) (cid:62) ψ t ( y ∗ t ) − ε x L xx T y (cid:62) ϕ t ( x ∗ t ) − ε x L xx T y . (4.30)It remains to compare ϕ t ( x t ) and ϕ t ( x ∗ t ). Combining F reg t ( x ∗ t , y ∗ t ) (cid:62) F reg t ( x ∗ t , y t )with the previous inequality, and observing that F reg t ( · , y t ) is L xx -strongly convex andminimized at (cid:98) x t ( y t ), we obtain (cid:107) (cid:98) x t ( y t ) − x ∗ t (cid:107) (cid:54) ε x L xx T y . On the other hand, (4.18)applied to y = y t gives(4.31) (cid:107) x t − (cid:98) x t ( y t ) (cid:107) (cid:54) (cid:18) δ L xy R y (cid:19) (cid:54) ∆(Θ + − Θ)64 L xy R y T y T x = ε x L xx T y , where we used the last expression in (4.10) for δ . Combining these results, we get (cid:107) x t − x ∗ t (cid:107) (cid:54) . ε x L xx T y . Now, ϕ t is (cid:0) L xx + L xy /λ y (cid:1) -smooth by Danskin’s theorem, and minimized at x ∗ t . Thus(4.32) ϕ t ( x t ) − ϕ t ( x ∗ t ) (cid:54) (cid:32) L xx + L xy λ y (cid:33) (cid:107) x t − x ∗ t (cid:107) (cid:54) . ε x L xx (cid:32) L + yy λ y T y (cid:33) (cid:54) . ε x L xx , where in the last step we plugged in T y from (4.12). Returning to (4.29)–(4.30), weget F reg ( x t , y t ) (cid:62) F reg ( x t , y t +1 ) − . ε x /L xx . Combining this with (4.28) we ﬁnallyget the desired analogue of (4.5):(4.33) F reg ( x t , y t +1 ) + L xx (cid:107) x t − x t − (cid:107) − ε x L xx (cid:54) F reg ( x t − , y t ) . FFICIENT SEARCH OF FIRST-ORDER NASH EQUILIBRIA t ∈ [ T x ] (cid:107) x t − x t − (cid:107) (cid:54) T x (cid:88) t ∈ [ T x ] (cid:107) x t − x t − (cid:107) (cid:54) F reg ( x , y ) − F reg ( x T x , y T x ) L xx T x + 7 ε x L xx (cid:54) ∆ + 2 ε y R y L xx T x + 7 ε x L xx (cid:54) ε x L xx , (4.34)where we used the estimates (4.6) and plugged in T x . It remains to mimic (3.20): S X ( x t , ∇ x F ( x t , y t ) , L xx ) (cid:54) S X ( x t , ∇ x F ( x t , y t ) , L xx ) ≡ L xx max x (cid:48) ∈ X (cid:2) − (cid:104)∇ x F ( x t , y t ) , x (cid:48) − x t (cid:105) − L xx (cid:107) x (cid:48) − x t (cid:107) (cid:3) (cid:54) L xx max x (cid:48) ∈ X (cid:2) − (cid:104)∇ x F t ( x t , y t ) , x (cid:48) − x t (cid:105) − L xx (cid:107) x (cid:48) − x t (cid:107) (cid:3) + 4 L xx max x (cid:48) ∈ X (cid:2) (cid:104) L xx ( x t − x t − ) , x (cid:48) − x t (cid:105) − L xx (cid:107) x (cid:48) − x t (cid:107) (cid:3) = 2 S X ( x t , ∇ x F t ( x t , y t ) , L xx )+ 4 L xx max x (cid:48) ∈ X (cid:2) (cid:104) L xx ( x t − x t − ) , x (cid:48) − x t (cid:105) − L xx (cid:107) x (cid:48) − x t (cid:107) (cid:3) ( a ) (cid:54) ε x / L xx max x (cid:48) ∈X (cid:2) (cid:104) x t − x t − , x (cid:48) − x t (cid:105) − (cid:107) x (cid:48) − x t (cid:107) (cid:3) ( b ) (cid:54) ε x / L xx (cid:107) x t − x t − (cid:107) c ) (cid:54) (1 / / ε x (cid:54) ε x , where in ( a ) we used (4.17) with y = y t , in ( b ) we used Young’s inequality, and ( c ) wasdue to (4.34). Combining this with the result of o , we conclude that ( x τ , y τ ) with τ ∈ argmin t ∈ T x (cid:107) x t − x t − (cid:107) is (2 ε x , ε y )-FNE. Moreover, we have performed (cid:6) T o S o S y T x T y (cid:7) iterations of FGM (in the for-loop of Algorithm 3.1) in total, with one computationof ∇ F and at most two projections on Y and X at each iteration.

5. Guarantees for the Moreau envelope.

We now consider the standard

Moreau envelope (see [34, 18]) of the primal function ϕ ( x ) = max y ∈ Y F ( x, y ), cf. (2.5):(5.1) ϕ L xx ( x ) := min x (cid:48) ∈ X (cid:2) ϕ ( x (cid:48) ) + L xx (cid:107) x (cid:48) − x (cid:107) (cid:3) . Clearly, ϕ is L xx -weakly convex, thus the minimized function in (5.1) is L xx -stronglyconvex. Focusing on the Moreau envelope makes sense in the applications where oneis only interested in the “primal” accuracy of solving (1.1). A common practice, inthe x -unconstrained case ( X = X ), is then to use the primal component (cid:98) x of anapproximate Nash equilibrium ( (cid:98) x, (cid:98) y ) as a candidate near-stationary point, measuringthe accuracy by (cid:107)∇ ϕ L xx ( (cid:98) x ) (cid:107) . When X = X , passing to the Moreau envelope canbe motivated as follows. On the one hand, ϕ L xx has Lipschitz gradient on X (byDanskin’s theorem), and we can “ignore” the non-diﬀerentiability of ϕ . On theother hand, (cid:107)∇ ϕ L xx ( (cid:98) x ) (cid:107) (cid:54) ε x implies that the point x + = x + ( (cid:98) x ) delivering theminimum in (5.1) for x = (cid:98) x (formally, x + λϕ, X ( (cid:98) x ) with λ = L xx , cf. (3.9)) satisﬁes (cid:107) x + − (cid:98) x (cid:107) = O ( ε x L xx ) and min ξ ∈ ∂ϕ ( x + ) (cid:107) ξ (cid:107) (cid:54) ε x , see, e.g., [32]. In other words, any ε x -stationary point for the Moreau envelope is within O ( ε x /L xx ) distance from a point atwhich ϕ has an ε x -small subgradient. We now extend this result to the case X ⊆ X . Proposition

Let φ : X → R be L -weakly convex, and deﬁne its standardMoreau envelope φ L ( x ) = min x (cid:48) ∈ X [ φ ( x (cid:48) ) + L (cid:107) x (cid:48) − x (cid:107) ] , cf. (5.1) . Then: D. M. OSTROVSKII, A. LOWY, AND M. RAZAVIYAYN

1. We have (cid:107)∇ φ L ( x ) (cid:107) = S X ( x, ∇ φ L ( x ) , L ) = W X ( x, ∇ φ L ( x ) , L ) for any x ∈ X .2. We have ∇ φ L xx ( (cid:98) x ) = 2 L xx ( (cid:98) x − x + ) , where x + = argmin x (cid:48) ∈ X (cid:2) φ ( x (cid:48) ) + L xx (cid:107) x (cid:48) − (cid:98) x (cid:107) (cid:3) .Thus (cid:107)∇ φ L xx ( (cid:98) x ) (cid:107) (cid:54) ε x implies (cid:107) x + − (cid:98) x (cid:107) (cid:54) ε x / (2 L xx ) and min ξ ∈ ∂φ ( x + ) S X ( x + , ξ, L xx ) (cid:54) ε x . This result is proved in appendix. Proposition 5.1 motivates the task of ﬁnding apoint (cid:98) x with a small norm of the Moreau envelope (cid:107)∇ ϕ L xx ( (cid:98) x ) (cid:107) . The recent work [34]proposes an algorithm that directly produces such a point in O ( ε x ) ﬁrst-order oraclecalls in the primally-unconstrained setup ( X = X ). The work [19] uses a diﬀerentapproach. They ﬁrst ﬁnd an O ( ε x , ε y )-FNE for (1.1) with respect to the weak criterion,i.e., ( (cid:98) x, (cid:98) y ) such that (2.2) holds with S X , S Y replaced with W X , W Y respectively. Thenthey use the result [18, Prop. 4.12] that claims to guarantee (again in the case X = X )that (cid:107)∇ ϕ L xx ( (cid:98) x ) (cid:107) = O ( ε x ) whenever ε y = O ( ε x ). Let us rephrase the claim in [18]. Proposition

Assuming (2.1) and X = X , one has that (5.2) (cid:107)∇ ϕ L xx ( (cid:98) x ) (cid:107) = O ( (cid:107)∇ x F ( (cid:98) x, (cid:98) y ) (cid:107) + L xx R y (cid:98) W y ) , where (cid:98) W y := W Y ( (cid:98) y, −∇ y F ( (cid:98) x, (cid:98) y ) , L yy ) . In particular, (cid:107)∇ ϕ L xx ( (cid:98) x ) (cid:107) = O ( ε x ) providedthat (cid:107)∇ x F ( (cid:98) x, (cid:98) y ) (cid:107) (cid:54) ε x and (cid:98) W y (cid:54) ε x / ( L xx R y ) . Inspecting the results in [19], we conclude that their algorithm outputs an ( ε x , ε y )-FNE(with respect to the weak measure W Y in y ) in (cid:101) O ( T x T y ) oracle calls. By Proposition 5.2,this translates to ﬁnding an ε x -stationary point for the Moreau envelope in(5.3) (cid:101) O (cid:32) ∆ L xx / L + yy / R y ε x (cid:33) oracle calls. However, our careful inspection of the proof of [18, Prop. 4.12] onlyallowed to verify (5.2) in the unconstrained case Y = Y (and replacing R y with thedistance (cid:107) (cid:98) y − y o (cid:107) for some y o ∈ Argmax y ∈ Y F ( (cid:98) x, y )), so that (cid:98) W y becomes (cid:107)∇ y F ( (cid:98) x, (cid:98) y ) (cid:107) .The underlying issue is that the proof relies on the bound(5.4) F ( (cid:98) x, y o ) − F ( (cid:98) x, (cid:98) y ) (cid:54) (cid:104) (cid:98) ζ y , y o − (cid:98) y (cid:105) , where (cid:98) ζ y := L yy (Π Y [ (cid:98) y + L yy ∇ y F ( (cid:98) x, (cid:98) y )] − (cid:98) y ) is the negative proximal gradient of − F ( (cid:98) x, · )at (cid:98) y ; this gives the term R y W Y ( (cid:98) y, −∇ y F ( (cid:98) x, (cid:98) y ) , L yy ) in (5.2) by Cauchy-Schwarz.However, (5.4) can be invalid when Y (cid:54) = Y . The following result allows to rectify this. Lemma

Let h : Y → R be diﬀerentiable and concave, and deﬁne ζ L ( y ) := L (Π Y [ y + L ∇ h ( y )] − y ) for L > . Then for any y, y (cid:48) ∈ Y and L > , one has that (5.5) h ( y (cid:48) ) − h ( y ) (cid:54) (cid:104) ζ L ( y ) , y (cid:48) − y (cid:105) + L (cid:2) S Y ( y, −∇ h ( y ) , L ) − W Y ( y, −∇ h ( y ) , L ) (cid:3) . Remark Y = Y , the second term in the right-hand side of (5.5) vanishes,and we recover (5.4) by putting h ( y ) = F ( (cid:98) x, y ). Meanwhile, in the constrained case (5.5)is tight up to a constant factor; moreover, (5.4) can be violated with arbitrary gap dueto the additional term in the right-hand side, which can be arbitrarily large (while theinner product term remains ﬁxed). Indeed, consider the following problem for a (cid:62) y ∈ [ − , [ h ( y ) := − ( y − a ) ] . Clearly, y o = 0 is the unique maximizer of h ( · ) on Y = [ − , L = 1 (which corresponds to the smoothness of h ), FFICIENT SEARCH OF FIRST-ORDER NASH EQUILIBRIA a (cid:62) ε ∈ [0 , (cid:98) y = − ε satisﬁes h ( y o ) − h ( (cid:98) y ) = ε + aε , ζ L ( (cid:98) y ) = ε , (cid:104) ζ L ( (cid:98) y ) , y o − (cid:98) y (cid:105) = ε , W Y ( (cid:98) y, −∇ h ( (cid:98) y ) , L ) = ε , S Y ( (cid:98) y, −∇ h ( (cid:98) y ) , L ) = 2 aε + ε . Thus, for this instance (5.5) with y (cid:48) = y o and y = (cid:98) y is almost attained: the left-hand side is equal to ε + aε and the right-hand side to ε + aε . Moreover, theterm L [ S Y ( (cid:98) y, −∇ h ( (cid:98) y ) , L ) − W Y ( (cid:98) y, −∇ h ( (cid:98) y ) , L )] = aε can be made arbitrarily large(by increasing a ) without changing the term (cid:104) ζ L ( (cid:98) y ) , y o − (cid:98) y (cid:105) = ε .As we noted before, the error in (5.5) seems to invalidate Proposition 5.2, and thus thecomplexity estimate (5.3). Indeed, (5.2) in fact gains the additional term under O ( · ) inthe right-hand side, and this term can be arbitrarily large when (cid:98) W y (cid:54) ε y . Fortunately,the complexity estimate (5.3) can be obtained in the fully constrained setup, by usingthat Algorithm 4.2 produces an ( ε x , ε y )-FNE in the strong sense (cf. Deﬁnition 2.1). Proposition

Assume (2.1) and let (cid:98) W y be as deﬁned in Proposition , then (5.6) (cid:107)∇ ϕ L xx ( (cid:98) x ) (cid:107) = O (cid:16) S X ( (cid:98) x, ∇ x F ( (cid:98) x, (cid:98) y ) , L xx ) + L xx R y (cid:98) W y + L xx ( (cid:98) S y − (cid:98) W y ) /L yy (cid:17) , where (cid:98) S y := S Y ( (cid:98) y, −∇ y F ( (cid:98) x, (cid:98) y ) , L yy ) . As a result, (cid:107)∇ ϕ L xx ( (cid:98) x ) (cid:107) = O ( ε x ) whenever ( (cid:98) x, (cid:98) y ) is ( ε x , ε y ) -FNE for (1.1) , in the sense of (2.2) , with (5.7) ε y (cid:54) min (cid:20) ε x L xx R y , ε x (cid:114) L yy L xx (cid:21) . Recalling Theorem 4.1 (cf. (2.8)), we conclude that Algorithm 4.2, when run with ε y set to the right-hand side of (5.7), produces (cid:98) x ∈ X such that (cid:107)∇ ϕ L xx ( (cid:98) x ) (cid:107) = O ( ε x ), in(5.8) (cid:101) O (cid:32) max (cid:34) ∆ L xx / L + yy / R y ε x , ∆ L xx / L + yy / R y / ε / x L yy / (cid:35)(cid:33) oracle calls . Acknowledgments.

We thank Babak Barazandeh for technical discussions, fordiscovering the issue with [18, Prop. 4.12] (cf. (5.4)) and for sketching the proofof (5.5).

References. [1]

S. Baharlouei, M. Nouiehed, A. Beirami, and M. Razaviyayn , Renyi fair inference , in 8thInternational Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April26-30, 2020, 2020.[2]

K. Ball, E. A. Carlen, and E. H. Lieb , Sharp uniform convexity and smoothness inequalitiesfor trace norms , Inventiones mathematicae, 115 (1994), pp. 463–482.[3]

B. Barazandeh and M. Razaviyayn , Solving non-convex non-diﬀerentiable min-max gamesusing proximal gradient method , arXiv preprint arXiv:2003.08093, (2020).[4]

H. H. Bauschke, J. Bolte, and M. Teboulle , A descent lemma beyond Lipschitz gradientcontinuity: ﬁrst-order methods revisited and applications , Mathematics of Operations Research,42 (2016), pp. 330–348.[5]

J. Borwein, A. Guirao, P. H´ajek, and J. Vanderwerff , Uniformly convex functions onbanach spaces , Proceedings of the American Mathematical Society, 137 (2009), pp. 1081–1091.[6]

S. Bubeck , Convex optimization: Algorithms and complexity , Foundations and Trends ® inMachine Learning, 8 (2015), pp. 231–357.[7] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford , Lower bounds for ﬁnding stationarypoints i , Mathematical Programming, (2017), pp. 1–50. Our attempts to obtain an alternative proof of Proposition 5.2 without using (5.5) have failed. D. M. OSTROVSKII, A. LOWY, AND M. RAZAVIYAYN[8]

B. Dai, A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song , Sbeed: Convergentreinforcement learning with nonlinear function approximation , in International Conference onMachine Learning, 2018, p. 1133–1142.[9]

C. D. Dang and G. Lan , Stochastic block mirror descent methods for nonsmooth and stochasticoptimization , SIAM Journal on Optimization, 25 (2015), pp. 856–881.[10]

J. M. Danskin , The theory of max-min, with applications , SIAM Journal on Applied Mathematics,14 (1966), pp. 641–664.[11]

O. Devolder , Stochastic ﬁrst order methods in smooth convex optimization , CORE discussionpaper, (2011).[12]

O. Devolder, F. Glineur, and Y. Nesterov , First-order methods of smooth convex optimiza-tion with inexact oracle , Mathematical Programming, 146 (2014), pp. 37–75.[13]

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,A. Courville, and Y. Bengio , Generative adversarial nets , in Advances in neural informationprocessing systems, 2014, pp. 2672–2680.[14]

C. Jin, P. Netrapalli, and M. I. Jordan , What is local optimality in nonconvex-nonconcaveminimax optimization? , arXiv:1902.00618v2, (2019).[15]

A. Juditsky and A. Nemirovski , First-order methods for nonsmooth convex large-scale opti-mization, I: General purpose methods , Optimization for Machine Learning, (2011), pp. 121–148.[16]

H. Karimi, J. Nutini, and M. Schmidt , Linear convergence of gradient and proximal-gradientmethods under the polyak-(cid:32)lojasiewicz condition , in Joint European Conference on MachineLearning and Knowledge Discovery in Databases, Springer, 2016, pp. 795–811.[17]

W. Kong and R. D. Monteiro , An accelerated inexact proximal point method for solvingnonconvex-concave min-max problems , arXiv:1905.13433, (2019).[18]

T. Lin, C. Jin, and M. I. Jordan , On gradient descent ascent for nonconvex-concave minimaxproblems , arXiv preprint arXiv:1906.00331, (2019).[19]

T. Lin, C. Jin, and M. I. Jordan , Near-optimal algorithms for minimax optimization , arXivpreprint arXiv:2002.02417, (2020).[20]

S. Lu, I. Tsaknakis, and M. Hong , Block alternating optimization for non-convex min-maxproblems: algorithms and applications in signal processing and communications , in IEEEInternational Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019,pp. 4754–4758.[21]

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu , Towards deep learningmodels resistant to adversarial attacks , arXiv preprint arXiv: 1706.06083v4, (2019).[22]

A. Nemirovski , Prox-method with rate of convergence O (1 /t ) for variational inequalities withLipschitz continuous monotone operators and smooth convex-concave saddle point problems ,SIAM Journal on Optimization, 15 (2004), pp. 229–251.[23] A. Nemirovski and D. Yudin , Problem complexity and method eﬃciency in optimization. ,Chichester, 1983.[24]

Y. Nesterov , Smooth minimization of non-smooth functions , Mathematical programming, 103(2005), pp. 127–152.[25]

Y. Nesterov , How to make the gradients small , Optima. Mathematical Optimization SocietyNewsletter, (2012), pp. 10–11.[26]

Y. Nesterov , Gradient methods for minimizing composite functions , Mathematical Program-ming, 140 (2013), pp. 125–161.[27]

Y. Nesterov , Introductory lectures on convex optimization: A basic course , vol. 87, SpringerScience & Business Media, 2013.[28]

Y. Nesterov and A. Nemirovski , On ﬁrst-order algorithms for l /nuclear norm minimization ,Acta Numerica, 22 (2013), pp. 509–575.[29] M. Nouiehed, M. Sanjabi, J. D. Lee, and M. Razaviyayn , Solving a class of non-convexmin-max games using iterative ﬁrst order methods , arXiv preprint arXiv:1902.08297, (2019).[30]

D. Ostrovskii and Z. Harchaoui , Eﬃcient ﬁrst-order algorithms for adaptive signal denoising ,in Proceedings of the 35th ICML conference, vol. 80, 2018, pp. 3946–3955.[31]

Y. Ouyang and X. Yangyang , Lower complexity bounds of ﬁrst-order methods for convex-concave bilinear saddle-point problems , arXiv preprint arXiv:1808.02901, (2018).[32]

R. T. Rockafellar , Convex analysis , Princeton university press, 2015.[33]

M. Sion , On general minimax theorems. , Paciﬁc Journal of Mathematics, 8 (1958), pp. 171–176.[34]

K. K. Thekumparampil, P. Jain, P. Netrapalli, and S. Oh , Eﬃcient algorithms for smoothminimax optimization , arXiv:1907.01543, (2019).[35]

R. Zhao , A primal dual smoothing framework for max-structured nonconvex optimization , arXivpreprint arXiv:2003.04375, (2020).

Appendix A. Deferred proofs.

FFICIENT SEARCH OF FIRST-ORDER NASH EQUILIBRIA A.1. Veriﬁcation of (4.23).

By concavity and ( L + yy + λ y )-smoothness of ψ t ,0 (cid:54) − ψ t ( y (cid:48) ) + ψ t ( y ) + (cid:104)∇ ψ t ( y ) , y (cid:48) − y (cid:105) (cid:54) ( L + yy + λ y ) (cid:107) y (cid:48) − y (cid:107) , ∀ y, y (cid:48) ∈ Y. By (4.15), (4.19), and (4.22), δ/ (cid:54) (cid:101) ψ δt ( y ) − ψ t ( y ) (cid:54) δ/ y ∈ Y. On the otherhand, by the second part of (4.18), (cid:107) (cid:101) ∇ ψ t ( y ) − ∇ ψ t ( y ) (cid:107) = (cid:107) ∂ y F ( (cid:101) x t ( y ) , y ) − ∂ y F ( (cid:98) x t ( y ) , y ) (cid:107) (cid:54) L xy (cid:107) (cid:98) x t ( y ) − (cid:101) x t ( y ) (cid:107) (cid:54) δ R y , hence, as (cid:107) y (cid:48) − y (cid:107) (cid:54) R y for y (cid:48) , y ∈ Y , we get − δ/ (cid:54) (cid:104) (cid:101) ∇ ψ t ( y ) − ∇ ψ t ( y ) , y (cid:48) − y (cid:105) (cid:54) δ/ . We obtain (4.23) by summing up the two-sided inequalities above.

A.2. Veriﬁcation of (4.20).

Let ϕ t ( x ) = max y ∈ Y F reg t ( x, y ) be the primalfunction of the saddle-point problem in (4.2). Then ϕ t ( x ) − L yy R y (cid:54) F t ( x, y ) (cid:54) ϕ t ( x )by bounding the variation of a smooth function F ( x, · ) over y ∈ Y , whence F t ( x t − , y ) − min x ∈ X F ( x, y ) (cid:54) ϕ t ( x t − ) − min x ϕ t ( x ) + 2 L yy R y (cid:54) ϕ t ( x t − ) − min x ∈ X ϕ ( x ) + 2 L yy R y + 2 ε y R y , (A.1)where we used that F reg t ( x, y ) (cid:62) F ( x, y ) − ε y R y . Thus, it only remains to provethat ϕ t ( x t − ) decreases in t up to certain error (since ϕ ( x ) (cid:54) ϕ ( x )). To this end,we proceed by induction. The base is obvious: (4.20) is satisﬁed when t = 1 since ϕ ( x ) − (cid:54) F reg1 ( x, y ) (cid:54) ϕ ( x ) , and ϕ ( x ) (cid:62) ϕ ( x ) − ε y R y . Now, assume that (4.20) was satisﬁed at steps τ ∈ [ t − o of the proofof Theorem 4.1, in all these previous steps, including step t −

1, the saddle-pointproblem (4.2) has been solved up to accuracy O ( ε x /L xx ) in primal gap:(A.2) ϕ τ ( x τ ) − min x ϕ τ ( x ) (cid:54) . ε x L xx , τ ∈ [ t − , cf. (4.32). On the other hand, one can easily see that ϕ τ ( x τ − ) (cid:54) ϕ τ − ( x τ − ) forall τ ∈ [ T x ], cf. (4.2). Combining the two inequalities sequentially, we get ϕ t ( x t − ) (cid:54) ϕ t − ( x t − ) + 0 . ε x L xx (cid:54) ϕ ( x ) + 0 . T x ε x L xx (cid:54) ϕ ( x ) + 2∆ + 4 ε y R y . Combining this with (A.1), we arrive at (4.20).

A.3. Proof of Proposition 5.1.

First observe that ∇ φ L ( (cid:98) x ) = 2 L ( (cid:98) x − x + ) forany (cid:98) x ∈ X by Danskin’s theorem. Thus, x + = (cid:98) x − L ∇ φ L ( (cid:98) x ). This implies theﬁrst claim of the Proposition: indeed, x + ∈ X , so that x + = Π X ( x + ) and hence S X ( (cid:98) x, ∇ φ L ( (cid:98) x ) , L ) = W X ( x, ∇ φ L ( (cid:98) x ) , L ) = (cid:107)∇ φ L ( (cid:98) x ) (cid:107) . The ﬁrst part of the secondclaim is obvious. For the second part, note that (5.1) is a convex minimization problemby the weak-convexity of φ , and the ﬁrst-order optimality condition for it is(A.3) ∃ ξ ∈ ∂φ ( x + ) : (cid:10) ξ + 2 L ( x + − (cid:98) x ) , x − x + (cid:11) (cid:62) , ∀ x ∈ X. D. M. OSTROVSKII, A. LOWY, AND M. RAZAVIYAYN

For such ξ , and assuming that (cid:107)∇ φ L ( (cid:98) x ) (cid:107) (cid:54) ε x , we have that S X ( x + , ξ, L ) = 4 L max x ∈ X (cid:2) −(cid:104) ξ, x − x + (cid:105) − L (cid:107) x − x + (cid:107) (cid:3) (cid:54) L max x ∈ X (cid:2) (cid:104) x + − (cid:98) x, x − x + (cid:105) − (cid:107) x − x + (cid:107) (cid:3) (cid:54) L (cid:107) x + − (cid:98) x (cid:107) (cid:54) ε x . Here we ﬁrst used (A.3) and then the Cauchy-Schwarz inequality.

A.4. Proof of Lemma 5.3.

By concavity h ( y (cid:48) ) − h ( y ) ≤ (cid:104) y (cid:48) − y, ∇ h ( y ) (cid:105) ; thus,it suﬃces to prove that(A.4) (cid:104) y (cid:48) − y, ζ (cid:105) (cid:54) L (cid:104) y (cid:48) − y, y + − y (cid:105) + L [ S Y ( y, − ζ, L ) − W Y ( y, − ζ, L )]where y + := Π Y ( y + L ζ ), for arbitrary y (cid:48) , y, ζ ∈ Y, and L >

0. Indeed, then (5.5)follows by applying (A.4) to ζ = ∇ h ( y ) so that ζ L ( y ) = L ( y + − y ). Now, observe that (cid:104) y (cid:48) − y, ζ (cid:105) = L (cid:104) y (cid:48) − y, y + − y (cid:105) + (cid:104) y (cid:48) − y, ζ − L ( y + − y ) (cid:105) and (cid:104) y (cid:48) − y, ζ − L ( y + − y ) (cid:105) (cid:54) (cid:104) y + − y, ζ − L ( y + − y ) (cid:105) by the projection lemma (see,e.g., [6, Lem. 3.1]). Finally, (cid:104) y + − y, ζ − L ( y + − y ) (cid:105) = (cid:104) ζ, y + − y (cid:105) − L (cid:107) y + − y (cid:107) (cid:54) max w ∈ Y (cid:2) (cid:104) ζ, w − y (cid:105) − L (cid:107) w − y (cid:107) (cid:3) − L (cid:107) y + − y (cid:107) = L [ S Y ( z, − ζ, L ) − W Y ( z, − ζ, L )] . A.5. Proof of Proposition 5.5.

By the second claim of Proposition 5.1, wehave ∇ ϕ L xx ( (cid:98) x ) = 2 L xx ( (cid:98) x − x + ) , thus we can focus on bounding L xx (cid:107) (cid:98) x − x + (cid:107) . Tothis end, the L xx -strong convexity of the function ϕ ( · ) + L xx (cid:107) · − (cid:98) x (cid:107) (minimized at x + )yields L xx (cid:107) (cid:98) x − x + (cid:107) (cid:54) ϕ ( (cid:98) x ) − ϕ ( x + ) − L xx (cid:107) x + − (cid:98) x (cid:107) . Moreover, we clearly have ϕ ( (cid:98) x ) − ϕ ( x + ) (cid:54) F ( (cid:98) x, y o ) − F ( (cid:98) x, (cid:98) y ) + F ( (cid:98) x, (cid:98) y ) − F ( x + , (cid:98) y )for y o ∈ Y such as F ( (cid:98) x, y o ) = ϕ ( (cid:98) x ). Now, by the descent lemma (due to (2.1)) we get F ( (cid:98) x, (cid:98) y ) − F ( x + , (cid:98) y ) − L xx (cid:107) x + − (cid:98) x (cid:107) (cid:54) − (cid:10) ∇ x F ( (cid:98) x, (cid:98) y ) , x + − (cid:98) x (cid:11) − L xx (cid:107) x + − (cid:98) x (cid:107) (cid:54) L xx S X ( (cid:98) x, ∇ x F ( (cid:98) x, (cid:98) y ) , L xx ) . On the other hand, applying Lemma 5.3 to h ( · ) = F ( (cid:98) x, · ) with L = L yy results in F ( (cid:98) x, y o ) − F ( (cid:98) x, (cid:98) y ) (cid:54) R y (cid:98) W y + L yy [ (cid:98) S y − (cid:98) W y ] . Combining the results obtained so far, we arrive at (5.6). The second claim of the lemmafollows by using that (cid:98) W y (cid:54) (cid:98) S y , and requiring that max[ (cid:98) S y L xx R y , (cid:98) S y L xx /L yy ] (cid:54) ε x . Appendix B. Extension to non-Euclidean geometries.

Here we do not assume the norm (cid:107) · (cid:107) to be Euclidean (unless explicitly stated).

B.1. Near-stationary points of a convex function.

The ﬁrst challenge whenextending Algorithm 4.2 to the non-Euclidean setup arises already in the sub-problemof ﬁnding a near-stationary point of a smooth and convex function. Therefore, weﬁrst focus on this problem in isolation. Given a norm (cid:107) · (cid:107) on R d and its dualnorm (cid:107) · (cid:107) ∗ , consider the problem of ﬁnding ε -ﬁrst-order-stationary point (cid:98) z ∈ R d offunction f : R d → R , i.e., such that (cid:107)∇ f ( (cid:98) z ) (cid:107) ∗ (cid:54) ε . We assume that f is convex andhas L -Lipschitz gradient with respect to (cid:107) · (cid:107) , i.e., (cid:107)∇ f ( z (cid:48) ) − ∇ f ( z ) (cid:107) ∗ (cid:54) (cid:107) z (cid:48) − z (cid:107) , ∀ z (cid:48) , z ∈ R d , FFICIENT SEARCH OF FIRST-ORDER NASH EQUILIBRIA (cid:98) z belongs to the origin-centered (cid:107) · (cid:107) -norm ball with radius R .Recall that, in the Euclidean case, the recipe of Nesterov [25] is to add the regu-larizer r ε ( z ) = ε R (cid:107) z (cid:107) , observing that the regularized function f ε has two properties:( i ) f ε has ( L + ε )-Lipschitz gradient (since r ε ( z ) has ε -Lipschitz gradient) andis ε -strongly-convex (since r ε ( z ) is strongly convex).( ii ) The gradient ∇ f ε uniformly approximates ∇ f with respect to (cid:107) · (cid:107) ∗ = (cid:107) · (cid:107) :(B.1) (cid:107)∇ f ε ( z ) − ∇ f ( z ) (cid:107) (cid:54) ε (cid:107) z (cid:107) R (cid:54) ε. Property ( ii ) allows to search for approximate stationary points of f ε instead of f ,whereas ( i ) guarantees that restarted FGM (Algorithm 3.2) ﬁnds such a point in (cid:101) O ( √ κ )queries of ∇ f ( · ) in total, where κ = L/ε , which results in the complexity bound(B.2) T ε = (cid:101) O (cid:0)(cid:112) LR/ε (cid:1) . This complexity bound is optimal up to a logarithmic factor in the Euclidean case [25].In the setup with a non-Euclidean proximal geometry, one would expect thecomplexity bound (B.2) to be preserved. More precisely, assume that the norm (cid:107)·(cid:107) , nownot necessarily Euclidean, admits a distance-generating function (d.-g. f.) ω : Z → R replacing the squared norm (cid:107) · (cid:107) in the Euclidean case, with the following threeproperties (see, e.g., [15, 28, 30] and references therein):

1) The function ω ( · ) is convex, admits a continuous selection of subgradients(denoted ∇ ω ( z ) later on), and has strong convexity modulus 1 w.r.t. (cid:107) · (cid:107) .2) One can easily solve (explicitly or to high accuracy) optimization problemsof the form min z (cid:48) [ (cid:104) ζ, z (cid:48) (cid:105) + ω ( z (cid:48) )], where ζ is an arbitrary linear form (i.e., elementof the dual space identiﬁed with R d by Riescz theorem), and (cid:104)· , ·(cid:105) is the dualitypairing (identiﬁed with the canonical dot product on R d ). Equivalently, one requirescomputational tractability of the problemmin z (cid:48) ∈ R d [ (cid:104) ζ, z (cid:48) (cid:105) + D ω ( z (cid:48) , z )] , where D ω ( z (cid:48) , z ) := ω ( z (cid:48) ) − ω ( z ) − (cid:104)∇ ω ( z ) , z (cid:48) − z (cid:105) is the Bregman divergence generatedby ω . The fulﬁllment of these requirements is guaranteed by working with d.-g. f.’sthat are coordinate-separable (such as entropy on the non-negative orthant or (cid:107) · (cid:107) pp for p (cid:62)

1) or “quasi-separable” (e.g., compositions of a separable function and amonotone map on R ), such as (cid:107) · (cid:107) p with p (cid:62) ω is minimized at the origin (and ω (cid:48) ( x ) = 0 is includedin the continuous selection ob sugradients), and satisﬁes the following quadratic growth condition: the ω. -radius functional Ω[ · ], deﬁned asΩ[ Z ] := max z ∈ Z ω ( z ) − min z (cid:48) ∈ Z ω ( z (cid:48) ) , for compact subsets of R d , satisﬁes(B.3) Ω[ Z r (0)] (cid:54) r (cid:101) O d (1) , ∀ r (cid:62) , where Z r ( z ) := { z (cid:48) : (cid:107) z (cid:48) − z (cid:107) (cid:54) r } , and (cid:101) O d (1) is a logarithmic factor in d . In otherwords, Ω[ Z r ] grows as the squared radius of the (cid:107) · (cid:107) -ball, mimicking the squared Here we ﬁrst focus on the unconstrained setup for the sake of simplicity; the case where z liveson a “simple” convex body in R d can be treated in a similar vein, and is postponed to Appendix B.2. D. M. OSTROVSKII, A. LOWY, AND M. RAZAVIYAYN norm (cid:107)·(cid:107) in this respect. Note also that the same bound holds for D ω ( z, (cid:54) Ω[ Z r (0)]for all z ∈ Z r (0). Moreover, these conditions can be “re-centered” to arbitrary point z by replacing ω ( · ) with the shifted d.-g. f.(B.4) ω z ( · ) := ω ( z − z )which is minimized at z and satisﬁes (B.3) with Z r ( z ) instead of Z r (0); the previousproperties hold for ω z as well. Here we note that the “slow growth” property is notrequired to obtain convergence guarantees in terms of the ω -radius; rather, it is neededto “translate” such guarantees to those in terms of the (cid:107) · (cid:107) -norm distance to optimum.Another remark is that the balls Z r here are only allowed to be centered in the origin(i.e., in the minimum of ω ), which makes the condition signiﬁcantly less restrictivethan that in [9] where (B.3) is required to hold for balls with arbitrary centers, notonly those centered at the d.-g. f. minimizer. Note that the latter condition impliesthe Lipschitzness of ∇ ω with respect to (cid:107) · (cid:107) , whereas the former does not; we willrevisit this circumstance in Appendix B.3.We call any d.-g. f. satisfying the above three properties compatible with (cid:107) · (cid:107) .Whenever one can ﬁnd a compatible d.-g. f., the usual recipe is to modify the “Euclidean”algorithm by replacing the Euclidean prox-mapping (3.1) with its generalization:(B.5) prox z,ω ( ζ ) := argmin z (cid:48) [ (cid:104) ζ, z (cid:48) (cid:105) + D ω ( z (cid:48) , z )] , which corresponds to replacing the gradient descent step with so-called mirror descentstep ([23]) – “steepest descent” with respect to the d.-g. f. that takes into accountthe geometry of (cid:107) · (cid:107) . For many standard primitives in convex optimization, such arecipe results in the desirable outcome: the distance to optimum R and the Lipschitzconstant L get replaced with their (cid:107) · (cid:107) -norm counterparts. In particular, this is thecase for FGM (Algorithm 3.1) as Theorem 3.2 generalizes almost verbatim. Theorem

B.1 ([12, Thm. 5 and Eq. (42)]).

Assume f is convex, has L -Lipschitzgradient with respect to the norm (cid:107) · (cid:107) , cf. (B.1) , is and minimized at z ∗ such that (cid:107) z ∗ − z (cid:107) (cid:54) R . Consider running Algorithm 3.1, with prox-mappings in lines 4 and 8replaced by the generalized prox-mapping (B.5) with respect to the d.-g. f. ω z (there-centered to z compatible d.-g. f., ω , cf. (B.4) ), with stepsize γ = 1 /L , and δ -inexactoracle in the sense of Deﬁnition (with (cid:107) · (cid:107) being the given norm). Then (B.6) f ( z T ) − f ( z ∗ ) (cid:54) L Ω T + 2 δT (cid:54) (cid:101) O d (1) LR T + 2 δT, where Ω := Ω[ Z R ( z )] is the ω -radius of the z -centered ball containing z ∗ . Thus, (B.7) f ( z T ) − f ( z ∗ ) (cid:54) L Ω T (cid:54) (cid:101) O d (1) LR T whenever δ (cid:54) δ T := L Ω2 T . Returning to our problem of ﬁnding a near-stationary point of f ( · ), the reasonableapproach would be to regularize f with the term(B.8) r ε ( z ) = εω ( z ) √ Ω , which reduces to ε R (cid:107) z (cid:107) in the Euclidean setup with (cid:107) · (cid:107) used as d.-g. f. However,we immediately see that neither of the properties ( i ) , ( ii ) remain valid. FFICIENT SEARCH OF FIRST-ORDER NASH EQUILIBRIA • Indeed, while the regularized function f ε ( z ) is strongly convex with respectto (cid:107) · (cid:107) , its gradient can be non-Lipschitz: in fact, the existence of functionsthat are strongly convex and smooth at the same time, with near-constantcondition number, is quite special for the Euclidean norm. • As for the property ( ii ), it is again a “fortunate coincidence” that in theEuclidean case ∇ ω ( z ) ≡ z and (cid:107) · (cid:107) ∗ ≡ (cid:107) · (cid:107) , whence (cid:107)∇ r ε ( z ) (cid:107) ∗ (cid:54) ε on Z R (0).The ﬁrst of these issues is easy to ﬁx: instead of treating f ε as a smooth function,which it is not anymore, one can treat it as a composite function with L -smoothpart f and a non-smooth but “simple” term r ε , simplicity being guaranteed by thecompatibility of ω . As such, one can exploit the “tolerance” of Algorithm 3.1 to suchcomposite objectives: one can use the inexact gradient oracle for f , rather than for f ε ,instead incorporating r ε into the prox-mapping, i.e., replacing (B.5) withprox z,ω,ε ( ζ ) := argmin z (cid:48) (cid:20) (cid:104) ζ, z (cid:48) (cid:105) + D ω ( z (cid:48) , z ) + ε √ Ω D ω ( z (cid:48) , z ) (cid:21) . As shown in [11, Sec. 6.3 and Thm. 8], Theorem B.1 generalizes to this most generalsetup: the guarantees (B.6)-(B.7) remain valid, with f in the left-hand side replacedby f ε , and z ∗ being the minimizer of f ε . As a result, using the strong convexity of f ε ,we can proceed with the same restart scheme as before (Algorithm 3.1). We now statethe appropriate modiﬁcation of Corollary 3.3. Corollary

B.2.

Let f λ be a composite function given by f λ √ Ω ( z ) := f ( z ) + λD ω ( z, z ) , with λ (cid:62) , and f having L -Lipschitz gradient with respect to (cid:107) · (cid:107) . Given ε > ,run Algorithm on f λ √ Ω with γ = 1 /L , parameters T, S satisfying (B.9) T (cid:62) (cid:113) (cid:101) O d (1) L/λ, S (cid:62) log (cid:16) L √ Ω /ε (cid:17) , where (cid:101) O d (1) is the logarithmic factor in (B.6) , and δ (cid:54) δ T , cf. (3.3) . Then z S satisﬁes (B.10) (cid:107) z S − z ∗ (cid:107) (cid:54) ε L , f λ √ Ω ( z S ) − f λ √ Ω ( z ∗ ) (cid:54) ε L , (cid:107)∇ f ( z S ) − ∇ f ( z ∗ ) (cid:107) ∗ (cid:54) ε . Proof.

Note that, with given T , we ensure that R s := (cid:107) z s − z ∗ (cid:107) satisﬁes R s (B.7) (cid:54) (cid:115) λ · (cid:101) L Ω[ Z R s − ( z s − )] T (cid:54) (cid:115) (cid:101) O d (1) LR s − λT (cid:54) R s − . Here the ﬁrst transition relied on the fact that ω is re-centered to z s − at s -th epoch,and the second transition used the quadratic growth condition (B.3). This givesthe ﬁrst inequality in (B.10) The second inequality can be veriﬁed as in the proofof Corollary 3.3 (with Ω replacing R ), and the last one follows by smoothness.We see that the ﬁrst of the two issues with regularization is solved: we simplyrun Algorithm 3.2 on f ε . Alas, the second issue is still present: while ∇ f ( z S ) approxi-mates ∇ f ( z ∗ ), where z ∗ minimizes f ε , we cannot guarantee that (cid:107)∇ f ( z ∗ ) (cid:107) ∗ is small:indeed, (cid:107)∇ f ( z ∗ ) (cid:107) ∗ (cid:54) ε is equivalent to(B.11) sup z ∈ Z R (cid:107)∇ ω ( z ) (cid:107) ∗ (cid:54) Ω , D. M. OSTROVSKII, A. LOWY, AND M. RAZAVIYAYN but this latter condition cannot be guaranteed from the compatibility properties of ω .In fact, in the constrained setup, where minimization has to be performed on a convexbody Z ⊂ R d , (B.11) breaks for the important class of Legendre d.-g. f.’s – thosewith gradients diverging on the boundary of the feasible set [4]. However, in theabsence of constraints, or for non-Legendre potentials in the constrained case, (B.11)can sometimes be guaranteed. Next we consider one such example relevant in practice.

Regularization with (cid:107)·(cid:107) p . Let the norm of interest be (cid:107)·(cid:107) with the dual norm (cid:107)·(cid:107) ∞ .It is well-known (see, e.g. [28]) that, for any d (cid:62)

3, the function(B.12) ω ( z ) = C d (cid:107) z (cid:107) p with p = 1 + 1log( d ) and C d = exp (cid:18) log d − d + 1 (cid:19) is a compatible d.-g. f. for (cid:107) · (cid:107) ; in particular, ω ( z ) is 1-strongly convex on R d withrespect to (cid:107) · (cid:107) , and Ω[ Z ] (cid:54) c log( d ) for some universal constant c (with a matchinglower bound). At the same time, (B.11) can be easily veriﬁed: ∇ ω ( z ) = C d (cid:107) z (cid:107) − pp z p − , where the coordinates of z p − ∈ R d are the ( p − z (with the signs preserved). As a result, (B.13) sup (cid:107) z (cid:107) (cid:54) (cid:107)∇ ω ( z ) (cid:107) ∞ = C d sup (cid:107) z (cid:107) (cid:54) (cid:107) z (cid:107) − pp (cid:107) z (cid:107) p − ∞ (cid:54) C d sup (cid:107) z (cid:107) (cid:54) (cid:107) z (cid:107) p (cid:54) (cid:112) cC d log( d ) where we ﬁrst used that (cid:107) z (cid:107) ∞ (cid:54) (cid:107) z (cid:107) p and then used the bound Ω[ Z ] (cid:54) c log( d ).Thus, (B.11) is veriﬁed, so (cid:107) · (cid:107) p -regularization only perturbs the gradient up to O ( ε ). B.2. Constrained case.

We have just seen that in the unconstrained scenario,one can indeed eﬃciently approximate ﬁrst-order stationary points of a convex function– at least in the (cid:96) -geometry setup. Let us now demonstrate that this result can beextended to the constrained scenario. Namely, we now incorporate into the problem aset Z ∈ R d , assumed to be convex, compact, and “prox-friendly”: one must be able toeﬃciently compute the prox-mapping with respect to Z , deﬁned as(B.14) prox z,Z,ω ( ζ ) := argmin z (cid:48) ∈ Z [ (cid:104) ζ, z (cid:48) (cid:105) + D ω ( z (cid:48) , z )] ;note that this is satisﬁed when Z is a “simple” set such as (cid:96) p -ball or a simplex.Accordingly, we modify the d.-g. f. compatibility requirements, now only requiringstrong convexity on Z . It is known from [12] that Theorem B.1 extends almost word-for-word to this setting, with the prox-mapping (B.5) replaced with (B.14), and Ω[ Z R ]replaced with Ω = Ω[ Z ]. Furthermore, the ﬁrst two inequalities in (B.10) are preserved,under the same premise (B.9). Now, let us deﬁne the natural ( ω -adapted) stationaritymeasure S Z,ω by S Z,ω ( z, ζ, L ) := 2 L max z (cid:48) ∈ Z (cid:2) − (cid:10) ζ, z (cid:48) − z (cid:11) − LD ω ( z (cid:48) , z ) (cid:3) , (B.15) We can easily verify that, under (B.9), one has (B.16) S Z,ω ( z S , ∇ f λ √ Ω ( z S ) , L + λ ) (cid:54) ε (cid:114) L + λL . Indeed, the argument mimics that in (3.7): S Z,ω ( z S , ∇ f λ √ Ω ( z S ) , L + λ ) = − L + λ ) min z ∈ Z (cid:2) (cid:104)∇ f λ √ Ω ( z S ) , z − z S (cid:105) + ( L + λ ) D ω ( z, z S ) (cid:3) ; E.g., in the “simplex” setup, where the norm is (cid:107) · (cid:107) , and d.-g. f. is the negative en-tropy h ( z ) = (cid:80) i ∈ [ d ] z i log( z i ) on the probability simplex ∆ d ⊂ R d . The appropriate modiﬁcationof (B.11), sup z ∈ ∆ d (cid:107)∇ h ( z ) (cid:107) ∞ (cid:54) log( d ), cannot be valid since the left-hand side is inﬁnite.FFICIENT SEARCH OF FIRST-ORDER NASH EQUILIBRIA z ∈ Z one has f λ √ Ω ( z ) − f λ √ Ω ( z S ) = f ( z ) − f ( z S ) + λ [ D ω ( z, z ) − D ω ( z S , z )] (cid:54) (cid:104)∇ f ( z S ) , z − z S (cid:105) + L (cid:107) z − z S (cid:107) + λ [ D ω ( z, z ) − D ω ( z S , z )] (cid:54) (cid:104)∇ f ( z S ) , z − z S (cid:105) + LD ω ( z, z S ) + λ [ D ω ( z, z ) − D ω ( z S , z )]= (cid:104)∇ f ( z S ) , z − z S (cid:105) + LD ω ( z, z S ) + λ [ D ω ( z, z S ) + (cid:104)∇ ω ( z S ) − ∇ ω ( z ) , z − z S (cid:105) ]= (cid:104)∇ f λ √ Ω ( z S ) , z − z S (cid:105) + ( L + λ ) D ω ( z, z S ) , where we ﬁrst used the smoothness of f , then the 1-strong convexity of ω , and ﬁnally,the well-known three-point identity for the Bregman divergence (see, e.g., [6, Eq. (4.1)]).Minimizing both sides over z ∈ Z and recalling that f λ √ Ω ( z S ) − f λ √ Ω ( z ∗ ) (cid:54) ε / (18 L ),we arrive at (B.16).Applying (B.2) with λ = ε/ √ Ω, i.e., to the regularized function f ε , cf. (B.8),we see that that one can obtain O ( ε )-stationary point – either in the sense of thedual gradient norm in the unconstrained case, or in the sense of S Z,ω ( · , · , · ) criterion– in (cid:101) O ( (cid:112) LR/ε ) prox-mapping computations, by running appropriately generalizedversion of Algorithm 3.2.

Remark

B.3. Using the optimality conditions in (B.15), one can verify that(B.17) S Z,ω ( z, ∇ f ( z ) , L ) (cid:62) L D ω ( z, ∇ ω ∗ Z [ ∇ ω ( z ) − L ∇ f ( z )])[= 2 L D ω ∗ Z ( ∇ ω ( z ) − L ∇ f ( z ) , ∇ ω ( z ))]with equality in the unconstrained case. Here, ω ∗ Z is the Fenchel dual of ω on Z , i.e., ω ∗ Z ( ζ ) := max z ∈ Z [ (cid:104) ζ, z (cid:105) − ω ( z )] , ∀ ζ ∈ R d , and ∇ ω ∗ Z [ ∇ ω ( z ) − L ∇ f ( z )] is the mirror descent update from z . The second representa-tion in (B.17) is by the standard properties of the Bregman divergences ([32]). From it,noting that ∇ ω ∗ Z is 1-Lipschitz with respect to (cid:107)·(cid:107) ∗ , we conclude that S Z,ω ( z, ∇ f ( z ) , L ) under-estimates the dual gradient norm (cid:107)∇ f ( z ) (cid:107) ∗ in the unconstrained setup, thisestimate only being tight in the Euclidean case, i.e., when ω ( z ) = (cid:107) z (cid:107) . On the otherhand, from the ﬁrst representation we see that S Z,ω ( z, ∇ f ( z ) , L ) over-estimates theproximal gradient norm measure W Z,ω ( z, ∇ f ( z ) , L ) deﬁned by W Z,ω ( z, ∇ f ( z ) , L ) = L (cid:13)(cid:13) z − ∇ ω ∗ Z [ ∇ ω ( z ) − L ∇ f ( z )] (cid:13)(cid:13) . Thus, S Z,ω corresponds to a stronger criterion than W Z,ω in the constrained case; inthe unconstrained case the two measures coincide, and the resulting criterion is weaker than the gradient norm one (unless the norm is Euclidean).Summarizing the results of this section, we see that one of the two key “computa-tional primitives” in our framework – the search of a near-stationary point of a smoothand concave function – extends to the (cid:96) -geometry with distance-generating functiongiven by (B.12), where the accuracy can be measured by the S Z,ω measure (cf. (B.15))or by the dual gradient norm in the unconstrained case. Thus, we have extended theresults of Section 3.1. Our next goal is to similarly extend the results of Section 3.2,i.e., to implement the non-Euclidean proximal point algorithm with inexact iterations.8

D. M. OSTROVSKII, A. LOWY, AND M. RAZAVIYAYN

B.3. Bregman proximal point algorithm.

Given a function φ : X → R with L -Lipschitz gradient with respect to the norm (cid:107) · (cid:107) , where X ⊆ R d is convex and“prox-friendly” with respect to a compatible with (cid:107) · (cid:107) d.-g. f. ω , the goal is to ﬁnd apoint (cid:98) x ∈ X such that S X,ω ( (cid:98) x, ∇ φ ( (cid:98) x ) , L ) (cid:54) ε . As in Section 3.2, we will achieve thisresult via proximal point updates implemented using Algorithm 3.2. First, we deﬁnethe Bregman proximal point operator following [22]:(B.18) x (cid:55)→ x + γφ,X,ω ( x ) := argmin x (cid:48) ∈ X (cid:104) φ ( x (cid:48) ) + γ D ω ( x (cid:48) , x ) (cid:105) ;note that the objective in (B.18) is 1 /γ -strongly convex with respect to (cid:107) · (cid:107) . Wedenote x + := x + γφ,X,ω ( x ) for brevity, and ﬁx γ = L . The optimality condition reads(B.19) (cid:10) L ∇ φ ( x + ) + ∇ ω ( x + ) − ∇ ω ( x ) , x (cid:48) − x + (cid:11) (cid:62) , ∀ x (cid:48) ∈ X. Following Section 3.2, we ﬁrst analyze the exact updates x t = x + φ/ (2 L ) ,X,ω ( x t − ) . By (B.18) we have φ ( x t − ) (cid:62) φ ( x t ) + 2 LD ω ( x t , x t − ) which allows to mimic (3.13):(B.20) min t ∈ [ T ] (cid:107) x t − x t − (cid:107) (cid:54) D ω ( x t , x t − ) (cid:54) T (cid:88) t ∈ [ T ] D ω ( x t , x t − ) (cid:54) ∆ LT .

On the other hand, we can bound the stationarity measure proceeding as in (3.14): (B.21) S X,ω ( x + , ∇ φ ( x + ) , L ) ≡ L max x (cid:48) ∈ X (cid:2) − (cid:10) ∇ φ ( x + ) , x (cid:48) − x + (cid:11) − LD ω ( x (cid:48) , x + ) (cid:3) (cid:54) L max x (cid:48) ∈ X (cid:2) (cid:10) ∇ ω ( x + ) − ∇ ω ( x ) , x (cid:48) − x + (cid:11) − D ω ( x (cid:48) , x + ) (cid:3) (cid:54) L max x (cid:48) ∈ X (cid:2) (cid:107)∇ ω ( x + ) − ∇ ω ( x ) (cid:107) ∗ + (cid:107) x (cid:48) − x + (cid:107) − D ω ( x (cid:48) , x + ) (cid:3) (cid:54) L (cid:107)∇ ω ( x + ) − ∇ ω ( x ) (cid:107) ∗ , where we ﬁrst used Young’s inequality and then the strong convexity of ω . Notethat in the unconstrained case, and with S X,ω ( x + , ∇ φ ( x + ) , L ) replaced by (cid:107)∇ f ( x + ) (cid:107) ∗ ,the bound (B.21) becomes an equality; on the other hand, we have not been able toﬁnd a tighter bound for W X,ω ( x + , ∇ φ ( x + ) , L ); all this indicates that (B.21) is likelyunimprovable in general. Now, the inequalities (B.20) and (B.21), when combinedtogether, imply that, in order to proceed as in the Euclidean case, one must requirethat the d.-g. f. ω is smooth on X with respect to (cid:107) · (cid:107) , i.e., for some (cid:96) X,ω (cid:62) (cid:107)∇ ω ( x (cid:48)(cid:48) ) − ∇ ω ( x (cid:48) ) (cid:107) ∗ (cid:54) (cid:96) X,ω (cid:107) x (cid:48)(cid:48) − x (cid:48) (cid:107) , ∀ x (cid:48) , x (cid:48)(cid:48) ∈ X. Indeed, when combined with (B.20)–(B.21), this implies, after T exact updates, thatmin t ∈ [ T ] S X,ω ( x t , ∇ φ ( x t ) , L ) (cid:54) (cid:96) X,ω (cid:114) L ∆ T , i.e., the same convergence rate as in the Euclidean case (up to the extra factor (cid:96)

X,ω ).Moreover, as in the Euclidean case, this argument preserves “robustness” to errorsin (B.18). Indeed, denoting φ L,x,ω ( · ) the objective in (B.18), assume that (cid:101) x + satisﬁes φ L,x,ω ( (cid:101) x + ) (cid:54) φ L,x,ω ( x + ) + ε L and S X,ω ( (cid:101) x + , ∇ φ L,x,ω ( (cid:101) x + ) , L + λ ) (cid:54) ε , FFICIENT SEARCH OF FIRST-ORDER NASH EQUILIBRIA φ L,x,ω with appropriately chosen parameter values. Onthe other hand, the sequence (cid:101) x t = (cid:101) x + φ/ L,X,ω ( (cid:101) x t − ) satisﬁes the counterpart of (B.20):min t ∈ [ T ] (cid:107) (cid:101) x t − (cid:101) x t − (cid:107) (cid:54) D ω ( (cid:101) x t , (cid:101) x t − ) (cid:54) T (cid:88) t ∈ [ T ] D ω ( (cid:101) x t , (cid:101) x t − ) (cid:54) ∆ LT + ε L , and that of (B.21): S X,ω ( (cid:101) x + , ∇ φ ( (cid:101) x + ) , L + λ ) (cid:54) L + λ ) max x (cid:48) ∈ X (cid:2) − (cid:10) ∇ φ L,x,ω ( (cid:101) x + ) , x (cid:48) − (cid:101) x + (cid:11) − ( L + λ ) D ω ( x (cid:48) , x ) (cid:3) + 2 L (2 L + λ ) max x (cid:48) ∈ X (cid:2) (cid:10) ∇ ω ( (cid:101) x + ) − ∇ ω ( x ) , x (cid:48) − (cid:101) x + (cid:11) − D ω ( x (cid:48) , x ) (cid:3) (cid:54) S X,ω ( (cid:101) x + , ∇ φ L,x ( (cid:101) x + ) , L + λ ) + 8 L ( L + λ ) (cid:107)∇ ω ( (cid:101) x + ) − ∇ ω ( x ) (cid:107) ∗ (cid:54) ε / L ( L + λ ) (cid:107)∇ ω ( (cid:101) x + ) − ∇ ω ( x ) (cid:107) ∗ , where we applied Young’s inequality and strong convexity. This allows to mimic (3.21):min t ∈ [ T ] S X ( (cid:101) x t , ∇ φ ( (cid:101) x t ) , L ) (cid:54) (cid:96) X,ω (cid:115) L + λ )∆ T + 5 ε (cid:18) L + λL (cid:19) Taking λ = L , we arrive at the desired complexity estimate O ( L ∆ /ε ). B.4. On restrictiveness of d.-g. f. smoothness.

While trivially satisﬁed inthe Euclidean case with (cid:96)

X,ω = 1 for any X , the smoothness assumption (B.22) – whichis also made, e.g., in [9, 35] – is strong in the general Bregman scenario. For example,the negative entropy h ( x ) = (cid:80) i ∈ [ d ] x i log( x i ), perhaps the most common choice of anon-Euclidean d.-g. f., is not smooth on its domain, the probability simplex ∆ d . Likewise, the previously considered (cid:107) · (cid:107) p -function (cf. (B.12)) does not satisfy (B.22)with respect to (cid:107) · (cid:107) on the set X = { x ∈ R d : (cid:107) x (cid:107) (cid:54) r } for any ﬁnite (cid:96) X,ω ,however small is r >

0, unless when d = 1. More generally, as implied by [5,Theorem 3.5 and its omitted dual version] in combination with [2, p. 469], there isno function simultaneously strongly convex and smooth, with dimension-independentcondition number, with respect (cid:96) p -norm for p (cid:54) = 2. Still, it would be interesting toexhibit ( X, (cid:107) · (cid:107) , ω ) with (cid:107) · (cid:107) -compatible ω for which (B.22) holds with moderate (cid:96) X,ω .Alternatively, we may consider circumventing (B.22) by discarding Bregmandivergences and working directly with the norm. Indeed, using conjugacy of (cid:107) · (cid:107) and (cid:107) · (cid:107) ∗ one can derive the O ( L ∆ /ε ) convergence rate for minimizing the gradientnorm up to ε by steepest descent with respect to the norm (cid:107) · (cid:107) , i.e., replacing theBregman divergence in (B.5) by (cid:107) · (cid:107) . The resulting prox-mapping is tractablewhenever (cid:107) · (cid:107) is a “simple” function, which is the case, e.g., for (cid:107) · (cid:107) = (cid:107) · (cid:107) p with p (cid:62) (cid:107) · (cid:107) p is O (1)-strongly convex with respect to (cid:107) · (cid:107) p when 1 < p (cid:54)

2. Thus, theresults of Appendix B.3 extend to such (cid:107) · (cid:107) -regularized proximal point algorithm. In fact, it is easy to see that no Legendre d.-g. f., i.e., such that (cid:107)∇ ω ( x ) (cid:107) ∗ → ∞ when x → ∂ dom( ω ), can satisfy (B.22) with X = dom( ω ) for any ﬁnite (cid:96) X,ω . Indeed, by explicitly computing the Hessian ∇ ω ( x ) deﬁned almost everywhere on R d , we observethat its ﬁrst diagonal element “explodes”: [ ∇ ω ( x )] → ∞ when x = ue + (1 − u ) e with u →→