[PDF] The Complexity of Constrained Min-Max Optimization

Abstract

Despite its important applications in Machine Learning, min-max optimization of nonconvex-nonconcave objectives remains elusive. Not only are there no known first-order methods converging even to approximate local min-max points, but the computational complexity of identifying them is also poorly understood. In this paper, we provide a characterization of the computational complexity of the problem, as well as of the limitations of first-order methods in constrained min-max optimization problems with nonconvex-nonconcave objectives and linear constraints. As a warm-up, we show that, even when the objective is a Lipschitz and smooth differentiable function, deciding whether a min-max point exists, in fact even deciding whether an approximate min-max point exists, is NP-hard. More importantly, we show that an approximate local min-max point of large enough approximation is guaranteed to exist, but finding one such point is PPAD-complete. The same is true of computing an approximate fixed point of Gradient Descent/Ascent. An important byproduct of our proof is to establish an unconditional hardness result in the Nemirovsky-Yudin model. We show that, given oracle access to some function f:P→[−1,1] and its gradient ∇f , where P⊆[0,1 ] d is a known convex polytope, every algorithm that finds a ε -approximate local min-max point needs to make a number of queries that is exponential in at least one of 1/ε , L , G , or d , where L and G are respectively the smoothness and Lipschitzness of f and d is the dimension. This comes in sharp contrast to minimization problems, where finding approximate local minima in the same setting can be done with Projected Gradient Descent using O(L/ε) many queries. Our result is the first to show an exponential separation between these two fundamental optimization problems.

Full PDF

TThe Complexity of Constrained Min-Max Optimization

Constantinos DaskalakisMIT [email protected]

Stratis SkoulakisSUTD [email protected]

Manolis ZampetakisMIT [email protected]

September 22, 2020

Abstract

Despite its important applications in Machine Learning, min-max optimization of objectivefunctions that are nonconvex-nonconcave remains elusive. Not only are there no known ﬁrst-order methods converging even to approximate local min-max points, but the computationalcomplexity of identifying them is also poorly understood. In this paper, we provide a charac-terization of the computational complexity of the problem, as well as of the limitations of ﬁrst-order methods in constrained min-max optimization problems with nonconvex-nonconcaveobjectives and linear constraints.As a warm-up, we show that, even when the objective is a Lipschitz and smooth differ-entiable function, deciding whether a min-max point exists, in fact even deciding whether anapproximate min-max point exists, is NP -hard. More importantly, we show that an approxi-mate local min-max point of large enough approximation is guaranteed to exist, but ﬁndingone such point is PPAD -complete. The same is true of computing an approximate ﬁxed pointof the (Projected) Gradient Descent/Ascent update dynamics.An important byproduct of our proof is to establish an unconditional hardness resultin the Nemirovsky-Yudin [NY83] oracle optimization model. We show that, given oracleaccess to some function f : P → [ −

1, 1 ] and its gradient ∇ f , where P ⊆ [

0, 1 ] d is a knownconvex polytope, every algorithm that ﬁnds a ε -approximate local min-max point needs tomake a number of queries that is exponential in at least one of 1/ ε , L , G , or d , where L and G are respectively the smoothness and Lipschitzness of f and d is the dimension. Thiscomes in sharp contrast to minimization problems, where ﬁnding approximate local minimain the same setting can be done with Projected Gradient Descent using O ( L / ε ) many queries.Our result is the ﬁrst to show an exponential separation between these two fundamentaloptimization problems in the oracle model. a r X i v : . [ c s . CC ] S e p ontents B.1 Proof of Theorem 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58B.2 Proof of Theorem 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

C Missing Proofs from Section 8 61D Constructing the Turing Machine – Proof of Theorem 7.6 74E Convergence of PGD to Approximate Local Minimum 81

Introduction

Min-Max Optimization has played a central role in the development of Game Theory [vN28],Convex Optimization [Dan51, Adl13], and Online Learning [Bla56, CBL06, SS12, BCB12, SSBD14,Haz16]. In its general constrained form, it can be written down as follows:min x ∈ R d max y ∈ R d f ( x , y ) ; (1.1)s.t. g ( x , y ) ≤ f : R d × R d → [ − B , B ] with B ∈ R + , and g : R d × R d → R is typically taken to be aconvex function so that the constraint set g ( x , y ) ≤ g so the constraint set is a polytope, thus projecting on this set and checking feasibilityof a point with respect to this set can both be done in polynomial time.The goal in (1.1) is to ﬁnd a feasible pair ( x (cid:63) , y (cid:63) ) , i.e., g ( x (cid:63) , y (cid:63) ) ≤

0, that satisﬁes the following f ( x (cid:63) , y (cid:63) ) ≤ f ( x , y (cid:63) ) , for all x s.t. g ( x , y (cid:63) ) ≤

0; (1.2) f ( x (cid:63) , y (cid:63) ) ≥ f ( x (cid:63) , y ) , for all y s.t. g ( x (cid:63) , y ) ≤

0. (1.3)It is well-known that, when f ( x , y ) is a convex-concave function, i.e., f is convex in x forall y and it is concave in y for all x , then Problem (1.1) is guaranteed to have a solution, undercompactness of the constraint set [vN28, Ros65], while computing a solution is amenable toconvex programming. In fact, if f is L -smooth, the problem can be solved via ﬁrst-order methods,which are iterative, only access f through its gradient, and achieve an approximation error ofpoly ( L , 1/ T ) in T iterations; see e.g. [Kor76, Nem04]. When the function is strongly convex-strongly concave, the rate becomes geometric [FP07].Unfortunately, our ability to solve Problem (1.1) remains rather poor in settings where our ob-jective function f is not convex-concave. This is emerging as a major challenge in Deep Learning,where min-max optimization has recently found many important applications, such as train-ing Generative Adversarial Networks (see e.g. [GPM +

14, ACB17]), and robustifying deep neuralnetwork-based models against adversarial attacks (see e.g. [MMS + In general, the access to the constraints g by these methods is more involved, namely through an optimizationoracle that optimizes convex functions (in fact, quadratic sufﬁces) over g ( x , y ) ≤

0. In the settings considered in thispaper g is linear and these tasks are computationally straightforward. In the stated error rate, we are suppressing factors that depend on the diameter of the feasible set. Moreover, thestated error of ε ( L , T ) (cid:44) poly ( L , 1/ T ) reﬂects that these methods return an approximate min-max solution, whereinthe inequalities on the LHS of (1.2) and (1.3) are satisﬁed to within an additive ε ( L , T ) . +

17, JGN +

17, LPP + shed light on the complexity of min-max optimization problems , and elucidate its difference to minimization and maximization problems —as far as the latter is concernedwithout loss of generality we focus on minimization problems, as maximization problems behaveexactly the same; we will also think of minimization problems in the framework of (1.1), wherethe variable y is absent, that is d =

0. An important driver of our comparison between min-maxoptimization and minimization is, of course, the nature of the objective. So let us discuss: (cid:46)

Convex-Concave Objective.

The benign setting for min-max optimization is that where the ob-jective function is convex-concave, while the benign setting for minimization is that where theobjective function is convex. In their corresponding benign settings, the two problems behavequite similarly from a computational perspective in that they are amenable to convex program-ming, as well as ﬁrst-order methods which only require gradient information about the objectivefunction. Moreover, in their benign settings, both problems have guaranteed existence of a so-lution under compactness of the constraint set. Finally, it is clear how to deﬁne approximatesolutions. We just relax the inequalities on the left hand side of (1.2) and (1.3) by some ε > (cid:46) Nonconvex-Nonconcave Objective.

By contrapositive, the challenging setting for min-max op-timization is that where the objective is not convex-concave, while the challenging setting forminimization is that where the objective is not convex. In these challenging settings, the behav-ior of the two problems diverges signiﬁcantly. The ﬁrst difference is that, while a solution to aminimization problem is still guaranteed to exist under compactness of the constraint set evenwhen the objective is not convex, a solution to a min-max problem is not guaranteed to existwhen the objective is not convex-concave, even under compactness of the constrained set. A triv-ial example is this: min x ∈ [ ] max y ∈ [ ] ( x − y ) . Unsurprisingly, we show that checking whether amin-max optimization problem has a solution is NP -hard. In fact, we show that checking whetherthere is an approximate min-max solution is NP -hard, even when the function is Lispchitz andsmooth and the desired approximation error is an absolute constant (see Theorem 10.1).Since min-max solutions may not exist, what could we plausibly hope to compute? There aretwo obvious targets:(I) approximate stationary points of f , as considered e.g. by [ALW19]; and(II) some type of approximate local min-max solution.Unfortunately, as far as (I) is concerned, it is still possible that (even approximate) station-ary points may not exist, and we show that checking if there is one is NP -hard, even whenthe constraint set is [

0, 1 ] d , the objective has Lipschitzness and smoothness polynomial in d ,and the desired approximation is an absolute constant (Theorem 4.1). So we focus on (II),i.e. (approximate) local min-max solutions. Several kinds of those have been proposed in theliterature [DP18, MR18, JNJ19]. We consider a generalization of the concept of local min-maxequilibria, proposed in [DP18, MR18], that also accommodates approximation.2 eﬁnition 1.1 (Approximate Local Min-Max Equilibrium) . Given f , g as above, and ε , δ > ( x (cid:63) , y (cid:63) ) is an ( ε , δ ) -local min-max solution of (1.1), or a ( ε , δ ) -local min-max equilibrium ,if it is feasible, i.e. g ( x (cid:63) , y (cid:63) ) ≤

0, and satisﬁes: f ( x (cid:63) , y (cid:63) ) < f ( x , y (cid:63) ) + ε , for all x such that (cid:107) x − x (cid:63) (cid:107) ≤ δ and g ( x , y (cid:63) ) ≤

0; (1.4) f ( x (cid:63) , y (cid:63) ) > f ( x (cid:63) , y ) − ε , for all y such that (cid:107) y − y (cid:63) (cid:107) ≤ δ and g ( x (cid:63) , y ) ≤

0. (1.5)In words, ( x (cid:63) , y (cid:63) ) is an ( ε , δ ) -local min-max equilibrium, whenever the min player cannot update x to a feasible point within δ of x (cid:63) to reduce f by at least ε , and symmetrically the max playercannot change y locally to increase f by at least ε .We show that the existence and complexity of computing such approximate local min-maxequilibria depends on the relationship of ε and δ with the smoothness, L , and the Lipschitzness, G , of the objective function f . We distinguish the following regimes, also shown in Figure 1together with a summary of our associated results. (cid:73) Trivial Regime.

This occurs when δ < ε G . This regime is trivial because the G -Lipschitzness of f guarantees that all feasible points are ( ε , δ ) -local min-max solutions. (cid:73) Local Regime.

This occurs when δ < (cid:113) ε L , and it represents the interesting regime for min-max optimization. In this regime, we use the smoothness of f to show that ( ε , δ ) -local min-maxsolutions always exist. Indeed, we show (Theorem 5.1) that computing them is computationallyequivalent to the following variant of (I) which is more suitable for the constrained setting:(I’) (approximate) ﬁxed points of the projected gradient descent-ascent dynamics (Section 3.3).We show via an application of Brouwer’s ﬁxed point theorem to the iteration map of the projectedgradient descent-ascent dynamics that (I)’ are guaranteed to exist. In fact, not only do they exist,but computing them is in PPAD , as can be shown by bounding the Lipschitzness of the projectedgradient descent-ascent dynamics (Theorem 5.2). (cid:73)

Global Regime.

This occurs when δ is comparable to the diameter of the constraint set. Inthis case, the existence of ( ε , δ ) -local min-max solutions is not guaranteed, and determining theirexistence is NP -hard, even if ε is an absolute constant (Theorem 10.1).The main results of this paper, summarized in Figure 1, are to characterize the complexity ofcomputing local min-max solutions in the local regime. Our ﬁrst main theorem is the following: Informal Theorem 1 (see Theorems 4.3, 4.4 and 5.1) . Computing ( ε , δ ) -local min-max solutions ofLipschitz and smooth objectives over convex compact domains in the local regime is PPAD -complete. Thehardness holds even when the constraint set is a polytope that is a subset of [

0, 1 ] d , the objective takes valuesin [ −

1, 1 ] and the smoothness, Lipschitzness, ε and δ are polynomial in the dimension. Equivalently,computing α -approximate ﬁxed points of the Projected Gradient Descent-Ascent dynamics on smooth andLipschitz objectives is PPAD -complete, and the hardness holds even when the the constraint set is a polytopethat is a subset of [

0, 1 ] d , the objective takes values in [ − d , d ] and smoothness, Lipschitzness, and α arepolynomial in the dimension. For the above complexity result we assume that we have “white box” access to the objectivefunction. An important byproduct of our proof, however, is to also establish an unconditionalhardness result in the Nemirovsky-Yudin [NY83] oracle optimization model, wherein we are givenblack-box access to oracles computing the objective function and its gradient. Our second mainresult is informally stated in Informal Theorem 2.3igure 1: Overview of the results proven in this paper and comparison between the complexityof computing an ( ε , δ ) -approximate local minimum and an ( ε , δ ) -approximate local min-maxequilibrium of a G -Lipschitz and L -smooth function over a d -dimensional polytope taking valuesin the interval [ − B , B ] . We assume that ε < G / L , thus the trivial regime is a strict subset ofthe local regime. Moreover, we assume that the approximation parameter ε is provided in unaryrepresentation in the input to these problems, which makes our hardness results stronger andthe comparison to the upper bounds known for ﬁnding approximate local minima fair, as theserequire time/oracle queries that are polynomial in 1/ ε . We note that the unary representationis not required for our results proving inclusion in PPAD . The ﬁgure portrays a sharp contrastbetween the computational complexity of approximate local minima and approximate local min-max equilibria in the local regime. Above the black lines, tracking the value of δ , we state our“white box” results and below the black lines we state our “black-box” results. The main resultof this paper is the PPAD -hardness of approximate local min-max equilibrium for δ ≥ √ ε / L andthe corresponding query lower bound. In the query lower bound the function h is deﬁned as h ( d , G , L , ε ) = (cid:0) min ( d , √ L / ε , G / ε ) (cid:1) p for some universal constant p ∈ R + . With (cid:63) we indicate our PPAD -completeness result which directly follows from Theorems 4.3 and 4.4. The NP -hardessresults in the global regime are presented in Section 10. Finally, the folklore result showing thetractability of ﬁnding approximate local minima is presented for completeness of exposition inAppendix E. The claimed results for the trivial regime follow from the deﬁnition of Lipschitzness.4 nformal Theorem 2 (see Theorem 4.5) . Assume that we have black-box access to an oracle computinga G-Lipschitz and L-smooth objective function f : P → [ −

1, 1 ] , where P ⊆ [

0, 1 ] d is a known polytope,and its gradient ∇ f . Then, computing an ( ε , δ ) -local min-max solution in the local regime (i.e., when δ < √ ε / L) requires a number of oracle queries that is exponential in at least one of the following: ε ,L, G, or d. In fact, exponential in d-many queries are required even when L, G, ε and δ are allpolynomial in d. Importantly, the above lower bounds, in both the white-box and the black-box setting, comein sharp contrast to minimization problems, given that ﬁnding approximate local minima ofsmooth non-convex objectives ranging in [ − B , B ] in the local regime can be done using ﬁrst-order methods using O ( B · L / ε ) time/queries (see Section E). Our results are the ﬁrst to show anexponential separation between these two fundamental problems in optimization in the black-box setting, and a super-polynomial separation in the white-box setting assuming PPAD (cid:54) = FP . We very brieﬂy outline some of the main ideas for the

PPAD -hardness proof that we presentin Sections 6 and 7. Our starting point as in many

PPAD -hardness results is a discrete analogof the problem of ﬁnding Brouwer ﬁxed points of a continuous map. Departing from previouswork, however, we do not use Sperner’s lemma as the discrete analog of Brouwer’s ﬁxed pointtheorem. Instead, we deﬁne a new problem, called B i S perner , which is useful for showing ourhardness results. B i S perner is closely related to the problem of ﬁnding panchromatic simplicesguaranteed by Sperner’s lemma except, roughly speaking, that the vertices of the simplicizationof a d -dimensional hypercube are colored with 2 d rather than d + d colors rather than one, and we are seeking a vertex of the sim-plicization so that the union of colors on the vertices in its neighborhood covers the full set ofcolors. The ﬁrst step of our proof is to show that B i S perner is PPAD -hard. This step followsfrom the hardness of computing Brouwer ﬁxed points.The step that we describe next is only implicitly done by our proof, but it serves as usefulintuition for reading and understanding it. We want to deﬁne a discrete two-player zero-sumgame whose local equilibrium points correspond to solutions of a given B i S perner instance.Our two players, called “minimizer” and “maximizer,” each choose a vertex of the simplicizationof the B i S perner instance. For every pair of strategies in our discrete game, i.e. vertices, chosenby our players, we deﬁne a function value and gradient values. Note that, at this point, wetreat these values at different vertices of the simplicization as independent choices, i.e. are notdeﬁning a function over the continuum whose function values and gradient values are consistentwith these choices. It is our intention, however, that in the continuous two-player zero-sum gamethat we obtain in the next paragraph via our interpolation scheme, wherein the minimizer andmaximizer may choose any point in the continuous hypercube, the function value determinesthe payment of the minimizer to the maximizer, and the gradient value determines the directionof the best-response dynamics of the game. Before getting to that continuous game in the nextparagraph, the main technical step of this discrete part of our construction is showing that everylocal equilibrium of the discrete game corresponds to a solution of the B i S perner instance we arereducing from. In order to achieve this we need to add some constraints to couple the strategiesof the minimizer and the maximizer player. This step is the reason that the constraints g ( x , y ) ≤ smooth and computationally efﬁcient way the discrete zero-sum game of the previous step. Inlow dimensions (treated in Section 6) such smooth and efﬁcient interpolation can be done ina relatively simple way using single-dimensional smooth step functions. In high dimensions,however, the smooth and efﬁcient interpolation becomes a challenging problem and to the bestof our knowledge no simple solution exists. For this reason we construct our novel smooth andefﬁcient interpolation coefﬁcients of Section 8. These are a technically involved construction that webelieve will prove to be very useful for characterizing the complexity of approximate solutionsof other optimization problems.The last part of our proof is to show that all the previous steps can be implemented inan efﬁcient way both with respect to computational but also with respect to query complexity.This part is essential for both our white-box and black-box results. Although this seems like arelatively easy step, it becomes more difﬁcult due to the complicated expressions in our smoothand efﬁcient interpolation coefﬁcients used in our previous step.Closing this section we mention that all our NP -hardness results are proven using a cuteapplication of Lovász Local Lemma [EL73], which provides a powerful rounding tool that candrive the inapproximability all the way up to an absolute constant. Because our proof is convoluted, involving multiple steps, it is difﬁcult to discern from it whyﬁnding local min-max solutions is so much harder than ﬁnding local minima. For this reason, weillustrate in this section a fundamental difference between local minimization and local min-maxoptimization. This provides good intuition about why our hardness construction would fail if wetried to apply it to prove hardness results for ﬁnding local minima (which we know don’t exist).So let us illustrate a key difference between min-max problems that can be expressed inthe form min x ∈X max y ∈Y f ( x , y ) , i.e. two-player zero-sum games wherein the players optimizeopposing objectives, and min-min problems of the form min x ∈X min y ∈Y f ( x , y ) , i.e., two-playercoordination games wherein the players optimize the same objective. For simplicity, suppose X = Y = R and let us consider long paths of best-response dynamics in the strategy space, X × Y , of the two players; these are paths along which at least one of the players improves theirpayoff. For our illustration, suppose that the derivative of the function with respect to eithervariable is either 1 or −

1. Consider a long path of best-response dynamics starting at a pair ofstrategies ( x , y ) in either a min-min problem or a min-max problem, and a speciﬁc point ( x , y ) along that path. We claim that in min-min problems the function value at ( x , y ) will have toreveal how far from ( x , y ) point ( x , y ) lies within the path in (cid:96) distance. On the other hand,in min-max problems the function value at ( x , y ) may reveal very little about how far ( x , y ) liesfrom ( x , y ) . We illustrate this in Figure 2. While in our min-min example the function valuemust be monotonically decreasing inside the best-response path, in the min-max example thefunction values repeat themselves in every straight line segment of length 3, without revealingwhere in the path each segment is.Ultimately a key difference between min-min and min-max optimization is that best-responsepaths in min-max optimization problems can be closed, i.e., can form a cycle, as shown in Figure2, Panel (b). On the other hand, this is impossible in min-min problems as the function valuemust monotonically decrease along best-response paths, thus cycles may not exist.6 a) Min-min problem; the function values re-veal the location of the points within best re-sponse path. (b) Min-max problem; the function values donot reveal the location of the points withinbest response path. Figure 2: Long paths of best-response dynamics in min-min problems (Panel (a)) and min-maxproblems (Panel (b)), where horizontal moves correspond to one player (who is a minimizer inboth (a) and (b)) and vertical moves correspond to the other player (who is minimizer in (a) but amaximizer in (b)). In Panels (a) and (b), we show the function value at a subset of discrete pointsin a 2D grid along a long path of best-response dynamics, where for our illustration we assumedthat the derivative of the objective with respect to either variable always has absolute value 2.As we see in Panel (a), the function value at some point along a long path of the best-responsedynamics in a min-min problem reveals information about where in the path that point lies.This is in sharp contrast to min-max problems where only local information is revealed aboutthe objective as shown in Panel (b), due to the frequent turns of the path. In Panel (b) we alsoshow that the best-response dynamics in min-max problems can form closed paths. This cannothappen in min-min problems as the function value must decrease along paths of best-responsedynamics, and hence it is impossible in min-min problems to build long best-response paths withfunction values that can be computed locally.The above discussion offers qualitative differences between min-min and min-max optimiza-tion, which lie in the heart of why our computational intractability results are possible to provefor min-max but not min-min problems. For the precise step in our construction that breaks ifwe were to switch from a min-max to a min-min problem we refer the reader to Remark 6.9.

There is a broad literature on the complexity of equilibrium computation. Virtually all theseresults are obtained within the computational complexity formalism of total search problems in NP , which was spearheaded by [JPY88, MP89, Pap94b] to capture the complexity of searchproblems that are guaranteed to have a solution. Some key complexity classes in this land-scape are shown in Figure 3. We give a non-exhaustive list of intractability results for equi-librium computation: [FPT04] prove that computing pure Nash equilibria in congestion gamesis PLS -complete; [DGP09] and later [CDT09] show that computing approximate Nash equilib-7ia in normal-form games is

PPAD -complete; [EY10] study the complexity of computing exactNash equilibria (which may use irrational probabilities), introducing the complexity class

FIXP ;Figure 3: The complexity-theoretic land-scape of total search problems in NP . [VY11, CPY17] consider the complexity of com-puting Market equilibria; [Das13, Rub15, Rub16]consider the complexity of computing approximateNash equilibria of constant approximation; [KM18]establish a connection between approximate Nashequilibrium computation and the SoS hierarchy;[Meh14, DFS20] study the complexity of comput-ing Nash equilibria in specially structured games.A result that is particularly useful for our work isthe result of [HPV89] which shows black-box querylower bounds for computing Brouwer ﬁxed pointsof a continuous function. We use this result inSection 9 as an ingredient for proving our black-box lower bounds for computing approximate localmin-max solutions.Beyond equilibrium computation and its ap-plications to Economics and Game Theory, thestudy of total search problems has found pro-found connections to many scientiﬁc ﬁelds, in-cluding continuous optimization [DP11, DTZ18],combinatorial optimization [SY91], query complex-ity [BCE + +

17, GKSZ19], and cryptography [Jeˇr16,BPR15, SZZ18]. For a more extensive overview of total search problems we refer the reader tothe recent survey by Daskalakis [Das18].As already discussed, min-max optimization has intimate connections to the foundationsof Game Theory, Mathematical Programming, Online Learning, Statistics, and several otherﬁelds. Recent applications of min-max optimization to Machine Learning, such as GenerativeAdversarial Networks and Adversarial Training, have motivated a slew of recent work target-ing ﬁrst-order (or other light-weight online learning) methods for solving min-max optimiza-tion problems for convex-concave, nonconvex-concave, as well as nonconvex-nonconcave ob-jectives. Work on convex-concave and nonconvex-concave objectives has focused on obtainingonline learning methods with improved rates [KM19, LJJ19, TJNO19, NSH +

19, LTHC19, OX19,Zha19, ADSG19, AMLJG20, GPDO20, LJJ20] and last-iterate convergence guarantees [DISZ18,DP18, MR18, MPP18, RLLY18, HA18, ADLH19, DP19, LS19, GHP +

19, MOP19, ALW19], whilework on nonconvex-nonconcave problems has focused on identifying different notions of localmin-max solutions [JNJ19, MV20] and studying the existence and (local) convergence propertiesof learning methods at these points [WZB19, MV20, MSV20].8

Preliminaries

Notation.

For any compact and convex K ⊆ R d and B ∈ R + , we deﬁne L ∞ ( K , B ) to be the set ofall continuous functions f : K → R such that max x ∈ K | f ( x ) | ≤ B . When K = [

0, 1 ] d , we use L ∞ ( B ) instead of L ∞ ([

0, 1 ] d , B ) for ease of notation. For p >

0, we deﬁne diam p ( K ) = max x , y ∈ K (cid:107) x − y (cid:107) p ,where (cid:107)·(cid:107) p is the usual (cid:96) p -norm of vectors. For an alphabet set Σ , the set Σ ∗ , called the Kleenestar of Σ , is equal to ∪ ∞ i = Σ i . For any string q ∈ Σ we use | q | to denote the length of q . We use thesymbol log ( · ) for base 2 logarithms and ln ( · ) for the natural logarithm. We use [ n ] (cid:44) {

1, . . . , n } , [ n ] − (cid:44) {

0, . . . , n − } , and [ n ] (cid:44) {

0, . . . , n } . Lipschitzness, Smoothness, and Normalization.

Our main objects of study are continuouslydifferentiable Lipschitz and smooth functions f : P → R , where P ⊆ [

0, 1 ] d is some polytope. Acontinuously differentiable function f is called G-Lipschitz if | f ( x ) − f ( y ) | ≤ G (cid:107) x − y (cid:107) , for all x , y , and L-smooth if (cid:107)∇ f ( x ) − ∇ f ( y ) (cid:107) ≤ L (cid:107) x − y (cid:107) , for all x , y . Remark . Note that the G-Lipschitzness of a function f : P → R , where P ⊆ [

0, 1 ] d implies that for any x and y it holds that | f ( x ) − f ( y ) | ≤ G √ d. Whenever the range of aG-Lipschitz function is taken to be [ − B , B ] , for some B, we always assume that B ≤ G √ d. This can beaccomplished by setting ˜ f ( x ) = f ( x ) − f ( x ) for some ﬁxed x in the domain of f . For all the problemsthat we consider in this paper any solution for ˜ f is also a solution for f and vice-versa. Function Access.

We study optimization problems involving real-valued functions, consideringtwo access models to such functions. (cid:73)

Black Box Model.

In this model we are given access to an oracle O f such that given a point x ∈ [

0, 1 ] d the oracle O f returns the values f ( x ) and ∇ f ( x ) . In this model we assume thatwe can perform real number arithmetic operations. This is the traditional model used toprove lower bounds in Optimization and Machine Learning [NY83]. (cid:73) White Box Model.

In this model we are given the description of a polynomial-time Turingmachine C f that computes f ( x ) and ∇ f ( x ) . More precisely, given some input x ∈ [

0, 1 ] d ,described using B bits, and some accuracy ε , C f runs in time upper bounded by somepolynomial in B and log ( ε ) and outputs approximate values for f ( x ) and ∇ f ( x ) , withapproximation error that is at most ε in (cid:96) distance. We note that a running time upperbound on a given Turing Machine can be enforced syntactically by stopping the compu-tation and outputting a ﬁxed output whenever the computation exceeds the bound. Seealso Remark 2.6 for an important remark about how to formally study the computationalcomplexity of problems that take as input a polynomial-time Turing Machine. Promise Problems.

To simplify the exposition of our paper, make the deﬁnitions of our compu-tational problems and theorem statements clearer, and make our intractability results stronger,we choose to enforce the following constraints on our function access, O f or C f , as a promise ,rather than enforcing these constraints in some syntactic manner.1. Consistency of Function Values and Gradient Values.

Given some oracle O f or Turingmachine C f , it is difﬁcult to determine by querying the oracle or examining the descriptionof the Turing machine whether the function and gradient values output on different inputsare consistent with some differentiable function. In all our computational problems, we9ill only consider instances where this is promised to be the case. Moreover, for all ourcomputational hardness results, the instances of the problems arising from our reductionssatisfy these constraints, which are guaranteed syntactically by our reduction.2. Lipschitzness, Smoothness and Boundedness.

Similarly, given some oracle O f or Turingmachine C f , it is difﬁcult to determine, by querying the oracle or examining the descriptionof the Turing machine, whether the function and gradient values output by O f or C f areconsistent with some Lipschitz, smooth and bounded function with some prescribed Lips-chitzness, smoothness, and bound on its absolute value. In all our computational problems,we only consider instances where the G -Lipschitzness, L -smoothness and B -boundednessof the function are promised to hold for the prescribed, in the input of the problem, pa-rameters G , L and B . Moreover, for all our computational hardness results, the instancesof the problems arising from our reductions satisfy this constraint, which is guaranteedsyntactically by our reduction.In summary, in the rest of this paper, whenever we prove an upper bound for some compu-tational problem, namely an upper bound on the number of steps or queries to the functionoracle required to solve the problem in the black-box model, or the containment of the problemin some complexity class in the white-box model, we assume that the afore-described propertiesare satisﬁed by the O f or C f provided in the input. On the other hand, whenever we prove a lower bound for some computational problem, namely a lower bound on the number of steps/-queries required to solve it in the black-box model, or its hardness for some complexity classin the white-box model, the instances arising in our lower bounds are guaranteed to satisfythe above properties syntactically by our constructions. As such, our hardness results will notexploit the difﬁculty in checking whether O f or C f satisfy the above constraints in order to in-fuse computational complexity into our problems, but will faithfully target the computationalproblems pertaining to min-max optimization of smooth and Lipschitz objectives that we aim tounderstand in this paper. In this section we deﬁne the main complexity classes that we use in this paper, namely NP , FNP and

PPAD , as well as the notion of reduction used to show containment or hardness of a problemfor one of these complexity classes.

Deﬁnition 2.2 (Search Problems, NP , FNP ) . A binary relation

Q ⊆ {

0, 1 } ∗ × {

0, 1 } ∗ is in the class FNP if (i) for every x , y ∈ {

0, 1 } ∗ such that ( x , y ) ∈ Q , it holds that | y | ≤ poly ( | x | ) ; and (ii) thereexists an algorithm that veriﬁes whether ( x , y ) ∈ Q in time poly ( | x | , | y | ) . The search problem associated with a binary relation Q takes some x as input and requests as output some y suchthat ( x , y ) ∈ Q or outputting ⊥ if no such y exists. The decision problem associated with Q takessome x as input and requests as output the bit 1, if there exists some y such that ( x , y ) ∈ Q ,and the bit 0, otherwise. The class NP is deﬁned as the set of decision problems associated withrelations Q ∈

FNP .To deﬁne the complexity class

PPAD we ﬁrst deﬁne the notion of polynomial-time reductionsbetween search problems , and the computational problem E nd - of - a -L ine . In this paper we only deﬁne and consider Karp-reductions between search problems. This problem is sometimes called E nd - of - the -L ine , but we adopt the nomenclature proposed by [Rub16] sincewe agree that it describes the problem better. eﬁnition 2.3 (Polynomial-Time Reductions) . A search problem P is polynomial-time reducible toa search problem P if there exist polynomial-time computable functions f : {

0, 1 } ∗ → {

0, 1 } ∗ and g : {

0, 1 } ∗ × {

0, 1 } ∗ → {

0, 1 } ∗ with the following properties: (i) if x is an input to P , then f ( x ) is an input to P ; and (ii) if y is a solution to P on input f ( x ) , then g ( x , f ( x ) , y ) isa solution to P on input x .E nd - of - a -L ine .E nd - of - a -L ine .I nput : Binary circuits C S (for successor) and C P (for predecessor) with n inputs and n outputs.O utput : One of the following:0. if either both C P ( C S ( )) and C S ( C P ( )) are equal to , or if they are both different than , where is the all-0 string.1. a binary string x ∈ {

0, 1 } n such that x (cid:54) = and C P ( C S ( x )) (cid:54) = x or C S ( C P ( x )) (cid:54) = x .To make sense of the above deﬁnition, we envision that the circuits C S and C P implicitly deﬁnea directed graph, with vertex set {

0, 1 } n , such that the directed edge ( x , y ) ∈ {

0, 1 } n × {

0, 1 } n belongs to the graph if and only if C S ( x ) = y and C P ( y ) = x . As such, all vertices in the implicitlydeﬁned graph have in-degree and out-degree at most 1. The above problem permits an outputof if has equal in-degree and out-degree in this graph. Otherwise it permits an output x (cid:54) = such that x has in-degree or out-degree equal to 0. It follows by the parity argument on directedgraphs, namely that in every directed graph the sum of in-degrees equals the sum of out-degrees,that E nd - of - a -L ine is a total problem , i.e. that for any possible binary circuits C S and C P thereexists a solution of the “0.” kind or the “1.” kind in the deﬁnition of our problem (or both).Indeed, if has unequal in- and out-degrees, there must exist another vertex x (cid:54) = with unequalin- and out-degrees, thus one of these degrees must be 0 (as all vertices in the graph have in- andout-degrees bounded by 1).We are ﬁnally ready to deﬁne the complexity class PPAD introduced by [Pap94b].

Deﬁnition 2.4 ( PPAD ) . The complexity class

PPAD contains all search problems that are poly-nomial time reducible to the E nd - of - a -L ine problem.The complexity class PPAD is of particular importance, since it contains lots of fundamentalproblems in Game Theory, Economics, Topology and several other ﬁelds [DGP09, Das18]. Aparticularly important

PPAD -complete problem is ﬁnding ﬁxed points of continuous functions,whose existence is guaranteed by Brouwer’s ﬁxed point theorem.B rouwer .B rouwer .I nput : Scalars L and γ and a polynomial-time Turing machine C M evaluating a L -Lipschitzfunction M : [

0, 1 ] d → [

0, 1 ] d .O utput : A point z (cid:63) ∈ [

0, 1 ] d such that (cid:107) z (cid:63) − M ( z (cid:63) ) (cid:107) < γ .While not stated exactly in this form, the following is a straightforward implication of the resultspresented in [CDT09]. Lemma 2.5 ([CDT09]) . B rouwer is PPAD -complete even when d = . Additionally, B rouwer is PPAD -complete even when γ = poly ( d ) and L = poly ( d ) .Remark . In the deﬁnition of the problem B rouwer we assume that we are given in the input the description of a Turing Machine C M that computes he map M. In order for polynomial-time reductions to and from this problem to be meaningful we need tohave an upper bound on the running time of this Turing Machine which we want to be polynomial in theinput of the Turing Machine. The formal way to ensure this and derive meaningful complexity results is todeﬁne a different problem, say k- B rouwer , for every k ∈ N . In the problem k- B rouwer the input TuringMachine C M has running time bounded by n k in the size n of its input. In the rest of the paper wheneverwe say that a polynomial-time Turing Machine is required in the input to a computational problem P r , weformally mean that we deﬁne a hierarchy of problems k- P r , k ∈ N , such that k- P r takes as input TuringMachines with running time bounded by n k , and we interpret computational complexity results for P r in the following way: whenever we prove that P r belongs to some complexity class, we prove that k- P r belongs to the complexity class for all k ∈ N ; whenever we prove that P r is hard for some complexity class,we prove that, for some absolute constant k determined in the hardness proof, k- P r is hard for that class,for all k ≥ k . For simplicity of exposition of our problems and results we do not repeat this discussion inthe rest of this paper. In this section, we deﬁne the computational problems that we study in this paper and discussour main results, postponing formal statements to Section 4. We start in Section 3.1 by deﬁningthe mathematical objects of our study, and proceed in Section 3.2 to deﬁne our main compu-tational problems, namely: (1) ﬁnding approximate stationary points; (2) ﬁnding approximatelocal minima; and (3) ﬁnding approximate local min-max equilibria. In Section 3.3, we presentsome bonus problems, which are intimately related, as we will see, to problems (2) and (3). Asdiscussed in Section 2, for ease of presentation, we deﬁne our problems as promise problems.

We deﬁne the concepts of stationary points , local minima , and local min-max equilibria of real val-ued functions, and make some remarks about their existence, as well as their computationalcomplexity. The formal discussion of the latter is postponed to Sections 3.2 and 4.Before we proceed with our deﬁnitions, recall that the goal of this paper is to study con-strained optimization. Our domain will be the hypercube [

0, 1 ] d , which we might intersect withthe set { x | g ( x ) ≤ } , for some convex (potentially multivariate) function g . Although mostof the deﬁnitions and results that we explore in this paper can be extended to arbitrary convexfunctions, we will focus on the case where g is linear, and the feasible set is thus a polytope.Focusing on this case avoids additional complications related to the representation of g in theinput to the computational problems that we deﬁne in the next section, and avoids also issuesrelated to verifying the convexity of g . Deﬁnition 3.1 (Feasible Set and Refutation of Feasibility) . Given A ∈ R d × m and b ∈ R m , wedeﬁne the set of feasible solutions to be P ( A , b ) = { z ∈ [

0, 1 ] d | A T z ≤ b } . Observe that testingwhether P ( A , b ) is empty can be done in polynomial time in the bit complexity of A and b . Deﬁnition 3.2 (Projection Operator) . For a nonempty, closed, and convex set K ⊂ R d , we deﬁnethe projection operator Π K : R d → K as follows Π K x = argmin y ∈ K (cid:107) x − y (cid:107) . It is well-knownthat for any nonempty, closed, and convex set K the argmin y ∈ K (cid:107) x − y (cid:107) exists and is unique,hence Π K is well deﬁned.Now that we have deﬁned the domain of the real-valued functions that we consider in thispaper we are ready to deﬁne a notion of approximate stationary points.12 eﬁnition 3.3 ( ε -Stationary Point) . Let f : [

0, 1 ] d → R be a G -Lipschitz and L -smooth functionand A ∈ R d × m , b ∈ R m . We call a point x (cid:63) ∈ P ( A , b ) a ε - stationary point of f if (cid:107)∇ f ( x (cid:63) ) (cid:107) < ε .It is easy to see that there exist continuously differentiable functions f that do not have any(approximate) stationary points, e.g. linear functions. As we will see later in this paper, decidingwhether a given function f has a stationary point is NP -hard and, in fact, it is even NP -hard todecide whether a function has an approximate stationary point of a very gross approximation.At the same time, verifying whether a given point is (approximately) stationary can be doneefﬁciently given access to a polynomial-time Turing machine that computes ∇ f , so the problem ofdeciding whether an (approximate) stationary point exists lies in NP , as long as we can guaranteethat, if there is such a point, there will also be one with polynomial bit complexity. We postponea formal discussion of the computational complexity of ﬁnding (approximate) stationary pointsor deciding their existence until we have formally deﬁned our corresponding computationalproblem and settled the bit complexity of its solutions.For the deﬁnition of local minima and local min-max equilibria we need the notion of closed d -dimensional Euclidean balls. Deﬁnition 3.4 (Euclidean Ball) . For r ∈ R + we deﬁne the closed Euclidean ball of radius r to bethe set B d ( r ) = (cid:8) x ∈ R d | (cid:107) x (cid:107) ≤ r (cid:9) . We also deﬁne the closed Euclidean ball of radius r centered at z ∈ R d to be the set B d ( r ; z ) = (cid:8) x ∈ R d | (cid:107) x − z (cid:107) ≤ r (cid:9) . Deﬁnition 3.5 ( ( ε , δ ) -Local Minimum) . Let f : [

0, 1 ] d → R be a G -Lipschitz and L -smooth func-tion, A ∈ R d × m , b ∈ R m , and ε , δ >

0. A point x (cid:63) ∈ P ( A , b ) is an ( ε , δ ) - local minimum of f con-strained on P ( A , b ) if and only if f ( x (cid:63) ) < f ( x ) + ε for every x ∈ P ( A , b ) such that x ∈ B d ( δ ; x (cid:63) ) .To be clear, using the term “local minimum” in Deﬁnition 3.5 is a bit of a misnomer, since forlarge enough values of δ the deﬁnition captures global minima as well. As δ ranges from largeto small, our notion of ( ε , δ ) -local minimum transitions from being an ε -globally optimal pointto being an ε -locally optimal point. Importantly, unlike (approximate) stationary points, a ( ε , δ ) -local minimum is guaranteed to exist for all ε , δ > [

0, 1 ] d ∩ P ( A , b ) and the continuity of f . Thus the problem of ﬁnding an ( ε , δ ) -local minimum is total for arbitraryvalues of ε and δ . On the negative side, for arbitrary values of ε and δ , there is no polynomial-sizeand polynomial-time veriﬁable witness for certifying that a point x (cid:63) is an ( ε , δ ) -local minimum.Thus the problem of ﬁnding an ( ε , δ ) -local minimum is not known to lie in FNP . As we willsee in Section 4, this issue can be circumvented if we focus on particular settings of ε and δ , inrelationship to the Lipschitzness and smoothness of f and the dimension d .Finally we deﬁne ( ε , δ ) -local min-max equilibrium as follows, recasting Deﬁnition 1.1 to theconstraint set P ( A , b ) . Deﬁnition 3.6 ( ( ε , δ ) -Local Min-Max Equilibrium) . Let f : [

0, 1 ] d × [

0, 1 ] d → R be a G -Lipschitzand L -smooth function, A ∈ R d × m and b ∈ R m , where d = d + d , and ε , δ >

0. A point ( x (cid:63) , y (cid:63) ) ∈ P ( A , b ) is an ( ε , δ ) - local min-max equilibrium of f if and only if the following hold: (cid:73) f ( x (cid:63) , y (cid:63) ) < f ( x , y (cid:63) ) + ε for every x ∈ B d ( δ ; x (cid:63) ) with ( x , y (cid:63) ) ∈ P ( A , b ) ; and (cid:73) f ( x (cid:63) , y (cid:63) ) > f ( x (cid:63) , y ) − ε for every y ∈ B d ( δ ; y (cid:63) ) with ( x (cid:63) , y ) ∈ P ( A , b ) .13imilarly to Deﬁnition 3.5, for large enough values of δ , Deﬁnition 3.6 captures global min-maxequilibria as well. As δ ranges from large to small, our notion of ( ε , δ ) -local min-max equilib-rium transitions from being an ε -approximate min-max equilibrium to being an ε -approximatelocal min-max equilibrium. Moreover, in comparison to local minima and stationary points, theproblem of ﬁnding an ( ε , δ ) -local min-max equilibrium is neither total nor can its solutions beveriﬁed efﬁciently for all values of ε and δ , even when P ( A , b ) = [

0, 1 ] d . Again, this issue can becircumvented if we focus on particular settings of ε and δ values, as we will see in Section 4. In this section, we deﬁne the search problems associated with our aforementioned deﬁnitionsof approximate stationary points, local minima, and local min-max equilibria. We state ourproblems in terms of white-box access to the function f and its gradient. Switching to the black-box variants of our computational problems amounts to simply replacing the Turing machinesprovided in the input of the problems with oracle access to the function and its gradient, asdiscussed in Section 2. As per our discussion in the same section, we deﬁne our computationalproblems as promise problems , the promise being that the Turing machine (or oracle) provided inthe input to our problems outputs function values and gradient values that are consistent witha smooth and Lipschitz function with the prescribed in the input smoothness and Lipschitzness.Besides making the presentation cleaner, as we discussed in Section 2, the motivation for doingso is to prevent the possibility that computational complexity is tacked into our problems dueto the possibility that the Turing machines/oracles provided in the input do not output functionand gradient values that are consistent with a Lipschitz and smooth function. Importantly, allour computational hardness results syntactically guarantee that the Turing machines/oraclesprovided as input to our constructed hard instances satisfy these constraints.Before stating our main computational problems below, we note that, for each problem, thedimension d (in unary representation) is also an implicit input, as the description of the Turingmachine C f (or the interface to the oracle O f in the black-box counterpart of each problem be-low) has size at least linear in d . We also refer to Remark 2.6 for how we may formally studycomplexity problems that take a polynomial-time Turing Machine in their input.S tationary P oint .S tationary P oint .I nput : Scalars ε , G , L , B > C f evaluating a G -Lipschitzand L -smooth function f : [