The Complexity of Constrained Min-Max Optimization
Constantinos Daskalakis, Stratis Skoulakis, Manolis Zampetakis
TThe Complexity of Constrained Min-Max Optimization
Constantinos DaskalakisMIT [email protected]
Stratis SkoulakisSUTD [email protected]
Manolis ZampetakisMIT [email protected]
September 22, 2020
Abstract
Despite its important applications in Machine Learning, min-max optimization of objectivefunctions that are nonconvex-nonconcave remains elusive. Not only are there no known first-order methods converging even to approximate local min-max points, but the computationalcomplexity of identifying them is also poorly understood. In this paper, we provide a charac-terization of the computational complexity of the problem, as well as of the limitations of first-order methods in constrained min-max optimization problems with nonconvex-nonconcaveobjectives and linear constraints.As a warm-up, we show that, even when the objective is a Lipschitz and smooth differ-entiable function, deciding whether a min-max point exists, in fact even deciding whether anapproximate min-max point exists, is NP -hard. More importantly, we show that an approxi-mate local min-max point of large enough approximation is guaranteed to exist, but findingone such point is PPAD -complete. The same is true of computing an approximate fixed pointof the (Projected) Gradient Descent/Ascent update dynamics.An important byproduct of our proof is to establish an unconditional hardness resultin the Nemirovsky-Yudin [NY83] oracle optimization model. We show that, given oracleaccess to some function f : P → [ −
1, 1 ] and its gradient ∇ f , where P ⊆ [
0, 1 ] d is a knownconvex polytope, every algorithm that finds a ε -approximate local min-max point needs tomake a number of queries that is exponential in at least one of 1/ ε , L , G , or d , where L and G are respectively the smoothness and Lipschitzness of f and d is the dimension. Thiscomes in sharp contrast to minimization problems, where finding approximate local minimain the same setting can be done with Projected Gradient Descent using O ( L / ε ) many queries.Our result is the first to show an exponential separation between these two fundamentaloptimization problems in the oracle model. a r X i v : . [ c s . CC ] S e p ontents B.1 Proof of Theorem 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58B.2 Proof of Theorem 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
C Missing Proofs from Section 8 61D Constructing the Turing Machine – Proof of Theorem 7.6 74E Convergence of PGD to Approximate Local Minimum 81
Introduction
Min-Max Optimization has played a central role in the development of Game Theory [vN28],Convex Optimization [Dan51, Adl13], and Online Learning [Bla56, CBL06, SS12, BCB12, SSBD14,Haz16]. In its general constrained form, it can be written down as follows:min x ∈ R d max y ∈ R d f ( x , y ) ; (1.1)s.t. g ( x , y ) ≤ f : R d × R d → [ − B , B ] with B ∈ R + , and g : R d × R d → R is typically taken to be aconvex function so that the constraint set g ( x , y ) ≤ g so the constraint set is a polytope, thus projecting on this set and checking feasibilityof a point with respect to this set can both be done in polynomial time.The goal in (1.1) is to find a feasible pair ( x (cid:63) , y (cid:63) ) , i.e., g ( x (cid:63) , y (cid:63) ) ≤
0, that satisfies the following f ( x (cid:63) , y (cid:63) ) ≤ f ( x , y (cid:63) ) , for all x s.t. g ( x , y (cid:63) ) ≤
0; (1.2) f ( x (cid:63) , y (cid:63) ) ≥ f ( x (cid:63) , y ) , for all y s.t. g ( x (cid:63) , y ) ≤
0. (1.3)It is well-known that, when f ( x , y ) is a convex-concave function, i.e., f is convex in x forall y and it is concave in y for all x , then Problem (1.1) is guaranteed to have a solution, undercompactness of the constraint set [vN28, Ros65], while computing a solution is amenable toconvex programming. In fact, if f is L -smooth, the problem can be solved via first-order methods,which are iterative, only access f through its gradient, and achieve an approximation error ofpoly ( L , 1/ T ) in T iterations; see e.g. [Kor76, Nem04]. When the function is strongly convex-strongly concave, the rate becomes geometric [FP07].Unfortunately, our ability to solve Problem (1.1) remains rather poor in settings where our ob-jective function f is not convex-concave. This is emerging as a major challenge in Deep Learning,where min-max optimization has recently found many important applications, such as train-ing Generative Adversarial Networks (see e.g. [GPM +
14, ACB17]), and robustifying deep neuralnetwork-based models against adversarial attacks (see e.g. [MMS + In general, the access to the constraints g by these methods is more involved, namely through an optimizationoracle that optimizes convex functions (in fact, quadratic suffices) over g ( x , y ) ≤
0. In the settings considered in thispaper g is linear and these tasks are computationally straightforward. In the stated error rate, we are suppressing factors that depend on the diameter of the feasible set. Moreover, thestated error of ε ( L , T ) (cid:44) poly ( L , 1/ T ) reflects that these methods return an approximate min-max solution, whereinthe inequalities on the LHS of (1.2) and (1.3) are satisfied to within an additive ε ( L , T ) . +
17, JGN +
17, LPP + shed light on the complexity of min-max optimization problems , and elucidate its difference to minimization and maximization problems —as far as the latter is concernedwithout loss of generality we focus on minimization problems, as maximization problems behaveexactly the same; we will also think of minimization problems in the framework of (1.1), wherethe variable y is absent, that is d =
0. An important driver of our comparison between min-maxoptimization and minimization is, of course, the nature of the objective. So let us discuss: (cid:46)
Convex-Concave Objective.
The benign setting for min-max optimization is that where the ob-jective function is convex-concave, while the benign setting for minimization is that where theobjective function is convex. In their corresponding benign settings, the two problems behavequite similarly from a computational perspective in that they are amenable to convex program-ming, as well as first-order methods which only require gradient information about the objectivefunction. Moreover, in their benign settings, both problems have guaranteed existence of a so-lution under compactness of the constraint set. Finally, it is clear how to define approximatesolutions. We just relax the inequalities on the left hand side of (1.2) and (1.3) by some ε > (cid:46) Nonconvex-Nonconcave Objective.
By contrapositive, the challenging setting for min-max op-timization is that where the objective is not convex-concave, while the challenging setting forminimization is that where the objective is not convex. In these challenging settings, the behav-ior of the two problems diverges significantly. The first difference is that, while a solution to aminimization problem is still guaranteed to exist under compactness of the constraint set evenwhen the objective is not convex, a solution to a min-max problem is not guaranteed to existwhen the objective is not convex-concave, even under compactness of the constrained set. A triv-ial example is this: min x ∈ [ ] max y ∈ [ ] ( x − y ) . Unsurprisingly, we show that checking whether amin-max optimization problem has a solution is NP -hard. In fact, we show that checking whetherthere is an approximate min-max solution is NP -hard, even when the function is Lispchitz andsmooth and the desired approximation error is an absolute constant (see Theorem 10.1).Since min-max solutions may not exist, what could we plausibly hope to compute? There aretwo obvious targets:(I) approximate stationary points of f , as considered e.g. by [ALW19]; and(II) some type of approximate local min-max solution.Unfortunately, as far as (I) is concerned, it is still possible that (even approximate) station-ary points may not exist, and we show that checking if there is one is NP -hard, even whenthe constraint set is [
0, 1 ] d , the objective has Lipschitzness and smoothness polynomial in d ,and the desired approximation is an absolute constant (Theorem 4.1). So we focus on (II),i.e. (approximate) local min-max solutions. Several kinds of those have been proposed in theliterature [DP18, MR18, JNJ19]. We consider a generalization of the concept of local min-maxequilibria, proposed in [DP18, MR18], that also accommodates approximation.2 efinition 1.1 (Approximate Local Min-Max Equilibrium) . Given f , g as above, and ε , δ > ( x (cid:63) , y (cid:63) ) is an ( ε , δ ) -local min-max solution of (1.1), or a ( ε , δ ) -local min-max equilibrium ,if it is feasible, i.e. g ( x (cid:63) , y (cid:63) ) ≤
0, and satisfies: f ( x (cid:63) , y (cid:63) ) < f ( x , y (cid:63) ) + ε , for all x such that (cid:107) x − x (cid:63) (cid:107) ≤ δ and g ( x , y (cid:63) ) ≤
0; (1.4) f ( x (cid:63) , y (cid:63) ) > f ( x (cid:63) , y ) − ε , for all y such that (cid:107) y − y (cid:63) (cid:107) ≤ δ and g ( x (cid:63) , y ) ≤
0. (1.5)In words, ( x (cid:63) , y (cid:63) ) is an ( ε , δ ) -local min-max equilibrium, whenever the min player cannot update x to a feasible point within δ of x (cid:63) to reduce f by at least ε , and symmetrically the max playercannot change y locally to increase f by at least ε .We show that the existence and complexity of computing such approximate local min-maxequilibria depends on the relationship of ε and δ with the smoothness, L , and the Lipschitzness, G , of the objective function f . We distinguish the following regimes, also shown in Figure 1together with a summary of our associated results. (cid:73) Trivial Regime.
This occurs when δ < ε G . This regime is trivial because the G -Lipschitzness of f guarantees that all feasible points are ( ε , δ ) -local min-max solutions. (cid:73) Local Regime.
This occurs when δ < (cid:113) ε L , and it represents the interesting regime for min-max optimization. In this regime, we use the smoothness of f to show that ( ε , δ ) -local min-maxsolutions always exist. Indeed, we show (Theorem 5.1) that computing them is computationallyequivalent to the following variant of (I) which is more suitable for the constrained setting:(I’) (approximate) fixed points of the projected gradient descent-ascent dynamics (Section 3.3).We show via an application of Brouwer’s fixed point theorem to the iteration map of the projectedgradient descent-ascent dynamics that (I)’ are guaranteed to exist. In fact, not only do they exist,but computing them is in PPAD , as can be shown by bounding the Lipschitzness of the projectedgradient descent-ascent dynamics (Theorem 5.2). (cid:73)
Global Regime.
This occurs when δ is comparable to the diameter of the constraint set. Inthis case, the existence of ( ε , δ ) -local min-max solutions is not guaranteed, and determining theirexistence is NP -hard, even if ε is an absolute constant (Theorem 10.1).The main results of this paper, summarized in Figure 1, are to characterize the complexity ofcomputing local min-max solutions in the local regime. Our first main theorem is the following: Informal Theorem 1 (see Theorems 4.3, 4.4 and 5.1) . Computing ( ε , δ ) -local min-max solutions ofLipschitz and smooth objectives over convex compact domains in the local regime is PPAD -complete. Thehardness holds even when the constraint set is a polytope that is a subset of [
0, 1 ] d , the objective takes valuesin [ −
1, 1 ] and the smoothness, Lipschitzness, ε and δ are polynomial in the dimension. Equivalently,computing α -approximate fixed points of the Projected Gradient Descent-Ascent dynamics on smooth andLipschitz objectives is PPAD -complete, and the hardness holds even when the the constraint set is a polytopethat is a subset of [
0, 1 ] d , the objective takes values in [ − d , d ] and smoothness, Lipschitzness, and α arepolynomial in the dimension. For the above complexity result we assume that we have “white box” access to the objectivefunction. An important byproduct of our proof, however, is to also establish an unconditionalhardness result in the Nemirovsky-Yudin [NY83] oracle optimization model, wherein we are givenblack-box access to oracles computing the objective function and its gradient. Our second mainresult is informally stated in Informal Theorem 2.3igure 1: Overview of the results proven in this paper and comparison between the complexityof computing an ( ε , δ ) -approximate local minimum and an ( ε , δ ) -approximate local min-maxequilibrium of a G -Lipschitz and L -smooth function over a d -dimensional polytope taking valuesin the interval [ − B , B ] . We assume that ε < G / L , thus the trivial regime is a strict subset ofthe local regime. Moreover, we assume that the approximation parameter ε is provided in unaryrepresentation in the input to these problems, which makes our hardness results stronger andthe comparison to the upper bounds known for finding approximate local minima fair, as theserequire time/oracle queries that are polynomial in 1/ ε . We note that the unary representationis not required for our results proving inclusion in PPAD . The figure portrays a sharp contrastbetween the computational complexity of approximate local minima and approximate local min-max equilibria in the local regime. Above the black lines, tracking the value of δ , we state our“white box” results and below the black lines we state our “black-box” results. The main resultof this paper is the PPAD -hardness of approximate local min-max equilibrium for δ ≥ √ ε / L andthe corresponding query lower bound. In the query lower bound the function h is defined as h ( d , G , L , ε ) = (cid:0) min ( d , √ L / ε , G / ε ) (cid:1) p for some universal constant p ∈ R + . With (cid:63) we indicate our PPAD -completeness result which directly follows from Theorems 4.3 and 4.4. The NP -hardessresults in the global regime are presented in Section 10. Finally, the folklore result showing thetractability of finding approximate local minima is presented for completeness of exposition inAppendix E. The claimed results for the trivial regime follow from the definition of Lipschitzness.4 nformal Theorem 2 (see Theorem 4.5) . Assume that we have black-box access to an oracle computinga G-Lipschitz and L-smooth objective function f : P → [ −
1, 1 ] , where P ⊆ [
0, 1 ] d is a known polytope,and its gradient ∇ f . Then, computing an ( ε , δ ) -local min-max solution in the local regime (i.e., when δ < √ ε / L) requires a number of oracle queries that is exponential in at least one of the following: ε ,L, G, or d. In fact, exponential in d-many queries are required even when L, G, ε and δ are allpolynomial in d. Importantly, the above lower bounds, in both the white-box and the black-box setting, comein sharp contrast to minimization problems, given that finding approximate local minima ofsmooth non-convex objectives ranging in [ − B , B ] in the local regime can be done using first-order methods using O ( B · L / ε ) time/queries (see Section E). Our results are the first to show anexponential separation between these two fundamental problems in optimization in the black-box setting, and a super-polynomial separation in the white-box setting assuming PPAD (cid:54) = FP . We very briefly outline some of the main ideas for the
PPAD -hardness proof that we presentin Sections 6 and 7. Our starting point as in many
PPAD -hardness results is a discrete analogof the problem of finding Brouwer fixed points of a continuous map. Departing from previouswork, however, we do not use Sperner’s lemma as the discrete analog of Brouwer’s fixed pointtheorem. Instead, we define a new problem, called B i S perner , which is useful for showing ourhardness results. B i S perner is closely related to the problem of finding panchromatic simplicesguaranteed by Sperner’s lemma except, roughly speaking, that the vertices of the simplicizationof a d -dimensional hypercube are colored with 2 d rather than d + d colors rather than one, and we are seeking a vertex of the sim-plicization so that the union of colors on the vertices in its neighborhood covers the full set ofcolors. The first step of our proof is to show that B i S perner is PPAD -hard. This step followsfrom the hardness of computing Brouwer fixed points.The step that we describe next is only implicitly done by our proof, but it serves as usefulintuition for reading and understanding it. We want to define a discrete two-player zero-sumgame whose local equilibrium points correspond to solutions of a given B i S perner instance.Our two players, called “minimizer” and “maximizer,” each choose a vertex of the simplicizationof the B i S perner instance. For every pair of strategies in our discrete game, i.e. vertices, chosenby our players, we define a function value and gradient values. Note that, at this point, wetreat these values at different vertices of the simplicization as independent choices, i.e. are notdefining a function over the continuum whose function values and gradient values are consistentwith these choices. It is our intention, however, that in the continuous two-player zero-sum gamethat we obtain in the next paragraph via our interpolation scheme, wherein the minimizer andmaximizer may choose any point in the continuous hypercube, the function value determinesthe payment of the minimizer to the maximizer, and the gradient value determines the directionof the best-response dynamics of the game. Before getting to that continuous game in the nextparagraph, the main technical step of this discrete part of our construction is showing that everylocal equilibrium of the discrete game corresponds to a solution of the B i S perner instance we arereducing from. In order to achieve this we need to add some constraints to couple the strategiesof the minimizer and the maximizer player. This step is the reason that the constraints g ( x , y ) ≤ smooth and computationally efficient way the discrete zero-sum game of the previous step. Inlow dimensions (treated in Section 6) such smooth and efficient interpolation can be done ina relatively simple way using single-dimensional smooth step functions. In high dimensions,however, the smooth and efficient interpolation becomes a challenging problem and to the bestof our knowledge no simple solution exists. For this reason we construct our novel smooth andefficient interpolation coefficients of Section 8. These are a technically involved construction that webelieve will prove to be very useful for characterizing the complexity of approximate solutionsof other optimization problems.The last part of our proof is to show that all the previous steps can be implemented inan efficient way both with respect to computational but also with respect to query complexity.This part is essential for both our white-box and black-box results. Although this seems like arelatively easy step, it becomes more difficult due to the complicated expressions in our smoothand efficient interpolation coefficients used in our previous step.Closing this section we mention that all our NP -hardness results are proven using a cuteapplication of Lovász Local Lemma [EL73], which provides a powerful rounding tool that candrive the inapproximability all the way up to an absolute constant. Because our proof is convoluted, involving multiple steps, it is difficult to discern from it whyfinding local min-max solutions is so much harder than finding local minima. For this reason, weillustrate in this section a fundamental difference between local minimization and local min-maxoptimization. This provides good intuition about why our hardness construction would fail if wetried to apply it to prove hardness results for finding local minima (which we know don’t exist).So let us illustrate a key difference between min-max problems that can be expressed inthe form min x ∈X max y ∈Y f ( x , y ) , i.e. two-player zero-sum games wherein the players optimizeopposing objectives, and min-min problems of the form min x ∈X min y ∈Y f ( x , y ) , i.e., two-playercoordination games wherein the players optimize the same objective. For simplicity, suppose X = Y = R and let us consider long paths of best-response dynamics in the strategy space, X × Y , of the two players; these are paths along which at least one of the players improves theirpayoff. For our illustration, suppose that the derivative of the function with respect to eithervariable is either 1 or −
1. Consider a long path of best-response dynamics starting at a pair ofstrategies ( x , y ) in either a min-min problem or a min-max problem, and a specific point ( x , y ) along that path. We claim that in min-min problems the function value at ( x , y ) will have toreveal how far from ( x , y ) point ( x , y ) lies within the path in (cid:96) distance. On the other hand,in min-max problems the function value at ( x , y ) may reveal very little about how far ( x , y ) liesfrom ( x , y ) . We illustrate this in Figure 2. While in our min-min example the function valuemust be monotonically decreasing inside the best-response path, in the min-max example thefunction values repeat themselves in every straight line segment of length 3, without revealingwhere in the path each segment is.Ultimately a key difference between min-min and min-max optimization is that best-responsepaths in min-max optimization problems can be closed, i.e., can form a cycle, as shown in Figure2, Panel (b). On the other hand, this is impossible in min-min problems as the function valuemust monotonically decrease along best-response paths, thus cycles may not exist.6 a) Min-min problem; the function values re-veal the location of the points within best re-sponse path. (b) Min-max problem; the function values donot reveal the location of the points withinbest response path. Figure 2: Long paths of best-response dynamics in min-min problems (Panel (a)) and min-maxproblems (Panel (b)), where horizontal moves correspond to one player (who is a minimizer inboth (a) and (b)) and vertical moves correspond to the other player (who is minimizer in (a) but amaximizer in (b)). In Panels (a) and (b), we show the function value at a subset of discrete pointsin a 2D grid along a long path of best-response dynamics, where for our illustration we assumedthat the derivative of the objective with respect to either variable always has absolute value 2.As we see in Panel (a), the function value at some point along a long path of the best-responsedynamics in a min-min problem reveals information about where in the path that point lies.This is in sharp contrast to min-max problems where only local information is revealed aboutthe objective as shown in Panel (b), due to the frequent turns of the path. In Panel (b) we alsoshow that the best-response dynamics in min-max problems can form closed paths. This cannothappen in min-min problems as the function value must decrease along paths of best-responsedynamics, and hence it is impossible in min-min problems to build long best-response paths withfunction values that can be computed locally.The above discussion offers qualitative differences between min-min and min-max optimiza-tion, which lie in the heart of why our computational intractability results are possible to provefor min-max but not min-min problems. For the precise step in our construction that breaks ifwe were to switch from a min-max to a min-min problem we refer the reader to Remark 6.9.
There is a broad literature on the complexity of equilibrium computation. Virtually all theseresults are obtained within the computational complexity formalism of total search problems in NP , which was spearheaded by [JPY88, MP89, Pap94b] to capture the complexity of searchproblems that are guaranteed to have a solution. Some key complexity classes in this land-scape are shown in Figure 3. We give a non-exhaustive list of intractability results for equi-librium computation: [FPT04] prove that computing pure Nash equilibria in congestion gamesis PLS -complete; [DGP09] and later [CDT09] show that computing approximate Nash equilib-7ia in normal-form games is
PPAD -complete; [EY10] study the complexity of computing exactNash equilibria (which may use irrational probabilities), introducing the complexity class
FIXP ;Figure 3: The complexity-theoretic land-scape of total search problems in NP . [VY11, CPY17] consider the complexity of com-puting Market equilibria; [Das13, Rub15, Rub16]consider the complexity of computing approximateNash equilibria of constant approximation; [KM18]establish a connection between approximate Nashequilibrium computation and the SoS hierarchy;[Meh14, DFS20] study the complexity of comput-ing Nash equilibria in specially structured games.A result that is particularly useful for our work isthe result of [HPV89] which shows black-box querylower bounds for computing Brouwer fixed pointsof a continuous function. We use this result inSection 9 as an ingredient for proving our black-box lower bounds for computing approximate localmin-max solutions.Beyond equilibrium computation and its ap-plications to Economics and Game Theory, thestudy of total search problems has found pro-found connections to many scientific fields, in-cluding continuous optimization [DP11, DTZ18],combinatorial optimization [SY91], query complex-ity [BCE + +
17, GKSZ19], and cryptography [Jeˇr16,BPR15, SZZ18]. For a more extensive overview of total search problems we refer the reader tothe recent survey by Daskalakis [Das18].As already discussed, min-max optimization has intimate connections to the foundationsof Game Theory, Mathematical Programming, Online Learning, Statistics, and several otherfields. Recent applications of min-max optimization to Machine Learning, such as GenerativeAdversarial Networks and Adversarial Training, have motivated a slew of recent work target-ing first-order (or other light-weight online learning) methods for solving min-max optimiza-tion problems for convex-concave, nonconvex-concave, as well as nonconvex-nonconcave ob-jectives. Work on convex-concave and nonconvex-concave objectives has focused on obtainingonline learning methods with improved rates [KM19, LJJ19, TJNO19, NSH +
19, LTHC19, OX19,Zha19, ADSG19, AMLJG20, GPDO20, LJJ20] and last-iterate convergence guarantees [DISZ18,DP18, MR18, MPP18, RLLY18, HA18, ADLH19, DP19, LS19, GHP +
19, MOP19, ALW19], whilework on nonconvex-nonconcave problems has focused on identifying different notions of localmin-max solutions [JNJ19, MV20] and studying the existence and (local) convergence propertiesof learning methods at these points [WZB19, MV20, MSV20].8
Preliminaries
Notation.
For any compact and convex K ⊆ R d and B ∈ R + , we define L ∞ ( K , B ) to be the set ofall continuous functions f : K → R such that max x ∈ K | f ( x ) | ≤ B . When K = [
0, 1 ] d , we use L ∞ ( B ) instead of L ∞ ([
0, 1 ] d , B ) for ease of notation. For p >
0, we define diam p ( K ) = max x , y ∈ K (cid:107) x − y (cid:107) p ,where (cid:107)·(cid:107) p is the usual (cid:96) p -norm of vectors. For an alphabet set Σ , the set Σ ∗ , called the Kleenestar of Σ , is equal to ∪ ∞ i = Σ i . For any string q ∈ Σ we use | q | to denote the length of q . We use thesymbol log ( · ) for base 2 logarithms and ln ( · ) for the natural logarithm. We use [ n ] (cid:44) {
1, . . . , n } , [ n ] − (cid:44) {
0, . . . , n − } , and [ n ] (cid:44) {
0, . . . , n } . Lipschitzness, Smoothness, and Normalization.
Our main objects of study are continuouslydifferentiable Lipschitz and smooth functions f : P → R , where P ⊆ [
0, 1 ] d is some polytope. Acontinuously differentiable function f is called G-Lipschitz if | f ( x ) − f ( y ) | ≤ G (cid:107) x − y (cid:107) , for all x , y , and L-smooth if (cid:107)∇ f ( x ) − ∇ f ( y ) (cid:107) ≤ L (cid:107) x − y (cid:107) , for all x , y . Remark . Note that the G-Lipschitzness of a function f : P → R , where P ⊆ [
0, 1 ] d implies that for any x and y it holds that | f ( x ) − f ( y ) | ≤ G √ d. Whenever the range of aG-Lipschitz function is taken to be [ − B , B ] , for some B, we always assume that B ≤ G √ d. This can beaccomplished by setting ˜ f ( x ) = f ( x ) − f ( x ) for some fixed x in the domain of f . For all the problemsthat we consider in this paper any solution for ˜ f is also a solution for f and vice-versa. Function Access.
We study optimization problems involving real-valued functions, consideringtwo access models to such functions. (cid:73)
Black Box Model.
In this model we are given access to an oracle O f such that given a point x ∈ [
0, 1 ] d the oracle O f returns the values f ( x ) and ∇ f ( x ) . In this model we assume thatwe can perform real number arithmetic operations. This is the traditional model used toprove lower bounds in Optimization and Machine Learning [NY83]. (cid:73) White Box Model.
In this model we are given the description of a polynomial-time Turingmachine C f that computes f ( x ) and ∇ f ( x ) . More precisely, given some input x ∈ [
0, 1 ] d ,described using B bits, and some accuracy ε , C f runs in time upper bounded by somepolynomial in B and log ( ε ) and outputs approximate values for f ( x ) and ∇ f ( x ) , withapproximation error that is at most ε in (cid:96) distance. We note that a running time upperbound on a given Turing Machine can be enforced syntactically by stopping the compu-tation and outputting a fixed output whenever the computation exceeds the bound. Seealso Remark 2.6 for an important remark about how to formally study the computationalcomplexity of problems that take as input a polynomial-time Turing Machine. Promise Problems.
To simplify the exposition of our paper, make the definitions of our compu-tational problems and theorem statements clearer, and make our intractability results stronger,we choose to enforce the following constraints on our function access, O f or C f , as a promise ,rather than enforcing these constraints in some syntactic manner.1. Consistency of Function Values and Gradient Values.
Given some oracle O f or Turingmachine C f , it is difficult to determine by querying the oracle or examining the descriptionof the Turing machine whether the function and gradient values output on different inputsare consistent with some differentiable function. In all our computational problems, we9ill only consider instances where this is promised to be the case. Moreover, for all ourcomputational hardness results, the instances of the problems arising from our reductionssatisfy these constraints, which are guaranteed syntactically by our reduction.2. Lipschitzness, Smoothness and Boundedness.
Similarly, given some oracle O f or Turingmachine C f , it is difficult to determine, by querying the oracle or examining the descriptionof the Turing machine, whether the function and gradient values output by O f or C f areconsistent with some Lipschitz, smooth and bounded function with some prescribed Lips-chitzness, smoothness, and bound on its absolute value. In all our computational problems,we only consider instances where the G -Lipschitzness, L -smoothness and B -boundednessof the function are promised to hold for the prescribed, in the input of the problem, pa-rameters G , L and B . Moreover, for all our computational hardness results, the instancesof the problems arising from our reductions satisfy this constraint, which is guaranteedsyntactically by our reduction.In summary, in the rest of this paper, whenever we prove an upper bound for some compu-tational problem, namely an upper bound on the number of steps or queries to the functionoracle required to solve the problem in the black-box model, or the containment of the problemin some complexity class in the white-box model, we assume that the afore-described propertiesare satisfied by the O f or C f provided in the input. On the other hand, whenever we prove a lower bound for some computational problem, namely a lower bound on the number of steps/-queries required to solve it in the black-box model, or its hardness for some complexity classin the white-box model, the instances arising in our lower bounds are guaranteed to satisfythe above properties syntactically by our constructions. As such, our hardness results will notexploit the difficulty in checking whether O f or C f satisfy the above constraints in order to in-fuse computational complexity into our problems, but will faithfully target the computationalproblems pertaining to min-max optimization of smooth and Lipschitz objectives that we aim tounderstand in this paper. In this section we define the main complexity classes that we use in this paper, namely NP , FNP and
PPAD , as well as the notion of reduction used to show containment or hardness of a problemfor one of these complexity classes.
Definition 2.2 (Search Problems, NP , FNP ) . A binary relation
Q ⊆ {
0, 1 } ∗ × {
0, 1 } ∗ is in the class FNP if (i) for every x , y ∈ {
0, 1 } ∗ such that ( x , y ) ∈ Q , it holds that | y | ≤ poly ( | x | ) ; and (ii) thereexists an algorithm that verifies whether ( x , y ) ∈ Q in time poly ( | x | , | y | ) . The search problem associated with a binary relation Q takes some x as input and requests as output some y suchthat ( x , y ) ∈ Q or outputting ⊥ if no such y exists. The decision problem associated with Q takessome x as input and requests as output the bit 1, if there exists some y such that ( x , y ) ∈ Q ,and the bit 0, otherwise. The class NP is defined as the set of decision problems associated withrelations Q ∈
FNP .To define the complexity class
PPAD we first define the notion of polynomial-time reductionsbetween search problems , and the computational problem E nd - of - a -L ine . In this paper we only define and consider Karp-reductions between search problems. This problem is sometimes called E nd - of - the -L ine , but we adopt the nomenclature proposed by [Rub16] sincewe agree that it describes the problem better. efinition 2.3 (Polynomial-Time Reductions) . A search problem P is polynomial-time reducible toa search problem P if there exist polynomial-time computable functions f : {
0, 1 } ∗ → {
0, 1 } ∗ and g : {
0, 1 } ∗ × {
0, 1 } ∗ × {
0, 1 } ∗ → {
0, 1 } ∗ with the following properties: (i) if x is an input to P , then f ( x ) is an input to P ; and (ii) if y is a solution to P on input f ( x ) , then g ( x , f ( x ) , y ) isa solution to P on input x .E nd - of - a -L ine .E nd - of - a -L ine .I nput : Binary circuits C S (for successor) and C P (for predecessor) with n inputs and n outputs.O utput : One of the following:0. if either both C P ( C S ( )) and C S ( C P ( )) are equal to , or if they are both different than , where is the all-0 string.1. a binary string x ∈ {
0, 1 } n such that x (cid:54) = and C P ( C S ( x )) (cid:54) = x or C S ( C P ( x )) (cid:54) = x .To make sense of the above definition, we envision that the circuits C S and C P implicitly definea directed graph, with vertex set {
0, 1 } n , such that the directed edge ( x , y ) ∈ {
0, 1 } n × {
0, 1 } n belongs to the graph if and only if C S ( x ) = y and C P ( y ) = x . As such, all vertices in the implicitlydefined graph have in-degree and out-degree at most 1. The above problem permits an outputof if has equal in-degree and out-degree in this graph. Otherwise it permits an output x (cid:54) = such that x has in-degree or out-degree equal to 0. It follows by the parity argument on directedgraphs, namely that in every directed graph the sum of in-degrees equals the sum of out-degrees,that E nd - of - a -L ine is a total problem , i.e. that for any possible binary circuits C S and C P thereexists a solution of the “0.” kind or the “1.” kind in the definition of our problem (or both).Indeed, if has unequal in- and out-degrees, there must exist another vertex x (cid:54) = with unequalin- and out-degrees, thus one of these degrees must be 0 (as all vertices in the graph have in- andout-degrees bounded by 1).We are finally ready to define the complexity class PPAD introduced by [Pap94b].
Definition 2.4 ( PPAD ) . The complexity class
PPAD contains all search problems that are poly-nomial time reducible to the E nd - of - a -L ine problem.The complexity class PPAD is of particular importance, since it contains lots of fundamentalproblems in Game Theory, Economics, Topology and several other fields [DGP09, Das18]. Aparticularly important
PPAD -complete problem is finding fixed points of continuous functions,whose existence is guaranteed by Brouwer’s fixed point theorem.B rouwer .B rouwer .I nput : Scalars L and γ and a polynomial-time Turing machine C M evaluating a L -Lipschitzfunction M : [
0, 1 ] d → [
0, 1 ] d .O utput : A point z (cid:63) ∈ [
0, 1 ] d such that (cid:107) z (cid:63) − M ( z (cid:63) ) (cid:107) < γ .While not stated exactly in this form, the following is a straightforward implication of the resultspresented in [CDT09]. Lemma 2.5 ([CDT09]) . B rouwer is PPAD -complete even when d = . Additionally, B rouwer is PPAD -complete even when γ = poly ( d ) and L = poly ( d ) .Remark . In the definition of the problem B rouwer we assume that we are given in the input the description of a Turing Machine C M that computes he map M. In order for polynomial-time reductions to and from this problem to be meaningful we need tohave an upper bound on the running time of this Turing Machine which we want to be polynomial in theinput of the Turing Machine. The formal way to ensure this and derive meaningful complexity results is todefine a different problem, say k- B rouwer , for every k ∈ N . In the problem k- B rouwer the input TuringMachine C M has running time bounded by n k in the size n of its input. In the rest of the paper wheneverwe say that a polynomial-time Turing Machine is required in the input to a computational problem P r , weformally mean that we define a hierarchy of problems k- P r , k ∈ N , such that k- P r takes as input TuringMachines with running time bounded by n k , and we interpret computational complexity results for P r in the following way: whenever we prove that P r belongs to some complexity class, we prove that k- P r belongs to the complexity class for all k ∈ N ; whenever we prove that P r is hard for some complexity class,we prove that, for some absolute constant k determined in the hardness proof, k- P r is hard for that class,for all k ≥ k . For simplicity of exposition of our problems and results we do not repeat this discussion inthe rest of this paper. In this section, we define the computational problems that we study in this paper and discussour main results, postponing formal statements to Section 4. We start in Section 3.1 by definingthe mathematical objects of our study, and proceed in Section 3.2 to define our main compu-tational problems, namely: (1) finding approximate stationary points; (2) finding approximatelocal minima; and (3) finding approximate local min-max equilibria. In Section 3.3, we presentsome bonus problems, which are intimately related, as we will see, to problems (2) and (3). Asdiscussed in Section 2, for ease of presentation, we define our problems as promise problems.
We define the concepts of stationary points , local minima , and local min-max equilibria of real val-ued functions, and make some remarks about their existence, as well as their computationalcomplexity. The formal discussion of the latter is postponed to Sections 3.2 and 4.Before we proceed with our definitions, recall that the goal of this paper is to study con-strained optimization. Our domain will be the hypercube [
0, 1 ] d , which we might intersect withthe set { x | g ( x ) ≤ } , for some convex (potentially multivariate) function g . Although mostof the definitions and results that we explore in this paper can be extended to arbitrary convexfunctions, we will focus on the case where g is linear, and the feasible set is thus a polytope.Focusing on this case avoids additional complications related to the representation of g in theinput to the computational problems that we define in the next section, and avoids also issuesrelated to verifying the convexity of g . Definition 3.1 (Feasible Set and Refutation of Feasibility) . Given A ∈ R d × m and b ∈ R m , wedefine the set of feasible solutions to be P ( A , b ) = { z ∈ [
0, 1 ] d | A T z ≤ b } . Observe that testingwhether P ( A , b ) is empty can be done in polynomial time in the bit complexity of A and b . Definition 3.2 (Projection Operator) . For a nonempty, closed, and convex set K ⊂ R d , we definethe projection operator Π K : R d → K as follows Π K x = argmin y ∈ K (cid:107) x − y (cid:107) . It is well-knownthat for any nonempty, closed, and convex set K the argmin y ∈ K (cid:107) x − y (cid:107) exists and is unique,hence Π K is well defined.Now that we have defined the domain of the real-valued functions that we consider in thispaper we are ready to define a notion of approximate stationary points.12 efinition 3.3 ( ε -Stationary Point) . Let f : [
0, 1 ] d → R be a G -Lipschitz and L -smooth functionand A ∈ R d × m , b ∈ R m . We call a point x (cid:63) ∈ P ( A , b ) a ε - stationary point of f if (cid:107)∇ f ( x (cid:63) ) (cid:107) < ε .It is easy to see that there exist continuously differentiable functions f that do not have any(approximate) stationary points, e.g. linear functions. As we will see later in this paper, decidingwhether a given function f has a stationary point is NP -hard and, in fact, it is even NP -hard todecide whether a function has an approximate stationary point of a very gross approximation.At the same time, verifying whether a given point is (approximately) stationary can be doneefficiently given access to a polynomial-time Turing machine that computes ∇ f , so the problem ofdeciding whether an (approximate) stationary point exists lies in NP , as long as we can guaranteethat, if there is such a point, there will also be one with polynomial bit complexity. We postponea formal discussion of the computational complexity of finding (approximate) stationary pointsor deciding their existence until we have formally defined our corresponding computationalproblem and settled the bit complexity of its solutions.For the definition of local minima and local min-max equilibria we need the notion of closed d -dimensional Euclidean balls. Definition 3.4 (Euclidean Ball) . For r ∈ R + we define the closed Euclidean ball of radius r to bethe set B d ( r ) = (cid:8) x ∈ R d | (cid:107) x (cid:107) ≤ r (cid:9) . We also define the closed Euclidean ball of radius r centered at z ∈ R d to be the set B d ( r ; z ) = (cid:8) x ∈ R d | (cid:107) x − z (cid:107) ≤ r (cid:9) . Definition 3.5 ( ( ε , δ ) -Local Minimum) . Let f : [
0, 1 ] d → R be a G -Lipschitz and L -smooth func-tion, A ∈ R d × m , b ∈ R m , and ε , δ >
0. A point x (cid:63) ∈ P ( A , b ) is an ( ε , δ ) - local minimum of f con-strained on P ( A , b ) if and only if f ( x (cid:63) ) < f ( x ) + ε for every x ∈ P ( A , b ) such that x ∈ B d ( δ ; x (cid:63) ) .To be clear, using the term “local minimum” in Definition 3.5 is a bit of a misnomer, since forlarge enough values of δ the definition captures global minima as well. As δ ranges from largeto small, our notion of ( ε , δ ) -local minimum transitions from being an ε -globally optimal pointto being an ε -locally optimal point. Importantly, unlike (approximate) stationary points, a ( ε , δ ) -local minimum is guaranteed to exist for all ε , δ > [
0, 1 ] d ∩ P ( A , b ) and the continuity of f . Thus the problem of finding an ( ε , δ ) -local minimum is total for arbitraryvalues of ε and δ . On the negative side, for arbitrary values of ε and δ , there is no polynomial-sizeand polynomial-time verifiable witness for certifying that a point x (cid:63) is an ( ε , δ ) -local minimum.Thus the problem of finding an ( ε , δ ) -local minimum is not known to lie in FNP . As we willsee in Section 4, this issue can be circumvented if we focus on particular settings of ε and δ , inrelationship to the Lipschitzness and smoothness of f and the dimension d .Finally we define ( ε , δ ) -local min-max equilibrium as follows, recasting Definition 1.1 to theconstraint set P ( A , b ) . Definition 3.6 ( ( ε , δ ) -Local Min-Max Equilibrium) . Let f : [
0, 1 ] d × [
0, 1 ] d → R be a G -Lipschitzand L -smooth function, A ∈ R d × m and b ∈ R m , where d = d + d , and ε , δ >
0. A point ( x (cid:63) , y (cid:63) ) ∈ P ( A , b ) is an ( ε , δ ) - local min-max equilibrium of f if and only if the following hold: (cid:73) f ( x (cid:63) , y (cid:63) ) < f ( x , y (cid:63) ) + ε for every x ∈ B d ( δ ; x (cid:63) ) with ( x , y (cid:63) ) ∈ P ( A , b ) ; and (cid:73) f ( x (cid:63) , y (cid:63) ) > f ( x (cid:63) , y ) − ε for every y ∈ B d ( δ ; y (cid:63) ) with ( x (cid:63) , y ) ∈ P ( A , b ) .13imilarly to Definition 3.5, for large enough values of δ , Definition 3.6 captures global min-maxequilibria as well. As δ ranges from large to small, our notion of ( ε , δ ) -local min-max equilib-rium transitions from being an ε -approximate min-max equilibrium to being an ε -approximatelocal min-max equilibrium. Moreover, in comparison to local minima and stationary points, theproblem of finding an ( ε , δ ) -local min-max equilibrium is neither total nor can its solutions beverified efficiently for all values of ε and δ , even when P ( A , b ) = [
0, 1 ] d . Again, this issue can becircumvented if we focus on particular settings of ε and δ values, as we will see in Section 4. In this section, we define the search problems associated with our aforementioned definitionsof approximate stationary points, local minima, and local min-max equilibria. We state ourproblems in terms of white-box access to the function f and its gradient. Switching to the black-box variants of our computational problems amounts to simply replacing the Turing machinesprovided in the input of the problems with oracle access to the function and its gradient, asdiscussed in Section 2. As per our discussion in the same section, we define our computationalproblems as promise problems , the promise being that the Turing machine (or oracle) provided inthe input to our problems outputs function values and gradient values that are consistent witha smooth and Lipschitz function with the prescribed in the input smoothness and Lipschitzness.Besides making the presentation cleaner, as we discussed in Section 2, the motivation for doingso is to prevent the possibility that computational complexity is tacked into our problems dueto the possibility that the Turing machines/oracles provided in the input do not output functionand gradient values that are consistent with a Lipschitz and smooth function. Importantly, allour computational hardness results syntactically guarantee that the Turing machines/oraclesprovided as input to our constructed hard instances satisfy these constraints.Before stating our main computational problems below, we note that, for each problem, thedimension d (in unary representation) is also an implicit input, as the description of the Turingmachine C f (or the interface to the oracle O f in the black-box counterpart of each problem be-low) has size at least linear in d . We also refer to Remark 2.6 for how we may formally studycomplexity problems that take a polynomial-time Turing Machine in their input.S tationary P oint .S tationary P oint .I nput : Scalars ε , G , L , B > C f evaluating a G -Lipschitzand L -smooth function f : [
0, 1 ] d → [ − B , B ] and its gradient ∇ f : [
0, 1 ] d → R d ; a matrix A ∈ R d × m and vector b ∈ R m such that P ( A , b ) (cid:54) = ∅ .O utput : If there exists some point x ∈ P ( A , b ) such that (cid:107)∇ f ( x ) (cid:107) < ε /2, output some point x (cid:63) ∈ P ( A , b ) such that (cid:107)∇ f ( x (cid:63) ) (cid:107) < ε ; if, for all x ∈ P ( A , b ) , (cid:107)∇ f ( x ) (cid:107) > ε , output ⊥ ;otherwise, it is allowed to either output x (cid:63) ∈ P ( A , b ) such that (cid:107)∇ f ( x (cid:63) ) (cid:107) < ε or to output ⊥ .It is easy to see that S tationary P oint lies in FNP . Indeed, if there exists some point x ∈ P ( A , b ) such that (cid:107)∇ f ( x ) (cid:107) < ε /2, then by the L -smoothness of f there must exist some point x (cid:63) ∈P ( A , b ) of bit complexity polynomial in the size of the input such that (cid:107)∇ f ( x (cid:63) ) (cid:107) < ε . On theother hand, it is clear that no such point exists if for all x ∈ P ( A , b ) , (cid:107)∇ f ( x ) (cid:107) > ε . We note thatthe looseness of the output requirement in our problem for functions f that do not have points x ∈ P ( A , b ) such that (cid:107)∇ f ( x ) (cid:107) < ε /2 but do have points x ∈ P ( A , b ) such that (cid:107)∇ f ( x ) (cid:107) ≤ ε isintroduced for the sole purpose of making the problem lie in FNP , as otherwise we would not beable to guarantee that the solutions to our search problem have polynomial bit complexity. As we14how in Section 4, S tationary P oint is also FNP -hard, even when ε is a constant, the constraintset is very simple, namely P ( A , b ) = [
0, 1 ] d , and G , L are both polynomial in d .Next, we define the computational problems associated with local minimum and local min-max equilibrium. Recall that the first is guaranteed to have a solution, because, in particular, aglobal minimum exists due to the continuity of f and the compactness of P ( A , b ) .L ocal M in .L ocal M in .I nput : Scalars ε , δ , G , L , B > C f evaluating a G -Lipschitz and L -smooth function f : [
0, 1 ] d → [ − B , B ] and its gradient ∇ f : [
0, 1 ] d → R d ; amatrix A ∈ R d × m and vector b ∈ R m such that P ( A , b ) (cid:54) = ∅ .O utput : A point x (cid:63) ∈ P ( A , b ) such that f ( x (cid:63) ) < f ( x ) + ε for all x ∈ B d ( δ ; x (cid:63) ) ∩ P ( A , b ) .L ocal M in M ax .L ocal M in M ax .I nput : Scalars ε , δ , G , L , B >
0; a polynomial-time Turing machine C f evaluating a G -Lipschitzand L -smooth function f : [
0, 1 ] d × [
0, 1 ] d → [ − B , B ] and its gradient ∇ f : [
0, 1 ] d × [
0, 1 ] d → R d + d ; a matrix A ∈ R d × m and vector b ∈ R m such that P ( A , b ) (cid:54) = ∅ , where d = d + d .O utput : A point ( x (cid:63) , y (cid:63) ) ∈ P ( A , b ) such that (cid:46) f ( x (cid:63) , y (cid:63) ) < f ( x , y (cid:63) ) + ε for all x ∈ B d ( δ ; x (cid:63) ) with ( x , y (cid:63) ) ∈ P ( A , b ) and (cid:46) f ( x (cid:63) , y (cid:63) ) > f ( x (cid:63) , y ) − ε for all y ∈ B d ( δ ; y (cid:63) ) with ( x (cid:63) , y ) ∈ P ( A , b ) ,or ⊥ if no such point exists.Unlike S tationary P oint the problems L ocal M in and L ocal M in M ax exhibit vastly differentbehavior, depending on the values of the inputs ε and δ in relationship to G , L and d , as wewill see in Section 4 where we summarize our computational complexity results. This range ofbehaviors is rooted at our earlier remark that, depending on the value of δ provided in the inputto these problems, they capture the complexity of finding global minima/min-max equilibria, forlarge values of δ , as well as finding local minima/min-max equilibria, for small values of δ . Next we present a couple of bonus problems, GDF ixed P oint and GDAF ixed P oint , which re-spectively capture the computation of fixed points of the (projected) gradient descent and the(projected) gradient descent-ascent dynamics, with learning rate =
1. As we see in Section 5,these problems are intimately related, indeed equivalent under polynomial-time reductions, toproblems L ocal M in and L ocal M in M ax respectively, in certain regimes of the approximationparameters. Before stating problems GDF ixed P oint and GDAF ixed P oint , we define the map-pings F GD and F GDA whose fixed points these problems are targeting.
Definition 3.7 (Projected Gradient Descent) . For a closed and convex K ⊆ R d and some contin-uously differentiable function f : K → R , we define the Projected Gradient Descent Dynamics withlearning rate F GD : K → K , where F GD ( x ) = Π K ( x − ∇ f ( x )) . Definition 3.8 (Projected Gradient Descent/Ascent) . For a closed and convex K ⊆ R d × R d andsome continuously differentiable function f : K → R , we define the Unsafe Projected GradientDescent/Ascent Dynamic with learning rate F GDA : K → R d × R d defined as follows F GDA ( x , y ) (cid:44) (cid:34) Π K ( y ) ( x − ∇ x f ( x , y )) Π K ( x ) ( y + ∇ y f ( x , y )) (cid:35) (cid:44) (cid:20) F GDAx ( x , y ) F GDAy ( x , y ) (cid:21) for all ( x , y ) ∈ K , where K ( y ) = { x (cid:48) | ( x (cid:48) , y ) ∈ K } and K ( x ) = { y (cid:48) | ( x , y (cid:48) ) ∈ K } .15ote that F GDA is called “unsafe” because the projection happens individually for x − ∇ x f ( x , y ) and y + ∇ y f ( x , y ) , thus F GDA ( x , y ) may not lie in K . We also define the “safe” version F sGDA ,which projects the pair ( x − ∇ x f ( x , y ) , y + ∇ y f ( x , y )) jointly onto K . As we show in Section 5(in particular inside the proof of Theorem 5.2), computing fixed points of F GDA and F sGDA arecomputationally equivalent so we stick to F GDA which makes the presentation slightly cleaner.We are now ready to define GDF ixed P oint and GDAF ixed P oint . As per earlier discussions,we define these computational problems as promise problems , the promise being that the Turingmachine provided in the input to these problems outputs function values and gradient valuesthat are consistent with a smooth and Lipschitz function with the prescribed, in the input to theseproblems, smoothness and Lipschitzness.GDF ixed P oint .GDF ixed P oint .I nput : Scalars α , G , L , B > C f evaluating a G -Lipschitzand L -smooth function f : [
0, 1 ] d → [ − B , B ] and its gradient ∇ f : [
0, 1 ] d → R d ; a matrix A ∈ R d × m and vector b ∈ R m such that P ( A , b ) (cid:54) = ∅ .O utput : A point x (cid:63) ∈ P ( A , b ) such that (cid:107) x (cid:63) − F GD ( x (cid:63) ) (cid:107) < α , where K = P ( A , b ) is theprojection set used in the definition of F GD .GDAF ixed P oint .GDAF ixed P oint .I nput : Scalars α , G , L , B > C f evaluating a G -Lipschitzand L -smooth function f : [
0, 1 ] d × [
0, 1 ] d → [ − B , B ] and its gradient ∇ f : [
0, 1 ] d × [
0, 1 ] d → R d + d ; a matrix A ∈ R d × m and vector b ∈ R m such that P ( A , b ) (cid:54) = ∅ , where d = d + d .O utput : A point ( x (cid:63) , y (cid:63) ) ∈ P ( A , b ) such that (cid:107) ( x (cid:63) , y (cid:63) ) − F GDA ( x (cid:63) , y (cid:63) ) (cid:107) < α , where K = P ( A , b ) is the projection set used in the definition of F GDA .In Section 5 we show that the problems GDF ixed P oint and L ocal M in are equivalent underpolynomial-time reductions, and the problems GDAF ixed P oint and L ocal M in M ax are equiva-lent under polynomial-time reductions, in certain regimes of the approximation parameters. In this section we summarize our results for the optimization problems that we defined in theprevious section. We start with our theorem about the complexity of finding approximate sta-tionary points, which we show to be
FNP -complete even for large values of the approximation.
Theorem 4.1 (Complexity of Finding Approximate Stationary Points) . The computational problem S tationary P oint is FNP -complete, even when ε is set to any value ≤ , and even when P ( A , b ) =[
0, 1 ] d , G = √ d, L = d, and B = . It is folklore and easy to verify that approximate stationary points always exist and can befound in time poly ( B , 1/ ε , L ) when the domain of f is unconstrained, i.e. it is the whole R d , andthe range of f is bounded, i.e., when f ( R d ) ⊆ [ − B , B ] . Theorem 4.1 implies that such a guar-antee should not be expected in the bounded domain case, where the existence of approximatestationary points is not guaranteed and must also be verified. In particular, it follows from ourtheorem that any algorithm that verifies the existence of and computes approximate stationarypoints in the constrained case should take time that is super-polynomial in at least one of G , L ,or d , unless P = NP . The proof of Theorem 4.1 is based on an elegant construction for converting(real valued) stationary points of an appropriately constructed function to (binary) solutions of a16arget S at instance. This conversion involves the use of Lovász Local Lemma [EL73]. The detailsof the proof can be found in Appendix A.The complexity of L ocal M in and L ocal M in M ax is more difficult to characterize, as thenature of these problems changes drastically depending on the relationship of δ with with ε , G , L and d , which determines whether these problems ask for a globally vs locally approximatelyoptimal solution. In particular, there are two regimes wherein the complexity of both problemsis simple to characterize. (cid:46) Global Regime.
When δ ≥ √ d then both L ocal M in and L ocal M in M ax ask for a globally optimal solution. In this regime it is not difficult to see that both problems are FNP -hard tosolve even when ε = Θ ( ) and G , L are O ( d ) (see Section 10). (cid:46) Trivial Regime.
When δ satisfies δ < ε / G , then for every point z ∈ P ( A , b ) it holds that | f ( z ) − f ( z (cid:48) ) | < ε for every z (cid:48) ∈ B d ( δ ; z ) with z (cid:48) ∈ P ( A , b ) . Thus, every point z in thedomain P ( A , b ) is a solution to both L ocal M in and L ocal M in M ax .It is clear from our discussion above, and in earlier sections, that, to really capture the complexityof finding local as opposed to global minima/min-max equilibria, we should restrict the valueof δ . We identify the following regime, which we call the “ local regime .” As we argue shortly,this regime is markedly different from the global regime identified above in that (i) a solution isguaranteed to exist for both our problems of interest, where in the global regime only L ocal M in is guaranteed to have a solution; and (ii) their computational complexity transitions to lowercomplexity classes. (cid:46) Local Regime.
Our main focus in this paper is the regime defined by δ < √ ε / L . Inthis regime it is well known that Projected Gradient Descent can solve L ocal M in intime O ( B · L / ε ) (see Appendix E). Our main interest is understanding the complexity ofL ocal M in M ax , which is not well understood in this regime. We note that the use of theconstant 2 in the constraint δ < √ ε / L which defines the local regime has a natural mo-tivation: consider a point z where a L -smooth function f has ∇ f ( z ) =
0; it follows fromthe definition of smoothness that z is both an ( ε , δ ) -local min and an ( ε , δ ) -local min-maxequilibrium, as long as δ < √ ε / L .The following theorems provide tight upper and lower bounds on the computational complexityof solving L ocal M in M ax in the local regime. For compactness, we define the following problem: Definition 4.2 (Local Regime L ocal M in M ax ) . We define the local-regime local min-max equilib-rium computation problem , in short LR-L ocal M in M ax , to be the search problem L ocal M in M ax restricted to instances in the local regime, i.e. satisfying δ < √ ε / L . Theorem 4.3 (Existence of Approximate Local Min-Max Equilibrium) . The computational problem
LR-L ocal M in M ax belongs to PPAD . As a byproduct, if some function f is G-Lipschitz and L-smooth,then an ( ε , δ ) -local min-max equilibrium is guaranteed to exist when δ < √ ε / L, i.e. in the local regime.
Theorem 4.4 (Hardness of Finding Approximate Local Min-Max Equilibrium) . The search problem
LR-L ocal M in M ax is PPAD -hard, for any δ ≥ √ ε / L, and even when it holds that ε = poly ( d ) ,G = poly ( d ) , L = poly ( d ) , and B = d. ( ε , δ ) -local min-max equilibrium of a G -Lipschitz and L -smooth function f in the local regime should take time that is super-polynomialin at least one of 1/ ε , G , L or d , unless FP = PPAD . As such, the complexity of computing localmin-max equilibria in the local regime is markedly different from the complexity of computinglocal minima, which can be found using Projected Gradient Descent in poly ( G , L , 1/ ε , d ) timeand function/gradient evaluations (see Appendix E).An important property of our reduction in the proof of Theorem 4.4 is that it is a black-boxreduction . We can hence prove the following unconditional lower bound in the black-box model. Theorem 4.5 (Black-Box Lower Bound for Finding Approximate Local Min-Max Equilibrium) . Suppose A ∈ R d × m and b ∈ R m are given together with an oracle O f that outputs a G-Lipschtz andL-smooth function f : P ( A , b ) → [ −
1, 1 ] and its gradient ∇ f . Let also δ ≥ √ L / ε , ε ≤ G / L, and letall the parameters ε , δ , L, G be upper bounded by poly ( d ) . Then any algorithm that has access tof only through O f and computes an ( ε , δ ) -local min-max equilibrium has to make a number of queries to O f that is exponential in at least one of the parameters: ε , G, L or d even when P ( A , b ) ⊆ [
0, 1 ] d . Our main goal in the rest of the paper is to provide the proofs of Theorems 4.3, 4.4 and 4.5.In Section 5, we show how to use Brouwer’s fixed point theorem to prove the existence of ap-proximate local min-max equilibrium in the local regime. Moreover, we establish an equivalencebetween L ocal M in M ax and GDAF ixed P oint , in the local regime, and show that both belongto PPAD . In Sections 6 and 7, we provide a detailed proof of our main result, i.e. Theorem 4.4.Finally, in Section 9, we show how our proof from Section 7 produces as a byproduct the black-box, unconditional lower bound of Theorem 4.5. In Section 8, we outline a useful interpolationtechnique which allows as to interpolate a function given its values and the values of its gradienton a hypergrid, so as to enforce the Lipschitzness and smoothness of the interpolating function.We make heavy use of this technically involved result in all our hardness proofs.
In this section, we establish the totality of LR-L ocal M in M ax , i.e. L ocal M in M ax for instancessatisfying δ < √ ε / L as defined in Definition 4.2. In particular, we prove that every G -Lipschitzand L -smooth function admits an ( ε , δ ) -local min-max equilibrium, as long as δ < √ ε / L . Abyproduct of our proof is in fact that LR-L ocal M in M ax lies inside PPAD . Specifically the maintool that we use to prove our result is a computational equivalence between the problem of find-ing fixed points of the Gradient Descent/Ascent dynamic, i.e. GDAF ixed P oint , and the problemLR-L ocal M in M ax . A similar equivalence between GDF ixed P oint and L ocal M in also holds,but the details of that are left to the reader as a simple exercise. Next, we first present the equiva-lence between GDAF ixed P oint and LR-L ocal M in M ax , and we then show that GDAF ixed P oint is in PPAD , which then also establishes that LR-L ocal M in M ax is in PPAD . Theorem 5.1.
The search problems
LR-L ocal M in M ax and GDAF ixed P oint are equivalent underpolynomial-time reductions. That is, there is a polynomial-time reduction from LR-L ocal M in M ax to GDAF ixed P oint and vice versa. In particular, given some A ∈ R d × m and b ∈ R m such that P ( A , b ) (cid:54) = ∅ , along with a G-Lipschitz and L-smooth function f : P ( A , b ) → R :1. For arbitrary ε > and < δ < √ ε / L, suppose that ( x ∗ , y ∗ ) ∈ P ( A , b ) is an α -approximatefixed point of F GDA , i.e., (cid:107) ( x ∗ , y ∗ ) − F GDA ( x ∗ , y ∗ ) (cid:107) < α , where α ≤ √ ( G + δ ) + ( ε − L δ ) − ( G + δ ) .Then ( x ∗ , y ∗ ) is also a ( ε , δ ) -local min-max equilibrium of f . . For arbitary α > , suppose that ( x ∗ , y ∗ ) is an ( ε , δ ) -local min-max equilibrium of f for ε = α · L ( L + ) and δ = √ ε / L. Then ( x ∗ , y ∗ ) is also an α -approximate fixed point of F GDA . The proof of Theorem 5.1 is presented in Appendix B.1. As already discussed, we use GDAF ixed -P oint as an intermediate step to establish the totality of LR-L ocal M in M ax and to show itsinclusion in PPAD . This leads to the following theorem.
Theorem 5.2.
The computational problems
GDAF ixed P oint and LR-L ocal M in M ax are both totalsearch problems and they both lie in PPAD . Observe that Theorem 4.3 is implied by Theorem 5.2 whose proof is presented in Appendix B.2.
In Section 5, we established that LR-L ocal M in M ax belongs to PPAD . Our proof is via theintermediate problem GDAF ixed P oint which we showed that it is computationally equivalentto LR-L ocal M in M ax . Our next step is to prove the PPAD -hardness of LR-L ocal M in M ax usingagain GDAF ixed P oint as an intermediate problem.In this section we prove that GDAF ixed P oint is PPAD -hard in four dimensions. To establishthis hardness result we introduce a variant of the classical 2D-S perner problem which we call2D-B i S perner which we show is PPAD -hard. The main technical part of our proof is to showthat 2D-B i S perner with input size n reduces to GDAF ixed P oint , with input size poly ( n ) , α = exp ( − poly ( n )) , G = L = exp ( poly ( n )) , and B =
2. This reduction proves the hardness ofGDAF ixed P oint . Formally, our main result of this section is the following theorem. Theorem 6.1.
The problem
GDAF ixed P oint is PPAD -complete even in dimension d = and B = .Therefore, LR-L ocal M in M ax is PPAD -complete even in dimension d = and B = . The above result excludes the existence of an algorithm for GDAF ixed P oint whose runningtime is poly ( log G , log L , log ( α ) , B ) and, equivalently, the existence of an algorithm for theproblem LR-L ocal M in M ax with running time poly ( log G , log L , log ( ε ) , log ( δ ) , B ) , unless FP = PPAD . Observe that it would not be possible to get a stronger hardness result for the fourdimensional GDAF ixed P oint problem since it is simple to construct brute-force search algo-rithms with running time poly ( α , G , L , B ) . We elaborate more on such algorithms towards theend of this section. In order to prove the hardness of GDAF ixed P oint for polynomially (ratherthan exponentially) bounded (in the size of the input) values of 1/ α , G , and L (See Theorem 4.4)we need to consider optimization problems in higher dimensions. This is the problem that weexplore in Section 7. Beyond establishing the hardness of the problem for d = We start by introducing the 2D-B i S perner problem. Consider a coloring of the N × N , 2-dimensional grid, where instead of coloring each vertex of the grid with a single color (as inSperner’s lemma), each vertex is colored via a combination of two out of four available colors.The four available colors are 1 − , 1 + , 2 − , 2 + . The five rules that define a proper coloring of the N × N grid are the following. 19igure 4: Left : Summary of the rules from a proper coloring of the grid. The gray color on theleft and the right side can be replaced with either blue or green . Similarly the gray color onthe top and the bottom side can be replaced with either red or yellow . Right:
An example of aproper coloring of a 9 × brown boxes indicate the two panchromatic cells, i.e., thecells where all the four available colors appear.1. The first color of every vertex is either 1 − or 1 + and the second color is either 2 − or 2 + .2. The first color of all vertices on the left boundary of the grid is 1 + .3. The first color of all vertices on the right boundary of the grid is 1 − .4. The second color of all vertices on the bottom boundary of the grid is 2 + .5. The second color of all vertices on the top boundary of the grid is 2 − .Using similar proof ideas as in Sperner’s lemma it is not hard to establish via a combinatorialargument that, in every proper coloring of the N × N grid, there exists a square cell where eachof the four colors in { − , 1 + , 2 − , 2 + } appears in at least one of its vertices. We call such a cell a panchromatic square . In the 2D-B i S perner problem, defined formally below, we are given the de-scription of some coloring of the grid and are asked to find either a panchromatic square or the vi-olation of the proper coloring conditions. In this paper, we will not present a direct combinatorialargument guaranteeing the existence of panchromatic squares under proper colorings of the grid,since the existence of panchromatic squares will be implied by the totality of the 2D-B i S perner problem, which will follow from our reduction from 2D-B i S perner to GDAF ixed P oint as wellas our proofs in Section 5 establishing the totality of GDAF ixed P oint . In Figure 4 we summarizethe five rules that define proper colorings and we present an example of a proper coloring of thegrid with 9 discrete points on each side. 20n order to formally define the computational problem 2D-B i S perner in a way that is usefulfor our reductions we need to allow for colorings of the N × N grid described in a succinctway, where the value N can be exponentially large compared to the size of the input to theproblem. A standard way to do this, introduced by [Pap94b] in defining the computationalversion of Sperner’s lemma, is to describe a coloring via a binary circuit C l that takes as inputthe coordinates of a vertex in the grid and outputs the combination of colors that is used to colorthis vertex. In the input, each one of the two coordinates of the input vertex is given via thebinary representation of a number in [ N ] −
1. Setting N = n we have that the representation ofeach coordinate belongs to {
0, 1 } n . In the rest of the section we abuse the notation and we use acoordinate i ∈ {
0, 1 } n both as a binary string and as a number in [ n ] − C l should be a combination of one of the colors { − , 1 + } and one of the colors { − , 2 + } . We represent this combination as a pair of {−
1, 1 } . Thefirst coordinate of this pair refers to the choice of 1 − or 1 + and the second coordinate refers tothe choice of 2 − or 2 + .In the definition of the computational problem 2D-B i S perner the input is a circuit C l , as de-scribed above. One type of possible solutions to 2D-B i S perner is providing a pair of coordinates ( i , j ) ∈ {
0, 1 } n × {
0, 1 } n indexing a cell of the grid whose bottom left vertex is ( i , j ) . For thistype of solution to be valid it must be that the output of C l when evaluated on all the vertices ofthis square contains at least one negative and one positive value for each one of the two outputcoordinates of C l , i.e. the cell must be panchromatic. Another type of possible solution to 2D-B i S perner is a vertex whose coloring violates the proper coloring conditions for the boundary,namely 2–5 above. For notational convenience we refer to the first coordinate of the output of C l by C l and to the second coordinate by C l . The formal definition of the computational problem2D-B i S perner is then the following.2D-B i S perner .2D-B i S perner .I nput : A boolean circuit C l : {
0, 1 } n × {
0, 1 } n → {−
1, 1 } .O utput : A vertex ( i , j ) ∈ {
0, 1 } n × {
0, 1 } n such that one of the following holds1. i (cid:54) = , j (cid:54) = , and (cid:91) i (cid:48) − i ∈{ } j (cid:48) − j ∈{ } C l ( i (cid:48) , j (cid:48) ) = {−
1, 1 } and (cid:91) i (cid:48) − i ∈{ } j (cid:48) − j ∈{ } C l ( i (cid:48) , j (cid:48) ) = {−
1, 1 } , or2. i = and C l ( i , j ) = −
1, or3. i = and C l ( i , j ) = +
1, or4. j = and C l ( i , j ) = −
1, or5. j = and C l ( i , j ) = + i S perner is PPAD -hard. Thus our reductionfrom 2D-B i S perner to GDAF ixed P oint in the next section establishes both the PPAD -hardnessof GDAF ixed P oint and the inclusion of 2D-B i S perner to PPAD . Lemma 6.2.
The problem - B i S perner is PPAD -hard.Proof.
To prove this Lemma we will use Lemma 2.5. Let C M be a polynomial-time Turing machinethat computes a function M : [
0, 1 ] → [
0, 1 ] that is L -Lipschitz. We know from Lemma 2.5 that21nding γ -approximate fixed points of M is PPAD -hard. We will use C M to define a circuit C l suchthat a solution of 2D-B i S perner with input C l will give us a γ -approximate fixed point of M .Consider the function g ( x ) = M ( x ) − x . Since M is L -Lipschitz, the function g : [
0, 1 ] → [ −
1, 1 ] is also ( L + ) -Lipschitz. Additionally g can be easily computed via a polynomial-timeTuring machine C g that uses C M as a subroutine. We construct a proper coloring of a fine grid of [
0, 1 ] using the signs of the outputs of g . Namely we set n = (cid:100) log ( L / γ ) + (cid:101) and this definesa 2 n × n grid over [
0, 1 ] that is indexed by {
0, 1 } n × {
0, 1 } n . Let g η : [
0, 1 ] → [ −
1, 1 ] be thefunction that the Turing Machine C g evaluate when the requested accuracy is η >
0. Now we candefine the circuit C l as follows, C l ( i , j ) = i = − i = n − g η ,1 (cid:16) i n − , j n − (cid:17) ≥ i (cid:54) = − − g η ,1 (cid:16) i n − , j n − (cid:17) < i (cid:54) = C l ( i , j ) = i = − i = n − g η ,2 (cid:16) i n − , j n − (cid:17) ≥ i (cid:54) = − − g η ,2 (cid:16) i n − , j n − (cid:17) < i (cid:54) = g i is the i th output coordinate of g . It is not hard then to observe that the coloring C l isproper, i.e. it satisfies the boundary conditions due to the fact that the image of M is alwaysinside [
0, 1 ] . Therefore the only possible solution to 2D-B i S perner with input C l is a cell thatcontains all the colors { − , 1 + , 2 − , 2 + } . Let ( i , j ) be the bottom-left vertex of this cell which wedenote by R , namely R = (cid:26) x ∈ [
0, 1 ] | x ∈ (cid:20) i n − i + n − (cid:21) , x ∈ (cid:20) j n − j + n − (cid:21)(cid:27) . Claim 6.3.
Let η = γ √ , there exists x ∈ R such that | g ( x ) | ≤ γ √ and y ∈ R such that | g ( y ) | ≤ γ √ .Proof of Claim 6.3. We will prove the existence of x and the existence of y follows using an identi-cal argument. If there exists a corner x of R such that g ( x ) is in the range [ − η , η ] then the claimfollows. Suppose not. Using this together with the fact that the first color of one of the corners of R is 1 − and also the first color of one of the corners of R is 1 + we conclude that there exist points x , x (cid:48) such that g η ,1 ( x ) ≥ g η ,1 ( x (cid:48) ) ≤ . But we have that (cid:13)(cid:13) g η − g (cid:13)(cid:13) ≤ η . This together withthe fact that g ( x ) (cid:54)∈ [ − η , η ] and g ( x (cid:48) ) (cid:54)∈ [ − η , η ] implies that g ( x ) ≥ g ( x (cid:48) ) ≤
0. Butbecause of the L -Lipschitzness of g and because the distance between x and x (cid:48) is at most √ γ L we conclude that | g ( x ) − g ( x (cid:48) ) | ≤ γ √ . Hence due to the signs of g ( x ) and g ( x (cid:48) ) we concludethat | g ( x ) | ≤ γ √ . The same way we can prove that | g ( y ) | ≤ γ √ and the claim follows. We remind that we abuse the notation and we use a coordinate i ∈ {
0, 1 } n both as a binary string and as a numberin ([ n − ] − ) and it is clear from the context which of the two we use. The latter is inaccurate for the cases where the vertex ( j ) belongs to either facets i = i = n −
1. Notice thatthe coloring in such vertices does not depend on the value of g η . However in case where the color of such a corner isnot consistent with the value of g η , i.e. g η ,1 ( j ) < C l ( j ) = | g ( j ) | ≤ η . This is dueto the fact that g ( j ) ≥ | g ( j ) − g η ( j ) | ≤ η . L -Lipschitzness of g we get that for every z ∈ R | g ( z ) − g ( x ) | ≤ L (cid:107) x − z (cid:107) ≤ √ · L · γ L = ⇒ | g ( z ) | ≤ γ √ | g ( z ) − g ( y ) | ≤ L (cid:107) y − z (cid:107) ≤ √ · L · γ L = ⇒ | g ( z ) | ≤ γ √ z , w it holds that (cid:107) z − w (cid:107) ≤ √ γ L which follows from the definition of the size of the grid. Therefore we have that (cid:107) g ( z ) (cid:107) ≤ γ andhence (cid:107) M ( z ) − z (cid:107) ≤ γ which implies that any point z ∈ R is a γ -approximate fixed point of M and the lemma follows.Now that we have established the PPAD -hardness of 2D-B i S perner we are ready to present ourmain result of this section which is a reduction from 2D-B i S perner to GDAF ixed P oint . We start with presenting a construction of a Lipschitz and smooth real-valued function f : [
0, 1 ] × [
0, 1 ] → R based on a given coloring circuit C l : {
0, 1 } n × {
0, 1 } n → {−
1, 1 } . Thenin Section 6.2.1 we will show that any solution to GDAF ixed P oint with input the representation C f of f is also a solution to the 2D-B i S perner problem with input C l . Constructing Lipschitzand smooth functions based on only local information is a surprisingly challenging task in high-dimensions as we will explain in detail in Section 7. Fortunately in the low-dimensional casethat we consider in this section the construction is much more simple and the main ideas of ourreduction are more clear.The basic idea of the construction of f consists in interpreting the coloring of a given pointin the grid as the directions of the gradient of f ( x , y ) with respect to the variables x , y and x , y respectively. More precisely, following the ideas in the proof of Lemma 6.2, we divide the [
0, 1 ] square in square-cells of length 1/ ( N − ) = ( n − ) where the corners of these cellscorrespond to vertices of the N × N grid of the 2D-B i S perner instance described by C l . When x is on a vertex of this grid, the first color of this vertex determines the direction of gradient withrespect to the variables x and y , while the second color of this vertex determines the directionof the gradient of the variables x and y . As an example, if x = ( x , x ) is on a vertex of the N × N grid, and the coloring of this vertex is ( − , 2 + ) , i.e. the output of C l on this vertex is ( − + ) , then we would like to have ∂ f ∂ x ( x , y ) ≥ ∂ f ∂ y ( x , y ) ≤ ∂ f ∂ x ( x , y ) ≤ ∂ f ∂ y ( x , y ) ≥ f locally close to ( x , y ) to be equal to f ( x , y ) = ( x − y ) − ( x − y ) .Similarly, if x is on a vertex of the N × N grid, and the coloring of this vertex is ( − , 2 − ) , i.e. theoutput of C l on this vertex is ( − − ) , then we would like to have ∂ f ∂ x ( x , y ) ≥ ∂ f ∂ y ( x , y ) ≤ ∂ f ∂ x ( x , y ) ≥ ∂ f ∂ y ( x , y ) ≤ f locally close to ( x , y ) to be equal to f ( x , y ) = ( x − y ) + ( x − y ) .23igure 5: The correspondence of the colors of the vertices of the N × N grid with the directionsof the gradient of the function f that we design.In Figure 5 we show pictorially the correspondence of the colors of the vertices of the grid withthe gradient of the function f that we design. As shown in the figure, any set of vertices thatshare at least one of the colors 1 + , 1 − , 2 + , 2 − , agree on the direction of the gradient with respectthe horizontal or the vertical axis. This observation is one of the main ingredients in the proof ofcorrectness of our reduction that we present later in this section.When x is not on a vertex of the N × N grid then our goal is to define f via interpolatingthe functions corresponding to the corners of the cell in which x belongs. The reason that thisinterpolation is challenging is that we need to make sure the following properties are satisfied (cid:46) the resulting function f is both Lipschitz and smooth inside every cell, (cid:46) the resulting function f is both Lipschitz and smooth even at the boundaries of every cell,where two differect cells stick together, (cid:46) no solution to the GDAF ixed P oint problem is created inside cells that are not solutions tothe 2D-B i S perner problem. In particular, it has to be true that if all the vertices of one cellagree on some color then the gradient of f inside that cell has large enough gradient in thecorresponding direction.For the low dimensional case, that we explore in this section, satisfying the first two propertiesis not a very difficult task, whereas for the third property we need to be careful and achievingthis property is the main technical contribution of this section. On the contrary, for the high-dimensional case that we explore in Section 7 even achieving the first two properties is verychallenging and technical.As we will see in Section 6.2.1, if we accomplish a construction of a function f with theaforementioned properties, then the fixed points of the projected Gradient Descent/Ascent canonly appear inside cells that have all of the colors { − , 1 + , 2 − , 2 + } at their corners. To see thisconsider a cell that misses some color, e.g. 1 + . Then all the corners of this cell have as firstcolor 1 − . Since f is defined as interpolation of the functions in the corners of the cells, with theaforementioned properties, inside that cell there is always a direction with respect to x and y for which the gradient is large enough. Hence any point inside that cell cannot be a fixed pointof the projected Gradient Descent/Ascent. Of course this example provides just an intuition ofour construction and ignores case where the cell is on the boundary of the grid. We provide adetailed explanation of this case in Section 6.2.1.The above neat idea needs some technical adjustments in order to work. At first, the inter-polation of the function in the interior of the cell must be smooth enough so that the resulting24unction is both Lipschitz and smooth. In order to satisfy this, we need to choose appropriatecoefficients of the interpolation that interpolate smoothly not only the value of the function butalso its derivatives. For this purpose we use the following smooth step function of order 1. Definition 6.4 (Smooth Step Function of Order 1) . We define S : [
0, 1 ] → [
0, 1 ] to be the smoothstep function of order S ( x ) = x − x . Observe that the following hold S ( ) = S ( ) = S (cid:48) ( ) =
0, and S (cid:48) ( ) = x it could be that the derivatives of these coefficients overpower the derivatives ofthe functions that we interpolate. In this case we could be potentially creating fixed points ofGradient Descent/Ascent even in non panchromatic squares. As we will see later the magnitudeof the derivatives from the interpolation coefficients depends on the differences x − y and x − y . Hence if we ensure that these differences are small then the derivatives of the interpolationcoefficients will have to remain small and hence they can never overpower the derivatives fromthe corners of every cell. This is the place in our reduction where we add the constraints A · ( x , y ) ≤ b that define the domain of the function f as we describe in Section 3.Now that we have summarized the main ideas of our construction we are ready for the formaldefinition of f based on the coloring circuit C l . Definition 6.5 (Continuous and Smooth Function from Colorings of 2D-Bi-Sperner) . Given abinary circuit C l : {
0, 1 } n × {
0, 1 } n → {−
1, 1 } , we define the function f C l : [
0, 1 ] × [
0, 1 ] → R asfollows. For any x ∈ [
0, 1 ] , let A = ( i A , j A ) , B = ( i B , j B ) , C = ( i C , j C ) , D = ( i D , j D ) be the verticesof the cell of the N (= n ) × N grid which contains x and x A , x B , x C and x C the correspondingpoints in the unit square [
0, 1 ] , i.e. x A = i A / ( n − ) , x A = j A / ( n − ) etc. Let also A bedown-left corner of this cell and B , C , D be the rest of the vertices in clockwise order, then wedefine f C l ( x , y ) = α ( x ) · ( y − x ) + α ( x ) · ( y − x ) where the coefficients α ( x ) , α ( x ) ∈ [ −
1, 1 ] are defined as follows α i ( x ) = S (cid:32) x C − x δ (cid:33) · S (cid:32) x C − x δ (cid:33) · C il ( A ) + S (cid:18) x D − x δ (cid:19) · S (cid:18) x − x D δ (cid:19) · C il ( B )+ S (cid:32) x − x A δ (cid:33) · S (cid:18) x − x A δ (cid:19) · C il ( C ) + S (cid:18) x − x B δ (cid:19) · S (cid:18) x B − x δ (cid:19) · C il ( D ) where δ (cid:44) ( N − ) = ( n − ) .In Figure 6 we present an example of the application of Definition 6.5 to a specific cell with somegiven coloring on the corners.An important property of the definition of the function f C l is that the coefficients used in thedefinition of α i have the following two properties S (cid:32) x C − x δ (cid:33) · S (cid:32) x C − x δ (cid:33) ≥ S (cid:18) x D − x δ (cid:19) · S (cid:18) x − x D δ (cid:19) ≥ S (cid:32) x − x A δ (cid:33) · S (cid:18) x − x A δ (cid:19) ≥ S (cid:18) x − x B δ (cid:19) · S (cid:18) x B − x δ (cid:19) ≥
0, and25igure 6: Example of the definition of the Lipschitz and smooth function f on some cell giventhe coloring on the corners of the cell. For details see Definition 6.5. S (cid:32) x C − x δ (cid:33) · S (cid:32) x C − x δ (cid:33) + S (cid:18) x D − x δ (cid:19) · S (cid:18) x − x D δ (cid:19) + S (cid:32) x − x A δ (cid:33) · S (cid:18) x − x A δ (cid:19) + S (cid:18) x − x B δ (cid:19) · S (cid:18) x B − x δ (cid:19) = f C l inside a cell is a smooth convex combination of the functions on thecorners of the cell, as is suggested from Figure 6. Of course there are many ways to define suchconvex combination but in our case we use the smooth step function S to ensure the Lipschitzcontinuous gradient of the overall function f C l . We prove this formally in the next lemma. Lemma 6.6.
Let f C l be the function defined based on a coloring circuit C l , as per Definition 6.5. Thenf C l is continuous and differentiable at any point ( x , y ) ∈ [
0, 1 ] . Moreover, f C l is Θ ( δ ) -Lipschitz and Θ ( δ ) -smooth in the whole 4-dimensional hypercube [
0, 1 ] , where δ = ( N − ) = ( n − ) .Proof. Clearly from Definition 6.5, f C l is differentiable at any point ( x , y ) ∈ [
0, 1 ] in which x lieson the strict interior of its respective cell. In this case the derivative with respect to x is ∂ f C l ( x , y ) ∂ x = ∂α ( x ) ∂ x · ( y − x ) − α ( x ) + ∂α ( x ) ∂ x · ( y − x ) .where for ∂α ( x ) / ∂ x we have that ∂α ( x ) ∂ x = − δ S (cid:48) (cid:32) x C − x δ (cid:33) · S (cid:32) x C − x δ (cid:33) · C l ( A ) − δ S (cid:48) (cid:18) x D − x δ (cid:19) · S (cid:18) x − x D δ (cid:19) · C l ( B )+ δ S (cid:48) (cid:32) x − x A δ (cid:33) · S (cid:18) x − x A δ (cid:19) · C l ( C )+ δ S (cid:48) (cid:18) x − x B δ (cid:19) · S (cid:18) x B − x δ (cid:19) · C l ( D ) .26ow since max z ∈ [ ] | S (cid:48) ( z ) | ≤
6, we can conclude that (cid:12)(cid:12)(cid:12) ∂α ( x ) ∂ x (cid:12)(cid:12)(cid:12) ≤ δ . Similarly we can provethat (cid:12)(cid:12)(cid:12) ∂α ( x ) ∂ x (cid:12)(cid:12)(cid:12) ≤ δ , which combined with | α ( x ) | ≤ (cid:12)(cid:12)(cid:12) ∂ f C l ( x , y ) ∂ x (cid:12)(cid:12)(cid:12) ≤ O ( δ ) . Using similarreasoning we can prove that (cid:12)(cid:12)(cid:12) ∂ f C l ( x , y ) ∂ x (cid:12)(cid:12)(cid:12) ≤ O ( δ ) and that (cid:12)(cid:12)(cid:12) ∂ f C l ( x , y ) ∂ y i (cid:12)(cid:12)(cid:12) ≤ i =
1, 2. Hence (cid:13)(cid:13) ∇ f C l ( x , y ) (cid:13)(cid:13) ≤ O ( δ ) .The only thing we are missing to prove the Lipschitzness of f C l is to prove its continuity on theboundaries of the cells of our subdivision. Suppose x lies on the boundary of some cell, e.g. let x lie on edge ( C , D ) of one cell that is the same as the edge ( A (cid:48) , B (cid:48) ) of the cell to the right of thatcell. Since S ( ) = S (cid:48) ( ) = S (cid:48) ( ) = ∂α ( x ) / ∂ x = α .Therefore the value of ∂ f C l / ∂ x remains the same no matter the cell according to which it wascalculated. As a result, f C l is differentiable with respect to x even if x belongs in the boundaryof its cell. Using the exact same reasoning for the rest of the variables, one can show that thefunction f C l is differentiable at any point ( x , y ) ∈ [
0, 1 ] and because of the aforementioned boundon the gradient ∇ f C l we can conclude that f C l is O ( δ ) -Lipschitz.Using very similar calculations, we can compute the closed formulas of the second derivativesof f C l and using the bounds (cid:12)(cid:12) f C l ( · ) (cid:12)(cid:12) ≤ | S ( · ) | ≤ | S (cid:48) ( · ) | ≤
6, and | S (cid:48)(cid:48) ( · ) | ≤
6, we can provethat each entry of the Hessian ∇ f C l ( x , y ) is bounded by O ( δ ) and thus (cid:13)(cid:13) ∇ f C l ( x , y ) (cid:13)(cid:13) ≤ O ( δ ) which implies the Θ ( δ ) -smoothness of f C l . In this section, we present and prove the exact polynomial-time construction of the instance ofthe problem GDAF ixed P oint from an instance C l of the problem 2D-B i S perner . (+) Construction of Instance for Fixed Points of Gradient Descent/Ascent.
Our construction can be described via the following properties. (cid:73)
The payoff function is the real-valued function f C l ( x , y ) from the Definition 6.5. (cid:73) The domain is the polytope P ( A , b ) that we described in Section 3. The matrix A and thevector b have constant size and they are computed so that the following inequalities hold x − y ≤ ∆ , y − x ≤ ∆ , x − y ≤ ∆ , and y − x ≤ ∆ (6.1)where ∆ = δ /12 and δ = ( N − ) = ( n − ) . (cid:73) The parameter α is set to be equal to ∆ /3. (cid:73) The parameters G and L are set to be equal to the upper bounds on the Lipschitzness andthe smoothness of f C l respectively that we derived in Lemma 6.6. Namely we have that G = O ( δ ) = O ( n ) and L = O ( δ ) = O ( n ) .The first thing that is simple to observe in the above reduction is that it runs in polynomialtime with respect to the size of the the circuit C l which is the input to the 2D-B i S perner problemthat we started with. To see this, recall from the definition of GDAF ixed P oint that our reduction27eeds to output: (1) a Turing machine C f C l that computes the value and the gradient of thefunction f C l in time polynomial in the number of requested bits of accuracy; (2) the requiredscalars α , G , and L . For the first, we observe from the definition of f C l that it is actually a piece-wise polynomial function with a closed form that only depends on the values of the circuit C l onthe corners of the corresponding cell. Since the size of C l is the size of the input to 2D-B i S perner we can easily construct a polynomial-time Turing machine that computes both function value andthe gradient of the piecewise polynomial function f C l . Also, from the aforementioned descriptionof the reduction we have that log ( G ) , log ( L ) and log ( α ) are linear in n and hence we canconstruct the binary representation of all this scalars in time O ( n ) . The same is true for thecoefficients of A and b as we can see from their definition in (+) . Hence we conclude that ourreduction runs in time that is polynomial in the size of the circuit C l .The next thing to observe is that, according to Lemma 6.6, the function f C l is both G -Lipschitzand L -smooth and hence the output of our reduction is a valid input for the promise problemGDAF ixed P oint . So the last step to complete the proof of Theorem 6.1 is to prove that the vec-tor x (cid:63) of every solution ( x (cid:63) , y (cid:63) ) of GDAF ixed P oint with input C f C l , lies in a cell that is eitherpanchromatic or violates the rules for proper coloring, in any of these cases we can find a solu-tion to the 2D-B i S perner problem. This proves that our construction reduces 2D-B i S perner toGDAF ixed P oint .We prove this last statement in Lemma 6.8, but before that we need the following technicallemma that will be useful to argue about solution on the boundary of P ( A , b ) . Lemma 6.7.
Let C l be an input to the - B i S perner problem, let f C l be the corresponding G-Lipschitzand L-smooth function defined in Definition 6.5, and let P ( A , b ) be the polytope defined by (6.1) . If ( x (cid:63) , y (cid:63) ) is any solution to the GDAF ixed P oint problem with inputs α , G, L, C f C l , A , and b , defined in (+) then the following statements hold, where recall that ∆ = δ /12 . For i ∈ {
1, 2 } : (cid:5) If x (cid:63) i ∈ ( α , 1 − α ) and x (cid:63) i ∈ ( y (cid:63) i − ∆ + α , y (cid:63) i + ∆ − α ) then (cid:12)(cid:12)(cid:12) ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ x i (cid:12)(cid:12)(cid:12) ≤ α . (cid:5) If x (cid:63) i ≤ α or x (cid:63) i ≤ y (cid:63) i − ∆ + α then ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ x i ≥ − α . (cid:5) If x (cid:63) i ≥ − α or x (cid:63) i ≥ y (cid:63) i + ∆ − α then ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ x i ≤ α .The symmetric statements for y (cid:63) i hold. For i ∈ {
1, 2 } : (cid:5) If y (cid:63) i ∈ ( α , 1 − α ) and y (cid:63) i ∈ ( x (cid:63) i − ∆ + α , x (cid:63) i + ∆ − α ) then (cid:12)(cid:12)(cid:12) ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ y i (cid:12)(cid:12)(cid:12) ≤ α . (cid:5) If y (cid:63) i ≤ α or y (cid:63) i ≤ x (cid:63) i − ∆ + α then ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ y i ≤ α . (cid:5) If y (cid:63) i ≥ − α or y (cid:63) i ≥ x (cid:63) i + ∆ − α then ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ y i ≥ − α .Proof. For this proof it is convenient to define ˆ x = x (cid:63) − ∇ x f C l ( x (cid:63) , y (cid:63) ) , K ( y (cid:63) ) = { x | ( x , y (cid:63) ) ∈P ( A , b )) } , and z = Π K ( y (cid:63) ) ˆ x .We first consider the first statement, so for the sake of contradiction let’s assume that x (cid:63) i ∈ ( α , 1 − α ) , that x (cid:63) i ∈ ( y (cid:63) i − ∆ + α , y (cid:63) i + ∆ − α ) , and that (cid:12)(cid:12)(cid:12) ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ x i (cid:12)(cid:12)(cid:12) > α . Due to the defini-tion of P ( A , b ) in (6.1) the set K ( y (cid:63) ) is an axes aligned box of R and hence the projectionof any vector x onto K ( y (cid:63) ) can be implemented independently for every coordinate x i of x .28herefore if it happens that ˆ x i ∈ (
0, 1 ) ∩ ( y (cid:63) i − ∆ , y (cid:63) i + ∆ ) , then it holds that ˆ x i = z i . Nowfrom the definition of ˆ x i and z i , and the fact that K ( y (cid:63) ) is an axes aligned box, we get that | x (cid:63) i − z i | = | x (cid:63) i − ˆ x i | = (cid:12)(cid:12)(cid:12) ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ x i (cid:12)(cid:12)(cid:12) > α which contradicts the fact that ( x (cid:63) , y (cid:63) ) is a solution to theproblem GDAF ixed P oint . On the other hand if ˆ x i (cid:54)∈ ( y (cid:63) i − ∆ , y (cid:63) i + ∆ ) ∩ (
0, 1 ) then z i has to be onthe boundary of K ( y (cid:63) ) and hence z i has to be equal to either 0, or 1, or y (cid:63) i − ∆ , or y (cid:63) i + ∆ . In anyof these cases since we assumed that x (cid:63) i ∈ ( α , 1 − α ) and that x (cid:63) i ∈ ( y (cid:63) i − ∆ + α , y (cid:63) i + ∆ − α ) weconclude that | x (cid:63) i − z i | > α and hence we get again a contradiction with the fact that ( x (cid:63) , y (cid:63) ) is asolution to the problem GDAF ixed P oint . Hence we have that (cid:12)(cid:12)(cid:12) ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ x i (cid:12)(cid:12)(cid:12) ≤ α .For the second case, we assume for the sake of contradiction that x (cid:63) i ≤ α and ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ x i < − α .These imply that ˆ x i > x (cid:63) i + α and that z i = min ( y (cid:63) i + ∆ , ˆ x i , 1 ) > min ( ∆ , ˆ x i , 1 ) ≥ min ( α , x (cid:63) i + α ) .As a result, | x (cid:63) i − z i | = z i − x (cid:63) i > min ( α , ˆ x i + α ) − x (cid:63) i which is greater than α . The latter is acontradiction with the assumption that ( x (cid:63) , y (cid:63) ) is a solution to the GDAF ixed P oint problem.Also if we assume that x (cid:63) i ≤ y (cid:63) i − ∆ + α using the same reasoning we get that z i = min ( ˆ x i , y (cid:63) i + ∆ − α , 1 ) . From this we can again prove that | x (cid:63) i − z i | > α which contradicts the fact that ( x (cid:63) , y (cid:63) ) is a solution to GDAF ixed P oint .The third case can be proved using the same arguments as the second case. Then using thecorresponding arguments we can prove the corresponding statements for the y variables.We are now ready to prove that solutions of GDAF ixed P oint can only occur in cells that areeither panchromatic or violate the boundary conditions of a proper coloring. For convenience inthe rest of this section we define R ( x ) to be the cell of the 2 n × n grid that contains x . R ( x ) = (cid:20) i n − i + n − (cid:21) × (cid:20) j n − j + n − (cid:21) , (6.2)for i , j such that x ∈ (cid:2) i n − , i + n − (cid:3) and x ∈ (cid:104) j n − , j + n − (cid:105) if there are multiple i , j that satisfy theabove condition then we choose R ( x ) to be the cell that corresponds to the i , j such that the pair ( i , j ) it the lexicographically first such that i , j satisfy the above condition. We also define thecorners R c ( x ) of R ( x ) as R c ( x ) = { ( i , j ) , ( i , j + ) , ( i + j ) , ( i + ) , ( j + ) } (6.3)where R ( x ) = (cid:2) i n − , i + n − (cid:3) × (cid:104) j n − , j + n − (cid:105) . Lemma 6.8.
Let C l be an input to the - B i S perner problem, let f C l be the corresponding G-Lipschitzand L-smooth function defined in Definition 6.5, and let P ( A , b ) be the polytope defined by (6.1) . If ( x (cid:63) , y (cid:63) ) is any solution to the GDAF ixed P oint problem with inputs α , G, L, C f C l , A , and b defined in (+) then none of the following statements hold for the cell R ( x (cid:63) ) .1. x (cid:63) ≥ ( n − ) and, for all v ∈ R c ( x (cid:63) ) , it holds that C l ( v ) = − .2. x (cid:63) ≤ ( n − ) / ( n − ) and, for all v ∈ R c ( x (cid:63) ) , it holds that C l ( v ) = + .3. x (cid:63) ≥ ( n − ) and, for all v ∈ R c ( x (cid:63) ) , it holds that C l ( v ) = − .4. x (cid:63) ≤ ( n − ) / ( n − ) and, for all v ∈ R c ( x (cid:63) ) , it holds that C l ( v ) = + . roof. We prove that there is no solution ( x (cid:63) , y (cid:63) ) of GDAF ixed P oint that satisfies the statement1. and the fact that ( x (cid:63) , y (cid:63) ) cannot satisfy the other statements follows similarly. It is convenientfor us to define ˆ x = x (cid:63) − ∇ x f C l ( x (cid:63) , y (cid:63) ) , K ( y (cid:63) ) = { x | ( x , y (cid:63) ) ∈ P ( A , b )) } , z = Π K ( y (cid:63) ) ˆ x , andˆ y = y (cid:63) + ∇ y f C l ( x (cid:63) , y (cid:63) ) , K ( x (cid:63) ) = { y | ( x (cid:63) , y ) ∈ P ( A , b )) } , w = Π K ( x (cid:63) ) ˆ y .For the sake of contradiction we assume that there exists a solution of ( x (cid:63) , y (cid:63) ) such that x (cid:63) ≥ ( n − ) and for all v ∈ R c ( x (cid:63) ) it holds that C l ( v ) = −
1. Using the fact that the first colorof all the corners of R ( x (cid:63) ) is 1 − , we will prove that (1) ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ x ≥ ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ y = − R ( x (cid:63) ) = (cid:2) i n − , i + n − (cid:3) × (cid:104) j n − , j + n − (cid:105) , then since all the corners v ∈ R c ( x (cid:63) ) have C l ( v ) = −
1, from the Definition 6.5 we have that f C l ( x (cid:63) , y (cid:63) ) = ( x (cid:63) − y (cid:63) ) − ( x (cid:63) − y (cid:63) ) · S (cid:32) x C − x (cid:63) δ (cid:33) · S (cid:32) x C − x (cid:63) δ (cid:33) · C l ( i , j ) − ( x (cid:63) − y (cid:63) ) · S (cid:18) x D − x (cid:63) δ (cid:19) · S (cid:18) x (cid:63) − x D δ (cid:19) · C l ( i , j + ) − ( x (cid:63) − y (cid:63) ) · S (cid:32) x (cid:63) − x A δ (cid:33) · S (cid:18) x (cid:63) − x A δ (cid:19) · C l ( i + j + ) − ( x (cid:63) − y (cid:63) ) · S (cid:18) x (cid:63) − x B δ (cid:19) · S (cid:18) x B − x (cid:63) δ (cid:19) · C l ( i + j ) where ( x A , x A ) = ( i / ( n − ) , j / ( n − )) , ( x B , x B ) = ( i / ( n − ) , ( j + ) / ( n − )) , ( x C , x C ) =(( i + ) / ( n − ) , ( j + ) / ( n − )) , and ( x D , x D ) = (( i + ) / ( n − ) , j / ( n − )) . If we differen-tiate this with respect to y we immediately get that ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ y = −
1. On the other hand if wedifferentiate with respect to x we get ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ x = + ( x (cid:63) − y (cid:63) ) · δ · S (cid:48) (cid:32) − x (cid:63) − x A δ (cid:33) · S (cid:18) − x (cid:63) − x A δ (cid:19) · C l ( i , j )+ ( x (cid:63) − y (cid:63) ) · δ · S (cid:48) (cid:32) − x (cid:63) − x A δ (cid:33) · S (cid:18) x (cid:63) − x A δ (cid:19) · C l ( i , j + ) − ( x (cid:63) − y (cid:63) ) · δ · S (cid:48) (cid:32) x (cid:63) − x A δ (cid:33) · S (cid:18) x (cid:63) − x A δ (cid:19) · C l ( i + j + ) − ( x (cid:63) − y (cid:63) ) · δ · S (cid:48) (cid:32) x (cid:63) − x A δ (cid:33) · S (cid:18) − x (cid:63) − x A δ (cid:19) · C l ( i + j ) ≥ − | x (cid:63) − y (cid:63) | · δ ≥ − · ∆ δ ≥ | S (cid:48) ( · ) | ≤ P ( A , b ) , it holds that | x − y | ≤ ∆ .Hence we have established that if x (cid:63) ≥ ( n − ) and for all v ∈ R c ( x (cid:63) ) it holds that C l ( v ) = − ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ x ≥ ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ y = −
1. Now it is easy tosee that the only way to satisfy both ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ x ≥ | z − x (cid:63) | ≤ α is that either x (cid:63) ≤ α or30 (cid:63) ≤ y (cid:63) − ∆ + α . The first case is excluded by the assumption in the first statement of our lemmaand our choice of α = ∆ /3 = ( · ( n − )) thus it holds that x (cid:63) ≤ y (cid:63) − ∆ + α . But then wecan use the case 3 for the y variables of Lemma 6.7 and we get that ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ y ≥ − α , which cannotbe true since we proved that ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ y = −
1. Therefore we have a contradiction and the firststatement of the lemma holds. Using the same reasoning we prove the rest of the statements.
Remark . The computations presented in (6.4) is the precise point where an attempt to provethe hardness of minimization problems would fail. In particular, if our goal was to construct ahard minimization instance then the function f C l would need to have the terms x i + y i instead of x i − y i so that the fixed points of gradient descent coincide with approximate local minimum of f C l . In that case we cannot lower bound the gradient of (6.4) below from 1/2 because the term | x (cid:63) + y (cid:63) | will be the dominant one and hence the sign of the derivative can change depending onthe value | x (cid:63) + y (cid:63) | . For a more intuitive explanation of the reason why we cannot prove hardnessof minimization problems we refer to the Introduction, at Section 1.2.We have now all the ingredients to prove Theorem 6.1. Proof of Theorem 6.1.
Let ( x (cid:63) , y (cid:63) ) be a solution to the GDAF ixed P oint instance that we constructbased on the instance C l of 2D-B i S perner . Let also R ( x (cid:63) ) be the cell that contains x (cid:63) . If thecorners R c ( x (cid:63) ) contain all the colors 1 − , 1 + , 2 − , 2 + then we have a solution to the 2D-B i S perner instance and the Theorem 6.1 follows. Otherwise there is at least one color missing from R c ( x (cid:63) ) ,let’s assume without loss of generality that one of the missing colors is 1 − , hence for every v ∈ R c ( x (cid:63) ) it holds that C l ( v ) = +
1. Now from Lemma 6.8 the only way for this to happenis that x (cid:63) > ( n − ) / ( n − ) which implies that in R c ( x (cid:63) ) there is at least one corner of theform v = ( n − j ) . But we have assumed that C l ( v ) = +
1, hence v is a violation of theproper coloring rules and hence a solution to the 2D-B i S perner instance. We can prove thecorresponding statement if any other color from 1 + , 2 − , 2 + is missing. Finally, we observe thatthe function that we define has range [ −
2, 2 ] and hence the Theorem 6.1 follows. Although the results of Section 6 are quite indicative about the computational complexity ofGDAF ixed P oint and LR-L ocal M in M ax , we have not yet excluded the possibility of the exis-tence of algorithms running in poly ( d , G , L , 1/ ε ) time. In this section we present a, significantlymore challenging, high dimensional version of the reduction that we presented in Section 6.The advantage of this reduction is that it rules out the existence even of algorithms running inpoly ( d , G , L , 1/ ε ) steps unless FP = PPAD , for details see Theorem 4.4. An easy consequenceof our result is an unconditional lower bound on the black-box model that states that the runningtime of any algorithm for LR-L ocal M in M ax that has only oracle access to f and ∇ f has to beexponential in d , or G , or L , or 1/ ε , for details we refer to the Theorem 4.5 and Section 9.The main reduction that we use to prove Theorem 4.4 is from the high dimensional gener-alization of the problem 2D-B i S perner , which we call H igh D-B i S perner , to GDAF ixed P oint .Our reduction in this section resembles some of the ideas of the reductions of Section 6 but it hasmany additional significant technical difficulties. The main difficulty that we face is how to de-fine a function on a d -dimensional simplex that is: (1) both Lipschitz and smooth, (2) interpolatedbetween some fixed functions at the d + We start by presenting the H igh
D-B i S perner problem. The H igh D-B i S perner is a straightfor-ward d -dimensional generalization of the 2D-B i S perner that we defined in the Section 6. Assumethat we have a d -dimensional grid N × · · · ( d times ) · · · × N . We assign to every vertex of thisgrid a sequence of d colors and we say that a coloring is proper if the following rules are satisfied.1. The i th color of every vertex is either the color i + or the color i − .2. All the vertices whose i th coordinate is 0, i.e. they are at the lower boundary of the i thdirection, should have the i th color equal to i + .3. All the vertices whose i th coordinate is 1, i.e. they are at the higher boundary of the i thdirection, should have the i th color equal to i − .Using proof ideas similar to the proof of the original Sperner’s Lemma it is not hard to provevia a combinatorial argument that in every proper coloring of a d -dimensional grid, there existsa cubelet of the grid where all the 2 · d colors { − , 1 + , . . . , d − , d + } appear in some of its vertices,we call such a cubelet panchromatic . In the H igh D-B i S perner problem we are asked to find sucha cubelet, or a violation of the rules of proper coloring. As in Section 6.1 we do not present thiscombinatorial argument in this paper since the totality of the H igh D-B i S perner problem willfollow from our reduction from H igh D-B i S perner to GDAF ixed P oint and our proofs in Section5 that establish the totality of GDAF ixed P oint .As in the case of 2D-B i S perner , in order to formally define the computational problemH igh D-B i S perner we need to define the coloring of the d -dimensional grid N × · · · × N in asuccinct way. The fundamental difference compared to the definition of 2D-B i S perner is that forthe H igh D-B i S perner we assume that N is only polynomially large . This difference will enableus to exclude algorithms for GDAF ixed P oint that run in time poly ( d , 1/ α , G , L ) . The input toH igh D-B i S perner is a coloring via a binary circuit C l that takes as input the coordinates of avertex of the grid and outputs the sequence of colors that are used to color this vertex. Each oneof d coordinates is given via the binary representation of a number in [ N ] −
1. Setting N = (cid:96) ,where here (cid:96) is a logarithmically in d small number, we have that the representation of eachcoordinate is a member of {
0, 1 } (cid:96) . In the rest of the section we abuse the notation and we usea coordinate i ∈ {
0, 1 } (cid:96) both as a binary string and as a number in (cid:2) (cid:96) (cid:3) − C l should be a sequence of d colors, wherethe i th member of this sequence is one of the colors { i − , i + } . We represent this sequence as amember of {− + } d , where the i th coordinate refers to the choice of i − or i + .In the definition of the computational problem H igh D-B i S perner the input is a circuit C l , aswe described above. As we discussed above in the H igh D-B i S perner problem we are asking for32 panchromatic cubelet of the grid. One issue with this high-dimensional setting is that in orderto check whether a cubelet is panchromatic or not we have to query all the 2 d corners of thiscubelet which makes the verification problem inefficient and hence a containment to the PPAD class cannot be proved. For this reason as a solution to the H igh
D-B i S perner we ask not justfor a cubelet but for 2 · d vertices v ( ) , . . . , v ( d ) u ( ) , . . . , u ( d ) , not necessarily different, such thatthey all belong to the same cubelet and the i th output of C l with input v i is −
1, i.e. correspondsto the color i − , whereas the i th output of C l with input u i is +
1, i.e. corresponds to the color i + . This way we have a certificate of size 2 · d that can be checked in polynomial time. Anotherpossible solution of H igh D-B i S perner is a vertex whose coloring violates the aforementionedboundary conditions 2. and 3.. of a proper coloring. For notational convenience we refer to the i th coordinate of C l by C il . The formal definition of H igh D-B i S perner is then the following.H igh D-B i S perner .H igh D-B i S perner .I nput : A boolean circuit C l : {
0, 1 } (cid:96) × · · · × {
0, 1 } (cid:96) (cid:124) (cid:123)(cid:122) (cid:125) d times → {−
1, 1 } d O utput : One of the following:1. Two sequences of d vertices v ( ) , . . . , v ( d ) an u ( ) , . . . , u ( d ) with v ( i ) , u ( i ) ∈ (cid:0) {
0, 1 } (cid:96) (cid:1) d suchthat C il ( v ( i ) ) = − C il ( u ( i ) ) = + v ∈ (cid:0) {
0, 1 } (cid:96) (cid:1) d with v i = such that C il ( v ) = − v ∈ (cid:0) {
0, 1 } (cid:96) (cid:1) d with v i = such that C il ( v ) = + PPAD -hardness of H igh
D-B i S perner in Theorem 7.2. To provethis we use a stronger version of the B rouwer problem that is called γ -S uccinct B rouwer andwas first introduced in [Rub16]. γ -S uccinct B rouwer . γ -S uccinct B rouwer .I nput : A polynomial-time Turing machine C M evaluating a 1/ γ -Lipschitz continuous vector-valued function M : [
0, 1 ] d → [
0, 1 ] d .O utput : A point x (cid:63) ∈ [
0, 1 ] d such that (cid:107) M ( x (cid:63) ) − x (cid:63) (cid:107) ≤ γ . Theorem 7.1 ([Rub16]) . γ -S uccinct B rouwer is PPAD -complete for any fixed constant γ > . Theorem 7.2.
There is a polynomial time reducton from any instance of the γ -S uccinct B rouwer prob-lem to an instance of H igh D - B i S perner with N = Θ ( d / γ ) .Proof. Consider the function g ( x ) = M ( x ) − x . Since M is 1/ γ -Lipschitz, g : [
0, 1 ] d → [ −
1, 1 ] d isalso ( + γ ) -Lipschitz. Additionally g can be easily computed via a polynomial-time Turingmachine C g that uses C M as a subroutine. We construct the coloring sequences of every vertex of a d -dimensional grid with N = Θ ( d / γ ) points in every direction using g . Let g η : [
0, 1 ] → [ −
1, 1 ] be the function that the Turing Machine C g evaluate when the requested accuracy is η > v = ( v , . . . , v n ) ∈ ([ N ] − ) d of the d -dimensional grid its coloring sequence C l ( v ) ∈ {−
1, 1 } d is constructed as follows: For each coordinate j =
1, . . . , d , C jl ( v ) = v j = − v j = n − (cid:0) g j (cid:0) v N − , . . . , v n N − (cid:1)(cid:1) otherwise ,33here sign : [ −
1, 1 ] (cid:55)→ {−
1, 1 } is the sign function and g η , j ( · ) is the j -th coordinate of g η .Observe that since M : [
0, 1 ] d → [
0, 1 ] d , for any vertex v with v j = C jl ( v ) = + v with v j = N − C jl ( v ) = − M is always in [
0, 1 ] d and hence there are no vertices in the grid satisfying the possibleoutputs 2. or 3. of the H igh D-B i S perner problem. Thus the only possible solution of the aboveH igh D-B i S perner instance is a sequence of 2 d vertices v ( ) , . . ., v ( d ) , u ( ) , . . ., u ( d ) on the samecubelet that certify that the corresponding cubelet is panchromatic, as per possible output 1. ofthe H igh D-B i S perner problem. We next prove that any vertex v of that cubelet it holds that (cid:12)(cid:12)(cid:12)(cid:12) g j (cid:18) v N − (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ d γ N for all coordinates j =
1, . . . , d .Let v be any vertex on the same cubelet with the output vertices v ( ) , . . ., v ( d ) , u ( ) , . . ., u ( d ) .From the guarantees of colors of the sequences v ( ) , . . ., v ( d ) , u ( ) , . . ., u ( d ) we have that either C jl ( v ) · C jl ( v ( j ) ) = − C jl ( v ) · C jl ( u ( j ) ) = −
1, let v ( j ) be the vertex v ( j ) or u ( j ) depending on whichone the j th color has product equal to − C jl ( v ) . Now let η = √ d γ N if g j (cid:0) v N − (cid:1) ∈ [ − η , η ] then the wanted inequality follows. On the other hand if g j (cid:0) v N − (cid:1) ∈ [ − η , η ] then using the factthat (cid:13)(cid:13) g (cid:0) v N − (cid:1) − g η (cid:0) v N − (cid:1)(cid:13)(cid:13) ∞ ≤ η and that from the definition of the colors we have that either g η , j (cid:0) v N − (cid:1) ≥ g η , j (cid:16) v ( j ) N − (cid:17) < g η , j (cid:0) v N − (cid:1) < g η , j (cid:16) ˆ v ( j ) N − (cid:17) ≥ g j (cid:0) v N − (cid:1) ≥ g j (cid:16) v ( j ) N − (cid:17) < g j (cid:0) v N − (cid:1) < g j (cid:16) ˆ v ( j ) N − (cid:17) ≥ (cid:12)(cid:12)(cid:12)(cid:12) g j (cid:18) v N − (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) g j (cid:18) v N − (cid:19) − g j (cid:32) v ( j ) N − (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:18) + γ (cid:19) · (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) v N − − v ( j ) N − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ √ d γ N where in the second inequality we have used the ( + γ ) -Lipschitzness of g . As a result, thepoint ˆ v = v / ( N − ) ∈ [
0, 1 ] d satisfies (cid:107) M ( ˆ v ) − ˆ v (cid:107) ≤ d / ( γ N ) and thus for if we pick N = Θ ( d / γ ) then any vertex v of the panchromatic cell is a solution for γ -S uccinct B rouwer .Now that we have established the PPAD -hardness of H igh
D-B i S perner we are ready to presentour main result of this section which is a reduction from the problem H igh D-B i S perner to theproblem GDAF ixed P oint with the additional constraints that the scalars α , G , L in the inputsatisfy 1/ α = poly ( d ) , G = poly ( d ) , and L = poly ( d ) . Given the binary circuit C l : ([ N ] − ) d → {− + } d that is an instance of H igh D-B i S perner ,we construct a G -Lipschitz and L -smooth function f C l : [
0, 1 ] d × [
0, 1 ] d → R . To do so, we dividethe [
0, 1 ] d hypercube into cubelets of length δ = ( N − ) . The corners of such cubelets havecoordinates that are integer multiples of δ = ( N − ) and we call them vertices . Each vertexcan be represented by the vector v = ( v , . . . , v d ) ∈ ([ N ] − ) d and admits a coloring sequencedefined by the boolean circuit C l : ([ N ] − ) d → {− + } d . For every x ∈ [
0, 1 ] d , we use R ( x ) todenote the cubelet that contains x , formally R ( x ) = (cid:20) c N − c + N − (cid:21) × · · · × (cid:20) c d N − c d + N − (cid:21) where c ∈ ([ N − ] − ) d such that x ∈ (cid:104) c N − , c + N − (cid:105) × · · · × (cid:104) c d N − , c d + N − (cid:105) and if there are multiplecorners c that satisfy this condition then we choose R ( x ) to be the cell that corresponds to the c R c ( x ) to bethe set of vertices that are corners of the cublet R ( x ) , namely R c ( x ) = { c , c + } × · · · × { c d , c d + } where c ∈ ([ N − ] − ) d such that R ( x ) = (cid:104) c N − , c + N − (cid:105) × · · · × (cid:104) c d N − , c d + N − (cid:105) Every y that belongsto the cubelet R ( x ) can be written as a convex combination of the vectors v / ( N − ) where v ∈ R c ( x ) . The value of the function f C l ( x , y ) that we construct in this section is determinedby the coloring sequences C l ( v ) of the vertices v ∈ R c ( x ) . One of the main challenges that weface though is that the size of R c ( x ) is 2 d and hence if we want to be able to compute the valueof f C l ( x , y ) efficiently then we have to find a consistent rule to pick a subset of the vertices of R c ( x ) whose coloring sequence we need to define the function value f C l ( x , y ) . Although thereare traditional ways to overcome this difficulty using the canonical simplicization of the cubelet R ( x ) , these technique leads only to functions that are continuous and Lipschitz but they are notenough to guarantee continuity of the gradient and hence the resulting functions are not smooth. The problem of finding a computationally efficient way to define a continuous function as aninterpolation of some fixed function in the corners of a cubelet so that the resulting functionis both Lischitz and smooth is surprisingly difficult to solve. For this reason we introduce inthis section the smooth and efficient interpolation coefficients (SEIC) that as we will see in Section7.2.2, is the main technical tool to implement such an interpolation. Our novel interpolationcoefficients are of independent interest and we believe that they will serve as a main technicaltool for proving other hardness results in continuous optimization in the future.In this section we only give a high level description of the smooth and efficient interpolationcoefficients via their properties that we use in Section 7.2.2 to define the function f C l . The actualconstruction of the coefficients is very challenging and technical and hence we postpone a detailexposition for Section 8. Definition 7.3 (Smooth and Efficient Interpolation Coefficients) . For every N ∈ N we define theset of smooth and efficient interpolation coefficients (SEIC) as the family of functions, called coefficients , I d , N = (cid:110) P v : [
0, 1 ] d → R | v ∈ ([ N ] − ) d (cid:111) with the following properties.(A) For all vertices v ∈ ([ N ] − ) d , the coefficient P v ( x ) is a twice-differentiable function andsatisfies (cid:73) (cid:12)(cid:12)(cid:12) ∂ P v ( x ) ∂ x i (cid:12)(cid:12)(cid:12) ≤ Θ ( d / δ ) . (cid:73) (cid:12)(cid:12)(cid:12) ∂ P v ( x ) ∂ x i ∂ x (cid:96) (cid:12)(cid:12)(cid:12) ≤ Θ ( d / δ ) .(B) For all v ∈ ([ N ] − ) d , it holds that P v ( x ) ≥ ∑ v ∈ ([ N ] − ) d P v ( x ) = ∑ v ∈ R c ( x ) P v ( x ) = x ∈ [
0, 1 ] d , it holds that all but d + P v ∈ I d , N satisfy P v ( x ) = ∇ P v ( x ) = ∇ P v ( x ) =
0. We denote this set of d + R + ( x ) . Furthermore,it holds that R + ( x ) ⊆ R c ( x ) and given x we can compute the set R + ( x ) it time poly ( d ) .(D) For all x ∈ [
0, 1 ] d , if x i ≤ ( N − ) for some i ∈ [ d ] then there exists v ∈ R + ( x ) such that v i =
0. Respectively, if x i ≥ − ( N − ) then there exists v ∈ R + ( x ) such that v i = (A) – The coefficients P v are both Lipschitz and smooth with Lipschitzness and smoothnessparameters that depends polynomially in d and N = δ + (B) – The coefficients P v ( x ) define a convex combination of the vertices R c ( x ) . (C) – For every x ∈ [
0, 1 ] d , out of the N d coefficients P v only d + x . Moreover, given x ∈ [
0, 1 ] d we can identify these d + (D) – For every x ∈ [
0, 1 ] d that is in a cubelet that touches the boundary there is at least one ofthe vertices in R + ( x ) that is on the boundary of the continuous hypercube [
0, 1 ] d .In Section 10 in the proof of Theorem 10.4 we present a simple application of the existenceof the SEIC coefficients for proving very simple black box oracle lower bounds for the globalminimization problem.Based on the existence of these coefficients we are now ready to define the function f C l whichis the main construction of our reduction. In this section our goal is to formally define the function f C l and prove its Lipschitzness andsmoothness properties in Lemma 7.5. Definition 7.4 (Continuous and Smooth Function from Colorings of Bi-Sperner) . Given a binarycircuit C l : ([ N ] − ) d → {−
1, 1 } d , we define the function f C l : [
0, 1 ] d × [
0, 1 ] d → R as follows f C l ( x , y ) = d ∑ j = ( x j − y j ) · α j ( x ) where α j ( x ) = − ∑ v ∈ ([ N ] − ) d P v ( x ) · C jl ( v ) , and P v are the coefficients defined in Definition 7.3.We first prove that the function f C l constructed in Definition 7.4 is G -Lipschitz and L -smoothfor some appropriately selected parameters G , L that are polynomial in the dimension d and inthe discretization parameter N . We use this property to establish that f C l is a valid input to thepromise problem GDAF ixed P oint . Lemma 7.5.
The function f C l of Definition 7.4 is O ( d / δ ) -Lipschitz and O ( d / δ ) -smooth.Proof. If we take the derivative with respect to x i and y i and using property (B) of the coefficients P v we get the following relations, ∂ f C l ( x , y ) ∂ x i = d ∑ j = ( x j − y j ) · ∂α j ( x ) ∂ x i + α i ( x ) and ∂ f C l ( x , y ) ∂ y i = − α i ( x ) where α i ( x ) = − ∑ v ∈ ([ N ] − ) d P v ( x ) and ∂α j ( x ) ∂ x i = − ∑ v ∈ ([ N ] − ) d ∂ P v ( x ) ∂ x i · C jl ( v ) .36ow by the property (C) of Definition 7.3 there are most d + v of R c ( x ) with theproperty ∇ P v ( x ) (cid:54) =
0. Then if we also use property (A) we get (cid:12)(cid:12)(cid:12) ∂α j ( x ) ∂ x i (cid:12)(cid:12)(cid:12) ≤ Θ ( d / δ ) and usingthe property (B) we get | α i ( x ) | ≤
1. Thus (cid:12)(cid:12)(cid:12) ∂ f C l ( x , y ) ∂ x i (cid:12)(cid:12)(cid:12) ≤ Θ ( d / δ ) and (cid:12)(cid:12)(cid:12) ∂ f C l ( x , y ) ∂ y i (cid:12)(cid:12)(cid:12) ≤ Θ ( d ) . Thereforewe can conclude that (cid:13)(cid:13) ∇ f C l ( x , y ) (cid:13)(cid:13) ≤ Θ ( d / δ ) and hence this proves that the function f C l isLipschitz continuous with Lipschitz constant Θ ( d / δ ) .To prove the smoothness of f C l , we use the property (B) of the Definition 7.3 and we have ∂ f C l ( x , y ) ∂ x i ∂ x (cid:96) = d ∑ j = ( x j − y j ) · ∂ α j ( x ) ∂ x i ∂ x (cid:96) + ∂α (cid:96) ( x ) ∂ x i + ∂α i ( x ) ∂ x (cid:96) , ∂ f C l ( x , y ) ∂ x i ∂ y (cid:96) = − ∂α (cid:96) ( x ) ∂ x i , and ∂ f C l ( x , y ) ∂ y i ∂ y (cid:96) = ∂ α j ( x ) ∂ x i ∂ x (cid:96) = − ∑ v ∈ ([ N ] − ) d ∂ P v ( x ) ∂ x i ∂ x (cid:96) · C jl ( v ) Again using the property (C) of Definition 7.3 we get that there are most d + v of R c ( x ) such that ∇ P v ( x ) (cid:54) =
0. This together with the property (A) of Definition 7.3 leads to the fact that (cid:12)(cid:12)(cid:12) ∂ α j ( x ) ∂ x i ∂ x (cid:96) (cid:12)(cid:12)(cid:12) ≤ Θ ( d / δ ) . Using the later together with the bounds that we obtained for (cid:12)(cid:12)(cid:12) ∂α j ( x ) ∂ x i (cid:12)(cid:12)(cid:12) inthe beginning of the proof we get that (cid:13)(cid:13) ∇ f C l ( x , y ) (cid:13)(cid:13) F ≤ Θ ( d / δ ) , where with (cid:107)·(cid:107) F we denotethe Frobenious norm. Since the bound on the Frobenious norm is a bound to the spectral normtoo, we get that the function f C l is Θ ( d / δ ) -smooth. We start with a description of the reduction from H igh
D-B i S perner to GDAF ixed P oint . Sup-pose we have an instance of H igh D-B i S perner given by boolean circuit C l : ([ N ] − ) d →{−
1, 1 } d , we construct an instance of GDAF ixed P oint according to the following set of rules. ( (cid:63) ) Construction of Instance for Fixed Points of Gradient Descent/Ascent. (cid:73)
The payoff function is the real-valued function f C l ( x , y ) from the Definition 7.4. (cid:73) The domain is the polytope P ( A , b ) that we described in Section 3. The matrix A and thevector b are computed so that the following inequalities hold x i − y i ≤ ∆ , y i − x i ≤ ∆ for all i ∈ [ d ] (7.1)where ∆ = t · δ / d , with t ∈ R + be a constant such that (cid:12)(cid:12)(cid:12) ∂ P v ( x ) ∂ x i (cid:12)(cid:12)(cid:12) · δ d t ≤ , for all v ∈ ([ N ] − ) d and x ∈ [
0, 1 ] d . The fact that such a constant t exists follows from the property(A) of the smooth and efficient coefficients. (cid:73) The parameter α is set to be equal to ∆ /3. (cid:73) The parameters G and L are set to be equal to the upper bounds on the Lipschitzness andthe smoothness of f C l respectively that we derived in Lemma 7.5. Namely we have that G = O ( d / δ ) and L = O ( d / δ ) . 37he first thing to observe is that the afore-described reduction is polynomial-time. For thisobserve that all of α , G , L , A , and b have representation that is polynomial in d even if we useunary instead of binary representation. So the only thing that remains is the existence of a Turingmachine C f C l that computes the function and the gradient value of f C l in time polynomial to thesize of C l and the requested accuracy. To prove this we need a detailed description of the SEICcoefficients and for this reason we postpone the proof of this to the Appendix D. Here we statethe formally the result that we prove in the Appendix D which together with the discussionabove proves that our reduction is indeed polynomial-time. Theorem 7.6.
Given a binary circuit C l : ([ N ] − ) d → {−
1, 1 } d that is an input to the H igh D - B i S perner problem. Then, there exists a polynomial-time Turing machine C f C l , that can be constructedin polynomial-time from the circuit C l such that for all vector x , y ∈ [
0, 1 ] d and accuracy ε > , C f C l computes both z ∈ R and w ∈ R d such that (cid:12)(cid:12) z − f C l ( x , y ) (cid:12)(cid:12) ≤ ε , (cid:13)(cid:13) w − ∇ f C l ( x , y ) (cid:13)(cid:13) ≤ ε . Moreover the running time of C f C l is polynomial in the binary representation of x , y , and log ( ε ) . We also observe that according to Lemma 7.5, the function f C l is both G -Lipschitz and L -smooth and hence the output of our reduction is a valid input for the constructed instance ofthe promise problem GDAF ixed P oint . The next step is to prove that the vector x (cid:63) of everysolution ( x (cid:63) , y (cid:63) ) of GDAF ixed P oint with input as we described above, lies in a cubelet thatis either panchromatic according to C l or is a violation of the rules for proper coloring of theH igh D-B i S perner problem. Lemma 7.7.
Let C l be an input to the H igh D - B i S perner problem, let f C l be the corresponding G-Lipschitz and L-smooth function defined in Definition 7.4, and let P ( A , b ) be the polytope defined by (7.1) . If ( x (cid:63) , y (cid:63) ) is any solution to the GDAF ixed P oint problem with input α , G, L, C f C l , A , and b ,defined in ( (cid:63) ) then the following statements hold, where we remind that ∆ = t · δ / d . (cid:5) If x (cid:63) i ∈ ( α , 1 − α ) and x (cid:63) i ∈ ( y (cid:63) i − ∆ + α , y (cid:63) i + ∆ − α ) then (cid:12)(cid:12)(cid:12) ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ x i (cid:12)(cid:12)(cid:12) ≤ α . (cid:5) If x (cid:63) i ≤ α or x (cid:63) i ≤ y (cid:63) i − ∆ + α then ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ x i ≥ − α . (cid:5) If x (cid:63) i ≥ − α or x (cid:63) i ≥ y (cid:63) i + ∆ − α then ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ x i ≤ α .The symmetric statements for y (cid:63) i hold. (cid:5) If y (cid:63) i ∈ ( α , 1 − α ) and y (cid:63) i ∈ ( x (cid:63) i − ∆ + α , x (cid:63) i + ∆ − α ) then (cid:12)(cid:12)(cid:12) ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ y i (cid:12)(cid:12)(cid:12) ≤ α . (cid:5) If y (cid:63) i ≤ α or y (cid:63) i ≤ x (cid:63) i − ∆ + α then ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ y i ≤ α . (cid:5) If y (cid:63) i ≥ − α or y (cid:63) i ≥ x (cid:63) i + ∆ − α then ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ y i ≥ − α .Proof. The proof of this lemma is identical to the proof of Lemma 6.7 and for this reason we skipthe details of the proof here. 38 emma 7.8.
Let C l be an input to the H igh D - B i S perner problem, let f C l be the corresponding G-Lipschitz and L-smooth function defined in Definition 7.4, and let P ( A , b ) be the polytope defined by (7.1) . If ( x (cid:63) , y (cid:63) ) is any solution to the GDAF ixed P oint problem with input α , G, L, C f C l , A , and b ,defined in ( (cid:63) ) , then none of the following statements hold for the cubelet R ( x (cid:63) ) .1. x (cid:63) i ≥ ( N − ) and for any v ∈ R + ( x (cid:63) ) , it holds that C il ( v ) = − .2. x (cid:63) i ≤ − ( N − ) and for any v ∈ R + ( x (cid:63) ) , it holds that C l ( v ) = + .Proof. We prove that there is no solution ( x (cid:63) , y (cid:63) ) of GDAF ixed P oint that satisfies the statement1. and the fact that ( x (cid:63) , y (cid:63) ) cannot satisfy the statement 2. follows similarly. It is convenientfor us to define ˆ x = x (cid:63) − ∇ x f C l ( x (cid:63) , y (cid:63) ) , K ( y (cid:63) ) = { x | ( x , y (cid:63) ) ∈ P ( A , b )) } , z = Π K ( y (cid:63) ) ˆ x , andˆ y = y (cid:63) − ∇ y f C l ( x (cid:63) , y (cid:63) ) , K ( x (cid:63) ) = { y | ( x (cid:63) , y ) ∈ P ( A , b )) } , w = Π K ( x (cid:63) ) ˆ y .For the sake of contradiction we assume that there exists a solution of ( x (cid:63) , y (cid:63) ) such that x (cid:63) ≥ ( N − ) and for any v ∈ R + ( x (cid:63) ) it holds that C il ( v ) = −
1. Using this fact, we will provethat (1) ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ x i ≥ ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ y i = − R ( x (cid:63) ) = (cid:104) c N − , c + N − (cid:105) × · · · × (cid:104) c d N − , c d + N − (cid:105) , then since all the corners v ∈ R + ( x (cid:63) ) have C il ( v ) = −
1, from the Definition 7.4 we have that f C l ( x (cid:63) , y (cid:63) ) = ( x (cid:63) i − y (cid:63) i ) + d ∑ j = j (cid:54) = i ( x (cid:63) j − y (cid:63) j ) · α j ( x ) If we differentiate this with respect to y i we immediately get that ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ y i = −
1. On the otherhand if we differentiate with respect to x i we get ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ x i = + d ∑ j = j (cid:54) = i ( x j − y j ) · ∂α j ( x ) ∂ x i ≥ − ∑ j (cid:54) = i (cid:12)(cid:12) x j − y j (cid:12)(cid:12) · (cid:12)(cid:12)(cid:12)(cid:12) ∂α j ( x ) ∂ x i (cid:12)(cid:12)(cid:12)(cid:12) ≥ − ∆ · d · Θ (cid:18) d δ (cid:19) ≥ (cid:12)(cid:12)(cid:12) ∂α j ( x ) ∂ x l (cid:12)(cid:12)(cid:12) ≤ Θ ( d / δ ) , which is proved inthe proof of Lemma 7.5, (2) (cid:12)(cid:12) x j − y j (cid:12)(cid:12) ≤ ∆ , and (3) the definition of ∆ . Now it is easy to see that theonly way to satisfy both ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ x i ≥ | z i − x (cid:63) i | ≤ α is that either x (cid:63) i ≤ α or x (cid:63) i ≤ y (cid:63) i − ∆ + α .The first case is excluded by the assumption of the first statement of our lemma and our choiceof α = ∆ /3 < ( N − ) , thus it holds that x (cid:63) i ≤ y (cid:63) i − ∆ + α . But then we can use the case 3.for the y variables of Lemma 6.7 and we get that ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ y ≥ − α , which cannot be true sincewe proved that ∂ f C l ( x (cid:63) , y (cid:63) ) ∂ y i = −
1. Therefore we have a contradiction and the first statement of thelemma holds. Using the same reasoning we prove the second statement too.We are now ready to complete the proof that the our reduction from H igh
D-B i S perner toGDAF ixed P oint is correct and hence we can prove Theorem 4.4.39 roof of Theorem 4.4. Let ( x (cid:63) , y (cid:63) ) be a solution to the GDAF ixed P oint problem with input a Tur-ing machine that represents the function f C l , α = ∆ /3, where ∆ = t · δ / d , G = Θ ( d / δ ) , L = Θ ( d / δ ) , and A , b as described in ( (cid:63) ).For each coordinate i , there exist the following three mutually exclusive cases, (cid:46) N − ≤ x (cid:63) i ≤ − N − : Since | R + ( x (cid:63) ) | ≥
1, it follows directly from Lemma 7.8 that thereexists v ∈ R + ( x (cid:63) ) such that C il ( v ) = − v (cid:48) ∈ R + ( x (cid:63) ) such that C il ( v ) = + (cid:46) ≤ x (cid:63) i < N − : Let C il ( v ) = − v ∈ R + ( x (cid:63) ) . By the property (D) of the SEICcoefficients, we have that there exists v ∈ R + ( x (cid:63) ) with v i =
0. This node is hence a solutionof type 2. for the H igh
D-B i S perner problem. (cid:46) − N − < x (cid:63) i ≤ : Let C il ( v ) = + v ∈ R + ( x (cid:63) ) . By the property (D) of the SEICcoefficients, we have that there exists v ∈ R + ( x (cid:63) ) with v i =
1. This node is hence a solutionof type 3. for the H igh
D-B i S perner problem.Since R + ( x (cid:63) ) computable in polynomial time given x (cid:63) , we can easily check for every i ∈ [ d ] whether any of the above cases hold. If at least for some i ∈ [ d ] the 2nd or the 3rd case fromabove hold, then the corresponding vertex gives a solution to the H igh D-B i S perner problemand therefore our reduction is correct. Hence we may assume that for every i ∈ [ d ] the 1st ofthe above cases holds. This implies that the cubelet R ( x (cid:63) ) is pachromatic and therefore it is asolution to the problem H igh D-B i S perner . Finally, we observe that the function that we definehas range [ − d , d ] and hence the Theorem 4.4 follows using Theorem 5.1. In this section we describe the construction of the smooth and efficient interpolation coefficients(SEIC) that we introduced in Section 7.2.1. After the description of the construction we presentthe statements of the lemmas that prove the properties (A) - (D) of their Definition 7.3 and werefer to the Appendix C. We first remind the definition of the SEIC coefficients.
Definition 7.3 (Smooth and Efficient Interpolation Coefficients) . For every N ∈ N we define theset of smooth and efficient interpolation coefficients (SEIC) as the family of functions, called coefficients , I d , N = (cid:110) P v : [
0, 1 ] d → R | v ∈ ([ N ] − ) d (cid:111) with the following properties.(A) For all vertices v ∈ ([ N ] − ) d , the coefficient P v ( x ) is a twice-differentiable function andsatisfies (cid:73) (cid:12)(cid:12)(cid:12) ∂ P v ( x ) ∂ x i (cid:12)(cid:12)(cid:12) ≤ Θ ( d / δ ) . (cid:73) (cid:12)(cid:12)(cid:12) ∂ P v ( x ) ∂ x i ∂ x (cid:96) (cid:12)(cid:12)(cid:12) ≤ Θ ( d / δ ) .(B) For all v ∈ ([ N ] − ) d , it holds that P v ( x ) ≥ ∑ v ∈ ([ N ] − ) d P v ( x ) = ∑ v ∈ R c ( x ) P v ( x ) = x ∈ [
0, 1 ] d , it holds that all but d + P v ∈ I d , N satisfy P v ( x ) = ∇ P v ( x ) = ∇ P v ( x ) =
0. We denote this set of d + R + ( x ) . Furthermore,it holds that R + ( x ) ⊆ R c ( x ) and given x we can compute the set R + ( x ) it time poly ( d ) .(D) For all x ∈ [
0, 1 ] d , if x i ≤ ( N − ) for some i ∈ [ d ] then there exists v ∈ R + ( x ) such that v i =
0. Respectively, if x i ≥ − ( N − ) then there exists v ∈ R + ( x ) such that v i = Theorem 8.1.
For every d ∈ N and every N = poly ( d ) there exist a family of functions I d , N thatsatisfies the properties (A) - (D) of Definition 7.3. One important component of the construction of the SEIC coefficients is the smooth-step func-tions which we introduce in Section 8.1. These functions also provide a toy example of smoothand efficient interpolation coefficients in 1 dimension. Then in Section 8.2 we present the con-struction of the SEIC coefficients in multiple dimensions and in Section 8.3 we state the mainlemmas that lead to the proof of Theorem 8.1.
Smooth step functions are real-valued function g : R → R of a single real variable with thefollowing properties Step Value.
For every x ≤ g ( x ) =
0, for every x ≥ g ( x ) = x ∈ [
0, 1 ] it holds that S ( x ) ∈ [
0, 1 ] . Smoothness.
For some k it holds that g is k times continuously differentiable and its k th deriva-tive satisfies g ( k ) ( ) = g ( k ) ( ) = k such that the smoothness property from above holds is characterizes the order of smoothness of the smooth step function g .In Section 6 we have already defined and used the smooth step function of order 1. For theconstruction of the SEIC coefficients we use the smooth step function of order 2 and the smoothstep function of order ∞ defined as follows. Definition 8.2.
We define the smooth step function S : R → R of order 2 as the following function S ( x ) = x − x + x x ∈ (
0, 1 ) x ≤ x ≥ S ∞ : R → R of order ∞ as the following function S ∞ ( x ) = − x − x + − ( − x ) x ∈ (
0, 1 ) x ≤ x ≥ S instead of S for the smooth step function of order 2 forsimplicitly of the exposition of the paper.We present a plot of these step function in Figure 7, and we summarize some of their prop-erties in Lemma 8.3. A more detailed lemma with additional properties of S ∞ that are useful forthe proof of Theorem 8.1 is presented in Lemma C.5 in the Appendix C. Lemma 8.3.
Let S and S ∞ be the smooth step functions defined in Definition 8.2. It holds both S andS ∞ are monotone increasing functions and that S ( ) = , S ( ) = and also S (cid:48) ( ) = S (cid:48) ( ) = S (cid:48)(cid:48) ( ) = S (cid:48)(cid:48) ( ) = . It also holds that S ∞ ( ) = , S ∞ ( ) = and also S ( k ) ∞ ( ) = S ( k ) ∞ ( ) = for everyk ∈ N . Additionally it holds for every x that | S (cid:48) ( x ) | ≤ , and | S (cid:48)(cid:48) ( x ) | ≤ whereas | S (cid:48) ∞ ( x ) | ≤ , and | S (cid:48)(cid:48) ∞ ( x ) | ≤ . S ∞ (a) Functions S and S ∞ . P
315 25 35 45 (b) The function P from Example 8.4. Figure 7: (a) The smooth step function S of order 1 and the smooth step function S ∞ of order ∞ .As we can see both S and S ∞ are continuous and continuously differentiable functions but S ∞ ismuch more flat around 0 and 1 since it has all its derivatives equal to 0 both at the point 0 and atthe point 1. This makes the S ∞ function infinitely many times differentiable. (b) The constructed P function of the family of SEIC coefficients for single dimensional case with N =
5. For detailswe refer to the Example 8.4.
Proof.
For the function S we compute S (cid:48) ( x ) = x − x + x for x ∈ [
0, 1 ] and S (cid:48) ( x ) = x (cid:54)∈ [
0, 1 ] . Therefore we can easily get that | S (cid:48) ( x ) | ≤ x ∈ R . We also have that S (cid:48)(cid:48) ( x ) = x − x + x for x ∈ (
0, 1 ) and S (cid:48)(cid:48) ( x ) = x (cid:54)∈ [
0, 1 ] hence we can concludethat | S (cid:48)(cid:48) ( x ) | ≤ S ∞ are more complicated. We have that S (cid:48) ∞ ( x ) = ln ( ) exp (cid:16) ln ( ) x ( − x ) (cid:17) ( − x ( − x )) (cid:16) exp (cid:16) ln ( ) x (cid:17) + exp (cid:16) ln ( ) − x (cid:17)(cid:17) ( − x ) x .We set h ( x ) (cid:44) (cid:16) exp (cid:16) ln ( ) x (cid:17) + exp (cid:16) ln ( ) − x (cid:17)(cid:17) ( − x ) x for x ∈ [
0, 1 ] and doing simple calculationswe get that for x ≤ h ( x ) ≥ exp (cid:16) ln ( ) x (cid:17) x . But the later can be easily lowerbounded by 1/4. Applying the same argument for x ≥ h ( x ) ≥ x ≤ (cid:16) ln ( ) x ( − x ) (cid:17) ≤ (cid:16) ln ( ) x (cid:17) , whereas for x ≥ (cid:16) ln ( ) x ( − x ) (cid:17) ≤ (cid:16) ln ( ) − x (cid:17) . Combining all these we can conclude that | S (cid:48) ∞ ( x ) | ≤
16. Using similar argument we can prove that | S (cid:48)(cid:48) ∞ ( x ) | ≤
32. For all the derivatives of S ∞ we can inductively prove that S ( k ) ∞ ( x ) = k − ∑ i = h i ( x ) · S ( i ) ∞ ( x ) ,where h ( ) = h i ( x ) are bounded. Then the fact that all the derivatives of S ∞ vanish at 0 and at 1 follows by a simple inductive argument. Example . Using thesmooth step functions that we described above we can get a construction of SEIC coefficients for42he single dimensional case. Unfortunately the extension to multiple dimensions is substantiallyharder and invokes new ideas that we explore later in this section. For the single dimensionalproblem of this example we have the interval [
0, 1 ] divided with N discrete points and our goalis to design N functions P - P N that satisfy the properties (A) - (D) of Definition 7.3. A simpleconstruction of such functions is the following P i ( x ) = (cid:40) S ∞ ( N · x − ( i − )) x ≤ iN − S ∞ ( i + − N · x ) x > iN − .Based on Lemma 8.3 it is not hard then to see that P i is twice differentiable and it has boundedfirst and second derivatives, hence it satisfies property (A) of Definition 8. Using the fact that1 − S ∞ ( x ) = S ∞ ( − x ) we can also prove property (B). Finally properties (C) and (D) can beproved via the definition of the coefficient P i from above. In Figure 7 we can see the plot of P for N =
5. We leave the exact proofs of this example as an exercise for the reader.
The goal of this section is to present the construction of the family I d , N of smooth and efficientinterpolation coefficients for every number of dimensions d and any discretization parameter N .Before diving into the details of our construction observe that even the 2-dimensional case with N = [
0, 1 ] to two triangles divided by the diagonal of [
0, 1 ] . Thenusing any soft-max function that is twice continuously differentiable we define a convex combi-nation at every triangle. Unfortunately this approach cannot work since the resulting coefficientshave discontinuous gradients along the diagonal of [
0, 1 ] . We leave the presice calculations ofthis example as an exercise to the reader.We start with some definitions about the orientation and the representation of the cubelets ofthe grid ([ N ] − ) d . Then we proceed with the definition of the Q v functions in Definition 8.7.Finally using Q v we can proceed with the construction of the SEIC coefficients. Definition 8.5 (Source and Target of Cubelets) . Each cubelet (cid:104) c N − , c + N − (cid:105) × · · · × (cid:104) c d N − , c d + N − (cid:105) ,where c ∈ ([ N − ] − ) d admits a source vertex s c = ( s , . . . , s d ) ∈ ([ N ] − ) d and a target vertex t c = ( t , . . . , t d ) ∈ ([ N ] − ) d defined as follows, s j = (cid:26) c j + c j is odd c j c j is even and t j = (cid:26) c j c j is odd c j + c j is evenNotice that the source s c and the target t c are vertices of the cubelet whose down-left corner is c . Definition 8.6. (Canonical Representation) Let x ∈ [
0, 1 ] d and R ( x ) = (cid:104) c N − , c + N − (cid:105) × · · · × (cid:104) c d N − , c d + N − (cid:105) where c ∈ ([ N − ] − ) d . The canonical representation of x under cubelet with down-left cor-ner c , denoted by p cx = ( p , . . . , p d ) is defined as follows, p j = x j − s j t j − s j where t c = ( t , . . . , t d ) and s c = ( s , . . . , s d ) are respectively the target and the source of R ( x ) .43 efinition 8.7 (Defining the functions Q v ( x ) ) . Let x ∈ [
0, 1 ] d lying in the cublet R ( x ) = (cid:20) c N − c + N − (cid:21) × · · · × (cid:20) c d N − c d + N − (cid:21) ,with corners R c ( x ) = { c , c + } × · · · × { c d , c d + } , where c ∈ ([ N − ] − ) d . Let also s c =( s , . . . , s d ) be the source vertex of R ( x ) and p cx = ( p , . . . , p d ) be the canonical representation of x . Then for each vertex v ∈ R c ( x ) we define the following partition of the set of coordinates [ d ] , A cv = { j : | v j − s j | = } and B cv = { j : | v j − s j | = } If there exist j ∈ A cv and (cid:96) ∈ B cv such that p j ≥ p (cid:96) then Q cv ( x ) =
0. Otherwise we define Q cv ( x ) = ∏ j ∈ A cv ∏ (cid:96) ∈ B cv S ∞ ( S ( p (cid:96) ) − S ( p j )) A cv , B cv (cid:54) = ∅ ∏ d (cid:96) = S ∞ ( − S ( p (cid:96) )) B cv = ∅ ∏ dj = S ∞ ( S ( p j )) A cv = ∅ where S ∞ ( x ) and S ( x ) are the smooth step function defined in Definition 8.2.To provide a better understanding of the Definitions 8.5, 8.6, and 8.7 we present the following3-dimensional example. Example . We consider a case where d = N =
3. Let x = ( ) lyingin the cubelet R ( x ) = (cid:2) , (cid:3) × (cid:2) , 1 (cid:3) × (cid:2) (cid:3) , and let c = (
1, 2, 0 ) . Then the source of R ( x ) is s c = (
2, 2, 0 ) and the target t c = (
1, 3, 1 ) (Definition 8.5). The canonical representation of x is p cx = ( ) (Definition 8.6). The only vertices with no-zero coefficients Q cv ( x ) are thosebelonging in the set R + ( x ) = { (
1, 3, 1 ) , (
1, 3, 0 ) , (
1, 2, 0 ) , (
2, 2, 0 ) } and again by Definition 8.7 wehave that (cid:46) Q ( ) ( x ) = S ∞ ( S ( )) · S ∞ ( S ( )) · S ∞ ( S ( )) , (cid:46) Q ( ) ( x ) = S ∞ ( S ( ) − S ( )) · S ∞ ( S ( ) − S ( )) , (cid:46) Q ( ) ( x ) = S ∞ ( S ( ) − S ( )) · S ∞ ( S ( ) − S ( )) , (cid:46) Q ( ) ( x ) = S ∞ ( − S ( )) · S ∞ ( − S ( )) · S ∞ ( − S ( )) .Now based on the Definitions 8.5, 8.6, and 8.7 we are ready to present the construction of thesmooth and efficient interpolation coefficients. Definition 8.9 (Construction of SEIC Coefficients) . Let x ∈ [
0, 1 ] d lying in the cubelet R ( x ) = (cid:104) c N − , c + N − (cid:105) × · · · × (cid:104) c d N − , c d + N − (cid:105) . Then for each vertex v ∈ ([ N ] − ) d the coefficient P v ( x ) isdefined as follows, P v ( x ) = (cid:40) Q cv ( x ) / ( ∑ v ∈ R c ( x ) Q cv ( x )) if v ∈ R c ( x ) v / ∈ R c ( x ) where the functions Q cv ( x ) ≥ v ∈ R c ( x ) . We note that in the following expression ∏ denotes the product symbol and should not be confused with theprojection operator used in the previous sections. .3 Sketch of the Proof of Theorem 8.1 First it is necessary to argue that P v ( x ) is a continuous function since it could be the case that Q cv ( x ) / ( ∑ v ∈ R c ( x ) Q cv ( x )) (cid:54) = Q c (cid:48) v ( x ) / ( ∑ v ∈ V c (cid:48) Q c (cid:48) v ( x )) for some point x that lies in the boundary oftwo adjacent cubelets with down-left corners c and c (cid:48) respectively. We specifically design thecoefficients Q c v ( x ) such as the latter does not occur and this is the main reason that the definitionof the function Q cv ( x ) is slightly complicated. For this reason we prove the following lemma. Lemma 8.10.
For any vertex v ∈ ([ N ] − ) d , P v ( x ) is a continuous and twice differentiable functionand for any v / ∈ R c ( x ) it holds that P v ( x ) = ∇ P v ( x ) = ∇ P v ( x ) = . Moreover, for every x ∈ [
0, 1 ] d the set R + ( x ) of vertices v ∈ ([ N ] − ) d such that P v ( x ) > satisfies | R + ( x ) | = d + . Based on Lemma 8.10 and the expression of P v we can prove that the P v coefficients defined inDefinition 8.9 satisfy the properties (B) and (C) of the definition 7.3. To prove the properties (A)and (D) we also need the following two lemmas. Lemma 8.11.
For any vertex v ∈ ([ N ] − ) d , it holds that1. (cid:12)(cid:12)(cid:12) ∂ P v ( x ) ∂ x i (cid:12)(cid:12)(cid:12) ≤ Θ ( d / δ ) ,2. (cid:12)(cid:12)(cid:12) ∂ P v ( x ) ∂ x i ∂ x j (cid:12)(cid:12)(cid:12) ≤ Θ ( d / δ ) . Lemma 8.12.
Let a point x ∈ [
0, 1 ] d and R + ( x ) the set of vertices with P v ( x ) > , then we have that1. If ≤ x i < ( N − ) then there always exists a vertex v ∈ R + ( x ) such that v i = .2. If − ( N − ) < x i ≤ then there always exists a vertex v ∈ R + ( x ) such that v i = . The proofs of Lemmas 8.10, 8.11, and 8.12 can be found in Appendix C. Based on Lemmas 8.10,8.11, and 8.12 we are now ready to prove Theorem 8.1.
Proof of Theorem 8.1.
The fact that the coefficients P v satisfy the property (A) follows directlyfrom Lemma 8.11. Property (B) follows directly from the definition of P v in Definition 8.9 andthe simple fact that Q cv ( x ) ≥
0. Property (C) follows from the second part of Lemma 8.10. FinallyProperty (D) follows directly from Lemma 8.12.
In this section our goal is to prove Theorem 4.5 based on the Theorem 4.4 that we proved inSection 7 and the known black box lower bounds that we know for
PPAD by [HPV89]. In thissection we assume that all the real number operation are performed with infinite precision.
Theorem 9.1 ([HPV89]) . Assume that there exists an algorithm A that has black-box oracle access to thevalue of a function M : [
0, 1 ] d → [
0, 1 ] d and outputs w (cid:63) ∈ [
0, 1 ] d . There exists a universal constant c > such that if M is -Lipschitz and (cid:107) M ( w (cid:63) ) − w (cid:63) (cid:107) ≤ ( c ) , then A has to make at least d differentoracle calls to the function value of M. It is easy to observe in the reduction in the proof of Theorem 7.2 is a black-box reduction andin every evaluation of the constructed circuit C l only requires one evaluation of the input function M . Therefore the proof of Theorem 7.2 together with the Theorem 9.1 imply the followingcorollary. 45igure 8: Pictorial representation on the way the black box lower bound follows from the whitebox PPAD -completeness that presented in Section 7 and the known black box lower bounds forthe B rouwer problem by [HPV89]. In the figure we can see the four dimensional case of Section6 that corresponds to the 2D-B i S perner and the 2-dimensional Brouwer. As we can see, in thatcase 1 query to O f can be implemented with 3 queries to 2D-B i S perner and each of these canbe implemented with 1 query to 2-dimensional B rouwer . In the high dimensional setting ofSection 7, every query ( x , y ) to the oracle O f to return the values f ( x , y ) and ∇ f ( x , y ) can beimplemented via d + igh D-B i S perner instance. Each of these oracles to H igh D-B i S perner can be implemented via 1 oracle to a B rouwer instance. Therefore an M d querylower bound for B rouwer implies an M d query lower bound for H igh D-B i S perner which inturn implies an M d / ( d + ) query lower bound for our GDAF ixed P oint and LR-L ocal M in M ax problems. Corollary 9.2 (Black-Box Lower Bound for Bi-Sperner) . Let C l : ([ N ] − ) d → {−
1, 1 } d be aninstance of the H igh D - B i S perner problem with N = O ( d ) . Then any algorithm that has black-boxoracle access to C l and outputs a solution to the corresponding H igh D - B i S perner problem, needs d different oracle calls to the value of C l . Based on Corollary 9.2 and the reduction that we presented in Section 7, we are now readyto prove Theorem 4.5.
Proof of Theorem 4.5.
This proof follows the steps of Figure 8. The last part of that figure is es-tablished in Corollary 9.2. So what is left to prove Theorem 4.5 is that for every instance ofH igh
D-B i S perner we can construct a function f such that the oracle O f can be implemented via d + igh D-B i S perner and also every solution of GDAF ixed P oint with oracle access O f to f and ∇ f reveals one solution of the starting H igh D-B i S perner instance.To construct this oracle O f we follow exactly the reduction that we described in Section7. The correctness of the reduction that we provide in Section 7 suffices to prove that everysolution of the GDAF ixed P oint with oracle access O f to f and ∇ f gives a solution to the initialH igh D-B i S perner instance. So the only thing that remains is to bound the number of queriesto the H igh D-B i S perner instance that we need in order to implement the oracle O f . To dothis consider the following definition of f based on an instance C l of H igh D-B i S perner from46efinition 7.4 with a scaling factor to make sure that the range of the function is [ −
1, 1 ] f C l ( x , y ) = d · d ∑ j = ( x j − y j ) · α j ( x ) where α j ( x ) = − ∑ v ∈ ([ N ] − ) d P v ( x ) · C jl ( v ) , and P v are the coefficients defined in Definition 7.3.From the property (C) of the coefficients P v we have that to evaluate a j ( x ) we only need thevalues C jl ( v ) for d + v and the same coefficients are needed to evaluate a j ( x ) forevery j . This implies that for every ( x , y ) we need d + C l of H igh D-B i S perner so that O f returns the value of f C l ( x , y ) . If we take the gradient of f C l with respect to ( x , y ) then an identical argument implies that the same set of d + igh D-B i S perner are needed so that O f returns the value of ∇ f C l ( x , y ) too. Therefore every query to the oracle O f can be implemented via d + C l . Now we can use Corollary 9.2 to get that the numberof queries that we need in order to solve the GDAF ixed P oint with oracle access O f to f and ∇ f is at least 2 d / ( d + ) . Finally observe that the proof of the Theorem 5.1 applies in th the black boxmodel too. Hence finding solution of GDAF ixed P oint in when we have black box access O f to f and ∇ f is equivalent to finding solutions of LR-L ocal M in M ax when we have exactly the sameblack box access O f to f and ∇ f . Therefore to find solutions of LR-L ocal M in M ax with blackbox access O f to f and ∇ f we need at least 2 d / ( d + ) queries to O f and the theorem follows byobserving that in our proof the only parameters that depend on d are L , G , ε , and possibly δ but1/ δ = O ( √ L / ε ) and hence the dependence of δ can be replaced by dependence on L and ε .
10 Hardness in the Global Regime
In this section our goal is to prove that the complexity of the problems L ocal M in M ax andL ocal M in is significantly increased when ε , δ lie outside the local regime, in the global regime.We start with the following theorem where we show that FNP -hardness of L ocal M in M ax . Theorem 10.1. L ocal M in M ax is FNP -hard even when ε is set to any value ≤ , δ is set to anyvalue ≥ , and even when P ( A , b ) = [
0, 1 ] d , G = √ d, L = d, and B = d.Proof. We now present a reduction from 3-SAT(3) to L ocal M in M ax that proves Theorem 10.1.First we remind the definition of the problem 3-SAT(3).3-SAT(3).3-SAT(3).I nput : A boolean CNF-formula φ with boolean variables x , . . . , x n such that every clause of φ has at most 3 boolean variables and every boolean variable appears to at most 3 clauses.O utput : An assignment x ∈ {
0, 1 } n that satisfies φ , or ⊥ if no such assignment exists.Given an instance of 3-SAT(3) we first construct a polynomial P j ( x ) for each clause φ j asfollows: for each boolean variable x i (there are n boolean variables x i ) we correspond a respectivereal-valued variable x i . Then for each clause φ j (there are m such clauses), let (cid:96) i , (cid:96) k , (cid:96) m denote theliterals participating in φ j , P j ( x ) = P ji ( x ) · P jk ( x ) · P jm ( x ) where P ji ( x ) = (cid:26) − x i if (cid:96) i = x i x i if (cid:96) i = x i f ( x , w , z ) = m ∑ j = P j ( x ) · ( w j − z j ) where each w j , z j are additional variables associated with clause φ j . The player that wants tominimize f controls x , w vectors while the maximizing player controls the z variables. Lemma 10.2.
The formula φ admits a satisfying assignment if and only if there exist an ( ε , δ ) -localmin-max equilibrium of f ( x , w ) with ε ≤ , δ = and ( x , w ) ∈ [
0, 1 ] n + m .Proof. Let us assume that there exists a satisfying assignment. Given such a satisfying assignmentwe will construct (( x (cid:63) , w (cid:63) ) , z (cid:63) ) that is a (
0, 1 ) -local min-max equilibrium of f . We set eachvariable x (cid:63) i (cid:44) P j ( x (cid:63) ) = j , meaning that the strategy profile (( x (cid:63) , w (cid:63) ) , z (cid:63) ) is a global Nash equilibriumno matter the values of w (cid:63) , z (cid:63) .On the opposite direction, let us assume that there exists an ( ε , δ ) -local min-max equilibriumof f with ε = δ =
1. In this case we first prove that for each j =
1, . . . , mP j ( x (cid:63) ) ≤ · ε .Fix any clause j . In case (cid:12)(cid:12)(cid:12) w (cid:63) j − z (cid:63) j (cid:12)(cid:12)(cid:12) ≥ f by atleast P j ( x ) /16 by setting w (cid:63) j (cid:44) z (cid:63) j . On the other hand in case (cid:12)(cid:12)(cid:12) w (cid:63) j − z (cid:63) j (cid:12)(cid:12)(cid:12) ≤ f by at least P j ( x (cid:63) ) /16 by moving z (cid:63) j either to 0 or to 1. We remark that bothof the options are feasible since δ = x i is independently selected to be true with probability x (cid:63) i . Then, P (cid:0) clause φ j is not satisfied (cid:1) = P j ( x (cid:63) ) ≤ · ε = φ j shares variables with at most 6 other clauses, the event of φ j not being satisfiedis dependent with at most 6 other events. By the Lovász Local Lemma [EL73], we get that theprobability none of these events occur is positive. As a result, there exists a satisfying assignment.Hence the formula φ is satisfiable if and only if f has a ( ) -local min-max equilibriumpoint. What is left to prove the FNP -hardness is to show how we can find a satisfying assign-ment of φ given an approximate stationary point of f . This can be done using the celebratedresults that provide constructive proofs of the Lovász Local Lemma [Mos09, MT10]. Finallyto conclude the proof observe that since the f that we construct is a polynomial of degree 6which can efficiently be described as a sum of monomials, we can trivially construct a Turingmachine that computes the values of both f and ∇ f in the polynomial time in the requestednumber of bits accuracy. The constructed function f is √ d -Lipschitz and d -smooth, where d isthe number of variables that is equal to n + m . More precisely since each variable x i partici-pates in at most 3 clauses, the real-valued variable x i appears in at most 3 monomials P j . Thus − ≤ ∂ f ( x , w , x ) ∂ x i ≤
3. Similarly it is not hard to see that − ≤ ∂ f ( x , w , x ) ∂ w j , ∂ f ( x , w , x ) ∂ z j ≤
2. All thelatter imply that (cid:107)∇ f ( x , w , z ) (cid:107) ≤ Θ ( √ n + m ) , meaning that f ( x , w , z ) is Θ ( n + m ) -Lipschitz.48sing again the fact that each x i participates in at most 3 monomials P j ( x ) , we get that allthe terms ∂ f ( x , w , z ) ∂ x i , ∂ f ( x , w , z ) ∂ w j , ∂ f ( x , w , z ) ∂ z j , ∂ f ( x , w , z ) ∂ x i ∂ w j , ∂ f ( x , w , z ) ∂ x i ∂ z j , ∂ f ( x , w , z ) ∂ w j ∂ z j ∈ [ −
6, 6 ] . Thus the absolutevalue of each entry of ∇ f ( x , w , z ) is bounded by 6 and thus (cid:13)(cid:13) ∇ f ( x , w , z ) (cid:13)(cid:13) ≤ Θ ( n + m ) ,which implies the Θ ( n + m ) -smoothness. Therefore our reduction produces a valid instance ofL ocal M in M ax and hence the theorem follows.Next we show the FNP -hardness of L ocal M in . As we can see there is a gap between Theorem10.1 and Theorem 10.3. In particular, the FNP -hardness result of L ocal M in M ax is stronger sinceit holds for any δ ≥ FNP -hardness of L ocal M in our proof needs δ ≥ √ d whenthe rest of the parameters remain the same. Theorem 10.3. L ocal M in is FNP -hard even when ε is set to any value ≤ , δ is set to any value ≥ √ d, and even when P ( A , b ) = [
0, 1 ] d , G = √ d, L = d, and B = d.Proof. We follow the same proof as in the proof of Theorem 10.1 but we instead set f ( x ) = ∑ mj = P j ( x ) where x ∈ [
0, 1 ] n (the number of variables is d : = n ). We then get that if the initialformula is satisfiable then there exist x ∈ P ( A , b ) , such that f ( x ) =
0. On the other hand if thereexist x ∈ P ( A , b ) such that f ( x ) ≤ FNP -hardness follows again from the constructive proof of theLovász Local Lemma [Mos09, MT10]. Setting δ ≥ √ n which equals the diameter of the feasibilityset implies that in case there exists ˆ x with f ( ˆ x ) = ( ε , δ ) -L ocal M in x ∗ must admit value f ( x ∗ ) ≤ FNP -hardness of L ocal M in in the global regimebut with worse Lipschitzness and smoothness parameters than the once at Theorem 10.3 and forthis reason we present both of them. Theorem 10.4.
In the worst case, Ω (cid:0) d / d (cid:1) value/gradient black-box queries are needed to determine a ( ε , δ ) - L ocal M in for functions f ( x ) : [
0, 1 ] d → [
0, 1 ] with G = Θ ( d ) , L = Θ ( d ) , ε < , δ = √ d.Proof. The proof is based on the fact that given just black-box access to a boolean formula φ : {
0, 1 } d (cid:55)→ {
0, 1 } , at least Ω ( d ) queries are needed in order to determine whether φ admits asatisfying assignment. The term black-box access refers to the fact that the clauses of the formulaare not given and the only way to determine whether a specific boolean assignment is satisfyingis by quering the specific binary string.Given such a black-box oracle for a satisfying assignment d , we construct the function f φ ( x ) : [
0, 1 ] d (cid:55)→ [
0, 1 ] as follows:1. for each corner v ∈ V of the [
0, 1 ] d hypercube, i.e. v ∈ {
0, 1 } d , we set f φ ( v ) : = − φ ( v ) .2. for the rest of the points x ∈ [
0, 1 ] d / V , f φ ( x ) : = ∑ v ∈ V P v ( x ) · f φ ( v ) where P v are the coeffi-cients of Definition 8.9.We remind that by Lemma 8.11, we get that (cid:13)(cid:13) ∇ f φ ( x ) (cid:13)(cid:13) ≤ Θ ( d ) and (cid:13)(cid:13) ∇ f φ ( x ) (cid:13)(cid:13) ≤ Θ ( d ) ,meaning that f φ ( · ) is Θ ( d ) -Lipschitz and Θ ( d ) -smooth. Moreover by Lemma 8.7 , for any49 ∈ [
0, 1 ] n the set V ( x ) = { v ∈ V : P v ( x ) (cid:54) = } has cardinality at most d +
1, while at the sametime ∑ v ∈ V P v ( x ) = φ is not satisfiable then f φ ( x ) = x ∈ [
0, 1 ] d since f φ ( v ) = v ∈ V . Incase there exists a satisfying assignment v ∗ then f φ ( v ∗ ) =
0. Since δ ≥ √ d that is the diameterof [
0, 1 ] d , any ( ε , δ ) -L ocal M in x ∗ must have f φ ( x ) ≤ ε <
1. Since f φ ( x ∗ ) (cid:44) ∑ v ∈ V ( x ∗ ) P v ( x ∗ ) · f φ ( v ∗ ) <
1, there exists at least one vertex ˆ v ∈ V ( x ) with f φ ( ˆ v ) =
0, meaning that φ ( v ∗ ) = ( ε , δ ) -L ocal M in x ∗ with f φ ( x ∗ ) <
1, we can find a satisfying ˆ v by querying φ ( v ) for each vertex v ∈ V ( x ∗ ) . Since | V ( x ∗ ) | ≤ d +
1, this will take at most d + ( ε , δ ) -L ocal M in could be determined with less than O ( d / d ) value/gradient queries, then determining whether φ admits a satisfying assignmentcould be done with less that O ( d ) queries on φ (the latter is obviously impossible). Notice thatany value/gradient query both f φ ( x ) and ∇ f φ ( x ) can be computed by querying the value f φ ( v ) of the vertices v ∈ V ( x ) . Since | V ( x ) | ≤ d +
1, any value/gradient query of f φ can be simulatedby d + φ . Acknowledgements
This work was supported by NSF Awards IIS-1741137, CCF-1617730 and CCF-1901292, by aSimons Investigator Award, by the DOE PhILMs project (No. DE-AC05-76RL01830), and by theDARPA award HR00111990021. M.Z. was also supported by Google Ph.D. Fellowship. S.S. wassupported by NRF 2018 Fellowship NRF-NRFF2018-07.
References [AAZB +
17] Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma.Finding approximate local minima faster than gradient descent. In
Proceedings ofthe 49th Annual ACM SIGACT Symposium on Theory of Computing , pages 1195–1199,2017.[ACB17] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative ad-versarial networks. In
Proceedings of the 34th International Conference on MachineLearning-Volume 70 , pages 214–223, 2017.[Adl13] Ilan Adler. The equivalence of linear programs and zero-sum games.
InternationalJournal of Game Theory , 42(1):165–177, 2013.[ADLH19] Leonard Adolphs, Hadi Daneshmand, Aurelien Lucchi, and Thomas Hofmann. Lo-cal saddle point optimization: A curvature exploitation approach. In
The 22ndInternational Conference on Artificial Intelligence and Statistics , pages 486–495, 2019.[ADSG19] Mohammad Alkousa, Darina Dvinskikh, Fedor Stonyakin, and Alexander Gas-nikov. Accelerated methods for composite non-bilinear saddle point problem. arXivpreprint arXiv:1906.03620 , 2019.[ALW19] Jacob Abernethy, Kevin A Lai, and Andre Wibisono. Last-iterate convergence ratesfor min-max optimization. arXiv preprint arXiv:1906.02027 , 2019.50AMLJG20] Waïss Azizian, Ioannis Mitliagkas, Simon Lacoste-Julien, and Gauthier Gidel. Atight and unified analysis of extragradient for a whole spectrum of differentiablegames. In
Proceedings of the 23rd International Conference on Artificial Intelligence andStatistics (AISTATS) , 2020.[BCB12] Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and non-stochastic multi-armed bandit problems.
Foundations and Trends in Machine Learning ,5(1):1–122, 2012.[BCE +
95] Paul Beame, Stephen A. Cook, Jeff Edmonds, Russell Impagliazzo, and ToniannPitassi. The relative complexity of NP search problems. In
Proceedings of the Twenty-Seventh Annual ACM Symposium on Theory of Computing, 29 May-1 June 1995, LasVegas, Nevada, USA , pages 303–314, 1995.[BIQ +
17] Aleksandrs Belovs, Gábor Ivanyos, Youming Qiao, Miklos Santha, and Siyi Yang.On the polynomial parity argument complexity of the combinatorial nullstellensatz.In
Proceedings of the 32nd Computational Complexity Conference , pages 1–24, 2017.[Bla56] David Blackwell. An analog of the minimax theorem for vector payoffs.
Pacific J.Math. , 6(1):1–8, 1956.[BPR15] Nir Bitansky, Omer Paneth, and Alon Rosen. On the cryptographic hardness offinding a nash equilibrium. In
Proceedings of the 56th Annual Symposium on Founda-tions of Computer Science, (FOCS) , 2015.[Bre76] Richard P Brent. Fast multiple-precision evaluation of elementary functions.
Journalof the ACM (JACM) , 23(2):242–251, 1976.[CBL06] Nikolo Cesa-Bianchi and Gabor Lugosi.
Prediction, Learning, and Games . CambridgeUniversity Press, 2006.[CDT09] Xi Chen, Xiaotie Deng, and Shang-Hua Teng. Settling the complexity of computingtwo-player nash equilibria.
Journal of the ACM (JACM) , 56(3):1–57, 2009.[CPY17] Xi Chen, Dimitris Paparas, and Mihalis Yannakakis. The complexity of non-monotone markets.
J. ACM , 64(3):20:1–20:56, 2017.[Dan51] George B. Dantzig. A proof of the equivalence of the programming problem andthe game problem. In
Koopmans, T. C., editor(s), Activity Analysis of Production andAllocation . Wiley, New York, 1951.[Das13] Constantinos Daskalakis. On the complexity of approximating a nash equilibrium.
ACM Transactions on Algorithms (TALG) , 9(3):1–35, 2013.[Das18] Constantinos Daskalakis. Equilibria, Fixed Points, and Computational Complexity- Nevanlinna Prize Lecture.
Proceedings of the International Congress of Mathematicians(ICM) , 1:147–209, 2018.[DFS20] Argyrios Deligkas, John Fearnley, and Rahul Savani. Tree polymatrix games areppad-hard.
CoRR , abs/2002.12119, 2020.51DGP09] Constantinos Daskalakis, Paul W Goldberg, and Christos H Papadimitriou. Thecomplexity of computing a nash equilibrium.
SIAM Journal on Computing , 39(1):195–259, 2009.[DHS11] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods foronline learning and stochastic optimization.
Journal of machine learning research ,12(Jul):2121–2159, 2011.[DISZ18] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng.Training gans with optimism. In
International Conference on Learning Representations(ICLR 2018) , 2018.[DP11] Constantinos Daskalakis and Christos Papadimitriou. Continuous local search. In
Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms ,pages 790–804. SIAM, 2011.[DP18] Constantinos Daskalakis and Ioannis Panageas. The limit points of (optimistic) gra-dient descent in min-max optimization. In
Advances in Neural Information ProcessingSystems , pages 9236–9246, 2018.[DP19] Constantinos Daskalakis and Ioannis Panageas. Last-iterate convergence: Zero-sumgames and constrained min-max optimization.
Innovations in Theoretical ComputerScience , 2019.[DTZ18] Constantinos Daskalakis, Christos Tzamos, and Manolis Zampetakis. A converseto banach’s fixed point theorem and its CLS-completeness. In
Proceedings of the 50thAnnual ACM SIGACT Symposium on Theory of Computing (STOC) , 2018.[EL73] Paul Erd˝os and László Lovász. Problems and results on 3-chromatic hypergraphsand some related questions. In
Colloquia Mathematica Societatis Janos Bolyai 10. Infiniteand Finite Sets, Keszthely (Hungary) . Citeseer, 1973.[EY10] Kousha Etessami and Mihalis Yannakakis. On the complexity of nash equilibriaand other fixed points.
SIAM Journal on Computing , 39(6):2531–2597, 2010.[FG18] Aris Filos-Ratsikas and Paul W. Goldberg. Consensus halving is ppa-complete.In
Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing(STOC) , 2018.[FG19] Aris Filos-Ratsikas and Paul W. Goldberg. The complexity of splitting necklacesand bisecting ham sandwiches. In
Proceedings of the 51st Annual ACM SIGACTSymposium on Theory of Computing (STOC) , 2019.[FP07] Francisco Facchinei and Jong-Shi Pang.
Finite-dimensional variational inequalities andcomplementarity problems . Springer Science & Business Media, 2007.[FPT04] Alex Fabrikant, Christos H. Papadimitriou, and Kunal Talwar. The complexity ofpure nash equilibria. In
Proceedings of the 36th Annual ACM Symposium on Theory ofComputing (STOC) , 2004. 52FRHSZ20a] Aris Filos-Ratsikas, Alexandros Hollender, Katerina Sotiraki, and Manolis Zam-petakis. Consenus-halving: Does it ever get easier? arXiv preprint arXiv:2002.11437 ,2020.[FRHSZ20b] Aris Filos-Ratsikas, Alexandros Hollender, Katerina Sotiraki, and Manolis Zam-petakis. A topological characterization of modulo-p arguments and implicationsfor necklace splitting. arXiv preprint arXiv:2003.11974 , 2020.[GH19] Paul W. Goldberg and Alexandros Hollender. The hairy ball problem is ppad-complete. In
Proceedings of the 46th International Colloquium on Automata, Languages,and Programming (ICALP) , 2019.[GHP +
19] Gauthier Gidel, Reyhane Askari Hemmat, Mohammad Pezeshki, Rémi Le Priol,Gabriel Huang, Simon Lacoste-Julien, and Ioannis Mitliagkas. Negative momen-tum for improved game dynamics. In
The 22nd International Conference on ArtificialIntelligence and Statistics , pages 1802–1811, 2019.[GKSZ19] Mika Göös, Pritish Kamath, Katerina Sotiraki, and Manolis Zampetakis. On thecomplexity of modulo-q arguments and the chevalley-warning theorem. arXivpreprint arXiv:1912.04467 , 2019.[Goo16] Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprintarXiv:1701.00160 , 2016.[GPDO20] Noah Golowich, Sarath Pattathil, Constantinos Daskalakis, and Asuman E.Ozdaglar. Last iterate is slower than averaged iterate in smooth convex-concavesaddle point problems.
CoRR , abs/2002.00057, 2020.[GPM +
14] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative Adversarial Nets.In
Advances in Neural Information Processing Systems 27: Annual Conference on NeuralInformation Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada ,pages 2672–2680, 2014.[HA18] Erfan Yazdandoost Hamedani and Necdet Serhat Aybat. A primal-dual algorithmfor general convex-concave saddle point problems. arXiv preprint arXiv:1803.01401 ,2018.[Haz16] Elad Hazan. Introduction to online convex optimization.
Foundations and Trends inOptimization , 2(3-4):157–325, 2016.[HPV89] M. D. Hirsch, C. H. Papadimitriou, and S. A. Vavasis. Exponential lower boundsfor finding brouwer fixed points.
Journal of Complexity , 5:379–416, 1989.[Jeˇr16] Emil Jeˇrábek. Integer factoring and modular square roots.
Journal of Computer andSystem Sciences , 82(2):380–394, 2016.[JGN +
17] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. Howto escape saddle points efficiently. In
Proceedings of the 34th International Conferenceon Machine Learning-Volume 70 , pages 1724–1732. JMLR. org, 2017.53JNJ19] Chi Jin, Praneeth Netrapalli, and Michael I Jordan. What is local optimality innonconvex-nonconcave minimax optimization? arXiv preprint arXiv:1902.00618 ,2019.[JPY88] David S Johnson, Christos H Papadimitriou, and Mihalis Yannakakis. How easy islocal search?
Journal of computer and system sciences , 37(1):79–100, 1988.[KB14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014.[KM18] Pravesh K. Kothari and Ruta Mehta. Sum-of-squares meets Nash: lower bounds forfinding any equilibrium. In
Proceedings of the 50th Annual ACM SIGACT Symposiumon Theory of Computing (STOC) , 2018.[KM19] Weiwei Kong and Renato DC Monteiro. An accelerated inexact proximalpoint method for solving nonconvex-concave min-max problems. arXiv preprintarXiv:1905.13433 , 2019.[Kor76] GM Korpelevich. The extragradient method for finding saddle points and otherproblems.
Matecon , 12:747–756, 1976.[LJJ19] Tianyi Lin, Chi Jin, and Michael I Jordan. On gradient descent ascent for nonconvex-concave minimax problems. arXiv preprint arXiv:1906.00331 , 2019.[LJJ20] Tianyi Lin, Chi Jin, and Michael Jordan. Near-optimal algorithms for minimaxoptimization. arXiv preprint arXiv:2002.02417 , 2020.[LPP +
19] Jason D. Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I.Jordan, and Benjamin Recht. First-order methods almost always avoid strict saddlepoints.
Math. Program. , 176(1-2):311–337, 2019.[LS19] Tengyuan Liang and James Stokes. Interaction matters: A note on non-asymptoticlocal convergence of generative adversarial networks. In
The 22nd International Con-ference on Artificial Intelligence and Statistics , pages 907–915, 2019.[LTHC19] Songtao Lu, Ioannis Tsaknakis, Mingyi Hong, and Yongxin Chen. Hybrid blocksuccessive approximation for one-sided non-convex min-max problems: algorithmsand applications. arXiv preprint arXiv:1902.08294 , 2019.[Meh14] Ruta Mehta. Constant rank bimatrix games are ppad-hard. In
Proceedings of the 46thSymposium on Theory of Computing (STOC) , 2014.[MGN18] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methodsfor gans do actually converge? In
International Conference on Machine Learning , pages3481–3490, 2018.[MMS +
18] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, andAdrian Vladu. Towards deep learning models resistant to adversarial attacks. In
International Conference on Learning Representations , 2018.54MOP19] Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil. A unified analysis ofextra-gradient and optimistic gradient methods for saddle point problems: Proxi-mal point approach. arXiv preprint arXiv:1901.08511 , 2019.[Mos09] Robin A Moser. A constructive proof of the lovász local lemma. In
Proceedings of theforty-first annual ACM symposium on Theory of computing , pages 343–350, 2009.[MP89] N Meggido and CH Papadimitriou. A note on total functions, existence theorems,and computational complexity. Technical report, Tech. report, IBM, 1989.[MPP18] Panayotis Mertikopoulos, Christos H. Papadimitriou, and Georgios Piliouras. Cy-cles in adversarial regularized learning. In
Proceedings of the Twenty-Ninth AnnualACM-SIAM Symposium on Discrete Algorithms (SODA) , 2018.[MPPSD16] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generativeadversarial networks. arXiv preprint arXiv:1611.02163 , 2016.[MR18] Eric Mazumdar and Lillian J Ratliff. On the convergence of gradient-based learningin continuous games. arXiv preprint arXiv:1804.05464 , 2018.[MSV20] Oren Mangoubi, Sushant Sachdeva, and Nisheeth K Vishnoi. A provably conver-gent and practical algorithm for min-max optimization with applications to gans. arXiv preprint arXiv:2006.12376 , 2020.[MT10] Robin A Moser and Gábor Tardos. A constructive proof of the general lovász locallemma.
Journal of the ACM (JACM) , 57(2):1–15, 2010.[MV20] Oren Mangoubi and Nisheeth K Vishnoi. A second-order equilibrium innonconvex-nonconcave min-max optimization: Existence and algorithm. arXivpreprint arXiv:2006.12363 , 2020.[Nem04] Arkadi Nemirovski. Interior point polynomial time methods in convex program-ming.
Lecture notes , 2004.[NSH +
19] Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D Lee, and Meisam Raza-viyayn. Solving a class of non-convex min-max games using iterative first ordermethods. In
Advances in Neural Information Processing Systems , pages 14905–14916,2019.[NY83] Arkadi˘ı Semenovich Nemirovsky and David Borisovich Yudin.
Problem complexityand method efficiency in optimization.
Chichester: Wiley, 1983.[OX19] Yuyuan Ouyang and Yangyang Xu. Lower complexity bounds of first-order meth-ods for convex-concave bilinear saddle-point problems.
Mathematical Programming ,pages 1–35, 2019.[Pap94a] C Papadimitriou.
Computational Complexity . Addison Welsey, 1994.[Pap94b] Christos H Papadimitriou. On the complexity of the parity argument and otherinefficient proofs of existence.
Journal of Computer and system Sciences , 48(3):498–532,1994. 55RKK18] Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam andbeyond. In
Proceedings of the 6th International Conference on Learning Representations(ICLR) , 2018.[RLLY18] Hassan Rafique, Mingrui Liu, Qihang Lin, and Tianbao Yang. Non-convex min-max optimization: Provable algorithms and applications in machine learning. arXivpreprint arXiv:1810.02060 , 2018.[Ros65] J Ben Rosen. Existence and uniqueness of equilibrium points for concave n-persongames.
Econometrica: Journal of the Econometric Society , pages 520–534, 1965.[Rub15] Aviad Rubinstein. Inapproximability of nash equilibrium. In
Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing (STOC) , 2015.[Rub16] Aviad Rubinstein. Settling the complexity of computing approximate two-playernash equilibria. In , pages 258–265. IEEE, 2016.[SS12] Shai Shalev-Shwartz. Online learning and online convex optimization.
Foundationsand Trends in Machine Learning , 4(2):107–194, 2012.[SSBD14] Shai Shalev-Shwartz and Shai Ben-David.
Understanding machine learning: Fromtheory to algorithms . Cambridge university press, 2014.[SY91] Alejandro A. Schäffer and Mihalis Yannakakis. Simple local search problems thatare hard to solve.
SIAM J. Comput. , 20(1):56–87, 1991.[SZZ18] Katerina Sotiraki, Manolis Zampetakis, and Giorgos Zirdelis. Ppp-completenesswith connections to cryptography. In
Proceddings of the 59th IEEE Annual Symposiumon Foundations of Computer Science ( FOCS) , 2018.[TJNO19] Kiran K Thekumparampil, Prateek Jain, Praneeth Netrapalli, and Sewoong Oh. Effi-cient algorithms for smooth minimax optimization. In
Advances in Neural InformationProcessing Systems , pages 12659–12670, 2019.[vN28] John von Neumann. Zur Theorie der Gesellschaftsspiele. In
Math. Ann. , pages295–320, 1928.[VY11] Vijay V. Vazirani and Mihalis Yannakakis. Market equilibrium under separable,piecewise-linear, concave utilities.
J. ACM , 58(3):10:1–10:25, 2011.[WZB19] Yuanhao Wang, Guodong Zhang, and Jimmy Ba. On solving minimax optimiza-tion locally: A follow-the-ridge approach. In
International Conference on LearningRepresentations , 2019.[Zha19] Renbo Zhao. Optimal algorithms for stochastic three-composite convex-concavesaddle point problems. arXiv preprint arXiv:1903.01687 , 2019.56
Proof of Theorem 4.1
We first remind the definition of the 3-SAT(3) problem that we will use for our reduction.3-SAT(3).3-SAT(3).I nput : A boolean CNF-formula φ with boolean variables x , . . . , x n such that every clause of φ has at most 3 boolean variables and every boolean variable appears to at most 3 clauses.O utput : An assignment x ∈ {
0, 1 } n that satisfies φ , or ⊥ if no such assignment exists.It is well known that 3-SAT(3) is FNP -complete, for details see §9.2 of [Pap94a]. To proveTheorem 4.1, we reduce 3-SAT(3) to ε -S tationary P oint .Given an instance of 3-SAT(3) we construct the function f : [
0, 1 ] n + m → [
0, 1 ] , where m is thenumber of clauses of φ . For each literal x i we assign a real-valued variable which by abuse ofnotation we also denote x i and it would be clear from the context whether we refer to the literalor the real-valued variable. Then for each clause φ j of φ , we construct a polynomial P j ( x ) asfollows: if (cid:96) i , (cid:96) k , (cid:96) m are the literals participating in φ j , then P j ( x ) = P ji ( x ) · P jk ( x ) · P jm ( x ) where P ji ( x ) = (cid:26) − x i if (cid:96) i = x i x i if (cid:96) i = x i The overall constructed function is f ( x , w ) = ∑ mj = w j · P j ( x ) , where each w j is an additionalvariable associated with clause φ j . Notice that 0 ≤ ∂ f ( x , w ) ∂ w j ≤ − ≤ ∂ f ( x , w ) ∂ x i ≤ x i participates in at most 3 clauses. As a result, (cid:107)∇ f ( x , w ) (cid:107) ≤ Θ ( √ n + m ) ,meaning that f ( x , w ) is G -Lipschitz with G = Θ ( √ n + m ) . Also notice that all the entries of ∇ f ( x , w ) , i.e. ∂ f ( x , w ) ∂ x i = ∂ f ( x , w ) ∂ w j , ∂ f ( x , w ) ∂ x i ∂ w j , ∂ f ( x , w ) ∂ x i ∂ x m , ∂ f ( x , w ) ∂ w k ∂ w j ∈ [ −
3, 3 ] . As a result, (cid:13)(cid:13) ∇ f ( x , w ) (cid:13)(cid:13) ≤ Θ ( n + m ) , meaning that f ( x , w ) is L -smooth with L = Θ ( n + m ) . Lemma A.1.
There exists a satisfying assignment for the clauses φ , . . . , φ m if and only if there solutionof the constructed S tationary P oint with ε = a admits solution ( x (cid:63) , w (cid:63) ) ∈ [
0, 1 ] n + m such that (cid:107)∇ f ( x (cid:63) , w (cid:63) ) (cid:107) < .Proof. By the definition of S tationary P oint , in case there exists a pair of points ( ˆ x , ˆ w ) ∈ [
0, 1 ] n + m with (cid:107)∇ f ( ˆ x , ˆ w ) (cid:107) < ε /2 = ( x (cid:63) , w (cid:63) ) with (cid:107)∇ f ( x (cid:63) , w (cid:63) ) (cid:107) < ε = (cid:107)∇ f ( x , w ) (cid:107) > ε = ( x , w ) ∈ [
0, 1 ] n + m , the null symbol ⊥ is returned.Let us assume that there exists a satisfying assignment of φ . Consider the solution ( ˆ x , ˆ w ) constructed as follows: each variable ˆ x i is set to 1 iff the respective boolean variable is true andˆ w j = j =
1, . . . , m . Since the assignment satisfies the CNF-formula φ , there exists atleast one true literal in each clause φ j which means that P j ( x ) = j =
1, . . . , m . As aresult ∂ f ( ˆ x , ˆ w ) ∂ w j = P j ( ˆ x ) = j =
1, . . . , m . At the same time, ∂ f ( ˆ x , ˆ w ) ∂ x i = w j = j =
1, . . . , m . Overall we have that ∇ f ( ˆ x , ˆ w ) = < = ε /2. As a result, the constructedS tationary P oint instance must return a solution ( x (cid:63) , w (cid:63) ) with (cid:107)∇ f ( x (cid:63) , w (cid:63) ) (cid:107) < = ε .On the opposite direction, the existence of a pair of points ( x (cid:63) , w (cid:63) ) with (cid:107)∇ f ( x (cid:63) , w (cid:63) ) (cid:107) < P j ( x ∗ ) < j = m . Consider the probability distribution over theboolean assignments in which each boolean variable x i is independently selected to be true with proba-bility x (cid:63) i . Then, P (cid:0) clause φ j is not satisfied (cid:1) = P j ( x (cid:63) ) < φ j shares variables with at most 6 other clauses, the bad event of φ j not being satisfied isdependent with at most 6 other bad events. By Lovász Local Lemma [EL73], we get that theprobability none of the events occurs is positive. As a result, there exists a satisfying assignment.Using Lemma A.1 we can conclude that φ is satisfiable if and only if f has a 1/24-approximatestationary point. What is left to prove the FNP -hardness is to show how we can find a satis-fying assignment of φ given an approximate stationary point of f . This can be done using thecelebrated results that provide constructive proofs of the Lovász Local Lemma [Mos09, MT10].Finally, we remind that the constructed function f is Θ (cid:16) √ d (cid:17) -Lipschitz and Θ ( d ) -smooth, where d is the number of variables that is equal to n + m . B Missing Proofs from Section 5
In this section we give proofs for the statements presented in Section 5. These statements establishthe totality and inclusion to
PPAD of LR-L ocal M in M ax and GDAF ixed P oint . B.1 Proof of Theorem 5.1
We start with establishing claim “1.” in the statement of the theorem. It will be clear that ourproof will provide a polynomial-time reduction from LR-L ocal M in M ax to GDAF ixed P oint .Suppose that ( x (cid:63) , y (cid:63) ) is an α -approximate fixed point of F GDA , where α is the specified in thetheorem statement function of δ , G and L . To simplify our proof, we abuse notation and define f ( x ) (cid:44) f ( x , y (cid:63) ) , ∇ f ( x ) (cid:44) ∇ x f ( x , y (cid:63) ) , K (cid:44) { x | ( x , y (cid:63) ) ∈ P ( A , b ) } and ˆ x (cid:44) Π K ( x (cid:63) − ∇ f ( x (cid:63) )) .Because ( x (cid:63) , y (cid:63) ) is an α -approximate fixed point of F FDA , it follows that (cid:107) ˆ x − x (cid:63) (cid:107) < α . Claim B.1. (cid:104)∇ f ( x (cid:63) ) , x (cid:63) − x (cid:105) < ( G + δ + α ) · α , for all x ∈ K ∩ B d ( δ ; x (cid:63) ) .Proof. Using the fact that ˆ x = Π K ( x (cid:63) − ∇ f ( x (cid:63) )) and that K is a convex set we can apply Theorem1.5.5 (b) of [FP07] to get that (cid:104) x (cid:63) − ∇ f ( x (cid:63) ) − ˆ x , x − ˆ x (cid:105) ≤ ∀ x ∈ K . (B.1)Next, we do some simple algebra to get that, for all x ∈ K ∩ B d ( δ ; x (cid:63) ) , (cid:104)∇ f ( x (cid:63) ) , x (cid:63) − x (cid:105) = (cid:104) x (cid:63) − ∇ f ( x (cid:63) ) − ˆ x , x − ˆ x (cid:105) + (cid:104) x − ˆ x − ∇ f ( x (cid:63) ) , ˆ x − x (cid:63) (cid:105) (B.1) ≤ (cid:104) x − ˆ x − ∇ f ( x (cid:63) ) , ˆ x − x (cid:63) (cid:105)≤ ( (cid:107) x − ˆ x (cid:107) + (cid:107)∇ f ( x (cid:63) ) (cid:107) ) (cid:107) ˆ x − x (cid:63) (cid:107) < ( G + δ + α ) · α ,where the second to last inequality follows from Cauchy–Schwarz inequality and the triangleinequality, and the last inequality follows from the triangle inequality and the following facts: (1) (cid:107) x (cid:63) − ˆ x (cid:107) < α , (2) x ∈ B d ( δ ; x (cid:63) ) , and (3) (cid:107)∇ f ( x , y ) (cid:107) ≤ G for all ( x , y ) ∈ P ( A , b ) .For all x ∈ K ∩ B d ( δ ; x (cid:63) ) , from the L -smoothness of f we have that | f ( x ) − ( f ( x (cid:63) ) + (cid:104)∇ f ( x ∗ ) , x − x (cid:63) (cid:105) ) | ≤ L (cid:107) x − x (cid:63) (cid:107) . (B.2)We distinguish two cases: 58. f ( x (cid:63) ) ≤ f ( x ) : In this case we stop, remembering that f ( x (cid:63) ) ≤ f ( x ) . (B.3)2. f ( x (cid:63) ) > f ( x ) : In this case, we consider two further sub-cases:(a) (cid:104)∇ f ( x ∗ ) , x − x (cid:63) (cid:105) ≥
0: in this sub-case, Eq (B.2) gives f ( x (cid:63) ) − f ( x ) + (cid:104)∇ f ( x ∗ ) , x − x (cid:63) (cid:105) ≤ L (cid:107) x − x (cid:63) (cid:107) Thus f ( x (cid:63) ) ≤ f ( x ) + L (cid:107) x − x (cid:63) (cid:107) ≤ f ( x ) + L δ < f ( x ) + ε , (B.4)where for the last inequality we used that x ∈ B d ( δ ; x (cid:63) ) , and that δ < √ ε / L .(b) (cid:104)∇ f ( x ∗ ) , x − x (cid:63) (cid:105) <
0: in this sub-case, Eq (B.2) gives f ( x (cid:63) ) − f ( x ) − (cid:104)∇ f ( x ∗ ) , x (cid:63) − x (cid:105) ≤ L (cid:107) x − x (cid:63) (cid:107) .Thus f ( x (cid:63) ) ≤ f ( x ) + (cid:104)∇ f ( x ∗ ) , x (cid:63) − x (cid:105) + L (cid:107) x − x (cid:63) (cid:107) ≤ f ( x ) + (cid:104)∇ f ( x ∗ ) , x (cid:63) − x (cid:105) + L · δ < f ( x ) + ( G + δ + α ) · α + L · δ ≤ f ( x ) + ε , (B.5)where the second inequality follows from the fact that x ∈ B d ( δ ; x (cid:63) ) , the third in-equality follows from Claim B.1, and the last inequality follows from the constraints δ < √ ε / L and α ≤ √ ( G + δ ) + ( ε − L δ ) − ( G + δ ) .In all cases, we get from (B.3), (B.4) and (B.5) that f ( x (cid:63) ) < f ( x ) + ε , for all x ∈ K ∩ B d ( δ ; x (cid:63) ) .Thus, lifting our abuse of notation, we get that f ( x (cid:63) , y (cid:63) ) < f ( x , y (cid:63) ) + ε , for all x ∈ { x | x ∈ B d ( δ ; x (cid:63) ) and ( x , y (cid:63) ) ∈ P ( A , b ) } . Using an identical argument we can also show that f ( x (cid:63) , y (cid:63) ) > f ( x (cid:63) , y ) − ε for all y ∈ { y | y ∈ B d ( δ ; y (cid:63) ) and ( x (cid:63) , y ) ∈ P ( A , b ) } . The first part of the theoremfollows.Now let us establish claim “2.” in the theorem statement. It will be clear that our proof willprovide a polynomial-time reduction from GDAF ixed P oint to LR-L ocal M in M ax . For the choiceof parameters ε and δ described in the theorem statement, we will show that, if ( x (cid:63) , y (cid:63) ) is an ( ε , δ ) -local min-max equilibrium of f , then (cid:107) F GDAx ( x (cid:63) , y (cid:63) ) − x (cid:63) (cid:107) < α /2 and (cid:13)(cid:13) F GDAy ( x (cid:63) , y (cid:63) ) − y (cid:63) (cid:13)(cid:13) < α /2. The second part of the theorem will then follow. We only prove that (cid:107) F GDAx ( x (cid:63) , y (cid:63) ) − x (cid:63) (cid:107) < α /2, as the argument for y (cid:63) is identical. In the argument below we abuse notation in the sameway we described earlier. With that notation we will show that (cid:107) ˆ x − x (cid:63) (cid:107) < α /2. Proof that (cid:107) ˆ x − x (cid:63) (cid:107) < α /2. From our choice of ε and δ , it is easy to see that δ = α / ( L + ) < α /2. Thus, if (cid:107) ˆ x − x (cid:63) (cid:107) < δ , then we automatically get (cid:107) ˆ x − x (cid:63) (cid:107) < α /2. So it remains to handle59he case (cid:107) ˆ x − x (cid:63) (cid:107) ≥ δ . We choose x c (cid:44) x (cid:63) + δ ˆ x − x (cid:63) (cid:107) ˆ x − x (cid:63) (cid:107) . It is easy to see that x c ∈ B d ( δ ; x (cid:63) ) andhence we get that f ( x (cid:63) ) − ε < f ( x c ) ≤ f ( x (cid:63) ) + (cid:104)∇ f ( x (cid:63) ) , x c − x (cid:63) (cid:105) + L (cid:107) x c − x (cid:63) (cid:107) ≤ f ( x (cid:63) ) + (cid:104)∇ f ( x (cid:63) ) , x c − x (cid:63) (cid:105) + ε ( x (cid:63) , y (cid:63) ) is an ( ε , δ ) -local min-max equilibrium,the second inequality follows from the L -smoothness of f , and the third inequality follows from (cid:107) x c − x (cid:63) (cid:107) ≤ δ and our choice of δ = √ ε / L . The above implies: (cid:104)∇ f ( x (cid:63) ) , x (cid:63) − x c (cid:105) < ε /2.Since ˆ x − x (cid:63) = ( x c − x (cid:63) ) · (cid:107) ˆ x − x (cid:63) (cid:107) / δ we get that (cid:104)∇ f ( x (cid:63) ) , x (cid:63) − ˆ x (cid:105) < ε δ (cid:107) x (cid:63) − ˆ x (cid:107) . Therefore (cid:107) x (cid:63) − ˆ x (cid:107) = (cid:104) x (cid:63) − ∇ f ( x (cid:63) ) − ˆ x , x (cid:63) − ˆ x (cid:105) + (cid:104)∇ f ( x (cid:63) ) , x (cid:63) − ˆ x (cid:105) < ε δ (cid:107) x (cid:63) − ˆ x (cid:107) where in the above inequality we have also used (B.1). As a result, (cid:107) x (cid:63) − ˆ x (cid:107) < ε δ < α /2. B.2 Proof of Theorem 5.2
We provide a polynomial-time reduction from GDAF ixed P oint to B rouwer . This establishesboth the totality of GDAF ixed P oint and its inclusion to PPAD , since B rouwer is both totaland lies in
PPAD , as per Lemma 2.5. It also establishes the totality and inclusion to
PPAD ofLR-L ocal M in M ax , since LR-L ocal M in M ax is polynomial-time reducible to GDAF ixed P oint ,as shown in Theorem 5.1.We proceed to describe our reduction. Suppose that f is the G -Lipschitz and L -smooth func-tion provided as input to GDAF ixed P oint . Suppose also that α is the approximation parameterprovided as input to GDAF ixed P oint . Given f and α , we define function M : P ( A , b ) → P ( A , b ) ,which serves as input to B rouwer , as follows: M ( x , y ) = Π P ( A , b ) (cid:2) ( x − ∇ x f ( x , y ) , y + ∇ y f ( x , y )) (cid:3) .Given that f is L -smooth, it follows that M is ( L + ) -Lipschitz. We set the approximationparameter provided as input to B rouwer be γ = α /4 ( G + √ d ) .To show the validity of the afore-described reduction, we prove that every feasible point ( x (cid:63) , y (cid:63) ) ∈ P ( A , b ) that is a γ -approximate fixed point of M , i.e. (cid:107) M ( x (cid:63) , y (cid:63) ) − ( x (cid:63) , y (cid:63) ) (cid:107) < γ isalso an α -approximate fixed point of F GDA . Observe that since P ( A , b ) ⊆ [
0, 1 ] d it holds that (cid:107) ( x , y ) − ( x (cid:48) , y (cid:48) ) (cid:107) ≤ √ d for all ( x , y ) , ( x (cid:48) , y (cid:48) ) ∈ P ( A , b ) . Hence, if γ > √ d , then finding γ -approximate fixed points of M is trivial and the same is true for fiding α -approximate fixedpoints of F GDA , since γ = α /4 ( G + √ d ) which implies that, if γ > √ d , then α > √ d . Thus, wemay assume that γ ≤ √ d .Next, to simplify notation we define ( x ∆ , y ∆ ) = ( x (cid:63) − ∇ x f ( x (cid:63) , y (cid:63) ) , y (cid:63) + ∇ y f ( x (cid:63) , y (cid:63) )) and ( ˆ x , ˆ y ) = argmin ( x , y ) ∈P ( A , b ) (cid:107) ( x ∆ , y ∆ ) − ( x , y ) (cid:107) . Given that ( x (cid:63) , y (cid:63) ) is a γ -approximate fixed pointof M , we have that (cid:107) ( x (cid:63) , y (cid:63) ) − ( ˆ x , ˆ y ) (cid:107) < γ . (B.6)60sing Theorem 1.5.5 (b) of [FP07], we get that (cid:104) ( x ∆ , y ∆ ) − ( ˆ x , ˆ y ) , ( x , y ) − ( ˆ x , ˆ y ) (cid:105) ≤ ( x , y ) ∈ P ( A , b ) . (B.7)Next we show the following: Claim B.2.
For all ( x , y ) ∈ P ( A , b ) , (cid:104) ( x ∆ , y ∆ ) − ( x (cid:63) , y (cid:63) ) , ( x , y ) − ( x (cid:63) , y (cid:63) ) (cid:105) < ( G + √ d ) · γ .Proof. We have that: (cid:104) ( x ∆ , y ∆ ) − ( x (cid:63) , y (cid:63) ) , ( x , y ) − ( x (cid:63) , y (cid:63) ) (cid:105) = (cid:104) ( x ∆ , y ∆ ) − ( ˆ x , ˆ y ) , ( x , y ) − ( x (cid:63) , y (cid:63) ) (cid:105) + (cid:104) ( ˆ x , ˆ y ) − ( x (cid:63) , y (cid:63) ) , ( x , y ) − ( x (cid:63) , y (cid:63) ) (cid:105) = (cid:104) ( x ∆ , y ∆ ) − ( ˆ x , ˆ y ) , ( x , y ) − ( ˆ x , ˆ y ) (cid:105) + (cid:104) ( x ∆ , y ∆ ) − ( ˆ x , ˆ y ) , ( ˆ x , ˆ y ) − ( x (cid:63) , y (cid:63) ) (cid:105) + (cid:104) ( ˆ x , ˆ y ) − ( x (cid:63) , y (cid:63) ) , ( x , y ) − ( x (cid:63) , y (cid:63) ) (cid:105) < (cid:107) ( x ∆ , y ∆ ) − ( ˆ x , ˆ y ) (cid:107) γ + γ · √ d ≤ (cid:107) ( x ∆ , y ∆ ) − ( x (cid:63) , y (cid:63) ) (cid:107) γ + γ + γ · √ d = (cid:107)∇ f ( x (cid:63) , y (cid:63) ) (cid:107) γ + γ + γ · √ d ≤ ( G + √ d ) · γ ,where (1) for the first inequality we use (B.6), (B.7), the Cauchy-Schwarz inequality, and the factthat the (cid:96) diameter of P ( A , b ) is at most √ d ; (2) for the second inquality we use the triangleinequality and (B.6); (3) for the equality that follows we use the definition of ( x ∆ , y ∆ ) ; and (4) forthe last inequality we use that G , the Lipschitzness of f , bounds the magnitude of its gradient,and that γ ≤ √ d .Now let x (cid:48) = argmin x ∈ K ( y (cid:63) ) (cid:107) x − x ∆ (cid:107) where K ( y (cid:63) ) = { x | ( x , y (cid:63) ) ∈ P ( A , b )) } . Using Theorem1.5.5 (b) of [FP07] for x (cid:48) we get that (cid:104) x ∆ − x (cid:48) , x (cid:63) − x (cid:48) (cid:105) ≤
0. Using Claim B.2 for vector ( x (cid:48) , y (cid:63) ) ∈P ( A , b ) we get that (cid:104) x (cid:63) − x ∆ , x (cid:63) − x (cid:48) (cid:105) < ( G + √ d ) γ . Adding the last two inequalities and usingthe fact that γ = α /4 ( G + √ d ) we get the following (cid:13)(cid:13)(cid:13) x (cid:63) − Π K ( y (cid:63) ) ( x (cid:63) − ∇ x f ( x (cid:63) , y (cid:63) )) (cid:13)(cid:13)(cid:13) < (cid:113) ( G + √ d ) · γ = α /2.Using the exact same reasoning we can also prove that (cid:13)(cid:13)(cid:13) y (cid:63) − Π K ( x (cid:63) ) ( y (cid:63) − ∇ y f ( x (cid:63) , y (cid:63) )) (cid:13)(cid:13)(cid:13) < α /2where K ( x (cid:63) ) = { y | ( x (cid:63) , y ) ∈ P ( A , b )) } . Combining the last two inequalities we get that ( x (cid:63) , y (cid:63) ) is an α -approximate fixed point of F GDA . C Missing Proofs from Section 8
In this section we present the missing proofs from Section 8 and more precisely in the followingsections we prove the Lemmas 8.10, 8.11, and 8.12. For the rest of the proofs in this section wedefine L ( c ) to be the cubelet which has the down-left corner equal to c , formaly L ( c ) = (cid:20) c N − c + N − (cid:21) × · · · × (cid:20) c d N − c d + N − (cid:21) and we also define L c ( c ) to be the set of corners of the cubelet L ( c ) , or more formally L c ( c ) = { c , c + } × · · · × { c d , c d + } .61 .1 Proof of Lemma 8.10 We start with a lemma about the differentiability properties of the functions Q cv which we definedin Definition 8.7. Lemma C.1.
Let x ∈ [
0, 1 ] d lying in cublet R ( x ) = (cid:104) c N − , c + N − (cid:105) × · · · × (cid:104) c d N − , c d + N − (cid:105) , where c ∈ ([ N ] − ) d . Then for any vertex v ∈ R c ( x ) , the function Q cv ( x ) is continuous and twice differentiable.Moreover if Q cv ( x ) = then also dQ cv ( x ) dx i = and d Q cv ( x ) dx i dx j = .Proof. We remind from the Definition 8.7 that if we let s c = ( s , . . . , s d ) be the source vertex of R ( x ) and p cx = ( p , . . . , p d ) be the canonical representation of x . Then foreach vertex v ∈ R c ( x ) we define the following partition of the set of coordinates [ d ] , A cv = { j : | v j − s j | = } and B cv = { j : | v j − s j | = } .Now in case B cv = ∅ , which corresponds to v being the source node s c then Q cv ( x ) = ∏ dj = S ∞ ( − S ( p j )) which is clearly differentiable as product of compositions of differentiable functions. Theexact same holds for A cv = ∅ which corresponds to v being the target vertex t c of the cubelet R ( x ) . We thus focus on the case where A cv , B cv (cid:54) = ∅ . To simplify notation we denote Q cv ( x ) by Q ( x ) , A cv by A and B cv by B for the rest of this proof. We prove that in case i ∈ B then ∂ Q ( x ) ∂ x i always exits. The case i ∈ A follows then symmetrically. We have the following cases (cid:73) Let j ∈ A and (cid:96) ∈ B \ { i } such that p j ≥ p (cid:96) . By Definition 8.7, if ε is sufficiently small then Q ( x i − ε , x − i ) = Q ( x i + ε , x − i ) = Q ( x i , x − i ) =
0. Thus ∂ Q ( x ) ∂ x i exists and equals 0. (cid:73) Let p (cid:96) > p j for all (cid:96) ∈ B \ { i } and j ∈ A . In this case we have the following subcases. (cid:46) p i > p j for all j ∈ A : Then ∂ Q ( x ) ∂ x i exists since both S ∞ ( · ) and S ( · ) are differentiable. (cid:46) p i < p j for some j ∈ A : By Definition 8.7, if ε is sufficiently small then Q ( x i − ε , x − i ) = Q ( x i + ε , x − i ) = Q ( x i , x − i ) =
0. Thus ∂ Q ( x ) ∂ x i exists and equals 0. (cid:46) p i = p j for some j ∈ A and p i ≥ p j (cid:48) for all j (cid:48) ∈ A \ { j } : By Definition 8.7, if ε issufficiently small then Q ( x i − ε , x − i ) = Q ( x i , x − i ) =
0, thuslim ε → + Q ( x i , x − i ) − Q ( x i − ε , x − i ) ε = ε → + Q ( x i + ε , x − i ) − Q ( x i , x − i ) ε = S ∞ ( · ) and S ( · ) are differentiable functions, S ∞ ( S ( p i ) − S ( p j )) = S ∞ ( ) = S (cid:48) ∞ ( S ( p i ) − S ( p j )) = S (cid:48) ∞ ( ) = Let Q (cid:48) ( x ) be equal to ∂ Q ( x ) ∂ x k for convenience. As in the previousanalysis in case A cv = ∅ or B cv = ∅ then Q (cid:48) ( x ) is differentiable with respect to x i since S ( · ) , S ∞ ( · ) are twice differentiable. Thus we again focus in the case where A , B (cid:54) = ∅ . Notice that by theprevious analysis Q (cid:48) ( x ) = (cid:96) ∈ B and j ∈ A such that p (cid:96) ≥ p j . Without loss ofgenerality we assume that i ∈ B and we prove that ∂ Q (cid:48) ( x ) ∂ x i (cid:44) ∂ Q ( x ) ∂ x i ∂ x k always exists.62 Let j ∈ A and (cid:96) ∈ B \ { i } such that p j ≥ p (cid:96) . By Definition 8.7, Q (cid:48) ( x i − ε , x − i ) = Q (cid:48) ( x i + ε , x − i ) = Q (cid:48) ( x i , x − i ) =
0. Thus ∂ Q (cid:48) ( x ) ∂ x i (cid:44) ∂ Q (cid:48) ( x ) ∂ x i ∂ x k exists and equals 0. (cid:73) Let p (cid:96) > p j for all (cid:96) ∈ B \ { i } and j ∈ A . (cid:46) p i > p j for all j ∈ A : Then ∂ Q (cid:48) ( x ) ∂ x i (cid:44) ∂ Q ( x ) ∂ x i ∂ x k exists since both S ∞ ( · ) and S ( · ) are twicedifferentiable. (cid:46) p i < p j for some j ∈ A . By Definition 8.7, Q (cid:48) ( x i − ε , x − i ) = Q (cid:48) ( x i + ε , x − i ) = Q (cid:48) ( x i , x − i ) =
0. Thus ∂ Q (cid:48) ( x ) ∂ x i (cid:44) ∂ Q ( x ) ∂ x i ∂ x k exists and equals 0. (cid:46) p i = p j for some j ∈ A and p i > p j (cid:48) for all j (cid:48) ∈ A \ { j } . By Definition 8.7, if ε issufficiently small then Q (cid:48) ( x i − ε , x − i ) = ε → + Q (cid:48) ( x i , x − i ) − Q (cid:48) ( x i − ε , x − i ) ε = ε → + Q (cid:48) ( x i + ε , x − i ) − Q (cid:48) ( x i , x − i ) ε exists since both S ∞ ( · ) and S ( · ) are twicedifferentiable. Moreover equals 0 since S ∞ ( S ( p i ) − S ( p j )) = S ∞ ( ) = S (cid:48) ∞ ( S ( p i ) − S ( p j )) = S (cid:48) ∞ ( ) = S (cid:48)(cid:48) ∞ ( ) = S ( ) = S ∞ and S we use Lemma 8.3.So far we have established the fact that the functions Q cv ( x ) are twice differentiable when x moveswithin the same cubelet. Next we will show that when x moves from one cubelet to another thenthe corresponding Q cv functions changes value smoothly. Lemma C.2.
Let x ∈ [
0, 1 ] d such that there exists a coordinate i ∈ [ d ] with the property R ( x i + ε , x − i ) = (cid:104) c N − , c + N − (cid:105) × · · · × (cid:104) c d N − , c d + N − (cid:105) and R ( x i − ε , x − i ) = (cid:104) c (cid:48) N − , c (cid:48) + N − (cid:105) × · · · × (cid:104) c (cid:48) d N − , c (cid:48) d + N − (cid:105) , with c , c (cid:48) ∈ ([ N − ] − ) d and ε sufficiently small, i.e. x lies in the boundary of two cubelets. Then the followingstatements hold.1. For all vertices v ∈ R c ( x i + ε , x − i ) ∩ R c ( x i − ε , x − i ) , it holds that(a) Q cv ( x ) = Q c (cid:48) v ( x ) ,(b) ∂ Q cv ( x ) ∂ x j = ∂ Q c (cid:48) v ( x ) ∂ x i for all i ∈ [ d ] , and(c) ∂ Q cv ( x ) ∂ x i ∂ x j = ∂ Q c (cid:48) v ( x ) ∂ x i ∂ x j for all i , j ∈ [ d ] .2. For all vertices v ∈ R c ( x i + ε , x − i ) \ R c ( x i − ε , x − i ) , it holds that Q cv ( x ) = ∂ Q cv ( x ) ∂ x i = ∂ Q cv ( x ) ∂ x i ∂ x j = .3. for all vertices v ∈ R c ( x i − ε , x − i ) \ R c ( x i + ε , x − i ) , it holds that Q c (cid:48) v ( x ) = ∂ Q c (cid:48) v ( x ) ∂ x i = ∂ Q c (cid:48) v ( x ) ∂ x i ∂ x j = . Lemma C.2 is crucial since it establishes that P v ( x ) is a continuous and twice differentiable evenwhen x moves from one cubelet to another. Since the proof of Lemma C.2 is very long andcontains the proof of some sublemmas, we postpone it for the end of this section in Section C.1.1.We now proceed with the proof of Lemma 8.10.63 roof of Lemma 8.10. We first prove that P v ( x ) is a continuous function. Let x ∈ [
0, 1 ] d lying onthe boundary of the following cubelets (cid:34) c ( ) N − c ( ) + N − (cid:35) × · · · × (cid:34) c ( ) d N − c ( ) d + N − (cid:35) · · · (cid:34) c ( i ) N − c ( i ) + N − (cid:35) × · · · × (cid:34) c ( i ) d N − c ( i ) d + N − (cid:35) · · · (cid:34) c ( m ) N − c ( m ) + N − (cid:35) × · · · × (cid:34) c ( m ) d N − c ( m ) d + N − (cid:35) .where c ( ) , . . . , c ( m ) ∈ ([ N − ] − ) d . This means that for every i ∈ [ m ] there exists a coordinate j i ∈ [ d ] and a value η i ∈ R with sufficiently small absolute value such that R ( x j i + η i , x − j i ) = (cid:34) c ( i ) N − c ( i ) + N − (cid:35) × · · · × (cid:34) c ( i ) d N − c ( i ) d + N − (cid:35) .We then consider the following cases. (cid:73) v / ∈ ∪ mi = R c ( x j i + η i , x − j i ) . By Definition 8.9, in all the m aforementioned cubelets, thecoefficient P v takes value 0 and hence it is continuous in this part of the space. (cid:73) v ∈ ∩ j ∈ U R c ( x j i + η i , x − j i ) and v / ∈ ∪ i ∈ U R c ( x j i + η i , x − j i ) , for some U ⊆ [ m ] with U = [ m ] \ U .In this case P v ( x j i + η i , x j i ) was computed according to a cubelet with v ∈ R c ( x j i + η i , x − j i ) .Then Lemma C.2 implies that Q c ( i ) v ( x ) = v ∈ R c ( x j i + η i , x − j i ) \ R c ( x j i (cid:48) + η i (cid:48) , x − j i (cid:48) ) where i (cid:48) ∈ [ m ] and i (cid:54) = i (cid:48) . Therefore we conclude that P v ( x ) = η i → P v ( x j i + η i , x − i ) = (cid:73) v ∈ ∩ mi = R c ( x j i + η i , x − j i ) . By Lemma C.2 for all i ∈ [ m ] it holds that Q c ( i ) v ( x ) ∑ v ∈ R c ( x ji + η i , x − ji ) Q c ( i ) v ( x ) = Q c ( i ) v ( x ) ∑ v ∈∩ mi = R c ( x ji + η i , x − ji ) Q c ( i ) v ( x )= Q c ( i (cid:48) ) v ( x ) ∑ v ∈∩ mi = R c ( x ji + η i , x − ji ) Q c ( i (cid:48) ) v ( x ) = Q c ( i (cid:48) ) v ( x ) ∑ v ∈ R c ( x ji + η i , x − ji ) Q c ( i (cid:48) ) v ( x ) which again implies the continuity of P v ( x ) at x .Next we prove that P v ( x ) is differentiable for all v ∈ ([ N ] − ) d . Fix some i ∈ [ d ] we willprove that ∂ P ( x ) ∂ x i always exists. Let C + be the set of down-left corners of the cubelets in whichlim ε → + ( x i + ε , x − i ) belongs to and C − be the set of down-left corners of the cubelets in whichlim ε → + ( x i − ε , x − i ) belongs to. It easy to see that C + and C − are non-empty and fixed for ε > ∂ P v ( x ) ∂ x i always exists, we consider the following 3 mutually exclusive cases.64 v ∈ L c ( c ( ) ) for c ( ) ∈ C + and v ∈ L c ( c ( ) ) for c ( ) ∈ C − . Since the coefficient P v ( x ) is a con-tinuous function, we have that (cid:46) lim ε → + P v ( x i + ε , x − i ) − P v ( x i , x − i ) ε = ∂ Q c ( ) v ( x ) ∂ xi ∑ v (cid:48)∈ Lc ( c ( )) Q c ( ) v (cid:48) ( x ) − Q c ( ) v ( x ) ∑ v (cid:48)∈ Lc ( c ( )) ∂ Q c ( ) v (cid:48) ( x ) ∂ xi (cid:16) ∑ v (cid:48)∈ Lc ( c ( )) Q c ( ) v (cid:48) ( x ) (cid:17) (cid:46) lim ε → + P v ( x i , x − i ) − P v ( x i − ε , x − i ) ε = ∂ Q c ( ) v ( x ) ∂ xi ∑ v (cid:48)∈ Lc ( c ( )) Q c ( ) v (cid:48) ( x ) − Q c ( ) v ( x ) ∑ v (cid:48)∈ Lc ( c ( )) ∂ Q c ( ) v (cid:48) ( x ) ∂ xi (cid:16) ∑ v (cid:48)∈ Lc ( c ( )) Q c ( ) v (cid:48) ( x ) (cid:17) Both of the above limits exists due to the fact that Q cv ( x ) is differentiable (Lemma C.1).Moreover, since v ∈ L c ( c ( ) ) ∩ L c ( c ( ) ) , Case 1 of Lemma C.2 implies that the two limitsabove have exactly the same value and hence P v is differentiable at x . (cid:73) v / ∈ L c ( c ( ) ) for all c ( ) ∈ C + . In the case where v / ∈ L c ( c ) for all the down-left corners c of the cubelets at which x lies, then by Definition 8.9 P v ( x i , x − i ) = P v ( x i + ε , x − i ) = P v ( x i − ε , x − i ) =
0. Thus ∂ P v ( x ) ∂ x i exists and equals 0. Therefore we may assume that v ∈ L c ( c ) for some down-left corner c of a cubelet at which x lies. Due to the fact that P v ( x ) is acontinuous function and that v / ∈ L c ( c ( ) ) for all c ( ) ∈ C + , we get that P v ( x i + ε , x − i ) = P v ( x i , x − i ) = v ∈ L c ( c ) / L c c ( ) where c , c ( ) are down-left corners of cubelets at which x lies and ( x i + ε , x − i ) lies respectively. Therefore we get by Case 1 of Lemma C.2 that Q cv ( x ) = P v ( x i , x − i ) =
0. As a result,lim ε → + P v ( x i + ε , x − i ) − P v ( x i , x − i ) ε = ε → + P v ( x i , x − i ) − P v ( x i − ε , x − i ) ε exists and equals 0. At first observethat 0 ≤ x i − c i ≤ δ since x lies in the cubelet with down-left corner c . In case x i − c i < δ then ( x i + ε , x − i ) lies in c for arbitrarily small ε , meaning that c ∈ C + . The latter contradictsthe fact that v / ∈ L c c ( ) for all c ( ) ∈ C + . As a result, x i − c i = δ which implies that c ∈ C − and hencelim ε → + P v ( x i , x − i ) − P v ( x i − ε , x − i ) ε = ∂ Q cv ( x ) ∂ x i ∑ v (cid:48) ∈ L c ( c ) Q cv (cid:48) ( x ) − Q cv ( x ) ∑ v (cid:48) ∈ L c ( c ) ∂ Q cv (cid:48) ( x ) ∂ x i (cid:16) ∑ v (cid:48) ∈ L c ( c ) Q cv (cid:48) ( x ) (cid:17) .The above limit equals to 0 since Q cv ( x ) = ∂ Q cv ( x ) ∂ x i = v ∈ L c ( c ) \ L c ( c ( ) ) . (cid:73) v / ∈ L c ( c ( ) ) for all c ( ) ∈ C − . Symmetrically with the previous case.The second order differentiability of P v ( x ) can be established using exactly the same argumentsfor computing the following limitlim ε , ε (cid:48) → P v ( x i + ε , x j + ε (cid:48) , x − i , j ) − P v ( x ) ε .65he last thing that we need to show to prove Lemma 8.10 is that the set R + ( x ) has cardinalityat most d + ( d ) time. Let p cx ∈ [
0, 1 ] d be the canonicalrepresentation of x with the respect to a cubelet L ( c ) in which x belongs to. We define the sourcevertex s c = ( s , . . . , s d ) and the target vertex t c = ( t , . . . , t d ) of L ( c ) . Once this is done the verticesin R + ( v ) are exactly the vertices of L c ( c ) for which it holds that p (cid:96) > p j for all (cid:96) ∈ A cv , j ∈ B cv since for all the others v ∈ ([ N ] − ) d it holds that Q cv ( x ) = ∇ Q cv ( x ) =
0, and ∇ Q cv ( x ) = v ∈ R + ( x ) can be computed in polynomial time as follows: i) the coordinates p , . . . , p d are sorted in increasing order, and ii) for each m =
0, . . . , d compute the vertex v ( m ) ∈ L c ( c ) , v mj = (cid:26) s j if coordinate j belongs in the first m coordinates wrt the order of p cx t j if coordinate j belongs in the last d − m coordinates wrt the order of p cx By Definition 8.7 it immediately follows that R + ( x ) ⊆ { v ( ) , . . . , v ( m ) } from which we get that | R + ( x ) | ≤ d + ( d ) time.To finish the proof of Lemma 8.10 we only need the proof of Lemma C.2 which we present in thefollowing section. C.1.1 Proof of Lemma C.2Lemma C.3.
Let a point x ∈ [
0, 1 ] d lying in the boundary of the cubelets with down-left corners c = ( c , . . . , c m − , c m , c m + , . . . , c d ) and c (cid:48) = ( c , . . . , c m − , c m + c m + , . . . , c d ) . Then the canoni-cal representation of x in the cubelet L ( c ) is the same with the the canonical representation of x in thecubelet L ( c (cid:48) ) . More precisely, p cx = p c (cid:48) x .Proof. Let c m be even. By the definition of the canonical representation in Definition 8.6, thesource and target of the cubelets L ( c ) and L ( c (cid:48) ) are respectively, (cid:5) s c = ( s , . . . , s m − , c m , s m + , . . . , s d ) , (cid:5) t c = ( t , . . . , s m − , c m + t m + , . . . , t d ) , (cid:5) s c (cid:48) = ( s , . . . , s m − , c m + s m + , . . . , s d ) , (cid:5) t c (cid:48) = ( t , . . . , t m − , c m + t m + , . . . , t d ) .Hence we get that p j = p (cid:48) j for j (cid:54) = m . Since x belongs to the boundary of both cublets L ( c ) and L ( c (cid:48) ) we get that x m = c m + p m = p (cid:48) m =
1. In case c m is odd we get that p cx = p c (cid:48) x but with p m = p (cid:48) m = Lemma C.4.
Let x ∈ [
0, 1 ] d lying at the intersection of the cubelets L ( c ) , L ( c (cid:48) ) with down-left corners c = ( c , . . . , c m − , c m , c m + , . . . , c d ) , and c (cid:48) = ( c , . . . , c m − , c m + c m + , . . . , c d ) . Then the followingstatements are true.1. For all vertices v ∈ L c ( c ) ∩ L c ( c (cid:48) ) it holds that(a) Q cv ( x ) = Q c (cid:48) v ( x ) , b) ∂ Q cv ( x ) ∂ x i = ∂ Q c (cid:48) v ( x ) ∂ x i ,(c) ∂ Q cv ( x ) ∂ x i ∂ x j = ∂ Q c (cid:48) v ( x ) ∂ x i ∂ x j .2. For all vertices v ∈ L c ( c ) \ L c ( c (cid:48) ) it holds that Q cv ( x ) = ∂ Q cv ( x ) ∂ x i = ∂ Q cv ( x ) ∂ x i ∂ x j = .3. For all vertices v ∈ L c ( c (cid:48) ) / L c ( c ) it holds that Q c (cid:48) v ( x ) = ∂ Q c (cid:48) v ( x ) ∂ x i = ∂ Q c (cid:48) v ( x ) ∂ x i ∂ x j = .Proof.
1. Let v ∈ L c ( c ) ∩ L c ( c (cid:48) ) then we have that(a) Q cv ( x ) = Q c (cid:48) v ( x ) . By Lemma C.3 we get that the canonical representation p cx = p c (cid:48) x .Since Q cv ( x ) is a function of the canonical representation p cx (see Definition 8.9), itholds that Q cv ( x ) = Q c (cid:48) v ( x ) for all vertices v ∈ L c ( c ) ∩ L c ( c (cid:48) ) .(b) ∂ Q cv ( x ) ∂ x i = ∂ Q c (cid:48) v ( x ) ∂ x i . For i (cid:54) = m , we get that ∂ Q cv ( x ) ∂ x i = t i − s i ∂ Q cv ( x ) ∂ p i = t (cid:48) i − s (cid:48) i ∂ Q c (cid:48) v ( x ) ∂ p (cid:48) i = ∂ Q c (cid:48) v ( x ) ∂ x i since t i = t (cid:48) i and s i = s (cid:48) i for all i (cid:54) = m . The latter argument cannot be applied for the m -thcoordinate since t m − s m = − ( t (cid:48) m − s (cid:48) m ) . However since x belongs to the boundary ofboth the cubelets L ( c ) and L ( c (cid:48) ) it is implied that p m = p (cid:48) m is either 0 or 1, meaningthat ∂ Q cv ( x ) ∂ x m = ∂ Q c (cid:48) v ( x ) ∂ x m = S (cid:48) ( ) = S (cid:48) ( ) = ∂ Q cv ( x ) ∂ x i ∂ x j = ∂ Q c (cid:48) v ( x ) ∂ x i ∂ x j . For i , j (cid:54) = m , we get that ∂ Q cv ( x ) ∂ x i ∂ x j = t i − s i t j − s j ∂ Q cv ( x ) ∂ p i ∂ p j = t (cid:48) i − s (cid:48) i t (cid:48) j − s (cid:48) j ∂ Q c (cid:48) v ( x ) ∂ p (cid:48) i ∂ p (cid:48) j = ∂ Q c (cid:48) v ( x ) ∂ x i ∂ x j since t i = t (cid:48) i and s i = s (cid:48) i for all i (cid:54) = m . As in the previous case, p m = p (cid:48) m equalseither 0 or 1. As a result, ∂ Q cv ( x ) ∂ x m ∂ x j = ∂ Q c (cid:48) v ( x ) ∂ x m ∂ x j = S (cid:48) ( ) = S (cid:48) ( ) = S (cid:48)(cid:48) ( ) = S (cid:48)(cid:48) ( ) = v ∈ L c ( c ) \ L c ( c (cid:48) ) , we get that v m = c m . In case c m is even, we get that s m = c m = v m and thus the coordinate the coordinate m belongs in the set A cv . Since x coincides with oneof the corners in L c ( c ) \ L c ( c (cid:48) ) we get that p m = m ∈ A cv implies that Q cv ( x ) = ∂ Q c (cid:48) v ( x ) ∂ x i = ∂ Q c (cid:48) v ( x ) ∂ x i ∂ x j =
0. Incase is odd, we get that s m = c m +
1. The latter combined with the fact that v m = c m impliesthat the m -th coordinate belongs in B cv . Now p m = Q cv ( x ) =
0. Thenagain by Lemma C.1, ∂ Q c (cid:48) v ( x ) ∂ x i = ∂ Q c (cid:48) v ( x ) ∂ x i ∂ x j = Proof of Lemma C.2.
1. Let v ∈ L c ( c ) ∩ L c ( c (cid:48) ) . There exists a sequence of corners c = c ( ) , . . . , c ( m ) = c (cid:48) such that (cid:13)(cid:13)(cid:13) c ( j ) − c ( j + ) (cid:13)(cid:13)(cid:13) = v ∈ L c ( c j ) for all j ∈ [ m ] . By Lemma C.4 we get that,(a) Q c ( j ) v ( x ) = Q c ( j + ) v ( x ) .(b) ∂ Q c ( j ) v ( x ) ∂ x i = ∂ Q c ( j + ) v ( x ) ∂ x i . 67c) ∂ Q c ( j ) v ( x ) ∂ x i ∂ x j = ∂ Q c ( j + ) v ( x ) ∂ x i ∂ x j .which implies Case 1 of Lemma C.2.2. Let v ∈ L c ( c ) \ L c ( c (cid:48) ) . There exists a sequence of corners c = c ( ) . . . , c ( i ) such that (cid:13)(cid:13)(cid:13) c ( j ) − c ( j + ) (cid:13)(cid:13)(cid:13) = v / ∈ L c c ( i ) and v ∈ L c ( c ( j ) ) for all j < i . By case 2 of LemmaC.4 we get that Q c ( i − ) v ( x ) = ∂ Q c ( i − ) v ( x ) ∂ x i = ∂ Q c ( i − ) v ( x ) ∂ x i ∂ x j =
0. Then case 2 of Lemma C.2 followsby case 1 of Lemma C.4.3. Similarly with case 2.
C.2 Proof of Lemma 8.11
We start this section with some fundamental properties of the smooth step function S ∞ that aremore fine-grained than the properties we presented in Lemma 8.3. Lemma C.5.
For d ≥ there exists a universal constant c > such that the following statements hold.1. If x ≥ d then S ∞ ( x ) ≥ c · − d .2. If x ≤ d then S (cid:48) ∞ ( x ) ≤ c · d · − d .3. If x ≥ d then S (cid:48) ∞ ( x ) S ∞ ( x ) ≤ c · d .4. If x ≤ d then | S (cid:48)(cid:48) ∞ ( x ) | ≤ c · d · − d .5. If x ≥ d then | S (cid:48)(cid:48) ∞ ( x ) | S ∞ ( x ) ≤ c · d .Proof. We compute the derivative of S ∞ and we have that S (cid:48) ∞ ( x ) = ln ( ) S ∞ ( x ) S ∞ ( − x ) (cid:18) x + ( − x ) (cid:19) from which we immediately get S (cid:48) ∞ ( x ) ≥
0. Then we can compute the second derivative of S ∞ as follows S (cid:48)(cid:48) ∞ ( x ) = ln ( ) S ∞ ( x ) S ∞ ( − x ) ·· (cid:32) ln ( ) ( S ∞ ( − x ) − S ∞ ( x )) (cid:18) x + ( − x ) (cid:19) − (cid:32) x − ( − x ) (cid:33)(cid:33) .We next want to prove that S (cid:48)(cid:48) ∞ ( x ) ≥ x ≤ − · S ∞ ( x ) ≥ x ≤ d and therefore S (cid:48)(cid:48) ∞ ( x ) ≥ ln ( ) x S ∞ ( x ) S ∞ ( − x ) (cid:18) ln ( ) x − (cid:19) hence for x ≤
4/ ln ( ) it holds that S (cid:48)(cid:48) ∞ ( x ) ≥
0. By similar but more tedious calculations we canconclude that S (cid:48)(cid:48)(cid:48) ∞ ( x ) ≥ x ≤ x ∈ [
0, 1/10 ] all the functions S ∞ , S (cid:48) ∞ , S (cid:48)(cid:48) ∞ are all increasing functions of x . 68ext we show that the function h ( x ) = − x + − ( − x ) is upper and lower bounded. Firstobserve that h ( x ) ≥ max { − x , 2 − ( − x ) } . Now if we set t ( x ) = − x then t (cid:48) ( x ) = ln ( ) t ( x ) / x and hence t ( x ) ≥ t ( ) = x ≥ − ( − x ) ≥ x ≤ h ( x ) ≥ x ∈ [
0, 1 ] . Also it is not hard to see that 2 − x ≤ − ( − x ) ≤ h ( x ) ≤
1. Hence overall we have that h ( x ) ∈ [ ] for all x ∈ [
0, 1 ] . We are now ready to prove the statements.1. We have shown that S (cid:48) ∞ ( x ) ≥ x ∈ [
0, 1 ] . Hence S ∞ is an increasing function andtherefore S ∞ ( x ) ≥ S ∞ ( d ) for x ≥ d . Now we have that S ∞ ( d ) = − d / h ( d ) ≥ − d .2. Since S (cid:48) ∞ ( x ) is increasing for x ∈ [
0, 1/10 ] , we have that S (cid:48) ∞ ( x ) ≤ S (cid:48) ∞ ( d ) for x ≤ d andtherefore S (cid:48) ∞ ( x ) ≤ ln ( ) S ∞ ( − d ) S ∞ ( d ) (cid:32) d + (cid:0) − d (cid:1) (cid:33) ≤ ( ) − d h ( d ) ≤ ( ) − d .3. We have that for x ≤ dS (cid:48) ∞ ( x ) S ∞ ( x ) = ln ( ) S ∞ ( − x ) (cid:18) x + ( − x ) (cid:19) ≤ ( ) x ≤ ( ) d .4. Follows directly from the statement 1., the fact that S (cid:48)(cid:48) ∞ ( x ) is increasing for x ∈ [
0, 1/10 ] andthe above expression of S (cid:48)(cid:48) ∞ this statement follows.5. This statement follows using the same reasoning with statement 3.In this section we establish the bounds on the gradient and the hessian of P v ( x ) . Thesebounds are formally stated in Lemma 8.11 the proof of which is the main goal of the section. Lemma 8.11.
For any vertex v ∈ ([ N ] − ) d , it holds that1. (cid:12)(cid:12)(cid:12) ∂ P v ( x ) ∂ x i (cid:12)(cid:12)(cid:12) ≤ Θ ( d / δ ) ,2. (cid:12)(cid:12)(cid:12) ∂ P v ( x ) ∂ x i ∂ x j (cid:12)(cid:12)(cid:12) ≤ Θ ( d / δ ) . In order to prove Lemma 8.11. We first introduce several technical lemmas.
Lemma C.6.
Let x ∈ [
0, 1 ] d lying in cublet L ( c ) , with c ∈ ([ N ] − ) d and let p cx = ( p , . . . , p d ) be thecanonical representation of x . Then for all vertices v ∈ L c ( c ) , it holds that (cid:12)(cid:12)(cid:12)(cid:12) ∂ Q cv ( x ) ∂ p i (cid:12)(cid:12)(cid:12)(cid:12) ≤ Θ ( d ) · ∑ v ∈ V c Q cv ( x ) .69 roof. To simplify notation we use Q v ( x ) instead of Q cv ( x ) , A instead of A cv and B instead of B cv for the rest of the proof. Without loss of generality we assume that for all j ∈ A and (cid:96) ∈ B , p (cid:96) > p j since otherwise ∂ Q cv ( x ) ∂ p i = i ∈ B (symmetrically for i ∈ A ) then, (cid:12)(cid:12)(cid:12)(cid:12) ∂ Q cv ( x ) ∂ p i (cid:12)(cid:12)(cid:12)(cid:12) == ∏ (cid:96) (cid:54) = i ∏ j ∈ A S ∞ ( S ( p (cid:96) ) − S ( p j )) · ∑ j ∈ A (cid:12)(cid:12) S (cid:48) ∞ ( S ( p i ) − S ( p j )) (cid:12)(cid:12) ∏ j (cid:48) ∈ A / { j } S ∞ ( S ( p i ) − S ( p j (cid:48) )) S (cid:48) ( p i ) ≤ ∑ j ∈ A (cid:12)(cid:12) S (cid:48) ∞ ( S ( p i ) − S ( p j )) (cid:12)(cid:12) · ∏ ( j (cid:48) , (cid:96) ) (cid:54) =( j , i ) S ∞ ( S ( p (cid:96) ) − S ( p j (cid:48) )) where the last inequality follows by the fact that | S (cid:48) ( · ) | ≤
6. Since | A | ≤ d the proof of the lemmawill be completed if we are able to show that for any j ∈ A , it holds that (cid:12)(cid:12) S (cid:48) ∞ ( S ( p i ) − S ( p j )) (cid:12)(cid:12) · ∏ ( j (cid:48) , (cid:96) ) (cid:54) =( j , i ) S ∞ ( S ( p (cid:96) ) − S ( p j (cid:48) )) ≤ Θ ( d ) · ∑ v (cid:48) ∈ L c ( c ) Q v (cid:48) ( x ) In case S ( p i ) − S ( p j ) ≥ d then by case 3. of Lemma C.5 we get that (cid:12)(cid:12) S (cid:48) ∞ ( S ( p i ) − S ( p j )) (cid:12)(cid:12) ≤ c · d · S ∞ ( S ( p i ) − S ( p j )) , which implies gthe following (cid:12)(cid:12) S (cid:48) ∞ ( S ( p i ) − S ( p j )) (cid:12)(cid:12) · ∏ ( j (cid:48) , (cid:96) ) (cid:54) =( j , i ) S ∞ ( S ( p (cid:96) ) − S ( p j (cid:48) )) ≤≤ c · d · S ∞ ( S ( p i ) − S ( p j )) · ∏ ( j (cid:48) , (cid:96) ) (cid:54) =( j , i ) S ∞ ( S ( p (cid:96) ) − S ( p j (cid:48) ))= c · d · Q v ( x ) ≤ c · d · ∑ v (cid:48) ∈ L c ( c ) Q v (cid:48) ( x ) Now consider the case where S ( p i ) − S ( p j ) ≤ d . Using case 2. of Lemma C.5, we have that (cid:12)(cid:12) S (cid:48) ∞ ( S ( p i ) − S ( p j )) (cid:12)(cid:12) · ∏ ( j (cid:48) , (cid:96) ) (cid:54) =( j , i ) S ∞ ( S ( p (cid:96) ) − S ( p j (cid:48) )) ≤ (cid:12)(cid:12) S (cid:48) ∞ ( S ( p i ) − S ( p j )) (cid:12)(cid:12) ≤ Θ ( d · − d ) Consider the sequence of points in the [
0, 1 ] interval 0, p , . . . , p d , 1. There always exist two con-secutive points with distance greater that 1/ ( d + ) . As a result, there exists v ∗ ∈ L c ( c ) such that p (cid:96) − p j ≥ ( d + ) for all (cid:96) ∈ B v ∗ and j ∈ A v ∗ . Then S ( p (cid:96) ) − S ( p j ) ≥ ( d + ) and by case 1.of Lemma C.5, S ∞ ( S ( p (cid:96) ) − S ( p j )) ≥ c − ( d + ) . If we also use the fact that | A v ∗ | · | B v ∗ | ≤ d , weget that Q v ∗ ( x ) ≥ ( c · − ( d + ) ) d = c d − ( d + ) · d .Then it holds that 1 Q v ∗ ( x ) · (cid:12)(cid:12) S (cid:48) ∞ ( S ( p i ) − S ( p j )) (cid:12)(cid:12) · ∏ ( j (cid:48) , (cid:96) ) (cid:54) =( j , i ) S ∞ ( S ( p (cid:96) ) − S ( p j (cid:48) )) ≤≤ Θ (cid:18) d · (cid:16) ( c ) · − d +( d + ) (cid:17) d (cid:19) ≤ Θ ( d ) .Combining the later with the discussion in the rest of the proof the lemma follows.70 emma C.7. For any vertex v ∈ ([ N ] − ) d it holds that (cid:12)(cid:12)(cid:12) ∂ P v ( x ) ∂ x i (cid:12)(cid:12)(cid:12) ≤ Θ (cid:0) d / δ (cid:1) .Proof. To simplify notation we use Q v ( x ) instead of Q cv ( x ) for the rest of the proof. Without lossof generality we assume that x lies on a cubelet L ( c ) with c ∈ ([ N ] − ) d and v ∈ L c ( c ) , sinceotherwise ∂ P v ( x ) ∂ x i =
0. Let p cx = ( p , . . . , p d ) be the canonical representation of x in the cubelet L ( c ) . Then it holds that (cid:12)(cid:12)(cid:12)(cid:12) ∂ P v ( x ) ∂ p i (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) ∂ Q v ( x ) ∂ p i · (cid:104) ∑ v (cid:48) ∈ L c ( c ) Q v (cid:48) ( x ) (cid:105) − Q v ( x ) · (cid:104) ∑ v (cid:48) ∈ L c ( c ) ∂ Q v (cid:48) ( x ) ∂ p i (cid:105)(cid:12)(cid:12)(cid:12) ( ∑ v (cid:48) ∈ L c ( c ) Q v (cid:48) ( x )) ≤ (cid:12)(cid:12)(cid:12) ∂ Q v ( x ) ∂ p i (cid:12)(cid:12)(cid:12) ∑ v (cid:48) ∈ L c ( c ) Q v (cid:48) ( x ) + ∑ v (cid:48) ∈ L c ( c ) (cid:12)(cid:12)(cid:12) ∂ Q v (cid:48) ( x ) ∂ p i (cid:12)(cid:12)(cid:12) ∑ v (cid:48) ∈ L c ( c ) Q v (cid:48) ( x ) ≤ ( d + ) · Θ ( d ) = Θ ( d ) where the last inequality follows by Lemma C.6 and the fact that at most d + v of L c ( c ) have non-zero gradient as we have proved in Lemma 8.10. Then the proof of Lemma C.7 followsby the fact that p i = x i − s i t i − s i . Lemma C.8.
Let c ∈ ([ N ] − ) d and v ∈ L c ( c ) then it holds that (cid:12)(cid:12)(cid:12) ∂ Q cv ( x ) ∂ p i ∂ p j (cid:12)(cid:12)(cid:12) ≤ Θ ( d ) · ∑ v ∈ R c ( x ) Q cv ( x ) .Proof. To simplify the notation we use CS ( p (cid:96) − p m ) to denote S ∞ ( S ( p (cid:96) ) − S ( p m )) , CS (cid:48) ( p (cid:96) − p m ) to denote | S (cid:48) ∞ ( S ( p (cid:96) ) − S ( p m )) | , A to denote A cv and B to denote B cv for the rest of the proof. Asin Lemma C.7, we assume that p (cid:96) > p m for all (cid:96) ∈ B and m ∈ A since otherwise ∂ Q v ( x ) ∂ p i ∂ p j =
0. Wehave the following cases for the indices i and j (cid:73) If i , j ∈ B then (cid:12)(cid:12)(cid:12)(cid:12) ∂ Q v ( x ) ∂ p i ∂ p j (cid:12)(cid:12)(cid:12)(cid:12) == ∑ m , m ∈ A CS (cid:48) ( p i − p m ) CS (cid:48) ( p j − p m ) · ∏ ( m , (cid:96) ) (cid:54) = { ( m , i ) , ( m , j ) } CS ( p (cid:96) − p m ) · S (cid:48) ( p i ) S (cid:48) ( p j ) ≤ ∑ m , m ∈ A CS (cid:48) ( p i − p m ) CS (cid:48) ( p j − p m ) · ∏ ( m , (cid:96) ) (cid:54) = { ( m , i ) , ( m , j ) } CS ( p (cid:96) − p m ) (cid:124) (cid:123)(cid:122) (cid:125) (cid:44) U ( i , j ) .If additionally it holds that S ( p i ) − S ( p m ) ≤ d or S ( p j ) − S ( p m ) ≤ d , then by thecase 2. of Lemma C.5, we have that U ( i , j ) ≤ CS (cid:48) ( p i − p m ) · CS (cid:48) ( p j − p m ) ≤ Θ ( d e − d ) .The latter follows from the fact that the function S (cid:48) ∞ ( · ) is bounded in the [
0, 1 ] interval andthat CS ( p (cid:96) − p m ) ≤
1. With the exact same arguments as in Lemma C.6, we hence get that CS (cid:48) ( p i − p m ) CS (cid:48) ( p j − p m ) · Π ( m , (cid:96) ) (cid:54) = { ( m , i ) , ( m , j ) } CS ( p (cid:96) − p m ) ≤ Θ ( d ) ∑ v (cid:48) ∈ L c ( c ) Q cv (cid:48) ( x ) .Thus (cid:12)(cid:12)(cid:12) ∂ Q v ( x ) ∂ p i ∂ p j (cid:12)(cid:12)(cid:12) ≤ Θ ( d ) ∑ v (cid:48) ∈ L c ( c ) Q cv (cid:48) ( x ) . 71n the other hand if S ( p i ) − S ( p m ) ≥ d and S ( p j ) − S ( p m ) ≥ d then by case 1. ofLemma C.5, CS (cid:48) ( p i − p m ) ≤ c · d · CS ( p i − p m ) and CS (cid:48) ( p j − p m ) ≤ c · d · CS ( p j − p m ) and thus U ( i , j ) ≤ Θ ( d ) · Q cv ( x ) . Overall we get that (cid:12)(cid:12)(cid:12) ∂ Q v ( x ) ∂ p i ∂ p j (cid:12)(cid:12)(cid:12) ≤ Θ ( d ) · ∑ v (cid:48) ∈ R c ( x ) Q cv (cid:48) ( x ) . (cid:73) If i ∈ B and j ∈ A then (cid:12)(cid:12)(cid:12)(cid:12) ∂ Q v ( x ) ∂ p i ∂ p j (cid:12)(cid:12)(cid:12)(cid:12) ≤≤ ∑ m ∈ A , (cid:96) ∈ B CS (cid:48) ( p i − p m ) CS (cid:48) ( p (cid:96) − p j ) · ∏ ( m , (cid:96) ) (cid:54) = { ( i , m ) , ( (cid:96) , j ) } CS ( p (cid:96) − p m ) · S (cid:48) ( p i ) S (cid:48) ( p j )+ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) CS (cid:48)(cid:48) ( p i − p j ) · ∏ ( m , (cid:96) ) (cid:54) =( i , j ) CS ( p (cid:96) − p m ) · S (cid:48) ( p i ) S (cid:48) ( p j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Θ ( d ) ∑ v ∈ L c ( c ) Q cv ( x ) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) CS (cid:48)(cid:48) ( p i − p j ) · ∏ ( m , (cid:96) ) (cid:54) =( i , j ) CS ( p (cid:96) − p m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) Q (cid:48)(cid:48) ( x ) .In case S ( p i ) − S ( p j ) ≥ d then by case 4. of Lemma C.5, we get that (cid:12)(cid:12)(cid:12) CS (cid:48)(cid:48) ( p i − p j ) (cid:12)(cid:12)(cid:12) ≤ cd · CS ( p i − p j ) which implies that Q (cid:48)(cid:48) ≤ Θ ( d ) · Q cv ( x ) .On the other hand if S ( p i ) − S ( p j ) ≤ d then by case 5. of Lemma C.5, we get that Q (cid:48)(cid:48) ≤ (cid:12)(cid:12)(cid:12) CS (cid:48)(cid:48) ( p i − p j ) (cid:12)(cid:12)(cid:12) ≤ c · d e − d . As in the proof of Lemma C.6, there exists a vertex v ∗ ∈ R c ( x ) such that Q cv ∗ ( x ) ≥ c d e − ( d + ) d and thus Q (cid:48)(cid:48) ≤ Θ ( d ) ∑ v ∈ L c ( c ) Q cv ( x ) . Overallwe get that (cid:12)(cid:12)(cid:12)(cid:12) ∂ Q v ( x ) ∂ p i ∂ p j (cid:12)(cid:12)(cid:12)(cid:12) ≤ Θ ( d ) ∑ v ∈ L c ( c ) Q cv ( x ) . (cid:73) If i = j ∈ B then (cid:12)(cid:12)(cid:12)(cid:12) ∂ Q v ( x ) ∂ p i (cid:12)(cid:12)(cid:12)(cid:12) ≤≤ ∑ m , m ∈ A (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) CS (cid:48) ( p i − p m ) CS (cid:48) ( p i − p m ) · ∏ ( m , (cid:96) ) (cid:54) = { ( m , i ) , ( m , i ) } CS ( p (cid:96) − p m ) · S (cid:48) ( p i ) S (cid:48) ( p i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + ∑ m ∈ A (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) CS (cid:48)(cid:48) ( p i − p m ) · ∏ ( m , (cid:96) ) (cid:54) =( m , (cid:96) ) CS ( p (cid:96) − p m ) S (cid:48) ( p i ) S (cid:48) ( p i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Θ ( d + d · d ) · ∑ v ∈ L c ( c ) Q cv ( x ) .If we combine all the above cases then the Lemma follows. Lemma C.9.
For any vertex v ∈ ([ N ] − ) d , it holds that (cid:12)(cid:12)(cid:12) ∂ P v ( x ) ∂ x i ∂ x j (cid:12)(cid:12)(cid:12) ≤ Θ ( d / δ ) . roof. Without loss of generality we assume that v ∈ L c ( c ) , where c ∈ ([ N − ] − ) d such that x ∈ L ( c ) , since otherwise ∂ P v ( x ) ∂ x i ∂ x j = ∂ P v ( x ) ∂ p i ∂ p j = ∂ Q v ( x ) ∂ p i ∂ p j (cid:32) ∑ v (cid:48) ∈ L c ( c ) Q v (cid:48) ( x ) (cid:33) · (cid:16) ∑ v (cid:48) ∈ L c ( c ) Q v (cid:48) ( x ) (cid:17) + ∂ Q v ( x ) ∂ p i ∑ v (cid:48) ∈ L c ( c ) ∂ Q v (cid:48) ( x ) ∂ p j (cid:32) ∑ v (cid:48) ∈ L c ( c ) Q v (cid:48) ( x ) (cid:33) · (cid:16) ∑ v (cid:48) ∈ L c ( c ) Q v (cid:48) ( x ) (cid:17) − ∂ Q v (cid:48) ( x ) ∂ p j ∑ v (cid:48) ∈ L c ( c ) ∂ Q v (cid:48) ( x ) ∂ p i (cid:32) ∑ v (cid:48) ∈ L c ( c ) Q v (cid:48) ( x ) (cid:33) · (cid:16) ∑ v (cid:48) ∈ L c ( c ) Q v (cid:48) ( x ) (cid:17) − Q v ( x ) ∑ v (cid:48) ∈ L c ( c ) ∂ Q v (cid:48) ( x ) ∂ p i ∂ p j (cid:32) ∑ v (cid:48) ∈ L c ( c ) Q v (cid:48) ( x ) (cid:33) · (cid:16) ∑ v (cid:48) ∈ L c ( c ) Q v (cid:48) ( x ) (cid:17) − ∂ Q v ( x ) ∂ p i ∑ v (cid:48) ∈ L c ( c ) Q v (cid:48) ( x ) · ∑ v (cid:48) ∈ L c ( c ) Q v (cid:48) ( x ) ∑ v (cid:48) ∈ L c ( c ) ∂ Q v (cid:48) ( x ) ∂ p j · (cid:16) ∑ v (cid:48) ∈ L c ( c ) Q v (cid:48) ( x ) (cid:17) + Q v ( x ) ∑ v (cid:48) ∈ L c ( c ) ∂ Q v (cid:48) ( x ) ∂ p i · ∑ v (cid:48) ∈ L c ( c ) Q v (cid:48) ( x ) ∑ v (cid:48) ∈ L c ( c ) ∂ Q v (cid:48) ( x ) ∂ p j · (cid:16) ∑ v (cid:48) ∈ L c ( c ) Q v (cid:48) ( x ) (cid:17) Using Lemma C.8 and Lemma C.6 we can bound every term in the above expression and hencewe get that (cid:12)(cid:12)(cid:12) ∂ P v ( x ) ∂ p i ∂ p j (cid:12)(cid:12)(cid:12) ≤ Θ ( d ) . Then the lemma follows from the fact that ∂ p i ∂ x i = δ .Finally using Lemma C.7 and Lemma C.9 we get the proof of Lemma 8.11. C.3 Proof of Lemma 8.12
Let 0 ≤ x i < ( N − ) and c = ( c , . . . , c i , . . . , c d ) denote down-left corner of the cubelet R ( x ) at which x ∈ [
0, 1 ] d lies, i.e. x ∈ L ( c ) . Since x ≤ ( N − ) , this means that c i =
0. By thedefinition of sources and targets in Definition 8.6, we have that s i = t i = ( N − ) , where s i , t i are respectively the i -th coordinate of the source s c and the target t c vertex. Let the canonicalrepresentation p cx = ( p , . . . , p d ) of x in the cubelet L ( c ) . Now partition the coordinates [ d ] in thefollowing sets A = (cid:8) j | p j ≤ p i (cid:9) and B = (cid:8) j | p i < p j (cid:9) .If B = ∅ then notice that P s c ( x ) >
0, since p i <
1, by the fact that x i < ( N − ) . Thus thelemma follows since s i =
0. So we may assume that B (cid:54) = ∅ . In this case consider the corner v = ( v , . . . , v d ) defined as follows v j = (cid:26) s j j ∈ At j j ∈ B .Observe that Q cv ( x ) > v ∈ R + ( x ) . Moreover the coordinate i ∈ A and therefore itholds that v i = s i =
0. This proves the first statement of the Lemma.For the second statement let 1 − ( N − ) ≤ x i ≤ ( N − ) and c = ( c , . . . , c i , . . . , c d ) denote down-left corner of the cubelet R ( x ) at which x ∈ [
0, 1 ] d lies, i.e. x ∈ L ( c ) . This meansthat c i = N − N − . 73 Let N be odd. In this case by the definition of sources and targets in Definition 8.6, we havethat s i = − ( N − ) and t i =
1, where s i , t i are respectively the i -th coordinate of thesource and target vertex. Let p cx = ( p , . . . , p d ) be the canonical representation of x underin the cubelet L ( c ) . Now partition the coordinates [ d ] as follows, A = (cid:8) j | p j < p i (cid:9) and B = (cid:8) j | p i ≤ p j (cid:9) If A = ∅ then notice that for the target vertex t c , P t c ( x ) >
0, since p i >
0, by the fact that x i > − ( N − ) . Thus the lemma follows since t i =
1. So we may assume that A (cid:54) = ∅ .In this case consider the corner v = ( v , . . . , v d ) defined as follows, v j = (cid:26) s j j ∈ At j j ∈ B Observe that Q cv ( x ) > v ∈ R + ( x ) . Moreover the coordinate i ∈ B and thus v i = t i = (cid:73) Let N be even. In this case we have that t i = − ( N − ) and s i =
1. Now partition thecoordinates [ d ] as follows, A = (cid:8) j | p j ≤ p i (cid:9) and B = (cid:8) j | p i < p j (cid:9) If B = ∅ then notice that for the source vertex s c , P s c ( x ) >
0, since p i <
1, by the fact that x i > − ( N − ) . Thus the lemma follows since s i =
1. In case B (cid:54) = ∅ consider thecorner v = ( v , . . . , v d ) defined as follows, v j = (cid:26) s j j ∈ At j j ∈ B Observe that Q cv ( x ) > v ∈ R + ( x ) . Moreover the coordinate i ∈ A and thus v i = s i = D Constructing the Turing Machine – Proof of Theorem 7.6
In this section we prove Theorem 7.6 establishing that both the function f C l ( x , y ) of Definition 7.4and its gradient, is computable by a polynomial-time Turing Machine. We prove Theorem 7.6through a series of Lemmas. To simplify notation we set b (cid:44) log 1/ ε . Definition D.1.
For a x ∈ R , we denote by [ x ] b ∈ R , a value represented by the b bits such that | [ x ] b − x | ≤ − b . Lemma D.2.
There exist Turing Machines M S ∞ , M S (cid:48) ∞ that given input x ∈ [
0, 1 ] and ε in binary form,compute [ S ∞ ( x )] b and [ S (cid:48) ∞ ( x )] b in time polynomial in b = log ( ε ) and the binary representation of x.Proof. The Turing Machine M S ∞ outputs the fist b bits of the following quantity, W ( x ) = + (cid:104) [ − x + x − ] b (cid:48) (cid:105) b (cid:48) b (cid:48) b (cid:48) will be selected sufficiently large. Notice it is possible to compute the above quantitydue to the fact that all functions γ + γ − , 2 γ and + γ can be computed with accuracy 2 − b (cid:48) inpolynomial time with respect to b (cid:48) and the binary representation of γ [Bre76]. Moreover, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:104) [ − x + x − ] b (cid:48) (cid:105) b (cid:48) b (cid:48) − + − x + x − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:104) [ − x + x − ] b (cid:48) (cid:105) b (cid:48) b (cid:48) − + (cid:104) [ − x + x − ] b (cid:48) (cid:105) b (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:104) [ − x + x − ] b (cid:48) (cid:105) b (cid:48) − + [ − x + x − ] b (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + [ − x + x − ] b (cid:48) − + − x + x − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ − b (cid:48) + (cid:12)(cid:12)(cid:12)(cid:104) [ − x + x − ] b (cid:48) (cid:105) b (cid:48) − [ − x + x − ] b (cid:48) (cid:12)(cid:12)(cid:12) + ln 2 (cid:12)(cid:12)(cid:12)(cid:12)(cid:20) − x + x − (cid:21) b (cid:48) − (cid:18) − x + x − (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ · − b (cid:48) where the first inequality follows from triangle inequality and the second follows from the factsthat 1/ ( + γ ) is a 1-Lipschitz function of γ for γ ≥
0, and 1/ ( + γ ) is an ln ( ) -Lipschitzfunction of γ for γ ≥
0. The last inequality follows from the definition of [ · ] b (cid:48) . Hence W ( x ) isindeed equal to [ S ∞ ( x )] b if we choose b (cid:48) = b + M S (cid:48) ∞ computes [ S (cid:48) ∞ ( x )] b . First notice that S (cid:48) ∞ ( x ) is equal to S (cid:48) ∞ ( x ) = ln 2 · x − x + x − − ( x − ) − x + x − (cid:16) − x + x − (cid:17) .To describe how to compute S (cid:48) ∞ ( x ) we first assume that we have computed the following quan-tities. Then based on these quantities we show how S (cid:48) ∞ ( x ) can be computed and finally weconsider the computation of these quantities. (cid:46) [ ln 2 ] b (cid:48) , (cid:46) A ← (cid:104) x − x + x − (cid:105) b (cid:48) , (cid:46) B ← (cid:104) ( x − ) − x + x − (cid:105) b (cid:48) , (cid:46) C ← (cid:20)(cid:16) − x + x − (cid:17) (cid:21) b (cid:48) . 75hen M S (cid:48) ∞ outputs the fist b bits of the quantity (cid:2) [ ln 2 ] b (cid:48) · (cid:2) A + BC (cid:3) b (cid:48) (cid:3) b (cid:48) . We now prove that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ ln 2 ] b (cid:48) (cid:20) A + BC (cid:21) b (cid:48) − ln 2 A + BC (cid:124) (cid:123)(cid:122) (cid:125) S (cid:48) ∞ ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Θ (cid:16) − b (cid:48) (cid:17) Consider the function g ( α , β , γ ) = α + βγ where | α | , | β | ≤ c and | γ | ≥ c where c , c are universalconstants. Notice that g ( α , β , γ ) is c -Lipschitz for c = (cid:113) c + c c . Since for sufficiently large b (cid:48) all the quantities | A | , | B | , (cid:12)(cid:12)(cid:12) x − x + x − (cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12) ( x − ) − x + x − (cid:12)(cid:12)(cid:12) ≤ c and | C | , (cid:16) − x + x − (cid:17) ≥ c where c , c are universal constants we get that (cid:12)(cid:12)(cid:12)(cid:12)(cid:20) A + BC (cid:21) b (cid:48) − A + BC (cid:12)(cid:12)(cid:12)(cid:12) ≤ Θ (cid:16) − b (cid:48) (cid:17) .Now consider the function g ( α , β ) = α · β where | α | , | β | ≤ c where c is a universal constant.In this case g ( α , β ) is √ c -Lipschitz continuous. Since for b (cid:48) sufficiently large all the quantities | [ ln 2 ] b (cid:48) | , (cid:12)(cid:12)(cid:2) A + BC (cid:3) b (cid:48) (cid:12)(cid:12) , ln 2, (cid:12)(cid:12) A + BC (cid:12)(cid:12) are bounded by a universal constant c , we have that, (cid:12)(cid:12)(cid:12)(cid:12) [ ln 2 ] b (cid:48) (cid:20) A + BC (cid:21) b (cid:48) − ln 2 A + BC (cid:12)(cid:12)(cid:12)(cid:12) ≤ Θ (cid:16) − b (cid:48) (cid:17) Next we explain how the values A , B and C are computed while [ ln ( )] (cid:48) b can easily be computedvia standard techniques [Bre76]. (cid:73) Computation of A . The Turing Machine M S (cid:48) ∞ will compute A by taking the first b (cid:48) bits ofthe following quantity, (cid:104) [ − x + x − + x / ln 2 ] b (cid:48)(cid:48) (cid:105) b (cid:48)(cid:48) where b (cid:48)(cid:48) will be taken sufficiently large. We remark that both where both the exponenti-ation and the natural logarithm can be computed in polynomial-time with respect to thenumber of accuracy bits and the binary representation of the input [Bre76]. The function x − x + x − = − x + x − + x / ln 2 is c -Lipschitz where c is a universal constant. Thus, (cid:12)(cid:12)(cid:12)(cid:12)(cid:104) [ − x + x − + x / ln 2 ] b (cid:48)(cid:48) (cid:105) b (cid:48)(cid:48) − x − x + x − (cid:12)(cid:12)(cid:12)(cid:12) ≤ Θ ( − b (cid:48)(cid:48) ) . (cid:73) Computation of B . Using the same arguments as for A . (cid:73) Computation of C . To compute C we first compute b (cid:48)(cid:48) bits of the following quantity, (cid:104) − [ x ] b (cid:48)(cid:48) (cid:105) b (cid:48)(cid:48) + (cid:104) [ x − ] b (cid:48)(cid:48) (cid:105) b (cid:48)(cid:48) b (cid:48)(cid:48) We first argue that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:104) − [ x ] b (cid:48)(cid:48) (cid:105) b (cid:48)(cid:48) + (cid:104) [ x − ] b (cid:48)(cid:48) (cid:105) b (cid:48)(cid:48) b (cid:48)(cid:48) − (cid:18) − x + x − (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Θ (cid:16) − b (cid:48)(cid:48) (cid:17) The latter follows by applying the triangle inequality and the following 3 inequalities.76. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:104) − [ x ] b (cid:48)(cid:48) (cid:105) b (cid:48)(cid:48) + (cid:104) [ x − ] b (cid:48)(cid:48) (cid:105) b (cid:48)(cid:48) b (cid:48)(cid:48) − (cid:104) − [ x ] b (cid:48)(cid:48) (cid:105) b (cid:48)(cid:48) + (cid:104) [ x − ] b (cid:48)(cid:48) (cid:105) b (cid:48)(cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Θ ( − b (cid:48)(cid:48) ) this holds since for b (cid:48)(cid:48) > (cid:16)(cid:104) − [ x ] b (cid:48)(cid:48) (cid:105) b (cid:48)(cid:48) + (cid:104) [ x − ] b (cid:48)(cid:48) (cid:105) b (cid:48)(cid:48) (cid:17) b (cid:48)(cid:48) and 1 (cid:16)(cid:104) − [ x ] b (cid:48)(cid:48) (cid:105) b (cid:48)(cid:48) + (cid:104) [ x − ] b (cid:48)(cid:48) (cid:105) b (cid:48)(cid:48) (cid:17) are both upper-bounded by 2 while the function g ( α ) = α is 4-Lipschitz for | α | ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:104) − [ x ] b (cid:48)(cid:48) (cid:105) b (cid:48)(cid:48) + (cid:104) [ x − ] b (cid:48)(cid:48) (cid:105) b (cid:48)(cid:48) − (cid:32) − [ x ] b (cid:48)(cid:48) + [ x − ] b (cid:48)(cid:48) (cid:33) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Θ (cid:16) − b (cid:48)(cid:48) (cid:17) The latter follows since for b (cid:48)(cid:48) larger than a universal constant, both (cid:104) − [ x ] b (cid:48)(cid:48) (cid:105) b (cid:48)(cid:48) + (cid:104) [ x − ] b (cid:48)(cid:48) (cid:105) b (cid:48)(cid:48) and 2 − [ x ] b (cid:48)(cid:48) + [ x − ] b (cid:48)(cid:48) are greater than a universal constant c , while thefunction g ( α , β ) = ( α + β ) is Θ (cid:0) c (cid:1) -Lipschitz for α + β ≥ c .3. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:32) − [ x ] b (cid:48)(cid:48) + [ x − ] b (cid:48)(cid:48) (cid:33) − (cid:18) − x + x − (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Θ (cid:16) − b (cid:48)(cid:48) (cid:17) The latter follows since for b (cid:48)(cid:48) larger than a universal constant it holds that both thequantities in the left hand side are greater than a positive universal constant c , whilethe function g ( α , β ) = ( − α + β ) for 2 − α + β ≥ c , α ≥
0, and β ≤ Θ (cid:0) c (cid:1) -Lipschitz.This concludes the proof of the lemma. Lemma D.3.
There exist Turing Machines M Q and M Q (cid:48) that given x ∈ [
0, 1 ] d and ε > in binary form,respectively compute [ Q cv ( x )] b and [ ∇ Q cv ( x )] b for all vertices v ∈ ([ N ] − ) d with Q cv ( x ) > , whereb = log ( ε ) . These vertices are most d + . Moreover both M Q and M Q (cid:48) run in polynomial time withrespect to b, d and the binary representation of x .Proof. Both M Q , M Q (cid:48) firsts compute the canonical representation p cx ∈ [
0, 1 ] d with the respect tothe cell R ( x ) in which x lies. Such a cell R ( x ) can be computed by taking the first ( log N + ) -bitsat each coordinate of x . The source vertex s c = ( s , . . . , s d ) and the target vertex t c = ( t , . . . , t d ) with respect to R ( x ) are also computed. Once this is done we are only interested in vertices v ∈ R c ( x ) for which p (cid:96) > p j for all (cid:96) ∈ A cv , j ∈ B cv v ∈ ([ N ] − ) d both Q cv ( x ) = ∇ Q cv ( x ) =
0. These vertices, that aredenoted by R + ( x ) , are at most d + v ∈ R + ( x ) can be computed in polynomial time as follows: (i) the coordinates p , . . . , p d are sorted in increasing order ii) for each m =
0, . . . , d compute the vertex v m ∈ R c ( x ) , v mj = (cid:26) s j if coordinate j belongs in the first m coordinates wrt the order of p cx t j if coordinate j belongs in the last d − m coordinates wrt the order of p cx By Definition 8.7 it immediately follows that R + ( x ) ⊆ (cid:83) dm = { v m } which also establish that | R + ( x ) | ≤ d + R + ( x ) is computed, M Q computes for each pair ( (cid:96) , j ) ∈ B cv × A cv the value of the number (cid:2) S ∞ ( S ( p (cid:96) ) − S ( p j )) (cid:3) b (cid:48) for some accuracy b (cid:48) that we determine later but depends polynomially on b , d and the input accuracy of x . Then each v ∈ R + ( x ) , M Q outputs as [ Q cv ( x )] b the fist b bits ofthe following quantity (cid:34) ∏ (cid:96) ∈ B cv , j ∈ A cv (cid:2) S ∞ ( S ( p (cid:96) ) − S ( p j )) (cid:3) b (cid:48) (cid:35) b (cid:48) where b (cid:48) is selected sufficiently large. We next prove that this computation indeed outputs [ Q cv ( x )] b accurately.To simplify notation let S ∞ ( S ( p (cid:96) ) − S ( p j )) be denoted by S (cid:96) j , A cv denoted by A and B cv denotedby B . Then, (cid:12)(cid:12)(cid:12)(cid:104) Π (cid:96) ∈ B , j ∈ A (cid:2) S (cid:96) j (cid:3) b (cid:48) (cid:105) b (cid:48) − Π (cid:96) ∈ B , j ∈ A S (cid:96) j (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:104) Π (cid:96) ∈ B , j ∈ A (cid:2) S (cid:96) j (cid:3) b (cid:48) (cid:105) b (cid:48) − Π (cid:96) ∈ B , j ∈ A (cid:2) S (cid:96) j (cid:3) b (cid:48) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) Π (cid:96) ∈ B , j ∈ A (cid:2) S (cid:96) j (cid:3) b (cid:48) − Π (cid:96) ∈ B , j ∈ A S (cid:96) j (cid:12)(cid:12)(cid:12) ≤ − b (cid:48) + (cid:12)(cid:12)(cid:12) Π (cid:96) ∈ B , j ∈ A (cid:2) S (cid:96) j (cid:3) b (cid:48) − Π (cid:96) ∈ B , j ∈ A S (cid:96) j (cid:12)(cid:12)(cid:12) Consider the function g ( y ) = ∏ (cid:96) ∈ B , j ∈ A y (cid:96) j . For y ∈ [
0, 1 + d ] | A |×| B | , (cid:107)∇ g ( y ) (cid:107) ≤ Θ ( d ) . As aresult, for all y , z ∈ [
0, 1 + d ] | A |×| B | , | g ( y ) − g ( z ) | ≤ Θ ( d ) · (cid:34) ∑ (cid:96) ∈ B , j ∈ A ( y (cid:96) j − z (cid:96) j ) (cid:35) In case the accuracy b (cid:48) ≥ Θ ( log d ) then (cid:2) S (cid:96) j (cid:3) b (cid:48) ≤ S (cid:96) j + d ≤ + d and the above inequalityapplies. Thus, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∏ (cid:96) ∈ B , j ∈ A (cid:2) S (cid:96) j (cid:3) B (cid:48) − Π (cid:96) ∈ B , j ∈ A S (cid:96) j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Θ ( d ) (cid:34) ∑ (cid:96) ∈ B , j ∈ A (cid:16)(cid:2) S (cid:96) j (cid:3) B (cid:48) − S (cid:96) j (cid:17)(cid:35) ≤ Θ ( d ) · − b (cid:48) Overall, (cid:12)(cid:12)(cid:12)(cid:104) Π (cid:96) ∈ B , j ∈ A (cid:2) S (cid:96) j (cid:3) b (cid:48) (cid:105) b (cid:48) − Π (cid:96) ∈ B , j ∈ A S (cid:96) j (cid:12)(cid:12)(cid:12) ≤ Θ ( d ) · − b (cid:48) which concludes the proofof the cor-rected of [ Q cv ( x )] b by selecting b (cid:48) = b + Θ ( log d ) .78n order to compute ∂ Q cv ( x ) ∂ x (cid:96) where (cid:96) ∈ B cv (symmetrically for j ∈ A cv ), M Q (cid:48) additionallycomputes the (cid:2) S (cid:48) ∞ ( S ( p (cid:96) ) − S ( p j )) (cid:3) b (cid:48) with accuracy b (cid:48) . To simplify notation we denote with S (cid:48) ∞ ( S ( p (cid:96) ) − S ( p j )) with S (cid:48) (cid:96) j and S (cid:48) ( p i ) by S (cid:48) i . Then M Q (cid:48) outputs, (cid:20) ∂ Q cv ( x ) ∂ x i (cid:21) b (cid:48) ← (cid:20) t i − s i · (cid:20) ∂ Q cv ( x ) ∂ p i (cid:21) b (cid:48) (cid:21) b (cid:48) where (cid:20) ∂ Q cv ( x ) ∂ p i (cid:21) b (cid:48) ← (cid:34) ∑ j ∈ A (cid:104) S (cid:48) ij (cid:105) b (cid:48) · (cid:2) S (cid:48) i (cid:3) b (cid:48) Π m ∈ A / j , (cid:96) ∈ B [ S (cid:96) m ] b (cid:48) (cid:35) b (cid:48) Observe that t i − s i = sign ( t i − s i ) N − and thus t i − s i · (cid:104) ∂ Q cv ( x ) ∂ p i (cid:105) b (cid:48) can be exactly computed. We next provethat these computations of (cid:104) ∂ Q cv ( x ) ∂ x i (cid:105) b (cid:48) and (cid:104) ∂ Q cv ( x ) ∂ p i (cid:105) b (cid:48) are correct.We first bound (cid:12)(cid:12)(cid:12)(cid:104) S (cid:48) ij (cid:105) b (cid:48) · [ S (cid:48) i ] b (cid:48) · Π m ∈ A / { j } , (cid:96) ∈ B [ S (cid:96) m ] b (cid:48) − S (cid:48) ij · S (cid:48) i · Π m ∈ A / { j } , (cid:96) ∈ B S (cid:96) m (cid:12)(cid:12)(cid:12) .Consider the function g ( y , y , y ) = y · y · ∏ m ∈ A / { j } , (cid:96) ∈ B y (cid:96) m . As previously done, for y , y ∈ [
0, 6 ] and y ∈ [
0, 1 + d ] | A |×| B |− we have that, (cid:107)∇ g ( y , y , y ) (cid:107) ≤ Θ ( d ) . If b (cid:48) ≤ Θ ( log d ) then (cid:12)(cid:12)(cid:12) S (cid:48) ij (cid:12)(cid:12)(cid:12) , S (cid:48) i ≤ S (cid:96) m ∈ [
0, 1 + d ] . As a result, (cid:12)(cid:12)(cid:12)(cid:104) S (cid:48) ij (cid:105) b (cid:48) · (cid:2) S (cid:48) i (cid:3) b (cid:48) · Π m ∈ A / { j } , (cid:96) ∈ B [ S (cid:96) m ] b (cid:48) − S (cid:48) ij · S (cid:48) i · Π m ∈ A / { j } , (cid:96) ∈ B S (cid:96) m (cid:12)(cid:12)(cid:12) ≤ Θ ( d ) · − b (cid:48) .We can now use the above inequality to bound (cid:12)(cid:12)(cid:12)(cid:104) ∂ Q cv ( x ) ∂ p i (cid:105) b (cid:48) − ∂ Q cv ( x ) ∂ p i (cid:12)(cid:12)(cid:12) . More precisely, (cid:12)(cid:12)(cid:12)(cid:12)(cid:20) ∂ Q cv ( x ) ∂ p i (cid:21) b (cid:48) − ∂ Q cv ( x ) ∂ p i (cid:12)(cid:12)(cid:12)(cid:12) ≤ − b + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∑ j ∈ A (cid:104) S (cid:48) ij (cid:105) b (cid:48) · (cid:2) S (cid:48) i (cid:3) b (cid:48) · ∏ m ∈ A / { j } , (cid:96) ∈ B [ S (cid:96) m ] b (cid:48) − ∑ j ∈ A S (cid:48) ij · S (cid:48) i · ∏ m ∈ A / { j } , (cid:96) ∈ B S (cid:96) m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Θ ( d ) · − b (cid:48) We finally get that (cid:12)(cid:12)(cid:12)(cid:12)(cid:20) ∂ Q cv ( x ) ∂ x i (cid:21) b (cid:48) − ∂ Q cv ( x ) ∂ x i (cid:12)(cid:12)(cid:12)(cid:12) ≤ − b (cid:48) + N (cid:12)(cid:12)(cid:12)(cid:12)(cid:20) ∂ Q cv ( x ) ∂ p i (cid:21) b (cid:48) − ∂ Q cv ( x ) ∂ p i (cid:12)(cid:12)(cid:12)(cid:12) ≤ Θ ( Nd ) · − b (cid:48) .Thus the analysis is completed by selecting b (cid:48) = b + Θ ( log d ) + Θ ( log N ) . Lemma D.4.
There exist Turing Machines M P and M P (cid:48) that given x ∈ [
0, 1 ] d and ε > in binary formcompute [ P v ( x )] b and [ ∇ P v ( x )] b respectively for all vertices v ∈ ([ N ] − ) d with P v ( x ) > , whereb = log ( ε ) . These vertices are most d + . Moreover both M P and M P (cid:48) run in polynomial time withrespect to b, d and the binary representation of x .Proof. M P first runs M Q of Lemma D.3 to find the coefficients Q cv ( x ) >
0. We remind that thesevertices are denoted with R + ( x ) and | R + ( x ) | ≤ d +
1. Then for each v ∈ R + ( x ) , M P outputs as [ P v ( x )] b the fist b bits of the quantity, (cid:34) [ Q cv ( x )] b (cid:48) ∑ v (cid:48) ∈ R + ( x ) (cid:2) Q cv (cid:48) ( x ) (cid:3) b (cid:48) (cid:35) b (cid:48) b (cid:48) later in the proof but it is chosen to be polynomial in b and d . We next present the proof that the above expression correctly computes [ P v ( x )] b .For accuracy b (cid:48) ≥ Θ ( d log d ) we get that, ∑ v (cid:48) ∈ R + ( x ) [ Q cv (cid:48) ( x )] b (cid:48) ≥ ∑ v (cid:48) ∈ R + ( x ) Q cv (cid:48) ( x ) − Θ ( d ) · − b (cid:48) = ∑ v (cid:48) ∈ R c ( x ) Q cv (cid:48) ( x ) − Θ ( d ) · − b (cid:48) ≥ Θ (cid:16) d ) d (cid:17) − Θ ( d ) · − b (cid:48) ≥ Θ (cid:16) ( d ) d (cid:17) Consider the function g ( y ) = y i / ( ∑ d + j = y j ) . Notice that for y ∈ [
0, 1 ] d + and ∑ d + j = y j ≥ µ then (cid:107)∇ g ( y ) (cid:107) ≤ Θ ( d / µ ) . The latter implies that for y , z ∈ [
0, 1 ] d + such that ∑ d + j = y j ≥ µ andthat ∑ d + j = z j ≥ µ , it holds that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) y i ∑ d + j = y j − z i ∑ d + j = z j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Θ (cid:18) d µ (cid:19) · (cid:107) y − z (cid:107) .Since there are at most d + v (cid:48) ∈ R + ( x ) while both the term ∑ v (cid:48) ∈ R + ( x ) (cid:2) Q cv (cid:48) ( x ) (cid:3) b (cid:48) andthe term ∑ v (cid:48) ∈ R + ( x ) Q cv (cid:48) ( x ) are greater than Θ (cid:16) ( d ) d (cid:17) , we can apply the above inequality with µ = Θ (cid:16) ( d ) d (cid:17) and we get the following (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ Q cv ( x )] b (cid:48) ∑ v (cid:48) ∈ R + ( x ) (cid:2) Q cv (cid:48) ( x ) (cid:3) b (cid:48) − Q cv ( x ) ∑ v (cid:48) ∈ R + ( x ) Q cv (cid:48) ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Θ (cid:16) d d + (cid:17) · (cid:34) ∑ v (cid:48) ∈ R + ( x ) (cid:0) [ Q cv (cid:48) ( x )] b (cid:48) − Q cv (cid:48) ( x ) (cid:1) (cid:35) ≤ Θ (cid:16) d d + (cid:17) · − b (cid:48) Overall, we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:34) [ Q cv ( x )] b (cid:48) ∑ v (cid:48) ∈ R + ( x ) (cid:2) Q cv (cid:48) ( x ) (cid:3) b (cid:48) (cid:35) b (cid:48) − Q cv ( x ) ∑ v (cid:48) ∈ R c ( x ) Q cv (cid:48) ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:34) [ Q cv ( x )] b (cid:48) ∑ v (cid:48) ∈ R + ( x ) (cid:2) Q cv (cid:48) ( x ) (cid:3) b (cid:48) (cid:35) b (cid:48) − [ Q cv ( x )] b (cid:48) ∑ v (cid:48) ∈ R + ( x ) (cid:2) Q cv (cid:48) ( x ) (cid:3) b (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ Q cv ( x )] b (cid:48) ∑ v (cid:48) ∈ R + ( x ) (cid:2) Q cv (cid:48) ( x ) (cid:3) b (cid:48) − Q cv ( x ) ∑ v (cid:48) ∈ R + ( x ) Q cv (cid:48) ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Θ (cid:16) d d + (cid:17) − b (cid:48) The proof is completed via selecting b (cid:48) = b + Θ ( d log d ) .80n order to compute ∂ P v ( x ) ∂ x i the Turing machine M P (cid:48) computes all vertices R + ( x ) the coefficients ∂ Q cv ( x ) ∂ x i with accuracy b (cid:48) . Then for each v ∈ R + ( x ) the Turing Machine M P (cid:48) outputs, (cid:20) ∂ P v ( x ) ∂ x i (cid:21) b (cid:48) ← (cid:20) t i − s i · (cid:20) ∂ P v ( x ) ∂ p i (cid:21) b (cid:48) (cid:21) b (cid:48) where (cid:20) ∂ P v ( x ) ∂ p i (cid:21) b (cid:48) ← (cid:104) ∂ Q v ( x ) ∂ p i (cid:105) b (cid:48) · ∑ v (cid:48) ∈ R + ( x ) [ Q v (cid:48) ( x )] b (cid:48) − [ Q v ( x )] b (cid:48) · ∑ v (cid:48) ∈ R + ( x ) (cid:104) ∂ Q v (cid:48) ( x ) ∂ p i (cid:105) b (cid:48) (cid:16) ∑ v (cid:48) ∈ R + ( x ) [ Q v (cid:48) ( x )] b (cid:48) (cid:17) b (cid:48) Similarly as above and as in Lemma D.3 we can prove that if b (cid:48) ≥ b + Θ ( d log d ) + Θ ( log N ) , (cid:12)(cid:12)(cid:12)(cid:104) ∂ P v ( x ) ∂ p i (cid:105) b (cid:48) − ∂ P v ( x ) ∂ p i (cid:12)(cid:12)(cid:12) ≤ − b . Proof of Theorem 7.6.
Let R ( x ) be the cell at which x lies. The Turing Machine M f C l initially cal-culates the vertices v ∈ R c ( x ) with coefficient P v ( x ) >
0. We remind that this set is denoted by R + ( x ) and | R + ( x ) | ≤ d +
1. Then M f C l outputs the first b bits of the following quantity, (cid:104) f C l ( x , y ) (cid:105) b (cid:48) = d ∑ j = [ α ( x , j )] b (cid:48) · ( x j − y j ) where [ α ( x , j )] b (cid:48) = ∑ v (cid:48) ∈ R + ( x ) C l ( v , j ) · [ P v ( x )] b (cid:48) we next prove that the above computation is correct. (cid:12)(cid:12)(cid:12)(cid:104) f C l ( x , y ) (cid:105) b (cid:48) − f C l ( x , y ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d ∑ j = [ α ( x , j )] b (cid:48) · ( x j − y j ) − d ∑ j = α ( x , j ) · ( x j − y j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ d ∑ j = | [ α ( x , j )] − α ( x , j ) | = d ∑ j = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∑ v (cid:48) ∈ R + ( x ) C l ( v , j ) · [ P v ( x )] b (cid:48) − ∑ v (cid:48) ∈ R + ( x ) C l ( v , j ) · P v ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ d ∑ j = ∑ v (cid:48) ∈ R + ( x ) | [ P v ( x )] b (cid:48) − P v ( x ) |≤ d · ( d + ) · − b (cid:48) Setting b (cid:48) = b + Θ ( log d ) we get the desired result. Similarly for ∂ f C l ( x , y ) ∂ x i and ∂ f C l ( x , y ) ∂ y i . E Convergence of PGD to Approximate Local Minimum