[PDF] Direct-Search for a Class of Stochastic Min-Max Problems

Abstract

Recent applications in machine learning have renewed the interest of the community in min-max optimization problems. While gradient-based optimization methods are widely used to solve such problems, there are however many scenarios where these techniques are not well-suited, or even not applicable when the gradient is not accessible. We investigate the use of direct-search methods that belong to a class of derivative-free techniques that only access the objective function through an oracle. In this work, we design a novel algorithm in the context of min-max saddle point games where one sequentially updates the min and the max player. We prove convergence of this algorithm under mild assumptions, where the objective of the max-player satisfies the Polyak-\L{}ojasiewicz (PL) condition, while the min-player is characterized by a nonconvex objective. Our method only assumes dynamically adjusted accurate estimates of the oracle with a fixed probability. To the best of our knowledge, our analysis is the first one to address the convergence of a direct-search method for min-max objectives in a stochastic setting.

Full PDF

DDirect-Search for a Class of Stochastic Min-MaxProblems

Sotiris AnagnostidisInstitute for Machine LearningETH Z¨urich [email protected]

Aurelien LucchiInstitute for Machine LearningETH Z¨urich [email protected]

Youssef DiouaneISAE-SUPAEROUniversit´e de Toulouse, France [email protected]

Abstract

Recent applications in machine learning have renewed the interest ofthe community in min-max optimization problems. While gradient-basedoptimization methods are widely used to solve such problems, there arehowever many scenarios where these techniques are not well-suited, oreven not applicable when the gradient is not accessible. We investigatethe use of direct-search methods that belong to a class of derivative-freetechniques that only access the objective function through an oracle. Inthis work, we design a novel algorithm in the context of min-max saddlepoint games where one sequentially updates the min and the max player.We prove convergence of this algorithm under mild assumptions, wherethe objective of the max-player satisﬁes the Polyak-(cid:32)Lojasiewicz (PL) con-dition, while the min-player is characterized by a nonconvex objective.Our method only assumes dynamically adjusted accurate estimates of theoracle with a ﬁxed probability. To the best of our knowledge, our analysisis the ﬁrst one to address the convergence of a direct-search method formin-max objectives in a stochastic setting.

Recent applications in the ﬁeld of machine learning, including generative mod-els (Goodfellow et al., 2014) or robust optimization (Ben-Tal et al., 2009), havetriggered signiﬁcant interest for the optimization of min-max functions of theform min x ∈X max y ∈Y f ( x , y ) = E [ ˜ f ( x , y , ξ )] , (1)1 a r X i v : . [ m a t h . O C ] F e b here ξ is a random variable characterized by some distribution. In machinelearning, ξ is for instance often drawn from a distribution that depends on thetraining data.In practice, min-max problems are often solved using gradient-based algo-rithms, especially simultaneous gradient descent ascent (GDA) that simply al-ternates between a gradient descent step for x and a gradient ascent step for y . While these algorithms are attractive due to their simplicity, there are how-ever cases where the gradient of the objective function is not accessible, suchas when modelling distributions with categorical variables (Jang et al., 2016),tuning hyper-parameters (Audet and Orban, 2006; Marzat et al., 2011) andmulti-agent reinforcement learning with bandit feedback (Zhang et al., 2019).A resurgence of interest has recently emerged for applications in black-box op-timization (Bogunovic et al., 2018; Liu et al., 2019) and black-box poisoningattack (Liu et al., 2020), where an attacker deliberately modiﬁes the trainingdata in order to tamper with the model’s predictions. This can be formulated asa min-max optimization problem, where only stochastic accesses to the objectivefunction are available (Wang et al., 2020).In this work, we investigate the use of direct-search methods to optimizemin-max objective functions without requiring access to the gradients of theobjective f . Direct-search methods have a long history in the ﬁeld of optimiza-tion, dating back from the seminal paper of Hooke and Jeeves (1961). Theappeal of these methods is due to their simplicity but also potential ability todeal with non-trivial objective functions. Although there are variations amongrandom search techniques, most of them can be summarized conceptually assampling random directions from a search space and moving towards directionsthat decrease the objective function value. We note that these techniques aresometimes named derivative-free methods, but it is important to distinguishthem from other techniques that try to estimate derivatives based on ﬁnite dif-ference (Spall, 2003) or smoothing (Nesterov and Spokoiny, 2017). We refer thereader to the surveys by Lewis et al. (2000); Rios and Sahinidis (2013) for acomprehensive review of direct-search methods.Solving the saddle point problem (1) is equivalent to ﬁnding a saddle point ( x ∗ , y ∗ ) such that f ( x ∗ , y ) ≤ f ( x ∗ , y ∗ ) ≤ f ( x , y ∗ ) ∀ x ∈ X , ∀ y ∈ Y . There is a rich literature on saddle point optimization for the particular classof convex-concave functions (i.e. when f is convex in x and concave in y )that are diﬀerentiable. Although this type of objective function is commonlyencountered in applications such as constrained convex minimization, manysaddle point problems of interest do not satisfy the convex-concave assumption.This for instance includes applications such as Generative Adversarial Networks(GANs) (Goodfellow et al., 2014), robust optimization (Ben-Tal et al., 2009; Bo-gunovic et al., 2018) and multi-agent reinforcement learning (Omidshaﬁei et al., In the game theory literature, such point is commonly referred to as (global) Nash equi-librium, see e.g. Liang and Stokes (2018). f ( x ) = max { f i ( x ) : i = 1 , , . . . , N } where N > f i is continuously diﬀerentiable. Other techniques such as Bert-simas and Nohadani (2010); Bertsimas et al. (2010) are restricted to functions f that are convex with respect to x or only provide asymptotic convergenceanalysis (Menickelly and Wild, 2020). We refer the reader to Section 2 for amore detailed discussion of prior approaches.Motivated by a wide range of applications, we therefore focus on a noncon-vex and nonconcave stochastic setting where the max player satisﬁes the PLcondition (see Deﬁnition 5 in Section 4), which is known to be a weaker as-sumption compared to convexity (Karimi and Schmidt, 2015). In summary, ourmain contributions are: • We design a novel direct-search algorithm for such min-max problemsand provide non-asymptotic convergence guarantees in terms of ﬁrst-orderNash equilibrium. Concretely, we prove convergence to an (cid:15) -ﬁrst-orderNash Equilibrium (for a deﬁnition see Section 3) in O ( (cid:15) − log( (cid:15) − )) iter-ations, which is comparable to the rate achieved by gradient-based tech-niques (Nouiehed et al., 2019). • We derive theoretical convergence guarantees in a stochastic setting whereone only has access to accurate estimates of the objective function, withsome ﬁxed probability. We prove our results for the case where the minplayer optimizes a nonconvex function while the max player optimizes aPL function. • We validate empirically our theoretical ﬁndings, including settings wherederivatives are not available.

Direct-search methods for minimization problems

The general princi-ple behind direct-search methods is to optimize a function f ( x ) without havingaccess to its gradient ∇ f ( x ). There is a large number of algorithms that arepart of this broad family including golden-section search techniques or randomsearch (Rastrigin, 1963). Among the most popular algorithms in machine learn-ing are evolution strategies and population-based algorithms that have demon-strated promising results in reinforcement learning (Salimans et al., 2017; Mah-eswaranathan et al., 2018) and bandit optimization (Flaxman et al., 2004). At ahigh-level, these techniques work by maintaining a distribution over parametersand duplicate the individuals in the population with higher ﬁtness. Often thesealgorithms are initialized at a random point and then adapt their search space,depending on which area contains the best samples (i.e. the lowest functionvalue when minimizing f ( x )). New samples are then generated from the best3egions in a process repeated until convergence. The most well-known algo-rithms that belong to this class are evolutionary-like algorithms, including forinstance CMA-ES (Hansen et al., 2003). Evolutionary strategies have recentlybeen shown to be able to solve various complex tasks in reinforcement learningsuch as Atari games or robotic control problems, see e.g. Salimans et al. (2017).Their advantages in the context of reinforcement learning are their reduced sen-sitivity to noisy or uninformative gradients (potentially increasing their abilityto avoid local minima (Conti et al., 2017)) and the ease with which one canimplement a distributed or parallel version. Convergence guarantees for direct-search methods

Proofs of conver-gence for direct-search methods are based on a speciﬁc construction for the sam-pling directions, often that they positively span the whole search space (Connet al., 2009), or that they are dense in certain types of directions (known asreﬁning directions) at the limit point (Audet and Dennis Jr, 2006). In addition,they also typically rely on the use of a forcing function that imposes each new se-lected iterate to decrease the function value adequately. This technique has beenanalyzed in Vicente (2013) who proved convergence under mild assumptions in O ( (cid:15) − ) iterations for the goal (cid:107)∇ f ( x ) (cid:107) < (cid:15) . The number of required steps is re-duced to O ( (cid:15) − ) for convex functions f , and to O (log( (cid:15) − )) for strongly-convexfunctions (Kone v cn`y and Richt´arik, 2014). This is on par with the steepest de-scent method for unconstrained optimization (Nesterov, 2013) apart from someconstants that depend on the dimensionality of the problem. Stochastic estimates of the function

In our analysis, we only assume ac-cess to stochastic estimates of the objective function. f ( x , y ) = E [ ˜ f ( x , y , ξ )] , (2)where ξ is a random variable that captures the randomness of the objective func-tion. The origin of the noise could be privacy related, or caused by a noisy ad-versary. Most commonly, it might arise from online streaming data, distributedand batch-sized updates due to the sheer size of the problem. Stochastic gradi-ent descent is often used to optimize Eq. (2), where one often assumes access toaccurate estimates of f and consider updates only in expectation (Johnson andZhang, 2013). To establish similar convergence rates to the deterministic case,an alternative solution consists of adapting the accuracy of these estimates dy-namically, which can be ensured by averaging multiple samples together. Thisapproach has for instance been analyzed in the context of trust-region meth-ods (Blanchet et al., 2019) and line-search methods (Paquette and Scheinberg,2018; Bergou et al., 2018), including direct-search for the minimization of non-convex functions (Dzahini, 2020). Algorithms for ﬁnding equilibria in games

Since the pioneering work ofvon Neumann (1928), equilibria in games have received great attention. Mostpast results focus on convex-concave settings (Chen et al., 2014; Hien et al.,4017). Notably, Cherukuri et al. (2017) studied convergence of the GDA al-gorithm under strictly convex-concave assumptions. For problems where thefunction does not satisfy this condition however, convergence to a saddle pointis not guaranteed. More recent results focus on relaxing these conditions. Thework of Nouiehed et al. (2019) analyzed gradient descent-ascent under a similarscenario, where the objective of the max player satisﬁes the PL condition andwhere the min player optimizes a nonconvex objective. Ostrovskii et al. (2020);Wang et al. (2020) analyze a nonconvex-concave class of problems, while Linet al. (2020) present a two-scale variant of the GDA algorithm for a similarscenario, providing a replacement for the alternating updates scheme.We take inspiration from the work of Liu et al. (2019); Nouiehed et al. (2019);Sanjabi et al. (2018) to design a novel alternating direct-search algorithm, wherethe inner maximization problem is solved almost exactly before performing asingle step towards improving the strategy of the minimization player. We areable to prove convergence of our direct-search algorithm under this procedure,which has been proven to be more stable than the analogous simultaneous one,as rigorously shown in Gidel et al. (2018) and Zhang and Yu (2019) for a varietyof algorithms.

Throughout, we use (cid:107) . (cid:107) to denote the Euclidean norm; that is, for x ∈ R n wehave (cid:107) x (cid:107) = √ x (cid:124) x . We consider the optimization problem deﬁned in Eq. (1) for which a commonnotion of optimality is the concept of Nash equilibrium as mentioned previously,which is formally deﬁned as follows.

Deﬁnition 1.

We say that a point ( x ∗ , y ∗ ) ∈ X × Y is a Nash equilibrium ofthe game if f ( x ∗ , y ) ≤ f ( x ∗ , y ∗ ) ≤ f ( x , y ∗ ) ∀ x ∈ X , ∀ y ∈ Y . A Nash equilibrium is a point where the change of strategy of each playerindividually does not lead to an improvement from her viewpoint. Such a Nashequilibrium point always exists for convex-concave games (Jin et al., 2019),but not necessarily for nonconvex-nonconcave games. Even when they exist,ﬁnding Nash equilibria is known to be a NP-hard problem, which has led to theintroduction of local characterizations as discussed in Jin et al. (2019); Adolphset al. (2018). Here we use the notion of a ﬁrst-order Nash equilibrium (FNE)(for a deﬁnition we refer to Pang and Razaviyayn (2016)). We focus on theproblem of converging to such a FNE point, or an approximate FNE deﬁned asfollows (adapted from Nouiehed et al. (2019) in the absence of constraints).5 eﬁnition 2.

For a function f : R n × R m → R , a point ( x ∗ , y ∗ ) ∈ R n × R m is said to be an (cid:15) -ﬁrst-order Nash Equilibrium ( (cid:15) -FNE) if: (cid:107)∇ x f ( x ∗ , y ∗ ) (cid:107) ≤ (cid:15) and (cid:107)∇ y f ( x ∗ , y ∗ ) (cid:107) ≤ (cid:15) . Spanning set

Direct-search methods typically rely on the smoothness of theobjective function, which we denote by f : R n → R in this section , and onappropriate choice of sampling points to prove convergence. The key idea toguarantee convergence is that one of the sampled directions will form an acuteangle with the negative gradient. This can be ensured by sampling from a Posi-tive Spanning Set (PSS). The quality of a spanning set D is typically measuredusing a notion of cosine measure deﬁned as κ ( D ) = min (cid:54) = u ∈ R n max d ∈D u T d (cid:107) u (cid:107)(cid:107) d (cid:107) . (3)In the following, we will consider positive spanning sets such that κ ( D ) ≥ κ min > d min ≤ (cid:107) d (cid:107) ≤ d max , ∀ d ∈ D . These assumptions require |D| ≥ n + 1. Common choices are i) the positive and negative orthonormal bases D = [ I n − I n ] = [ e , . . . , e n , − e , . . . , − e n ] of size |D| = 2 n , ii) a minimal positivebasis with uniform angles of size |D| = n + 1 (see Corollary 2.6 of Conn et al.(2009) and Kolda et al. (2003)) or iii) even rotations of these matrices (Grattonet al., 2016). Forcing function

Another critical component to guarantee that the functionvalue decreases at each step appropriately is a forcing function ρ that satisﬁes ρ ( σ ) σ → σ →

0. Given such σ , direct-search methods sample new pointsaccording to the rule x (cid:48) = x + σ d , (4)and accept points for which f ( x (cid:48) ) < f ( x ) − ρ ( σ ) , (5) d ∈ D . If the previous condition holds for some d ∈ D , then the new pointis accepted, the step is deemed successful and the σ parameter is increased,otherwise σ is decreased and the above process is repeated. We use a parameter γ to indicate these updates of the step size. For convenience and without lossof generality, we will only consider spanning sets with vectors of unitary length d min = d max = 1 and a forcing function ρ ( σ ) = cσ . The direct-search scheme is displayed in Algorithm 1. When using the function f with one set of variables, we consider the minimization problem.When using two sets of variables, we instead consider the min-max problem as deﬁned inEq. (1). lgorithm 1: Direct-search( f, x , c, T ) Input: f : objective function, with f k it’s estimate at step kc : forcing function constant T : number of stepsInitialize step size value σ . Choose γ >

1. Create the PositiveSpanning Set D . for k = 0, . . . , T - 1 do1. Oﬀspring generation: Generate the points x i = x k + σ k d i , ∀ d i ∈ D .

2. Parent Selection:

Choose x (cid:48) = arg min i f k ( x i ).

3. Suﬃcient Decrease:if f k ( x (cid:48) ) < f k ( x k ) − ρ ( σ k ) then (Iteration is successful)Update and increase step size x k +1 = x (cid:48) , σ k +1 = min { σ max , γσ k } . else (Iteration is unsuccessful)Decrease step size x k +1 = x k , σ k +1 = γ − σ k . endendreturn x T The full algorithm we analyze to solve the min-max objective is presented inAlgorithm 2. It consists of two steps: i) ﬁrst solve the maximization problemw.r.t. the y variable using Algorithm 1, and ii) perform one update step forthe x variable. In this section, we ﬁrst analyze the convergence properties ofAlgorithm 1 in the setting where we only have access to estimates of the objectivefunction f , f ( x ) = E [ ˜ f ( x , ξ )] . Let (Ω , F , P ) be a probability space with elementary events denoted with ω .We denote the random quantities for the iterate by x k = X k ( ω ) and for thestep size by σ k = Σ k ( ω ). Similarly let { F k , F σk } be the estimates of f ( X k )and f ( X k + Σ k d k ), for each d k in a set D , with their realizations f k = F k ( ω ), f σk = F σk ( ω ). At each iteration the inﬂuence of the noise on function evaluationsis random. We will assume that, when conditioned on all the past iterates,these estimates are suﬃciently accurate with a suﬃciently high probability. Weformalize this concept in the two deﬁnitions below. Deﬁnition 3. ( (cid:15) f -accurate) The estimates { F k , F σk } are said to be (cid:15) f -accuratewith respect to the corresponding sequence if | F k − f ( X k ) | ≤ (cid:15) f Σ k and | F σk − f ( X k + Σ k d k ) | ≤ (cid:15) f Σ k . eﬁnition 4. ( p f -probabilistically (cid:15) f -accurate) The estimates { F k , F σk } aresaid to be p f -probabilistically (cid:15) f -accurate with respect to the corresponding se-quence if the events J k = { The estimates { F k , F σk } are (cid:15) f -accurate } satisfy the condition P ( J k | F k − ) = E [1 J k | F k − ] ≥ p f , where F k − is the sigma-algebra generated by the sequence { F , F σ , . . . , F k − , F σk − } . As the step size σ gets smaller, meaning that we are getting closer to the op-timum, we require the accuracy over the function values to increase. However,the probability to encounter a good estimation remains the same throughout.A signiﬁcant challenge arises, as steps may satisfy our suﬃcient decrease condi-tion speciﬁed in Eq. (5) falsely, leading to a potential increase in terms of theobjective value. This increase can potentially be very large, leading to diver-gence, and we therefore need to require an additional assumption regarding thevariance of the error. Assumption 1.

The sequence of estimates { F k , F σk } are said to satisfy a l f -variance condition if for all k ≥ E [ | F k − f ( X k ) | | F k − ] ≤ l f Σ k , E [ | F σk − f ( X k + Σ k d k ) | | F k − ] ≤ l f Σ k . Based on the above assumptions, we reach the following conclusion regardinginaccurate steps (similar to Lemma 2.5 in Paquette and Scheinberg (2018)).

Lemma 1.

Let Assumption 1 hold for p f -probabilistically (cid:15) f -accurate esti-mates of a function. Then for k ≥ we have E [1 J ck | F k − f ( X k ) | | F k − ] ≤ (1 − p f ) / l f Σ k , E [1 J ck | F σk − f ( X k + Σ k d k ) | | F k − ] ≤ (1 − p f ) / l f Σ k . Computing the estimates

In order to satisfy Assumption 1 we can performmultiple function evaluations and average them out (see for instance Tropp(2015)). We therefore get an estimate F k = | S k | (cid:80) ξ i ∈ S k ˜ f ( X k , ξ i ), where S k , S σk correspond to independent samples for F k and F σk respectively. Assumingbounded variance, i.e. E [ | ˜ f ( x , ξ ) − f ( x ) | ] ≤ σ f , known concentration results We use 1 A to denote the indicator function for the set A and A c to denote its complement. p f -probabilistically (cid:15) f -accurate estimates for | S k | ≥ O (1) (cid:32) σ f (cid:15) f Σ k log (cid:18) − p f (cid:19)(cid:33) number of evaluations (the same result holds for S σk ). To also satisfy Assump-tion 1, we additionally require | S k | ≥ σ f l f Σ k . In order to study the convergence properties of Algorithm 1, we introduce thefollowing (random) Lyapunov function:Φ k = v ( f ( X k ) − f ∗ ) + (1 − v )Σ k , where v ∈ (0 ,

1) is a constant. We denote by f ∗ the minimum of the function f ,assumed to exist and potentially achieved at multiple positions. The Lyapunovfunction Φ k will be used to track the progress of the gradient norm (cid:107)∇ f ( X k ) (cid:107) ,which will serve as a measure of convergence.Theorem 2 presented below ensures that the Lyapunov function decreasesover iterations. Using this result, one can guarantee that the sequence of step-sizes decreases and then exploit the fact that for suﬃciently small step sizes (andaccurate estimates), the steps are successful, i.e. they decrease the objectivefunction. The proof of the next Theorem is mainly inspired by Dzahini (2020);Audet et al. (2021). Theorem 2.

Let a function f with a minimum value f ∗ , with Lipschitz con-tinuous gradients with a constant L . Let also f be p f -probabilistically (cid:15) f -accurate, while also having bounded noise variance according to Assumption 1with constant l f . Then: E [Φ k +1 − Φ k | F k − ] ≤ − p f (1 − v )(1 − γ ) Σ k . (6) The constants c , v and p f should satisfy c − (cid:15) f > , p f (cid:112) − p f ≥ vl f (1 − v )(1 − γ − ) , v − v ≥ c − (cid:15) f ( γ − γ ) . Next, we characterize the number of steps required to converge by usinga renewal-reward process adapted from Blanchet et al. (2019). Let us de-ﬁne the random process { Φ k , Σ k } , with Φ k ≥ k ≥

0. Let us also9enote with W k a random walk process and F k the σ -algebra generated by { Φ , Σ , W , . . . , Φ k , Σ k , W k } with W = 1, P ( W k +1 = 1 | F k ) = p,P ( W k +1 = − | F k ) = 1 − p. (7)We also deﬁne a family of stopping times { T (cid:15) } (cid:15)> with respect to {F k } k ≥ for (cid:15) > Assumption 2.

Given the random quantities { Φ k , Σ k , W k } , we make the fol-lowing assumptions.i. There exists λ > such that Σ max = Σ e λj max for j max ∈ Z , and Σ k ≤ Σ max for all k .ii. There exists Σ (cid:15) = Σ e λj (cid:15) with j (cid:15) ∈ Z , such that T (cid:15) >k Σ k +1 ≥ T (cid:15) >k min { Σ k e λW k +1 , Σ (cid:15) } where W k +1 satisﬁes Equation (7) with probability p > .iii. There exists a nondecreasing function h ( · ): [0 , ∞ ] → (0 , ∞ ) and a constant Θ > such that T (cid:15) >k E [Φ k +1 | F k ] ≤ T (cid:15) >k (Φ k − Θ h (Σ k )) . Assumption 2 (ii) requires that step sizes tend to increase when below aspeciﬁc threshold, while Assumption 2 (iii) requires that the random function Φdecreases in expectation (already proved in Theorem 2). Under this assumption,the following results hold for the stopping time T (cid:15) (Blanchet et al., 2019). Theorem 3.

Under Assumption 2, we have E [ T (cid:15) ] ≤ p p − Θ h (Σ (cid:15) ) + 1 . In our case, we use the fundamental result of convergence for direct-searchmethods, that comes from correlating the norm of the gradient with the step sizefor unsuccessful iterations (generalization of results in Vicente (2013); Grattonet al. (2016)).

Lemma 4.

Let f : x ∈ R n → R be a continuous diﬀerentiable function withLipschitz continuous gradients of constant L . Let also D be a positive spanningset with κ ( D ) = κ min > and vectors d satisfying (cid:107) d (cid:107) = 1 , ∀ d ∈ D . For aforcing function ρ ( σ ) = cσ and an (cid:15) f -accurate estimates of the function, for n unsuccessful step k it holds that σ k ≥ C (cid:107)∇ f ( x k ) (cid:107) , C = 2 κ min L + 2 c + 4 (cid:15) f . (8)In this analysis, our goal is to show that the norm of the gradient decreasesbelow a threshold T (cid:15) = inf { k ≥ (cid:107)∇ f ( X k ) (cid:107) ≤ (cid:15) } . We assume that Assumption 2 (i) holds by the choice of Σ max . We also knowfrom Lemma 4 that for (cid:107)∇ f ( X ) (cid:107) > (cid:15) and Σ ≤ C(cid:15) then a successful step occurs,provided that estimates are accurate. Then following Lemma 4.10 from Paquetteand Scheinberg (2018) we get that Assumption 2 (ii) also holds, for Σ (cid:15) = C(cid:15) .Based on the results of Theorem 3 and Lemma 4, we can now prove convergencefor a nonconvex bounded function.

Theorem 5.

Assume that the Assumptions of Theorem 2 hold with addi-tionally p f > . Then to get (cid:107)∇ f ( X k ) (cid:107) ≤ (cid:15) , the expected stopping time ofAlgorithm 1 is E [ T (cid:15) ] ≤ O (1) κ − p f − f ( X ) − f ∗ + Σ )( L + c + (cid:15) f ) (cid:15) . Note that for the deterministic scenario where (cid:15) f = l f = 0, the above boundmatches known results of direct-search in the nonconvex case (Vicente, 2013;Kone v cn`y and Richt´arik, 2014). We now establish faster convergence for afunction f , additionally satisfying the PL condition, deﬁned below. Deﬁnition 5. (Polyak-(cid:32)Lojasiewicz Condition). A diﬀerentiable function f : R n → R with the minimum value f ∗ = min x f ( x ) is said to be µ -Polyak-(cid:32)Lojasiewicz ( µ -PL) if: (cid:107)∇ f ( x ) (cid:107) ≥ µ ( f ( x ) − f ∗ ) . The PL condition is the weakest among a large family of function classesthat include convex functions and other nonconvex ones (Karimi and Schmidt,2015). Again we can guarantee convergence that closely matches results fordeterministic direct-search under strong convexity, by proving that the numberof iterations required to halve the distance to the optimum objective value isconstant in terms of the accuracy (cid:15) . Theorem 6.

Let a function f with a minimum value f ∗ and satisfying thePL condition with a constant µ and Lipschitz continuous gradients with aconstant L . Let also f be p f -probabilistically (cid:15) f -accurate, while also havingbounded noise variance according to Assumption 1 with constant l f . Then to et (cid:107)∇ f ( X k ) (cid:107) ≤ (cid:15) , the expected stopping time of Algorithm 1 is E [ T (cid:15) ] ≤ O (1) κ − ( c + L ) (2 p f − µ (cid:18) c (cid:19) log (cid:18) L ( f ( X ) − f ∗ ) (cid:15) (cid:19) . (9) The constants c , v and p f > should satisfy c > max { (cid:15) f , √ l f } , p f (cid:112) − p f ≥ vl f (1 − v )(1 − γ − ) and v − v ≥ max (cid:26) c − (cid:15) f ( γ − γ ) , γ c (cid:27) . We now focus on the min-max problem presented in Eq. (1). To proceed, wemake the following standard assumptions regarding the smoothness of f . Assumption 3.

The function f is continuously diﬀerentiable in both x and y and there exist constants L , L , L and L such that for every x , x , x ∈ X and y , y , y ∈ Y(cid:107)∇ x f ( x , y ) − ∇ x f ( x , y ) (cid:107) ≤ L (cid:107) x − x (cid:107) , (cid:107)∇ x f ( x , y ) − ∇ x f ( x , y ) (cid:107) ≤ L (cid:107) y − y (cid:107) , (cid:107)∇ y f ( x , y ) − ∇ y f ( x , y ) (cid:107) ≤ L (cid:107) x − x (cid:107) , (cid:107)∇ y f ( x , y ) − ∇ y f ( x , y ) (cid:107) ≤ L (cid:107) y − y (cid:107) . We require that the objective of the max-player satisﬁes the PL condition.

Assumption 4.

There exists a constant µ > such that the function − f ( x , y ) in problem (1) is µ -PL for any x ∈ X . Following prior works on PL games, e.g. Nouiehed et al. (2019), we proposea sequential scheme for the updates of the two players presented in Algorithm 2(for simplicity some of the algorithm’s constants are not depicted). This multi-step algorithm solves the maximization problem up to some accuracy, and itthen performs a single (successful) Direct-Search (DR) step for the minimizationproblem (see Algorithm 3).We formalize our Assumptions and our ﬁnal result.

Assumption 5.

The function f is deﬁned on the whole domain X × Y = R |X | × R |Y| . We also require f to be bounded below for every y ∈ Y and boundedabove for every x ∈ X . lgorithm 2: Min-Max-Direct-search

Input: f : objective function( x , y ): initial point σ : initial step for the min problem for t = 1, . . . , T doy t = Direct-search( − f ( x t − , . ) , y t − ) x t , σ t = One-Step-Direct-search( f ( ., y t ) , x t − , σ t − ) endreturn ( x T , x T ). Theorem 7.

Suppose that the objective function f ( x , y ) satisﬁes Assump-tions 3, 4 and 5. If the estimates are deterministic, then Algorithm 2 con-verges to an (cid:15) -FNE within O ( (cid:15) − log( (cid:15) − )) steps. When f ( x , y ) is (cid:15) x -accuratewith probability p x for every y satisfying assumptions of Theorem 5 and (cid:15) y -accurate with probability p y for every x , satisfying assumptions of Theorem 6,then with a probability at least δ , Algorithm 2 convergences and the expectednumber of steps to converge to reach an (cid:15) -FNE is O (cid:16) p x − p y − (cid:15) − (cid:16) log( (cid:15) − )+ (cid:20) log (cid:18) − p x p x (cid:19)(cid:21) − log (cid:16) − e px − (cid:15) − log δ (cid:17) (cid:17)(cid:17) . Algorithm 2 performs in total O ( (cid:15) − ) updates for the minimization problem,and each minimization update requires O (log( (cid:15) − )) updates for the maximiza-tion problem. The proof of Theorem 7 consists in showing that the maximizationproblem is solved with suﬃcient accuracy, for which we invoke the result of The-orem 6. We then proceed by showing that iteratively solving the minimizationproblem allows us to converge in terms of the min-max objective, which is doneusing the result of Theorem 5. We note that the suﬃcient decrease conditionallows us to prove convergence for the last iterate instead of relying on theexistence of an iterate k in the whole sequence that satisﬁes the required in-equalities (as proven in the corresponding gradient based method by Nouiehedet al. (2019)). One advantage of direct-search methods is their abilities to explore the space ofparameters. This however comes at the price of a high dependency to the size ofthe parameter space (Vicente, 2013). For nonconvex optimization problems in R n , the complexity of DS methods is of the order O ( n ) (Dodangeh et al., 2016).13 .01 0.02 0.05 Regularization E rr o r Both classes

Regularization

Class 0

Regularization

Class 1

DRGDA lr: 0.005GDA lr: 0.01GDA lr: 0.05

Figure 1: Zero-one loss for each method across classes. The term ”lr” standsfor diﬀerent learning rates used. Error bars correspond to 20% of the standarddeviation across 10-fold cross validation.However, recent works by Gratton et al. (2015); Bergou et al. (2018) have shownthat replacing the sampling procedure from a PSS by one that correlates withthe gradient direction probabilistically, it is possible to achieve a dependenceof the order O ( n ). The sequential aspect of our method allows us to adoptthis probabilistic perspective for the experiments to follow, thus lowering thecomputation cost. Robustly-regularized estimators have been successfully used in prior work (Namkoongand Duchi, 2017) to deal with situations in which the empirical risk minimizeris susceptible to high amounts of noise. Formally, the problem of empirical riskminimization can be formulated as follows,min θθθ sup P ∈P [ f ( X ; θθθ, P ) = { E P [ l ( X ; θθθ )] : D ( P (cid:107) ˆ P n ) ≤ ρn } ] , (10)where l ( X ; θθθ ) denotes the loss function, X the data and D ( P (cid:107) ˆ P n ) a distancefunction that measures the divergence between the true data distribution P andthe empirical data distribution P n . For the speciﬁc case of a binary classiﬁcationproblem, as for instance considered in Adolphs et al. (2018), Eq. (10) can bereformulated asmin θθθ max p { − n (cid:88) i =1 p i [ y i log(ˆ y ( X i ; θθθ )) + (1 − y i )log(1 − ˆ y ( X i ; θθθ ))] − λ n (cid:88) i =1 (cid:18) p i − n (cid:19) } , where y i and ˆ y ( X i ; θθθ ) correspond to the true and the predicted class of datapoint X i and λ > p (i.e. it satisﬁes our PL14ssumption) and can thus be solved eﬃciently. We consider this optimizationproblem on the Wisconsin breast cancer data set , comparing the performancebetween our proposed direct-search method and GDA, using the same neuralnetwork as classiﬁer. The zero-one loss is shown in Fig. 1 which clearly showsthat our algorithm can consistently outperform GDA for diﬀerent choices ofregularization parameters. Generative Adversarial Networks (Goodfellow et al., 2014) are formulated asthe saddle point problem:min x max y f ( x , y ) = E θθθ ∼ p data [log D y ( θθθ )]+ E z ∼ p z [log(1 − D y ( G x ( z )))] , where D y : R n → [0 ,

1] and G x : R m → R n are the discriminator and gener-ator networks. Although GANs have been used in a wide variety of applica-tions (Goodfellow, 2016), very few approaches can deal with discrete data. Themost severe impeding factor in such settings is the non existence of the gradi-ent due to the non-smooth nature of the objective function. One advantage ofdirect-search techniques over gradient-based methods is that they can be usedin such a context where gradients are not accessible. In some cases, we notethat (cid:96) regularization can be used to increase the smoothness constant of theobjective function.We illustrate the performance of our direct-search algorithmon a simple example consisting of correlated categorical data, in Figure 2. Fora more detailed discussion and more experimental results we refer the reader tothe Appendix. Step: 0 Step: 10000 Step: 20000 Step: 30000Step: 40000 Step: 50000 Step: 60000 Target H e lli n g e r d i s t a n c e Number of steps M a x i m u m M e a n D i s c r e p a n c y Figure 2: Learning a discretized mixture of Gaussian processes using direct-search methods. Both the Hellinger distance and maximum mean discrepancydecrease as DR learns the modes of the distribution.Scaling direct-search to higher dimensions still remains an active area ofresearch, where recent developments include guided search (Maheswaranathan https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

15t al., 2018) and projection-based approaches (Wang et al., 2016). In this work,we focus on the theoretical guarantees or our algorithm in the stochastic min-max setting. While we demonstrate a good empirical behavior on relativelysmall-scale problems, scaling our algorithm to large-scale problems will requirefurther modiﬁcations to improve its scalability.

We presented and proved convergence results for a direct-search method in astochastic minimization setting for both nonconvex and PL objective functions.We then extended these results to prove convergence for min-max objectivefunctions, where the objective of the max-player satisﬁes the (PL) condition,while the min-player objective is nonconvex. Our experimental results establishthat direct-search can outperform traditionally adopted optimization schemes,while also presenting a promising alternative for categorical settings. A poten-tial direction for future work is to improve the scalability of our algorithm inorder to run it on large-scale problems, such as adversarial poisoning attacks onbenchmark computer vision datasets. Additional extensions of our work includethe use of momentum to accelerate convergence as in Gidel et al. (2018) or de-veloping an optimistic variant of our algorithm as in Daskalakis et al. (2017);Daskalakis and Panageas (2018).

Sotiris Anagnostidis is supported by by the Onassis Foundation - ScholarshipID: F ZP 002-1/2019-2020.

References

Leonard Adolphs, Hadi Daneshmand, Aurelien Lucchi, and Thomas Hofmann. Lo-cal saddle point optimization: A curvature exploitation approach. arXiv preprintarXiv:1805.05751 , 2018.Charles Audet and John E Dennis Jr. Mesh adaptive direct search algorithms forconstrained optimization.

SIAM Journal on optimization , 17(1):188–217, 2006.Charles Audet and Dominique Orban. Finding optimal algorithmic parameters usingderivative-free optimization.

SIAM Journal on Optimization , 17(3):642–664, 2006.Charles Audet, Kwassi Joseph Dzahini, Michael Kokkolaras, and S´ebastien Le Diga-bel. StoMADS: Stochastic blackbox optimization using probabilistic estimates.

Toappear in Computational Optimization and Applications. , 2021.Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski.

Robust optimization ,volume 28. Princeton University Press, 2009.El Houcine Bergou, Youssef Diouane, Vyacheslav Kungurtsev, and Cl´ement WRoyer. A subsampling line-search method with second-order results. arXiv preprintarXiv:1810.07211 , 2018. imitris Bertsimas and Omid Nohadani. Robust optimization with simulated anneal-ing. Journal of Global Optimization , 48(2):323–334, 2010.Dimitris Bertsimas, Omid Nohadani, and Kwong Meng Teo. Robust optimizationfor unconstrained simulation-based problems.

Operations research , 58(1):161–178,2010.Jose Blanchet, Coralia Cartis, Matt Menickelly, and Katya Scheinberg. Convergencerate analysis of a stochastic trust-region method via supermartingales.

INFORMSjournal on optimization , 1(2):92–119, 2019.Ilija Bogunovic, Jonathan Scarlett, Stefanie Jegelka, and Volkan Cevher. Adversariallyrobust optimization with gaussian processes. In

Advances in neural informationprocessing systems , pages 5760–5770, 2018.Ruobing Chen, Matt Menickelly, and Katya Scheinberg. Stochastic optimization usinga trust-region method and random models.

Mathematical Programming , 169(2):447–487, 2018.Yunmei Chen, Guanghui Lan, and Yuyuan Ouyang. Optimal primal-dual methods fora class of saddle point problems.

SIAM Journal on Optimization , 24(4):1779–1814,2014.Ashish Cherukuri, Bahman Gharesifard, and Jorge Cortes. Saddle-point dynamics:conditions for asymptotic stability of saddle points.

SIAM Journal on Control andOptimization , 55(1):486–511, 2017.Andrew R Conn, Katya Scheinberg, and Luis Nunes Vicente.

Introduction toderivative-free optimization , volume 8. Siam, 2009.Edoardo Conti, Vashisht Madhavan, Felipe Petroski Such, Joel Lehman, Kenneth OStanley, and Jeﬀ Clune. Improving exploration in evolution strategies for deepreinforcement learning via a population of novelty-seeking agents. arXiv preprintarXiv:1712.06560 , 2017.Ana Luisa Cust´odio, Youssef Diouane, Roholla Garmanjani, and Elisa Riccietti. Worst-case complexity bounds of directional direct-search methods for multiobjective op-timization.

Journal of Optimization Theory and Applications , 188(1):73–93, 2021.Constantinos Daskalakis and Ioannis Panageas. The limit points of (optimistic) gradi-ent descent in min-max optimization. In

Advances in Neural Information ProcessingSystems , pages 9236–9246, 2018.Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Traininggans with optimism. arXiv preprint arXiv:1711.00141 , 2017.Mahdi Dodangeh, Lu´ıs Nunes Vicente, and Zaikun Zhang. On the optimal order ofworst case complexity of direct search.

Optimization Letters , 10(4):699–708, 2016.Kwassi Joseph Dzahini. Expected complexity analysis of stochastic direct-search. arXiv preprint arXiv:2003.03066 , 2020.Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convexoptimization in the bandit setting: gradient descent without a gradient. arXivpreprint cs/0408007 , 2004. authier Gidel, Reyhane Askari Hemmat, Mohammad Pezeshki, Remi Lepriol, GabrielHuang, Simon Lacoste-Julien, and Ioannis Mitliagkas. Negative momentum forimproved game dynamics. arXiv preprint arXiv:1807.04740 , 2018.Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprintarXiv:1701.00160 , 2016.Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems , pages 2672–2680, 2014.Serge Gratton, Cl´ement W Royer, Lu´ıs Nunes Vicente, and Zaikun Zhang. Directsearch based on probabilistic descent.

SIAM Journal on Optimization , 25(3):1515–1541, 2015.Serge Gratton, Cl´ement W Royer, and Lu´ıs Nunes Vicente. A second-order globallyconvergent direct-search method and its worst-case complexity.

Optimization , 65(6):1105–1128, 2016.Paulina Grnarova, Kﬁr Y Levy, Aurelien Lucchi, Thomas Hofmann, and AndreasKrause. An online learning approach to generative adversarial networks. arXivpreprint arXiv:1706.03269 , 2017.Nikolaus Hansen, Sibylle D M¨uller, and Petros Koumoutsakos. Reducing the time com-plexity of the derandomized evolution strategy with covariance matrix adaptation(cma-es).

Evolutionary computation , 11(1):1–18, 2003.Warren Hare and Mason Macklem. Derivative-free optimization methods for ﬁniteminimax problems.

Optimization Methods and Software , 28(2):300–312, 2013.Warren Hare and Julie Nutini. A derivative-free approximate gradient sampling algo-rithm for ﬁnite minimax problems.

Computational Optimization and Applications ,56(1):1–38, 2013.Le Thi Khanh Hien, Renbo Zhao, and William B Haskell. An inexact primal-dualsmoothing framework for large-scale non-bilinear saddle point problems. arXivpreprint arXiv:1711.03669 , 2017.Robert Hooke and Terry A Jeeves. Direct search solution of numerical and statisticalproblems.

Journal of the ACM (JACM) , 8(2):212–229, 1961.Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 , 2016.Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Minmax optimization: Sta-ble limit points of gradient descent ascent are locally optimal. arXiv preprintarXiv:1902.00618 , 2019.Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictivevariance reduction. In

Advances in neural information processing systems , pages315–323, 2013.Hamed Karimi and Mark Schmidt. Linear convergence of proximal-gradient methodsunder the polyak-lojasiewicz condition. arXiv preprint arXiv:1608.04636 , 2015. amara G Kolda, Robert Michael Lewis, and Virginia Torczon. Optimization by directsearch: New perspectives on some classical and modern methods. SIAM review , 45(3):385–482, 2003.Jakub Kone v cn`y and Peter Richt´arik. Simple complexity analysis of simpliﬁed directsearch. arXiv preprint arXiv:1410.0390 , 2014.Robert Michael Lewis, Virginia Torczon, and Michael W Trosset. Direct search meth-ods: then and now. Journal of computational and Applied Mathematics , 124(1-2):191–207, 2000.Tengyuan Liang and James Stokes. Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. arXiv preprintarXiv:1802.06132 , 2018.Tianyi Lin, Chi Jin, and Michael Jordan. On gradient descent ascent for nonconvex-concave minimax problems. In

International Conference on Machine Learning ,pages 6083–6093. PMLR, 2020.Sijia Liu, Songtao Lu, Xiangyi Chen, Yao Feng, Kaidi Xu, Abdullah Al-Dujaili, MinyiHong, and Una-May Obelilly. Min-max optimization without gradients: Conver-gence and applications to adversarial ml. arXiv preprint arXiv:1909.13806 , 2019.Sijia Liu, Songtao Lu, Xiangyi Chen, Yao Feng, Kaidi Xu, Abdullah Al-Dujaili, MingyiHong, and Una-May O’Reilly. Min-max optimization without gradients: Conver-gence and applications to black-box evasion and poisoning attacks. In

InternationalConference on Machine Learning , pages 6282–6293. PMLR, 2020.Niru Maheswaranathan, Luke Metz, George Tucker, Dami Choi, and Jascha Sohl-Dickstein. Guided evolutionary strategies: Augmenting random search with surro-gate gradients. arXiv preprint arXiv:1806.10230 , 2018.Julien Marzat, H´el`ene Piet-Lahanier, and Eric Walter. Min-max hyperparameter tun-ing, with application to fault detection.

IFAC Proceedings Volumes , 44(1):12904–12909, 2011.Matt Menickelly and Stefan M Wild. Derivative-free robust optimization by outerapproximations.

Mathematical Programming , 179(1-2):157–193, 2020.Hongseok Namkoong and John C Duchi. Variance-based regularization with convexobjectives. In

Advances in neural information processing systems , pages 2971–2980,2017.Yurii Nesterov.

Introductory lectures on convex optimization: A basic course , vol-ume 87. Springer Science & Business Media, 2013.Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convexfunctions.

Foundations of Computational Mathematics , 17(2):527–566, 2017.John von Neumann. Zur theorie der gesellschaftsspiele.

Mathematische annalen , 100(1):295–320, 1928. aher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D Lee, and Meisam Raza-viyayn. Solving a class of non-convex min-max games using iterative ﬁrst ordermethods. In Advances in Neural Information Processing Systems , pages 14905–14916, 2019.Shayegan Omidshaﬁei, Jason Pazis, Christopher Amato, Jonathan P How, and JohnVian. Deep decentralized multi-task multi-agent reinforcement learning under par-tial observability. In

Proceedings of the 34th International Conference on MachineLearning-Volume 70 , pages 2681–2690. JMLR. org, 2017.Dmitrii M Ostrovskii, Andrew Lowy, and Meisam Razaviyayn. Eﬃcient search ofﬁrst-order nash equilibria in nonconvex-concave smooth min-max problems. arXivpreprint arXiv:2002.07919 , 2020.Jong-Shi Pang and Meisam Razaviyayn. A uniﬁed distributed algorithm for non-cooperative games., 2016.Courtney Paquette and Katya Scheinberg. A stochastic line search method with con-vergence rate analysis. arXiv preprint arXiv:1807.07994 , 2018.LA Rastrigin. The convergence of the random search method in the extremal controlof a many parameter system.

Automaton & Remote Control , 24:1337–1342, 1963.Luis Miguel Rios and Nikolaos V Sahinidis. Derivative-free optimization: a review ofalgorithms and comparison of software implementations.

Journal of Global Opti-mization , 56(3):1247–1293, 2013.Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolu-tion strategies as a scalable alternative to reinforcement learning. arXiv preprintarXiv:1703.03864 , 2017.Maziar Sanjabi, Jimmy Ba, Meisam Razaviyayn, and Jason D Lee. On the convergenceand robustness of training gans with regularized optimal transport. In

Advances inNeural Information Processing Systems , pages 7091–7101, 2018.J Spall. Stochastic approximation and the ﬁnite-diﬀerence method.

Introduction toStochastic Search and Optimization: Estimation, Simulation and Control, number ,pages 150–175, 2003.Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeﬀrey Regier, and Michael I Jordan.Stochastic cubic regularization for fast nonconvex optimization. In

Advances inneural information processing systems , pages 2899–2908, 2018.Joel A Tropp. An introduction to matrix concentration inequalities. arXiv preprintarXiv:1501.01571 , 2015.Lu´ıs Nunes Vicente. Worst case complexity of direct search.

EURO Journal on Com-putational Optimization , 1(1-2):143–153, 2013.Zhongruo Wang, Krishnakumar Balasubramanian, Shiqian Ma, and Meisam Raza-viyayn. Zeroth-order algorithms for nonconvex minimax problems with improvedcomplexities. arXiv preprint arXiv:2001.07819 , 2020. iyu Wang, Frank Hutter, Masrour Zoghi, David Matheson, and Nando de Feitas.Bayesian optimization in a billion dimensions via random embeddings. Journal ofArtiﬁcial Intelligence Research , 55:361–387, 2016.Guojun Zhang and Yaoliang Yu. Convergence of gradient methods on bilinear zero-sum games. arXiv preprint arXiv:1908.05699 , 2019.Kaiqing Zhang, Zhuoran Yang, and Tamer Ba¸sar. Multi-agent reinforcement learning:A selective overview of theories and algorithms. arXiv preprint arXiv:1911.10635 ,2019. ppendix A Algorithms We present omitted algorithms. Algorithm 3 depicts the updates for the min-imization problem. At each outer iteration of Algorithm 2, a single successfulstep for the minimization problem is performed. In contrast to standard direct-search algorithms, we do not increase the step size parameter immediately aftera successful step, but instead, before the start of the next search for a newsuccessful step, to simplify notation for the upcoming proofs.

Algorithm 3:

One-Step-Direct-search( f, x , σ ) Input: f : objective function, with f k it’s estimate at step k x : initial point σ : step size value c : forcing function constant γ >

1: step size update parameterCreate the Positive Spanning Set D for the variables x .Update σ = min { γσ , σ max } as last update was successful. for k = 1, . . . do1. Oﬀspring generation: Generate the points x i = x + σ k d i , ∀ d i ∈ D .

2. Parent Selection:

Choose x (cid:48) = arg min i f k ( x i ).

3. Suﬃcient Decrease: if f k ( x (cid:48) ) < f k ( x ) − ρ ( σ k ) then (Iteration is successful) return x (cid:48) , σ k . else (Iteration is unsuccessful)Decrease step size σ k +1 = γ − σ k . endend Appendix B Proofs of Section 4

B.1 Proof of Lemma 1

Proof.

22y Assumption 1, it holds that (cid:16) E (cid:104) | F k − f ( x k ) | l f Σ k (cid:105)(cid:17) / ≤ B.2 Proof of Theorem 2

Proof.

We begin by taking separate cases according to if the estimates are accu-rate or not and if the steps of Algorithm 1 are successful or not. We use 1

Succ k to denote the event that step k is successful. Case 1: Accurate estimates. • Successful step.

At a successful step with accurate estimates we have that:1

Succ k J k ( f ( X k +1 ) − f ( X k ))= 1 Succ k J k ( f ( X k +1 ) − f k ( X k +1 ) + f k ( X k +1 ) − f k ( X k ) + f k ( X k ) − f ( X k )) ≤ Succ k J k ( − ( c − (cid:15) f )Σ k ) . Therefore1

Succ k J k (Φ k +1 − Φ k )= 1 Succ k J k ( v ( f ( X k +1 ) − f ( X k )) + (1 − v )Σ k +1 − (1 − v )Σ k ) ≤ Succ k J k ( − v ( c − (cid:15) f )Σ k + (1 − v )( γ − k ) . • Unsuccessful step. Succ ck J k (Φ k +1 − Φ k ) = 1 Succ ck J k ((1 − v )Σ k +1 − (1 − v )Σ k )= 1 Succ ck J k ( − (1 − v )(1 − γ )Σ k ) . Combining the above results and given that v − v ≥ c − (cid:15) f ( γ − γ ) = ⇒ − v ( c − (cid:15) f )+(1 − v )( γ − ≤ − (1 − v )(1 − γ ) , in the case of accurate estimates we have E [1 J k (Φ k +1 − Φ k ) | F k − ] ≤ − p f (1 − v )(1 − γ )Σ k . (11) Case 2: Inaccurate estimates. Successful step. Succ k J ck (Φ k +1 − Φ k ) = 1 Succ k J ck ( v ( f ( X k +1 ) − f ( X k )) + (1 − v )Σ k +1 − (1 − v )Σ k )= 1 Succ k J ck ( v ( f ( X k +1 ) − f k ( X k +1 ) + f k ( X k +1 ) − f k ( X k )+ f k ( X k ) − f ( X k )) + (1 − v )Σ k +1 − (1 − v )Σ k ) ≤ Succ k J ck ( − vc Σ k + v | f ( X k +1 ) − f k ( X k +1 ) | + v | f ( X k ) − f k ( X k ) |− (1 − v )( γ − k ) , where we will later bound terms | f ( X k +1 ) − f k ( X k +1 ) | , | f ( X k ) − f k ( X k ) | using Lemma 1. • Unsuccessful step.

As before:1

Succ ck J ck (Φ k +1 − Φ k ) = 1 Succ ck J ck ((1 − v )Σ k +1 − (1 − v )Σ k )= 1 Succ ck J ck ( − (1 − v )(1 − γ )Σ k ) . In total for inaccurate estimates and by using Assumption 1 and Lemma 1 E [1 J ck (Φ k +1 − Φ k ) | F k − ] ≤ v (1 − p f ) / l f Σ k . (12)Finally, integrating both successful and unsuccessful iterations E [Φ k +1 − Φ k | F k − ] ≤ − p f (1 − v )(1 − γ )Σ k + 2 v (1 − p f ) / l f Σ k ≤ − p f (1 − v )(1 − γ ) Σ k , for our requirement of p f √ − p f ≥ vl f (1 − v )(1 − γ − ) . B.3 Proof of Lemma 4

Proof.

Similar to Conn et al. (2009), for an unsuccessful step with accurateestimates, we have that for some d k ∈ D κ ( D ) (cid:107)∇ f ( x k ) (cid:107)(cid:107) d k (cid:107) ≤ −∇ f ( x k ) (cid:62) d k . (13)By the mean value theorem, for some η k ∈ [0 , f ( x k + σ k d k ) − f ( x k ) = σ k ∇ f ( x k + η k σ k d k ) (cid:62) d k . k is the index of an unsuccessful iteration, f k ( x k + σ k d k ) − f k ( x k ) + ρ ( σ k ) ≥ f ( x k + σ k d k ) − f ( x k ) = f ( x k + σ k d k ) − f k ( x k + σ k d k )+ f k ( x k + σ k d k ) − f k ( x k ) + f k ( x k ) − f ( x k ) ≥ − (cid:15) f σ k − ρ ( σ k ) − (cid:15) f σ k = − ( c + 2 (cid:15) f ) σ k . Combining the above equations, σ k ∇ f ( x k + η k σ k d k ) (cid:62) d k + ( c + 2 (cid:15) f ) σ k ≥ ⇒ ∇ f ( x k + η k σ k d k ) (cid:62) d k + ( c + 2 (cid:15) f ) σ k ≥ ⇒ − ∇ f ( x k ) (cid:62) d k ≤ ( ∇ f ( x k + η k σ k d k ) − ∇ f ( x k )) (cid:62) d k + ( c + 2 (cid:15) f ) σ k , (14)where in the last inequality, we subtracted ∇ f ( x k ) (cid:62) d k from both sides.Finally, Eq. (13) implies κ ( D ) (cid:107)∇ f ( x k ) (cid:107)(cid:107) d k (cid:107) ≤ ( ∇ f ( x k + η k σ k d k ) − ∇ f ( x k )) (cid:62) d k + ( c + 2 (cid:15) f ) σ k = ⇒ κ ( D ) (cid:107)∇ f ( x k ) (cid:107) ≤ (cid:107)∇ f ( x k + η k σ k d k ) − ∇ f ( x k ) (cid:107) + ( c + 2 (cid:15) f ) σ k ≤ L σ k + ( c + 2 (cid:15) f ) σ k . B.4 Proof of Theorem 5

Proof.

By Theorem 2 and Lemma 4.10 from Paquette and Scheinberg (2018)we get that Assumption 2 is satisﬁed, for Σ (cid:15) = C(cid:15) . Then by an application ofTheorem 3 we get E [ T (cid:15) ] ≤ p f p f − v ( f ( X ) − f ∗ ) + (1 − v )Σ p f (1 − v )(1 − γ ) C (cid:15) . The result follows.

B.5 Proof of Theorem 6

We will also use the additional result holding for any function with Lipschitz-continuous gradients. 25 emma 8.

Let f : x ∈ R n → R be a continuous diﬀerentiable function withLipschitz continuous gradient with a constant L and a minimum value achievedfor x ∗ . Then f ( x ) − f ( x ∗ ) ≥ L (cid:107)∇ f ( x ) (cid:107) . (15) Proof.

By smoothness and for y = x − L ∇ f ( x ) we have f ( x ) − f ( x ∗ ) ≥ f ( x ) − f ( y ) ≥ (cid:104)∇ f ( x ) , x − y (cid:105) − L (cid:107) y − x (cid:107) = 1 L (cid:107)∇ f ( x ) (cid:107) − L (cid:107)∇ f ( x ) (cid:107) = 12 L (cid:107)∇ f ( x ) (cid:107) . We can now proceed with the proof of Theorem 6.

Proof.

We note that for the conditions on the constants c , v and p f , require-ments of Theorem 2 are also satisﬁed. We deﬁne as T i = inf { k ≥ f ( X k ) − f ∗ ≤ f ( X ) − f ∗ i } , with T = 0. We will also use the random variable Λ i = T i − T i − .We will assume without loss of generality thatΣ ≤ γ c ( f ( X ) − f ∗ ) (cid:44) A ( f ( X ) − f ∗ ) , (16)for A = γ c . We apply Theorem 3. Given that f ( X T i − ) − f ∗ ≤ f ( X T ) − f ∗ i − ,and that f ( X k ) − f ∗ > f ( X T ) − f ∗ i for k ∈ [ T i − , T i ) (possibly an empty set),Lemma 4 and the Deﬁnition 5, then for step sizes Σ k ≤ C µ f ( X ) − f ∗ i − andaccurate estimates, steps are successful. Then by Theorem 3 for an applicationof the results from Theorem 2 as before, we have E [Λ i | F T i − − ] ≤ p f p f − v ( f ( X T i − ) − f ∗ ) + (1 − v )Σ T i − p f (1 − v )(1 − γ − ) C µ f ( X ) − f ∗ i − = 2(2 p f − − γ − ) C µ (cid:32) v − v f ( X T i − ) − f ∗ f ( X ) − f ∗ i − + Σ T i − f ( X ) − f ∗ i − (cid:33) ≤ p f − − γ − ) C µ (cid:32) v − v f ( X ) − f ∗ i − f ( X ) − f ∗ i − + Σ T i − f ( X ) − f ∗ i − (cid:33) = 2(2 p f − − γ − ) C µ (cid:32) v − v + Σ T i − f ( X ) − f ∗ i − (cid:33) (17)26e will further show with induction that E [Σ T i ] ≤ A f ( X ) − f ∗ i . As a result E [Λ i ] ≤ p f − − γ − ) C µ (cid:32) v − v + E [Σ T i − ] f ( X ) − f ∗ i − (cid:33) ≤ p f − − γ − ) C µ (cid:18) v − v + A (cid:19) . (18)The ﬁnal complexity will be: E [ T (cid:100) log L ( f ( X − f ∗ ) (cid:15) (cid:101) ] = E [Λ + Λ + · · · + Λ (cid:100) log L ( f ( X − f ∗ ) (cid:15) (cid:101) ] ≤ p f − − γ − ) C µ (cid:18) v − v + A (cid:19) (cid:100) log (cid:18) L ( f ( X ) − f ∗ ) (cid:15) (cid:19) (cid:101) . (19)Getting (cid:107)∇ f ( x ) (cid:107) ≤ L ( f ( x ) − f ∗ ) from Lemma 8, the result follows.It remains to show the result that E [Σ T i ] ≤ A f ( X ) − f ∗ i . By assumption, asaforementioned, it holds for T . We then assume that it holds for T i − and showthat it also holds for T i . For each T i , the last step T i − X was updated to satisfy the goal f ( X T i ) − f ∗ ≤ f ( X ) − f ∗ i .As in Theorem 2 we diﬀerentiate between the events of this step being accurateor not.Since we have a successful step f T i − ( X T i ) − f T i − ( X T i − ) ≤ − c Σ T i − . Then f ( X T i ) − f ( X T i − ) = f ( X T i ) − f T i − ( X T i )+ f T i − ( X T i ) − f T i − ( X T i − ) + f T i − ( X T i − ) − f ( X T i − )(20)We denote with p Acc the probability of this last step being accurate. Notethat this is not the same as p f as we are conditioning on a successful step. Thenwe distinguish the two cases. • Accurate estimates.

By Assumption 3 we get that1

Acc ( f ( X T i ) − f ( X T i − )) ≤ Acc ( − ( c − (cid:15) f )Σ T i − ) . (21) • Inaccurate.

In this case, similarly to the proof of Lemma 1 we get E [1 c Acc ( f ( X T i ) − f ( X T i − )) | F T i − ] ≤ − (1 − p Acc ) c Σ T i − +2 (cid:112) − p Acc l f Σ T i − . (22)27ombining the above cases, we get E [ f ( X T i ) − f ( X T i − ) | F T i − ] ≤ − p Acc ( c − (cid:15) f )Σ T i − − (1 − p Acc ) c Σ T i − + 2 (cid:112) − p Acc l f Σ T i − ≤ − c Σ T i − (1 − p Acc − (cid:114) − p Acc , for c > max { (cid:15) f , √ l f }≤ − c T i − = ⇒ Σ T i − ≤ E [ f ( X T i − ) − f ∗ |F T i − ] c . (23)Furthermore, by Theorem 2 we get that E [Φ T i − | F T i − − ] ≤ Φ T i − E [ v ( f ( X T i − ) − f ∗ ) | F T i − − ] ≤ v ( f ( X T i − ) − f ∗ ) + (1 − v )Σ T i − E [ f ( X T i − ) − f ∗ ] ≤ f ( X ) − f ∗ i − + (1 − v ) v E [Σ T i − ] ≤ f ( X ) − f ∗ i − + (1 − v ) v A f ( X ) − f ∗ i − ≤ f ( X ) − f ∗ i − vv A ) . (24)By combining (23) and (24) and using the law of iterated expectation wehave that E [Σ T i ] = E [ γ Σ T i − ] ≤ f ( X ) − f ∗ i γ c (1 + 1 − vv A ) ≤ A f ( X ) − f ∗ i , (25)for v − v ≥ γ c and A = γ c . The proof is complete. Appendix C Proofs of Section 5

We ﬁrst present some additional results, required for our proof.From Karimi and Schmidt (2015), for a function that satisﬁes the PL con-dition, it additionally satisﬁes the Quadratic Growth (GQ) condition.

Lemma 9.

A diﬀerentiable function f that satisﬁes the PL condition with pa-rameter µ , also satisﬁes the QG condition with parameter µ : f ( x ) − f ∗ ≥ µ (cid:107) x ∗ − x (cid:107) , where x ∗ belongs to the solution set X ∗ . > Σ (cid:15) , itcorresponds to a biased reﬂected random walk. The dotted line indices thebarrier at position 0, indicating a step size of Σ (cid:15) .Based on the previous Lemma, we can easily prove the following result. Lemma 10.

Let a diﬀerentiable µ -PL function f and also x ∗ ∈ arg min x f ( x ) .If we know that (cid:107)∇ f ( x ) (cid:107) ≤ (cid:15) then: (cid:107) x − x ∗ (cid:107) ≤ µ (cid:15). Proof.

By Lemma 9 and the deﬁnition of the PL condition we have that: (cid:107) x − x ∗ (cid:107) ≤ (cid:114) µ ( f ( x ) − f ∗ ) ≤ µ (cid:107)∇ f ( x ) (cid:107) ≤ µ (cid:15). Lemma 11. (Lemma A.3 from Nouiehed et al. (2019)) Assume that − f ( x , y ) for a speciﬁc x , is a class of µ -PL functions in y . Deﬁne the set of op-timal solutions Y ( x ) = arg max y f ( x , y ) . Then for every x , x ∈ X and y ∗ ∈ Y ( x ) , y ∗ ∈ Y ( x ) it holds that: (cid:107) y ∗ − y ∗ (cid:107) ≤ L xy (cid:107) x − x (cid:107) , where we denote with L xy = L µ . Next, we will need to establish a lower bound on the step size Σ. In thedeterministic case, Lemma 4 establishes such a lower bound for unsuccessfulsteps, guaranteeing that if Σ = Σ (cid:15) , then (cid:107)∇ f ( x ) (cid:107) ≤ (cid:15) . However, in the stochas-tic case, inaccurate steps may occur. We want to ensure a lower bound on thestep size parameter with high probability.To do so, we will consider the worst-case scenario where step sizes get assmall as possible. This corresponds to the case where for all step sizes Σ > Σ (cid:15) ,unsuccessful steps occur. So do all of the inaccurate estimates, with probability1 − p f . For convenience, we will ignore steps above the value Σ (cid:15) since we onlyrequire a bound. This corresponds to a random walk with a reﬂection barrier atposition 0 (which corresponds to the step size Σ (cid:15) ) and an increment probability1 − p f , where p f is the probability of accurate estimates. We, therefore, use thefollowing Lemma to get a probabilistic lower bound on the step sizes.29 emma 12. Let a random walk starting at position 0, with a reﬂection barrierat position and a transition probability matrix  p f − p f p f − p f p f − p f . . . . . . . . .  for p f > . Then for k ≥ log(1 − e n log( δ ) )log (cid:16) − pfpf (cid:17) − , the random walk of length n , staysconﬁned within the space [0 , k ] with a probability at least δ > .Proof. Let a random walk S n = max { S n − + X n , } , with S = 0 and P ( X n =1) = 1 − p f , P ( X n = −

1) = p f , for p f > . The probability that the randomwalk stays until position k , P ( S i ≤ k, ∀ i ≤ n ), is bounded below by the prob-ability of n randomly chosen points from the stationary distribution to be atpositions lower or equal to k.Let us denote with p i,n the probability that the random walk is at position i after n total steps. We ﬁrst prove by induction that p i,n ≥ p f − p f p i +1 ,n . (26)It obviously holds for n = 0, as p , = 1 and p i, = 0 , ∀ i ≥

1. Assumethat it holds for n . As shown in Fig. 3, with probability (1 − p f ), position i isincremented, therefore for i ≥ p i,n +1 = p i − ,n (1 − p f ) + p i +1 ,n p f ≥ p f − p f p i,n (1 − p f ) + p f − p f p i +2 ,n p f , by induction= p f − p f ( p i,n (1 − p f ) + p i +2 ,n p f )= p f − p f p i +1 ,n +1 and for i = 0 p ,n +1 = p ,n p f + p ,n p f ≥ p ,n p f + p f − p f p ,n p f , by induction= p f − p f ( p ,n (1 − p f ) + p ,n p f )= p f − p f p ,n +1 . Let us now consider the probability that the random walk resides in the ﬁrst k positions. Then: 30 (cid:88) i =0 p i,n +1 = p ,n p f + p ,n p f + k (cid:88) i =1 ( p i − ,n (1 − p f ) + p i +1 ,n p f )= k (cid:88) i =0 p i,n − p k,n (1 − p f ) + p k +1 ,n p f ≤ k (cid:88) i =0 p i,n , by (26) , where the equality in the second line is due to the terms telescoping in the sumin the ﬁrst line.As a result, we can lower bound the probability (cid:80) ki =0 p i,n with the corre-sponding one for n → ∞ , which corresponds to a stationary distribution. Also P ( S i ≤ k | S i − ≤ k ) = P ( S i ≤ k | S i − = k ) P ( S i − = k ) + P ( S i ≤ k | S i − < k ) P ( S i − < k )= p f P ( S i − = k ) + P ( S i − < k ) ≥ p f and P ( S i ≤ k | S i − > k ) = P ( S i ≤ k | S i − = k + 1) P ( S i − = k + 1)+ P ( S i ≤ k | S i − > k + 1) P ( S i − > k + 1)= p f P ( S i − = k + 1) + 0 P ( S i − > k + 1) ≤ p f ≤ P ( S i ≤ k | S i − ≤ k ) . As a result P ( S i ≤ k ) = P ( S i ≤ k | S i − ≤ k ) P ( S i − ≤ k ) + P ( S i ≤ k | S i − > k ) P ( S i − > k ) ≤ P ( S i ≤ k | S i − ≤ k ) . (27)The probability of a random walk of length n to stay between the ﬁrst k ≥ P ( S i ≤ k, ∀ i ≤ n ) = P ( S ≤ k ) n (cid:89) i =1 P ( S i ≤ k | S j ≤ k, ∀ j ∈ [0 , i − n (cid:89) i =1 P ( S i ≤ k | S i − ≤ k ) , with P ( S ≤ k ) = 1 , ∀ k ≥ ≥ n (cid:89) i =1 P ( S i ≤ k ) , by Eq. (27)= n (cid:89) j =1 (cid:32) k (cid:88) i =0 p i,j (cid:33) ≥ (cid:32) k (cid:88) i =0 π i (cid:33) n , where π i denotes the stationary probability of the random walk for i ∈ N . Fromthe recursive relation, we get π i = ( p f − p f ) π i +1 , which means π i = ( − p f p f ) i π .We now calculate the probability π ≤ k of a randomly chosen point to be partof the ﬁrst k positions for the stationary distribution π ≤ k = k (cid:88) i =0 π i = π (cid:80) ki =0 ( − p f p f ) i π (cid:80) ∞ i =0 ( − p f p f ) i = 1 − (cid:18) − p f p f (cid:19) k +1 . Thus the required probability must be lower bounded by π n ≤ k ≥ δ = ⇒ (cid:32) − (cid:18) − p f p f (cid:19) k +1 (cid:33) n ≥ δ log (cid:32) − (cid:18) − p f p f (cid:19) k +1 (cid:33) ≥ n log( δ ) (cid:18) − p f p f (cid:19) k +1 ≤ − e n log( δ ) k ≥ log (cid:16) − e n log( δ ) (cid:17) log (cid:16) − p f p f (cid:17) − , (28)where for the last step we used the fact that log (cid:16) − p f p f (cid:17) < p f > implies − p f p f < Proof.

We denote with c x the constant used for the suﬃcient decrease conditionof the min problem. We ﬁrst prove the deterministic case. In the deterministic32ase, Theorems 5 and 6 hold deterministically, meaning that we can reduce thenorm of the gradient below a threshold (cid:15) for the nonconvex case in O ( (cid:15) − ) itera-tions and for the case that the function satisﬁes our PL condition in O (log( (cid:15) − ))iterations.At each step, the max problem is solved almost exactly, which is guaranteedby Theorem 6 and Algorithm 1. Then (cid:107)∇ y f ( x t − , y t ) (cid:107) ≤ (cid:15) max , for an accuracy (cid:15) max to be speciﬁed later. In the proof, we will show that fora particular choice of a forcing function constant, the improvement on the min-imization problem is better than possible deterioration caused by the updatesof the max problem. By Assumption 3 of Lipschitz continuity (cid:107)∇ y f ( x t , y t ) − ∇ y f ( x t − , y t ) (cid:107) ≤ L (cid:107) x t − x t − (cid:107) = L σ t = ⇒ (cid:107)∇ y f ( x t , y t ) (cid:107) ≤ L σ t + (cid:15) max , (29)for a successful update. Here σ t is used to denote the step size used for theminimization step throughout Algorithms 2 and 3. We note that σ t alwaysbelongs to a successful step, by the notation used in Algorithm 2. Also bytriangle inequality we have that (let y ∗ t and y ∗ t +1 belong to the optimal solutionsets at iterations t and t + 1 respectively) (cid:107) y t +1 − y t (cid:107) = (cid:107) y t +1 − y ∗ t +1 + y ∗ t +1 − y ∗ t + y ∗ t − y t (cid:107)≤ (cid:107) y t +1 − y ∗ t +1 (cid:107) + (cid:107) y ∗ t +1 − y ∗ t (cid:107) + (cid:107) y ∗ t − y t (cid:107) . (30)By Lemma 11 we have that (cid:107) y ∗ t +1 − y ∗ t (cid:107) ≤ L xy σ t , since y ∗ t +1 ∈ Y ( x t )and y ∗ t ∈ Y ( x t − ) (we remind that Y ( x ) = arg max y f ( x , y )). Also, as aconsequence of Deﬁnition 5 and Lemma 10 we have that both (cid:107) y t +1 − y ∗ t +1 (cid:107) ≤ (cid:15) max µ and (cid:107) y t − y ∗ t (cid:107) ≤ (cid:15) max µ . As a result (cid:107) y t +1 − y t (cid:107) ≤ (cid:15) max µ + L xy σ t . (31)Finally, for a successful update of the Algorithm 3 we have f ( x t , y t +1 ) − f ( x t , y t ) ≤ (cid:104)∇ y f ( x t , y t ) , y t +1 − x t (cid:105) + L (cid:107) x t +1 − x t (cid:107) ≤ ( L σ t + (cid:15) max )( L xy σ t + (cid:15) max µ )+ L L xy σ t + (cid:15) max µ ) = D σ t + D σ t (cid:15) max + D ( (cid:15) max ) , (32)33or D (cid:44) L L xy + L L xy , D = L µ + L xy + L L xy µ and D = µ (1 + L µ ).During the updates of the minimization problem we have that σ ≥ σ min = σ (cid:15) , with σ (cid:15) = C(cid:15).

Here C , which is deﬁned in Lemma 4, entails the constants for the min problem.We want to ensure that f ( x t , y t +1 ) − f ( x t − , y t ) < − Kσ t , (33)for some K > σ t ≥ σ min , to then apply Theorem 2, for f ∗ theminimum of f at each y t . Taking also into account our suﬃcient decreasecondition, we want to make sure that the following holds for the polynomial pp ( σ t ) (cid:44) Kσ t − c x σ t + D σ t + D σ t (cid:15) max + D ( (cid:15) max ) ≤ σ t ≥ σ min . To establish this we just need to ensure that forthe quadratic with negative second degree coeﬃcient (for c x > D + K ) themaximum occurs at position: D (cid:15) max c x − K − D ) ≤ C(cid:15) = ⇒ (cid:15) max ≤ (cid:15) C ( c x − K − D ) D (35)and also that p ( C(cid:15) ) ≤ ⇐⇒ ( − c x + K + D ) C (cid:15) + D C(cid:15)(cid:15) max + D ( (cid:15) max ) ≤ . (36)For the ﬁnal condition to hold (cid:15) max ≤ (cid:15) C ( − D + (cid:112) D + 4( c x − K − D ) D D . (37)In the stochastic case, we apply Theorems 5 and 6 as is to get the expectednumber of steps. In this case however, the step size may become smaller thanthe pre-speciﬁed σ min parameter, due to inaccurate estimates. We can then useLemma 12, to get a bound with high probability, regarding this minimum stepsize value. More speciﬁcally for the number of iterates n speciﬁed by Theorem 5,for k ≥ log(1 − e n log( δ ) )log( − pxpx ) −

1, throughout the updates σ ≥ σ (cid:48) min = 1 γ k σ min , with probability at least δ >

0, where γ is the update parameter for the minproblem in Algorithm 3. We then get the similar bounds (cid:15) max ≤ (cid:15) min (cid:40) C ( c x − K − D ) γ k D , C ( − D + (cid:112) D + 4( c x − K − D ) D D γ k (cid:41) .

34e note that K acts as a new suﬃcient decrease constant and should betaken into account for all assumptions of Theorem 5, namely K > (cid:15) x , whichholds for the constant c x > D + K > D + 2 (cid:15) x . Appendix D Experimental Setup

D.1 Robust Optimization

The Wisconsin breast cancer data set, is a binary classiﬁcation task with 569samples in total, each having 30 attributes. We use a simple neural network witha hidden layer of size 50 and a LeakyReLU activation. This choice of activationaccommodates the GDA baseline providing additional gradient information. Allnetworks across methods and folds are initialized with the same weights. For theGDA method we tried a range of diﬀerent learning rates from the set { } , but only present results for the cases that converged.In Fig. 4 we present the evolution of the zero-one error across epochs foreach method. We stress that one epoch for the GDA approach corresponds toone update each for the max and the min problem, whereas one epoch for DRcorresponds to a series of updates for the max problem (at most 10) followed bya single update for the min problem. GDA was run for a total of 10000 epochsand DR for a maximum of 2000 epochs but usually converges a lot faster thanthat. GDA suﬀers considerably more by poor initializations compared to DR.In Fig. 4 constant large errors correspond to a constant output of the networkfor a speciﬁc class of the problem (for this unbalanced dataset with rates 0.63and 0.37). Epoch E rr o r Regularization: 0.01

Epoch E rr o r Regularization: 0.02

Epoch E rr o r Regularization: 0.05

MethodDRGDA lr: 0.005GDA lr: 0.05GDA lr: 0.01

Figure 4: Misclassiﬁcation error across epochs for each method.

D.2 Toy Examples

Although in the examples following, the objective of the max player is noncon-cave and does not satisfy the PL condition, empirical results demonstrate thatthe proposed algorithm can be successful. We begin by illustrating examplesof GANs learning diﬀerent 2D underlying distributions for a continuous case inFig. 5. Both the generator and the discriminator have 2 hidden layers of size350 (64 for learning a mixture of Gaussian in a grid formation) with Tanh acti-vations, while we also use spectral normalization for the discriminator. In allscenarios, we sample the latent code from a lower-dimensional space N (0 , I ),such that it matches the data dimensionality, allowing the generator to learn asimpler mapping (as in Grnarova et al. (2017)). R i n g Step: 0 Step: 15000 Step: 30000 Step: 45000 Step: 60000 Step: 75000 Step: 90000

Target G r i d Step: 0 Step: 25000 Step: 50000 Step: 75000 Step: 100000 Step: 125000 Step: 150000

Target S w i ss r o ll Step: 0 Step: 7500 Step: 15000 Step: 22500 Step: 30000 Step: 37500 Step: 45000

Target

Figure 5: Mode collapse check for direct-search methods for three diﬀerentproblems in the continuous setting.Motivated by encouraging results, we proceed in a discrete setting, whereeach of the 2 dimensions of the underlying distributions is parametrized by acategorical variable. The choice of this categorical variable makes the objec-tive function of the generator nondiﬀerentiable. As aforementioned, our algo-rithm can support multi-categorical data. In the current literature, the mostpopular methods to deal with this kind of scenario are baselines based on theGumbel-softmax or the REINFORCE algorithm. Due to their sampling tech-niques though, dependence on the number of parameters of the model is expo-nential for these baselines.We describe shortly how training is performed for each of the baselines used.Based on the output logits o of size n , the result of a projection layer, eachmethod samples a new point y . Gumbel-softmax

Using the Gumbel-max trick, the sampling can be parametrizedas y = one-hot(arg max ≤ i ≤ n ( o ( i ) + g ( i ) )) , (38)where g are sampled from the i.i.d. Gumbel distribution. To enable thecalculation of gradients, this is relaxed to the result of a softmax operationˆ y = σ ( o + gT ) , (39)36ith T >

0, the temperature, controlling the softness of the sampling. A highinitial temperature value forces more exploration. As the temperature decreases,ˆ y becomes a better approximation of y , leading however to steeper gradients andmore instabilities. This is also the reason why gradient clipping is crucial forthe stability of this method. Untimely updates of the temperature have beenknown to bolster mode collapse. For all experiments, we use an exponentialdecay update scheme, decreasing the temperature after a predeﬁned number ofsteps. REINFORCE

We sample a new point s from the output logits and appoint aspeciﬁc reward according to the output of the discriminator r = 2 ∗ ( D ( s ) − . . Direct-search