[PDF] Efficiently avoiding saddle points with zero order methods: No gradients required

Abstract

Full PDF

EEfﬁciently avoiding saddle pointswith zero order methods: No gradients required

Lampros Flokas ∗ Department of Computer ScienceColumbia UniversityNew York, NY 10025 [email protected]

Emmanouil V. Vlatakis-Gkaragkounis ∗ Department of Computer ScienceColumbia UniversityNew York, NY 10025 [email protected]

Georgios Piliouras

Engineering Systems and DesignSingapore University of Technology and DesignSingapore [email protected]

Abstract

We consider the case of derivative-free algorithms for non-convex optimization,also known as zero order algorithms, that use only function evaluations rather thangradients. For a wide variety of gradient approximators based on ﬁnite differences,we establish asymptotic convergence to second order stationary points using acarefully tailored application of the Stable Manifold Theorem. Regarding efﬁciency,we introduce a noisy zero-order method that converges to second order stationarypoints, i.e avoids saddle points. Our algorithm uses only ˜ O (1 /(cid:15) ) approximategradient calculations and, thus, it matches the converge rate guarantees of their exactgradient counterparts up to constants. In contrast to previous work, our convergencerate analysis avoids imposing additional dimension dependent slowdowns in thenumber of iterations required for non-convex zero order optimization. Given a function f : R d → R , solving the problem x ∗ = arg min x ∈ R d f ( x ) is one of the building blocks that many machine learning algorithms are based on. The difﬁcultyof this problem varies signiﬁcantly depending on the properties of f and the way we can accessinformation about it. The general case of non-convex functions makes the problem signiﬁcantly morechallenging, since ﬁrst order stationary points can be global or local optima as well as saddle points.In fact, discovering global optima is an NP hard problem in general and even for quartic functionsverifying local optima is a co-NP complete problem [Murty and Kabadi, 1987, Lee et al., 2019].While local optima may be satisfactory for some applications in machine learning Choromanska et al.[2015], saddle points can make high dimensional non convex optimization tasks signiﬁcantly moredifﬁcult Dauphin et al. [2014], Sun et al. [2018]. Therefore, researchers have focused their efforts onfunctions possessing the strict saddle property. Under this property, Hessians of f evaluated at saddlepoints have at least one negative eigenvalue making detection of saddle points tractable. Given this ∗ Equal contribution33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. a r X i v : . [ m a t h . O C ] O c t ssumption, methods that use second order information like computing Hessians or Hessian-vectorproducts [Nesterov and Polyak, 2006, Carmon and Duchi, 2016, Agarwal et al., 2017] can convergeto second order stationary points (SOSPs) and thus avoid strict saddle points. Recent work [Ge et al.,2015, Levy, 2016, Jin et al., 2017, Lee et al., 2019, Allen-Zhu and Li, 2018, Jin et al., 2018b] hasalso showed that gradient descent (and its variants) can also avoid strict saddle points and converge tolocal minima.Unfortunately access to gradient evaluations is not available in all settings of interest. Even withthe advent of automatic differentiation software, there are several applications where computationof gradients is either computationally inefﬁcient or even impossible. Examples of such applicationsare hyper-parameter tuning of machine learning models Snoek et al. [2012], Salimans et al. [2017],Choromanski et al. [2018], black-box adversarial attacks on deep neural networks Papernot et al.[2017], Madry et al. [2018], Chen et al. [2017], computer network control Liu et al. [2018a],variational approaches to graphical models Wainwright and Jordan [2008] and simulation basedRubinstein and Kroese [2016], Spall [2003] or bandit feedback optimization Agarwal et al. [2010],Chen and Giannakis [2019]. Zero order methods, also known as black-box methods, try to addressthese issues by employing only evaluations of the function f during the optimization procedure. Thecase of convex functions is well understood Nesterov and Spokoiny [2017], Duchi et al. [2015],Agarwal et al. [2010]. For the non-convex case, there has been a considerable amount of work on theconvergence to ﬁrst order stationary points both for deterministic settings Nesterov and Spokoiny[2017] and stochastic ones Ghadimi and Lan [2013], Wang et al. [2018], Balasubramanian andGhadimi [2018], Liu et al. [2018b], Gu et al. [2016].The case of SOSPs has been so far comparatively under-studied. It has been established that SOSPsare achievable through zero order trust region methods that employ fully quadratic models Conn et al.[2009]. The disadvantage of trust region methods is that their computation cost per iteration is O ( d ) which becomes quickly prohibitive as we increase the number of dimensions d . More recently, theauthors of Jin et al. [2018a] studied the case of ﬁnding local minima of functions having access only toapproximate function or gradient evaluations. They manage to reduce zero order optimization to thestochastic ﬁrst order optimization of a Gaussian smoothed version of f . While this approach yieldsguarantees of convergence to SOSPs , each stochastic gradient evaluation requires O (poly( d, /(cid:15) )) number of function evaluations. This leads to signiﬁcantly less efﬁcient optimization algorithms whencompared to their ﬁrst order counterparts. It is therefore yet unclear if there are scalable zeroorder methods that can safely avoid strict saddle points and always converge to local minimaof f . To the best of our knowledge, our work is the ﬁrst one to establish a positive answer tothis important question.Our results. We prove that zero order optimization methods solve general non-convex problemsefﬁciently.

In a nutshell, we present a family of of zero order optimization methods which provablyconverge to SOSPs . Our proof includes a new, elaborating analysis of Stable Manifold Theorem(See Section 4). Additionally, the number of the approximate gradient evaluations match the standardbounds for ﬁrst order methods in non-convex problems (see Table 1 & Section 5).

Algorithm Oracle Iterations Evaluations of f Theorem

Theorem ˜ O (1 /(cid:15) ) ˜ O ( d/(cid:15) ) FPSGD Jin et al. [2018a] Approx. Gradient + Noise ˜ O ( d/(cid:15) ) ˜ O ( d /(cid:15) ) ZPSGD Jin et al. [2018a] Function Evaluations + Noise ˜ O (1 /(cid:15) ) ˜ O ( d /(cid:15) ) Jin et al. [2017] Exact Gradient + Noise ˜ O (1 /(cid:15) ) -Table 1: Oracle model and iteration complexity to SOSPs . Algorithms.

Instead of focusing on a single ﬁnite differences algorithms, we construct a generalframework of approximate gradient oracles that generalizes over many ﬁnite differences approachesin the literature. We then use these approximate gradient oracles to devise approximate gradientdescent algorithms. For more details see Section 3.3 and Deﬁnition 4.2 symptotic convergence.

We use the stable manifold theorem to prove that zero order methodscan almost surely avoid saddle points. In contrast to the analysis of Lee et al. [2019] for ﬁrst ordermethods, the zero order case is more demanding. Convergence to ﬁrst order stationary points requireschanging the gradient approximation accuracy over the iterations and, thus, the equivalent dynamicalsystem is time dependent. By reducing our time dependent dynamical system to a time invariant onedeﬁned in an expanded state, we are able to obtain provable guarantees about avoiding saddle points.To extend our guarantees of convergence to deterministic choices of the initial accuracy, we providea carefully tailored application of the Stable Manifold Theorem that analyzes the structure of thestable manifolds of the dynamical system. Our results on saddle point avoidance extend to functionswith non isolated critical points. To address this, we provide sufﬁcient conditions for point-wiseconvergence of the iterates of approximate gradient descent methods for the case of analytic functions.

Convergence rates for noisy dynamics.

In order to produce fast convergence rates, as in the case ofﬁrst order methods Jin et al. [2017], it is useful to consider perturbed/noisy versions of the dynamics.Once again the case of zero order methods poses distinct hurdles. Close to critical points of f ,approximations of the potentially arbitrarily small gradient can be very noisy. Iterates of exactgradient descent and approximate gradient descent may diverge signiﬁcantly in this case. In fact,provably escaping saddle points by guaranteeing decrease of value of f is more challenging for thecase of approximate gradient descent since it is not a descent algorithm. A key technical step is toshow that the negative curvature dynamics that enable gradient descent to escape saddle points arerobust to gradient approximation errors. As long as the gradient approximation error is smaller thana ﬁxed a-priori known threshold, zero order methods can provably escape saddle points. Based onthis, we are able to prove that zero order methods can converge to approximate SOSPs with the samenumber of approximate gradient evaluations provided by Jin et al. [2017] up to constants.It is worth pointing out that achieving an ˜ O ( (cid:15) − ) bound of approximate gradient evaluations requiresconceptually different techniques from other recent approaches in zero order methods. Indeed,previous work on randomized and stochastic zero order optimization [Nesterov and Spokoiny,2017, Ghadimi and Lan, 2013] has relied on treating randomized approximate gradients of f asin expectation exact gradients of a carefully constructed smoothed version of f . Then with someadditional work, convergence arguments for the smooth version of f can be transferred to f itself.Although these arguments are applicable to our case as well, as shown by the work of Jin et al.[2018a], they also lead to a slowdown both in terms of the dimension d and the required accuracy (cid:15) .The main reasons behind this slowdown are that the Lipschitz constants of the smoothed version of f depend on d and the high variance of the stochastic gradient estimators. To sidestep both issues, weanalyze the effect of gradient approximation error directly on the optimization of f . Our work builds and improves upon previous ﬁnite difference approaches for non-convex optimizationand provides SOSP guarantees previously only reserved to computationally expensive methods.

First Order Algorithms

A recent line of work has shown that gradient descent and variations of itcan actually converge to SOSPs . Speciﬁcally, Lee et al. [2019] shows that gradient descent startingfrom a random point can eventually converge to SOSPs with probability one. [Jin et al., 2017, 2018b]modiﬁed standard gradient descent using perturbations to provide an algorithm that converges toSOSPs in O (poly(log d, /(cid:15) )) iterations. As noted in the introduction, the zero order case posesadditional hurdles compared to the ﬁrst order one. Our work, by addressing these hurdles effectivelyextends the guarantees provided by Lee et al. [2019], Jin et al. [2017] to zero order methods. Zero Order Algorithms

Approximating gradients using ﬁnite differences methods has been thestandard approach for both for convex and non-convex zero order optimization.Nesterov and Spokoiny[2017] established convergence properties even for randomized gradient oracles. Recently, [Duchiet al., 2015] provided optimal guarantees for stochastic convex optimization up to logarithmic factors.For the more general case of stochastic non-convex optimization there has been extensive workcovering several aspects of the problem: distributed Hajinezhad and Zavlanos [2018], asynchronousLian et al. [2016], high-dimensional Wang et al. [2018], Balasubramanian and Ghadimi [2018]optimization and variance reduction Liu et al. [2018b], Gu et al. [2016]. It is signiﬁcant to mentionthat the aforementioned work is focused on convergence to (cid:15) − ﬁrst order stationary points.3egarding SOSPs , Conn et al. [2009] showed that trust region methods that employ fully quadraticmodels can converge to SOSPs at the cost of O ( d ) operations per iteration. The authors of Jinet al. [2018a] studied the convergence to SOSPs using approximate function or gradient evaluations.While both approaches are applicable for the zero order setting with exact function evaluations, as wewill see in Section 3.4, this type of reduction results in algorithms that require substantially morefunction evaluations to reach an (cid:15) -SOSP . Our work provides provable guarantees of convergence atsigniﬁcantly faster rates. We will use lower case bold letters x , y to denote vectors. (cid:107)·(cid:107) will be used to denote the spectralnorm and the (cid:96) vector norm. λ min ( · ) will be used to denote the minimum eigenvalue of a matrix. If g is a vector valued differentiable function then Dg denotes the differential of function g . We willuse { e , e , . . . e d } to refer to the standard orthonormal basis of R d . Also C n is the set of n timescontinuously differentiable functions. B x ( r ) refers to the ball of radius r centered at x . Finally, µ ( S ) is the Lebesgue measure of a measurable set S ⊆ R d . A function f : R d → R is said to be L -continuous, (cid:96) -gradient, ρ -Hessian Lipschitz if for every x , y ∈ R d (cid:107) f ( x ) − f ( y ) (cid:107) ≤ L (cid:107) x − y (cid:107) , (cid:107)∇ f ( x ) − ∇ f ( y ) (cid:107) ≤ (cid:96) (cid:107) x − y (cid:107) , (cid:107)∇ f ( x ) − ∇ f ( y ) (cid:107) ≤ ρ (cid:107) x − y (cid:107) correspondingly. Additionally, we can deﬁne approximate ﬁrst order stationary points as: Deﬁnition 1 ( (cid:15) -ﬁrst order stationary point) . Let f : R d → R be a differentiable function. Then x ∈ R d is a ﬁrst order stationary point of f if (cid:107)∇ f ( x ) (cid:107) ≤ (cid:15) . A ﬁrst order stationary point can be either a local minimum, a local maximum or a saddle point.Following the terminology of Lee et al. [2019] and Jin et al. [2017], we will include local maxima insaddle points since they are both undesirable for our minimization task. Under this deﬁnition, strictsaddle points can be identiﬁed as follows:

Deﬁnition 2 (Strict saddle point) . Let f : R d → R be a twice differentiable function. Then x ∈ R d is a strict saddle point of f if (cid:107)∇ f ( x ) (cid:107) = 0 and λ min ( ∇ f ( x )) < . To avoid convergence to strict saddle points, we need to converge to SOSPs . In order to studythe convergence rate of algorithms that converge to SOSPs , we need to deﬁne some notion ofapproximate SOSPs . Following the convention of Jin et al. [2017] we deﬁne the following:

Deﬁnition 3 ( (cid:15) -SOSP ) . Let f : R d → R be a ρ -Hessian Lipschitz function. Then x ∈ R d is an (cid:15) -second order order stationary point of f if (cid:107)∇ f ( x ) (cid:107) ≤ (cid:15) and λ min ( ∇ f ( x )) ≥ −√ ρ(cid:15) . One of the key ways that enables zero order methods to converge quickly is using approximations ofthe gradient based on ﬁnite differences approaches. Here we will show how forward differencingcan provide these approximate gradient calculations. Without much additional effort we can get thesame results for other ﬁnite differences approaches like backward and symmetric difference as wellas ﬁnite differences approaches with higher order accuracy guarantees. Let us deﬁne the gradientapproximation function based on forward difference r f : R d × R → R d r f ( x , h ) = (cid:80) dl =0 f ( x + h e l ) − f ( x ) h e l when h (cid:54) = 0 ∇ f ( x ) if h = 0 (1)This function takes two arguments: A vector x where the gradient should be approximated as well asa scalar value h that controls the approximation accuracy of the estimator. An additional propertythat will be of interest when we analyze approximate gradient descent is the fact that r f is Lipschitz.Based on the deﬁnition one can show: Lemma 1.

Let f be (cid:96) -gradient Lipschitz. Then r f ( · , h ) as deﬁned in Equation 1 is √ d(cid:96) Lipschitz forall h ∈ R and ∀ h ∈ R , x ∈ R d : (cid:107) r f ( x , h ) − ∇ f ( x ) (cid:107) ≤ (cid:96) √ d | h | . .4 Black box reductions to ﬁrst order methods As shown in the works of Nesterov and Spokoiny [2017], Ghadimi and Lan [2013], zero orderoptimization is reducible to stochastic ﬁrst order optimization. The reduction relies on treatingrandomized approximate gradients of f as in expectation exact gradients of a carefully constructedsmoothed version of f . These arguments are also applicable to our case as well. FPSG, one of theapproaches of Jin et al. [2018a], naively leads to a large poly( d ) dependence in the convergence rate.More speciﬁcally one can show that Jin et al. [2018a]’s FPSG method needs ˜ O ( d /(cid:15) ) evaluations of ∇ g to converge to an (cid:15) -SOSP . The main reason behind this dimension dependent slowdown is thatthe Hessian Lipschitz constant of the smoothed version of g is O ( ρ √ d ) . An alternative approach inJin et al. [2018a] named ZPSG builds gradient estimators using function evaluations directly. Themain source slowdown here is the high variance of the stohastic gradients. An analysis of thosemethods for the case where exact function evaluations are available can be found in the Appendix.In the next sections we will provide an alternative analysis that accounts for the gradient approximationerrors on the optimization of f directly. Thus, we will be able to sidestep the above issues and providefaster convergence rates and better sample complexity. It is easy to see that conceptually any iterative optimization method can be expressed as a dynamicalsystem of the form { x k +1 = g ( x k ) } where x k is the current solution iterate that gets updated throughan update function g . Additionally, for ﬁrst order methods strict saddle points correspond to theunstable ﬁxed points of the dynamical system. These key observations have motivated Lee et al.[2019] to use the Stable Manifold Theorem (SMT) Shub [1987] in order to prove that gradient descentavoids strict saddle points. Intuitively, SMT formalizes why convergence to unstable ﬁxed points isunlikely starting from a local region around an unstable ﬁxed point. Adding the requirement that g isa global diffeomorphism, Lee et al. [2019] generalizes the conclusions of SMT to the whole space.In order to prove similar guarantees for a zero order algorithm using approximate gradient evaluations,we will need to construct a new dynamical system that is applicable to our zero order setting. Thestate of our dynamical system χ k consists of two parts: The current solution iterate x k that is a vectorin R d and a scalar value h ∈ R that controls the quality of the gradient approximation. Speciﬁcallywe have χ k +1 = g ( χ k ) (cid:44) (cid:18) x k +1 h k +1 (cid:19) = (cid:18) x k − ηq x ( x k , h k ) βq h ( h k ) (cid:19) (2)where η, β ∈ R + positive scalar parameters and functions q x : R d × R → R d and q h : R → R .The function q x can be seen as the gradient approximation oracle used by the dynamical system asdescribed in Section 3.3. The function q h is responsible for controlling the accuracy of the gradientapproximation. As we shall see later, it is important that h k converges to 0 so that the stable points of g are the same as in gradient descent. In this section we will provide sufﬁcient conditions that the parameters η, β must satisfy so that theupdate rule of Equation 2 avoids convergence to strict saddle points. To do this we will need tointroduce some properties of g . Deﬁnition 4 ( ( L, B, c ) -Well-behaved function) . Let f : R d → R ∈ C be a (cid:96) -gradient Lipschitzfunction. A function g of the form of Equation 2 is a ( L, B, c ) -well behaved function (for function f )if it has the following properties: i) q x , q h ∈ C with q h (0) = 0 . ii) ∀ h ∈ R : q x ( · , h ) is L Lipschitzand < ∂q h ( h ) ∂h ≤ B . iii) ∀ ( x , h ) ∈ R d +1 : (cid:107) q x ( x , h ) − ∇ f ( x ) (cid:107) ≤ c | h | . Given this deﬁnition and Lemma 1, it is clear that we can always construct ( L, B, c ) -well-behavedfunctions for L = √ d(cid:96) , B = 1 , c = √ d(cid:96) using q x = r f and q h = h .In the following lemmas and theorems we will require that βB < . Under this assumption βq h is acontraction having 0 as its only ﬁxed point so for all ﬁxed points of g we know that h = 0 . Notice5lso that when h = 0 , we have q x ( x ,

0) = ∇ f ( x ) and therefore the x coordinates of ﬁxed points of g must coincide with ﬁrst order stationary points of f . In fact, in the Appendix we prove that thereis a one to one mapping between strict saddles of f and unstable ﬁxed points of g . Using the sameassumptions, we also get that det(D g ( · )) (cid:54) = 0 . Putting all together, we are able to prove our ﬁrstmain result. Theorem 1.

Let g be a ( L, B, c ) -well-behaved function for function f . Let X ∗ f be the set of strictsaddle points of f . Then if η < L and β < B : ∀ h ∈ R : µ ( { x : lim k →∞ x k ∈ X ∗ f } ) = 0 . Notice that the random initialization refers only to the x ’s domain. Indeed a straightforwardapplication of the result of Lee et al. [2019] would guarantee a saddle-avoidance lemma only under anextra random choice of h . Such a result would not be able to clarify if saddle-avoidance stems fromthe instability of the ﬁxed point, just like in ﬁrst order methods, or from the additional randomnessof h . The key insight provided by the SMT is that the all the initialization points that eventuallyconverge to an unstable ﬁxed point lie in a low dimensional manifold. Thus, to obtain a strongerresult we have to understand how SMT restricts the dimensionality of this stable manifold for a ﬁxed h . The structure of the eigenvectors of the Jacobian of g around a ﬁxed point reveals that such aninteresting decoupling is ﬁnally achievable. In the previous section we provided sufﬁcient conditions to avoid convergence to strict saddle points.These results are meaningful however only if lim k →∞ x k exists. Therefore, in this section we willprovide sufﬁcient conditions such that the dynamic system of g converges. Given that strict saddlepoints are avoided, it is sufﬁcient to prove convergence to ﬁrst order stationary points. Let the errorof the gradient approximation be ε k = q x ( x k , h k ) − ∇ f ( x k ) . Firstly we establish the zero orderanalogue of the folklore lower bound for the decrease of the function:

Lemma 2 (Step-Convergence) . Suppose that g is a ( L, B, c ) -well-behaved function for a (cid:96) -gradientLipschitz function f . If η ≤ (cid:96) then we have that f ( x k +1 ) ≤ f ( x k ) − η (cid:16) (cid:107)∇ f ( x k ) (cid:107) − (cid:107) ε k (cid:107) (cid:17) . Given this lemma we can prove convergence to ﬁrst order stationary points.

Theorem 2 (Convergence to ﬁrst order stationary points) . Suppose that g is a ( L, B, c ) -well-behaved function for a (cid:96) -gradient Lipschitz function f . Let η ≤ (cid:96) , β < B . Then if f is lowerbounded lim k →∞ (cid:107)∇ f ( x k ) (cid:107) = 0 . The last theorem gives us a guarantee that the norm of the gradient is converging to zero but thisis not enough to prove convergence to a single stationary point if f has non isolated critical points.In the Appendix, we prove that if the gradient approximation error decreases quickly enough thenconvergence to a single stationary point is guaranteed for analytic functions. This allows us toconclude our analysis with this ﬁnal theorem. Theorem 3 (Convergence to minimizers) . Let f : R d → R ∈ C be a (cid:96) -gradient Lipschitz function.Let us also assume that f is analytic, has compact sub-level sets and all of its saddle points are strict.Let g be a ( L, B, c ) -well-behaved function for f with η < min { L , (cid:96) } and β < − η(cid:96)B . If we pick arandom initialization point x , then we have that for the x k iterates of g ∀ h ∈ R : Pr( lim k →∞ x k = x ∗ ) = 1 where x ∗ is a local minimizer of f . In the previous subsections we provided sufﬁcient conditions for approximate gradient descent toavoid strict saddle points. However, the stable manifold theorem guarantees that this will happenasymptotically. In fact, convergence could be quite slow until we reach a neighborhood of a localminimum. An analysis done for the ﬁrst order case by Du et al. [2017] showed that avoiding saddlepoints could take exponential time in the worst case. In this section, we will use ideas from the workof Jin et al. [2017] in order to get a zero order algorithm that converges to SOSPs efﬁciently.6onvergence to SOSPs poses unique challenges to zero order methods when it comes to controllingthe gradient approximation accuracy. For convergence to ﬁrst order stationary points one can useproperty iii) of Deﬁnition 4 and Lemma 2 to show that h = (cid:15)/c guarantees the decrease of f until (cid:107)∇ f ( x k ) (cid:107) ≤ (cid:15) . For SOSPs , this is not applicable as the norm of the gradient can become arbitrarilysmall near saddle points. One could resort to iteratively trying smaller h to ﬁnd one that guaranteesthe decrease of f . A surprising fact about our algorithm is that even if the gradient is arbitrarily small,computationally burdensome searches for h can be totally avoided. Initialization: ( (cid:96), ρ, (cid:15), c, δ, ∆ f ) χ ← { log( d(cid:96) ∆ f c(cid:15) δ ) , } , η ← c(cid:96) , r ← √ cχ · (cid:15)(cid:96) , g thres ← √ cχ · (cid:15), f thres ← cχ · (cid:113) (cid:15) ρ t thres ← χc · (cid:96) √ ρ(cid:15) , S ← √ cχ √ ρ(cid:15)ρ , h low ← c h min { g thres , rρδS √ d } Algorithm 1

PAGD( x ) for t = 0 , , . . . do z t ← q ( x t , g thres c h ) if (cid:107) z t (cid:107) ≥ g thres then x t +1 ← x t − η z t else x t +1 ← EscapeSaddle ( x t ) if x t +1 = x t then return x t end if end for Algorithm 2 EscapeSaddle ( ˆ x ) ξ ∼ Unif( B ( r )) ˜ x ← ˆ x + ξ for i = 0 , , . . . t thres do if f (ˆ x ) − f (˜ x i ) ≥ f thres then return ˜ x i end if ˜ x i +1 ← ˜ x i − ηq (˜ x i , h low ) end for return ˆ x Just like Jin et al. [2017], we will assume that f is (cid:96) − gradient Lipschitz and also ρ − Hessian Lipschitz.To construct a zero order algorithm we will also need a gradient approximator q : R d × R → R d . Wewill only require the error bound property on q , i.e., there exists a constant c h such that ∀ x ∈ R d , h ∈ R : (cid:107) q ( x , h ) − ∇ f ( x ) (cid:107) ≤ c h | h | The high level idea of Algorithm 1 is that given a point x t that is not an (cid:15) -SOSP the algorithm makesprogress by ﬁnding a x t +1 where f ( x t +1 ) is substantially smaller than f ( x t ) . By the deﬁnition of (cid:15) -SOSPs either the gradient of f at x t is large or the Hessian has a substantially negative eigenvalue.Separating these two cases is not as straightforward as in the ﬁrst order case. Given the norm of theapproximate gradient q ( x , h ) , we only know that (cid:107)∇ f ( x ) (cid:107) ∈ (cid:107) q ( x , h ) (cid:107) ± c h | h | . In Algorithm 1by choosing g thres / as the threshold to test for and h = g thres / (4 c h ) , we guarantee that in step 4 (cid:107)∇ f ( x t ) (cid:107) ≥ g thres / . This threshold is actually high enough to guarantee substantial decrease of f .Indeed given that we have a lower bound on the exact gradient and using Lemma 2 we get f ( x t ) − f ( x t +1 ) ≥ η (cid:16) (cid:107)∇ f ( x t ) (cid:107) − (cid:107) ε t (cid:107) (cid:17) ≥ ηg thres where ε t is the gradient approximation error at x t . This decrease is the same as in the ﬁrst order caseup to constants.On the other hand, in Algorithm 2 we are guaranteed that (cid:107)∇ f ( ˆx ) (cid:107) ≤ g thres . In this case ourapproximate gradient cannot guarantee a substantial decrease of f . However, we know that theHessian has a substantially negative eigenvalue and therefore a direction of steep decrease of f mustexist. The problem is that we do not know which direction has this property. In Jin et al. [2017]it is proved that identifying this direction is not necessary for the ﬁrst order case. Adding a smallrandom perturbation to our current iterate (step 2) is enough so that with high probability we can geta substantial decrease of f after at most t thres gradient descent steps (step 5). Of course this work isnot directly applicable to our case since we do not have access to exact gradients.The work of Jin et al. [2017] mainly depends on two arguments to provide its guarantees. The ﬁrstargument is that if the ˜ x i iterates do not achieve a decrease of f thres in t thres steps then they mustremain conﬁned in a small ball around ˜ x . Speciﬁcally for the exact gradient case we have that (cid:107) ˜ x i − ˜ x (cid:107) ≤ ηf thres t thres . f . Therefore, iterates may wander away from ˜ x without even decreasing thefunction value of f . To amend this argument for the zero order case we require that h low ≤ g thres /c h .This guarantees that even if gradient approximation errors amass over the iterations we will get thesame bound as the ﬁrst order case up to constants.The second argument of Jin et al. [2017] formalizes why the existence of a negative eigenvalue ofthe Hessian is important. Let us run gradient descent starting from two points u and w such that w − u = κ e where e is the eigenvector corresponding to the most negative eigenvalue of theHessian and κ ≥ rδ/ (2 √ d ) . Then at least one of the sequences { w i } , { u i } is able to escape awayfrom its starting point in t thres iterations and by the ﬁrst argument it is also able to decrease the value of f substantially. The proof of the claim is based on creating a recurrence relationship on v i = w i − u i .The corresponding recurrence relationship for the zero order case is more complicated with additionalterms that correspond to the gradient approximation errors for w i and u i . However, we are able toprove that if h low ≤ rρδS/ (2 √ d ) then these additional terms cannot distort the exponential growthof v i . Having extended both arguments of Jin et al. [2017] we can establish the same guarantees forescaping saddle points. Theorem 4 (Analysis of PAGD) . There exists absolute constant c max such that: if f is (cid:96) -gradientLipschitz and ρ -Hessian Lipschitz, then for any δ > , (cid:15) ≤ (cid:96) ρ , ∆ f ≥ f ( x ) − f (cid:63) , and constant c ≤ c max , with probability − δ , the output of PAGD ( x , (cid:96), ρ, (cid:15), c, δ, ∆ f ) will be an (cid:15) -SOSP , andhave the following number of iterations until termination: O (cid:18) (cid:96) ( f ( x ) − f (cid:63) ) (cid:15) log (cid:18) d(cid:96) ∆ f (cid:15) δ (cid:19)(cid:19) In this section we use simulations to verify our theoretical ﬁndings. Speciﬁcally we are interested inverifying if zero order methods can avoid saddle points as efﬁciently as ﬁrst order methods. To do thiswe use the two dimensional Rastrigin function, a popular benchmark in the non-convex optimizationliterature. This function exhibits several strict saddle points so it will be an adequate benchmark forour case. The two dimensional Rastrigin function can be deﬁned asRas ( x , x ) = 20 + x −

10 cos(2 πx ) + x −

10 cos(2 πx ) . For this experiment we selected 75 points randomly from [ − . , . × [ − , , . . In this domainthe Rastrigin function is (cid:96) -gradient Lipschitz with (cid:96) ≈ . . Using these points as initializationwe run gradient descent and the approximate gradient descent dynamical system we introduced inSection 4.2. For both gradient descent and approximate gradient descent we used η = 1 / (4 (cid:96) ) . Thenfor approximate gradient descent we used symmetric differences to approximate the gradients and β = 0 . as well as h = 0 . . Figure 1 shows the contour plot of the Rastrigin function as well Intial Points

Iteration 2

Iteration 4

Iteration 6

Figure 1: Contour plots of the Rastrigin function along with the evolution of the iterates of gradientdescent and approximate gradient descent. Green points correspond to gradient descent whereas cyanpoints correspond to approximate gradient descent.as the evolution of the iterates of both methods. As expected, for points initialized closed to localminima of the function convergence is quite fast. On the other hand, points starting close to saddle8oints of the Rastrigin function take some more time to converge to minima. However, it is clear thatin both cases the behaviour of gradient descent and approximate gradient descent is similar in thesense that for the same initialization there is no discrepancy in terms of convergence speed for thetwo methods.We also want to experimentally verify the performance of PAGD. To do this we use the octopusfunction proposed by Du et al. [2017]. This function is is particularly relevant to our setting as itpossesses a sequence of saddle points. The authors of Du et al. [2017] proved that for this functiongradient descent needs exponential time to avoid saddle points before converging to a local minimum.In contrast the perturbed version of gradient descent (PGD) of Jin et al. [2017] does not suffer fromthe same limitation. Based on the results of Theorem 4, we expect PAGD to not have this limitationas well. We compare gradient descent (GD), PGD, AGD and PAGD on an octopus function of d = 15 dimensions. Figure 2 clearly shows that the zero order versions have the same iteration performancewith the ﬁrst-order ones. In fact, AGD is shown to behave even better than GD in this example thanksto the noise induced by the gradient approximation. f ( x k ) GDPGDAGDPAGD

Figure 2: Octopus function value varying the number of iterations. Parameters of the function τ = e , L = e , γ = 1 . Parameters of ﬁrst order methods taken from Du et al. [2017]. Zero order methods usesymmetric differencing with h = 0 . This paper is the ﬁrst one to establish that zero order methods can avoid saddle points efﬁciently. Toachieve this we went beyond smoothing arguments used in prior work and studied the effect of thegradient approximation error on ﬁrst order methods that converge to second order stationary points.One important open question for future work is whether similar guarantees can be established forother zero order methods used in practice like direct search methods and trust region methods usinglinear models. Another generalization of interest would be to consider the performance of zero ordermethods for instances of (non-convex) constrained optimization.

Acknowledgements

Georgios Piliouras acknowledges MOE AcRF Tier 2 Grant 2016-T2-1-170, grant PIE-SGP-AI-2018-01 and NRF 2018 Fellowship NRF-NRFF2018-07. Emmanouil-Vasileios Vlatakis-Gkaragkouniswas supported by NSF CCF-1563155, NSF CCF-1814873, NSF CCF-1703925, NSF CCF-1763970.We are grateful to Alexandros Potamianos for bringing this problem to our attention, and for helpfuldiscussions at an early stage of this project for its connection to Natural Language Processing tasks.Finally, this work was supported by the Onassis Foundation - Scholarship ID: F ZN 010-1/2017-2018.9 eferences

Pierre-Antoine Absil, Robert E. Mahony, and B. Andrews. Convergence of the iterates of descentmethods for analytic cost functions.

SIAM Journal on Optimization , 16(2):531–547, 2005. doi:10.1137/040605266.Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization withmulti-point bandit feedback. In

COLT 2010 - The 23rd Conference on Learning Theory, Haifa,Israel, June 27-29, 2010 , pages 28–40, 2010.Naman Agarwal, Zeyuan Allen Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximatelocal minima faster than gradient descent. In

Proceedings of the 49th Annual ACM SIGACTSymposium on Theory of Computing, STOC 2017, Montreal, QC, Canada, June 19-23, 2017 , pages1195–1199, 2017. doi: 10.1145/3055399.3055464.Zeyuan Allen-Zhu and Yuanzhi Li. NEON2: ﬁnding local minima via ﬁrst-order oracles. In

Advances in Neural Information Processing Systems 31: Annual Conference on Neural InformationProcessing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada. , pages 3720–3730, 2018.Krishnakumar Balasubramanian and Saeed Ghadimi. Zeroth-order (non)-convexstochastic optimization via conditional gradient and gradient updates. In

Advancesin Neural Information Processing Systems 31: Annual Conference on Neural In-formation Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Mon-tréal, Canada. , pages 3459–3468, 2018. URL http://papers.nips.cc/paper/7605-zeroth-order-non-convex-stochastic-optimization-via-conditional-gradient-and-gradient-updates .Yair Carmon and John C. Duchi. Gradient descent efﬁciently ﬁnds the cubic-regularized non-convexnewton step.

CoRR , abs/1612.00547, 2016.Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. ZOO: zeroth orderoptimization based black-box attacks to deep neural networks without training substitute models.In

Proceedings of the 10th ACM Workshop on Artiﬁcial Intelligence and Security, AISec@CCS2017, Dallas, TX, USA, November 3, 2017 , pages 15–26, 2017. doi: 10.1145/3128572.3140448.URL https://doi.org/10.1145/3128572.3140448 .Tianyi Chen and Georgios B. Giannakis. Bandit convex optimization for scalable and dynamic iotmanagement.

IEEE Internet of Things Journal , 6(1):1276–1286, 2019. doi: 10.1109/JIOT.2018.2839563. URL https://doi.org/10.1109/JIOT.2018.2839563 .Anna Choromanska, Mikael Henaff, Michaël Mathieu, Gérard Ben Arous, and Yann LeCun. Theloss surfaces of multilayer networks. In

Proceedings of the Eighteenth International Conferenceon Artiﬁcial Intelligence and Statistics, AISTATS 2015, San Diego, California, USA, May 9-12,2015 , 2015. URL http://jmlr.org/proceedings/papers/v38/choromanska15.html .Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard E. Turner, and Adrian Weller.Structured evolution with compact architectures for scalable policy optimization. In

Proceedingsof the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan,Stockholm, Sweden, July 10-15, 2018 , pages 969–977, 2018. URL http://proceedings.mlr.press/v80/choromanski18a.html .Andrew R. Conn, Katya Scheinberg, and Luís N. Vicente. Global convergence of general derivative-free trust-region algorithms to ﬁrst- and second-order critical points.

SIAM Journal on Optimization ,20(1):387–415, 2009. doi: 10.1137/060673424.Yann N. Dauphin, Razvan Pascanu, Çaglar Gülçehre, KyungHyun Cho, Surya Ganguli, andYoshua Bengio. Identifying and attacking the saddle point problem in high-dimensionalnon-convex optimization. In

Advances in Neural Information Processing Systems 27: AnnualConference on Neural Information Processing Systems 2014, December 8-13 2014, Mon-treal, Quebec, Canada , pages 2933–2941, 2014. URL http://papers.nips.cc/paper/5486-identifying-and-attacking-the-saddle-point-problem-in-high-dimensional-non-convex-optimization .10imon S. Du, Chi Jin, Jason D. Lee, Michael I. Jordan, Aarti Singh, and Barnabás Póczos. Gradientdescent can take exponential time to escape saddle points. In

Advances in Neural InformationProcessing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9December 2017, Long Beach, CA, USA , pages 1067–1077, 2017.John C. Duchi, Michael I. Jordan, Martin J. Wainwright, and Andre Wibisono. Optimal rates forzero-order convex optimization: The power of two function evaluations.

IEEE Trans. InformationTheory , 61(5):2788–2806, 2015. doi: 10.1109/TIT.2015.2409256.Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points - online stochasticgradient for tensor decomposition. In

Proceedings of The 28th Conference on Learning Theory,COLT 2015, Paris, France, July 3-6, 2015 , pages 797–842, 2015.Saeed Ghadimi and Guanghui Lan. Stochastic ﬁrst- and zeroth-order methods for nonconvexstochastic programming.

SIAM Journal on Optimization , 23(4):2341–2368, 2013. doi: 10.1137/120880811.Bin Gu, Zhouyuan Huo, and Heng Huang. Zeroth-order asynchronous doubly stochastic algorithmwith variance reduction. arXiv preprint arXiv:1612.01425 , 2016.Davood Hajinezhad and Michael M. Zavlanos. Gradient-free multi-agent nonconvex nonsmoothoptimization. In , pages 4939–4944, 2018. doi: 10.1109/CDC.2018.8619333. URL https://doi.org/10.1109/CDC.2018.8619333 .Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escapesaddle points efﬁciently. In

Proceedings of the 34th International Conference on Machine Learning,ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 , pages 1724–1732, 2017.Chi Jin, Lydia T. Liu, Rong Ge, and Michael I. Jordan. On the local minima of the em-pirical risk. In

Advances in Neural Information Processing Systems 31: Annual Confer-ence on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018,Montréal, Canada. , pages 4901–4910, 2018a. URL http://papers.nips.cc/paper/7738-on-the-local-minima-of-the-empirical-risk .Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. Accelerated gradient descent escapes saddlepoints faster than gradient descent. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet,editors,

Proceedings of the 31st Conference On Learning Theory , volume 75 of

Proceedings ofMachine Learning Research , pages 1042–1085. PMLR, 06–09 Jul 2018b.Jason D. Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I. Jordan, andBenjamin Recht. First-order methods almost always avoid strict saddle points.

Math. Program. ,176(1-2):311–337, 2019. doi: 10.1007/s10107-019-01374-3. URL https://doi.org/10.1007/s10107-019-01374-3 .Kﬁr Y. Levy. The power of normalization: Faster evasion of saddle points.

CoRR , abs/1611.04831,2016.Xiangru Lian, Huan Zhang, Cho-Jui Hsieh, Yijun Huang, and Ji Liu. A comprehensivelinear speedup analysis for asynchronous stochastic parallel optimization from zeroth-order to ﬁrst-order. In

Advances in Neural Information Processing Systems 29: AnnualConference on Neural Information Processing Systems 2016, December 5-10, 2016,Barcelona, Spain , pages 3054–3062, 2016. URL http://papers.nips.cc/paper/6551-a-comprehensive-linear-speedup-analysis-for-asynchronous-stochastic-parallel-optimization-from-zeroth-order-to-first-order .Sijia Liu, Jie Chen, Pin-Yu Chen, and Alfred Hero. Zeroth-order online alternating direction methodof multipliers: Convergence analysis and applications. In

International Conference on ArtiﬁcialIntelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, CanaryIslands, Spain , pages 288–297, 2018a. URL http://proceedings.mlr.press/v84/liu18a.html . 11ijia Liu, Bhavya Kailkhura, Pin-Yu Chen, Pai-Shun Ting, Shiyu Chang, and LisaAmini. Zeroth-order stochastic variance reduction for nonconvex optimization. In

Advances in Neural Information Processing Systems 31: Annual Conference on Neu-ral Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Mon-tréal, Canada. , pages 3731–3741, 2018b. URL http://papers.nips.cc/paper/7630-zeroth-order-stochastic-variance-reduction-for-nonconvex-optimization .W Tu Loring. An introduction to manifolds, 2008.Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.Towards deep learning models resistant to adversarial attacks. In , 2018. URL https://openreview.net/forum?id=rJzIBfZAb .Katta G. Murty and Santosh N. Kabadi. Some np-complete problems in quadratic and nonlinearprogramming.

Math. Program. , 39(2):117–129, 1987. doi: 10.1007/BF02592948.Yurii Nesterov and Boris T. Polyak. Cubic regularization of newton method and its global performance.

Math. Program. , 108(1):177–205, 2006. doi: 10.1007/s10107-006-0706-8.Yurii Nesterov and Vladimir G. Spokoiny. Random gradient-free minimization of convex func-tions.

Foundations of Computational Mathematics , 17(2):527–566, 2017. doi: 10.1007/s10208-015-9296-2.Nicolas Papernot, Patrick D. McDaniel, Ian J. Goodfellow, Somesh Jha, Z. Berkay Celik, andAnanthram Swami. Practical black-box attacks against machine learning. In

Proceedings of the2017 ACM on Asia Conference on Computer and Communications Security, AsiaCCS 2017, AbuDhabi, United Arab Emirates, April 2-6, 2017 , pages 506–519, 2017. doi: 10.1145/3052973.3053009. URL https://doi.org/10.1145/3052973.3053009 .Reuven Y Rubinstein and Dirk P Kroese.

Simulation and the Monte Carlo method , volume 10. JohnWiley & Sons, 2016.Tim Salimans, Jonathan Ho, Xi Chen, and Ilya Sutskever. Evolution strategies as a scalable alternativeto reinforcement learning.

CoRR , abs/1703.03864, 2017. URL http://arxiv.org/abs/1703.03864 .Michael Shub.

Global stability of dynamical systems . Springer Science & Business Media, 1987.Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical bayesian optimiza-tion of machine learning algorithms. In

Advances in Neural Information Process-ing Systems 25: 26th Annual Conference on Neural Information Processing Systems2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada,United States. , pages 2960–2968, 2012. URL http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms .James C. Spall.

Introduction to Stochastic Search and Optimization . John Wiley & Sons, Inc., NewYork, NY, USA, 1 edition, 2003. ISBN 0471330523.Ju Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval.

Foundations ofComputational Mathematics , 18(5):1131–1198, 2018. doi: 10.1007/s10208-017-9365-9.Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families, and variationalinference.

Foundations and Trends in Machine Learning , 1(1-2):1–305, 2008. doi: 10.1561/2200000001.Yining Wang, Simon S. Du, Sivaraman Balakrishnan, and Aarti Singh. Stochastic zeroth-orderoptimization in high dimensions. In

International Conference on Artiﬁcial Intelligence andStatistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain , pages1356–1365, 2018. URL http://proceedings.mlr.press/v84/wang18e.html .12 fﬁciently avoiding saddle pointswith zero order methods: No gradients required

Supplementary MaterialsA Preliminaries Detailed proofs

In this ﬁrst subsection, we show that the forward ﬁnite differences method can be used toconstruct an approximate gradient oracle. Similar oracles can be constructed using backward,symmetric ﬁnite differences or Richardson extrapolation which have even higher gradientapproximation accuracy. Additionally, we compute the Lipschitz constant of our method andwe show that our deﬁnition of "well-behaved" approximate gradient is well deﬁned. In otherwords, there are simple approximation oracles which follow the smoothness requirementsthat our work assumes.

A.1 Gradient Approximation using Zero Order InformationLemma 4 ( Lemma 1 restated ) . Let f be (cid:96) -gradient Lipschitz. Then r f ( · , h ) as deﬁned in Equation1 is √ d(cid:96) Lipschitz for all h ∈ R and it holds that: (cid:107) r f ( x , h ) − ∇ f ( x ) (cid:107) ≤ (cid:96) √ d | h | Proof.

For the ﬁrst part of the lemma we split our proof into two cases: • For any h (cid:54) = 0 and any x , x (cid:48) ∈ R d we have (cid:107) r f ( x , h ) − r f ( x (cid:48) , h ) (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d (cid:88) l =0 f ( x + h e l ) − f ( x ) h e l − d (cid:88) l =0 f ( x (cid:48) + h e l ) − f ( x (cid:48) ) h e l (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) l =0 (cid:12)(cid:12)(cid:12)(cid:12) f ( x + h e l ) − f ( x (cid:48) + h e l ) − ( f ( x ) − f ( x (cid:48) )) h (cid:12)(cid:12)(cid:12)(cid:12) Let us deﬁne the function q l ( s ) = f ( x + s e l ) − f ( x (cid:48) + s e l ) for all l ∈ [ d ] . Then by applyingthe mean value theorem we get (cid:107) r f ( x , h ) − r f ( x (cid:48) , h ) (cid:107) = (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) l =0 (cid:12)(cid:12)(cid:12)(cid:12) q l ( h ) − q l (0) h (cid:12)(cid:12)(cid:12)(cid:12) = (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) l =0 | q (cid:48) l ( ξ l ) | for some ξ l ∈ (0 , h ) . We have that q (cid:48) l ( ξ l ) = ∂f ( x + ξ l e l ) ∂x l − ∂f ( x (cid:48) + ξ l e l ) ∂x l . If f is (cid:96) -gradientLipschitz so are all the partial derivatives (cid:107) r f ( x , h ) − r f ( x (cid:48) , h ) (cid:107) ≤ (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) l =0 (cid:96) (cid:107) x − x (cid:48) (cid:107) = √ d(cid:96) (cid:107) x − x (cid:48) (cid:107)• For the special case of h = 0 (cid:107) r f ( x , − r f ( x (cid:48) , (cid:107) = (cid:107)∇ f ( x ) − ∇ f ( x (cid:48) ) (cid:107) ≤ (cid:96) (cid:107) x − x (cid:48) (cid:107) ≤ √ d(cid:96) (cid:107) x − x (cid:48) (cid:107) h (cid:54) = 0 and any x (cid:107) r f ( x , h ) − ∇ f ( x ) (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d (cid:88) l =0 f ( x + h e l ) − f ( x ) h e l − ∇ f ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) l =0 (cid:12)(cid:12)(cid:12)(cid:12) f ( x + h e l ) − f ( x ) h − ∂f ( x ) ∂x l (cid:12)(cid:12)(cid:12)(cid:12) For each l ∈ [ d ] we use the mean value theorem so that for some x l : | ξ l | ≤ | h | we have (cid:107) r f ( x , h ) − ∇ f ( x ) (cid:107) = (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) l =0 (cid:12)(cid:12)(cid:12)(cid:12) ∂f ( x + ξ l e l ) ∂x l − ∂f ( x ) ∂x l (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) l =0 ( (cid:96)ξ l ) ≤ (cid:96) √ d | h | For h = 0 the requested inequality holds as an equality.As noted in the main paper, recent studies have analyzed zero order optimization by carefullycrafting a smoothed version of the original objective function. These arguments are alsoapplicable to our case as well.The following lemmas show why these approaches lead poly( d, (cid:15) − ) slowdown in terms of number of iterations and function evaluations. A.2 Black box reductions to ﬁrst order methods

Algorithm 3 of Jin et al. [2018a], uses approximate gradient evaluations at randomly sampled pointsaround the current iterate to get an estimate of the gradient of f . This estimate is then perturbed withnoise in order to avoid any potential saddle point. Algorithm 3

First order Perturbed Stochastic Gradient Descent (FPSGD)

Input: x , learning rate η , noise radius r , mini-batch size m . for t = 0 , , . . . , do sample ( z (1) t , · · · , z ( m ) t ) ∼ N (0 , σ I ) g t ( x t ) ← (cid:80) mi =1 g ( x t + z ( i ) t ) x t +1 ← x t − η ( g t ( x t ) + ξ t ) , ξ t uniformly ∼ B ( r ) end forreturn x T Lemma 5.

Let f : R d → R be a bounded, L -continuous, (cid:96) -gradient, ρ -Hessian Lipschitz function.Additionally, suppose that we have access to a function g : R d → R such that (cid:107)∇ g − ∇ f (cid:107) ∞ ≤ ν .Then, Jin et al. [2018a]’s FPSG method needs ˜ O ( d (cid:15) ) evaluations of ∇ g to converge to an (cid:15) -SOSP .Proof. We will show the main steps that Jin et al. [2018a] followed in Section E of the Appendix.The ﬁrst step of the proof is to deﬁne the Gaussian smoothing of function g with parameter σg σ ( x ) = E z ∼N (0 ,σ I ) g ( x + z ) One can show that ∇ g σ ( x ) = E z ∼N (0 ,σ I ) ∇ g ( x + z ) ∇ g σ ( x ) = E z ∼N (0 ,σ I ) ∇ g ( x + z ) Additionally Lemma 48 of Jin et al. [2018a] tells us that the gradients and Hessians of g σ and f areclose to each other and that g σ is gradient Lipschitz and Hessian Lipschitz.14 g σ is O ( (cid:96) + νσ ) gradient Lipschitz and O ( ρ + νσ ) Hessian Lipschitz. • (cid:107)∇ g σ ( x ) − ∇ f ( x ) (cid:107) ≤ O ( ρdσ + ν ) and (cid:107)∇ g σ ( x ) − ∇ f ( x ) (cid:107) ≤ O ( ρ √ dσ + ν ) Then Lemma 54 of Jin et al. [2018a] proves that a (cid:15) √ d -SOSP of g σ is also a O ( (cid:15) ) stationary point of f if σ ≤ O ( (cid:114) (cid:15)ρd ) ν ≤ O ( (cid:15) √ d ) For the aforementioned choices of ν and σ , ∇ g is bounded (cid:107)∇ g σ ( x ) (cid:107) ≤ (cid:107)∇ g σ ( x ) − ∇ f ( x ) (cid:107) + (cid:107)∇ f ( x ) (cid:107) ≤ √ dν + L ≤ (cid:15) + L So g ( x + z ) is O ( (cid:15) + L ) sub-gaussian. Notice also that by replacing with the upper bounds on σ and ν one can observe that the Lipschitz constant of ∇ g σ is O ( ρ √ d ) . This is the main reason that a (cid:15) √ d -SOSP of g σ is required.According to Theorem 65 of Jin et al. [2018a] getting an (cid:15) -SOSP of g σ requires ˜ O ( d/(cid:15) ) number ofevaluations of ∇ g . So to get an (cid:15) √ d -SOSP of g σ , one would require ˜ O ( d /(cid:15) ) number of evaluationsof ∇ g .Notice that the above theorem makes the technical assumption that the gradient approximator is agradient of a function, that may not be true for standard ﬁnite differences approximators. The Lemmabelow for ZPSG does not have the same limitation. In contrast to FPSG, Algorithm 4 works withfunction evaluations directly to come up with appropriate gradient evaluations. Algorithm 4

Zero order Perturbed Stochastic Gradient Descent (ZPSGD)

Input: x , learning rate η , noise radius r , mini-batch size m . for t = 0 , , . . . , do sample ( z (1) t , · · · , z ( m ) t ) ∼ N (0 , σ I ) g t ( x t ) ← (cid:80) mi =1 z ( i ) t [ f ( x t + z ( i ) t ) − f ( x t )] / ( mσ ) x t +1 ← x t − η ( g t ( x t ) + ξ t ) , ξ t uniformly ∼ B ( r ) end forreturn x T Lemma 6.

Let f : R d → R be a bounded, L -continuous, (cid:96) -gradient, ρ -Hessian Lipschitz function.Then, Jin et al. [2018a]’s ZPSG method needs ˜ O ( d (cid:15) ) evaluations of f to converge to an (cid:15) -SOSP .Proof. We will show the main steps that Jin et al. [2018a] followed in Section A of the Appendix.The ﬁrst step of the proof is to deﬁne the Gaussian smoothing of function f with parameter σf σ ( x ) = E z ∼N (0 ,σ I ) f ( x + z ) One can show that ∇ f σ ( x ) = E z ∼N (0 ,σ I ) ∇ f ( x + z ) ∇ f σ ( x ) = E z ∼N (0 ,σ I ) ∇ f ( x + z ) Additionally Lemma 18 of Jin et al. [2018a] for ν = 0 , tells us that the gradients and Hessians of f σ and f are close to each other and that f σ is gradient Lipschitz and Hessian Lipschitz. • f σ is O ( (cid:96) ) gradient Lipschitz and O ( ρ ) Hessian Lipschitz. • (cid:107)∇ f σ ( x ) − ∇ f ( x ) (cid:107) ≤ O ( ρdσ ) and (cid:107)∇ f σ ( x ) − ∇ f ( x ) (cid:107) ≤ O ( ρ √ dσ ) (cid:15) -SOSP of f σ is also a O ( (cid:15) ) stationary point of f if σ ≤ O ( (cid:114) (cid:15)ρd ) We also need to develop a random gradient approximator of ∇ f σ given only evaluations f . Based onLemma 19 ∇ f σ ( x ) = E z ∼N (0 ,σ I ) z f ( x + z ) − f ( x ) σ Let us deﬁne g ( x ; z ) = z f ( x + z ) − f ( x ) σ Lemma 24 shows that g is Bσ subgaussian where B is the upper bound on | f ( x ) | (it exists since f isbounded). Replacing with the upper bound on σ , it turns out that g is O ( B (cid:113) (cid:15)ρd ) subgaussian. Thisdependence on d and (cid:15) is the main reason of the slowdown in this case.According to Theorem 65 getting an (cid:15) -SOSP of f σ requires ˜ O ( d /(cid:15) ) number of evaluations of g .Each evaluation of g requires 2 evaluations of f .16n the next section, we show the complete proof of our ﬁrst main result. We will use theStable Manifold Theorem (SMT) to prove that zero-order approximate gradient descent(AGD) avoids strict saddle points. B Approximate Gradient Descent Detailed proofs

Our ﬁrst two lemmas prove the equivalence between the ﬁrst order stationary points of f and the ﬁxed points of the AGD. Additionally we show that saddle points of the objectivefunction correspond exactly to the unstable ﬁxed of the proposed zero order method. Finallywe show that for sufﬁciently small size-step the dynamical system is diffeomorphism. Thiscritical property will allow us to generalize the consequences of SMT from a local regionaround a saddle point to the global domain. B.1 Avoiding strict saddle pointsLemma 7.

Assume that g is an ( L, B, c ) well behaved function. If β < B and η < L for everystrict saddle point x ∗ of f and we have that (cid:0) x ∗ (cid:1) is not a stable ﬁxed point of g . Additionally, theseare the only unstable ﬁxed points of g .Proof. For h = 0 and at a strict saddle x ∗ , we will calculate the general differential of g . D g (cid:18) x ∗ (cid:19) = (cid:18) I − ηD x q x ( x ∗ , − ηD h q x ( x ∗ , β ∂q h (0) ∂h (cid:19) = (cid:18) I − η ∇ f ( x ∗ ) − ηD h q x ( x ∗ , β ∂q h (0) ∂h (cid:19) with eigenvalues β ∂q h (0) ∂h , (1 − ηλ i ) , where λ i are eigenvalues of ∇ f ( x ∗ ) . Since x ∗ is a strictsaddle, then there is at least one eigenvalue λ i < , and − ηλ i > . Thus (cid:0) x ∗ (cid:1) is an unstable ﬁxedpoint of g . To prove that these are the only unstable ﬁxed points, observe that β ∂q h (0) ∂h ∈ (0 , sothe only way D g (cid:0) x ∗ (cid:1) has an eigenvalue greater than 1 is for some λ i to be negative and therefore x ∗ should be a strict saddle.For the sake of completeness here we provide an extra lemma that proves the equivalence betweenthe ﬁrst order stationary points of f and the ﬁxed points of g . Lemma 8.

Assume that g is an ( L, B, c ) -well-behaved function for a function f with β < B . Thenfor each ﬁrst order stationary point of f x ∗ , (cid:0) x ∗ (cid:1) is a ﬁxed point of g . Additionally g has no otherﬁxed points.Proof. For β < B we have that g h = βq h ( h ) is a contraction since its Lipschitz constant is less thanone. So the only ﬁxed point of g h is 0. Therefore for h (cid:54) = 0 no point (cid:0) x h (cid:1) is a stable point. Now for h = 0 we get that q x ( x , h ) = ∇ f ( x ) so we have x k +1 = x k − η ∇ f ( x k ) (3)So x is a ﬁxed point if and only if ∇ f ( x ) = 0 . Combining this with the requirement that all ﬁxedpoints of g have h = 0 proves the lemma.In order to prove Theorem 1 we also have to prove the diffeomorphism property of g . Lemma 9. If g is an ( L, B, c ) well behaved function and η < L , then det(D g ( · )) (cid:54) = 0 .Proof. Let K = D x q x ( x , h ) (4)17y straightforward calculation D g (cid:18) xh (cid:19) = (cid:18) I − η K − η D h q x ( x , h )0 β ∂q h ( h ) ∂h (cid:19) Given that g ( · , h ) is L -Lipschitz for all h ∈ R , we have that (cid:107)K(cid:107) ≤ L . Clearly we have that det( I − η K ) (cid:54) = 0 since (cid:107) I − η K(cid:107) ≥ − ηL > . Finally we have that det(D g (cid:18) xh (cid:19) ) = β ∂q h ( h ) ∂h det( I − η K ) (cid:54) = 0 . A straightforward application of result of Lee et al. [2019] and SMT will yields a saddle-avoidance lemma following kind :

Let X ∗ f be the set of the strict saddle points of f , η < L and β < B . Then it holds: Pr (cid:16) { (cid:0) x h (cid:1) : lim k →∞ x k ∈ X ∗ f } (cid:17) = 0 .Notice that the random choice would be both on x , h . In the following subsection wewill prove that a stronger result where the random initialization refers only to the x ’sdomain is surprisingly possible via a new reﬁnement of SMT : ∀ h ∈ R : Pr( lim k →∞ x k = x ∗ ) = 1 Let us ﬁrst describe our general strategy for proving this reﬁnement:1. We will restate the Stable Manifold Theorem and understand its implications.(Section B.2.1)2. We will study the structure of the eigenvalues of D g at ﬁxed points of g .(Section B.2.2)3. We will show how this affects the projections to the stable and unstable eignespacesof D g .(Section B.2.3)4. Finally we will see how this enables us to study the dimension of the stable manifoldwhen h is ﬁxed.(Section B.2.4) B.2 A Reﬁnement of the Stable Manifold TheoremB.2.1 Understanding the Stable Manifold TheoremTheorem 5 (Theorem III.2 & III.7 of Shub [1987]) . Let p be a ﬁxed point for the C r local diffeomor-phism h : U → R n where U ⊂ R n is an open neighborhood of p in R n and r ≥ . Let E s ⊕ E c ⊕ E u be the invariant splitting of R n into generalized eigenspaces of Dh ( p ) corresponding to eigenvaluesof absolute value less than one, equal to one, and greater than one. To the Dh ( p ) invariant subspace E s ⊕ E c there is an associated local h invariant embedded disc W locsc which is the graph of a C r function r : E s ⊕ E c → E u , and ball B around p such that: h ( W locsc ) ∩ B ⊂ W locsc . If h n ( x ) ∈ B for all n ≥ , then x ∈ W locsc We will give some intuition on how the Stable Manifold Theorem restricts the dimensionality of thestable manifold. It essentially boils down to restricting the dimensionality of the manifold W locsc .Let us have a x ∈ U , then this can be decomposed in two vectors x sc and x u , the projection of x to E s ⊕ E c and E u respectively. Thus by the construction of W locsc in the proof of the StableManifold theorem, we know that there is a function r : E s ⊕ E c → E u such that if x ∈ W locsc then ( x sc , x u ) ∈ graph ( r ) , or equivalently it holds that x u = r ( x sc ) . By the construction of r , r is smooth so now dim ( W locsc ) = dim (graph( r )) = dim ( E s ⊕ E c ) . To understand why the laststatement is true, the interested reader can look at example 5.14 of Loring [2008].18 .2.2 Eigenvalues of the Jacobian at ﬁxed points Our main tool for understanding the structure of the eigenvalues of D g at ﬁxed points of g iscomparing it and contrasting it with its ﬁrst order counterpart, gradient descent. Here is the dynamicalsystem of gradient descent: x k +1 = g ( x k ) = x k − η ∇ f ( x k ) Now let us pick a ﬁxed point of f , x ∗ . Then D g ( x ∗ ) = I − η ∇ f ( x ∗ ) is a symmetrical matrix for the C function f . Then we can write down its real orthonormaleigenvectors { v i } di =1 . Without loss of generality we can reorder them so that the k ﬁrst eigenvectorscorrespond to eigenvalues less than one, the next s correspond to eigenvalues that are equal to oneand and the last ones correspond to eigenvalues that are larger than one in absolute value. Based onthis separation between the eigenvectors, we can now deﬁne the following three vector spaces E g s = [ { v , · · · , v k } ] E g c = [ { v k +1 , · · · , v k + s } ] E g u = [ { v k + s +1 , · · · , v d } ] Then we can prove the following interesting lemma

Lemma 10. If v is eigenvector of Dg ( x ∗ ) then (cid:0) v (cid:1) is eigenvector of Dg (cid:0) x ∗ (cid:1) with the sameeigenvalue.Proof. By straightforward calculation D g (cid:18) x ∗ (cid:19) = (cid:18) I − ηD x q x ( x ∗ , − ηD h q x ( x ∗ , β ∂q h (0) ∂h (cid:19) = (cid:18) I − η ∇ f ( x ∗ ) − ηD h q x ( x ∗ , β ∂q h (0) ∂h (cid:19) = (cid:18) D g ( x ∗ ) − ηD h q x ( x ∗ , β ∂q h (0) ∂h (cid:19) Indeed if v is eigenvector of Dg ( x ∗ ) with eigenvalue λ then D g (cid:18) x ∗ (cid:19)(cid:18) v (cid:19) = (cid:18) D g ( x ∗ ) − η D h q x ( x ∗ , β ∂q h (0) ∂h (cid:19) (cid:18) v (cid:19) = (cid:18) λ v (cid:19) = λ (cid:18) v (cid:19) Now we now the form of the d out of the d + 1 generalized eigenvalues of D g (cid:0) x ∗ (cid:1) . There must beat least one more generalized eigenvector along with its corresponding eigenvalue. It is known thatgeneralized eigenvectors span the whole space. But so far all the eigenvectors have a zero in the lastcoordinate. So the last generalized eigenvector must have a non-zero value in the last coordinate.Without loss of generality we can assume that the last coordinate is 1. So the vector will be of theform (cid:0) ˜ v (cid:1) . We would like to determine its corresponding eigenvalue. Lemma 11.

The eigenvalue of D g (cid:0) x ∗ (cid:1) that corresponds to (cid:0) ˜ v (cid:1) is β ∂q h (0) ∂h Proof.

Since the last row of D g (cid:0) x ∗ (cid:1) contains only one non-zero element, we know that the charac-teristic polynomial p of D g can be written as det(D g (cid:18) x ∗ (cid:19) − λI d +1 × d +1 ) = det(D g ( x ∗ ) − λI d × d ) det( β ∂q h (0) ∂h − λ ) Given that all the other eigenvalues cover the roots of the ﬁrst term, we know that the last eigenvalueis β ∂q h (0) ∂h . 19y assumption we know that < β ∂q h (0) ∂h < . Thus the last generalized eigenvector corresponds toa stable eigenvalue. Now we can write down the following E gs = (cid:20)(cid:26)(cid:18) v (cid:19) , · · · , (cid:18) v k (cid:19) , (cid:18) ˜ v (cid:19)(cid:27)(cid:21) E gc = (cid:20)(cid:26)(cid:18) v k +1 (cid:19) , · · · , (cid:18) v k + s (cid:19)(cid:27)(cid:21) (5) E gu = (cid:20)(cid:26)(cid:18) v k + s +1 (cid:19) , · · · , (cid:18) v d (cid:19)(cid:27)(cid:21) B.2.3 Projections to stable and unstable eigenspaces of the Jacobian

In this paragraph we want to learn more about the projection to the stable and unstable eigenspacesof D g . Speciﬁcally for any vector (cid:0) x h (cid:1) , there are unique x g sc , x g u , h s , h u such that (cid:18) x h (cid:19) = (cid:18) x g sc h s (cid:19) + (cid:18) x g u h u (cid:19)(cid:18) x g sc h s (cid:19) ∈ E g s ⊕ E g c and (cid:18) x g u h u (cid:19) ∈ E g u Let us compute these projections. Given that the generalized eigenvectors span the whole space, wehave that there are unique λ i ∈ R such that (cid:18) x h (cid:19) = n (cid:88) i =1 λ i (cid:18) v i (cid:19) + λ n +1 (cid:18) ˜ v (cid:19) ⇔ λ n +1 = h and x = n (cid:88) i =1 λ i v i + h ˜ v ⇔ λ n +1 = h and x − h ˜ v = n (cid:88) i =1 λ i v i ⇔ λ n +1 = h and λ i = (cid:104) x − h ˜ v , v i (cid:105) Since v i are orthogonal as eigenvectors of a symmetrical matrix. We can now ﬁnd the vectors andvalues x g sc , x g u , h s , h u x g sc = k + (cid:96) (cid:88) i =1 λ i v i + h ˜ v = k + (cid:96) (cid:88) i =1 (cid:104) x − h ˜ v , v i (cid:105) v i + h ˜ v = k + (cid:96) (cid:88) i =1 (cid:104) x , v i (cid:105) v i + h (cid:32) ˜ v − k + (cid:96) (cid:88) i =1 (cid:104) ˜ v, v i (cid:105) v i (cid:33) x g u = n (cid:88) i = k + (cid:96) +1 λ i v i = n (cid:88) i = k + (cid:96) +1 (cid:104) x − h ˜ v , v i (cid:105) v i = n (cid:88) i = k + (cid:96) +1 (cid:104) x , v i (cid:105) v i − h n (cid:88) i = k + (cid:96) +1 (cid:104) ˜ v , v i (cid:105) v i h s = h and h u = 0 Once again we will compare and contrast with the ﬁrst order case. Equivalently for every vector x there are unique x g sc , x g u such that x = x g sc + x g u x g sc ∈ E g s ⊕ E g c and x g u ∈ E g u q = ˜ v − k + (cid:96) (cid:88) i =1 (cid:104) ˜ v , v i (cid:105) v i = n (cid:88) i = k + (cid:96) +1 (cid:104) ˜ v , v i (cid:105) v i (6)Then clearly x g sc = x g sc + h qx g u = x g u − h q (7) h sc = hh u = 0 B.2.4 Restricting the dimension of the stable manifold for ﬁxed initial h In this paragraph we are ready to ﬁnally prove Theorem . Theorem 6 (Theorem 1 restated) . Let g be a ( L, B, c ) -well-behaved function for function f . Let X ∗ f be the set of strict saddle points of f . Then if η < L and β < B : ∀ h ∈ R : µ ( { x : lim k →∞ x k ∈ X ∗ f } ) = 0 Proof.

Without loss of generality let us have a ﬁxed h = h . Let us deﬁne M h as M h = { x ∈ R n : lim k →∞ g k ( x , h ) = ( x ∗ , and x ∗ ∈ X ∗ f } We want to prove that the set M has measure 0. Let us apply the Stable Manifold Theorem on g forall ﬁxed points p = ( x ∗ , ∈ X ∗ f × { } . Let B p , W locsc, p be the ball and the corresponding manifoldderived by Theorem 5. We consider the union of those balls B = (cid:83) B p . The following property for R N holds: Theorem (Lindelöf’s lemma) . For every open cover there is a countable subcover.

Therefore due to Lindelöf’s lemma, we can ﬁnd a countable subcover for B , i.e., there exists acountable family of ﬁxed-points p , p , · · · such that B = (cid:83) + ∞ m =0 B p m . Once again, based onTheorem 5, if starting from x one converges to an unstable ﬁxed point then it holds that x ∈ M h ⇒ ∃ m, t : ∀ t ≥ t ( x t , h t ) = g t ( x , h ) and ( x t , h t ) ∈ B p m ⇒ ∃ m, t : ( x t , h t ) = g t ( x , h ) and ( x t , h t ) ∈ W locsc, p m Let us deﬁne U mt = { x ∈ R d : ( x t , h t ) = g t ( x , h ) and ( x t , h t ) ∈ W locsc, p m } Therefore we have M h ⊆ ∞ (cid:91) m =0 ∞ (cid:91) t =0 U mt Now it sufﬁces to prove that all U mt sets have zero measure. Let us ﬁrst prove the following lemmaas a stepping stone. Lemma.

Let us deﬁne the following set of points R mh = { x ∈ R d : ( x , h ) ∈ W locsc, p m } Then dim ( R mh ) < d . roof. Based on our discussion on the Stable Manifold Theorem, we know that there is a smoothfunction r : E gs ⊕ E gc → E gu such that (cid:18) x h (cid:19) ∈ W locsc, p m ⇒ (cid:18) x g u h s (cid:19) = r ( x g sc , h u ) where x gu , x gsc , h s and h u the components of the projections to E g s ⊕ E g c and E g u as deﬁned in theEquations of 5. Now using our analysis in the Equations of 7 (cid:18) x h (cid:19) ∈ W locsc, p m ⇒ (cid:18) x g u − h q (cid:19) = r ( x g sc + h q , h ) where q is the vector we deﬁned in Equation 6. Let (cid:81) be the projection that for each (cid:0) x h (cid:1) ∈ R d +1 returns x . Then we can deﬁne the following smooth function r (cid:48) h : E g s ⊕ E g c → E g u r (cid:48) h ( x ) = h q + (cid:89) r ( x + h q , h ) . Using the { v i } ni =1 as a basis we can write (cid:18) x h (cid:19) ∈ W locsc, p m ⇒ x g u = r (cid:48) h ( x g sc ) ⇒ x ∈ graph( r (cid:48) h ) Therefore dim ( R mh ) ≤ dim ( E g s ⊕ E g c ) < d since p m corresponds to an unstable ﬁxed point of g .Then we can prove the following lemma Lemma 12.

The measure of U mt is zero.Proof. We will do this by contradiction. Let us assume that U t has non-zero measure. Let us deﬁne W m = { x ∈ R n : x ∈ U mt } W m = { x ∈ R n : x ∈ g ( W m , h ) } ... W mt = { x ∈ R n : x ∈ g ( W mt − , h t − ) } Given that g ( · , h i ) is a diffeomorphism for all i , we have that W i has non zero measure. Observe that W mt ⊆ R mh t and so dim ( W mt ) < d and W mt has measure zero leading to a contradiction.Since the countable union of zero measure sets is zero measure we clearly have that M h has measurezero as requested.In the previous section, we provided sufﬁcient conditions to avoid convergence to strictsaddle points. These results are meaningful however only if lim k →∞ x k = x ∗ . Thus in order tocomplete the proof of 3, in the following section we will provide sufﬁcient conditions suchthat the dynamic system of AGD converges. 22 .3 Convergence We will refer to the error of the gradient approximation as ε k = q x ( x k , h k ) − ∇ f ( x k ) . In order to prove the convergence ﬁrstly we establish a lower bound for the decrease of thefunction that is connected with the norm of the gradient and its approximation error (Lemma2). We also prove that our scheme yields to an exponential decrease of that error (Lemma14). Given those lemmas we can prove an exact and an (cid:15) − ﬁrst order stationary convergencetheorem. Lemma 13 (Lemma 2 restated) . Suppose that g is a ( L, B, c ) -well-behaved function for a (cid:96) -gradientLipschitz function f . If η ≤ (cid:96) then we have that f ( x k +1 ) ≤ f ( x k ) − η (cid:16) (cid:107)∇ f ( x k ) (cid:107) − (cid:107) ε k (cid:107) (cid:17) (8) Proof. f ( x k +1 ) ≤ f ( x k ) + ∇ f ( x k ) (cid:62) ( x k +1 − x k ) + (cid:96) (cid:107) x k +1 − x k (cid:107) ≤ f ( x k ) − η ∇ f ( x k ) (cid:62) q x ( x k , h k ) + η (cid:96) (cid:107) q x ( x k , h k ) (cid:107) ≤ f ( x k ) − η ∇ f ( x k ) (cid:62) ( ∇ f ( x k ) + ε k ) + η (cid:96) (cid:107)∇ f ( x k ) + ε k (cid:107) ≤ f ( x k ) − η ∇ f ( x k ) (cid:62) ( ∇ f ( x k ) + ε k ) + η (cid:107)∇ f ( x k ) + ε k (cid:107) ≤ f ( x k ) − η (cid:16) (cid:107)∇ f ( x k ) (cid:107) − (cid:107) ε k (cid:107) (cid:17) Lemma 14 (Exponentially Decreasing ε k ) . Suppose that g is a ( L, B, c ) -well-behaved function fora function f . Then we have that (cid:107) ε k (cid:107) ≤ c | h | ( βB ) k Proof.

Since q h is B -Lipschitz | h k +1 | = | βq h ( h k ) − βq h (0) | ≤ βB | h k | Therefore we have that | h k | ≤ ( βB ) k | h | Based on property 3 of the ( L, B, c ) -well-behaved function we have that (cid:107) ε k (cid:107) = (cid:107) q x ( x k , h k ) − ∇ f ( x k ) (cid:107) ≤ c | h k | = ( βB ) k | h | Now we are ready to start our proof for the convergence to the ﬁrst order stationary points.

Theorem 7 ( Theorem 2 Restated) . Suppose that g is a ( L, B, c ) -well-behaved gradient functionfor a (cid:96) -gradient Lipschitz function f . Let η ≤ (cid:96) , β < B . Then if f is lower bounded lim k →∞ (cid:107)∇ f ( x k ) (cid:107) = 0 roof. Applying Lemma 2 repeatedly we get f ( x ) − f ( x k ) ≥ η k (cid:88) i =0 (cid:16) (cid:107)∇ f ( x i ) (cid:107) − (cid:107) ε i (cid:107) (cid:17) We now have that f ( x ) − f ( x k ) + η k (cid:88) i =0 (cid:107) ε i (cid:107) ≥ η k (cid:88) i =0 (cid:107)∇ f ( x i ) (cid:107) f ( x ) − f ( x k ) + η ∞ (cid:88) i =0 (cid:107) ε i (cid:107) ≥ η k (cid:88) i =0 (cid:107)∇ f ( x i ) (cid:107) f ( x ) − f ( x k ) + η ∞ (cid:88) i =0 (cid:0) c | h | ( βB ) i (cid:1) ≥ η k (cid:88) i =0 (cid:107)∇ f ( x i ) (cid:107) f ( x ) − f ( x k ) + η c h − ( βB ) ≥ η k (cid:88) i =0 (cid:107)∇ f ( x i ) (cid:107) Given that f is lower bounded, f ( x ) − f ( x k ) and therefore the whole left hand side is upper boundedwhich means the series sum in the right hand side is upper bounded. Since this is a series of nonnegative terms this means that the series converges and therefore lim k →∞ (cid:107)∇ f ( x k ) (cid:107) = 0 For the sake of completeness, we will analyze the convergence rate to (cid:15) -ﬁrst order stationarypoints in this setting. This would enable us to to make a fair comparison with previous resultsthat assume a ﬁxed h k = h . Notice that the following result improves over previous workin randomized zero order gradient approximations. In Nesterov and Spokoiny [2017], itwas proved that using a randomized oracle that requires 2 function evaluations per iteration,one could get an in expectation (cid:15) -ﬁrst order stationary point after O (cid:0) d(cid:96) ( f ( x ) − f ∗ ) /(cid:15) (cid:1) iterations. For the case of q x using r f as deﬁned in Equation 1 of the Section 3, we have justproved that with d + 1 function evaluations per iteration we can get a (cid:15) -ﬁrst order stationarypoint after only O (cid:0) (cid:96) ( f ( x ) − f ∗ ) /(cid:15) (cid:1) iterations. Thus for the same number of functionevaluations up to constants, our work provides deterministic guarantees whereas Nesterovand Spokoiny [2017] provides guarantees only in expectation. Theorem 8 ( (cid:15) -ﬁrst order stationary points) . Suppose that g is a ( L, B, c ) -well-behaved gradientfunction for a (cid:96) -gradient Lipschitz function f . Let q h ( h ) = h and β = 1 , η = (cid:96) . Then if f hasminimum value f ∗ and h = (cid:15) √ c , the required number of iterations to reach a (cid:15) -ﬁrst order stationarypoint is O (cid:18) (cid:96) ( f ( x ) − f ∗ ) (cid:15) (cid:19) Proof.

Applying Lemma 2 repeatedly we get f ( x ) − f ( x k ) ≥ (cid:96) k (cid:88) i =0 (cid:16) (cid:107)∇ f ( x i ) (cid:107) − (cid:107) ε i (cid:107) (cid:17)

24e now have that f ( x ) − f ( x k ) + 12 (cid:96) k (cid:88) i =0 (cid:107) ε i (cid:107) ≥ (cid:96) k (cid:88) i =0 (cid:107)∇ f ( x i ) (cid:107) f ( x ) − f ( x k ) + k + 12 (cid:96) ( c | h | ) ≥ (cid:96) k (cid:88) i =0 (cid:107)∇ f ( x i ) (cid:107) (cid:96) ( f ( x ) − f ( x k ))2( k + 1) + c | h | ≥ k + 1 k (cid:88) i =0 (cid:107)∇ f ( x i ) (cid:107) (cid:96) ( f ( x ) − f ∗ )2( k + 1) + (cid:15) ≥ k + 1 k (cid:88) i =0 (cid:107)∇ f ( x i ) (cid:107) Choose the smallest k such that (cid:96) ( f ( x ) − f ∗ )( k +1) ≤ (cid:15) . Then we have (cid:15) ≥ k + 1 k (cid:88) i =0 (cid:107)∇ f ( x i ) (cid:107) Since the average of the squared norms of the gradients is less than (cid:15) , there should be at least onethat is less or equal to (cid:15) . That is there is a k ≤ k such that (cid:107)∇ f ( x k ) (cid:107) ≤ (cid:15) . Given the deﬁnition of k we get the iteration bound stated in the theorem.The last theorems give us a guarantee that the norm of the gradient is converging to zerobut this is not enough to prove convergence to a single stationary point if f has non isolatedcritical points. To establish a stronger result we prove that {(cid:107)∇ f ( x k ) (cid:107)} does not decreasearbitrarily quickly. Lemma 15 (Sufﬁciently large gradients) . Suppose that g is a ( L, B, c ) -well-behaved function for a (cid:96) -gradient Lipschitz function f . Then we have that (cid:107)∇ f ( x k +1 ) (cid:107) ≥ (1 − η(cid:96) ) (cid:107)∇ f ( x k ) (cid:107) − η(cid:96) (cid:107) ε k (cid:107) Proof. (cid:107)∇ f ( x k +1 ) (cid:107) ≥ (cid:107)∇ f ( x k ) (cid:107) − (cid:107)∇ f ( x k +1 ) − ∇ f ( x k ) (cid:107)≥ (cid:107)∇ f ( x k ) (cid:107) − (cid:96) (cid:107) x k +1 − x k (cid:107)≥ (cid:107)∇ f ( x k ) (cid:107) − η(cid:96) (cid:107) q x ( x k , h k ) (cid:107)≥ (cid:107)∇ f ( x k ) (cid:107) − η(cid:96) (cid:107)∇ f ( x k ) + ε k (cid:107)≥ (cid:107)∇ f ( x k ) (cid:107) − η(cid:96) (cid:107)∇ f ( x k ) (cid:107) − η(cid:96) (cid:107) ε k (cid:107)≥ (1 − η(cid:96) ) (cid:107)∇ f ( x k ) (cid:107) − η(cid:96) (cid:107) ε k (cid:107) Theorem 9.

Assume that f is (cid:96) -gradient Lipschitz, is analytic and that it has compact sub-level setsand that g is a ( L, B, c ) -well-behaved gradient oracle. Let η < (cid:96) , β < − η(cid:96)B . Then lim x k existsand is a stationary point of f .Proof. We will ﬁrst prove that given the fact that f has compact sub-level sets { x k } is conﬁned incompact set. Based on Lemma 2 we have that for all k ≥ f ( x k +1 ) − f ( x k ) ≤ η (cid:107) ε k (cid:107) Applying this recursively and adding the inequalities f ( x k +1 ) ≤ f ( x ) + η k (cid:88) i =0 (cid:107) ε i (cid:107) ≤ f ( x ) + η k (cid:88) i =0 (cid:0) c | h | ( βB ) i (cid:1) ≤ f ( x ) + η c h k (cid:88) i =0 ( βB ) i ≤ f ( x ) + η c h − ( βB ) So clearly { f ( x k ) } is bounded and therefore { x k } stays in one of the compact sub-level sets of f forever.Let us deﬁne the following φ k ( h ) = c | h | ( βB ) k We will split the proof of the theorem in two cases. For the ﬁrst case we will assume that there is a k ∈ N such that (cid:107)∇ f ( x k ) (cid:107) ≥ φ k ( h ) Then by Lemma 15 (cid:107)∇ f ( x k +1 ) (cid:107) ≥ (1 − η(cid:96) ) (cid:107)∇ f ( x k ) (cid:107) − η(cid:96) (cid:107) ε k (cid:107) ≥ (1 − η(cid:96) ) φ k ( h ) − η(cid:96)φ k ( h ) ≥ (1 − η(cid:96) ) φ k ( h ) ≥ − η(cid:96)βB βBφ k ( h ) ≥ − η(cid:96)βB φ k +1 ( h ) ≥ φ k +1 ( h ) By induction we have that ∀ k ≥ k + 1 (cid:107)∇ f ( x k ) (cid:107) ≥ − η(cid:96)βB φ k ( h ) By Lemma 14 (cid:107)∇ f ( x k ) (cid:107)(cid:107) ε k (cid:107) ≥ (cid:18) − η(cid:96)βB (cid:19) = q >

26t the same time −∇ f ( x k ) (cid:62) ( x k +1 − x k ) = η ∇ f ( x k ) (cid:62) ( ∇ f ( x k ) + ε k )= η (cid:107)∇ f ( x k ) (cid:107) + η ∇ f ( x k ) (cid:62) ε k ≤ η (cid:18) q (cid:19) (cid:107)∇ f ( x k ) (cid:107) Additionally using similar arguments as above −∇ f ( x k ) (cid:62) ( x k +1 − x k ) (cid:107)∇ f ( x k ) (cid:107)(cid:107) ( x k +1 − x k ) (cid:107) ≥ η (cid:16) − q (cid:17) (cid:107)∇ f ( x k ) (cid:107) η (cid:16) q (cid:17) (cid:107)∇ f ( x k ) (cid:107) = (cid:16) − q (cid:17)(cid:16) q (cid:17) Let us deﬁne c = 12 (cid:18) − q (cid:19) c = (cid:16) − q (cid:17)(cid:16) q (cid:17) Clearly by Lemma 2 we have that f ( x k ) − f ( x k +1 ) ≥ η (cid:16) (cid:107)∇ f ( x k ) (cid:107) − (cid:107) ε k (cid:107) (cid:17) ≥ η (cid:18) − q (cid:19) (cid:107)∇ f ( x k ) (cid:107) We can conclude that f ( x k ) − f ( x k +1 ) ≥ − c ∇ f ( x k ) (cid:62) ( x k +1 − x k ) ≥ c c (cid:107)∇ f ( x k ) (cid:107)(cid:107) ( x k +1 − x k ) (cid:107) with c c > . Moreover, (cid:107)∇ f ( x k ) (cid:107) ≥ φ k ( h ) > so we do not have to worry about arriving onstationary points in ﬁnite time. Given that f is analytic, we have all the necessary conditions ofTheorem 3.2 in Absil et al. [2005] and we have ruled out the possibility of { x k } escaping to inﬁnity.Therefore, we can now claim that { x k } converges.For the second case we have that for for all k ∈ N (cid:107)∇ f ( x k ) (cid:107) < φ k ( h ) . We will now prove that { x k } is a Cauchy sequence. (cid:107) x k − x m (cid:107) ≤ k (cid:88) i = m (cid:107) x i +1 − x i (cid:107)≤ k (cid:88) i = m (cid:107) ηq x ( x i , h i ) (cid:107)≤ η k (cid:88) i = m (cid:107)∇ f ( x i , h i ) + ε i (cid:107)≤ η k (cid:88) i = m φ i ( h ) We know that (cid:80) ∞ i φ i ( h ) converges so the partial sums must converge to 0. Then lim m,k →∞ (cid:107) x k − x m (cid:107) ≤ η lim m,k →∞ k (cid:88) i = m φ i ( h ) = 0 So lim m,k →∞ (cid:107) x k − x m (cid:107) = 0 and { x k } is a Cauchy sequence bounded in a compact set and thereforeit converges.In either of the cases the limit of { x k } is of course a stationary point.27e can now conclude our analysis with this ﬁnal theorem. Theorem 10 (Theorem 3 restated) . Let f : R d → R ∈ C be a (cid:96) -gradient Lipschitz function. Letus also assume that f is analytic, has compact sub-level sets and all of its saddle points are strict.Let g be a ( L, B, c ) -well-behaved function for f with η < min { L , (cid:96) } and β < − η(cid:96)B . If we pick arandom initialization point x , then we have that for the x k iterates of g ∀ h ∈ R Pr( lim k →∞ x k = x ∗ ) = 1 where x ∗ is a local minimizer of f .Proof. Given the assumptions, we can apply Theorem 9 and get that lim k →∞ x k exists and is astationary point of f . We can also apply Theorem 1 in order to guarantee that the limit is not a strictsaddle of f with probability 1. Given the assumption that f has only strict saddles, then lim k →∞ x k is with probability 1 a local minimum of f . 28 Escaping Saddle Points Efﬁciently Detailed proofs

Before presenting the iteration complexity proof ( Theorem 4 ) we will state our main probabilisticlemma.

Lemma 16.

There exists an absolute constant c max , such that for any f that is (cid:96) -gradient Lipschitzand ρ -Hessian Lipschitz function and any c ≤ c max , and χ ≥ . Let η, r, g thres , f thres , t thres , h low becalculated same way as in Algorithm 1. Then, if x t satisﬁes: (cid:107)∇ f ( x t ) (cid:107) ≤ g thres and λ min ( ∇ f ( x t )) ≤ −√ ρ(cid:15) Let ˜ x = x t + ξ , where ξ comes from the uniform distribution over B ( r ) , and let { ˜ x i } be the iteratesof approximate gradient descent from ˜ x with stepsize η and h = h low , then with at least probability − d(cid:96) √ ρ(cid:15) e − χ , we have: ∃ i ≤ t thres : f ( x t ) − f (˜ x i (cid:48) ) ≥ f thres This lemma will be the “workhorse” which will offer the high probability guarantees of Algorithm 1given that substantial progress can be made in the low gradient phase. The proof of the above lemmais deferred to the end of this section.We are ready now to prove our main theorem:

Theorem 11 (Theorem 4 restated) . There exists absolute constant c max such that: if f is (cid:96) -gradientLipschitz and ρ -Hessian Lipschitz, then for any δ > , (cid:15) ≤ (cid:96) ρ , ∆ f ≥ f ( x ) − f (cid:63) , and constant c ≤ c max , with probability − δ , the output of PAGD ( x , (cid:96), ρ, (cid:15), c, δ, ∆ f ) will be (cid:15) -SOSP , andterminate in iterations: O (cid:18) (cid:96) ( f ( x ) − f (cid:63) ) (cid:15) log (cid:18) d(cid:96) ∆ f (cid:15) δ (cid:19)(cid:19) Proof.

Denote ˜ c max to be the absolute constant allowed in Lemma 16. In this theorem, we let c max = min { ˜ c max , / } , and choose any constant c ≤ c max .In this proof, that Algorithm 1 returns a point x that satisﬁes the following condition: (cid:107)∇ f ( x ) (cid:107) ≤ g thres = √ cχ · (cid:15), λ min ( ∇ f ( x )) ≥ −√ ρ(cid:15) (9)Since c ≤ , χ ≥ , we have √ cχ ≤ , which implies any x satisﬁes Equation (9) is also a (cid:15) -SOSP .Starting from x , we know if x does not satisfy Equation 9, there are only two cases:1. (cid:107) z (cid:107) = (cid:13)(cid:13)(cid:13) q (cid:16) x , g thres c h (cid:17)(cid:13)(cid:13)(cid:13) > g thres In this case, (cid:107)∇ f ( x ) (cid:107) ≥ g thres and Algorithm 1 will not add perturbation. By Lemma 2: f ( x ) − f ( x ) ≥ η · ( (cid:107)∇ f ( x ) (cid:107) − (cid:107) ε (cid:107) ) where ε = q (cid:16) x , g thres c h (cid:17) − ∇ f ( x ) . Therefore we get (cid:107) ε (cid:107) ≤ g thres f ( x ) − f ( x ) ≥ η · ( (cid:107)∇ f ( x ) (cid:107) − (cid:107) ε (cid:107) ) ≥ η g thres ≥ c (cid:15) (cid:96)χ (cid:107) z (cid:107) = (cid:13)(cid:13)(cid:13) q (cid:16) x , g thres c h (cid:17)(cid:13)(cid:13)(cid:13) ≤ g thres In this case, (cid:107)∇ f ( x ) (cid:107) ≤ g thres and Algorithm 1 will add a perturbation ξ of radius r suchthat ˜ x ← x + ξ , and will perform approximate gradient descent (without perturbations)for at most t thres steps. Since x is not a second-order stationary point then by Lemma 16there exists i (cid:48) ≤ t thres such that: f ( x ) − f ( x ) = f ( x ) − f (˜ x i (cid:48) ) ≥ f thres = cχ · (cid:115) (cid:15) ρ f ( x ) − f (˜ x i (cid:48) ) i (cid:48) ≥ f thres t thres = c χ · (cid:15) (cid:96) Hence, we can conclude that as long as Algorithm 1 has not terminated yet, on average, every stepdecreases function value by at least c χ · (cid:15) (cid:96) . However, we clearly can not decrease function value bymore than f ( x ) − f (cid:63) , where f (cid:63) is the minimum value of f . This means Algorithm 1 must terminatewithin the following number of iterations: f ( x ) − f (cid:63)c χ · (cid:15) (cid:96) = χ c · (cid:96) ( f ( x ) − f (cid:63) ) (cid:15) = O (cid:18) (cid:96) ( f ( x ) − f (cid:63) ) (cid:15) log (cid:18) d(cid:96) ∆ f (cid:15) δ (cid:19)(cid:19) Finally, we have to ensure that the above statement holds with high probability. In the worst casescenario, in each outer-loop iteration the algorithm will be enforced to add a perturbation yielding adecrease of f thres . Thus, the maximum number of perturbations are at most: f ( x ) − f (cid:63) f thres = f ( x ) − f (cid:63)cχ · (cid:113) (cid:15) ρ Applying Lemma 16, we know that the guaranteed decrease of f thres happens with probability at least − d(cid:96) √ ρ(cid:15) e − χ each time. By union bound, the probability that all perturbations satisfy the decreaseguarantee is at least − d(cid:96) √ ρ(cid:15) e − χ · f ( x ) − f (cid:63)cχ · (cid:113) (cid:15) ρ = 1 − χ e − χ c · d(cid:96) ( f ( x ) − f (cid:63) ) (cid:15) Recall our choice of χ = 3 max { log( d(cid:96) ∆ f c(cid:15) δ ) , } . Since χ ≥ , we have χ e − χ ≤ e − χ/ , this gives: χ e − χ c · d(cid:96) ( f ( x ) − f (cid:63) ) (cid:15) ≤ e − χ/ d(cid:96) ( f ( x ) − f (cid:63) ) c(cid:15) ≤ δ which ﬁnishes the proof.What remains to be proven is why adding a perturbation is guaranteed to help the algorithm decreasethe value of f substantially with high probability. Following the proof strategy of Jin et al. [2017] wewill deﬁne some additional notation. Let the condition number be the ratio of the Lipschitz constantof ∇ f and the smallest negative eigenvalue of the Hessian of x t before adding the perturbation, i.e κ = (cid:96)/γ ≥ . Additionally we deﬁne the following units: p ← log( dκδ ) , L ← η(cid:96), F ← L p γ ρ , G ← √ L p γ ρ , S ← √ L p γρ , R ← S κp , T ← pηγ Following the above deﬁnitions, it holds that: S = (cid:113) F pγ = G pγ , (cid:96) R = 2 G and η TG = S (A): The ﬁrst argument in this proof is that if the ˜ x i iterates do not achieve a decrease of . F in cT steps thenthey must remain conﬁned in a small ball around ˜ x . Lemma 17.

For any constant c ≥ , deﬁne: T = min (cid:110) inf t { t | f ( u ) − f ( u t ) ≥ . F } , cT (cid:111) then, for any η ≤ /(cid:96) , we have for all t < T that (cid:107) u t − u (cid:107) ≤ S · c ) . roof of Lemma 17. Applying repeatedly Lemma 2, we get for t < Tf ( u t ) − f ( u ) ≤ − η t (cid:88) i =0 (cid:16) (cid:107)∇ f ( u i ) (cid:107) − (cid:107) ε i (cid:107) (cid:17) where ε i = q x ( u i , h low ) − ∇ f ( u i ) . By deﬁnition of T we have that the function value of f has not yet decreased by . F . η t (cid:88) i =0 (cid:107)∇ f ( u i ) (cid:107) ≤ f ( u ) − f ( u t ) + η t (cid:88) i =0 (cid:107) ε i (cid:107) η t (cid:88) i =0 (cid:107)∇ f ( u i ) (cid:107) ≤ . F + η t (cid:88) i =0 (cid:107) ε i (cid:107) Since T ≤ cT and also (cid:107) ε i (cid:107) ≤ G we then have η t (cid:88) i =0 (cid:107)∇ f ( u i ) (cid:107) ≤ . F + η G cT t (cid:88) i =0 (cid:107)∇ f ( u i ) (cid:107) ≤ η F + G cT t (cid:88) i =0 (cid:16) (cid:107)∇ f ( u i ) (cid:107) + (cid:107) ε i (cid:107) (cid:17) ≤ η F + 2 G cT We also have that (cid:107) q x ( u i , h low ) (cid:107) ≤ (cid:16) (cid:107)∇ f ( u i ) (cid:107) + (cid:107) ε i (cid:107) (cid:17) . Therefore we have that t (cid:88) i =0 (cid:107) q x ( u i , h low ) (cid:107) ≤ η F + 4 G cT Now we can bound the difference between u t and u : (cid:107) u t − u (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t (cid:88) i =1 u i − u i − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ t t (cid:88) i =1 (cid:107) u i − u i − (cid:107) ≤ tη t (cid:88) i =0 (cid:107) q x ( u i , h low ) (cid:107) ≤ tη (cid:18) η F + 4 G cT (cid:19) ≤ tη (cid:18) η F + 4 G cT (cid:19) ≤ cT η (cid:18) η F + 4 G cT (cid:19) Manipulating the constants we get (cid:107) u t − u (cid:107) ≤ (cid:0) c + c (cid:1) S (cid:107) u t − u (cid:107) ≤ (cid:112) (10 c + c ) S For any c ≥ we have (cid:107) u t − u (cid:107) ≤ c S ) u are constrained in a smallball, iterates from w = u + µ · R e , for large enough µ must be able to decrease the function value. In orderto do that, we keep track of vector v which is the difference between { u i } and { w i } . We also decompose v into two different eigenspaces: the direction e (the minimum-eigenvalue eigenvector) and its orthogonalsubspace. Lemma 18.

There exists absolute constant c max , c such that: for any δ ∈ (0 , dκe ] , let f ( · ) , ˆ x satisﬁesthe following conditions (cid:107)∇ f (ˆ x ) (cid:107) ≤ G and λ min ( ∇ f (ˆ x )) ≤ − γ and any two sequences { u t } , { w t } with initial points u , w satisfying: w = u + µ · R · e , µ ∈ [ δ/ (2 √ d ) , , (cid:107) u − ˆ x (cid:107) ≤ R e is the eignevector of the minimum eigenvalue of ∇ f (ˆ x ) . Assume also that h low ≤ ρ S δ c h √ d R .Deﬁne T = min (cid:110) inf t { t | f ( w ) − f ( w t ) ≥ . F } , cT (cid:111) then, for any η ≤ c max /(cid:96) , if (cid:107) u t − u (cid:107) ≤ S · c ) for all t < T , we will have T < cT .Proof of Lemma 18. Recall notation ˜ H = ∇ f (ˆ x ) . Since δ ∈ (0 , dκe ] , we always have p ≥ . Deﬁne v t = w t − u t , by assumption, we have v = µ R e . Let us ﬁrstly deﬁne the gradient approximationerrors for these two sequences ε w t = q x ( w t , h low ) − ∇ f ( w t ) ε u t = q x ( u t , h low ) − ∇ f ( u t ) Now, consider the update equation for w t : u t +1 + v t +1 = w t +1 = w t − ηq x ( w t , h low )= w t − η ( ∇ f ( w t ) + ε w t )= u t + v t − η ∇ f ( u t + v t ) − η ε w t = u t + v t − η ∇ f ( u t ) − η (cid:20)(cid:90) ∇ f ( u t + θ v t )d θ (cid:21) v t − η ε w t = u t + v t − η ∇ f ( u t ) − η ( ˜ H + ∆ (cid:48) t ) v t − η ε w t = u t − η ∇ f ( u t ) + ( I − η ˜ H − η ∆ (cid:48) t ) v t − η ε w t = u t − η ( ∇ f ( u t ) + ε u t ) + ( I − η ˜ H − η ∆ (cid:48) t ) v t − η ( ε w t − ε u t )= u t − ηq x ( u t , h low ) + ( I − η ˜ H − η ∆ (cid:48) t ) v t − η ( ε w t − ε u t )= u t +1 + ( I − η ˜ H − η ∆ (cid:48) t ) v t − η ( ε w t − ε u t ) where ∆ (cid:48) t = (cid:90) ∇ f ( u t + θv t )d θ − ˜ H This gives the dynamic for v t satisfy: v t +1 = ( I − η ˜ H − η ∆ (cid:48) t ) v t − η ( ε w t − ε u t ) (10)Since f is Hessian Lipschitz, we have (cid:107) ∆ (cid:48) t (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:90) ∇ f ( u t + θ v t ) − ∇ f (ˆ x )d θ (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:90) ρ (cid:107) u t + θ v t − ˆ x (cid:107) d θ ≤ ρ ( (cid:107) u t − u (cid:107) + (cid:107) v t (cid:107) + (cid:107) ˆ x − u (cid:107) ) . t < T the sequence { w t } has not decreased the function f by − . F . In other words, it holdsthat f ( w ) − f ( w t ) ≤ . F , so applying Lemma 17, we know for all t ≤ T (cid:107) w t − w (cid:107) ≤ S c ) . By condition of Lemma 18, we know (cid:107) u t − u (cid:107) ≤ S c ) for all t < T . This gives for all t < T : (cid:107) v t (cid:107) = (cid:107) w t − u t (cid:107) = (cid:107) ( w t − w ) − ( u t − u ) + ( w − u ) (cid:107)≤ (cid:107) ( w t − w ) (cid:107) + (cid:107) u t − u (cid:107) + (cid:107) w − u (cid:107)≤ S c ) + 100( S c ) + µ R ≤ S c ) + R ≤ (200 c + 1) S (11)where the last step holds because R ≤ S This gives us for t < T : (cid:107) ∆ (cid:48) t (cid:107) ≤ ρ ( (cid:107) u t − u (cid:107) + (cid:107) v t (cid:107) + (cid:107) ˆ x − u (cid:107) ) ≤ ρ (100 c S + (200 c + 1) S + R ) ≤ ρ S (300 c + 2) Let ψ t be the norm of v t projected onto e direction and the normal vector and ϕ t correspondinglybe the norm of v t projected onto remaining subspace. Let us deﬁne as λ = ηρ S (300 c + 2) . Equation10 gives us: ψ t +1 = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:89) e ( I − η ˜ H ) v t − η ∆ (cid:48) t v t − η ( ε w t − ε u t ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ϕ t +1 = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:89) R d \{ e } ( I − η ˜ H ) v t − η ∆ (cid:48) t v t − η ( ε w t − ε u t ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Lower bound of ψ t +1 : ψ t +1 = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:89) e [( I − η ˜ H ) ψ t e − η ∆ (cid:48) t v t − η ( ε w t − ε u t )] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≥ (cid:107) ( I − η ˜ H ) ψ t e (cid:107) − η (cid:107) (cid:89) e [∆ (cid:48) t v t ] (cid:107) − η (cid:107) (cid:89) e [ ε w t − ε u t ] (cid:107)≥ (1 + γη ) ψ t − η (cid:107) ∆ (cid:48) t v t (cid:107) − η (cid:107) ε w t − ε u t (cid:107)≥ (1 + γη ) ψ t − η (cid:107) ∆ (cid:48) t (cid:107)(cid:107) v t (cid:107) − η (cid:107) ε w t − ε u t (cid:107)≥ (1 + γη ) ψ t − λ (cid:113) ψ t + ϕ t − η (cid:107) ε w t − ε u t (cid:107) Upper bound of ϕ t +1 : ϕ t +1 = (cid:107) (cid:89) R d \{ e } [( I − η ˜ H ) v t − η ∆ (cid:48) t v t − η ( ε w t − ε u t )] (cid:107)≤ (cid:107) (cid:89) R d \{ e } [( I − η ˜ H ) v t ] (cid:107) + (cid:107) (cid:89) R d \{ e } [ η ∆ (cid:48) t v t ] (cid:107) + η (cid:107) (cid:89) R d \{ e } [ ε w t − ε u t ] (cid:107)≤ (cid:107) (cid:89) R d \{ e } [( I − η ˜ H ) v t ] (cid:107) + (cid:107) η ∆ (cid:48) t v t (cid:107) + η (cid:107) ε w t − ε u t (cid:107)≤ (1 + γη ) ϕ t + λ (cid:113) ψ t + ϕ t + η (cid:107) ε w t − ε u t (cid:107) ψ t +1 ≥ (1 + γη ) ψ t − λ (cid:113) ψ t + ϕ t − η (cid:107) ε w t − ε u t (cid:107) ϕ t +1 ≤ (1 + γη ) ϕ t + λ (cid:113) ψ t + ϕ t + η (cid:107) ε w t − ε u t (cid:107) We will now prove via induction the following fact:

Claim 1. ∀ t < T ϕ t ≤ λt · ψ t and (cid:107) ε w t (cid:107) ≤ λ η (cid:107) v t (cid:107) and (cid:107) ε u t (cid:107) ≤ λ η (cid:107) v t (cid:107) Proof.

Let us prove the base case of the induction: • By hypothesis of Lemma 18, we know ϕ = 0 so ϕ ≤ λ · ψ holds trivially • Based on the choice of h low we have that (cid:107) ε w t (cid:107) ≤ ρ S δ √ d R ≤ λ η ψ ≤ λ η (cid:107) v (cid:107)(cid:107) ε u t (cid:107) ≤ ρ S δ √ d R ≤ λ η ψ ≤ λ η (cid:107) v (cid:107) . Thus the base case of induction holds. Assume Claim 1 is true for τ ≤ t . Now we can rewrite theinequalities based on the inductive hypothesis as follows: ψ t +1 ≥ (1 + γη ) ψ t − λ (cid:113) ψ t + ϕ t ϕ t +1 ≤ (1 + γη ) ϕ t + 2 λ (cid:113) ψ t + ϕ t For t + 1 ≤ T , we have: (cid:40) λ ( t + 1) ψ t +1 ≥ λ ( t + 1) (cid:16) (1 + γη ) ψ t − λ (cid:112) ψ t + ϕ t (cid:17) ϕ t +1 ≤ λt (1 + γη ) ψ t + 2 λ (cid:112) ψ t + ϕ t (cid:41) Thus it sufﬁces to prove that: λt (1 + γη ) ψ t + 2 λ (cid:113) ψ t + ϕ t ≤ λ ( t + 1) (cid:18) (1 + γη ) ψ t − λ (cid:113) ψ t + ϕ t (cid:19) (2 + 8 λ ( t + 1)) (cid:113) ψ t + ϕ t ≤ γη ) ψ t . By choosing √ c max ≤ c +2 min { √ , / c } , using the facts (cid:26) ηρ S T = √ η(cid:96)η ≤ c max /(cid:96) , we have λ ( t + 1) ≤ λT ≤ ηρ S (300 c + 2) cT = 8 (cid:112) η(cid:96) (300 c + 2) c ≤ / This gives: γη ) ψ t ≥ ψ t ≥ (cid:113) ψ t ≥ (2 + 8 λ ( t + 1)) (cid:113) ψ t + ϕ t which ﬁnishes the induction of the ﬁrst part.Now, using again the induction hypothesis, we know ϕ t ≤ λt · ψ t ≤ ψ t , this gives: ψ t +1 ≥ (1 + γη ) ψ t − √ λψ t ≥ (1 + γη ψ t (12)where the last step follows from √ λ = √ ηρ S (300 c + 2) = √ (cid:112) η(cid:96) γρp ≤ √ c max (300 c + 2) γ ηp < γη . Equation 12 yields that ψ t is increasing sequence. Clearly (cid:107) ε w t +1 (cid:107) ≤ λ η ψ ≤ λ η ψ t +1 ≤ λ η (cid:107) v t +1 (cid:107)(cid:107) ε u t +1 (cid:107) ≤ λ η ψ ≤ λ η ψ t +1 ≤ λ η (cid:107) v t +1 (cid:107) Thus we have completed the induction. 34inally, combining Eq.(11) and (12) we have for all t < T : (200 c + 1) S ≥ (cid:107) v t (cid:107) ≥ ψ t ≥ (1 + γη t ψ = (1 + γη t µ R = (1 + γη t S κ p = (1 + γη t δ √ d S κ p This implies:

T < log( (200 c +1)2 √ d κdδ · p )log(1 + γη ) ≤ log((200 c + 1)) + log( κdδ ) + log p ( γη ) ≤ c + 1) γη +2 log( κdδ ) γη +2 pγη The last inequality is due to the following facts • p = log( κdδ ) ≥ and ∀ x ≥ x ≤ x . • ∀ x ≥ x ) ≤ x thus log(1 + γη ) ≤ γη . • T = pγη Therefore, it holds that:

T < c + 1) pγη + 4 T ≤ T (2 log(200 c + 1) + 4) By choosing constant c to be large enough to satisfy c + 1) + 4) ≤ c , for example (i.e c ≥ ), we will have T < cT , which ﬁnishes the proof.35C): Until now we have proved that ﬁrstly if approximate gradient descent from u does not decreasefunction value, then all the iterates must lie within a small ball around u (Lemma 17) and secondly startingan approximate descent from w , which is u but displaced along e direction (negative eigenvalue’seigenvector for at least a certain distance), will decreases the function value if { u t } is bounded. (Lemma 18).The following lemma combines the above two lemmas: Lemma 19.

There exists a universal constant ˆ c max , for any δ ∈ (0 , dκe ] , let f ( · ) , ˆ x satisﬁes thefollowing conditions (cid:107)∇ f (ˆ x ) (cid:107) ≤ G and λ min ( ∇ f (ˆ x )) ≤ − γ and e be the minimum eigenvector of ∇ f (ˆ x ) . Consider two algorithm sequences { u t } , { w t } withinitial points u , w satisfying: (cid:107) u − ˆ x (cid:107) ≤ R , w = u + µ · R · e , µ ∈ [ δ/ (2 √ d ) , Then, for any step size η ≤ ˆ c max /(cid:96) , at least one of the following is true • there exists T u ≤ c max T such that f ( u ) − f ( u T u ) ≥ . F • there exists T w ≤ c max T such that f ( w ) − f ( w T w ) ≥ . F Proof of Lemma 19.

Let ( c (1)max , c ) be the absolute constant so that Lemma 18 holds. Choose ˆ c max = min { , c (1)max , c } Let T (cid:63) = cT . Notice that by deﬁnition T (cid:63) ≤ c max T . Finally , deﬁne: T ◦ = inf t { t | f ( u ) − f ( u t ) ≥ . F } Let’s consider following two cases:

Case T ◦ ≤ T (cid:63) : Clearly for this case we have for T u = T ◦ that f ( u ) − f ( u T u ) ≥ . F Case T ◦ > T (cid:63) : In this case, by Lemma 17, we know (cid:107) u t − u (cid:107) ≤ O ( S ) for all t ≤ T (cid:63) . Deﬁne T ◦◦ = inf t { t | f ( w ) − f ( w t ) ≥ . F } By Lemma 18, we immediately have T ◦◦ ≤ T (cid:63) = cT . Clearly for this case we have for T u = T ◦◦ we have that f ( w ) − f ( w T w ) ≥ . F . Lemma 20.

Let f be a (cid:96) -gradient Lipschitz and ρ -Hessian Lipschitz function. There exists universalconstant c max , for any δ ∈ (0 , dκe ] , suppose we start with point ˆ x satisfying following conditions: (cid:107)∇ f (ˆ x ) (cid:107) ≤ G and λ min ( ∇ f (ˆ x )) ≤ − γ Let x = ˆ x + ξ where ξ come from the uniform distribution over ball with radius r = R , and let x t be the iterates of approximate gradient descent from x and T = T c max . Then, when step size η ≤ c max /(cid:96) , with at least probability − δ , we have that: ∃ t ≤ T : f (ˆ x ) − f ( x t ) ≥ F Proof of Lemma 20.

By adding perturbation, in worst case we increase function value by: f ( x ) − f (ˆ x ) ≤ ∇ f (ˆ x ) (cid:62) ξ + (cid:96) (cid:107) ξ (cid:107) ≤ (cid:96) R = 3 (cid:96) S κ p = 3 (cid:96) F pγ κ p ≤ F κp ≤ F We know x come from the uniform distribution over B ˆ x ( r ) . Let A ⊂ B ˆ x ( r ) denote the set of badstarting points A = { x ∈ B ˆ x ( r ) | ∀ t ≤ T : f ( x ) − f ( x t ) < . F } otherwise if x ∈ B ˆ x ( r ) \ A , we have that ∃ t ≤ T : f ( x ) − f ( x t ) ≥ . F By applying Lemma 18, we know for any x ∈ A , it is guaranteed that x ± µr e (cid:54)∈ A where µ ∈ [ δ √ d , where e is the eigenvector of ∇ f (ˆ x ) with the smallest negative eigenvalue.Let us denote I A ( · ) be the indicator function of being inside set A . For a vector x let us deﬁne thefollowing quantities x e = (cid:104) x , e (cid:105) x ¬ e = (cid:89) R d \{ e } x Recall B ( d ) ( r ) be d -dimensional ball with radius r . By calculus, this gives an upper bound on thevolume of A : Vol ( A ) = (cid:90) B ( d )ˆ x ( r ) d x · I A ( x )= (cid:90) B ( d − x ( r ) d x ¬ e (cid:90) ˆ x e + √ r −(cid:107) ˆ x ¬ e − x ¬ e (cid:107) ˆ x e − √ r −(cid:107) ˆ x ¬ e − x ¬ e (cid:107) d x e · I A ( x ) ≤ (cid:90) B ( d − x ( r ) d x ¬ e · (cid:18) · δ √ d r (cid:19) = Vol ( B ( d − ( r )) × δr √ d Then, we immediately have the ratio:Vol ( A ) Vol ( B ( d )ˆ x ( r )) ≤ δr √ d × Vol ( B ( d − ( r )) Vol ( B ( d )0 ( r )) = δ √ πd Γ( d + 1)Γ( d + ) ≤ δ √ πd · (cid:114) d ≤ δ The second last inequality is by the property of Gamma function that Γ( x +1)Γ( x +1 / < (cid:113) x + as long as x ≥ . Therefore, with at least probability − δ , x (cid:54)∈ A . In this case, we have that there exists a t ≤ T : f (ˆ x ) − f ( x t ) = f (ˆ x ) − f ( x ) + f ( x ) − f ( x t ) ≤ . F − . F ≥ F which ﬁnishes the proof. 37t is easy to check that our initial Lemma 16 can be derived by substituting η = c(cid:96) , γ = √ ρ(cid:15), δ = d(cid:96) √ ρ(cid:15) e − χ and simply applying the deﬁnitions of G , T , F , g thres , t thres , f thresthres