Efficiently avoiding saddle points with zero order methods: No gradients required
Lampros Flokas, Emmanouil-Vasileios Vlatakis-Gkaragkounis, Georgios Piliouras
EEfficiently avoiding saddle pointswith zero order methods: No gradients required
Lampros Flokas ∗ Department of Computer ScienceColumbia UniversityNew York, NY 10025 [email protected]
Emmanouil V. Vlatakis-Gkaragkounis ∗ Department of Computer ScienceColumbia UniversityNew York, NY 10025 [email protected]
Georgios Piliouras
Engineering Systems and DesignSingapore University of Technology and DesignSingapore [email protected]
Abstract
We consider the case of derivative-free algorithms for non-convex optimization,also known as zero order algorithms, that use only function evaluations rather thangradients. For a wide variety of gradient approximators based on finite differences,we establish asymptotic convergence to second order stationary points using acarefully tailored application of the Stable Manifold Theorem. Regarding efficiency,we introduce a noisy zero-order method that converges to second order stationarypoints, i.e avoids saddle points. Our algorithm uses only ˜ O (1 /(cid:15) ) approximategradient calculations and, thus, it matches the converge rate guarantees of their exactgradient counterparts up to constants. In contrast to previous work, our convergencerate analysis avoids imposing additional dimension dependent slowdowns in thenumber of iterations required for non-convex zero order optimization. Given a function f : R d → R , solving the problem x ∗ = arg min x ∈ R d f ( x ) is one of the building blocks that many machine learning algorithms are based on. The difficultyof this problem varies significantly depending on the properties of f and the way we can accessinformation about it. The general case of non-convex functions makes the problem significantly morechallenging, since first order stationary points can be global or local optima as well as saddle points.In fact, discovering global optima is an NP hard problem in general and even for quartic functionsverifying local optima is a co-NP complete problem [Murty and Kabadi, 1987, Lee et al., 2019].While local optima may be satisfactory for some applications in machine learning Choromanska et al.[2015], saddle points can make high dimensional non convex optimization tasks significantly moredifficult Dauphin et al. [2014], Sun et al. [2018]. Therefore, researchers have focused their efforts onfunctions possessing the strict saddle property. Under this property, Hessians of f evaluated at saddlepoints have at least one negative eigenvalue making detection of saddle points tractable. Given this ∗ Equal contribution33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. a r X i v : . [ m a t h . O C ] O c t ssumption, methods that use second order information like computing Hessians or Hessian-vectorproducts [Nesterov and Polyak, 2006, Carmon and Duchi, 2016, Agarwal et al., 2017] can convergeto second order stationary points (SOSPs) and thus avoid strict saddle points. Recent work [Ge et al.,2015, Levy, 2016, Jin et al., 2017, Lee et al., 2019, Allen-Zhu and Li, 2018, Jin et al., 2018b] hasalso showed that gradient descent (and its variants) can also avoid strict saddle points and converge tolocal minima.Unfortunately access to gradient evaluations is not available in all settings of interest. Even withthe advent of automatic differentiation software, there are several applications where computationof gradients is either computationally inefficient or even impossible. Examples of such applicationsare hyper-parameter tuning of machine learning models Snoek et al. [2012], Salimans et al. [2017],Choromanski et al. [2018], black-box adversarial attacks on deep neural networks Papernot et al.[2017], Madry et al. [2018], Chen et al. [2017], computer network control Liu et al. [2018a],variational approaches to graphical models Wainwright and Jordan [2008] and simulation basedRubinstein and Kroese [2016], Spall [2003] or bandit feedback optimization Agarwal et al. [2010],Chen and Giannakis [2019]. Zero order methods, also known as black-box methods, try to addressthese issues by employing only evaluations of the function f during the optimization procedure. Thecase of convex functions is well understood Nesterov and Spokoiny [2017], Duchi et al. [2015],Agarwal et al. [2010]. For the non-convex case, there has been a considerable amount of work on theconvergence to first order stationary points both for deterministic settings Nesterov and Spokoiny[2017] and stochastic ones Ghadimi and Lan [2013], Wang et al. [2018], Balasubramanian andGhadimi [2018], Liu et al. [2018b], Gu et al. [2016].The case of SOSPs has been so far comparatively under-studied. It has been established that SOSPsare achievable through zero order trust region methods that employ fully quadratic models Conn et al.[2009]. The disadvantage of trust region methods is that their computation cost per iteration is O ( d ) which becomes quickly prohibitive as we increase the number of dimensions d . More recently, theauthors of Jin et al. [2018a] studied the case of finding local minima of functions having access only toapproximate function or gradient evaluations. They manage to reduce zero order optimization to thestochastic first order optimization of a Gaussian smoothed version of f . While this approach yieldsguarantees of convergence to SOSPs , each stochastic gradient evaluation requires O (poly( d, /(cid:15) )) number of function evaluations. This leads to significantly less efficient optimization algorithms whencompared to their first order counterparts. It is therefore yet unclear if there are scalable zeroorder methods that can safely avoid strict saddle points and always converge to local minimaof f . To the best of our knowledge, our work is the first one to establish a positive answer tothis important question.Our results. We prove that zero order optimization methods solve general non-convex problemsefficiently.
In a nutshell, we present a family of of zero order optimization methods which provablyconverge to SOSPs . Our proof includes a new, elaborating analysis of Stable Manifold Theorem(See Section 4). Additionally, the number of the approximate gradient evaluations match the standardbounds for first order methods in non-convex problems (see Table 1 & Section 5).
Algorithm Oracle Iterations Evaluations of f Theorem
Theorem ˜ O (1 /(cid:15) ) ˜ O ( d/(cid:15) ) FPSGD Jin et al. [2018a] Approx. Gradient + Noise ˜ O ( d/(cid:15) ) ˜ O ( d /(cid:15) ) ZPSGD Jin et al. [2018a] Function Evaluations + Noise ˜ O (1 /(cid:15) ) ˜ O ( d /(cid:15) ) Jin et al. [2017] Exact Gradient + Noise ˜ O (1 /(cid:15) ) -Table 1: Oracle model and iteration complexity to SOSPs . Algorithms.
Instead of focusing on a single finite differences algorithms, we construct a generalframework of approximate gradient oracles that generalizes over many finite differences approachesin the literature. We then use these approximate gradient oracles to devise approximate gradientdescent algorithms. For more details see Section 3.3 and Definition 4.2 symptotic convergence.
We use the stable manifold theorem to prove that zero order methodscan almost surely avoid saddle points. In contrast to the analysis of Lee et al. [2019] for first ordermethods, the zero order case is more demanding. Convergence to first order stationary points requireschanging the gradient approximation accuracy over the iterations and, thus, the equivalent dynamicalsystem is time dependent. By reducing our time dependent dynamical system to a time invariant onedefined in an expanded state, we are able to obtain provable guarantees about avoiding saddle points.To extend our guarantees of convergence to deterministic choices of the initial accuracy, we providea carefully tailored application of the Stable Manifold Theorem that analyzes the structure of thestable manifolds of the dynamical system. Our results on saddle point avoidance extend to functionswith non isolated critical points. To address this, we provide sufficient conditions for point-wiseconvergence of the iterates of approximate gradient descent methods for the case of analytic functions.
Convergence rates for noisy dynamics.
In order to produce fast convergence rates, as in the case offirst order methods Jin et al. [2017], it is useful to consider perturbed/noisy versions of the dynamics.Once again the case of zero order methods poses distinct hurdles. Close to critical points of f ,approximations of the potentially arbitrarily small gradient can be very noisy. Iterates of exactgradient descent and approximate gradient descent may diverge significantly in this case. In fact,provably escaping saddle points by guaranteeing decrease of value of f is more challenging for thecase of approximate gradient descent since it is not a descent algorithm. A key technical step is toshow that the negative curvature dynamics that enable gradient descent to escape saddle points arerobust to gradient approximation errors. As long as the gradient approximation error is smaller thana fixed a-priori known threshold, zero order methods can provably escape saddle points. Based onthis, we are able to prove that zero order methods can converge to approximate SOSPs with the samenumber of approximate gradient evaluations provided by Jin et al. [2017] up to constants.It is worth pointing out that achieving an ˜ O ( (cid:15) − ) bound of approximate gradient evaluations requiresconceptually different techniques from other recent approaches in zero order methods. Indeed,previous work on randomized and stochastic zero order optimization [Nesterov and Spokoiny,2017, Ghadimi and Lan, 2013] has relied on treating randomized approximate gradients of f asin expectation exact gradients of a carefully constructed smoothed version of f . Then with someadditional work, convergence arguments for the smooth version of f can be transferred to f itself.Although these arguments are applicable to our case as well, as shown by the work of Jin et al.[2018a], they also lead to a slowdown both in terms of the dimension d and the required accuracy (cid:15) .The main reasons behind this slowdown are that the Lipschitz constants of the smoothed version of f depend on d and the high variance of the stochastic gradient estimators. To sidestep both issues, weanalyze the effect of gradient approximation error directly on the optimization of f . Our work builds and improves upon previous finite difference approaches for non-convex optimizationand provides SOSP guarantees previously only reserved to computationally expensive methods.
First Order Algorithms
A recent line of work has shown that gradient descent and variations of itcan actually converge to SOSPs . Specifically, Lee et al. [2019] shows that gradient descent startingfrom a random point can eventually converge to SOSPs with probability one. [Jin et al., 2017, 2018b]modified standard gradient descent using perturbations to provide an algorithm that converges toSOSPs in O (poly(log d, /(cid:15) )) iterations. As noted in the introduction, the zero order case posesadditional hurdles compared to the first order one. Our work, by addressing these hurdles effectivelyextends the guarantees provided by Lee et al. [2019], Jin et al. [2017] to zero order methods. Zero Order Algorithms
Approximating gradients using finite differences methods has been thestandard approach for both for convex and non-convex zero order optimization.Nesterov and Spokoiny[2017] established convergence properties even for randomized gradient oracles. Recently, [Duchiet al., 2015] provided optimal guarantees for stochastic convex optimization up to logarithmic factors.For the more general case of stochastic non-convex optimization there has been extensive workcovering several aspects of the problem: distributed Hajinezhad and Zavlanos [2018], asynchronousLian et al. [2016], high-dimensional Wang et al. [2018], Balasubramanian and Ghadimi [2018]optimization and variance reduction Liu et al. [2018b], Gu et al. [2016]. It is significant to mentionthat the aforementioned work is focused on convergence to (cid:15) − first order stationary points.3egarding SOSPs , Conn et al. [2009] showed that trust region methods that employ fully quadraticmodels can converge to SOSPs at the cost of O ( d ) operations per iteration. The authors of Jinet al. [2018a] studied the convergence to SOSPs using approximate function or gradient evaluations.While both approaches are applicable for the zero order setting with exact function evaluations, as wewill see in Section 3.4, this type of reduction results in algorithms that require substantially morefunction evaluations to reach an (cid:15) -SOSP . Our work provides provable guarantees of convergence atsignificantly faster rates. We will use lower case bold letters x , y to denote vectors. (cid:107)·(cid:107) will be used to denote the spectralnorm and the (cid:96) vector norm. λ min ( · ) will be used to denote the minimum eigenvalue of a matrix. If g is a vector valued differentiable function then Dg denotes the differential of function g . We willuse { e , e , . . . e d } to refer to the standard orthonormal basis of R d . Also C n is the set of n timescontinuously differentiable functions. B x ( r ) refers to the ball of radius r centered at x . Finally, µ ( S ) is the Lebesgue measure of a measurable set S ⊆ R d . A function f : R d → R is said to be L -continuous, (cid:96) -gradient, ρ -Hessian Lipschitz if for every x , y ∈ R d (cid:107) f ( x ) − f ( y ) (cid:107) ≤ L (cid:107) x − y (cid:107) , (cid:107)∇ f ( x ) − ∇ f ( y ) (cid:107) ≤ (cid:96) (cid:107) x − y (cid:107) , (cid:107)∇ f ( x ) − ∇ f ( y ) (cid:107) ≤ ρ (cid:107) x − y (cid:107) correspondingly. Additionally, we can define approximate first order stationary points as: Definition 1 ( (cid:15) -first order stationary point) . Let f : R d → R be a differentiable function. Then x ∈ R d is a first order stationary point of f if (cid:107)∇ f ( x ) (cid:107) ≤ (cid:15) . A first order stationary point can be either a local minimum, a local maximum or a saddle point.Following the terminology of Lee et al. [2019] and Jin et al. [2017], we will include local maxima insaddle points since they are both undesirable for our minimization task. Under this definition, strictsaddle points can be identified as follows:
Definition 2 (Strict saddle point) . Let f : R d → R be a twice differentiable function. Then x ∈ R d is a strict saddle point of f if (cid:107)∇ f ( x ) (cid:107) = 0 and λ min ( ∇ f ( x )) < . To avoid convergence to strict saddle points, we need to converge to SOSPs . In order to studythe convergence rate of algorithms that converge to SOSPs , we need to define some notion ofapproximate SOSPs . Following the convention of Jin et al. [2017] we define the following:
Definition 3 ( (cid:15) -SOSP ) . Let f : R d → R be a ρ -Hessian Lipschitz function. Then x ∈ R d is an (cid:15) -second order order stationary point of f if (cid:107)∇ f ( x ) (cid:107) ≤ (cid:15) and λ min ( ∇ f ( x )) ≥ −√ ρ(cid:15) . One of the key ways that enables zero order methods to converge quickly is using approximations ofthe gradient based on finite differences approaches. Here we will show how forward differencingcan provide these approximate gradient calculations. Without much additional effort we can get thesame results for other finite differences approaches like backward and symmetric difference as wellas finite differences approaches with higher order accuracy guarantees. Let us define the gradientapproximation function based on forward difference r f : R d × R → R d r f ( x , h ) = (cid:80) dl =0 f ( x + h e l ) − f ( x ) h e l when h (cid:54) = 0 ∇ f ( x ) if h = 0 (1)This function takes two arguments: A vector x where the gradient should be approximated as well asa scalar value h that controls the approximation accuracy of the estimator. An additional propertythat will be of interest when we analyze approximate gradient descent is the fact that r f is Lipschitz.Based on the definition one can show: Lemma 1.
Let f be (cid:96) -gradient Lipschitz. Then r f ( · , h ) as defined in Equation 1 is √ d(cid:96) Lipschitz forall h ∈ R and ∀ h ∈ R , x ∈ R d : (cid:107) r f ( x , h ) − ∇ f ( x ) (cid:107) ≤ (cid:96) √ d | h | . .4 Black box reductions to first order methods As shown in the works of Nesterov and Spokoiny [2017], Ghadimi and Lan [2013], zero orderoptimization is reducible to stochastic first order optimization. The reduction relies on treatingrandomized approximate gradients of f as in expectation exact gradients of a carefully constructedsmoothed version of f . These arguments are also applicable to our case as well. FPSG, one of theapproaches of Jin et al. [2018a], naively leads to a large poly( d ) dependence in the convergence rate.More specifically one can show that Jin et al. [2018a]’s FPSG method needs ˜ O ( d /(cid:15) ) evaluations of ∇ g to converge to an (cid:15) -SOSP . The main reason behind this dimension dependent slowdown is thatthe Hessian Lipschitz constant of the smoothed version of g is O ( ρ √ d ) . An alternative approach inJin et al. [2018a] named ZPSG builds gradient estimators using function evaluations directly. Themain source slowdown here is the high variance of the stohastic gradients. An analysis of thosemethods for the case where exact function evaluations are available can be found in the Appendix.In the next sections we will provide an alternative analysis that accounts for the gradient approximationerrors on the optimization of f directly. Thus, we will be able to sidestep the above issues and providefaster convergence rates and better sample complexity. It is easy to see that conceptually any iterative optimization method can be expressed as a dynamicalsystem of the form { x k +1 = g ( x k ) } where x k is the current solution iterate that gets updated throughan update function g . Additionally, for first order methods strict saddle points correspond to theunstable fixed points of the dynamical system. These key observations have motivated Lee et al.[2019] to use the Stable Manifold Theorem (SMT) Shub [1987] in order to prove that gradient descentavoids strict saddle points. Intuitively, SMT formalizes why convergence to unstable fixed points isunlikely starting from a local region around an unstable fixed point. Adding the requirement that g isa global diffeomorphism, Lee et al. [2019] generalizes the conclusions of SMT to the whole space.In order to prove similar guarantees for a zero order algorithm using approximate gradient evaluations,we will need to construct a new dynamical system that is applicable to our zero order setting. Thestate of our dynamical system χ k consists of two parts: The current solution iterate x k that is a vectorin R d and a scalar value h ∈ R that controls the quality of the gradient approximation. Specificallywe have χ k +1 = g ( χ k ) (cid:44) (cid:18) x k +1 h k +1 (cid:19) = (cid:18) x k − ηq x ( x k , h k ) βq h ( h k ) (cid:19) (2)where η, β ∈ R + positive scalar parameters and functions q x : R d × R → R d and q h : R → R .The function q x can be seen as the gradient approximation oracle used by the dynamical system asdescribed in Section 3.3. The function q h is responsible for controlling the accuracy of the gradientapproximation. As we shall see later, it is important that h k converges to 0 so that the stable points of g are the same as in gradient descent. In this section we will provide sufficient conditions that the parameters η, β must satisfy so that theupdate rule of Equation 2 avoids convergence to strict saddle points. To do this we will need tointroduce some properties of g . Definition 4 ( ( L, B, c ) -Well-behaved function) . Let f : R d → R ∈ C be a (cid:96) -gradient Lipschitzfunction. A function g of the form of Equation 2 is a ( L, B, c ) -well behaved function (for function f )if it has the following properties: i) q x , q h ∈ C with q h (0) = 0 . ii) ∀ h ∈ R : q x ( · , h ) is L Lipschitzand < ∂q h ( h ) ∂h ≤ B . iii) ∀ ( x , h ) ∈ R d +1 : (cid:107) q x ( x , h ) − ∇ f ( x ) (cid:107) ≤ c | h | . Given this definition and Lemma 1, it is clear that we can always construct ( L, B, c ) -well-behavedfunctions for L = √ d(cid:96) , B = 1 , c = √ d(cid:96) using q x = r f and q h = h .In the following lemmas and theorems we will require that βB < . Under this assumption βq h is acontraction having 0 as its only fixed point so for all fixed points of g we know that h = 0 . Notice5lso that when h = 0 , we have q x ( x ,
0) = ∇ f ( x ) and therefore the x coordinates of fixed points of g must coincide with first order stationary points of f . In fact, in the Appendix we prove that thereis a one to one mapping between strict saddles of f and unstable fixed points of g . Using the sameassumptions, we also get that det(D g ( · )) (cid:54) = 0 . Putting all together, we are able to prove our firstmain result. Theorem 1.
Let g be a ( L, B, c ) -well-behaved function for function f . Let X ∗ f be the set of strictsaddle points of f . Then if η < L and β < B : ∀ h ∈ R : µ ( { x : lim k →∞ x k ∈ X ∗ f } ) = 0 . Notice that the random initialization refers only to the x ’s domain. Indeed a straightforwardapplication of the result of Lee et al. [2019] would guarantee a saddle-avoidance lemma only under anextra random choice of h . Such a result would not be able to clarify if saddle-avoidance stems fromthe instability of the fixed point, just like in first order methods, or from the additional randomnessof h . The key insight provided by the SMT is that the all the initialization points that eventuallyconverge to an unstable fixed point lie in a low dimensional manifold. Thus, to obtain a strongerresult we have to understand how SMT restricts the dimensionality of this stable manifold for a fixed h . The structure of the eigenvectors of the Jacobian of g around a fixed point reveals that such aninteresting decoupling is finally achievable. In the previous section we provided sufficient conditions to avoid convergence to strict saddle points.These results are meaningful however only if lim k →∞ x k exists. Therefore, in this section we willprovide sufficient conditions such that the dynamic system of g converges. Given that strict saddlepoints are avoided, it is sufficient to prove convergence to first order stationary points. Let the errorof the gradient approximation be ε k = q x ( x k , h k ) − ∇ f ( x k ) . Firstly we establish the zero orderanalogue of the folklore lower bound for the decrease of the function:
Lemma 2 (Step-Convergence) . Suppose that g is a ( L, B, c ) -well-behaved function for a (cid:96) -gradientLipschitz function f . If η ≤ (cid:96) then we have that f ( x k +1 ) ≤ f ( x k ) − η (cid:16) (cid:107)∇ f ( x k ) (cid:107) − (cid:107) ε k (cid:107) (cid:17) . Given this lemma we can prove convergence to first order stationary points.
Theorem 2 (Convergence to first order stationary points) . Suppose that g is a ( L, B, c ) -well-behaved function for a (cid:96) -gradient Lipschitz function f . Let η ≤ (cid:96) , β < B . Then if f is lowerbounded lim k →∞ (cid:107)∇ f ( x k ) (cid:107) = 0 . The last theorem gives us a guarantee that the norm of the gradient is converging to zero but thisis not enough to prove convergence to a single stationary point if f has non isolated critical points.In the Appendix, we prove that if the gradient approximation error decreases quickly enough thenconvergence to a single stationary point is guaranteed for analytic functions. This allows us toconclude our analysis with this final theorem. Theorem 3 (Convergence to minimizers) . Let f : R d → R ∈ C be a (cid:96) -gradient Lipschitz function.Let us also assume that f is analytic, has compact sub-level sets and all of its saddle points are strict.Let g be a ( L, B, c ) -well-behaved function for f with η < min { L , (cid:96) } and β < − η(cid:96)B . If we pick arandom initialization point x , then we have that for the x k iterates of g ∀ h ∈ R : Pr( lim k →∞ x k = x ∗ ) = 1 where x ∗ is a local minimizer of f . In the previous subsections we provided sufficient conditions for approximate gradient descent toavoid strict saddle points. However, the stable manifold theorem guarantees that this will happenasymptotically. In fact, convergence could be quite slow until we reach a neighborhood of a localminimum. An analysis done for the first order case by Du et al. [2017] showed that avoiding saddlepoints could take exponential time in the worst case. In this section, we will use ideas from the workof Jin et al. [2017] in order to get a zero order algorithm that converges to SOSPs efficiently.6onvergence to SOSPs poses unique challenges to zero order methods when it comes to controllingthe gradient approximation accuracy. For convergence to first order stationary points one can useproperty iii) of Definition 4 and Lemma 2 to show that h = (cid:15)/c guarantees the decrease of f until (cid:107)∇ f ( x k ) (cid:107) ≤ (cid:15) . For SOSPs , this is not applicable as the norm of the gradient can become arbitrarilysmall near saddle points. One could resort to iteratively trying smaller h to find one that guaranteesthe decrease of f . A surprising fact about our algorithm is that even if the gradient is arbitrarily small,computationally burdensome searches for h can be totally avoided. Initialization: ( (cid:96), ρ, (cid:15), c, δ, ∆ f ) χ ← { log( d(cid:96) ∆ f c(cid:15) δ ) , } , η ← c(cid:96) , r ← √ cχ · (cid:15)(cid:96) , g thres ← √ cχ · (cid:15), f thres ← cχ · (cid:113) (cid:15) ρ t thres ← χc · (cid:96) √ ρ(cid:15) , S ← √ cχ √ ρ(cid:15)ρ , h low ← c h min { g thres , rρδS √ d } Algorithm 1
PAGD( x ) for t = 0 , , . . . do z t ← q ( x t , g thres c h ) if (cid:107) z t (cid:107) ≥ g thres then x t +1 ← x t − η z t else x t +1 ← EscapeSaddle ( x t ) if x t +1 = x t then return x t end if end for Algorithm 2 EscapeSaddle ( ˆ x ) ξ ∼ Unif( B ( r )) ˜ x ← ˆ x + ξ for i = 0 , , . . . t thres do if f (ˆ x ) − f (˜ x i ) ≥ f thres then return ˜ x i end if ˜ x i +1 ← ˜ x i − ηq (˜ x i , h low ) end for return ˆ x Just like Jin et al. [2017], we will assume that f is (cid:96) − gradient Lipschitz and also ρ − Hessian Lipschitz.To construct a zero order algorithm we will also need a gradient approximator q : R d × R → R d . Wewill only require the error bound property on q , i.e., there exists a constant c h such that ∀ x ∈ R d , h ∈ R : (cid:107) q ( x , h ) − ∇ f ( x ) (cid:107) ≤ c h | h | The high level idea of Algorithm 1 is that given a point x t that is not an (cid:15) -SOSP the algorithm makesprogress by finding a x t +1 where f ( x t +1 ) is substantially smaller than f ( x t ) . By the definition of (cid:15) -SOSPs either the gradient of f at x t is large or the Hessian has a substantially negative eigenvalue.Separating these two cases is not as straightforward as in the first order case. Given the norm of theapproximate gradient q ( x , h ) , we only know that (cid:107)∇ f ( x ) (cid:107) ∈ (cid:107) q ( x , h ) (cid:107) ± c h | h | . In Algorithm 1by choosing g thres / as the threshold to test for and h = g thres / (4 c h ) , we guarantee that in step 4 (cid:107)∇ f ( x t ) (cid:107) ≥ g thres / . This threshold is actually high enough to guarantee substantial decrease of f .Indeed given that we have a lower bound on the exact gradient and using Lemma 2 we get f ( x t ) − f ( x t +1 ) ≥ η (cid:16) (cid:107)∇ f ( x t ) (cid:107) − (cid:107) ε t (cid:107) (cid:17) ≥ ηg thres where ε t is the gradient approximation error at x t . This decrease is the same as in the first order caseup to constants.On the other hand, in Algorithm 2 we are guaranteed that (cid:107)∇ f ( ˆx ) (cid:107) ≤ g thres . In this case ourapproximate gradient cannot guarantee a substantial decrease of f . However, we know that theHessian has a substantially negative eigenvalue and therefore a direction of steep decrease of f mustexist. The problem is that we do not know which direction has this property. In Jin et al. [2017]it is proved that identifying this direction is not necessary for the first order case. Adding a smallrandom perturbation to our current iterate (step 2) is enough so that with high probability we can geta substantial decrease of f after at most t thres gradient descent steps (step 5). Of course this work isnot directly applicable to our case since we do not have access to exact gradients.The work of Jin et al. [2017] mainly depends on two arguments to provide its guarantees. The firstargument is that if the ˜ x i iterates do not achieve a decrease of f thres in t thres steps then they mustremain confined in a small ball around ˜ x . Specifically for the exact gradient case we have that (cid:107) ˜ x i − ˜ x (cid:107) ≤ ηf thres t thres . f . Therefore, iterates may wander away from ˜ x without even decreasing thefunction value of f . To amend this argument for the zero order case we require that h low ≤ g thres /c h .This guarantees that even if gradient approximation errors amass over the iterations we will get thesame bound as the first order case up to constants.The second argument of Jin et al. [2017] formalizes why the existence of a negative eigenvalue ofthe Hessian is important. Let us run gradient descent starting from two points u and w such that w − u = κ e where e is the eigenvector corresponding to the most negative eigenvalue of theHessian and κ ≥ rδ/ (2 √ d ) . Then at least one of the sequences { w i } , { u i } is able to escape awayfrom its starting point in t thres iterations and by the first argument it is also able to decrease the value of f substantially. The proof of the claim is based on creating a recurrence relationship on v i = w i − u i .The corresponding recurrence relationship for the zero order case is more complicated with additionalterms that correspond to the gradient approximation errors for w i and u i . However, we are able toprove that if h low ≤ rρδS/ (2 √ d ) then these additional terms cannot distort the exponential growthof v i . Having extended both arguments of Jin et al. [2017] we can establish the same guarantees forescaping saddle points. Theorem 4 (Analysis of PAGD) . There exists absolute constant c max such that: if f is (cid:96) -gradientLipschitz and ρ -Hessian Lipschitz, then for any δ > , (cid:15) ≤ (cid:96) ρ , ∆ f ≥ f ( x ) − f (cid:63) , and constant c ≤ c max , with probability − δ , the output of PAGD ( x , (cid:96), ρ, (cid:15), c, δ, ∆ f ) will be an (cid:15) -SOSP , andhave the following number of iterations until termination: O (cid:18) (cid:96) ( f ( x ) − f (cid:63) ) (cid:15) log (cid:18) d(cid:96) ∆ f (cid:15) δ (cid:19)(cid:19) In this section we use simulations to verify our theoretical findings. Specifically we are interested inverifying if zero order methods can avoid saddle points as efficiently as first order methods. To do thiswe use the two dimensional Rastrigin function, a popular benchmark in the non-convex optimizationliterature. This function exhibits several strict saddle points so it will be an adequate benchmark forour case. The two dimensional Rastrigin function can be defined asRas ( x , x ) = 20 + x −
10 cos(2 πx ) + x −
10 cos(2 πx ) . For this experiment we selected 75 points randomly from [ − . , . × [ − , , . . In this domainthe Rastrigin function is (cid:96) -gradient Lipschitz with (cid:96) ≈ . . Using these points as initializationwe run gradient descent and the approximate gradient descent dynamical system we introduced inSection 4.2. For both gradient descent and approximate gradient descent we used η = 1 / (4 (cid:96) ) . Thenfor approximate gradient descent we used symmetric differences to approximate the gradients and β = 0 . as well as h = 0 . . Figure 1 shows the contour plot of the Rastrigin function as well Intial Points
Iteration 2
Iteration 4
Iteration 6
Figure 1: Contour plots of the Rastrigin function along with the evolution of the iterates of gradientdescent and approximate gradient descent. Green points correspond to gradient descent whereas cyanpoints correspond to approximate gradient descent.as the evolution of the iterates of both methods. As expected, for points initialized closed to localminima of the function convergence is quite fast. On the other hand, points starting close to saddle8oints of the Rastrigin function take some more time to converge to minima. However, it is clear thatin both cases the behaviour of gradient descent and approximate gradient descent is similar in thesense that for the same initialization there is no discrepancy in terms of convergence speed for thetwo methods.We also want to experimentally verify the performance of PAGD. To do this we use the octopusfunction proposed by Du et al. [2017]. This function is is particularly relevant to our setting as itpossesses a sequence of saddle points. The authors of Du et al. [2017] proved that for this functiongradient descent needs exponential time to avoid saddle points before converging to a local minimum.In contrast the perturbed version of gradient descent (PGD) of Jin et al. [2017] does not suffer fromthe same limitation. Based on the results of Theorem 4, we expect PAGD to not have this limitationas well. We compare gradient descent (GD), PGD, AGD and PAGD on an octopus function of d = 15 dimensions. Figure 2 clearly shows that the zero order versions have the same iteration performancewith the first-order ones. In fact, AGD is shown to behave even better than GD in this example thanksto the noise induced by the gradient approximation. f ( x k ) GDPGDAGDPAGD
Figure 2: Octopus function value varying the number of iterations. Parameters of the function τ = e , L = e , γ = 1 . Parameters of first order methods taken from Du et al. [2017]. Zero order methods usesymmetric differencing with h = 0 . This paper is the first one to establish that zero order methods can avoid saddle points efficiently. Toachieve this we went beyond smoothing arguments used in prior work and studied the effect of thegradient approximation error on first order methods that converge to second order stationary points.One important open question for future work is whether similar guarantees can be established forother zero order methods used in practice like direct search methods and trust region methods usinglinear models. Another generalization of interest would be to consider the performance of zero ordermethods for instances of (non-convex) constrained optimization.
Acknowledgements
Georgios Piliouras acknowledges MOE AcRF Tier 2 Grant 2016-T2-1-170, grant PIE-SGP-AI-2018-01 and NRF 2018 Fellowship NRF-NRFF2018-07. Emmanouil-Vasileios Vlatakis-Gkaragkouniswas supported by NSF CCF-1563155, NSF CCF-1814873, NSF CCF-1703925, NSF CCF-1763970.We are grateful to Alexandros Potamianos for bringing this problem to our attention, and for helpfuldiscussions at an early stage of this project for its connection to Natural Language Processing tasks.Finally, this work was supported by the Onassis Foundation - Scholarship ID: F ZN 010-1/2017-2018.9 eferences
Pierre-Antoine Absil, Robert E. Mahony, and B. Andrews. Convergence of the iterates of descentmethods for analytic cost functions.
SIAM Journal on Optimization , 16(2):531–547, 2005. doi:10.1137/040605266.Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization withmulti-point bandit feedback. In
COLT 2010 - The 23rd Conference on Learning Theory, Haifa,Israel, June 27-29, 2010 , pages 28–40, 2010.Naman Agarwal, Zeyuan Allen Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximatelocal minima faster than gradient descent. In
Proceedings of the 49th Annual ACM SIGACTSymposium on Theory of Computing, STOC 2017, Montreal, QC, Canada, June 19-23, 2017 , pages1195–1199, 2017. doi: 10.1145/3055399.3055464.Zeyuan Allen-Zhu and Yuanzhi Li. NEON2: finding local minima via first-order oracles. In
Advances in Neural Information Processing Systems 31: Annual Conference on Neural InformationProcessing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada. , pages 3720–3730, 2018.Krishnakumar Balasubramanian and Saeed Ghadimi. Zeroth-order (non)-convexstochastic optimization via conditional gradient and gradient updates. In
Advancesin Neural Information Processing Systems 31: Annual Conference on Neural In-formation Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Mon-tréal, Canada. , pages 3459–3468, 2018. URL http://papers.nips.cc/paper/7605-zeroth-order-non-convex-stochastic-optimization-via-conditional-gradient-and-gradient-updates .Yair Carmon and John C. Duchi. Gradient descent efficiently finds the cubic-regularized non-convexnewton step.
CoRR , abs/1612.00547, 2016.Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. ZOO: zeroth orderoptimization based black-box attacks to deep neural networks without training substitute models.In
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, AISec@CCS2017, Dallas, TX, USA, November 3, 2017 , pages 15–26, 2017. doi: 10.1145/3128572.3140448.URL https://doi.org/10.1145/3128572.3140448 .Tianyi Chen and Georgios B. Giannakis. Bandit convex optimization for scalable and dynamic iotmanagement.
IEEE Internet of Things Journal , 6(1):1276–1286, 2019. doi: 10.1109/JIOT.2018.2839563. URL https://doi.org/10.1109/JIOT.2018.2839563 .Anna Choromanska, Mikael Henaff, Michaël Mathieu, Gérard Ben Arous, and Yann LeCun. Theloss surfaces of multilayer networks. In
Proceedings of the Eighteenth International Conferenceon Artificial Intelligence and Statistics, AISTATS 2015, San Diego, California, USA, May 9-12,2015 , 2015. URL http://jmlr.org/proceedings/papers/v38/choromanska15.html .Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard E. Turner, and Adrian Weller.Structured evolution with compact architectures for scalable policy optimization. In
Proceedingsof the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan,Stockholm, Sweden, July 10-15, 2018 , pages 969–977, 2018. URL http://proceedings.mlr.press/v80/choromanski18a.html .Andrew R. Conn, Katya Scheinberg, and Luís N. Vicente. Global convergence of general derivative-free trust-region algorithms to first- and second-order critical points.
SIAM Journal on Optimization ,20(1):387–415, 2009. doi: 10.1137/060673424.Yann N. Dauphin, Razvan Pascanu, Çaglar Gülçehre, KyungHyun Cho, Surya Ganguli, andYoshua Bengio. Identifying and attacking the saddle point problem in high-dimensionalnon-convex optimization. In
Advances in Neural Information Processing Systems 27: AnnualConference on Neural Information Processing Systems 2014, December 8-13 2014, Mon-treal, Quebec, Canada , pages 2933–2941, 2014. URL http://papers.nips.cc/paper/5486-identifying-and-attacking-the-saddle-point-problem-in-high-dimensional-non-convex-optimization .10imon S. Du, Chi Jin, Jason D. Lee, Michael I. Jordan, Aarti Singh, and Barnabás Póczos. Gradientdescent can take exponential time to escape saddle points. In
Advances in Neural InformationProcessing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9December 2017, Long Beach, CA, USA , pages 1067–1077, 2017.John C. Duchi, Michael I. Jordan, Martin J. Wainwright, and Andre Wibisono. Optimal rates forzero-order convex optimization: The power of two function evaluations.
IEEE Trans. InformationTheory , 61(5):2788–2806, 2015. doi: 10.1109/TIT.2015.2409256.Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points - online stochasticgradient for tensor decomposition. In
Proceedings of The 28th Conference on Learning Theory,COLT 2015, Paris, France, July 3-6, 2015 , pages 797–842, 2015.Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvexstochastic programming.
SIAM Journal on Optimization , 23(4):2341–2368, 2013. doi: 10.1137/120880811.Bin Gu, Zhouyuan Huo, and Heng Huang. Zeroth-order asynchronous doubly stochastic algorithmwith variance reduction. arXiv preprint arXiv:1612.01425 , 2016.Davood Hajinezhad and Michael M. Zavlanos. Gradient-free multi-agent nonconvex nonsmoothoptimization. In , pages 4939–4944, 2018. doi: 10.1109/CDC.2018.8619333. URL https://doi.org/10.1109/CDC.2018.8619333 .Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escapesaddle points efficiently. In
Proceedings of the 34th International Conference on Machine Learning,ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 , pages 1724–1732, 2017.Chi Jin, Lydia T. Liu, Rong Ge, and Michael I. Jordan. On the local minima of the em-pirical risk. In
Advances in Neural Information Processing Systems 31: Annual Confer-ence on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018,Montréal, Canada. , pages 4901–4910, 2018a. URL http://papers.nips.cc/paper/7738-on-the-local-minima-of-the-empirical-risk .Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. Accelerated gradient descent escapes saddlepoints faster than gradient descent. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet,editors,
Proceedings of the 31st Conference On Learning Theory , volume 75 of
Proceedings ofMachine Learning Research , pages 1042–1085. PMLR, 06–09 Jul 2018b.Jason D. Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I. Jordan, andBenjamin Recht. First-order methods almost always avoid strict saddle points.
Math. Program. ,176(1-2):311–337, 2019. doi: 10.1007/s10107-019-01374-3. URL https://doi.org/10.1007/s10107-019-01374-3 .Kfir Y. Levy. The power of normalization: Faster evasion of saddle points.
CoRR , abs/1611.04831,2016.Xiangru Lian, Huan Zhang, Cho-Jui Hsieh, Yijun Huang, and Ji Liu. A comprehensivelinear speedup analysis for asynchronous stochastic parallel optimization from zeroth-order to first-order. In
Advances in Neural Information Processing Systems 29: AnnualConference on Neural Information Processing Systems 2016, December 5-10, 2016,Barcelona, Spain , pages 3054–3062, 2016. URL http://papers.nips.cc/paper/6551-a-comprehensive-linear-speedup-analysis-for-asynchronous-stochastic-parallel-optimization-from-zeroth-order-to-first-order .Sijia Liu, Jie Chen, Pin-Yu Chen, and Alfred Hero. Zeroth-order online alternating direction methodof multipliers: Convergence analysis and applications. In
International Conference on ArtificialIntelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, CanaryIslands, Spain , pages 288–297, 2018a. URL http://proceedings.mlr.press/v84/liu18a.html . 11ijia Liu, Bhavya Kailkhura, Pin-Yu Chen, Pai-Shun Ting, Shiyu Chang, and LisaAmini. Zeroth-order stochastic variance reduction for nonconvex optimization. In
Advances in Neural Information Processing Systems 31: Annual Conference on Neu-ral Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Mon-tréal, Canada. , pages 3731–3741, 2018b. URL http://papers.nips.cc/paper/7630-zeroth-order-stochastic-variance-reduction-for-nonconvex-optimization .W Tu Loring. An introduction to manifolds, 2008.Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.Towards deep learning models resistant to adversarial attacks. In , 2018. URL https://openreview.net/forum?id=rJzIBfZAb .Katta G. Murty and Santosh N. Kabadi. Some np-complete problems in quadratic and nonlinearprogramming.
Math. Program. , 39(2):117–129, 1987. doi: 10.1007/BF02592948.Yurii Nesterov and Boris T. Polyak. Cubic regularization of newton method and its global performance.
Math. Program. , 108(1):177–205, 2006. doi: 10.1007/s10107-006-0706-8.Yurii Nesterov and Vladimir G. Spokoiny. Random gradient-free minimization of convex func-tions.
Foundations of Computational Mathematics , 17(2):527–566, 2017. doi: 10.1007/s10208-015-9296-2.Nicolas Papernot, Patrick D. McDaniel, Ian J. Goodfellow, Somesh Jha, Z. Berkay Celik, andAnanthram Swami. Practical black-box attacks against machine learning. In
Proceedings of the2017 ACM on Asia Conference on Computer and Communications Security, AsiaCCS 2017, AbuDhabi, United Arab Emirates, April 2-6, 2017 , pages 506–519, 2017. doi: 10.1145/3052973.3053009. URL https://doi.org/10.1145/3052973.3053009 .Reuven Y Rubinstein and Dirk P Kroese.
Simulation and the Monte Carlo method , volume 10. JohnWiley & Sons, 2016.Tim Salimans, Jonathan Ho, Xi Chen, and Ilya Sutskever. Evolution strategies as a scalable alternativeto reinforcement learning.
CoRR , abs/1703.03864, 2017. URL http://arxiv.org/abs/1703.03864 .Michael Shub.
Global stability of dynamical systems . Springer Science & Business Media, 1987.Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical bayesian optimiza-tion of machine learning algorithms. In
Advances in Neural Information Process-ing Systems 25: 26th Annual Conference on Neural Information Processing Systems2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada,United States. , pages 2960–2968, 2012. URL http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms .James C. Spall.
Introduction to Stochastic Search and Optimization . John Wiley & Sons, Inc., NewYork, NY, USA, 1 edition, 2003. ISBN 0471330523.Ju Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval.
Foundations ofComputational Mathematics , 18(5):1131–1198, 2018. doi: 10.1007/s10208-017-9365-9.Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families, and variationalinference.
Foundations and Trends in Machine Learning , 1(1-2):1–305, 2008. doi: 10.1561/2200000001.Yining Wang, Simon S. Du, Sivaraman Balakrishnan, and Aarti Singh. Stochastic zeroth-orderoptimization in high dimensions. In
International Conference on Artificial Intelligence andStatistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain , pages1356–1365, 2018. URL http://proceedings.mlr.press/v84/wang18e.html .12 fficiently avoiding saddle pointswith zero order methods: No gradients required
Supplementary MaterialsA Preliminaries Detailed proofs
In this first subsection, we show that the forward finite differences method can be used toconstruct an approximate gradient oracle. Similar oracles can be constructed using backward,symmetric finite differences or Richardson extrapolation which have even higher gradientapproximation accuracy. Additionally, we compute the Lipschitz constant of our method andwe show that our definition of "well-behaved" approximate gradient is well defined. In otherwords, there are simple approximation oracles which follow the smoothness requirementsthat our work assumes.
A.1 Gradient Approximation using Zero Order InformationLemma 4 ( Lemma 1 restated ) . Let f be (cid:96) -gradient Lipschitz. Then r f ( · , h ) as defined in Equation1 is √ d(cid:96) Lipschitz for all h ∈ R and it holds that: (cid:107) r f ( x , h ) − ∇ f ( x ) (cid:107) ≤ (cid:96) √ d | h | Proof.
For the first part of the lemma we split our proof into two cases: • For any h (cid:54) = 0 and any x , x (cid:48) ∈ R d we have (cid:107) r f ( x , h ) − r f ( x (cid:48) , h ) (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d (cid:88) l =0 f ( x + h e l ) − f ( x ) h e l − d (cid:88) l =0 f ( x (cid:48) + h e l ) − f ( x (cid:48) ) h e l (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) l =0 (cid:12)(cid:12)(cid:12)(cid:12) f ( x + h e l ) − f ( x (cid:48) + h e l ) − ( f ( x ) − f ( x (cid:48) )) h (cid:12)(cid:12)(cid:12)(cid:12) Let us define the function q l ( s ) = f ( x + s e l ) − f ( x (cid:48) + s e l ) for all l ∈ [ d ] . Then by applyingthe mean value theorem we get (cid:107) r f ( x , h ) − r f ( x (cid:48) , h ) (cid:107) = (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) l =0 (cid:12)(cid:12)(cid:12)(cid:12) q l ( h ) − q l (0) h (cid:12)(cid:12)(cid:12)(cid:12) = (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) l =0 | q (cid:48) l ( ξ l ) | for some ξ l ∈ (0 , h ) . We have that q (cid:48) l ( ξ l ) = ∂f ( x + ξ l e l ) ∂x l − ∂f ( x (cid:48) + ξ l e l ) ∂x l . If f is (cid:96) -gradientLipschitz so are all the partial derivatives (cid:107) r f ( x , h ) − r f ( x (cid:48) , h ) (cid:107) ≤ (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) l =0 (cid:96) (cid:107) x − x (cid:48) (cid:107) = √ d(cid:96) (cid:107) x − x (cid:48) (cid:107)• For the special case of h = 0 (cid:107) r f ( x , − r f ( x (cid:48) , (cid:107) = (cid:107)∇ f ( x ) − ∇ f ( x (cid:48) ) (cid:107) ≤ (cid:96) (cid:107) x − x (cid:48) (cid:107) ≤ √ d(cid:96) (cid:107) x − x (cid:48) (cid:107) h (cid:54) = 0 and any x (cid:107) r f ( x , h ) − ∇ f ( x ) (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d (cid:88) l =0 f ( x + h e l ) − f ( x ) h e l − ∇ f ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) l =0 (cid:12)(cid:12)(cid:12)(cid:12) f ( x + h e l ) − f ( x ) h − ∂f ( x ) ∂x l (cid:12)(cid:12)(cid:12)(cid:12) For each l ∈ [ d ] we use the mean value theorem so that for some x l : | ξ l | ≤ | h | we have (cid:107) r f ( x , h ) − ∇ f ( x ) (cid:107) = (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) l =0 (cid:12)(cid:12)(cid:12)(cid:12) ∂f ( x + ξ l e l ) ∂x l − ∂f ( x ) ∂x l (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:118)(cid:117)(cid:117)(cid:116) d (cid:88) l =0 ( (cid:96)ξ l ) ≤ (cid:96) √ d | h | For h = 0 the requested inequality holds as an equality.As noted in the main paper, recent studies have analyzed zero order optimization by carefullycrafting a smoothed version of the original objective function. These arguments are alsoapplicable to our case as well.The following lemmas show why these approaches lead poly( d, (cid:15) − ) slowdown in terms of number of iterations and function evaluations. A.2 Black box reductions to first order methods
Algorithm 3 of Jin et al. [2018a], uses approximate gradient evaluations at randomly sampled pointsaround the current iterate to get an estimate of the gradient of f . This estimate is then perturbed withnoise in order to avoid any potential saddle point. Algorithm 3
First order Perturbed Stochastic Gradient Descent (FPSGD)
Input: x , learning rate η , noise radius r , mini-batch size m . for t = 0 , , . . . , do sample ( z (1) t , · · · , z ( m ) t ) ∼ N (0 , σ I ) g t ( x t ) ← (cid:80) mi =1 g ( x t + z ( i ) t ) x t +1 ← x t − η ( g t ( x t ) + ξ t ) , ξ t uniformly ∼ B ( r ) end forreturn x T Lemma 5.
Let f : R d → R be a bounded, L -continuous, (cid:96) -gradient, ρ -Hessian Lipschitz function.Additionally, suppose that we have access to a function g : R d → R such that (cid:107)∇ g − ∇ f (cid:107) ∞ ≤ ν .Then, Jin et al. [2018a]’s FPSG method needs ˜ O ( d (cid:15) ) evaluations of ∇ g to converge to an (cid:15) -SOSP .Proof. We will show the main steps that Jin et al. [2018a] followed in Section E of the Appendix.The first step of the proof is to define the Gaussian smoothing of function g with parameter σg σ ( x ) = E z ∼N (0 ,σ I ) g ( x + z ) One can show that ∇ g σ ( x ) = E z ∼N (0 ,σ I ) ∇ g ( x + z ) ∇ g σ ( x ) = E z ∼N (0 ,σ I ) ∇ g ( x + z ) Additionally Lemma 48 of Jin et al. [2018a] tells us that the gradients and Hessians of g σ and f areclose to each other and that g σ is gradient Lipschitz and Hessian Lipschitz.14 g σ is O ( (cid:96) + νσ ) gradient Lipschitz and O ( ρ + νσ ) Hessian Lipschitz. • (cid:107)∇ g σ ( x ) − ∇ f ( x ) (cid:107) ≤ O ( ρdσ + ν ) and (cid:107)∇ g σ ( x ) − ∇ f ( x ) (cid:107) ≤ O ( ρ √ dσ + ν ) Then Lemma 54 of Jin et al. [2018a] proves that a (cid:15) √ d -SOSP of g σ is also a O ( (cid:15) ) stationary point of f if σ ≤ O ( (cid:114) (cid:15)ρd ) ν ≤ O ( (cid:15) √ d ) For the aforementioned choices of ν and σ , ∇ g is bounded (cid:107)∇ g σ ( x ) (cid:107) ≤ (cid:107)∇ g σ ( x ) − ∇ f ( x ) (cid:107) + (cid:107)∇ f ( x ) (cid:107) ≤ √ dν + L ≤ (cid:15) + L So g ( x + z ) is O ( (cid:15) + L ) sub-gaussian. Notice also that by replacing with the upper bounds on σ and ν one can observe that the Lipschitz constant of ∇ g σ is O ( ρ √ d ) . This is the main reason that a (cid:15) √ d -SOSP of g σ is required.According to Theorem 65 of Jin et al. [2018a] getting an (cid:15) -SOSP of g σ requires ˜ O ( d/(cid:15) ) number ofevaluations of ∇ g . So to get an (cid:15) √ d -SOSP of g σ , one would require ˜ O ( d /(cid:15) ) number of evaluationsof ∇ g .Notice that the above theorem makes the technical assumption that the gradient approximator is agradient of a function, that may not be true for standard finite differences approximators. The Lemmabelow for ZPSG does not have the same limitation. In contrast to FPSG, Algorithm 4 works withfunction evaluations directly to come up with appropriate gradient evaluations. Algorithm 4
Zero order Perturbed Stochastic Gradient Descent (ZPSGD)
Input: x , learning rate η , noise radius r , mini-batch size m . for t = 0 , , . . . , do sample ( z (1) t , · · · , z ( m ) t ) ∼ N (0 , σ I ) g t ( x t ) ← (cid:80) mi =1 z ( i ) t [ f ( x t + z ( i ) t ) − f ( x t )] / ( mσ ) x t +1 ← x t − η ( g t ( x t ) + ξ t ) , ξ t uniformly ∼ B ( r ) end forreturn x T Lemma 6.
Let f : R d → R be a bounded, L -continuous, (cid:96) -gradient, ρ -Hessian Lipschitz function.Then, Jin et al. [2018a]’s ZPSG method needs ˜ O ( d (cid:15) ) evaluations of f to converge to an (cid:15) -SOSP .Proof. We will show the main steps that Jin et al. [2018a] followed in Section A of the Appendix.The first step of the proof is to define the Gaussian smoothing of function f with parameter σf σ ( x ) = E z ∼N (0 ,σ I ) f ( x + z ) One can show that ∇ f σ ( x ) = E z ∼N (0 ,σ I ) ∇ f ( x + z ) ∇ f σ ( x ) = E z ∼N (0 ,σ I ) ∇ f ( x + z ) Additionally Lemma 18 of Jin et al. [2018a] for ν = 0 , tells us that the gradients and Hessians of f σ and f are close to each other and that f σ is gradient Lipschitz and Hessian Lipschitz. • f σ is O ( (cid:96) ) gradient Lipschitz and O ( ρ ) Hessian Lipschitz. • (cid:107)∇ f σ ( x ) − ∇ f ( x ) (cid:107) ≤ O ( ρdσ ) and (cid:107)∇ f σ ( x ) − ∇ f ( x ) (cid:107) ≤ O ( ρ √ dσ ) (cid:15) -SOSP of f σ is also a O ( (cid:15) ) stationary point of f if σ ≤ O ( (cid:114) (cid:15)ρd ) We also need to develop a random gradient approximator of ∇ f σ given only evaluations f . Based onLemma 19 ∇ f σ ( x ) = E z ∼N (0 ,σ I ) z f ( x + z ) − f ( x ) σ Let us define g ( x ; z ) = z f ( x + z ) − f ( x ) σ Lemma 24 shows that g is Bσ subgaussian where B is the upper bound on | f ( x ) | (it exists since f isbounded). Replacing with the upper bound on σ , it turns out that g is O ( B (cid:113) (cid:15)ρd ) subgaussian. Thisdependence on d and (cid:15) is the main reason of the slowdown in this case.According to Theorem 65 getting an (cid:15) -SOSP of f σ requires ˜ O ( d /(cid:15) ) number of evaluations of g .Each evaluation of g requires 2 evaluations of f .16n the next section, we show the complete proof of our first main result. We will use theStable Manifold Theorem (SMT) to prove that zero-order approximate gradient descent(AGD) avoids strict saddle points. B Approximate Gradient Descent Detailed proofs
Our first two lemmas prove the equivalence between the first order stationary points of f and the fixed points of the AGD. Additionally we show that saddle points of the objectivefunction correspond exactly to the unstable fixed of the proposed zero order method. Finallywe show that for sufficiently small size-step the dynamical system is diffeomorphism. Thiscritical property will allow us to generalize the consequences of SMT from a local regionaround a saddle point to the global domain. B.1 Avoiding strict saddle pointsLemma 7.
Assume that g is an ( L, B, c ) well behaved function. If β < B and η < L for everystrict saddle point x ∗ of f and we have that (cid:0) x ∗ (cid:1) is not a stable fixed point of g . Additionally, theseare the only unstable fixed points of g .Proof. For h = 0 and at a strict saddle x ∗ , we will calculate the general differential of g . D g (cid:18) x ∗ (cid:19) = (cid:18) I − ηD x q x ( x ∗ , − ηD h q x ( x ∗ , β ∂q h (0) ∂h (cid:19) = (cid:18) I − η ∇ f ( x ∗ ) − ηD h q x ( x ∗ , β ∂q h (0) ∂h (cid:19) with eigenvalues β ∂q h (0) ∂h , (1 − ηλ i ) , where λ i are eigenvalues of ∇ f ( x ∗ ) . Since x ∗ is a strictsaddle, then there is at least one eigenvalue λ i < , and − ηλ i > . Thus (cid:0) x ∗ (cid:1) is an unstable fixedpoint of g . To prove that these are the only unstable fixed points, observe that β ∂q h (0) ∂h ∈ (0 , sothe only way D g (cid:0) x ∗ (cid:1) has an eigenvalue greater than 1 is for some λ i to be negative and therefore x ∗ should be a strict saddle.For the sake of completeness here we provide an extra lemma that proves the equivalence betweenthe first order stationary points of f and the fixed points of g . Lemma 8.
Assume that g is an ( L, B, c ) -well-behaved function for a function f with β < B . Thenfor each first order stationary point of f x ∗ , (cid:0) x ∗ (cid:1) is a fixed point of g . Additionally g has no otherfixed points.Proof. For β < B we have that g h = βq h ( h ) is a contraction since its Lipschitz constant is less thanone. So the only fixed point of g h is 0. Therefore for h (cid:54) = 0 no point (cid:0) x h (cid:1) is a stable point. Now for h = 0 we get that q x ( x , h ) = ∇ f ( x ) so we have x k +1 = x k − η ∇ f ( x k ) (3)So x is a fixed point if and only if ∇ f ( x ) = 0 . Combining this with the requirement that all fixedpoints of g have h = 0 proves the lemma.In order to prove Theorem 1 we also have to prove the diffeomorphism property of g . Lemma 9. If g is an ( L, B, c ) well behaved function and η < L , then det(D g ( · )) (cid:54) = 0 .Proof. Let K = D x q x ( x , h ) (4)17y straightforward calculation D g (cid:18) xh (cid:19) = (cid:18) I − η K − η D h q x ( x , h )0 β ∂q h ( h ) ∂h (cid:19) Given that g ( · , h ) is L -Lipschitz for all h ∈ R , we have that (cid:107)K(cid:107) ≤ L . Clearly we have that det( I − η K ) (cid:54) = 0 since (cid:107) I − η K(cid:107) ≥ − ηL > . Finally we have that det(D g (cid:18) xh (cid:19) ) = β ∂q h ( h ) ∂h det( I − η K ) (cid:54) = 0 . A straightforward application of result of Lee et al. [2019] and SMT will yields a saddle-avoidance lemma following kind :
Let X ∗ f be the set of the strict saddle points of f , η < L and β < B . Then it holds: Pr (cid:16) { (cid:0) x h (cid:1) : lim k →∞ x k ∈ X ∗ f } (cid:17) = 0 .Notice that the random choice would be both on x , h . In the following subsection wewill prove that a stronger result where the random initialization refers only to the x ’sdomain is surprisingly possible via a new refinement of SMT : ∀ h ∈ R : Pr( lim k →∞ x k = x ∗ ) = 1 Let us first describe our general strategy for proving this refinement:1. We will restate the Stable Manifold Theorem and understand its implications.(Section B.2.1)2. We will study the structure of the eigenvalues of D g at fixed points of g .(Section B.2.2)3. We will show how this affects the projections to the stable and unstable eignespacesof D g .(Section B.2.3)4. Finally we will see how this enables us to study the dimension of the stable manifoldwhen h is fixed.(Section B.2.4) B.2 A Refinement of the Stable Manifold TheoremB.2.1 Understanding the Stable Manifold TheoremTheorem 5 (Theorem III.2 & III.7 of Shub [1987]) . Let p be a fixed point for the C r local diffeomor-phism h : U → R n where U ⊂ R n is an open neighborhood of p in R n and r ≥ . Let E s ⊕ E c ⊕ E u be the invariant splitting of R n into generalized eigenspaces of Dh ( p ) corresponding to eigenvaluesof absolute value less than one, equal to one, and greater than one. To the Dh ( p ) invariant subspace E s ⊕ E c there is an associated local h invariant embedded disc W locsc which is the graph of a C r function r : E s ⊕ E c → E u , and ball B around p such that: h ( W locsc ) ∩ B ⊂ W locsc . If h n ( x ) ∈ B for all n ≥ , then x ∈ W locsc We will give some intuition on how the Stable Manifold Theorem restricts the dimensionality of thestable manifold. It essentially boils down to restricting the dimensionality of the manifold W locsc .Let us have a x ∈ U , then this can be decomposed in two vectors x sc and x u , the projection of x to E s ⊕ E c and E u respectively. Thus by the construction of W locsc in the proof of the StableManifold theorem, we know that there is a function r : E s ⊕ E c → E u such that if x ∈ W locsc then ( x sc , x u ) ∈ graph ( r ) , or equivalently it holds that x u = r ( x sc ) . By the construction of r , r is smooth so now dim ( W locsc ) = dim (graph( r )) = dim ( E s ⊕ E c ) . To understand why the laststatement is true, the interested reader can look at example 5.14 of Loring [2008].18 .2.2 Eigenvalues of the Jacobian at fixed points Our main tool for understanding the structure of the eigenvalues of D g at fixed points of g iscomparing it and contrasting it with its first order counterpart, gradient descent. Here is the dynamicalsystem of gradient descent: x k +1 = g ( x k ) = x k − η ∇ f ( x k ) Now let us pick a fixed point of f , x ∗ . Then D g ( x ∗ ) = I − η ∇ f ( x ∗ ) is a symmetrical matrix for the C function f . Then we can write down its real orthonormaleigenvectors { v i } di =1 . Without loss of generality we can reorder them so that the k first eigenvectorscorrespond to eigenvalues less than one, the next s correspond to eigenvalues that are equal to oneand and the last ones correspond to eigenvalues that are larger than one in absolute value. Based onthis separation between the eigenvectors, we can now define the following three vector spaces E g s = [ { v , · · · , v k } ] E g c = [ { v k +1 , · · · , v k + s } ] E g u = [ { v k + s +1 , · · · , v d } ] Then we can prove the following interesting lemma
Lemma 10. If v is eigenvector of Dg ( x ∗ ) then (cid:0) v (cid:1) is eigenvector of Dg (cid:0) x ∗ (cid:1) with the sameeigenvalue.Proof. By straightforward calculation D g (cid:18) x ∗ (cid:19) = (cid:18) I − ηD x q x ( x ∗ , − ηD h q x ( x ∗ , β ∂q h (0) ∂h (cid:19) = (cid:18) I − η ∇ f ( x ∗ ) − ηD h q x ( x ∗ , β ∂q h (0) ∂h (cid:19) = (cid:18) D g ( x ∗ ) − ηD h q x ( x ∗ , β ∂q h (0) ∂h (cid:19) Indeed if v is eigenvector of Dg ( x ∗ ) with eigenvalue λ then D g (cid:18) x ∗ (cid:19)(cid:18) v (cid:19) = (cid:18) D g ( x ∗ ) − η D h q x ( x ∗ , β ∂q h (0) ∂h (cid:19) (cid:18) v (cid:19) = (cid:18) λ v (cid:19) = λ (cid:18) v (cid:19) Now we now the form of the d out of the d + 1 generalized eigenvalues of D g (cid:0) x ∗ (cid:1) . There must beat least one more generalized eigenvector along with its corresponding eigenvalue. It is known thatgeneralized eigenvectors span the whole space. But so far all the eigenvectors have a zero in the lastcoordinate. So the last generalized eigenvector must have a non-zero value in the last coordinate.Without loss of generality we can assume that the last coordinate is 1. So the vector will be of theform (cid:0) ˜ v (cid:1) . We would like to determine its corresponding eigenvalue. Lemma 11.
The eigenvalue of D g (cid:0) x ∗ (cid:1) that corresponds to (cid:0) ˜ v (cid:1) is β ∂q h (0) ∂h Proof.
Since the last row of D g (cid:0) x ∗ (cid:1) contains only one non-zero element, we know that the charac-teristic polynomial p of D g can be written as det(D g (cid:18) x ∗ (cid:19) − λI d +1 × d +1 ) = det(D g ( x ∗ ) − λI d × d ) det( β ∂q h (0) ∂h − λ ) Given that all the other eigenvalues cover the roots of the first term, we know that the last eigenvalueis β ∂q h (0) ∂h . 19y assumption we know that < β ∂q h (0) ∂h < . Thus the last generalized eigenvector corresponds toa stable eigenvalue. Now we can write down the following E gs = (cid:20)(cid:26)(cid:18) v (cid:19) , · · · , (cid:18) v k (cid:19) , (cid:18) ˜ v (cid:19)(cid:27)(cid:21) E gc = (cid:20)(cid:26)(cid:18) v k +1 (cid:19) , · · · , (cid:18) v k + s (cid:19)(cid:27)(cid:21) (5) E gu = (cid:20)(cid:26)(cid:18) v k + s +1 (cid:19) , · · · , (cid:18) v d (cid:19)(cid:27)(cid:21) B.2.3 Projections to stable and unstable eigenspaces of the Jacobian
In this paragraph we want to learn more about the projection to the stable and unstable eigenspacesof D g . Specifically for any vector (cid:0) x h (cid:1) , there are unique x g sc , x g u , h s , h u such that (cid:18) x h (cid:19) = (cid:18) x g sc h s (cid:19) + (cid:18) x g u h u (cid:19)(cid:18) x g sc h s (cid:19) ∈ E g s ⊕ E g c and (cid:18) x g u h u (cid:19) ∈ E g u Let us compute these projections. Given that the generalized eigenvectors span the whole space, wehave that there are unique λ i ∈ R such that (cid:18) x h (cid:19) = n (cid:88) i =1 λ i (cid:18) v i (cid:19) + λ n +1 (cid:18) ˜ v (cid:19) ⇔ λ n +1 = h and x = n (cid:88) i =1 λ i v i + h ˜ v ⇔ λ n +1 = h and x − h ˜ v = n (cid:88) i =1 λ i v i ⇔ λ n +1 = h and λ i = (cid:104) x − h ˜ v , v i (cid:105) Since v i are orthogonal as eigenvectors of a symmetrical matrix. We can now find the vectors andvalues x g sc , x g u , h s , h u x g sc = k + (cid:96) (cid:88) i =1 λ i v i + h ˜ v = k + (cid:96) (cid:88) i =1 (cid:104) x − h ˜ v , v i (cid:105) v i + h ˜ v = k + (cid:96) (cid:88) i =1 (cid:104) x , v i (cid:105) v i + h (cid:32) ˜ v − k + (cid:96) (cid:88) i =1 (cid:104) ˜ v, v i (cid:105) v i (cid:33) x g u = n (cid:88) i = k + (cid:96) +1 λ i v i = n (cid:88) i = k + (cid:96) +1 (cid:104) x − h ˜ v , v i (cid:105) v i = n (cid:88) i = k + (cid:96) +1 (cid:104) x , v i (cid:105) v i − h n (cid:88) i = k + (cid:96) +1 (cid:104) ˜ v , v i (cid:105) v i h s = h and h u = 0 Once again we will compare and contrast with the first order case. Equivalently for every vector x there are unique x g sc , x g u such that x = x g sc + x g u x g sc ∈ E g s ⊕ E g c and x g u ∈ E g u q = ˜ v − k + (cid:96) (cid:88) i =1 (cid:104) ˜ v , v i (cid:105) v i = n (cid:88) i = k + (cid:96) +1 (cid:104) ˜ v , v i (cid:105) v i (6)Then clearly x g sc = x g sc + h qx g u = x g u − h q (7) h sc = hh u = 0 B.2.4 Restricting the dimension of the stable manifold for fixed initial h In this paragraph we are ready to finally prove Theorem . Theorem 6 (Theorem 1 restated) . Let g be a ( L, B, c ) -well-behaved function for function f . Let X ∗ f be the set of strict saddle points of f . Then if η < L and β < B : ∀ h ∈ R : µ ( { x : lim k →∞ x k ∈ X ∗ f } ) = 0 Proof.
Without loss of generality let us have a fixed h = h . Let us define M h as M h = { x ∈ R n : lim k →∞ g k ( x , h ) = ( x ∗ , and x ∗ ∈ X ∗ f } We want to prove that the set M has measure 0. Let us apply the Stable Manifold Theorem on g forall fixed points p = ( x ∗ , ∈ X ∗ f × { } . Let B p , W locsc, p be the ball and the corresponding manifoldderived by Theorem 5. We consider the union of those balls B = (cid:83) B p . The following property for R N holds: Theorem (Lindelöf’s lemma) . For every open cover there is a countable subcover.
Therefore due to Lindelöf’s lemma, we can find a countable subcover for B , i.e., there exists acountable family of fixed-points p , p , · · · such that B = (cid:83) + ∞ m =0 B p m . Once again, based onTheorem 5, if starting from x one converges to an unstable fixed point then it holds that x ∈ M h ⇒ ∃ m, t : ∀ t ≥ t ( x t , h t ) = g t ( x , h ) and ( x t , h t ) ∈ B p m ⇒ ∃ m, t : ( x t , h t ) = g t ( x , h ) and ( x t , h t ) ∈ W locsc, p m Let us define U mt = { x ∈ R d : ( x t , h t ) = g t ( x , h ) and ( x t , h t ) ∈ W locsc, p m } Therefore we have M h ⊆ ∞ (cid:91) m =0 ∞ (cid:91) t =0 U mt Now it suffices to prove that all U mt sets have zero measure. Let us first prove the following lemmaas a stepping stone. Lemma.
Let us define the following set of points R mh = { x ∈ R d : ( x , h ) ∈ W locsc, p m } Then dim ( R mh ) < d . roof. Based on our discussion on the Stable Manifold Theorem, we know that there is a smoothfunction r : E gs ⊕ E gc → E gu such that (cid:18) x h (cid:19) ∈ W locsc, p m ⇒ (cid:18) x g u h s (cid:19) = r ( x g sc , h u ) where x gu , x gsc , h s and h u the components of the projections to E g s ⊕ E g c and E g u as defined in theEquations of 5. Now using our analysis in the Equations of 7 (cid:18) x h (cid:19) ∈ W locsc, p m ⇒ (cid:18) x g u − h q (cid:19) = r ( x g sc + h q , h ) where q is the vector we defined in Equation 6. Let (cid:81) be the projection that for each (cid:0) x h (cid:1) ∈ R d +1 returns x . Then we can define the following smooth function r (cid:48) h : E g s ⊕ E g c → E g u r (cid:48) h ( x ) = h q + (cid:89) r ( x + h q , h ) . Using the { v i } ni =1 as a basis we can write (cid:18) x h (cid:19) ∈ W locsc, p m ⇒ x g u = r (cid:48) h ( x g sc ) ⇒ x ∈ graph( r (cid:48) h ) Therefore dim ( R mh ) ≤ dim ( E g s ⊕ E g c ) < d since p m corresponds to an unstable fixed point of g .Then we can prove the following lemma Lemma 12.
The measure of U mt is zero.Proof. We will do this by contradiction. Let us assume that U t has non-zero measure. Let us define W m = { x ∈ R n : x ∈ U mt } W m = { x ∈ R n : x ∈ g ( W m , h ) } ... W mt = { x ∈ R n : x ∈ g ( W mt − , h t − ) } Given that g ( · , h i ) is a diffeomorphism for all i , we have that W i has non zero measure. Observe that W mt ⊆ R mh t and so dim ( W mt ) < d and W mt has measure zero leading to a contradiction.Since the countable union of zero measure sets is zero measure we clearly have that M h has measurezero as requested.In the previous section, we provided sufficient conditions to avoid convergence to strictsaddle points. These results are meaningful however only if lim k →∞ x k = x ∗ . Thus in order tocomplete the proof of 3, in the following section we will provide sufficient conditions suchthat the dynamic system of AGD converges. 22 .3 Convergence We will refer to the error of the gradient approximation as ε k = q x ( x k , h k ) − ∇ f ( x k ) . In order to prove the convergence firstly we establish a lower bound for the decrease of thefunction that is connected with the norm of the gradient and its approximation error (Lemma2). We also prove that our scheme yields to an exponential decrease of that error (Lemma14). Given those lemmas we can prove an exact and an (cid:15) − first order stationary convergencetheorem. Lemma 13 (Lemma 2 restated) . Suppose that g is a ( L, B, c ) -well-behaved function for a (cid:96) -gradientLipschitz function f . If η ≤ (cid:96) then we have that f ( x k +1 ) ≤ f ( x k ) − η (cid:16) (cid:107)∇ f ( x k ) (cid:107) − (cid:107) ε k (cid:107) (cid:17) (8) Proof. f ( x k +1 ) ≤ f ( x k ) + ∇ f ( x k ) (cid:62) ( x k +1 − x k ) + (cid:96) (cid:107) x k +1 − x k (cid:107) ≤ f ( x k ) − η ∇ f ( x k ) (cid:62) q x ( x k , h k ) + η (cid:96) (cid:107) q x ( x k , h k ) (cid:107) ≤ f ( x k ) − η ∇ f ( x k ) (cid:62) ( ∇ f ( x k ) + ε k ) + η (cid:96) (cid:107)∇ f ( x k ) + ε k (cid:107) ≤ f ( x k ) − η ∇ f ( x k ) (cid:62) ( ∇ f ( x k ) + ε k ) + η (cid:107)∇ f ( x k ) + ε k (cid:107) ≤ f ( x k ) − η (cid:16) (cid:107)∇ f ( x k ) (cid:107) − (cid:107) ε k (cid:107) (cid:17) Lemma 14 (Exponentially Decreasing ε k ) . Suppose that g is a ( L, B, c ) -well-behaved function fora function f . Then we have that (cid:107) ε k (cid:107) ≤ c | h | ( βB ) k Proof.
Since q h is B -Lipschitz | h k +1 | = | βq h ( h k ) − βq h (0) | ≤ βB | h k | Therefore we have that | h k | ≤ ( βB ) k | h | Based on property 3 of the ( L, B, c ) -well-behaved function we have that (cid:107) ε k (cid:107) = (cid:107) q x ( x k , h k ) − ∇ f ( x k ) (cid:107) ≤ c | h k | = ( βB ) k | h | Now we are ready to start our proof for the convergence to the first order stationary points.
Theorem 7 ( Theorem 2 Restated) . Suppose that g is a ( L, B, c ) -well-behaved gradient functionfor a (cid:96) -gradient Lipschitz function f . Let η ≤ (cid:96) , β < B . Then if f is lower bounded lim k →∞ (cid:107)∇ f ( x k ) (cid:107) = 0 roof. Applying Lemma 2 repeatedly we get f ( x ) − f ( x k ) ≥ η k (cid:88) i =0 (cid:16) (cid:107)∇ f ( x i ) (cid:107) − (cid:107) ε i (cid:107) (cid:17) We now have that f ( x ) − f ( x k ) + η k (cid:88) i =0 (cid:107) ε i (cid:107) ≥ η k (cid:88) i =0 (cid:107)∇ f ( x i ) (cid:107) f ( x ) − f ( x k ) + η ∞ (cid:88) i =0 (cid:107) ε i (cid:107) ≥ η k (cid:88) i =0 (cid:107)∇ f ( x i ) (cid:107) f ( x ) − f ( x k ) + η ∞ (cid:88) i =0 (cid:0) c | h | ( βB ) i (cid:1) ≥ η k (cid:88) i =0 (cid:107)∇ f ( x i ) (cid:107) f ( x ) − f ( x k ) + η c h − ( βB ) ≥ η k (cid:88) i =0 (cid:107)∇ f ( x i ) (cid:107) Given that f is lower bounded, f ( x ) − f ( x k ) and therefore the whole left hand side is upper boundedwhich means the series sum in the right hand side is upper bounded. Since this is a series of nonnegative terms this means that the series converges and therefore lim k →∞ (cid:107)∇ f ( x k ) (cid:107) = 0 For the sake of completeness, we will analyze the convergence rate to (cid:15) -first order stationarypoints in this setting. This would enable us to to make a fair comparison with previous resultsthat assume a fixed h k = h . Notice that the following result improves over previous workin randomized zero order gradient approximations. In Nesterov and Spokoiny [2017], itwas proved that using a randomized oracle that requires 2 function evaluations per iteration,one could get an in expectation (cid:15) -first order stationary point after O (cid:0) d(cid:96) ( f ( x ) − f ∗ ) /(cid:15) (cid:1) iterations. For the case of q x using r f as defined in Equation 1 of the Section 3, we have justproved that with d + 1 function evaluations per iteration we can get a (cid:15) -first order stationarypoint after only O (cid:0) (cid:96) ( f ( x ) − f ∗ ) /(cid:15) (cid:1) iterations. Thus for the same number of functionevaluations up to constants, our work provides deterministic guarantees whereas Nesterovand Spokoiny [2017] provides guarantees only in expectation. Theorem 8 ( (cid:15) -first order stationary points) . Suppose that g is a ( L, B, c ) -well-behaved gradientfunction for a (cid:96) -gradient Lipschitz function f . Let q h ( h ) = h and β = 1 , η = (cid:96) . Then if f hasminimum value f ∗ and h = (cid:15) √ c , the required number of iterations to reach a (cid:15) -first order stationarypoint is O (cid:18) (cid:96) ( f ( x ) − f ∗ ) (cid:15) (cid:19) Proof.
Applying Lemma 2 repeatedly we get f ( x ) − f ( x k ) ≥ (cid:96) k (cid:88) i =0 (cid:16) (cid:107)∇ f ( x i ) (cid:107) − (cid:107) ε i (cid:107) (cid:17)
24e now have that f ( x ) − f ( x k ) + 12 (cid:96) k (cid:88) i =0 (cid:107) ε i (cid:107) ≥ (cid:96) k (cid:88) i =0 (cid:107)∇ f ( x i ) (cid:107) f ( x ) − f ( x k ) + k + 12 (cid:96) ( c | h | ) ≥ (cid:96) k (cid:88) i =0 (cid:107)∇ f ( x i ) (cid:107) (cid:96) ( f ( x ) − f ( x k ))2( k + 1) + c | h | ≥ k + 1 k (cid:88) i =0 (cid:107)∇ f ( x i ) (cid:107) (cid:96) ( f ( x ) − f ∗ )2( k + 1) + (cid:15) ≥ k + 1 k (cid:88) i =0 (cid:107)∇ f ( x i ) (cid:107) Choose the smallest k such that (cid:96) ( f ( x ) − f ∗ )( k +1) ≤ (cid:15) . Then we have (cid:15) ≥ k + 1 k (cid:88) i =0 (cid:107)∇ f ( x i ) (cid:107) Since the average of the squared norms of the gradients is less than (cid:15) , there should be at least onethat is less or equal to (cid:15) . That is there is a k ≤ k such that (cid:107)∇ f ( x k ) (cid:107) ≤ (cid:15) . Given the definition of k we get the iteration bound stated in the theorem.The last theorems give us a guarantee that the norm of the gradient is converging to zerobut this is not enough to prove convergence to a single stationary point if f has non isolatedcritical points. To establish a stronger result we prove that {(cid:107)∇ f ( x k ) (cid:107)} does not decreasearbitrarily quickly. Lemma 15 (Sufficiently large gradients) . Suppose that g is a ( L, B, c ) -well-behaved function for a (cid:96) -gradient Lipschitz function f . Then we have that (cid:107)∇ f ( x k +1 ) (cid:107) ≥ (1 − η(cid:96) ) (cid:107)∇ f ( x k ) (cid:107) − η(cid:96) (cid:107) ε k (cid:107) Proof. (cid:107)∇ f ( x k +1 ) (cid:107) ≥ (cid:107)∇ f ( x k ) (cid:107) − (cid:107)∇ f ( x k +1 ) − ∇ f ( x k ) (cid:107)≥ (cid:107)∇ f ( x k ) (cid:107) − (cid:96) (cid:107) x k +1 − x k (cid:107)≥ (cid:107)∇ f ( x k ) (cid:107) − η(cid:96) (cid:107) q x ( x k , h k ) (cid:107)≥ (cid:107)∇ f ( x k ) (cid:107) − η(cid:96) (cid:107)∇ f ( x k ) + ε k (cid:107)≥ (cid:107)∇ f ( x k ) (cid:107) − η(cid:96) (cid:107)∇ f ( x k ) (cid:107) − η(cid:96) (cid:107) ε k (cid:107)≥ (1 − η(cid:96) ) (cid:107)∇ f ( x k ) (cid:107) − η(cid:96) (cid:107) ε k (cid:107) Theorem 9.
Assume that f is (cid:96) -gradient Lipschitz, is analytic and that it has compact sub-level setsand that g is a ( L, B, c ) -well-behaved gradient oracle. Let η < (cid:96) , β < − η(cid:96)B . Then lim x k existsand is a stationary point of f .Proof. We will first prove that given the fact that f has compact sub-level sets { x k } is confined incompact set. Based on Lemma 2 we have that for all k ≥ f ( x k +1 ) − f ( x k ) ≤ η (cid:107) ε k (cid:107) Applying this recursively and adding the inequalities f ( x k +1 ) ≤ f ( x ) + η k (cid:88) i =0 (cid:107) ε i (cid:107) ≤ f ( x ) + η k (cid:88) i =0 (cid:0) c | h | ( βB ) i (cid:1) ≤ f ( x ) + η c h k (cid:88) i =0 ( βB ) i ≤ f ( x ) + η c h − ( βB ) So clearly { f ( x k ) } is bounded and therefore { x k } stays in one of the compact sub-level sets of f forever.Let us define the following φ k ( h ) = c | h | ( βB ) k We will split the proof of the theorem in two cases. For the first case we will assume that there is a k ∈ N such that (cid:107)∇ f ( x k ) (cid:107) ≥ φ k ( h ) Then by Lemma 15 (cid:107)∇ f ( x k +1 ) (cid:107) ≥ (1 − η(cid:96) ) (cid:107)∇ f ( x k ) (cid:107) − η(cid:96) (cid:107) ε k (cid:107) ≥ (1 − η(cid:96) ) φ k ( h ) − η(cid:96)φ k ( h ) ≥ (1 − η(cid:96) ) φ k ( h ) ≥ − η(cid:96)βB βBφ k ( h ) ≥ − η(cid:96)βB φ k +1 ( h ) ≥ φ k +1 ( h ) By induction we have that ∀ k ≥ k + 1 (cid:107)∇ f ( x k ) (cid:107) ≥ − η(cid:96)βB φ k ( h ) By Lemma 14 (cid:107)∇ f ( x k ) (cid:107)(cid:107) ε k (cid:107) ≥ (cid:18) − η(cid:96)βB (cid:19) = q >
26t the same time −∇ f ( x k ) (cid:62) ( x k +1 − x k ) = η ∇ f ( x k ) (cid:62) ( ∇ f ( x k ) + ε k )= η (cid:107)∇ f ( x k ) (cid:107) + η ∇ f ( x k ) (cid:62) ε k ≤ η (cid:18) q (cid:19) (cid:107)∇ f ( x k ) (cid:107) Additionally using similar arguments as above −∇ f ( x k ) (cid:62) ( x k +1 − x k ) (cid:107)∇ f ( x k ) (cid:107)(cid:107) ( x k +1 − x k ) (cid:107) ≥ η (cid:16) − q (cid:17) (cid:107)∇ f ( x k ) (cid:107) η (cid:16) q (cid:17) (cid:107)∇ f ( x k ) (cid:107) = (cid:16) − q (cid:17)(cid:16) q (cid:17) Let us define c = 12 (cid:18) − q (cid:19) c = (cid:16) − q (cid:17)(cid:16) q (cid:17) Clearly by Lemma 2 we have that f ( x k ) − f ( x k +1 ) ≥ η (cid:16) (cid:107)∇ f ( x k ) (cid:107) − (cid:107) ε k (cid:107) (cid:17) ≥ η (cid:18) − q (cid:19) (cid:107)∇ f ( x k ) (cid:107) We can conclude that f ( x k ) − f ( x k +1 ) ≥ − c ∇ f ( x k ) (cid:62) ( x k +1 − x k ) ≥ c c (cid:107)∇ f ( x k ) (cid:107)(cid:107) ( x k +1 − x k ) (cid:107) with c c > . Moreover, (cid:107)∇ f ( x k ) (cid:107) ≥ φ k ( h ) > so we do not have to worry about arriving onstationary points in finite time. Given that f is analytic, we have all the necessary conditions ofTheorem 3.2 in Absil et al. [2005] and we have ruled out the possibility of { x k } escaping to infinity.Therefore, we can now claim that { x k } converges.For the second case we have that for for all k ∈ N (cid:107)∇ f ( x k ) (cid:107) < φ k ( h ) . We will now prove that { x k } is a Cauchy sequence. (cid:107) x k − x m (cid:107) ≤ k (cid:88) i = m (cid:107) x i +1 − x i (cid:107)≤ k (cid:88) i = m (cid:107) ηq x ( x i , h i ) (cid:107)≤ η k (cid:88) i = m (cid:107)∇ f ( x i , h i ) + ε i (cid:107)≤ η k (cid:88) i = m φ i ( h ) We know that (cid:80) ∞ i φ i ( h ) converges so the partial sums must converge to 0. Then lim m,k →∞ (cid:107) x k − x m (cid:107) ≤ η lim m,k →∞ k (cid:88) i = m φ i ( h ) = 0 So lim m,k →∞ (cid:107) x k − x m (cid:107) = 0 and { x k } is a Cauchy sequence bounded in a compact set and thereforeit converges.In either of the cases the limit of { x k } is of course a stationary point.27e can now conclude our analysis with this final theorem. Theorem 10 (Theorem 3 restated) . Let f : R d → R ∈ C be a (cid:96) -gradient Lipschitz function. Letus also assume that f is analytic, has compact sub-level sets and all of its saddle points are strict.Let g be a ( L, B, c ) -well-behaved function for f with η < min { L , (cid:96) } and β < − η(cid:96)B . If we pick arandom initialization point x , then we have that for the x k iterates of g ∀ h ∈ R Pr( lim k →∞ x k = x ∗ ) = 1 where x ∗ is a local minimizer of f .Proof. Given the assumptions, we can apply Theorem 9 and get that lim k →∞ x k exists and is astationary point of f . We can also apply Theorem 1 in order to guarantee that the limit is not a strictsaddle of f with probability 1. Given the assumption that f has only strict saddles, then lim k →∞ x k is with probability 1 a local minimum of f . 28 Escaping Saddle Points Efficiently Detailed proofs
Before presenting the iteration complexity proof ( Theorem 4 ) we will state our main probabilisticlemma.
Lemma 16.
There exists an absolute constant c max , such that for any f that is (cid:96) -gradient Lipschitzand ρ -Hessian Lipschitz function and any c ≤ c max , and χ ≥ . Let η, r, g thres , f thres , t thres , h low becalculated same way as in Algorithm 1. Then, if x t satisfies: (cid:107)∇ f ( x t ) (cid:107) ≤ g thres and λ min ( ∇ f ( x t )) ≤ −√ ρ(cid:15) Let ˜ x = x t + ξ , where ξ comes from the uniform distribution over B ( r ) , and let { ˜ x i } be the iteratesof approximate gradient descent from ˜ x with stepsize η and h = h low , then with at least probability − d(cid:96) √ ρ(cid:15) e − χ , we have: ∃ i ≤ t thres : f ( x t ) − f (˜ x i (cid:48) ) ≥ f thres This lemma will be the “workhorse” which will offer the high probability guarantees of Algorithm 1given that substantial progress can be made in the low gradient phase. The proof of the above lemmais deferred to the end of this section.We are ready now to prove our main theorem:
Theorem 11 (Theorem 4 restated) . There exists absolute constant c max such that: if f is (cid:96) -gradientLipschitz and ρ -Hessian Lipschitz, then for any δ > , (cid:15) ≤ (cid:96) ρ , ∆ f ≥ f ( x ) − f (cid:63) , and constant c ≤ c max , with probability − δ , the output of PAGD ( x , (cid:96), ρ, (cid:15), c, δ, ∆ f ) will be (cid:15) -SOSP , andterminate in iterations: O (cid:18) (cid:96) ( f ( x ) − f (cid:63) ) (cid:15) log (cid:18) d(cid:96) ∆ f (cid:15) δ (cid:19)(cid:19) Proof.
Denote ˜ c max to be the absolute constant allowed in Lemma 16. In this theorem, we let c max = min { ˜ c max , / } , and choose any constant c ≤ c max .In this proof, that Algorithm 1 returns a point x that satisfies the following condition: (cid:107)∇ f ( x ) (cid:107) ≤ g thres = √ cχ · (cid:15), λ min ( ∇ f ( x )) ≥ −√ ρ(cid:15) (9)Since c ≤ , χ ≥ , we have √ cχ ≤ , which implies any x satisfies Equation (9) is also a (cid:15) -SOSP .Starting from x , we know if x does not satisfy Equation 9, there are only two cases:1. (cid:107) z (cid:107) = (cid:13)(cid:13)(cid:13) q (cid:16) x , g thres c h (cid:17)(cid:13)(cid:13)(cid:13) > g thres In this case, (cid:107)∇ f ( x ) (cid:107) ≥ g thres and Algorithm 1 will not add perturbation. By Lemma 2: f ( x ) − f ( x ) ≥ η · ( (cid:107)∇ f ( x ) (cid:107) − (cid:107) ε (cid:107) ) where ε = q (cid:16) x , g thres c h (cid:17) − ∇ f ( x ) . Therefore we get (cid:107) ε (cid:107) ≤ g thres f ( x ) − f ( x ) ≥ η · ( (cid:107)∇ f ( x ) (cid:107) − (cid:107) ε (cid:107) ) ≥ η g thres ≥ c (cid:15) (cid:96)χ (cid:107) z (cid:107) = (cid:13)(cid:13)(cid:13) q (cid:16) x , g thres c h (cid:17)(cid:13)(cid:13)(cid:13) ≤ g thres In this case, (cid:107)∇ f ( x ) (cid:107) ≤ g thres and Algorithm 1 will add a perturbation ξ of radius r suchthat ˜ x ← x + ξ , and will perform approximate gradient descent (without perturbations)for at most t thres steps. Since x is not a second-order stationary point then by Lemma 16there exists i (cid:48) ≤ t thres such that: f ( x ) − f ( x ) = f ( x ) − f (˜ x i (cid:48) ) ≥ f thres = cχ · (cid:115) (cid:15) ρ f ( x ) − f (˜ x i (cid:48) ) i (cid:48) ≥ f thres t thres = c χ · (cid:15) (cid:96) Hence, we can conclude that as long as Algorithm 1 has not terminated yet, on average, every stepdecreases function value by at least c χ · (cid:15) (cid:96) . However, we clearly can not decrease function value bymore than f ( x ) − f (cid:63) , where f (cid:63) is the minimum value of f . This means Algorithm 1 must terminatewithin the following number of iterations: f ( x ) − f (cid:63)c χ · (cid:15) (cid:96) = χ c · (cid:96) ( f ( x ) − f (cid:63) ) (cid:15) = O (cid:18) (cid:96) ( f ( x ) − f (cid:63) ) (cid:15) log (cid:18) d(cid:96) ∆ f (cid:15) δ (cid:19)(cid:19) Finally, we have to ensure that the above statement holds with high probability. In the worst casescenario, in each outer-loop iteration the algorithm will be enforced to add a perturbation yielding adecrease of f thres . Thus, the maximum number of perturbations are at most: f ( x ) − f (cid:63) f thres = f ( x ) − f (cid:63)cχ · (cid:113) (cid:15) ρ Applying Lemma 16, we know that the guaranteed decrease of f thres happens with probability at least − d(cid:96) √ ρ(cid:15) e − χ each time. By union bound, the probability that all perturbations satisfy the decreaseguarantee is at least − d(cid:96) √ ρ(cid:15) e − χ · f ( x ) − f (cid:63)cχ · (cid:113) (cid:15) ρ = 1 − χ e − χ c · d(cid:96) ( f ( x ) − f (cid:63) ) (cid:15) Recall our choice of χ = 3 max { log( d(cid:96) ∆ f c(cid:15) δ ) , } . Since χ ≥ , we have χ e − χ ≤ e − χ/ , this gives: χ e − χ c · d(cid:96) ( f ( x ) − f (cid:63) ) (cid:15) ≤ e − χ/ d(cid:96) ( f ( x ) − f (cid:63) ) c(cid:15) ≤ δ which finishes the proof.What remains to be proven is why adding a perturbation is guaranteed to help the algorithm decreasethe value of f substantially with high probability. Following the proof strategy of Jin et al. [2017] wewill define some additional notation. Let the condition number be the ratio of the Lipschitz constantof ∇ f and the smallest negative eigenvalue of the Hessian of x t before adding the perturbation, i.e κ = (cid:96)/γ ≥ . Additionally we define the following units: p ← log( dκδ ) , L ← η(cid:96), F ← L p γ ρ , G ← √ L p γ ρ , S ← √ L p γρ , R ← S κp , T ← pηγ Following the above definitions, it holds that: S = (cid:113) F pγ = G pγ , (cid:96) R = 2 G and η TG = S (A): The first argument in this proof is that if the ˜ x i iterates do not achieve a decrease of . F in cT steps thenthey must remain confined in a small ball around ˜ x . Lemma 17.
For any constant c ≥ , define: T = min (cid:110) inf t { t | f ( u ) − f ( u t ) ≥ . F } , cT (cid:111) then, for any η ≤ /(cid:96) , we have for all t < T that (cid:107) u t − u (cid:107) ≤ S · c ) . roof of Lemma 17. Applying repeatedly Lemma 2, we get for t < Tf ( u t ) − f ( u ) ≤ − η t (cid:88) i =0 (cid:16) (cid:107)∇ f ( u i ) (cid:107) − (cid:107) ε i (cid:107) (cid:17) where ε i = q x ( u i , h low ) − ∇ f ( u i ) . By definition of T we have that the function value of f has not yet decreased by . F . η t (cid:88) i =0 (cid:107)∇ f ( u i ) (cid:107) ≤ f ( u ) − f ( u t ) + η t (cid:88) i =0 (cid:107) ε i (cid:107) η t (cid:88) i =0 (cid:107)∇ f ( u i ) (cid:107) ≤ . F + η t (cid:88) i =0 (cid:107) ε i (cid:107) Since T ≤ cT and also (cid:107) ε i (cid:107) ≤ G we then have η t (cid:88) i =0 (cid:107)∇ f ( u i ) (cid:107) ≤ . F + η G cT t (cid:88) i =0 (cid:107)∇ f ( u i ) (cid:107) ≤ η F + G cT t (cid:88) i =0 (cid:16) (cid:107)∇ f ( u i ) (cid:107) + (cid:107) ε i (cid:107) (cid:17) ≤ η F + 2 G cT We also have that (cid:107) q x ( u i , h low ) (cid:107) ≤ (cid:16) (cid:107)∇ f ( u i ) (cid:107) + (cid:107) ε i (cid:107) (cid:17) . Therefore we have that t (cid:88) i =0 (cid:107) q x ( u i , h low ) (cid:107) ≤ η F + 4 G cT Now we can bound the difference between u t and u : (cid:107) u t − u (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t (cid:88) i =1 u i − u i − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ t t (cid:88) i =1 (cid:107) u i − u i − (cid:107) ≤ tη t (cid:88) i =0 (cid:107) q x ( u i , h low ) (cid:107) ≤ tη (cid:18) η F + 4 G cT (cid:19) ≤ tη (cid:18) η F + 4 G cT (cid:19) ≤ cT η (cid:18) η F + 4 G cT (cid:19) Manipulating the constants we get (cid:107) u t − u (cid:107) ≤ (cid:0) c + c (cid:1) S (cid:107) u t − u (cid:107) ≤ (cid:112) (10 c + c ) S For any c ≥ we have (cid:107) u t − u (cid:107) ≤ c S ) u are constrained in a smallball, iterates from w = u + µ · R e , for large enough µ must be able to decrease the function value. In orderto do that, we keep track of vector v which is the difference between { u i } and { w i } . We also decompose v into two different eigenspaces: the direction e (the minimum-eigenvalue eigenvector) and its orthogonalsubspace. Lemma 18.
There exists absolute constant c max , c such that: for any δ ∈ (0 , dκe ] , let f ( · ) , ˆ x satisfiesthe following conditions (cid:107)∇ f (ˆ x ) (cid:107) ≤ G and λ min ( ∇ f (ˆ x )) ≤ − γ and any two sequences { u t } , { w t } with initial points u , w satisfying: w = u + µ · R · e , µ ∈ [ δ/ (2 √ d ) , , (cid:107) u − ˆ x (cid:107) ≤ R e is the eignevector of the minimum eigenvalue of ∇ f (ˆ x ) . Assume also that h low ≤ ρ S δ c h √ d R .Define T = min (cid:110) inf t { t | f ( w ) − f ( w t ) ≥ . F } , cT (cid:111) then, for any η ≤ c max /(cid:96) , if (cid:107) u t − u (cid:107) ≤ S · c ) for all t < T , we will have T < cT .Proof of Lemma 18. Recall notation ˜ H = ∇ f (ˆ x ) . Since δ ∈ (0 , dκe ] , we always have p ≥ . Define v t = w t − u t , by assumption, we have v = µ R e . Let us firstly define the gradient approximationerrors for these two sequences ε w t = q x ( w t , h low ) − ∇ f ( w t ) ε u t = q x ( u t , h low ) − ∇ f ( u t ) Now, consider the update equation for w t : u t +1 + v t +1 = w t +1 = w t − ηq x ( w t , h low )= w t − η ( ∇ f ( w t ) + ε w t )= u t + v t − η ∇ f ( u t + v t ) − η ε w t = u t + v t − η ∇ f ( u t ) − η (cid:20)(cid:90) ∇ f ( u t + θ v t )d θ (cid:21) v t − η ε w t = u t + v t − η ∇ f ( u t ) − η ( ˜ H + ∆ (cid:48) t ) v t − η ε w t = u t − η ∇ f ( u t ) + ( I − η ˜ H − η ∆ (cid:48) t ) v t − η ε w t = u t − η ( ∇ f ( u t ) + ε u t ) + ( I − η ˜ H − η ∆ (cid:48) t ) v t − η ( ε w t − ε u t )= u t − ηq x ( u t , h low ) + ( I − η ˜ H − η ∆ (cid:48) t ) v t − η ( ε w t − ε u t )= u t +1 + ( I − η ˜ H − η ∆ (cid:48) t ) v t − η ( ε w t − ε u t ) where ∆ (cid:48) t = (cid:90) ∇ f ( u t + θv t )d θ − ˜ H This gives the dynamic for v t satisfy: v t +1 = ( I − η ˜ H − η ∆ (cid:48) t ) v t − η ( ε w t − ε u t ) (10)Since f is Hessian Lipschitz, we have (cid:107) ∆ (cid:48) t (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:90) ∇ f ( u t + θ v t ) − ∇ f (ˆ x )d θ (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:90) ρ (cid:107) u t + θ v t − ˆ x (cid:107) d θ ≤ ρ ( (cid:107) u t − u (cid:107) + (cid:107) v t (cid:107) + (cid:107) ˆ x − u (cid:107) ) . t < T the sequence { w t } has not decreased the function f by − . F . In other words, it holdsthat f ( w ) − f ( w t ) ≤ . F , so applying Lemma 17, we know for all t ≤ T (cid:107) w t − w (cid:107) ≤ S c ) . By condition of Lemma 18, we know (cid:107) u t − u (cid:107) ≤ S c ) for all t < T . This gives for all t < T : (cid:107) v t (cid:107) = (cid:107) w t − u t (cid:107) = (cid:107) ( w t − w ) − ( u t − u ) + ( w − u ) (cid:107)≤ (cid:107) ( w t − w ) (cid:107) + (cid:107) u t − u (cid:107) + (cid:107) w − u (cid:107)≤ S c ) + 100( S c ) + µ R ≤ S c ) + R ≤ (200 c + 1) S (11)where the last step holds because R ≤ S This gives us for t < T : (cid:107) ∆ (cid:48) t (cid:107) ≤ ρ ( (cid:107) u t − u (cid:107) + (cid:107) v t (cid:107) + (cid:107) ˆ x − u (cid:107) ) ≤ ρ (100 c S + (200 c + 1) S + R ) ≤ ρ S (300 c + 2) Let ψ t be the norm of v t projected onto e direction and the normal vector and ϕ t correspondinglybe the norm of v t projected onto remaining subspace. Let us define as λ = ηρ S (300 c + 2) . Equation10 gives us: ψ t +1 = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:89) e ( I − η ˜ H ) v t − η ∆ (cid:48) t v t − η ( ε w t − ε u t ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ϕ t +1 = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:89) R d \{ e } ( I − η ˜ H ) v t − η ∆ (cid:48) t v t − η ( ε w t − ε u t ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Lower bound of ψ t +1 : ψ t +1 = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:89) e [( I − η ˜ H ) ψ t e − η ∆ (cid:48) t v t − η ( ε w t − ε u t )] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≥ (cid:107) ( I − η ˜ H ) ψ t e (cid:107) − η (cid:107) (cid:89) e [∆ (cid:48) t v t ] (cid:107) − η (cid:107) (cid:89) e [ ε w t − ε u t ] (cid:107)≥ (1 + γη ) ψ t − η (cid:107) ∆ (cid:48) t v t (cid:107) − η (cid:107) ε w t − ε u t (cid:107)≥ (1 + γη ) ψ t − η (cid:107) ∆ (cid:48) t (cid:107)(cid:107) v t (cid:107) − η (cid:107) ε w t − ε u t (cid:107)≥ (1 + γη ) ψ t − λ (cid:113) ψ t + ϕ t − η (cid:107) ε w t − ε u t (cid:107) Upper bound of ϕ t +1 : ϕ t +1 = (cid:107) (cid:89) R d \{ e } [( I − η ˜ H ) v t − η ∆ (cid:48) t v t − η ( ε w t − ε u t )] (cid:107)≤ (cid:107) (cid:89) R d \{ e } [( I − η ˜ H ) v t ] (cid:107) + (cid:107) (cid:89) R d \{ e } [ η ∆ (cid:48) t v t ] (cid:107) + η (cid:107) (cid:89) R d \{ e } [ ε w t − ε u t ] (cid:107)≤ (cid:107) (cid:89) R d \{ e } [( I − η ˜ H ) v t ] (cid:107) + (cid:107) η ∆ (cid:48) t v t (cid:107) + η (cid:107) ε w t − ε u t (cid:107)≤ (1 + γη ) ϕ t + λ (cid:113) ψ t + ϕ t + η (cid:107) ε w t − ε u t (cid:107) ψ t +1 ≥ (1 + γη ) ψ t − λ (cid:113) ψ t + ϕ t − η (cid:107) ε w t − ε u t (cid:107) ϕ t +1 ≤ (1 + γη ) ϕ t + λ (cid:113) ψ t + ϕ t + η (cid:107) ε w t − ε u t (cid:107) We will now prove via induction the following fact:
Claim 1. ∀ t < T ϕ t ≤ λt · ψ t and (cid:107) ε w t (cid:107) ≤ λ η (cid:107) v t (cid:107) and (cid:107) ε u t (cid:107) ≤ λ η (cid:107) v t (cid:107) Proof.
Let us prove the base case of the induction: • By hypothesis of Lemma 18, we know ϕ = 0 so ϕ ≤ λ · ψ holds trivially • Based on the choice of h low we have that (cid:107) ε w t (cid:107) ≤ ρ S δ √ d R ≤ λ η ψ ≤ λ η (cid:107) v (cid:107)(cid:107) ε u t (cid:107) ≤ ρ S δ √ d R ≤ λ η ψ ≤ λ η (cid:107) v (cid:107) . Thus the base case of induction holds. Assume Claim 1 is true for τ ≤ t . Now we can rewrite theinequalities based on the inductive hypothesis as follows: ψ t +1 ≥ (1 + γη ) ψ t − λ (cid:113) ψ t + ϕ t ϕ t +1 ≤ (1 + γη ) ϕ t + 2 λ (cid:113) ψ t + ϕ t For t + 1 ≤ T , we have: (cid:40) λ ( t + 1) ψ t +1 ≥ λ ( t + 1) (cid:16) (1 + γη ) ψ t − λ (cid:112) ψ t + ϕ t (cid:17) ϕ t +1 ≤ λt (1 + γη ) ψ t + 2 λ (cid:112) ψ t + ϕ t (cid:41) Thus it suffices to prove that: λt (1 + γη ) ψ t + 2 λ (cid:113) ψ t + ϕ t ≤ λ ( t + 1) (cid:18) (1 + γη ) ψ t − λ (cid:113) ψ t + ϕ t (cid:19) (2 + 8 λ ( t + 1)) (cid:113) ψ t + ϕ t ≤ γη ) ψ t . By choosing √ c max ≤ c +2 min { √ , / c } , using the facts (cid:26) ηρ S T = √ η(cid:96)η ≤ c max /(cid:96) , we have λ ( t + 1) ≤ λT ≤ ηρ S (300 c + 2) cT = 8 (cid:112) η(cid:96) (300 c + 2) c ≤ / This gives: γη ) ψ t ≥ ψ t ≥ (cid:113) ψ t ≥ (2 + 8 λ ( t + 1)) (cid:113) ψ t + ϕ t which finishes the induction of the first part.Now, using again the induction hypothesis, we know ϕ t ≤ λt · ψ t ≤ ψ t , this gives: ψ t +1 ≥ (1 + γη ) ψ t − √ λψ t ≥ (1 + γη ψ t (12)where the last step follows from √ λ = √ ηρ S (300 c + 2) = √ (cid:112) η(cid:96) γρp ≤ √ c max (300 c + 2) γ ηp < γη . Equation 12 yields that ψ t is increasing sequence. Clearly (cid:107) ε w t +1 (cid:107) ≤ λ η ψ ≤ λ η ψ t +1 ≤ λ η (cid:107) v t +1 (cid:107)(cid:107) ε u t +1 (cid:107) ≤ λ η ψ ≤ λ η ψ t +1 ≤ λ η (cid:107) v t +1 (cid:107) Thus we have completed the induction. 34inally, combining Eq.(11) and (12) we have for all t < T : (200 c + 1) S ≥ (cid:107) v t (cid:107) ≥ ψ t ≥ (1 + γη t ψ = (1 + γη t µ R = (1 + γη t S κ p = (1 + γη t δ √ d S κ p This implies:
T < log( (200 c +1)2 √ d κdδ · p )log(1 + γη ) ≤ log((200 c + 1)) + log( κdδ ) + log p ( γη ) ≤ c + 1) γη +2 log( κdδ ) γη +2 pγη The last inequality is due to the following facts • p = log( κdδ ) ≥ and ∀ x ≥ x ≤ x . • ∀ x ≥ x ) ≤ x thus log(1 + γη ) ≤ γη . • T = pγη Therefore, it holds that:
T < c + 1) pγη + 4 T ≤ T (2 log(200 c + 1) + 4) By choosing constant c to be large enough to satisfy c + 1) + 4) ≤ c , for example (i.e c ≥ ), we will have T < cT , which finishes the proof.35C): Until now we have proved that firstly if approximate gradient descent from u does not decreasefunction value, then all the iterates must lie within a small ball around u (Lemma 17) and secondly startingan approximate descent from w , which is u but displaced along e direction (negative eigenvalue’seigenvector for at least a certain distance), will decreases the function value if { u t } is bounded. (Lemma 18).The following lemma combines the above two lemmas: Lemma 19.
There exists a universal constant ˆ c max , for any δ ∈ (0 , dκe ] , let f ( · ) , ˆ x satisfies thefollowing conditions (cid:107)∇ f (ˆ x ) (cid:107) ≤ G and λ min ( ∇ f (ˆ x )) ≤ − γ and e be the minimum eigenvector of ∇ f (ˆ x ) . Consider two algorithm sequences { u t } , { w t } withinitial points u , w satisfying: (cid:107) u − ˆ x (cid:107) ≤ R , w = u + µ · R · e , µ ∈ [ δ/ (2 √ d ) , Then, for any step size η ≤ ˆ c max /(cid:96) , at least one of the following is true • there exists T u ≤ c max T such that f ( u ) − f ( u T u ) ≥ . F • there exists T w ≤ c max T such that f ( w ) − f ( w T w ) ≥ . F Proof of Lemma 19.
Let ( c (1)max , c ) be the absolute constant so that Lemma 18 holds. Choose ˆ c max = min { , c (1)max , c } Let T (cid:63) = cT . Notice that by definition T (cid:63) ≤ c max T . Finally , define: T ◦ = inf t { t | f ( u ) − f ( u t ) ≥ . F } Let’s consider following two cases:
Case T ◦ ≤ T (cid:63) : Clearly for this case we have for T u = T ◦ that f ( u ) − f ( u T u ) ≥ . F Case T ◦ > T (cid:63) : In this case, by Lemma 17, we know (cid:107) u t − u (cid:107) ≤ O ( S ) for all t ≤ T (cid:63) . Define T ◦◦ = inf t { t | f ( w ) − f ( w t ) ≥ . F } By Lemma 18, we immediately have T ◦◦ ≤ T (cid:63) = cT . Clearly for this case we have for T u = T ◦◦ we have that f ( w ) − f ( w T w ) ≥ . F . Lemma 20.
Let f be a (cid:96) -gradient Lipschitz and ρ -Hessian Lipschitz function. There exists universalconstant c max , for any δ ∈ (0 , dκe ] , suppose we start with point ˆ x satisfying following conditions: (cid:107)∇ f (ˆ x ) (cid:107) ≤ G and λ min ( ∇ f (ˆ x )) ≤ − γ Let x = ˆ x + ξ where ξ come from the uniform distribution over ball with radius r = R , and let x t be the iterates of approximate gradient descent from x and T = T c max . Then, when step size η ≤ c max /(cid:96) , with at least probability − δ , we have that: ∃ t ≤ T : f (ˆ x ) − f ( x t ) ≥ F Proof of Lemma 20.
By adding perturbation, in worst case we increase function value by: f ( x ) − f (ˆ x ) ≤ ∇ f (ˆ x ) (cid:62) ξ + (cid:96) (cid:107) ξ (cid:107) ≤ (cid:96) R = 3 (cid:96) S κ p = 3 (cid:96) F pγ κ p ≤ F κp ≤ F We know x come from the uniform distribution over B ˆ x ( r ) . Let A ⊂ B ˆ x ( r ) denote the set of badstarting points A = { x ∈ B ˆ x ( r ) | ∀ t ≤ T : f ( x ) − f ( x t ) < . F } otherwise if x ∈ B ˆ x ( r ) \ A , we have that ∃ t ≤ T : f ( x ) − f ( x t ) ≥ . F By applying Lemma 18, we know for any x ∈ A , it is guaranteed that x ± µr e (cid:54)∈ A where µ ∈ [ δ √ d , where e is the eigenvector of ∇ f (ˆ x ) with the smallest negative eigenvalue.Let us denote I A ( · ) be the indicator function of being inside set A . For a vector x let us define thefollowing quantities x e = (cid:104) x , e (cid:105) x ¬ e = (cid:89) R d \{ e } x Recall B ( d ) ( r ) be d -dimensional ball with radius r . By calculus, this gives an upper bound on thevolume of A : Vol ( A ) = (cid:90) B ( d )ˆ x ( r ) d x · I A ( x )= (cid:90) B ( d − x ( r ) d x ¬ e (cid:90) ˆ x e + √ r −(cid:107) ˆ x ¬ e − x ¬ e (cid:107) ˆ x e − √ r −(cid:107) ˆ x ¬ e − x ¬ e (cid:107) d x e · I A ( x ) ≤ (cid:90) B ( d − x ( r ) d x ¬ e · (cid:18) · δ √ d r (cid:19) = Vol ( B ( d − ( r )) × δr √ d Then, we immediately have the ratio:Vol ( A ) Vol ( B ( d )ˆ x ( r )) ≤ δr √ d × Vol ( B ( d − ( r )) Vol ( B ( d )0 ( r )) = δ √ πd Γ( d + 1)Γ( d + ) ≤ δ √ πd · (cid:114) d ≤ δ The second last inequality is by the property of Gamma function that Γ( x +1)Γ( x +1 / < (cid:113) x + as long as x ≥ . Therefore, with at least probability − δ , x (cid:54)∈ A . In this case, we have that there exists a t ≤ T : f (ˆ x ) − f ( x t ) = f (ˆ x ) − f ( x ) + f ( x ) − f ( x t ) ≤ . F − . F ≥ F which finishes the proof. 37t is easy to check that our initial Lemma 16 can be derived by substituting η = c(cid:96) , γ = √ ρ(cid:15), δ = d(cid:96) √ ρ(cid:15) e − χ and simply applying the definitions of G , T , F , g thres , t thres , f thresthres