Boundary Conditions for Linear Exit Time Gradient Trajectories Around Saddle Points: Analysis and Algorithm
aa r X i v : . [ m a t h . O C ] J a n Boundary Conditions for Linear Exit Time Gradient Trajectories AroundSaddle Points: Analysis and Algorithm
Rishabh Dixit · Waheed U. Bajwa
Submitted: January 6, 2021
Abstract
Gradient-related first-order methods have become the workhorse of large-scale numerical opti-mization problems. Many of these problems involve nonconvex objective functions with multiple saddlepoints, which necessitates an understanding of the behavior of discrete trajectories of first-order methodswithin the geometrical landscape of these functions. This paper concerns convergence of first-order discretemethods to a local minimum of nonconvex optimization problems that comprise strict saddle points within thegeometrical landscape. To this end, it focuses on analysis of discrete gradient trajectories around saddle neigh-borhoods, derives sufficient conditions under which these trajectories can escape strict-saddle neighborhoodsin linear time, explores the contractive and expansive dynamics of these trajectories in neighborhoods ofstrict-saddle points that are characterized by gradients of moderate magnitude, characterizes the non-curvingnature of these trajectories, and highlights the inability of these trajectories to re-enter the neighborhoodsaround strict-saddle points after exiting them. Based on these insights and analyses, the paper then proposes asimple variant of the vanilla gradient descent algorithm, termed Curvature Conditioned Regularized GradientDescent (CCRGD) algorithm, which utilizes a check for an initial boundary condition to ensure its trajecto-ries can escape strict-saddle neighborhoods in linear time. Convergence analysis of the CCRGD algorithm,which includes its rate of convergence to a local minimum within a geometrical landscape that has a maxi-mum number of strict-saddle points, is also presented in the paper. Numerical experiments are then providedon a test function as well as a low-rank matrix factorization problem to evaluate the efficacy of the proposedalgorithm.
Keywords
Boundary conditions · Linear exit time · Nonconvex optimization · Strict-saddle property
Mathematics Subject Classification (2010) · · · This work was supported in part by the National Science Foundation under grants CCF-1453073, CCF-1910110, and CCF-1907658,by the Army Research Office under grant W911NF-17-1-0546, and by the DARPA Lagrange Program under ONR/SPAWAR contractN660011824020.R. DixitDepartment of Electrical and Computer EngineeringRutgers University–New Brunswick, NJ 08854 USAE-mail: [email protected]
W.U. Bajwa ( (cid:0) )Department of Electrical and Computer EngineeringDepartment of StatisticsRutgers University–New Brunswick94 Brett Rd, Piscataway, NJ 08854 USATel.: +1-848-445-8541E-mail: [email protected]
Rishabh Dixit, Waheed U. Bajwa
The gradient descent method and its (stochastic) variants have been at the forefront of nonconvex optimiza-tion for nearly a decade. Many of these variants stem from the earliest works like [8, 13, 36], the interior-point method [18, 26, 30], and their stochastic counterparts. But the highly complicated geometrical land-scape of many nonconvex functions often puts the efficacy of these algorithms to question, which otherwisehave robust performance in convex settings. Indeed, problems involving matrix factorization [21], neural net-works [37], rank minimization [6], etc., can be highly nonconvex, wherein the function geometry can possessmany saddle points that create regions of very small magnitude gradients, something which the gradient-related methods rely upon heavily. As a consequence, travel times for trajectories generated by these methodsin such regions could be exponentially large, thereby defeating the purpose of optimization. However, thelarge travel times around saddle points for gradient-based methods is not always the case, as proved in [10]which gives a linear exit-time bound for first-order approximations of gradient trajectories provided somenecessary boundary conditions are satisfied by the trajectories. Such analysis suggests existence of gradient-based methods capable of ‘fast’ traversal of geometrical landscapes of nonconvex functions under appropriateconditions. Development of such methods, however, necessitates an exhaustive geometric analysis of the sad-dle neighborhoods so as to leverage any initial boundary conditions required by the gradient trajectory aroundsaddle points in order to reduce the total travel time on the entire function landscape.To this end, we first study in this paper the problem of developing sufficient boundary conditions forgradient trajectories around any saddle point x ∗ of some nonconvex function f ( x ) that can guarantee linearexit time, i.e., K exit = O ( log ( ε − )) , from the open saddle neighborhood B ε ( x ∗ ) . This problem focuses on aclosed neighborhood ¯ B ε ( x ∗ ) around the saddle point x ∗ , with the current iterate x sitting on the boundaryof this neighborhood, i.e., x ∈ ¯ B ε ( x ∗ ) \ B ε ( x ∗ ) . Suppose also that the gradient trajectory starting at x hasapproximately linear exit time from this region B ε ( x ∗ ) . (Existence of such trajectories is guaranteed becauseof the analysis in [10].) Then, the question posed here is what are the sufficient conditions on x such that thetrajectory can escape B ε ( x ∗ ) in almost linear time of order O ( log ( ε − )) . Once the sufficient conditions havebeen derived, we next study the question of whether it is possible to get either linear rates or approximatelinear rates of travel by the same gradient trajectory in some bigger neighborhood B ξ ( x ∗ ) ⊃ B ε ( x ∗ ) . Notethat unlike the matrix perturbation-based analysis in [10], the radius ξ of the bigger neighborhood needs tobe characterized by a fundamentally different proof technique. This is since the eigenspace of the Hessian ∇ f ( x ) for any x ∈ B ξ ( x ∗ ) \ B ε ( x ∗ ) cannot be obtained by perturbing the eigenspace of ∇ f ( x ∗ ) since theseries expansion of ∇ f ( x ) about ∇ f ( x ∗ ) may not necessarily converge from matrix perturbation theory. Third , after such approximate linear rates have been obtained, we then study whether it is possible to developa robust algorithm that leverages the boundary conditions so as to steer the gradient trajectory away from B ε ( x ∗ ) in almost linear time. Finally , we seek an answer to the question of what would be the rate ofconvergence of the developed algorithm to some local minimum within the global landscape of the nonconvexfunction.To address all these problems effectively, we engage in a rigorous analysis of trajectories of the vanillagradient descent method, starting off directly where we left in [10]. First, we utilize tools from the ma-trix perturbation theory to develop sufficient conditions on x ∈ ¯ B ε ( x ∗ ) \ B ε ( x ∗ ) for which the subsequentgradient trajectory has linear exit time from B ε ( x ∗ ) . Next, we prove a rather intuitive yet extremely pow-erful result, termed the sequential monotonicity of gradient trajectories , which establishes that the gradienttrajectories in a neighborhood of the saddle point first exhibit contractive dynamics up to some point andthere onward strictly expansive dynamics. Next, we provide an analysis of the travel time for the gradienttrajectory in the region B ξ ( x ∗ ) \ B ε ( x ∗ ) using the sequential monotonicity result. Finally, we develop a novelgradient-based algorithm, termed Curvature Conditioned Regularized Gradient Descent (CCRGD), aroundthe idea of sufficient boundary conditions with a robust check condition guaranteeing almost linear exit timefrom B ε ( x ∗ ) . In doing so, we also prove certain qualitative lemmas about the local behavior of gradient tra-jectories around saddle points. Thereafter, the rate of convergence for CCRGD to a local minimum is proved Since this work is a continuation of [10], we refrain from elaborating certain terminologies and definitions that were covered indetail in [10], though a summary of all the required concepts is provided in Sec. 3.1 to make this a self-contained paper.oundary Conditions for Trajectories Around Saddle Points 3 using these lemmas. Finally, the performance of CCRGD is evaluated on two problems: a test function fornonconvex optimization and a low-rank matrix factorization problem.1.1 Relation to Prior WorkSince this work directly extends the results in [10], we steer away from repeating the discussion in [10,Sec. 1.1] in relation to existing convergence guarantees for gradient-related methods in nonconvex settings.Instead, we primarily focus in this section on presenting comparisons and highlighting key differences be-tween our contributions and the existing literature. In addition, given the vast interest of the optimizationcommunity in nonconvex optimization using gradient-related methods, we also discuss some additional rele-vant works in here.First, in this work we develop proofs of convergence to a local minimum and most importantly obtain al-gorithmic rates using the geometry of function landscape near saddle points and in regions that have sufficientgradient magnitudes. Though this idea of rate analysis has been well summarized in [33] for gradient-relatedsequences and more recently in [31] for Newton methods, yet these works refrain from utilizing the non-convex geometry to its fullest extent. Within the regions of “moderate gradients” around saddle points, i.e.,the shell B ξ ( x ∗ ) \ B ε ( x ∗ ) , we construct an invex envelope function ˆ f ( · ) for f ( · ) using the sequential mono-tonicity property (detailed in Theorem 2) that satisfies the Polyak–Łojasiewicz condition whenever the iteratesequence { x k } has contractive dynamics with respect to x ∗ . Consequently, linear rates of contraction to somepoint on the boundary ¯ B ε ( x ∗ ) \ B ε ( x ∗ ) are recovered for the envelope function ˆ f ( · ) under the sequentialmonotonicity property, which aid in our convergence analysis. Note that this would not be possible for f ( · ) directly since it has saddle points and hence it cannot be invex. This analytical approach is in contrast to [4]and [5] that utilize the more general Kurdyka–Łojasiewicz property [24] and, as before, do not utilize thefunction geometry completely. In our work, we categorize the function geometry into regions near and re-gions away from stationary points so as to better analyze escape conditions from saddle neighborhoods andat the same time generate convergence guarantees to some local minimum. Regions near saddle points areanalysed using matrix perturbation theory, whereas regions away from saddle points utilize properties likethe sequential monotonicity (cf. Theorem 2), which differs from the monotonicity of gradient sequences usedin [24].Next, to the best of our knowledge, no other work has provided sufficient boundary conditions for escapefrom saddle neighborhoods for the case of discrete-time gradient descent-related algorithms. Though theidea is not necessarily new and has been explored while dealing with continuous-time dynamical systems,specifically the boundary value problems, yet it is still nascent when it comes to analyzing saddle points. Thecontinuous-time works such as [5, 15, 20] have been discussed in detail in [10]. However even these worksdo not analyze the boundary conditions for continuous trajectories. The work [15] does take into accountcascaded saddles encountered by continuous trajectories, which gets a detailed treatment in our work inTheorems 4 and 5 for discrete trajectories.The Stochastic Differential Equation (SDE) setup has also been utilized in a recent work [38] to studygradient-based (stochastic) methods for nonconvex optimization in the continuous-time setting. Interestingly,this work considers the set of index-1 saddle points in the function’s geometry and thereby obtains stochasticrate of convergence to a global minimum where the rate is of order constant plus some geometric term.While the rate is almost geometric, this work assumes the confining condition (sufficient growth condition onfunction away from origin) and Villani condition (growth of norm gradient with respect to Laplacian), none ofwhich are assumed in this work. Also, the constant in their non-geometric term of the rate is dependent on thehorizon T obtained from discretization of the SDE, which could be large. Coincidentally, our work providesrate of convergence to a local minimum in Theorem 5 that is sum of geometric terms and constant-orderterms where the geometric terms arise near saddle points and the constant-order terms result from the traveltime between saddle neighborhoods. Moreover, their work does not provide any insights into the behaviorof discrete gradient trajectories around first-order saddle points. Along the lines of work [22], which uses Rishabh Dixit, Waheed U. Bajwa the Stable Manifold Theorem [19], we prove that the trajectories generated by CCRGD (Algorithm 1 in thepaper) converge to a local minimum almost surely.Recently, within the class of discrete-time non-acceleration based methods, [11, 16] provide the rates forescaping saddles using perturbed gradient descent, [40] utilizes the notion of variational coherence betweenstochastic mirror gradient and descent direction in quasi convex and nonconvex problems for obtaining er-godic rates of convergence to a local/global minimum (under certain conditions), and [9] provides rates andescape guarantees under certain strong assumptions of high correlation between the negative curvature direc-tion and a random perturbation vector. However, none of these stochastic variants explore the idea of initialboundary conditions near saddle points so as to obtain linear rates. It should be noted that the work in [11]shows that the time to escape cascaded saddles scales exponentially, which coincides with the findings fromTheorem 5 where the exponential factor comes from the number of saddles encountered.The next set of related discrete-time gradient-based methods include first-order methods leveraging accel-eration and momentum techniques. For instance, the work in [35] provides an extension of SGD to methodslike the Stochastic Variance Reduced Gradient (SVRG) algorithm for escaping saddles. Recently, methodsapproximating the second-order information of the function that preserve the first-order nature of the algo-rithm have also been employed to escape the saddles. Examples include [17], where the authors prove thatan acceleration step in gradient descent guarantees escape from saddle points, and the method in [39] ,whichutilizes the second-order nature of the acceleration step combined with a stochastic perturbation to guaranteeescape rates. Moreover, both [1, 2] build on the idea of utilizing acceleration as a source of finding the nega-tive curvature direction. While both these works rely on the finite-sum structure and adapt acceleration stepthrough variance reduction, the former adds a strong convexity in its iterate to boost convergence. Due to thelow computational cost of evaluating gradients, we also make use of such connections between the curvaturemagnitude and the gradient difference in our proposed algorithm (Algorithm 1). In the class of first-orderalgorithms, there also exist trust region-based methods. The work in [12] is one such method that presents anovel stopping criteria with a heavy ball controlled mechanism for escaping saddles using the SGD method.If the SGD iterate escapes some neighborhood in a certain number of iterations, the algorithm is restartedwith the next round of SGD, else the ergodic average of the iterate sequence is designated to be a second-order stationary solution. In a similar vein, we formally derive in Lemma 5 the escape guarantees from aneighborhood around a saddle point and utilize that result within the proposed Algorithm 1.Lastly, higher-order methods are discussed in [27, 32], which utilize either Hessian based method or asecond-order step combined with first-order algorithms so as to reach local minimum with fast speed whiletrading off with computational costs. Going a step even further, the work in [3] poses the escape problem withsecond-order saddles, thereby making higher order methods an absolute necessity. Though these techniquesoptimize well over certain pathological functions like those having degenerate saddles or very ill-conditionedgeometries, yet they suffer heavily in terms of complexity, specifically the work [3] that requires third-ordermethods to solve for a feasible descent direction. This further motivates us to develop a hybrid algorithm forthe saddle escape problem that captures the advantages of a Hessian based method and at the same time islow on computational complexity.1.2 Our ContributionsThis work starts off directly from the point where we left off in [10], where we obtained exit time boundsfor ε -precision gradient descent trajectories around saddle points and derived the necessary condition on theinitial unstable subspace projection for linear exit time. The first novel result in this work is the developmentof a bound on the initial unstable subspace projection in Theorem 1 that approximately guarantees the lin-ear exit time bound from [10, Theorem 2]. Our second contribution is Theorem 2, in which we analyze thebehavior of gradient descent trajectories in some region B ξ ( x ∗ ) ⊃ B ε ( x ∗ ) where the approximate analysisfrom matrix perturbation theory may not necessarily hold. In such augmented neighborhood of the strict sad-dle point x ∗ , we prove that the gradient descent trajectories have a sequential monotonic behavior, i.e., thereexists some ξ such that the trajectory inside B ξ ( x ∗ ) first exhibits contractive dynamics moving towards x ∗ oundary Conditions for Trajectories Around Saddle Points 5 and then has expansive dynamics for the entire time as long as it stays inside B ξ ( x ∗ ) . Though this propertymay appear to be trivial for trajectories around saddle points, yet it is extremely important in developingimproved rates/travel times of the gradient descent trajectories inside B ξ ( x ∗ ) , which follows from our nextcontribution. Our third contribution is Theorem 3, in which we obtain upper bounds on the travel time ofgradient trajectory inside the shell B ξ ( x ∗ ) \ B ε ( x ∗ ) that we denote by K shell . This particular region is specifi-cally of great importance since we can categorize it as a region of “moderate” gradients (gradient magnitudenot too small) that still inherits certain geometric properties such as the minimum curvature from the smallersaddle neighborhood B ε ( x ∗ ) . Without taking such properties into consideration, the journey time in this shellcould be naively upper bounded as K shell = O ( ε − ) by just using the gradient Lipschitz condition. Hence, itis imperative to separately analyze the journey time inside the shell B ξ ( x ∗ ) \ B ε ( x ∗ ) so as to improve uponthe standard nonconvex rate of O ( ε − ) .Our next set of contributions corresponds to Lemmas 1–5, in which we provide insights into certainqualitative properties of the gradient descent trajectories around saddle points. Lemma 1 talks about theapproximate hyperbolic nature of the gradient trajectories near saddle points, while Lemma 2 proves thattrajectories with linear exit time approximately never curve around saddle points. Lemma 3 shows that thegradient trajectory can only exit B ε ( x ∗ ) at those points where the function value is approximately less than f ( x ∗ ) . Lemma 4 establishes that the gradient trajectory, once it exits the neighborhood B ε ( x ∗ ) , can neverre-enter it, while Lemma 5 extends the same result to the bigger neighborhood B ξ ( x ∗ ) under certain stricterconditions. Our next contribution, which is also a major one, is the development of the Curvature ConditionedRegularized Gradient Descent (CCRGD) algorithm (cf. Algorithm 1) that provably escapes saddle neighbor-hoods and gives second-order stationary solutions. The algorithm checks for a curvature condition near thesaddle neighborhood and makes the decision of whether to perform a second-order iteration for one step orcontinue using the vanilla gradient descent method. The curvature condition (Step 15 in Algorithm 1) is de-rived from our proof of convergence of the algorithm; in addition, Algorithm 1 is tested for its efficacy on amodified Rastrigin function (a test function for nonconvex optimization) and the matrix factorization problemas part of numerical experiments. Last, but not the least, the final contribution of this work is derivation ofthe rate of convergence of an iterate sequence generated from Algorithm 1 to a local minimum. The rates areobtained for a more general setting of cascaded saddles where the number of saddles encountered and thetotal time of convergence are bounded from Theorems 4 and 5.1.3 NotationsSimilar to our notation in [10], all vectors are in bold lower-case letters, all matrices are in bold upper-caseletters, is the n -dimensional null vector, I represents the n × n identity matrix, and h· , ·i represents the innerproduct of two vectors. In addition, unless otherwise stated, all vector norms k·k are ℓ norms, while thematrix norm k · k denotes the operator norm. Further, the symbol ( · ) T is the transpose operator, the symbol O represents the Big-O notation, and W ( · ) is the Lambert W function [7]. Throughout the paper, k , K are usedfor the discrete time. Next, ' and / represent the ‘approximately greater than’ and ‘approximately less than’symbols, respectively. Also, for any matrix expressed as Z + O ( c ) where c is some scalar, the matrix-valuedperturbation term O ( c ) is with respect to the Frobenius norm. Finally, the operator dist ( · , · ) gives the distancebetween two sets whereas diam ( · ) gives the diameter of a set. Consider a nonconvex smooth function f ( · ) that has strict first-order saddle points in its geometry. By strictfirst-order saddle points, we mean that the Hessian of function f ( · ) at these points has at least one negativeeigenvalue, i.e., the function has negative curvature. Next, consider some (open) neighborhood B ε ( x ∗ ) arounda given saddle point x ∗ where ε is bounded above from [10, Theorem 2]. Also, it is given that the initial iterate x of the gradient trajectory sits on the boundary of the ball, i.e., x ∈ ¯ B ε ( x ∗ ) \ B ε ( x ∗ ) , and the gradient Rishabh Dixit, Waheed U. Bajwa trajectory exits B ε ( x ∗ ) in linear time bounded by [10, Theorem 2]. With this information, we are interestedin finding the sufficient conditions on x that guarantee the linear exit time. In addition, we are required toanalyze the gradient trajectories in some larger ball B ξ ( x ∗ ) ⊃ B ε ( x ∗ ) such that the trajectories first contracttowards the saddle point and then expand away from it. More importantly, we are interested in finding such ξ > ε for which the gradient trajectory has approximate linear travel time in the shell B ξ ( x ∗ ) \ B ε ( x ∗ ) . Next,we are required to find certain local properties of f ( · ) for which the gradient trajectories, having escaped itonce, can never re-enter the ball B ξ ( x ∗ ) . Finally, we have to develop a robust low complexity algorithm thatuses the sufficient conditions that can possibly traverse the geometry of saddle neighborhoods in linear timeand provide its rate of convergence to some local minimum.Having briefly stated the problem, we now formally state the set of assumptions that are required for thisproblem to be tackled in this work.2.1 Assumptions A1.
The function f : R n → R is C ω , i.e., all the derivatives of this function are continuous and the functionf ( · ) is analytic. A2.
The gradient of function f ( · ) is L − Lipschitz continuous: k ∇ f ( x ) − ∇ f ( y ) k ≤ L k x − y k . A3.
The Hessian of function f ( · ) is M − Lipschitz continuous: (cid:13)(cid:13) ∇ f ( x ) − ∇ f ( y ) (cid:13)(cid:13) ≤ M k x − y k . A4.
The function f ( · ) has only well-conditioned first-order stationary points, i.e., no eigenvalue of thefunction’s Hessian is close to zero at these points. Formally, if x ∗ is the first-order stationary point forf ( · ) , then we have: ∇ f ( x ∗ ) = , andmin i | λ i ( ∇ f ( x ∗ )) | > β , where λ i ( ∇ f ( x ∗ )) denotes the i th eigenvalue of the matrix ∇ f ( x ∗ ) and β > . Note that such a functionis termed a Morse function. Also, there exists a neighborhood W of x ∗ such that: min i | λ i ( ∇ f ( x )) | > β for any x ∈ W . A5.
The function f ( · ) has only first-order saddle points in its geometry. Notice that, as a consequenceof the previous assumption, these first-order saddle points are strict saddle points, i.e., if a first-orderstationary point x ∗ is a first-order saddle point, then for at least one eigenvalue of ∇ f ( x ∗ ) , we have that: λ i ( ∇ f ( x ∗ )) < − β . A6.
Let G be the set of eigenvalues of Hessian ∇ f ( x ∗ ) at a strict saddle point x ∗ . Note that theseeigenvalues can always be grouped into m disjoint sets { G , G , ..., G m } based on the level of degeneracyof eigenvalues (closeness to one another), where ≤ m ≤ n due to the previous two assumptions; thus, m [ p = G p = G (1) and for every i, we have λ i ( ∇ f ( x ∗ )) ∈ G p for some unique p. We assume that these disjoint sets ofeigenvalues satisfy the following conditions for some δ > : dist ( G p , G q ) ≥ δ ∀ G p , G q s.t. p = q , and (2)max p { diam ( G p ) } = O ( ε ) , (3) where ε is the radius of ball B ε ( x ∗ ) around x ∗ in consideration. oundary Conditions for Trajectories Around Saddle Points 7 Note that, as a consequence of the last assumption A6 and the strict-saddle property, we get the followingnecessary condition: β ≥ δ . (4) B ε ( x ∗ ) for some strict saddle point x ∗ , the goal is selecting those gradienttrajectories in B ε ( x ∗ ) for which the exit time is of the order K exit = O ( log ( ε − )) , i.e., of linear rate. Formally,exit time for an iterate sequence { x k } of some trajectory in the ball B ε ( x ∗ ) is defined as the smallest positiveindex K such that k x K − x ∗ k ≥ ε and we are required to obtain such sequence { x k } generated by the gradientdescent method for which the exit time from the saddle neighborhood B ε ( x ∗ ) is linear. To conduct suchanalysis, certain essential concepts and definitions need to be elaborated, most of which were developed in aprevious work (for reference see [10]).First, due to the strict-saddle property, for any x in an ε -neighborhood of x ∗ , i.e., x ∈ B ε ( x ∗ ) , the vector x − x ∗ belongs to a vector space E = E S L E US , where E S = { span ( v i ) | λ i > } , N S = { i | λ i > } , E US = { span ( v i ) | λ i < } , N US = { j | λ j < } , and ( λ i , v i ) are the i th eigenvalue–eigenvector pair of the Hessian ∇ f ( x ∗ ) .Second, using ‘degenerate’ matrix perturbation theory, the Hessian ∇ f ( x ) at any point x = x ∗ + p u ,where p ∈ [ , ] and k u k ≤ ε , can be given as ∇ f ( x ) = ∇ f ( x ∗ ) + p k u k H ( ˆ u ) + O ( ε ) , (5)where u : = x − x ∗ is termed the radial vector , ˆ u = u k u k is the unit radial vector and we have that: H ( ˆ u ) = n ∑ i = (cid:18) h v i , H ( ˆ u ) v i i v i v Ti + λ i ∑ l G i h v l , H ( ˆ u ) v i i λ i − λ l (cid:18) v l v Ti + v i v Tl (cid:19)(cid:19) (6)with G i = { j | λ j = λ i ± O ( ε ) } . For details, see Lemma 4 from [10].The third concept can be regarded as the most important tool for developing the proof machinery of linearexit time; see Lemmas 5 and 6 from [10] for details. Specifically, it can be summarized as the ‘ApproximationLemma’ for a linear dynamical system. Given some initialization of the radial vector u and sufficiently small ε , we have for any iteration K that u K = ∏ K − k = (cid:20) A k + ε P k (cid:21) u , where ε P k = B k + O ( ε ) , B k = O ( ε ) for x k ∈ B ε ( x ∗ ) , { A k } and { B k } are sequences of real symmetric matrices, and A k ’s are invertible.When K ε ≪ ε < (cid:13)(cid:13) A − (cid:13)(cid:13) − k P k − , we have the condition: (cid:13)(cid:13) A − (cid:13)(cid:13) − K (cid:18) − K ε k P k k A − k − − O (cid:18) ( K ε ) (cid:19)(cid:19) ≤ ν n ≤ · · · ≤ ν ≤ k A k K (cid:18) + K ε k P k k A k + O (cid:18) ( K ε ) (cid:19)(cid:19) where ν n ≤ · · · ≤ ν are absolute values of the eigenvalues of matrix ∏ K − k = (cid:20) A k + ε P k (cid:21) and sup ≤ k ≤ K − k A k k = k A k , sup ≤ k ≤ K − (cid:13)(cid:13) A − k (cid:13)(cid:13) = (cid:13)(cid:13) A − (cid:13)(cid:13) , sup ≤ k ≤ K − k P k k = k P k for some matrices A and P . Hence, u K = ∏ K − k = (cid:20) A k + ε P k (cid:21) u can be expanded to first order in ε with the first-order approximation called ˜ u K Rishabh Dixit, Waheed U. Bajwa and the trajectory generated by the sequence { ˜ u K } is termed ε –precision trajectory . Thus the gradi-ent update x K + = x K − α∇ f ( x K ) near x ∗ can be written as u K = ∏ K − k = (cid:20) A k + ε P k (cid:21) u for u K = x K − x ∗ , A K = I − α∇ f ( x ∗ ) and ε P K = − α k u K k H ( ˆ u K ) + O ( ε ) .Fourth, from Lemma 7 of [10], the ‘minimal’ ε –precision trajectory has the maximum exit time. Morerigorously, let S ε = n { ˜ u τ K } K τ exit K = (cid:12)(cid:12)(cid:12) u o be the set of τ -parametrized ε – precision trajectories generated byexpanding u K to first order in ε , where τ varies with variations in the perturbation sequence { P k } Kk = .Let K τ exit be the exit time of the τ -parametrized trajectory { ˜ u τ K } K τ exit K = from the ball B ε ( x ∗ ) , where we have K τ exit = inf K ≥ (cid:26) K (cid:12)(cid:12)(cid:12)(cid:12) k ˜ u τ K k > ε (cid:27) . Let K ι be defined as: K ι = inf K ≥ (cid:26) K (cid:12)(cid:12)(cid:12)(cid:12) inf τ (cid:26) k ˜ u τ K k (cid:27) > ε (cid:27) . (7)Then the following inequality holds: K ι ≥ sup τ (cid:26) K τ exit (cid:27) = sup τ inf K ≥ (cid:26) K (cid:12)(cid:12)(cid:12)(cid:12) k ˜ u τ K k > ε (cid:27) . Finally, the linear exit time theorem for the ε –precision trajectories (Theorem 2 in [10]) states that forgradient descent with α = L on a well-conditioned function, i.e., β L > ε M L , and some minimum projection ∑ j ∈ N US ( θ usj ) ≥ ∆ of the initial radial vector u on E US with u = ε∑ i ∈ N S θ si v i + ε∑ j ∈ N US θ usj v j , thereexist ε –precision trajectories { ˜ u K } K exit K = with linear exit time. Moreover their exit time K exit from B ε ( x ∗ ) isapproximately upper bounded as: K exit < K ι / log (cid:18)(cid:18) + ε M L (cid:19) log (cid:18) + ε M L + β L − ε M L (cid:19) δε Mn (cid:19) (cid:18) + ε M L + β L − ε M L (cid:19) . (8)In [10, Theorem 2], we provide a necessary initial condition for the linear exit time bound, which is ∆ > ε MLn δ ( L + β ) = O ( ε ) , where it is required that ∑ j ∈ N US ( θ usj ) ≥ ∆ . In this work we provide the sufficient boundary conditions forlinear exit time ε –precision trajectories.Before moving to the next section that details the sufficient conditions, we show that the ε –precisiontrajectory { ˜ u K } K exit K = generated by expanding the matrix product in the expression u K = ∏ K − k = (cid:20) A k + ε P k (cid:21) u to first order in ε has a very small relative error compared to the exact trajectory. ε –Precision Trajectory By the definition of ε –precision trajectory, we have that:˜ u K = K − ∏ k = A k u + ε K − ∑ r = r ∏ k = A k P r K − ∏ k = r + A k u , (9) oundary Conditions for Trajectories Around Saddle Points 9 which is obtained by expanding the matrix product ∏ K − k = (cid:20) A k + ε P k (cid:21) to first order in ε . Now using the‘Approximation Lemma’ discussed above for K ε ≪ ε < (cid:13)(cid:13) A − (cid:13)(cid:13) − k P k − where sup ≤ k ≤ K − k A k k = k A k , sup ≤ k ≤ K − (cid:13)(cid:13) A − k (cid:13)(cid:13) = (cid:13)(cid:13) A − (cid:13)(cid:13) , sup ≤ k ≤ K − k P k k = k P k for some matrices A and P , we get that: u K = K − ∏ k = (cid:20) A k + ε P k (cid:21) u (10) = K − ∏ k = A k u + ε K − ∑ r = r ∏ k = A k P r K − ∏ k = r + A k u + O (cid:18) k A k K ( K ε ) k P k k A k k u k (cid:19) (11) = ˜ u K + O (cid:18) k A k K ( K ε ) ε (cid:19) . (12)Next, from the proof of [10, Lemma 5] we recall that A k = ∑ i ∈ N S c si ( k ) v i v Ti + ∑ j ∈ N US c usj ( k ) v j v Tj where v i and v j are the eigenvectors corresponding to the stable and unstable subspaces of ∇ f ( x ∗ ) . Also, u = ε∑ i ∈ N S θ si v i + ε∑ j ∈ N US θ usj v j and for α = L we have the bounds 1 + β L − ε M L ≤ c usj ( k ) ≤ + ε M L and − ε M L ≤ c si ( k ) ≤ − β L + ε M L (see [10, Lemma 5]). Hence we have that: k u K k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K − ∏ k = (cid:20) A k + ε P k (cid:21) u (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (13) ≥ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K − ∏ k = A k u (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ε K − ∑ r = r ∏ k = A k P r K − ∏ k = r + A k u (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − O (cid:18) k A k K ( K ε ) k P k k A k k u k (cid:19) (14) ≥ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K − ∏ k = A k u (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − O (cid:18) k A k K ( K ε ) k P k k A k k u k (cid:19) (15) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:18) K − ∏ k = c si ( k ) (cid:19) ε ∑ i ∈ N S θ si v i + (cid:18) K − ∏ k = c usj ( k ) (cid:19) ε ∑ j ∈ N US θ usj v j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − O (cid:18) k A k K ( K ε ) ε (cid:19) (16) ≥ ε (cid:18) inf { c usj ( k ) } (cid:19) K vuut(cid:18) inf { c si ( k ) } inf { c usj ( k ) } (cid:19) K ∑ i ∈ N S ( θ si ) + ∑ j ∈ N US ( θ usj ) − O (cid:18) k A k K ( K ε ) ε (cid:19) (17) ≈ ε (cid:18) + β L − ε M L (cid:19) K r ∑ j ∈ N US ( θ usj ) − O (cid:18) k A k K ( K ε ) ε (cid:19) , (18)where we used inf { c usj ( k ) } = (cid:18) + β L − ε M L (cid:19) , inf { c si ( k ) } = − ε M L and ε K ≈ ε ≪ K ε ≪ k A k = sup {k A k k } = sup { c usj ( k ) } = + ε M L and taking normyields: k u K − ˜ u K k = O (cid:18) k A k K ( K ε ) ε (cid:19) = O (cid:18)(cid:18) + ε M L (cid:19) K ( K ε ) ε (cid:19) . (19) Finally, dividing (19) by (18) we get the following bound on relative error: k u K − ˜ u K kk u K k ≤ ε (cid:18) + β L − ε M L (cid:19) K q ∑ j ∈ N US ( θ usj ) − O (cid:18) k A k K ( K ε ) ε (cid:19) O (cid:18)(cid:18) + ε M L (cid:19) K ( K ε ) ε (cid:19) (20) ≤ q ∑ j ∈ N US ( θ usj ) − O (cid:18) (cid:18) + ε M L (cid:19) K (cid:18) + β L − ε M L (cid:19) K ( K ε ) (cid:19) O (cid:18) (cid:18) + ε M L (cid:19) K (cid:18) + β L − ε M L (cid:19) K ( K ε ) (cid:19) (21) ≤ q ∑ j ∈ N US ( θ usj ) − O (cid:18) √ ε (cid:18) log (cid:18) ε (cid:19) ε (cid:19)(cid:19) O (cid:18) √ ε (cid:18) log (cid:18) ε (cid:19) ε (cid:19) (cid:19) , (22)where we have substituted the upper bound on K exit from (8) into K . Now, if q ∑ j ∈ N US ( θ usj ) > O (cid:18) √ ε (cid:18) log (cid:18) ε (cid:19) ε (cid:19)(cid:19) then relative error is of order O (cid:18) √ ε (cid:18) log (cid:18) ε (cid:19) ε (cid:19) (cid:19) , which goes to 0 as ε → Theorem 1
The ε –precision trajectory { ˜ u K } K exit K = generated by the gradient descent method for step-size α = L has linear exit time (8) from the ball B ε ( x ∗ ) provided the projection of the initialization u onto theunstable subspace E US of the Hessian ∇ f ( x ∗ ) , given by ∑ j ∈ N US ( θ usj ) , is lower bounded as: ∑ j ∈ N US ( θ usj ) ' (cid:18) + ε M L (cid:19)(cid:18) δµ log (cid:18) + β L − ε M L (cid:19) Mn (cid:19) a log (cid:18) δ (cid:18) + ε M L (cid:19) log (cid:18) + β L − ε M L (cid:19) log (cid:18) + ε M L + β L − ε M L (cid:19) ε Mn log (cid:18) + ε M L (cid:19) (cid:19) + , (23) where a √ µ = Mn log (cid:18) + ε M L (cid:19) δ (cid:18) + ε M L (cid:19) log (cid:18) + β L − ε M L (cid:19) log (cid:18) + ε M L + β L − ε M L (cid:19) , a = log (cid:18) + ε M L (cid:19) log (cid:18) + ε M L (cid:19) − log (cid:18) + β L − ε M L (cid:19) , the function is well-conditioned with β L > ε M L and we require that : ε < min (cid:26) inf k u k = (cid:18) lim sup j → ∞ j s r j ( u ) j ! (cid:19) − , L δ M ( Ln − δ ) + O ( ε ) (cid:27) , (24) where r j ( u ) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) d j dw j ∇ f ( x ∗ + w u ) (cid:12)(cid:12)(cid:12)(cid:12) w = (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) , u = ε∑ i ∈ N S θ si v i + ε∑ j ∈ N US θ usj v j and v i , v j are the eigen-vectors of the Hessian ∇ f ( x ∗ ) .In terms of order notation, we require the following lower bound on the projection ∑ j ∈ N US ( θ usj ) : ∑ j ∈ N US ( θ usj ) ' O (cid:18) ( ε − ) (cid:19) . (25) oundary Conditions for Trajectories Around Saddle Points 11 The proof of this theorem is given in Appendix A.
Remark 1
Recall from (22) that for relative error in the ε –precision trajectory to be bounded, we requirethat q ∑ j ∈ N US ( θ usj ) > O (cid:18) √ ε (cid:18) log (cid:18) ε (cid:19) ε (cid:19)(cid:19) . However, this condition is already satisfied by the sufficientcondition ∑ j ∈ N US ( θ usj ) ' O (cid:18) ( ε − ) (cid:19) in terms of order since O (cid:18)q ( ε − ) (cid:19) > O (cid:18) √ ε (cid:18) log (cid:18) ε (cid:19) ε (cid:19)(cid:19) as ε → Theorem 2
If a gradient trajectory with respect to some stationary point x ∗ has non-contractive dynamicsat any iteration k = K , then it has expansive dynamics for all iterations k > K provided k x k − x ∗ k is boundedabove by ξ , where { x k } is the sequence that generates the gradient trajectory. This property of the sequenceof radial distances {k x k − x ∗ k} can be termed as the sequential monotonicity.Moreover, in the case of x ∗ being a strict saddle point, we have for gradient trajectories with step-size α = L that ξ < ς M s(cid:18) ( + β L ) (cid:18) − (cid:18) − β L (cid:19) (cid:19) − (cid:19) for some ς > provided the function is well-conditioned in the saddle neighborhood W with β L > . . Specifically, consider the tuple ( x , x + , x ++ ) thatis equivalent to the tuple ( x k , x k + , x k + ) for any k . Let k x + − x ∗ k > k x − x ∗ k and k x − x ∗ k < ξ . Then thefollowing holds: a. (cid:13)(cid:13) x ++ − x ∗ (cid:13)(cid:13) > (cid:13)(cid:13) x + − x ∗ (cid:13)(cid:13) − σ ( x ) , and (26) b. (cid:13)(cid:13) x ++ − x ∗ (cid:13)(cid:13) ≥ ¯ ρ ( x ) (cid:13)(cid:13) x + − x ∗ (cid:13)(cid:13) − σ ( x ) , (27) where σ ( x ) = O ( k x − x ∗ k ) and ¯ ρ ( x ) > . The proof of this theorem is given in Appendix B. Before moving on to the next theorem, we introducecertain terms that are required for understanding the theorem statement. In this regard, let ˆ K exit be the firstexit time of the gradient descent trajectory from the ball B ξ ( x ∗ ) , where we assume that the trajectory startsat the boundary of the ball B ξ ( x ∗ ) , i.e., x ∈ ¯ B ξ ( x ∗ ) \ B ξ ( x ∗ ) and ξ is bounded from Theorem 2. Next, forany ε < ξ , let ¯ B ξ ( x ∗ ) \ B ε ( x ∗ ) be a compact shell centered at x ∗ . Let k = K c be the last iteration for whichthe gradient trajectory has contractive dynamics inside the shell and k = K e be the first iteration for which thegradient trajectory has expansive dynamics inside the shell. We are now ready to state the theorem. Theorem 3
The sojourn time K shell for a gradient trajectory inside the compact shell ¯ B ξ ( x ∗ ) \ B ε ( x ∗ ) fora strict saddle point x ∗ is bounded by K shell ≤ min ( L (cid:18) ( ξ ) − ε (cid:19) (cid:18) f ( x K c ) − f ( x ∗ ) − s K c ε (cid:19) , log (cid:18) L ( f ( x Kc ) − f ( x ∗ ) − s Kc ε )( βξ ) (cid:19) log (cid:18) − β L (cid:19) ) + log ( ξ ) − log ( ε ) log (cid:18) inf { ¯ ρ ( x k − ) } + M ξ (cid:19) + , where we have K shell = ˆ K exit + K c − K e with K c ≤ min ( L (cid:18) ( ξ ) − ε (cid:19) (cid:18) f ( x Kc ) − f ( x ∗ ) − s Kc ε (cid:19) , log (cid:18) L ( f ( x Kc ) − f ( x ∗ ) − sKc ε )( βξ ) (cid:19) log (cid:18) − β L (cid:19) ) , ˆ K exit − K e ≤ log ( ξ ) − log ( ε ) log (cid:18) inf { ¯ ρ ( x k − ) } + M ξ (cid:19) + , and infimum in the term inf { ¯ ρ ( x k − ) } is over K e + ≤ k ≤ ˆ K exit . Fur-ther, ˆ K exit − K e is the time for which the gradient trajectory has expansive dynamics inside the shell ¯ B ξ ( x ∗ ) \ B ε ( x ∗ ) , and K c is the time for which the gradient trajectory has contractive dynamics insidethe shell ¯ B ξ ( x ∗ ) \ B ε ( x ∗ ) . Also, s K c and s are positive constants that belong to a sequence { s k } K c k = that increases monotonically, ξ ≤ min (cid:26) ς M s(cid:18) ( + β L ) (cid:18) − (cid:18) − β L (cid:19) (cid:19) − (cid:19) , (cid:18) Ls L − β (cid:19)(cid:27) with ς > and thefunction f ( · ) is well-conditioned in the saddle neighborhood W with β L > . .In terms of order notation, K shell has the following rate: K shell = O (cid:18) log (cid:18) a Λ − b (cid:19)(cid:19) + O (cid:18) log (cid:18) ξε (cid:19)(cid:19) + O ( ) , (28) where f ( x K c ) − f ( x ∗ ) = Λ , K c = O (cid:18) log (cid:18) a Λ − b (cid:19)(cid:19) , ˆ K exit − K e = O (cid:18) log (cid:18) ξε (cid:19)(cid:19) + O ( ) , and a , b are somepositive constants with Λ > b . The proof of this theorem is given in Appendix C.
Remark 2
Note here that ¯ B ξ ( x ∗ ) ⊂ W , x K c ∈ ¯ B ε ( x ∗ ) \ B ε ( x ∗ ) , f ( x K c ) − f ( x ∗ ) − s K c ε > { s k } is any in-creasing positive sequence for 0 ≤ k ≤ K c that satisfies the following condition s k + k x k + − x ∗ k ≥ s k k x k − x ∗ k for the strictly decreasing sequence {k x k − x ∗ k} K c k = . Remark 3
While giving the rate for K shell = ˆ K exit + K c − K e in order notation, we purposely omit ξ , ε fromthe term K c and substitute K c = O (cid:18) log (cid:18) a Λ − b (cid:19)(cid:19) , where f ( x K c ) − f ( x ∗ ) = Λ and a , b are some positiveconstants. This is done so as to determine the order for K c in terms of the function value gap f ( x K c ) − f ( x ∗ ) only, since the proof in Appendix C obtains K c from the zeroth-order oracle, i.e., the function value. Therefore ξ and ε are treated as constants in the order notation for K c and get absorbed into the constants a , b . We now discuss some additional yet important lemmas instrumental in analysing the gradient trajectory/approximate trajectory behavior in saddle neighborhoods of any strict saddle point x ∗ . Also, in the remainderof this section, we do not consider the effects of first order perturbation, i.e., O ( ε ) terms, in Hessian (see[10, Lemma 4]) since we no longer quantify the exit times / boundary conditions and are only interested inapproximate trajectory behavior. Hence most of the results in this section are qualitative. The proofs of thelemmas in this section are given in Appendix D. Lemma 1.
The gradient trajectories { u K } K exit K = inside the ball B ε ( x ∗ ) with linear exit time approximatelyexhibit hyperbolic behavior such that they first move exponentially fast towards the saddle point x ∗ , reachsome point of minimum distance from x ∗ , denoted by x critical , and then move exponentially fast away from x ∗ so as to escape the saddle region. For the case when x critical → x ∗ , their first-order approximation or the ε –precision trajectories can take very large time to exit the ball B ε ( x ∗ ) , i.e., K ι → ∞ where K ι is defined in (7) . In the limit of x critical = x ∗ , we have K exit = K ι = ∞ , which implies that the ε –precision trajectories andhence the gradient trajectory can never escape the saddle region. Lemma 2.
In the ball B ε ( x ∗ ) , gradient descent trajectories with linear exit time approximately never curvearound the stationary point x ∗ . Moreover, all the linear exit time gradient descent trajectories lie approxi-mately inside some orthant of the ball B ε ( x ∗ ) , i.e., the entry and exit point approximately subtend an angleless than or equal to π at the point x ∗ . Lemma 3.
The function value at the exit point on the ball B ε ( x ∗ ) for any gradient descent trajectory isapproximately less than f ( x ∗ ) . Lemma 4.
For well-conditioned problems, i.e., O (cid:18) √ q log ( ε ) (cid:19) < β L ≤ , where ε is upper bounded fromTheorem 1, a gradient trajectory having exited the ball B ε ( x ∗ ) can never re-enter this ball. Lemma 5.
The gradient descent trajectories exiting the ball B ξ ( x ∗ ) , where ξ is defined in Theorem 3 for ς ≥ , can never re-enter this ball provided the gradient magnitudes outside the ball B ξ ( x ∗ ) are sufficientlylarge with k ∇ f ( x ) k ≥ γ > √ L ξ and the function is well conditioned inside the ball B ξ ( x ∗ ) with β L > . . oundary Conditions for Trajectories Around Saddle Points 13 Now that we have established the preliminaries on our unstable projection value and the requisite conditionnumber, we propose a method that can guarantee escaping saddle points in almost linear time for well condi-tioned problems, by virtue of the results in [10] as well as Theorem 1, and that is also guaranteed to convergeto a local minimum.
Algorithm 1
Curvature Conditioned Regularized Gradient Descent Initialize { x , y , y } to , a radius ε bounded by Theorem 1, constants L , M , β , δ , minimum unstable projection P min ( ε ) from thelower bound in (72), condition flag Ξ = κ = β L and step-size α = L for k = , , ··· , K do Obtain ∇ f ( x k ) from first-order oracle If k ∇ f ( x k ) k > L ε then Update x k + ← x k − α∇ f ( x k ) If Ξ = then update condition flag Ξ ← Else If k ∇ f ( x k ) k ≤ L ε and Ξ = then Update x k + ← x k − α∇ f ( x k ) Else If k ∇ f ( x k ) k ≤ L ε and Ξ = then Set y ← x k Update y ← y − α∇ f ( y ) Compute V ← h y − y , y − y i Compute V ← α h y − y , ∇ f ( y ) − ∇ f ( y ) i If ε κ < V − V < (cid:18) P min ( ε )+ (cid:19) ε κ then ⊲ Curvature Check Condition16:
Obtain H ← α∇ f ( x k ) from second-order oracle Solve the constrained eigenvalue problem: x k + ∈ argmin k x − x k k = k ∇ f ( x k ) k β (cid:18) ( x − x k ) T H ( x − x k ) (cid:19) Else If < V − V ≤ ε κ then ⊲ Curvature Check Condition19:
Obtain H ← α∇ f ( x k ) from second-order oracle Compute λ min ( H ) If λ min ( H ) < then Solve the constrained eigenvalue problem: x k + ∈ argmin k x − x k k = k ∇ f ( x k ) k β (cid:18) ( x − x k ) T H ( x − x k ) (cid:19) Else break
Update condition flag Ξ ← end for Second-Order Stationary Solution = x k The proof of convergence of Algorithm 1 to a local minimum is given in Appendix E.
Now that we have developed an algorithm that escapes saddle neighborhoods in linear time, our goal is toobtain rates of convergence for Algorithm 1 to some local minimum. First, notice that the iterate sequence { x k } generated by Algorithm 1 converges to a local minimum since gradient descent converges to a localminimum (see [23] for a measure theoretic proof). This is because the sequence { x k } generated by Algorithm1 can be written as follows: { x k } = (cid:18) J [ j = { x k } K sj + − k = K sj + (cid:19) [ { x K sj } J j = (29)where { x k } K sj + − k = K sj + is the j th contiguous subsequence generated by the gradient descent method while theiterates { x K sj } J j = are generated by Step 17 of Algorithm 1 with x K s = x , the initialization of the algorithm, and J being the total number of times Step 17 is invoked in Algorithm 1. Now, any subsequence { x k } K sj + − k = K sj + is a gradient descent trajectory, hence almost every time avoids strict saddle points, while the iterates fromthe set { x K sj } J j = being generated by second order operations from the Step 17 in Algorithm 1 never land onsaddle points. Hence, the sequence { x k } generated by Algorithm 1 converges to a minimum.To develop rates of convergence of the sequence { x k } to some minimum x ∗ optimal of f ( · ) we need thefollowing additional assumptions.6.1 Additional Assumptions A7.
In some compact domain U , let S ∗ be the set of all first-order stationary points of function f ( · ) ,where x ∗ j ∈ S ∗ denotes the j th stationary point with | S ∗ | = l. Then the distance between any two station-ary points of the function f ( · ) is lower bounded by R > , i.e., (cid:13)(cid:13)(cid:13) x ∗ i − x ∗ j (cid:13)(cid:13)(cid:13) ≥ R for any x ∗ i and x ∗ j in S ∗ and we have that R > ξ where ξ is bounded from Theorem 2. A8.
Let the sequence { x k } generated by Algorithm 1 converge to the minimum x ∗ optimal ∈ S ∗ and we have (cid:13)(cid:13)(cid:13) x − x ∗ optimal (cid:13)(cid:13)(cid:13) ≤ ζ , where x is the initialization point for Algorithm 1 and we have that R < ζ < lR.Also, we have the following condition: (cid:13)(cid:13) x − x ∗ j (cid:13)(cid:13) ≤ ξ for some stationary saddle point x ∗ j . A9.
The gradient magnitude for any x ∈ U \ S lj = ¯ B ξ ( x ∗ j ) is lower bounded by some γ , where we havethat: k ∇ f ( x ) k ≥ γ > √ L ξ such that ξ is bounded from Theorem 3 and β L is bounded from Lemma 5. Moreover for any x ∈ U wehave the following bound: k ∇ f ( x ) k ≤ Γ . Lemma 6.
As a consequence of the Assumption A9 , the function f ( · ) is Lipschitz continuous in the compactdomain U , where the Lipschitz constant is given by (cid:18) Γ + L diam ( U ) (cid:19) .Proof. From the gradient Lipschitz continuity of f ( · ) , for any x , y in U we have that: f ( x ) − f ( y ) ≤ h ∇ f ( y ) , x − y i + L k x − y k (30) ≤ k ∇ f ( y ) k k x − y k + L k x − y k (31) = (cid:18) k ∇ f ( y ) k + L k x − y k (cid:19) k x − y k (32) ≤ (cid:18) Γ + L diam ( U ) (cid:19) k x − y k . (33) (cid:4) Theorem 4
The trajectory generated from the iterate sequence { x k } by Algorithm 1 that has escaped someball B R ( x ∗ ) cannot escape the ball B R ω ( x ∗ ) ⊃ B R ( x ∗ ) if it has to re-enter the ball B R ( x ∗ ) in finitenumber of iterations, where we have that x ∈ B ξ ( x ∗ ) and x ∗ ∈ S ∗ is a strict saddle point. The radius R ω satisfies the condition: R ω ≤ R + (cid:18) Γ + L diam ( U ) (cid:19) R γ + N K exit (cid:18) β + L β (cid:19) L ε γ + N ( K exit + K shell ) ξ (34) oundary Conditions for Trajectories Around Saddle Points 15 where N = (cid:18) Γ + L diam ( U ) (cid:19) R R (cid:18) − K exit (cid:18) β + L β (cid:19) L ε R − γ ( K exit + K shell ) ξ R (cid:19) and R > K exit (cid:18) β + L β (cid:19) L ε + γ ( K exit + K shell ) ξ . Notethat here K exit is upper bounded by (8) , K shell is upper bounded by Theorem 3 and the compact domain U contains the ball B R ω ( x ∗ ) , i.e., U ⊃ B R ω ( x ∗ ) . The proof of this theorem is given in Appendix F.
Remark 4
In order to characterize the convergence rate for Algorithm 1 in the following, we need to focus onthe worst-case trajectories that can be generated by it. Theorem 4 helps capture the behavior of such worst-case trajectories by finding an upper bound on the radius of a ball to whose boundary such trajectories canpossibly travel.
Theorem 5
Suppose x ∈ B ξ ( x ∗ ) , where x ∗ ∈ S ∗ is a strict saddle point. Let S i ∈ P B R ( x ∗ i ) be the tightestpacking of the ball B R ω ( x ∗ ) where we have (cid:13)(cid:13)(cid:13) x ∗ i − x ∗ j (cid:13)(cid:13)(cid:13) = R for any neighboring balls B R ( x ∗ i ) , B R ( x ∗ j ) and | P | is the packing number of the ball B R ω ( x ∗ ) with balls of radius R where R < R ω . Next, let Y = { B ξ ( x ∗ i ) } Q i = be an ordered sequence of balls traversed by the trajectory of { x k } from Algorithm 1 insidethe ball B R ω ( x ∗ ) , where the ball centers x ∗ i are numbered in the increasing order of traversal by thetrajectory of { x k } . Then Q is maximized whenever the travel time between any two consecutive balls ofthe sequence Y is upper bounded by L γ (cid:18) Γ + L diam ( U ) (cid:19) ( ˆ R + ξ ) where ˆ R = R + (cid:18) Γ + L diam ( U ) (cid:19) R γ + N K exit (cid:18) β + L β (cid:19) L ε γ + N ( K exit + K shell ) from Theorem 4 for R = R provided | P j | ( R − ξ ) R (cid:18) Γ + L diam ( U ) (cid:19) γ ≥ Γ . Here, | P j | is the packing number for the ball B R ( x ∗ j ) with balls of radius R and ˜ R = R + (cid:18) Γ + L diam ( U ) (cid:19) R γ + N K exit (cid:18) β + L β (cid:19) L ε γ + N ( K exit + K shell ) .Further, the total time K max for the trajectory of { x k } to reach x ∗ optimal is bounded by: K max < (cid:18) R e f f R (cid:19) n (cid:18) ( K exit + K shell ) + L γ (cid:18) Γ + L diam ( U ) (cid:19) ( ˆ R + ξ ) (cid:19) , (35) where R e f f = ζ + (cid:18) Γ + L diam ( U ) (cid:19) ζγ + N K exit (cid:18) β + L β (cid:19) L ε γ + N ( K exit + K shell ) ξ , N is defined inTheorem 4, K exit is bounded by (8) , and K shell is bounded from Theorem 3.In terms of the order notation, using (8) and (28) , K max has the following rate: K max = A n (cid:18) O (cid:18) log (cid:18) ε (cid:19)(cid:19) + O (cid:18) log (cid:18) a Λ − b (cid:19)(cid:19) + O (cid:18) log (cid:18) ξε (cid:19)(cid:19) + O (cid:18) γ (cid:19) + O ( ) (cid:19) (36) where A is some positive constant. The proof of this theorem is given in Appendix F.
Remark 5
In the above theorem, the total number of saddle neighborhoods encountered by the trajectory of { x k } is upper bounded by (cid:18) R ef f R (cid:19) n , where we have that (cid:18) R ef f R (cid:19) n < l from Assumption A7 . To test the efficacy of the proposed method, we simulate Algorithm 1 on two different problems, a modifiedRastrigin function and a low-rank matrix factorization problem. f ( x ) = An + n ∑ i = ( x i − cos ( π x i )) (37)where A =
10 and x i ∈ [ − . , . ] , and f ( · ) has a global minimum at x = . In this section, we use amodified version of (37) given by: f ( x ) = n ∑ i = a i cos ( b i x i ) , (38)where (38) differs from (37) in the sense that (38) does not have a quadratic term added to it (hence possiblysome local minima are global minima).For simulations, we set a i = i = a i = − b i = ≤ i ≤ (cid:4) n (cid:5) and b i = . (cid:4) n (cid:5) + ≤ i ≤ n . The point x ∗ = is a strict saddle point in our case and the initialization of the proposedCCRGD algorithm (Algorithm 1) and the gradient descent (GD) method is done in an ε neighborhood of x ∗ . Specifically, the iterate x K is initialized in an ε neighborhood of the strict saddle point x ∗ with a verylow unstable subspace projection, where the initialization point is same for both methods. In addition, thestep-size for both methods is set to α = L , where L is the maximum absolute eigenvalue of the Hessian in thesaddle neighborhood.The results of our simulations are reported in Figures 1(a)–(d), where each subfigure has a total of threeplots for a different combination of ( n , ε ) . In each of the subfigures, the top-left plot shows that the gradientnorm of the proposed CCRGD method first increases and then decreases while the GD method struggles toincrease its gradient norm for many iterations. The top-right plot in each subfigure shows the initial and finaleigenvalues of the Hessian at an iterate generated by the two methods, while the blue stem subplot in thereshows the eigenvalue spectrum at the initialization (which is same for both methods). It can be seen fromthese two plots in each subfigure that the GD method fails to converge to a second-order stationary point,while the CCRGD method easily converges to a local minimum.Finally, the bottom plot in each subfigure shows the evolution of distance of the iterate from initializationfor the two methods. This plot also marks the iteration where the CCRGD method first exited the initial saddleneighborhood (this iteration index is the “First Exit Time”) and also marks those iteration indices where theCCRGD method invoked the second-order Step 15 in Algorithm 1.7.2 Low-Rank Matrix FactorizationThe objective function for the problem in consideration is as follows: f ( X , X ) = (cid:13)(cid:13) M − X X T (cid:13)(cid:13) F + ϖ k X k F + ϖ k X k F , (39)where M ∈ R n × n , X ∈ R n × r and X ∈ R n × r such that r < max { n , n } is the rank of matrix M .To simplify the problem structure so as to make (39) some function of a single variable X , let X and X be blocks of the variable X such that X = (cid:20) X X (cid:21) , where we have X = B X and X = B X with B = (cid:2) I n × n | n × n (cid:3) and B = (cid:2) n × n | I n × n (cid:3) . Here I n × n , I n × n represent the identity matrices and n × n , n × n represent the null rectangular matrices ofappropriate dimensions. Using this change of variable, (39) can be written as a function of X : f ( X ) = (cid:13)(cid:13) M − B XX T B T (cid:13)(cid:13) F + ϖ k B X k F + ϖ k B X k F . (40) oundary Conditions for Trajectories Around Saddle Points 17 Iterations K G r ad i en t no r m Hessian with dimension n = 10, = 0.32, and /L = 0.16
GDCCRGD
Eigenvalue index E i gen v a l ue s K=100 (CCRGD)K=100 (GD)K=0 Iterations K D i s t an c e f r o m i n i t i a li z a t i on n = 10, = 0.0001 GDCCRGDFirst Exit Time =2Second order iterations for CCRGD
Iterations K G r ad i en t no r m Hessian with dimension n = 10, = 0.32 and /L = 0.16
GDCCRGD
Eigenvalue index E i gen v a l ue s K=84 (CCRGD)K=84 (GD)K=0 Iterations K D i s t an c e f r o m i n i t i a li z a t i on n = 10, = 0.001 GDCCRGDFirst Exit Time =2Second order iterations for CCRGD (a) (b)
Iterations K G r ad i en t no r m Hessian with dimension n = 18, = 0.32 and /L = 0.16
GDCCRGD
Eigenvalue index E i gen v a l ue s K=100 (CCRGD)K=100 (GD)K=0 Iterations K D i s t an c e f r o m i n i t i a li z a t i on n = 18, = 0.0005 GDCCRGDFirst Exit Time =2Second order iterations for CCRGD
Iterations K G r ad i en t no r m Hessian with dimension n = 18, = 0.32 and /L = 0.16
GDCCRGD
Eigenvalue index E i gen v a l ue s K=100 (CCRGD)K=100 (GD)K=0 Iterations K D i s t an c e f r o m i n i t i a li z a t i on n = 18, = 0.0005 GDCCRGDFirst Exit Time =2Second order iterations for CCRGD (c) (d)
Fig. 1
Simulation results on the modified Rastrigin function for various values of n and ε . Next, ∇ f ( X ) can be given as follows: ∇ f ( X ) = ( B T B XX T B T B + B T B XX T B T B ) X − ( B T M T B + B T MB ) X + ϖ B T B X + ϖ B T B X . (41)Since the gradient in (41) is a matrix, hence the corresponding Hessian will be a tensor, whereas our analysisassumes Hessian to be a matrix. To circumvent this problem, we make use of [25, Theorem 9] by vectorizingmatrix X so that ∇ f ( vec ( X )) is a Jacobian matrix.The closed form expression for the Jacobian is as follows: ∇ f ( vec ( X )) = (cid:18) (( X T B T B ) ⊗ I n × n )(( X ⊗ I n × n )( I r × r ⊗ ( B T B )) + ( I n × n ⊗ ( B T B X )))+ ( I r × r ⊗ ( B T B XX T ))( I r × r ⊗ ( B T B )) (cid:19) + (cid:18) (( X T B T B ) ⊗ I n × n )(( X ⊗ I n × n )( I r × r ⊗ ( B T B ))+ ( I n × n ⊗ ( B T B X ))) + ( I r × r ⊗ ( B T B XX T ))( I r × r ⊗ ( B T B )) (cid:19) − (cid:18) I r × r ⊗ ( B T M T B + B T MB ) (cid:19) + (cid:18) I r × r ⊗ ( ϖ B T B + ϖ B T B ) (cid:19) , (42) Iterations K G r ad i en t no r m Hessian with dimension (n +n ) r, = 0.004 and /L = 0.029 GDCCRGD
Eigenvalue index -505 E i gen v a l ue s K=59 (CCRGD)K=59 (GD)K=0 Iterations K D i s t an c e f r o m i n i t i a li z a t i on n = 8, n = 9, r = 3, = 5e-05 GDCCRGDFirst Exit Time =2Second order iterations for CCRGD
Iterations K G r ad i en t no r m Hessian with dimension (n +n ) r, = 0.002 and /L = 0.084 GDCCRGD
Eigenvalue index -50510 E i gen v a l ue s K=79 (CCRGD)K=79 (GD)K=0 Iterations K D i s t an c e f r o m i n i t i a li z a t i on n = 8, n = 9, r = 3, = 0.0005 GDCCRGDFirst Exit Time =6Second order iterations for CCRGD (a) (b)
Iterations K G r ad i en t no r m Hessian with dimension (n +n ) r, = 0.004 and /L = 0.015 GDCCRGD
Eigenvalue index -10010 E i gen v a l ue s K=114 (CCRGD)K=114 (GD)K=0 Iterations K D i s t an c e f r o m i n i t i a li z a t i on n = 12, n = 10, r = 5, = 5e-05 GDCCRGDFirst Exit Time =2Second order iterations for CCRGD
Iterations K G r ad i en t no r m Hessian with dimension (n +n ) r, = 0.002 and /L = 0.085 GDCCRGD
Eigenvalue index -10010 E i gen v a l ue s K=165 (CCRGD)K=165 (GD)K=0 Iterations K D i s t an c e f r o m i n i t i a li z a t i on n = 12, n = 10, r = 5, = 0.0005 GDCCRGDFirst Exit Time =6 (c) (d)
Iterations K G r ad i en t no r m Hessian with dimension (n +n ) r, = 0.004 and /L = 0.072 GDCCRGD
Eigenvalue index -1001020 E i gen v a l ue s K=89 (CCRGD)K=89 (GD)K=0 Iterations K D i s t an c e f r o m i n i t i a li z a t i on n = 15, n = 9, r = 4, = 0.0001 GDCCRGDFirst Exit Time =2Second order iterations for CCRGD
Iterations K G r ad i en t no r m Hessian with dimension (n +n ) r, = 0.004 and /L = 0.058 GDCCRGD
Eigenvalue index -10010 E i gen v a l ue s K=313 (CCRGD)K=313 (GD)K=0 Iterations K D i s t an c e f r o m i n i t i a li z a t i on n = 15, n = 9, r = 4, = 1e-05 GDCCRGDFirst Exit Time =2Second order iterations for CCRGD (e) (f)
Fig. 2
Simulation results of a low-rank matrix factorization problem for various values of n , n , r , and ε .oundary Conditions for Trajectories Around Saddle Points 19 where n = n + n . For simulations, matrix M was generated randomly using the relation M = U U T + ρ N , where U ∈ R n × r , U ∈ R n × r and the entries of these matrices were independently sampled from a standardnormal distribution. Matrix N ∈ R n × n is the additive noise generated from a normal distribution whosevariance is scaled by ρ .For the experiments, we use ϖ = ϖ = . ρ = .
5, and step-size α = L where L = λ max ( ∇ f ( vec ( X ))) .Also, for the particular selection of parameters, X = is a strict saddle point. Hence, X is initialized on theboundary of ball B ε ( ) and ε is varied in the simulations along with n , n , r . Finally, the proposed methodis plotted against the standard gradient descent method where the metric is k X k − X init k F with X init being thecommon initialization for the two methods.The simulation results for Algorithm 1 are presented in Figures 2(a)–(e) and comparisons are made withthe GD method. For the sake of uniformity, the plots within each subfigure of Figure 2 follow the sameconvention as the plots within each subfigure of Figure 1. From the plots, it is evident that the functions arenot well-conditioned for the different cases. Hence, a remarkable contrast is not observed between the twomethods, which in reality are identical except possibly for a single iteration in saddle neighborhood. Bothmethods exhibit convergence, which is evident from the eigenvalues of the Hessian at final iterate. This work focuses on the global analysis of gradient trajectories on a class of nonconvex functions that havestrict saddle points in their geometry. Building on top of the results from our earlier work [10], sufficientboundary conditions are developed here that guarantee approximate linear exit time of gradient trajectoriesfrom saddle neighborhoods. Further, the gradient trajectories are analyzed in an augmented saddle neighbor-hood and it is proved that the trajectories exhibit sequential monotonicity. Using this result, bounds on thetotal travel time are given for trajectories in this region. A robust algorithm is also developed in this workthat uses the sufficient boundary conditions to check whether a given trajectory will exit saddle neighborhoodin linear time and invokes a second-order step otherwise. Several intuitive yet important lemmas are proved,characterizing the behaviour of gradient trajectories in saddle neighborhoods and two theorems are provedthat provide rate of convergence of the algorithm to a local minimum.
A Proof of Theorem 1
Proof.
From Theorem 1 in [10], for every value of parameter τ , there exists a lower bound on the squared radial distance k ˜ u τ K k for all K in the range 1 ≤ K ≤ sup τ (cid:26) K τ exit (cid:27) provided K ε ≪
1. Moreover, this lower bound can be expressed using a function of K called thetrajectory function Ψ ( K ) . Formally, we have that: ε ≥ inf τ k ˜ u τ K k > ε Ψ ( K ) , (43)where the the trajectory function Ψ ( K ) is given by: Ψ ( K ) = (cid:18) c K − Kc K − b − b c K c K − b c K (cid:19) ∑ i ∈ N S ( θ si ) + (cid:18) c K − Kc K − b − b c K c K − b c K (cid:19) ∑ j ∈ N US ( θ usj ) (44)with c = (cid:18) − α L − αε M − O ( ε ) (cid:19) , c = (cid:18) − αβ + αε M + O ( ε ) (cid:19) , c = (cid:18) + α L + αε M + O ( ε ) (cid:19) , c = (cid:18) + αβ − αε M − O ( ε ) (cid:19) , b = (cid:18) αε MLn δ + O ( ε ) (cid:19) , and b = (cid:18) αε MLn δ + O ( ε ) (cid:19)(cid:18) + O ( K ε ) (cid:19)(cid:18) α L + αβ + O ( ε ) (cid:19) .0 Rishabh Dixit, Waheed U. BajwaSubstituting these coefficients in the expression for Ψ ( K ) followed by dropping order O ( ε ) and O ( K ε ) terms (for K ε ≪ Ψ ( K ) : Ψ ( K ) ≈ (cid:18)(cid:20)(cid:18) − α L − αε M (cid:19) K − K (cid:18) − αβ + αε M (cid:19) K − αε MLn δ (cid:21) ∑ i ∈ N S ( θ si ) + (cid:20)(cid:18) + αβ − αε M (cid:19) K − K (cid:18) + α L + αε M (cid:19) K − αε MLn δ (cid:21) ∑ j ∈ N US ( θ usj ) − αε MLn δ ( α L + αβ ) (cid:18) + α L + αε M (cid:19) K (cid:18) − αβ + αε M (cid:19) K (cid:18) ∑ i ∈ N S ( θ si ) + ∑ j ∈ N US ( θ usj ) (cid:19) − αε MLn δ ( α L + αβ ) (cid:18) + α L + αε M (cid:19) K (cid:19)(cid:18) ∑ i ∈ N S ( θ si ) + ∑ j ∈ N US ( θ usj ) (cid:19) (45) Ψ ( K ) ' (cid:18)(cid:20)(cid:18) − α L − αε M (cid:19) K − K (cid:18) − αβ + αε M (cid:19) K − αε MLn δ (cid:21) ∑ i ∈ N S ( θ si ) + (cid:20)(cid:18) + αβ − αε M (cid:19) K − K (cid:18) + α L + αε M (cid:19) K − αε MLn δ (cid:21) ∑ j ∈ N US ( θ usj ) − ε MLn (cid:18) + α L + αε M (cid:19) K δ ( L + β ) (cid:19) , (46)where in the last step we used the relation (cid:18) ∑ i ∈ N S ( θ si ) + ∑ j ∈ N US ( θ usj ) (cid:19) = (cid:18) − αβ + αε M (cid:19) < (cid:18) + α L + αε M (cid:19) .Now for α = L , (46) becomes the following approximate inequality: Ψ ( K ) ' (cid:18)(cid:20)(cid:18) − ε M L (cid:19) K − K (cid:18) − β L + ε M L (cid:19) K − ε Mn δ (cid:21) ∑ i ∈ N S ( θ si ) + (cid:20)(cid:18) + β L − ε M L (cid:19) K − K (cid:18) + ε M L (cid:19) K − ε Mn δ (cid:21) ∑ j ∈ N US ( θ usj ) − ε MLn (cid:18) + ε M L (cid:19) K δ ( L + β ) (cid:19) (47) Ψ ( K ) ' (cid:18)(cid:20) − K (cid:18) − β L + ε M L (cid:19) K − ε Mn δ (cid:21) ∑ i ∈ N S ( θ si ) + (cid:20)(cid:18) + β L − ε M L (cid:19) K − K (cid:18) + ε M L (cid:19) K − ε Mn δ (cid:21) ∑ j ∈ N US ( θ usj ) − ε MLn (cid:18) + ε M L (cid:19) K δ ( L + β ) (cid:19) . (48)Recall that from the condition (43), the exit time is obtained by evaluating the first K where Ψ ( K ) >
1. From the inequality (48), bysetting the right hand side greater than equal to 1 for some given K of order O ( log ( ε − )) , we will have Ψ ( K ) '
1. Hence the sufficientcondition on the unstable projection ∑ j ∈ N US ( θ usj ) for escaping saddle with linear rate can be obtained from (48) by setting its righthand side greater than equal to 1. Notice that for very large K , the right hand side of (48) is always less than 1. Moreover, there existssome K min ≥ K max > ( K min , K max ) . Wewish to find condition which yields some positive K min and K max . Note that our escape from saddle can happen only in this particulartime interval, i.e., K exit ∈ ( K min , K max ) .We first assume that the approximate lower bound on Ψ ( K ) from (48) is a continuous function of K so as to allow differentiationof this lower bound with respect to variable K . This continuous extension is possible since the approximate lower bound on Ψ ( K ) from(48) is a well-defined function of K . Note that we do not use the lower bound from (47) since we are looking for values of K greater than1 and the derivative of (cid:18) − ε M L (cid:19) K is of at most order O ( ε K − ) for K > ε . Representing this approximate lower bound inoundary Conditions for Trajectories Around Saddle Points 21(48) as Ψ ( K ) where we have that Ψ ( K ) ' Ψ ( K ) , followed by differentiating it with respect to K yields: Ψ ( K ) = (cid:18)(cid:20) − K (cid:18) − β L + ε M L (cid:19) K − ε Mn δ (cid:21) ∑ i ∈ N S ( θ si ) + (cid:20)(cid:18) + β L − ε M L (cid:19) K − K (cid:18) + ε M L (cid:19) K − ε Mn δ (cid:21) ∑ j ∈ N US ( θ usj ) − ε MLn (cid:18) + ε M L (cid:19) K δ ( β − ε M ) (cid:19) (49) d Ψ ( K ) dK = (cid:18)(cid:20) − K log (cid:18) − β L + ε M L (cid:19) − (cid:21)(cid:18) − β L + ε M L (cid:19) K − ε Mn δ ∑ i ∈ N S ( θ si ) + (cid:20) (cid:18) + β L − ε M L (cid:19) K log (cid:18) + β L − ε M L (cid:19) − (cid:18) + ε M L (cid:19) K − ε Mn δ − K (cid:18) + ε M L (cid:19) K − ε Mn δ log (cid:18) + ε M L (cid:19)(cid:21) ∑ j ∈ N US ( θ usj ) − ε MLn (cid:18) + ε M L (cid:19) K δ ( β − ε M ) log (cid:18) + ε M L (cid:19)(cid:19) (50)It can be inferred from the above equation (50) that for well conditioned problems, i.e., β L > ε M L , and some good unstable projections,i.e., ∑ j ∈ N US ( θ usj ) > ∆ where ∆ > ε MLn δ ( L + β ) , the function Ψ ( K ) slopes upward for some small positive values of K and then it slopesdownward for very large values of K , i.e., Ψ ( K ) becomes a decreasing function for large values of K ( Ψ ( K ) → − ∞ as K → ∞ ). Thereforewe only need to find some K ∈ ( K min , K max ) where the function Ψ ( K ) has zero slope and the value Ψ ( K ) is greater than or equal to 1 forguaranteeing escape. The condition Ψ ( K ) ≥ Ψ ( K ) ' K = K .The above condition can be achieved in many different ways. However, to ensure that the so-called sufficient conditions haveminimal restrictions, we must have K to be the local maximum of the function Ψ ( K ) on the interval K ∈ ( , C ] where C is some arbitrarypositive finite value with C ≤ K max . Note that K is a root of the equation d Ψ ( K ) dK =
0. The condition that K is the local maximum of Ψ ( K ) on the interval K ∈ ( , C ] ensures existence of at least one value of K such that Ψ ( K ) ≥ Ψ ( K ) ' Ψ ( K ) ≥ K exit < K ι / O ( log ( ε − )) for ε –precision trajectories with linearexit time. Note that the linear exit time was obtained explicitly by solving for the roots of equation Ψ ( K ) =
1. Now K is the localmaximum of the function Ψ ( K ) on the interval K ∈ ( , C ] and we have Ψ ( K ) ≥ C = K ι which is valid since C wasarbitrary with K exit < C ≤ K max . Similarly, K max was arbitrary hence we can set K max = K ι . Therefore we will have (cid:13)(cid:13)(cid:13) ˜ u τ K (cid:13)(cid:13)(cid:13) > ε for allvalues of τ where { ˜ u τ K } K exit K = was the ε –precision trajectory defined in [10].Then the sufficient condition (though not necessary) which guarantees the escape of the approximate lower bound Ψ ( K ) on thetrajectory function Ψ ( K ) from the ball B ε ( x ∗ ) is as follows: G Ψ = (cid:26) K (cid:12)(cid:12)(cid:12)(cid:12) K ∈ ( , K ι ] , d Ψ ( K ) dK < , d Ψ ( K ) dK = (cid:27) (51)1 ≤ sup K ∈ G Ψ (cid:26) Ψ ( K ) (cid:27) . (52)The condition (52) can be relaxed to obtain Ψ ( K ) ≥ K ∈ G Ψ . Note that the set G Ψ is non-empty since the function Ψ ( K ) slopes upwards for small positive K whereas Ψ ( K ) → − ∞ as K → ∞ . Simplifying the derivative condition (50) by setting it to 0 we getthe following: 0 = d Ψ dK (cid:12)(cid:12)(cid:12)(cid:12) K = K = (cid:18)(cid:20) − K log (cid:18) − β L + ε M L (cid:19) − (cid:21)(cid:18) − β L + ε M L (cid:19) K − ε Mn δ ∑ i ∈ N S ( θ si ) + (cid:20) (cid:18) + β L − ε M L (cid:19) K log (cid:18) + β L − ε M L (cid:19) − (cid:18) + ε M L (cid:19) K − ε Mn δ − K (cid:18) + ε M L (cid:19) K − ε Mn δ log (cid:18) + ε M L (cid:19)(cid:21) ∑ j ∈ N US ( θ usj ) − ε MLn (cid:18) + ε M L (cid:19) K δ ( β − ε M ) log (cid:18) + ε M L (cid:19)(cid:19) (53)0 = (cid:18)(cid:20) − K log (cid:18) − β L + ε M L (cid:19) − (cid:21)(cid:18) − β L + ε M L + ε M L (cid:19) K (cid:18) − β L + ε M L (cid:19) − ε Mn δ ∑ i ∈ N S ( θ si ) + (cid:20) (cid:18) + β L − ε M L + ε M L (cid:19) K log (cid:18) + β L − ε M L (cid:19) − (cid:18) + ε M L (cid:19) − ε Mn δ − K (cid:18) + ε M L (cid:19) − ε Mn δ log (cid:18) + ε M L (cid:19)(cid:21) ∑ j ∈ N US ( θ usj ) − ε MLn δ ( β − ε M ) log (cid:18) + ε M L (cid:19)(cid:19) (54)2 Rishabh Dixit, Waheed U. BajwaObserve that the roots of this equation cannot be explicitly computed due to the transcendental nature of this equation. However, theroots can be obtained if the order of K is known with respect to ε . Since K ∈ G Ψ , we will have K < K ι / O ( log ( ε − )) . Therefore,we compute only those values of K which are linear, i.e., K = O ( log ( ε − )) . For such a K , setting (cid:18) + ε M L (cid:19) K = µε a where µ > a > (cid:18) − β L + ε M L (cid:19) K = ηε b where η > b > β L > ε M L , the above equality (54) becomes:0 = (cid:18)(cid:20) − K log (cid:18) − β L + ε M L (cid:19) − (cid:21)(cid:18) − β L + ε M L (cid:19) − µηε ( + a + b ) Mn δ ∑ i ∈ N S ( θ si ) | {z } F + (cid:20) (cid:18) + β L − ε M L + ε M L (cid:19) K log (cid:18) + β L − ε M L (cid:19) − (cid:18) + ε M L (cid:19) − ε Mn δ − K (cid:18) + ε M L (cid:19) − ε Mn δ log (cid:18) + ε M L (cid:19)(cid:21) ∑ j ∈ N US ( θ usj ) − ε MLn δ ( β − ε M ) log (cid:18) + ε M L (cid:19)(cid:19) (55) (cid:18) + β L − ε M L + ε M L (cid:19) K ≈ (cid:18) + ε M L (cid:19) − ε Mn δ log (cid:18) + ε M L (cid:19) log (cid:18) + β L − ε M L (cid:19) K + (cid:18) + ε M L (cid:19) − ε Mn δ log (cid:18) + β L − ε M L (cid:19) + ε MLn δ ( β − ε M ) log (cid:18) + β L − ε M L (cid:19) ∑ j ∈ N US ( θ usj ) log (cid:18) + ε M L (cid:19) (56)where in the last step, we dropped the term F (since this term F = O ( K ε ( + a + b ) ) = O ( ε ( + a + b ) log ( ε − )) ) to obtain the approximateequality (56). The approximate solution for (56) can be obtained using a transcendental equation of the form q x = cx + d where x = K and the coefficients are as follows: q = (cid:18) + β L − ε M L + ε M L (cid:19) , c = (cid:18) + ε M L (cid:19) − ε Mn δ log (cid:18) + ε M L (cid:19) log (cid:18) + β L − ε M L (cid:19) (57) d = (cid:18) + ε M L (cid:19) − ε Mn δ log (cid:18) + β L − ε M L (cid:19) + ε MLn log (cid:18) + ε M L (cid:19) δ ( β − ε M ) log (cid:18) + β L − ε M L (cid:19) ∑ j ∈ N US ( θ usj ) . (58)The solution for this equation is given by the following relation: x = − W ( − log qc q − dc ) log q − dc ≤ log ( − log qc q − dc ) log q − − dc = log ( − log qc ) log q − (59)where W ( . ) is the Lambert W function and we have that W ( y ) ≤ log ( y ) for large y . Substituting these coefficients in (56), we obtain thefollowing approximate condition: K /
12 log (cid:18) + ε M L + β L − ε M L (cid:19) (cid:18) δ (cid:18) + ε M L (cid:19) log (cid:18) + β L − ε M L (cid:19) log (cid:18) + ε M L + β L − ε M L (cid:19) ε Mn log (cid:18) + ε M L (cid:19) (cid:19)| {z } ˆ K (60)where ˆ K is the approximate upper bound on K . However, for the condition K ∈ G Ψ to hold, we also require d Ψ dK (cid:12)(cid:12)(cid:12)(cid:12) K = K < d Ψ dK (cid:12)(cid:12)(cid:12)(cid:12) K = ˆ K < d Ψ dK is positive for very small values of K . Hence, there must exist a localmaximum at some K < ˆ K which would imply d Ψ dK (cid:12)(cid:12)(cid:12)(cid:12) K = K <
0. Hence, it is not required to explicitly solve the condition d Ψ dK (cid:12)(cid:12)(cid:12)(cid:12) K = K < F to obtain the approximate equality (56) is justified. Observe that in the twoapproximate transcendental equations (55) and (56) with K as the variable, the right-hand sides will be greater than their left-handoundary Conditions for Trajectories Around Saddle Points 23sides respectively at the value K = ˆ K . Also, for small values of K the respective left-hand sides of (55) and (56) dominate, hencethe approximate equality occurs for some K < ˆ K . Now, we are only left to prove that the approximations (55) and (56) are almostequal at K = ˆ K . This can be established by proving that the term F = O ( ˆ K ε ( + a + b ) ) = O ( ε ( + a + b ) log ( ε − )) is negligible w.r.t. otherterms in (55) at K = ˆ K . From the particular approximate upper bound in (60), it can be verified that a >
1. Using the substitution (cid:18) + ε M L (cid:19) K = µε a where µ > a >
0, taking log both sides followed by substituting the approximate upper bound ˆ K from (60)yields: a log (cid:18) a √ µε (cid:19) = K log (cid:18) + ε M L (cid:19) (61) a log (cid:18) a √ µε (cid:19) = log (cid:18) δ (cid:18) + ε M L (cid:19) log (cid:18) + β L − ε M L (cid:19) log (cid:18) + ε M L + β L − ε M L (cid:19) ε Mn log (cid:18) + ε M L (cid:19) (cid:19) log (cid:18) + ε M L + β L − ε M L (cid:19) log (cid:18) + ε M L (cid:19) (62) a = log (cid:18) + ε M L (cid:19) log (cid:18) + ε M L (cid:19) − log (cid:18) + β L − ε M L (cid:19) > , (63)where in the last step we have that a √ µε = δ (cid:18) + ε M L (cid:19) log (cid:18) + β L − ε M L (cid:19) log (cid:18) + ε M L + β L − ε M L (cid:19) ε Mn log (cid:18) + ε M L (cid:19) . Now with a > b >
0: lim ε → + ε ( + a + b ) log ( ε − ) ε = . (64)Hence, for sufficiently small ε , term F can be of at most order O ( ε ) .Recall that from the relaxation of condition (52), we require Ψ ( K ) ≥
1. Since K is not explicitly available and we only have theapproximate upper bound ˆ K from (60), hence we use the substitution K = χ ˆ K for some 0 < χ ≤ Ψ atthis point greater than equal to 1.Substituting K = χ ˆ K from (60) into the condition Ψ ( K ) ≥
1, dropping the first term on the right hand side of (49) (it is oforder O ( χ ˆ K ε ( + a + b ) ) = O ( ε ( + a + b ) log ( ε − )) as before, substituting (cid:18) + ε M L (cid:19) K = µε χ a for µ > , ε >
0, using (56) for K = χ ˆ K followed by rearranging, we get: (cid:18)(cid:20)(cid:18) + β L − ε M L + ε M L (cid:19) K − K (cid:18) + ε M L (cid:19) − ε Mn δ (cid:21) ∑ j ∈ N US ( θ usj ) − ε MLn δ ( β − ε M ) (cid:19) ' (cid:18) + ε M L (cid:19) K (65) (cid:18)(cid:18) + ε M L (cid:19) − ε Mn δ log (cid:18) + ε M L (cid:19) log (cid:18) + β L − ε M L (cid:19) K − (cid:18) + ε M L (cid:19) − ε Mn δ K + (cid:18) + ε M L (cid:19) − ε Mn δ log (cid:18) + β L − ε M L (cid:19) (cid:19) ∑ j ∈ N US ( θ usj ) ' µε χ a + ε MLn δ ( β − ε M ) − ε MLn δ ( β − ε M ) log (cid:18) + β L − ε M L (cid:19) log (cid:18) + ε M L (cid:19) (66)4 Rishabh Dixit, Waheed U. BajwaSince (cid:18) ε Mn δ log (cid:18) + ε M L (cid:19) log (cid:18) + β L − ε M L (cid:19) K − ε Mn δ K + ε Mn δ log (cid:18) + β L − ε M L (cid:19) (cid:19) >
0, dividing both sides by this quantity yields the following sufficientcondition on unstable projection ∑ j ∈ N US ( θ usj ) : ∑ j ∈ N US ( θ usj ) ' (cid:18) + ε M L (cid:19)(cid:18) µε χ a + ε MLn δ ( β − ε M ) − ε MLn δ ( β − ε M ) log (cid:18) + β L − ε M L (cid:19) log (cid:18) + ε M L (cid:19)(cid:19)(cid:18) ε Mn δ log (cid:18) + ε M L (cid:19) log (cid:18) + β L − ε M L (cid:19) K − ε Mn δ K + ε Mn δ log (cid:18) + β L − ε M L (cid:19) (cid:19) (67) ∑ j ∈ N US ( θ usj ) ' (cid:18) + ε M L (cid:19)(cid:18) δµε ( χ a − ) Mn + ( β L − ε M L ) − log (cid:18) + ε M L (cid:19) ( β L − ε M L ) log (cid:18) + β L − ε M L (cid:19) (cid:19) χ ˆ K (cid:18) log (cid:18) + ε M L (cid:19) log (cid:18) + β L − ε M L (cid:19) − (cid:19) + (cid:18) + β L − ε M L (cid:19) . (68)Now, recall that from (63) we have a > K ' K = χ ˆ K . Since K is not explicitly known we can choosea surrogate for χ to obtain the sufficient condition. Notice that χ is a quantity between 0 and 1. Choosing a large value for χ sayclose to 1 will yield the following order bound ∑ j ∈ N US ( θ usj ) ' O (cid:18) ε a − log ( ε − ) (cid:19) . Recall that from (22) we require q ∑ j ∈ N US ( θ usj ) > O (cid:18) √ ε (cid:18) log (cid:18) ε (cid:19) ε (cid:19)(cid:19) . However this bound may then contradict the sufficient condition ∑ j ∈ N US ( θ usj ) ' O (cid:18) ε a − log ( ε − ) (cid:19) if a >
2, i.e., wehave O (cid:18) ε (cid:18) log (cid:18) ε (cid:19) ε (cid:19) (cid:19) > O (cid:18) ε a − log ( ε − ) (cid:19) as ε → β L close to 1, it can be checked using (63)that a becomes arbitrarily large). Next, choosing very small values of χ say close to 0 will cause the approximation (56) to fail since theterm F in (55) can no longer be dropped (this term is of order O ( ε ) for χ = χ = a is able to strike a balance between both the requirements (dropping of the term F in (55) and satisfyingthe bound on ∑ j ∈ N US ( θ usj ) from (22)). Observe that by setting χ = a , we can get rid of the ε dependency in the numerator of (68) whichgenerates the order bound ∑ j ∈ N US ( θ usj ) ' O (cid:18) ( ε − ) (cid:19) that agrees with the condition q ∑ j ∈ N US ( θ usj ) > O (cid:18) √ ε (cid:18) log (cid:18) ε (cid:19) ε (cid:19)(cid:19) from(22) for any a >
0. Also, it can be easily checked that the term F from (55) for K = χ ˆ K = a ˆ K has the order O ( ε ( + b ) log ( ε − )) forsome b > F can be dropped to get the approximation (56). Substituting ˆ K from (60) and χ = a in (68) followed byfurther simplification gives the following result: ∑ j ∈ N US ( θ usj ) ' (cid:18) + ε M L (cid:19)(cid:18) δµ log (cid:18) + β L − ε M L (cid:19) Mn + log (cid:18) + β L − ε M L (cid:19) ( β L − ε M L ) − log (cid:18) + ε M L (cid:19) ( β L − ε M L ) (cid:19) a log (cid:18) δ (cid:18) + ε M L (cid:19) log (cid:18) + β L − ε M L (cid:19) log (cid:18) + ε M L + β L − ε M L (cid:19) ε Mn log (cid:18) + ε M L (cid:19) (cid:19) + β L ≫ β L > ε M L ), dropping the negative term log (cid:18) + β L − ε M L (cid:19) ( β L − ε M L ) − log (cid:18) + ε M L (cid:19) ( β L − ε M L ) fromthe numerator of (69) and setting the condition: ∑ j ∈ N US ( θ usj ) ' (cid:18) + ε M L (cid:19)(cid:18) δµ log (cid:18) + β L − ε M L (cid:19) Mn (cid:19) a log (cid:18) δ (cid:18) + ε M L (cid:19) log (cid:18) + β L − ε M L (cid:19) log (cid:18) + ε M L + β L − ε M L (cid:19) ε Mn log (cid:18) + ε M L (cid:19) (cid:19) + , (70)oundary Conditions for Trajectories Around Saddle Points 25will be sufficient to guarantee the approximate lower bound in (69). Now using the upper bound on K from (60) in the expression µε a = (cid:18) + ε M L (cid:19) K , we have that: a √ µ = Mn log (cid:18) + ε M L (cid:19) δ (cid:18) + ε M L (cid:19) log (cid:18) + β L − ε M L (cid:19) log (cid:18) + ε M L + β L − ε M L (cid:19) (71)where a = log (cid:18) + ε M L (cid:19) log (cid:18) + ε M L (cid:19) − log (cid:18) + β L − ε M L (cid:19) >
1. Hence, the approximate lower bound on the unstable projection ∑ j ∈ N US ( θ usj ) has thefollowing order: ∑ j ∈ N US ( θ usj ) ' O (cid:18) ( ε − ) (cid:19) . (72)It is also worth mentioning that the lower bound on the unstable projection ∑ j ∈ N US ( θ usj ) from (72) is an increasing function of ε .Finally, ε is upper bounded from Theorem 2 of [10] as follows: ε < min (cid:26) inf k u k = (cid:18) limsup j → ∞ j s r j ( u ) j ! (cid:19) − , L δ M ( Ln − δ ) + O ( ε ) (cid:27) (73)where r j ( u ) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) d j dw j ∇ f ( x ∗ + w u ) (cid:12)(cid:12)(cid:12)(cid:12) w = (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) . (cid:4) B Proof of Theorem 2
Proof.
For an iterative gradient mapping given by x + = x − α∇ f ( x ) in any neighborhood of x ∗ , we have: ∇ f ( x ) = (cid:18) ∇ f ( x ∗ ) + Z p = p = ∇ f ( x ∗ + p ( x − x ∗ ))( x − x ∗ ) dp (cid:19) . (74)provided function f ( · ) is analytic. Using this substitution in the iterative gradient mapping, we have the following result: (cid:13)(cid:13) x + − x ∗ (cid:13)(cid:13) = k x − α∇ f ( x ) − x ∗ k (75) = (cid:13)(cid:13)(cid:13)(cid:13) ( x − x ∗ ) − α (cid:18) ∇ f ( x ∗ ) + Z p = p = ∇ f ( x ∗ + p ( x − x ∗ ))( x − x ∗ ) dp (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) (76) = (cid:13)(cid:13)(cid:13)(cid:13) ( x − x ∗ ) − α Z p = p = ∇ f ( x ∗ + p ( x − x ∗ )) dp ( x − x ∗ ) (cid:13)(cid:13)(cid:13)(cid:13) (77) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) I − α Z p = p = ∇ f ( x ∗ + p ( x − x ∗ )) dp (cid:19) ( x − x ∗ ) (cid:13)(cid:13)(cid:13)(cid:13) (78) = s(cid:18) ∑ j ∈ I US ( ν usj h ˆ u , e usj i ) + ∑ i ∈ I S ( ν si h ˆ u , e si i ) (cid:19) k x − x ∗ k (79)where u = x − x ∗ , ˆ u = u k u k , x − x ∗ = k u k (cid:18) ∑ j ∈ I US h ˆ u , e usj i e usj + ∑ i ∈ I S h ˆ u , e si i e si (cid:19) and ( e usj , ν usj ) , ( e sj , ν sj ) are the eigenvector-eigenvaluepair of the matrix D ( x ) where D ( x ) = (cid:18) I − α R p = p = ∇ f ( x ∗ + p ( x − x ∗ )) dp (cid:19) with ν si < i ∈ I S , ν usj ≥ j ∈ I US and I US , I S are the index sets associated respectively with these subspaces respectively.We first consider the case of strict expansive dynamics in the current iteration. Given: k x + − x ∗ k > k x − x ∗ k or equivalently (cid:13)(cid:13) x + − x ∗ (cid:13)(cid:13) = s(cid:18) ∑ j ∈ I US ( ν usj h ˆ u , e usj i ) + ∑ i ∈ I S ( ν si h ˆ u , e si i ) (cid:19) k u k > k u k . (80)This implies: s(cid:18) ∑ j ∈ I US ( ν usj h ˆ u , e usj i ) + ∑ i ∈ I S ( ν si h ˆ u , e si i ) (cid:19) > . (81)6 Rishabh Dixit, Waheed U. BajwaTo prove the following: a. (cid:13)(cid:13) x ++ − x ∗ (cid:13)(cid:13) > (cid:13)(cid:13) x + − x ∗ (cid:13)(cid:13) − σ ( x ) (82) b. (cid:13)(cid:13) x ++ − x ∗ (cid:13)(cid:13) ≥ ¯ ρ ( x ) (cid:13)(cid:13) x + − x ∗ (cid:13)(cid:13) − σ ( x ) (83)where σ ( x ) = O ( k x − x ∗ k ) and ¯ ρ ( x ) > Part a. k x ++ − x ∗ k > k x + − x ∗ k − σ ( x ) Proof.
Since x ++ = x + − α∇ f ( x + ) , we have the following: (cid:13)(cid:13) x ++ − x ∗ (cid:13)(cid:13) = (cid:13)(cid:13) x + − α∇ f ( x + ) − x ∗ (cid:13)(cid:13) (84) = (cid:13)(cid:13)(cid:13)(cid:13) ( x + − x ∗ ) − α (cid:18) ∇ f ( x ∗ ) + Z p = p = ∇ f ( x ∗ + p ( x + − x ∗ ))( x + − x ∗ ) dp (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) (85) = (cid:13)(cid:13)(cid:13)(cid:13) ( x + − x ∗ ) − α Z p = p = ∇ f ( x ∗ + p ( x + − x ∗ )) dp ( x + − x ∗ ) (cid:13)(cid:13)(cid:13)(cid:13) (86) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) I − α Z p = p = ∇ f ( x ∗ + p ( x + − x ∗ )) dp (cid:19) ( x + − x ∗ ) (cid:13)(cid:13)(cid:13)(cid:13) (87) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) I − α Z p = p = ∇ f ( x ∗ + p ( x + − x ∗ )) dp (cid:19)(cid:18) I − α Z p = p = ∇ f ( x ∗ + p ( x − x ∗ )) dp (cid:19) ( x − x ∗ ) (cid:13)(cid:13)(cid:13)(cid:13) (88) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) I − α Z p = p = ∇ f ( x ∗ + p ( x − x ∗ )) dp − α P ( x ) (cid:19)(cid:18) I − α Z p = p = ∇ f ( x ∗ + p ( x − x ∗ )) dp (cid:19) ( x − x ∗ ) (cid:13)(cid:13)(cid:13)(cid:13) (89)where in the last step we used the following substitution: Z p = p = ∇ f ( x ∗ + p ( x + − x ∗ )) dp = Z p = p = ∇ f ( x ∗ + p ( x − x ∗ )) dp + P ( x ) . (90)and we have that k P ( x ) k = O ( k ∇ f ( x ) k ) which can be verified from Assumption A3 . Rearranging (90) and taking norm both sides weget: k P ( x ) k = (cid:13)(cid:13)(cid:13)(cid:13) Z p = p = (cid:18) ∇ f ( x ∗ + p ( x + − x ∗ )) − ∇ f ( x ∗ + p ( x − x ∗ )) (cid:19) dp (cid:13)(cid:13)(cid:13)(cid:13) (91) ≤ Z p = p = (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) ∇ f ( x ∗ + p ( x + − x ∗ )) − ∇ f ( x ∗ + p ( x − x ∗ )) (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) dp (92) ≤ Z p = p = M (cid:13)(cid:13) p ( x + − x ) (cid:13)(cid:13) dp (93) = M (cid:13)(cid:13) x + − x (cid:13)(cid:13) Z p = p = pdp = M α k ∇ f ( x ) k . (94)Now recall that D ( x ) = (cid:18) I − α R p = p = ∇ f ( x ∗ + p ( x − x ∗ )) dp (cid:19) hence further simplifying (89) yields the following: (cid:13)(cid:13) x ++ − x ∗ (cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:18) D ( x ) (cid:19) ( x − x ∗ ) − α (cid:18) D ( x ) P ( x )( x − x ∗ ) (cid:19)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (95) ≥ s(cid:18) ∑ j ∈ I US ( ν usj ) ( h ˆ u , e usj i ) + ∑ i ∈ I S ( ν si ) ( h ˆ u , e si i ) (cid:19) k x − x ∗ k − α k D ( x ) k k P ( x ) k k x − x ∗ k (96) ≥ s(cid:18) ∑ j ∈ I US ( ν usj ) ( h ˆ u , e usj i ) + ∑ i ∈ I S ( ν si ) ( h ˆ u , e si i ) (cid:19) k x − x ∗ k − sup j { ν usj } M α k ∇ f ( x ) kk x − x ∗ k ≥ s(cid:18) ∑ j ∈ I US ( ν usj ) ( h ˆ u , e usj i ) + ∑ i ∈ I S ( ν si ) ( h ˆ u , e si i ) (cid:19) k x − x ∗ k − sup j { ν usj } ML α k x − x ∗ k k ∇ f ( x ) k ≤ L k x − x ∗ k by Lipschitz continuity of ∇ f ( x ) . Now with σ ( x ) = sup j { ν usj } ML α k x − x ∗ k = O ( k x − x ∗ k ) we are left to prove: s(cid:18) ∑ j ∈ I US ( ν usj ) ( h ˆ u , e usj i ) + ∑ i ∈ I S ( ν si ) ( h ˆ u , e si i ) (cid:19) k x − x ∗ k > (cid:13)(cid:13) x + − x ∗ (cid:13)(cid:13) oundary Conditions for Trajectories Around Saddle Points 27or equivalently the following result: s(cid:18) ∑ j ∈ I US ( ν usj ) ( h ˆ u , e usj i ) + ∑ i ∈ I S ( ν si ) ( h ˆ u , e si i ) (cid:19) k u k > s(cid:18) ∑ j ∈ I US ( ν usj h ˆ u , e usj i ) + ∑ i ∈ I S ( ν si h ˆ u , e si i ) (cid:19) k u k (99) s(cid:18) ∑ j ∈ I US ( ν usj ) ( h ˆ u , e usj i ) + ∑ i ∈ I S ( ν si ) ( h ˆ u , e si i ) (cid:19) > s(cid:18) ∑ j ∈ I US ( ν usj h ˆ u , e usj i ) + ∑ i ∈ I S ( ν si h ˆ u , e si i ) (cid:19) . (100)This will hold true if: s(cid:18) ∑ j ∈ I US ( ν usj h ˆ u , e usj i ) + ∑ i ∈ I S ( ν si h ˆ u , e si i ) (cid:19) > . (101)Recall that ( e usj , ν usj ) , ( e sj , ν sj ) are respectively the eigenvector-eigenvalue pair of the matrix D ( x ) = (cid:18) I − α R p = p = ∇ f ( x ∗ + p ( x − x ∗ )) dp (cid:19) with ν si < i ∈ I S , ν usj ≥ j ∈ I US . Then the condition (100) can be written as: q h ˆ u , ( D ( x )) ˆ u i > q h ˆ u , ( D ( x )) ˆ u i (102) = ⇒ h ˆ u , ( D ( x )) ˆ u i > h ˆ u , ( D ( x )) ˆ u i (103)where ˆ u is a unit vector. Also we are given (81) that can be written as: q h ˆ u , ( D ( x )) ˆ u i > = p h ˆ u , ˆ u i (104) = ⇒ h ˆ u , (( D ( x )) − I ) ˆ u i > . (105)Now consider the following difference: h ˆ u , ( D ( x )) ˆ u i − h ˆ u , ( D ( x )) ˆ u i = h ˆ u , (( D ( x )) − I ) ˆ u i | {z } ≥ + h ˆ u , (( D ( x )) − I ) ˆ u i | {z } > > = ⇒ h ˆ u , ( D ( x )) ˆ u i > h ˆ u , ( D ( x )) ˆ u i (107)which completes the proof for part a from (82). (cid:4) Part b. k x ++ − x ∗ k ≥ ¯ ρ ( x ) k x + − x ∗ k − σ ( x ) Proof.
Recall that from (79) we have that: (cid:13)(cid:13) x + − x ∗ (cid:13)(cid:13) = s(cid:18) ∑ j ∈ I US ( ν usj h ˆ u , e usj i ) + ∑ i ∈ I S ( ν si h ˆ u , e si i ) (cid:19) k x − x ∗ k (108) = q h ˆ u , ( D ( x )) ˆ u ik x − x ∗ k (109)Now from (98) we have the following: (cid:13)(cid:13) x ++ − x ∗ (cid:13)(cid:13) ≥ s(cid:18) ∑ j ∈ I US ( ν usj ) ( h ˆ u , e usj i ) + ∑ i ∈ I S ( ν si ) ( h ˆ u , e si i ) (cid:19) k x − x ∗ k − O ( k x − x ∗ k ) (110) = q h ˆ u , ( D ( x )) ˆ u ik x − x ∗ k − σ ( x ) (111) = p h ˆ u , ( D ( x )) ˆ u i p h ˆ u , ( D ( x )) ˆ u i q h ˆ u , ( D ( x )) ˆ u ik x − x ∗ k − σ ( x ) (112) = p h ˆ u , ( D ( x )) ˆ u i p h ˆ u , ( D ( x )) ˆ u i (cid:13)(cid:13) x + − x ∗ (cid:13)(cid:13) − σ ( x ) (113)where in the last step we used the substitution from (109). Next, note that h ˆ u , ( D ( x )) ˆ u i > h ˆ u , ( D ( x )) ˆ u i > ρ ( x ) = √ h ˆ u , ( D ( x )) ˆ u i √ h ˆ u , ( D ( x )) ˆ u i > (cid:4) Sequential Monotonicity
Note that we started with the condition that k x + − x ∗ k > k x − x ∗ k so as to prove the above two claims. In order to apply the result of thistheorem inductively for a sequence { x k } generated by the gradient descent method, we need this condition to hold for a sub-sequence of { x k } . It can be done using Part b. of the result where we lower bound the right hand side of Part b. with k x + − x ∗ k to get: (cid:13)(cid:13) x ++ − x ∗ (cid:13)(cid:13) ≥ ¯ ρ ( x ) (cid:13)(cid:13) x + − x ∗ (cid:13)(cid:13) − σ ( x ) > (cid:13)(cid:13) x + − x ∗ (cid:13)(cid:13) (114) = ⇒ ( ¯ ρ ( x ) − ) (cid:13)(cid:13) x + − x ∗ (cid:13)(cid:13) > σ ( x ) . (115)Since σ ( x ) = O ( k x − x ∗ k ) , hence k x − x ∗ k should be sufficiently small for (115) to hold. Now, if (115) condition holds true, then wewill have the condition (cid:13)(cid:13) x ++ − x ∗ (cid:13)(cid:13) ≥ ¯ ρ ( x ) (cid:13)(cid:13) x + − x ∗ (cid:13)(cid:13) − σ ( x ) > (cid:13)(cid:13) x + − x ∗ (cid:13)(cid:13) or equivalently k x ++ − x ∗ k > k x + − x ∗ k . Next, for some k = K let x = x K , x + = x K + , x ++ = x K + and we have k x K + − x ∗ k > k x K − x ∗ k with the condition (115) satisfied, then we also have k x K + − x ∗ k > k x K + − x ∗ k . Using induction, we then get k x k + − x ∗ k > k x k − x ∗ k for all k ≥ K + x = x k .Hence, the claim of sequential monotonicity has been proved partially, i.e., if a gradient trajectory has expansive dynamics w.r.t.stationary point x ∗ at some k = K , then it has expansive dynamics for all iterations k > K provided k x k − x ∗ k remains bounded above. Now, we are only left with proving the complete claim, i.e., sequential monotonicity holds even if the gradient trajectory has non-contraction dynamics w.r.t. stationary point x ∗ at some k = K . Before completing the proof of this claim, we need to do some analysisof the expansion factor ¯ ρ ( x ) . Comments on ¯ ρ ( x ) for Well-Conditioned Problems From the condition (115), we require σ ( x ) to be upper bounded. Notice that the upper bound on σ ( x ) goes to 0 as ¯ ρ ( x ) approaches1. Then, the particular theorem cannot be applied recursively since σ ( x ) is a positive quantity that comes from (98) and (115) wouldthen fail to hold. Hence, in order to exploit the property (115), we require ¯ ρ ( x ) to be bounded away from 1. Using (109) in (115) andsimplifying ¯ ρ ( x ) , we get that: ( ¯ ρ ( x ) − ) (cid:13)(cid:13) x + − x ∗ (cid:13)(cid:13) > σ ( x ) (116) = ⇒ (cid:18) p h ˆ u , ( D ( x )) ˆ u i p h ˆ u , ( D ( x )) ˆ u i − (cid:19)q h ˆ u , ( D ( x )) ˆ u ik x − x ∗ k > σ ( x ) (117) = ⇒ (cid:18)q h ˆ u , ( D ( x )) ˆ u i − q h ˆ u , ( D ( x )) ˆ u i (cid:19) k x − x ∗ k > σ ( x ) (118)where we require the term (cid:18)p h ˆ u , ( D ( x )) ˆ u i − p h ˆ u , ( D ( x )) ˆ u i (cid:19) to be bounded away from 0. Since we have p h ˆ u , ( D ( x )) ˆ u i > p h ˆ u , ( D ( x )) ˆ u i , using the identity √ a − √ b > p a − b for some a > b > (cid:18)q h ˆ u , ( D ( x )) ˆ u i − q h ˆ u , ( D ( x )) ˆ u i (cid:19) > r h ˆ u , ( D ( x )) ˆ u i − h ˆ u , ( D ( x )) ˆ u i (119)where we require h ˆ u , ( D ( x )) ˆ u i > h ˆ u , ( D ( x )) ˆ u i . Next, recall that h ˆ u , ( D ( x )) ˆ u i = (cid:18) ∑ j ∈ I US ( ν usj ) ( h ˆ u , e usj i ) + ∑ i ∈ I S ( ν si ) ( h ˆ u , e si i ) (cid:19) and h ˆ u , ( D ( x )) ˆ u i = (cid:18) ∑ j ∈ I US ( ν usj h ˆ u , e usj i ) + ∑ i ∈ I S ( ν si h ˆ u , e si i ) (cid:19) . Substituting these in the left-hand side of (119) and simplifying yields: (cid:18)q h ˆ u , ( D ( x )) ˆ u i − q h ˆ u , ( D ( x )) ˆ u i (cid:19) > vuut(cid:18) ∑ j ∈ I US (cid:18) ( ν usj ) − (cid:19) ( ν usj h ˆ u , e usj i ) + ∑ i ∈ I S (cid:18) ( ν si ) − (cid:19) ( ν si h ˆ u , e si i ) (cid:19) (120) Notice that x ∗ can be any stationary point and not just the strict saddle point. Since the stationary points of the function are non-degenerate from our assumptions, the extension of this proof to other types of stationary points is left as an easy exercise to the reader.oundary Conditions for Trajectories Around Saddle Points 29Now recall that we had D ( x ) = (cid:18) I − α R p = p = ∇ f ( x ∗ + p ( x − x ∗ )) dp (cid:19) , hence for any eigenvalue ν l of the matrix D ( x ) where ν l = − αλ l ( R p = p = ∇ f ( x ∗ + p ( x − x ∗ )) dp ) and 1 ≤ l ≤ n with ν l ≥ λ l is the corresponding eigenvalue of R p = p = ∇ f ( x ∗ + p ( x − x ∗ )) dp ,we have that: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:18) Z p = p = (cid:18) I − α∇ f ( x ∗ + p ( x − x ∗ )) (cid:19) dp (cid:19) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − ≤ ν l ≤ (cid:13)(cid:13)(cid:13)(cid:13) Z p = p = (cid:18) I − α∇ f ( x ∗ + p ( x − x ∗ )) (cid:19) dp (cid:13)(cid:13)(cid:13)(cid:13) (121) Z p = p = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:18) I − α∇ f ( x ∗ + p ( x − x ∗ )) (cid:19) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − dp ≤ ν l ≤ Z p = p = (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) I − α∇ f ( x ∗ + p ( x − x ∗ )) (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) dp (122)1 − α Z p = p = sup l λ l ( ∇ f ( x ∗ + p ( x − x ∗ ))) dp ≤ ν l ≤ − α Z p = p = inf l λ l ( ∇ f ( x ∗ + p ( x − x ∗ ))) dp . (123)Therefore, the bounds on ν si and ν usj for α = L can be given by:1 − α Z p = p = sup λ l > λ l ( ∇ f ( x ∗ + p ( x − x ∗ ))) dp ≤ ν si ≤ − α Z p = p = inf λ l > λ l ( ∇ f ( x ∗ + p ( x − x ∗ ))) dp (124)1 − α Z p = p = Ldp ≤ ν si ≤ − α Z p = p = β dp (125)0 ≤ ν si ≤ − β L (126)1 − α Z p = p = sup λ l < λ l ( ∇ f ( x ∗ + p ( x − x ∗ ))) dp ≤ ν usj ≤ − α Z p = p = inf λ l < λ l ( ∇ f ( x ∗ + p ( x − x ∗ ))) dp (127)1 − α Z p = p = − β dp ≤ ν usj ≤ − α Z p = p = − Ldp (128)1 + β L ≤ ν usj ≤ l | λ l ( ∇ f ( x ∗ + p ( x − x ∗ ))) | > β , i.e., the minimum absolute eigenvalue of the function f ( · ) in a neighbor-hood of x ∗ is greater than β from Assumption A4 . Also, we used sup l | λ l ( ∇ f ( x ∗ + p ( x − x ∗ ))) | ≤ L , from Assumption A2 .Hence, the bound in (120) can be further lower bounded as: (cid:18)q h ˆ u , ( D ( x )) ˆ u i − q h ˆ u , ( D ( x )) ˆ u i (cid:19) > vuut(cid:18) ∑ j ∈ I US (cid:18) ( + β L ) − (cid:19) ( ν usj h ˆ u , e usj i ) − ∑ i ∈ I S ( ν si h ˆ u , e si i ) (cid:19) (130)where we used the fact that (cid:18) ( ν usj ) − (cid:19) ≥ ( + β L ) − (cid:18) ( ν si ) − (cid:19) ≥ −
1. Now using the given condition (101) on the right-handside of (130), we have: (cid:18)q h ˆ u , ( D ( x )) ˆ u i − q h ˆ u , ( D ( x )) ˆ u i (cid:19) > vuut(cid:18) ( + β L ) ∑ j ∈ I US ( ν usj h ˆ u , e usj i ) − (cid:19) (131)Further solving the given condition (101) to obtain lower bound on ∑ j ∈ I US ( ν usj h ˆ u , e usj i ) we obtain: ∑ j ∈ I US ( ν usj h ˆ u , e usj i ) > − ∑ i ∈ I S ( ν si h ˆ u , e si i ) (132) ≥ − (cid:18) − β L (cid:19) ∑ i ∈ I S ( h ˆ u , e si i ) (133) > − (cid:18) − β L (cid:19) . (134)Here we used ( ν si ) ≤ (cid:18) − β L (cid:19) and ∑ i ∈ I S ( h ˆ u , e si i ) = − ∑ j ∈ I US ( h ˆ u , e usj i ) <
1. Substituting (134) in (131) yields: (cid:18)q h ˆ u , ( D ( x )) ˆ u i − q h ˆ u , ( D ( x )) ˆ u i (cid:19) > s(cid:18) ( + β L ) (cid:18) − (cid:18) − β L (cid:19) (cid:19) − (cid:19) (135)Now, it can be verified that for values of β L > .
57, the right-hand side of (135) is bounded away from 0 and is approximately greaterthan 0 . σ ( x ) can be upper bounded by 0 . k x − x ∗ k provided k x − x ∗ k is of order magnitude 10 − . Also, it can be readilychecked that if (135) is satisfied then h ˆ u , ( D ( x )) ˆ u i > h ˆ u , ( D ( x )) ˆ u i which was required while formulating (119).0 Rishabh Dixit, Waheed U. Bajwa Case of k x + − x ∗ k = k x − x ∗ k Notice that while obtaining (131) from (130), we utilized the given condition of (101) according to which we have: ∑ j ∈ I US ( ν usj h ˆ u , e usj i ) + ∑ i ∈ I S ( ν si h ˆ u , e si i ) > . This condition implies that we have k x + − x ∗ k > k x − x ∗ k . However, it could be the case that we have k x + − x ∗ k = k x − x ∗ k whichwould imply h ˆ u , ( D ( x )) ˆ u i = ∑ j ∈ I US ( ν usj h ˆ u , e usj i ) + ∑ i ∈ I S ( ν si h ˆ u , e si i ) = . Using this condition, it can be readily checked that (135) will still hold but only with a non-strict inequality, i.e., we will have: (cid:18)q h ˆ u , ( D ( x )) ˆ u i − q h ˆ u , ( D ( x )) ˆ u i (cid:19) ≥ s(cid:18) ( + β L ) (cid:18) − (cid:18) − β L (cid:19) (cid:19) − (cid:19) . Now since ¯ ρ ( x ) = √ h ˆ u , ( D ( x )) ˆ u i √ h ˆ u , ( D ( x )) ˆ u i = p h ˆ u , ( D ( x )) ˆ u i , we will have that:¯ ρ ( x ) ≥ q h ˆ u , ( D ( x )) ˆ u i + s(cid:18) ( + β L ) (cid:18) − (cid:18) − β L (cid:19) (cid:19) − (cid:19) (136) = + s(cid:18) ( + β L ) (cid:18) − (cid:18) − β L (cid:19) (cid:19) − (cid:19) . (137)Now if σ ( x ) satisfies the condition (115) for this ¯ ρ ( x ) then we are guaranteed to have k x ++ − x ∗ k > k x + − x ∗ k even when k x + − x ∗ k = k x − x ∗ k . This completes the proof of the claim. Radius for monotonic sequence {k x k − x ∗ k} : Now that we have established the fact that if k x + − x ∗ k ≥ k x − x ∗ k , then we are guaranteed to have k x ++ − x ∗ k > k x + − x ∗ k provided σ ( x ) satisfies the condition (115) and the problem is well-conditioned, we can apply it recursively for any gradient trajectory generatedby the sequence { x k } in some neighborhood of x ∗ . To identify the radius of this neighborhood, we use (115) where we substitute σ ( x ) from (98) and ¯ ρ ( x ) = √ h ˆ u , ( D ( x )) ˆ u i √ h ˆ u , ( D ( x )) ˆ u i to get the condition: ( ¯ ρ ( x ) − ) (cid:13)(cid:13) x + − x ∗ (cid:13)(cid:13) > σ ( x ) = sup j { ν usj } ML α k x − x ∗ k = ⇒ (cid:18)q h ˆ u , ( D ( x )) ˆ u i − q h ˆ u , ( D ( x )) ˆ u i (cid:19) k x − x ∗ k > σ ( x ) = sup j { ν usj } ML α k x − x ∗ k ς >
2, we set (cid:18)p h ˆ u , ( D ( x )) ˆ u i− p h ˆ u , ( D ( x )) ˆ u i (cid:19) equal to ς times its lowerbound from (135) and set σ ( x ) to its upper bound in (139) to get the condition:1 ς s(cid:18) ( + β L ) (cid:18) − (cid:18) − β L (cid:19) (cid:19) − (cid:19) k x − x ∗ k ≥ ML α k x − x ∗ k ≥ sup j { ν usj } ML α k x − x ∗ k = σ ( x ) (140)1 ς M s(cid:18) ( + β L ) (cid:18) − (cid:18) − β L (cid:19) (cid:19) − (cid:19) ≥ k x − x ∗ k (141)where we used α = L and the bound sup j { ν usj } = + α L ≤
2. Now for well-conditioned problems, if (141) is satisfied then the condition(139) will hold true. Hence any gradient descent trajectory with α = L inside the ball B ξ ( x ∗ ) will exhibit strictly monotonic expansivedynamics once it has a non-contractive dynamics at any instant. (cid:4) C Proof of Theorem 3
Proof.
From theorem 2 it was established that a gradient trajectory { x k } with x k ∈ B ξ ( x ∗ ) has expansive dynamics for all k > K if at k = K , the gradient trajectory has non-contraction dynamics . Let there be some k = K τ such that the sequence {k x k − x ∗ k} is strictly Note: here we assume that x ∈ ¯ B ξ ( x ∗ ) \ B ξ ( x ∗ ) .oundary Conditions for Trajectories Around Saddle Points 31decreasing for all k ≤ K τ and is non-decreasing for k = K τ . Then from Theorem 2 we have that {k x k − x ∗ k} is strictly increasing for all k > K τ provided x k ∈ B ξ ( x ∗ ) . Since k x K τ − x ∗ k is the minimum of the sequence {k x k − x ∗ k} with x k ∈ B ξ ( x ∗ ) , let k = K c and k = K e be the indices with K c ≤ K τ ≤ K e defined as follows: K c = sup (cid:26) k ≤ K τ (cid:12)(cid:12)(cid:12)(cid:12) x k ∈ ¯ B ξ ( x ∗ ) \ B ε ( x ∗ ) (cid:27) (142) K e = inf (cid:26) k ≥ K τ (cid:12)(cid:12)(cid:12)(cid:12) x k ∈ ¯ B ξ ( x ∗ ) \ B ε ( x ∗ ) (cid:27) . (143)Let the gradient trajectory exit the ball B ξ ( x ∗ ) at some iteration ˆ K exit . Then the total sojourn time for the gradient trajectory inside thecompact shell ¯ B ξ ( x ∗ ) \ B ε ( x ∗ ) is K c + ( ˆ K exit − K e ) . Bound on K c Since K c ≤ K τ , we have the condition that k x k − x ∗ k is monotonically decreasing for all 0 < k ≤ K c . However, even with the monotoni-cally decreasing sequence, it cannot be guaranteed that k x k − x ∗ k will decrease with a geometric rate. This can checked very easily from(106) in the proof of theorem 2. From that condition, we are guaranteed geometric expansion since the factor ¯ ρ ( x ) = √ h ˆ u , ( D ( x )) ˆ u i √ h ˆ u , ( D ( x )) ˆ u i > h ˆ u , ( D ( x )) ˆ u i − h ˆ u , ( D ( x )) ˆ u i = h ˆ u , (( D ( x )) − I ) ˆ u i | {z } ≥ + h ˆ u , (( D ( x )) − I ) ˆ u i | {z } > > k x + − x ∗ k > k x − x ∗ k or equivalently h ˆ u , (( D ( x )) − I ) ˆ u i >
0. Recall that k x + − x ∗ k = p h ˆ u , ( D ( x )) ˆ u ik x − x ∗ k from (109).However, when we have k x + − x ∗ k < k x − x ∗ k or equivalently h ˆ u , (( D ( x )) − I ) ˆ u i < h ˆ u , ( D ( x )) ˆ u i − h ˆ u , ( D ( x )) ˆ u i = h ˆ u , (( D ( x )) − I ) ˆ u i | {z } ≥ + h ˆ u , (( D ( x )) − I ) ˆ u i | {z } < ≶ ρ ( x ) < k x + − x ∗ k < k x − x ∗ k . Hence, the best one can achieve interms of contraction rate is K c = O ( ε − ) rate for a non-strict saddle setting and K c = O ( log ( ε − ) + c ) rate when β is bounded awayfrom 0, where ε = k x K c − x ∗ k in both cases. Intuitively, this is the best that can be done w.r.t. contraction rate for a non-accelerationmethod like gradient descent since the condition of monotonically decreasing sequence k x K − x ∗ k is somewhat similar to the convexcase where the convergence rate of K c = O ( ε − ) is recovered.We now develop a proof technique along lines similar to those of [29] that involves use of some envelope function ˆ f ( · ) for thefunction f ( · ) to generate contraction rates. To motivate the utility of ˆ f ( · ) , we start with some preliminaries required for K c = O ( ε − ) rate. Recall that from the section on convergence rates for convex functions in [29], for a given sequence { x k } generated by gradientdescent method on some convex function f ( · ) with α = L , we require Lipschitz continuity and convexity for O ( ε − ) convergence rates.The first property is the gradient Lipschitz continuity: f ( x k + ) ≤ f ( x k ) + h ∇ f ( x k ) , x k + − x k i + L k x k − x ∗ k (146) ≤ f ( x k ) − α h ∇ f ( x k ) , ∇ f ( x k ) i + L α k ∇ f ( x k ) k (147) ≤ f ( x k ) − L k ∇ f ( x k ) k (148)and the second property is convexity of f ( · ) with respect to x ∗ : f ( x k ) ≤ f ( x ∗ ) + h ∇ f ( x k ) , x k − x ∗ i . (149)Now, substituting (149) into (148), making perfect squares and summing up (148) from k = k = K − f ( x k + ) − f ( x ∗ ) ≤ h ∇ f ( x k ) , x k − x ∗ i − L k ∇ f ( x k ) k (150) = L (cid:18) k x k − x ∗ k − k x k + − x ∗ k (cid:19) (151) = ⇒ K − ∑ k = (cid:18) f ( x k + ) − f ( x ∗ ) (cid:19) ≤ L K − ∑ k = (cid:18) k x k − x ∗ k − k x k + − x ∗ k (cid:19) (152) = ⇒ K (cid:18) f ( x K ) − f ( x ∗ ) (cid:19) ≤ L (cid:18) k x − x ∗ k − k x K − x ∗ k (cid:19) (153) = ⇒ K ≤ L (cid:18) k x − x ∗ k − k x K − x ∗ k (cid:19) (cid:18) f ( x K ) − f ( x ∗ ) (cid:19) (154)2 Rishabh Dixit, Waheed U. Bajwawhere in the second last step we used the fact that the sequence { f ( x k ) } is monotonically decreasing. Notice that we already have (148)condition for f ( · ) and we only require (149) condition. Since f ( · ) is nonconvex, (149) may not hold true even with the condition ofmonotonically decreasing sequence k x K − x ∗ k . However, a variant of (149) can hold true for some envelope function ˆ f ( · ) of f ( · ) whereˆ f ( · ) is a lower bound envelope function. To this end let us define the sequence { ˆ f ( x k ) } as follows:ˆ f ( x k ) = f ( x k ) − s k k x k − x ∗ k (155)where we require that s k k x k − x ∗ k ≤ s k + k x k + − x ∗ k for all k and 0 < k ≤ K c . Notice that this condition on the sequence { s k } ensuresthat the sequence { ˆ f ( x k ) } decreases monotonically. Moreover if for a certain sequence { x k } we have that x k → x ∗ as K c → ∞ then s k → s ∞ where s ∞ is bounded. Therefore we can observe that f ( x ∗ ) = ˆ f ( x ∗ ) = lim x k → x ∗ ˆ f ( x k ) . Now using the monotonicity of k x k − x ∗ k , we getthat: k x k − x ∗ k > k x k + − x ∗ k (156) = ⇒ k x k − x ∗ k > k x k − x ∗ k + L k ∇ f ( x k ) k + L h ∇ f ( x k ) , x ∗ − x k i (157) = ⇒ > − L k ∇ f ( x k ) k > h ∇ f ( x k ) , x ∗ − x k i . (158)Next, using gradient Lipschitz condition, we have the following bound: −h ∇ f ( x k ) , x ∗ − x k i − L k x k − x ∗ k ≤ f ( x k ) − f ( x ∗ ) ≤ L k x k − x ∗ k (159) = ⇒ − L k x k − x ∗ k ≤ f ( x k ) − f ( x ∗ ) ≤ L k x k − x ∗ k (160)where we dropped the term −h ∇ f ( x k ) , x ∗ − x k i using (158) in the last step. Subtracting s k k x k − x ∗ k from (160) yields: − L k x k − x ∗ k − s k k x k − x ∗ k ≤ f ( x k ) − s k k x k − x ∗ k − f ( x ∗ ) ≤ L k x k − x ∗ k − s k k x k − x ∗ k (161) = ⇒ − L k x k − x ∗ k − s k k x k − x ∗ k ≤ ˆ f ( x k ) − ˆ f ( x ∗ ) ≤ L k x k − x ∗ k − s k k x k − x ∗ k . (162)Next, taking norm on (74), using the substitution G = ∇ f ( x ∗ + p ( x − x ∗ )) followed by taking the lower bound yields: k ∇ f ( x ) k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) Z p = p = ∇ f ( x ∗ + p ( x − x ∗ )) dp (cid:19) ( x − x ∗ ) (cid:13)(cid:13)(cid:13)(cid:13) (163) = ⇒ k ∇ f ( x ) k ≥ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:18) Z p = p = ∇ f ( x ∗ + p ( x − x ∗ )) dp (cid:19) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − k x − x ∗ k (164) = ⇒ k ∇ f ( x ) k ≥ (cid:18) Z p = p = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:18) ∇ f ( x ∗ + p ( x − x ∗ )) (cid:19) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − dp (cid:19) k x − x ∗ k (165) = ⇒ k ∇ f ( x ) k ≥ (cid:18) Z p = p = λ min (cid:26) √ GG T (cid:27) dp (cid:19) k x − x ∗ k (166) = ⇒ k ∇ f ( x ) k ≥ β k x − x ∗ k . (167)Substituting (167) into (158) yields: β L k x − x ∗ k < L k ∇ f ( x k ) k < h ∇ f ( x k ) , x k − x ∗ i (168)Now if we have the condition L k x k − x ∗ k − s k k x k − x ∗ k < β L k x k − x ∗ k , then combining (162) and (168) yields:ˆ f ( x k ) − ˆ f ( x ∗ ) ≤ L k x k − x ∗ k − s k k x k − x ∗ k < β L k x k − x ∗ k < L k ∇ f ( x k ) k < h ∇ f ( x k ) , x k − x ∗ i (169) = ⇒ ˆ f ( x k ) − ˆ f ( x ∗ ) < h ∇ f ( x k ) , x k − x ∗ i . (170)The condition (170) holds whenever we have that: L k x k − x ∗ k − s k k x k − x ∗ k < β L k x k − x ∗ k (171) = ⇒ (cid:18) L − β L (cid:19) k x k − x ∗ k < s k k x k − x ∗ k (172) = ⇒ k x k − x ∗ k < s k (cid:18) LL − β (cid:19) . (173)oundary Conditions for Trajectories Around Saddle Points 33Since s k + k x k + − x ∗ k ≥ s k k x k − x ∗ k and k x k − x ∗ k > k x k + − x ∗ k , we require s k + > s k for all k ≤ K c , i.e., { s k } is an increasingsequence. Therefore to guarantee (173) for any feasible k , we must have that: k x k − x ∗ k < s (cid:18) LL − β (cid:19) . (174)Next, subtracting s k k x k − x ∗ k from both sides of (148) yields: f ( x k + ) − s k k x k − x ∗ k ≤ f ( x k ) − s k k x k − x ∗ k − L k ∇ f ( x k ) k (175) = ⇒ f ( x k + ) − s k + k x k + − x ∗ k ≤ f ( x k ) − s k k x k − x ∗ k − L k ∇ f ( x k ) k (176) = ⇒ ˆ f ( x k + ) ≤ ˆ f ( x k ) − L k ∇ f ( x k ) k (177)where we used the fact that s k + k x k + − x ∗ k ≥ s k k x k − x ∗ k .Finally using (170) and (177) along the lines of inequalities (150)-(154) for K = K c yields: K c ≤ L (cid:18) k x − x ∗ k − k x K c − x ∗ k (cid:19) (cid:18) ˆ f ( x K c ) − ˆ f ( x ∗ ) (cid:19) = L (cid:18) ( ξ ) − k x K c − x ∗ k (cid:19) (cid:18) f ( x K c ) − f ( x ∗ ) − s K c k x K c − x ∗ k (cid:19) (178)where we used k x − x ∗ k = ξ and we have that ε ≤ k x K c − x ∗ k ≤ ξ from the definition of K c . It should be noted that the above bound isonly valid if f ( x K c ) − f ( x ∗ ) − s K c k x K c − x ∗ k >
0. Now for k x K c − x ∗ k = ε , (178) becomes: K c ≤ L (cid:18) ( ξ ) − ε (cid:19) (cid:18) f ( x K c ) − f ( x ∗ ) − s K c ε (cid:19) (179)where we have that x K c ∈ ¯ B ε ( x ∗ ) \ B ε ( x ∗ ) and f ( x K c ) − f ( x ∗ ) − s K c ε > K c may not be optimal in the sense that we also have the Polyak-Łojasiewicz condition on ˆ f ( . ) thatcould possibly give a better contraction rate. Recall from (169) we have that:ˆ f ( x k ) − ˆ f ( x ∗ ) ≤ L k x k − x ∗ k − s k k x k − x ∗ k < β L k x k − x ∗ k < L k ∇ f ( x k ) k ≤ β k ∇ f ( x k ) k (180) = ⇒ ˆ f ( x k ) − ˆ f ( x ∗ ) ≤ β k ∇ f ( x k ) k . (181)Combining (177) and (181) followed by using (169) for k = f ( x k + ) − ˆ f ( x k ) ≤ − β L (cid:18) ˆ f ( x k ) − ˆ f ( x ∗ ) (cid:19) (182)ˆ f ( x k + ) − ˆ f ( x ∗ ) ≤ (cid:18) − β L (cid:19)(cid:18) ˆ f ( x k ) − ˆ f ( x ∗ ) (cid:19) (183)ˆ f ( x K c ) − ˆ f ( x ∗ ) ≤ (cid:18) − β L (cid:19) K c (cid:18) ˆ f ( x ) − ˆ f ( x ∗ ) (cid:19) ≤ (cid:18) − β L (cid:19) K c β L k x − x ∗ k . (184)Finally substituting k x − x ∗ k = ξ and k x K c − x ∗ k = ε gives: K c ≤ log (cid:18) ˆ f ( x Kc ) − ˆ f ( x ∗ ) β L ( ξ ) (cid:19) log (cid:18) − β L (cid:19) = log (cid:18) L ( f ( x Kc ) − f ( x ∗ ) − s Kc k x Kc − x ∗ k )( βξ ) (cid:19) log (cid:18) − β L (cid:19) = log (cid:18) L ( f ( x Kc ) − f ( x ∗ ) − s Kc ε )( βξ ) (cid:19) log (cid:18) − β L (cid:19) (185)Combining (185) with (179) gives the bound: K c ≤ min ( L (cid:18) ( ξ ) − ε (cid:19) (cid:18) f ( x K c ) − f ( x ∗ ) − s K c ε (cid:19) , log (cid:18) L ( f ( x Kc ) − f ( x ∗ ) − s Kc ε )( βξ ) (cid:19) log (cid:18) − β L (cid:19) ) (186)when f ( x K c ) − f ( x ∗ ) − s K c ε > Note on the Optimality of Contraction Rate K c Recall that we utilized the Polyak-Łojasiewicz condition on ˆ f ( . ) to obtain linear contraction rate (185) w.r.t. envelope function ˆ f ( . ) .However it should be noted that the linear contraction rates can also be recovered by applying the Polyak-Łojasiewicz condition on f ( . ) ,however this would come at the cost of a poorer contraction factor.Using gradient Lipschitz condition on f ( · ) for x k and x ∗ along with (167) we get: f ( x k ) − f ( x ∗ ) ≤ L k x k − x ∗ k ≤ L β k ∇ f ( x k ) k (187)Using gradient Lipschitz condition on f ( · ) for x k and x k + where x k + = x k − L ∇ f ( x k ) followed by (187) gives: f ( x k + ) − f ( x k ) ≤ − L k ∇ f ( x k ) k ≤ − β L ( f ( x k ) − f ( x ∗ )) (188) = ⇒ f ( x k + ) − f ( x ∗ ) ≤ (cid:18) − β L (cid:19)(cid:18) f ( x k ) − f ( x ∗ ) (cid:19) (189) = ⇒ f ( x K c ) − f ( x ∗ ) ≤ (cid:18) − β L (cid:19) K c (cid:18) f ( x ) − f ( x ∗ ) (cid:19) (190) = ⇒ K c ≤ log ( f ( x K c ) − f ( x ∗ )) − log ( f ( x ) − f ( x ∗ )) log (cid:18) − β L (cid:19) . (191)Now the contraction factor (cid:18) − β L (cid:19) obtained above is larger than the contraction factor (cid:18) − β L (cid:19) from (185) hence not optimal.On the other hand, the numerator term from (185) is better compared to that from (191) due to the fact that we used the envelopefunction sequence ˆ f ( x k ) = f ( x k ) − s k k x k − x ∗ k where it is required that s k k x k − x ∗ k ≤ s k + k x k + − x ∗ k . To improve the numerator termof (185) we can set the condition s k k x k − x ∗ k = s k + k x k + − x ∗ k for all 0 ≤ k ≤ K c thereby obtaining the sequence { s k } as follows: s k = k x − x ∗ kk x K c − x ∗ kk x k − x ∗ k (192)which increases monotonically for all 0 ≤ k ≤ K c provided k x K c − x ∗ k 6 = Bound on ˆ K exit − K e Recall that from (83) in theorem 2 we have k x ++ − x ∗ k > ¯ ρ ( x ) k x + − x ∗ k − σ ( x ) whenever k x + − x ∗ k ≥ k x − x ∗ k . Now for K e ≤ k ≤ ˆ K exit , the sequence {k x k − x ∗ k} is non-decreasing from the definition of K e . Hence, (83) holds for all such x k which have K e ≤ k ≤ ˆ K exit .Using (83) with x + = x k − and x ++ = x k for K e + ≤ k ≤ ˆ K exit yields: k x k − x ∗ k > ¯ ρ ( x k − ) k x k − − x ∗ k − σ ( x k − ) (193) k x k − x ∗ k > ¯ ρ ( x k − ) k x k − − x ∗ k − M k x k − − x ∗ k (194) k x k − x ∗ k + M k x k − − x ∗ k > ¯ ρ ( x k − ) k x k − − x ∗ k (195) k x k − x ∗ k + M k x k − x ∗ k > ¯ ρ ( x k − ) k x k − − x ∗ k (196) k x k − x ∗ k > ¯ ρ ( x k − ) + M k x k − x ∗ k k x k − − x ∗ k > ¯ ρ ( x k − ) + M ξ k x k − − x ∗ k (197)where we used the bound on σ ( x ) from (98) given by σ ( x k − ) = M k x k − − x ∗ k ≤ M ( ξ ) followed by the condition k x k − x ∗ k > k x k − − x ∗ k arising from the fact that {k x k − x ∗ k} is a monotonically increasing sequence for K e + ≤ k ≤ ˆ K exit and finally the substi-oundary Conditions for Trajectories Around Saddle Points 35tution (cid:13)(cid:13)(cid:13) x ˆ K exit − x ∗ (cid:13)(cid:13)(cid:13) = ξ . Now applying the bound (197) recursively for K e + ≤ k ≤ ˆ K exit yields: (cid:13)(cid:13)(cid:13) x ˆ K exit − x ∗ (cid:13)(cid:13)(cid:13) > ˆ K exit − ∏ k = K e + ¯ ρ ( x k − ) + M ξ k x K e + − x ∗ k (198) (cid:13)(cid:13)(cid:13) x ˆ K exit − x ∗ (cid:13)(cid:13)(cid:13) > (cid:18) inf K e + ≤ k ≤ ˆ K exit { ¯ ρ ( x k − ) } + M ξ (cid:19) ˆ K exit − K e − k x K e + − x ∗ k (199)ˆ K exit − K e − < log (cid:18)(cid:13)(cid:13)(cid:13) x ˆ K exit − x ∗ (cid:13)(cid:13)(cid:13)(cid:19) − log (cid:18) k x K e + − x ∗ k (cid:19) log (cid:18) inf { ¯ ρ ( x k − ) } + M ξ (cid:19) < log ( ξ ) − log ( ε ) log (cid:18) inf { ¯ ρ ( x k − ) } + M ξ (cid:19) (200)ˆ K exit − K e < log ( ξ ) − log ( ε ) log (cid:18) inf { ¯ ρ ( x k − ) } + M ξ (cid:19) + (cid:13)(cid:13)(cid:13) x ˆ K exit − x ∗ (cid:13)(cid:13)(cid:13) = ξ , k x K e + − x ∗ k ≥ ε and the range of infimum is omitted after second step. Note that werequire the condition (cid:18) inf { ¯ ρ ( x k − ) } + M ξ (cid:19) >
1, however this is trivially satisfied which can be easily checked from (135) and (141). Noticethat from (135), we have the following: (cid:18)q h ˆ u , ( D ( x )) ˆ u i − q h ˆ u , ( D ( x )) ˆ u i (cid:19) > s(cid:18) ( + β L ) (cid:18) − (cid:18) − β L (cid:19) (cid:19) − (cid:19) (202) (cid:18) p h ˆ u , ( D ( x )) ˆ u i p h ˆ u , ( D ( x )) ˆ u i − (cid:19) > s(cid:18) ( + β L ) (cid:18) − (cid:18) − β L (cid:19) (cid:19) − (cid:19)p h ˆ u , ( D ( x )) ˆ u i (203)¯ ρ ( x ) > + s(cid:18) ( + β L ) (cid:18) − (cid:18) − β L (cid:19) (cid:19) − (cid:19)p h ˆ u , ( D ( x )) ˆ u i (204)¯ ρ ( x ) > + s(cid:18) ( + β L ) (cid:18) − (cid:18) − β L (cid:19) (cid:19) − (cid:19) (205)where we used the fact that p h ˆ u , ( D ( x )) ˆ u i ≤ ( D ( x )) is 4.Now for ξ ≤ ς M s(cid:18) ( + β L ) (cid:18) − (cid:18) − β L (cid:19) (cid:19) − (cid:19) where ς >
2, we get the condition:¯ ρ ( x ) + M ξ > + s(cid:18) ( + β L ) (cid:18) − (cid:18) − β L (cid:19) (cid:19) − (cid:19) + ς s(cid:18) ( + β L ) (cid:18) − (cid:18) − β L (cid:19) (cid:19) − (cid:19) > . (206)Finally adding (186) and (201), we get the following bound: K shell ≤ min ( L (cid:18) ( ξ ) − ε (cid:19) (cid:18) f ( x K c ) − f ( x ∗ ) − s K c ε (cid:19) , log (cid:18) L ( f ( x Kc ) − f ( x ∗ ) − s Kc ε )( βξ ) (cid:19) log (cid:18) − β L (cid:19) ) + log ( ξ ) − log ( ε ) log (cid:18) inf { ¯ ρ ( x k − ) } + M ξ (cid:19) + K shell = K c + ˆ K exit − K e , x K c ∈ ¯ B ε ( x ∗ ) \ B ε ( x ∗ ) , f ( x K c ) − f ( x ∗ ) − s K c ε > { s k } is any increasing sequence for 0 ≤ k ≤ K c that satisfies the condition s k + k x k + − x ∗ k ≥ s k k x k − x ∗ k for the strictly decreasing sequence {k x k − x ∗ k} . Bounds on ξ From conditions (141) and (174) we have that: ξ ≤ min (cid:26) ς M s(cid:18) ( + β L ) (cid:18) − (cid:18) − β L (cid:19) (cid:19) − (cid:19) , (cid:18) Ls L − β (cid:19)(cid:27) (208)where ς > (cid:4) D Proof of Additional Lemmas
Proof of Lemma 1
Proof.
For values of ε upper bounded by Theorem 1, we have the following approximation: ∇ f ( x ) = (cid:18) Z p = p = ∇ f ( x ∗ + p ( x − x ∗ )) dp (cid:19) ( x − x ∗ ) (209) = ⇒ ∇ f ( x ) = ∇ f ( x ∗ )( x − x ∗ ) + (cid:18) Z p = p = ∇ f ( x ∗ + p ( x − x ∗ )) dp − ∇ f ( x ∗ ) (cid:19)| {z } F ( x − x ∗ ) (210) = ⇒ ∇ f ( x ) = ∇ f ( x ∗ )( x − x ∗ ) + O ( ε ) ≈ ∇ f ( x ∗ )( x − x ∗ ) (211)where in the last step we used the bound k x − x ∗ k ≤ ε and Lipschitz continuity of the Hessian: k F k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) Z p = p = ∇ f ( x ∗ + p ( x − x ∗ )) dp − ∇ f ( x ∗ ) (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) (212) ≤ (cid:18) Z p = p = (cid:13)(cid:13) ∇ f ( x ∗ + p ( x − x ∗ )) − ∇ f ( x ∗ ) (cid:13)(cid:13) dp (cid:19) (213) ≤ M k x − x ∗ k ≤ M ε . (214)Hence we can write ∇ f ( x k ) = (cid:18) ∇ f ( x ∗ ) + O ( ε ) (cid:19) u k ≈ ∇ f ( x ∗ ) u k (ignoring O ( ε ) perturbation terms in the Hessian ∇ f ( x ∗ ) ) where u k = x k − x ∗ . Using the iterative update u k + − u k = − α∇ f ( x k ) we can express k u k k as follows: k u k k = (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) I − α∇ f ( x ∗ ) + O ( ε ) (cid:19) u k − (cid:13)(cid:13)(cid:13)(cid:13) (215) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:18) I − α∇ f ( x ∗ ) (cid:19) k u + O ( k ε ) u (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (216) ≈ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:18) I − α∇ f ( x ∗ ) (cid:19) k u (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (217)where O ( k ε ) term is neglected w.r.t. the term (cid:18) I − α∇ f ( x ∗ ) (cid:19) k because k < K exit / O ( log ( ε − )) which implies k ε ≪
1. Now, if u has a projection close to 0 on the unstable subspace of ∇ f ( x ∗ ) , then k u k k first approximately decreases exponentially such that x k reaches some x critical and from there onward it approximately increases exponentially until saddle region is escaped. For the casewhen x critical → x ∗ , we will have k x critical − x ∗ k →
0. The escape time for the ε –precision trajectories from this region B ε ′ ( x ∗ ) where ε ′ = k x critical − x ∗ k will be upper bounded by K < O ( log ( ε ′− )) from (8). This upper bound goes to infinity when ε ′ → ε –precision trajectories fail to escape the saddle neighborhood when x critical = x ∗ . It should also be noted that if for some k , u k = or inother words x critical = x ∗ , then for all l > k we have u l = since ∇ f ( x l ) = and the gradient trajectory can never escape the saddleregion. (cid:4) Proof of Lemma 2
Proof.
If a gradient trajectory curves around x ∗ then the vectors u and u k will form an obtuse angle for some finite values of k . Thereforein order to prove the first part, it is sufficient to show that: h u k , u i ≥ k such that k u k k < ε . Now, for sufficiently small ε where ε is upper bounded by Theorem 1, we can have u k =( I − α∇ f ( x ∗ )) k u + O ( k ε ) u ≈ ( I − α∇ f ( x ∗ )) k u where we dropped O ( k ε ) term as in the proof of Lemma 1 since k < K exit / O ( log ( ε − )) . Using this approximate u k we get: h u k , u i ≈ u T ( I − α∇ f ( x ∗ )) k u ≥ ( I − α∇ f ( x ∗ )) k will be a positive semi-definite matrix for α ≤ L . Therefore, vectors u and u k will form an acute angle between them for all values of k such that k u k k < ε . Hence, the trajectory can never curve around x ∗ .oundary Conditions for Trajectories Around Saddle Points 37The proof for second part follows the same method. Let us take any two points on the gradient trajectory denoted by vectors u k and u k w.r.t. stationary point x ∗ . Then we have the following inner product: h u k , u k i ≈ h u , ( I − α∇ f ( x ∗ )) k + k u i ≥ k + k ≤ O ( log ( ε − )) . Now with h u k , u k i ' k , k where k + k ≤ O ( log ( ε − )) such that (cid:13)(cid:13) u k (cid:13)(cid:13) < ε and (cid:13)(cid:13) u k (cid:13)(cid:13) < ε , theangle between the vectors u k and u k can never approximately exceed π . Hence the entire gradient descent trajectory approximatelylies inside some orthant of the ball B ε ( x ∗ ) . (cid:4) Proof of Lemma 3
Proof.
Let us denote the exit point on the ball B ε ( x ∗ ) by x K + where k x K − x ∗ k ≤ ε and k x K + − x ∗ k > ε . Now applying the HessianLipschitz condition around x ∗ for x K , we get the following: f ( x K ) ≤ f ( x ∗ ) + h ∇ f ( x ∗ ) , x K − x ∗ i + h ( x K − x ∗ ) , ∇ f ( x ∗ )( x K − x ∗ ) i + M k x K − x ∗ k (220) ≤ f ( x ∗ ) + h x K − x ∗ , ∇ f ( x K ) i + (cid:28) ( x K − x ∗ ) , (cid:18) ∇ f ( x ∗ ) − ∇ f ( x ∗ ) − O ( ε ) (cid:19) ( x K − x ∗ ) (cid:29) + M k x K − x ∗ k (221) ≤ f ( x ∗ ) + h x K − x ∗ , ∇ f ( x K ) i + O ( ε ) (222)where we have used ∇ f ( x K ) = (cid:18) ∇ f ( x ∗ ) + O ( ε ) (cid:19) ( x K − x ∗ ) from Lemma 1 and substituted k x K − x ∗ k ≤ ε in the last step.Let us first analyze the term h x K − x ∗ , ∇ f ( x K ) i . Now, k x K − x ∗ k < k x K + − x ∗ k since the gradient descent trajectory is exiting the ball B ε ( x ∗ ) at iteration K + . Squaring the condition k x K − x ∗ k < k x K + − x ∗ k yields: k x K − x ∗ k < k x K + − x ∗ k (223) k x K − x ∗ k < k x K − x ∗ k + k α∇ f ( x K ) k − α h x K − x ∗ , ∇ f ( x K ) i (224) h x K − x ∗ , ∇ f ( x K ) i < α k ∇ f ( x K ) k . (225)Next, by the gradient Lipschitz continuity for x K and x K + , we have that: f ( x K + ) ≤ f ( x K ) + h ∇ f ( x K ) , x K + − x K i + L k x K + − x K k (226) f ( x K + ) ≤ f ( x K ) − α k ∇ f ( x K ) k + L k α∇ f ( x K ) k (227) f ( x K + ) + L k ∇ f ( x K ) k ≤ f ( x K ) (228)where we substituted α = L . Combining (228) with (222) followed by substitution of (225) yields: f ( x K + ) + L k ∇ f ( x K ) k ≤ f ( x K ) ≤ f ( x ∗ ) + h x K − x ∗ , ∇ f ( x K ) i + O ( ε ) (229) = ⇒ f ( x K + ) + L k ∇ f ( x K ) k ≤ f ( x ∗ ) + α k ∇ f ( x K ) k + O ( ε ) (230) = ⇒ f ( x K + ) ≤ f ( x ∗ ) − L k ∇ f ( x K ) k + O ( ε ) . (231)Next, using the bound k ∇ f ( x K ) k ≥ β k x K − x ∗ k from (167) in (231) yields: f ( x K + ) ≤ f ( x ∗ ) − β L k x K − x ∗ k + O ( ε ) < f ( x ∗ ) + O ( ε ) (232) = ⇒ f ( x K + ) / f ( x ∗ ) (233)where the term O ( ε ) was dropped from the right -hand side. (cid:4) Exit at iteration K + k x K − x ∗ k < k x K + − x ∗ k .8 Rishabh Dixit, Waheed U. Bajwa Proof of Lemma 4
Proof.
Let us take any two points x , x in the closed ball ¯ B ε ( x ∗ ) . Using gradient Lipschitz condition, we get the following inequalities: f ( x ) ≤ f ( x ∗ ) + h ∇ f ( x ∗ ) , x − x ∗ i + L k x − x ∗ k (234) ≤ f ( x ∗ ) + L k x − x ∗ k (235)and f ( x ∗ ) ≤ f ( x ) − h ∇ f ( x ∗ ) , x − x ∗ i + L k x − x ∗ k (236) ≤ f ( x ) + L k x − x ∗ k (237)Now adding (235) and (237) yields: f ( x ) − f ( x ) ≤ L k x − x ∗ k + L k x − x ∗ k . (238)Next, using the fact that k x − x ∗ k ≤ ε , k x − x ∗ k ≤ ε in (238), we get the following upper bound: f ( x ) − f ( x ) ≤ L ε . (239)Formally, this upper bound states that the function value gap between any two points in the closed ball ¯ B ε ( x ∗ ) surface cannot be morethan L ε . Also notice that the result in (239) only depends on the gradient Lipschitz condition and therefore will hold true for any ε .Next, we assume that our gradient trajectory is currently exiting the ball B ε ( x ∗ ) at point x K s.t. k x K − − x ∗ k ≤ ε and k x K − x ∗ k > ε .Let us further assume that ˆ K iterations after the current iteration, the gradient trajectory re-enters the ball B ε ( x ∗ ) , i.e., (cid:13)(cid:13) x K + ˆ K − x ∗ (cid:13)(cid:13) ≤ ε and (cid:13)(cid:13) x K + ˆ K − − x ∗ (cid:13)(cid:13) > ε . Using the update equation x k + = x k − α∇ f ( x k ) for 0 ≪ α ≤ L together with gradient Lipschitz condition, weget: f ( x k + ) ≤ f ( x k ) + h ∇ f ( x k ) , x k + − x k i + L k x k + − x k k (240) = ⇒ f ( x k + ) ≤ f ( x k ) − α L (cid:18) L − α (cid:19) k ∇ f ( x k ) k (241)Taking the telescopic sum for these inequalities from k = K to k = K + ˆ K − f ( x K ) − f ( x K + ˆ K ) : f ( x K + ˆ K ) ≤ f ( x K ) − α L (cid:18) L − α (cid:19) K + ˆ K − ∑ k = K k ∇ f ( x k ) k (242) α L β (cid:18) L − α (cid:19) ˆ K ε < α L (cid:18) L − α (cid:19) K + ˆ K − ∑ k = K k ∇ f ( x k ) k ≤ f ( x K ) − f ( x K + ˆ K ) ≤ f ( x K − ) − f ( x K + ˆ K ) (243)where f ( x K ) ≤ f ( x K − ) from monotonicity of { f ( x K ) } and we have substituted the lower bound k ∇ f ( x k ) k ≥ β k x k − x ∗ k ≥ βε from (167) since k x k − x ∗ k > ε for all K ≤ k ≤ K + ˆ K −
1. Combining (243) with (239) for x K − , x K + ˆ K ∈ B ε ( x ∗ ) yields the followingcondition on ˆ K : α L β (cid:18) L − α (cid:19) ˆ K ε < L ε (244)ˆ K < αβ (cid:18) L − α (cid:19) . (245)Now, for sake of simplicity we substitute α = L . This yields the following bound on ˆ K :ˆ K < κ (246)where κ = β L . This inequality claims that if the gradient trajectory re-enters the ball B ε ( x ∗ ) , it has to do so in fewer than κ iterations.From here onward we will develop a proof which contradicts this claim. It is to be noted that we can carry out a similar analysis for any other α s.t. 0 ≪ α ≤ L and still obtain the same inference.oundary Conditions for Trajectories Around Saddle Points 39Let us first define some ξ > ε such that ξ = κ ε ( + b ) where κ = β L , b = k x K − x ∗ k ε − ξ is upper boundedfrom theorem 3. Note that x K as defined earlier is the exit point of the gradient trajectory, i.e., k x K − − x ∗ k ≤ ε and k x K − x ∗ k > ε . Nowfor well conditioned problems, i.e., O (cid:18) √ q log ( ε ) (cid:19) < κ ≤ ε upper bounded by Theorem 1, we will have ξ = O ( ε ) . Therefore agradient trajectory moving outwards from the ball B ε ( x ∗ ) is also bound to move out from the ball B ξ ( x ∗ ) since we have already provedthis in 2 for trajectories with expansive dynamics.Under these conditions, let J represent the minimum number of iterations required to exit the ball B ξ ( x ∗ ) for a trajectory which isjust exiting B ε ( x ∗ ) and is currently at the point x K s.t. k x K − x ∗ k > ε . To this end, we rewrite the update equation of radial vector u k forany k ∈ { K , K + ,..., K + J − } : u k + = u k + ( x k + − x k ) = u k − α∇ f ( x k ) (247)where we have that u k = x k − x ∗ . From the gradient Lipschitz condition we have the following bound for any u k : k ∇ f ( x k ) k ≤ L k u k k (248)where u k = x k − x ∗ . Applying norm to (247) followed by triangle inequality and using the upper bound from (248) yields: k u k + k = k u k + ( x k + − x k ) k ≤ k u k k + α k ∇ f ( x k ) k ≤ k u k k (249)for α = L . Applying this bound recursively from k = K to k = K + J − k u K k = ε ( + b ) , we have: k u K + J k ≤ J k u K k = J ε ( + b ) . (250)Since J is the minimum number of iterations required to exit the ξ radius ball for a trajectory which is just exiting the ε ball, we can set2 J ε ( + b ) = ξ . This yields: 2 J ε ( + b ) = ξ = κ ε ( + b ) (251) J = κ . (252)Now, the ˆ K we defined as the time to re-enter the ball B ε ( x ∗ ) should be definitely greater than J since any trajectory will certainly takemore than J iterations to traverse the shell present in between the concentric ξ and ε radii balls.ˆ K > J = κ . (253)However, this inequality contradicts the claim that ˆ K < κ from (246) which completes our proof. (cid:4) Proof of Lemma 5
Proof.
Recall that from (238) and (239) in previous lemma, for any x , x ∈ ¯ B ξ ( x ∗ ) we have that: f ( x ) − f ( x ) ≤ L ( ξ ) . (254)Next, let ˆ K be the minimum number of iterations in which the gradient trajectory re-enters the ball B ξ ( x ∗ ) . Then following the same setof steps as in the previous lemma for obtaining (242), we get: f ( x K + ˆ K ) ≤ f ( x K ) − α L (cid:18) L − α (cid:19) K + ˆ K − ∑ k = K k ∇ f ( x k ) k (255) = ⇒ α L (cid:18) L − α (cid:19) ˆ K ( ξ ) < α L (cid:18) L − α (cid:19) K + ˆ K − ∑ k = K k ∇ f ( x k ) k ≤ f ( x K ) − f ( x K + ˆ K ) ≤ f ( x K − ) − f ( x K + ˆ K ) (256)where we substituted k ∇ f ( x k ) k ≥ γ > √ L ξ and f ( x K ) ≤ f ( x K − ) from monotonicity of { f ( x k ) } . Now if the trajectory re-enters theball B ξ ( x ∗ ) in ˆ K iterations, then x K − , x K + ˆ K ∈ B ξ ( x ∗ ) and hence x K − , x K + ˆ K satisfy (254). Therefore combining (256) with (254)yields the bound: α L (cid:18) L − α (cid:19) ˆ K ( ξ ) < L ( ξ ) (257) = ⇒ ˆ K < α L (cid:18) L − α (cid:19) . (258) (cid:4) Now for α = L , we have that ˆ K <
4. Therefore the gradient trajectory has to re-enter the ball B ξ ( x ∗ ) in three or less iterationswhich is not possible from the proof below.0 Rishabh Dixit, Waheed U. Bajwa Claim: The gradient trajectory cannot re-enter the ball B ξ ( x ∗ ) in three or less iterations. Proof.
Let the current iterate for the gradient trajectory be x − such that k x − − x ∗ k < ξ and k x − x ∗ k ≥ ξ , i.e., the iterate x exits the ball B ξ ( x ∗ ) where ξ is bounded from Theorem 3. Next, from Theorem 2, the iterate x + in the sequence { x − , x , x + } will also have expansivedynamics, i.e., k x + − x ∗ k > k x − x ∗ k . Let x ++ denote the next iterate in the sequence { x − , x , x + } . Now, if the following condition: h x ++ − x + , x − x ∗ i ≥ x ++ B ξ ( x ∗ ) . To check this, let the condition (259) be given and we have the contradiction x ++ ∈ B ξ ( x ∗ ) , i.e., k x ++ − x ∗ k < ξ . Then we can write the following inequality: (cid:13)(cid:13) x ++ − x ∗ (cid:13)(cid:13) < ( ξ ) (260) = ⇒ (cid:13)(cid:13) x ++ − x + + x + − x ∗ (cid:13)(cid:13) < ( ξ ) (261) = ⇒ (cid:13)(cid:13) x ++ − x + (cid:13)(cid:13) + (cid:13)(cid:13) x + − x ∗ (cid:13)(cid:13) + h x ++ − x + , x + − x ∗ i < ( ξ ) (262) = ⇒ (cid:13)(cid:13) x ++ − x + (cid:13)(cid:13) | {z } > + (cid:13)(cid:13) x + − x ∗ (cid:13)(cid:13) | {z } ≥ ( ξ ) + h x ++ − x + , x + − x i | {z } ≥ + h x ++ − x + , x − x ∗ i | {z } ≥ < ( ξ ) (263)which is not possible (left hand side is greater than right hand side). Hence, x ++ ¯ B ξ ( x ∗ ) . Note that we used the condition h x ++ − x + , x + − x i ≥ { x , x + , x ++ } generated by gradient descent method.The following steps verify this result. Using the substitutions x ++ − x + = − α∇ f ( x + ) , x + − x = − α∇ f ( x ) and ∇ f ( x + ) = ∇ f ( x ) + (cid:18) R p = ∇ f ( x + p ( x + − x )) dp (cid:19) ( x + − x ) we get: h x ++ − x + , x + − x i = h− α∇ f ( x + ) , − α∇ f ( x ) i (264) = α (cid:28) ∇ f ( x ) + (cid:18) Z p = ∇ f ( x + p ( x + − x )) dp (cid:19) ( x + − x ) , ∇ f ( x ) (cid:29) (265) = α (cid:28) ∇ f ( x ) − α (cid:18) Z p = ∇ f ( x + p ( x + − x )) dp (cid:19) ∇ f ( x ) , ∇ f ( x ) (cid:29) (266) = α (cid:28)(cid:18) I − α Z p = ∇ f ( x + p ( x + − x )) dp (cid:19) ∇ f ( x ) , ∇ f ( x ) (cid:29) ≥ (cid:18) I − α R p = ∇ f ( x + p ( x + − x )) dp (cid:19) is a positive semi-definite matrix for α ≤ L .Now, we are left to prove (259) condition, i.e., h x ++ − x + , x − x ∗ i ≥
0. Manipulating the left hand side of this condition, we obtain: h x ++ − x + , x − x ∗ i = h− α∇ f ( x + ) , x − x ∗ i (268) = − α (cid:28) ∇ f ( x ) + (cid:18) Z p = ∇ f ( x + p ( x + − x )) dp (cid:19) ( x + − x ) , x − x ∗ (cid:29) (269) = − α (cid:28)(cid:18) I − α Z p = ∇ f ( x + p ( x + − x )) dp (cid:19) ∇ f ( x ) , x − x ∗ (cid:29) (270) = (cid:28) x − x ∗ , (cid:18) I − α Z p = ∇ f ( x + p ( x + − x )) dp (cid:19) ( − α∇ f ( x )) (cid:29) . (271)Next, recall that from (197) in the proof for theorem 3 for any sequence of iterates { x − , x , x + } inside the ball B ξ ( x ∗ ) generated by thegradient descent method, we have that: (cid:13)(cid:13) x + − x ∗ (cid:13)(cid:13) ≥ (cid:18) ¯ ρ ( x − ) + M ξ (cid:19) k x − x ∗ k (272) k x − x ∗ k + k α∇ f ( x ) k + h− α∇ f ( x ) , x − x ∗ i ≥ (cid:18) ¯ ρ ( x − ) + M ξ (cid:19) k x − x ∗ k (273) (cid:18)(cid:18) ¯ ρ ( x − ) + M ξ (cid:19) − (cid:19) k x − x ∗ k − k α∇ f ( x ) k ≤ h− α∇ f ( x ) , x − x ∗ i (274) (cid:18)(cid:18) ¯ ρ ( x − ) + M ξ (cid:19) − (cid:19) k x − x ∗ k ≤ h− α∇ f ( x ) , x − x ∗ i (275)oundary Conditions for Trajectories Around Saddle Points 41where we used the bound k α∇ f ( x ) k ≤ k x − x ∗ k . Next, it can be checked that for β L > .
89 and ς ≥
47, we will have the condition:¯ ρ ( x ) + M ξ > + s(cid:18) ( + β L ) (cid:18) − (cid:18) − β L (cid:19) (cid:19) − (cid:19) + ς s(cid:18) ( + β L ) (cid:18) − (cid:18) − β L (cid:19) (cid:19) − (cid:19) ≥ √ x ∈ B ξ ( x ∗ ) .Combining (276) with (275) for x − ∈ B ξ ( x ∗ ) yields:0 ≤ (cid:18)(cid:18) ¯ ρ ( x − ) + M ξ (cid:19) − (cid:19) k x − x ∗ k ≤ h− α∇ f ( x ) , x − x ∗ i . (277)For any vectors a and b and any positive semi-definite matrix A , if h a , b i ≥ h a , Ab i ≥
0. Using this property for A = (cid:18) I − α R p = ∇ f ( x + p ( x + − x )) dp (cid:19) , a = − α∇ f ( x ) and b = x − x ∗ , we get that (cid:28) x − x ∗ , (cid:18) I − α R p = ∇ f ( x + p ( x + − x )) dp (cid:19) ( − α∇ f ( x )) (cid:29) ≥ h− α∇ f ( x ) , x − x ∗ i ≥ h x ++ − x + , x − x ∗ i = (cid:28) x − x ∗ , (cid:18) I − α Z p = ∇ f ( x + p ( x + − x )) dp (cid:19) ( − α∇ f ( x )) (cid:29) ≥ (cid:4) E Proof of Convergence of Algorithm 1
Proof.
To establish the efficacy of proposed algorithm, it is sufficient to prove the curvature condition (refer Step 15 from Algorithm 1).Now, for k ∇ f ( x ) k ≤ ε and Ξ =
0, we have that: ∇ f ( x ) = (cid:18) ∇ f ( x ∗ ) + Z p = p = ∇ f ( x ∗ + p ( x − x ∗ )) dp (cid:19) ( x − x ∗ ) . (279)With the problem being well conditioned and ε very small and upper bounded by Theorem 1, using Lemma 4 from [10] we canapproximate the Hessian ∇ f ( x ∗ + p ( x − x ∗ )) = ∇ f ( x ∗ ) + O ( ε ) ≈ ∇ f ( x ∗ ) for any x ∈ B ε ( x ∗ ) . This is a valid approximation sincewe are no longer solving for rates of convergence and just need to approximately determine the unstable projection value. Therefore, theequation (279) for x = x k is approximated as: ∇ f ( x k ) = (cid:18) ∇ f ( x ∗ ) + O ( ε ) (cid:19) ( x k − x ∗ ) ≈ ∇ f ( x ∗ )( x k − x ∗ ) (280)where ∇ f ( x ∗ ) is zero vector. With y = x k , y = x k + and the approximation (280), we have the following terms: y = x k + = x k − α∇ f ( x k ) (281) = x k − α (cid:18) ∇ f ( x ∗ ) + O ( ε ) (cid:19) ( x k − x ∗ ) (282) ≈ x k − α∇ f ( x ∗ )( x k − x ∗ ) , (283) ∇ f ( y ) = ∇ f ( x k + ) = (cid:18) ∇ f ( x ∗ ) + O ( ε ) (cid:19) ( x k + − x ∗ ) (284) = (cid:18) ∇ f ( x ∗ ) + O ( ε ) (cid:19)(cid:18) x k − α (cid:18) ∇ f ( x ∗ ) + O ( ε ) (cid:19) ( x k − x ∗ ) − x ∗ (cid:19) (285) ≈ ∇ f ( x ∗ ) (cid:18) x k − α∇ f ( x ∗ )( x k − x ∗ ) − x ∗ (cid:19) . (286)Note that in the second last step we used the substitution from (282). Now, we define the terms V , V using y , y : V = h y − y , y − y i ≈ ( x k − x ∗ ) T ( α∇ f ( x ∗ )) ( x k − x ∗ ) (287) V = α h y − y , ∇ f ( y ) − ∇ f ( y ) i ≈ ( x k − x ∗ ) T ( α∇ f ( x ∗ )) ( x k − x ∗ ) (288)2 Rishabh Dixit, Waheed U. BajwaNext we use the following substitution: x k − x ∗ = k x k − x ∗ k (cid:18) ∑ i ∈ N S θ si v i ( ) + ∑ j ∈ N US θ usj v j ( ) (cid:19) (289)where k x k − x ∗ k θ si = h ( x k − x ∗ ) , v i ( ) i , k x k − x ∗ k θ usj = h ( x k − x ∗ ) , v j ( ) i and v i ( ) , v j ( ) are the eigenvectors of the scaled Hessian α∇ f ( x ∗ ) . On further simplifying V , V using (289) we get: V ≈k x k − x ∗ k (cid:18) ∑ i ∈ N S ( λ si ) ( θ si ) + ∑ j ∈ N US ( λ usj ) ( θ usj ) (cid:19) (290) V ≈k x k − x ∗ k (cid:18) ∑ i ∈ N S ( λ si ) ( θ si ) + ∑ j ∈ N US ( λ usj ) ( θ usj ) (cid:19) (291)where λ si and λ usj are the eigenvalues of stable subspace E S and unstable subspace E US of the scaled Hessian α∇ f ( x ∗ ) respectively.These eigenvalues are bounded by : β L ≤ λ si ≤ − ≤ λ usj ≤ − β L . (293)Evaluating V − V and using the fact that k x k − x ∗ k ≤ β k ∇ f ( x k ) k ≤ L εβ from (167), we get the following expression: V − V / ε κ (cid:18) ∑ i ∈ N S (( λ si ) − ( λ si ) )( θ si ) + ∑ j ∈ N US (( λ usj ) − ( λ usj ) )( θ usj ) (cid:19) (294)where κ = β L . Now, the function h ( y ) = y − y attains a maximum value of in the interval y ∈ ( , ] and a maximum value of 2 inthe interval y ∈ [ − , ) . Substituting y = λ si in the interval y ∈ ( , ] and y = λ usj in the interval y ∈ [ − , ) , the upper bound for (294)becomes: V − V / ε κ (cid:18) ∑ i ∈ N S ( θ si ) + ∑ j ∈ N US ( θ usj ) (cid:19) (295) V − V / ε κ (cid:18) − (cid:18) ∑ j ∈ N US ( θ usj ) (cid:19) + ∑ j ∈ N US ( θ usj ) (cid:19) (296) V − V / ε κ (cid:18) + (cid:18) ∑ j ∈ N US ( θ usj ) (cid:19)(cid:19) (297) ∑ j ∈ N US ( θ usj ) ' ( V − V ) κ ε − . (298)The right-hand side in (298) can be considered as as the lower bound estimate for ∑ j ∈ N US ( θ usj ) . Now, the sufficient condition forescaping the saddle neighborhood comes from the minimum unstable subspace projection in (72). Let P min ( ε ) be a function of ε equal tothe lower bound from (72), then with the condition ( V − V ) κ ε − > P min ( ε ) and (298), we can guarantee ∑ j ∈ N US ( θ usj ) ' P min ( ε ) whichimplies that we have a sufficient unstable projection to escape saddle region in almost linear time.Notice that the curvature condition from the step 15 in Algorithm 1 checks the inequality ( V − V ) κ ε − < P min ( ε ) which if truecould imply ∑ j ∈ N US ( θ usj ) < P min ( ε ) . Then the gradient trajectory may not necessarily have linear exit time from saddle neighborhood.Hence, we solve the eigenvector problem given by: x k + ∈ argmin k x − x k k = k ∇ f ( x k ) k β (cid:18) ( x − x k ) T H ( x − x k ) (cid:19) (299)which gives a solution with sufficient unstable projection. Notice that a possible solution to the unconstrained problem: x k + ∈ argmin x (cid:18) ( x − x k ) T H ( x − x k ) (cid:19) (300)can be given by x k + − x k = b k x k − x ∗ k e usj where e usj is any eigenvector of the scaled Hessian H = α∇ f ( x k ) ≈ α∇ f ( x ∗ ) corre-sponding to its least eigenvalue and b is any scalar. Although any vector in the subspace formed by the eigenvectors correspondingoundary Conditions for Trajectories Around Saddle Points 43to the minimum eigenvalue can be used instead of e usj , for sake of simplicity of the proof, we use the direction e usj . Hence fromthe unconstrained eigenvector problem (300), we can write x k + − x ∗ = x k − x ∗ + b k x k − x ∗ k e usj . Using the substitution x k − x ∗ = k x k − x ∗ k (cid:18) ∑ i ∈ N S θ si v i ( ) + ∑ j ∈ N US θ usj v j ( ) (cid:19) as before from (289) we get: x k + − x ∗ = k x k − x ∗ k (cid:18) ∑ i ∈ N S θ si v i ( ) + ∑ j ∈ N US θ usj v j ( ) (cid:19) + b k x k − x ∗ k e usj (301) = k x k − x ∗ k (cid:18) ∑ i ∈ N S θ si v i ( ) + ∑ j ∈ N US θ usj v j ( ) (cid:19) + b k x k − x ∗ k (cid:18) v l ( ) + O ( ε ) (cid:19) (302) = k x k − x ∗ k p + b (cid:18) ∑ i ∈ N S θ si √ + b v i ( ) + ∑ j ∈ N US θ usj √ + b v j ( ) + b √ + b v l ( ) (cid:19) + O ( ε ) (303) = k x k − x ∗ k p + b (cid:18) ∑ i ∈ N S ˜ θ si v i ( ) + ∑ j ∈ N US ˜ θ usj v j ( ) (cid:19) + O ( ε ) . (304)where we have ∑ i ∈ N S ( ˜ θ si ) + ∑ j ∈ N US ( ˜ θ usj ) = θ si , ˜ θ usj . Notice that we used the eigenvector perturbation bound e usj = v l ( ) + O ( ε ) in the second step and v l ( ) corresponds to the eigenvector for the smallest eigenvalue of α∇ f ( x ∗ ) . Notice that l ∈ N US where l is the index of v l ( ) provided x k lies within some saddle neighborhood and not in a local minimum neighborhood. If x k were in a local minimum neighborhood, then the unstable subspace would have been the null space. Finally, in the second last stepwe normalized by dividing with √ + b because we require the condition: ∑ i ∈ N S (cid:18) θ si √ + b (cid:19) + ∑ j ∈ N US (cid:18) θ usj √ + b (cid:19) + (cid:18) b √ + b (cid:19) | {z } U = ∑ i ∈ N S ( θ si ) + ∑ j ∈ N US ( θ usj ) =
1. From (303) and (304) using coefficient comparison, it can be checked that θ si √ + b = ˜ θ si + O ( ε ) for all i ∈ N S . Using this relation in (305) we get that U = ∑ j ∈ N US ( ˜ θ usj ) + O ( ε ) . Next, dropping O ( ε ) term from theright-hand side of (304), we have: x k + − x ∗ ≈ k x k − x ∗ k p + b (cid:18) ∑ i ∈ N S ˜ θ si v i ( ) + ∑ j ∈ N US ˜ θ usj v j ( ) (cid:19) (306)where ∑ j ∈ N US ( ˜ θ usj ) can be considered as the new unstable projection of ( x k + − x ∗ ) and k x k + − x ∗ k ≈ k x k − x ∗ k√ + b . Now, we re-quire that the future gradient trajectory that starts from the point x k + escapes the ball B ˜ ε ( x ∗ ) in linear time where ˜ ε = k x k − x ∗ k√ + b .Therefore we get that: U ≈ ∑ j ∈ N US ( ˜ θ usj ) ≥ P min ( ˜ ε ) (307) = ⇒ ∑ j ∈ N US (cid:18) θ usj √ + b (cid:19) + (cid:18) b √ + b (cid:19) ' P min ( ˜ ε ) (308) = P min ( k x k − x ∗ k p + b ) (309) > P min (cid:18) k ∇ f ( x k ) k √ + b L (cid:19) (310)where in the last step we used P min ( k x k − x ∗ k√ + b ) > P min (cid:18) k ∇ f ( x k ) k √ + b L (cid:19) due to the fact that the function P min ( ε ) monotoni-cally increases with ε from (72) along with the property that k ∇ f ( x k ) k ≤ L k x k − x ∗ k . Now (310) will hold true whenever: (cid:18) b √ + b (cid:19) > P min (cid:18) k ∇ f ( x k ) k √ + b L (cid:19) (311) b > s P min (cid:18) k ∇ f ( x k ) k √ + b L (cid:19)s − P min (cid:18) k ∇ f ( x k ) k √ + b L (cid:19) . (312)It can be checked that (312) will hold true for any positive b as long as it is bounded away from ε . Finally in the substitution x k + − x k = b k x k − x ∗ k e usj , we can use the lower bound k ∇ f ( x k ) k ≥ β k x k − x ∗ k from (167) and the gradient Lipschitz bound k ∇ f ( x k ) k ≤ L k x k − x ∗ k to get the range k ∇ f ( x k ) k L k x k − x ∗ k ≤ b ≤ k ∇ f ( x k ) k β k x k − x ∗ k . Selecting the upper bound of b gives x k + − x k = k ∇ f ( x k ) k β e usj provided β L ≫
0. Thisparticular choice of b is less conservative though it should be selected carefully and the selection criterion may vary from one problemto another. For the particular case of well-conditioned saddle neighborhood, a large b and hence a large step-size can be afforded. Noticethat β L ≤ b ≤ L β and any b in this range will satisfy (312) provided β L ≫
0. Since x k + is the desired solution, taking norm on both sidesof x k + − x k = k ∇ f ( x k ) k β e usj gives the constraint k x k + − x k k = k ∇ f ( x k ) k β in the Step 17 of Algorithm 1.Since evaluating the eigenvector e usj will involve Hessian inversion operations, it will be solved in polynomial time though this stepis invoked only once in the saddle neighborhood if required and hence does not add much computational complexity per iteration (only O ( n ) complexity per saddle point).Recall that the entire algorithmic analysis was carried out assuming there is just one eigenvector e usj corresponding to the smallesteigenvalue of the Hessian ∇ f ( x ∗ ) . However, the same analysis can be done for the case of a subspace corresponding to the smallesteigenvalue. The bounds on b will still be the same however the steps involved are somewhat tedious and lengthy hence purposefully leftout from the proof. Remark on the Monotonicity of { f ( x k ) } It can be very easily established that f ( x K + ) / f ( x K ) where x K + comes from the Step 17 in Algorithm 1. Proof.
Since x K + is generated from Step 17 of Algorithm 1 we can use the particular update x K + − x K = k ∇ f ( x K ) k β e usj (the moregeneral update 17 is avoided for sake of simplicity) where e usj is an eigenvector of ∇ f ( x K ) belonging to its unstable subspace and h e usj , x K − x ∗ i / O ( ε log ( ε − ) ) (this approximate bound implies x K − x ∗ does not have the required unstable subspace projection fromTheorem 1). As a consequence we will have h ∇ f ( x K ) , x K + − x K i / O ( ε log ( ε − ) ) from the following steps where we use the substitutions ∇ f ( x K ) = ( ∇ f ( x ∗ ) + O ( ε ))( x K − x ∗ ) and ∇ f ( x K ) = ( ∇ f ( x ∗ ) + O ( ε )) from matrix perturbation theory. h ∇ f ( x K ) , x K + − x K i = h ∇ f ( x K ) , k ∇ f ( x K ) k β e usj i (313) = k ∇ f ( x K ) k β h e usj , ( ∇ f ( x ∗ ) + O ( ε ))( x K − x ∗ ) i (314) = k ∇ f ( x K ) k β h e usj , ( ∇ f ( x K ) + O ( ε ))( x K − x ∗ ) i (315) = k ∇ f ( x K ) k β h λ usj e usj , ( x K − x ∗ ) i + O ( ε ) / O ( ε log ( ε − ) ) (316)where ∇ f ( x K ) e usj = λ usj e usj and O ( ε log ( ε − ) ) > O ( ε ) .Finally using Hessian Lipschitz condition for x K + about x K along with (316) we get: f ( x K + ) ≤ f ( x K ) + h ∇ f ( x K ) , x K + − x K i + h ( x K + − x K ) , ∇ f ( x K )( x K + − x K ) i | {z } < + M k x K + − x K k (317) / f ( x K ) + O ( ε log ( ε − ) ) + M k x K + − x K k (318) / f ( x K ) + O ( ε log ( ε − ) ) (319)where we used the fact that k x K + − x K k = O ( ε ) and h ( x K + − x K ) , ∇ f ( x K )( x K + − x K ) i < ε , the term ε log ( ε − ) → ε goes to 0. Using the Mean Value Theorem for some a ∈ [ , ] we getoundary Conditions for Trajectories Around Saddle Points 45that: f ( x K + ) = f ( x K ) + h ∇ f ( x K ) , x K + − x K i + h ( x K + − x K ) , ∇ f ( a x K + ( − a ) x K + )( x K + − x K ) i (320) f ( x K + ) − f ( x K ) = k ∇ f ( x K ) k β h λ usj e usj , ( x K − x ∗ ) i + O ( ε ) + h ( x K + − x K ) , ∇ f ( a x K + ( − a ) x K + )( x K + − x K ) i (321) | f ( x K + ) − f ( x K ) | ≥ − k ∇ f ( x K ) k β |h λ usj e usj , ( x K − x ∗ ) i| − O ( ε ) + k ∇ f ( x K ) k β |h e usj , ∇ f ( a x K + ( − a ) x K + ) e usj i| (322) | f ( x K + ) − f ( x K ) | ≥ − k ∇ f ( x K ) k β |h λ usj e usj , ( x K − x ∗ ) i| − O ( ε ) + k ∇ f ( x K ) k β |h e usj , ( ∇ f ( x K ) + O ( ε )) e usj i| | {z } O ( ε ) (323) | f ( x K + ) − f ( x K ) | ' O ( ε ) − O ( ε log ( ε − ) ) − O ( ε ) ≈ O ( ε ) (324)where we used (316) in the second step followed by the facts that ∇ f ( a x K + ( − a ) x K + ) = ∇ f ( x K ) + O ( ε ) and h e usj , ( x K − x ∗ ) i / O ( ε log ( ε − ) ) . Now f ( x K + ) − f ( x K ) is atleast approximately of order O ( ε ) , hence for the Step 17 we have f ( x K + ) / f ( x K ) afterdropping the term O ( ε log ( ε − ) ) in (319) for very small ε . For all other iterations when gradient descent update is used, the sequence { f ( x k ) } decreases monotonically. (cid:4) . Local Minimum Neighborhood
For the case of a local minimum we will have ∑ j ∈ N US ( θ usj ) = ε κ ' V − V . (325)Hence for ε κ / V − V we cannot have a local minimum neighborhood. Hence if (325) holds, then the region can be both a saddleneighborhood or a local minimum region. Therefore, the Step 15 in Algorithm 1 also checks if ε κ < V − V so as to rule out thepossibility of local minimum. If however we have the inequality ε κ > V − V then a secondary condition λ min ( H ) < (cid:4) F Convergence Rate to a Local Minimum
Proof of Theorem 4
Proof.
For any x , y in ¯ B R ( x ∗ ) using (33) we have the following condition: f ( x ) − f ( y ) ≤ (cid:18) Γ + L diam ( U ) (cid:19) k x − y k ≤ (cid:18) Γ + L diam ( U ) (cid:19) R . (326)Next, let the trajectory re-enter the ball B R ( x ∗ ) after J iterations and the current iteration index be K where we have that x K , x K + J belong to ¯ B R ( x ∗ ) whereas x K + J − ¯ B R ( x ∗ ) . Using gradient Lipschitz continuity on x k and x k + we get: f ( x k + ) − f ( x k ) ≤ h ∇ f ( x k ) , x k + − x k i + L k x k + − x k k (327) K + J − ∑ k = K (cid:18) h ∇ f ( x k ) , x k − x k + i − L k x k + − x k k (cid:19) ≤ K + J − ∑ k = K (cid:18) f ( x k ) − f ( x k + ) (cid:19) (328) K + J − ∑ k = K (cid:18) h ∇ f ( x k ) , x k − x k + i − L k x k + − x k k (cid:19) ≤ f ( x K ) − f ( x K + J ) ≤ (cid:18) Γ + L diam ( U ) (cid:19) R (329)where in the last step we used (326). Now from Algorithm 1 let { k l } be the subsequence of I where I = { K ,... , K + J − } forwhich we have the update x k + = x k − α∇ f ( x k ) and I \{ k l } be the subsequence for which we have x k + − x k = k ∇ f ( x k ) k β e usj (this6 Rishabh Dixit, Waheed U. Bajwaupdate is a particular case of the Step 17 from Algorithm 1) . Further let { k l j } be the subsequence of { k l } where k ∇ f ( x k ) k > γ and let r k = h ∇ f ( x k ) , x k − x k + i − L k x k + − x k k . Now the left-hand side of (329) can be written as: ∑ k ∈ I r k = ∑ k ∈{ k lj } r k + ∑ k ∈{ k l }\{ k lj } r k + ∑ k ∈ I \{ k l } r k (330) ∑ k ∈ I r k = ∑ k ∈{ k lj } (cid:18) α h x k − x k + , x k − x k + i − L k x k + − x k k (cid:19) + ∑ k ∈{ k l }\{ k lj } L k ∇ f ( x k ) k + ∑ k ∈ I \{ k l } (cid:18) h ∇ f ( x k ) , k ∇ f ( x k ) k β e usj i − L (cid:13)(cid:13)(cid:13)(cid:13) k ∇ f ( x k ) k β e usj (cid:13)(cid:13)(cid:13)(cid:13) (cid:19) (331) ∑ k ∈ I r k = ∑ k ∈{ k lj } k ∇ f ( x k ) kk x k + − x k k + ∑ k ∈{ k l }\{ k lj } L k ∇ f ( x k ) k + ∑ k ∈ I \{ k l } (cid:18) h ∇ f ( x k ) , k ∇ f ( x k ) k β e usj i − L β k ∇ f ( x k ) k (cid:19) (332) ∑ k ∈ I r k > γ ∑ k ∈{ k lj } k x k + − x k k + ∑ k ∈{ k l }\{ k lj } L k ∇ f ( x k ) k − ∑ k ∈ I \{ k l } (cid:18) β + L β (cid:19) k ∇ f ( x k ) k . (333)Substituting (333) into (329) yields: γ ∑ k ∈{ k lj } k x k + − x k k + ∑ k ∈{ k l }\{ k lj } L k ∇ f ( x k ) k − ∑ k ∈ I \{ k l } (cid:18) β + L β (cid:19) k ∇ f ( x k ) k ≤ (cid:18) Γ + L diam ( U ) (cid:19) R (334) γ ∑ k ∈{ k lj } k x k + − x k k − ∑ k ∈ I \{ k l } (cid:18) β + L β (cid:19) k ∇ f ( x k ) k ≤ (cid:18) Γ + L diam ( U ) (cid:19) R (335) γ ∑ k ∈{ k lj } k x k + − x k k − ∑ k ∈ I \{ k l } (cid:18) β + L β (cid:19) L ε ≤ (cid:18) Γ + L diam ( U ) (cid:19) R (336)(337)where in the last step we used the fact that k ∇ f ( x k ) k ≤ L ε for k ∈ I \{ k l } . Also note that for all k ∈ I \{ k l } we will have x k ∈ S x ∗ i ∈ S ∗ k x ∗ i − x ∗ k > R B ε ( x ∗ i ) . Similarly for all k ∈ I \{ k l j } we will have x k , x k + in the region S x ∗ i ∈ S ∗ k x ∗ i − x ∗ k > R B ξ ( x ∗ i ) along with B ξ ( x ∗ r ) ∩ B ξ ( x ∗ s ) = φ for any x ∗ r , x ∗ s in S ∗ .Now adding γ ∑ k ∈ I \{ k lj } k x k + − x k k to both sides of (337) we get: γ ∑ k ∈ I \{ k lj } k x k + − x k k + γ ∑ k ∈{ k lj } k x k + − x k k ≤ (cid:18) Γ + L diam ( U ) (cid:19) R + ∑ k ∈ I \{ k l } (cid:18) β + L β (cid:19) L ε + γ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∑ k ∈ I \{ k lj } ( x k + − x k ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (338) γ ∑ k ∈ I k x k + − x k k ≤ (cid:18) Γ + L diam ( U ) (cid:19) R + ∑ k ∈ I \{ k l } (cid:18) β + L β (cid:19) L ε + γ ∑ k ∈ I \{ k lj } k ( x k + − x k ) k (339) γ ∑ k ∈ I k x k + − x k k ≤ (cid:18) Γ + L diam ( U ) (cid:19) R + ∑ k ∈ I \{ k l } (cid:18) β + L β (cid:19) L ε + γ ∑ k ∈ I \{ k lj } ξ (340)where in the last step we used the fact that k ( x k + − x k ) k ≤ ξ since x k , x k + lie inside some ball B ξ ( x ∗ i ) for k ∈ I \{ k l j } . If thetrajectory { x k } encounters N such B ξ ( x ∗ i ) balls then (340) can be further simplified as: γ ∑ k ∈ I k x k + − x k k ≤ (cid:18) Γ + L diam ( U ) (cid:19) R + NK exit (cid:18) β + L β (cid:19) L ε + γ N ( K exit + K shell ) ξ (341) The more general update Step 17 from Algorithm 1 will also yield the same bound after taking norm but is not used here in theinterest of simplifying analysisoundary Conditions for Trajectories Around Saddle Points 47where exit time from B ε ( x ∗ ) ball is K exit from theorem 2 of [10], exit time from B ξ ( x ∗ ) ball is K exit + K shell after adding results fromTheorem 3 and Theorem 2 of [10], and we have that ∑ k ∈ I \{ k l } ≤ NK exit , ∑ k ∈ I \{ k lj } ≤ N ( K exit + K shell ) .Note that ∑ k ∈ I k x k + − x k k is the total path length of the trajectory inside the shell B R ω ( x ∗ ) \ B R ( x ∗ ) where we have that R ω = max k ∈ I (cid:13)(cid:13) x k − x ∗ (cid:13)(cid:13) and R = (cid:13)(cid:13) x K − x ∗ (cid:13)(cid:13) = (cid:13)(cid:13) x K + J − x ∗ (cid:13)(cid:13) . Hence, for some K ω = argmax k ∈ I (cid:13)(cid:13) x k − x ∗ (cid:13)(cid:13) we will have the condition: ∑ k ∈ I k x k + − x k k = K ω − ∑ k = K k x k + − x k k + K + J ∑ k = K ω k x k + − x k k (342) ≥ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K ω − ∑ k = K x k + − x k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K + J ∑ k = K ω x k + − x k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (343) ≥ k x K ω − x K k + k x K + J − x K ω k (344) ≥ k x K ω − x ∗ k − k x K − x ∗ k + k x K ω − x ∗ k − k x K + J − x ∗ k (345) = ( R ω − R ) . (346)Substituting (346) into (341) yields: γ ( R ω − R ) ≤ γ ∑ k ∈ I k x k + − x k k ≤ (cid:18) Γ + L diam ( U ) (cid:19) R + NK exit (cid:18) β + L β (cid:19) L ε + γ N ( K exit + K shell ) ξ . (347)Next, recall that the distance between any two stationary points is greater than R . Hence, between two points x , y with k x − y k ≤ D , therecan be at most DR stationary points. Now if the points x , y are connected by a path formed from the sequence of points { v k } Pk = then therecan be at most P − ∑ p = k v k + − v k k R stationary points on the path connecting x , y . Using this result in (347) yields the following bound on N : N ≤ γ∑ k ∈ I k x k + − x k k R ≤ (cid:18) Γ + L diam ( U ) (cid:19) R R + NK exit (cid:18) β + L β (cid:19) L ε R + γ N ( K exit + K shell ) ξ R (348) N (cid:18) − K exit (cid:18) β + L β (cid:19) L ε R − γ ( K exit + K shell ) ξ R (cid:19) ≤ (cid:18) Γ + L diam ( U ) (cid:19) R R (349) N ≤ (cid:18) Γ + L diam ( U ) (cid:19) R R (cid:18) − K exit (cid:18) β + L β (cid:19) L ε R − γ ( K exit + K shell ) ξ R (cid:19) (350)provided (cid:18) − K exit (cid:18) β + L β (cid:19) L ε R − γ ( K exit + K shell ) ξ R (cid:19) > R > K exit (cid:18) β + L β (cid:19) L ε + γ ( K exit + K shell ) ξ .Finally, combining (347) and (350) yields the result: R ω ≤ R + (cid:18) Γ + L diam ( U ) (cid:19) R γ + N K exit (cid:18) β + L β (cid:19) L ε γ + N ( K exit + K shell ) ξ (351)where N = (cid:18) Γ + L diam ( U ) (cid:19) R R (cid:18) − K exit (cid:18) β + L β (cid:19) L ε R − γ ( K exit + K shell ) ξ R (cid:19) . (cid:4) Proof of Theorem 5
Proof.
Let the trajectory { x k } be travelling from the ball B ξ ( x ∗ j ) to B ξ ( x ∗ j + ) where B ξ ( x ∗ j ) ∈ Y . From the gradient Lipschitz conti-nuity of f ( · ) for K = sup { k | x k ∈ B ξ ( x ∗ j ) } + K = inf { k | x k ∈ B ξ ( x ∗ j + ) } , we have that for any K ≤ k < K : f ( x k + ) ≤ f ( x k ) + h ∇ f ( x k ) , x k + − x k i + L k x k + − x k k (352)12 L k ∇ f ( x k ) k ≤ f ( x k ) − f ( x k + ) (353) = ⇒ ( K − K ) L γ ≤ K − ∑ k = K L k ∇ f ( x k ) k ≤ K − ∑ k = K ( f ( x k ) − f ( x k + )) (354) = ⇒ ( K − K ) γ L ≤ f ( x K ) − f ( x K ) ≤ (cid:18) Γ + L diam ( U ) (cid:19)(cid:13)(cid:13) x K − x K (cid:13)(cid:13) (355)where we used (33) in the last step and the fact that k ∇ f ( x k ) k > γ for all K ≤ k < K . Next, using Theorem 4 for R = R , we have that R ω ≤ ˆ R where: ˆ R = R + (cid:18) Γ + L diam ( U ) (cid:19) R γ + N K exit (cid:18) β + L β (cid:19) L ε γ + N ( K exit + K shell ) ξ (356)and N is defined from theorem 4. Now let S i ∈ P j B R ( x ∗ i ) is the tightest packing of the ball B R ( x ∗ j ) . Hence S i ∈ P j B ξ ( x ∗ i ) \ B ξ ( x ∗ j ) are the ξ radii balls in the immediate neighborhood of the ball B ξ ( x ∗ j ) such that (cid:13)(cid:13)(cid:13) x ∗ i − x ∗ j (cid:13)(cid:13)(cid:13) = R for all i ∈ P j . Next if there is at least oneball B ∈ S i ∈ P j B ξ ( x ∗ i ) \ B ξ ( x ∗ j ) that has not been visited by the trajectory of { x k } for any k ≤ K then from Theorem 4 we must have (cid:13)(cid:13)(cid:13) x K − x ∗ j (cid:13)(cid:13)(cid:13) ≤ ˆ R if the trajectory of { x k } for some k > K possibly traverses the ball B . Hence, the cardinality of Y or equivalently Q increases if the the trajectory of { x k } visits the missed ball B which is only possible if (cid:13)(cid:13)(cid:13) x K − x ∗ j (cid:13)(cid:13)(cid:13) ≤ ˆ R . Using the bound (cid:13)(cid:13)(cid:13) x K − x ∗ j (cid:13)(cid:13)(cid:13) ≤ ˆ R in (355) yields: ( K − K ) γ L ≤ (cid:18) Γ + L diam ( U ) (cid:19)(cid:13)(cid:13) x K − x K (cid:13)(cid:13) ≤ (cid:18) Γ + L diam ( U ) (cid:19) ( (cid:13)(cid:13) x K − x ∗ j (cid:13)(cid:13) + (cid:13)(cid:13) x K − x ∗ j (cid:13)(cid:13) ) (357) ( K − K ) ≤ L γ (cid:18) Γ + L diam ( U ) (cid:19) ( ˆ R + ξ ) . (358)which is the upper bound on the travel time between two consecutive balls.Next, from Theorem 4 let R = ζ and R ω be upper bounded by some R ef f where: R ef f = ζ + (cid:18) Γ + L diam ( U ) (cid:19) ζγ + N K exit (cid:18) β + L β (cid:19) L ε γ + N ( K exit + K shell ) ξ (359)and N is defined from Theorem 4. Let N R ef f be the packing number of ball B R ef f ( x ∗ ) with balls of radius R . Then using Theorem 2from [10], Theorem 3 and (358), the total time for the trajectory of { x k } to reach x ∗ optimal is bounded by: K max ≤ N R ef f (cid:18) ( K exit + K shell ) + ( K − K ) (cid:19) (360) < (cid:18) R ef f R (cid:19) n (cid:18) ( K exit + K shell ) + L γ (cid:18) Γ + L diam ( U ) (cid:19) ( ˆ R + ξ ) (cid:19) . (361) Claim: If the trajectory of { x k } is currently traversing the ball B ξ ( x ∗ j ) then it cannot have visited every ballin the union S i ∈ P j B ξ ( x ∗ i ) \ B ξ ( x ∗ j ) provided | P j | ( R − ξ ) R (cid:18) Γ + L diam ( U ) (cid:19) γ ≥ Γ where | P j | is the packing numberfor the ball B R ( x ∗ j ) with R radii balls and ˜ R = R + (cid:18) Γ + L diam ( U ) (cid:19) R γ + N K exit (cid:18) β + L β (cid:19) L ε γ + N ( K exit + K shell ) . Proof.
To contradict the claim, consider the case where every ball in the union S i ∈ P j B ξ ( x ∗ i ) \ B ξ ( x ∗ j ) has been visited by the tra-jectory of { x k } for some k ≤ K and the trajectory is currently traversing from ball B ξ ( x ∗ j ) to ball B ξ ( x ∗ j + ) . Recall that the union S i ∈ P j B ξ ( x ∗ i ) \ B ξ ( x ∗ j ) forms the immediate neighborhood of the ball B ξ ( x ∗ j ) . Then in order to completely traverse S i ∈ P j B ξ ( x ∗ i ) \ B ξ ( x ∗ j ) ,oundary Conditions for Trajectories Around Saddle Points 49the trajectory must have traversed some portions of each ball in the union S i ∈ P j B R ( x ∗ i ) \ B R ( x ∗ j ) where we have that B R ( x ∗ j ) ⊃ S i ∈ P j B R ( x ∗ i ) and | P j | is the packing number for the ball B R ( x ∗ j ) with balls of radius R . Since the balls in the union S i ∈ P j B R ( x ∗ i ) are disjoint, the minimum path length that the trajectory of { x k } traverses inside any shell B R ( x ∗ i ) \ B ξ ( x ∗ i ) for any i ∈ P j is R − ξ .Then for a subsequence { k P j } of sequence { k } such that x k ∈ S i ∈ P j (cid:18) B R ( x ∗ i ) \ B ξ ( x ∗ i ) (cid:19) for any k ∈ k P j , we have that: k x k − x k + k = α k ∇ f ( x k ) k (362) = ⇒ | P j | ( R − ξ ) ≤ ∑ k ∈{ k P j } k x k − x k + k = ∑ k ∈{ k P j } α k ∇ f ( x k ) k ≤ Γ L K P j (363) = ⇒ K P j ≥ | P j | ( R − ξ ) L Γ (364)where K P j are the number of iterations of x k in the region S i ∈ P j (cid:18) B R ( x ∗ i ) \ B ξ ( x ∗ i ) (cid:19) . Next, let { k C } be the contiguous subsequenceof { k } such that x k ∈ B ˜ R ( x ∗ j ) for any k ∈ { k C } where ˜ R = R + (cid:18) Γ + L diam ( U ) (cid:19) R γ + N K exit (cid:18) β + L β (cid:19) L ε γ + N ( K exit + K shell ) ξ from Theorem 4 for R = R . Recall from Theorem 4 if the trajectory escapes B ˜ R ( x ∗ j ) , then it cannot return to B R ( x ∗ j ) . Then by thegradient Lipschitz continuity of f ( · ) for any x k , x k + in the region S i ∈ P j (cid:18) B R ( x ∗ i ) \ B ξ ( x ∗ i ) (cid:19) , we have that: f ( x k + ) ≤ f ( x k ) + h ∇ f ( x k ) , x k + − x k i + L k x k + − x k k (365) = ⇒ K P j L γ ≤ ∑ k ∈{ k P j } L k ∇ f ( x k ) k ≤ ∑ k ∈{ k P j } ( f ( x k ) − f ( x k + )) ≤ ∑ k ∈{ k C } ( f ( x k ) − f ( x k + )) (366) = ⇒ K P j L γ ≤ f ( x inf { k C } ) − f ( x sup { k C } + ) ≤ (cid:18) Γ + L diam ( U ) (cid:19)(cid:13)(cid:13) x inf { k C } − x sup { k C } + (cid:13)(cid:13) (367) = ⇒ K P j < (cid:18) Γ + L diam ( U ) (cid:19) ˜ RL γ (368)where we used the monotonicity of the sequence { f ( x k ) } for k ∈ { k C } and the fact that (cid:13)(cid:13) x inf { k C } − x sup { k C } + (cid:13)(cid:13) < R . Combining (364)and (368) yields: | P j | ( R − ξ ) L Γ ≤ K P j < (cid:18) Γ + L diam ( U ) (cid:19) ˜ RL γ (369) | P j | ( R − ξ ) R (cid:18) Γ + L diam ( U ) (cid:19) γ < Γ (370)which contradicts the given condition on Γ . (cid:4) We complete the proof of Theorem 5 by proving one last claim. Recall that K exit was the exit time of the ε –precision trajectory fromthe ball B ε ( x ∗ ) while we proved Theorem 5 for the exact gradient trajectory. Hence, we need to justify the use of the upper bound on K exit from (8) in Theorem 5 which we prove in the following claim. Claim: The actual exit time of the gradient descent trajectory from the ball B ε ( x ∗ ) is approximately less thanor equal to the sum of K exit and a small positive constant. Hence, for sufficiently small ε the upper bound onK exit from (8) can be used as an upper bound for the actual exit time. Proof.
Let K o exit be the actual exit time of the gradient trajectory { u K } from the ball B ε ( x ∗ ) , i.e., K o exit = inf K > (cid:26) K (cid:12)(cid:12)(cid:12)(cid:12) k u K k ≥ ε (cid:27) where u K = x K − x ∗ is the radial vector and k u k = ε . Since K exit is the exit time of the ε –precision trajectory { ˜ u K } from the ball B ε ( x ∗ ) , i.e., K exit = inf K > (cid:26) K (cid:12)(cid:12)(cid:12)(cid:12) k ˜ u K k ≥ ε (cid:27) , by the definition of exit time we have that (cid:13)(cid:13) ˜ u K exit (cid:13)(cid:13) ≥ ε .0 Rishabh Dixit, Waheed U. BajwaNow if the initial unstable subspace projection ∑ j ∈ N US ( θ usj ) satisfies the condition of Theorem 1 then from the relative error bound(22) we have that: k u K − ˜ u K kk u K k ≤ O (cid:18) √ ε (cid:18) log (cid:18) ε (cid:19) ε (cid:19) (cid:19) (371) = ⇒ − O (cid:18) √ ε (cid:18) log (cid:18) ε (cid:19) ε (cid:19) (cid:19) ≤ k ˜ u K kk u K k ≤ + O (cid:18) √ ε (cid:18) log (cid:18) ε (cid:19) ε (cid:19) (cid:19) (372) = ⇒ k ˜ u K k + O (cid:18) √ ε (cid:18) log (cid:18) ε (cid:19) ε (cid:19) (cid:19) ≤ k u K k ≤ k ˜ u K k − O (cid:18) √ ε (cid:18) log (cid:18) ε (cid:19) ε (cid:19) (cid:19) (373) = ⇒ ε + O (cid:18) √ ε (cid:18) log (cid:18) ε (cid:19) ε (cid:19) (cid:19) ≤ (cid:13)(cid:13) u K exit (cid:13)(cid:13) ≤ ( + d ) ε − O (cid:18) √ ε (cid:18) log (cid:18) ε (cid:19) ε (cid:19) (cid:19) (374)where we substituted K = K exit and used the bound ( + d ) ε ≥ (cid:13)(cid:13) ˜ u K exit (cid:13)(cid:13) ≥ ε for some d > K o exit we have that (cid:13)(cid:13)(cid:13) u K o exit (cid:13)(cid:13)(cid:13) ≥ ε . Hence, unless we have (cid:13)(cid:13) u K exit (cid:13)(cid:13) ≥ ε (which implies K o exit ≤ K exit ), the gradient trajectory { u K } will takenot more than K o exit − K exit iterations to travel the shell B ε ( x ∗ ) \ B k u Kexit k ( x ∗ ) . Next, K o exit − K exit can be upper bounded by Theorem 3provided the gradient trajectory has expansive dynamics at K exit (from Theorem 2).Now for sufficiently small ε and K exit ≥ B ε ( x ∗ ) ),there exists some K = K υ with K υ < K exit such that: k ˜ u K υ k − O (cid:18) √ ε (cid:18) log (cid:18) ε (cid:19) ε (cid:19) (cid:19) ≤ (cid:13)(cid:13) ˜ u K exit (cid:13)(cid:13) + O (cid:18) √ ε (cid:18) log (cid:18) ε (cid:19) ε (cid:19) (cid:19) . (375)Combining (375) with (373) for K = K exit and K = K υ we get: k u K υ k ≤ k ˜ u K υ k − O (cid:18) √ ε (cid:18) log (cid:18) ε (cid:19) ε (cid:19) (cid:19) ≤ (cid:13)(cid:13) ˜ u K exit (cid:13)(cid:13) + O (cid:18) √ ε (cid:18) log (cid:18) ε (cid:19) ε (cid:19) (cid:19) ≤ (cid:13)(cid:13) u K exit (cid:13)(cid:13) (376) = ⇒ k u K υ k ≤ (cid:13)(cid:13) u K exit (cid:13)(cid:13) . (377)which implies that the gradient trajectory has expansive dynamics at K = K exit from Theorem 2. Hence, the gradient trajectory will alsohave expansive dynamics from K = K exit to K = K o exit . Using Theorem 3 for ξ = (cid:13)(cid:13)(cid:13) u K o exit − (cid:13)(cid:13)(cid:13) , ε = (cid:13)(cid:13) u K exit (cid:13)(cid:13) , ˆ K exit = K o exit − K e = K exit we get: K o exit − − K exit = ˆ K exit − K e ≤ log ( (cid:13)(cid:13)(cid:13) u K o exit − (cid:13)(cid:13)(cid:13) ) − log ( (cid:13)(cid:13) u K exit (cid:13)(cid:13) ) log (cid:18) inf { ¯ ρ ( x k − ) } + M ξ (cid:19) + < log ( + O (cid:18) √ ε (cid:18) log (cid:18) ε (cid:19) ε (cid:19) (cid:19) ) log (cid:18) inf { ¯ ρ ( x k − ) } + M ξ (cid:19) + / (cid:13)(cid:13)(cid:13) u K o exit − (cid:13)(cid:13)(cid:13) < ε from the definition of K o exit , the lower bound on (cid:13)(cid:13) u K exit (cid:13)(cid:13) from (374) in the second last stepand dropped the term log ( + O (cid:18) √ ε (cid:18) log (cid:18) ε (cid:19) ε (cid:19) (cid:19) ) for sufficiently small ε . Hence we have the condition K o exit / K exit + O ( log ( ε − )) term after substituting the upper bound on K exit from (8). This completes theproof. (cid:4)(cid:4) References
1. Allen-Zhu, Z.: Natasha 2: Faster non-convex optimization than sgd. In: Advances in Neural Information Processing Systems, pp.2675–2686 (2018)oundary Conditions for Trajectories Around Saddle Points 512. Allen-Zhu, Z., Li, Y.: Neon2: Finding local minima via first-order oracles. In: Advances in Neural Information Processing Systems,pp. 3716–3726 (2018)3. Anandkumar, A., Ge, R.: Efficient approaches for escaping higher order saddle points in non-convex optimization. In: Conferenceon learning theory, pp. 81–102 (2016)4. Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms,forward–backward splitting, and regularized gauss–seidel methods. Mathematical Programming (1-2), 91–129 (2013)5. Bolte, J., Daniilidis, A., Lewis, A.: The łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradientdynamical systems. SIAM Journal on Optimization (4), 1205–1223 (2007)6. Broyden, C.G.: The convergence of a class of double-rank minimization algorithms 1. general considerations. IMA Journal ofApplied Mathematics (1), 76–90 (1970)7. Corless, R.M., Gonnet, G.H., Hare, D.E., Jeffrey, D.J., Knuth, D.E.: On the lambertw function. Advances in Computational mathe-matics (1), 329–359 (1996)8. Curry, H.B.: The method of steepest descent for non-linear minimization problems. Quarterly of Applied Mathematics (3), 258–261 (1944)9. Daneshmand, H., Kohler, J., Lucchi, A., Hofmann, T.: Escaping saddles with stochastic gradients. arXiv preprint arXiv:1803.05999(2018)10. Dixit, R., Bajwa, W.U.: Exit time analysis for approximations of gradient descent trajectories around saddle points. arXiv preprintarXiv:2006.01106 (2020)11. Du, S.S., Jin, C., Lee, J.D., Jordan, M.I., Singh, A., Poczos, B.: Gradient descent can take exponential time to escape saddle points.In: Advances in neural information processing systems, pp. 1067–1077 (2017)12. Fang, C., Lin, Z., Zhang, T.: Sharp analysis for nonconvex sgd escaping from saddle points. arXiv preprint arXiv:1902.00247 (2019)13. Hestenes, M.R., et al.: Methods of conjugate gradients for solving linear systems. Journal of research of the National Bureau ofStandards (6), 409–436 (1952)14. Hoffmeister, F., B¨ack, T.: Genetic algorithms and evolution strategies: Similarities and differences. In: International Conference onParallel Problem Solving from Nature, pp. 455–469. Springer (1990)15. Hu, W., Li, C.J.: On the fast convergence of random perturbations of the gradient flow. arXiv preprint arXiv:1706.00837 (2017)16. Jin, C., Ge, R., Netrapalli, P., Kakade, S.M., Jordan, M.I.: How to escape saddle points efficiently. In: Proceedings of the 34thInternational Conference on Machine Learning-Volume 70, pp. 1724–1732. JMLR. org (2017)17. Jin, C., Netrapalli, P., Jordan, M.I.: Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprintarXiv:1711.10456 (2017)18. Karmarkar, N.: A new polynomial-time algorithm for linear programming. In: Proceedings of the sixteenth annual ACM symposiumon Theory of computing, pp. 302–311. ACM (1984)19. Kelley, A.: The stable, center-stable, center, center-unstable, unstable manifolds. Journal of Differential Equations (1966)20. Kifer, Y.: The exit problem for small random perturbations of dynamical systems with a hyperbolic fixed point. Israel Journal ofMathematics (1), 74–96 (1981)21. Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Advances in neural information processing systems,pp. 556–562 (2001)22. Lee, J.D., Panageas, I., Piliouras, G., Simchowitz, M., Jordan, M.I., Recht, B.: First-order methods almost always avoid saddlepoints. arXiv preprint arXiv:1710.07406 (2017)23. Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent converges to minimizers. arXiv preprint arXiv:1602.04915(2016)24. Łojasiewicz, S.: Sur le probl`eme de la division. Studia Mathematica , 87–136 (1959)25. Magnus, J.R., Neudecker, H.: Matrix differential calculus with applications to simple, hadamard, and kronecker products. Journalof Mathematical Psychology (4), 474–492 (1985)26. Mehrotra, S.: On the implementation of a primal-dual interior point method. SIAM Journal on optimization (4), 575–601 (1992)27. Mokhtari, A., Ozdaglar, A., Jadbabaie, A.: Escaping saddle points in constrained optimization. In: Advances in Neural InformationProcessing Systems, pp. 3629–3639 (2018)28. M¨uhlenbein, H., Schomisch, M., Born, J.: The parallel genetic algorithm as function optimizer. Parallel computing (6-7), 619–632(1991)29. Nesterov, Y.: Introductory lectures on convex optimization: A basic course, vol. 87. Springer Science & Business Media (2013)30. Nesterov, Y., Nemirovskii, A.: Interior-point polynomial algorithms in convex programming, vol. 13. Siam (1994)31. Nesterov, Y., Polyak, B.T.: Cubic regularization of newton method and its global performance. Mathematical Programming (1),177–205 (2006)32. Paternain, S., Mokhtari, A., Ribeiro, A.: A newton-based method for nonconvex optimization with fast evasion of saddle points.SIAM Journal on Optimization (1), 343–368 (2019)33. Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathe-matical Physics (5), 1–17 (1964)34. Rastrigin, L.A.: Systems of extremal control. Nauka (1974)35. Reddi, S.J., Zaheer, M., Sra, S., Poczos, B., Bach, F., Salakhutdinov, R., Smola, A.J.: A generic approach for escaping saddle points.arXiv preprint arXiv:1709.01434 (2017)36. Rosenbrock, H.: An automatic method for finding the greatest or least value of a function. The Computer Journal (3), 175–184(1960)37. Sanger, T.D.: Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural networks (6), 459–473(1989)2 Rishabh Dixit, Waheed U. Bajwa38. Shi, B., Su, W.J., Jordan, M.I.: On learning rates and Schr¨odinger operators. arXiv preprint (2020). URL https://arxiv.org/abs/2004.06977https://arxiv.org/abs/2004.06977