Newton-Type Methods for Non-Convex Optimization Under Inexact Hessian Information
aa r X i v : . [ m a t h . O C ] M a y Newton-Type Methods for Non-Convex Optimization UnderInexact Hessian Information
Peng Xu ∗ Fred Roosta † Michael W. Mahoney ‡ May 15, 2019
Abstract
We consider variants of trust-region and adaptive cubic regularization methodsfor non-convex optimization, in which the Hessian matrix is approximated. Undercertain condition on the inexact Hessian, and using approximate solution of the cor-responding sub-problems, we provide iteration complexity to achieve ǫ -approximatesecond-order optimality which have been shown to be tight. Our Hessian approx-imation condition offers a range of advantages as compared with the prior worksand allows for direct construction of the approximate Hessian with a priori guaran-tees through various techniques, including randomized sampling methods. In thislight, we consider the canonical problem of finite-sum minimization, provide appro-priate uniform and non-uniform sub-sampling strategies to construct such Hessianapproximations, and obtain optimal iteration complexity for the corresponding sub-sampled trust-region and adaptive cubic regularization methods. Consider the generic unconstrained optimization problemmin x ∈ R d F ( x ) , ( P0 )where F : R d → R is smooth and non-convex . Faced with the large-scale nature of modern“big-data” problems, many of the classical optimization algorithms might prove to beinefficient, if applicable at all. In this light, many of the recent research efforts have beencentered around designing variants of classical algorithms which, by employing suitable approximations of the gradient and/or Hessian, improve upon the cost-per-iteration, whilemaintaining the original iteration complexity. In this light, we focus on trust-region (TR)[17] and cubic regularization (CR) [34], two algorithms which are considered as amongthe most elegant and theoretically sound general-purpose Newton-type methods for non-convex problems. ∗ Institute for Computational and Mathematical Engineering, Stanford University, Email:[email protected] † School of Mathematics and Physics, University of Queensland, Brisbane, Australia, and InternationalComputer Science Institute, Berkeley, USA, Email: [email protected] ‡ International Computer Science Institute and Department of Statistics, University of California atBerkeley, Email: [email protected]
1n doing so, we first consider ( P0 ), and study the theoretical convergence properties ofvariants of these two algorithms in which, under favorable conditions, Hessian is suitablyapproximated. We show that our Hessian approximation conditions, in many cases, areweaker than the existing ones in the literature. In addition, and in contrast to some priorworks, our conditions allow for efficient constructions of the inexact Hessian with a prioriguarantees via various approximation methods, of which Randomized Numerical LinearAlgebra (RandNLA), [22, 42], techniques are shown to be highly effective.Subsequently, to showcase the application of randomized techniques for constructionof the approximate Hessian, we consider an important instance of ( P0 ), i.e., large-scale finite-sum minimization, of the formmin x ∈ R d F ( x ) , n n X i =1 f i ( x ) , ( P1 )and its special case min x ∈ R d F ( x ) , n n X i =1 f i ( a Ti x ) , ( P2 )where n ≫
1, each f i is a smooth but possibly non-convex function, and a i ∈ R d , i =1 , . . . , n, are given. Problems of the form ( P1 ) and ( P2 ) arise very often in machinelearning, e.g., [51] as well as scientific computing, e.g., [47, 48]. In big-data regimewhere n ≫
1, operations with the Hessian of F , e.g., matrix-vector products, typicallyconstitute the main bottleneck of computations. Here, we show that our relaxed Hessianapproximation conditions allow one to draw upon the sub-sampling ideas of [6, 49, 62],to design variants of TR and CR algorithms where the Hessian is (non-)uniformly sub-sampled. We then present the theoretical convergence properties of these variants fornon-convex finite-sum problems of the form ( P1 ) and ( P2 ).The rest of this paper is organized as follows. In Section 1.1, we first introduce thenotation and definitions used throughout the paper. For completeness, in Section 1.2, wegive a brief review of trust region (Section 1.2.1) and cubic regularization (Section 1.2.2)along with related prior works. Our main contributions are summarized in Section 1.3.Theoretical analysis of the proposed algorithms for solving generic non-convex problem( P0 ) are presented in Section 2. Various randomized sub-sampling strategies as well astheoretical properties of the proposed algorithms for finite-sum minimization problems( P1 ) and ( P2 ) are given in Section 3. Conclusions and further thoughts are gathered inSection 4. Throughout the paper, vectors are denoted by bold lowercase letters, e.g., v , and matricesor random variables are denoted by bold upper case letters, e.g., V . v T denotes thetranspose of a real vector v . We use regular lower-case and upper-case letters to denotescalar constants, e.g., c or K . For two vectors, v , w , their inner-product is denoted as h v , w i = v T w . For a vector v , and a matrix V , k v k and k V k denote the vector ℓ norm and the matrix spectral norm, respectively, while k V k F is the matrix Frobeniusnorm. ∇ F ( x ) and ∇ F ( x ) are the gradient and the Hessian of F at x , respectively, and I denotes the identity matrix. For two symmetric matrices A and B , A (cid:23) B indicatesthat A − B is symmetric positive semi-definite. The subscript, e.g., x t , denotes iteration2ounter and log( x ) is the natural logarithm of x . The inexact Hessian is denoted by H ( x ),but for notational simplicity, we may use H t to, instead, denote the approximate Hessianevaluated at the iterate x t in iteration t , i.e., H t , H ( x t ). Throughout the paper, S denotes a collection of indices from { , , · · · , n } , with potentially repeated items and itscardinality is denoted by |S| .Unlike convex functions for which “local optimality” and “global optimality” are infact the same, in non-convex settings, we are often left with designing algorithms thatcan guarantee convergence to approximate local optimality. In this light, throughout thispaper, we make use of the following definition of ( ǫ g , ǫ H )-Optimality: Definition 1 (( ǫ g , ǫ H )-Optimality) . Given ǫ g , ǫ H ∈ (0 , , x ∈ R d is an ( ǫ g , ǫ H ) -optimalsolution to the problem ( P0 ) , if k∇ F ( x ) k ≤ ǫ g , λ min ( ∇ F ( x )) ≥ − ǫ H . (1)We note that ( ǫ g , ǫ H )-Optimality (even with ǫ g = ǫ H = 0) does not necessarily implycloseness to any local minimum, neither in iterate nor in the objective value. However,if the saddle points satisfy the strict-saddle property [26, 40], then an ( ǫ g , ǫ H )-optimalityguarantees vicinity to a local minimum for sufficiently small ǫ g and ǫ H . Arguably, the most straightforward approach for globalization of many Newton-type al-gorithms is the application of line-search. However, near saddle points where the gradientmagnitude can be small, traditional line search methods can be very ineffective and infact produce iterates that can get stuck at a saddle point [46]. Trust region and cubicregularization methods are two elegant globalization alternatives that, specially recently,have attracted much attention. The main advantage of these methods is that they arereliably able to take advantage of the direction of negative curvature and escape saddlepoints. In this section we briefly review these algorithms as they pertain to the presentpaper and mention the relevant prior works.
TR methods [17, 53] encompass a general class of iterative methods which specificallydefine a region around the current iterate within which they trust the model to be areasonable approximation of the true objective function. The most widely used approxi-mating model, which we consider here, is done via a quadratic function. More specifically,using the current iterate x t , the quadratic variant of TR algorithm finds the next iterateas x t +1 = x t + s t where s t is a solution of the constrained sub-problemmin m t ( s ) , h s , ∇ F ( x t ) i + 12 h s , ∇ F ( x t ) s i (2a)s.t. k s k ≤ ∆ t . Here, ∆ t is the region in which we “trust” our quadratic model to be an acceptableapproximation of the true objective for the current iteration. The major bottleneck ofcomputations in TR algorithm is the minimization of the constrained quadratic sub-problem (2a), for which numerous approaches have been proposed, e.g., [23, 28, 29, 36,41, 43, 54, 56]. 3or a smooth non-convex objective and in order to obtain approximate first-ordercriticality, i.e., k∇ F ( x t ) k ≤ ǫ g for some ǫ g ∈ (0 , O ( ǫ − g ); e.g.,[5, 11, 31, 32, 33]. Recent non-trivial modifications of the classical TR methods havealso been proposed which improve upon the complexity to O ( ǫ − / g ); see [19] and furtherextensions to a more general framework in [20]. These bounds can be shown to be tight[13] in the worst case. Under a more general algorithmic framework and in terms ofobjective function sub-optimality, i.e., F ( x ) − F ∗ ≤ ǫ , better complexity bounds, in theconvex and strongly-convex settings, have been obtained which are of the orders of O ( ǫ − g )and O (log(1 /ǫ g )), respectively [30].For non-convex problems, however, it is more desired to obtain complexity bounds forachieving approximate second-order criticality, i.e., Definition 1. For this, bounds in theorders of O (max { ǫ − H ǫ − g , ǫ − H } ) and O (max { ǫ − g , ǫ − H } ) have been obtained in [11] and [30],respectively. Similar bounds were also given in [33] under probabilistic model. Boundsof this order have shown to be optimal in certain cases [11].More closely related to the present paper, there have been several results which studythe role of derivative-free and probabilistic models in general, and Hessian approximationin particular, e.g., see [2, 5, 11, 16, 18, 33, 39, 52] and references therein. An alternative to the traditional line-search and TR for globalization of Newton-typemethods is the application of cubic regularization. Such class of methods is characterizedby generating iterates as x t +1 = x t + s t where s t is a solution of the following unconstrained sub-problem min s ∈ R d m t ( s ) , h s , ∇ F ( x t ) i + 12 h s , ∇ F ( x t ) s i + σ t k s k , (2b)where σ t is the cubic regularization parameter chosen for the current iteration. As in thecase of TR, the major bottleneck of CR involved solving the sub-problem (2b), for whichvarious techniques have been proposed, e.g., [1, 4, 8, 9].To the best of our knowledge, the use of such regularization, was first introduced inthe pioneering work of [34], and subsequently further studied in the seminal works of[9, 10, 45].From the worst-case complexity point of view, CR has a better dependenceon ǫ g compared to TR. More specifically, [45] showed that, under global Lipschitz con-tinuity assumption on the Hessian, if the sub-problem (2b) is solved exactly, then theresulting CR algorithm achieves the approximate first-order criticality with complexityof O ( ǫ − / g ). These results were extended by the pioneering and seminal works of [9, 10]to an adaptive variant, which is often referred to as ARC (Adaptive Regularization withCubics). In particular, the authors showed that the worst case complexity of O ( ǫ − / g )can be achieved without requiring the knowledge of the Hessian’s Lipschitz constant, ac-cess to the exact Hessian, or multi-dimensional global optimization of the sub-problem(2b). These results were further refined in [11] where it was shown that, not only, multi-dimensional global minimization of (2b) is unnecessary, but also the same complexitycan be achieved with mere one or two dimensional search. This O ( ǫ − / ) bound has beenshown to be tight [14]. As for the approximate second-order criticality, [11] showed thatat least O (max { ǫ − g , ǫ − H } ) is required. With further assumptions on the inexactness of4ub-problem solution, [10, 11] also show that one can achieve O (max { ǫ − / g , ǫ − H } ), whichis shown to be tight [13]. Better dependence on ǫ g can be obtained if one assumes addi-tional structure, such as convexity, e.g., see [12, 45] as well as the acceleration scheme of[44].Recently, for (strongly) convex problems, [27] obtained sub-optimal complexity forARC and its accelerated variants using Hessian approximations. In the context of stochas-tic optimization problems, [57] considers cubic regularization with a priori chosen fixedregularization parameter using both approximations of the gradients and Hessian. Spe-cific to the finite-sum problem ( P1 ), and by a direct application of the theoretical resultsof [9, 10], [37] presents a sub-sampled variant of ARC, in which the exact Hessian andthe gradient are replaced by sub-samples. However, unfortunately, their analysis suffersfrom a rather vicious circle: the approximate Hessian and gradient are formed based onan a priori unknown step which can only be determined after such approximations areformed. In this section, we summarize the key aspects of our contributions. In Section 2, we con-sider ( P0 ) and establish the worst-case iteration complexities for variants of trust-regionand adaptive cubic regularization methods in which the Hessian is suitably approximated.More specifically, our entire analysis is based on the following key condition on the ap-proximate Hessian H ( x ): Condition 1 (Inexact Hessian Regularity) . For some < K H < ∞ , ǫ > , the approxi-mating Hessian, H ( x t ) , satisfies (cid:13)(cid:13)(cid:0) H ( x t ) − ∇ F ( x t ) (cid:1) s t (cid:13)(cid:13) ≤ ǫ · k s t k , (3a) k H ( x t ) k ≤ K H , (3b) where x t and s t are, respectively, the iterate and the update at iteration t . Under Condition 1, we show that our proposed algorithms (Algorithms 1 and 2)achieve the same worst-case iteration complexity to obtain approximate second ordercritical solution as that of the exact variants (Theorems 1, 2, and 3).In Section 3, we describe schemes for constructing H ( x t ) to satisfy Condition 1. Specif-ically, in the context of finite-sum optimization framework, i.e., problems ( P1 ) and ( P2 ),we present various sub-sampling schemes to probabilistically ensure Condition 1 (Lem-mas 16 and 17). Our proposed randomized sub-sampling strategies guarantee, with highprobability, a stronger condition than (3a), namely k H ( x ) − ∇ F ( x ) k ≤ ǫ. (4)It is clear that (4) implies (3a). We then give optimal iteration complexities for Algo-rithms 1 and 2 for optimization of non-convex finite-sum problems where the Hessian isapproximated by means of appropriate sub-sampling (Theorems 4, 5 and 6).To establish optimal second-order iteration complexity, many previous works consid-ered Hessian approximation conditions that, while enjoying many advantages, come withcertain disadvantages. Our proposed Condition 1 aims to remedy some of these disad-vantages. We first briefly review the conditions used in the prior works, and subsequentlyhighlight the merits of Condition 1 in comparison.5 .3.1 Conditions Used in Prior Works For the analysis of trust-region, many authors have considered the following condition (cid:13)(cid:13) H ( x t ) − ∇ F ( x t + s ) (cid:13)(cid:13) ≤ C ∆ t , ∀ s ∈ { s ; k s k ≤ ∆ t } , (5a)for some 0 < C < ∞ , where ∆ t is the current trust-region radius, e.g., [2, 33]. In [5],condition (5a) is replaced with (cid:13)(cid:13) H ( x t ) − ∇ F ( x t ) (cid:13)(cid:13) ≤ C ∆ t , (5b)for some 0 < C < ∞ . In fact, by assuming Lipschitz continuity of Hessian, it is easyto show that (5a) and (5b) are equivalent, in that one implies the other, albeit withmodified constants. We also note that [2, 5, 33] study a more general framework underwhich the entire sub-problem model is probabilistically constructed and approximationextends beyond just the Hessian.For cubic regularization, the condition imposed on the inexact Hessian is often con-sidered as (cid:13)(cid:13)(cid:0) H ( x t ) − ∇ F ( x t ) (cid:1) s t (cid:13)(cid:13) ≤ C k s t k , (5c)for some 0 < C < ∞ , e.g., [9, 10, 11] and other follow-up works. In fact, [11] has alsoestablished optimal iteration complexity for trust-region algorithm under (5c). Both of(5a) and (5c), are stronger than the celebrated Dennis-Mor´e [21] condition, i.e.,lim t →∞ k ( H ( x t ) − ∇ F ( x t )) s t kk s t k = 0 . Indeed, under certain assumptions, Dennis-Mor´e condition is satisfied by a number ofquasi-Newton methods, although the same cannot be said about (5a) and (5c) [9].
For our trust-region analysis, we require Condition 1 with ǫ ∈ O (max { ǫ H , ∆ t } ); see(11) in Theorem 1. Hence, when ∆ t is large, e.g., at the beginning of iterations, allthe conditions (3a), (5a), and (5b) are equivalent, up to some constants. However, theconstants in (5a) and (5b) can be larger than what is implied by (3a), amounting tocruder approximations in practice for when ∆ t is large. As iterations progress, the trust-region radius will get smaller, and in fact it is expected that ∆ t will eventually shrink tobe ∆ t ∈ Θ (min { ǫ g , ǫ H } ). In prior works, e.g., [5, 33], the convergence analysis is derivedusing ǫ H = ǫ g , whereas here we allow ǫ H = √ ǫ g . As a result, the requirements (5a) and(5b) can eventually amount to stricter conditions than (3a).As for (5c), the main drawback lies in the difficulty of enforcing it. Despite the factthat for certain values of k s t k and ǫ , e.g., ǫ ≪ k s t k , (5c) can be less restrictive than(3a), a priori enforcing (5c) requires one to have already computed the search direction s t , which itself can be done only after H ( x t ) is constructed, hence creating a viciouscircle. A posteriori guarantees can be given if one obtains a lower-bound estimate on theyet-to-be-computed step-size, i.e., to have s > s ≤ k s t k . This allows oneto consider a stronger condition as k ( H ( x t ) − ∇ F ( x t )) k ≤ C s , which can be enforcedusing a variety of methods such as those described in Section 3. However, to obtain sucha lower-bound estimate on the next step-size, one has to resort to a recursive procedure,6hich necessitates repeated constructions of the approximate Hessian and subsequentsolutions of the corresponding subproblems. Consequently, this procedure may result ina significant computational overhead and will lead to undesirable theoretical complexities.In sharp contrast to (5c), the condition (3a) allows for theoretically principled use ofmany practical techniques to construct H t . For example, under (3a), the use of quasi-Newton methods to approximate the Hessian is theoretically justified. Further, by con-sidering the stronger condition (4), many randomized matrix approximation techniquescan be readily applied, e.g., [42, 58, 59, 60]; see Section 3. To the best of our knowledge,the only successful attempt at guaranteed a priori construction of H t using (5c) is donein [15]. Specifically, by considering probabilistic models, which are “sufficiently accurate”in that they are partly based on (5c), [15] studies first-order complexity of a large classof methods, including ARC, and discusses ways to construct such probabilistic modelsas long as the gradient is large enough, i.e., before first-order approximate-optimality isachieved. Here, by considering (3a), we are able to provide an alternative analysis, whichallows us to obtain second-order complexity results.Requiring (4), as a way of enforcing (3a), offers a variety of other practical advan-tages, which are not readily available with other conditions. For example, consider dis-tributed/parallel environments where the data is distributed across a network and themain bottleneck of computations is the communications across the nodes. In such set-tings, since (4) allows for the Hessian accuracy to be set a priori and to remain fixedacross all iterations, the number of samples in each node can stay the same throughoutiterations. This prevents unnecessary communications to re-distribute the data at everyiteration.Furthermore, in case of failed iterations, i.e., when the computed steps are rejected,the previous H t may seamlessly be used in the next iteration, which avoids repeatingmany such, potentially expensive, computations throughout the iterations. For example,consider approximate solutions to the underlying sub-problems by means of dimension-ality reduction, i.e., H t is projected onto a lower dimensional sub-space as U T H t U forsome U ∈ R d × p with p ≪ d , resulting in a smaller dimensional sub-problem. Now ifthe current iteration leads to a rejected step, the projection of the H t from the previousiteration can be readily re-used in the next iteration. This naturally amounts to savingfurther Hessian computations. We are now ready to present our main algorithms for solving the generic non-convexoptimization ( P0 ) along with their corresponding iteration complexity results to obtaina ( ǫ g , ǫ H )-optimal solution as in (1). More precisely, in Section 2.1 and 2.2, respectively,we present modifications of the TR and ARC methods which incorporate inexact Hessianinformation, according to Condition 1.We remind that, though not specifically mentioned in the statement of the theoremsor the algorithms, when the computed steps are rejected and an iteration needs to berepeated with different ∆ t or σ t , the previous H t may seamlessly be used in the nextiteration. This can be a desirable feature in many practical situations and is directly theresult of enforcing (4); see also the discussion in Section 1.3.2.For our analysis throughout the paper, we make the following standard assumptionregarding the regularity of the exact Hessian of the objective function F .7 ssumption 1 (Hessian Regularity) . F ( x ) is twice differentiable and has bounded andLipschitz continuous Hessian on the piece-wise linear path generated by the iterates, i.e.for some < K, L < ∞ and all iterations (cid:13)(cid:13) ∇ F ( x ) − ∇ F ( x t ) (cid:13)(cid:13) ≤ L k x − x t k , ∀ x ∈ [ x t , x t + s t ] , (6a) (cid:13)(cid:13) ∇ F ( x t ) (cid:13)(cid:13) ≤ K, (6b) where x t and s t are, respectively, the iterate and the update step at iteration t . Although, we do not know of a particular way to, a priori, verify (6a), it is clear thatAssumption (6a) is weaker than Lipschitz continuity of the Hessian for all x , i.e., (cid:13)(cid:13) ∇ F ( x ) − ∇ F ( y ) (cid:13)(cid:13) ≤ L k x − y k , ∀ x , y ∈ R d . (7)Despite the fact that theoretically (6a) is weaker than (7), to the best of our knowledgeas of yet, (7) is the only practical sufficient condition for verifying (6a). Algorithm 1 depicts a trust-region algorithm where at each iteration t , instead of thetrue Hessian ∇ F ( x t ), only an inexact approximation, H t , is used. For Algorithm 1, theaccuracy tolerance in (3a) is adaptively chosen as ǫ t ≤ max { ǫ , ∆ t } , where ∆ t is the trustregion in the t-th iteration and ǫ ∈ O ( ǫ H ) is some fixed threshold. This allows for avery crude approximation at the beginning of iterations, when ∆ t is large. As iterationsprogress towards optimality and ∆ t gets small, the threshold ǫ can prevent ǫ from gettingunnecessarily too small.In Algorithm 1, we require that the sub-problem (8) is solved only approximately.Indeed, in large-scale problems, where the exact solution of the sub-problem is the mainbottleneck of the computations, this is a very crucial relaxation. Such approximatesolution of the sub-problem (8) has been adopted in many previous work. Here, wefollow the inexactness conditions discussed in [17], which are widely known as Cauchyand Eigenpoint conditions. Recall that the Cauchy and Eigen directions correspond,respectively, to one dimensional minimization of the sub-problem (8) along the directionsgiven by the gradient and negative curvature. Condition 2 (Sufficient Descent Cauchy and Eigen Directions [17]) . Assume that wesolve the sub-problem (8) approximately to find s t such that − m t ( s t ) ≥ − m t ( s Ct ) ≥ k∇ F ( x t ) k min (cid:26) k∇ F ( x t ) k k H t k , ∆ t (cid:27) , (9a) − m t ( s t ) ≥ − m t ( s Et ) ≥ ν | λ min ( H t ) | ∆ t , if λ min ( H t ) < . (9b) Here, m t ( · ) is defined in (8) , s Ct (Cauchy point) is along negative gradient direction and s Et is along approximate negative curvature direction such that h s Et , H t s Et i ≤ νλ min ( H t ) k s Et k < , for some ν ∈ (0 , (see Appendix B for a way to efficiently compute s Et ). One way to ensure that an approximate solution to the sub-problem (8) satisfies (9),is by replacing (8) with the following reduced-dimension problem, in which the searchspace is a two-dimensional sub-space containing vectors s Ct , and s Et , i.e., s t = arg min k s k≤ ∆ t s ∈ Span { s Ct , s Et } h∇ F ( x t ) , s i + 12 h s , H t s i . lgorithm 1 Trust Region with Inexact Hessian Input:
Starting point x , initial radius 0 < ∆ < ∞ , hyper-parameters ǫ , ǫ g , ǫ H , η ∈ (0 , , γ > for t = 0 , , . . . do Set the approximate Hessian, H t , as in (3) with ǫ t ≤ max { ǫ , ∆ t } if k∇ F ( x t ) k ≤ ǫ g , λ min ( H t ) ≥ − ǫ H then Return x t . end if Solve the sub-problem approximately s t ≈ arg min k s k≤ ∆ t m t ( s ) , h∇ F ( x t ) , s i + 12 h s , H t s i (8) Set ρ t , F ( x t ) − F ( x t + s t ) − m t ( s t ) if ρ t ≥ η then x t +1 = x t + s t ∆ t +1 = γ ∆ t else x t +1 = x t ∆ t +1 = ∆ t /γ end if end for Output: x t Of course, any larger dimensional sub-space P for which we have Span { s Ct , s Et } ⊆ P would also guarantee (9). In fact, a larger dimensional sub-space implies a more accuratesolution to our original sub-problem (8).We now set out to provide iteration complexity for Algorithm 1. Our analysis followssimilar line of reasoning as that in [9, 10, 11]. First, we show the discrepancy betweenthe quadratic model and objective function in Lemma 1. Lemma 1.
Given Assumption 1 and Condition (3a) with any ǫ t > , we have | F ( x t + s t ) − F ( x t ) − m t ( s t ) | ≤ L t + ǫ t t . (10) Proof.
Applying Mean Value Theorem on F at x t gives F ( x t + s t ) = F ( x t ) + ∇ F ( x t ) T s t + s Tt ∇ F ( ξ t ) s t , for some ξ t in the segment of [ x t , x t + s t ]. We have | F ( x t + s t ) − F ( x t ) − m t ( s t ) | = 12 (cid:12)(cid:12) s Tt ( ∇ F ( ξ t ) − H t ) s t (cid:12)(cid:12) = 12 (cid:12)(cid:12) s Tt ( ∇ F ( ξ t ) − ∇ F ( x t ) + ∇ F ( x t ) − H t ) s t (cid:12)(cid:12) ≤ (cid:12)(cid:12) s Tt ( ∇ F ( ξ t ) − ∇ F ( x t )) s t (cid:12)(cid:12) + 12 (cid:12)(cid:12) s Tt ( ∇ F ( x t ) − H t ) s t (cid:12)(cid:12) ≤ L k s t k + ǫ t k s t k ≤ L t + ǫ t t . Lemma 2.
Consider any ǫ H > , let ǫ , α (1 − η ) νǫ H for some α ∈ (0 , , and supposeCondition 1 is satisfied with ǫ t ≤ max { ǫ , ∆ t } , where ∆ t is the trust region at the t-thiteration. Given Assumption 1 and Condition 2, if λ min ( H t ) < − ǫ H and ∆ t ≤ (1 − α )(1 − η ) ν | λ min ( H t ) | / ( L + 1) , then the t-th iteration is successful, i.e. ∆ t +1 = γ ∆ t .Proof. Suppose ∆ t ≤ ǫ . From (9b) and (10), we have1 − ρ t = F ( x t + s t ) − F ( x t ) − m t ( s t ) − m t ( s t ) ≤ L ∆ t + ǫ t ∆ t ν | λ min ( H t ) | ∆ t ≤ L ∆ t + α (1 − η ) νǫ H ∆ t ν | λ min ( H t ) | ∆ t ≤ L ∆ t + α (1 − η ) ν | λ min ( H t ) | ∆ t ν | λ min ( H t ) | ∆ t ≤ L ∆ t + α (1 − η ) ν | λ min ( H t ) | ν | λ min ( H t ) | . By the assumption on ∆ t , we get ρ t ≥ η and the iteration is successful. Now consider∆ t ≥ ǫ . Similar to the above, we have1 − ρ t = F ( x t + s t ) − F ( x t ) − m t ( s t ) − m t ( s t ) ≤ L ∆ t + ǫ t ∆ t ν | λ min ( H t ) | ∆ t ≤ ( L + 1)∆ t ν | λ min ( H t ) | ∆ t ≤ ( L + 1)∆ t ν | λ min ( H t ) | , which again by assumption on ∆ t and noting α <
1, we get ρ t ≥ η and the iteration issuccessful. Lemma 3.
Suppose Condition 1 is satisfied with any ǫ t > . Given Assumption 1 andCondition 2, if k∇ F ( x t ) k > ǫ g and ∆ t ≤ min ( k∇ F ( x t ) k (1 + K H ) , p ǫ t + 4 L (1 − η ) k∇ F ( x t ) k − ǫ t L ) , then, the t-th iteration is successful, i.e. ∆ t +1 = γ ∆ t .Proof. By assumption on ∆ t , (9a), and since k∇ F ( x t ) k > ǫ g , we have − m t ( s t ) ≥ k∇ F ( x t ) k min (cid:26) k∇ F ( x t ) k k H t k , ∆ t (cid:27) ≥ k∇ F ( x t ) k ∆ t . Therefore,1 − ρ t = F ( x t + s t ) − F ( x t ) − m t ( s t ) − m t ( s t ) ≤ L ∆ t + ǫ t ∆ t k∇ F ( x t ) k ∆ t ≤ L ∆ t + ǫ t ∆ t k∇ F ( x t ) k ≤ − η, where the last inequality follows by assumption on ∆ t . So ρ t ≥ η , which means theiteration is successful.Lemma 4 gives a lower bound for the trust region radius before the algorithm termi-nates, i.e., this ensures that the trust region never shrinks to become too small.10 emma 4. Consider any ǫ g , ǫ H > such that ǫ H ≤ √ ǫ g and let ǫ , α (1 − η ) νǫ H forsome α ∈ (0 , . Further, suppose Condition 1 is satisfied with ǫ t ≤ max { ǫ , ∆ t } , where ∆ t is the trust region at the t-th iteration. For Algorithm 1, under Assumption 1 andCondition 2, we have ∆ t ≥ κ ∆ min { ǫ g , ǫ H } , ∀ t ≥ , where κ ∆ , min { κ , κ , κ , κ } /γ, κ , (1 − α )(1 − η ) ν/ ( L + 1) , κ , α (1 − η ) ν,κ , / (1 + K H ) , κ , p ( α (1 − η ) ν ) + 4 L (1 − η ) − α (1 − η ) ν/ (2 L ) . Proof.
We prove by contradiction. Assume that the t-th iteration is the first unsuccessfuliteration such that ∆ t +1 = ∆ t /γ ≤ κ ∆ min { ǫ g , ǫ H } , i.e., we have∆ t ≤ min { κ , κ , κ , κ } · min { ǫ g , ǫ H } . Suppose λ min ( H t ) < − ǫ H . By Lemma 2, since ∆ t ≤ (1 − α )(1 − η ) ν | λ min ( H t ) | / ( L + 1),iteration t must have been accepted and we must have ∆ t +1 = γ ∆ t > ∆ t , which isa contradiction. Now suppose k∇ F ( x t ) k ≥ ǫ g . By assumption on ∆ t , we have that∆ t ≤ κ ǫ H = ǫ , which implies that ǫ t ≤ ǫ . Since the function h ( a, b ) , − a + √ a + b ,for any fixed b >
0, is decreasing in a , and for any fixed a , is increasing in b ≥
0, we have h ( ǫ t , L (1 − η ) k∇ F ( x t ) k ) ≥ h ( ǫ , L (1 − η ) k∇ F ( x t ) k ) ≥ h ( ǫ , L (1 − η ) ǫ g ) ≥ h ( ǫ , L (1 − η ) ǫ H ) , which implies p ǫ t + 4 L (1 − η ) k∇ F ( x t ) k − ǫ t L ≥ κ ǫ H . As a result, since ∆ t ≤ min { κ ǫ g , κ ǫ H } , it must satisfy the condition of Lemma 3. Thisimplies that iteration t must have been accepted, which is a contradiction.The following lemma follows closely the line of reasoning in [11, Lemma 4.5]. Lemma 5 (Successful Iterations) . Consider any ǫ g , ǫ H > such that ǫ H ≤ √ ǫ g and let ǫ , α (1 − η ) νǫ H for some α ∈ (0 , . Further, suppose Condition 1 is satisfied with ǫ t ≤ max { ǫ , ∆ t } , where ∆ t is the trust region at the t-th iteration. Let T succ denote theset of all the successful iterations before Algorithm 1 stops. Then, under Assumption 1,Condition 2, the number of successful iterations is upper bounded by, |T succ | ≤ ( F ( x ) − F min ) η min { b κ ∆ , e κ ∆ } · max { ǫ − g ǫ − H , ǫ − H } where b κ ∆ , κ ∆ / , e κ ∆ , νκ / , and κ ∆ is as defined in Lemma 4.Proof. Suppose Algorithm 1 doesn’t terminate at the t-th iteration. Then we have either k∇ F ( x t ) k ≥ ǫ g or λ min (∆ F ( x t )) ≤ − ǫ H . In the first case, from (9a), we have − m t ( s t ) ≥ ǫ g (cid:26) ǫ g K H , ∆ t (cid:27) ≥ ǫ g (cid:26) ǫ g K H , κ ∆ ǫ g , κ ∆ ǫ H (cid:27) ≥ b κ ∆ ǫ g min { ǫ g , ǫ H } , where κ ∆ is as defined in Lemma 4. Similarly, in the second case, from (9b), we obtain − m t ( s t ) ≥ ν | λ min ( H t ) | ∆ t ≥ νκ ǫ H min { ǫ g , ǫ H } = e κ ∆ ǫ H min { ǫ g , ǫ H } . F ( x t ) is monotonically decreasing, we have F ( x ) − F min ≥ ∞ X t =0 F ( x t ) − F ( x t +1 ) ≥ X t ∈T succ F ( x t ) − F ( x t +1 ) ≥ η X t ∈T succ min (cid:8)b κ ∆ ǫ g min { ǫ g , ǫ H } , e κ ∆ ǫ H min { ǫ g , ǫ H } (cid:9) ≥ |T succ | η min { b κ ∆ , e κ ∆ } min { ǫ g ǫ H , ǫ H } . Hence, we have |T succ | ≤ ( F ( x ) − F min ) max { ǫ − g ǫ − H , ǫ − H } / ( η min { b κ ∆ , e κ ∆ } ).Now we are ready to present the final complexity in Theorem 1. Theorem 1 (Optimal Complexity of Algorithm 1) . Consider any ǫ g , ǫ H > such that ǫ H ≤ √ ǫ g and let ǫ , α (1 − η ) νǫ H for some α ∈ (0 , where η is a hyper-parameter inAlgorithm 1, and ν is as in (9b) . Suppose the inexact Hessian, H ( x ) , satisfies Condition1 with the approximation tolerance, ǫ t , in (3a) as ǫ t ≤ max { ǫ , ∆ t } , (11) where ∆ t is the trust region at the t-th iteration. For Problem ( P0 ) , under Assumption1 and Condition 2, Algorithm 1 terminates after at most T ∈ O (cid:0) max { ǫ − g ǫ − H , ǫ − H } (cid:1) iterations.Proof. Suppose Algorithm 1 terminates at the t-th iteration. Let T succ and T fail denote thesets of all the successful and unsuccessful iterations, respectively. Then T = |T succ | + |T fail | and ∆ T = ∆ γ |T succ |− | T fail | , where γ is a hyper-parameter of Algorithm 1. From Lemma 4,we have ∆ T ≥ κ ∆ min { ǫ g , ǫ H } . Hence, ( |T succ | − |T fail | ) log γ ≥ log ( κ ∆ · min { ǫ g , ǫ H } / ∆ ),which implies |T fail | ≤ log (∆ / ( κ ∆ · min { ǫ g , ǫ H } )) / log γ + |T succ | . Combine the resultfrom Lemma 5, we have the total iteration complexity as T ≤ γ log (cid:18) ∆ κ ∆ · min { ǫ g , ǫ H } (cid:19) + 2( F ( x ) − F min ) η min { b κ ∆ , e κ ∆ } · max { ǫ − g ǫ − H , ǫ − H }∈ O (cid:0) max { ǫ − g ǫ − H , ǫ − H } (cid:1) , where κ ∆ , b κ ∆ , e κ ∆ are defined in the proofs of Lemmas 4 and 5, respectively.As it can be seen, the worst-case total number of iterations required by Algorithm 1before termination, matches the optimal iteration complexity obtained in [11]. Further-more, from (3a), it follows that upon termination of Algorithm 1 after T iterations, inaddition to k∇ F ( x T ) k ≤ ǫ g , we have λ min ( ∇ F ( x T )) ≥ − ( ǫ H + ǫ T ), i.e., the obtainedsolution satisfies ( ǫ g , ǫ T + ǫ H )-Optimality as in (1).For Algorithm 1, the Hessian approximation tolerance ǫ t is allowed to be chosen per-iteration as ǫ t ≤ O (max { ǫ H , ∆ t } ). This way, when ∆ t is large (e.g., at the beginning ofiterations), one can employ crude Hessian approximations. As iterations progress towardsoptimality, ∆ t can get very small, in which case Hessian accuracy is set in the order of ǫ H . Note that by Lemma 4, we are always guaranteed to have ∆ t ∈ Ω (min { ǫ g , ǫ H } ). Asa result, when ǫ g ≪ ǫ H , e.g., ǫ H = ǫ g = ǫ , we can have that ∆ t ≪ ǫ H . In such cases,the choice ǫ t ≤ O (max { ǫ H , ∆ t } ) ensures that the Hessian approximation tolerance nevergets unnecessarily too small. 12 .2 Adaptive Cubic Regularization with Inexact Hessian Similar to Section 2.1, in this section, we present the algorithm and its correspondingconvergence results for the case of adaptive cubic regularization with inexact Hessian. Inparticular, Algorithm 2 depicts a variant of ARC algorithm where at each iteration t , theinexact approximation, H t , is constructed according to Condition 1. Here, unlike Section2.1, we were unable to provide convergence guarantees with adaptive tolerance in (3a)and as result, ǫ is set fixed a priori to a sufficiently small value, i.e., ǫ ∈ O ( √ ǫ g , ǫ H ) toguarantee ( ǫ g , ǫ H )-optimality. Algorithm 2
Adaptive Cubic Regularization with Inexact Hessian Input:
Starting point x , initial regularization 0 < σ < ∞ , hyper-parameters ǫ g , ǫ H , η ∈ (0 , , γ > for t = 0 , , . . . do Set the approximating Hessian, H t , as in (3) if k∇ F ( x t ) k ≤ ǫ g , λ min ( H t ) ≥ − ǫ H then Return x t . end if Solve the sub-problem approximately s t ≈ arg min s ∈ R d m t ( s ) , h∇ F ( x t ) , s i + 12 h s , H t s i + σ t k s k (12) Set ρ t , F ( x t ) − F ( x t + s t ) − m t ( s t ) if ρ t ≥ η then x t +1 = x t + s t σ t +1 = σ t /γ else x t +1 = x t σ t +1 = γσ t end if end for Output: x t Similar to Algorithm 1, here we also require that the sub-problem (12) in Algorithm 2is solved only approximately. Although similar inexact solutions to the sub-problem (12)by using Cauchy and Eigenpoint has been considered in several previous work, e.g., [11],here we provide refined conditions which prove to be instrumental in obtaining iterationcomplexities with the relaxed Hessian approximation (3a), as opposed to the strongerCondition (5c).
Condition 3 (Sufficient Descent Cauchy & Eigen Directions) . Assume that we solve the ub-problem (12) approximately to find s t such that − m t ( s t ) ≥ − m t ( s Ct ) ≥ max ( k s Ct k (cid:18)q K H + 4 σ t k∇ F ( x t ) k − K H (cid:19) , k∇ F ( x t ) k √ ( k∇ F ( x t ) k K H , s k∇ F ( x t ) k σ t )) , (13a) − m t ( s t ) ≥ − m t ( s Et ) ≥ ν | λ min ( H t ) | (cid:26) k s Et k , ν | λ min ( H t ) | σ t (cid:27) , if λ min ( H t ) < . (13b) Here m t ( · ) is defined in (12) , s Ct (Cauchy point) is along negative gradient direction and s Et is along approximate negative curvature direction such that h s Et , H t s Et i ≤ νλ min ( H t ) k s Et k < for some ν ∈ (0 , (see Appendix B for a way to efficiently compute s Et ). Note that Condition (13) describes the quality of the descent obtained by Cauchy andEigen directions more accurately than is usually found in similar literature. A naturalway to ensure that the approximate solution to the sub-problem (12) satisfies (13), isby replacing the unconstrained high-dimensional sub-problem (12) with the followingconstrained but lower-dimensional problem, in which the search space is reduced to atwo-dimensional sub-space containing vectors s Ct , and s Et , i.e., s t = arg min s ∈ Span { s Ct , s Et } h∇ F ( x t ) , s i + 12 h s , H t s i + σ t k s k . Note that, if U ∈ R d × p is an orthogonal basis for the sub-space “Span { s Ct , s Et } ”, by alinear transformation, we can turn the above sub-problem into an unconstrained problemas v t = arg min v ∈ R p h U T ∇ F ( x t ) , v i + 12 h v , U T H t Uv i + σ t k v k , and set s t = Uv t . As before, any larger dimensional sub-space P for which we haveSpan { s Ct , s Et } ⊆ P would also ensure (13), and, indeed, implies a more accurate solutionto our original sub-problem (12).Lemmas 6 and 7 describe the model reduction obtained by Cauchy and eigen pointsas required by Condition (3). Lemma 6 (Descent with Cauchy Direction) . Consider the Cauchy direction as s Ct = − α ∇ F ( x t ) where α = arg min b α ≥ m t ( − b α ∇ F ( x t )) . We have − m t ( s Ct ) ≥ max ( k s Ct k (cid:18)q K H + 4 σ t k∇ F ( x t ) k − K H (cid:19) , k∇ F ( x t ) k √ ( k∇ F ( x t ) k K H , s k∇ F ( x t ) k σ t )) . Proof.
For any b α ≥
0, we have m t ( − b α ∇ F ( x t )) ≤ m t ( b α ∇ F ( x t )), which implies α =arg min b α ∈ R m t ( − b α ∇ F ( x t )). Hence, we have −k∇ F ( x t ) k + α h∇ F ( x t ) , H t ∇ F ( x t ) i + σ t α k∇ F ( x t ) k = 0. We can find explicit formula for such α by finding the roots of14he quadratic function r ( α ) = σ t k∇ F ( x t ) k α + h∇ F ( x t ) , H t ∇ F ( x t ) i α − k∇ F ( x t ) k .Hence, we must have α = −h∇ F ( x t ) , H t ∇ F ( x t ) i + q(cid:0) h∇ F ( x t ) , H t ∇ F ( x t ) i (cid:1) + 4 σ t k∇ F ( x t ) k σ t k∇ F ( x t ) k ≥ . It follows that2 ασ t k∇ F ( x t ) k = s(cid:18) h∇ F ( x t ) , H t ∇ F ( x t ) ik∇ F ( x t ) k (cid:19) + 4 σ t k∇ F ( x t ) k − h∇ F ( x t ) , H t ∇ F ( x t ) ik∇ F ( x t ) k . Consider the function h ( x ; β ) = p x + β − x . It is easy to verify that, for β ≥ h ( x ) isdecreasing function of x . Now since h∇ F ( x t ) , H t ∇ F ( x t ) i ≤ K H k∇ F ( x t ) k , we get k s Ct k = α k∇ F ( x t ) k ≥ σ t (cid:20)q K H + 4 σ t k∇ F ( x t ) k − K H (cid:21) . (14)Now, from [11, Lemma 2.1], we get − m t ( s Ct ) ≥ σ t k s Ct k k s Ct k ασ t k∇ F ( x t ) k ≥ k s Ct k
12 ( q K H + 4 σ t k∇ F ( x t ) k − K H ) . Alternatively, following the proof of [9, Lemma 2.1], for any α ≥
0, we get m t ( s Ct ) ≤ m t ( − α ∇ F ( x t ))= − α k∇ F ( x t ) k + 12 α h∇ F ( x t ) , H t ∇ F ( x t ) i + α σ t k∇ F ( x t ) k ≤ α k∇ F ( x t ) k (cid:0) − αK H + 2 α σ t k∇ F ( x t ) k (cid:1) . Consider the quadratic polynomial r ( α ) = 2 α σ t k∇ F ( x t ) k +3 αK H −
6. We have r ( α ) ≤ α ∈ [0 , ¯ α ], where¯ α = − K H + p K H + 48 σ t k∇ F ( x t ) k σ t k∇ F ( x t ) k = 12 (cid:16) K H + p K H + 48 σ t k∇ F ( x t ) k (cid:17) . Note that p K H + 48 σ t k∇ F ( x t ) k ≤ √ n K H , p σ t k∇ F ( x t ) k o and trivially 3 K H ≤ √ n K H , p σ t k∇ F ( x t ) k o . Hence, defining α , / ( √ { K H , p σ t k∇ F ( x t ) k} ),it is easy to see that 0 < α ≤ ¯ α . With this α , we get r ( α ) ≤ / / √ − ≤ − . Therefore m t ( s t ) ≤ − k∇ F ( x t ) k √ n K H , p σ t k∇ F ( x t ) k o = −k∇ F ( x t ) k √ k∇ F ( x t ) k K H , s k∇ F ( x t ) k σ t . emma 7 (Descent with Negative Curvature) . Suppose λ min ( H t ) < . For some ν ∈ (0 , , define s Et = α u t , where α = arg min b α ∈ R m t ( b α u t ) , and h u t , H t u t i ≤ νλ min ( H t ) k u t k < . We have − m t ( s Et ) ≥ ν | λ min ( H t ) | (cid:26) k s Et k , ν | λ min ( H t ) | σ t (cid:27) . Proof.
By the first-order necessary optimality condition of α , we get h∇ F ( x t ) , u t i + α h u t , H t u t i + σ t α k u t k = 0, which implies h∇ F ( x t ) , s Et i + h s Et , H t s Et i + σ t k s Et k = 0.Next, since α is a minimizer of m t ( b α u t ), we have m t ( α u t ) ≤ m t ( − α u t ), which implies h∇ F ( x t ) , s Et i ≤
0. Hence, we also obtain h s Et , H t s Et i + σ t k s Et k ≥
0. From [11, Lemma 2.1],we get − m t ( s Et ) ≥ σ t k s Et k / (cid:0) −h∇ F ( x t ) , s Et i − h s Et , H t s Et i (cid:1) / ≥ ν | λ min ( H t ) |k s Et k / σ t k s Et k ≥ − h s Et , H t s Et ik s Et k ≥ ν | λ min ( H t ) | , (15)which gives σ t k s Et k ≥ ν | λ min ( H t ) |k s Et k and σ t k s Et k ≥ ν σ − t | λ min ( H t ) | . Hence, we have − m t ( s Et ) ≥ σ t k s Et k / ≥ ν | λ min ( H t ) |k s Et k / − m t ( s Et ) ≥ σ t k s Et k / ≥ ν σ − t | λ min ( H t ) | / Lemma 8.
Given Assumption 1 and Condition 1, we have F ( x t + s t ) − F ( x t ) − m t ( s t ) ≤ (cid:18) L − σ t (cid:19) k s t k + ǫ k s t k . Proof.
Apply Mean Value Theorem on F at x t gives F ( x t + s t ) = F ( x t ) + ∇ F ( x t ) T s t + s Tt ∇ F ( ξ t ) s t , for some ξ t in the segment of [ x t , x t + s t ]. Now, it follows that F ( x t + s t ) − F ( x t ) − m t ( s t ) = 12 s Tt ( ∇ F ( ξ t ) − H t ) s t − σ t k s t k = 12 s Tt ( ∇ F ( ξ t ) − ∇ F ( x t ) + ∇ F ( x t ) − H t ) s t − σ t k s t k ≤ s Tt ( ∇ F ( ξ t ) − ∇ F ( x t )) s t + 12 s Tt ( ∇ F ( x t ) − H t ) s t − σ t k s t k ≤ L k s t k + 12 ǫ k s t k − σ t k s t k ≤ (cid:18) L − σ t (cid:19) k s t k + ǫ k s t k . Lemma 9.
Given Assumption 1, Conditions 1 and 3, suppose σ t ≥ L, ǫ ≤ min (cid:26) (cid:18)q K H + 8 Lǫ g − K H (cid:19) , νǫ H γ (cid:27) . Then, we have (cid:18) L − σ t (cid:19) k s t k + ǫ k s t k ≤ ǫ k s Ct k , ǫ k s Et k , If λ min ( H t ) ≥ − ǫ H . roof. First consider k s Ct k for which we have two cases. i. If k s t k ≤ k s Ct k , then from assumption on σ t , it immediately follows that (cid:18) L − σ t (cid:19) k s t k + ǫ k s t k ≤ ǫ k s t k ≤ ǫ k s Ct k . ii. If k s t k ≥ k s Ct k , since L ≤ σ t /
2, then (cid:18) L − σ t (cid:19) k s t k + ǫ k s t k ≤ − σ t k s t k + ǫ k s t k ≤ (cid:16) − σ t (cid:13)(cid:13) s Ct (cid:13)(cid:13) + ǫ (cid:17) k s t k ≤ − p K H + 8 Lǫ g − K H
24 + ǫ ! k s t k ≤ ≤ ǫ (cid:13)(cid:13) s Ct (cid:13)(cid:13) . The second last inequality follows from (14).Similarly, for k s Et k , we have two cases. i. If k s t k ≤ k s Et k , then from assumption on σ t , it immediately follows that (cid:18) L − σ t (cid:19) k s t k + ǫ k s t k ≤ ǫ k s t k ≤ ǫ k s Et k . ii. If k s t k ≥ k s Et k , since L ≤ σ t /
2, then (cid:18) L − σ t (cid:19) k s t k + ǫ k s t k ≤ − σ t k s t k + ǫ k s t k ≤ − σ t (cid:13)(cid:13) s Et (cid:13)(cid:13) k s t k + ǫ k s t k ≤ − νǫ H k s t k + ǫ k s t k ≤ < ǫ (cid:13)(cid:13) s Et (cid:13)(cid:13) . The second last inequality follows from (15) and the last line follows from ǫ ≤ νǫ H . Lemma 10.
Given Assumption 1, Conditions 1 and 3, suppose at the t-th iteration, λ min ( H t ) < − ǫ H , σ t ≥ L , and ǫ ≤ min { / , (1 − η ) / } νǫ H . Then, the t-th iteration issuccessful, i.e. σ t +1 = σ t /γ .Proof. From (13b), Lemma 8, Lemma 9, as well as assumptions on σ t and ǫ , we have1 − ρ t = F ( x t + s t ) − F ( x t ) − m t ( s t ) − m t ( s t ) ≤ ( L/ − σ t / k s t k + ǫ k s t k / ν | λ min ( H t ) |k s Et k / ≤ ǫ k s Et k ν | λ min ( H t ) |k s Et k ≤ ǫνǫ H ≤ − η. Hence, ρ t ≥ η , and the iteration is successful. Lemma 11.
Given Assumption 1, Conditions 1 and 3, suppose at the t-th iteration, k∇ F ( x t ) k ≥ ǫ g , σ t ≥ L , and ǫ ≤ min (cid:26) , − η (cid:27) (cid:18)q K H + 8 Lǫ g − K H (cid:19) . Then, the t-th iteration is successful, i.e. σ t +1 = σ t /γ . roof. First note that, from (13a), we have − m t ( s t ) ≥ − m t ( s Ct ) ≥ k s Ct k (cid:18)q K H + 4 σ t k∇ F ( x t ) k − K H (cid:19) . Hence, again, by (13a), Lemma 8 and 9, it follows that1 − ρ t = F ( x t + s t ) − F ( x t ) − m t ( s t ) − m t ( s t ) ≤ (cid:0) L − σ t (cid:1) k s t k + ǫ k s t k − m t ( s Ct ) ≤ ǫ k s Ct k k s Ct k (cid:16)p K H + 4 σ t k∇ F ( x t ) k − K H (cid:17) ≤ ǫ (cid:16)p K H + 4 σ t k∇ F ( x t ) k − K H (cid:17) ≤ ǫ (cid:16)p K H + 8 Lǫ g − K H (cid:17) ≤ − η. Hence, ρ t ≥ η , and the iteration is successful.Now we can upper bound the cubic regularization parameter before the algorithmterminates, as in Lemma 12. Lemma 12.
Consider Assumption 1, Conditions 1 and 3, and ǫ ≤ min (cid:26) min (cid:26) , − η (cid:27) (cid:18)q K H + 8 Lǫ g − K H (cid:19) , min (cid:26) , − η (cid:27) νǫ H (cid:27) , (16) where ν, L, K H are, respectively, defined as in (13b) , (6a) , (3b) , and η is a hyper-parameter of Algorithm 2. For Algorithm 2 we have for all t , σ t ≤ max { σ , γL } .Proof. We prove by contradiction. Assume the t-th iteration is the first unsuccessfuliteration such that σ t +1 = γσ t ≥ γL , which implies that σ t ≥ L . However, according toLemmas 10 and 11, respectively, if λ min ( H t ) < − ǫ H or k∇ F ( x t ) k ≥ ǫ g , then the iterationis successful and hence we must have σ t +1 = σ t /γ ≤ σ t , which is a contradiction.Now, similar to [11, Lemma 2.8], we can get the following result about the estimateof the total number of successful iterations before algorithm terminates. Lemma 13 (Success Iterations) . Given Assumption 1, Conditions 1 and 3, let T succ denote the set of all the successful iterations before Algorithm 2 stops. The number ofsuccessful iterations is upper bounded by, |T succ | ≤ ( F ( x ) − F min ) ηκ σ · max { ǫ − g , ǫ − H } , where κ σ , min n ν / (24 γ L ) , min n /K H , p / (2 γL ) o / (2 √ o .Proof. Suppose Algorithm 2 doesn’t terminate at the t-th iteration. Then either we have k∇ F ( x t ) k ≥ ǫ g or λ min ( ∇ H t ) ≤ − ǫ H . In the first case, (13a) and Lemma 12 gives − m t ( s t ) ≥ k∇ F ( x t ) k √ k∇ F ( x t ) k K H , s k∇ F ( x t ) k σ t ≥ ǫ g √ (cid:26) K H , r γL (cid:27) . λ min ( ∇ H t ) ≤ − ǫ H , from (13b) and Lemma 12, we obtain − m t ( s t ) ≥ ν | λ min ( H t ) | / (6 σ t ) ≥ ν ǫ H / (24 γ L ).Since F ( x t ) is monotonically decreasing, we have F ( x ) − F min ≥ ∞ X t =0 F ( x t ) − F ( x t +1 ) ≥ X t ∈T succ F ( x t ) − F ( x t +1 ) ≥ − η X t ∈T succ m t ( s t ) ≥ η |T succ | min (cid:26) ν ǫ H γ L , ǫ g √ (cid:26) K H , r γL (cid:27)(cid:27) ≥ |T succ | ηκ σ min { ǫ g , ǫ H } . Now we show the final complexity bounds of Algorithm 2 in Theorem 2.
Theorem 2 (Complexity of Algorithm 2) . Consider any < ǫ g , ǫ H < . Suppose theinexact Hessian, H ( x ) , satisfies Condition 1 with the approximation tolerance, ǫ , in (3a) as (16) . For Problem ( P0 ) , under Assumption 1 and Condition 3, Algorithm 2 terminatesafter at most T ∈ O (cid:0) max { ǫ − g , ǫ − H } (cid:1) iterations.Proof. Suppose Algorithm 2 terminates at the t-th iteration. Let T succ and T fail denotethe sets of all the successful and unsuccessful iterations, respectively. Then T = |T succ | + |T fail | and σ T = σ γ | T fail | −|T succ | . From Lemma 12, we have σ T ≤ γL . Hence, |T fail | ≤ log (2 γL/σ ) / log γ + |T succ | , which, using Lemma 13 gives the total iteration complexityas T ≤ log (2 γL/σ ) / log γ + 2( F ( x ) − F min ) · max { ǫ − g , ǫ − H } / ( ηκ σ ) , where κ σ is defined in Lemma 13.In Theorem 2 (as well as Theorem 3 below), we require ǫ ∈ O ( √ ǫ g , ǫ H ). This can berather strict and computationally unattractive, unless either crude solutions are required(e.g., in most machine learning applications very rough solutions are encouraged to avoidover-fitting), or the inexact Hessian is formed from a sub-set of data that is significantlysmaller than the original dataset (e.g., see Section 3 in the context of big-data regimeswhere n ≫ |S| ≪ n ). Nonetheless, the theoretical existence of such tolerance,though small, implies a certain level of robustness of the algorithm, i.e., the complexityof the algorithm is not adversely affected by small errors in Hessian computations.We note that, for iterations where ǫ ≪ k s t k , (3a) is indeed a more stringent conditionthan (5c). As iterations progress towards optimality, step-size can become small, in whichcase (3a) might be theoretically more preferable. Nonetheless, beyond a direct theoreticalcomparison among various Hessian approximation bounds in terms of their tightness, themain advantage of (3a) should be regarded in light of its simplicity, which allows fordirect constructions of H t with a priori guarantees.Condition 3 seems to be the bare minimum required to guarantee convergence to anapproximate second-order criticality. Intuitively, however, if an approximate solution tothe sub-problem (12) satisfies more than (13), i.e., if we solve (12) more exactly than justrequiring (13), one could expect to be able to improve upon the iteration complexity ofTheorem 2. Indeed, suppose we solve the reduced sub-problem on progressively embeddedsub-spaces with increasingly higher dimensions, all of which including “Span { s Ct , s Et } ”,and stop when the corresponding solution s t satisfies the following conditions.19 ondition 4 (Sufficient Descent for Optimal Complexity) . Assume that we solve thesub-problem (12) approximately to find s t such that, in addition to (13) , we have k∇ m t ( s t ) k ≤ ζ max (cid:8) k s t k , θ t k∇ F ( x t ) k (cid:9) , θ t , min { , k s t k} , (17) for some prescribed ζ ∈ (0 , . Here, m t ( · ) is defined in (12) . Conditions on the inexactness of the sub-problems were initially pioneered in [9, 10,11]. However, the main drawback for these conditions is that the inexactness toleranceis closely tied with the magnitude of the gradient. More specifically, when gradient issmall, e.g., near saddle points, the sub-problems are required to be solved exceedinglymore accurately. In fact, at a saddle point where k∇ F ( x t ) k = 0, these conditions imply anexact solution to the sub-problem. To the best of our knowledge, Condition 4 representsa novel criterion, which offers the best of both worlds: when gradient is large, we allow forcrude solutions to the sub-problem, but near saddle-points where the gradient is small,inexactness will be determined by the step length, which can be significantly larger thanthe gradient. Using Condition 4, we can obtain the optimal iteration complexity forAlgorithm 2, as shown in Theorem 3. First, we prove the following two lemmas whichwill be used later for the proof of Theorem 3. Lemma 14.
Suppose k∇ F ( x t ) k ≥ ǫ g . Given Assumption 1 and Condition 3, let (3a) hold with ǫ t = min { ǫ, ζ k∇ F ( x t ) k} where ǫ is as in (16) and ζ ∈ (0 , / . Furthermore,suppose (12) is solved such that Condition 4 eventually holds. Then, we have k s t k ≥ κ g p k∇ F ( x t +1 ) k , where κ g , − ζ )((1 + 4 γ ) L + 2 max { ( ǫ + ζ max { , K } ) , ζ max { , K }} ) . Proof.
First, suppose k s t k ≤ θ t k∇ F ( x t ) k . Using Condition 4, we get k∇ F ( x t +1 ) k ≤k∇ F ( x t +1 ) − ∇ m t ( s t ) k + k∇ m t ( s t ) k ≤ k∇ F ( x t +1 ) − ∇ m t ( s t ) k + θ t k∇ F ( x t ) k . Notingthat ∇ m t ( s t ) = ∇ F ( x t ) + H t s t + σ t k s t k s t , and using Mean Value Theorem for vector-valued functions, (6a) and (3a), we get k∇ F ( x t +1 ) − ∇ m t ( s t ) k ≤ k Z ∇ F ( x t + τ s t ) s t dτ − H t s t k + σ t k s t k ≤ k Z (cid:0) ∇ F ( x t + τ s t ) − ∇ F ( x t ) (cid:1) s t dτ + (cid:0) ∇ F ( x t ) − H t (cid:1) s t k + σ t k s t k ≤ k s t k Z k∇ F ( x t + τ s t ) − ∇ F ( x t ) k dτ + k (cid:0) ∇ F ( x t ) − H t (cid:1) s t k + σ t k s t k ≤ L k s t k Z τ dτ + ǫ t k s t k + σ t k s t k ≤ (cid:18) L γL (cid:19) k s t k + ǫ t k s t k , where the last equality follows from Lemma 12. From (6b), it follows that k∇ F ( x t ) k ≤ K k s t k + k∇ F ( x t +1 ) k . (18)As such, using θ t ≤ ζ from Condition 4 as well as the assumption on ǫ t , we get k∇ F ( x t +1 ) k ≤ (cid:18) L γL (cid:19) k s t k + ǫ t k s t k + θ t K k s t k + θ t k∇ F ( x t +1 ) k≤ (cid:18) L γL (cid:19) k s t k + ǫ t k s t k + θ t K k s t k + ζ k∇ F ( x t +1 ) k , − ζ ) k∇ F ( x t +1 ) k ≤ ( L/ γL ) k s t k + ( ǫ t + θ t K ) k s t k . Now usingCondition 4, we consider two cases: i. If k s t k ≥
1, then we get ( ǫ t + θ t K ) k s t k ≤ ( ǫ t + θ t K ) k s t k ≤ ( ǫ + ζ K ) k s t k . Hence, itfollows that (1 − ζ ) k∇ F ( x t +1 ) k ≤ ( L/ γL + ( ǫ + ζ K )) k s t k . ii. If k s t k ≤
1, then from assumption on ǫ t and (18) , we have ǫ t k s t k ≤ ζ k∇ F ( x t ) kk s t k ≤ ζ ( K k s t k + k∇ F ( x t +1 ) kk s t k ) ≤ ζ ( K k s t k + k∇ F ( x t +1 ) k ). Now by assumption on θ t ,we get ( ǫ t + θ t K ) k s t k = ǫ t k s t k + θ t K k s t k ≤ ζ K k s t k + ζ k∇ F ( x t +1 ) k , which, in turn,implies that (1 − ζ ) k∇ F ( x t +1 ) k ≤ ( L/ γL + 2 ζ K ) k s t k .Now suppose, k s t k ≥ θ t k∇ F ( x t ) k . As above, we have k∇ F ( x t +1 ) k ≤ k∇ F ( x t +1 ) −∇ m t ( s t ) k + k∇ m t ( s t ) k ≤ ( L/ γL + ζ ) k s t k + ǫ t k s t k . If k s t k ≥
1, we have ǫ t k s t k ≤ ǫ k s t k , which gives k∇ F ( x t +1 ) k ≤ ( L/ γL + ζ + ǫ ) k s t k . Otherwise, if k s t k ≤ k s t k ≥ θ t k∇ F ( x t ) k implies that k s t k ≥ k∇ F ( x t ) k . From assumption on ǫ t ,it follows that ǫ t k s t k ≤ ζ k∇ F ( x t ) kk s t k ≤ ζ k s t k , which in turn gives k∇ F ( x t +1 ) k ≤ ( L/ γL + 2 ζ ) k s t k . Lemma 15 (Success Iterations: Optimal Case) . Let T succ , { t ; k∇ F ( x t ) k ≥ ǫ g ∨ λ min ( H t ) ≤ − ǫ H } , be the set of all successful iterations, before Algorithm 2 terminates. Under the conditionsof Lemma 14, we must have |T succ | ∈ O (max { ǫ − H , ǫ − / g } ) .Proof. From (13b) and Lemma 12, if λ min ( ∇ H t ) ≤ − ǫ H , it follows that − m t ( s t ) ≥ ν | λ min ( H t ) | / (6 σ t ) ≥ ν ǫ H / (24 γ L ). Note that T succ = T S T S T , where T , { t ∈ T succ ; k∇ F ( x t +1 ) k ≥ ǫ g } , T , { t ∈ T succ ; k∇ F ( x t +1 ) k ≤ ǫ g and λ min ( H t +1 ) ≤ − ǫ H }T , { t ∈ T succ ; k∇ F ( x t +1 ) k ≤ ǫ g and λ min ( H t +1 ) ≥ − ǫ H } . We bound each of these sets individually. Since F ( x t ) is monotonically decreasing, from[9, Lemma 3.3], σ t ≥ σ min , and Lemmas 12 and 14, we have F ( x ) − F min ≥ ∞ X t =0 F ( x t ) − F ( x t +1 ) ≥ X t ∈T F ( x t ) − F ( x t +1 ) ≥ − η X t ∈T m t ( s t ) ≥ η X t ∈T min (cid:26) ν ǫ H γ L , σ min k s t k (cid:27) ≥ η X t ∈T min (cid:26) ν ǫ H γ L , σ min κ g k∇ F ( x t +1 ) k / (cid:27) ≥ η X t ∈T min (cid:26) ν ǫ H γ L , σ min κ g ǫ / g (cid:27) ≥ η X t ∈T min (cid:26) ν γ L , σ min κ g (cid:27) min { ǫ H , ǫ / g } . Hence, |T | ≤ κ T succ max { ǫ − H , ǫ − / g } , where κ T succ , ( F ( x ) − F min ) max { γ L /ν , / ( σ min κ g ) } /η.
21s for T , we have F ( x ) − F min ≥ F ( x ) − F ( x ) + ∞ X t =0 F ( x t +1 ) − F ( x t +2 ) ≥ F ( x ) − F ( x ) + X t ∈T F ( x t +1 ) − F ( x t +2 ) ≥ F ( x ) − F ( x ) − η X t ∈T m t +1 ( s t +1 ) ≥ F ( x ) − F ( x ) + η X t ∈T ν ǫ H γ L . Hence, |T | ≤ κ T succ ǫ − H , where κ T succ , ( F ( x ) − F min )24 γ L / ( ην ). Finally, we have |T | = 1, because in such a case, the algorithm stops in one iteration. Putting thesebounds all together, we get |T succ | ≤ max { , κ T succ , κ T succ } max { ǫ − H , ǫ − / g } .Now we can obtain the optimal complexity bound of Algorithm 2 in Theorem 3. Theproof follows similarly as that of Theorem 2, and hence is omitted here. Theorem 3 (Optimal Complexity of Algorithm 2) . Consider any < ǫ g , ǫ H < . Supposethe inexact Hessian, H ( x ) , satisfies Conditions (3) with the approximation tolerance, ǫ ,in (3a) as ǫ = min { ǫ , ζ ǫ g } where ǫ is as in (16) , and ζ ∈ (0 , / . For Problem ( P0 ) and under Assumption 1, if the approximate solution to the sub-problem (12) satisfiesConditions 3 and 4, then Algorithm 2 terminates after at most T ∈ O (cid:16) max { ǫ − / g , ǫ − H } (cid:17) iterations. From (3a), upon termination of Algorithm 2, the obtained solution satisfies ( ǫ g , ǫ + ǫ H )-Optimality as in (1), i.e., k∇ F ( x T ) k ≤ ǫ g and λ min ( ∇ F ( x T )) ≥ − ( ǫ H + ǫ ). In this section, we give concrete and practical examples to demonstrate ways to constructthe approximate Hessian, which satisfies Condition 1. By considering finite-sum mini-mization , a ubiquitous problem arising frequently in machine learning, we showcase thepractical benefits of the proposed relaxed requirement (3a) for approximating Hessian,compared to the stronger alternative (5c). In Section 3.1, we describe randomized tech-niques to appropriately construct the approximate Hessian, followed by the convergenceanalysis of Algorithms 1 and 2 with such Hessian approximations in Section 3.2.
Indeed, a major advantage of (3a) over (5c) is that there are many approximation tech-niques that can produce an inexact Hessian satisfying (3a). Of particular interest in ourpresent paper is the application of randomized matrix approximation techniques, whichhave recently shown great success in the area of RandNLA at solving various numericallinear algebra tasks [22, 42, 60]. For this, we consider the highly prevalent finite-summinimization problem ( P1 ) and employ random sampling as a way to construct approx-imations to the exact Hessian, which are, probabilistically, ensured to satisfy (3a). Manymachine learning and scientific computing applications involve finite-sum optimizationproblems of the form ( P1 ) where each f i is a loss (or misfit) function corresponding to i th observation (or measurement), e.g., see [7, 24, 47, 48, 50, 55] and references therein.22ere, we consider ( P1 ) in large-scale regime where n, d ≫
1. In such settings, themere evaluations of the Hessian and the gradient increase linearly in n . Indeed, for big-data problems, the operations with the Hessian, e.g., matrix-vector products involvedin the (approximate) solution of the sub-problems (8) and (12), typically constitute themain bottleneck of computations, and in particular when n ≫
1, are computationallyprohibitive. For the special case of ( P1 ) in which each f i is convex, randomized sub-sampling has shown to be effective in reducing such costs, e.g., [6, 49, 62]. We now showthat such randomized approximation techniques can indeed be effectively employed forthe non-convex settings considered in this paper.In this light, suppose we have a probability distribution, p = { p i } ni =1 , over the set { , , . . . , n } , such that for each index i = 1 , . . . , n , we have Pr( i ) = p i > P ni =1 p i =1. Consider picking a sample of indices from { , , . . . , n } , at each iteration, randomlyaccording to the distribution p . Let S and |S| denote the sample collection and itscardinality, respectively and define H ( x ) , n |S| X j ∈S p j ∇ f j ( x ) , (19)to be the sub-sampled Hessian. In big-data regime when n ≫
1, if |S| ≪ n , suchsub-sampling can offer significant computational savings.Now, suppose sup x ∈ R d k∇ f i ( x ) k ≤ K i , i = 1 , , . . . , n, (20a)and define K max , max i =1 ,...,n K i . (20b) b K , n n X i =1 K i . (20c)In this case, we can naturally consider uniform distribution over { , , . . . , n } , i.e., p i =1 /n, ; ∀ i . Lemma 16 gives the sample size required for the inexact Hessian, H ( x ), toprobabilistically satisfy (3), for when the indices are picked uniformly at random with or without replacement. Lemma 16 (Complexity of Uniform Sampling) . Given (20a) , (20b) , and < ǫ, δ < ,let |S| ≥ K ǫ log 2 dδ , (21) where K max is defined as in (20b) . At any x ∈ R d , suppose picking the elements of S uniformly at random with or without replacement, and forming H ( x ) as in (19) with p i = 1 /n, ; ∀ i . We have Pr (cid:16) k H ( x ) − ∇ F ( x ) k ≤ ǫ (cid:17) ≥ − δ. (22) Proof.
Consider |S| random matrices H j ( x ) , j = 1 , . . . , |S| s.t. Pr ( H j ( x ) = ∇ f i ( x )) =1 /n ; ∀ i = 1 , , . . . , n . Define X j , (cid:0) H j − ∇ F ( x ) (cid:1) , H , P j ∈S H j / |S| , and X , j ∈S X j = |S| ( H − ∇ F ( x )). Note that E ( X j ) = 0 and for H j = ∇ f ( x ) we have k X j k = k n − n ∇ f ( x ) − n X i =2 n ∇ f i ( x ) k ≤ n − n ) K ≤ K . Hence, we can apply Operator-Bernstein inequality [35, Theorem 1] to getPr (cid:16) k H − ∇ F ( x ) k ≥ ǫ (cid:17) = Pr (cid:16) k X k ≥ ǫ |S| (cid:17) ≤ d exp {− ǫ |S| / (16 K ) } . Now (21) ensure that 2 d exp {− ǫ |S| / (16 K ) } ≤ δ , which gives (22).Indeed, if (22) holds, then (3a) follows with the same probability. In addition, if H is constructed according to Lemma 16, it is easy to see that (3b) is satisfied with K H = K max (in fact this is a deterministic statement). These two, together, imply that H satisfies Condition 1, with probability 1 − δ . A Special Case:
In certain settings, one might be able to construct a more “infor-mative” distribution, p , over the indices in the set { , , . . . , n } , as opposed to obliviousuniform sampling. In particular, it might be advantageous to bias the probability dis-tribution towards picking indices corresponding to those f i ’s which are more relevant ,in certain sense, in forming the Hessian. If this is possible, then we can only expectto require smaller sample size as compared with oblivious uniform sampling. One suchsetting where this is possible is the finite-sum optimization of the form ( P2 ), which isindeed a special case of ( P1 ) and arise often in many machine learning problems [51].It is easy to see that, the Hessian of F in this case can be written as ∇ F ( x ) = A T BA = P ni =1 f ′′ i ( a Ti x ) a i a Ti /n , where A T = | | . . . | a a . . . a n | | . . . | d × n and B = 1 n f ′′ ( a T x ) 0 . . . f ′′ ( a T x ) . . . . . . f ′′ n ( a Tn x ) n × n . Now let S ∈ R n ×|S| be the sampling matrix and define the approximate Hessian as H , A T SS T BA . It can be seen that approximating the Hessian matrix ∇ F ( x ) = A T BA can be regarded as approximating matrix-matrix multiplication from RandNLA [42, 60].For this, consider the sampling distribution p as p i = | f ′′ i ( a Ti x ) |k a i k P nj =1 | f ′′ j ( a Tj x ) |k a j k . (23)Note that the absolute values are needed since for non-convex f i , we might have f ′′ j ( a Tj x ) < f ′′ j ( a Tj x ) ≥
0, one can obtain stronger guarantees thanLemmas 16 and 17; see [62]). Using non-uniform sampling distribution (23), Lemma17 gives sampling complexity for the approximate Hessian of ( P2 ) to, probabilistically,satisfy (3). Lemma 17 (Complexity of Non-Uniform Sampling) . Given (20a) , (20c) and < ǫ, δ < ,let |S| ≥ b K ǫ log 2 dδ , (24)24 here b K is defined as in (20c) . At any x ∈ R d , suppose picking the elements of S randomly according to the probability distribution (23) , and forming H ( x ) as in (19) .We have Pr (cid:16) k H − ∇ F ( x ) k ≤ ǫ (cid:17) ≥ − δ. (25) Proof.
Define B = diag { f ′′ ( a T x ) /n, · · · , f ′′ n ( a Tn x ) /n } ∈ R n × n . Let S ∈ R n ×|S| be thesampling matrix and define H , A T SS T BA . Further, let the diagonals of B be denotedby b i and define c , P ni =1 | b j |k a j k . Consider s random matrices H j such that Pr( H j = b i a i a Ti /p i ) = p i , ∀ j = 1 , , . . . , |S| , where p i = | b i |k a i k / ( P ni =1 | b j |k a j k ) . Define X j , H j − A T BA , H , |S| |S| X j =1 H j , X , |S| X j =1 X j = |S| (cid:0) H − A T BA (cid:1) . Note that E [ X j ] = P ni =1 p i (cid:0) b i a i a Ti /p i − A T BA (cid:1) = 0, and E [ X j ] = E [ H j − A T BA ] = E [ H j ] + ( A T BA ) − E [ H j ] A T BA − A T BA E [ H j ]= E [ H j ] − ( A T BA ) (cid:22) E [ H j ] = n X i =1 p i (cid:18) b i p i a i a Ti (cid:19) = n X i =1 b i k a i k p i a i a Ti = n X i =1 | b j |k a j k n X i =1 | b | i a i a Ti = c n X i =1 | b | i a i a Ti = c A T | B | A . So we have k E [ X j ] k ≤ c k A T | B | A k . Now we can apply the Operator-Bernstein inequality[35, Theorem 1] to getPr (cid:0) k H − A T BA k ≥ ǫ (cid:1) ≤ Pr ( k X k ≥ ǫ |S| ) ≤ de ǫ |S| / (4 c k A T | B | A k ) . Since c = P ni =1 | b i | k a i k = n P ni =1 | f ′′ i | k a i k ≤ n P ni =1 K i = b K and (cid:13)(cid:13) A T | B | A (cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 | f ′′ i | a i a Ti (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ n n X i =1 (cid:13)(cid:13) | f ′′ i | a i a Ti (cid:13)(cid:13) ≤ n n X i =1 K i = b K, then we have Pr (cid:0) k H − A T BA k ≥ ǫ (cid:1) ≤ de ǫ |S| / (4 b K ) , which gives the desired result.The bound in (24) can be improved by replacing the dimension d with a smallerquantity, known as intrinsic dimension; see Appendix A. As it can be seen from (20b)and (20c), since b K ≤ K max , the sampling complexity given by Lemma 17 always providesa smaller sample-size compared with that prescribed by Lemma 16. Indeed, the advantageof non-uniform sampling is more pronounced in cases where the distribution of K i ’s arehighly skewed, i.e., a few large ones and many small ones, in which case we can have b K ≪ K max ; see numerical experiments in [61]. Also, from (25), it follows that the approximatematrix H , constructed according to Lemma 17 satisfies (3b) with K H = b K + ǫ , withprobability 1 − δ , which in turn, implies that Condition 1 is ensured, with probability1 − δ .As concrete examples of the problems in the form ( P2 ) where Lemma 17 can bereadily used, Table 1 gives estimates for K i in (20a) for robust linear regression withsmooth non-convex bi-weight loss, [3], as well as non-convex binary-classification usinglogistic regression with least squares loss, [61].25able 1: Examples of problems in the form ( P2 ) with the corresponding estimates for K i in (20a). Problem Data f i ( a Ti x ) ∇ f i ( a Ti x ) K i RobustLinearRegression a i ∈ R d b i ∈ R (cid:0) a Ti x − b i (cid:1) a Ti x − b i ) (cid:16) − (cid:0) a Ti x (cid:1) (cid:17)(cid:16) ( a Ti x ) + 1 (cid:17) a i a Ti k a i k √ a i ∈ R d b i ∈{ , } (cid:18)
11 + exp ( − a Ti x ) − b i (cid:19) exp (cid:0) a Ti x (cid:1) (cid:0) − exp (cid:0) a Ti x (cid:1)(cid:1) (exp ( a Ti x ) + 1) ! a i a Ti k a i k Now, we are in the position to give iteration complexity for Algorithms 1 and 2 wherethe inexact Hessian matrix H t is constructed according to Lemmas 16 or 17. Since theapproximation is a probabilistic construction, in order to guarantee success, we need toensure that we require a small failure probability across all iterations. In particular, inorder to get an overall and accumulative success probability of 1 − δ for the entire T iterations, the per-iteration failure probability is set as (1 − T p (1 − δ )) ∈ O ( δ/T ). Thisfailure probability appears only in the “log factor” for sample size in all of our results,and so it is not the dominating cost. Hence, requiring that all T iterations are successfulfor a large T , only necessitates a small (logarithmic) increase in the sample size. Forexample, for T ∈ O (max { ǫ − g , ǫ − H } ), as in Theorem 2, we can set the per-iteration failureprobability to δ min { ǫ g , ǫ H } , and ensure that when Algorithm 2 terminates, all Hessianapproximations have been, accumulatively, successful with probability of 1 − δ .Using these results, we can have the following probabilistic, but optimal, guarantee onthe worst-case iteration complexity of Algorithm 1 for solving finite-sum problem ( P1 )(or ( P2 )) and in the case where the inexact Hessian is formed by sub-sampling. Theirproofs follow very similar line of reasoning as that used for obtaining the results of Section2, and hence are omitted. Theorem 4 (Optimal Complexity of Algorithm 1 For Finite-Sum Problem) . Considerany < ǫ g , ǫ H , δ < . Let ǫ be as in (11) and set δ = δ min { ǫ g ǫ H , ǫ H } . Furthermore,for such ( ǫ, δ ) , let the sample-size |S| be as in (21) (or (24) ) and form the sub-sampledmatrix H as in (19) . For Problem ( P1 ) (or ( P2 ) ), under Assumption 1 and Condition2, Algorithm 1 terminates in at most T ∈ O (max { ǫ − g ǫ − H , ǫ − H } ) iterations, upon which,with probability − δ , we have that k∇ F ( x ) k ≤ ǫ g , and λ min ( ∇ F ( x )) ≥ − ( ǫ + ǫ H ) . Similarly, in the setting of optimization problems ( P1 ) and ( P2 ), with appropriatesub-sampling of the Hessian as in Lemmas 16 and 17, we can also obtain probabilisticworst-case iteration complexities for Algorithm 2 as in the deterministic case. Again, theproofs are similar to those in Section 2, and hence are omitted. Theorem 5 (Complexity of Algorithm 2 For Finite-Sum Problem) . Consider any <ǫ g , ǫ H , δ < . Let ǫ be as in (16) and set δ = δ min { ǫ g , ǫ H } . Furthermore, for such ( ǫ, δ ) ,let the sample-size |S| be as in (21) (or (24) ) and form the sub-sampled matrix H as in . For Problem ( P1 ) (or ( P2 ) ), under Assumption 1 and Condition 3, Algorithm2 terminates in at most T ∈ O (max { ǫ − g , ǫ − H } ) iterations, upon which, with probability − δ , we have that k∇ F ( x ) k ≤ ǫ g , and λ min ( ∇ F ( x )) ≥ − ( ǫ + ǫ H ) . Theorem 6 (Optimal Complexity of Algorithm 2 For Finite-Sum Problem) . Considerany < ǫ g , ǫ H , δ < . Let ǫ be as in Theorem 3 and set δ = δ min { ǫ / g , ǫ H } . Furthermore,for such ( ǫ, δ ) , let the sample-size |S| be as in (21) (or (24) ) and form the sub-sampledmatrix H as in (19) . For Problem ( P1 ) (or ( P2 ) ), under Assumption 1, Conditions 3and 4, Algorithm 2 terminates in at most T ∈ O (max { ǫ − / g , ǫ − H } ) iterations, upon which,with probability − δ , we have that k∇ F ( x ) k ≤ ǫ g , and λ min ( ∇ F ( x )) ≥ − ( ǫ + ǫ H ) . As it can be seen, the main difference between Theorems 5 and 6 is in the solutionto the sub-problem (12). More specifically, if in addition to Condition 3, Condition 4 isalso satisfied, then Theorem 6 gives optimal worst-case iteration complexity.
We considered non-convex optimization settings and developed efficient variants of thetrust region and adaptive cubic regularization methods in which both the sub-problemsas well as the the curvature information are suitably approximated. For all of our pro-posed variants, we obtained iteration complexities to achieve approximate second ordercriticality, which are shown to be the same (up to some constant) as that of the exactvariants.As compared with previous works, our proposed Hessian approximation condition of-fers a range of theoretical and practical advantages. As a concrete example, we consideredthe large-scale finite-sum optimization problem and proposed uniform and non-uniformsub-sampling strategies as ways to efficiently construct the desired approximate Hessian.We then, probabilistically, established optimal iteration complexity for variants of trustregion and adaptive cubic regularization methods in which the Hessian is appropriatelysub-sampled.In this paper, we focused on approximating the Hessian under the exact gradient in-formation. Arguably, the bottleneck of the computations in such second-order methodsinvolves the computations with the Hessian, e.g., matrix-vector products in the (approx-imate) solution of the sub-problem. In fact, the cost of the exact gradient computation istypically amortized by that of the operations with the Hessian. In spite of this, approx-imating the gradient in a computationally feasible way and with minimum assumptionscould improve upon the efficiency of the methods proposed here. However, care hasto be taken as cheaper iterations with inaccurate gradients could in fact result in moreiterations overall. This could have the adverse effect of slowing down the algorithm’sconvergence. As a result, approximating the gradient has to be done with care to avoidsuch pitfalls.Finally, we mention that our focus here has been solely on developing the theoreticalfoundations of such randomized algorithms. Extensive empirical evaluations of thesealgorithms on various machine learning applications are given in the [61].27 eferences [1] Naman Agarwal et al. “Finding Approximate Local Minima Faster than GradientDescent”. In: arXiv preprint arXiv:1611.01146 (2016).[2] Afonso S Bandeira, Katya Scheinberg, and Lu´ıs N Vicente. “Convergence of trust-region methods based on probabilistic models”. In:
SIAM Journal on Optimization
Technometrics
Computational Optimization and Applications arXiv preprint arXiv:1609.07428v3 (2018).[6] Raghu Bollapragada, Richard Byrd, and Jorge Nocedal. “Exact and Inexact Sub-sampled Newton Methods for Optimization”. In: arXiv preprint arXiv:1609.08502 (2016).[7] L´eon Bottou, Frank E Curtis, and Jorge Nocedal. “Optimization methods for large-scale machine learning”. In: arXiv preprint arXiv:1606.04838 (2016).[8] Yair Carmon and John C Duchi. “Gradient Descent Efficiently Finds the Cubic-Regularized Non-Convex Newton Step”. In: arXiv preprint arXiv:1612.00547 (2016).[9] Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. “Adaptive cubic regu-larisation methods for unconstrained optimization. Part I: motivation, convergenceand numerical results”. In:
Mathematical Programming
Mathematical programming
Journal of Complexity
Optimization Methods and Software
SIAM journal on optimization
Optimal Newton-typemethods for nonconvex smooth optimization problems . Tech. rep. ERGO technicalreport 11-009, School of Mathematics, University of Edinburgh, 2011.2815] Coralia Cartis and Katya Scheinberg. “Global convergence rate analysis of uncon-strained optimization methods based on probabilistic models”. In:
MathematicalProgramming (2015), pp. 1–39.[16] Ruobing Chen, Matt Menickelly, and Katya Scheinberg. “Stochastic optimizationusing a trust-region method and random models”. In: arXiv preprint arXiv:1504.04231 (2015).[17] Andrew R Conn, Nicholas IM Gould, and Philippe L Toint.
Trust region methods .SIAM, 2000.[18] Andrew R Conn, Katya Scheinberg, and Lu´ıs N Vicente. “Global convergenceof general derivative-free trust-region algorithms to first-and second-order criticalpoints”. In:
SIAM Journal on Optimization O ( ǫ − / ) for NonconvexOptimization”. In: Mathematical Programming O ( ǫ − / )for Nonconvex Optimization”. In: arXiv preprint arXiv:1708.00475 (2017).[21] John E Dennis and Jorge J Mor´e. “A characterization of superlinear convergenceand its application to quasi-Newton methods”. In: Mathematics of computation
Communications of the ACM
SIAM Journal on Optimization
The Elements of StatisticalLearning . Vol. 1. Springer series in statistics Springer, Berlin, 2001.[25] Dan Garber et al. “Faster eigenvector computation via shift-and-invert precondi-tioning”. In:
International Conference on Machine Learning . 2016, pp. 2626–2634.[26] Rong Ge et al. “Escaping From Saddle Points-Online Stochastic Gradient for TensorDecomposition.” In:
COLT . 2015, pp. 797–842.[27] Saeed Ghadimi, Han Liu, and Tong Zhang. “Second-Order Methods with CubicRegularization Under Inexact Information”. In: arXiv preprint arXiv:1710.05782 (2017).[28] Nicholas IM Gould, Daniel P Robinson, and H Sue Thorne. “On solving trust-regionand other regularised subproblems in optimization”. In:
Mathematical ProgrammingComputation
SIAM Journal on Optimization
Op-timization Methods and Software
SIAM Journal on Optimization
IMA Journal of Numerical Analysis
IMA Journal of Numerical Analysis
Technical Report NA/12. Department ofApplied Mathematics and Theoretical Physics, University of Cambridge. (1981).[35] David Gross and Vincent Nesme. “Note on sampling without replacing from a finitecollection of matrices”. In: arXiv preprint arXiv:1001.2738 (2010).[36] Elad Hazan and Tomer Koren. “A linear-time algorithm for trust region problems”.In:
Mathematical Programming (2015), pp. 1–19.[37] Jonas Moritz Kohler and Aurelien Lucchi. “Sub-sampled Cubic Regularization forNon-convex Optimization”. In: arXiv preprint arXiv:1705.05933 (2017).[38] J Kuczy´nski and H Wo´zniakowski. “Estimating the largest eigenvalue by the powerand Lanczos algorithms with a random start”. In:
SIAM journal on matrix analysisand applications
Computational Optimization and Applications
Conferenceon Learning Theory . 2016, pp. 1246–1257.[41] Felix Lenders, Christian Kirches, and Andreas Potschka. “trlib: A vector-free imple-mentation of the GLTR method for iterative solution of the trust region problem”.In: arXiv preprint arXiv:1611.04718 (2016).[42] Michael W Mahoney. “Randomized algorithms for matrices and data”. In:
Foun-dations and Trends R (cid:13) in Machine Learning SIAMJournal on Scientific and Statistical Computing
Mathematical Programming
Mathematical Programming
Numerical optimization . Springer Science &Business Media, 2006.[47] Farbod Roosta-Khorasani, Kees van den Doel, and Uri Ascher. “Data completionand stochastic algorithms for PDE inversion problems with many measurements”.In:
Electronic Transactions on Numerical Analysis
42 (2014), pp. 177–196.3048] Farbod Roosta-Khorasani, Kees van den Doel, and Uri Ascher. “Stochastic algo-rithms for inverse problems involving PDEs and many measurements”. In:
SIAMJ. Scientific Computing
Mathematical Programming issn : 1436-4646.[50] Farbod Roosta-Khorasani, G´abor J. Sz´ekely, and Uri Ascher. “Assessing stochasticalgorithms for large scale nonlinear least squares problems using extremal probabil-ities of linear combinations of gamma random variables”. In:
SIAM/ASA Journalon Uncertainty Quantification
Understanding machine learning: Fromtheory to algorithms . Cambridge university press, 2014.[52] Sara Shashaani, Fatemeh Hashemi, and Raghu Pasupathy. “ASTRO-DF: A Classof Adaptive Sampling Trust-Region Algorithms for Derivative-Free Stochastic Op-timization”. In: arXiv preprint arXiv:1610.06506 (2016).[53] Danny C Sorensen. “Newtons method with a model trust region modification”. In:
SIAM Journal on Numerical Analysis
SIAM Journal on Optimization
Optimization for machinelearning . Mit Press, 2012.[56] Trond Steihaug. “The conjugate gradient method and trust regions in large scaleoptimization”. In:
SIAM Journal on Numerical Analysis arXiv preprint arXiv:1711.02838 (2017).[58] J. A. Tropp. “User-friendly tail bounds for sums of random matrices”. In:
Found.Comput. Math. arXivpreprint arXiv:1501.01571 (2015).[60] David P. Woodruff. “Sketching as a tool for numerical linear algebra”. In: arXivpreprint arXiv:1411.4357 (2014).[61] Peng Xu, Farbod Roosta-Khorasani, and Michael W. Mahoney. “Second-OrderOptimization for Non-Convex Machine Learning: An Empirical Study”. In: arXivpreprint arXiv:1708.07827 (2017).[62] Peng Xu et al. “Sub-sampled newton methods with non-uniform sampling”. In:
Advances in Neural Information Processing Systems . 2016, pp. 3000–3008.
Appendix A: Intrinsic dimension and improving thesampling complexity (24)
We can still improve the sampling complexity (24) by considering the intrinsic dimensionof the matrix A T | B | A . Recall that for a SPSD matrix A ∈ R d × d , the intrinsic dimensionis defined as t ( A ) = tr( A ) / k A k , where tr( A ) is the trace of A . The intrinsic dimension31an be regarded as a measure for the number of dimensions where A has a significantspectrum. It is easy to see that 1 ≤ t ( A ) ≤ rank( A ) ≤ d ; see [59] for more details. Nowlet t = tr( A T | B | A ) / k A T | B | A k be the intrinsic dimension of the SPSD matrix A T | B | A .We have the following improved sampling complexity result: Lemma 18 (Complexity of Non-Uniform Sampling: Intrinsic Dimension) . The result ofLemma 17 holds with (24) replaced with |S| ≥ b K ǫ log 8 tδ , (26) where t = tr ( A T | B | A ) / k A T | B | A k ≤ d is the intrinsic dimension of the matrix A T | B | A .Proof. It is easy to see that Var( X ) = E ( X ) = P |S| j =1 E ( X j ) (cid:22) |S| c A T | B | A , where X and c are given in the proof of Lemma 17. For H j = b i p i a i a Ti , we have λ max ( X j ) ≤ k X j k = k b i p i a i a Ti − A T BA k = k (cid:18) − p i p i (cid:19) b i a i a Ti − X j = i b j a j a Tj k≤ (cid:18) − p i p i (cid:19) | b i |k a i k + X j = i | b j |k a j k = (1 − p i ) n X i =1 | b j |k a j k + X j = i | b j |k a j k = 2 X j = i | b j |k a j k ≤ n X i =1 | b j |k a j k = 2 c. Hence, if ǫ |S| ≥ p |S| c k A T | B | A k + 2 c/
3, we can apply Matrix Bernstein using theintrinsic dimension [59, Theorem 7.7.1] to get for ǫ ≤ / λ max ( X ) ≥ ǫ |S| ) ≤ t exp (cid:26) − ǫ |S| c k A T | B | A k + 4 cǫ/ (cid:27) ≤ t exp (cid:26) − ǫ |S| c (cid:27) . Applying the same bound for Y j = − X j and Y = P sj =1 Y j , followed by the union bound,we get the desired result. Appendix B: Computation of Approximate NegativeCurvature Direction
Throughout our analysis, we assume that, if a sufficiently negative curvature exists, i.e., λ min ( H ) ≤ − ǫ H for some ǫ H ∈ (0 , u , i.e., h u , Hu i ≤ − νǫ H k u k , for some ν ∈ (0 , H = K H − H . These methods onlyemploy matrix vector products and, hence, are suitable for large scale problems. Morespecifically, with any κ ∈ (0 , O (log( d/δ ) p K H /κ ) matrix-vectorproducts and with probability 1 − δ , yield a vector u satisfying K H k u k − h u , Hu i = h u , ˜ Hu i ≥ κλ min ( ˜ H ) k u k = κ ( K H − λ min ( H )) k u k . Rearranging, we obtain h u , Hu i ≤ (1 − κ ) K H k u k + κλ min ( H ) k u k . Setting 1 > ν = 2 κ ≥ (2 K H ) / (2 K H + ǫ H ), gives h u , Hu i ≤ − νǫ H k u k2