[PDF] Newton-Type Methods for Non-Convex Optimization Under Inexact Hessian Information

Abstract

We consider variants of trust-region and cubic regularization methods for non-convex optimization, in which the Hessian matrix is approximated. Under mild conditions on the inexact Hessian, and using approximate solution of the corresponding sub-problems, we provide iteration complexity to achieve \epsilon -approximate second-order optimality which have shown to be tight. Our Hessian approximation conditions constitute a major relaxation over the existing ones in the literature. Consequently, we are able to show that such mild conditions allow for the construction of the approximate Hessian through various random sampling methods. In this light, we consider the canonical problem of finite-sum minimization, provide appropriate uniform and non-uniform sub-sampling strategies to construct such Hessian approximations, and obtain optimal iteration complexity for the corresponding sub-sampled trust-region and cubic regularization methods.

Full PDF

aa r X i v : . [ m a t h . O C ] M a y Newton-Type Methods for Non-Convex Optimization UnderInexact Hessian Information

Peng Xu ∗ Fred Roosta † Michael W. Mahoney ‡ May 15, 2019

Abstract

We consider variants of trust-region and adaptive cubic regularization methodsfor non-convex optimization, in which the Hessian matrix is approximated. Undercertain condition on the inexact Hessian, and using approximate solution of the cor-responding sub-problems, we provide iteration complexity to achieve ǫ -approximatesecond-order optimality which have been shown to be tight. Our Hessian approx-imation condition oﬀers a range of advantages as compared with the prior worksand allows for direct construction of the approximate Hessian with a priori guaran-tees through various techniques, including randomized sampling methods. In thislight, we consider the canonical problem of ﬁnite-sum minimization, provide appro-priate uniform and non-uniform sub-sampling strategies to construct such Hessianapproximations, and obtain optimal iteration complexity for the corresponding sub-sampled trust-region and adaptive cubic regularization methods. Consider the generic unconstrained optimization problemmin x ∈ R d F ( x ) , ( P0 )where F : R d → R is smooth and non-convex . Faced with the large-scale nature of modern“big-data” problems, many of the classical optimization algorithms might prove to beineﬃcient, if applicable at all. In this light, many of the recent research eﬀorts have beencentered around designing variants of classical algorithms which, by employing suitable approximations of the gradient and/or Hessian, improve upon the cost-per-iteration, whilemaintaining the original iteration complexity. In this light, we focus on trust-region (TR)[17] and cubic regularization (CR) [34], two algorithms which are considered as amongthe most elegant and theoretically sound general-purpose Newton-type methods for non-convex problems. ∗ Institute for Computational and Mathematical Engineering, Stanford University, Email:[email protected] † School of Mathematics and Physics, University of Queensland, Brisbane, Australia, and InternationalComputer Science Institute, Berkeley, USA, Email: [email protected] ‡ International Computer Science Institute and Department of Statistics, University of California atBerkeley, Email: [email protected]

1n doing so, we ﬁrst consider ( P0 ), and study the theoretical convergence properties ofvariants of these two algorithms in which, under favorable conditions, Hessian is suitablyapproximated. We show that our Hessian approximation conditions, in many cases, areweaker than the existing ones in the literature. In addition, and in contrast to some priorworks, our conditions allow for eﬃcient constructions of the inexact Hessian with a prioriguarantees via various approximation methods, of which Randomized Numerical LinearAlgebra (RandNLA), [22, 42], techniques are shown to be highly eﬀective.Subsequently, to showcase the application of randomized techniques for constructionof the approximate Hessian, we consider an important instance of ( P0 ), i.e., large-scale ﬁnite-sum minimization, of the formmin x ∈ R d F ( x ) , n n X i =1 f i ( x ) , ( P1 )and its special case min x ∈ R d F ( x ) , n n X i =1 f i ( a Ti x ) , ( P2 )where n ≫

1, each f i is a smooth but possibly non-convex function, and a i ∈ R d , i =1 , . . . , n, are given. Problems of the form ( P1 ) and ( P2 ) arise very often in machinelearning, e.g., [51] as well as scientiﬁc computing, e.g., [47, 48]. In big-data regimewhere n ≫

1, operations with the Hessian of F , e.g., matrix-vector products, typicallyconstitute the main bottleneck of computations. Here, we show that our relaxed Hessianapproximation conditions allow one to draw upon the sub-sampling ideas of [6, 49, 62],to design variants of TR and CR algorithms where the Hessian is (non-)uniformly sub-sampled. We then present the theoretical convergence properties of these variants fornon-convex ﬁnite-sum problems of the form ( P1 ) and ( P2 ).The rest of this paper is organized as follows. In Section 1.1, we ﬁrst introduce thenotation and deﬁnitions used throughout the paper. For completeness, in Section 1.2, wegive a brief review of trust region (Section 1.2.1) and cubic regularization (Section 1.2.2)along with related prior works. Our main contributions are summarized in Section 1.3.Theoretical analysis of the proposed algorithms for solving generic non-convex problem( P0 ) are presented in Section 2. Various randomized sub-sampling strategies as well astheoretical properties of the proposed algorithms for ﬁnite-sum minimization problems( P1 ) and ( P2 ) are given in Section 3. Conclusions and further thoughts are gathered inSection 4. Throughout the paper, vectors are denoted by bold lowercase letters, e.g., v , and matricesor random variables are denoted by bold upper case letters, e.g., V . v T denotes thetranspose of a real vector v . We use regular lower-case and upper-case letters to denotescalar constants, e.g., c or K . For two vectors, v , w , their inner-product is denoted as h v , w i = v T w . For a vector v , and a matrix V , k v k and k V k denote the vector ℓ norm and the matrix spectral norm, respectively, while k V k F is the matrix Frobeniusnorm. ∇ F ( x ) and ∇ F ( x ) are the gradient and the Hessian of F at x , respectively, and I denotes the identity matrix. For two symmetric matrices A and B , A (cid:23) B indicatesthat A − B is symmetric positive semi-deﬁnite. The subscript, e.g., x t , denotes iteration2ounter and log( x ) is the natural logarithm of x . The inexact Hessian is denoted by H ( x ),but for notational simplicity, we may use H t to, instead, denote the approximate Hessianevaluated at the iterate x t in iteration t , i.e., H t , H ( x t ). Throughout the paper, S denotes a collection of indices from { , , · · · , n } , with potentially repeated items and itscardinality is denoted by |S| .Unlike convex functions for which “local optimality” and “global optimality” are infact the same, in non-convex settings, we are often left with designing algorithms thatcan guarantee convergence to approximate local optimality. In this light, throughout thispaper, we make use of the following deﬁnition of ( ǫ g , ǫ H )-Optimality: Deﬁnition 1 (( ǫ g , ǫ H )-Optimality) . Given ǫ g , ǫ H ∈ (0 , , x ∈ R d is an ( ǫ g , ǫ H ) -optimalsolution to the problem ( P0 ) , if k∇ F ( x ) k ≤ ǫ g , λ min ( ∇ F ( x )) ≥ − ǫ H . (1)We note that ( ǫ g , ǫ H )-Optimality (even with ǫ g = ǫ H = 0) does not necessarily implycloseness to any local minimum, neither in iterate nor in the objective value. However,if the saddle points satisfy the strict-saddle property [26, 40], then an ( ǫ g , ǫ H )-optimalityguarantees vicinity to a local minimum for suﬃciently small ǫ g and ǫ H . Arguably, the most straightforward approach for globalization of many Newton-type al-gorithms is the application of line-search. However, near saddle points where the gradientmagnitude can be small, traditional line search methods can be very ineﬀective and infact produce iterates that can get stuck at a saddle point [46]. Trust region and cubicregularization methods are two elegant globalization alternatives that, specially recently,have attracted much attention. The main advantage of these methods is that they arereliably able to take advantage of the direction of negative curvature and escape saddlepoints. In this section we brieﬂy review these algorithms as they pertain to the presentpaper and mention the relevant prior works.

TR methods [17, 53] encompass a general class of iterative methods which speciﬁcallydeﬁne a region around the current iterate within which they trust the model to be areasonable approximation of the true objective function. The most widely used approxi-mating model, which we consider here, is done via a quadratic function. More speciﬁcally,using the current iterate x t , the quadratic variant of TR algorithm ﬁnds the next iterateas x t +1 = x t + s t where s t is a solution of the constrained sub-problemmin m t ( s ) , h s , ∇ F ( x t ) i + 12 h s , ∇ F ( x t ) s i (2a)s.t. k s k ≤ ∆ t . Here, ∆ t is the region in which we “trust” our quadratic model to be an acceptableapproximation of the true objective for the current iteration. The major bottleneck ofcomputations in TR algorithm is the minimization of the constrained quadratic sub-problem (2a), for which numerous approaches have been proposed, e.g., [23, 28, 29, 36,41, 43, 54, 56]. 3or a smooth non-convex objective and in order to obtain approximate ﬁrst-ordercriticality, i.e., k∇ F ( x t ) k ≤ ǫ g for some ǫ g ∈ (0 , O ( ǫ − g ); e.g.,[5, 11, 31, 32, 33]. Recent non-trivial modiﬁcations of the classical TR methods havealso been proposed which improve upon the complexity to O ( ǫ − / g ); see [19] and furtherextensions to a more general framework in [20]. These bounds can be shown to be tight[13] in the worst case. Under a more general algorithmic framework and in terms ofobjective function sub-optimality, i.e., F ( x ) − F ∗ ≤ ǫ , better complexity bounds, in theconvex and strongly-convex settings, have been obtained which are of the orders of O ( ǫ − g )and O (log(1 /ǫ g )), respectively [30].For non-convex problems, however, it is more desired to obtain complexity bounds forachieving approximate second-order criticality, i.e., Deﬁnition 1. For this, bounds in theorders of O (max { ǫ − H ǫ − g , ǫ − H } ) and O (max { ǫ − g , ǫ − H } ) have been obtained in [11] and [30],respectively. Similar bounds were also given in [33] under probabilistic model. Boundsof this order have shown to be optimal in certain cases [11].More closely related to the present paper, there have been several results which studythe role of derivative-free and probabilistic models in general, and Hessian approximationin particular, e.g., see [2, 5, 11, 16, 18, 33, 39, 52] and references therein. An alternative to the traditional line-search and TR for globalization of Newton-typemethods is the application of cubic regularization. Such class of methods is characterizedby generating iterates as x t +1 = x t + s t where s t is a solution of the following unconstrained sub-problem min s ∈ R d m t ( s ) , h s , ∇ F ( x t ) i + 12 h s , ∇ F ( x t ) s i + σ t k s k , (2b)where σ t is the cubic regularization parameter chosen for the current iteration. As in thecase of TR, the major bottleneck of CR involved solving the sub-problem (2b), for whichvarious techniques have been proposed, e.g., [1, 4, 8, 9].To the best of our knowledge, the use of such regularization, was ﬁrst introduced inthe pioneering work of [34], and subsequently further studied in the seminal works of[9, 10, 45].From the worst-case complexity point of view, CR has a better dependenceon ǫ g compared to TR. More speciﬁcally, [45] showed that, under global Lipschitz con-tinuity assumption on the Hessian, if the sub-problem (2b) is solved exactly, then theresulting CR algorithm achieves the approximate ﬁrst-order criticality with complexityof O ( ǫ − / g ). These results were extended by the pioneering and seminal works of [9, 10]to an adaptive variant, which is often referred to as ARC (Adaptive Regularization withCubics). In particular, the authors showed that the worst case complexity of O ( ǫ − / g )can be achieved without requiring the knowledge of the Hessian’s Lipschitz constant, ac-cess to the exact Hessian, or multi-dimensional global optimization of the sub-problem(2b). These results were further reﬁned in [11] where it was shown that, not only, multi-dimensional global minimization of (2b) is unnecessary, but also the same complexitycan be achieved with mere one or two dimensional search. This O ( ǫ − / ) bound has beenshown to be tight [14]. As for the approximate second-order criticality, [11] showed thatat least O (max { ǫ − g , ǫ − H } ) is required. With further assumptions on the inexactness of4ub-problem solution, [10, 11] also show that one can achieve O (max { ǫ − / g , ǫ − H } ), whichis shown to be tight [13]. Better dependence on ǫ g can be obtained if one assumes addi-tional structure, such as convexity, e.g., see [12, 45] as well as the acceleration scheme of[44].Recently, for (strongly) convex problems, [27] obtained sub-optimal complexity forARC and its accelerated variants using Hessian approximations. In the context of stochas-tic optimization problems, [57] considers cubic regularization with a priori chosen ﬁxedregularization parameter using both approximations of the gradients and Hessian. Spe-ciﬁc to the ﬁnite-sum problem ( P1 ), and by a direct application of the theoretical resultsof [9, 10], [37] presents a sub-sampled variant of ARC, in which the exact Hessian andthe gradient are replaced by sub-samples. However, unfortunately, their analysis suﬀersfrom a rather vicious circle: the approximate Hessian and gradient are formed based onan a priori unknown step which can only be determined after such approximations areformed. In this section, we summarize the key aspects of our contributions. In Section 2, we con-sider ( P0 ) and establish the worst-case iteration complexities for variants of trust-regionand adaptive cubic regularization methods in which the Hessian is suitably approximated.More speciﬁcally, our entire analysis is based on the following key condition on the ap-proximate Hessian H ( x ): Condition 1 (Inexact Hessian Regularity) . For some < K H < ∞ , ǫ > , the approxi-mating Hessian, H ( x t ) , satisﬁes (cid:13)(cid:13)(cid:0) H ( x t ) − ∇ F ( x t ) (cid:1) s t (cid:13)(cid:13) ≤ ǫ · k s t k , (3a) k H ( x t ) k ≤ K H , (3b) where x t and s t are, respectively, the iterate and the update at iteration t . Under Condition 1, we show that our proposed algorithms (Algorithms 1 and 2)achieve the same worst-case iteration complexity to obtain approximate second ordercritical solution as that of the exact variants (Theorems 1, 2, and 3).In Section 3, we describe schemes for constructing H ( x t ) to satisfy Condition 1. Specif-ically, in the context of ﬁnite-sum optimization framework, i.e., problems ( P1 ) and ( P2 ),we present various sub-sampling schemes to probabilistically ensure Condition 1 (Lem-mas 16 and 17). Our proposed randomized sub-sampling strategies guarantee, with highprobability, a stronger condition than (3a), namely k H ( x ) − ∇ F ( x ) k ≤ ǫ. (4)It is clear that (4) implies (3a). We then give optimal iteration complexities for Algo-rithms 1 and 2 for optimization of non-convex ﬁnite-sum problems where the Hessian isapproximated by means of appropriate sub-sampling (Theorems 4, 5 and 6).To establish optimal second-order iteration complexity, many previous works consid-ered Hessian approximation conditions that, while enjoying many advantages, come withcertain disadvantages. Our proposed Condition 1 aims to remedy some of these disad-vantages. We ﬁrst brieﬂy review the conditions used in the prior works, and subsequentlyhighlight the merits of Condition 1 in comparison.5 .3.1 Conditions Used in Prior Works For the analysis of trust-region, many authors have considered the following condition (cid:13)(cid:13) H ( x t ) − ∇ F ( x t + s ) (cid:13)(cid:13) ≤ C ∆ t , ∀ s ∈ { s ; k s k ≤ ∆ t } , (5a)for some 0 < C < ∞ , where ∆ t is the current trust-region radius, e.g., [2, 33]. In [5],condition (5a) is replaced with (cid:13)(cid:13) H ( x t ) − ∇ F ( x t ) (cid:13)(cid:13) ≤ C ∆ t , (5b)for some 0 < C < ∞ . In fact, by assuming Lipschitz continuity of Hessian, it is easyto show that (5a) and (5b) are equivalent, in that one implies the other, albeit withmodiﬁed constants. We also note that [2, 5, 33] study a more general framework underwhich the entire sub-problem model is probabilistically constructed and approximationextends beyond just the Hessian.For cubic regularization, the condition imposed on the inexact Hessian is often con-sidered as (cid:13)(cid:13)(cid:0) H ( x t ) − ∇ F ( x t ) (cid:1) s t (cid:13)(cid:13) ≤ C k s t k , (5c)for some 0 < C < ∞ , e.g., [9, 10, 11] and other follow-up works. In fact, [11] has alsoestablished optimal iteration complexity for trust-region algorithm under (5c). Both of(5a) and (5c), are stronger than the celebrated Dennis-Mor´e [21] condition, i.e.,lim t →∞ k ( H ( x t ) − ∇ F ( x t )) s t kk s t k = 0 . Indeed, under certain assumptions, Dennis-Mor´e condition is satisﬁed by a number ofquasi-Newton methods, although the same cannot be said about (5a) and (5c) [9].

For our trust-region analysis, we require Condition 1 with ǫ ∈ O (max { ǫ H , ∆ t } ); see(11) in Theorem 1. Hence, when ∆ t is large, e.g., at the beginning of iterations, allthe conditions (3a), (5a), and (5b) are equivalent, up to some constants. However, theconstants in (5a) and (5b) can be larger than what is implied by (3a), amounting tocruder approximations in practice for when ∆ t is large. As iterations progress, the trust-region radius will get smaller, and in fact it is expected that ∆ t will eventually shrink tobe ∆ t ∈ Θ (min { ǫ g , ǫ H } ). In prior works, e.g., [5, 33], the convergence analysis is derivedusing ǫ H = ǫ g , whereas here we allow ǫ H = √ ǫ g . As a result, the requirements (5a) and(5b) can eventually amount to stricter conditions than (3a).As for (5c), the main drawback lies in the diﬃculty of enforcing it. Despite the factthat for certain values of k s t k and ǫ , e.g., ǫ ≪ k s t k , (5c) can be less restrictive than(3a), a priori enforcing (5c) requires one to have already computed the search direction s t , which itself can be done only after H ( x t ) is constructed, hence creating a viciouscircle. A posteriori guarantees can be given if one obtains a lower-bound estimate on theyet-to-be-computed step-size, i.e., to have s > s ≤ k s t k . This allows oneto consider a stronger condition as k ( H ( x t ) − ∇ F ( x t )) k ≤ C s , which can be enforcedusing a variety of methods such as those described in Section 3. However, to obtain sucha lower-bound estimate on the next step-size, one has to resort to a recursive procedure,6hich necessitates repeated constructions of the approximate Hessian and subsequentsolutions of the corresponding subproblems. Consequently, this procedure may result ina signiﬁcant computational overhead and will lead to undesirable theoretical complexities.In sharp contrast to (5c), the condition (3a) allows for theoretically principled use ofmany practical techniques to construct H t . For example, under (3a), the use of quasi-Newton methods to approximate the Hessian is theoretically justiﬁed. Further, by con-sidering the stronger condition (4), many randomized matrix approximation techniquescan be readily applied, e.g., [42, 58, 59, 60]; see Section 3. To the best of our knowledge,the only successful attempt at guaranteed a priori construction of H t using (5c) is donein [15]. Speciﬁcally, by considering probabilistic models, which are “suﬃciently accurate”in that they are partly based on (5c), [15] studies ﬁrst-order complexity of a large classof methods, including ARC, and discusses ways to construct such probabilistic modelsas long as the gradient is large enough, i.e., before ﬁrst-order approximate-optimality isachieved. Here, by considering (3a), we are able to provide an alternative analysis, whichallows us to obtain second-order complexity results.Requiring (4), as a way of enforcing (3a), oﬀers a variety of other practical advan-tages, which are not readily available with other conditions. For example, consider dis-tributed/parallel environments where the data is distributed across a network and themain bottleneck of computations is the communications across the nodes. In such set-tings, since (4) allows for the Hessian accuracy to be set a priori and to remain ﬁxedacross all iterations, the number of samples in each node can stay the same throughoutiterations. This prevents unnecessary communications to re-distribute the data at everyiteration.Furthermore, in case of failed iterations, i.e., when the computed steps are rejected,the previous H t may seamlessly be used in the next iteration, which avoids repeatingmany such, potentially expensive, computations throughout the iterations. For example,consider approximate solutions to the underlying sub-problems by means of dimension-ality reduction, i.e., H t is projected onto a lower dimensional sub-space as U T H t U forsome U ∈ R d × p with p ≪ d , resulting in a smaller dimensional sub-problem. Now ifthe current iteration leads to a rejected step, the projection of the H t from the previousiteration can be readily re-used in the next iteration. This naturally amounts to savingfurther Hessian computations. We are now ready to present our main algorithms for solving the generic non-convexoptimization ( P0 ) along with their corresponding iteration complexity results to obtaina ( ǫ g , ǫ H )-optimal solution as in (1). More precisely, in Section 2.1 and 2.2, respectively,we present modiﬁcations of the TR and ARC methods which incorporate inexact Hessianinformation, according to Condition 1.We remind that, though not speciﬁcally mentioned in the statement of the theoremsor the algorithms, when the computed steps are rejected and an iteration needs to berepeated with diﬀerent ∆ t or σ t , the previous H t may seamlessly be used in the nextiteration. This can be a desirable feature in many practical situations and is directly theresult of enforcing (4); see also the discussion in Section 1.3.2.For our analysis throughout the paper, we make the following standard assumptionregarding the regularity of the exact Hessian of the objective function F .7 ssumption 1 (Hessian Regularity) . F ( x ) is twice diﬀerentiable and has bounded andLipschitz continuous Hessian on the piece-wise linear path generated by the iterates, i.e.for some < K, L < ∞ and all iterations (cid:13)(cid:13) ∇ F ( x ) − ∇ F ( x t ) (cid:13)(cid:13) ≤ L k x − x t k , ∀ x ∈ [ x t , x t + s t ] , (6a) (cid:13)(cid:13) ∇ F ( x t ) (cid:13)(cid:13) ≤ K, (6b) where x t and s t are, respectively, the iterate and the update step at iteration t . Although, we do not know of a particular way to, a priori, verify (6a), it is clear thatAssumption (6a) is weaker than Lipschitz continuity of the Hessian for all x , i.e., (cid:13)(cid:13) ∇ F ( x ) − ∇ F ( y ) (cid:13)(cid:13) ≤ L k x − y k , ∀ x , y ∈ R d . (7)Despite the fact that theoretically (6a) is weaker than (7), to the best of our knowledgeas of yet, (7) is the only practical suﬃcient condition for verifying (6a). Algorithm 1 depicts a trust-region algorithm where at each iteration t , instead of thetrue Hessian ∇ F ( x t ), only an inexact approximation, H t , is used. For Algorithm 1, theaccuracy tolerance in (3a) is adaptively chosen as ǫ t ≤ max { ǫ , ∆ t } , where ∆ t is the trustregion in the t-th iteration and ǫ ∈ O ( ǫ H ) is some ﬁxed threshold. This allows for avery crude approximation at the beginning of iterations, when ∆ t is large. As iterationsprogress towards optimality and ∆ t gets small, the threshold ǫ can prevent ǫ from gettingunnecessarily too small.In Algorithm 1, we require that the sub-problem (8) is solved only approximately.Indeed, in large-scale problems, where the exact solution of the sub-problem is the mainbottleneck of the computations, this is a very crucial relaxation. Such approximatesolution of the sub-problem (8) has been adopted in many previous work. Here, wefollow the inexactness conditions discussed in [17], which are widely known as Cauchyand Eigenpoint conditions. Recall that the Cauchy and Eigen directions correspond,respectively, to one dimensional minimization of the sub-problem (8) along the directionsgiven by the gradient and negative curvature. Condition 2 (Suﬃcient Descent Cauchy and Eigen Directions [17]) . Assume that wesolve the sub-problem (8) approximately to ﬁnd s t such that − m t ( s t ) ≥ − m t ( s Ct ) ≥ k∇ F ( x t ) k min (cid:26) k∇ F ( x t ) k k H t k , ∆ t (cid:27) , (9a) − m t ( s t ) ≥ − m t ( s Et ) ≥ ν | λ min ( H t ) | ∆ t , if λ min ( H t ) < . (9b) Here, m t ( · ) is deﬁned in (8) , s Ct (Cauchy point) is along negative gradient direction and s Et is along approximate negative curvature direction such that h s Et , H t s Et i ≤ νλ min ( H t ) k s Et k < , for some ν ∈ (0 , (see Appendix B for a way to eﬃciently compute s Et ). One way to ensure that an approximate solution to the sub-problem (8) satisﬁes (9),is by replacing (8) with the following reduced-dimension problem, in which the searchspace is a two-dimensional sub-space containing vectors s Ct , and s Et , i.e., s t = arg min k s k≤ ∆ t s ∈ Span { s Ct , s Et } h∇ F ( x t ) , s i + 12 h s , H t s i . lgorithm 1 Trust Region with Inexact Hessian Input:

Starting point x , initial radius 0 < ∆ < ∞ , hyper-parameters ǫ , ǫ g , ǫ H , η ∈ (0 , , γ > for t = 0 , , . . . do Set the approximate Hessian, H t , as in (3) with ǫ t ≤ max { ǫ , ∆ t } if k∇ F ( x t ) k ≤ ǫ g , λ min ( H t ) ≥ − ǫ H then Return x t . end if Solve the sub-problem approximately s t ≈ arg min k s k≤ ∆ t m t ( s ) , h∇ F ( x t ) , s i + 12 h s , H t s i (8) Set ρ t , F ( x t ) − F ( x t + s t ) − m t ( s t ) if ρ t ≥ η then x t +1 = x t + s t ∆ t +1 = γ ∆ t else x t +1 = x t ∆ t +1 = ∆ t /γ end if end for Output: x t Of course, any larger dimensional sub-space P for which we have Span { s Ct , s Et } ⊆ P would also guarantee (9). In fact, a larger dimensional sub-space implies a more accuratesolution to our original sub-problem (8).We now set out to provide iteration complexity for Algorithm 1. Our analysis followssimilar line of reasoning as that in [9, 10, 11]. First, we show the discrepancy betweenthe quadratic model and objective function in Lemma 1. Lemma 1.

Given Assumption 1 and Condition (3a) with any ǫ t > , we have | F ( x t + s t ) − F ( x t ) − m t ( s t ) | ≤ L t + ǫ t t . (10) Proof.

Applying Mean Value Theorem on F at x t gives F ( x t + s t ) = F ( x t ) + ∇ F ( x t ) T s t + s Tt ∇ F ( ξ t ) s t , for some ξ t in the segment of [ x t , x t + s t ]. We have | F ( x t + s t ) − F ( x t ) − m t ( s t ) | = 12 (cid:12)(cid:12) s Tt ( ∇ F ( ξ t ) − H t ) s t (cid:12)(cid:12) = 12 (cid:12)(cid:12) s Tt ( ∇ F ( ξ t ) − ∇ F ( x t ) + ∇ F ( x t ) − H t ) s t (cid:12)(cid:12) ≤ (cid:12)(cid:12) s Tt ( ∇ F ( ξ t ) − ∇ F ( x t )) s t (cid:12)(cid:12) + 12 (cid:12)(cid:12) s Tt ( ∇ F ( x t ) − H t ) s t (cid:12)(cid:12) ≤ L k s t k + ǫ t k s t k ≤ L t + ǫ t t . Lemma 2.

Consider any ǫ H > , let ǫ , α (1 − η ) νǫ H for some α ∈ (0 , , and supposeCondition 1 is satisﬁed with ǫ t ≤ max { ǫ , ∆ t } , where ∆ t is the trust region at the t-thiteration. Given Assumption 1 and Condition 2, if λ min ( H t ) < − ǫ H and ∆ t ≤ (1 − α )(1 − η ) ν | λ min ( H t ) | / ( L + 1) , then the t-th iteration is successful, i.e. ∆ t +1 = γ ∆ t .Proof. Suppose ∆ t ≤ ǫ . From (9b) and (10), we have1 − ρ t = F ( x t + s t ) − F ( x t ) − m t ( s t ) − m t ( s t ) ≤ L ∆ t + ǫ t ∆ t ν | λ min ( H t ) | ∆ t ≤ L ∆ t + α (1 − η ) νǫ H ∆ t ν | λ min ( H t ) | ∆ t ≤ L ∆ t + α (1 − η ) ν | λ min ( H t ) | ∆ t ν | λ min ( H t ) | ∆ t ≤ L ∆ t + α (1 − η ) ν | λ min ( H t ) | ν | λ min ( H t ) | . By the assumption on ∆ t , we get ρ t ≥ η and the iteration is successful. Now consider∆ t ≥ ǫ . Similar to the above, we have1 − ρ t = F ( x t + s t ) − F ( x t ) − m t ( s t ) − m t ( s t ) ≤ L ∆ t + ǫ t ∆ t ν | λ min ( H t ) | ∆ t ≤ ( L + 1)∆ t ν | λ min ( H t ) | ∆ t ≤ ( L + 1)∆ t ν | λ min ( H t ) | , which again by assumption on ∆ t and noting α <

1, we get ρ t ≥ η and the iteration issuccessful. Lemma 3.

Suppose Condition 1 is satisﬁed with any ǫ t > . Given Assumption 1 andCondition 2, if k∇ F ( x t ) k > ǫ g and ∆ t ≤ min ( k∇ F ( x t ) k (1 + K H ) , p ǫ t + 4 L (1 − η ) k∇ F ( x t ) k − ǫ t L ) , then, the t-th iteration is successful, i.e. ∆ t +1 = γ ∆ t .Proof. By assumption on ∆ t , (9a), and since k∇ F ( x t ) k > ǫ g , we have − m t ( s t ) ≥ k∇ F ( x t ) k min (cid:26) k∇ F ( x t ) k k H t k , ∆ t (cid:27) ≥ k∇ F ( x t ) k ∆ t . Therefore,1 − ρ t = F ( x t + s t ) − F ( x t ) − m t ( s t ) − m t ( s t ) ≤ L ∆ t + ǫ t ∆ t k∇ F ( x t ) k ∆ t ≤ L ∆ t + ǫ t ∆ t k∇ F ( x t ) k ≤ − η, where the last inequality follows by assumption on ∆ t . So ρ t ≥ η , which means theiteration is successful.Lemma 4 gives a lower bound for the trust region radius before the algorithm termi-nates, i.e., this ensures that the trust region never shrinks to become too small.10 emma 4. Consider any ǫ g , ǫ H > such that ǫ H ≤ √ ǫ g and let ǫ , α (1 − η ) νǫ H forsome α ∈ (0 , . Further, suppose Condition 1 is satisﬁed with ǫ t ≤ max { ǫ , ∆ t } , where ∆ t is the trust region at the t-th iteration. For Algorithm 1, under Assumption 1 andCondition 2, we have ∆ t ≥ κ ∆ min { ǫ g , ǫ H } , ∀ t ≥ , where κ ∆ , min { κ , κ , κ , κ } /γ, κ , (1 − α )(1 − η ) ν/ ( L + 1) , κ , α (1 − η ) ν,κ , / (1 + K H ) , κ , p ( α (1 − η ) ν ) + 4 L (1 − η ) − α (1 − η ) ν/ (2 L ) . Proof.

We prove by contradiction. Assume that the t-th iteration is the ﬁrst unsuccessfuliteration such that ∆ t +1 = ∆ t /γ ≤ κ ∆ min { ǫ g , ǫ H } , i.e., we have∆ t ≤ min { κ , κ , κ , κ } · min { ǫ g , ǫ H } . Suppose λ min ( H t ) < − ǫ H . By Lemma 2, since ∆ t ≤ (1 − α )(1 − η ) ν | λ min ( H t ) | / ( L + 1),iteration t must have been accepted and we must have ∆ t +1 = γ ∆ t > ∆ t , which isa contradiction. Now suppose k∇ F ( x t ) k ≥ ǫ g . By assumption on ∆ t , we have that∆ t ≤ κ ǫ H = ǫ , which implies that ǫ t ≤ ǫ . Since the function h ( a, b ) , − a + √ a + b ,for any ﬁxed b >

0, is decreasing in a , and for any ﬁxed a , is increasing in b ≥

0, we have h ( ǫ t , L (1 − η ) k∇ F ( x t ) k ) ≥ h ( ǫ , L (1 − η ) k∇ F ( x t ) k ) ≥ h ( ǫ , L (1 − η ) ǫ g ) ≥ h ( ǫ , L (1 − η ) ǫ H ) , which implies p ǫ t + 4 L (1 − η ) k∇ F ( x t ) k − ǫ t L ≥ κ ǫ H . As a result, since ∆ t ≤ min { κ ǫ g , κ ǫ H } , it must satisfy the condition of Lemma 3. Thisimplies that iteration t must have been accepted, which is a contradiction.The following lemma follows closely the line of reasoning in [11, Lemma 4.5]. Lemma 5 (Successful Iterations) . Consider any ǫ g , ǫ H > such that ǫ H ≤ √ ǫ g and let ǫ , α (1 − η ) νǫ H for some α ∈ (0 , . Further, suppose Condition 1 is satisﬁed with ǫ t ≤ max { ǫ , ∆ t } , where ∆ t is the trust region at the t-th iteration. Let T succ denote theset of all the successful iterations before Algorithm 1 stops. Then, under Assumption 1,Condition 2, the number of successful iterations is upper bounded by, |T succ | ≤ ( F ( x ) − F min ) η min { b κ ∆ , e κ ∆ } · max { ǫ − g ǫ − H , ǫ − H } where b κ ∆ , κ ∆ / , e κ ∆ , νκ / , and κ ∆ is as deﬁned in Lemma 4.Proof. Suppose Algorithm 1 doesn’t terminate at the t-th iteration. Then we have either k∇ F ( x t ) k ≥ ǫ g or λ min (∆ F ( x t )) ≤ − ǫ H . In the ﬁrst case, from (9a), we have − m t ( s t ) ≥ ǫ g (cid:26) ǫ g K H , ∆ t (cid:27) ≥ ǫ g (cid:26) ǫ g K H , κ ∆ ǫ g , κ ∆ ǫ H (cid:27) ≥ b κ ∆ ǫ g min { ǫ g , ǫ H } , where κ ∆ is as deﬁned in Lemma 4. Similarly, in the second case, from (9b), we obtain − m t ( s t ) ≥ ν | λ min ( H t ) | ∆ t ≥ νκ ǫ H min { ǫ g , ǫ H } = e κ ∆ ǫ H min { ǫ g , ǫ H } . F ( x t ) is monotonically decreasing, we have F ( x ) − F min ≥ ∞ X t =0 F ( x t ) − F ( x t +1 ) ≥ X t ∈T succ F ( x t ) − F ( x t +1 ) ≥ η X t ∈T succ min (cid:8)b κ ∆ ǫ g min { ǫ g , ǫ H } , e κ ∆ ǫ H min { ǫ g , ǫ H } (cid:9) ≥ |T succ | η min { b κ ∆ , e κ ∆ } min { ǫ g ǫ H , ǫ H } . Hence, we have |T succ | ≤ ( F ( x ) − F min ) max { ǫ − g ǫ − H , ǫ − H } / ( η min { b κ ∆ , e κ ∆ } ).Now we are ready to present the ﬁnal complexity in Theorem 1. Theorem 1 (Optimal Complexity of Algorithm 1) . Consider any ǫ g , ǫ H > such that ǫ H ≤ √ ǫ g and let ǫ , α (1 − η ) νǫ H for some α ∈ (0 , where η is a hyper-parameter inAlgorithm 1, and ν is as in (9b) . Suppose the inexact Hessian, H ( x ) , satisﬁes Condition1 with the approximation tolerance, ǫ t , in (3a) as ǫ t ≤ max { ǫ , ∆ t } , (11) where ∆ t is the trust region at the t-th iteration. For Problem ( P0 ) , under Assumption1 and Condition 2, Algorithm 1 terminates after at most T ∈ O (cid:0) max { ǫ − g ǫ − H , ǫ − H } (cid:1) iterations.Proof. Suppose Algorithm 1 terminates at the t-th iteration. Let T succ and T fail denote thesets of all the successful and unsuccessful iterations, respectively. Then T = |T succ | + |T fail | and ∆ T = ∆ γ |T succ |− | T fail | , where γ is a hyper-parameter of Algorithm 1. From Lemma 4,we have ∆ T ≥ κ ∆ min { ǫ g , ǫ H } . Hence, ( |T succ | − |T fail | ) log γ ≥ log ( κ ∆ · min { ǫ g , ǫ H } / ∆ ),which implies |T fail | ≤ log (∆ / ( κ ∆ · min { ǫ g , ǫ H } )) / log γ + |T succ | . Combine the resultfrom Lemma 5, we have the total iteration complexity as T ≤ γ log (cid:18) ∆ κ ∆ · min { ǫ g , ǫ H } (cid:19) + 2( F ( x ) − F min ) η min { b κ ∆ , e κ ∆ } · max { ǫ − g ǫ − H , ǫ − H }∈ O (cid:0) max { ǫ − g ǫ − H , ǫ − H } (cid:1) , where κ ∆ , b κ ∆ , e κ ∆ are deﬁned in the proofs of Lemmas 4 and 5, respectively.As it can be seen, the worst-case total number of iterations required by Algorithm 1before termination, matches the optimal iteration complexity obtained in [11]. Further-more, from (3a), it follows that upon termination of Algorithm 1 after T iterations, inaddition to k∇ F ( x T ) k ≤ ǫ g , we have λ min ( ∇ F ( x T )) ≥ − ( ǫ H + ǫ T ), i.e., the obtainedsolution satisﬁes ( ǫ g , ǫ T + ǫ H )-Optimality as in (1).For Algorithm 1, the Hessian approximation tolerance ǫ t is allowed to be chosen per-iteration as ǫ t ≤ O (max { ǫ H , ∆ t } ). This way, when ∆ t is large (e.g., at the beginning ofiterations), one can employ crude Hessian approximations. As iterations progress towardsoptimality, ∆ t can get very small, in which case Hessian accuracy is set in the order of ǫ H . Note that by Lemma 4, we are always guaranteed to have ∆ t ∈ Ω (min { ǫ g , ǫ H } ). Asa result, when ǫ g ≪ ǫ H , e.g., ǫ H = ǫ g = ǫ , we can have that ∆ t ≪ ǫ H . In such cases,the choice ǫ t ≤ O (max { ǫ H , ∆ t } ) ensures that the Hessian approximation tolerance nevergets unnecessarily too small. 12 .2 Adaptive Cubic Regularization with Inexact Hessian Similar to Section 2.1, in this section, we present the algorithm and its correspondingconvergence results for the case of adaptive cubic regularization with inexact Hessian. Inparticular, Algorithm 2 depicts a variant of ARC algorithm where at each iteration t , theinexact approximation, H t , is constructed according to Condition 1. Here, unlike Section2.1, we were unable to provide convergence guarantees with adaptive tolerance in (3a)and as result, ǫ is set ﬁxed a priori to a suﬃciently small value, i.e., ǫ ∈ O ( √ ǫ g , ǫ H ) toguarantee ( ǫ g , ǫ H )-optimality. Algorithm 2

Adaptive Cubic Regularization with Inexact Hessian Input:

Starting point x , initial regularization 0 < σ < ∞ , hyper-parameters ǫ g , ǫ H , η ∈ (0 , , γ > for t = 0 , , . . . do Set the approximating Hessian, H t , as in (3) if k∇ F ( x t ) k ≤ ǫ g , λ min ( H t ) ≥ − ǫ H then Return x t . end if Solve the sub-problem approximately s t ≈ arg min s ∈ R d m t ( s ) , h∇ F ( x t ) , s i + 12 h s , H t s i + σ t k s k (12) Set ρ t , F ( x t ) − F ( x t + s t ) − m t ( s t ) if ρ t ≥ η then x t +1 = x t + s t σ t +1 = σ t /γ else x t +1 = x t σ t +1 = γσ t end if end for Output: x t Similar to Algorithm 1, here we also require that the sub-problem (12) in Algorithm 2is solved only approximately. Although similar inexact solutions to the sub-problem (12)by using Cauchy and Eigenpoint has been considered in several previous work, e.g., [11],here we provide reﬁned conditions which prove to be instrumental in obtaining iterationcomplexities with the relaxed Hessian approximation (3a), as opposed to the strongerCondition (5c).

Condition 3 (Suﬃcient Descent Cauchy & Eigen Directions) . Assume that we solve the ub-problem (12) approximately to ﬁnd s t such that − m t ( s t ) ≥ − m t ( s Ct ) ≥ max ( k s Ct k (cid:18)q K H + 4 σ t k∇ F ( x t ) k − K H (cid:19) , k∇ F ( x t ) k √ ( k∇ F ( x t ) k K H , s k∇ F ( x t ) k σ t )) , (13a) − m t ( s t ) ≥ − m t ( s Et ) ≥ ν | λ min ( H t ) | (cid:26) k s Et k , ν | λ min ( H t ) | σ t (cid:27) , if λ min ( H t ) < . (13b) Here m t ( · ) is deﬁned in (12) , s Ct (Cauchy point) is along negative gradient direction and s Et is along approximate negative curvature direction such that h s Et , H t s Et i ≤ νλ min ( H t ) k s Et k < for some ν ∈ (0 , (see Appendix B for a way to eﬃciently compute s Et ). Note that Condition (13) describes the quality of the descent obtained by Cauchy andEigen directions more accurately than is usually found in similar literature. A naturalway to ensure that the approximate solution to the sub-problem (12) satisﬁes (13), isby replacing the unconstrained high-dimensional sub-problem (12) with the followingconstrained but lower-dimensional problem, in which the search space is reduced to atwo-dimensional sub-space containing vectors s Ct , and s Et , i.e., s t = arg min s ∈ Span { s Ct , s Et } h∇ F ( x t ) , s i + 12 h s , H t s i + σ t k s k . Note that, if U ∈ R d × p is an orthogonal basis for the sub-space “Span { s Ct , s Et } ”, by alinear transformation, we can turn the above sub-problem into an unconstrained problemas v t = arg min v ∈ R p h U T ∇ F ( x t ) , v i + 12 h v , U T H t Uv i + σ t k v k , and set s t = Uv t . As before, any larger dimensional sub-space P for which we haveSpan { s Ct , s Et } ⊆ P would also ensure (13), and, indeed, implies a more accurate solutionto our original sub-problem (12).Lemmas 6 and 7 describe the model reduction obtained by Cauchy and eigen pointsas required by Condition (3). Lemma 6 (Descent with Cauchy Direction) . Consider the Cauchy direction as s Ct = − α ∇ F ( x t ) where α = arg min b α ≥ m t ( − b α ∇ F ( x t )) . We have − m t ( s Ct ) ≥ max ( k s Ct k (cid:18)q K H + 4 σ t k∇ F ( x t ) k − K H (cid:19) , k∇ F ( x t ) k √ ( k∇ F ( x t ) k K H , s k∇ F ( x t ) k σ t )) . Proof.

For any b α ≥

0, we have m t ( − b α ∇ F ( x t )) ≤ m t ( b α ∇ F ( x t )), which implies α =arg min b α ∈ R m t ( − b α ∇ F ( x t )). Hence, we have −k∇ F ( x t ) k + α h∇ F ( x t ) , H t ∇ F ( x t ) i + σ t α k∇ F ( x t ) k = 0. We can ﬁnd explicit formula for such α by ﬁnding the roots of14he quadratic function r ( α ) = σ t k∇ F ( x t ) k α + h∇ F ( x t ) , H t ∇ F ( x t ) i α − k∇ F ( x t ) k .Hence, we must have α = −h∇ F ( x t ) , H t ∇ F ( x t ) i + q(cid:0) h∇ F ( x t ) , H t ∇ F ( x t ) i (cid:1) + 4 σ t k∇ F ( x t ) k σ t k∇ F ( x t ) k ≥ . It follows that2 ασ t k∇ F ( x t ) k = s(cid:18) h∇ F ( x t ) , H t ∇ F ( x t ) ik∇ F ( x t ) k (cid:19) + 4 σ t k∇ F ( x t ) k − h∇ F ( x t ) , H t ∇ F ( x t ) ik∇ F ( x t ) k . Consider the function h ( x ; β ) = p x + β − x . It is easy to verify that, for β ≥ h ( x ) isdecreasing function of x . Now since h∇ F ( x t ) , H t ∇ F ( x t ) i ≤ K H k∇ F ( x t ) k , we get k s Ct k = α k∇ F ( x t ) k ≥ σ t (cid:20)q K H + 4 σ t k∇ F ( x t ) k − K H (cid:21) . (14)Now, from [11, Lemma 2.1], we get − m t ( s Ct ) ≥ σ t k s Ct k k s Ct k ασ t k∇ F ( x t ) k ≥ k s Ct k

12 ( q K H + 4 σ t k∇ F ( x t ) k − K H ) . Alternatively, following the proof of [9, Lemma 2.1], for any α ≥

0, we get m t ( s Ct ) ≤ m t ( − α ∇ F ( x t ))= − α k∇ F ( x t ) k + 12 α h∇ F ( x t ) , H t ∇ F ( x t ) i + α σ t k∇ F ( x t ) k ≤ α k∇ F ( x t ) k (cid:0) − αK H + 2 α σ t k∇ F ( x t ) k (cid:1) . Consider the quadratic polynomial r ( α ) = 2 α σ t k∇ F ( x t ) k +3 αK H −

6. We have r ( α ) ≤ α ∈ [0 , ¯ α ], where¯ α = − K H + p K H + 48 σ t k∇ F ( x t ) k σ t k∇ F ( x t ) k = 12 (cid:16) K H + p K H + 48 σ t k∇ F ( x t ) k (cid:17) . Note that p K H + 48 σ t k∇ F ( x t ) k ≤ √ n K H , p σ t k∇ F ( x t ) k o and trivially 3 K H ≤ √ n K H , p σ t k∇ F ( x t ) k o . Hence, deﬁning α , / ( √ { K H , p σ t k∇ F ( x t ) k} ),it is easy to see that 0 < α ≤ ¯ α . With this α , we get r ( α ) ≤ / / √ − ≤ − . Therefore m t ( s t ) ≤ − k∇ F ( x t ) k √ n K H , p σ t k∇ F ( x t ) k o = −k∇ F ( x t ) k √  k∇ F ( x t ) k K H , s k∇ F ( x t ) k σ t  . emma 7 (Descent with Negative Curvature) . Suppose λ min ( H t ) < . For some ν ∈ (0 , , deﬁne s Et = α u t , where α = arg min b α ∈ R m t ( b α u t ) , and h u t , H t u t i ≤ νλ min ( H t ) k u t k < . We have − m t ( s Et ) ≥ ν | λ min ( H t ) | (cid:26) k s Et k , ν | λ min ( H t ) | σ t (cid:27) . Proof.

By the ﬁrst-order necessary optimality condition of α , we get h∇ F ( x t ) , u t i + α h u t , H t u t i + σ t α k u t k = 0, which implies h∇ F ( x t ) , s Et i + h s Et , H t s Et i + σ t k s Et k = 0.Next, since α is a minimizer of m t ( b α u t ), we have m t ( α u t ) ≤ m t ( − α u t ), which implies h∇ F ( x t ) , s Et i ≤

0. Hence, we also obtain h s Et , H t s Et i + σ t k s Et k ≥

Given Assumption 1 and Condition 1, we have F ( x t + s t ) − F ( x t ) − m t ( s t ) ≤ (cid:18) L − σ t (cid:19) k s t k + ǫ k s t k . Proof.

Apply Mean Value Theorem on F at x t gives F ( x t + s t ) = F ( x t ) + ∇ F ( x t ) T s t + s Tt ∇ F ( ξ t ) s t , for some ξ t in the segment of [ x t , x t + s t ]. Now, it follows that F ( x t + s t ) − F ( x t ) − m t ( s t ) = 12 s Tt ( ∇ F ( ξ t ) − H t ) s t − σ t k s t k = 12 s Tt ( ∇ F ( ξ t ) − ∇ F ( x t ) + ∇ F ( x t ) − H t ) s t − σ t k s t k ≤ s Tt ( ∇ F ( ξ t ) − ∇ F ( x t )) s t + 12 s Tt ( ∇ F ( x t ) − H t ) s t − σ t k s t k ≤ L k s t k + 12 ǫ k s t k − σ t k s t k ≤ (cid:18) L − σ t (cid:19) k s t k + ǫ k s t k . Lemma 9.

Given Assumption 1, Conditions 1 and 3, suppose σ t ≥ L, ǫ ≤ min (cid:26) (cid:18)q K H + 8 Lǫ g − K H (cid:19) , νǫ H γ (cid:27) . Then, we have (cid:18) L − σ t (cid:19) k s t k + ǫ k s t k ≤  ǫ k s Ct k , ǫ k s Et k , If λ min ( H t ) ≥ − ǫ H . roof. First consider k s Ct k for which we have two cases. i. If k s t k ≤ k s Ct k , then from assumption on σ t , it immediately follows that (cid:18) L − σ t (cid:19) k s t k + ǫ k s t k ≤ ǫ k s t k ≤ ǫ k s Ct k . ii. If k s t k ≥ k s Ct k , since L ≤ σ t /

2, then (cid:18) L − σ t (cid:19) k s t k + ǫ k s t k ≤ − σ t k s t k + ǫ k s t k ≤ (cid:16) − σ t (cid:13)(cid:13) s Ct (cid:13)(cid:13) + ǫ (cid:17) k s t k ≤ − p K H + 8 Lǫ g − K H

24 + ǫ ! k s t k ≤ ≤ ǫ (cid:13)(cid:13) s Ct (cid:13)(cid:13) . The second last inequality follows from (14).Similarly, for k s Et k , we have two cases. i. If k s t k ≤ k s Et k , then from assumption on σ t , it immediately follows that (cid:18) L − σ t (cid:19) k s t k + ǫ k s t k ≤ ǫ k s t k ≤ ǫ k s Et k . ii. If k s t k ≥ k s Et k , since L ≤ σ t /

2, then (cid:18) L − σ t (cid:19) k s t k + ǫ k s t k ≤ − σ t k s t k + ǫ k s t k ≤ − σ t (cid:13)(cid:13) s Et (cid:13)(cid:13) k s t k + ǫ k s t k ≤ − νǫ H k s t k + ǫ k s t k ≤ < ǫ (cid:13)(cid:13) s Et (cid:13)(cid:13) . The second last inequality follows from (15) and the last line follows from ǫ ≤ νǫ H . Lemma 10.

Given Assumption 1, Conditions 1 and 3, suppose at the t-th iteration, λ min ( H t ) < − ǫ H , σ t ≥ L , and ǫ ≤ min { / , (1 − η ) / } νǫ H . Then, the t-th iteration issuccessful, i.e. σ t +1 = σ t /γ .Proof. From (13b), Lemma 8, Lemma 9, as well as assumptions on σ t and ǫ , we have1 − ρ t = F ( x t + s t ) − F ( x t ) − m t ( s t ) − m t ( s t ) ≤ ( L/ − σ t / k s t k + ǫ k s t k / ν | λ min ( H t ) |k s Et k / ≤ ǫ k s Et k ν | λ min ( H t ) |k s Et k ≤ ǫνǫ H ≤ − η. Hence, ρ t ≥ η , and the iteration is successful. Lemma 11.

Given Assumption 1, Conditions 1 and 3, suppose at the t-th iteration, k∇ F ( x t ) k ≥ ǫ g , σ t ≥ L , and ǫ ≤ min (cid:26) , − η (cid:27) (cid:18)q K H + 8 Lǫ g − K H (cid:19) . Then, the t-th iteration is successful, i.e. σ t +1 = σ t /γ . roof. First note that, from (13a), we have − m t ( s t ) ≥ − m t ( s Ct ) ≥ k s Ct k (cid:18)q K H + 4 σ t k∇ F ( x t ) k − K H (cid:19) . Hence, again, by (13a), Lemma 8 and 9, it follows that1 − ρ t = F ( x t + s t ) − F ( x t ) − m t ( s t ) − m t ( s t ) ≤ (cid:0) L − σ t (cid:1) k s t k + ǫ k s t k − m t ( s Ct ) ≤ ǫ k s Ct k k s Ct k (cid:16)p K H + 4 σ t k∇ F ( x t ) k − K H (cid:17) ≤ ǫ (cid:16)p K H + 4 σ t k∇ F ( x t ) k − K H (cid:17) ≤ ǫ (cid:16)p K H + 8 Lǫ g − K H (cid:17) ≤ − η. Hence, ρ t ≥ η , and the iteration is successful.Now we can upper bound the cubic regularization parameter before the algorithmterminates, as in Lemma 12. Lemma 12.

Consider Assumption 1, Conditions 1 and 3, and ǫ ≤ min (cid:26) min (cid:26) , − η (cid:27) (cid:18)q K H + 8 Lǫ g − K H (cid:19) , min (cid:26) , − η (cid:27) νǫ H (cid:27) , (16) where ν, L, K H are, respectively, deﬁned as in (13b) , (6a) , (3b) , and η is a hyper-parameter of Algorithm 2. For Algorithm 2 we have for all t , σ t ≤ max { σ , γL } .Proof. We prove by contradiction. Assume the t-th iteration is the ﬁrst unsuccessfuliteration such that σ t +1 = γσ t ≥ γL , which implies that σ t ≥ L . However, according toLemmas 10 and 11, respectively, if λ min ( H t ) < − ǫ H or k∇ F ( x t ) k ≥ ǫ g , then the iterationis successful and hence we must have σ t +1 = σ t /γ ≤ σ t , which is a contradiction.Now, similar to [11, Lemma 2.8], we can get the following result about the estimateof the total number of successful iterations before algorithm terminates. Lemma 13 (Success Iterations) . Given Assumption 1, Conditions 1 and 3, let T succ denote the set of all the successful iterations before Algorithm 2 stops. The number ofsuccessful iterations is upper bounded by, |T succ | ≤ ( F ( x ) − F min ) ηκ σ · max { ǫ − g , ǫ − H } , where κ σ , min n ν / (24 γ L ) , min n /K H , p / (2 γL ) o / (2 √ o .Proof. Suppose Algorithm 2 doesn’t terminate at the t-th iteration. Then either we have k∇ F ( x t ) k ≥ ǫ g or λ min ( ∇ H t ) ≤ − ǫ H . In the ﬁrst case, (13a) and Lemma 12 gives − m t ( s t ) ≥ k∇ F ( x t ) k √  k∇ F ( x t ) k K H , s k∇ F ( x t ) k σ t  ≥ ǫ g √ (cid:26) K H , r γL (cid:27) . λ min ( ∇ H t ) ≤ − ǫ H , from (13b) and Lemma 12, we obtain − m t ( s t ) ≥ ν | λ min ( H t ) | / (6 σ t ) ≥ ν ǫ H / (24 γ L ).Since F ( x t ) is monotonically decreasing, we have F ( x ) − F min ≥ ∞ X t =0 F ( x t ) − F ( x t +1 ) ≥ X t ∈T succ F ( x t ) − F ( x t +1 ) ≥ − η X t ∈T succ m t ( s t ) ≥ η |T succ | min (cid:26) ν ǫ H γ L , ǫ g √ (cid:26) K H , r γL (cid:27)(cid:27) ≥ |T succ | ηκ σ min { ǫ g , ǫ H } . Now we show the ﬁnal complexity bounds of Algorithm 2 in Theorem 2.

Theorem 2 (Complexity of Algorithm 2) . Consider any < ǫ g , ǫ H < . Suppose theinexact Hessian, H ( x ) , satisﬁes Condition 1 with the approximation tolerance, ǫ , in (3a) as (16) . For Problem ( P0 ) , under Assumption 1 and Condition 3, Algorithm 2 terminatesafter at most T ∈ O (cid:0) max { ǫ − g , ǫ − H } (cid:1) iterations.Proof. Suppose Algorithm 2 terminates at the t-th iteration. Let T succ and T fail denotethe sets of all the successful and unsuccessful iterations, respectively. Then T = |T succ | + |T fail | and σ T = σ γ | T fail | −|T succ | . From Lemma 12, we have σ T ≤ γL . Hence, |T fail | ≤ log (2 γL/σ ) / log γ + |T succ | , which, using Lemma 13 gives the total iteration complexityas T ≤ log (2 γL/σ ) / log γ + 2( F ( x ) − F min ) · max { ǫ − g , ǫ − H } / ( ηκ σ ) , where κ σ is deﬁned in Lemma 13.In Theorem 2 (as well as Theorem 3 below), we require ǫ ∈ O ( √ ǫ g , ǫ H ). This can berather strict and computationally unattractive, unless either crude solutions are required(e.g., in most machine learning applications very rough solutions are encouraged to avoidover-ﬁtting), or the inexact Hessian is formed from a sub-set of data that is signiﬁcantlysmaller than the original dataset (e.g., see Section 3 in the context of big-data regimeswhere n ≫ |S| ≪ n ). Nonetheless, the theoretical existence of such tolerance,though small, implies a certain level of robustness of the algorithm, i.e., the complexityof the algorithm is not adversely aﬀected by small errors in Hessian computations.We note that, for iterations where ǫ ≪ k s t k , (3a) is indeed a more stringent conditionthan (5c). As iterations progress towards optimality, step-size can become small, in whichcase (3a) might be theoretically more preferable. Nonetheless, beyond a direct theoreticalcomparison among various Hessian approximation bounds in terms of their tightness, themain advantage of (3a) should be regarded in light of its simplicity, which allows fordirect constructions of H t with a priori guarantees.Condition 3 seems to be the bare minimum required to guarantee convergence to anapproximate second-order criticality. Intuitively, however, if an approximate solution tothe sub-problem (12) satisﬁes more than (13), i.e., if we solve (12) more exactly than justrequiring (13), one could expect to be able to improve upon the iteration complexity ofTheorem 2. Indeed, suppose we solve the reduced sub-problem on progressively embeddedsub-spaces with increasingly higher dimensions, all of which including “Span { s Ct , s Et } ”,and stop when the corresponding solution s t satisﬁes the following conditions.19 ondition 4 (Suﬃcient Descent for Optimal Complexity) . Assume that we solve thesub-problem (12) approximately to ﬁnd s t such that, in addition to (13) , we have k∇ m t ( s t ) k ≤ ζ max (cid:8) k s t k , θ t k∇ F ( x t ) k (cid:9) , θ t , min { , k s t k} , (17) for some prescribed ζ ∈ (0 , . Here, m t ( · ) is deﬁned in (12) . Conditions on the inexactness of the sub-problems were initially pioneered in [9, 10,11]. However, the main drawback for these conditions is that the inexactness toleranceis closely tied with the magnitude of the gradient. More speciﬁcally, when gradient issmall, e.g., near saddle points, the sub-problems are required to be solved exceedinglymore accurately. In fact, at a saddle point where k∇ F ( x t ) k = 0, these conditions imply anexact solution to the sub-problem. To the best of our knowledge, Condition 4 representsa novel criterion, which oﬀers the best of both worlds: when gradient is large, we allow forcrude solutions to the sub-problem, but near saddle-points where the gradient is small,inexactness will be determined by the step length, which can be signiﬁcantly larger thanthe gradient. Using Condition 4, we can obtain the optimal iteration complexity forAlgorithm 2, as shown in Theorem 3. First, we prove the following two lemmas whichwill be used later for the proof of Theorem 3. Lemma 14.

Suppose k∇ F ( x t ) k ≥ ǫ g . Given Assumption 1 and Condition 3, let (3a) hold with ǫ t = min { ǫ, ζ k∇ F ( x t ) k} where ǫ is as in (16) and ζ ∈ (0 , / . Furthermore,suppose (12) is solved such that Condition 4 eventually holds. Then, we have k s t k ≥ κ g p k∇ F ( x t +1 ) k , where κ g , − ζ )((1 + 4 γ ) L + 2 max { ( ǫ + ζ max { , K } ) , ζ max { , K }} ) . Proof.

First, suppose k s t k ≤ θ t k∇ F ( x t ) k . Using Condition 4, we get k∇ F ( x t +1 ) k ≤k∇ F ( x t +1 ) − ∇ m t ( s t ) k + k∇ m t ( s t ) k ≤ k∇ F ( x t +1 ) − ∇ m t ( s t ) k + θ t k∇ F ( x t ) k . Notingthat ∇ m t ( s t ) = ∇ F ( x t ) + H t s t + σ t k s t k s t , and using Mean Value Theorem for vector-valued functions, (6a) and (3a), we get k∇ F ( x t +1 ) − ∇ m t ( s t ) k ≤ k Z ∇ F ( x t + τ s t ) s t dτ − H t s t k + σ t k s t k ≤ k Z (cid:0) ∇ F ( x t + τ s t ) − ∇ F ( x t ) (cid:1) s t dτ + (cid:0) ∇ F ( x t ) − H t (cid:1) s t k + σ t k s t k ≤ k s t k Z k∇ F ( x t + τ s t ) − ∇ F ( x t ) k dτ + k (cid:0) ∇ F ( x t ) − H t (cid:1) s t k + σ t k s t k ≤ L k s t k Z τ dτ + ǫ t k s t k + σ t k s t k ≤ (cid:18) L γL (cid:19) k s t k + ǫ t k s t k , where the last equality follows from Lemma 12. From (6b), it follows that k∇ F ( x t ) k ≤ K k s t k + k∇ F ( x t +1 ) k . (18)As such, using θ t ≤ ζ from Condition 4 as well as the assumption on ǫ t , we get k∇ F ( x t +1 ) k ≤ (cid:18) L γL (cid:19) k s t k + ǫ t k s t k + θ t K k s t k + θ t k∇ F ( x t +1 ) k≤ (cid:18) L γL (cid:19) k s t k + ǫ t k s t k + θ t K k s t k + ζ k∇ F ( x t +1 ) k , − ζ ) k∇ F ( x t +1 ) k ≤ ( L/ γL ) k s t k + ( ǫ t + θ t K ) k s t k . Now usingCondition 4, we consider two cases: i. If k s t k ≥

1, then we get ( ǫ t + θ t K ) k s t k ≤ ( ǫ t + θ t K ) k s t k ≤ ( ǫ + ζ K ) k s t k . Hence, itfollows that (1 − ζ ) k∇ F ( x t +1 ) k ≤ ( L/ γL + ( ǫ + ζ K )) k s t k . ii. If k s t k ≤

1, then from assumption on ǫ t and (18) , we have ǫ t k s t k ≤ ζ k∇ F ( x t ) kk s t k ≤ ζ ( K k s t k + k∇ F ( x t +1 ) kk s t k ) ≤ ζ ( K k s t k + k∇ F ( x t +1 ) k ). Now by assumption on θ t ,we get ( ǫ t + θ t K ) k s t k = ǫ t k s t k + θ t K k s t k ≤ ζ K k s t k + ζ k∇ F ( x t +1 ) k , which, in turn,implies that (1 − ζ ) k∇ F ( x t +1 ) k ≤ ( L/ γL + 2 ζ K ) k s t k .Now suppose, k s t k ≥ θ t k∇ F ( x t ) k . As above, we have k∇ F ( x t +1 ) k ≤ k∇ F ( x t +1 ) −∇ m t ( s t ) k + k∇ m t ( s t ) k ≤ ( L/ γL + ζ ) k s t k + ǫ t k s t k . If k s t k ≥

1, we have ǫ t k s t k ≤ ǫ k s t k , which gives k∇ F ( x t +1 ) k ≤ ( L/ γL + ζ + ǫ ) k s t k . Otherwise, if k s t k ≤ k s t k ≥ θ t k∇ F ( x t ) k implies that k s t k ≥ k∇ F ( x t ) k . From assumption on ǫ t ,it follows that ǫ t k s t k ≤ ζ k∇ F ( x t ) kk s t k ≤ ζ k s t k , which in turn gives k∇ F ( x t +1 ) k ≤ ( L/ γL + 2 ζ ) k s t k . Lemma 15 (Success Iterations: Optimal Case) . Let T succ , { t ; k∇ F ( x t ) k ≥ ǫ g ∨ λ min ( H t ) ≤ − ǫ H } , be the set of all successful iterations, before Algorithm 2 terminates. Under the conditionsof Lemma 14, we must have |T succ | ∈ O (max { ǫ − H , ǫ − / g } ) .Proof. From (13b) and Lemma 12, if λ min ( ∇ H t ) ≤ − ǫ H , it follows that − m t ( s t ) ≥ ν | λ min ( H t ) | / (6 σ t ) ≥ ν ǫ H / (24 γ L ). Note that T succ = T S T S T , where T , { t ∈ T succ ; k∇ F ( x t +1 ) k ≥ ǫ g } , T , { t ∈ T succ ; k∇ F ( x t +1 ) k ≤ ǫ g and λ min ( H t +1 ) ≤ − ǫ H }T , { t ∈ T succ ; k∇ F ( x t +1 ) k ≤ ǫ g and λ min ( H t +1 ) ≥ − ǫ H } . We bound each of these sets individually. Since F ( x t ) is monotonically decreasing, from[9, Lemma 3.3], σ t ≥ σ min , and Lemmas 12 and 14, we have F ( x ) − F min ≥ ∞ X t =0 F ( x t ) − F ( x t +1 ) ≥ X t ∈T F ( x t ) − F ( x t +1 ) ≥ − η X t ∈T m t ( s t ) ≥ η X t ∈T min (cid:26) ν ǫ H γ L , σ min k s t k (cid:27) ≥ η X t ∈T min (cid:26) ν ǫ H γ L , σ min κ g k∇ F ( x t +1 ) k / (cid:27) ≥ η X t ∈T min (cid:26) ν ǫ H γ L , σ min κ g ǫ / g (cid:27) ≥ η X t ∈T min (cid:26) ν γ L , σ min κ g (cid:27) min { ǫ H , ǫ / g } . Hence, |T | ≤ κ T succ max { ǫ − H , ǫ − / g } , where κ T succ , ( F ( x ) − F min ) max { γ L /ν , / ( σ min κ g ) } /η.

21s for T , we have F ( x ) − F min ≥ F ( x ) − F ( x ) + ∞ X t =0 F ( x t +1 ) − F ( x t +2 ) ≥ F ( x ) − F ( x ) + X t ∈T F ( x t +1 ) − F ( x t +2 ) ≥ F ( x ) − F ( x ) − η X t ∈T m t +1 ( s t +1 ) ≥ F ( x ) − F ( x ) + η X t ∈T ν ǫ H γ L . Hence, |T | ≤ κ T succ ǫ − H , where κ T succ , ( F ( x ) − F min )24 γ L / ( ην ). Finally, we have |T | = 1, because in such a case, the algorithm stops in one iteration. Putting thesebounds all together, we get |T succ | ≤ max { , κ T succ , κ T succ } max { ǫ − H , ǫ − / g } .Now we can obtain the optimal complexity bound of Algorithm 2 in Theorem 3. Theproof follows similarly as that of Theorem 2, and hence is omitted here. Theorem 3 (Optimal Complexity of Algorithm 2) . Consider any < ǫ g , ǫ H < . Supposethe inexact Hessian, H ( x ) , satisﬁes Conditions (3) with the approximation tolerance, ǫ ,in (3a) as ǫ = min { ǫ , ζ ǫ g } where ǫ is as in (16) , and ζ ∈ (0 , / . For Problem ( P0 ) and under Assumption 1, if the approximate solution to the sub-problem (12) satisﬁesConditions 3 and 4, then Algorithm 2 terminates after at most T ∈ O (cid:16) max { ǫ − / g , ǫ − H } (cid:17) iterations. From (3a), upon termination of Algorithm 2, the obtained solution satisﬁes ( ǫ g , ǫ + ǫ H )-Optimality as in (1), i.e., k∇ F ( x T ) k ≤ ǫ g and λ min ( ∇ F ( x T )) ≥ − ( ǫ H + ǫ ). In this section, we give concrete and practical examples to demonstrate ways to constructthe approximate Hessian, which satisﬁes Condition 1. By considering ﬁnite-sum mini-mization , a ubiquitous problem arising frequently in machine learning, we showcase thepractical beneﬁts of the proposed relaxed requirement (3a) for approximating Hessian,compared to the stronger alternative (5c). In Section 3.1, we describe randomized tech-niques to appropriately construct the approximate Hessian, followed by the convergenceanalysis of Algorithms 1 and 2 with such Hessian approximations in Section 3.2.

Indeed, a major advantage of (3a) over (5c) is that there are many approximation tech-niques that can produce an inexact Hessian satisfying (3a). Of particular interest in ourpresent paper is the application of randomized matrix approximation techniques, whichhave recently shown great success in the area of RandNLA at solving various numericallinear algebra tasks [22, 42, 60]. For this, we consider the highly prevalent ﬁnite-summinimization problem ( P1 ) and employ random sampling as a way to construct approx-imations to the exact Hessian, which are, probabilistically, ensured to satisfy (3a). Manymachine learning and scientiﬁc computing applications involve ﬁnite-sum optimizationproblems of the form ( P1 ) where each f i is a loss (or misﬁt) function corresponding to i th observation (or measurement), e.g., see [7, 24, 47, 48, 50, 55] and references therein.22ere, we consider ( P1 ) in large-scale regime where n, d ≫

1. In such settings, themere evaluations of the Hessian and the gradient increase linearly in n . Indeed, for big-data problems, the operations with the Hessian, e.g., matrix-vector products involvedin the (approximate) solution of the sub-problems (8) and (12), typically constitute themain bottleneck of computations, and in particular when n ≫

1, are computationallyprohibitive. For the special case of ( P1 ) in which each f i is convex, randomized sub-sampling has shown to be eﬀective in reducing such costs, e.g., [6, 49, 62]. We now showthat such randomized approximation techniques can indeed be eﬀectively employed forthe non-convex settings considered in this paper.In this light, suppose we have a probability distribution, p = { p i } ni =1 , over the set { , , . . . , n } , such that for each index i = 1 , . . . , n , we have Pr( i ) = p i > P ni =1 p i =1. Consider picking a sample of indices from { , , . . . , n } , at each iteration, randomlyaccording to the distribution p . Let S and |S| denote the sample collection and itscardinality, respectively and deﬁne H ( x ) , n |S| X j ∈S p j ∇ f j ( x ) , (19)to be the sub-sampled Hessian. In big-data regime when n ≫

1, if |S| ≪ n , suchsub-sampling can oﬀer signiﬁcant computational savings.Now, suppose sup x ∈ R d k∇ f i ( x ) k ≤ K i , i = 1 , , . . . , n, (20a)and deﬁne K max , max i =1 ,...,n K i . (20b) b K , n n X i =1 K i . (20c)In this case, we can naturally consider uniform distribution over { , , . . . , n } , i.e., p i =1 /n, ; ∀ i . Lemma 16 gives the sample size required for the inexact Hessian, H ( x ), toprobabilistically satisfy (3), for when the indices are picked uniformly at random with or without replacement. Lemma 16 (Complexity of Uniform Sampling) . Given (20a) , (20b) , and < ǫ, δ < ,let |S| ≥ K ǫ log 2 dδ , (21) where K max is deﬁned as in (20b) . At any x ∈ R d , suppose picking the elements of S uniformly at random with or without replacement, and forming H ( x ) as in (19) with p i = 1 /n, ; ∀ i . We have Pr (cid:16) k H ( x ) − ∇ F ( x ) k ≤ ǫ (cid:17) ≥ − δ. (22) Proof.

Consider |S| random matrices H j ( x ) , j = 1 , . . . , |S| s.t. Pr ( H j ( x ) = ∇ f i ( x )) =1 /n ; ∀ i = 1 , , . . . , n . Deﬁne X j , (cid:0) H j − ∇ F ( x ) (cid:1) , H , P j ∈S H j / |S| , and X , j ∈S X j = |S| ( H − ∇ F ( x )). Note that E ( X j ) = 0 and for H j = ∇ f ( x ) we have k X j k = k n − n ∇ f ( x ) − n X i =2 n ∇ f i ( x ) k ≤ n − n ) K ≤ K . Hence, we can apply Operator-Bernstein inequality [35, Theorem 1] to getPr (cid:16) k H − ∇ F ( x ) k ≥ ǫ (cid:17) = Pr (cid:16) k X k ≥ ǫ |S| (cid:17) ≤ d exp {− ǫ |S| / (16 K ) } . Now (21) ensure that 2 d exp {− ǫ |S| / (16 K ) } ≤ δ , which gives (22).Indeed, if (22) holds, then (3a) follows with the same probability. In addition, if H is constructed according to Lemma 16, it is easy to see that (3b) is satisﬁed with K H = K max (in fact this is a deterministic statement). These two, together, imply that H satisﬁes Condition 1, with probability 1 − δ . A Special Case:

In certain settings, one might be able to construct a more “infor-mative” distribution, p , over the indices in the set { , , . . . , n } , as opposed to obliviousuniform sampling. In particular, it might be advantageous to bias the probability dis-tribution towards picking indices corresponding to those f i ’s which are more relevant ,in certain sense, in forming the Hessian. If this is possible, then we can only expectto require smaller sample size as compared with oblivious uniform sampling. One suchsetting where this is possible is the ﬁnite-sum optimization of the form ( P2 ), which isindeed a special case of ( P1 ) and arise often in many machine learning problems [51].It is easy to see that, the Hessian of F in this case can be written as ∇ F ( x ) = A T BA = P ni =1 f ′′ i ( a Ti x ) a i a Ti /n , where A T =  | | . . . | a a . . . a n | | . . . |  d × n and B = 1 n  f ′′ ( a T x ) 0 . . . f ′′ ( a T x ) . . . . . . f ′′ n ( a Tn x )  n × n . Now let S ∈ R n ×|S| be the sampling matrix and deﬁne the approximate Hessian as H , A T SS T BA . It can be seen that approximating the Hessian matrix ∇ F ( x ) = A T BA can be regarded as approximating matrix-matrix multiplication from RandNLA [42, 60].For this, consider the sampling distribution p as p i = | f ′′ i ( a Ti x ) |k a i k P nj =1 | f ′′ j ( a Tj x ) |k a j k . (23)Note that the absolute values are needed since for non-convex f i , we might have f ′′ j ( a Tj x ) < f ′′ j ( a Tj x ) ≥

0, one can obtain stronger guarantees thanLemmas 16 and 17; see [62]). Using non-uniform sampling distribution (23), Lemma17 gives sampling complexity for the approximate Hessian of ( P2 ) to, probabilistically,satisfy (3). Lemma 17 (Complexity of Non-Uniform Sampling) . Given (20a) , (20c) and < ǫ, δ < ,let |S| ≥ b K ǫ log 2 dδ , (24)24 here b K is deﬁned as in (20c) . At any x ∈ R d , suppose picking the elements of S randomly according to the probability distribution (23) , and forming H ( x ) as in (19) .We have Pr (cid:16) k H − ∇ F ( x ) k ≤ ǫ (cid:17) ≥ − δ. (25) Proof.

Deﬁne B = diag { f ′′ ( a T x ) /n, · · · , f ′′ n ( a Tn x ) /n } ∈ R n × n . Let S ∈ R n ×|S| be thesampling matrix and deﬁne H , A T SS T BA . Further, let the diagonals of B be denotedby b i and deﬁne c , P ni =1 | b j |k a j k . Consider s random matrices H j such that Pr( H j = b i a i a Ti /p i ) = p i , ∀ j = 1 , , . . . , |S| , where p i = | b i |k a i k / ( P ni =1 | b j |k a j k ) . Deﬁne X j , H j − A T BA , H , |S| |S| X j =1 H j , X , |S| X j =1 X j = |S| (cid:0) H − A T BA (cid:1) . Note that E [ X j ] = P ni =1 p i (cid:0) b i a i a Ti /p i − A T BA (cid:1) = 0, and E [ X j ] = E [ H j − A T BA ] = E [ H j ] + ( A T BA ) − E [ H j ] A T BA − A T BA E [ H j ]= E [ H j ] − ( A T BA ) (cid:22) E [ H j ] = n X i =1 p i (cid:18) b i p i a i a Ti (cid:19) = n X i =1 b i k a i k p i a i a Ti = n X i =1 | b j |k a j k n X i =1 | b | i a i a Ti = c n X i =1 | b | i a i a Ti = c A T | B | A . So we have k E [ X j ] k ≤ c k A T | B | A k . Now we can apply the Operator-Bernstein inequality[35, Theorem 1] to getPr (cid:0) k H − A T BA k ≥ ǫ (cid:1) ≤ Pr ( k X k ≥ ǫ |S| ) ≤ de ǫ |S| / (4 c k A T | B | A k ) . Since c = P ni =1 | b i | k a i k = n P ni =1 | f ′′ i | k a i k ≤ n P ni =1 K i = b K and (cid:13)(cid:13) A T | B | A (cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n X i =1 | f ′′ i | a i a Ti (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ n n X i =1 (cid:13)(cid:13) | f ′′ i | a i a Ti (cid:13)(cid:13) ≤ n n X i =1 K i = b K, then we have Pr (cid:0) k H − A T BA k ≥ ǫ (cid:1) ≤ de ǫ |S| / (4 b K ) , which gives the desired result.The bound in (24) can be improved by replacing the dimension d with a smallerquantity, known as intrinsic dimension; see Appendix A. As it can be seen from (20b)and (20c), since b K ≤ K max , the sampling complexity given by Lemma 17 always providesa smaller sample-size compared with that prescribed by Lemma 16. Indeed, the advantageof non-uniform sampling is more pronounced in cases where the distribution of K i ’s arehighly skewed, i.e., a few large ones and many small ones, in which case we can have b K ≪ K max ; see numerical experiments in [61]. Also, from (25), it follows that the approximatematrix H , constructed according to Lemma 17 satisﬁes (3b) with K H = b K + ǫ , withprobability 1 − δ , which in turn, implies that Condition 1 is ensured, with probability1 − δ .As concrete examples of the problems in the form ( P2 ) where Lemma 17 can bereadily used, Table 1 gives estimates for K i in (20a) for robust linear regression withsmooth non-convex bi-weight loss, [3], as well as non-convex binary-classiﬁcation usinglogistic regression with least squares loss, [61].25able 1: Examples of problems in the form ( P2 ) with the corresponding estimates for K i in (20a). Problem Data f i ( a Ti x ) ∇ f i ( a Ti x ) K i RobustLinearRegression a i ∈ R d b i ∈ R (cid:0) a Ti x − b i (cid:1) a Ti x − b i )  (cid:16) − (cid:0) a Ti x (cid:1) (cid:17)(cid:16) ( a Ti x ) + 1 (cid:17)  a i a Ti k a i k √ a i ∈ R d b i ∈{ , } (cid:18)

11 + exp ( − a Ti x ) − b i (cid:19) exp (cid:0) a Ti x (cid:1) (cid:0) − exp (cid:0) a Ti x (cid:1)(cid:1) (exp ( a Ti x ) + 1) ! a i a Ti k a i k Now, we are in the position to give iteration complexity for Algorithms 1 and 2 wherethe inexact Hessian matrix H t is constructed according to Lemmas 16 or 17. Since theapproximation is a probabilistic construction, in order to guarantee success, we need toensure that we require a small failure probability across all iterations. In particular, inorder to get an overall and accumulative success probability of 1 − δ for the entire T iterations, the per-iteration failure probability is set as (1 − T p (1 − δ )) ∈ O ( δ/T ). Thisfailure probability appears only in the “log factor” for sample size in all of our results,and so it is not the dominating cost. Hence, requiring that all T iterations are successfulfor a large T , only necessitates a small (logarithmic) increase in the sample size. Forexample, for T ∈ O (max { ǫ − g , ǫ − H } ), as in Theorem 2, we can set the per-iteration failureprobability to δ min { ǫ g , ǫ H } , and ensure that when Algorithm 2 terminates, all Hessianapproximations have been, accumulatively, successful with probability of 1 − δ .Using these results, we can have the following probabilistic, but optimal, guarantee onthe worst-case iteration complexity of Algorithm 1 for solving ﬁnite-sum problem ( P1 )(or ( P2 )) and in the case where the inexact Hessian is formed by sub-sampling. Theirproofs follow very similar line of reasoning as that used for obtaining the results of Section2, and hence are omitted. Theorem 4 (Optimal Complexity of Algorithm 1 For Finite-Sum Problem) . Considerany < ǫ g , ǫ H , δ < . Let ǫ be as in (11) and set δ = δ min { ǫ g ǫ H , ǫ H } . Furthermore,for such ( ǫ, δ ) , let the sample-size |S| be as in (21) (or (24) ) and form the sub-sampledmatrix H as in (19) . For Problem ( P1 ) (or ( P2 ) ), under Assumption 1 and Condition2, Algorithm 1 terminates in at most T ∈ O (max { ǫ − g ǫ − H , ǫ − H } ) iterations, upon which,with probability − δ , we have that k∇ F ( x ) k ≤ ǫ g , and λ min ( ∇ F ( x )) ≥ − ( ǫ + ǫ H ) . Similarly, in the setting of optimization problems ( P1 ) and ( P2 ), with appropriatesub-sampling of the Hessian as in Lemmas 16 and 17, we can also obtain probabilisticworst-case iteration complexities for Algorithm 2 as in the deterministic case. Again, theproofs are similar to those in Section 2, and hence are omitted. Theorem 5 (Complexity of Algorithm 2 For Finite-Sum Problem) . Consider any <ǫ g , ǫ H , δ < . Let ǫ be as in (16) and set δ = δ min { ǫ g , ǫ H } . Furthermore, for such ( ǫ, δ ) ,let the sample-size |S| be as in (21) (or (24) ) and form the sub-sampled matrix H as in . For Problem ( P1 ) (or ( P2 ) ), under Assumption 1 and Condition 3, Algorithm2 terminates in at most T ∈ O (max { ǫ − g , ǫ − H } ) iterations, upon which, with probability − δ , we have that k∇ F ( x ) k ≤ ǫ g , and λ min ( ∇ F ( x )) ≥ − ( ǫ + ǫ H ) . Theorem 6 (Optimal Complexity of Algorithm 2 For Finite-Sum Problem) . Considerany < ǫ g , ǫ H , δ < . Let ǫ be as in Theorem 3 and set δ = δ min { ǫ / g , ǫ H } . Furthermore,for such ( ǫ, δ ) , let the sample-size |S| be as in (21) (or (24) ) and form the sub-sampledmatrix H as in (19) . For Problem ( P1 ) (or ( P2 ) ), under Assumption 1, Conditions 3and 4, Algorithm 2 terminates in at most T ∈ O (max { ǫ − / g , ǫ − H } ) iterations, upon which,with probability − δ , we have that k∇ F ( x ) k ≤ ǫ g , and λ min ( ∇ F ( x )) ≥ − ( ǫ + ǫ H ) . As it can be seen, the main diﬀerence between Theorems 5 and 6 is in the solutionto the sub-problem (12). More speciﬁcally, if in addition to Condition 3, Condition 4 isalso satisﬁed, then Theorem 6 gives optimal worst-case iteration complexity.

We considered non-convex optimization settings and developed eﬃcient variants of thetrust region and adaptive cubic regularization methods in which both the sub-problemsas well as the the curvature information are suitably approximated. For all of our pro-posed variants, we obtained iteration complexities to achieve approximate second ordercriticality, which are shown to be the same (up to some constant) as that of the exactvariants.As compared with previous works, our proposed Hessian approximation condition of-fers a range of theoretical and practical advantages. As a concrete example, we consideredthe large-scale ﬁnite-sum optimization problem and proposed uniform and non-uniformsub-sampling strategies as ways to eﬃciently construct the desired approximate Hessian.We then, probabilistically, established optimal iteration complexity for variants of trustregion and adaptive cubic regularization methods in which the Hessian is appropriatelysub-sampled.In this paper, we focused on approximating the Hessian under the exact gradient in-formation. Arguably, the bottleneck of the computations in such second-order methodsinvolves the computations with the Hessian, e.g., matrix-vector products in the (approx-imate) solution of the sub-problem. In fact, the cost of the exact gradient computation istypically amortized by that of the operations with the Hessian. In spite of this, approx-imating the gradient in a computationally feasible way and with minimum assumptionscould improve upon the eﬃciency of the methods proposed here. However, care hasto be taken as cheaper iterations with inaccurate gradients could in fact result in moreiterations overall. This could have the adverse eﬀect of slowing down the algorithm’sconvergence. As a result, approximating the gradient has to be done with care to avoidsuch pitfalls.Finally, we mention that our focus here has been solely on developing the theoreticalfoundations of such randomized algorithms. Extensive empirical evaluations of thesealgorithms on various machine learning applications are given in the [61].27 eferences [1] Naman Agarwal et al. “Finding Approximate Local Minima Faster than GradientDescent”. In: arXiv preprint arXiv:1611.01146 (2016).[2] Afonso S Bandeira, Katya Scheinberg, and Lu´ıs N Vicente. “Convergence of trust-region methods based on probabilistic models”. In:

SIAM Journal on Optimization

Technometrics

Computational Optimization and Applications arXiv preprint arXiv:1609.07428v3 (2018).[6] Raghu Bollapragada, Richard Byrd, and Jorge Nocedal. “Exact and Inexact Sub-sampled Newton Methods for Optimization”. In: arXiv preprint arXiv:1609.08502 (2016).[7] L´eon Bottou, Frank E Curtis, and Jorge Nocedal. “Optimization methods for large-scale machine learning”. In: arXiv preprint arXiv:1606.04838 (2016).[8] Yair Carmon and John C Duchi. “Gradient Descent Eﬃciently Finds the Cubic-Regularized Non-Convex Newton Step”. In: arXiv preprint arXiv:1612.00547 (2016).[9] Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. “Adaptive cubic regu-larisation methods for unconstrained optimization. Part I: motivation, convergenceand numerical results”. In:

Mathematical Programming

Mathematical programming

Journal of Complexity

Optimization Methods and Software

SIAM journal on optimization

Optimal Newton-typemethods for nonconvex smooth optimization problems . Tech. rep. ERGO technicalreport 11-009, School of Mathematics, University of Edinburgh, 2011.2815] Coralia Cartis and Katya Scheinberg. “Global convergence rate analysis of uncon-strained optimization methods based on probabilistic models”. In:

MathematicalProgramming (2015), pp. 1–39.[16] Ruobing Chen, Matt Menickelly, and Katya Scheinberg. “Stochastic optimizationusing a trust-region method and random models”. In: arXiv preprint arXiv:1504.04231 (2015).[17] Andrew R Conn, Nicholas IM Gould, and Philippe L Toint.

Trust region methods .SIAM, 2000.[18] Andrew R Conn, Katya Scheinberg, and Lu´ıs N Vicente. “Global convergenceof general derivative-free trust-region algorithms to ﬁrst-and second-order criticalpoints”. In:

SIAM Journal on Optimization O ( ǫ − / ) for NonconvexOptimization”. In: Mathematical Programming O ( ǫ − / )for Nonconvex Optimization”. In: arXiv preprint arXiv:1708.00475 (2017).[21] John E Dennis and Jorge J Mor´e. “A characterization of superlinear convergenceand its application to quasi-Newton methods”. In: Mathematics of computation

Communications of the ACM

SIAM Journal on Optimization

The Elements of StatisticalLearning . Vol. 1. Springer series in statistics Springer, Berlin, 2001.[25] Dan Garber et al. “Faster eigenvector computation via shift-and-invert precondi-tioning”. In:

International Conference on Machine Learning . 2016, pp. 2626–2634.[26] Rong Ge et al. “Escaping From Saddle Points-Online Stochastic Gradient for TensorDecomposition.” In:

COLT . 2015, pp. 797–842.[27] Saeed Ghadimi, Han Liu, and Tong Zhang. “Second-Order Methods with CubicRegularization Under Inexact Information”. In: arXiv preprint arXiv:1710.05782 (2017).[28] Nicholas IM Gould, Daniel P Robinson, and H Sue Thorne. “On solving trust-regionand other regularised subproblems in optimization”. In:

Mathematical ProgrammingComputation

SIAM Journal on Optimization

Op-timization Methods and Software

SIAM Journal on Optimization

IMA Journal of Numerical Analysis

Technical Report NA/12. Department ofApplied Mathematics and Theoretical Physics, University of Cambridge. (1981).[35] David Gross and Vincent Nesme. “Note on sampling without replacing from a ﬁnitecollection of matrices”. In: arXiv preprint arXiv:1001.2738 (2010).[36] Elad Hazan and Tomer Koren. “A linear-time algorithm for trust region problems”.In:

Mathematical Programming (2015), pp. 1–19.[37] Jonas Moritz Kohler and Aurelien Lucchi. “Sub-sampled Cubic Regularization forNon-convex Optimization”. In: arXiv preprint arXiv:1705.05933 (2017).[38] J Kuczy´nski and H Wo´zniakowski. “Estimating the largest eigenvalue by the powerand Lanczos algorithms with a random start”. In:

SIAM journal on matrix analysisand applications

Computational Optimization and Applications

Conferenceon Learning Theory . 2016, pp. 1246–1257.[41] Felix Lenders, Christian Kirches, and Andreas Potschka. “trlib: A vector-free imple-mentation of the GLTR method for iterative solution of the trust region problem”.In: arXiv preprint arXiv:1611.04718 (2016).[42] Michael W Mahoney. “Randomized algorithms for matrices and data”. In:

Foun-dations and Trends R (cid:13) in Machine Learning SIAMJournal on Scientiﬁc and Statistical Computing

Mathematical Programming

Numerical optimization . Springer Science &Business Media, 2006.[47] Farbod Roosta-Khorasani, Kees van den Doel, and Uri Ascher. “Data completionand stochastic algorithms for PDE inversion problems with many measurements”.In:

Electronic Transactions on Numerical Analysis

42 (2014), pp. 177–196.3048] Farbod Roosta-Khorasani, Kees van den Doel, and Uri Ascher. “Stochastic algo-rithms for inverse problems involving PDEs and many measurements”. In:

SIAMJ. Scientiﬁc Computing

Mathematical Programming issn : 1436-4646.[50] Farbod Roosta-Khorasani, G´abor J. Sz´ekely, and Uri Ascher. “Assessing stochasticalgorithms for large scale nonlinear least squares problems using extremal probabil-ities of linear combinations of gamma random variables”. In:

SIAM/ASA Journalon Uncertainty Quantiﬁcation

Understanding machine learning: Fromtheory to algorithms . Cambridge university press, 2014.[52] Sara Shashaani, Fatemeh Hashemi, and Raghu Pasupathy. “ASTRO-DF: A Classof Adaptive Sampling Trust-Region Algorithms for Derivative-Free Stochastic Op-timization”. In: arXiv preprint arXiv:1610.06506 (2016).[53] Danny C Sorensen. “Newtons method with a model trust region modiﬁcation”. In:

SIAM Journal on Numerical Analysis

SIAM Journal on Optimization

Optimization for machinelearning . Mit Press, 2012.[56] Trond Steihaug. “The conjugate gradient method and trust regions in large scaleoptimization”. In:

SIAM Journal on Numerical Analysis arXiv preprint arXiv:1711.02838 (2017).[58] J. A. Tropp. “User-friendly tail bounds for sums of random matrices”. In:

Found.Comput. Math. arXivpreprint arXiv:1501.01571 (2015).[60] David P. Woodruﬀ. “Sketching as a tool for numerical linear algebra”. In: arXivpreprint arXiv:1411.4357 (2014).[61] Peng Xu, Farbod Roosta-Khorasani, and Michael W. Mahoney. “Second-OrderOptimization for Non-Convex Machine Learning: An Empirical Study”. In: arXivpreprint arXiv:1708.07827 (2017).[62] Peng Xu et al. “Sub-sampled newton methods with non-uniform sampling”. In:

Advances in Neural Information Processing Systems . 2016, pp. 3000–3008.

Appendix A: Intrinsic dimension and improving thesampling complexity (24)

We can still improve the sampling complexity (24) by considering the intrinsic dimensionof the matrix A T | B | A . Recall that for a SPSD matrix A ∈ R d × d , the intrinsic dimensionis deﬁned as t ( A ) = tr( A ) / k A k , where tr( A ) is the trace of A . The intrinsic dimension31an be regarded as a measure for the number of dimensions where A has a signiﬁcantspectrum. It is easy to see that 1 ≤ t ( A ) ≤ rank( A ) ≤ d ; see [59] for more details. Nowlet t = tr( A T | B | A ) / k A T | B | A k be the intrinsic dimension of the SPSD matrix A T | B | A .We have the following improved sampling complexity result: Lemma 18 (Complexity of Non-Uniform Sampling: Intrinsic Dimension) . The result ofLemma 17 holds with (24) replaced with |S| ≥ b K ǫ log 8 tδ , (26) where t = tr ( A T | B | A ) / k A T | B | A k ≤ d is the intrinsic dimension of the matrix A T | B | A .Proof. It is easy to see that Var( X ) = E ( X ) = P |S| j =1 E ( X j ) (cid:22) |S| c A T | B | A , where X and c are given in the proof of Lemma 17. For H j = b i p i a i a Ti , we have λ max ( X j ) ≤ k X j k = k b i p i a i a Ti − A T BA k = k (cid:18) − p i p i (cid:19) b i a i a Ti − X j = i b j a j a Tj k≤ (cid:18) − p i p i (cid:19) | b i |k a i k + X j = i | b j |k a j k = (1 − p i ) n X i =1 | b j |k a j k + X j = i | b j |k a j k = 2 X j = i | b j |k a j k ≤ n X i =1 | b j |k a j k = 2 c. Hence, if ǫ |S| ≥ p |S| c k A T | B | A k + 2 c/

3, we can apply Matrix Bernstein using theintrinsic dimension [59, Theorem 7.7.1] to get for ǫ ≤ / λ max ( X ) ≥ ǫ |S| ) ≤ t exp (cid:26) − ǫ |S| c k A T | B | A k + 4 cǫ/ (cid:27) ≤ t exp (cid:26) − ǫ |S| c (cid:27) . Applying the same bound for Y j = − X j and Y = P sj =1 Y j , followed by the union bound,we get the desired result. Appendix B: Computation of Approximate NegativeCurvature Direction

Throughout our analysis, we assume that, if a suﬃciently negative curvature exists, i.e., λ min ( H ) ≤ − ǫ H for some ǫ H ∈ (0 , u , i.e., h u , Hu i ≤ − νǫ H k u k , for some ν ∈ (0 , H = K H − H . These methods onlyemploy matrix vector products and, hence, are suitable for large scale problems. Morespeciﬁcally, with any κ ∈ (0 , O (log( d/δ ) p K H /κ ) matrix-vectorproducts and with probability 1 − δ , yield a vector u satisfying K H k u k − h u , Hu i = h u , ˜ Hu i ≥ κλ min ( ˜ H ) k u k = κ ( K H − λ min ( H )) k u k . Rearranging, we obtain h u , Hu i ≤ (1 − κ ) K H k u k + κλ min ( H ) k u k . Setting 1 > ν = 2 κ ≥ (2 K H ) / (2 K H + ǫ H ), gives h u , Hu i ≤ − νǫ H k u k2