Sharp worst-case evaluation complexity bounds for arbitrary-order nonconvex optimization with inexpensive constraints
aa r X i v : . [ m a t h . O C ] N ov Sharp worst-case evaluation complexity bounds forarbitrary-order nonconvex optimization with inexpensiveconstraints
C. Cartis ∗ , N. I. M. Gould † and Ph. L. Toint ‡ Abstract
We provide sharp worst-case evaluation complexity bounds for nonconvex minimiza-tion problems with general inexpensive constraints, i.e. problems where the cost of eval-uating/enforcing of the (possibly nonconvex or even disconnected) constraints, if any, isnegligible compared to that of evaluating the objective function. These bounds unify,extend or improve all known upper and lower complexity bounds for unconstrained andconvexly-constrained problems. It is shown that, given an accuracy level ǫ , a degree ofhighest available Lipschitz continuous derivatives p and a desired optimality order q be-tween one and p , a conceptual regularization algorithm requires no more than O ( ǫ − p +1 p − q +1 )evaluations of the objective function and its derivatives to compute a suitably approxi-mate q -th order minimizer. With an appropriate choice of the regularization, a similarresult also holds if the p -th derivative is merely H¨older rather than Lipschitz continuous.We provide an example that shows that the above complexity bound is sharp for uncon-strained and a wide class of constrained problems; we also give reasons for the optimalityof regularization methods from a worst-case complexity point of view, within a large classof algorithms that use the same derivative information. Since the seminal paper by Vavasis [21] on the complexity of finding first-order critical points inunconstrained nonlinear optimization was published 25 years ago, the question of the optimalworst-case complexity of optimization methods has been of interest to mathematicians andalso, because of its strong connection with deep learning, to computer scientists. Of late, therehas been a growing interest in this research field, both for convex and nonconvex problems.This paper focusses on the latter class and follows a now subtantial (1) trend of research wherebounds on the worst-case evaluation complexity (or oracle complexity) of obtaining first- and ∗ Mathematical Institute, Oxford University, Oxford OX2 6GG, England. Email:[email protected] † Computational Mathematics Group, STFC-Rutherford Appleton Laboratory, Chilton OX11 0QX, Eng-land. Email: [email protected] . The work of this author was supported by EPSRC grant EP/M025179/1 ‡ Namur Center for Complex Systems (naXys), University of Namur, 61, rue de Bruxelles, B-5000 Namur,Belgium. Email: [email protected] (1)
See [12] for a more complete list of references. artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization (2) for nonlinear nonconvex unconstrainedoptimization problems [21, 17, 14, 19, 5]. These papers all provide upper evaluation complexitybounds: they show that, to obtain an ǫ -approximate first-order-necessary minimizer (forunconstrained problem, this is a point at which the gradient of the objective function isless than ǫ in norm), at most O ( ǫ − ) evaluations of the objective function (3) are neededif a model involving first derivatives is used, and at most O ( ǫ − / ) evaluations are neededif using second derivatives is permitted. This result was extended to convexly-constrainedproblems in [6]. A broader framework allowing the use of Taylor series of degree p was morerecently proposed in [2], in which case the worst-case evaluation complexity bound for ǫ -first-order-necessary unconstrained minimizer is shown to be O ( ǫ − p +1 p ), thereby generalizing theprevious results for this case. Complexity for obtaining ǫ -approximate second-order-necessaryunconstrained minimizers was considered in [19, 5], where a bound of O ( ǫ − ) evaluations wasproved to obtain an ǫ -second-order-necessary minimizer using a Taylor’s model of degree two,and a bound of O ( ǫ − p +1 p − ) evaluations was shown in [8] for the case where a Taylor model ofdegree p is used. Defining q -th-order-necessary minimizers for q > ǫ -approximate q -th-order-necessary minimizers for q > O ( ǫ − ( q +1) ) on evaluation complexity for convexly-constrained problems, in particularimproving on the bound of O ( ǫ − / ) stated in [1] for the case p = q = 3.The unconstrained and convexly-constrained cases where the assumption of Lipschitz con-tinuity is replaced by the weaker β -H¨older continuity ( β ∈ (0 , q = 1 in [18, 7, 9]. These references show that at most O ( ǫ − p + βp − β ) evaluations are needed forobtaining an ǫ -first-order-necessary minimizer.While upper complexity bounds are important as they provide a handle on the intrinsicdifficulty of the considered problem, they do so at the condition of not being overly pessimistic.To address this last point, lower bounds on the evaluation complexity of unconstrained non-convex optimization problems and methods were derived in [4, 17] and [12], where it was shownthat the known upper complexity bounds are sharp (irrespective of problem’s dimension) formost known methods using Taylor’s models of degree one or two. That is to say that there areexamples for which the complexity order predicted by the upper bound is actually achieved.More recently, Carmon et al. [3] provided an elaborate construction showing that at least a multiple of ǫ − p +1 p function evaluations may be needed to obtain an ǫ -first-order-necessaryunconstrained minimizer where derivatives of order at most p are used. This result, whichmatches in order the upper bound of [2], covers a very wide class of potential optimizationmethods (4) but has the drawback of being only valid for problems whose dimension essentiallyexceeds the number of iterations needed, which can be very large and quickly grows when ǫ tends to zero. Contributions.
The present paper aims at unifying and generalizing all the above resultsin a single framework, providing, for problems with inexpensive or no constraints, provably (2)
That is points satisfying the first- or second-order necessary optimality conditions for minimization. (3)
And its available derivatives. (4)
In particular, it covers randomized methods, which we do not consider in this paper. artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization (5) cost is negligible comparedto the cost of evaluating the objective function. As a consequence, the evaluation complex-ity for such problems is meaningfully captured by focusing of the number of evaluations ofthis latter function. This class of minimization problems contains important cases such asbound-constrained problems and convexly-constrained problems (when the projection ontothe feasible set is inexpensive), but also allows possibly nonconvex or even disconnected fea-sible sets.In order to achieve these objectives, we first revisit the Taylor-based optimality measure of[11] and define ( ǫ , δ )- q -th-order-necessary minimizers, a notion extending the standard ǫ -first-and ǫ -second-order cases to arbitrary orders. We then present a conceptual regularizationalgorithm using degree p models and show that this algorithm requires at most O ( ǫ − p + βp − q + β )evaluations of f and its derivatives to find such an ( ǫ , δ )- q -th-order-necessary minimizer whenthe p -th derivative of f is assumed to be β -H¨older continuous. (If the p -th derivative isassumed to be Lispchitz continuous, the bound becomes O ( ǫ − p +1 p − q +1 ).) This bound matches thebest known lower bounds for first- and second-order, and improves on the bound in O ( ǫ − ( q +1) )given by [11]. We then show that this bound is sharp in order for unconstrained problems withLipschitz continuous p -th derivative by completing and extending the result of [3] in two ways.The first is to show that the lower worst-case bound of order ǫ − p +1 p evaluations for obtaining afirst-order-necessary minimizer using at most p derivatives is also valid for problems of everydimension, and the second is to show that this bound can be generalized to a multiple of ǫ − p +1 p − q +1 for obtaining a q -th-order-necessary minimizer of any order q . In particular, thisresult matches in order the upper bound obtained in the first part of the paper and subsumesor improves known lower bounds for first- and second-order-necessary minimizers. While ourlower bounds are derived for regularization algorithms applied to unconstrained problems, wealso indicate that they may be extended to a much wider class of minimization methods andto a significant class of constrained problems.The paper is organized as follows. Section 2 introduces the (possibly constrained) mini-mization problem of interest and the concept of ( ǫ , δ )-approximate q -th-order-necessary min-imizers. It also presents a variant of the Adaptive Regularization algorithm using degree p Taylor’s models (AR p ) whose purpose is to find such minimizers. Section 3 then providesan upper bound on the evaluation complexity for the AR p algorithm to achieve this task.Section 4 then discusses specialization of this result to the case where ǫ -approximate second-order-necessary minimizers are sought. The complexity upper bound of Section 3 is thenproved to be sharp in Section 5 for the Lipschitz-continuous cases where the feasible setcontains a ray. Some conclusions are finally presented in Section 6. Notation.
Throughout the paper, k v k denotes the standard Euclidean norm of a vector v ∈ IR n . For a symmetric tensor S of order p , S [ v , . . . , v p ] is the result of applying S to thevectors v , . . . , v p , S [ v ] p is the result of applying S to p copies of the vector v and k S k [ p ] def = max k v k =1 | S [ v ] p | = max k v k = ··· = k v p k =1 | S [ v , . . . , v p ] | (1.1)(where the second equality results from Theorem 2.1 in [23]) is the associated induced normfor such tensors. If S and S are tensors, S ⊗ S is their tensor product and S k ⊗ is the (5) Constraint’s values and that of their derivatives, if relevant. artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization S k times with itself. For a real, sufficiently differentiable univariate function f , f ( i ) denotes its i -th derivative and f (0) is a synonym for f . For an integer k and a real β ∈ (0 , k + β )! def = Q kℓ =1 ( β + ℓ ) (this coincides with the standard factorial if β = 1). As is usual, we also define 0! = 1. If M is a symmetric matrix, λ min ( M ) is itsleft-most eigenvalue. If α is a real, ⌈ α ⌉ and ⌊ α ⌋ denote the smallest integer not smaller than α and the largest integer not exceeding α , respectively. Finally globmin x ∈S f ( x ) denotes thesmallest value of f ( x ) over x ∈ S . p algorithm Given p ≥
1, this paper considers the set-constrained optimization problemmin x ∈F f ( x ) , (2.1)where we assume that F ⊆ IR n is closed and nonempty, and where f ∈ C p,β (IR n ), namely,that: • f is p -times continuously differentiable, • f is bounded below by f low , and • the p -th derivative tensor of f at x is globally H¨older continuous, that is, there existconstants L ≥ β ∈ (0 ,
1] such that, for all x, y ∈ IR n , k∇ px f ( x ) − ∇ px f ( y ) k [ p ] ≤ L k x − y k β . (2.2)Observe that convexity or even connectedness of F is not requested. Observe also that themore usual case of Lipschitz continuous p -th derivative corresponds to β = 1. We note thatour assumption covers the continuous range of objective function’s smoothness from H¨oldercontinuous gradients to Lipschitz continuous p -th derivatives. In what follows, we assumethat β is known.If T p ( x, s ) is the standard p -th degree Taylor’s expansion of f about x computed for theincrement s , that is T p ( x, s ) def = f ( x ) + p X ℓ =1 ℓ ! ∇ ℓx f ( x )[ s ] ℓ , (2.3)(2.2) provides crucial approximation bounds, whose proof can be found in the appendix. Lemma 2.1
Let f ∈ C p,β (IR n ), and T p ( x, s ) be the Taylor approximation of f ( x + s )about x given by (2.3). Then for all x, s ∈ IR n , f ( x + s ) ≤ T p ( x, s ) + L ( p + β )! k s k p + β , (2.4) k∇ jx f ( x + s ) − ∇ js T p ( x, s ) k [ j ] ≤ L ( p − j + β )! k s k p − j + β . ( j = 1 , . . . , p ) . (2.5) artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization δ ∈ (0 , j ≤ p , φ δf,j ( x ) def = f ( x ) − globmin x + d ∈F k d k≤ δ T j ( x, d ) , (2.6)which can be interpreted as the magnitude of the largest decrease achievable on the Taylor’sexpansion of degree j within the intersection a ball of radius δ with the feasible set . It wasshown in [11] that φ δf,j ( x ) is a proper generalization of well-known unconstrained optimalitymeasures for low orders, in that, for δ = 1, φ δf, ( x ) = k∇ x f ( x ) k δ, (2.7) φ δf, ( x ) = (cid:12)(cid:12) min[0 , λ min (cid:0) ∇ x f ( x ) (cid:1)(cid:12)(cid:12) δ (2.8)provided ∇ x f ( x ) = 0, and also, if additionally ∇ x f ( x ) is positive semi-definite, that φ δf, = k projection of ∇ x f ( x ) onto the nullspace of ∇ x f ( x ) k δ . (2.9)At variance with other optimality measures, φ δj,f ( x ) is well-defined for any order j ≥ x varies continuoulsy in F . The role of the “optimality radius” δ in (2.6) merits some discussion. While the choice of δ = 1 is adequate for retrieving knownoptimality conditions in the unconstrained case for j = 1, j = 2 provided ∇ x f ( x ) = 0, and j = 3 provided additionnaly ∇ x f ( x ) is positive semi-definite (as we have just seen), δ becomesimportant in other cases. Corollary 3.6 in [11] indicates that, when F is convex, q -th-ordernecessary “path-based” optimality conditions hold iflim δ → φ δf,j ( x ) δ j = 0 for j = 1 , . . . , q. (2.10)The limit for δ → φ δf,j ( x ) for non-vanishing δ has substantial advantages from the point of view ofoptimization: while it may fail to indicate that x is a local minimizer, it does so only byproviding a direction leading to values of f below f ( x ), thereby helping to avoid local butnon-global approximate solutions. We refer the reader to [11] for a further discussion, butconclude that considering fixed δ has strong advantages when solving (2.1).A special case is when x is an isolated feasible point, that is a point which is the soleintersection between F and any sufficiently small neighbourhood of x . Such a point is clearlya local minimizer, and this is reflected by the fact that φ δf,q ( x ) = 0 for any f , any q and anysufficiently small δ .The main drawback of using φ δf,j ( x ) is, of course, that its computation requires the globalminimization of T p ( x, d ) in the intersection of the ball of radius δ with F . We are not awareof an easy way to do this in general (6) when n >
1, which is why our analysis remains of anessentially theoretical nature, as was the case for [11]. Note however that, albeit potentiallyvery difficult, solving this global minimization problem does not involve calculating the valueof f or of any of its derivatives. In that sense, this drawback is thus irrelevant for theworst-case evaluation complexity which solely focuses on these evaluations. (6) A small value of δ might help, but this computation remains NP-hard in most cases. artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization ∇ x f ( x ) = 0 for uncon-strained problems to k∇ x f ( x ) k ≤ ǫ and, at the same time, relax the second-order conditionto (cid:12)(cid:12) min[0 , λ min (cid:0) ∇ x f ( x ) (cid:1)(cid:12)(cid:12) ≤ ǫ , we then deduce that φ δf, ( x ) ≤ ǫδ + ǫδ = ǫ X ℓ =1 δ ℓ ℓ ! . (2.11)A natural generalization of this observation is to define an ( ǫ, δ ) -approximate q -th-order-necessary minimizer of f as a point x such that φ δf,q ( x ) ≤ ǫχ q ( δ ) (2.12)where χ q ( δ ) def = q X ℓ =1 δ ℓ ℓ ! . (2.13)Because (2.12) is a new way to look at approximate optimality and is crucial for the rest ofthis paper, it is worthwhile to motivate and discuss it further.1. When ǫ = 0, (2.12) implies that the complicated path-based necessary optimality con-ditions derived in [11] do hold. This results from the fact that these latter conditionsmerely express that the Taylor’s model of order q cannot decrease close enough to x along any feasible polynomial path emanating from x , which is clearly the case if x isa global minimizer of the same models in the intersection of the feasible set and a ballof radius δ centered at x . By continuity, these path-based conditions must thereforehold in the limit under (2.12) when ǫ tends to zero. The role of (2.12) as a conditionfor approximate minimization is thus coherent and consistent with known necessaryconditions.2. Inspired by (2.10), the stronger approximate optimality condition φ δf,j ( x ) ≤ ǫ δ j for j ∈ { , . . . , q } (2.14)was used in [11] instead of (2.12). Our main reason to prefer (2.12) is the follow-ing. Observe that (2.14) implies in particular that φ δf,q ( x ) ≤ ǫδ q , which in turn im-plies, for δ small enough for the first-order term to dominate, that φ δf, ( x ) ≤ ǫδ q . Inthe unconstrained case (for example), this requires k∇ x f ( x k ) k ≤ ǫδ q − , imposing aninordinate level of first-order optimality, much stronger than the standard condition k∇ x f ( x k ) k ≤ ǫ . No such difficulty arises with (2.12) because the right-hand side of thecondition involves all powers of δ , which is not the case of the right-hand side of (2.14).Note however that the vital continuity properties of φ δf,q are not affected by the choiceof the right-hand side, and are thus inherited by (2.12).3. For given δ ∈ (0 , φ δf,j ( x ) ≤ ǫχ j ( δ ) for j ∈ { , . . . , q − } ,although the violation of this condition tends to zero with δ (7) . This slight blemish canbe cured by requiring that φ δf,j ( x ) ≤ ǫχ j ( δ ) for j ∈ { , . . . , q } instead of (2.12), but weclaim that the benefit of this stronger definition is outweighted by the need to perform q − (7) When δ tends to zero, the terms of orders j + 1 and higher in the Taylor’s expansion defining φ δf,q ( x ) and χ q ( δ ) become negligible compared to the first j . artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization f ( x ) in the feasibleneighbourhood of an ( ǫ, δ )-approximate q -th-order-necessary minimizer. Theorem 2.2
Suppose that f is p times continuouly differentiable and that ∇ qx f is β -H¨older continuous with constant L (in the sense of (2.2) with p = q ) in an openneighbourhood of radius δ ∈ (0 ,
1] of some x ∈ F . Suppose also that x is an ( ǫ, δ )-approximate q -th-order-necessary minimizer of f in the sense of (2.12). Then f ( x + d ) ≥ f ( x ) − ǫχ q ( δ ) for all d with x + d ∈ F and k d k ≤ min " δ, (cid:18) ( q + 1)! ǫL (cid:19) q + β − . (2.15) Proof.
Using the triangle inequality, (2.2), (2.4) and (2.12), we obtain that f ( x + d ) ≥ f ( x + d ) − T q ( x, d ) + T q ( x, d ) ≥ −| f ( x + d ) − T q ( x, d ) | + T q ( x, − φ δf,q ( x ) ≥ − L ( q + 1)! k d k q + β + f ( x ) − ǫχ q ( δ ) . Thus, if k d k ≤ δ , f ( x + d ) ≥ f ( x ) − L ( q + 1)! k d k q + β − δ − ǫχ q ( δ )and the desired bound follows from the fact that δ ≤ χ q ( δ ). ✷ In order to find ( ǫ, δ )-approximate q -th-order-necessary minimizers, we consider applying avariant of the AR p algorithm to (2.1). This algorithm, described as Algorithm 2.1 on thefollowing page, is of the regularization type in that, at each iterate x k , a step s k is computedwhich approximately minimizes (in a sense defined below) the model m k ( s ) = T p ( x k , s ) + σ k ( p + β )! k s k p + β (2.16)subject to x k + s ∈ F , where p in an integer such that p ≥ q and σ k ≥ σ min is a “regularizationparameter”.A few comments are useful at this stage.1. Since σ k ≥ σ min by (2.22), we have that m k ( s ) is bounded below as a function of s andthe existence of a constrained global minimizer s ∗ k is guaranteed.2. Step 2 requires, that, for s k = 0, we also compute δ k . This is easy for orders oneand two. If q = 1, the formula for a global minimizer s ∗ k is analytic and δ k = 1 is artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization Algorithm 2.1: AR p for ( ǫ, δ ) -approximate q -th-order-necessary minimizersStep 0: Initialization. An initial point x ∈ F and an initial regularization parameter σ > ǫ ∈ (0 , δ − , ̟ , θ , η , η , γ , γ , γ and σ min are also given and satisfy ̟ ∈ (0 , , θ > , δ − ∈ (0 , , σ min ∈ (0 , σ ] , < η ≤ η < < γ < < γ < γ . (2.17)Compute f ( x ) and set k = 0. Step 1: Test for termination.
Evaluate {∇ ix f ( x k ) } qi =1 . If (2.12) holds with δ = δ k − , terminate with the approximate solution x ǫ = x k . Otherwise compute {∇ ix f ( x k ) } pi = q +1 . Step 2: Step calculation.
Attempt to compute a step s k such that x k + s k ∈ F andan optimality radius δ k ∈ (0 ,
1] by approximately minimizing the model m k ( s ) inthe sense that m k ( s k ) < m k (0) (2.18)and either k s k k ≥ ̟ǫ p − q + β (2.19)or φ δ k m k ,q ( s k ) ≤ θ k s k k p − q + β ( p − q + β )! χ q ( δ k ) . (2.20)If no such step exist, terminate with the approximate solution x ǫ = x k . Step 3: Acceptance of the trial point.
Compute f ( x k + s k ) and define ρ k = f ( x k ) − f ( x k + s k ) T p ( x k , − T p ( x k , s k ) . (2.21)If ρ k ≥ η , then define x k +1 = x k + s k ; otherwise define x k +1 = x k . Step 4: Regularization parameter update.
Set σ k +1 ∈ [max( σ min , γ σ k ) , σ k ] if ρ k ≥ η , [ σ k , γ σ k ] if ρ k ∈ [ η , η ) , [ γ σ k , γ σ k ] if ρ k < η . (2.22)Increment k by one and go to Step 1 if ρ k ≥ η , or to Step 2 otherwise. artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization q = 2, where s ∗ k can be assessed usinga trust-region method whose radius is δ k = 1 (more details are provided at the end ofSection 3). The task is more difficult for higher orders where one may have to rely onthe arguments of Lemma 2.5 below, or use different subproblems with decreasing valuesof δ . However, none of these computations involve the evaluation of f or its derivatives,and therefore the evaluation complexity bound discussed in this paper is unaffected.3. That one needs to consider the second case in Step 2 (where no step exists satisfying(2.18) – (2.20)) can be seen by examining the following one-dimensional example. Let p = q = 3 and β = 1, and suppose that δ k − = 1, T q ( x k , s ) = s − s and σ k = 4! = 24.Then m k ( s ) = s − s + s = s ( s − and the origin is a global minimizer ofthe model (and a local minimizer of T q ( x k , s )) but yet T q ( x k , δ ) = −
1, yielding that φ δ k − f,q ( x k ) = 1 > ǫχ q (1) for ǫ ≤ /χ q (1) = . Thus, Step 1 with δ k − = 1 has failed toidentify that termination was possible. In addition, we see that, at variance with thecases q = 1 and q = 2, a global minimizer of the model (2.16) may not, for q ≥
3, be aglobal minimizer of its q -th order Taylor’s expansion in the intersection of F and a ballof arbitrary radius: we may have to restrict this radius (to δ k − = in our example)for this important property to hold (see Lemma 2.5 below).4. If (2.19) holds, the possibly expensive computation of φ δ k m k ,q ( s k ) in (2.20) is unnecessaryand δ k may be chosen arbitrarily in (0 , f and its first p derivative tensors, while only the evaluation of f is needed at unsuccessful ones.7. The mechanism of the algorithm ensures the non-increasing nature of the sequence { f ( x k ) } k ≥ .Iterations for which ρ k ≥ η (and hence x k +1 = x k + s k ) are called “successful” and we denoteby S k def = { ≤ j ≤ k | ρ j ≥ η } the index set of all successful iterations between 0 and k . Weimmediately observe that the total number of iterations (successful or not) can be boundedas a function of the number of successful ones (and include a proof in the appendix). Lemma 2.3 [2, Theorem 2.4] The mechanism of Algorithm 2.1 guarantees that, if σ k ≤ σ max , (2.23)for some σ max >
0, then k + 1 ≤ |S k | (cid:18) | log γ | log γ (cid:19) + 1log γ log (cid:18) σ max σ (cid:19) . (2.24)We also verify that the algorithm is well-defined in the sense that either a step s k satisfying(2.18)–(2.20) can always be found, or termination is justified. For unconstrained problems artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization q ∈ { , } , the first possibility directly results from the observation that φ δm k ,j ( s k ) (asgiven by (2.7)-(2.9) for f = m k and j ∈ { , , } ) can be made suitably small at a globalminimizer of the model. The situation is more complicated for other cases. In order to clarifyit, we first state a useful technical lemma, whose proof is in the appendix. Lemma 2.4
Let s be a vector of IR n . Then k ∇ js (cid:0) k s k p + β (cid:1) k [ j ] = ( p + β )!( p − j + β )! k s k p − j + β for j ∈ { , . . . , p } (2.25)and k ∇ p +1 s (cid:0) k s k p + β (cid:1) k [ p +1] = β ( p + β )! k s k β − . (2.26)We now provide reasonable sufficient conditions for a nonzero step s k and an optimalityradius δ k to satisfy (2.18)–(2.20). Lemma 2.5
Suppose that s ∗ k is a global minimizer of m k ( s ) under the constraint that x k + s ∈ F , such m k ( s ∗ k ) < m k (0). Then there exist a neighbourhood of s ∗ k and a rangeof sufficiently small δ such that (2.18) and (2.20) hold for any s k in the intersection ofthis neighbourhood with F and any δ k in this range. Proof.
Let s ∗ k be the global minimizer of the model m k ( s ) over all s such that x k + s ∈ F .Since m k ( s ∗ k ) < m k (0), we have that s ∗ k = 0. By Taylor’s theorem, we have that, for all d ,0 ≤ m k ( s ∗ k + d ) − m k ( s ∗ k ) = p X ℓ =1 ℓ ! ∇ ℓs m k ( s ∗ k )[ d ] ℓ + 1( p + 1)! ∇ p +1 s m k ( s ∗ k + ξd )[ d ] p +1 for some ξ ∈ (0 , − q X ℓ =1 ℓ ! ∇ ℓs m k ( s ∗ k )[ d ] ℓ ≤ p X ℓ = q +1 k d k ℓ ℓ ! k∇ ℓs m k ( s ∗ k ) k [ ℓ ] + k d k p +1 ( p + 1)! k∇ p +1 s m k ( s ∗ k + ξd ) k [ p +1] = p X ℓ = q +1 k d k ℓ ℓ ! k∇ ℓs m k ( s ∗ k ) k [ ℓ ] + βσ k k d k p +1 ( p + 1)! k s ∗ k + ξd k β − . (2.27)Since s ∗ k = 0, we may then choose δ k < k s ∗ k k such that, for every d with k d k ≤ δ k , k s ∗ k + ξd k ≥ k s ∗ k k > p X ℓ = q +1 k d k ℓ ℓ ! k∇ ℓs m k ( s ∗ k ) k [ ℓ ] + 2 − β βσ k k d k p +1 ( p + 1)! k s ∗ k k β − ≤ θ k s ∗ k k p − q + β p − q + β )! k d k . (2.28)Hence we deduce from (2.27) and (2.28) that, for k d k ≤ δ k , − q X ℓ =1 ℓ ! ∇ ℓs m k ( s ∗ k )[ d ] ℓ ≤ θ k s ∗ k k p − q + β p − q + β )! δ k ≤ θ k s ∗ k k p − q + β p − q + β )! χ q ( δ k ) , artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization m k and its derivatives and theinequality m k ( s ∗ k ) < m k (0) then imply that there exists a neighbourhood of s ∗ k = 0 suchthat (2.18) holds and − q X ℓ =1 ℓ ! ∇ ℓs m k ( s )[ d ] ℓ ≤ θ k s k p − q + β ( p − q + β )! χ q ( δ k ) . for all s in this neighbourhood and all d with k d k ≤ δ k . This yields that, for all such s with x k + s ∈ F , φ δ k m k ,q ( s ) = max " , globmax k d k≤ δ k x k + d ∈F − q X ℓ =1 ℓ ! ∇ ℓs m k ( s k )[ d ] ℓ ! ≤ θ k s k p − q + β ( p − q + β )! χ q ( δ k ) , as requested. ✷ As can be seen in the proof of this lemma, δ k may need to be small if any of the tensors ∇ ℓs m k ( s ∗ k ) = p X j = ℓ j ! ∇ js m k (0)[ s ∗ k ] j − ℓ for ℓ ∈ { , . . . , p + 1 } has a large norm. This may occur in particular if β and k s ∗ k k are bothclose to zero, as is shown by the last term in the left-hand side of (2.28). We also note that(2.20) obviously holds for s k = s ∗ k if x k + s ∗ k is an isolated feasible point. It now remains toverify that it is justified to terminate in Step 2 when no suitable nonzero step can be found. Lemma 2.6
Suppose that the algorithm terminates in Step 2 of iteration k with x ǫ = x k .Then there exists a δ ∈ (0 ,
1] such that (2.12) holds for x = x ǫ and x ǫ is an ( ǫ, δ )-approximate q th-order-necessary minimizer. Proof.
Given Lemma 2.5, if the algorithm terminates within Step 2, it must bebecause every global minimizer s ∗ k of m k ( s ) under the constraints x k + s ∈ F is such that m k ( s ∗ k ) ≥ m k (0). In that case, s ∗ k = 0 is one such global minimizer and we have that, forall d ,0 ≤ m k ( d ) − m k (0) = q X ℓ =1 ℓ ! ∇ jx f ( x k )[ d ] j + p X ℓ = q +1 ℓ ! ∇ jx f ( x k )[ d ] j + σ k ( p + β )! k d k p + β . We may now choose δ ∈ (0 ,
1] small enough to ensure that, for all d with k d k ≤ δ , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p X ℓ = q +1 ℓ ! ∇ jx f ( x k )[ d ] j + σ k ( p + β )! k d k p + β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ǫ k d k ≤ ǫ χ q ( δ ) , (2.29)which in turn implies that, for all d with k d k ≤ δ , φ δf,q ( x k ) = max " , globmax k d k≤ δx k + d ∈F − q X ℓ =1 ℓ ! ∇ ℓx f ( x k )[ d ] ℓ ! ≤ ǫ χ q ( δ k ) , artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization ✷ Observe that, in this proof, we could have chosen δ small enough to ensure σ k ( p + β )! k d k p + β ≤ ǫχ p ( δ )instead of (2.29), yielding φ δf,p ( x k ) ≤ ǫχ p ( δ ), which is a stronger necessary optimality conditionthan (2.12). Together, Lemmas 2.5 and 2.6 ensure that Algorithm 2.1 is well-defined. The proofs of the following two lemmas are very similar to corresponding results in [2] andhence we again defer them to the appendix (but still include them for completeness, as thealgorithm has changed).
Lemma 3.1
The mechanism of Algorithm 2.1 guarantees that, for all k ≥ T p ( x k , − T p ( x k , s k ) ≥ σ k ( p + β )! k s k k p + β , (3.1)and so (2.21) is well-defined. Lemma 3.2
Let f ∈ C p,β (IR n ). Then, for all k ≥ σ k ≤ σ max def = max (cid:20) σ , γ L − η (cid:21) . (3.2)We are now in position to prove the crucial lower bound on the step length. Lemma 3.3
Let f ∈ C p,β (IR n ). Then, for all k ≥ k + 1, k s k k ≥ κ s ǫ p − q + β , (3.3)where κ s def = min " ̟, (cid:18) ( p − q + β )!( L + σ max + θ ) (cid:19) p − q + β . (3.4) Proof. If k s k k > ̟ǫ p − q + β , the result is obvious. Suppose now that k s k k ≤ ̟ǫ p − q + β .Since the algorithm does not terminate at iteration k + 1, we have that φ δ k f,q ( x k +1 ) > ǫχ q ( δ k ) (3.5) artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization φ δ k f,q ( x k +1 ) be achieved at d with k d k ≤ δ k .Since φ δ k f,q ( x k +1 ) >
0, we have from (2.6) that q X ℓ =1 ℓ ! ∇ ℓx f ( x k +1 )[ d ] ℓ < f at x k +1 , the triangle inequality, (2.16), (1.1) and(2.25), we deduce that φ δ k f,q ( x k +1 ) = − q X ℓ =1 ℓ ! ∇ ℓx f ( x k +1 )[ d ] ℓ = − q X ℓ =1 ℓ ! ∇ ℓx f ( x k +1 )[ d ] ℓ + q X ℓ =1 ℓ ! ∇ ℓs T p ( x k , s k )[ d ] ℓ − q X ℓ =1 ℓ ! ∇ ℓs T p ( x k , s k )[ d ] ℓ − σ k ( p + β )! q X k = ℓ ℓ ! (cid:16) ∇ ℓs (cid:2) k s k p + β (cid:3) ( s k ) (cid:17) [ d ] ℓ + σ k ( p + β )! q X k = ℓ ℓ ! (cid:16) ∇ ℓs (cid:2) k s k p + β (cid:3) ( s k ) (cid:17) [ d ] ℓ ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q X ℓ =1 ℓ ! h ∇ ℓx f ( x k +1 ) − ∇ ℓs T p ( x k , s k ) i [ d ] ℓ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − q X ℓ =1 ℓ ! ∇ ℓs (cid:20) T p ( x k , s ) + σ k ( p + β )! k s k p + β (cid:21) s = s k ! [ d ] ℓ + σ k ( p + β )! (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q X k = ℓ ℓ ! (cid:16) ∇ ℓs (cid:2) k s k p + β (cid:3) s = s k (cid:17) [ d ] ℓ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ q X ℓ =1 Lℓ !( p − ℓ + β )! k s k k p − ℓ + β δ ℓk − q X ℓ =1 ℓ ! ∇ ℓs m k ( s k )[ d ] ℓ + q X ℓ =1 σ k ℓ !( p − ℓ + β )! k s k k p − ℓ + β δ ℓk (3.6)Now, since k d k ≤ δ k , and using (2.6) for m k at s k , − q X ℓ =1 ℓ ! ∇ ℓs m k ( s k )[ d ] ℓ ≤ max " , − q X ℓ =1 ℓ ! ∇ ℓs m k ( s k )[ d ] ℓ ≤ φ δ k m k ,q ( s k ) . Therefore, using (2.20) and (3.6), we have that φ δ k f,q ( x k +1 ) ≤ q X ℓ =1 Lℓ !( p − ℓ + β )! k s k k p − ℓ + β δ ℓk + θ χ q ( δ k )( p − q + β )! k s k k p − q + β + q X ℓ =1 σ k ℓ !( p − ℓ + β )! k s k k p − ℓ + β δ ℓk ≤ h L + σ k + θ i χ q ( δ k )( p − q + β )! k s k k p − q + β , (3.7)where we have used the fact that k s k k ≤ ̟ǫ p − q + β ≤ k s k k ≥ (cid:20) ǫ ( p − q + β )!( L + σ k + θ ) (cid:21) p − q + β artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization ✷ The bound given by this lemma is another indication that choosing θ of the order of L (whenthis is known a priori) makes sense.We now combine all the above results to deduce an upper bound on the maximum numberof successful iterations, from which a final complexity bound immediately follows. Theorem 3.4
Let f ∈ C p,β (IR n ). Then, given ǫ ∈ (0 , j κ p ( f ( x ) − f low ) (cid:16) ǫ − p + βp − q + β (cid:17)k + 1successful iterations (each involving one evaluation of f and its p first derivatives) andat most (cid:22)j κ p ( f ( x ) − f low ) (cid:16) ǫ − p + βp − q + β (cid:17) + 1 k (cid:18) | log γ | log γ (cid:19) + 1log γ log (cid:18) σ max σ (cid:19)(cid:23) (3.8)iterations in total to produce an iterate x ǫ such that (2.12) holds, where σ max is givenby (3.2) and where κ p def = ( p + β )! η σ min max ( ̟ − ( p + β ) , (cid:20) ( L + σ max + θ )( p − q + β )! (cid:21) p + βp − q + β ) . Proof.
At each successful iteration k before termination, we have the guaranteed decrease f ( x k ) − f ( x k +1 ) ≥ η ( T p ( x k , − T p ( x k , s k )) ≥ η σ min ( p + β )! k s k k p + β (3.9)where we used (2.21), (3.1) and (2.22). Moreover we deduce from (3.9), (3.3) and (3.2)that f ( x k ) − f ( x k +1 ) ≥ κ − p ǫ p + βp − q + β j where κ − p def = η σ min κ s ( p + β )! . (3.10)Thus, since { f ( x k ) } decreases monotonically, f ( x ) − f ( x k +1 ) ≥ κ − p ǫ p + βp − q + β j |S k | . Using that f is bounded below by f low , we conclude |S k | ≤ f ( x ) − f low κ − p ǫ − p + βp − q + β j (3.11)until termination. The desired bound on the number of successful iterations follows fromcombining (3.11). Lemma 2.3 is then invoked to compute the upper bound on the totalnumber of iterations. ✷ artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization p -th derivative of f is assumed to be globally Lipschitz rather than merelyH¨older continuous (i.e. if β = 1), the bound (3.8) on the maximum number of evaluationsbecomes (cid:22)j κ p ( f ( x ) − f low ) (cid:16) ǫ − p +1 p − q +1 (cid:17) + 1 k (cid:18) | log γ | log γ (cid:19) + 1log γ log (cid:18) σ max σ (cid:19)(cid:23) (3.12)where κ p def = ( p + 1)! η σ min max ( ̟ p + β , (cid:20) q !( L + σ max + θ )( e − p − q + 1)! (cid:21) p +1 p − q +1 ) . This worst-case evaluation bound generalizes known bounds for q = 1 (see [2]) or q = 2(see[8]) and significantly improve upon the bounds in O ( ǫ − ( q +1) ) given by [11] for a morestringent termination rule. It also extends the results obtained in [6] for convexly-constrainedproblems with q = 1 by allowing the significantly broader class of inexpensive constraints.We also note that it is possible to weaken the assumption that ∇ px f must satisfy the H¨olderinequality (2.2) for every x, y ∈ IR n (as required in the beginning of Section 2). The weakestpossible smoothness assumption is to require that (2.2) holds only for points belonging tothe same segment of the “path of iterates” ∪ k ≥ [ x k , x k +1 ] (this is necessary for the proof ofLemma 2.1). As this path joining feasible iterates may be hard to predict a priori, one mayinstead use the monotonic character of Algorithm 2.1 and require (2.2) to hold for all x, y in the intersection of F with the level set { x ∈ IR n | f ( x ) ≤ f ( x ) } . Again, it may be hardto determine this set and to ensure that it contains the path of iterates, and one may thenresort to requiring (2.2) to hold in the whole of F , which must then be convex to ensure thedesired H¨older property on every segment [ x k , x k +1 ]. ǫ -approximate second-order-necessary minimizers We now discuss the particular and much-studied case where second-order minimizers aresought for unconstrained problems with Lipschitz continuous Hessians (that is p ≥ q = 2, F = IR n and β = 1). As we now show, a specialization of Algorithm 2.1 to this case isvery close (but not identical) to well-known methods. Let us consider Step 1 first. Thecomputation of φ δ k − f, ( x k ) then reduce to φ δ k − f, ( x k ) = max " , − globmin k d k≤ δ k − (cid:16) ∇ x f ( x k ) T d + d T ∇ x f ( x k ) d (cid:17) , (4.1)which amounts to solving a standard trust-region subproblem with radius δ k − (see [13]).Hence verifying (4.1) or testing the more usual approximate second-order criterion k∇ x f ( x k ) k ≤ ǫ and λ min (cid:16) ∇ x f ( x k ) (cid:17) ≥ − ǫ, (4.2)have very similar numerical costs (remember that finding the leftmost eigenvalue of the Hes-sian is the same as finding the global minimizer of the associated Rayleigh quotient). If wenow turn to the computation of s k in Step 2, Algorithm 2.1 then computes such a step byattempting to minimize the model T p ( x k , s ) + σ k ( p + 1)! k s k p +1 , (4.3) artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization p [2, 8]. Moreover, the failure of (2.12) inStep 1 is enough, when q ≤
2, to guarantee the existence of nonzero global minimizers of T p ( x k , s ) and m k ( s ), and thus to ensure that a nonzero s k is possible. The approximatemodel minimization is stopped as soon as (2.19) or (2.20) holds, the latter then reducing tochecking that φ δm k , ( x k ) = max " , − globmin k d k≤ δ (cid:16) ∇ s m k ( s k ) T d + d T ∇ s m k ( s k ) d (cid:17) ≤ θ k s k k p − ( p − χ ( δ ) (4.4)for some δ ∈ (0 , s k , finding δ ∈ (0 ,
1] requires solving (possibly approx-imately) − globmin k d k≤ δ (cid:0) ∇ s m k ( s k ) T d + d T ∇ s m k ( s k ) d (cid:1) ≤ θ k s k k p − ( p − χ ( δ ) . While this could be acceptable without affecting the overall evaluation complexity of thealgorithm, a simpler alternative is available for q = 2. We may consider terminating themodel minimization when either (2.19) holds, or0 > globmin k d k≤ (cid:16) ∇ s m k ( s k ) T d + d T ∇ s m k ( s k ) d (cid:17) ≥ − θ k s k k p − ( p − χ (1) = − θ k s k k p − p − . (4.5)The inequality is guaranteed to hold when s k is close enough to s ∗ k , a global minimizer ofthe model m k ( s ), since then ∇ s m k ( s ∗ k ) = 0 and ∇ s m k ( s ∗ k ) is positive semi definite, and then d = 0 provides the global minimizer of the second-order Taylor model of m k ( s ) around s k .Verifying (4.5) only requires at most one trust-region calculation for each potential step andensures (4.4) with δ = 1, making the choice δ k = 1 acceptable. The cost this technique iscomparable to that that proposed in [8] where an eigenvalue computation is required for eachpotential step. Combining these observations, Algorithm 2.1 then becomes Algorithm 4.1.If p = q = 2, computing s k in Step 2 amounts to approximately minimizing the now well-known cubic model of [15, 19, 22, 5]. In addition, if s k is the exact global minimizer ofthis model, the above argument shows that (4.5) automatically holds at s k and checkingthis inequality by solving a trust-region subproblem is thus unnecessary. The only differencebetween our proposed algorithm and the more usual cubic regularization (ARC) method withexact global minimization is that the latter would check (4.2) for termination, while thealgorithm presented here would instead check (4.1) with δ k − = 1 by solving a trust-regionsubproblem. As observed above, both techniques have comparable numerical cost.The bound (3.12) then ensures that Algorithm 4.1 terminates in at most O (cid:16) ǫ − p +1 p − (cid:17) eval-uations of f , its gradient and Hessian. This algorithm thus shares (8) the upper complexitybounds stated in [8] for general p with different values of ǫ fpr fisrt- and second-order, and in[19, 5] for p = 2. (8) For a marginally weaker (see footnote (7) and Theorem 2.2) but still necessary and, in our view, moresensible approximate optimality condition. artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization Algorithm 4.1: AR p for ǫ -approximate second-order-necessary minimizersStep 0: Initialization. An initial point x ∈ F and an initial regularization parameter σ > ǫ ∈ (0 , ̟ , θ , η , η , γ , γ , γ and σ min are also given and satisfy (2.17). Compute f ( x ) and set k = 0. Step 1: Test for termination.
Evaluate {∇ ix f ( x k ) } i =1 . If (2.12) holds with φ f, ( x k )given by (4.1) and δ k − = 1, terminate with the approximate solution x ǫ = x k .Otherwise compute {∇ ix f ( x k ) } pi =3 . Step 2: Step calculation.
Compute a step s k = 0 by approximately minimizing themodel (4.3) in the sense that (2.18) holds and k s k k ≥ ̟ǫ p − β or (4.5) holds. Step 3: Acceptance of the trial point.
Compute f ( x k + s k ) and define ρ k as in(2.21). If ρ k ≥ η , then define x k +1 = x k + s k ; otherwise define x k +1 = x k . Step 4: Regularization parameter update.
Compute σ k +1 as in (2.22). Increment k by one and go to Step 1 if ρ k ≥ η , or to Step 2 otherwise. We now intend to show that the upper bound on evaluation complexity of Theorem 3.4 is tightin terms of the order given for unconstrained and a broad class of constrained problems withLipschitz continuous p -th derivative (i.e. β = 1 (9) ). This objective is attained by defining avariant of the high-degree Hermite interpolation technique developed in [11], and then usingthis technique to build, for any number p of available derivatives of the objective functionand any optimality order q , an unconstrained univariate example of suitably slow convergence(i.e. for which the order in ǫ given by (3.12) is achieved). This example is then embedded inhigher dimensions to provide general lower bounds. We start by investigating some useful properties of Hermite interpolation. Let us assume thatwe wish to construct a univariate Hermite interpolant π of degree 2( p + 1) of the form π ( τ ) = p +1 X i =0 c i τ i (5.1)on the interval [0 , s ] satisfying the 2( p + 1) conditions π ( i ) (0) = f ( i )0 , π ( i ) ( s ) = f ( i )1 for i ∈ { , . . . , p } , (5.2) (9) A example of slow convergence for general β and p > β is provided in [9]. artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization f ( i )0 and f ( i )1 are given. The values of the coefficients c , . . . , c p may then be obtainedby c i = f ( i )0 i ! for i ∈ { , . . . , p } while the remaining ones satisfy the linear system a , s p +1 a , s p +2 · · · a ,p − s p a ,p s p +1 a , s p a , s p +1 · · · a ,p − s p − a ,p s p ... ... . . . ... ... a p, s a p, s · · · a p,p − s p a p,p s p +1 c p +1 c p +2 ... c p +1 = f (0)1 − T (0) p (0 , s ) f (1)1 − T (1) p (0 , s )... f ( p )1 − T ( p ) p (0 , s )) (5.3)where T p (0 , s ) = p X i =0 f ( i )0 i ! s i and a i,j = ( p + j + 1)!( p + j + 1 − i )! ( i, j = 0 , . . . , p ) . Observe that (5.3) can be rewritten as s p s p − . . . 1 A p s s . . . s p +1 c p +1 c p +2 ... c p +1 = f (0)1 − T (0) p (0 , s ) f (1)1 − T (1) p (0 , s )... f ( p )1 − T ( p ) p (0 , s )) with A p is the matrix whose ( i, j )-th entry is a i,j , which only depends on p . It was show in[11, Appendix] that A p is nonsingular. Therefore c p +1 sc p +2 s ... c p +1 s p +1 = A − p s p [ f (0)1 − T (0) p (0 , s )] s p − [ f (1)1 − T (1) p (0 , s )]... f ( p )1 − T ( p ) p (0 , s ) . We therefore deduce that, for any τ ∈ [0 , s ] , | π ( p +1) ( τ ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p X i =0 ( p + 1 + i )! i ! c p +1+ i τ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ p X i =0 ( p + 1 + i )! i ! (cid:0) | c p +1+ i | s i +1 (cid:1) s − ≤ ( p + 1)(2 p + 1)! p ! k A − p k ∞ max j =0 ,...,p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f ( j )1 − T ( j ) p (0 , s ) s p − j +1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . The mean-value theorem then implies that, for any 0 ≤ τ ≤ τ ≤ s and some ξ ∈ [ τ , τ ] ⊆ [0 , s ], | π ( p ) ( τ ) − π ( p ) ( τ ) || τ − τ | = | π ( p +1) ( ξ ) |≤ max τ ∈ [0 ,s ] | π ( p +1) ( τ ) |≤ ( p + 1)(2 p + 1)! p ! k A − p k ∞ max j =0 ,...,p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f ( j )1 − T ( j ) p (0 , s ) s p − j +1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (5.4) artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization Theorem 5.1
Suppose that { f ( j ) ℓ } are given for ℓ ∈ { , } and j ∈ { , . . . , p } . Supposealso that there exists a constant κ f ≥ j ∈ { , , . . . , p } , | f ( j )1 − T ( j ) p (0 , s ) | ≤ κ f s p − j +1 . (5.5)Then the Hermite interpolation polynomial π ( τ ) on [0 , s ] given by (5.1) and satisfying(5.2) admits a Lipschitz continuous p -th derivative on [0 , s ], with Lipschitz constant givenby L p def = ( p + 1)(2 p + 1)! p ! k A − p k ∞ κ f , which only depends on p and κ f . Proof.
Directly results from (5.4) and (5.5). ✷ Observe that (5.5) is identical to (2.5) when β = 1 and n = 1. This means that the conditionsof Theorem 5.1 automatically hold if the interpolation data { f ( j ) i } is itself extracted from afunction having a Lipschitz continuous p -th derivative.Applying the above results to several interpolation intervals then yields the existence ofa smooth Hermite interpolant. Theorem 5.2
Suppose that, for some integer k e > p >
0, the data { f ( j ) k } and { x k } is given for k ∈ { , . . . , k e } and j ∈ { , . . . , p } . Suppose also that s k = x k +1 − x k ∈ (0 , κ s ] for k ∈ { , . . . , k e } and some κ s >
0, and that, for some constant κ f ≥ k ∈ { , . . . , k e − } , | f ( j ) k +1 − T ( j ) k,p ( x k , s k ) | ≤ κ f s p − j +1 k . (5.6)where T k,p ( x k , s ) = P pi =0 f ( i ) k s i /i !. Then there exists a p times continuously differentiablefunction f from IR to IR with Lipschitz continuous p -th derivative such that, for k ∈{ , . . . , k e } , f ( j ) ( x k ) = f ( j ) k for j ∈ { , . . . , p } . Moreover, the range of f only depends on p , κ f , max k f (0) k and min k f (0) k . Proof.
We first use Theorem 5.1 to define a Hermite interpolant π k ( s ) of the form(5.1) on each interval [ x k , x k +1 ] = [ x k , x k + s k ] ( k ∈ { , . . . , k e } ) using f ( j )0 = f ( j ) k and f ( j )1 = f ( j ) k +1 for j ∈ { , . . . , p } , and then set f ( x k + s ) = π k ( s )for any s ∈ [0 , s k ]. We may then smoothly prolongate f for x ∈ IR by defining two addi- artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization x − , x ] = [ − s − ,
0] and [ x k e , x k e + s k e ] with end conditions f − = f (0)0 , f k e +1 = f (0) k e and f ( j ) − = f ( j ) k e +1 = 0 for j ∈ { , . . . , p } , and where s − and s k e are chosen sufficiently large to ensure that (5.6) also holds onintervals -1 and k e . We next set f ( x ) = f (0)0 for x ≤ x − ,π k ( x − x k ) for x ∈ [ x k , x k +1 ] and k ∈ {− , . . . , n } ,f (0) k e for x ≥ x k e + s k e . ✷ ǫ , δ )-approximate q -th-order-necessary minimizers We now consider an unconstrained univariate instance of problem (2.1). Our aim is first toshow that, for each choice of p ≥ q ∈ { , . . . , p } , there exists an objective function f for problem (2.1) with f ∈ C p, (IR) (i.e. β = 1) such that obtaining an ( ǫ, δ )-approximate q -th-order-necessary minimizer may require at least ǫ − p +1 p − q +1 evaluations of the objective function and its derivatives using Algorithm 2.1, matching, inorder of ǫ ∈ (0 , p ≥ q ∈ { , . . . , p } , we first define the sequences { f ( j ) k } for j ∈ { , . . . , p } and k ∈ { , . . . , k ǫ } with k ǫ = l ǫ − p +1 p − q +1 m (5.7)by ω k = ǫ k ǫ − kk ǫ . (5.8)as well as f ( jk = 0 for j ∈ { , . . . , q − } ∪ { q + 1 , . . . , p } (5.9)and f ( q ) k = − ( ǫ + ω k ) q ! χ q (1) < . (5.10)Thus T p ( x k , s ) = p X j =0 f ( j ) k j ! s j = f (0) k − ( ǫ + ω k ) χ q (1) s q (5.11)and, assuming δ k − = 1 for all k (we verify below that this is acceptable), φ δ k − f,q ( x k ) = ( ǫ + ω k ) χ q ( δ k − ) (5.12)We also set σ k = p ! for all k ∈ { , . . . , k ǫ } (we again verify below that is acceptable). Notethat ω k ∈ (0 , ǫ ] and φ δ k − f,q > ǫχ q ( δ k − ) for k ∈ { , . . . , k ǫ − } , (5.13) artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization x k ), while ω k ǫ = 0 and φ δ k − f,q ( x k ǫ ) = ǫχ q ( δ k − ) (5.14)(and (2.12) holds at x k ǫ ). It is easy to verify using (5.11) that the model (2.16) is then globallyminimized for s k = " | f ( q ) k | ( q − p − q +1 = [ q ( ǫ + ω k ) χ q (1)] p − q +1 > ǫ p − q +1 ( k ∈ { , . . . , k ǫ } ) . (5.15)Hence this step satisfies (2.19) if we choose ̟ = 1. Because of this fact, we are free to choose δ k arbitrarily in (0 ,
1] and we choose δ k = 1. Thus, provided we make the choice δ − = 1ensuring (5.12) for k = 0 , the value δ k = 1 is admissible for all k . The step (5.15) yields that m k ( s k ) = f (0) k − ( ǫ + ω k ) χ q ( δ k )[ q ( ǫ + ω k ) χ q ( δ k )] qp − q +1 + p +1 [ q ( ǫ + ω k ) χ q ( δ k )] p +1 p − q +1 = f (0) k − ζ ( q, p )[ q ( ǫ + ω k ) χ q ( δ k )] p +1 p − q +1 (5.16)where ζ ( q, p ) def = p − q + 1 q ( p + 1) ∈ (0 , . (5.17)Thus m k ( s k ) < m k (0) and (2.18) holds. We then define f (0)0 = 2[2 qχ q (1)] p +1 p − q +1 and f (0) k +1 = f (0) k − ζ ( q, p )[ q ( ǫ + ω k ) χ q ( δ k )] p +1 p − q +1 , (5.18)which provides the identity m k ( s k ) = f (0) k +1 (5.19)(ensuring that iteration k is successful because ρ k = 1 in (2.21) and thus that our choice of aconstant σ k is acceptable). In addition, using (5.18), (5.13), (5.17), the equality δ k = 1 andthe inequality k ǫ ≤ ǫ − p +1 p − q +1 from (5.7) gives that, for k ∈ { , . . . , k ǫ } , f (0)0 ≥ f (0) k ≥ f (0)0 − kζ ( q, p )[2 qǫχ q ( δ k )] p +1 p − q +1 ≥ f (0)0 − k ǫ ǫ p +1 p − q +1 [2 qχ q (1)] p +1 p − q +1 ≥ f (0)0 − (cid:16) ǫ p +1 p − q +1 (cid:17) [2 qχ q (1)] p +1 p − q +1 ≥ f (0)0 − qχ q (1)] p +1 p − q +1 , and hence that f (0) k ∈ h , qχ q (1)] p +1 p − q +1 i for k ∈ { , . . . , k ǫ } . (5.20)We also set δ − = 1 , x = 0 and x k = k − X i =0 s i . Then (5.19) and (2.16) give that | f (0) k +1 − T p ( x k , s k ) | = 1 p + 1 | s k | p +1 . (5.21) artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization T ( j ) p ( x k , s k ) = f ( q ) k ( q − j )! s q − jk δ [ j ≤ q ] = − ( q − q − j )! s p − j +1 k δ [ j ≤ q ] where δ [ · ] is the standard indicator function. We may now verify that, for j ∈ { , . . . , q − } , | f ( j ) k +1 − T ( j ) p ( x k , s k ) | = | − T ( j ) p ( x k , s k ) | ≤ (cid:12)(cid:12)(cid:12)(cid:12) ( q − q − j )! (cid:12)(cid:12)(cid:12)(cid:12) | s k | p − j +1 ≤ ( q − | s k | p − j +1 , (5.22)while, for j = q , we have that | f ( q ) k +1 − T ( q ) p ( x k , s k ) | = | − ( q − s p − q +1 k + ( q − s p − q +1 k | = 0 (5.23)and, for j ∈ { q + 1 , . . . , p } , | f ( j ) k +1 − T ( j ) p ( x k , s k ) | = | − | = 0 . (5.24)Combining (5.21), (5.22), (5.23) and (5.24), we deduce that (5.6) holds with κ f = ( q − β = 1, κ f = ( q − κ s = 1, and deduce theexistence of a p times continuously differentiable function f from IR to IR with Lipschitzcontinuous derivatives of order 0 to p which interpolates the { f ( j ) k } at { x k } for k ∈ { , . . . , n } and j ∈ { , . . . , p } . Moreover, (5.20) and Theorem 5.2 imply that the range of f only dependson p and q . In addition, (5.19) ensures that every iteration is successful and thus, because of(2.22), that the value σ k = p ! may be used at all iterations.This argument allows us to state the following lower bound on the complexity of theregularization algorithm using a p -th degree model. Lemma 5.3
Given any p ∈ IN and q ∈ { , . . . , p } , there exists a p times continuoulsydifferentiable function f from IR to IR with range only depending on p and q and Lipschitzcontinuous p -th derivative such that, when the regularization algorithm with p -th degreemodel (Algorithm 2.1) is applied to minimize f without constraints, it takes exactly k ǫ = l ǫ − p +1 p − q +1 m iterations (and evaluations of the objective function and its derivatives) to find an ( ǫ , δ )-approximate q -th-order-necessary minimizer.This implies the following important consequence for higher dimensional problems. artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization Theorem 5.4
Given any n ∈ IN , p ∈ IN and q ∈ { , . . . , p } , there exists a p timescontinuoulsy differentiable function f from IR n to IR with range only depending on p and q and Lipschitz continuous p -th derivative tensor such that, when the regularizationalgorithm with p -th degree model (Algorithm 2.1) is applied to minimize f withoutconstraints, it takes exactly k ǫ = l ǫ − p +1 p − q +1 m (5.25)iterations (and evaluations of the objective function and its derivatives) to find an ( ǫ, δ )-approximate q -th-order-necessary minimizer. Furthermore, the same conclusion holds ifthe optimization problem under consideration involves constraints provided the feasibleset F contains a ray. Proof.
The first conclusion directly follows from Lemma 5.3 since it is always possibleto include the unimodal example as an independent component of a multivariate one.The second conclusion follows from the observation that our univariate example of slowconvergence is only defined on IR + (even if Theorem 5.2 provides an extension to thecomplete real line). As a consequence, it may be used on any feasible ray. ✷ We now make a few observations.1. Theorem 5.4 generalizes to arbitrary q the bound obtained in [3] for the case q = 1 andalso shows that, at variance with the result derived in this reference, the generalizedbound applies for arbitrary problem’s dimension, but depends on ǫ , p and q .2. For simplicity, we have chosen, in the above example, to minimize the model m k ( s )globally at every iteration, but we might consider other pairs ( s k , δ k ). A similar exampleof slow convergence may in fact be constructed along the lines used above (10) for anysequence of acceptable (11) model reducing steps and associated optimality radii (in thesense of Lemma 2.5), provided the optimality radii remain bounded away from zero.This means that our example of slow convergence applies not only to Algorithm 2.1but also to a much broader class of minimization methods. Moreover, it is also possibleto weaken the constraints on the step further by relaxing (5.19) and only insisting onacceptable decrease of the objective function value in Step 3 of the algorithm.In [3], the authors derive their upper bound for q = 1 for the general class of “zero-preserving” algorithms, which are algorithms that “never explore (from x k ) coordinateswhich appear not to affect the function”, that is directions d along which T p ( x k , · )is constant. This property is obviously shared by Algorithm 2.1 because it attemptsto reduce the Taylors’ expansion of f around the current iterate (the presence of theisotropic regularization term is irrelevant for this).3. Our example does not apply, for instance, to a linesearch method using global univariateminimization in a direction of search computed from the Taylor’s expansion of f , which (10) At the price of possibly larger constants. (11)
Remember that δ = 1 is always possible for q = 1. It thus unsurprising that no such condition appears in[3]. artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization f ( x , x ) = ( x − x )( x − x ) . Then f (0 ,
0) = 0 and the origin is not a minimizer since f decreases along the arc x = x . Yet the origin is the global minimizer along every line passing through theorigin, preventing any linesearch method to progress away from (0 , unregularized model (that is (2.16) with σ k = 0) in order to find an unconstrainedfirst-order minimizer. It is easy to see that if one chooses f (1) k = − ( ǫ + ω k ) , f ( i ) k = 0 for i ∈ { , . . . , p − } and f ( p ) k = p ! , the same reasoning as above yields that the largest obtainable decrease with this model occursat s k = (cid:18) ǫ + ω k p (cid:19) p − and is given by f (0) k − m k ( s k ) = ( p − (cid:18) ǫ + ω k p (cid:19) pp − . This then implies that at least a multiple of ǫ − pp − evaluations may be needed to find approx-imate first-order-necessary minimizers, which is worse than the bound in ǫ − p +1 p holding forthe regularized algorithm. This is consistent with the known lower O ( ǫ − ) bound for first-order points that holds for the (unregularized) Newton method (and hence the trust-regionmethod), both of which use p = 2. Adding the regularization term thus not only provides amechanism to limit the stepsize and make the step well-defined when T p ( x k , s ) is unboundedbelow, but also amounts to increasing the ’useful degree’ of the model by one, improving theworst-case complexity bound.Summing up the above discussion, we conclude that an example of slow convergencerequiring at least (5.25) evaluations can be built for any method whose steps decrease theregularized ( σ k ≥ σ min ) or unregularized ( σ k = 0) model (2.16) and whose approximate localoptimality can be measured by (2.20) for some constant θ and δ k = 1 (which we can alwaysenforce by adapting ̟ and (5.9)). For orders up to two, this includes most variants of steepest-descent and Newton’s methods including those globalized with regularization, trust-region, alinesearch or a mixture of these (see [12] for a discussion). General linesearch methods areexcluded for high-order optimization as they may fail to converge to approximate minimizersof order four and beyond.Finally, one may wonder at what would happen if, for the interpolation data (5.9)-(5.10),the model m k ( s ) = T p ( x k , s ) + σ k m ! | s | m were used for some m > p + 1, resulting in a shorter step. The global model minimizerwould then occur at s = [ q ( ǫ + ω k ) χ q (1)] / ( m − and give an optimal model decrease equal to artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization q ( ǫ + ω k ) χ q (1)] m/ ( m − ( m − q ) /m . However, (5.6) would then fail for j = 0 and the argumentleading to an example of slow convergence would break down. For any optimality order q ≥
1, we have provided the concept of an ( ǫ, δ )-approximate q -th-order-necessary minimizer for the very general set-constrained problem (2.1). We have thenproposed a conceptual regularization algorithm to find such approximate minimizers andhave shown that, if ∇ px f is β -H¨older continuous, this algorithm requires at most O ( ǫ − p + βp − q + β )evaluations of the objective function and its p first derivatives to terminate. When ∇ px f isLipschitz continuous, we have used an unconstrained univariate version of the problem toshow that this bound is sharp in terms of the order in ǫ for any feasible set containing a rayand any problem dimension.In view of the results in [7, 18], one may wonder at what would happen if the regularizationpower (i.e. the power of k s k used in the last term of the model (2.16)) is allowed to differ from p + β . The theory presented above must then be re-examined and the crucial point is whether aglobal upper bound σ max on the regularization parameter can still be ensured as in Lemma 3.2.One easily verifies that this is the case for regularization powers r ∈ ( p, p + β ]. Argumentsparallel to those presented above then yield an upper bound of O ( ǫ − rr − q ) evaluations (12) ,recovering the bound given in Section 3.3 of [7] for q = 1. The situation is however morecomplicated (and beyond the scope of the present paper) for r > p + β and the determinationof a suitable general complexity upper bound for this latter case has not been formalizedat this stage, but the analysis for q = 1 discussed in Section 3.2 of [7] suggests that animprovement of the bound for larger r is unlikely.Although the results presented essentially solve the question of determining the optimalevaluation complexity for unconstrained problems and problems with general inexpensiveconstraints, some interesting issues remain open at this stage. A first such issue is whether anexample of slow convergence for all ǫ ∈ (0 ,
1) can be found for feasible domains not containinga ray. A second is to extend the general complexity theory for problems whose constraintsare not inexpensive: the discussion in [10] indicates that this is a challenging research area.
References [1] A. Anandkumar and R. Ge. Efficient approaches for escaping high-order saddle points in nonconvexoptimization.
Proceedings of the Machine Learning Society , 49:81–102, 2016.[2] E. G. Birgin, J. L. Gardenghi, J. M. Mart´ınez, S. A. Santos, and Ph. L. Toint. Worst-case evaluationcomplexity for unconstrained nonlinear optimization using high-order regularized models.
MathematicalProgramming, Series A , 163(1):359–368, 2017.[3] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for finding stationary points I.arXiv:1710.11606, 2018.[4] C. Cartis, N. I. M. Gould, and Ph. L. Toint. On the complexity of steepest descent, Newton’s andregularized Newton’s methods for nonconvex unconstrained optimization.
SIAM Journal on Optimization ,20(6):2833–2852, 2010.[5] C. Cartis, N. I. M. Gould, and Ph. L. Toint. Adaptive cubic overestimation methods for unconstrainedoptimization. Part II: worst-case function-evaluation complexity.
Mathematical Programming, Series A ,130(2):295–319, 2011. (12)
We may even relax (2.20) slightly by replacing k s k k p − q + β by k s k k r − q . artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization [6] C. Cartis, N. I. M. Gould, and Ph. L. Toint. An adaptive cubic regularization algorithm for nonconvexoptimization with convex constraints and its function-evaluation complexity. IMA Journal of NumericalAnalysis , 32(4):1662–1695, 2012.[7] C. Cartis, N. I. M. Gould, and Ph. L. Toint. Universal regularization methods – varying the power,the smoothness and the accuracy. Technical Report naXys-7-2016, Namur Center for Complex Systems(naXys), University of Namur, Namur, Belgium, 2016.[8] C. Cartis, N. I. M. Gould, and Ph. L. Toint. Improved second-order evaluation complexity for uncon-strained nonlinear optimization using high-order regularized models. arXiv:1708.04044, 2017.[9] C. Cartis, N. I. M. Gould, and Ph. L. Toint. Worst-case evaluation complexity of regularization methodsfor smooth unconstrained optimization using H¨older continuous gradients.
Optimization Methods andSoftware , 6(6):1273–1298, 2017.[10] C. Cartis, N. I. M. Gould, and Ph. L. Toint. Optimality of orders one to three and beyond: characterizationand evaluation complexity in constrained nonconvex optimization.
Journal of Complexity , (to appear),2018.[11] C. Cartis, N. I. M. Gould, and Ph. L. Toint. Second-order optimality and beyond: characterization andevaluation complexity in convexly-constrained nonlinear optimization.
Foundations of ComputationalMathematics , 18(5):1073–1107, 2018.[12] C. Cartis, N. I. M. Gould, and Ph. L. Toint. Worst-case evaluation complexity and optimality of second-order methods for nonconvex smooth optimization. To appear in the Proceedings of the 2018 InternationalConference of Mathematicians (ICM 2018), Rio de Janeiro, 2018.[13] A. R. Conn, N. I. M. Gould, and Ph. L. Toint.
Trust-Region Methods . MPS-SIAM Series on Optimization.SIAM, Philadelphia, USA, 2000.[14] S. Gratton, A. Sartenaer, and Ph. L. Toint. Recursive trust-region methods for multiscale nonlinearoptimization.
SIAM Journal on Optimization , 19(1):414–444, 2008.[15] A. Griewank. The modification of Newton’s method for unconstrained optimization by bounding cubicterms. Technical Report NA/12, Department of Applied Mathematics and Theoretical Physics, Universityof Cambridge, Cambridge, United Kingdom, 1981.[16] H. Hancock.
The Theory of Maxima and Minima . The Athenaeum Press, Ginn & Co, NewYork, USA,1917. Available on line at https://archive.org/details/theoryofmaximami00hancroft.[17] Yu. Nesterov.
Introductory Lectures on Convex Optimization . Applied Optimization. Kluwer AcademicPublishers, Dordrecht, The Netherlands, 2004.[18] Yu. Nesterov and G. N. Grapiglia. Globally convergent second-order schemes for minimizing twice-differentiable functions. Technical Report CORE Discussion paper 2016/28, CORE, Catholic Universityof Louvain, Louvain-la-Neuve, Belgium, 2016.[19] Yu. Nesterov and B. T. Polyak. Cubic regularization of Newton method and its global performance.
Mathematical Programming, Series A , 108(1):177–205, 2006.[20] G. Peano.
Calcolo differenziale e principii di calcolo integrale . Fratelli Bocca, Roma, Italy, 1884.[21] S. A. Vavasis. Black-box complexity of local minimization.
SIAM Journal on Optimization , 3(1):60–80,1993.[22] M. Weiser, P. Deuflhard, and B. Erdmann. Affine conjugate adaptive Newton methods for nonlinearelastomechanics.
Optimization Methods and Software , 22(3):413–431, 2007.[23] X. Zhang, C. Ling, and L. Qi. The best rank-1 approximation of a symmetric tensor and related sphericaloptimization problems.
SIAM Journal on Matrix Analysis , 33(3):806–821, 2012.
Appendix AA.1 Proof of Lemmas in Section 2
Proof of Lemma 2.1.
We first establish the identity I k − ,β def = Z ξ β (1 − ξ ) k − dξ = ( k − k + β )! , where ( k + β )! def = k Y i =1 ( i + β ) . (A.1) artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization I k − ,β = h − ( k − ξ β (1 − ξ ) k − i + ( k − β ) ξ β (1 − ξ ) k − dξ = ( k − β ) I k − , β and thus, recursively, that I k − ,β = ( k − k − β )! I ,k − β = ( k − k − β )! Z ξ k − β dξ = ( k − k + β )! . As in [11], consider the Taylor identity ψ (1) − τ k (1) = 1( k − Z (1 − ξ ) k − [ ψ ( k ) ( ξ ) − ψ ( k ) (0)] dξ (A.2)involving a given univariate C k function ψ ( t ) and its k -th order Taylor approximation τ k ( t ) = k X i =0 ψ ( i ) (0) t i i !expressed in terms of the value ψ (0) = ψ and i th derivatives ψ ( i ) , i = 1 , . . . , k . Then, picking ψ ( t ) = f ( x + ts ), for given x, s ∈ IR n , and k = p , the identity (A.2), and the relationships ψ ( p ) ( t ) = ∇ px f ( x + ts )[ s ] p and τ p (1) = T p ( x, s ) give that f ( x + s ) − T p ( x, s ) = 1( p − Z (1 − ξ ) k − ( ∇ px f ( x + ξs ) − ∇ px f ( x )) [ s ] p dξ, and thus from the definition of the tensor norm (1.1), the H¨older bound (2.2) and the identity(A.1) when k = p that f ( x + s ) − T p ( x, s ) ≤ p − Z (1 − ξ ) k − (cid:12)(cid:12)(cid:12)(cid:12) ( ∇ px f ( x + ξs ) − ∇ px f ( x )) (cid:20) s k s k (cid:21) p (cid:12)(cid:12)(cid:12)(cid:12) k s k p dξ ≤ p − Z (1 − ξ ) k − max k v k =1 | ( ∇ px f ( x + ξs ) − ∇ px f ( x )) [ v ] p | k s k p dξ = 1( p − Z (1 − ξ ) k − k∇ px f ( x + ξs ) − ∇ px f ( x )) k [ p ] dξ · k s k p ≤ p − Z ξ β (1 − ξ ) p − dξ · L k s k p + β = L ( p + β )! k s k p + β for all x, s ∈ IR n , which is the required (2.4).Likewise, for arbitrary unit vectors v , . . . , v j , choosing ψ ( t ) = ∇ jx f ( x + ts )[ v , . . . , v j ] and k = p − j ,it follows from (A.2), the relationships ψ ( p − j ) ( t ) = ∇ px f ( x + ts )[ v , . . . , v j ][ s ] p − j and τ p − j (1) = ∇ js T p ( x, s ) that( ∇ jx f ( x + s ) − ∇ js T p ( x, s ))[ v , . . . , v j ]= 1( p − j − Z (1 − ξ ) p − j − ( ∇ px f ( x + ξs ) − ∇ px f ( x )[ v , . . . , v j ][ s ] p − j dξ. (A.3) artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization v , . . . , v j to maximize the absolute value of left-hand size of (A.3) and usingthe tensor norm (1.1), the H¨older bound (2.2) and the identity (A.1) when k = p − j , we findthat k∇ jx f ( x + s ) − ∇ js T p ( x, s ) k [ j ] ≤ p − j − Z (1 − ξ ) p − j − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( ∇ px f ( x + ξs ) − ∇ px f ( x ))[ v , . . . , v j ] (cid:20) s k s k (cid:21) p − j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k s k p − j dξ ≤ p − j − Z (1 − ξ ) p − j − max k v k = ··· = k v p k =1 | ( ∇ px f ( x + ξs ) − ∇ px f ( x )) [ v , . . . , v p ] | k s k p − j dξ = 1( p − j − Z (1 − ξ ) p − j − k∇ px f ( x + ξs ) − ∇ px f ( x ) k [ p ] dξ · k s k p − j ≤ p − j − Z ξ β (1 − ξ ) p − j − dξ · L k s k p − j + β = L ( p − j + β )! k s k p − j + β for all x, s ∈ IR n , which gives (2.5). ✷ Proof of Lemma 2.3.
The regularization parameter update (2.22) gives that, for each k , γ σ j ≤ max[ γ σ j , σ min ] ≤ σ j +1 , j ∈ S k , and γ σ j ≤ σ j +1 , j ∈ U k , where U k def = { , . . . , k } \ S k . Thus we deduce inductively that σ γ |S k | γ |U k | ≤ σ k . We thereforeobtain, using (2.23), that |S k | log γ + |U k | log γ ≤ log (cid:18) σ max σ (cid:19) , which then implies that |U k | ≤ −|S k | log γ log γ + 1log γ log (cid:18) σ max σ (cid:19) , since γ >
1. The desired result (2.24) then follows from the equality k + 1 = |S k | + |U k | andthe inequality γ < ✷ Proof of Lemma 2.4.
We first observe that ∇ js (cid:0) k s k p + β (cid:1) is a j -th order tensor, whose normis defined using (1.1). Moreover, using the relationships ∇ s (cid:0) k s k τ (cid:1) = τ k s k τ − s and ∇ s (cid:0) s τ ⊗ (cid:1) = τ s ( τ − ⊗ ⊗ I, ( τ ∈ IR) , (A.4)defining ν = 1 , and ν i def = i Y ℓ =1 ( p + 2 − ℓ ) , (A.5) artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization µ j,i ≥ µ , = 1, ∇ s h ∇ j − s (cid:0) k s k p + β (cid:1)i = ∇ s " j X i =2 µ j − ,i − ν i − k s k p + β − i − s (2( i − − ( j − ⊗ ⊗ I (( j − − ( i − ⊗ = j X i =2 µ j − ,i − ν i − h ( p + β − i − k s k p + β − i − − s (2( i − − ( j − ⊗ ⊗ I ( j − i ) ⊗ +((2( i − − ( j − k s k p + β − i − s (2( i − − ( j − − ⊗ ⊗ I ( j − − ( i − ⊗ i = j X i =2 µ j − ,i − ν i − h ( p + β + 2 − i ) k s k p + β − i s (2 i − j ) ⊗ ⊗ I ( j − i ) ⊗ +(2( i − − j + 1) k s k p + β − i − s (2( i − − j ) ⊗ ⊗ I ( j − ( i − ⊗ i = j X i =2 µ j − ,i − ν i − ( p + β + 2 − i ) k s k p + β − i s (2 i − j ) ⊗ ⊗ I ( j − i ) ⊗ + j − X i =1 (2 i − j + 1) µ j − ,i ν i k s k p + β − i s (2 i − j ) ⊗ ⊗ I ( j − i ) ⊗ i = j X i =1 (cid:0) ( p + β + 2 − i ) µ j − ,i − ν i − + (2 i − j + 1) µ j − ,i ν i (cid:1) k s k p + β − i s (2 i − j ) ⊗ ⊗ I ( j − i ) ⊗ . where the last equation uses the convention that µ j, = 0 for all j . Thus we may write ∇ js (cid:0) k s k p + β (cid:1) = ∇ s h ∇ j − s (cid:0) k s k p + β (cid:1)i = j X i =1 µ j,i ν i k s k p + β − i s (2 i − j ) ⊗ ⊗ I ( j − i ) ⊗ (A.6)with µ j,i ν i = ( p + β + 2 − i ) µ j − ,i − ν i − + (2 i − j + 1) µ j − ,i ν i = (cid:2) µ j − ,i − + (2 i − j + 1) µ j − ,i (cid:3) ν i , (A.7)where we used the identity ν i = ( p + β + 2 − i ) ν i − for i = 1 , . . . , j (A.8)to deduce the second equality. Now (A.6) gives that ∇ js (cid:0) k s k p + β (cid:1) [ v ] j = j X i =1 µ j,i ν i k s k p + β − j (cid:18) s T v k s k (cid:19) i − j ( v T v ) j − i . It is then easy to see that the maximum in (1.1) is achieved for v = s/ k s k , so that k ∇ js (cid:0) k s k p + β (cid:1) k [ j ] = j X i =1 µ j,i ν i ! k s k p + β − j = π j k s k p + β − j . (A.9)with π j def = j X i =1 µ j,i ν i . (A.10) artis, Gould, Toint — Complexity bounds for arbitrary-order nonconvex optimization µ j − ,j = 0 and (A.10)again, we then deduce that π j = j X i =1 µ j − ,i − ν i + j X i =1 (2 i − j + 1) µ j − ,i (cid:3) ν i = j − X i =1 µ j − ,i ν i +1 + j X i =1 (2 i − j + 1) µ j − ,i (cid:3) ν i = j − X i =1 µ j − ,i (cid:2) ν i +1 + (2 i − j + 1) ν i (cid:3) = j − X i =1 µ j − ,i (cid:2) ( p + β + 2 − i + 1)) ν i + (2 i − j + 1) ν i (cid:3) = ( p + β + 1 − j ) j − X i =1 µ j − ,i ν i = ( p + β + 1 − j ) π j − , (A.11)Since π = p + β from the first part of (A.4), we obtain that π j = ( p + β )! / ( p − j + β )!, which,combined with (A.9) and (A.10), gives (2.25). We obtain (2.26) from (A.9) and (A.10), theobservation that π p = ( p + β )! and (A.11) for j = p + 1. ✷ A.2 Proof of Lemmas in Section 3
Proof of Lemma 3.1. (See [2, Lemma 2.1]) Observe that, because of (2.18) and (2.16),0 ≤ m k (0) − m k ( s k ) = T p ( x k , − T p ( x k , s k ) − σ k p + 1 k s k k p + β which implies the desired bound. Note that s k = 0 as long as we can satisfy condition (2.18),and so (3.1) implies (2.21) is well defined. ✷ Proof of Lemma 3.2. (See [2, Lemma 2.2]) Assume that σ k ≥ L − η . (A.12)Using (2.4) and (3.1), we may then deduce that | ρ k − | ≤ | f ( x k + s k ) − T p ( x k , s k ) || T p ( x k , − T p ( x k , s k ) | ≤ Lσ k ≤ − η and thus that ρ k ≥ η . Then iteration k is very successful in that ρ k ≥ η and σ k +1 ≤ σ k . Asa consequence, the mechanism of the algorithm ensures that (3.2) holds.. Asa consequence, the mechanism of the algorithm ensures that (3.2) holds.