[PDF] Worst-case iteration bounds for log barrier methods for problems with nonconvex constraints

Abstract

Interior point methods (IPMs) that handle nonconvex constraints such as IPOPT, KNITRO and LOQO have had enormous practical success. We consider IPMs in the setting where the objective and constraints have Lipschitz first and second derivatives. Unfortunately, previous analyses of log barrier methods in this setting implicitly prove guarantees with exponential dependencies on 1/μ , where μ is the barrier penalty parameter. We provide an IPM that finds a μ -approximate Fritz John point by solving O( μ −7/4 ) trust-region subproblems. For this setup, the results represent both the first iteration bound with a polynomial dependence on 1/μ for a log barrier method and the best-known guarantee for finding Fritz John points. We also show that, given convexity and regularity conditions, our algorithm finds an ϵ -optimal solution in at most O( ϵ −2/3 ) trust-region steps.

Full PDF

WWorst-case iteration bounds for log barrier methods for problemswith nonconvex constraints

Oliver Hinder Yinyu Ye { ohinder,yyye } @stanford.edu Abstract

Interior point methods (IPMs) such as IPOPT, KNITRO and LOQO that handle nonconvexconstraints have had enormous practical success. We consider IPMs in the setting where theobjective and constraints have Lipschitz ﬁrst and second derivatives. Unfortunately, previousanalyses of log barrier methods in this setting implicitly prove guarantees with exponentialdependencies on 1 /µ , where µ is the barrier penalty parameter. We provide an IPM that ﬁnds a µ -approximate Fritz John point by solving O ( µ − / ) trust-region subproblems. For this setup,the results represent both the ﬁrst iteration bound with a polynomial dependence on 1 /µ for alog barrier method and the best-known guarantee for ﬁnding Fritz John points. We also showthat, given convexity and regularity conditions, our algorithm ﬁnds an (cid:15) -optimal solution in atmost O (cid:0) (cid:15) − / (cid:1) trust-region steps. We are concerned with the problemminimize x ∈ R n f ( x ) such that a ( x ) ≥ , where f : R n → R and a : R n → R m have Lipschitz continuous ﬁrst and second deriva-tives. The worst-case runtime to ﬁnd a global optimum to this problem is exponential in thedesired accuracy [25], so instead we seek a Fritz John point [17], a necessary condition for localoptimality, deﬁned as a point ( x, y, t ) ∈ R n × R m × R satisfying t, a ( x ) , y ≥ (1a) y i a i ( x ) = 0 ∀ i ∈ { , . . . , m } (1b) t ∇ f ( x ) − ∇ a ( x ) T y = , (1c)where y is the vector of dual variables, t is a scalar that is equal to one in the KKT conditions,and ( y, t ) (cid:54) = . When the Mangasarian-Fromovitz constraint qualiﬁcation [20] holds all FritzJohn points are KKT points. Since it is not possible to ﬁnd an exact Fritz John point, werequire a notion of an approximate Fritz John point. One natural deﬁnition of an approximateFritz John point is a ( x ) , y ≥ y i a i ( x ) ≤ µ ∀ i ∈ { , . . . , m }(cid:107) ∇ x L ( x, y ) (cid:107) ≤ µ ( (cid:107) y (cid:107) + 1) , where the Lagrangian is L ( x, y ) := f ( x ) − y T a ( x ), and µ ≥ µ desirable. Our interior point method returns point a r X i v : . [ m a t h . O C ] J un atisfying a slightly stronger condition: a ( x ) , y > (3a) | y i a i ( x ) − µ | ≤ µ/ ∀ i ∈ { , . . . , m } (3b) (cid:107) ∇ x L ( x, y ) (cid:107) ≤ µ (cid:113) (cid:107) y (cid:107) + 1 , (3c)with µ > ψ µ ( x ) := f ( x ) − µ m (cid:88) i =1 log( a i ( x )) (4)with some parameter µ >

0, and start from a strictly feasible point. The log barrier penalizespoints too close to the boundary, enabling the use of unconstrained methods to solve a con-strained problem. Typically, if f and each a i were linear we would apply Newton’s method tothe log barrier. However, since we allow a i to be nonlinear, ∇ ψ µ could be singular or indeﬁnite.To avoid this issue, we use a trust region method to generate our search directions: d x ∈ argmin u ∈ B r ( ) M ψ µ x ( u )with M ψ µ x ( u ) := 12 u T ∇ ψ µ ( x ) u + ∇ ψ µ ( x ) T u B r ( v ) := { x ∈ R n : (cid:107) x − v (cid:107) ≤ r } . The function M ψ µ x ( u ) is a second-order Taylor series local approximation to ψ µ ( x ) at x . Itpredicts how much ψ µ changes as we move from x to x + u . Our algorithm changes the radius r to scale inversely proportional to the size of the current dual iterates.We now give a brief overview of our results; for cleanliness we omit Lipschitz constants,dependence on the number of constraints m , and higher-order terms. Our main results assumethat we are given a feasible starting point, i.e., x (0) ∈ X := { x ∈ R n : a ( x ) > } . This assumption is removed in Section 7, where we use a two-phase algorithm: phase-oneminimizes the constraint violation to obtain a feasible point, then phase-two minimizes theobjective subject to the constraints. We also assume that a i and f are continuous functions on R n with Lipschitz ﬁrst and second derivatives on the set X .Our ﬁrst main result is Theorem 1, which states that after at most O (cid:0) µ − / (cid:1) trust regionsubproblem solves we ﬁnd a µ -approximate Fritz John point, i.e., a point satisfying (3). Oursecond main result is Theorem 2 which additionally assumes that the constraints are concavefunctions (implying the feasible region is convex) and that certain regularity conditions holdto ensures that Fritz John points are KKT points. Under these assumptions Theorem 2 statesthat after at most O ( (cid:15) − / ) trust region subproblem solves we ﬁnd an (cid:15) -optimal solution, i.e., apoint x with f ( x ) − inf z ∈X f ( z ) ≤ (cid:15) .We proceed as follows. The remainder of the introduction provides notation and overviewsrelated work. Section 2 analyzes gradient descent applied to the log barrier and explains whyprevious analyses implicitly prove iteration bounds with exponential dependencies on 1 /µ . Sec-tion 3 introduces our main algorithm, a trust region IPM. Section 4 gives a series of usefullemmas for the analysis. Section 5 proves Theorem 1 and Section 6 proves Theorem 2. Sec-tion 7 compares the iteration bounds of our IPM with existing iteration bounds for problemswith nonconvex constraints [3, 8, 9]. .1 Notation Let diag ( v ) be a diagonal matrix with entries composed of the vector v . Let R denote the setof real numbers, R + the set of nonnegative real numbers and R ++ the set of strictly positivereal numbers. Let Convex { x, y } = { αx + (1 − α ) y : α ∈ [0 , } . Let λ min ( · ) denote theminimum eigenvalue of a matrix. Let ψ ∗ µ = inf x ∈X ψ µ ( x ). Unless otherwise speciﬁed, log( · ) isthe natural logarithm. For a function g : R → R we let g ( p ) ( θ ) denote any function such that g ( p ) ( θ ) = ∂ p g ( θ ) ∂θ p .During this paper we assume some of the derivatives of f : R n → R and a : R n → R m areLipschitz. For this paper, the deﬁnition of a function being Lipschitz is given as follows. Deﬁnition 1.

Let L p ∈ (0 , ∞ ) be a constant and p a nonnegative integer.A univariate function g : R → R has L p -Lipschitz p th derivatives on a set S ⊆ R if for all θ ∈ S function is p + 1 order diﬀerentiable with (cid:12)(cid:12) g ( p +1) ( θ ) (cid:12)(cid:12) ≤ L p .A multivariate function w : R n → R has L p -Lipschitz p th derivatives on a set S ⊆ R n iffor any x ∈ S and v ∈ B ( ) the univariate function g : R → R deﬁned by g ( θ ) := w ( x + vθ ) is L p -Lipschitz on the set { θ : x + vθ ∈ S } . We remark that this deﬁnition is slightly less general than standard deﬁnition. The standarddeﬁnition is that a univariate function g : R → R has L p -Lipschitz p th derivatives on a set S ⊆ R , if for any [ θ , θ ] ⊆ S we have (cid:12)(cid:12) g ( p ) ( θ ) − g ( p ) ( θ ) (cid:12)(cid:12) ≤ L p (cid:12)(cid:12) θ − θ (cid:12)(cid:12) . This is equivalentDeﬁnition 1 when g is p + 1 order diﬀerentiable on the set S . We decided to use Deﬁnition 1because it simpliﬁes the proofs. However, it is also possible to prove our results using thestandard deﬁnition.Taylor’s theorem states that given a one-dimensional function g : R → R with L p -Lipschitz p th derivatives on the interval [0 , θ ] then for all q ∈ { , . . . , p } one has (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p − q (cid:88) i =0 θ i g ( q + i ) (0) i ! − g ( q ) ( θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ L p | θ | p − q (1 + p − q )! . (5)See [33, Theorem 50.3] for a proof of the remainder version of this theorem with q = 0. Toextend this theorem to q > h ( θ ) := g ( q ) ( θ ).We often refer to the function a : R n → R m as having L p -Lipschitz p th derivatives. By thiswe mean that each component function a i has L p -Lipschitz p th derivatives. Finally, the matrix ∇ a ( x ) is the m × n Jacobian of a ( x ). The practical performance of IPMs is excellent for linear [22], conic [36], general convex [1], andnonconvex optimization [5, 38, 41]. Moreover, the theoretical performance of IPMs for linear[18, 32, 42, 46, 47] and conic [28] optimization is well studied. The main theoretical resultin this area is that it takes at most O ( √ c log(1 /(cid:15) )) iterations to ﬁnd an (cid:15) -global minimum,where c is the self-concordance parameter (e.g., c = m + n for linear programming). Each IPMiteration consists of a Newton step, i.e., one linear system solve, applied to an unconstrainedoptimization problem. Unfortunately, this approach only works for convex cones with tractableself-concordant barriers.While self-concordance theory is designed for structured convex problems, there is a richliterature on the minimization of general blackbox unconstrained objectives, particularly ifthe objective is convex [25, 26]. Here we brieﬂy review results in nonconvex optimization.In unconstrained nonconvex optimization, the measure of local optimality is usually whether (cid:107)∇ f ( x ) (cid:107) ≤ µ , known as a µ -approximate stationary point. A fundamental result is that gra-dient descent needs at most O ( µ − ) iterations to ﬁnd an µ -approximate stationary point if thefunction f : R n → R has Lipschitz continuous ﬁrst derivatives. Nesterov and Polyak [29] showthat the iteration guarantee of cubic regularized Newton is O ( µ − / ) for ﬁnding µ -approximatestationary points. The same iteration bound can be extended to trust region methods [13, 45]. hese O ( µ − ) and O ( µ − / ) iteration bounds match the blackbox lower bounds for functionswith Lipschitz continuous ﬁrst and second derivatives respectively [6, 7].However, there is relatively little theory studying nonconvex optimization with constraints.Important contributions in this area include the work of Ye [44], Bian et al. [2], Haeser et al. [15],who consider an aﬃne scaling technique for general objectives with linear inequality constraints,i.e., a i are linear. At each iteration they solve problems of the form d x ∈ argmin u ∈ R n : (cid:107) S − ∇ a ( x ) u (cid:107) ≤ r M ψ µ x ( u ) (6)with S = diag ( a ( x )). In this context, Haeser et al. [15] give an algorithm with an O ( µ − / )iteration bound for ﬁnding KKT points. This work is pertinent to ours, but the addition ofnonconvex constraints and the use of a trust region method instead of aﬃne scaling distinguishour work.Our motivation is to understand the performance of practical interior point methods, mostof which tend to use an approach similar to ours. To see this relationship, observe that if we areat a feasible solution and set the dual variables to exactly satisfy perturbed complementarity( y = µS − ) then LOQO [38] and the one-phase IPM [16] both generate directions of the form d x ∈ argmin u ∈ R n M ψ µ x ( u ) + δ (cid:107) u (cid:107) (7)for some δ > ∇ ψ µ ( x ) + δ I (cid:31)

0. There is a well-known duality between thismodiﬁed Newton approach and the trust-region approach. In particular, for any δ > r > d x ∈ argmin u ∈ B r ( x ) M ψ µ x ( u ) . (8)The reverse statement holds except in the hard case [30, Chapter 4]. Therefore, our algorithmcan be viewed as an extremely simpliﬁed variant of LOQO, the one-phase IPM, or IPOPT [41].There are major diﬀerences between our approach and practical methods: we ignore feasibilityissues, our method is not primal-dual, we use a trust-region instead of adding δ I to the Hessian,and our algorithm require knowledge of Lipschitz constants. However, these diﬀerences shouldbe viewed in context of our goal: to develop a prototypical algorithm that captures the essenceof nonconvex interior point methods.While there has been theoretical work studying these practically successful log barrier meth-ods with nonconvex constraints, most of this work tends to show only that the method eventuallyconverges [4, 10, 11, 14, 16, 40] without giving explicit iteration bounds, or focuses on super-linear convergence in regions close to local optima [37, 39]. However, there has been analysisof other methods for optimization with nonconvex constraints using methods other than IPMs[3, 8, 9]. We compare with these results in Section 7.There is a vast body of literature analyzing the convergence of unconstrained optimizationmethods on self-concordant functions or functions with Lipschitz derivatives. Unfortunately,with general constraints one cannot assume that the log barrier is self-concordant nor that thederivatives are Lipschitz (even if the derivatives of the constraints are Lipschitz). Therefore wedevelop a new approach. To help the reader understand the crux of this problem, we begin byanalyzing the worst-case performance of gradient descent on the log barrier. This section explains how a naive theoretical analysis, which is often used to analyze IPMwith nonlinear constraints, can give iteration bounds for gradient descent applied to the logbarrier with exponential dependencies on 1 /µ . At the end of this section we explain how toﬁx the analysis to obtain iteration bounds with polynomial dependencies on 1 /µ . Hence theexponential iteration bounds are a ﬂaw of the analysis—not the algorithm. The goal of this ection is to get the reader into the correct mindset for analyzing the more challenging trustregion IPM that is the focus of this paper.The log barrier does not have Lipschitz continuous derivatives. However, typical analysis ofinterior point methods in the nonlinear programming community is as follows: A. Observe that if we apply a descent method to the log barrier, all iterates remain in the set S := { x ∈ R n : ψ µ ( x ) ≤ ψ µ ( x (0) ) } , where x (0) is the starting point. B. Show that the p th derivatives of ψ µ are L p -Lipschitz continuous on the set S . This isusually done by arguing that if x ∈ S then a i ( x ) ≥ inf x min i a i ( x ) = ε >

0. The resultfollows from the fact that log( θ ) has (1 /ε )-Lipschitz continuous derivatives on the set { θ : θ ≥ ε } and using the assumption that the objective and constraints have Lipschitzderivatives. C. Prove that for suﬃciently small steps the line segment between the current and new iteratesremains in S . Apply generic bounds from cubic regularization/gradient descent to givethe iteration bounds.For examples of this style of analysis, see [4, 10, 11, 16]. Turning this into a polynomialbound on 1 /µ requires showing that the constant L p is a polynomial function of the desiredtolerance. However, L p is exponentially large in µ because L p is proportional to 1 /ε and thelower bound on ε can be exponentially small in µ . This can occur even when the constraintsare linear. For example, consider the log barrier arising from the linear program min x s.t.0 ≤ x ≤ ψ µ ( x ) := x − µ (log( x ) + log(2 − x ))with µ ∈ (0 , x (0) = 1. We show that under these assumptions the Lipschitzconstants for the ﬁrst and second derivatives are exponentially large in 1 /µ on the set S := { x ∈ R n : ψ µ ( x ) ≤ ψ µ ( x (0) ) } . Observe that ψ µ ( x (0) ) = 1 and at the point x = exp( − /µ ) ∈ S we have ∇ ψ µ ( x ) = µ (cid:16) x + − x ) (cid:17) ≥ µ exp(2 /µ ) and ∇ ψ µ ( x ) = 2 µ (cid:16) − x + − x ) (cid:17) ≤ − µ exp(3 /µ ).This is illustrated in Figure 1.The methods [4, 10, 11, 16] that employ the (A)-(C) argument use a line search to choosetheir step size rather than use a ﬁxed step size. Line search methods have many beneﬁts overconstant step size methods, including removing the need to do hyperparameter searches overLipschitz constants and converging faster in practice. However, the (A)-(C) argument where weprove a uniform bound on the Lipschitz constant of ∇ ψ µ is roughly equivalent to proving aniteration bound on a constant step size algorithm and then arguing that an adaptive step sizealgorithm is faster. While in some situations this argument gives a good worst-case iterationbound, there exists problem classes where the worst-case iteration bound of the constant stepsize method is exponentially worse than an adaptive method.Claim 1 gives a simple example of constant step size algorithms having poor theoreticalperformance. In particular, the claim shows gradient descent with a ﬁxed step size α ∈ (0 , ∞ ),i.e., x ( k +1) ← x ( k ) − α ∇ ψ µ ( x ( k ) ) , (9)cannot eﬃciently minimize a log barrier for all starting points in the set S C := { x ∈ R n : ψ µ ( x ) ≤ ψ ∗ µ + C } . Contrast to a function f with L -Lipschitz gradient where for any startingpoint x (0) ∈ { x : f ( x ) ≤ inf x f ( x ) + C } gradient descent with a constant step size 1 /L uses atmost 2 L C/(cid:15) iterations until (cid:107) ∇ f ( x ( k ) ) (cid:107) ≤ (cid:15) [27]. Claim 1.

Let ψ µ ( x ) := x − µ (log( x ) + log(2 − x )) , µ ∈ (0 , / and C ∈ [2 , ∞ ) . Fix α ∈ (0 , ∞ ) and suppose the x ( k ) satisﬁes (9) . If x ( k ) remains in the interval [0 , for the startingpoint x (0) = exp( − C/ (2 µ )) ∈ S C , then for the starting point x (0) = 1 ∈ S C and for all k ≤ ( µ/

8) exp( C/ (2 µ )) we have (cid:107) ∇ ψ µ ( x ( k ) ) (cid:107) ≥ µ . The proof appears in Appendix A and involves ﬁrst arguing that the step size α must betiny; otherwise, if we initialize close to the boundary, i.e., x (0) = exp( − C/ (2 µ )), the iterates x ψ µ ( x ) intial p oint x (0) = 1 O exp − µ Derivatives are moving very quicklyand have exp onetially large Lipshitz constant in µ .Region iterates must lie in: S = { x : ψ µ ( x ) ≤ ψ µ (1) } Figure 1: Why a traditional nonlinear programming analysis of IPMs will not give an iterationbound polynomial in 1 /µ . In this example µ = 0 . will leave the feasible region. On the other hand, given the step size α must be tiny then if weinitialize away from the boundary, i.e., x (0) = 1, the algorithm will converge slowly.An astute reader might observe that Claim 1 is dependent on allowing a starting point close tothe boundary. However, any constant step size algorithm that circumvents this issue must showthat all of its iterates do not get too close to the boundary. This requires an innovation on the(A)-(C) argument. Moreover, the fact that the log barrier does not have Lipschitz continuousderivatives causes the same issues for cubic regularized Newton with a ﬁxed regularizationparameter or trust region methods with a ﬁxed trust region radius. Implicitly when usingthe analysis (A)-(C) we are arguing our algorithm cannot do worse than a constant step sizealgorithm. Unfortunately, as we have seen in Claim 1, even from a purely theoretical standpoint,constant step size algorithms can be poor benchmarks.This is the insight of the polynomial time IPM analysis for linear programming—it circum-vents these issues using the self-concordant properties of − µ log( a ( x )) when a is linear [28].However, the function − µ log( a ( x )) is not self-concordant in general. While we do not expect toobtain an algorithm with a polynomial dependence on log(1 /µ ), can we still obtain an algorithmwith polynomial dependence on the desired tolerance 1 /µ ? Next, we show this possible usinggradient descent with an adaptive step size routine, y ( k ) i ← µa i ( x ( k ) ) ∀ i ∈ { , . . . , m } (10a) d ( k ) x ← − ∇ ψ µ ( x ( k ) ) (10b) x ( k +1) ← x ( k ) + α ( k ) d ( k ) x . (10c)This procedure does not tell us how to choose α ( k ) . One approach is to pick, α ( k ) ← min (cid:40) min i a i ( x ( k ) )2 L (cid:107) d ( k ) x (cid:107) , (cid:96) ( x ( k ) ) (cid:41) (11)where the term min i a i ( x ( k ) )2 L (cid:107) d ( k ) x (cid:107) represents the step size that guarantees a i ( x ( k +1) ) > (cid:96) ( x ) := L (1 + 2 (cid:107) y (cid:107) ) + 4 L (cid:107) y (cid:107) µ with y i = µa i ( x ) for i ∈ { , . . . , m } x ψ µ ( x ) Derivatives are moving slowly.Take large gradient descent steps.Derivatives are moving very quickly.Take small gradient descent steps sizes.

Figure 2: Explanation of adaptive step sizes. represents the ‘local’ Lipschitz constant of ∇ ψ µ at the point x . For this reason, the term 1 /(cid:96) ( x )in (11) ensures the log barrier is reduced suﬃciently at each iteration. See Figure 2 explaininghow the step size α ( k ) is small for points close to the boundary and large for points far from theboundary. To prove our results we require the following assumption. Assumption 1. (Lipschitz function and ﬁrst derivatives) Assume that each a i : R n → R for i ∈ { , . . . , m } is a continuous function on R n . Let L , L ∈ (0 , ∞ ) . Assume that, on the set X , each a i is L -Lipschitz continuous with L -Lipschitz continuous derivatives. Also assumethe ﬁrst derivatives of f : R n → R are L -Lipschitz continuous on the set X . This assumption that a is a continuous function on R n may seem extraneous but is is neededto ensure there are no discontinuities on the boundary of the feasible region. In particular, ifwe removed this assumption then a function such as a i ( x ) = (cid:40) x > − x ≤ f ( x ) = x would satisfy Assumption 1 with L and L arbitrarily small. Forthis setup, there exists no µ -approximate Fritz John point for µ suﬃciently small. To see thisassume there is a µ -approximate Fritz John point ( x, y ) for µ < / x > ∇ a ( x ) = 0, ∇ f ( x ) = 1, a ( x ) = 1 and y ≤ µ by (3b). It follows that 1 ≤ (cid:107) ∇ x L ( x, y ) (cid:107) ≤ µ √ µ < X is a bounded set, and if f and each a i are twicediﬀerentiable functions on R n then f and a i are Lipschitz functions with Lipschitz ﬁrst deriva-tives. Of course, this does not give an explicit value for these Lipschitz constants, they couldbe arbitrarily big depending on the functions f and a i .In the following Lemma we justify (11) by proving that the step will remain feasible and (cid:96) ( x ) indeed represents the local Lipschitz constant for ∇ ψ µ . Lemma 1.

Let τ l , µ ∈ (0 , ∞ ) , v ∈ B ( ) , x ∈ X , and g ( θ ) := ψ µ ( x + θv ) . Suppose Assumption 1holds. For all θ ∈ (cid:104) , min i a i ( x )2 L (cid:105) we have a i ( x + θv ) a i ( x ) ∈ [1 / , / for all i = 1 , . . . , m , and (cid:12)(cid:12) g (2) ( θ ) (cid:12)(cid:12) ≤ (cid:96) ( x ) . roof. Deﬁne q i ( θ ) := sup ˆ θ ∈ [0 ,θ ] (cid:12)(cid:12)(cid:12) a i ( x + v ˆ θ ) − a i ( x ) (cid:12)(cid:12)(cid:12) for i ∈ { , . . . , m } . Let x + = x + θv .First, we establish a i ( x + ) a i ( x ) ∈ [1 / , / q i ( ϑ ) > a i ( x )2 forsome ϑ ∈ (cid:104) , min i a i ( x )2 L (cid:105) and i ∈ { , . . . , m } . Since a i is continuous it follows q i is continuous, andby the intermediate value theorem there exists some ˜ θ ∈ [0 , ϑ ] such that q i (˜ θ ) ∈ (cid:16) a i ( x )2 , a i ( x ) (cid:17) .Since a i ( x ) is Lipschitz continuous on the set X and a i ( x + ¯ θv ) > θ ∈ (cid:104) , ˜ θ (cid:105) we have (cid:12)(cid:12)(cid:12) a i ( x ) − a i ( x + v ˜ θ ) (cid:12)(cid:12)(cid:12) ≤ q i (˜ θ ) ≤ L ˜ θ ≤ a i ( x )2 contradicting our earlier statement that q i (˜ θ ) > a i ( x )2 . Since a i ( x + ) − a i ( x ) a i ( x ) ≤ ⇒ a i ( x + ) a i ( x ) ≤ / a i ( x ) − a i ( x + ) a i ( x ) ≤ ⇒ a i ( x + ) a i ( x ) ≥ /

2, we haveestablished a i ( x + ) a i ( x ) ∈ [1 / , / ∇ ψ µ ( x + ) = ∇ f ( x + ) + µ (cid:80) mi =1 (cid:16) ∇ a i ( x + ) a i ( x + ) + ∇ a i ( x + ) ∇ a i ( x + ) T a i ( x + ) (cid:17) , a i ( x + ) a i ( x ) ∈ [1 / , / y i = µa i ( x ) it follows that (cid:12)(cid:12)(cid:12) g (2) ( θ ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12) v T ∇ ψ µ ( x + ) v (cid:12)(cid:12) ≤ L + µ m (cid:88) i =1 (cid:18) L a i ( x ) + 4 L a i ( x ) (cid:19) = L (1+2 (cid:107) y (cid:107) )+ 4 L (cid:107) y (cid:107) µ = (cid:96) ( x ) . With Lemma 1 in hand we can now prove Claim 2.

Claim 2.

Let τ l , µ ∈ (0 , ∞ ) . Suppose Assumption 1 holds. Let x (0) ∈ X be the initial point.Then there exists some K ≤ ψ µ ( x (0) ) − ψ ∗ µ )(2 L τ − l µ − + L τ − l µ − + L τ − l µ − ) such thatthe procedure (10) with α ( k ) satisfying (11) ﬁnds a point ( x ( K ) , y ( K ) ) with (cid:107) ∇ ψ µ ( x ( K ) ) (cid:107) ≤ τ l µ (1 + (cid:107) y ( K ) (cid:107) ) .Proof. Denote x + = x ( k +1) , x = x ( k ) , α = α ( k ) , y = y ( k ) , and d x = d ( k ) x . At each iteration of(10) with (cid:107) ∇ ψ µ ( x ) (cid:107) ≥ τ l µ ( (cid:107) y (cid:107) + 1) we have ψ µ ( x ) − ψ µ ( x + ) ≥ α ∇ ψ µ ( x ) T d x + 12 (cid:96) ( x ) α (cid:107) d x (cid:107) (Lemma 1 and Taylor’s theorem)= α (cid:107) ∇ ψ µ ( x ) (cid:107) (cid:18) − α(cid:96) ( x ) (cid:19) (substituting for d x ) ≥ α (cid:107) ∇ ψ µ ( x ) (cid:107) (using (11)) ≥ min (cid:26) (cid:107) ∇ ψ µ ( x ) (cid:107) min i a i ( x )4 L , (cid:107) ∇ ψ µ ( x ) (cid:107) (cid:96) ( x ) (cid:27) (using (11)) ≥ min (cid:26) τ l µ L , τ l µ (1 + (cid:107) y (cid:107) ) (cid:96) ( x ) (cid:27) (by (cid:107) ∇ ψ µ ( x ) (cid:107) ≥ τ l µ ( (cid:107) y (cid:107) + 1) and a i ( x ) ≥ µ/ (cid:107) y (cid:107) )= min  τ l µ L , τ l µ (1 + (cid:107) y (cid:107) ) L (1 + 2 (cid:107) y (cid:107) ) + L (cid:107) y (cid:107) µ  (substituting for (cid:96) ( x )) ≥ min (cid:26) τ l µ L , τ l µ L , τ l µ L (cid:27) (simplifying).Therefore if (cid:107) ∇ ψ µ ( x ( k ) ) (cid:107) ≥ τ l µ ( (cid:107) y ( k ) (cid:107) + 1) for k = 0 , . . . , K then ψ µ ( x (0) ) − ψ ∗ µ ≥ K (cid:88) k =0 ( ψ µ ( x ( k ) ) − ψ µ ( x ( k +1) )) ≥ K min (cid:26) τ l µ L , τ l µ L , τ l µ L (cid:27) , rearranging this expression to upper bound K gives the result. his section demonstrated that gradient descent with a constant step sizes applied to thelog barrier requires an number of iterations proportional to µ exp(1 /µ ) to ﬁnd a Fritz Johnpoint whereas gradient descent with adaptive step sizes requires iterations proportional to µ − .While it is well-known that methods with adaptive step sizes are practically faster than constantstep size methods, most other theoretical results in continuous optimization show no diﬀerencebetween the worst-case performance of adaptive and constant step size methods.Finally, we remark that the algorithms in this paper are not practical. For example, theyrequire knowledge of unknown Lipschitz constants to calculate the local Lipschitz constant (cid:96) .Therefore our primary contributions are theoretical. It remains a subject of further inquiry todevelop practical methods with similar worst-case guarantees. One possibility to remove theneed to know Lipschitz constants would be to use a backtracking line search to compute α ( k ) . This section introduces our trust region IPM (Algorithm 1). A naive algorithm we could use is d x ∈ argmin u ∈ B r ( ) M ψ µ x ( u ) x + ← x + d x x ← x + for some ﬁxed constant r ∈ (0 , ∞ ) where x denotes the current iterate and x + the next iterate.If ∇ ψ µ is L -Lipschitz then one can show a convergence to an (cid:15) -approximate stationary pointof ψ µ in O ( L / (cid:15) − / ) iterations [29]. However, as we described in Section 2 this method willstruggle because the log barrier ensures the eﬀective Lipschitz constant of ∇ ψ µ is exponentiallylarge in µ . Instead, as per line 7 of Algorithm 1, we make the trust region radius adaptive tothe size of the dual variables using the formula r ← η x (cid:114) µL (1 + (cid:107) y (cid:107) ) , this ensures that for constant η x ∈ (0 , ∞ ) the trust region radius becomes smaller as the dualvariable size increases. The intuition for this selection of r is similar to the intuition for thestep sizes for gradient descent in Section 2: the value of r shrinks as we get very close to theboundary of the feasible region. This enables the algorithms to adapt to the ‘local’ Lipschitzconstant of the log barrier. The next iterate for our algorithm is selected by α ← min (cid:26) η s (cid:107) S − d s (cid:107) , (cid:27) x + ← x + αd x , where parameters η x , η s ∈ (0 , ∞ ) are problem dependent. More speciﬁcally they are choosenusing a formula incorporating the Lipschitz constant and barrier parameter. This formulachanges depending on whether the problem is nonconvex (see Theorem 1) and convex (seeLemma 10). These speciﬁc choices guarantee that x + ∈ X and allow us to prove our iterationbounds.The term η s (cid:107) S − d s (cid:107) above encourages small step sizes when the linear approximation ofthe slack variable indicates a large α would cause the algorithm to step outside the feasibleregion. For example, if we were solving a linear program picking η s = 1 / a i ( x + ) > a i ( x ) / > M ψ µ x ( k ) ( d x ) is small we would like to ﬁnd an approximate Fritz Johnpoint. To do this we need a method for selecting the dual variable y + . An instinctive solutionis to pick y + such that y + = µ ( S + ) − with S + = diag ( a ( x + )), i.e., a typical primal barrier pdate. Unfortunately, using this method it is unclear how to construct eﬃcient bounds on (cid:107) ∇ x L ( x + , y + ) (cid:107) . Instead we pick y + using a typical primal-dual step, i.e, y + ← y + αd y where d y satisﬁes Sd y + Y d s + Sy = µ with y = µS − and d s = ∇ a ( x ) d x . We remark that because y = µS − this can be simpliﬁedto y + ← µS − − µS − d s . Hence, Algorithm 1 is a hybrid between a traditional primal-dualmethod and a pure primal method. We believe that one could develop a pure primal-dualversion of our interior method. However, to keep our proofs as simple as possible we decidedto use this hybrid algorithm. To further understand how our algorithm generates its directionnote that d x ∈ argmin u ∈ B r ( ) M ψ µ x ( u ) implies there exists some δ ≥ ∇ ψ µ ( x ) + I δ ) d x = − ∇ ψ µ ( x ) . Using Sd y + Y d s + Sy = µ and substituting d s = ∇ a ( x ) d x into ( ∇ ψ µ ( x ) + I δ ) d x = − ∇ ψ µ ( x )we deduce that  ∇ xx L ( x, y ) + δ I − ∇ a ( x ) T ∇ a ( x ) 0 I S Y   d x d y d s  = −  ∇ x L ( x, y )0 Sy − µ  . (12)At each iteration the radius r is selected suﬃciently small such that the error on the Taylorseries approximations are small, i.e., (cid:12)(cid:12) ψ µ ( x ) + M ψ µ x ( d x ) − ψ µ ( x + ) (cid:12)(cid:12) ≈ | a i ( x ) + ∇ a i ( x ) d x − a i ( x + d x ) | a i ( x ) ≈ (cid:107) ∇ x L ( x, y ) − ∇ x L ( x + d x , y + d y ) − δd x (cid:107) ≈ . Suppose these Taylor series errors are suﬃciently small and the step x + d x is feasible. Then,when δ ≈ x + , y + ) will be an approximate Fritz John point and when δ (cid:29) α δ (cid:107) d x (cid:107) (see Lemma 6). However, thepoint x + d x need not be feasible. For example, if we were solving a linear program each of theseterms would be zero but x + d x could still be infeasible. Therefore we wish to select α smallenough that we remain feasible, this motivates our formula for α in Algorithm 1.Algorithm 1 terminates when it reaches an approximate second-order Fritz John point whichis deﬁned by (FJ1) and (FJ2). Deﬁnition 2. A ( µ, τ l , τ c ) -approximate ﬁrst-order Fritz John point is a point ( x, y ) deﬁned by a ( x ) , y > (FJ1.a) | y i a i ( x ) − µ | ≤ τ c µ ∀ i ∈ { , . . . , m } (FJ1.b) (cid:107) ∇ x L ( x, y ) (cid:107) ≤ τ l µ (cid:113) (cid:107) y (cid:107) + 1 . (FJ1.c)One should interpret (FJ1) thinking of µ ∈ (0 , ∞ ) becoming arbitrarily small, and τ l ∈ (0 , ∞ )as a ﬁxed constant which allows us to trade oﬀ how small we want (cid:107) ∇ x L ( x, y ) (cid:107) relative to y i a i ( x ). Additionally, τ c ∈ (0 ,

1) is a ﬁxed constant deﬁnes how tightly we want perturbedcomplementarity to hold.

Deﬁnition 3. A ( µ, τ l , τ c ) -approximate second-order Fritz John point ( x, y ) satisﬁesequation (FJ1) and ∇ ψ µ ( x ) (cid:23) −√ τ l (1 + (cid:107) y (cid:107) ) I . (FJ2) lgorithm 1 Adaptive trust region interior point algorithm with ﬁxed µ function Trust-IPM ( f, a, µ, τ l , L , η s , η x , x (0) ) Input: ∇ f and ∇ a are L -Lipschitz. The parameters η s ∈ (0 , η x ∈ (0 ,

1) are selectedusing diﬀerent formulas depending on whether the problem is convex or nonconvex. Always x (0) ∈ X . x ← x (0) for k = 0 , . . . , ∞ do S ← diag ( a ( x )) y ← µS − (cid:46) Primal update of dual variables. r ← η x (cid:113) µL (1+ (cid:107) y (cid:107) ) (cid:46) Trust region radius gets smaller as the dual variables get larger. ( d x , d s , d y ) ← Trust-region-direction ( f, a, µ, x, r ) α ← min (cid:110) η s (cid:107) S − d s (cid:107) , (cid:111) (cid:46) Pick a step size α ∈ (0 ,

1] to guarantee x + ∈ X . x + ← x + αd x y + ← y + αd y if ( x + , y + ) satisﬁes (FJ1) and (FJ2) then return ( x + , y + ) (cid:46) Termination criterion met. else x ← x + (cid:46) Only update primal variables, throw away new dual variable y + . end if end for end function function Trust-region-direction ( f, a, µ, x, r ) d x ∈ argmin u ∈ B r ( ) M ψ µ x ( u ) d s ← ∇ a ( x ) d x S ← diag ( a ( x )) d y ← − µS − d s return ( d x , d s , d y ) end function Note that (FJ1.b) and (FJ2) imply ∇ xx L ( x, y ) + µ ∇ a ( x ) T S − ∇ a ( x ) (cid:23) − ( √ τ l + L τ c ) (1 + (cid:107) y (cid:107) ) I with S = diag ( a ( x )). For some threshold ε >

0, let A ε = { i : a i ( x ) < ε } represent the set ofapproximately active constraints, then for all u ∈ { u ∈ B ( ) : ∇ a i ( x ) T u = 0 ∀ i ∈ A ε } wehave u T ∇ xx L ( x, y ) u ≥ − L ε − µ − ( √ τ l + L τ c ) (1 + (cid:107) y (cid:107) ) . Note that if we select ε = ω ( √ µ ) and let µ, τ l , τ c → ∇ xx L ( x, y ) is positive semideﬁnite projected onto the nullspace of the Jacobian of the activeconstraints. See [30, Section 12.4] for an explanation of the second-order necessary conditions.Algorithm 1 keeps µ ﬁxed since for our nonconvex results given in Theorem 1, a ﬁxed µ suﬃces. Practically log barrier methods solve a sequence of problems with decreasing µ .However, Algorithm 2 which is a specialized algorithm for the convex case, solves a sequence ofproblems with decreasing µ .We have omitted the details on how to solve the trust-region subproblems. One issue is thatthe matrix ∇ ψ µ ( x ) and vector ∇ ψ µ ( x ) may contain components that are exponentially large n 1 /µ . While we omit details of this issue from the paper, this can be resolved using the resultsof [43] which show one requires O (log log(1 /(cid:15) )) linear systems solves to solve one trust regionproblem.We remark that this paper provided intuition for the design of our practical one-phase IPMcode [16]. The stabilization steps of the one-phase IPM, where one attempts to minimize a logbarrier, is most strongly related to Trust-IPM . Similarities during these stabilization stepsinclude: A. Maintaining iterates that are exactly feasible using nonlinear slack variable updates ( s + = a ( x + )). B. Adaptive step size and trust region/regularization parameter choice.There are signiﬁcant diﬀerences between the algorithms. In contrast to

Trust-IPM the one-phase IPM is a primal-dual IPM, does not need a strictly feasible initial point, and does notneed to know any Lipschitz constants. Since the one-phase IPM [16] does not have a worst-caseiteration bound and the algorithms presented in this paper are not practical, it remains an openproblem to develop a practical IPM with a polynomial worst-case iteration dependence on 1 /µ . We develop some useful Lemmas in Section 4.1 to predict the quality of our local approximationsas a function of the direction sizes. In Section 4.2, we prove a key lemma, which bounds thedirections size in terms of predicted progress. To prove our main results we need the followingassumption.

Assumption 2. (Lipschitz derivatives) Assume that each a i : R n → R for i ∈ { , . . . , m } is acontinuous function on R n . Let L , L ∈ (0 , ∞ ) . The functions f : R n → R and a i : R n → R have L -Lipschitz ﬁrst derivatives and L -Lipschitz second derivatives on the set X . In this section, as a function of the direction sizes (cid:107) d x (cid:107) , (cid:107) Y − d y (cid:107) and (cid:107) S − d s (cid:107) , we boundthe following. Recall that x + and y + are the next iterates given by Algorithm 1. A. The gap between the predicted reduction and the actual reduction of the log barrier(Lemma 3). This allows us to convert predicted reduction M ψ µ x ( d x ) into a reductionin the log barrier. B. Perturbed complementarity (cid:12)(cid:12) a i ( x + ) T y + i − µ (cid:12)(cid:12) (Lemma 4). This allows us to establish when(FJ1.b) holds. C. The norm of the gradient of the Lagrangian (Lemma 5). This allows us to establish when(FJ1.c) holds. Therefore Lemma 4 and 5 allow us to reason about when we are at anapproximate Fritz John point.

Lemma 2.

Suppose the function g : R → R has L -Lipschitz ﬁrst derivatives and L -Lipschitzsecond derivatives on the set [0 , θ ] where θ ∈ R + . Further assume g (0) > , β ∈ (0 , / , and theinequality | θg (cid:48) (0) | g (0) + L θ g (0) ≤ β holds. Then g ( θ ) g (0) ∈ [ , ] and θ (cid:12)(cid:12)(cid:12) ∂ log( g ( θ )) ∂ θ (cid:12)(cid:12)(cid:12) ≤ L θ +6 L θ βg (0) + 5 β . The proof of Lemma 2 is given in Section B.1. Globally the log barrier does not haveLipschitz second derivatives. But Lemma 2 shows it is possible to bound the Lipschitz constantof second derivatives of log( g ( θ )) in a neighborhood of the current point.Lemma 2 only gives us a bound on the local Lipschitz constant for the second derivativesof log( g ( θ )) when g is univariate. By applying Lemma 2 with g ( θ ) := a i ( x + θv ), v = d x (cid:107) d x (cid:107) wecan bound the diﬀerence between the actual and predicted progress on the log barrier function.This bound is given in Lemma 3. emma 3. Suppose Assumption 2 holds (Lipschitz derivatives). Let x ∈ X , S = diag ( a ( x )) , d x ∈ R n , d s = ∇ a ( x ) d x , y = µS − , and κ ∈ (0 , / . If (cid:107) S − d s (cid:107) + L (cid:107) d x (cid:107) (cid:107) y (cid:107) µ ≤ κ, (13) then a i ( x + d x ) a i ( x ) ∈ [3 / , / for all i ∈ { , . . . , m } and (cid:12)(cid:12) ψ µ ( x ) + M ψ µ x ( d x ) − ψ µ ( x + d x ) (cid:12)(cid:12) ≤ L (cid:107) y (cid:107) ) (cid:107) d x (cid:107) + L (cid:107) d x (cid:107) (cid:107) y (cid:107) κ + µκ . The proof of Lemma 3 is given in Section B.2. Also, observe that if (13) holds for some x ∈ X and d x then (13) holds for any damped direction αd x with α ∈ [0 , Convex { x, x + αd x } ⊆ Convex { x, x + d x } ⊆ X . This observation ensures we can use Lemma 3 to establish the premisesof Lemma 4 and 5 which require Convex { x, x + } ⊆ X . Lemma 4.

Suppose Assumption 2 holds. Let

The reader might observe that our termination criteron (FJ1) has a strange mix ofnorms, in particular the size of ∇ x L ( x, y ) is measured using (cid:107)·(cid:107) and the the size of y is measuredby (cid:107)·(cid:107) . We attempt to explain this by showing how these norms naturally appear in the Lemmasin this section. The bound on (cid:107) ∇ x L ( x, y ) + ∇ xx L ( x, y ) T d x − d Ty ∇ x a ( x ) − ∇ x L ( x + , y + ) (cid:107) inLemma 5 contains a term of the form L (cid:107) y (cid:107) (cid:107) d x (cid:107) . This term is tight because if v = d x (cid:107) d x (cid:107) , d y =0 , x = 0 , a i ( x ) = L ( v T x ) + 1 , and f ( x ) = 0 then (cid:107) ∇ x L ( x, y ) + ∇ xx L ( x, y ) T d x − d Ty ∇ x a ( x ) − ∇ x L ( d x , d y ) (cid:107) = (cid:107) ∇ x L ( d x , y ) (cid:107) = (cid:107) (cid:80) i y i L ( v T d x ) v (cid:107) = L ( v T d x ) (cid:107) y (cid:107) = L (cid:107) y (cid:107) (cid:107) d x (cid:107) .Furthermore, one can see from this example that changing the norm of (cid:107) y (cid:107) would introduce adimension-factor and make the bound strictly weaker. Trust region subproblems can be eﬃcientlysolved when d x is bounded in Euclidean norm. For this reason, we choose to use the Euclideannorm to measure the size of d x . Inspection of the proof of Lemma 5 indicates that one cannotchange the norm on the term ∇ x L ( x, y ) + ∇ xx L ( x, y ) T d x − d Ty ∇ x a ( x ) − ∇ x L ( x + , y + ) withoutchanging the norm on the term d x or introducing a dimension-factor. For similar the reasonsit is inadvisable to change the norms on the term L (cid:107) y (cid:107) (cid:107) d x (cid:107) in Lemma 3. .2 Bounding the direction size of the slack variables This section presents Lemma 7 which allows us to bound the direction size of the slack variables.Before proving Lemma 7 we state Lemma 6 which contains some basic and well-known factsabout trust region subproblems that will be useful. The proof is given is Section B.5.

Lemma 6.

Consider g ∈ R n and a symmetric matrix H ∈ R m × n . Deﬁne ∆( u ) := u T Hu + g T u where ∆ : R n → R and let u ∗ ∈ argmin u ∈ B r ( ) ∆( u ) be an optimal solution to the trustregion subproblem for some r ≥ . Then there exists some δ ≥ such that: δ ( (cid:107) u ∗ (cid:107) − r ) = 0 , ( H + δ I ) u ∗ = − g, and H + δ I (cid:23) . (17) Conversely, if u ∗ satisﬁes (17) then u ∗ ∈ argmin u ∈ B r ( ) ∆( u ) . Let σ ( r ) := min u ∈ B r ( ) ∆( u ) ,then for all r ∈ [0 , ∞ ) we have σ ( r ) ≤ − δr σ ( r ) ≤ σ ( αr ) ≤ α σ ( r ) ∀ α ∈ [0 , . (18b) Furthermore, the function σ ( r ) is monotone decreasing and continuous. Lemma 7 which follows is key to our result, because it allows us to bound the size of (cid:107) S − d s (cid:107) (recall d s = ∇ a ( x ) d x ). We remark that often in linear programming one shows (cid:107) S − d s (cid:107) = O (1) to prove a O ( √ n log(1 /µ )) iteration bound. Combining Lemma 7 with the Lemmas fromSection 4.1 allows us to give concrete bounds on the reduction of the log barrier at each iteration.This underpins our main results in Section 5. Lemma 7.

Consider A ∈ R m × n , g ∈ R n , and a symmetric matrix H ∈ R m × n . Deﬁne ∆( u ) := u T ( H + A T A ) u + g T u where ∆ : R n → R and let d x ∈ argmin u ∈ B r ( ) ∆( u ) for some r ≥ . Then (cid:107) Ad x (cid:107) ≤ (cid:113) − d Tx Hd x − d x ) . (19) Proof

Observe that∆( d x ) = 12 d Tx ( H + A T A ) d x + g T d x = 12 d Tx ( H + A T A ) d x − d Tx ( H + A T A + δ I ) d x = − d Tx (cid:0) H + A T A (cid:1) d x − δ (cid:107) d x (cid:107) where the second transition use the fact from Lemma 6 that there exists some δ such that( H + A T A + δ I ) d x = − g . Rearranging this expression and using δ (cid:107) d x (cid:107) ≥ (cid:107) Ad x (cid:107) ≤ − d Tx Hd x − d x ) . (20)This concludes the proof of Lemma 7. (cid:3) Now, if we set H = ∇ xx L ( x, y ), A = √ µ S − ∇ a ( x ), S = diag ( a ( x )), and d s = ∇ a ( x ) d x then we deduce from Lemma 7 that (cid:107) S − d s (cid:107) ≤ (cid:115) − d Tx ∇ xx L ( x, y ) d x − M ψ µ x ( d x ) µ which if we assume ∇ xx L ( x, y ) is positive deﬁnite we deduce that (cid:107) S − d s (cid:107) ≤ (cid:115) − M ψ µ x ( d x ) µ . (21) lternately, in the nonconvex case if (cid:107) ∇ f ( x ) (cid:107) ≤ L and (cid:107) ∇ a i ( x ) (cid:107) ≤ L then (cid:107) S − d s (cid:107) ≤ (cid:115) L (1 + (cid:107) y (cid:107) ) (cid:107) d x (cid:107) − M ψ µ x ( d x ) µ . (22)We emphasize that (21) and (22) are unusual because the bound on (cid:107) S − d s (cid:107) is dependent onthe amount of predicted progress for a step size of α = 1, i.e., M ψ µ x ( d x ). This is related towhy it is critical that Algorithm 1 adaptively selects the step size. The intuition is as follows.At each iteration if we have not terminated then we want to reduce the barrier function bya ﬁxed quantity. Lemma 3 implies for suﬃciently small α that the new point x + αd x willreduce the barrier function proportional to M ψ µ x ( αd x ). If (cid:107) S − d s (cid:107) is small then we can takea step size with α = 1 and reduce the barrier function proportional to M ψ µ x ( d x ). On the otherhand, if (cid:107) S − d s (cid:107) is big we must pick α small to guarantee that we reduce the barrier functionproportional to M ψ µ x ( αd x ). Since α is small, the term M ψ µ x ( αd x ) is smaller than M ψ µ x ( d x ).Fortunately, this is counterbalanced because if (cid:107) S − d s (cid:107) is large that implies using either (21)and (22) that M ψ µ x ( d x ) is also large. This section outlines the proof of our main result, a bound on the number of iterations

Trust-IPM algorithm takes to ﬁnd a Fritz John point. Section 5.1 gives a general bound for thenumber of iterations to ﬁnd a Fritz John point, i.e., proves Theorem 1. Section 5.2 gives atighter bound in the case that f is convex and each a i is concave. In this section we prove our main result, Theorem 1 which bounds the number of iterations ofAlgorithm 1 to ﬁnd a Fritz John point by O (cid:0) µ − / (cid:1) . At a high level this proof is similar totypical cubic regularization/trust region arguments: we argue that if the termination conditionsare not satisﬁed at the next iterate then we have reduced the log barrier function by at leastΩ( µ / ). Before proving Theorem 1, we prove the auxiliary Lemmas 8 and 9. Lemma 8 showswe reduce the barrier merit function when the predicted progress at each iteration is large;Lemma 9 allows us to reason about when the algorithm will terminate.Recall that Algorithm 1 computes steps via S = diag ( a ( x )) , y = µS − (ITRS.a) r = η x (cid:114) µL ( (cid:107) y (cid:107) + 1) (ITRS.b) d x ∈ argmin u ∈ B r ( ) M ψ µ x ( u ) d s ← ∇ a ( x ) d x d y ← µS − d s (ITRS.c) x + = x + αd x y + = y + αd y . (ITRS.d)where (ITRS) stands for interior trust region subproblem.Also recall that τ l , τ c and µ are all parameters for our termination criterion (FJ1). Tosimplify the analysis we assume µ is small enough such that the following assumption holds. Wealso ﬁx the value of τ c which determines how tightly perturbed complementarity holds in (FJ1). Assumption 3 (Suﬃciently small µ ) . Let τ c = (cid:18) τ l µL (cid:19) / ∈ (0 , L µL ∈ (0 , . emma 8. Suppose Assumptions 2 and 3 hold (Lipschitz derivatives, and suﬃciently small µ ).Let x ∈ X , η s ∈ [0 , / , (ITRS) hold with η x = η s , and α = min (cid:110) , η s (cid:107) S − d s (cid:107) (cid:111) . Then x + ∈ X and ψ µ ( x + ) − ψ µ ( x ) ≤ µη s + max (cid:26) M ψ µ x ( d x ) , − η s µ (cid:27) . (23)Lemma 8 provides a bound on the progress as a function of the parameter η s ∈ [0 ,

1] whichcontrols the step size. This allows us to guarantee that we will be able to reduce the barrierfunction during Algorithm 1 if the predicted progress from solving the trust region subproblem M ψ µ x ( d x ) is suﬃciently large. The proof of Lemma 8 is given in Section C.1 and consists oftwo parts. The ﬁrst part uses (22), (ITRS), and the deﬁnition of α to argue that M ψ µ x ( αd x ) ≤ max {M ψ µ x ( d x ) , − η s µ/ } . The second part uses Lemma 3 to show that M ψ µ x ( αd x ) accuratelypredicts the reduction in the barrier function. Lemma 9.

Suppose (ITRS) , Assumptions 2 and 3 hold (direction selection, Lipschitz deriva-tives, and suﬃciently small µ ). Let x ∈ X , η x ∈ (0 , ( τ l µL ) / ] , and α = 1 . Furtherassume M ψ µ x ( d x ) ≥ − τ l µr √ (cid:107) y (cid:107) . Under these assumptions, ( x + , y + ) satisﬁes (FJ1) and ∇ ψ µ ( x ) (cid:23) −√ τ l (1 + (cid:107) y (cid:107) ) (cid:113) τ l µη x L I . Lemma 9 shows that if the predicted progress, M ψ µ x ( d x ), from the trust region step is smallthen the algorithm must terminate at the next iterate. The proof of Lemma 9 is given inSection C.2. It ﬁrst uses (22) and M ψ µ x ( d x ) ≥ − τ l µr (cid:112) (cid:107) y (cid:107) / (cid:107) S − d s (cid:107) and (cid:107) Y − d y (cid:107) must be small. This enables the use of Lemma 5 to bound (cid:107) ∇ L ( x + , y + ) (cid:107) .With Lemma 8 and 9 in hand we are now ready to prove our main result, Theorem 1. Theorem 1.

Suppose Assumption 2 and 4 hold (Lipschitz derivatives, and suﬃciently small µ ).Let f be convex and each a i concave. Then Trust-IPM ( f, a, µ, τ l , L , η s , η x , x (0) ) with x (0) ∈ X and η x = θ (cid:18) τ l µL (cid:19) / η s = θ (cid:18) τ l µL (cid:19) / θ = 1 / , ( η -2) takes at most O (cid:32) ψ µ ( x (0) ) − ψ ∗ µ µ (cid:18) L τ l µ (cid:19) / (cid:33) iterations to terminate with a ( µ, τ l , τ c ) -approximate ﬁrst-order Fritz John point ( x + , y + ) , i.e., (FJ1) holds. The proof of Lemma 10 is similar to Theorem 1 and is given in Section D. For this resultwe only need to prove that we have found an approximate ﬁrst-order Fritz John rather than anapproximate second-order Fritz John point (by the assumption f is convex and a i is concave wetrivially have ∇ ψ µ ( x ) (cid:23) f is convex and a i concave so we can apply (21) to bound (cid:107) S − d s (cid:107) instead of (22). While Lemma 10 specialized our guarantees to when f is convex and a i is concave, it only madea statement on how long it takes to ﬁnd a Fritz John point. However, ﬁnding a Fritz John pointdoes not necessarily guarantee optimality. The purpose of this section is to provide optimalityguarantees. We begin with a simple lemma showing that ﬁnding an approximate KKT pointimplies approximate optimality. We use this lemma to convert algorithms that ﬁnd approximateKKT points of the log barrier to algorithms that ﬁnd approximately optimal solutions. Finally,the main result (Theorem 2) is that under a certain regularity assumption, our algorithm, whenapplied to a sequence of subproblems with decreasing µ , takes at most O (cid:0) (cid:15) − / (cid:1) trust regionsubproblem solves to ﬁnd an (cid:15) -optimal solution. Lemma 11.

Let f : R n → R and a : R n → R m . Let (cid:107)X (cid:107) ≤ R . If ( x, y ) ∈ X × R m ++ and a i ( x ) y i ≥ µ for i ∈ { , . . . , m } then ψ µ ( x ) − ψ ∗ µ ≤ (cid:107) ∇ x L ( x, y ) (cid:107) R + (cid:80) mi =1 ( a i ( x ) y i − µ ) .Proof Let S := diag ( a ( x )) and ˜ y := y − µS − and q ( z ) := ψ µ ( z ) − a ( z ) T ˜ y . By a i ( x ) y i ≥ µ we have ˜ y i ≥

0. Now, ψ ∗ µ ≥ inf z ∈X q ( z ) ≥ q ( x ) − (cid:107) ∇ q ( x ) (cid:107) R = ψ µ ( x ) − (cid:107) ∇ q ( x ) (cid:107) R − a ( x ) T ˜ y, where the ﬁrst inequality uses a ( z ) T ˜ y ≥

0, the second inequality the convexity of q , and theﬁnal inequality the deﬁnition of q . The result follows by ∇ q ( x ) = ∇ x L ( x, y ). (cid:3) So far we have presented

Trust-IPM which only minimizes the log barrier with µ ﬁxed.However, log barrier methods traditionally solve a sequence of subproblems with µ tendingtoward zero as described in Algorithm 2. lgorithm 2 IPM with decreasing µ function Annealed-IPM ( f, a, µ (0) , x (0) , (cid:15) ) for j = 0 , . . . , ∞ do ( x ( j +1) , y ( j +1) ) ← Generic-IPM ( f, a, µ ( j ) , x ( j ) ) µ ( j +1) ← µ ( j ) / if µ ( j ) m ≤ (cid:15) thenreturn x ( j +1) end ifend forend function In Algorithm 2 we write

Generic-IPM as a placeholder for any algorithm that ﬁnds a FritzJohn point. The precise properties we need

Generic-IPM to satisfy are given in Assump-tion 5. For this paper will use

Generic-IPM = Trust-IPM but any other method satisfyingAssumption 5 would suﬃce. Then, as we show in Lemma 12 it is possible to give an iterationbound for the algorithm to ﬁnd a (cid:15) -optimal solution to the original problem.

Assumption 5.

Let (cid:107)X (cid:107) ≤ R . Suppose that for any µ ∈ (0 , ∞ ) , x ∈ X that Generic-IPM ( f, a, µ, x ) ﬁnds a point ( x + , y + ) with (cid:107) ∇ x L ( x + , y + ) (cid:107) ≤ µmR and (cid:12)(cid:12) a i ( x + ) y + i − µ (cid:12)(cid:12) ≤ µ/ in at most O (1) + ψ µ ( x ) − ψ ∗ µ µ w ( µ ) unit operations, where the function w : R → R is monotone decreas-ing. The term ‘unit operations’ is used to denote the metric for computational cost, this couldbe trust-region steps, linear system solves or matrix-vector multiplies.Before stating Lemma 12 we deﬁne f ∗ := inf z ∈X f ( z )log +2 ( x ) := max { log ( x ) , } . Lemma 12.

Trust-IPM only produce Fritz John points but to satisfy Assumption 5 weneed an algorithm that produces KKT points. Next, we present a regularity assumption whichenables us to convert a Fritz John point into a KKT point and thereby enables

Trust-IPM tosatisfy Assumption 5.

Assumption 6 (Regularity conditions) . Assume there exists some ζ > that if (FJ1) holdsthen (cid:107) y + (cid:107) + 1 ≤ ζ . One suﬃcient condition for Assumption 6 to hold is Slater’s condition, i.e., there exists somepoint x ∈ X and γ > a ( x ) > γ . We show this formally in Section E.1.Next, we present the main result of this section, Theorem 2, which combines Lemma 10, andLemma 12. To satisfy the premises of these lemmas we make the following assumption. ssumption 7 (Parameter settings) . Let τ l = mRζ / (A7. τ l ) τ c = (cid:18) τ l µL (cid:19) / (A7. τ c ) µ (0) = min (cid:26) L R ζm , L mRL √ ζ (cid:27) (A7. µ (0) ) where µ (0) represents the initial µ value of Annealed-IPM . Theorem 2.

Suppose Assumption 2, 6 and 7 hold (Lipschitz derivatives, regularity conditions,and parameter settings). Let f be convex and each a i concave. Let x (0) ∈ X and (cid:107)X (cid:107) ≤ R .Deﬁne η s , η x by ( η -2) and set Generic-IPM ( f, a, µ, x ) := Trust-IPM ( f, a, µ, τ l , L , η s , η x , x ) inside Annealed-IPM . Then

Annealed-IPM ( f, a, µ (0) , x (0) , (cid:15) ) takes at most O (cid:32)(cid:32) m / (cid:18) L R ζ(cid:15) (cid:19) / + 1 (cid:33) log + (cid:18) mµ (0) (cid:15) (cid:19) + ψ µ (0) ( x (0) ) − ψ ∗ µ (0) µ (0) (cid:18) L R ζm µ (0) (cid:19) / (cid:33) unit operations to return a point x ( k ) ∈ X with f ( x ( k ) ) − f ∗ ≤ (cid:15) . The proof of Theorem 2 appears in Section E.3. Notice that the iteration bound given inTheorem 2 comprises of two terms. The ﬁrst term is dependent on (cid:15) and corresponds to thetotal number of trust-region subproblems used during iterations j = 1 , . . . , k of Annealed-IPM . The second term corresponds to the number of inner iterations required in iteration j = 0 of Annealed-IPM , in other words, the number of trust region subproblems used by

Trust-IPM ( f, a, µ (0) , τ l , L , η s , η x , x (0) ). This second term has no (cid:15) dependence, and by sub-stituting the value of µ (0) given by (A7. µ (0) ) we observe this term is bounded by O (cid:18) ∆ f (cid:18) m L R ζ + R L ζ / L m (cid:19) + log( b ) (cid:18) m + R L ζL m (cid:19)(cid:19) where b is some constant such that a i ( x ) a i ( x (0) ) ≤ b for all i = 1 , . . . , m and x ∈ X . One diﬃculty with nonconvex optimization is that there are many choices termination criterionand this choice aﬀects iteration bounds. The results of Birgin et al. [3] guarantee to ﬁnd anunscaled KKT points or a certiﬁcate of local infeasibility. Their criterion is diﬀerent from ourFritz John termination criterion. Therefore for the sake of comparison we now introduce a newpair of termination criterion similar to the criterion they presented. Our own deﬁnition of anunscaled KKT point is a ( x ) ≥ − ε opt (KKT.a) (cid:107) ∇ x L ( x, y ) (cid:107) ≤ ε opt (KKT.b) y ≥ (KKT.c) a i ( x ) y i ≤ ε opt . (KKT.d)Let us contrast this deﬁnition with the deﬁnition of an unscaled KKT point given in Birgin et al.[3]. The most important diﬀerence is how complementarity is measured. In particular, in Birgin t al. [3] their termination criterion replaces (KKT.d) of our criterion with min { a i ( x ) , y i } ≤ ε opt .In this respect, the termination criterion of Birgin et al. [3] is stronger than (KKT). To detectinfeasibility we consider the following termination criterion.min i a i ( x ) < − ε opt / a ( x ) + t ≥ (INF1.b) (cid:13)(cid:13) ∇ a ( x ) T y (cid:13)(cid:13) ≤ ε inf (INF1.c) (cid:107) y (cid:107) = 1 (INF1.d) y ≥ (INF1.e)( a i ( x ) + t ) y i ≤ ε inf ε opt . (INF1.f)System (INF1) ﬁnds an approximate KKT point for the problem of minimizing the inﬁnitynorm of the constraint violation. In contrast, Birgin et al. [3] ﬁnd a stationary point for theEuclidean norm of the constraint violation squared which they denote by θ ( x ). However, thisis a weak measure of infeasibility since if θ ( x ) ≤ ε opt then automatically (cid:107) ∇ θ ( x ) (cid:107) ≤ ε opt . Thenatural termination criterion corresponding to (INF1) is an approximate KKT point for theproblem of minimizing the Euclidean norm of the constraint violation. This can be written asmin i a i ( x ) < − ε opt / (cid:13)(cid:13) ∇ a ( x ) T y (cid:13)(cid:13) ≤ ε inf (INF2.b) y = z (cid:107) z (cid:107) (INF2.c) a ( x ) + z ≥ (INF2.d)( a i ( x ) + z i ) y i = 0 (INF2.e) y ≥ . (INF2.f)To ﬁnd a point satisfying (INF2) they require (cid:107) ∇ θ ( x ) (cid:107) ≤ ε opt ε inf . If this condition holdsthen z = min { a ( x ) , } , y = z (cid:107) z (cid:107) satisﬁes (INF2). Finally, notice that both (INF1) and (INF2)ﬁnd points withmin i a i ( x ) < − ε opt / a i ( x ) y i ≤ ε inf (cid:13)(cid:13) ∇ a ( x ) T y (cid:13)(cid:13) ≤ ε inf y ≥ , which proves infeasibility in ball of radius R if ε inf = O ( ε opt / (1 + R )), f is convex, and a i isconcave [16, Observation 1].To obtain our algorithm that ﬁnds a point satisfying either (KKT) or (INF1), we apply Trust-IPM in two-phases (see

Two-Phase-IPM in Appendix F.1).Let x (0) ∈ R n be our starting point and deﬁne t (0) := ε opt { min i − a i ( x (0) ) , } . Phase-one applies Algorithm 1 to minimize the inﬁnity norm of the constraint violation, i.e., weﬁnd a Fritz John point of min x,t f P ( x, t ) := t (PI.a) a P ( x, t ) :=  a ( x ) + t t ε opt + t (0) − t  ≥ . (PI.b)Let ( x ( P , t ( P ) be the solution obtained. Starting from x ( P , phase-two minimizes theobjective subject to the ( ε opt -relaxed) constraints, i.e., we ﬁnd a Fritz John point ofmin x f ( x ) (PII.a) a P ( x ) := a ( x ) + ε opt ≥ (PII.b) tarting from the point obtained in phase-one.We replace Assumption 2 with Assumption 8, where X replaced with two sets, correspondingto phase-one and phase-two respectively:˜ X ( P := { x ∈ R n : a ( x ) ≥ − ( ε opt / t (0) ) } ˜ X ( P := { x ∈ R n : a ( x ) ≥ − ε opt } . By the deﬁnition of t (0) we have ˜ X ( P ⊆ ˜ X ( P . Assumption 8.

Assume that each a i : R n → R for i ∈ { , . . . , m } is a continuous function on R n . Let L , L ∈ (0 , ∞ ) . The functions a i : R n → R have L -Lipschitz ﬁrst derivatives and L -Lipschitz second derivatives on the set ˜ X ( P . The function f : R n → R and a i : R n → R has L -Lipschitz ﬁrst derivatives and L -Lipschitz second derivatives on the set ˜ X ( P . Before presenting Claim 3 let us introduce non-negative scalars c , ∆ f , and ∆ a chosen asfollows. c ≥ sup x ∈ ˜ X ( P max i ∈{ ,...,m } a i ( x ) (24a)∆ f ≥ sup z ∈ ˜ X ( P f ( z ) − inf z ∈ ˜ X ( P f ( z ) (24b)∆ a ≥ min i ∈{ ,...,m } max {− a i ( x (0) ) , } . (24c) Claim 3.

Let x (0) ∈ R n . Suppose Assumption 8 and (24) holds. Let f be L -Lipschitz. Assume c, ∆ a , ∆ f , L , L ≥ , ε opt ∈ (cid:16) , m log + ( c/ε opt ) (cid:105) , ε inf ∈ (0 , L m ] and ε opt ∈ (0 , √ ε inf ] . Then Two-Phase-IPM ( f, a, ε opt , ε inf , L , L , x (0) ) takes at most O (cid:32) ∆ a (cid:32) L / ε / inf ε / opt + 1 ε inf ε opt (cid:33) + ∆ f ε opt (cid:18) L L ε opt ε inf (cid:19) / (cid:33) trust region subproblem solves to return a point ( x, t, y ) that satisﬁes either (KKT) or (INF1) . The deﬁnition of

Two-Phase-IPM appears in Section F.1 and the proof of Claim 3 appearsin Section F.2. The proof is primarily devoted to analyzing phase-two when we minimize theobjective while approximately satisfying the constraints. We argue that when we terminatewith a Fritz John point in phase-two then either the dual variables are small enough that thisis a KKT point or if the dual variables are large the scaled dual variables give an infeasibilitycertiﬁcate. If we add the assumption that ε opt ∈ (0 , ε inf ] the iteration bound of Claim 3 can beeven more simply stated as O (cid:32) ∆ a + ∆ f ε opt (cid:18) L L ε opt ε inf (cid:19) / (cid:33) . (25)We can now compare with the results of [3] in Table 1. Table 1

This table compares iteration bounds under the setup of (25). It only includes dependen-cies on ε opt and ε inf . CRN stands for cubic regularized Newton [29].algorithm O (cid:16) ε − opt ε − inf (cid:17) gradient computation ∇ Birgin et al. [3] O (cid:16) ε − opt ε − / inf (cid:17) CRN with non-negativity constraint ∇ , ∇ IPM (this paper) O (cid:16) ε − / opt ε − / inf (cid:17) trust-region subproblem ∇ , ∇ he algorithm of Birgin et al. [3] sequentially ﬁnds KKT points to quadratic penalty sub-problems of the form,minimize ( x,r,s ) ∈ R n +1+ m Φ t ( x, r, s ) := ( f ( x ) − t + r ) + (cid:107) a ( x ) + s (cid:107) s.t. r ≥ , s ≥ . (26)To solve this subproblem method they suggest using p th order regularization with non-negativityconstraints. For p = 2 this reduces to cubic regularization Newton’s method with non-negativityconstraints, i.e.,minimize d ∈ R n +1+ m d T ∇ Φ t ( x, r, s ) d + ∇ Φ t ( x, r, s ) T d + C (cid:107) d (cid:107) s.t r + d r ≥ , s + d s ≥ (27)for some constant C > d = ( d x , d r , d s ). Solving this subproblem might be computa-tionally expensive. It is well-known that checking if a point is a local optimum of (27) is ingeneral NP-hard [31]. It is possible to ﬁnd an approximate KKT point using projected gradientdescent or an interior point method for solving nonconvex quadratic program [44]. However,both these approaches are likely to result in a computation runtime worse than O (cid:16) ε − opt ε − / inf (cid:17) .We speculate that one might also be able to apply the interior point method of Haeser et al. [15]as the unconstrained minimization algorithm for solving (26) and potentially obtain the runtimebound of O (cid:16) ε − opt ε − / inf (cid:17) given by [3], although further analysis is needed to conﬁrm this.Finally, Cartis et al. [8, 9] show that one requires O (cid:0) ε − opt (cid:1) iterations to ﬁnd a scaled KKTpoint: (cid:107) ∇ x L ( x, y ) (cid:107) ≤ ε opt ( (cid:107) y (cid:107) + 1) y ≥ a ( x ) ≥ − ε opt a i ( x ) y i ≤ ε opt (1 + (cid:107) y (cid:107) ) , or a certiﬁcate of infeasibility (with ε inf = 1). Their method only requires computation of ﬁrst-derivatives but has the disadvantage that it requires solving a linear program at each iteration. Since there has been relatively little work with general convex constraints we generate a set base-lines for comparison using existing methods for unconstrained optimization. To simplify thesecomparisons we consider the weaker problem of ﬁnding an (cid:15) -optimal solution to the problem ofminimize x ∈ R n max i ∈{ ,...,m } a i ( x ) . (28)To further simplify we assume the optimal objective value of (28) is zero, that the initial point x (0) satisﬁes a ( x (0) ) > and that L + L + L + R + m = O (1) with R ≥ (cid:107) x ∗ − x (0) (cid:107) where x ∗ is some optimal solution. To apply our IPM we can reformulate (28) asminimize ( x,t ) ∈ R n +1 t s.t. (cid:107) x − x (0) (cid:107) ≤ R + 1 , ≤ t ≤ t (0) + 1 , a i ( x ) ≤ t ∀ i ∈ { , . . . , m } , (29)where t (0) := 1+max i ∈{ ,...,m } a i ( x (0) ) and ( x (0) , t (0) ) is the starting point of our IPM. Note thatsubstituting this starting point into (29) implies Assumption 9 holds with γ = 1, and thereforeby Lemma 15 Assumption 6 holds with ζ = O (1 + µ ). This implies our IPM has a runtime of˜ O (cid:0) (cid:15) − / (cid:1) using Theorem 2.Another approach to solve (28) is to minimize ω p ( x ) := m (cid:88) i =1 max { a i ( x ) , } p +1 using a method that only requires the p th order derivative to be Lipschitz. To ﬁnd a pointsatisfying a ( x ) ≤ (cid:15) we need to ﬁnd a point with ω p ( x ) ≤ (cid:15) p +1 . We can then use genericunconstrained optimization methods such as cubic regularization, accelerated gradient descent, nd accelerated cubic regularization to solve this problem. These comparisons are summarizedin Table 2. Table 2

Iteration bounds to ﬁnd a point a ( x ) ≤ (cid:15) . Assume L + L + L + (cid:107) x ∗ − x (0) (cid:107) + m = O (1).SG = sub-gradient method [34], CRN = cubic regularized Newton [29], AGD = accelerated gradientdescent [26], ACRN = accelerated cubic regularized Newton of Monteiro and Svaiter [24]. Allsubproblems with a * have similar computational cost: a logarithmic number of linear systemsolves.algorithm O ( (cid:15) − ) matrix-vector product ∇ CRN on ω O ( (cid:15) − / ) cubic regularization subproblem* ∇ , ∇ AGD on ω O ( (cid:15) − ) matrix-vector product ∇ ACRN on ω O ( (cid:15) − / ) cubic regularization subproblem* ∇ , ∇ IPM (this paper) ˜ O (cid:0) (cid:15) − / (cid:1) trust region subproblem* ∇ , ∇ cutting plane O ( n log(1 /(cid:15) )) centre of polytope ∇ Acknowledgement

We’d like to thank Yair Carmon, Ron Estrin, and Michael Saunders for their useful feedback onthe paper.

References [1] Andersen, E. D. and Ye, Y. (1998). A computational study of the homogeneous algorithmfor large-scale convex optimization.

Computational Optimization and Applications , 10(3):243–269.[2] Bian, W., Chen, X., and Ye, Y. (2015). Complexity analysis of interior point algorithms fornon-Lipschitz and nonconvex minimization.

Mathematical Programming , 149(1-2):301–327.[3] Birgin, E. G., Gardenghi, J., Mart´ınez, J., Santos, S., and Toint, P. L. (2016). Evaluationcomplexity for nonlinear constrained optimization using unscaled KKT conditions and high-order models.

SIAM Journal on Optimization , 26(2):951–967.[4] Byrd, R. H., Gilbert, J. C., and Nocedal, J. (2000). A trust region method based on interiorpoint techniques for nonlinear programming.

Mathematical Programming , 89(1):149–185.[5] Byrd, R. H., Nocedal, J., and Waltz, R. A. (2006). KNITRO: An integrated package fornonlinear optimization. In

Large-scale nonlinear optimization , pages 35–59. Springer.[6] Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. (2017a). Lower bounds for ﬁndingstationary points I. arXiv preprint arXiv:1710.11606 .[7] Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. (2017b). Lower bounds for ﬁndingstationary points II: First-order methods. arXiv preprint arXiv:1711.00841 .[8] Cartis, C., Gould, N. I., and Toint, P. L. (2011). On the evaluation complexity of compositefunction minimization with applications to nonconvex nonlinear programming.

SIAM Journalon Optimization , 21(4):1721–1739.[9] Cartis, C., Gould, N. I., and Toint, P. L. (2014). On the complexity of ﬁnding ﬁrst-ordercritical points in constrained nonlinear optimization.

Mathematical Programming , 144(1-2):93–106.

10] Chen, L. and Goldfarb, D. (2006). Interior-point l -penalty methods for nonlinear program-ming with strong global convergence properties. Mathematical Programming , 108(1):1–36.[11] Conn, A. R., Gould, N. I., Orban, D., and Toint, P. L. (2000a). A primal-dual trust-regionalgorithm for non-convex nonlinear programming.

Mathematical programming , 87(2):215–249.[12] Conn, A. R., Gould, N. I., and Toint, P. L. (2000b).

Trust region methods . SIAM.[13] Curtis, F. E., Robinson, D. P., and Samadi, M. (2017). A trust region algorithm witha worst-case iteration complexity of O ( (cid:15) − / ) for nonconvex optimization. MathematicalProgramming , 162(1-2):1–32.[14] Gould, N. I., Orban, D., and Toint, P. (2002).

An interior-point l1-penalty method fornonlinear optimization . Citeseer.[15] Haeser, G., Liu, H., and Ye, Y. (2017). Optimality condition and complexity analysis forlinearly-constrained optimization without diﬀerentiability on the boundary. arXiv preprintarXiv:1702.04300 .[16] Hinder, O. and Ye, Y. (2018). A one-phase interior point method for nonconvex optimiza-tion. arXiv preprint arXiv:1801.03072 .[17] John, F. (1948). Extremum problems with inequalities as side conditions.

Studies andEssays: Courant Anniversary Volume , pages 187–204.[18] Karmarkar, N. (1984). A new polynomial-time algorithm for linear programming. In

Proceedings of the sixteenth annual ACM symposium on Theory of computing , pages 302–311.ACM.[19] Kojima, M., Mizuno, S., and Yoshise, A. (1989). A primal-dual interior point algorithmfor linear programming. In

Progress in mathematical programming , pages 29–47. Springer.[20] Mangasarian, O. L. and Fromovitz, S. (1967). The fritz john necessary optimality conditionsin the presence of equality and inequality constraints.

Journal of Mathematical Analysis andApplications , 17(1):37–47.[21] Megiddo, N. (1989). Pathways to the optimal set in linear programming. In

Progress inmathematical programming , pages 131–158. Springer.[22] Mehrotra, S. (1992). On the implementation of a primal-dual interior point method.

SIAMJournal on optimization , 2(4):575–601.[23] Monteiro, R. D. and Adler, I. (1989). Interior path following primal-dual algorithms. partI: Linear programming.

Mathematical programming , 44(1):27–41.[24] Monteiro, R. D. and Svaiter, B. F. (2013). An accelerated hybrid proximal extragradientmethod for convex optimization and its implications to second-order methods.

SIAM Journalon Optimization , 23(2):1092–1125.[25] Nemirovskii, A. and Yudin, D. B. (1983). Problem Complexity and Method Eﬃciency inOptimization.[26] Nesterov, Y. (1983). A method of solving a convex programming problem with convergencerate O (1 /k ). In Soviet Mathematics Doklady , volume 27, pages 372–376.[27] Nesterov, Y. (2013).

Introductory Lectures on Convex Optimization: A Basic Course ,volume 87. Springer Science & Business Media.

28] Nesterov, Y. and Nemirovskii, A. (1994).

Interior-point polynomial algorithms in convexprogramming . SIAM.[29] Nesterov, Y. and Polyak, B. T. (2006). Cubic regularization of newton method and itsglobal performance.

Mathematical Programming , 108(1):177–205.[30] Nocedal, J. and Wright, S. (2006).

Numerical Optimization . Springer Science & BusinessMedia.[31] Pardalos, P. M. and Schnitger, G. (1988). Checking local optimality in constrainedquadratic programming is NP-hard.

Operations Research Letters , 7(1):33–35.[32] Renegar, J. (1988). A polynomial-time algorithm, based on newton’s method, for linearprogramming.

Mathematical Programming , 40(1-3):59–93.[33] Richard Johnsonbaugh, W. P. (2010).

Foundations of mathematical analysis . Dover publi-cations, Mineola, New York.[34] Shor, N. (1985).

Minimization Methods for Non-diﬀerentiable Functions . ComputationalMathematics. Springer.[35] Sorensen, D. C. (1982). Newton’s method with a model trust region modiﬁcation.

SIAMJournal on Numerical Analysis , 19(2):409–426.[36] Sturm, J. F. (2002). Implementation of interior point methods for mixed semideﬁnite andsecond order cone optimization problems.

Optimization Methods and Software , 17(6):1105–1154.[37] Ulbrich, S. (2004). On the superlinear local convergence of a ﬁlter-SQP method.

Mathe-matical Programming , 100(1):217–245.[38] Vanderbei, R. J. (1999). LOQO user’s manual—version 3.10.

Optimization methods andsoftware , 11(1-4):485–514.[39] Vicente, L. N. and Wright, S. J. (2002). Local convergence of a primal-dual method for de-generate nonlinear programming.

Computational Optimization and Applications , 22(3):311–328.[40] W¨achter, A. and Biegler, L. T. (2005). Line search ﬁlter methods for nonlinear program-ming: Motivation and global convergence.

SIAM Journal on Optimization , 16(1):1–31.[41] W¨achter, A. and Biegler, L. T. (2006). On the implementation of an interior-point ﬁlterline-search algorithm for large-scale nonlinear programming.

Mathematical programming ,106(1):25–57.[42] Ye, Y. (1991). An O ( n L ) potential reduction algorithm for linear programming. Mathe-matical programming , 50(1):239–258.[43] Ye, Y. (1992). A new complexity result on minimization of a quadratic function witha sphere constraint. In

Recent Advances in Global Optimization , pages 19–31. PrincetonUniversity Press.[44] Ye, Y. (1998). On the complexity of approximating a KKT point of quadratic programming.

Mathematical programming , 80(2):195–211.[45] Ye, Y. (2018). MS&E311. Lecture note 12. https://web.stanford.edu/class/msande311/handout.shtml.[46] Ye, Y., Todd, M. J., and Mizuno, S. (1994). An O ( √ nL )-iteration homogeneous andself-dual linear programming algorithm. Mathematics of Operations Research , 19(1):53–67.

47] Zhang, Y. (1994). On the convergence of a class of infeasible interior-point methods for thehorizontal linear complementarity problem.

SIAM Journal on Optimization , 4(1):208–227.

A Proof of Claim 1

Claim 1.

8) exp( C/ (2 µ )) we have (cid:107) ∇ ψ µ ( x ( k ) ) (cid:107) ≥ µ .Proof. Suppose x (0) = exp( − C/ (2 µ )). Note that x (0) ≤ exp( − / (2 / −

5) and therefore ψ µ ( x (0) ) = x (0) − µ (cid:16) log (cid:16) x (0) (cid:17) + log (cid:16) − x (0) (cid:17)(cid:17) ≤ x (0) + C/ ≤ C ⇒ x (0) ∈ S C where the ﬁrst inequality uses that log (cid:0) x (0) (cid:1) = log(exp( − C/ (2 µ ))) = − C/ (2 µ ) and log (cid:0) − x (0) (cid:1) ≥ log(1) = 0, the second inequality uses x (0) ≤ exp( − ≤ C/

2, and the implication uses ψ ∗ µ ≥ ∇ ψ µ ( x (0) ) = 1 − µ (cid:18) x (0) − − x (0) (cid:19) ≤ − µ (exp( C/ (2 µ )) − ≤ − µ/ C/ (2 µ ))where the last inequality uses that µ exp( C/ (2 µ )) is monotone decreasing with respect to µ on theinterval [0 , C/

2] because ∂µ exp( C/ (2 µ )) /∂µ = − ( C − µ ) exp( C/ (2 µ )) / (2 µ ) ≤

0, and therefore µ exp( C/ (2 µ )) ≥ exp(2 / (2 / ≥

29. We conclude that if x (1) ≤ α ≤ µ exp( − C/ (2 µ )).On the other hand, if x ( k ) ∈ [1 / ,

1] then 0 . ≤ − / − / .

5) = ∇ ψ µ (1 / ≤ ∇ ψ µ ( x ( k ) ) ≤ ∇ ψ µ (1) ≤ ⇒ ∇ ψ µ ( x ( k ) ) ∈ [0 . , . x (0) = 1 and µ exp( − C/ (2 µ )) ≤ α it willtake at least µ exp( C/ (2 µ )) iterations until x ( k ) (cid:54)∈ [1 / , B Proofs from Section 4.1

B.1 Proof of Lemma 2

Lemma 2.

Suppose the function g : R → R has L -Lipschitz ﬁrst derivatives and L -Lipschitzsecond derivatives on the set [0 , θ ] where θ ∈ R + . Further assume g (0) > , β ∈ (0 , / , and theinequality | θg (cid:48) (0) | g (0) + L θ g (0) ≤ β holds. Then g ( θ ) g (0) ∈ [ , ] and θ (cid:12)(cid:12)(cid:12) ∂ log( g ( θ )) ∂ θ (cid:12)(cid:12)(cid:12) ≤ L θ +6 L θ βg (0) + 5 β .Proof We have | g (0) − g ( θ ) | g (0) ≤ | θg (cid:48) (0) | g (0) + L θ g (0) ≤ β ≤ . The ﬁrst inequality uses | g (0) + g (cid:48) (0) θ − g ( θ ) | ≤ L θ , the triangle inequality and g (0) > g ( θ ) g (0) ∈ [3 / , / g ( θ )), ∂ log( g ( θ )) ∂θ = g (cid:48) ( θ ) g ( θ ) ∂ log( g ( θ )) ∂ θ = g (cid:48)(cid:48) ( θ ) g ( θ ) − g (cid:48) ( θ ) g ( θ ) ∂ log( g ( θ )) ∂ θ = g (cid:48)(cid:48)(cid:48) ( θ ) g ( θ ) − g (cid:48) ( θ ) g (cid:48)(cid:48) ( θ ) g ( θ ) + 2 g (cid:48) ( θ ) g ( θ ) . (30) y (30), g ( θ ) g (0) ∈ [3 / , / | g (cid:48)(cid:48)(cid:48) ( θ ) | ≤ L , and | g (cid:48)(cid:48) ( θ ) | ≤ L we have (cid:12)(cid:12)(cid:12)(cid:12) ∂ log( g ( θ )) ∂ θ (cid:12)(cid:12)(cid:12)(cid:12) ≤ (4 / L g (0) + 3(4 / L | g (cid:48) ( θ ) | g (0) + 2(4 / | g (cid:48) ( θ ) | g (0) . Now, since g (cid:48) ( θ ) ≤ g (cid:48) (0) + L θ we have θ (cid:12)(cid:12)(cid:12)(cid:12) ∂ log( g ( θ )) ∂ θ (cid:12)(cid:12)(cid:12)(cid:12) ≤ (4 / L θ g (0) + 3(4 / L θ ( g (cid:48) (0) θ + L θ ) g (0) + 2(4 / ( | θg (cid:48) (0) | + L θ ) g (0) ≤ L θ + 6 L θ βg (0) + 5 β . (cid:3) B.2 Proof of Lemma 3

Lemma 3.

Suppose Assumption 2 holds (Lipschitz derivatives). Let x ∈ X , S = diag ( a ( x )) , d x ∈ R n , d s = ∇ a ( x ) d x , y = µS − , and κ ∈ (0 , / . If (cid:107) S − d s (cid:107) + L (cid:107) d x (cid:107) (cid:107) y (cid:107) µ ≤ κ, (13) then a i ( x + d x ) a i ( x ) ∈ [3 / , / for all i ∈ { , . . . , m } and (cid:12)(cid:12) ψ µ ( x ) + M ψ µ x ( d x ) − ψ µ ( x + d x ) (cid:12)(cid:12) ≤ L (cid:107) y (cid:107) ) (cid:107) d x (cid:107) + L (cid:107) d x (cid:107) (cid:107) y (cid:107) κ + µκ . Proof

First we aim to prove a i ( x + d x ) a i ( x ) ∈ [3 / , / v := d x / (cid:107) d x (cid:107) , g i ( θ ) := a i ( x + θv )and q i ( θ ) := sup ˆ θ ∈ [0 ,θ ] (cid:12)(cid:12)(cid:12) a i ( x + v ˆ θ ) − a i ( x ) (cid:12)(cid:12)(cid:12) for i ∈ { , . . . , m } . To obtain a contradiction assume q i ( ϑ ) > a i ( x )4 for some ϑ ∈ [0 , (cid:107) d x (cid:107) ] and i ∈ { , . . . , m } . Since a i is continuous it follows q i is continuous, and by the intermediate value theorem there exists some ˜ θ ∈ [0 , ϑ ] such that q i (˜ θ ) ∈ (cid:16) a i ( x )4 , a i ( x ) (cid:17) . Using that a i ( x ) has L -Lipschitz ﬁrst derivatives and L -Lipschitzsecond derivatives on the set X we deduce that g i ( θ ) satisﬁes the same properties on the set[0 , ˜ θ ]. Applying Lemma 2 and (13) we deduce q i (˜ θ ) ≤ a i ( x )4 contradicting the earlier statementthat q i (˜ θ ) ∈ (cid:16) a i ( x )4 , a i ( x ) (cid:17) . We conclude a i ( x + d x ) a i ( x ) ∈ [3 / , / (cid:12)(cid:12)(cid:12) ψ µ ( x ) + M ψ µ x ( d x ) − ψ µ ( x + d x ) (cid:12)(cid:12)(cid:12) we provide some auxiliary bounds. Deﬁne β i := | ∇ a i ( x ) T d x | a i ( x ) + L (cid:107) d x (cid:107) a i ( x ) , for all i ∈ { , . . . , m } . Then we have, (cid:107) β (cid:107) = m (cid:88) i =1 (cid:18) | ∇ a i ( x ) T d x | a i ( x ) + L (cid:107) d x (cid:107) y i µ (cid:19) ≤ m (cid:88) i =1 (cid:32)(cid:18) | ∇ a i ( x ) T d x | a i ( x ) (cid:19) + (cid:18) L (cid:107) d x (cid:107) µ (cid:19) y i (cid:33) = 2 (cid:107) S − d s (cid:107) + 2 (cid:18) L (cid:107) d x (cid:107) µ (cid:19) (cid:107) y (cid:107) ≤ κ here the ﬁrst equality uses 1 /a i ( x ) = y i /µ , the ﬁrst inequality uses the fact that ( a + b ) ≤ a + b ), and the ﬁnal inequality uses a + b ≤ ( a + b ) for a, b ≥

0. Hence, m (cid:88) i =1 β i ≤ (cid:107) β (cid:107) max i { β i } ≤ κ . (31)Observe, also by Taylor’s Theorem and the fact that f is Lipschitz on X that (cid:12)(cid:12)(cid:12)(cid:12) f ( x ) + 12 d x ∇ f ( x ) d x + ∇ f ( x ) T d x − f ( x + d x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ L (cid:107) d x (cid:107) . (32)Using Lemma 2 and Taylor’s Theorem with g i ( θ ) := a i ( x + θv ), h i ( θ ) := log( g i ( θ )), and v = d x (cid:107) d x (cid:107) , we get (cid:12)(cid:12)(cid:12)(cid:12) h i (0) + θh (cid:48) i (0) + θ h (cid:48)(cid:48) i (0) − h i ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ θ ˆ θ ∈ [0 ,θ ] h (cid:48)(cid:48)(cid:48) i (ˆ θ ) ≤ (cid:18) L θ + 6 L θ β i g (0) + 5 β i (cid:19) . (33)We can now bound the quality of a second-order Taylor series expansion of ψ µ as (cid:12)(cid:12) ψ µ ( x ) + M ψ µ x ( d x ) − ψ µ ( x + d x ) (cid:12)(cid:12) ≤ L (cid:107) d x (cid:107) + µ m (cid:88) i =1 (cid:18) L (cid:107) d x (cid:107) + 6 L (cid:107) d x (cid:107) β i a i ( x ) + 5 β i (cid:19) ≤ L (cid:107) d x (cid:107) + m (cid:88) i =1 (cid:18) y i (cid:18) L (cid:107) d x (cid:107) L (cid:107) d x (cid:107) β i (cid:19) + µ β i (cid:19) ≤ L (cid:107) y (cid:107) ) (cid:107) d x (cid:107) + L (cid:107) d x (cid:107) (cid:107) y (cid:107) κ + µκ . The ﬁrst inequality uses (32) and (33). The second inequality uses 1 /a i ( x ) = y i /µ . The thirdinequality uses β i ≤ κ and (31). (cid:3) B.3 Proof of Lemma 4

Lemma 4.

Suppose Assumption 2 holds. Let

Convex { x, x + } ⊆ X , s = a ( x ) , s + = a ( x + ) , S = diag ( a ( x )) , Y = diag ( y ) , y + ∈ R m , Y + = diag ( y + ) , d x = x + − x , d y = y + − y , and d s = ∇ a ( x ) d x . If the equation Sy + Sd y + Y d s = µ holds, then (cid:107) Y − d y (cid:107) ≤ (cid:107) S − d s (cid:107) + (cid:107) µ ( SY ) − − (cid:107) (14) (cid:107) Y + s + − µ (cid:107) ≤ (cid:107) Sy (cid:107) ∞ (cid:107) S − d s (cid:107) (cid:107) Y − d y (cid:107) + L (cid:107) y (cid:107) (1 + (cid:107) Y − d y (cid:107) ) (cid:107) d x (cid:107) . (15) Furthermore, if (cid:107) Y + s + − µ (cid:107) ∞ < µ and (cid:107) Y − d y (cid:107) ∞ ≤ then s + , y + ∈ R m ++ .Proof To show (14) notice that multiplying Sy + Sd y + Y d s = µ by ( SY ) − and rearrangingyields Y − d y = − S − d s + (( Sy ) − µ − ).Next, we show (15). Observe that s + i y + i − µ = a i ( x + d x )( y i + d y i ) − µ = ( d s i + a i ( x ))( y i + d y i ) + ( a i ( x + d x ) − ( d s i + a i ( x )))( y i + d y i ) − µ = d s i d y i + ( a i ( x + d x ) − ( d s i + a i ( x )))( y i + d y i ) , (34)where the ﬁrst transition is by deﬁnition of s + i and y + i , the second transition comes from addingand subtracting ( d s i + a i ( x ))( y i + d y i ), and the third transition by substituting µ = s i y i + s i d y i + y i d s i = a i ( x ) y i + a i ( x ) d y i + y i d s i . Furthermore, since ∇ a i is L -Lipschitz continuous on X , | a i ( x + d x ) − ( d s i + a i ( x )) | = | a i ( x + d x ) − ( ∇ a i ( x ) d x + a i ( x )) | ≤ L (cid:107) d x (cid:107) , ombining this equality with (34) yields (cid:12)(cid:12) s + i y + i − µ (cid:12)(cid:12) ≤ | d s i d y i | + L y + i (cid:107) d x (cid:107) ≤ | s i y i | (cid:12)(cid:12) s − i d s i (cid:12)(cid:12)(cid:12)(cid:12) y − i d y i (cid:12)(cid:12) + L y i (1 + y − i d y i ) (cid:107) d x (cid:107) . We deduce (15) by Cauchy-Schwarz. The fact that y + ∈ R m + follows from (cid:107) Y − d y (cid:107) ∞ ≤

1. Thefact that y + , s + ∈ R m ++ follows from y + ∈ R m + and (cid:107) S + y + − µ (cid:107) ∞ < µ . (cid:3) B.4 Proof of Lemma 5

Lemma 5.

Suppose Assumption 2 holds. Let y, y + ∈ R m and Convex { x, x + } ⊆ X . Then thefollowing inequality holds: (cid:107) ∇ x L ( x, y ) + ∇ xx L ( x, y ) T d x − d Ty ∇ x a ( x ) − ∇ x L ( x + , y + ) (cid:107) ≤ L (cid:107) y (cid:107) (cid:107) d x (cid:107) (cid:107) Y − d y (cid:107) + L (cid:107) y (cid:107) + 1) (cid:107) d x (cid:107) (16) with d x = x + − x and d y = y + − y .Proof Observe that: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i (cid:0) y i ∇ a i ( x ) + y i ∇ a i ( x ) d x − d y i ∇ a i ( x ) − y + i ∇ a i ( x + ) (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:88) i (cid:13)(cid:13) y i ∇ a i ( x ) + y i ∇ a i ( x ) d x + d y i ∇ a i ( x ) − y + i ∇ a i ( x + ) (cid:13)(cid:13) ≤ (cid:107) y (cid:107) (cid:13)(cid:13) ∇ a i ( x ) + ∇ a i ( x ) d x − ∇ a i ( x + ) (cid:13)(cid:13) + (cid:107) d y (cid:107) (cid:13)(cid:13) ∇ a i ( x ) − ∇ a i ( x + ) (cid:13)(cid:13) ≤ L (cid:107) y (cid:107) (cid:107) d x (cid:107) + L (cid:107) d y (cid:107) (cid:107) d x (cid:107) , where the ﬁrst and second transition hold by the triangle inequality, the third transition applying(5) using the Lipschitz continuity of ∇ a and ∇ a . Next, by the triangle inequality, the inequalitywe just established, and Taylor’s theorem with Lipschitz continuity of ∇ f we get (cid:107) ∇ x L ( x, y ) + ∇ xx L ( x, y ) T d x − d Ty ∇ x a ( x ) − ∇ x L ( x + , y + ) (cid:107) ≤ (cid:13)(cid:13) ∇ f ( x ) + ∇ f ( x ) d x − ∇ f ( x + ) (cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i (cid:0) y i ∇ a i ( x ) + y i ∇ a i ( x ) d x + d y i ∇ a i ( x ) − y + i ∇ a i ( x + ) (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ L (cid:107) y (cid:107) + 1) (cid:107) d x (cid:107) + L (cid:107) d y (cid:107) (cid:107) d x (cid:107) . (35) (cid:3) B.5 Proof of Lemma 6

Lemma 6.

Consider g ∈ R n and a symmetric matrix H ∈ R m × n . Deﬁne ∆( u ) := u T Hu + g T u where ∆ : R n → R and let u ∗ ∈ argmin u ∈ B r ( ) ∆( u ) be an optimal solution to the trustregion subproblem for some r ≥ . Then there exists some δ ≥ such that: δ ( (cid:107) u ∗ (cid:107) − r ) = 0 , ( H + δ I ) u ∗ = − g, and H + δ I (cid:23) . (17) Conversely, if u ∗ satisﬁes (17) then u ∗ ∈ argmin u ∈ B r ( ) ∆( u ) . Let σ ( r ) := min u ∈ B r ( ) ∆( u ) ,then for all r ∈ [0 , ∞ ) we have σ ( r ) ≤ − δr σ ( r ) ≤ σ ( αr ) ≤ α σ ( r ) ∀ α ∈ [0 , . (18b) Furthermore, the function σ ( r ) is monotone decreasing and continuous. roof Equation (17) follows from the KKT conditions, see Sorensen [35, Lemma 2.4.], Connet al. [12, Corollary 7.2.2] or Nocedal and Wright [30, Theorem 4.3.]. We now show (18a).Substituting ( H + δ I ) u ∗ = − g into ( u ∗ ) T Hu ∗ + g T u ∗ yields σ ( r ) = ∆( u ∗ ) = 1 / g T u ∗ − δ/ (cid:107) u ∗ (cid:107) ≤ − δ/ (cid:107) u ∗ (cid:107) where the last inequality follows from g T u ∗ = − g T ( H + δ I ) − g ≤ δ = 0 or (cid:107) u ∗ (cid:107) = r we conclude (18a) holds. The inequality σ ( αr ) ≤ α σ ( r ) holds since σ ( αr ) ≤ ∆( αu ∗ ) = α ( u ∗ ) T Hu ∗ + αg T u ∗ ≤ α ( u ∗ ) T Hu ∗ + α g T u ∗ = α σ ( r ) where the inequality uses g T u ∗ ≤

0. The inequality σ ( r ) ≤ σ ( αr ) holds sinceany solution to (cid:107) u (cid:107) ≤ r is feasible to (cid:107) u (cid:107) ≤ αr . The fact that σ ( r ) is monotone decreasingand continuous follows from (18b). (cid:3) C Proofs of results in Section 5.1

C.1 Proof of Lemma 8

Lemma 8.

Suppose Assumptions 2 and 3 hold (Lipschitz derivatives, and suﬃciently small µ ).Let x ∈ X , η s ∈ [0 , / , (ITRS) hold with η x = η s , and α = min (cid:110) , η s (cid:107) S − d s (cid:107) (cid:111) . Then x + ∈ X and ψ µ ( x + ) − ψ µ ( x ) ≤ µη s + max (cid:26) M ψ µ x ( d x ) , − η s µ (cid:27) . (23) Proof

Our ﬁrst goal is to show for all α ∈ (0 ,

1] that M ψ µ x ( αd x ) ≤ max (cid:26) M ψ µ x ( d x ) , − η s µ (cid:27) . (36)Note (36) trivially holds if α = 1. Therefore let us consider the case α ∈ (0 , α = η s (cid:107) S − d s (cid:107) ≥ η s (cid:115) µL ( (cid:107) y (cid:107) + 1) (cid:107) d x (cid:107) − M ψ µ x ( d x ) ≥ η s (cid:115) µη s µ/ − M ψ µ x ( d x ) (37)where the ﬁrst inequality uses (22), and the second inequality uses (cid:107) d x (cid:107) ≤ η s (cid:113) µL ( (cid:107) y (cid:107) +1) .Furthermore, if M ψ µ x ( d x ) ∈ (cid:104) − η s µ , (cid:105) from (37) we get α ≥ (cid:112) / >

1; by contradiction weconclude M ψ µ x ( d x ) (cid:54)∈ (cid:104) − η s µ , (cid:105) . Using M ψ µ x ( d x ) (cid:54)∈ (cid:104) − η s µ , (cid:105) and M ψ µ x ( d x ) ≤ M ψ µ x ( ) = 0(recall deﬁnition of d x in (ITRS)), we deduce M ψ µ x ( d x ) < − η s µ . Combining M ψ µ x ( d x ) < − η s µ with (37) yields α ≥ η s (cid:113) µ − M ψµx ( d x ) . Therefore, M ψ µ x ( αd x ) = α d Tx ∇ ψ µ ( x ) d x + α ∇ ψ µ ( x ) T d x ≤ α M ψ µ x ( d x ) ≤ − η s µ ∇ ψ µ ( x ) T d x ≤ α ≥ η s (cid:113) µ − M ψµx ( d x ) . Thus (36) holds.It remains to bound the accuracy of the predicted decrease M ψ µ x ( αd x ). Note that by α ∈ [0 , η x = η s we have (cid:107) αd x (cid:107) ≤ (cid:107) d x (cid:107) ≤ η s (cid:114) µL ( (cid:107) y (cid:107) + 1) = η x (cid:114) µL ( (cid:107) y (cid:107) + 1) . (38)Let us select κ = (21 / η s , this choice satisﬁes the premise of Lemma 3 because α (cid:107) S − d s (cid:107) + L (cid:107) αd x (cid:107) (cid:107) y (cid:107) µ ≤ η s + η s ≤ (21 / η s = κ (39) here the ﬁrst inequality comes from α (cid:107) S − d s (cid:107) ≤ η s and (38), and the third inequality uses η s ∈ [0 , / η s ∈ [0 , /

5] we deduce κ ≤ / x + ∈ X . From Lemma 3, (cid:12)(cid:12) ψ µ ( x ) + M ψ µ x ( αd x ) − ψ µ ( x + αd x ) (cid:12)(cid:12) ≤ L (cid:107) y (cid:107) ) (cid:107) αd x (cid:107) + L (cid:107) αd x (cid:107) (cid:107) y (cid:107) κ + µκ ≤ L / µ − / (cid:107) y (cid:107) ) (cid:107) αd x (cid:107) + L (cid:107) αd x (cid:107) (cid:107) y (cid:107) κ + µκ ≤ (cid:18) × + (1 / )(21 / + (21 / (cid:19) µη s ≤ µη s (40)where the second inequality uses L µL ∈ (0 ,

1] from Assumption 3, the third inequality uses ourbound on (cid:107) αd x (cid:107) and κ , i.e., (38) and (39). Combining (36) and (40) gives (23). (cid:3) C.2 Proof of Lemma 9

Lemma 9.

Suppose (ITRS) , Assumptions 2 and 3 hold (direction selection, Lipschitz deriva-tives, and suﬃciently small µ ). Let x ∈ X , η x ∈ (0 , ( τ l µL ) / ] , and α = 1 . Furtherassume M ψ µ x ( d x ) ≥ − τ l µr √ (cid:107) y (cid:107) . Under these assumptions, ( x + , y + ) satisﬁes (FJ1) and ∇ ψ µ ( x ) (cid:23) −√ τ l (1 + (cid:107) y (cid:107) ) (cid:113) τ l µη x L I .Proof First, let us bound (cid:107) S − d s (cid:107) : (cid:107) S − d s (cid:107) ≤ (cid:115) L ( (cid:107) y (cid:107) + 1) (cid:107) d x (cid:107) − M ψ µ x ( d x ) µ ≤ (cid:115) η x + 23 η x (cid:18) µτ l L (cid:19) / ≤ (cid:115) (cid:18) µτ l L (cid:19) / + 260 (cid:18) µτ l L (cid:19) / ≤ (cid:18) τ l µL (cid:19) / , where the ﬁrst inequality uses (22), the second r = η x (cid:113) µL ( (cid:107) y (cid:107) +1) and M ψ µ x ( d x ) ≥ − τ l µr √ (cid:107) y (cid:107) ,the third inequality uses η x ∈ (0 , ( τ l µL ) / ], and the ﬁnal inequality uses τ l µL ∈ (0 ,

1] from As-sumption 3.Let us compute κ from Lemma 3: (cid:107) S − d s (cid:107) + L (cid:107) d x (cid:107) (cid:107) y (cid:107) µ ≤ (cid:16) τ l µL (cid:17) / + η x ≤ = κ wherethe ﬁrst inequality uses (cid:107) d x (cid:107) ≤ r = η x (cid:113) µL ( (cid:107) y (cid:107) +1) ≤ η x (cid:113) µL (cid:107) y (cid:107) and the second inequalityuses η x ≤ (cid:16) τ l µL (cid:17) / ∈ (0 , / Convex { x, x + } ⊆ X .Furthermore, by Lemma 4, the fact y = µS − , and our bound on (cid:107) S − d s (cid:107) we have (cid:107) Y − d y (cid:107) ≤ (cid:107) S − d s (cid:107) ≤ (cid:18) τ l µL (cid:19) / . herefore, (cid:107) δd x − ∇ x L ( x + , y + ) (cid:107) ≤ L (cid:107) d x (cid:107) (cid:107) y (cid:107) (cid:107) Y − d y (cid:107) + L (cid:107) y (cid:107) + 1) (cid:107) d x (cid:107) ≤ L (cid:107) d x (cid:107) (cid:107) y (cid:107) (cid:107) Y − d y (cid:107) + L / µ − / (cid:107) y (cid:107) + 1) (cid:107) d x (cid:107) ≤ η x τ l µ (cid:112) (cid:107) y (cid:107) (cid:18) τ l µL (cid:19) − / + L / µ / η x ≤ τ l µ (cid:112) (cid:107) y (cid:107)

100 + µτ l ≤ µτ l (cid:112) (cid:107) y (cid:107) ≤ µτ l (cid:112) (cid:107) y (cid:107) where the ﬁrst inequality follows from Lemma 5, the second by L µL ∈ (0 , (cid:107) Y − d y (cid:107) ≤ (cid:16) τ l µL (cid:17) / that we just proved and (cid:107) d x (cid:107) ≤ η x (cid:113) µL ( (cid:107) y (cid:107) +1) , andthe fourth inequality using η x ∈ (0 , ( τ l µL ) / ].Next, we bound δ (cid:107) d x (cid:107) . By (12) there exists some δ ≥ ∇ x L ( x, y )+ ∇ xx L ( x, y ) T d x − d Ty ∇ x a ( x ) = δd x . Moreover, − τ l µr √ (cid:107) y (cid:107) ≤ M ψ µ x ( u ) ≤ − δr by (18a), so δ (cid:107) d x (cid:107) ≤ δr ≤ τ l µ (cid:112) (cid:107) y (cid:107) . Therefore using the bounds on δ (cid:107) d x (cid:107) and (cid:107) δd x − ∇ x L ( x + , y + ) (cid:107) that weproved, (cid:107) ∇ x L ( x + , y + ) (cid:107) ≤ (cid:107) δd x − ∇ x L ( x + , y + ) (cid:107) + δ (cid:107) d x (cid:107) ≤ τ l µ (cid:112) (cid:107) y (cid:107) + τ l µ (cid:112) (cid:107) y (cid:107) ≤ τ l µ (cid:112) (cid:107) y (cid:107) . This shows (FJ1.c) holds. It remains to show (FJ1.a) and (FJ1.b). From Lemma 4 we get (cid:107) S + y + − µ (cid:107) ≤ µ (cid:107) S − d s (cid:107) (cid:107) Y − d y (cid:107) + L (cid:107) y (cid:107) (1 + (cid:107) Y − d y (cid:107) ) (cid:107) d x (cid:107) ≤ µ (cid:18) τ l µL (cid:19) / + µη x ≤ µ (cid:18) τ l µL (cid:19) / = µτ c , where the second inequality uses (cid:107) Y − d y (cid:107) ≤ (cid:107) S − d s (cid:107) ≤ (cid:16) τ l µL (cid:17) / ≤ (cid:107) d x (cid:107) ≤ r = η x (cid:113) µL ( (cid:107) y (cid:107) +1) , and the third inequality η x ∈ (0 , ( τ l µL ) / ]. Therefore (FJ1) holds.Let v min be the eigenvector of ∇ ψ µ ( x ) corresponding to the minimum eigenvalue of ∇ ψ µ ( x ).Note that − τ l µr (cid:112) (cid:107) y (cid:107) ≤ M ψ µ x ( d x ) ≤ min {M ψ µ x ( rv min ) , M ψ µ x ( − rv min ) } ≤ λ min ( ∇ ψ µ ( x )) r λ min ( · ) denotes the minimum eigenvalue. Therefore λ min ( ∇ ψ µ ( x )) ≥ − τ l µ (cid:112) (cid:107) y (cid:107) r = − √ τ l (1 + (cid:107) y (cid:107) )3 (cid:114) τ l µη x L . (cid:3) C.3 Proof of Theorem 1

Theorem 1.

Suppose Assumptions 2 and 3 hold (Lipschitz derivatives, and suﬃciently small µ ). Then Trust-IPM ( f, a, µ, τ l , L , η s , η x , x (0) ) with x (0) ∈ X and η s = 140 (cid:18) τ l µL (cid:19) / η x = η s , ( η -1) akes at most O (cid:32) ψ µ ( x (0) ) − ψ ∗ µ µ (cid:18) L µτ l (cid:19) / (cid:33) iterations to terminate with a ( µ, τ l , τ c ) -approximate second-order Fritz John point ( x + , y + ) , i.e., (FJ1) and (FJ2) hold.Proof Let x ∈ X be some iterate of the algorithm with corresponding direction d x . If − τ l µr √ (cid:107) y (cid:107) ≥ M ψ µ x ( d x ) then ψ µ ( x + αd x ) − ψ µ ( x ) ≤ µη s + max (cid:26) M ψ µ x ( d x ) , − η s µ (cid:27) ≤ µη s max  η s − (cid:115) τ l µL , η s − η s  (41)= µ (cid:18) τ l µL (cid:19) / max (cid:40)(cid:18) − × (cid:19) (cid:18) τ l µL (cid:19) / , (cid:18) τ l µL (cid:19) / − × (cid:41) ≤ µ (cid:18) τ l µL (cid:19) / (cid:18) − × (cid:19) = − µ × (cid:18) τ l µL (cid:19) / (42)where the ﬁrst transition uses Lemma 8, the second transition uses M ψ µ x ( d x ) ≥ − τ l µr √ (cid:107) y (cid:107) = − µη s (cid:113) τ l µL , the third transition uses η s = (cid:16) τ l µL (cid:17) / , and the fourth transition uses τ l µL ∈ (0 , x, d x ) denote the current primal iterate and direction. Let ( x + , d x + ) denote the subse-quent primal iterate and direction. By Lemma 9 if − τ l µr √ (cid:107) y (cid:107) ≤ M ψ µ x ( d x ) then (FJ1) holds.Also, by Lemma 9 if − τ l µr √ (cid:107) y + (cid:107) ≤ M ψ µ x + ( d + x ) then ∇ ψ µ ( x + ) (cid:23) −√ τ l (1 + (cid:107) y + (cid:107) ) (cid:113) τ l µη x L I (cid:23)− √ τ l (1 + (cid:107) y + (cid:107) ) I , i.e., (FJ2) holds. Therefore if both − τ l µr √ (cid:107) y (cid:107) ≤ M ψ µ x ( d x ) and − τ l µr √ (cid:107) y + (cid:107) ≤ M ψ µ x + ( d + x ) the algorithm terminates.It remains to show that if either − τ l µr √ (cid:107) y (cid:107) > M ψ µ x ( d x ) or − τ l µr √ (cid:107) y + (cid:107) > M ψ µ x + ( d + x )then over these two iterations we reduce the function value by a constant quantity. First notethat even if M ψ µ x ( d x ) ≥ − τ l µr √ (cid:107) y (cid:107) we by M ψ µ x ( d x ) ≤ ψ µ ( x + αd x ) − ψ µ ( x ) ≤ µη s + max (cid:26) M ψ µ x ( d x ) , − η s µ (cid:27) ≤ µη s = 2 µ (cid:18) τ l µL (cid:19) / (43)where the ﬁrst inequality follows from Lemma 8. The same equation applies replacing ( x, d x )with ( x + , d + x ). By applying (42) and (43) we can see that if over these two iterations thealgorithm did not terminate then ψ µ must have been reduced by at least µ (cid:18) × − (cid:19) (cid:18) τ l µL (cid:19) / = 7 µ × (cid:18) τ l µL (cid:19) / . To conclude note if the algorithm has not terminated across iterations 0 , . . . , K then letting x ( k ) be the k th x iterate, ψ µ ( x (0) ) − ψ ∗ µ ≥ (cid:80) K − k =0 ( ψ µ ( x ( k ) ) − ψ µ ( x ( k +1) )) ≥ K − × µ × (cid:16) τ l µL (cid:17) / ,rearranging to bound K gives the result. (cid:3) Proof of results in Section 5.2

The main purpose of this section is to prove Lemma 10. Before we prove this result in Section D.3we prove two auxiliary Lemmas. Lemma 13 is the convex version of Lemma 8 and Lemma 14 isthe convex version of Lemma 8.

D.1 Proof of Lemma 13

Lemma 13.

Suppose (ITRS) , Assumption 2 and 4 hold (direction selection, Lipschitz deriva-tives, and suﬃciently small µ ). Let f be convex and each a i concave. Let x ∈ X , η x = θ (cid:16) τ l µL (cid:17) / , η s = θ (cid:16) τ l µL (cid:17) / , α = min (cid:110) , η s (cid:107) S − d s (cid:107) (cid:111) , and θ ∈ [0 , / . Then x + ∈ X and ψ µ ( x + αd x ) − ψ µ ( x ) ≤ µθ (cid:18) τ l µL (cid:19) / + max (cid:40) M ψ µ x ( d x ) , − θ µ (cid:18) τ l µL (cid:19) / (cid:41) . Proof

First we show M ψ µ x ( αd x ) ≤ max (cid:26) M ψ µ x ( d x ) , − η s µ (cid:27) , (44)which trivially holds if α = 1. Therefore let us consider the case α ∈ (0 , α = η s (cid:107) S − d s (cid:107) ≥ η s (cid:115) µ − M ψ µ x ( d x ) . (45)Therefore, M ψ µ x ( αd x ) = α d Tx ∇ ψ µ ( x ) d x + α ∇ ψ µ ( x ) T d x ≤ α M ψ µ x ( d x ) ≤ − η s µ ∇ ψ µ ( x ) T d x ≤ α ≥ η s (cid:113) µ − M ψµx ( d x ) , and the third inequality by (45). We conclude (44) holds.It remains to bound the accuracy of the predicted decrease M ψ µ x ( αd x ). Let us bound theconstant κ from Lemma 3, α (cid:107) S − d s (cid:107) + L (cid:107) αd x (cid:107) (cid:107) y (cid:107) µ ≤ θ (cid:0) τ l µ/L (cid:1) / + θ (cid:0) τ l µ/L (cid:1) / ≤ (7 / θ (cid:0) τ l µ/L (cid:1) / = κ (46)where the second inequality comes from α (cid:107) S − d s (cid:107) ≤ η s = θ (cid:16) τ l µL (cid:17) / and (cid:107) αd x (cid:107) ≤ θ (cid:0) τ l µ/L (cid:1) / (cid:113) µL ( (cid:107) y (cid:107) +1) ,the third inequality from θ ∈ [0 , / θ ∈ [0 , /

6] and τ l µ/L ∈ (0 ,

1] we deduce κ ≤ / x + ∈ X . Furthermore, from Lemma 3, (cid:12)(cid:12) ψ µ ( x ) + M ψ µ x ( αd x ) − ψ µ ( x + αd x ) (cid:12)(cid:12) ≤ L (cid:107) y (cid:107) ) (cid:107) d x (cid:107) + L (cid:107) d x (cid:107) (cid:107) y (cid:107) κ + µκ ≤ L / τ / l µ − / (1 / / (cid:107) y (cid:107) ) (cid:107) αd x (cid:107) + L (cid:107) αd x (cid:107) (cid:107) y (cid:107) κ + µκ ≤ µθ (cid:0) τ l µ/L (cid:1) / + µθ (cid:0) τ l µ/L (cid:1) / + (7 / µθ (cid:0) τ l µ/L (cid:1) / ≤ µθ (cid:0) τ l µ/L (cid:1) / (47)where the second inequality uses L µL τ l ∈ (0 , (cid:107) αd x (cid:107) and κ . Combining (44) and (47) gives the result. (cid:3) .2 Proof of Lemma 14 Lemma 14.

Suppose (ITRS) , Assumption 2 and 4 hold (Lipschitz derivatives, and suﬃcientlysmall µ ). Let f be convex and each a i concave. Let x ∈ X , η x ∈ (0 , ( µτ l L ) / ] , and α = 1 . Un-der these assumptions, if M ψ µ x ( d x ) ≥ − τ l µr √ (cid:107) y (cid:107) then ( x + , y + ) is an ( µ, τ l , τ c ) -approximateﬁrst-order Fritz John point.Proof First note, (cid:107) S − d s (cid:107) ≤ (cid:115) − M ψ µ x ( d x ) µ ≤ (cid:114) rτ l (cid:112) (cid:107) y (cid:107) = (cid:18) τ l η x µ L (cid:19) / ≤ (cid:32) (cid:18) τ l µL (cid:19) / (cid:33) / ≤ (cid:18) τ l µL (cid:19) / where the ﬁrst transition uses (21), the second transition uses our assumed bound on M ψ µ x ( d x ),the third transition uses r = η x (cid:113) µL ( (cid:107) y (cid:107) +1) , and the fourth transition uses η x ∈ (0 , ( µτ l L ) / ].Let us compute κ from Lemma 3, (cid:107) S − d s (cid:107) + L (cid:107) y (cid:107) (cid:107) d x (cid:107) µ ≤ (cid:16) µτ l L (cid:17) / + η x ≤ = κ . Itfollows that Convex { x, x + } ⊆ X .By Lemma 4 and the fact y = µS − we have (cid:107) Y − d y (cid:107) ≤ (cid:107) S − d s (cid:107) ≤ (cid:16) µτ l L (cid:17) / . There-fore, (cid:107) δd x − ∇ x L ( x + , y + ) (cid:107) ≤ L (cid:107) d x (cid:107) (cid:107) y (cid:107) (cid:107) Y − d y (cid:107) + L (cid:107) y (cid:107) + 1) (cid:107) d x (cid:107) ≤ L (cid:107) d x (cid:107) (cid:107) y (cid:107) (cid:107) Y − d y (cid:107) + L / µ − / τ / l (cid:107) y (cid:107) + 1) (cid:107) d x (cid:107) ≤ τ l µ (cid:112) (cid:107) y (cid:107) (cid:18) µτ l L (cid:19) − / η x + L / µ / τ / l η x ≤ τ l µ (cid:112) (cid:107) y (cid:107)

10 + µτ l ≤ µτ l (cid:112) (cid:107) y (cid:107) where the ﬁrst inequality follows from Lemma 5, the second by L µL τ l ∈ (0 , (cid:107) d x (cid:107) ≤ η x (cid:113) µL ( (cid:107) y (cid:107) +1) and (cid:107) Y − d y (cid:107) ≤ (cid:16) µτ l L (cid:17) / , and the fourth inequality using η x ∈ (0 , ( µτ l L ) / ].Now, by (12) there exists some δ ≥ ∇ x L ( x, y ) + ∇ xx L ( x, y ) T d x − d Ty ∇ x a ( x ) = δd x . Moreover, − τ l µr √ (cid:107) y (cid:107) ≤ M ψ µ x ( u ) ≤ − δr by (18a) which implies δ (cid:107) d x (cid:107) ≤ δr ≤ τ l µr (cid:112) (cid:107) y (cid:107) r = 23 τ l µ (cid:112) (cid:107) y (cid:107) . Therefore using the bounds on δ (cid:107) d x (cid:107) and (cid:107) δd x − ∇ x L ( x + , y + ) (cid:107) that we proved, (cid:107) ∇ x L ( x + , y + ) (cid:107) ≤ (cid:107) δd x − ∇ x L ( x + , y + ) (cid:107) + δ (cid:107) d x (cid:107) ≤ τ l µ (cid:112) (cid:107) y (cid:107) + τ l µ (cid:112) (cid:107) y (cid:107) ≤ τ l µ (cid:112) (cid:107) y (cid:107) . his shows (FJ1.c) holds. It remains to show (FJ1.a) and (FJ1.b). From Lemma 4 we get (cid:107) S + y + − µ (cid:107) ≤ µ (cid:107) S − d s (cid:107) (cid:107) Y − d y (cid:107) + L (cid:107) y (cid:107) (1 + (cid:107) Y − d y (cid:107) ) (cid:107) d x (cid:107) ≤ µ (cid:18) µτ l L (cid:19) / + µη x ≤ µ (cid:18) µτ l L (cid:19) / ≤ µτ c , where the second inequality uses (cid:107) Y − d y (cid:107) ≤ (cid:107) S − d s (cid:107) ≤ (cid:16) µτ l L (cid:17) / ≤ r , the third inequality uses η x ∈ (0 , ( µτ l L ) / ] and τ l µL ∈ (0 , (cid:3) D.3 Proof of Lemma 10

Lemma 10.

Suppose Assumption 2 and 4 hold (Lipschitz derivatives, and suﬃciently small µ ).Let f be convex and each a i concave. Then Trust-IPM ( f, a, µ, τ l , L , η s , η x , x (0) ) with x (0) ∈ X and η x = θ (cid:18) τ l µL (cid:19) / η s = θ (cid:18) τ l µL (cid:19) / θ = 1 / , ( η -2) takes at most O (cid:32) ψ µ ( x (0) ) − ψ ∗ µ µ (cid:18) L τ l µ (cid:19) / (cid:33) iterations to terminate with a ( µ, τ l , τ c ) -approximate ﬁrst-order Fritz John point ( x + , y + ) , i.e., (FJ1) holds.Proof Let x ∈ X be some iterate of the algorithm with corresponding direction d x . If M ψ µ x ( d x ) ≥ − τ l µr √ (cid:107) y (cid:107) then the algorithm terminates at the next iteration by Lemma 14.Therefore consider the case that − τ l µr √ (cid:107) y (cid:107) < M ψ µ x ( d x ). By Lemma 13 we have x + ∈ X .Furthermore, ψ µ ( x + ) − ψ µ ( x ) ≤ µθ (cid:18) τ l µL (cid:19) / + max (cid:40) M ψ µ x ( d x ) , − θ µ (cid:18) τ l µL (cid:19) / (cid:41) ≤ µθ (cid:18) τ l µL (cid:19) / max (cid:26) θ − , θ − θ (cid:27) = − µ (cid:18) τ l µL (cid:19) / M ψ µ x ( d x ) ≥ − τ l µr √ (cid:107) y (cid:107) = − µθ (cid:16) τ l µL (cid:17) / , and the ﬁnal inequality comes from substituting θ = 1 / , . . . , K then letting x ( k ) be the k th x iterate, ψ µ ( x (0) ) − ψ ∗ µ ≥ (cid:80) K − k =0 ( ψ µ ( x ( k ) ) − ψ µ ( x ( k +1) )) ≥ Kµ (cid:16) τ l µL (cid:17) / ,rearranging to bound K gives the result. (cid:3) E Proof of results in Section 6

E.1 Proof of Lemma 15

Assumption 9 (Slater’s condition) . Suppose that there exists some

R > such that (cid:107)X (cid:107) ≤ R and there exists some z ∈ X , γ ∈ R ++ such that a ( z ) ≥ γ . Further assume there exists someconstant L > such that (cid:107) ∇ f ( x ) (cid:107) ≤ L for all x ∈ X . urthermore, in order to apply Slater’s condition we need µ to be suﬃciently small: µ ≤ γ τ l R . (48)

Lemma 15.

Suppose that f is convex, a i is concave and Assumption 9 holds. If ( x + , y + ) is aFritz John point (i.e., (FJ1) holds) and (48) holds then (cid:13)(cid:13) y + (cid:13)(cid:13) ≤ mµ + 2 L Rγ .

Proof

Observe, (cid:13)(cid:13) y + (cid:13)(cid:13) ≤ a ( z ) T y + γ ≤ ( a ( x + ) + ∇ a ( x + )( z − x + )) T y + γ ≤ a ( x + ) T y + + (cid:107) ∇ a ( x + ) T y + (cid:107) Rγ ≤ mµ + (cid:0) L + τ l µ (cid:112) (cid:107) y + (cid:107) + 1 (cid:1) Rγ ≤ mµ + L Rγ + 12 (cid:113) (cid:107) y + (cid:107) + 1where the ﬁrst inequality uses Assumption 9 which implies a ( z ) /γ ≥ and that (cid:107) y + (cid:107) = T y + ,the second inequality uses that a i is concave, the third inequality uses (cid:107) X (cid:107) ≤ R , the fourthinequality uses (FJ1) and Assumption 9, and the ﬁfth inequality uses (48). It follows that1 + mµ + L Rγ ≥ (cid:13)(cid:13) y + (cid:13)(cid:13) + 1 − (cid:113) (cid:107) y + (cid:107) + 1 ≥ (cid:0)(cid:13)(cid:13) y + (cid:13)(cid:13) + 1 (cid:1) . (cid:3) E.2 Proof of Lemma 12

Lemma 12.

Let f be convex and each a i concave. Suppose that Assumption 5 holds. Let x (0) ∈ X . Then Annealed-IPM ( f, a, µ (0) , x (0) , (cid:15) ) takes at most (cid:16) O (1) + 6 m × w (cid:16) (cid:15) m (cid:17)(cid:17) log +2 (cid:18) mµ (0) (cid:15) (cid:19) + ψ µ (0) ( x (0) ) − ψ ∗ µ (0) µ (0) w ( µ (0) ) unit operations to return a point x ( k ) ∈ X with f ( x ( k ) ) − f ∗ ≤ (cid:15) .Proof Let J = (cid:108) log +2 (cid:16) mµ (0) (cid:15) (cid:17)(cid:109) . At this point if we apply Lemma 11 with µ = 0, we obtain f ( x ( J ) ) − f ∗ ≤ (cid:107) ∇ x L ( x ( J ) , y ( J ) ) (cid:107) R + m (cid:88) i =1 a i ( x ( J ) ) y ( J ) i ≤ (1+2) µ ( J ) m = 3 µ ( J ) m = 3 µ (0) − J ≤ (cid:15). Hence after J iterations we have found an (cid:15) -optimal solution. By Lemma 11 and Assumption 5, ψ µ ( j ) ( x ( j − ) − ψ ∗ µ ( j ) ≤ (cid:107) ∇ x L ( x ( j − , y ( j − ) (cid:107) R + m (cid:88) i =1 ( a i ( x ( j − ) y ( j − i − µ ( j ) ) ≤ µ ( j − m. Applying using this inequality in Assumption 5 we deduce that the number of unit operationsof each iteration j > O (1) + ψ µ ( j ) ( x ( j − ) − ψ ∗ µ ( j ) µ ( j ) w ( µ ( j ) ) ≤ O (1) + 3 µ ( j − mµ ( j ) w ( µ ( j ) ) ≤ O (1) + 6 m × w ( µ ( j ) ) ≤ O (1) + 6 m × w ( µ ( J ) ) . The second inequality uses µ ( j ) = µ ( j − . The ﬁnal inequality uses that w is monotonedecreasing by Assumption 5. (cid:3) .3 Proof of Theorem 2 Theorem 2.

Annealed-IPM ( f, a, µ (0) , x (0) , (cid:15) ) takes at most O (cid:32)(cid:32) m / (cid:18) L R ζ(cid:15) (cid:19) / + 1 (cid:33) log + (cid:18) mµ (0) (cid:15) (cid:19) + ψ µ (0) ( x (0) ) − ψ ∗ µ (0) µ (0) (cid:18) L R ζm µ (0) (cid:19) / (cid:33) unit operations to return a point x ( k ) ∈ X with f ( x ( k ) ) − f ∗ ≤ (cid:15) .Proof Let j be some iteration of Annealed-IPM . Our ﬁrst goal is to show that (A7. µ (0) ),i.e., µ (0) = min (cid:110) L R ζm , L mRL √ ζ (cid:111) , implies the assumptions of Lemma 10, and Lemma 12 are metat iteration j .Recall that (A7. τ l ) states that τ l = mRζ / . In particular, τ l µL ∈ (0 ,

1] holds with µ = µ ( j ) by µ ≤ µ (0) ≤ L R ζm = L τ l and L µL τ l ∈ (0 ,

1] holds by µ ≤ µ (0) ≤ L mRL √ ζ = L τ l L . Thereforethe assumptions of Lemma 10 are met which implies each iteration of Annealed-IPM willterminate satisfying (FJ1). Therefore, (cid:107) ∇ x L ( x ( j ) , y ( j ) ) (cid:107) ≤ µ ( j ) τ l (cid:113) (cid:107) y ( j ) (cid:107) ≤ µ ( j ) τ l ζ / ≤ mµ ( j ) R where the ﬁrst inequality uses (FJ1), the second inequality uses Assumption 6 and the ﬁnalinequality by (A7. τ l ). Therefore Assumption 5 holds which allows us to apply Lemma 12. Inparticular, w ( µ ( j ) ) = O (cid:32)(cid:18) τ l µ ( j ) L (cid:19) − / (cid:33) = O (cid:32)(cid:18) m µ ( j ) L R ζ (cid:19) − / (cid:33) where the second equality uses τ l = mRζ / . Substituting this into Lemma 12 yields the runtimebound. (cid:3) A two-phase method to ﬁnd unscaled KKT points

F.1 Algorithm 3 deﬁnition

Algorithm 3

Two-phase IPM function

Two-Phase-IPM ( f, a, ε opt , ε inf , L , L , x (0) ) Output:

A status (

KKT if (KKT) holds and

INF if (INF1) holds) and a point ( x, t, y ). Phase-one.

Let µ ( P = ε inf ε opt , τ ( P l = min (cid:110) ε opt , (cid:113) L ε opt ε inf (cid:111) , t (0) = ε opt + max { min i − a i ( x (0) ) , } , and η satisfy ( η -1). if t (0) ≤ ε opt / then x ( P ← x (0) else ( x ( P , t ( P , y ( P , λ ( P , γ ( P ) ← Trust-IPM ( f P , a P , µ ( P , τ ( P l , L , η s , η x , ( x (0) , t (0) )) . if min i a i ( x ( P ) < − ε opt / then ( x, t, y ) ← ( x ( P , t ( P , y ( P / (cid:107) y ( P (cid:107) ) . return INF , ( x, t, y ) end ifend ifPhase-two. Let µ ( P = ε opt , τ ( P l = (cid:113) ε inf L +1) , and η satisfy ( η -1).( x ( P , y ( P ) ← Trust-IPM ( f, a P , µ ( P , τ ( P l , L , η s , η x , x ( P ) . if (cid:107) y ( P (cid:107) > /ε inf then ( x, t, y ) ← ( x ( P , ε opt , y ( P / (cid:107) y ( P (cid:107) ) return INF , ( x, t, y ) else ( x, t, y ) ← ( x ( P , ∅ , y ( P ) . return KKT , ( x, t, y ) end ifend function F.2 Proof of Claim 3

Claim 3.

2] represent the set of feasible values of t in phase-one. Now, ψ P µ ( P ( x (0) , t (0) ) − inf ( x,t ) ∈ ˜ X ( P × T ψ P µ ( P ( x, t )= sup ( x,t ) ∈ ˜ X ( P × T t (0) − t + µ ( P (cid:32) log (cid:18) tt (0) (cid:19) + log (cid:32) ε opt + t (0) − t ε opt (cid:33) + m (cid:88) i =1 log (cid:18) t − a i ( x ) t (0) − a i ( x (0) ) (cid:19)(cid:33) = O (cid:16) min i max {− a i ( x (0) ) , } + µ ( P m log + ( c/ε opt ) (cid:17) = O (∆ a )where the second transition uses that 0 ≤ t ≤ t (0) + ε opt / i max {− a i ( x (0) ) , } + ε opt and(24a), and the last transition uses µ ( P = ε inf ε opt = O ( ε opt ), ε opt ∈ (cid:16) , m log + ( c/ε opt ) (cid:105) and∆ a ≥ µ ( P = ε opt / ε opt ∈ (cid:16) , m log + ( c/ε opt ) (cid:105) and ∆ f ≥ ψ P µ ( P ( x ( P ) − inf x ∈ ˜ X ( P ψ P µ ( P ( x ) = O (cid:18) f ( x ( P ) − inf x ∈ ˜ X ( P f ( x ) + µ ( P m log + ( c/ε opt ) (cid:19) = O (∆ f ) . Recall Theorem 1 gives a bound on the iteration count of

Trust-IPM of O (cid:18) ψ µ ( x (0) ) − ψ ∗ µ µ (cid:16) L µτ l (cid:17) / (cid:19) .Substituting the appropriate values of τ l and µ from Algorithm 3 yields a bound of O (cid:32) a (cid:32) L / ε / inf ε / opt + 1 ε inf ε opt (cid:33) + ∆ f ε opt (cid:18) L L ε opt ε inf (cid:19) / (cid:33) trust region subproblem solves for Two-Phase-IPM .It remains to show either (KKT) or (INF1) is satisﬁed. Observe that after calling

Trust-IPM in phase-one we ﬁnd a point satisfying the Fritz John conditions for the problem of minimizingthe inﬁnity norm of the constraint violation, i.e., (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) ∇ a ( x ( P ) T y ( P T y ( P − λ ( P − γ ( P (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ε inf (cid:118)(cid:117)(cid:117)(cid:117)(cid:116)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) y ( P λ ( P γ ( P (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + 1 (49) a ( x ( P ) + t ( P ≥ (50)0 ≤ t ( P ≤ t (0) + ε opt / a i ( x ( P ) + t ( P ) y ( P i ≤ ε inf ε opt (52) t ( P λ ( P ≤ ε inf ε opt (53) (cid:16) ε opt t (0) − t ( P (cid:17) γ ( P ≤ ε inf ε opt (54) y ( P , λ ( P , γ ( P ≥ . (55)Consider the case that in phase-one the status is INF , in which case min i a i ( x ( P ) < − ε opt / t ( P > ε opt / t ( P > ε opt / λ ( P < ε inf .Therefore using (49), ε inf ∈ (0 ,

1] and we deduce (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) ∇ a ( x ( P ) T y ( P T y ( P − − γ ( P (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ε inf (cid:114) (cid:107) ( y ( P (cid:107) + ε inf ≤ ε inf (cid:18)(cid:113) (cid:107) ( y ( P (cid:107) + 2 (cid:19) . (56) f (cid:107) y ( P (cid:107) < / / < γ ( P + 1 − T y ( P ≤ ε inf / ≤ /

2. Bycontradiction (cid:107) y ( P (cid:107) ≥ /

2. Using (cid:107) y ( P (cid:107) ≥ /

2, (56), and (52) we deduce (cid:107) ∇ a ( x ( P ) T y ( P (cid:107) (cid:107) y ( P (cid:107) ≤ ε inf ( a i ( x ( P ) + t ( P ) y ( P i (cid:107) y ( P (cid:107) ≤ ε inf ε opt . Observe that after calling

Trust-IPM in phase-two we ﬁnd a point satisfying a ( x ( P ) > − ε opt y ( P i ( a i ( x ( P ) + ε opt ) ≤ ε opt ∀ i ∈ { , . . . , m } (cid:13)(cid:13)(cid:13) ∇ x L ( x ( P , y ( P ) (cid:13)(cid:13)(cid:13) ≤ ε opt (cid:114) ε inf L + 1) (cid:113)(cid:13)(cid:13) y ( P (cid:13)(cid:13) + 1 y ( P > . If (cid:107) y ( P (cid:107) < ε opt ε inf + L ε inf then using the fact that ε opt ∈ (0 , ε opt ∈ (0 , √ ε inf ] and L ≥ (cid:13)(cid:13)(cid:13) ∇ x L ( x ( P , y ( P ) (cid:13)(cid:13)(cid:13) ≤ ε opt (cid:115) ε inf L (cid:18) ε opt ε inf + 3 L ε inf (cid:19) + 1 ≤ ε opt (cid:107) y ( P (cid:107) ≥ ε opt ε inf + L ε inf then (cid:107) ∇ a ( x ( P ) T y ( P (cid:107) (cid:107) y ( P (cid:107) ≤ (cid:13)(cid:13) ∇ x L ( x ( P , y ( P ) (cid:13)(cid:13) + (cid:13)(cid:13) ∇ f ( x ( P ) (cid:13)(cid:13) (cid:107) y ( P (cid:107) ≤ ε opt (cid:107) y ( P (cid:107) / + ε opt (cid:107) y ( P (cid:107) + L (cid:107) y ( P (cid:107) ≤ ε inf and ( a i ( x ( P ) + ε opt ) y ( P i (cid:107) y ( P (cid:107) ≤ ε inf ε opt . Finally note that since y ( P i ( a i ( x ( P ) + ε opt ) ≤ ε opt and (cid:107) y ( P (cid:107) ≥ ε opt ε inf + L ε inf ≥ m wededuce min i a i ( x ( P ) ≤ ε opt min i (cid:18) y ( P i − (cid:19) ≤ − ε opt /

2. Hence (INF1) is satisﬁed with( x, t, y ) = (cid:16) x ( P , ε opt , y ( P (cid:107) y ( P (cid:107) (cid:17) . (cid:3)(cid:3)