Worst-case iteration bounds for log barrier methods for problems with nonconvex constraints
WWorst-case iteration bounds for log barrier methods for problemswith nonconvex constraints
Oliver Hinder Yinyu Ye { ohinder,yyye } @stanford.edu Abstract
Interior point methods (IPMs) such as IPOPT, KNITRO and LOQO that handle nonconvexconstraints have had enormous practical success. We consider IPMs in the setting where theobjective and constraints have Lipschitz first and second derivatives. Unfortunately, previousanalyses of log barrier methods in this setting implicitly prove guarantees with exponentialdependencies on 1 /µ , where µ is the barrier penalty parameter. We provide an IPM that finds a µ -approximate Fritz John point by solving O ( µ − / ) trust-region subproblems. For this setup,the results represent both the first iteration bound with a polynomial dependence on 1 /µ for alog barrier method and the best-known guarantee for finding Fritz John points. We also showthat, given convexity and regularity conditions, our algorithm finds an (cid:15) -optimal solution in atmost O (cid:0) (cid:15) − / (cid:1) trust-region steps. We are concerned with the problemminimize x ∈ R n f ( x ) such that a ( x ) ≥ , where f : R n → R and a : R n → R m have Lipschitz continuous first and second deriva-tives. The worst-case runtime to find a global optimum to this problem is exponential in thedesired accuracy [25], so instead we seek a Fritz John point [17], a necessary condition for localoptimality, defined as a point ( x, y, t ) ∈ R n × R m × R satisfying t, a ( x ) , y ≥ (1a) y i a i ( x ) = 0 ∀ i ∈ { , . . . , m } (1b) t ∇ f ( x ) − ∇ a ( x ) T y = , (1c)where y is the vector of dual variables, t is a scalar that is equal to one in the KKT conditions,and ( y, t ) (cid:54) = . When the Mangasarian-Fromovitz constraint qualification [20] holds all FritzJohn points are KKT points. Since it is not possible to find an exact Fritz John point, werequire a notion of an approximate Fritz John point. One natural definition of an approximateFritz John point is a ( x ) , y ≥ y i a i ( x ) ≤ µ ∀ i ∈ { , . . . , m }(cid:107) ∇ x L ( x, y ) (cid:107) ≤ µ ( (cid:107) y (cid:107) + 1) , where the Lagrangian is L ( x, y ) := f ( x ) − y T a ( x ), and µ ≥ µ desirable. Our interior point method returns point a r X i v : . [ m a t h . O C ] J un atisfying a slightly stronger condition: a ( x ) , y > (3a) | y i a i ( x ) − µ | ≤ µ/ ∀ i ∈ { , . . . , m } (3b) (cid:107) ∇ x L ( x, y ) (cid:107) ≤ µ (cid:113) (cid:107) y (cid:107) + 1 , (3c)with µ > ψ µ ( x ) := f ( x ) − µ m (cid:88) i =1 log( a i ( x )) (4)with some parameter µ >
0, and start from a strictly feasible point. The log barrier penalizespoints too close to the boundary, enabling the use of unconstrained methods to solve a con-strained problem. Typically, if f and each a i were linear we would apply Newton’s method tothe log barrier. However, since we allow a i to be nonlinear, ∇ ψ µ could be singular or indefinite.To avoid this issue, we use a trust region method to generate our search directions: d x ∈ argmin u ∈ B r ( ) M ψ µ x ( u )with M ψ µ x ( u ) := 12 u T ∇ ψ µ ( x ) u + ∇ ψ µ ( x ) T u B r ( v ) := { x ∈ R n : (cid:107) x − v (cid:107) ≤ r } . The function M ψ µ x ( u ) is a second-order Taylor series local approximation to ψ µ ( x ) at x . Itpredicts how much ψ µ changes as we move from x to x + u . Our algorithm changes the radius r to scale inversely proportional to the size of the current dual iterates.We now give a brief overview of our results; for cleanliness we omit Lipschitz constants,dependence on the number of constraints m , and higher-order terms. Our main results assumethat we are given a feasible starting point, i.e., x (0) ∈ X := { x ∈ R n : a ( x ) > } . This assumption is removed in Section 7, where we use a two-phase algorithm: phase-oneminimizes the constraint violation to obtain a feasible point, then phase-two minimizes theobjective subject to the constraints. We also assume that a i and f are continuous functions on R n with Lipschitz first and second derivatives on the set X .Our first main result is Theorem 1, which states that after at most O (cid:0) µ − / (cid:1) trust regionsubproblem solves we find a µ -approximate Fritz John point, i.e., a point satisfying (3). Oursecond main result is Theorem 2 which additionally assumes that the constraints are concavefunctions (implying the feasible region is convex) and that certain regularity conditions holdto ensures that Fritz John points are KKT points. Under these assumptions Theorem 2 statesthat after at most O ( (cid:15) − / ) trust region subproblem solves we find an (cid:15) -optimal solution, i.e., apoint x with f ( x ) − inf z ∈X f ( z ) ≤ (cid:15) .We proceed as follows. The remainder of the introduction provides notation and overviewsrelated work. Section 2 analyzes gradient descent applied to the log barrier and explains whyprevious analyses implicitly prove iteration bounds with exponential dependencies on 1 /µ . Sec-tion 3 introduces our main algorithm, a trust region IPM. Section 4 gives a series of usefullemmas for the analysis. Section 5 proves Theorem 1 and Section 6 proves Theorem 2. Sec-tion 7 compares the iteration bounds of our IPM with existing iteration bounds for problemswith nonconvex constraints [3, 8, 9]. .1 Notation Let diag ( v ) be a diagonal matrix with entries composed of the vector v . Let R denote the setof real numbers, R + the set of nonnegative real numbers and R ++ the set of strictly positivereal numbers. Let Convex { x, y } = { αx + (1 − α ) y : α ∈ [0 , } . Let λ min ( · ) denote theminimum eigenvalue of a matrix. Let ψ ∗ µ = inf x ∈X ψ µ ( x ). Unless otherwise specified, log( · ) isthe natural logarithm. For a function g : R → R we let g ( p ) ( θ ) denote any function such that g ( p ) ( θ ) = ∂ p g ( θ ) ∂θ p .During this paper we assume some of the derivatives of f : R n → R and a : R n → R m areLipschitz. For this paper, the definition of a function being Lipschitz is given as follows. Definition 1.
Let L p ∈ (0 , ∞ ) be a constant and p a nonnegative integer.A univariate function g : R → R has L p -Lipschitz p th derivatives on a set S ⊆ R if for all θ ∈ S function is p + 1 order differentiable with (cid:12)(cid:12) g ( p +1) ( θ ) (cid:12)(cid:12) ≤ L p .A multivariate function w : R n → R has L p -Lipschitz p th derivatives on a set S ⊆ R n iffor any x ∈ S and v ∈ B ( ) the univariate function g : R → R defined by g ( θ ) := w ( x + vθ ) is L p -Lipschitz on the set { θ : x + vθ ∈ S } . We remark that this definition is slightly less general than standard definition. The standarddefinition is that a univariate function g : R → R has L p -Lipschitz p th derivatives on a set S ⊆ R , if for any [ θ , θ ] ⊆ S we have (cid:12)(cid:12) g ( p ) ( θ ) − g ( p ) ( θ ) (cid:12)(cid:12) ≤ L p (cid:12)(cid:12) θ − θ (cid:12)(cid:12) . This is equivalentDefinition 1 when g is p + 1 order differentiable on the set S . We decided to use Definition 1because it simplifies the proofs. However, it is also possible to prove our results using thestandard definition.Taylor’s theorem states that given a one-dimensional function g : R → R with L p -Lipschitz p th derivatives on the interval [0 , θ ] then for all q ∈ { , . . . , p } one has (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p − q (cid:88) i =0 θ i g ( q + i ) (0) i ! − g ( q ) ( θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ L p | θ | p − q (1 + p − q )! . (5)See [33, Theorem 50.3] for a proof of the remainder version of this theorem with q = 0. Toextend this theorem to q > h ( θ ) := g ( q ) ( θ ).We often refer to the function a : R n → R m as having L p -Lipschitz p th derivatives. By thiswe mean that each component function a i has L p -Lipschitz p th derivatives. Finally, the matrix ∇ a ( x ) is the m × n Jacobian of a ( x ). The practical performance of IPMs is excellent for linear [22], conic [36], general convex [1], andnonconvex optimization [5, 38, 41]. Moreover, the theoretical performance of IPMs for linear[18, 32, 42, 46, 47] and conic [28] optimization is well studied. The main theoretical resultin this area is that it takes at most O ( √ c log(1 /(cid:15) )) iterations to find an (cid:15) -global minimum,where c is the self-concordance parameter (e.g., c = m + n for linear programming). Each IPMiteration consists of a Newton step, i.e., one linear system solve, applied to an unconstrainedoptimization problem. Unfortunately, this approach only works for convex cones with tractableself-concordant barriers.While self-concordance theory is designed for structured convex problems, there is a richliterature on the minimization of general blackbox unconstrained objectives, particularly ifthe objective is convex [25, 26]. Here we briefly review results in nonconvex optimization.In unconstrained nonconvex optimization, the measure of local optimality is usually whether (cid:107)∇ f ( x ) (cid:107) ≤ µ , known as a µ -approximate stationary point. A fundamental result is that gra-dient descent needs at most O ( µ − ) iterations to find an µ -approximate stationary point if thefunction f : R n → R has Lipschitz continuous first derivatives. Nesterov and Polyak [29] showthat the iteration guarantee of cubic regularized Newton is O ( µ − / ) for finding µ -approximatestationary points. The same iteration bound can be extended to trust region methods [13, 45]. hese O ( µ − ) and O ( µ − / ) iteration bounds match the blackbox lower bounds for functionswith Lipschitz continuous first and second derivatives respectively [6, 7].However, there is relatively little theory studying nonconvex optimization with constraints.Important contributions in this area include the work of Ye [44], Bian et al. [2], Haeser et al. [15],who consider an affine scaling technique for general objectives with linear inequality constraints,i.e., a i are linear. At each iteration they solve problems of the form d x ∈ argmin u ∈ R n : (cid:107) S − ∇ a ( x ) u (cid:107) ≤ r M ψ µ x ( u ) (6)with S = diag ( a ( x )). In this context, Haeser et al. [15] give an algorithm with an O ( µ − / )iteration bound for finding KKT points. This work is pertinent to ours, but the addition ofnonconvex constraints and the use of a trust region method instead of affine scaling distinguishour work.Our motivation is to understand the performance of practical interior point methods, mostof which tend to use an approach similar to ours. To see this relationship, observe that if we areat a feasible solution and set the dual variables to exactly satisfy perturbed complementarity( y = µS − ) then LOQO [38] and the one-phase IPM [16] both generate directions of the form d x ∈ argmin u ∈ R n M ψ µ x ( u ) + δ (cid:107) u (cid:107) (7)for some δ > ∇ ψ µ ( x ) + δ I (cid:31)
0. There is a well-known duality between thismodified Newton approach and the trust-region approach. In particular, for any δ > r > d x ∈ argmin u ∈ B r ( x ) M ψ µ x ( u ) . (8)The reverse statement holds except in the hard case [30, Chapter 4]. Therefore, our algorithmcan be viewed as an extremely simplified variant of LOQO, the one-phase IPM, or IPOPT [41].There are major differences between our approach and practical methods: we ignore feasibilityissues, our method is not primal-dual, we use a trust-region instead of adding δ I to the Hessian,and our algorithm require knowledge of Lipschitz constants. However, these differences shouldbe viewed in context of our goal: to develop a prototypical algorithm that captures the essenceof nonconvex interior point methods.While there has been theoretical work studying these practically successful log barrier meth-ods with nonconvex constraints, most of this work tends to show only that the method eventuallyconverges [4, 10, 11, 14, 16, 40] without giving explicit iteration bounds, or focuses on super-linear convergence in regions close to local optima [37, 39]. However, there has been analysisof other methods for optimization with nonconvex constraints using methods other than IPMs[3, 8, 9]. We compare with these results in Section 7.There is a vast body of literature analyzing the convergence of unconstrained optimizationmethods on self-concordant functions or functions with Lipschitz derivatives. Unfortunately,with general constraints one cannot assume that the log barrier is self-concordant nor that thederivatives are Lipschitz (even if the derivatives of the constraints are Lipschitz). Therefore wedevelop a new approach. To help the reader understand the crux of this problem, we begin byanalyzing the worst-case performance of gradient descent on the log barrier. This section explains how a naive theoretical analysis, which is often used to analyze IPMwith nonlinear constraints, can give iteration bounds for gradient descent applied to the logbarrier with exponential dependencies on 1 /µ . At the end of this section we explain how tofix the analysis to obtain iteration bounds with polynomial dependencies on 1 /µ . Hence theexponential iteration bounds are a flaw of the analysis—not the algorithm. The goal of this ection is to get the reader into the correct mindset for analyzing the more challenging trustregion IPM that is the focus of this paper.The log barrier does not have Lipschitz continuous derivatives. However, typical analysis ofinterior point methods in the nonlinear programming community is as follows: A. Observe that if we apply a descent method to the log barrier, all iterates remain in the set S := { x ∈ R n : ψ µ ( x ) ≤ ψ µ ( x (0) ) } , where x (0) is the starting point. B. Show that the p th derivatives of ψ µ are L p -Lipschitz continuous on the set S . This isusually done by arguing that if x ∈ S then a i ( x ) ≥ inf x min i a i ( x ) = ε >
0. The resultfollows from the fact that log( θ ) has (1 /ε )-Lipschitz continuous derivatives on the set { θ : θ ≥ ε } and using the assumption that the objective and constraints have Lipschitzderivatives. C. Prove that for sufficiently small steps the line segment between the current and new iteratesremains in S . Apply generic bounds from cubic regularization/gradient descent to givethe iteration bounds.For examples of this style of analysis, see [4, 10, 11, 16]. Turning this into a polynomialbound on 1 /µ requires showing that the constant L p is a polynomial function of the desiredtolerance. However, L p is exponentially large in µ because L p is proportional to 1 /ε and thelower bound on ε can be exponentially small in µ . This can occur even when the constraintsare linear. For example, consider the log barrier arising from the linear program min x s.t.0 ≤ x ≤ ψ µ ( x ) := x − µ (log( x ) + log(2 − x ))with µ ∈ (0 , x (0) = 1. We show that under these assumptions the Lipschitzconstants for the first and second derivatives are exponentially large in 1 /µ on the set S := { x ∈ R n : ψ µ ( x ) ≤ ψ µ ( x (0) ) } . Observe that ψ µ ( x (0) ) = 1 and at the point x = exp( − /µ ) ∈ S we have ∇ ψ µ ( x ) = µ (cid:16) x + − x ) (cid:17) ≥ µ exp(2 /µ ) and ∇ ψ µ ( x ) = 2 µ (cid:16) − x + − x ) (cid:17) ≤ − µ exp(3 /µ ).This is illustrated in Figure 1.The methods [4, 10, 11, 16] that employ the (A)-(C) argument use a line search to choosetheir step size rather than use a fixed step size. Line search methods have many benefits overconstant step size methods, including removing the need to do hyperparameter searches overLipschitz constants and converging faster in practice. However, the (A)-(C) argument where weprove a uniform bound on the Lipschitz constant of ∇ ψ µ is roughly equivalent to proving aniteration bound on a constant step size algorithm and then arguing that an adaptive step sizealgorithm is faster. While in some situations this argument gives a good worst-case iterationbound, there exists problem classes where the worst-case iteration bound of the constant stepsize method is exponentially worse than an adaptive method.Claim 1 gives a simple example of constant step size algorithms having poor theoreticalperformance. In particular, the claim shows gradient descent with a fixed step size α ∈ (0 , ∞ ),i.e., x ( k +1) ← x ( k ) − α ∇ ψ µ ( x ( k ) ) , (9)cannot efficiently minimize a log barrier for all starting points in the set S C := { x ∈ R n : ψ µ ( x ) ≤ ψ ∗ µ + C } . Contrast to a function f with L -Lipschitz gradient where for any startingpoint x (0) ∈ { x : f ( x ) ≤ inf x f ( x ) + C } gradient descent with a constant step size 1 /L uses atmost 2 L C/(cid:15) iterations until (cid:107) ∇ f ( x ( k ) ) (cid:107) ≤ (cid:15) [27]. Claim 1.
Let ψ µ ( x ) := x − µ (log( x ) + log(2 − x )) , µ ∈ (0 , / and C ∈ [2 , ∞ ) . Fix α ∈ (0 , ∞ ) and suppose the x ( k ) satisfies (9) . If x ( k ) remains in the interval [0 , for the startingpoint x (0) = exp( − C/ (2 µ )) ∈ S C , then for the starting point x (0) = 1 ∈ S C and for all k ≤ ( µ/
8) exp( C/ (2 µ )) we have (cid:107) ∇ ψ µ ( x ( k ) ) (cid:107) ≥ µ . The proof appears in Appendix A and involves first arguing that the step size α must betiny; otherwise, if we initialize close to the boundary, i.e., x (0) = exp( − C/ (2 µ )), the iterates x ψ µ ( x ) intial p oint x (0) = 1 O exp − µ Derivatives are moving very quicklyand have exp onetially large Lipshitz constant in µ .Region iterates must lie in: S = { x : ψ µ ( x ) ≤ ψ µ (1) } Figure 1: Why a traditional nonlinear programming analysis of IPMs will not give an iterationbound polynomial in 1 /µ . In this example µ = 0 . will leave the feasible region. On the other hand, given the step size α must be tiny then if weinitialize away from the boundary, i.e., x (0) = 1, the algorithm will converge slowly.An astute reader might observe that Claim 1 is dependent on allowing a starting point close tothe boundary. However, any constant step size algorithm that circumvents this issue must showthat all of its iterates do not get too close to the boundary. This requires an innovation on the(A)-(C) argument. Moreover, the fact that the log barrier does not have Lipschitz continuousderivatives causes the same issues for cubic regularized Newton with a fixed regularizationparameter or trust region methods with a fixed trust region radius. Implicitly when usingthe analysis (A)-(C) we are arguing our algorithm cannot do worse than a constant step sizealgorithm. Unfortunately, as we have seen in Claim 1, even from a purely theoretical standpoint,constant step size algorithms can be poor benchmarks.This is the insight of the polynomial time IPM analysis for linear programming—it circum-vents these issues using the self-concordant properties of − µ log( a ( x )) when a is linear [28].However, the function − µ log( a ( x )) is not self-concordant in general. While we do not expect toobtain an algorithm with a polynomial dependence on log(1 /µ ), can we still obtain an algorithmwith polynomial dependence on the desired tolerance 1 /µ ? Next, we show this possible usinggradient descent with an adaptive step size routine, y ( k ) i ← µa i ( x ( k ) ) ∀ i ∈ { , . . . , m } (10a) d ( k ) x ← − ∇ ψ µ ( x ( k ) ) (10b) x ( k +1) ← x ( k ) + α ( k ) d ( k ) x . (10c)This procedure does not tell us how to choose α ( k ) . One approach is to pick, α ( k ) ← min (cid:40) min i a i ( x ( k ) )2 L (cid:107) d ( k ) x (cid:107) , (cid:96) ( x ( k ) ) (cid:41) (11)where the term min i a i ( x ( k ) )2 L (cid:107) d ( k ) x (cid:107) represents the step size that guarantees a i ( x ( k +1) ) > (cid:96) ( x ) := L (1 + 2 (cid:107) y (cid:107) ) + 4 L (cid:107) y (cid:107) µ with y i = µa i ( x ) for i ∈ { , . . . , m } x ψ µ ( x ) Derivatives are moving slowly.Take large gradient descent steps.Derivatives are moving very quickly.Take small gradient descent steps sizes.
Figure 2: Explanation of adaptive step sizes. represents the ‘local’ Lipschitz constant of ∇ ψ µ at the point x . For this reason, the term 1 /(cid:96) ( x )in (11) ensures the log barrier is reduced sufficiently at each iteration. See Figure 2 explaininghow the step size α ( k ) is small for points close to the boundary and large for points far from theboundary. To prove our results we require the following assumption. Assumption 1. (Lipschitz function and first derivatives) Assume that each a i : R n → R for i ∈ { , . . . , m } is a continuous function on R n . Let L , L ∈ (0 , ∞ ) . Assume that, on the set X , each a i is L -Lipschitz continuous with L -Lipschitz continuous derivatives. Also assumethe first derivatives of f : R n → R are L -Lipschitz continuous on the set X . This assumption that a is a continuous function on R n may seem extraneous but is is neededto ensure there are no discontinuities on the boundary of the feasible region. In particular, ifwe removed this assumption then a function such as a i ( x ) = (cid:40) x > − x ≤ f ( x ) = x would satisfy Assumption 1 with L and L arbitrarily small. Forthis setup, there exists no µ -approximate Fritz John point for µ sufficiently small. To see thisassume there is a µ -approximate Fritz John point ( x, y ) for µ < / x > ∇ a ( x ) = 0, ∇ f ( x ) = 1, a ( x ) = 1 and y ≤ µ by (3b). It follows that 1 ≤ (cid:107) ∇ x L ( x, y ) (cid:107) ≤ µ √ µ < X is a bounded set, and if f and each a i are twicedifferentiable functions on R n then f and a i are Lipschitz functions with Lipschitz first deriva-tives. Of course, this does not give an explicit value for these Lipschitz constants, they couldbe arbitrarily big depending on the functions f and a i .In the following Lemma we justify (11) by proving that the step will remain feasible and (cid:96) ( x ) indeed represents the local Lipschitz constant for ∇ ψ µ . Lemma 1.
Let τ l , µ ∈ (0 , ∞ ) , v ∈ B ( ) , x ∈ X , and g ( θ ) := ψ µ ( x + θv ) . Suppose Assumption 1holds. For all θ ∈ (cid:104) , min i a i ( x )2 L (cid:105) we have a i ( x + θv ) a i ( x ) ∈ [1 / , / for all i = 1 , . . . , m , and (cid:12)(cid:12) g (2) ( θ ) (cid:12)(cid:12) ≤ (cid:96) ( x ) . roof. Define q i ( θ ) := sup ˆ θ ∈ [0 ,θ ] (cid:12)(cid:12)(cid:12) a i ( x + v ˆ θ ) − a i ( x ) (cid:12)(cid:12)(cid:12) for i ∈ { , . . . , m } . Let x + = x + θv .First, we establish a i ( x + ) a i ( x ) ∈ [1 / , / q i ( ϑ ) > a i ( x )2 forsome ϑ ∈ (cid:104) , min i a i ( x )2 L (cid:105) and i ∈ { , . . . , m } . Since a i is continuous it follows q i is continuous, andby the intermediate value theorem there exists some ˜ θ ∈ [0 , ϑ ] such that q i (˜ θ ) ∈ (cid:16) a i ( x )2 , a i ( x ) (cid:17) .Since a i ( x ) is Lipschitz continuous on the set X and a i ( x + ¯ θv ) > θ ∈ (cid:104) , ˜ θ (cid:105) we have (cid:12)(cid:12)(cid:12) a i ( x ) − a i ( x + v ˜ θ ) (cid:12)(cid:12)(cid:12) ≤ q i (˜ θ ) ≤ L ˜ θ ≤ a i ( x )2 contradicting our earlier statement that q i (˜ θ ) > a i ( x )2 . Since a i ( x + ) − a i ( x ) a i ( x ) ≤ ⇒ a i ( x + ) a i ( x ) ≤ / a i ( x ) − a i ( x + ) a i ( x ) ≤ ⇒ a i ( x + ) a i ( x ) ≥ /
2, we haveestablished a i ( x + ) a i ( x ) ∈ [1 / , / ∇ ψ µ ( x + ) = ∇ f ( x + ) + µ (cid:80) mi =1 (cid:16) ∇ a i ( x + ) a i ( x + ) + ∇ a i ( x + ) ∇ a i ( x + ) T a i ( x + ) (cid:17) , a i ( x + ) a i ( x ) ∈ [1 / , / y i = µa i ( x ) it follows that (cid:12)(cid:12)(cid:12) g (2) ( θ ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12) v T ∇ ψ µ ( x + ) v (cid:12)(cid:12) ≤ L + µ m (cid:88) i =1 (cid:18) L a i ( x ) + 4 L a i ( x ) (cid:19) = L (1+2 (cid:107) y (cid:107) )+ 4 L (cid:107) y (cid:107) µ = (cid:96) ( x ) . With Lemma 1 in hand we can now prove Claim 2.
Claim 2.
Let τ l , µ ∈ (0 , ∞ ) . Suppose Assumption 1 holds. Let x (0) ∈ X be the initial point.Then there exists some K ≤ ψ µ ( x (0) ) − ψ ∗ µ )(2 L τ − l µ − + L τ − l µ − + L τ − l µ − ) such thatthe procedure (10) with α ( k ) satisfying (11) finds a point ( x ( K ) , y ( K ) ) with (cid:107) ∇ ψ µ ( x ( K ) ) (cid:107) ≤ τ l µ (1 + (cid:107) y ( K ) (cid:107) ) .Proof. Denote x + = x ( k +1) , x = x ( k ) , α = α ( k ) , y = y ( k ) , and d x = d ( k ) x . At each iteration of(10) with (cid:107) ∇ ψ µ ( x ) (cid:107) ≥ τ l µ ( (cid:107) y (cid:107) + 1) we have ψ µ ( x ) − ψ µ ( x + ) ≥ α ∇ ψ µ ( x ) T d x + 12 (cid:96) ( x ) α (cid:107) d x (cid:107) (Lemma 1 and Taylor’s theorem)= α (cid:107) ∇ ψ µ ( x ) (cid:107) (cid:18) − α(cid:96) ( x ) (cid:19) (substituting for d x ) ≥ α (cid:107) ∇ ψ µ ( x ) (cid:107) (using (11)) ≥ min (cid:26) (cid:107) ∇ ψ µ ( x ) (cid:107) min i a i ( x )4 L , (cid:107) ∇ ψ µ ( x ) (cid:107) (cid:96) ( x ) (cid:27) (using (11)) ≥ min (cid:26) τ l µ L , τ l µ (1 + (cid:107) y (cid:107) ) (cid:96) ( x ) (cid:27) (by (cid:107) ∇ ψ µ ( x ) (cid:107) ≥ τ l µ ( (cid:107) y (cid:107) + 1) and a i ( x ) ≥ µ/ (cid:107) y (cid:107) )= min τ l µ L , τ l µ (1 + (cid:107) y (cid:107) ) L (1 + 2 (cid:107) y (cid:107) ) + L (cid:107) y (cid:107) µ (substituting for (cid:96) ( x )) ≥ min (cid:26) τ l µ L , τ l µ L , τ l µ L (cid:27) (simplifying).Therefore if (cid:107) ∇ ψ µ ( x ( k ) ) (cid:107) ≥ τ l µ ( (cid:107) y ( k ) (cid:107) + 1) for k = 0 , . . . , K then ψ µ ( x (0) ) − ψ ∗ µ ≥ K (cid:88) k =0 ( ψ µ ( x ( k ) ) − ψ µ ( x ( k +1) )) ≥ K min (cid:26) τ l µ L , τ l µ L , τ l µ L (cid:27) , rearranging this expression to upper bound K gives the result. his section demonstrated that gradient descent with a constant step sizes applied to thelog barrier requires an number of iterations proportional to µ exp(1 /µ ) to find a Fritz Johnpoint whereas gradient descent with adaptive step sizes requires iterations proportional to µ − .While it is well-known that methods with adaptive step sizes are practically faster than constantstep size methods, most other theoretical results in continuous optimization show no differencebetween the worst-case performance of adaptive and constant step size methods.Finally, we remark that the algorithms in this paper are not practical. For example, theyrequire knowledge of unknown Lipschitz constants to calculate the local Lipschitz constant (cid:96) .Therefore our primary contributions are theoretical. It remains a subject of further inquiry todevelop practical methods with similar worst-case guarantees. One possibility to remove theneed to know Lipschitz constants would be to use a backtracking line search to compute α ( k ) . This section introduces our trust region IPM (Algorithm 1). A naive algorithm we could use is d x ∈ argmin u ∈ B r ( ) M ψ µ x ( u ) x + ← x + d x x ← x + for some fixed constant r ∈ (0 , ∞ ) where x denotes the current iterate and x + the next iterate.If ∇ ψ µ is L -Lipschitz then one can show a convergence to an (cid:15) -approximate stationary pointof ψ µ in O ( L / (cid:15) − / ) iterations [29]. However, as we described in Section 2 this method willstruggle because the log barrier ensures the effective Lipschitz constant of ∇ ψ µ is exponentiallylarge in µ . Instead, as per line 7 of Algorithm 1, we make the trust region radius adaptive tothe size of the dual variables using the formula r ← η x (cid:114) µL (1 + (cid:107) y (cid:107) ) , this ensures that for constant η x ∈ (0 , ∞ ) the trust region radius becomes smaller as the dualvariable size increases. The intuition for this selection of r is similar to the intuition for thestep sizes for gradient descent in Section 2: the value of r shrinks as we get very close to theboundary of the feasible region. This enables the algorithms to adapt to the ‘local’ Lipschitzconstant of the log barrier. The next iterate for our algorithm is selected by α ← min (cid:26) η s (cid:107) S − d s (cid:107) , (cid:27) x + ← x + αd x , where parameters η x , η s ∈ (0 , ∞ ) are problem dependent. More specifically they are choosenusing a formula incorporating the Lipschitz constant and barrier parameter. This formulachanges depending on whether the problem is nonconvex (see Theorem 1) and convex (seeLemma 10). These specific choices guarantee that x + ∈ X and allow us to prove our iterationbounds.The term η s (cid:107) S − d s (cid:107) above encourages small step sizes when the linear approximation ofthe slack variable indicates a large α would cause the algorithm to step outside the feasibleregion. For example, if we were solving a linear program picking η s = 1 / a i ( x + ) > a i ( x ) / > M ψ µ x ( k ) ( d x ) is small we would like to find an approximate Fritz Johnpoint. To do this we need a method for selecting the dual variable y + . An instinctive solutionis to pick y + such that y + = µ ( S + ) − with S + = diag ( a ( x + )), i.e., a typical primal barrier pdate. Unfortunately, using this method it is unclear how to construct efficient bounds on (cid:107) ∇ x L ( x + , y + ) (cid:107) . Instead we pick y + using a typical primal-dual step, i.e, y + ← y + αd y where d y satisfies Sd y + Y d s + Sy = µ with y = µS − and d s = ∇ a ( x ) d x . We remark that because y = µS − this can be simplifiedto y + ← µS − − µS − d s . Hence, Algorithm 1 is a hybrid between a traditional primal-dualmethod and a pure primal method. We believe that one could develop a pure primal-dualversion of our interior method. However, to keep our proofs as simple as possible we decidedto use this hybrid algorithm. To further understand how our algorithm generates its directionnote that d x ∈ argmin u ∈ B r ( ) M ψ µ x ( u ) implies there exists some δ ≥ ∇ ψ µ ( x ) + I δ ) d x = − ∇ ψ µ ( x ) . Using Sd y + Y d s + Sy = µ and substituting d s = ∇ a ( x ) d x into ( ∇ ψ µ ( x ) + I δ ) d x = − ∇ ψ µ ( x )we deduce that ∇ xx L ( x, y ) + δ I − ∇ a ( x ) T ∇ a ( x ) 0 I S Y d x d y d s = − ∇ x L ( x, y )0 Sy − µ . (12)At each iteration the radius r is selected sufficiently small such that the error on the Taylorseries approximations are small, i.e., (cid:12)(cid:12) ψ µ ( x ) + M ψ µ x ( d x ) − ψ µ ( x + ) (cid:12)(cid:12) ≈ | a i ( x ) + ∇ a i ( x ) d x − a i ( x + d x ) | a i ( x ) ≈ (cid:107) ∇ x L ( x, y ) − ∇ x L ( x + d x , y + d y ) − δd x (cid:107) ≈ . Suppose these Taylor series errors are sufficiently small and the step x + d x is feasible. Then,when δ ≈ x + , y + ) will be an approximate Fritz John point and when δ (cid:29) α δ (cid:107) d x (cid:107) (see Lemma 6). However, thepoint x + d x need not be feasible. For example, if we were solving a linear program each of theseterms would be zero but x + d x could still be infeasible. Therefore we wish to select α smallenough that we remain feasible, this motivates our formula for α in Algorithm 1.Algorithm 1 terminates when it reaches an approximate second-order Fritz John point whichis defined by (FJ1) and (FJ2). Definition 2. A ( µ, τ l , τ c ) -approximate first-order Fritz John point is a point ( x, y ) defined by a ( x ) , y > (FJ1.a) | y i a i ( x ) − µ | ≤ τ c µ ∀ i ∈ { , . . . , m } (FJ1.b) (cid:107) ∇ x L ( x, y ) (cid:107) ≤ τ l µ (cid:113) (cid:107) y (cid:107) + 1 . (FJ1.c)One should interpret (FJ1) thinking of µ ∈ (0 , ∞ ) becoming arbitrarily small, and τ l ∈ (0 , ∞ )as a fixed constant which allows us to trade off how small we want (cid:107) ∇ x L ( x, y ) (cid:107) relative to y i a i ( x ). Additionally, τ c ∈ (0 ,
1) is a fixed constant defines how tightly we want perturbedcomplementarity to hold.
Definition 3. A ( µ, τ l , τ c ) -approximate second-order Fritz John point ( x, y ) satisfiesequation (FJ1) and ∇ ψ µ ( x ) (cid:23) −√ τ l (1 + (cid:107) y (cid:107) ) I . (FJ2) lgorithm 1 Adaptive trust region interior point algorithm with fixed µ function Trust-IPM ( f, a, µ, τ l , L , η s , η x , x (0) ) Input: ∇ f and ∇ a are L -Lipschitz. The parameters η s ∈ (0 , η x ∈ (0 ,
1) are selectedusing different formulas depending on whether the problem is convex or nonconvex. Always x (0) ∈ X . x ← x (0) for k = 0 , . . . , ∞ do S ← diag ( a ( x )) y ← µS − (cid:46) Primal update of dual variables. r ← η x (cid:113) µL (1+ (cid:107) y (cid:107) ) (cid:46) Trust region radius gets smaller as the dual variables get larger. ( d x , d s , d y ) ← Trust-region-direction ( f, a, µ, x, r ) α ← min (cid:110) η s (cid:107) S − d s (cid:107) , (cid:111) (cid:46) Pick a step size α ∈ (0 ,
1] to guarantee x + ∈ X . x + ← x + αd x y + ← y + αd y if ( x + , y + ) satisfies (FJ1) and (FJ2) then return ( x + , y + ) (cid:46) Termination criterion met. else x ← x + (cid:46) Only update primal variables, throw away new dual variable y + . end if end for end function function Trust-region-direction ( f, a, µ, x, r ) d x ∈ argmin u ∈ B r ( ) M ψ µ x ( u ) d s ← ∇ a ( x ) d x S ← diag ( a ( x )) d y ← − µS − d s return ( d x , d s , d y ) end function Note that (FJ1.b) and (FJ2) imply ∇ xx L ( x, y ) + µ ∇ a ( x ) T S − ∇ a ( x ) (cid:23) − ( √ τ l + L τ c ) (1 + (cid:107) y (cid:107) ) I with S = diag ( a ( x )). For some threshold ε >
0, let A ε = { i : a i ( x ) < ε } represent the set ofapproximately active constraints, then for all u ∈ { u ∈ B ( ) : ∇ a i ( x ) T u = 0 ∀ i ∈ A ε } wehave u T ∇ xx L ( x, y ) u ≥ − L ε − µ − ( √ τ l + L τ c ) (1 + (cid:107) y (cid:107) ) . Note that if we select ε = ω ( √ µ ) and let µ, τ l , τ c → ∇ xx L ( x, y ) is positive semidefinite projected onto the nullspace of the Jacobian of the activeconstraints. See [30, Section 12.4] for an explanation of the second-order necessary conditions.Algorithm 1 keeps µ fixed since for our nonconvex results given in Theorem 1, a fixed µ suffices. Practically log barrier methods solve a sequence of problems with decreasing µ .However, Algorithm 2 which is a specialized algorithm for the convex case, solves a sequence ofproblems with decreasing µ .We have omitted the details on how to solve the trust-region subproblems. One issue is thatthe matrix ∇ ψ µ ( x ) and vector ∇ ψ µ ( x ) may contain components that are exponentially large n 1 /µ . While we omit details of this issue from the paper, this can be resolved using the resultsof [43] which show one requires O (log log(1 /(cid:15) )) linear systems solves to solve one trust regionproblem.We remark that this paper provided intuition for the design of our practical one-phase IPMcode [16]. The stabilization steps of the one-phase IPM, where one attempts to minimize a logbarrier, is most strongly related to Trust-IPM . Similarities during these stabilization stepsinclude: A. Maintaining iterates that are exactly feasible using nonlinear slack variable updates ( s + = a ( x + )). B. Adaptive step size and trust region/regularization parameter choice.There are significant differences between the algorithms. In contrast to
Trust-IPM the one-phase IPM is a primal-dual IPM, does not need a strictly feasible initial point, and does notneed to know any Lipschitz constants. Since the one-phase IPM [16] does not have a worst-caseiteration bound and the algorithms presented in this paper are not practical, it remains an openproblem to develop a practical IPM with a polynomial worst-case iteration dependence on 1 /µ . We develop some useful Lemmas in Section 4.1 to predict the quality of our local approximationsas a function of the direction sizes. In Section 4.2, we prove a key lemma, which bounds thedirections size in terms of predicted progress. To prove our main results we need the followingassumption.
Assumption 2. (Lipschitz derivatives) Assume that each a i : R n → R for i ∈ { , . . . , m } is acontinuous function on R n . Let L , L ∈ (0 , ∞ ) . The functions f : R n → R and a i : R n → R have L -Lipschitz first derivatives and L -Lipschitz second derivatives on the set X . In this section, as a function of the direction sizes (cid:107) d x (cid:107) , (cid:107) Y − d y (cid:107) and (cid:107) S − d s (cid:107) , we boundthe following. Recall that x + and y + are the next iterates given by Algorithm 1. A. The gap between the predicted reduction and the actual reduction of the log barrier(Lemma 3). This allows us to convert predicted reduction M ψ µ x ( d x ) into a reductionin the log barrier. B. Perturbed complementarity (cid:12)(cid:12) a i ( x + ) T y + i − µ (cid:12)(cid:12) (Lemma 4). This allows us to establish when(FJ1.b) holds. C. The norm of the gradient of the Lagrangian (Lemma 5). This allows us to establish when(FJ1.c) holds. Therefore Lemma 4 and 5 allow us to reason about when we are at anapproximate Fritz John point.
Lemma 2.
Suppose the function g : R → R has L -Lipschitz first derivatives and L -Lipschitzsecond derivatives on the set [0 , θ ] where θ ∈ R + . Further assume g (0) > , β ∈ (0 , / , and theinequality | θg (cid:48) (0) | g (0) + L θ g (0) ≤ β holds. Then g ( θ ) g (0) ∈ [ , ] and θ (cid:12)(cid:12)(cid:12) ∂ log( g ( θ )) ∂ θ (cid:12)(cid:12)(cid:12) ≤ L θ +6 L θ βg (0) + 5 β . The proof of Lemma 2 is given in Section B.1. Globally the log barrier does not haveLipschitz second derivatives. But Lemma 2 shows it is possible to bound the Lipschitz constantof second derivatives of log( g ( θ )) in a neighborhood of the current point.Lemma 2 only gives us a bound on the local Lipschitz constant for the second derivativesof log( g ( θ )) when g is univariate. By applying Lemma 2 with g ( θ ) := a i ( x + θv ), v = d x (cid:107) d x (cid:107) wecan bound the difference between the actual and predicted progress on the log barrier function.This bound is given in Lemma 3. emma 3. Suppose Assumption 2 holds (Lipschitz derivatives). Let x ∈ X , S = diag ( a ( x )) , d x ∈ R n , d s = ∇ a ( x ) d x , y = µS − , and κ ∈ (0 , / . If (cid:107) S − d s (cid:107) + L (cid:107) d x (cid:107) (cid:107) y (cid:107) µ ≤ κ, (13) then a i ( x + d x ) a i ( x ) ∈ [3 / , / for all i ∈ { , . . . , m } and (cid:12)(cid:12) ψ µ ( x ) + M ψ µ x ( d x ) − ψ µ ( x + d x ) (cid:12)(cid:12) ≤ L (cid:107) y (cid:107) ) (cid:107) d x (cid:107) + L (cid:107) d x (cid:107) (cid:107) y (cid:107) κ + µκ . The proof of Lemma 3 is given in Section B.2. Also, observe that if (13) holds for some x ∈ X and d x then (13) holds for any damped direction αd x with α ∈ [0 , Convex { x, x + αd x } ⊆ Convex { x, x + d x } ⊆ X . This observation ensures we can use Lemma 3 to establish the premisesof Lemma 4 and 5 which require Convex { x, x + } ⊆ X . Lemma 4.
Suppose Assumption 2 holds. Let
Convex { x, x + } ⊆ X , s = a ( x ) , s + = a ( x + ) , S = diag ( a ( x )) , Y = diag ( y ) , y + ∈ R m , Y + = diag ( y + ) , d x = x + − x , d y = y + − y , and d s = ∇ a ( x ) d x . If the equation Sy + Sd y + Y d s = µ holds, then (cid:107) Y − d y (cid:107) ≤ (cid:107) S − d s (cid:107) + (cid:107) µ ( SY ) − − (cid:107) (14) (cid:107) Y + s + − µ (cid:107) ≤ (cid:107) Sy (cid:107) ∞ (cid:107) S − d s (cid:107) (cid:107) Y − d y (cid:107) + L (cid:107) y (cid:107) (1 + (cid:107) Y − d y (cid:107) ) (cid:107) d x (cid:107) . (15) Furthermore, if (cid:107) Y + s + − µ (cid:107) ∞ < µ and (cid:107) Y − d y (cid:107) ∞ ≤ then s + , y + ∈ R m ++ . We give the proof of Lemma 4 in Section B.3. Lemma 4 will allow us to guarantee ( x + , y + )satisfies (FJ1.a) and (FJ1.b) when we take a primal-dual step in Algorithm 1. This a typicalLemma used for interior point methods in linear programming except that the nonlinearity ofthe constraints creates the additional L (cid:107) y (cid:107) (1 + (cid:107) Y − d y (cid:107) ) (cid:107) d x (cid:107) term in (15). Lemma 5.
Suppose Assumption 2 holds. Let y, y + ∈ R m and Convex { x, x + } ⊆ X . Then thefollowing inequality holds: (cid:107) ∇ x L ( x, y ) + ∇ xx L ( x, y ) T d x − d Ty ∇ x a ( x ) − ∇ x L ( x + , y + ) (cid:107) ≤ L (cid:107) y (cid:107) (cid:107) d x (cid:107) (cid:107) Y − d y (cid:107) + L (cid:107) y (cid:107) + 1) (cid:107) d x (cid:107) (16) with d x = x + − x and d y = y + − y . The proof of Lemma 5 is given in Section B.4. Lemma 5 allows us to guarantee that (FJ1.c)holds at ( x + , y + ) when (cid:107) d x (cid:107) and (cid:107) Y − d y (cid:107) are small. The introduction of the L (cid:107) y (cid:107) (cid:107) d x (cid:107) (cid:107) Y − d y (cid:107) term is the key reason that the analysis of [2, 15, 43] for affine scaling does not automaticallyextend into nonlinear constraints because this method does not efficiently bound (cid:107) Y − d y (cid:107) . Remark 1.
The reader might observe that our termination criteron (FJ1) has a strange mix ofnorms, in particular the size of ∇ x L ( x, y ) is measured using (cid:107)·(cid:107) and the the size of y is measuredby (cid:107)·(cid:107) . We attempt to explain this by showing how these norms naturally appear in the Lemmasin this section. The bound on (cid:107) ∇ x L ( x, y ) + ∇ xx L ( x, y ) T d x − d Ty ∇ x a ( x ) − ∇ x L ( x + , y + ) (cid:107) inLemma 5 contains a term of the form L (cid:107) y (cid:107) (cid:107) d x (cid:107) . This term is tight because if v = d x (cid:107) d x (cid:107) , d y =0 , x = 0 , a i ( x ) = L ( v T x ) + 1 , and f ( x ) = 0 then (cid:107) ∇ x L ( x, y ) + ∇ xx L ( x, y ) T d x − d Ty ∇ x a ( x ) − ∇ x L ( d x , d y ) (cid:107) = (cid:107) ∇ x L ( d x , y ) (cid:107) = (cid:107) (cid:80) i y i L ( v T d x ) v (cid:107) = L ( v T d x ) (cid:107) y (cid:107) = L (cid:107) y (cid:107) (cid:107) d x (cid:107) .Furthermore, one can see from this example that changing the norm of (cid:107) y (cid:107) would introduce adimension-factor and make the bound strictly weaker. Trust region subproblems can be efficientlysolved when d x is bounded in Euclidean norm. For this reason, we choose to use the Euclideannorm to measure the size of d x . Inspection of the proof of Lemma 5 indicates that one cannotchange the norm on the term ∇ x L ( x, y ) + ∇ xx L ( x, y ) T d x − d Ty ∇ x a ( x ) − ∇ x L ( x + , y + ) withoutchanging the norm on the term d x or introducing a dimension-factor. For similar the reasonsit is inadvisable to change the norms on the term L (cid:107) y (cid:107) (cid:107) d x (cid:107) in Lemma 3. .2 Bounding the direction size of the slack variables This section presents Lemma 7 which allows us to bound the direction size of the slack variables.Before proving Lemma 7 we state Lemma 6 which contains some basic and well-known factsabout trust region subproblems that will be useful. The proof is given is Section B.5.
Lemma 6.
Consider g ∈ R n and a symmetric matrix H ∈ R m × n . Define ∆( u ) := u T Hu + g T u where ∆ : R n → R and let u ∗ ∈ argmin u ∈ B r ( ) ∆( u ) be an optimal solution to the trustregion subproblem for some r ≥ . Then there exists some δ ≥ such that: δ ( (cid:107) u ∗ (cid:107) − r ) = 0 , ( H + δ I ) u ∗ = − g, and H + δ I (cid:23) . (17) Conversely, if u ∗ satisfies (17) then u ∗ ∈ argmin u ∈ B r ( ) ∆( u ) . Let σ ( r ) := min u ∈ B r ( ) ∆( u ) ,then for all r ∈ [0 , ∞ ) we have σ ( r ) ≤ − δr σ ( r ) ≤ σ ( αr ) ≤ α σ ( r ) ∀ α ∈ [0 , . (18b) Furthermore, the function σ ( r ) is monotone decreasing and continuous. Lemma 7 which follows is key to our result, because it allows us to bound the size of (cid:107) S − d s (cid:107) (recall d s = ∇ a ( x ) d x ). We remark that often in linear programming one shows (cid:107) S − d s (cid:107) = O (1) to prove a O ( √ n log(1 /µ )) iteration bound. Combining Lemma 7 with the Lemmas fromSection 4.1 allows us to give concrete bounds on the reduction of the log barrier at each iteration.This underpins our main results in Section 5. Lemma 7.
Consider A ∈ R m × n , g ∈ R n , and a symmetric matrix H ∈ R m × n . Define ∆( u ) := u T ( H + A T A ) u + g T u where ∆ : R n → R and let d x ∈ argmin u ∈ B r ( ) ∆( u ) for some r ≥ . Then (cid:107) Ad x (cid:107) ≤ (cid:113) − d Tx Hd x − d x ) . (19) Proof
Observe that∆( d x ) = 12 d Tx ( H + A T A ) d x + g T d x = 12 d Tx ( H + A T A ) d x − d Tx ( H + A T A + δ I ) d x = − d Tx (cid:0) H + A T A (cid:1) d x − δ (cid:107) d x (cid:107) where the second transition use the fact from Lemma 6 that there exists some δ such that( H + A T A + δ I ) d x = − g . Rearranging this expression and using δ (cid:107) d x (cid:107) ≥ (cid:107) Ad x (cid:107) ≤ − d Tx Hd x − d x ) . (20)This concludes the proof of Lemma 7. (cid:3) Now, if we set H = ∇ xx L ( x, y ), A = √ µ S − ∇ a ( x ), S = diag ( a ( x )), and d s = ∇ a ( x ) d x then we deduce from Lemma 7 that (cid:107) S − d s (cid:107) ≤ (cid:115) − d Tx ∇ xx L ( x, y ) d x − M ψ µ x ( d x ) µ which if we assume ∇ xx L ( x, y ) is positive definite we deduce that (cid:107) S − d s (cid:107) ≤ (cid:115) − M ψ µ x ( d x ) µ . (21) lternately, in the nonconvex case if (cid:107) ∇ f ( x ) (cid:107) ≤ L and (cid:107) ∇ a i ( x ) (cid:107) ≤ L then (cid:107) S − d s (cid:107) ≤ (cid:115) L (1 + (cid:107) y (cid:107) ) (cid:107) d x (cid:107) − M ψ µ x ( d x ) µ . (22)We emphasize that (21) and (22) are unusual because the bound on (cid:107) S − d s (cid:107) is dependent onthe amount of predicted progress for a step size of α = 1, i.e., M ψ µ x ( d x ). This is related towhy it is critical that Algorithm 1 adaptively selects the step size. The intuition is as follows.At each iteration if we have not terminated then we want to reduce the barrier function bya fixed quantity. Lemma 3 implies for sufficiently small α that the new point x + αd x willreduce the barrier function proportional to M ψ µ x ( αd x ). If (cid:107) S − d s (cid:107) is small then we can takea step size with α = 1 and reduce the barrier function proportional to M ψ µ x ( d x ). On the otherhand, if (cid:107) S − d s (cid:107) is big we must pick α small to guarantee that we reduce the barrier functionproportional to M ψ µ x ( αd x ). Since α is small, the term M ψ µ x ( αd x ) is smaller than M ψ µ x ( d x ).Fortunately, this is counterbalanced because if (cid:107) S − d s (cid:107) is large that implies using either (21)and (22) that M ψ µ x ( d x ) is also large. This section outlines the proof of our main result, a bound on the number of iterations
Trust-IPM algorithm takes to find a Fritz John point. Section 5.1 gives a general bound for thenumber of iterations to find a Fritz John point, i.e., proves Theorem 1. Section 5.2 gives atighter bound in the case that f is convex and each a i is concave. In this section we prove our main result, Theorem 1 which bounds the number of iterations ofAlgorithm 1 to find a Fritz John point by O (cid:0) µ − / (cid:1) . At a high level this proof is similar totypical cubic regularization/trust region arguments: we argue that if the termination conditionsare not satisfied at the next iterate then we have reduced the log barrier function by at leastΩ( µ / ). Before proving Theorem 1, we prove the auxiliary Lemmas 8 and 9. Lemma 8 showswe reduce the barrier merit function when the predicted progress at each iteration is large;Lemma 9 allows us to reason about when the algorithm will terminate.Recall that Algorithm 1 computes steps via S = diag ( a ( x )) , y = µS − (ITRS.a) r = η x (cid:114) µL ( (cid:107) y (cid:107) + 1) (ITRS.b) d x ∈ argmin u ∈ B r ( ) M ψ µ x ( u ) d s ← ∇ a ( x ) d x d y ← µS − d s (ITRS.c) x + = x + αd x y + = y + αd y . (ITRS.d)where (ITRS) stands for interior trust region subproblem.Also recall that τ l , τ c and µ are all parameters for our termination criterion (FJ1). Tosimplify the analysis we assume µ is small enough such that the following assumption holds. Wealso fix the value of τ c which determines how tightly perturbed complementarity holds in (FJ1). Assumption 3 (Sufficiently small µ ) . Let τ c = (cid:18) τ l µL (cid:19) / ∈ (0 , L µL ∈ (0 , . emma 8. Suppose Assumptions 2 and 3 hold (Lipschitz derivatives, and sufficiently small µ ).Let x ∈ X , η s ∈ [0 , / , (ITRS) hold with η x = η s , and α = min (cid:110) , η s (cid:107) S − d s (cid:107) (cid:111) . Then x + ∈ X and ψ µ ( x + ) − ψ µ ( x ) ≤ µη s + max (cid:26) M ψ µ x ( d x ) , − η s µ (cid:27) . (23)Lemma 8 provides a bound on the progress as a function of the parameter η s ∈ [0 ,
1] whichcontrols the step size. This allows us to guarantee that we will be able to reduce the barrierfunction during Algorithm 1 if the predicted progress from solving the trust region subproblem M ψ µ x ( d x ) is sufficiently large. The proof of Lemma 8 is given in Section C.1 and consists oftwo parts. The first part uses (22), (ITRS), and the definition of α to argue that M ψ µ x ( αd x ) ≤ max {M ψ µ x ( d x ) , − η s µ/ } . The second part uses Lemma 3 to show that M ψ µ x ( αd x ) accuratelypredicts the reduction in the barrier function. Lemma 9.
Suppose (ITRS) , Assumptions 2 and 3 hold (direction selection, Lipschitz deriva-tives, and sufficiently small µ ). Let x ∈ X , η x ∈ (0 , ( τ l µL ) / ] , and α = 1 . Furtherassume M ψ µ x ( d x ) ≥ − τ l µr √ (cid:107) y (cid:107) . Under these assumptions, ( x + , y + ) satisfies (FJ1) and ∇ ψ µ ( x ) (cid:23) −√ τ l (1 + (cid:107) y (cid:107) ) (cid:113) τ l µη x L I . Lemma 9 shows that if the predicted progress, M ψ µ x ( d x ), from the trust region step is smallthen the algorithm must terminate at the next iterate. The proof of Lemma 9 is given inSection C.2. It first uses (22) and M ψ µ x ( d x ) ≥ − τ l µr (cid:112) (cid:107) y (cid:107) / (cid:107) S − d s (cid:107) and (cid:107) Y − d y (cid:107) must be small. This enables the use of Lemma 5 to bound (cid:107) ∇ L ( x + , y + ) (cid:107) .With Lemma 8 and 9 in hand we are now ready to prove our main result, Theorem 1. Theorem 1.
Suppose Assumptions 2 and 3 hold (Lipschitz derivatives, and sufficiently small µ ). Then Trust-IPM ( f, a, µ, τ l , L , η s , η x , x (0) ) with x (0) ∈ X and η s = 140 (cid:18) τ l µL (cid:19) / η x = η s , ( η -1) takes at most O (cid:32) ψ µ ( x (0) ) − ψ ∗ µ µ (cid:18) L µτ l (cid:19) / (cid:33) iterations to terminate with a ( µ, τ l , τ c ) -approximate second-order Fritz John point ( x + , y + ) , i.e., (FJ1) and (FJ2) hold. The proof is given in Section C.3. The idea is that if over two consecutive iterations thefunction is not reduced by Ω( µ / ) then (FJ1) and (FJ2) hold. This argument is a little differentfrom proofs of related results in literature. Convergence proofs for cubic regularization argue thatif there is a little progress this iteration then the next iterate will satisfy the termination criterion;convergence proofs for gradient descent argue that if there is little progress this iteration thenthe current iteration satisfies the termination criterion. The reason for our unusual argumentis that Lemma 9 guarantees that that (FJ1) holds at the next iterate and that ∇ ψ µ ( x ) isapproximately positive definite at the current iterate. To obtain our results in this section we will assume that the function f is convex and eachfunction a i is concave. The result, Lemma 10, only gives the iteration bound to find a FritzJohn point. In the subsequence section we use this Lemma to prove Theorem 2 which gives aniteration bound for finding an (cid:15) -optimal solution.Similar, to Assumption 3 given in Section 5.1 we use Assumption 4 to require that µ is smallto simplify the analysis and final bound. ssumption 4 (Sufficiently small µ ) . Let τ c = (cid:18) τ l µL (cid:19) / ∈ (0 , L µL τ l ∈ (0 , . Lemma 10.
Suppose Assumption 2 and 4 hold (Lipschitz derivatives, and sufficiently small µ ).Let f be convex and each a i concave. Then Trust-IPM ( f, a, µ, τ l , L , η s , η x , x (0) ) with x (0) ∈ X and η x = θ (cid:18) τ l µL (cid:19) / η s = θ (cid:18) τ l µL (cid:19) / θ = 1 / , ( η -2) takes at most O (cid:32) ψ µ ( x (0) ) − ψ ∗ µ µ (cid:18) L τ l µ (cid:19) / (cid:33) iterations to terminate with a ( µ, τ l , τ c ) -approximate first-order Fritz John point ( x + , y + ) , i.e., (FJ1) holds. The proof of Lemma 10 is similar to Theorem 1 and is given in Section D. For this resultwe only need to prove that we have found an approximate first-order Fritz John rather than anapproximate second-order Fritz John point (by the assumption f is convex and a i is concave wetrivially have ∇ ψ µ ( x ) (cid:23) f is convex and a i concave so we can apply (21) to bound (cid:107) S − d s (cid:107) instead of (22). While Lemma 10 specialized our guarantees to when f is convex and a i is concave, it only madea statement on how long it takes to find a Fritz John point. However, finding a Fritz John pointdoes not necessarily guarantee optimality. The purpose of this section is to provide optimalityguarantees. We begin with a simple lemma showing that finding an approximate KKT pointimplies approximate optimality. We use this lemma to convert algorithms that find approximateKKT points of the log barrier to algorithms that find approximately optimal solutions. Finally,the main result (Theorem 2) is that under a certain regularity assumption, our algorithm, whenapplied to a sequence of subproblems with decreasing µ , takes at most O (cid:0) (cid:15) − / (cid:1) trust regionsubproblem solves to find an (cid:15) -optimal solution. Lemma 11.
Let f : R n → R and a : R n → R m . Let (cid:107)X (cid:107) ≤ R . If ( x, y ) ∈ X × R m ++ and a i ( x ) y i ≥ µ for i ∈ { , . . . , m } then ψ µ ( x ) − ψ ∗ µ ≤ (cid:107) ∇ x L ( x, y ) (cid:107) R + (cid:80) mi =1 ( a i ( x ) y i − µ ) .Proof Let S := diag ( a ( x )) and ˜ y := y − µS − and q ( z ) := ψ µ ( z ) − a ( z ) T ˜ y . By a i ( x ) y i ≥ µ we have ˜ y i ≥
0. Now, ψ ∗ µ ≥ inf z ∈X q ( z ) ≥ q ( x ) − (cid:107) ∇ q ( x ) (cid:107) R = ψ µ ( x ) − (cid:107) ∇ q ( x ) (cid:107) R − a ( x ) T ˜ y, where the first inequality uses a ( z ) T ˜ y ≥
0, the second inequality the convexity of q , and thefinal inequality the definition of q . The result follows by ∇ q ( x ) = ∇ x L ( x, y ). (cid:3) So far we have presented
Trust-IPM which only minimizes the log barrier with µ fixed.However, log barrier methods traditionally solve a sequence of subproblems with µ tendingtoward zero as described in Algorithm 2. lgorithm 2 IPM with decreasing µ function Annealed-IPM ( f, a, µ (0) , x (0) , (cid:15) ) for j = 0 , . . . , ∞ do ( x ( j +1) , y ( j +1) ) ← Generic-IPM ( f, a, µ ( j ) , x ( j ) ) µ ( j +1) ← µ ( j ) / if µ ( j ) m ≤ (cid:15) thenreturn x ( j +1) end ifend forend function In Algorithm 2 we write
Generic-IPM as a placeholder for any algorithm that finds a FritzJohn point. The precise properties we need
Generic-IPM to satisfy are given in Assump-tion 5. For this paper will use
Generic-IPM = Trust-IPM but any other method satisfyingAssumption 5 would suffice. Then, as we show in Lemma 12 it is possible to give an iterationbound for the algorithm to find a (cid:15) -optimal solution to the original problem.
Assumption 5.
Let (cid:107)X (cid:107) ≤ R . Suppose that for any µ ∈ (0 , ∞ ) , x ∈ X that Generic-IPM ( f, a, µ, x ) finds a point ( x + , y + ) with (cid:107) ∇ x L ( x + , y + ) (cid:107) ≤ µmR and (cid:12)(cid:12) a i ( x + ) y + i − µ (cid:12)(cid:12) ≤ µ/ in at most O (1) + ψ µ ( x ) − ψ ∗ µ µ w ( µ ) unit operations, where the function w : R → R is monotone decreas-ing. The term ‘unit operations’ is used to denote the metric for computational cost, this couldbe trust-region steps, linear system solves or matrix-vector multiplies.Before stating Lemma 12 we define f ∗ := inf z ∈X f ( z )log +2 ( x ) := max { log ( x ) , } . Lemma 12.
Let f be convex and each a i concave. Suppose that Assumption 5 holds. Let x (0) ∈ X . Then Annealed-IPM ( f, a, µ (0) , x (0) , (cid:15) ) takes at most (cid:16) O (1) + 6 m × w (cid:16) (cid:15) m (cid:17)(cid:17) log +2 (cid:18) mµ (0) (cid:15) (cid:19) + ψ µ (0) ( x (0) ) − ψ ∗ µ (0) µ (0) w ( µ (0) ) unit operations to return a point x ( k ) ∈ X with f ( x ( k ) ) − f ∗ ≤ (cid:15) . The proof of Lemma 12 appears in Section E.2.Our results for
Trust-IPM only produce Fritz John points but to satisfy Assumption 5 weneed an algorithm that produces KKT points. Next, we present a regularity assumption whichenables us to convert a Fritz John point into a KKT point and thereby enables
Trust-IPM tosatisfy Assumption 5.
Assumption 6 (Regularity conditions) . Assume there exists some ζ > that if (FJ1) holdsthen (cid:107) y + (cid:107) + 1 ≤ ζ . One sufficient condition for Assumption 6 to hold is Slater’s condition, i.e., there exists somepoint x ∈ X and γ > a ( x ) > γ . We show this formally in Section E.1.Next, we present the main result of this section, Theorem 2, which combines Lemma 10, andLemma 12. To satisfy the premises of these lemmas we make the following assumption. ssumption 7 (Parameter settings) . Let τ l = mRζ / (A7. τ l ) τ c = (cid:18) τ l µL (cid:19) / (A7. τ c ) µ (0) = min (cid:26) L R ζm , L mRL √ ζ (cid:27) (A7. µ (0) ) where µ (0) represents the initial µ value of Annealed-IPM . Theorem 2.
Suppose Assumption 2, 6 and 7 hold (Lipschitz derivatives, regularity conditions,and parameter settings). Let f be convex and each a i concave. Let x (0) ∈ X and (cid:107)X (cid:107) ≤ R .Define η s , η x by ( η -2) and set Generic-IPM ( f, a, µ, x ) := Trust-IPM ( f, a, µ, τ l , L , η s , η x , x ) inside Annealed-IPM . Then
Annealed-IPM ( f, a, µ (0) , x (0) , (cid:15) ) takes at most O (cid:32)(cid:32) m / (cid:18) L R ζ(cid:15) (cid:19) / + 1 (cid:33) log + (cid:18) mµ (0) (cid:15) (cid:19) + ψ µ (0) ( x (0) ) − ψ ∗ µ (0) µ (0) (cid:18) L R ζm µ (0) (cid:19) / (cid:33) unit operations to return a point x ( k ) ∈ X with f ( x ( k ) ) − f ∗ ≤ (cid:15) . The proof of Theorem 2 appears in Section E.3. Notice that the iteration bound given inTheorem 2 comprises of two terms. The first term is dependent on (cid:15) and corresponds to thetotal number of trust-region subproblems used during iterations j = 1 , . . . , k of Annealed-IPM . The second term corresponds to the number of inner iterations required in iteration j = 0 of Annealed-IPM , in other words, the number of trust region subproblems used by
Trust-IPM ( f, a, µ (0) , τ l , L , η s , η x , x (0) ). This second term has no (cid:15) dependence, and by sub-stituting the value of µ (0) given by (A7. µ (0) ) we observe this term is bounded by O (cid:18) ∆ f (cid:18) m L R ζ + R L ζ / L m (cid:19) + log( b ) (cid:18) m + R L ζL m (cid:19)(cid:19) where b is some constant such that a i ( x ) a i ( x (0) ) ≤ b for all i = 1 , . . . , m and x ∈ X . One difficulty with nonconvex optimization is that there are many choices termination criterionand this choice affects iteration bounds. The results of Birgin et al. [3] guarantee to find anunscaled KKT points or a certificate of local infeasibility. Their criterion is different from ourFritz John termination criterion. Therefore for the sake of comparison we now introduce a newpair of termination criterion similar to the criterion they presented. Our own definition of anunscaled KKT point is a ( x ) ≥ − ε opt (KKT.a) (cid:107) ∇ x L ( x, y ) (cid:107) ≤ ε opt (KKT.b) y ≥ (KKT.c) a i ( x ) y i ≤ ε opt . (KKT.d)Let us contrast this definition with the definition of an unscaled KKT point given in Birgin et al.[3]. The most important difference is how complementarity is measured. In particular, in Birgin t al. [3] their termination criterion replaces (KKT.d) of our criterion with min { a i ( x ) , y i } ≤ ε opt .In this respect, the termination criterion of Birgin et al. [3] is stronger than (KKT). To detectinfeasibility we consider the following termination criterion.min i a i ( x ) < − ε opt / a ( x ) + t ≥ (INF1.b) (cid:13)(cid:13) ∇ a ( x ) T y (cid:13)(cid:13) ≤ ε inf (INF1.c) (cid:107) y (cid:107) = 1 (INF1.d) y ≥ (INF1.e)( a i ( x ) + t ) y i ≤ ε inf ε opt . (INF1.f)System (INF1) finds an approximate KKT point for the problem of minimizing the infinitynorm of the constraint violation. In contrast, Birgin et al. [3] find a stationary point for theEuclidean norm of the constraint violation squared which they denote by θ ( x ). However, thisis a weak measure of infeasibility since if θ ( x ) ≤ ε opt then automatically (cid:107) ∇ θ ( x ) (cid:107) ≤ ε opt . Thenatural termination criterion corresponding to (INF1) is an approximate KKT point for theproblem of minimizing the Euclidean norm of the constraint violation. This can be written asmin i a i ( x ) < − ε opt / (cid:13)(cid:13) ∇ a ( x ) T y (cid:13)(cid:13) ≤ ε inf (INF2.b) y = z (cid:107) z (cid:107) (INF2.c) a ( x ) + z ≥ (INF2.d)( a i ( x ) + z i ) y i = 0 (INF2.e) y ≥ . (INF2.f)To find a point satisfying (INF2) they require (cid:107) ∇ θ ( x ) (cid:107) ≤ ε opt ε inf . If this condition holdsthen z = min { a ( x ) , } , y = z (cid:107) z (cid:107) satisfies (INF2). Finally, notice that both (INF1) and (INF2)find points withmin i a i ( x ) < − ε opt / a i ( x ) y i ≤ ε inf (cid:13)(cid:13) ∇ a ( x ) T y (cid:13)(cid:13) ≤ ε inf y ≥ , which proves infeasibility in ball of radius R if ε inf = O ( ε opt / (1 + R )), f is convex, and a i isconcave [16, Observation 1].To obtain our algorithm that finds a point satisfying either (KKT) or (INF1), we apply Trust-IPM in two-phases (see
Two-Phase-IPM in Appendix F.1).Let x (0) ∈ R n be our starting point and define t (0) := ε opt { min i − a i ( x (0) ) , } . Phase-one applies Algorithm 1 to minimize the infinity norm of the constraint violation, i.e., wefind a Fritz John point of min x,t f P ( x, t ) := t (PI.a) a P ( x, t ) := a ( x ) + t t ε opt + t (0) − t ≥ . (PI.b)Let ( x ( P , t ( P ) be the solution obtained. Starting from x ( P , phase-two minimizes theobjective subject to the ( ε opt -relaxed) constraints, i.e., we find a Fritz John point ofmin x f ( x ) (PII.a) a P ( x ) := a ( x ) + ε opt ≥ (PII.b) tarting from the point obtained in phase-one.We replace Assumption 2 with Assumption 8, where X replaced with two sets, correspondingto phase-one and phase-two respectively:˜ X ( P := { x ∈ R n : a ( x ) ≥ − ( ε opt / t (0) ) } ˜ X ( P := { x ∈ R n : a ( x ) ≥ − ε opt } . By the definition of t (0) we have ˜ X ( P ⊆ ˜ X ( P . Assumption 8.
Assume that each a i : R n → R for i ∈ { , . . . , m } is a continuous function on R n . Let L , L ∈ (0 , ∞ ) . The functions a i : R n → R have L -Lipschitz first derivatives and L -Lipschitz second derivatives on the set ˜ X ( P . The function f : R n → R and a i : R n → R has L -Lipschitz first derivatives and L -Lipschitz second derivatives on the set ˜ X ( P . Before presenting Claim 3 let us introduce non-negative scalars c , ∆ f , and ∆ a chosen asfollows. c ≥ sup x ∈ ˜ X ( P max i ∈{ ,...,m } a i ( x ) (24a)∆ f ≥ sup z ∈ ˜ X ( P f ( z ) − inf z ∈ ˜ X ( P f ( z ) (24b)∆ a ≥ min i ∈{ ,...,m } max {− a i ( x (0) ) , } . (24c) Claim 3.
Let x (0) ∈ R n . Suppose Assumption 8 and (24) holds. Let f be L -Lipschitz. Assume c, ∆ a , ∆ f , L , L ≥ , ε opt ∈ (cid:16) , m log + ( c/ε opt ) (cid:105) , ε inf ∈ (0 , L m ] and ε opt ∈ (0 , √ ε inf ] . Then Two-Phase-IPM ( f, a, ε opt , ε inf , L , L , x (0) ) takes at most O (cid:32) ∆ a (cid:32) L / ε / inf ε / opt + 1 ε inf ε opt (cid:33) + ∆ f ε opt (cid:18) L L ε opt ε inf (cid:19) / (cid:33) trust region subproblem solves to return a point ( x, t, y ) that satisfies either (KKT) or (INF1) . The definition of
Two-Phase-IPM appears in Section F.1 and the proof of Claim 3 appearsin Section F.2. The proof is primarily devoted to analyzing phase-two when we minimize theobjective while approximately satisfying the constraints. We argue that when we terminatewith a Fritz John point in phase-two then either the dual variables are small enough that thisis a KKT point or if the dual variables are large the scaled dual variables give an infeasibilitycertificate. If we add the assumption that ε opt ∈ (0 , ε inf ] the iteration bound of Claim 3 can beeven more simply stated as O (cid:32) ∆ a + ∆ f ε opt (cid:18) L L ε opt ε inf (cid:19) / (cid:33) . (25)We can now compare with the results of [3] in Table 1. Table 1
This table compares iteration bounds under the setup of (25). It only includes dependen-cies on ε opt and ε inf . CRN stands for cubic regularized Newton [29].algorithm O (cid:16) ε − opt ε − inf (cid:17) gradient computation ∇ Birgin et al. [3] O (cid:16) ε − opt ε − / inf (cid:17) CRN with non-negativity constraint ∇ , ∇ IPM (this paper) O (cid:16) ε − / opt ε − / inf (cid:17) trust-region subproblem ∇ , ∇ he algorithm of Birgin et al. [3] sequentially finds KKT points to quadratic penalty sub-problems of the form,minimize ( x,r,s ) ∈ R n +1+ m Φ t ( x, r, s ) := ( f ( x ) − t + r ) + (cid:107) a ( x ) + s (cid:107) s.t. r ≥ , s ≥ . (26)To solve this subproblem method they suggest using p th order regularization with non-negativityconstraints. For p = 2 this reduces to cubic regularization Newton’s method with non-negativityconstraints, i.e.,minimize d ∈ R n +1+ m d T ∇ Φ t ( x, r, s ) d + ∇ Φ t ( x, r, s ) T d + C (cid:107) d (cid:107) s.t r + d r ≥ , s + d s ≥ (27)for some constant C > d = ( d x , d r , d s ). Solving this subproblem might be computa-tionally expensive. It is well-known that checking if a point is a local optimum of (27) is ingeneral NP-hard [31]. It is possible to find an approximate KKT point using projected gradientdescent or an interior point method for solving nonconvex quadratic program [44]. However,both these approaches are likely to result in a computation runtime worse than O (cid:16) ε − opt ε − / inf (cid:17) .We speculate that one might also be able to apply the interior point method of Haeser et al. [15]as the unconstrained minimization algorithm for solving (26) and potentially obtain the runtimebound of O (cid:16) ε − opt ε − / inf (cid:17) given by [3], although further analysis is needed to confirm this.Finally, Cartis et al. [8, 9] show that one requires O (cid:0) ε − opt (cid:1) iterations to find a scaled KKTpoint: (cid:107) ∇ x L ( x, y ) (cid:107) ≤ ε opt ( (cid:107) y (cid:107) + 1) y ≥ a ( x ) ≥ − ε opt a i ( x ) y i ≤ ε opt (1 + (cid:107) y (cid:107) ) , or a certificate of infeasibility (with ε inf = 1). Their method only requires computation of first-derivatives but has the disadvantage that it requires solving a linear program at each iteration. Since there has been relatively little work with general convex constraints we generate a set base-lines for comparison using existing methods for unconstrained optimization. To simplify thesecomparisons we consider the weaker problem of finding an (cid:15) -optimal solution to the problem ofminimize x ∈ R n max i ∈{ ,...,m } a i ( x ) . (28)To further simplify we assume the optimal objective value of (28) is zero, that the initial point x (0) satisfies a ( x (0) ) > and that L + L + L + R + m = O (1) with R ≥ (cid:107) x ∗ − x (0) (cid:107) where x ∗ is some optimal solution. To apply our IPM we can reformulate (28) asminimize ( x,t ) ∈ R n +1 t s.t. (cid:107) x − x (0) (cid:107) ≤ R + 1 , ≤ t ≤ t (0) + 1 , a i ( x ) ≤ t ∀ i ∈ { , . . . , m } , (29)where t (0) := 1+max i ∈{ ,...,m } a i ( x (0) ) and ( x (0) , t (0) ) is the starting point of our IPM. Note thatsubstituting this starting point into (29) implies Assumption 9 holds with γ = 1, and thereforeby Lemma 15 Assumption 6 holds with ζ = O (1 + µ ). This implies our IPM has a runtime of˜ O (cid:0) (cid:15) − / (cid:1) using Theorem 2.Another approach to solve (28) is to minimize ω p ( x ) := m (cid:88) i =1 max { a i ( x ) , } p +1 using a method that only requires the p th order derivative to be Lipschitz. To find a pointsatisfying a ( x ) ≤ (cid:15) we need to find a point with ω p ( x ) ≤ (cid:15) p +1 . We can then use genericunconstrained optimization methods such as cubic regularization, accelerated gradient descent, nd accelerated cubic regularization to solve this problem. These comparisons are summarizedin Table 2. Table 2
Iteration bounds to find a point a ( x ) ≤ (cid:15) . Assume L + L + L + (cid:107) x ∗ − x (0) (cid:107) + m = O (1).SG = sub-gradient method [34], CRN = cubic regularized Newton [29], AGD = accelerated gradientdescent [26], ACRN = accelerated cubic regularized Newton of Monteiro and Svaiter [24]. Allsubproblems with a * have similar computational cost: a logarithmic number of linear systemsolves.algorithm O ( (cid:15) − ) matrix-vector product ∇ CRN on ω O ( (cid:15) − / ) cubic regularization subproblem* ∇ , ∇ AGD on ω O ( (cid:15) − ) matrix-vector product ∇ ACRN on ω O ( (cid:15) − / ) cubic regularization subproblem* ∇ , ∇ IPM (this paper) ˜ O (cid:0) (cid:15) − / (cid:1) trust region subproblem* ∇ , ∇ cutting plane O ( n log(1 /(cid:15) )) centre of polytope ∇ Acknowledgement
We’d like to thank Yair Carmon, Ron Estrin, and Michael Saunders for their useful feedback onthe paper.
References [1] Andersen, E. D. and Ye, Y. (1998). A computational study of the homogeneous algorithmfor large-scale convex optimization.
Computational Optimization and Applications , 10(3):243–269.[2] Bian, W., Chen, X., and Ye, Y. (2015). Complexity analysis of interior point algorithms fornon-Lipschitz and nonconvex minimization.
Mathematical Programming , 149(1-2):301–327.[3] Birgin, E. G., Gardenghi, J., Mart´ınez, J., Santos, S., and Toint, P. L. (2016). Evaluationcomplexity for nonlinear constrained optimization using unscaled KKT conditions and high-order models.
SIAM Journal on Optimization , 26(2):951–967.[4] Byrd, R. H., Gilbert, J. C., and Nocedal, J. (2000). A trust region method based on interiorpoint techniques for nonlinear programming.
Mathematical Programming , 89(1):149–185.[5] Byrd, R. H., Nocedal, J., and Waltz, R. A. (2006). KNITRO: An integrated package fornonlinear optimization. In
Large-scale nonlinear optimization , pages 35–59. Springer.[6] Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. (2017a). Lower bounds for findingstationary points I. arXiv preprint arXiv:1710.11606 .[7] Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. (2017b). Lower bounds for findingstationary points II: First-order methods. arXiv preprint arXiv:1711.00841 .[8] Cartis, C., Gould, N. I., and Toint, P. L. (2011). On the evaluation complexity of compositefunction minimization with applications to nonconvex nonlinear programming.
SIAM Journalon Optimization , 21(4):1721–1739.[9] Cartis, C., Gould, N. I., and Toint, P. L. (2014). On the complexity of finding first-ordercritical points in constrained nonlinear optimization.
Mathematical Programming , 144(1-2):93–106.
10] Chen, L. and Goldfarb, D. (2006). Interior-point l -penalty methods for nonlinear program-ming with strong global convergence properties. Mathematical Programming , 108(1):1–36.[11] Conn, A. R., Gould, N. I., Orban, D., and Toint, P. L. (2000a). A primal-dual trust-regionalgorithm for non-convex nonlinear programming.
Mathematical programming , 87(2):215–249.[12] Conn, A. R., Gould, N. I., and Toint, P. L. (2000b).
Trust region methods . SIAM.[13] Curtis, F. E., Robinson, D. P., and Samadi, M. (2017). A trust region algorithm witha worst-case iteration complexity of O ( (cid:15) − / ) for nonconvex optimization. MathematicalProgramming , 162(1-2):1–32.[14] Gould, N. I., Orban, D., and Toint, P. (2002).
An interior-point l1-penalty method fornonlinear optimization . Citeseer.[15] Haeser, G., Liu, H., and Ye, Y. (2017). Optimality condition and complexity analysis forlinearly-constrained optimization without differentiability on the boundary. arXiv preprintarXiv:1702.04300 .[16] Hinder, O. and Ye, Y. (2018). A one-phase interior point method for nonconvex optimiza-tion. arXiv preprint arXiv:1801.03072 .[17] John, F. (1948). Extremum problems with inequalities as side conditions.
Studies andEssays: Courant Anniversary Volume , pages 187–204.[18] Karmarkar, N. (1984). A new polynomial-time algorithm for linear programming. In
Proceedings of the sixteenth annual ACM symposium on Theory of computing , pages 302–311.ACM.[19] Kojima, M., Mizuno, S., and Yoshise, A. (1989). A primal-dual interior point algorithmfor linear programming. In
Progress in mathematical programming , pages 29–47. Springer.[20] Mangasarian, O. L. and Fromovitz, S. (1967). The fritz john necessary optimality conditionsin the presence of equality and inequality constraints.
Journal of Mathematical Analysis andApplications , 17(1):37–47.[21] Megiddo, N. (1989). Pathways to the optimal set in linear programming. In
Progress inmathematical programming , pages 131–158. Springer.[22] Mehrotra, S. (1992). On the implementation of a primal-dual interior point method.
SIAMJournal on optimization , 2(4):575–601.[23] Monteiro, R. D. and Adler, I. (1989). Interior path following primal-dual algorithms. partI: Linear programming.
Mathematical programming , 44(1):27–41.[24] Monteiro, R. D. and Svaiter, B. F. (2013). An accelerated hybrid proximal extragradientmethod for convex optimization and its implications to second-order methods.
SIAM Journalon Optimization , 23(2):1092–1125.[25] Nemirovskii, A. and Yudin, D. B. (1983). Problem Complexity and Method Efficiency inOptimization.[26] Nesterov, Y. (1983). A method of solving a convex programming problem with convergencerate O (1 /k ). In Soviet Mathematics Doklady , volume 27, pages 372–376.[27] Nesterov, Y. (2013).
Introductory Lectures on Convex Optimization: A Basic Course ,volume 87. Springer Science & Business Media.
28] Nesterov, Y. and Nemirovskii, A. (1994).
Interior-point polynomial algorithms in convexprogramming . SIAM.[29] Nesterov, Y. and Polyak, B. T. (2006). Cubic regularization of newton method and itsglobal performance.
Mathematical Programming , 108(1):177–205.[30] Nocedal, J. and Wright, S. (2006).
Numerical Optimization . Springer Science & BusinessMedia.[31] Pardalos, P. M. and Schnitger, G. (1988). Checking local optimality in constrainedquadratic programming is NP-hard.
Operations Research Letters , 7(1):33–35.[32] Renegar, J. (1988). A polynomial-time algorithm, based on newton’s method, for linearprogramming.
Mathematical Programming , 40(1-3):59–93.[33] Richard Johnsonbaugh, W. P. (2010).
Foundations of mathematical analysis . Dover publi-cations, Mineola, New York.[34] Shor, N. (1985).
Minimization Methods for Non-differentiable Functions . ComputationalMathematics. Springer.[35] Sorensen, D. C. (1982). Newton’s method with a model trust region modification.
SIAMJournal on Numerical Analysis , 19(2):409–426.[36] Sturm, J. F. (2002). Implementation of interior point methods for mixed semidefinite andsecond order cone optimization problems.
Optimization Methods and Software , 17(6):1105–1154.[37] Ulbrich, S. (2004). On the superlinear local convergence of a filter-SQP method.
Mathe-matical Programming , 100(1):217–245.[38] Vanderbei, R. J. (1999). LOQO user’s manual—version 3.10.
Optimization methods andsoftware , 11(1-4):485–514.[39] Vicente, L. N. and Wright, S. J. (2002). Local convergence of a primal-dual method for de-generate nonlinear programming.
Computational Optimization and Applications , 22(3):311–328.[40] W¨achter, A. and Biegler, L. T. (2005). Line search filter methods for nonlinear program-ming: Motivation and global convergence.
SIAM Journal on Optimization , 16(1):1–31.[41] W¨achter, A. and Biegler, L. T. (2006). On the implementation of an interior-point filterline-search algorithm for large-scale nonlinear programming.
Mathematical programming ,106(1):25–57.[42] Ye, Y. (1991). An O ( n L ) potential reduction algorithm for linear programming. Mathe-matical programming , 50(1):239–258.[43] Ye, Y. (1992). A new complexity result on minimization of a quadratic function witha sphere constraint. In
Recent Advances in Global Optimization , pages 19–31. PrincetonUniversity Press.[44] Ye, Y. (1998). On the complexity of approximating a KKT point of quadratic programming.
Mathematical programming , 80(2):195–211.[45] Ye, Y. (2018). MS&E311. Lecture note 12. https://web.stanford.edu/class/msande311/handout.shtml.[46] Ye, Y., Todd, M. J., and Mizuno, S. (1994). An O ( √ nL )-iteration homogeneous andself-dual linear programming algorithm. Mathematics of Operations Research , 19(1):53–67.
47] Zhang, Y. (1994). On the convergence of a class of infeasible interior-point methods for thehorizontal linear complementarity problem.
SIAM Journal on Optimization , 4(1):208–227.
A Proof of Claim 1
Claim 1.
Let ψ µ ( x ) := x − µ (log( x ) + log(2 − x )) , µ ∈ (0 , / and C ∈ [2 , ∞ ) . Fix α ∈ (0 , ∞ ) and suppose the x ( k ) satisfies (9) . If x ( k ) remains in the interval [0 , for the startingpoint x (0) = exp( − C/ (2 µ )) ∈ S C , then for the starting point x (0) = 1 ∈ S C and for all k ≤ ( µ/
8) exp( C/ (2 µ )) we have (cid:107) ∇ ψ µ ( x ( k ) ) (cid:107) ≥ µ .Proof. Suppose x (0) = exp( − C/ (2 µ )). Note that x (0) ≤ exp( − / (2 / −
5) and therefore ψ µ ( x (0) ) = x (0) − µ (cid:16) log (cid:16) x (0) (cid:17) + log (cid:16) − x (0) (cid:17)(cid:17) ≤ x (0) + C/ ≤ C ⇒ x (0) ∈ S C where the first inequality uses that log (cid:0) x (0) (cid:1) = log(exp( − C/ (2 µ ))) = − C/ (2 µ ) and log (cid:0) − x (0) (cid:1) ≥ log(1) = 0, the second inequality uses x (0) ≤ exp( − ≤ C/
2, and the implication uses ψ ∗ µ ≥ ∇ ψ µ ( x (0) ) = 1 − µ (cid:18) x (0) − − x (0) (cid:19) ≤ − µ (exp( C/ (2 µ )) − ≤ − µ/ C/ (2 µ ))where the last inequality uses that µ exp( C/ (2 µ )) is monotone decreasing with respect to µ on theinterval [0 , C/
2] because ∂µ exp( C/ (2 µ )) /∂µ = − ( C − µ ) exp( C/ (2 µ )) / (2 µ ) ≤
0, and therefore µ exp( C/ (2 µ )) ≥ exp(2 / (2 / ≥
29. We conclude that if x (1) ≤ α ≤ µ exp( − C/ (2 µ )).On the other hand, if x ( k ) ∈ [1 / ,
1] then 0 . ≤ − / − / .
5) = ∇ ψ µ (1 / ≤ ∇ ψ µ ( x ( k ) ) ≤ ∇ ψ µ (1) ≤ ⇒ ∇ ψ µ ( x ( k ) ) ∈ [0 . , . x (0) = 1 and µ exp( − C/ (2 µ )) ≤ α it willtake at least µ exp( C/ (2 µ )) iterations until x ( k ) (cid:54)∈ [1 / , B Proofs from Section 4.1
B.1 Proof of Lemma 2
Lemma 2.
Suppose the function g : R → R has L -Lipschitz first derivatives and L -Lipschitzsecond derivatives on the set [0 , θ ] where θ ∈ R + . Further assume g (0) > , β ∈ (0 , / , and theinequality | θg (cid:48) (0) | g (0) + L θ g (0) ≤ β holds. Then g ( θ ) g (0) ∈ [ , ] and θ (cid:12)(cid:12)(cid:12) ∂ log( g ( θ )) ∂ θ (cid:12)(cid:12)(cid:12) ≤ L θ +6 L θ βg (0) + 5 β .Proof We have | g (0) − g ( θ ) | g (0) ≤ | θg (cid:48) (0) | g (0) + L θ g (0) ≤ β ≤ . The first inequality uses | g (0) + g (cid:48) (0) θ − g ( θ ) | ≤ L θ , the triangle inequality and g (0) > g ( θ ) g (0) ∈ [3 / , / g ( θ )), ∂ log( g ( θ )) ∂θ = g (cid:48) ( θ ) g ( θ ) ∂ log( g ( θ )) ∂ θ = g (cid:48)(cid:48) ( θ ) g ( θ ) − g (cid:48) ( θ ) g ( θ ) ∂ log( g ( θ )) ∂ θ = g (cid:48)(cid:48)(cid:48) ( θ ) g ( θ ) − g (cid:48) ( θ ) g (cid:48)(cid:48) ( θ ) g ( θ ) + 2 g (cid:48) ( θ ) g ( θ ) . (30) y (30), g ( θ ) g (0) ∈ [3 / , / | g (cid:48)(cid:48)(cid:48) ( θ ) | ≤ L , and | g (cid:48)(cid:48) ( θ ) | ≤ L we have (cid:12)(cid:12)(cid:12)(cid:12) ∂ log( g ( θ )) ∂ θ (cid:12)(cid:12)(cid:12)(cid:12) ≤ (4 / L g (0) + 3(4 / L | g (cid:48) ( θ ) | g (0) + 2(4 / | g (cid:48) ( θ ) | g (0) . Now, since g (cid:48) ( θ ) ≤ g (cid:48) (0) + L θ we have θ (cid:12)(cid:12)(cid:12)(cid:12) ∂ log( g ( θ )) ∂ θ (cid:12)(cid:12)(cid:12)(cid:12) ≤ (4 / L θ g (0) + 3(4 / L θ ( g (cid:48) (0) θ + L θ ) g (0) + 2(4 / ( | θg (cid:48) (0) | + L θ ) g (0) ≤ L θ + 6 L θ βg (0) + 5 β . (cid:3) B.2 Proof of Lemma 3
Lemma 3.
Suppose Assumption 2 holds (Lipschitz derivatives). Let x ∈ X , S = diag ( a ( x )) , d x ∈ R n , d s = ∇ a ( x ) d x , y = µS − , and κ ∈ (0 , / . If (cid:107) S − d s (cid:107) + L (cid:107) d x (cid:107) (cid:107) y (cid:107) µ ≤ κ, (13) then a i ( x + d x ) a i ( x ) ∈ [3 / , / for all i ∈ { , . . . , m } and (cid:12)(cid:12) ψ µ ( x ) + M ψ µ x ( d x ) − ψ µ ( x + d x ) (cid:12)(cid:12) ≤ L (cid:107) y (cid:107) ) (cid:107) d x (cid:107) + L (cid:107) d x (cid:107) (cid:107) y (cid:107) κ + µκ . Proof
First we aim to prove a i ( x + d x ) a i ( x ) ∈ [3 / , / v := d x / (cid:107) d x (cid:107) , g i ( θ ) := a i ( x + θv )and q i ( θ ) := sup ˆ θ ∈ [0 ,θ ] (cid:12)(cid:12)(cid:12) a i ( x + v ˆ θ ) − a i ( x ) (cid:12)(cid:12)(cid:12) for i ∈ { , . . . , m } . To obtain a contradiction assume q i ( ϑ ) > a i ( x )4 for some ϑ ∈ [0 , (cid:107) d x (cid:107) ] and i ∈ { , . . . , m } . Since a i is continuous it follows q i is continuous, and by the intermediate value theorem there exists some ˜ θ ∈ [0 , ϑ ] such that q i (˜ θ ) ∈ (cid:16) a i ( x )4 , a i ( x ) (cid:17) . Using that a i ( x ) has L -Lipschitz first derivatives and L -Lipschitzsecond derivatives on the set X we deduce that g i ( θ ) satisfies the same properties on the set[0 , ˜ θ ]. Applying Lemma 2 and (13) we deduce q i (˜ θ ) ≤ a i ( x )4 contradicting the earlier statementthat q i (˜ θ ) ∈ (cid:16) a i ( x )4 , a i ( x ) (cid:17) . We conclude a i ( x + d x ) a i ( x ) ∈ [3 / , / (cid:12)(cid:12)(cid:12) ψ µ ( x ) + M ψ µ x ( d x ) − ψ µ ( x + d x ) (cid:12)(cid:12)(cid:12) we provide some auxiliary bounds. Define β i := | ∇ a i ( x ) T d x | a i ( x ) + L (cid:107) d x (cid:107) a i ( x ) , for all i ∈ { , . . . , m } . Then we have, (cid:107) β (cid:107) = m (cid:88) i =1 (cid:18) | ∇ a i ( x ) T d x | a i ( x ) + L (cid:107) d x (cid:107) y i µ (cid:19) ≤ m (cid:88) i =1 (cid:32)(cid:18) | ∇ a i ( x ) T d x | a i ( x ) (cid:19) + (cid:18) L (cid:107) d x (cid:107) µ (cid:19) y i (cid:33) = 2 (cid:107) S − d s (cid:107) + 2 (cid:18) L (cid:107) d x (cid:107) µ (cid:19) (cid:107) y (cid:107) ≤ κ here the first equality uses 1 /a i ( x ) = y i /µ , the first inequality uses the fact that ( a + b ) ≤ a + b ), and the final inequality uses a + b ≤ ( a + b ) for a, b ≥
0. Hence, m (cid:88) i =1 β i ≤ (cid:107) β (cid:107) max i { β i } ≤ κ . (31)Observe, also by Taylor’s Theorem and the fact that f is Lipschitz on X that (cid:12)(cid:12)(cid:12)(cid:12) f ( x ) + 12 d x ∇ f ( x ) d x + ∇ f ( x ) T d x − f ( x + d x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ L (cid:107) d x (cid:107) . (32)Using Lemma 2 and Taylor’s Theorem with g i ( θ ) := a i ( x + θv ), h i ( θ ) := log( g i ( θ )), and v = d x (cid:107) d x (cid:107) , we get (cid:12)(cid:12)(cid:12)(cid:12) h i (0) + θh (cid:48) i (0) + θ h (cid:48)(cid:48) i (0) − h i ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ θ ˆ θ ∈ [0 ,θ ] h (cid:48)(cid:48)(cid:48) i (ˆ θ ) ≤ (cid:18) L θ + 6 L θ β i g (0) + 5 β i (cid:19) . (33)We can now bound the quality of a second-order Taylor series expansion of ψ µ as (cid:12)(cid:12) ψ µ ( x ) + M ψ µ x ( d x ) − ψ µ ( x + d x ) (cid:12)(cid:12) ≤ L (cid:107) d x (cid:107) + µ m (cid:88) i =1 (cid:18) L (cid:107) d x (cid:107) + 6 L (cid:107) d x (cid:107) β i a i ( x ) + 5 β i (cid:19) ≤ L (cid:107) d x (cid:107) + m (cid:88) i =1 (cid:18) y i (cid:18) L (cid:107) d x (cid:107) L (cid:107) d x (cid:107) β i (cid:19) + µ β i (cid:19) ≤ L (cid:107) y (cid:107) ) (cid:107) d x (cid:107) + L (cid:107) d x (cid:107) (cid:107) y (cid:107) κ + µκ . The first inequality uses (32) and (33). The second inequality uses 1 /a i ( x ) = y i /µ . The thirdinequality uses β i ≤ κ and (31). (cid:3) B.3 Proof of Lemma 4
Lemma 4.
Suppose Assumption 2 holds. Let
Convex { x, x + } ⊆ X , s = a ( x ) , s + = a ( x + ) , S = diag ( a ( x )) , Y = diag ( y ) , y + ∈ R m , Y + = diag ( y + ) , d x = x + − x , d y = y + − y , and d s = ∇ a ( x ) d x . If the equation Sy + Sd y + Y d s = µ holds, then (cid:107) Y − d y (cid:107) ≤ (cid:107) S − d s (cid:107) + (cid:107) µ ( SY ) − − (cid:107) (14) (cid:107) Y + s + − µ (cid:107) ≤ (cid:107) Sy (cid:107) ∞ (cid:107) S − d s (cid:107) (cid:107) Y − d y (cid:107) + L (cid:107) y (cid:107) (1 + (cid:107) Y − d y (cid:107) ) (cid:107) d x (cid:107) . (15) Furthermore, if (cid:107) Y + s + − µ (cid:107) ∞ < µ and (cid:107) Y − d y (cid:107) ∞ ≤ then s + , y + ∈ R m ++ .Proof To show (14) notice that multiplying Sy + Sd y + Y d s = µ by ( SY ) − and rearrangingyields Y − d y = − S − d s + (( Sy ) − µ − ).Next, we show (15). Observe that s + i y + i − µ = a i ( x + d x )( y i + d y i ) − µ = ( d s i + a i ( x ))( y i + d y i ) + ( a i ( x + d x ) − ( d s i + a i ( x )))( y i + d y i ) − µ = d s i d y i + ( a i ( x + d x ) − ( d s i + a i ( x )))( y i + d y i ) , (34)where the first transition is by definition of s + i and y + i , the second transition comes from addingand subtracting ( d s i + a i ( x ))( y i + d y i ), and the third transition by substituting µ = s i y i + s i d y i + y i d s i = a i ( x ) y i + a i ( x ) d y i + y i d s i . Furthermore, since ∇ a i is L -Lipschitz continuous on X , | a i ( x + d x ) − ( d s i + a i ( x )) | = | a i ( x + d x ) − ( ∇ a i ( x ) d x + a i ( x )) | ≤ L (cid:107) d x (cid:107) , ombining this equality with (34) yields (cid:12)(cid:12) s + i y + i − µ (cid:12)(cid:12) ≤ | d s i d y i | + L y + i (cid:107) d x (cid:107) ≤ | s i y i | (cid:12)(cid:12) s − i d s i (cid:12)(cid:12)(cid:12)(cid:12) y − i d y i (cid:12)(cid:12) + L y i (1 + y − i d y i ) (cid:107) d x (cid:107) . We deduce (15) by Cauchy-Schwarz. The fact that y + ∈ R m + follows from (cid:107) Y − d y (cid:107) ∞ ≤
1. Thefact that y + , s + ∈ R m ++ follows from y + ∈ R m + and (cid:107) S + y + − µ (cid:107) ∞ < µ . (cid:3) B.4 Proof of Lemma 5
Lemma 5.
Suppose Assumption 2 holds. Let y, y + ∈ R m and Convex { x, x + } ⊆ X . Then thefollowing inequality holds: (cid:107) ∇ x L ( x, y ) + ∇ xx L ( x, y ) T d x − d Ty ∇ x a ( x ) − ∇ x L ( x + , y + ) (cid:107) ≤ L (cid:107) y (cid:107) (cid:107) d x (cid:107) (cid:107) Y − d y (cid:107) + L (cid:107) y (cid:107) + 1) (cid:107) d x (cid:107) (16) with d x = x + − x and d y = y + − y .Proof Observe that: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i (cid:0) y i ∇ a i ( x ) + y i ∇ a i ( x ) d x − d y i ∇ a i ( x ) − y + i ∇ a i ( x + ) (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:88) i (cid:13)(cid:13) y i ∇ a i ( x ) + y i ∇ a i ( x ) d x + d y i ∇ a i ( x ) − y + i ∇ a i ( x + ) (cid:13)(cid:13) ≤ (cid:107) y (cid:107) (cid:13)(cid:13) ∇ a i ( x ) + ∇ a i ( x ) d x − ∇ a i ( x + ) (cid:13)(cid:13) + (cid:107) d y (cid:107) (cid:13)(cid:13) ∇ a i ( x ) − ∇ a i ( x + ) (cid:13)(cid:13) ≤ L (cid:107) y (cid:107) (cid:107) d x (cid:107) + L (cid:107) d y (cid:107) (cid:107) d x (cid:107) , where the first and second transition hold by the triangle inequality, the third transition applying(5) using the Lipschitz continuity of ∇ a and ∇ a . Next, by the triangle inequality, the inequalitywe just established, and Taylor’s theorem with Lipschitz continuity of ∇ f we get (cid:107) ∇ x L ( x, y ) + ∇ xx L ( x, y ) T d x − d Ty ∇ x a ( x ) − ∇ x L ( x + , y + ) (cid:107) ≤ (cid:13)(cid:13) ∇ f ( x ) + ∇ f ( x ) d x − ∇ f ( x + ) (cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) i (cid:0) y i ∇ a i ( x ) + y i ∇ a i ( x ) d x + d y i ∇ a i ( x ) − y + i ∇ a i ( x + ) (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ L (cid:107) y (cid:107) + 1) (cid:107) d x (cid:107) + L (cid:107) d y (cid:107) (cid:107) d x (cid:107) . (35) (cid:3) B.5 Proof of Lemma 6
Lemma 6.
Consider g ∈ R n and a symmetric matrix H ∈ R m × n . Define ∆( u ) := u T Hu + g T u where ∆ : R n → R and let u ∗ ∈ argmin u ∈ B r ( ) ∆( u ) be an optimal solution to the trustregion subproblem for some r ≥ . Then there exists some δ ≥ such that: δ ( (cid:107) u ∗ (cid:107) − r ) = 0 , ( H + δ I ) u ∗ = − g, and H + δ I (cid:23) . (17) Conversely, if u ∗ satisfies (17) then u ∗ ∈ argmin u ∈ B r ( ) ∆( u ) . Let σ ( r ) := min u ∈ B r ( ) ∆( u ) ,then for all r ∈ [0 , ∞ ) we have σ ( r ) ≤ − δr σ ( r ) ≤ σ ( αr ) ≤ α σ ( r ) ∀ α ∈ [0 , . (18b) Furthermore, the function σ ( r ) is monotone decreasing and continuous. roof Equation (17) follows from the KKT conditions, see Sorensen [35, Lemma 2.4.], Connet al. [12, Corollary 7.2.2] or Nocedal and Wright [30, Theorem 4.3.]. We now show (18a).Substituting ( H + δ I ) u ∗ = − g into ( u ∗ ) T Hu ∗ + g T u ∗ yields σ ( r ) = ∆( u ∗ ) = 1 / g T u ∗ − δ/ (cid:107) u ∗ (cid:107) ≤ − δ/ (cid:107) u ∗ (cid:107) where the last inequality follows from g T u ∗ = − g T ( H + δ I ) − g ≤ δ = 0 or (cid:107) u ∗ (cid:107) = r we conclude (18a) holds. The inequality σ ( αr ) ≤ α σ ( r ) holds since σ ( αr ) ≤ ∆( αu ∗ ) = α ( u ∗ ) T Hu ∗ + αg T u ∗ ≤ α ( u ∗ ) T Hu ∗ + α g T u ∗ = α σ ( r ) where the inequality uses g T u ∗ ≤
0. The inequality σ ( r ) ≤ σ ( αr ) holds sinceany solution to (cid:107) u (cid:107) ≤ r is feasible to (cid:107) u (cid:107) ≤ αr . The fact that σ ( r ) is monotone decreasingand continuous follows from (18b). (cid:3) C Proofs of results in Section 5.1
C.1 Proof of Lemma 8
Lemma 8.
Suppose Assumptions 2 and 3 hold (Lipschitz derivatives, and sufficiently small µ ).Let x ∈ X , η s ∈ [0 , / , (ITRS) hold with η x = η s , and α = min (cid:110) , η s (cid:107) S − d s (cid:107) (cid:111) . Then x + ∈ X and ψ µ ( x + ) − ψ µ ( x ) ≤ µη s + max (cid:26) M ψ µ x ( d x ) , − η s µ (cid:27) . (23) Proof
Our first goal is to show for all α ∈ (0 ,
1] that M ψ µ x ( αd x ) ≤ max (cid:26) M ψ µ x ( d x ) , − η s µ (cid:27) . (36)Note (36) trivially holds if α = 1. Therefore let us consider the case α ∈ (0 , α = η s (cid:107) S − d s (cid:107) ≥ η s (cid:115) µL ( (cid:107) y (cid:107) + 1) (cid:107) d x (cid:107) − M ψ µ x ( d x ) ≥ η s (cid:115) µη s µ/ − M ψ µ x ( d x ) (37)where the first inequality uses (22), and the second inequality uses (cid:107) d x (cid:107) ≤ η s (cid:113) µL ( (cid:107) y (cid:107) +1) .Furthermore, if M ψ µ x ( d x ) ∈ (cid:104) − η s µ , (cid:105) from (37) we get α ≥ (cid:112) / >
1; by contradiction weconclude M ψ µ x ( d x ) (cid:54)∈ (cid:104) − η s µ , (cid:105) . Using M ψ µ x ( d x ) (cid:54)∈ (cid:104) − η s µ , (cid:105) and M ψ µ x ( d x ) ≤ M ψ µ x ( ) = 0(recall definition of d x in (ITRS)), we deduce M ψ µ x ( d x ) < − η s µ . Combining M ψ µ x ( d x ) < − η s µ with (37) yields α ≥ η s (cid:113) µ − M ψµx ( d x ) . Therefore, M ψ µ x ( αd x ) = α d Tx ∇ ψ µ ( x ) d x + α ∇ ψ µ ( x ) T d x ≤ α M ψ µ x ( d x ) ≤ − η s µ ∇ ψ µ ( x ) T d x ≤ α ≥ η s (cid:113) µ − M ψµx ( d x ) . Thus (36) holds.It remains to bound the accuracy of the predicted decrease M ψ µ x ( αd x ). Note that by α ∈ [0 , η x = η s we have (cid:107) αd x (cid:107) ≤ (cid:107) d x (cid:107) ≤ η s (cid:114) µL ( (cid:107) y (cid:107) + 1) = η x (cid:114) µL ( (cid:107) y (cid:107) + 1) . (38)Let us select κ = (21 / η s , this choice satisfies the premise of Lemma 3 because α (cid:107) S − d s (cid:107) + L (cid:107) αd x (cid:107) (cid:107) y (cid:107) µ ≤ η s + η s ≤ (21 / η s = κ (39) here the first inequality comes from α (cid:107) S − d s (cid:107) ≤ η s and (38), and the third inequality uses η s ∈ [0 , / η s ∈ [0 , /
5] we deduce κ ≤ / x + ∈ X . From Lemma 3, (cid:12)(cid:12) ψ µ ( x ) + M ψ µ x ( αd x ) − ψ µ ( x + αd x ) (cid:12)(cid:12) ≤ L (cid:107) y (cid:107) ) (cid:107) αd x (cid:107) + L (cid:107) αd x (cid:107) (cid:107) y (cid:107) κ + µκ ≤ L / µ − / (cid:107) y (cid:107) ) (cid:107) αd x (cid:107) + L (cid:107) αd x (cid:107) (cid:107) y (cid:107) κ + µκ ≤ (cid:18) × + (1 / )(21 / + (21 / (cid:19) µη s ≤ µη s (40)where the second inequality uses L µL ∈ (0 ,
1] from Assumption 3, the third inequality uses ourbound on (cid:107) αd x (cid:107) and κ , i.e., (38) and (39). Combining (36) and (40) gives (23). (cid:3) C.2 Proof of Lemma 9
Lemma 9.
Suppose (ITRS) , Assumptions 2 and 3 hold (direction selection, Lipschitz deriva-tives, and sufficiently small µ ). Let x ∈ X , η x ∈ (0 , ( τ l µL ) / ] , and α = 1 . Furtherassume M ψ µ x ( d x ) ≥ − τ l µr √ (cid:107) y (cid:107) . Under these assumptions, ( x + , y + ) satisfies (FJ1) and ∇ ψ µ ( x ) (cid:23) −√ τ l (1 + (cid:107) y (cid:107) ) (cid:113) τ l µη x L I .Proof First, let us bound (cid:107) S − d s (cid:107) : (cid:107) S − d s (cid:107) ≤ (cid:115) L ( (cid:107) y (cid:107) + 1) (cid:107) d x (cid:107) − M ψ µ x ( d x ) µ ≤ (cid:115) η x + 23 η x (cid:18) µτ l L (cid:19) / ≤ (cid:115) (cid:18) µτ l L (cid:19) / + 260 (cid:18) µτ l L (cid:19) / ≤ (cid:18) τ l µL (cid:19) / , where the first inequality uses (22), the second r = η x (cid:113) µL ( (cid:107) y (cid:107) +1) and M ψ µ x ( d x ) ≥ − τ l µr √ (cid:107) y (cid:107) ,the third inequality uses η x ∈ (0 , ( τ l µL ) / ], and the final inequality uses τ l µL ∈ (0 ,
1] from As-sumption 3.Let us compute κ from Lemma 3: (cid:107) S − d s (cid:107) + L (cid:107) d x (cid:107) (cid:107) y (cid:107) µ ≤ (cid:16) τ l µL (cid:17) / + η x ≤ = κ wherethe first inequality uses (cid:107) d x (cid:107) ≤ r = η x (cid:113) µL ( (cid:107) y (cid:107) +1) ≤ η x (cid:113) µL (cid:107) y (cid:107) and the second inequalityuses η x ≤ (cid:16) τ l µL (cid:17) / ∈ (0 , / Convex { x, x + } ⊆ X .Furthermore, by Lemma 4, the fact y = µS − , and our bound on (cid:107) S − d s (cid:107) we have (cid:107) Y − d y (cid:107) ≤ (cid:107) S − d s (cid:107) ≤ (cid:18) τ l µL (cid:19) / . herefore, (cid:107) δd x − ∇ x L ( x + , y + ) (cid:107) ≤ L (cid:107) d x (cid:107) (cid:107) y (cid:107) (cid:107) Y − d y (cid:107) + L (cid:107) y (cid:107) + 1) (cid:107) d x (cid:107) ≤ L (cid:107) d x (cid:107) (cid:107) y (cid:107) (cid:107) Y − d y (cid:107) + L / µ − / (cid:107) y (cid:107) + 1) (cid:107) d x (cid:107) ≤ η x τ l µ (cid:112) (cid:107) y (cid:107) (cid:18) τ l µL (cid:19) − / + L / µ / η x ≤ τ l µ (cid:112) (cid:107) y (cid:107)
100 + µτ l ≤ µτ l (cid:112) (cid:107) y (cid:107) ≤ µτ l (cid:112) (cid:107) y (cid:107) where the first inequality follows from Lemma 5, the second by L µL ∈ (0 , (cid:107) Y − d y (cid:107) ≤ (cid:16) τ l µL (cid:17) / that we just proved and (cid:107) d x (cid:107) ≤ η x (cid:113) µL ( (cid:107) y (cid:107) +1) , andthe fourth inequality using η x ∈ (0 , ( τ l µL ) / ].Next, we bound δ (cid:107) d x (cid:107) . By (12) there exists some δ ≥ ∇ x L ( x, y )+ ∇ xx L ( x, y ) T d x − d Ty ∇ x a ( x ) = δd x . Moreover, − τ l µr √ (cid:107) y (cid:107) ≤ M ψ µ x ( u ) ≤ − δr by (18a), so δ (cid:107) d x (cid:107) ≤ δr ≤ τ l µ (cid:112) (cid:107) y (cid:107) . Therefore using the bounds on δ (cid:107) d x (cid:107) and (cid:107) δd x − ∇ x L ( x + , y + ) (cid:107) that weproved, (cid:107) ∇ x L ( x + , y + ) (cid:107) ≤ (cid:107) δd x − ∇ x L ( x + , y + ) (cid:107) + δ (cid:107) d x (cid:107) ≤ τ l µ (cid:112) (cid:107) y (cid:107) + τ l µ (cid:112) (cid:107) y (cid:107) ≤ τ l µ (cid:112) (cid:107) y (cid:107) . This shows (FJ1.c) holds. It remains to show (FJ1.a) and (FJ1.b). From Lemma 4 we get (cid:107) S + y + − µ (cid:107) ≤ µ (cid:107) S − d s (cid:107) (cid:107) Y − d y (cid:107) + L (cid:107) y (cid:107) (1 + (cid:107) Y − d y (cid:107) ) (cid:107) d x (cid:107) ≤ µ (cid:18) τ l µL (cid:19) / + µη x ≤ µ (cid:18) τ l µL (cid:19) / = µτ c , where the second inequality uses (cid:107) Y − d y (cid:107) ≤ (cid:107) S − d s (cid:107) ≤ (cid:16) τ l µL (cid:17) / ≤ (cid:107) d x (cid:107) ≤ r = η x (cid:113) µL ( (cid:107) y (cid:107) +1) , and the third inequality η x ∈ (0 , ( τ l µL ) / ]. Therefore (FJ1) holds.Let v min be the eigenvector of ∇ ψ µ ( x ) corresponding to the minimum eigenvalue of ∇ ψ µ ( x ).Note that − τ l µr (cid:112) (cid:107) y (cid:107) ≤ M ψ µ x ( d x ) ≤ min {M ψ µ x ( rv min ) , M ψ µ x ( − rv min ) } ≤ λ min ( ∇ ψ µ ( x )) r λ min ( · ) denotes the minimum eigenvalue. Therefore λ min ( ∇ ψ µ ( x )) ≥ − τ l µ (cid:112) (cid:107) y (cid:107) r = − √ τ l (1 + (cid:107) y (cid:107) )3 (cid:114) τ l µη x L . (cid:3) C.3 Proof of Theorem 1
Theorem 1.
Suppose Assumptions 2 and 3 hold (Lipschitz derivatives, and sufficiently small µ ). Then Trust-IPM ( f, a, µ, τ l , L , η s , η x , x (0) ) with x (0) ∈ X and η s = 140 (cid:18) τ l µL (cid:19) / η x = η s , ( η -1) akes at most O (cid:32) ψ µ ( x (0) ) − ψ ∗ µ µ (cid:18) L µτ l (cid:19) / (cid:33) iterations to terminate with a ( µ, τ l , τ c ) -approximate second-order Fritz John point ( x + , y + ) , i.e., (FJ1) and (FJ2) hold.Proof Let x ∈ X be some iterate of the algorithm with corresponding direction d x . If − τ l µr √ (cid:107) y (cid:107) ≥ M ψ µ x ( d x ) then ψ µ ( x + αd x ) − ψ µ ( x ) ≤ µη s + max (cid:26) M ψ µ x ( d x ) , − η s µ (cid:27) ≤ µη s max η s − (cid:115) τ l µL , η s − η s (41)= µ (cid:18) τ l µL (cid:19) / max (cid:40)(cid:18) − × (cid:19) (cid:18) τ l µL (cid:19) / , (cid:18) τ l µL (cid:19) / − × (cid:41) ≤ µ (cid:18) τ l µL (cid:19) / (cid:18) − × (cid:19) = − µ × (cid:18) τ l µL (cid:19) / (42)where the first transition uses Lemma 8, the second transition uses M ψ µ x ( d x ) ≥ − τ l µr √ (cid:107) y (cid:107) = − µη s (cid:113) τ l µL , the third transition uses η s = (cid:16) τ l µL (cid:17) / , and the fourth transition uses τ l µL ∈ (0 , x, d x ) denote the current primal iterate and direction. Let ( x + , d x + ) denote the subse-quent primal iterate and direction. By Lemma 9 if − τ l µr √ (cid:107) y (cid:107) ≤ M ψ µ x ( d x ) then (FJ1) holds.Also, by Lemma 9 if − τ l µr √ (cid:107) y + (cid:107) ≤ M ψ µ x + ( d + x ) then ∇ ψ µ ( x + ) (cid:23) −√ τ l (1 + (cid:107) y + (cid:107) ) (cid:113) τ l µη x L I (cid:23)− √ τ l (1 + (cid:107) y + (cid:107) ) I , i.e., (FJ2) holds. Therefore if both − τ l µr √ (cid:107) y (cid:107) ≤ M ψ µ x ( d x ) and − τ l µr √ (cid:107) y + (cid:107) ≤ M ψ µ x + ( d + x ) the algorithm terminates.It remains to show that if either − τ l µr √ (cid:107) y (cid:107) > M ψ µ x ( d x ) or − τ l µr √ (cid:107) y + (cid:107) > M ψ µ x + ( d + x )then over these two iterations we reduce the function value by a constant quantity. First notethat even if M ψ µ x ( d x ) ≥ − τ l µr √ (cid:107) y (cid:107) we by M ψ µ x ( d x ) ≤ ψ µ ( x + αd x ) − ψ µ ( x ) ≤ µη s + max (cid:26) M ψ µ x ( d x ) , − η s µ (cid:27) ≤ µη s = 2 µ (cid:18) τ l µL (cid:19) / (43)where the first inequality follows from Lemma 8. The same equation applies replacing ( x, d x )with ( x + , d + x ). By applying (42) and (43) we can see that if over these two iterations thealgorithm did not terminate then ψ µ must have been reduced by at least µ (cid:18) × − (cid:19) (cid:18) τ l µL (cid:19) / = 7 µ × (cid:18) τ l µL (cid:19) / . To conclude note if the algorithm has not terminated across iterations 0 , . . . , K then letting x ( k ) be the k th x iterate, ψ µ ( x (0) ) − ψ ∗ µ ≥ (cid:80) K − k =0 ( ψ µ ( x ( k ) ) − ψ µ ( x ( k +1) )) ≥ K − × µ × (cid:16) τ l µL (cid:17) / ,rearranging to bound K gives the result. (cid:3) Proof of results in Section 5.2
The main purpose of this section is to prove Lemma 10. Before we prove this result in Section D.3we prove two auxiliary Lemmas. Lemma 13 is the convex version of Lemma 8 and Lemma 14 isthe convex version of Lemma 8.
D.1 Proof of Lemma 13
Lemma 13.
Suppose (ITRS) , Assumption 2 and 4 hold (direction selection, Lipschitz deriva-tives, and sufficiently small µ ). Let f be convex and each a i concave. Let x ∈ X , η x = θ (cid:16) τ l µL (cid:17) / , η s = θ (cid:16) τ l µL (cid:17) / , α = min (cid:110) , η s (cid:107) S − d s (cid:107) (cid:111) , and θ ∈ [0 , / . Then x + ∈ X and ψ µ ( x + αd x ) − ψ µ ( x ) ≤ µθ (cid:18) τ l µL (cid:19) / + max (cid:40) M ψ µ x ( d x ) , − θ µ (cid:18) τ l µL (cid:19) / (cid:41) . Proof
First we show M ψ µ x ( αd x ) ≤ max (cid:26) M ψ µ x ( d x ) , − η s µ (cid:27) , (44)which trivially holds if α = 1. Therefore let us consider the case α ∈ (0 , α = η s (cid:107) S − d s (cid:107) ≥ η s (cid:115) µ − M ψ µ x ( d x ) . (45)Therefore, M ψ µ x ( αd x ) = α d Tx ∇ ψ µ ( x ) d x + α ∇ ψ µ ( x ) T d x ≤ α M ψ µ x ( d x ) ≤ − η s µ ∇ ψ µ ( x ) T d x ≤ α ≥ η s (cid:113) µ − M ψµx ( d x ) , and the third inequality by (45). We conclude (44) holds.It remains to bound the accuracy of the predicted decrease M ψ µ x ( αd x ). Let us bound theconstant κ from Lemma 3, α (cid:107) S − d s (cid:107) + L (cid:107) αd x (cid:107) (cid:107) y (cid:107) µ ≤ θ (cid:0) τ l µ/L (cid:1) / + θ (cid:0) τ l µ/L (cid:1) / ≤ (7 / θ (cid:0) τ l µ/L (cid:1) / = κ (46)where the second inequality comes from α (cid:107) S − d s (cid:107) ≤ η s = θ (cid:16) τ l µL (cid:17) / and (cid:107) αd x (cid:107) ≤ θ (cid:0) τ l µ/L (cid:1) / (cid:113) µL ( (cid:107) y (cid:107) +1) ,the third inequality from θ ∈ [0 , / θ ∈ [0 , /
6] and τ l µ/L ∈ (0 ,
1] we deduce κ ≤ / x + ∈ X . Furthermore, from Lemma 3, (cid:12)(cid:12) ψ µ ( x ) + M ψ µ x ( αd x ) − ψ µ ( x + αd x ) (cid:12)(cid:12) ≤ L (cid:107) y (cid:107) ) (cid:107) d x (cid:107) + L (cid:107) d x (cid:107) (cid:107) y (cid:107) κ + µκ ≤ L / τ / l µ − / (1 / / (cid:107) y (cid:107) ) (cid:107) αd x (cid:107) + L (cid:107) αd x (cid:107) (cid:107) y (cid:107) κ + µκ ≤ µθ (cid:0) τ l µ/L (cid:1) / + µθ (cid:0) τ l µ/L (cid:1) / + (7 / µθ (cid:0) τ l µ/L (cid:1) / ≤ µθ (cid:0) τ l µ/L (cid:1) / (47)where the second inequality uses L µL τ l ∈ (0 , (cid:107) αd x (cid:107) and κ . Combining (44) and (47) gives the result. (cid:3) .2 Proof of Lemma 14 Lemma 14.
Suppose (ITRS) , Assumption 2 and 4 hold (Lipschitz derivatives, and sufficientlysmall µ ). Let f be convex and each a i concave. Let x ∈ X , η x ∈ (0 , ( µτ l L ) / ] , and α = 1 . Un-der these assumptions, if M ψ µ x ( d x ) ≥ − τ l µr √ (cid:107) y (cid:107) then ( x + , y + ) is an ( µ, τ l , τ c ) -approximatefirst-order Fritz John point.Proof First note, (cid:107) S − d s (cid:107) ≤ (cid:115) − M ψ µ x ( d x ) µ ≤ (cid:114) rτ l (cid:112) (cid:107) y (cid:107) = (cid:18) τ l η x µ L (cid:19) / ≤ (cid:32) (cid:18) τ l µL (cid:19) / (cid:33) / ≤ (cid:18) τ l µL (cid:19) / where the first transition uses (21), the second transition uses our assumed bound on M ψ µ x ( d x ),the third transition uses r = η x (cid:113) µL ( (cid:107) y (cid:107) +1) , and the fourth transition uses η x ∈ (0 , ( µτ l L ) / ].Let us compute κ from Lemma 3, (cid:107) S − d s (cid:107) + L (cid:107) y (cid:107) (cid:107) d x (cid:107) µ ≤ (cid:16) µτ l L (cid:17) / + η x ≤ = κ . Itfollows that Convex { x, x + } ⊆ X .By Lemma 4 and the fact y = µS − we have (cid:107) Y − d y (cid:107) ≤ (cid:107) S − d s (cid:107) ≤ (cid:16) µτ l L (cid:17) / . There-fore, (cid:107) δd x − ∇ x L ( x + , y + ) (cid:107) ≤ L (cid:107) d x (cid:107) (cid:107) y (cid:107) (cid:107) Y − d y (cid:107) + L (cid:107) y (cid:107) + 1) (cid:107) d x (cid:107) ≤ L (cid:107) d x (cid:107) (cid:107) y (cid:107) (cid:107) Y − d y (cid:107) + L / µ − / τ / l (cid:107) y (cid:107) + 1) (cid:107) d x (cid:107) ≤ τ l µ (cid:112) (cid:107) y (cid:107) (cid:18) µτ l L (cid:19) − / η x + L / µ / τ / l η x ≤ τ l µ (cid:112) (cid:107) y (cid:107)
10 + µτ l ≤ µτ l (cid:112) (cid:107) y (cid:107) where the first inequality follows from Lemma 5, the second by L µL τ l ∈ (0 , (cid:107) d x (cid:107) ≤ η x (cid:113) µL ( (cid:107) y (cid:107) +1) and (cid:107) Y − d y (cid:107) ≤ (cid:16) µτ l L (cid:17) / , and the fourth inequality using η x ∈ (0 , ( µτ l L ) / ].Now, by (12) there exists some δ ≥ ∇ x L ( x, y ) + ∇ xx L ( x, y ) T d x − d Ty ∇ x a ( x ) = δd x . Moreover, − τ l µr √ (cid:107) y (cid:107) ≤ M ψ µ x ( u ) ≤ − δr by (18a) which implies δ (cid:107) d x (cid:107) ≤ δr ≤ τ l µr (cid:112) (cid:107) y (cid:107) r = 23 τ l µ (cid:112) (cid:107) y (cid:107) . Therefore using the bounds on δ (cid:107) d x (cid:107) and (cid:107) δd x − ∇ x L ( x + , y + ) (cid:107) that we proved, (cid:107) ∇ x L ( x + , y + ) (cid:107) ≤ (cid:107) δd x − ∇ x L ( x + , y + ) (cid:107) + δ (cid:107) d x (cid:107) ≤ τ l µ (cid:112) (cid:107) y (cid:107) + τ l µ (cid:112) (cid:107) y (cid:107) ≤ τ l µ (cid:112) (cid:107) y (cid:107) . his shows (FJ1.c) holds. It remains to show (FJ1.a) and (FJ1.b). From Lemma 4 we get (cid:107) S + y + − µ (cid:107) ≤ µ (cid:107) S − d s (cid:107) (cid:107) Y − d y (cid:107) + L (cid:107) y (cid:107) (1 + (cid:107) Y − d y (cid:107) ) (cid:107) d x (cid:107) ≤ µ (cid:18) µτ l L (cid:19) / + µη x ≤ µ (cid:18) µτ l L (cid:19) / ≤ µτ c , where the second inequality uses (cid:107) Y − d y (cid:107) ≤ (cid:107) S − d s (cid:107) ≤ (cid:16) µτ l L (cid:17) / ≤ r , the third inequality uses η x ∈ (0 , ( µτ l L ) / ] and τ l µL ∈ (0 , (cid:3) D.3 Proof of Lemma 10
Lemma 10.
Suppose Assumption 2 and 4 hold (Lipschitz derivatives, and sufficiently small µ ).Let f be convex and each a i concave. Then Trust-IPM ( f, a, µ, τ l , L , η s , η x , x (0) ) with x (0) ∈ X and η x = θ (cid:18) τ l µL (cid:19) / η s = θ (cid:18) τ l µL (cid:19) / θ = 1 / , ( η -2) takes at most O (cid:32) ψ µ ( x (0) ) − ψ ∗ µ µ (cid:18) L τ l µ (cid:19) / (cid:33) iterations to terminate with a ( µ, τ l , τ c ) -approximate first-order Fritz John point ( x + , y + ) , i.e., (FJ1) holds.Proof Let x ∈ X be some iterate of the algorithm with corresponding direction d x . If M ψ µ x ( d x ) ≥ − τ l µr √ (cid:107) y (cid:107) then the algorithm terminates at the next iteration by Lemma 14.Therefore consider the case that − τ l µr √ (cid:107) y (cid:107) < M ψ µ x ( d x ). By Lemma 13 we have x + ∈ X .Furthermore, ψ µ ( x + ) − ψ µ ( x ) ≤ µθ (cid:18) τ l µL (cid:19) / + max (cid:40) M ψ µ x ( d x ) , − θ µ (cid:18) τ l µL (cid:19) / (cid:41) ≤ µθ (cid:18) τ l µL (cid:19) / max (cid:26) θ − , θ − θ (cid:27) = − µ (cid:18) τ l µL (cid:19) / M ψ µ x ( d x ) ≥ − τ l µr √ (cid:107) y (cid:107) = − µθ (cid:16) τ l µL (cid:17) / , and the final inequality comes from substituting θ = 1 / , . . . , K then letting x ( k ) be the k th x iterate, ψ µ ( x (0) ) − ψ ∗ µ ≥ (cid:80) K − k =0 ( ψ µ ( x ( k ) ) − ψ µ ( x ( k +1) )) ≥ Kµ (cid:16) τ l µL (cid:17) / ,rearranging to bound K gives the result. (cid:3) E Proof of results in Section 6
E.1 Proof of Lemma 15
Assumption 9 (Slater’s condition) . Suppose that there exists some
R > such that (cid:107)X (cid:107) ≤ R and there exists some z ∈ X , γ ∈ R ++ such that a ( z ) ≥ γ . Further assume there exists someconstant L > such that (cid:107) ∇ f ( x ) (cid:107) ≤ L for all x ∈ X . urthermore, in order to apply Slater’s condition we need µ to be sufficiently small: µ ≤ γ τ l R . (48)
Lemma 15.
Suppose that f is convex, a i is concave and Assumption 9 holds. If ( x + , y + ) is aFritz John point (i.e., (FJ1) holds) and (48) holds then (cid:13)(cid:13) y + (cid:13)(cid:13) ≤ mµ + 2 L Rγ .
Proof
Observe, (cid:13)(cid:13) y + (cid:13)(cid:13) ≤ a ( z ) T y + γ ≤ ( a ( x + ) + ∇ a ( x + )( z − x + )) T y + γ ≤ a ( x + ) T y + + (cid:107) ∇ a ( x + ) T y + (cid:107) Rγ ≤ mµ + (cid:0) L + τ l µ (cid:112) (cid:107) y + (cid:107) + 1 (cid:1) Rγ ≤ mµ + L Rγ + 12 (cid:113) (cid:107) y + (cid:107) + 1where the first inequality uses Assumption 9 which implies a ( z ) /γ ≥ and that (cid:107) y + (cid:107) = T y + ,the second inequality uses that a i is concave, the third inequality uses (cid:107) X (cid:107) ≤ R , the fourthinequality uses (FJ1) and Assumption 9, and the fifth inequality uses (48). It follows that1 + mµ + L Rγ ≥ (cid:13)(cid:13) y + (cid:13)(cid:13) + 1 − (cid:113) (cid:107) y + (cid:107) + 1 ≥ (cid:0)(cid:13)(cid:13) y + (cid:13)(cid:13) + 1 (cid:1) . (cid:3) E.2 Proof of Lemma 12
Lemma 12.
Let f be convex and each a i concave. Suppose that Assumption 5 holds. Let x (0) ∈ X . Then Annealed-IPM ( f, a, µ (0) , x (0) , (cid:15) ) takes at most (cid:16) O (1) + 6 m × w (cid:16) (cid:15) m (cid:17)(cid:17) log +2 (cid:18) mµ (0) (cid:15) (cid:19) + ψ µ (0) ( x (0) ) − ψ ∗ µ (0) µ (0) w ( µ (0) ) unit operations to return a point x ( k ) ∈ X with f ( x ( k ) ) − f ∗ ≤ (cid:15) .Proof Let J = (cid:108) log +2 (cid:16) mµ (0) (cid:15) (cid:17)(cid:109) . At this point if we apply Lemma 11 with µ = 0, we obtain f ( x ( J ) ) − f ∗ ≤ (cid:107) ∇ x L ( x ( J ) , y ( J ) ) (cid:107) R + m (cid:88) i =1 a i ( x ( J ) ) y ( J ) i ≤ (1+2) µ ( J ) m = 3 µ ( J ) m = 3 µ (0) − J ≤ (cid:15). Hence after J iterations we have found an (cid:15) -optimal solution. By Lemma 11 and Assumption 5, ψ µ ( j ) ( x ( j − ) − ψ ∗ µ ( j ) ≤ (cid:107) ∇ x L ( x ( j − , y ( j − ) (cid:107) R + m (cid:88) i =1 ( a i ( x ( j − ) y ( j − i − µ ( j ) ) ≤ µ ( j − m. Applying using this inequality in Assumption 5 we deduce that the number of unit operationsof each iteration j > O (1) + ψ µ ( j ) ( x ( j − ) − ψ ∗ µ ( j ) µ ( j ) w ( µ ( j ) ) ≤ O (1) + 3 µ ( j − mµ ( j ) w ( µ ( j ) ) ≤ O (1) + 6 m × w ( µ ( j ) ) ≤ O (1) + 6 m × w ( µ ( J ) ) . The second inequality uses µ ( j ) = µ ( j − . The final inequality uses that w is monotonedecreasing by Assumption 5. (cid:3) .3 Proof of Theorem 2 Theorem 2.
Suppose Assumption 2, 6 and 7 hold (Lipschitz derivatives, regularity conditions,and parameter settings). Let f be convex and each a i concave. Let x (0) ∈ X and (cid:107)X (cid:107) ≤ R .Define η s , η x by ( η -2) and set Generic-IPM ( f, a, µ, x ) := Trust-IPM ( f, a, µ, τ l , L , η s , η x , x ) inside Annealed-IPM . Then
Annealed-IPM ( f, a, µ (0) , x (0) , (cid:15) ) takes at most O (cid:32)(cid:32) m / (cid:18) L R ζ(cid:15) (cid:19) / + 1 (cid:33) log + (cid:18) mµ (0) (cid:15) (cid:19) + ψ µ (0) ( x (0) ) − ψ ∗ µ (0) µ (0) (cid:18) L R ζm µ (0) (cid:19) / (cid:33) unit operations to return a point x ( k ) ∈ X with f ( x ( k ) ) − f ∗ ≤ (cid:15) .Proof Let j be some iteration of Annealed-IPM . Our first goal is to show that (A7. µ (0) ),i.e., µ (0) = min (cid:110) L R ζm , L mRL √ ζ (cid:111) , implies the assumptions of Lemma 10, and Lemma 12 are metat iteration j .Recall that (A7. τ l ) states that τ l = mRζ / . In particular, τ l µL ∈ (0 ,
1] holds with µ = µ ( j ) by µ ≤ µ (0) ≤ L R ζm = L τ l and L µL τ l ∈ (0 ,
1] holds by µ ≤ µ (0) ≤ L mRL √ ζ = L τ l L . Thereforethe assumptions of Lemma 10 are met which implies each iteration of Annealed-IPM willterminate satisfying (FJ1). Therefore, (cid:107) ∇ x L ( x ( j ) , y ( j ) ) (cid:107) ≤ µ ( j ) τ l (cid:113) (cid:107) y ( j ) (cid:107) ≤ µ ( j ) τ l ζ / ≤ mµ ( j ) R where the first inequality uses (FJ1), the second inequality uses Assumption 6 and the finalinequality by (A7. τ l ). Therefore Assumption 5 holds which allows us to apply Lemma 12. Inparticular, w ( µ ( j ) ) = O (cid:32)(cid:18) τ l µ ( j ) L (cid:19) − / (cid:33) = O (cid:32)(cid:18) m µ ( j ) L R ζ (cid:19) − / (cid:33) where the second equality uses τ l = mRζ / . Substituting this into Lemma 12 yields the runtimebound. (cid:3) A two-phase method to find unscaled KKT points
F.1 Algorithm 3 definition
Algorithm 3
Two-phase IPM function
Two-Phase-IPM ( f, a, ε opt , ε inf , L , L , x (0) ) Output:
A status (
KKT if (KKT) holds and
INF if (INF1) holds) and a point ( x, t, y ). Phase-one.
Let µ ( P = ε inf ε opt , τ ( P l = min (cid:110) ε opt , (cid:113) L ε opt ε inf (cid:111) , t (0) = ε opt + max { min i − a i ( x (0) ) , } , and η satisfy ( η -1). if t (0) ≤ ε opt / then x ( P ← x (0) else ( x ( P , t ( P , y ( P , λ ( P , γ ( P ) ← Trust-IPM ( f P , a P , µ ( P , τ ( P l , L , η s , η x , ( x (0) , t (0) )) . if min i a i ( x ( P ) < − ε opt / then ( x, t, y ) ← ( x ( P , t ( P , y ( P / (cid:107) y ( P (cid:107) ) . return INF , ( x, t, y ) end ifend ifPhase-two. Let µ ( P = ε opt , τ ( P l = (cid:113) ε inf L +1) , and η satisfy ( η -1).( x ( P , y ( P ) ← Trust-IPM ( f, a P , µ ( P , τ ( P l , L , η s , η x , x ( P ) . if (cid:107) y ( P (cid:107) > /ε inf then ( x, t, y ) ← ( x ( P , ε opt , y ( P / (cid:107) y ( P (cid:107) ) return INF , ( x, t, y ) else ( x, t, y ) ← ( x ( P , ∅ , y ( P ) . return KKT , ( x, t, y ) end ifend function F.2 Proof of Claim 3
Claim 3.
Let x (0) ∈ R n . Suppose Assumption 8 and (24) holds. Let f be L -Lipschitz. Assume c, ∆ a , ∆ f , L , L ≥ , ε opt ∈ (cid:16) , m log + ( c/ε opt ) (cid:105) , ε inf ∈ (0 , L m ] and ε opt ∈ (0 , √ ε inf ] . Then Two-Phase-IPM ( f, a, ε opt , ε inf , L , L , x (0) ) takes at most O (cid:32) ∆ a (cid:32) L / ε / inf ε / opt + 1 ε inf ε opt (cid:33) + ∆ f ε opt (cid:18) L L ε opt ε inf (cid:19) / (cid:33) trust region subproblem solves to return a point ( x, t, y ) that satisfies either (KKT) or (INF1) .Proof Let ψ P µ ( P and ψ P µ ( P denote the log barrier for problems (PI) and (PII) respectively. et T := [0 , t (0) + ε opt /
2] represent the set of feasible values of t in phase-one. Now, ψ P µ ( P ( x (0) , t (0) ) − inf ( x,t ) ∈ ˜ X ( P × T ψ P µ ( P ( x, t )= sup ( x,t ) ∈ ˜ X ( P × T t (0) − t + µ ( P (cid:32) log (cid:18) tt (0) (cid:19) + log (cid:32) ε opt + t (0) − t ε opt (cid:33) + m (cid:88) i =1 log (cid:18) t − a i ( x ) t (0) − a i ( x (0) ) (cid:19)(cid:33) = O (cid:16) min i max {− a i ( x (0) ) , } + µ ( P m log + ( c/ε opt ) (cid:17) = O (∆ a )where the second transition uses that 0 ≤ t ≤ t (0) + ε opt / i max {− a i ( x (0) ) , } + ε opt and(24a), and the last transition uses µ ( P = ε inf ε opt = O ( ε opt ), ε opt ∈ (cid:16) , m log + ( c/ε opt ) (cid:105) and∆ a ≥ µ ( P = ε opt / ε opt ∈ (cid:16) , m log + ( c/ε opt ) (cid:105) and ∆ f ≥ ψ P µ ( P ( x ( P ) − inf x ∈ ˜ X ( P ψ P µ ( P ( x ) = O (cid:18) f ( x ( P ) − inf x ∈ ˜ X ( P f ( x ) + µ ( P m log + ( c/ε opt ) (cid:19) = O (∆ f ) . Recall Theorem 1 gives a bound on the iteration count of
Trust-IPM of O (cid:18) ψ µ ( x (0) ) − ψ ∗ µ µ (cid:16) L µτ l (cid:17) / (cid:19) .Substituting the appropriate values of τ l and µ from Algorithm 3 yields a bound of O (cid:32) a (cid:32) L / ε / inf ε / opt + 1 ε inf ε opt (cid:33) + ∆ f ε opt (cid:18) L L ε opt ε inf (cid:19) / (cid:33) trust region subproblem solves for Two-Phase-IPM .It remains to show either (KKT) or (INF1) is satisfied. Observe that after calling
Trust-IPM in phase-one we find a point satisfying the Fritz John conditions for the problem of minimizingthe infinity norm of the constraint violation, i.e., (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) ∇ a ( x ( P ) T y ( P T y ( P − λ ( P − γ ( P (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ε inf (cid:118)(cid:117)(cid:117)(cid:117)(cid:116)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) y ( P λ ( P γ ( P (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + 1 (49) a ( x ( P ) + t ( P ≥ (50)0 ≤ t ( P ≤ t (0) + ε opt / a i ( x ( P ) + t ( P ) y ( P i ≤ ε inf ε opt (52) t ( P λ ( P ≤ ε inf ε opt (53) (cid:16) ε opt t (0) − t ( P (cid:17) γ ( P ≤ ε inf ε opt (54) y ( P , λ ( P , γ ( P ≥ . (55)Consider the case that in phase-one the status is INF , in which case min i a i ( x ( P ) < − ε opt / t ( P > ε opt / t ( P > ε opt / λ ( P < ε inf .Therefore using (49), ε inf ∈ (0 ,
1] and we deduce (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) ∇ a ( x ( P ) T y ( P T y ( P − − γ ( P (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ε inf (cid:114) (cid:107) ( y ( P (cid:107) + ε inf ≤ ε inf (cid:18)(cid:113) (cid:107) ( y ( P (cid:107) + 2 (cid:19) . (56) f (cid:107) y ( P (cid:107) < / / < γ ( P + 1 − T y ( P ≤ ε inf / ≤ /
2. Bycontradiction (cid:107) y ( P (cid:107) ≥ /
2. Using (cid:107) y ( P (cid:107) ≥ /
2, (56), and (52) we deduce (cid:107) ∇ a ( x ( P ) T y ( P (cid:107) (cid:107) y ( P (cid:107) ≤ ε inf ( a i ( x ( P ) + t ( P ) y ( P i (cid:107) y ( P (cid:107) ≤ ε inf ε opt . Observe that after calling
Trust-IPM in phase-two we find a point satisfying a ( x ( P ) > − ε opt y ( P i ( a i ( x ( P ) + ε opt ) ≤ ε opt ∀ i ∈ { , . . . , m } (cid:13)(cid:13)(cid:13) ∇ x L ( x ( P , y ( P ) (cid:13)(cid:13)(cid:13) ≤ ε opt (cid:114) ε inf L + 1) (cid:113)(cid:13)(cid:13) y ( P (cid:13)(cid:13) + 1 y ( P > . If (cid:107) y ( P (cid:107) < ε opt ε inf + L ε inf then using the fact that ε opt ∈ (0 , ε opt ∈ (0 , √ ε inf ] and L ≥ (cid:13)(cid:13)(cid:13) ∇ x L ( x ( P , y ( P ) (cid:13)(cid:13)(cid:13) ≤ ε opt (cid:115) ε inf L (cid:18) ε opt ε inf + 3 L ε inf (cid:19) + 1 ≤ ε opt (cid:107) y ( P (cid:107) ≥ ε opt ε inf + L ε inf then (cid:107) ∇ a ( x ( P ) T y ( P (cid:107) (cid:107) y ( P (cid:107) ≤ (cid:13)(cid:13) ∇ x L ( x ( P , y ( P ) (cid:13)(cid:13) + (cid:13)(cid:13) ∇ f ( x ( P ) (cid:13)(cid:13) (cid:107) y ( P (cid:107) ≤ ε opt (cid:107) y ( P (cid:107) / + ε opt (cid:107) y ( P (cid:107) + L (cid:107) y ( P (cid:107) ≤ ε inf and ( a i ( x ( P ) + ε opt ) y ( P i (cid:107) y ( P (cid:107) ≤ ε inf ε opt . Finally note that since y ( P i ( a i ( x ( P ) + ε opt ) ≤ ε opt and (cid:107) y ( P (cid:107) ≥ ε opt ε inf + L ε inf ≥ m wededuce min i a i ( x ( P ) ≤ ε opt min i (cid:18) y ( P i − (cid:19) ≤ − ε opt /
2. Hence (INF1) is satisfied with( x, t, y ) = (cid:16) x ( P , ε opt , y ( P (cid:107) y ( P (cid:107) (cid:17) . (cid:3)(cid:3)