Solving Linear Programs with Sqrt(rank) Linear System Solves
SSolving Linear Programs with (cid:101) O ( √ rank) Linear System Solves
Yin Tat LeeUniversity of Washington and Microsoft Research [email protected]
Aaron SidfordStanford University [email protected]
Abstract
We present an algorithm that given a linear program with n variables, m constraints, andconstraint matrix A , computes an (cid:15) -approximate solution in (cid:101) O ( (cid:112) rank( A ) log(1 /(cid:15) )) iterationswith high probability. Each iteration of our method consists of solving (cid:101) O (1) linear systems andadditional nearly linear time computation, improving by a factor of ˜Ω(( m/ rank( A )) / ) over theprevious fastest method with this iteration cost due to Renegar (1988) [51]. Further, we providea deterministic polynomial time computable (cid:101) O (rank( A )) -self-concordant barrier function for thepolytope, resolving an open question of Nesterov and Nemirovski (1994) [47] on the theory of“universal barriers” for interior point methods.Applying our techniques to the linear program formulation of maximum flow yields an (cid:101) O ( | E | (cid:112) | V | log( U )) time algorithm for solving the maximum flow problem on directed graphswith | E | edges, | V | vertices, and integer capacities of size at most U . This improves uponthe previous fastest polynomial running time of O ( | E | min {| E | / , | V | / } log( | V | / | E | ) log( U )) achieved by Goldberg and Rao (1998) [18]. In the special case of solving dense directed unit ca-pacity graphs our algorithm improves upon the previous fastest O ( | E | min {| E | / , | V | / } ) run-ning times achieved by Even and Tarjan (1975) [16] and Karzanov (1973) [22] and of (cid:101) O ( | E | / ) achieved more recently by Mądry (2013) [39]. This paper is a journal version of the paper, “Path-Finding Methods for Linear Programming : Solving LinearPrograms in (cid:101) O ( √ rank) Iterations and Faster Algorithms for Maximum Flow” [34] and arXiv submissions [32, 33].This paper contains several new results beyond these prior submissions. This paper provides the first proof ofa (cid:101) O ( r ) -self-concordant barrier for all polytopes { x ∈ R n : A x ≥ b } with r = rank( A ) that is polynomial timecomputable (as opposed to the pseudo-polynomial time computability of the universal barrier of [47]). Further,this paper provides new connections between the algorithms presented, the barrier analyzed, and (cid:96) p Lewis weights[37, 9, 8].Several components of [34, 32, 33] were not included in this journal version. Techniques, for leveraging this paperto solve linear programs exactly are deferred to [32] and techniques for analyzing the error induced by approximatelinear system solves are deferred to [33]. These techniques are fairly standard and general and omitted from thispaper for brevity. Further, techniques for reducing the cost of the linear systems found in [32] are also not includedand have been improved in a sequence of recent work [35, 8, 2] and techniques solving generalized minimum cost flowas opposed to the more restricted minimum cost flow problem considered in this paper are deferred to [33]. a r X i v : . [ c s . D S ] A ug Introduction
Given a matrix, A ∈ R m × n , and vectors, b ∈ R m and c ∈ R n , solving a linear program min x ∈ R n : A x ≥ b c (cid:62) x (1.1)is a core algorithmic task for the theory and practice of computer science and operations research.Since Karmarkar’s breakthrough result in 1984 [21], proving that interior point methods cansolve linear programs in polynomial time for a relatively small polynomial, interior point methodshave been an incredibly active area of research. Currently, the fastest asymptotic running times forsolving (1.1) in many regimes are interior point methods. Previously, state-of-the-art interior pointmethods for solving (1.1) compute an (cid:15) -approximate solution in either (cid:101) O ( √ m log(1 /(cid:15) )) iterationsof solving linear systems [51] or (cid:101) O (( m · rank( A )) / log(1 /(cid:15) )) iterations of a more complicated butstill polynomial time operation [56, 59, 61, 3]. However, in a breakthrough result of Nesterov and Nemirovski in 1994, they showed thatthere exists a universal barrier function that if computable would allow (1.1) to be solved in O ( (cid:112) rank( A ) log(1 /(cid:15) )) iterations [48]. Unfortunately, this barrier is more difficult to compute thansolutions to (1.1) and despite this result, in many regimes the fastest interior point algorithms arestill based on the (cid:101) O ( √ m log(1 /(cid:15) )) iteration algorithm of Renegar from 1988.In this paper we present a new interior point method that solves general linear programs in (cid:101) O ( (cid:112) rank( A ) log(1 /(cid:15) )) iterations thereby matching the theoretical limit proved by Nesterov andNemirovski up to polylogarithmic factors. Further, we show how to achieve this convergence ratewhile only solving (cid:101) O (1) linear systems and performing additional (cid:101) O (nnz( A )) work in each iteration. Our algorithm is easily parallelizable and in the standard PRAM model of computation we achievethe first (cid:101) O ( (cid:112) rank( A ) log(1 /(cid:15) )) -depth polynomial-work method for solving linear programs. Usingstate-of-the-art regression algorithms in [43, 38], our linear programming algorithm has a runningtime of (cid:101) O ((nnz( A )+(rank( A )) ω ) (cid:112) rank( A ) log(1 /(cid:15) )) where ω < . is the matrix multiplicationconstant [64]. Further, leveraging advances in solving sequences of linear systems this running timeis improvable to (cid:101) O ((nnz( A ) + rank( A ) ) (cid:112) rank( A ) log(1 /(cid:15) )) [35].We achieve our results through an extension of standard path following techniques for linearprogramming [51, 19] that we call weighted path finding . We study the weighted central path , i.e. aweighted variant of the standard logarithmic barrier function [55, 17, 41] that was used implicitlyby Mądry [39] to achieve a breakthrough improvement to the running time for solving unit-capacitymaximum flow problem [39]. We provide a general analysis of the weighted central path, discusstools for manipulating points along the path and changing the path, and leverage this to producean efficiently computable path that converges in (cid:101) O ( (cid:112) rank( A ) log(1 /(cid:15) ) iterations.Ultimately, we show approximately following the central path re-weighted by variants of (cid:96) p Lewis weights , a fundamental concept in Banach space theory that has recently found applicationsfor solving (cid:96) p regression, yields our desired running times. We provide further intuition regardingthese weighted central paths, and show that the central path re-weighted by (cid:96) p -Lewis weights isthe central path induced by a (cid:101) O (rank( A )) -self-concordant barrier. Further, we show that thevalue, gradient, and Hessian of this barrier are all computable deterministically in polynomial time. Here and throughout the paper we use (cid:101) O ( · ) to hide factors polylogarithmic in m , n , U , | V | , | E | , and M . All approximate linear programming algorithms discussed in this paper can be leveraged to obtain exact solutionsin weakly polynomial time through standard straightforward reductions (see e.g. [51]). This transformation replaceseach log(1 /(cid:15) ) factor in running times with L , a parameter that is at most the number of bits needed to represent(1.1) but in many cases can be much smaller. We assume that A has no rows or columns that are all zero as these can be remedied by trivially removingconstraints or variables respectively or immediately solving the linear program. Therefore nnz( A ) ≥ min { m, n } . Lewis weight barrier constitutes the first barrier for polytopes whose self-concordance nearlymatches that of the universal [49, 36] and entropic [5] barriers; neither of which are not known tobe either deterministically or polynomial time computable. Previous methods for computing suchbarriers required random sampling and run in pseudo-polynomial time, i.e. have running timeswhich depend polynomially (as opposed to polylogarithmically) on the desired accuracy [1].To further demonstrate the efficacy of our proposed interior point method, we show that ityields provably faster algorithms for solving the maximum flow problem, one of the most wellstudied problems in combinatorial optimization [52]. By applying our interior point method to alinear program formulation of maximum flow and applying state-of-the-art solvers for symmetricdiagonally dominant linear systems [54, 26, 27, 23, 31, 6, 30, 29], to implement the iterations weachieve an algorithm on | V | node, | E | edge graphs with integer capacities in the range to U in time O ( | E | (cid:112) | V | log O (1) ( | V | ) log( U )) with high probability. This improves upon the previousfastest polynomial running time of O ( | E | min {| E | / , | V | / } log( | V | / | E | ) log( U )) achieved in 1998by Goldberg and Rao [18] for dense graphs. In the special case of solving dense unit capacitygraphs our algorithm improves upon the previous fastest running times of O ( | E | min {| E | / , | V | / } ) achieved by Even and Tarjan in 1975 [16] and Karzanov in 1973 [22] and of (cid:101) O ( | E | / ) achievedby Mądry [39] more recently. Further, our algorithm is easily parallelizable and using [50, 30, 28],in the PRAM model we obtain a (cid:101) O ( | E | (cid:112) | V | log( U )) -work (cid:101) O ( (cid:112) | V | ) -depth algorithm. Using thesame technique, we also solve the minimum cost flow problem in time (cid:101) O ( | E | (cid:112) | V | log( M )) withhigh probability where M is an upper bound on the absolute value of integer costs and capacities,improving upon the previous fastest algorithm of (cid:101) O ( E | . log( M )) due to Daitch and Spielman [11]. Linear programming is an extremely well studied problem with a long history. There are numer-ous algorithmic frameworks for solving linear programming problems, e.g. simplex methods [12],ellipsoid methods [24], and interior point methods [21]. Each method has a rich history and animpressive body of work analyzing the practical and theoretical guarantees of the methods. Herewe only present the major improvements on the number of iterations required to solve (1.1) anddiscuss the asymptotic running times of these methods. For a more comprehensive history linearprogramming and interior point methods we refer the reader to one of the many excellent referenceson the subject, e.g. [49, 66].In 1984 Karmarkar [21] provided the first proof of an interior point method running in polynomialtime. This method required O ( m log(1 /(cid:15) )) iterations where the running time of each iterationwas dominated by the time needed to solve a linear system of the form A (cid:62) DA x = y for somediagonal matrix D ∈ R m × m> and some y ∈ R n . Using low rank matrix updates and preconditioning,Karmarkar achieved a running time of O ( m . log(1 /(cid:15) )) for solving (1.1) inspiring a long line ofresearch into interior point methods.In 1988 Renegar provided an improved O ( √ m log(1 /(cid:15) )) iteration interior point method for solv-ing (1.1). His method was based on type of interior point methods known as path following methods which solve (1.1) by incrementally minimizing a f t ( x ) def = t · c T x + φ ( x ) where φ : R n → R is a barrier function such that φ ( x ) → ∞ as x tends to boundary of the polytope and t is a param-eter changed during the algorithm. Renegar provided a method based on using the log barrier φ (cid:96) ( x ) def = − (cid:80) i ∈ [ m ] log([ A x − b ] i ) which serves as the foundation for many modern interior pointmethods. As with Karmarkar’s result the running time of each iteration of this method was domi-nated by the time needed to solve a linear system of the form A (cid:62) DA x = y . Using a combinationof techniques involving low rank updates, preconditioning and fast matrix multiplication, the amor-tized complexity of each iteration was improved [58, 19, 49] yielding the previous best known runningtime of O ( m . n log(1 /(cid:15) )) [57]. 2n seminal work of Nesterov and Nemirovski in 1994 [49], they generalized this approach andshowed that path-following methods can be applied to minimize any linear cost function over anyconvex set if given a suitable barrier function . They introduced a measure of complexity of a barrierknown as self-concordance and showed that given any ν -self-concordant barrier for the set, an (cid:101) O ( √ ν log(1 /(cid:15) )) iteration method could be achieved. Further, they showed that for any convex setin R n , there exists an O ( n ) -self-concordant barrier,called the universal barrier function. Therefore,in theory any such n -dimensional convex optimization problem can be solved in O ( √ n log(1 /(cid:15) )) iterations. However, this result is traditionally considered to be primarily of theoretical interest asthe universal barrier function is difficult to compute. Given the possible algorithmic implicationsof faster interior point methods, e.g. the flow problems of this paper, obtaining a barrier withnear-optimal self-concordance that is easy to minimize is a fundamental open problem.In 1989, Vaidya [61] made an important breakthrough in this direction. He proposed two bar-rier functions related to the volume of certain ellipsoids and obtained O (( m · rank( A )) / log(1 /(cid:15) )) and O (rank( A ) log(1 /(cid:15) )) iteration linear programming algorithms [59, 61, 56]. Unfortunately, eachiteration of these methods required computing the projection matrix D / A ( A (cid:62) DA ) − A (cid:62) D / fora positive diagonal matrix D ∈ R m × m . This was slightly improved by Anstreicher [3] who showed itsufficed to compute the diagonal of this projection matrix. Unfortunately, neither of these methodsyield faster running times than [57] unless m (cid:29) n and neither are immediately amenable to takefull advantage of improvements in solving structured linear system solvers and thereby improve therunning time for solving the maximum flow problem.Year Author Number of Iterations Nature of iterations1984 Karmarkar [21] (cid:101) O ( m log(1 /(cid:15) )) Linear system solve1986 Renegar [51] O ( √ m log(1 /(cid:15) )) Linear system solve1989 Vaidya [60] O (( m · rank( A )) / log(1 /(cid:15) )) Matrix Inversion1994 Nesterov and Nemirovskii [49] O ( (cid:112) rank( A ) log(1 /(cid:15) )) Volume computationThis paper (cid:101) O ( (cid:112) rank( A ) log(1 /(cid:15) )) (cid:101) O (1) Linear system solvesThese results suggest that you can solve linear programs closer to the (cid:101) O ( (cid:112) rank( A ) log(1 /(cid:15) )) bound achieved by the universal barrier only if you pay more in each iteration. In this paper, weshow that this is not the case. We provide a method that up to polylogarithmic factors matches theconvergence rate of the universal barrier function while only having iterations of cost comparableto that of Karmarkar’s [21] and Renegar’s [51] algorithms. Our main result is provably faster algorithms which given A ∈ R m × n , b ∈ R n , c ∈ R m , l i ∈ R ∪{−∞} ,and u i ∈ R ∪ { + ∞} for all i ∈ [ m ] solve linear programs in the following form OPT def = min x ∈ R m : A (cid:62) x = b ∀ i ∈ [ m ] : l i ≤ x i ≤ u i c (cid:62) x . (1.2)We assume throughout that A is non-degenerate , which we define as full column rank and no rowsthat are all zero. Further, we assume that for all i ∈ [ m ] the set dom( x i ) def = { x : l i < x < u i } , isneither the empty set or the entire real line, i.e. l i < u i and either l i (cid:54) = −∞ or u i (cid:54) = + ∞ and weassume that the interior of the polytope, Ω ◦ def = { x ∈ R m : A (cid:62) x = b, l i < x i < u i } , is non-empty. Typically (1.2) is written as A x = b rather than A (cid:62) x = b . We chose this formulation to be consistent with thederivation of the self-concordant barrier in Section 5, and the standard use of n to denote the number of vertices and m to denote the number of edges in the linear program formulation of flow problems. Theorem 1 (Linear Programming) . Given interior point x ∈ Ω ◦ for linear program (1.2) , thealgorithm LPSolve (Algorithm 3) outputs x ∈ Ω ◦ with c (cid:62) x ≤ OPT + (cid:15) with constant probability in O ( √ n log m · log( mU/(cid:15) ) · T w ) -work and O ( √ n log m · log( mU/(cid:15) ) · T d ) -depthwhere U = max {(cid:107) / ( u − x ) (cid:107) ∞ , (cid:107) / ( x − l ) (cid:107) ∞ , (cid:107) u − l (cid:107) ∞ , (cid:107) c (cid:107) ∞ } and T w and T d are the work anddepth needed to compute ( A (cid:62) DA ) − q for input positive diagonal matrix D and vector q . Note that (1.2) is the dual of (1.1) in the special case when u i = ∞ for all i ∈ [ m ] . Consequently,in obtaining this result we solve (1.1) with the desired complexity (see Theorem 43). We considerthis formulation with two-sided constraints, (1.2), as it directly encompasses the formulation ofmaximum flow and minimum cost flow as a linear program [11]. Interestingly, while it is well knownthat all linear programs, including (1.2), can be written in standard form, all known transformationsto put (1.2) in standard form would increase the rank of A causing an (cid:101) O ( (cid:112) rank( A ) ) iterationalgorithms to be too slow to improve the running time for solving the maximum flow problem.Using lower bounds results of Nesterov and Nemirovski, it is not hard to see that any generalbarrier for (1.2) must have self-concordance Ω( m ) . In particular, Proposition 2.3.6 of [49] shows thatif any vertex of a m -dimensional polytope belongs to k linearly independent ( m − -dimensionalfacets, then the self-concordance of any barrier on Ω is at least k . Consequently, Theorem 1corresponds to a method which converges at a rate faster than what would be predicted by standardinterior point theory. That we solve (1.2) in o ( √ m log(1 /(cid:15) )) is critical for achieving our fastermaximum flow results. Leveraging Theorem 1 we show the following: Theorem 2 (Maximum Flow) . Given a directed graph G = ( V, E ) with integral costs q ∈ Z E andcapacities c ∈ Z E ≥ with (cid:107) q (cid:107) ∞ ≤ M and (cid:107) c (cid:107) ∞ ≤ M , we can compute a minimum cost maximum flowwith constant probability with O ( | E | (cid:112) | V | log | E | log M ) work and O ( (cid:112) | V | log | E | log M ) depth. We complement these results by designing a new barrier whose self-concordance nearly matchesthat of the the universal barrier. We show that specialized to (1.1) an idealized version of our algo-rithm corresponds to following a path following scheme on a natural barrier induced by Lewis weights[37]. Formally, when A is non-degenerate we provide an O (rank( A ) log ( m )) -self-concordant-barriersuch that it’s gradient and Hessian are all polynomial time computable. Theorem 3 (Nearly Universal Barrier) . Let Ω ◦ def = { x : A x > b } be non-empty for non-degenerate A ∈ R m × n . There is an O ( n log m ) -self concordant barrier ψ for Ω ◦ such that for all (cid:15) > and x ∈ Ω ◦ in O ( mn ω − · log m · log( m/(cid:15) )) -work and O (log m · log( m/(cid:15) )) -depth it is possible to compute g ∈ R n and H ∈ R n × n with (cid:107) g − ∇ ψ ( x ) (cid:107) ∇ ψ ( x ) − ≤ (cid:15) , and (1 − (cid:15) ) ∇ ψ ( x ) (cid:22) H (cid:22) (1 + (cid:15) ) ∇ ψ ( x ) . To obtain these results we provide several additional tools of possible independent interest. InSection 4 we provide several algebraic facts regarding Lewis weights and in Section B we provideseveral algorithms for computing Lewis weights in different contexts. Further, in Section C weprovide results for a natural online learning problem which we leverage to handle approximationerrors in our path finding schemes.Ultimately, we hope the varied results of this paper will open the door towards developing evenfaster algorithms for convex programming more broadly. While the analysis in the paper is quitetechnical, ultimately the algorithms and heuristics they suggest, i.e. locally re-weighting the centralpath by Lewis weights (and in the case of maximum flow, effective resistance), are straightforwardand we hope may be used more broadly. 4 .3 Geometric Motivation
To motivate our approach, consider the slightly simplified problem of designing an (cid:101) O ( √ n log(1 /(cid:15) )) = (cid:101) O ( (cid:112) rank( A ) log(1 /(cid:15) )) iteration algorithm for solving (1.1) for non-degenerate A where the runningtime of each iteration is dominated by the time needed to solve a linear system A (cid:62) DA x = y fordiagonal D ∈ R m × m ≥ . The classic self-concordance theory for analyzing interior point methodsestablished in [49] shows that it suffices to produce a simple enough (cid:101) O ( n ) -self-concordant barrierfor the set Ω ◦ def = { x ∈ R n | A x > b } . This seminal work of Nesterov and Nemirovski showed thatgiven any ν - self-concordant barrier for an open convex set K there is an (cid:101) O ( √ ν log(1 /(cid:15) )) iterationinterior point method, based on a technique known as path following , for minimizing linear functionsover K . Further, the running time of each iteration is dominated by the time needed to compute agradient of the barrier and approximately solve a linear system in its Hessian. Definition 4 (Self-concordance) . A convex, thrice continuously differentiable function φ : K → R n is a ν -self-concordant barrier function for open convex set K ⊂ R n if the following conditions hold • lim i →∞ φ ( x i ) → ∞ for all sequences x i ∈ K converging to boundary of K . • | D φ ( x )[ h, h, h ] | ≤ | D φ ( x )[ h, h ] | / for all x ∈ K and h ∈ R n , • | Dφ ( x )[ h ] | ≤ √ ν | D φ ( x )[ h, h ] | / for all x ∈ K and h ∈ R n .To achieve our goals, ideally we would produce a (cid:101) O ( n ) -self-concordant barrier function for thefeasible region such that the resulting path following scheme would have sufficiently low iterationcosts. Unfortunately, as we have discussed no such barrier is known to exist, all previous (cid:101) O ( n ) -self-concordant barriers are more difficult to evaluate then linear programming, and it it would be unclearhow to generalize such an approach to solving (1.2). Deferring this last issue to Section 1.4, heredescribe how to overcome the first two issues and derive a deterministic polynomial-time computablebarrier functions with self-concordance ˜ O ( n ) .Our barrier function can be derived from the following intuition regarding interior point methods.At a high level, interior point methods address the key difficulty of linear programming, makingprogress in the presence of non-differentiable inequality constraints, by leveraging a barrier function , φ , which provides a local smooth approximation. These methods solve the linear program byperforming Newtons method, i.e. solving a sequence of linear systems, which trade off the utilityof minimizing cost, c (cid:62) x , and staying away from the constraints, i.e. minimizing φ . Since theseNewton steps correspond to minimizing linear functions over ellipsoids and these ellipsoids comefrom the second-order approximations of the barrier functions, interior point methods essentiallyapproximate polytopes by a sequence of ellipsoids. Self-concordance can be viewed as a geometriccondition that relates how well these ellipsoids approximate the domain. In particular, the followinglemma shows that the second-order approximation of the barrier function at the minimum pointwell-approximates the domain. Theorem 5 (Dikin Ellipsoid Rounding [46, Thm 4.2.6]) . Given a ν -self-concordant barrier function φ for convex set K ⊂ R n , let x φ be the minimizer of φ and E def = { x ∈ R n : ( x − x φ ) (cid:62) ∇ φ ( x φ )( x − x φ ) ≤ } be the Dikin ellipsoid . E is a ν + 2 √ ν -rounding of K , i.e. E ⊆ K ⊆ ( ν + 2 √ ν ) E. Consequently, to obtain a ˜ O ( n ) -self-concordant barriers it is necessary to obtain ellipsoids thatare ˜ O ( n ) -roundings. The maximum volume contained ellipsoid or John ellipsoid has this property.
Lemma 6 (John Ellipsoid Rounding [20]) . For convex K ⊆ R n and John ellipsoid, J ( K ) , i.e. thelargest volume ellipsoid contained inside K , we have that J ( K ) ⊆ K ⊆ nJ ( K ) .
5n contrast to other ellipsoids that yield approximation guarantees, e.g. the covariance matrixof the uniform distribution on the body [63], the John ellipsoid has the desirable property of beingdefined by a convex optimization problem and therefore can be computed in weakly polynomialtime. There are multiple ways to express the John ellipsoid as the solution to a convex problem.Our barrier function is motivated by the following formulation, called D -optimal design. Lemma 7 (Convex Formulation of John Ellipsoid [25]) . For any A ∈ R m × n , b ∈ R m , and polytopeinterior Ω ◦ = { x ∈ R n : A x > b } the John ellipsoid equals { y ∈ R n : ( y − x ) (cid:62) A (cid:62) WA ( y − x ) ≤ } where { x ∈ Ω ◦ , w ∈ R m ≥ } is the saddle point of the following convex concave problem min x ∈ Ω ◦ φ ∞ ( x ) where φ ∞ ( x ) = max (cid:80) w i = n,w i ≥ ln det (cid:16) A (cid:62) S − x WS − x A (cid:17) (1.3) where S x and W are diagonal m × m matrices with [ S x ] ii def = a (cid:62) i x − b i and W ii def = w i . Motivated by Theorem 5 a natural approach towards obtaining a polynomial time computable (cid:101) O ( n ) -self-concordant barrier would simply be to pick a barrier function for Ω ◦ whose minimizer isthe center of John ellipsoid. The function φ ∞ ( x ) of (1.3) is such a function, but unfortunately,simply inducing a Dikin ellipse that approximates the feasible region is insufficient to be a self-concordant barrier. A self-concordant barrier also needs to not change two quickly; however φ ∞ ( x ) is not even continuously differentiable. To see this, let J (Ω , x ) be the maximum volume ellipsoidinside K and centered at x and note that φ ∞ ( x ) = c log( vol ( J (Ω , x ))) for a universal constant c .Consequently, for Ω = [ − , we have φ ∞ ( x ) = c log(2(1 − | x | )) , i.e. it is only affected by oneconstraint at each point, except at , where it is non-differentiable.To make φ ∞ ( x ) smooth, we could apply a standard approach of adding a strongly concave term,i.e. a regularizer, to the objective function ln det( A (cid:62) S − x WS − x A ) . In general, if smooth f ( x, y ) isstrongly concave in y , then max y f ( x, y ) is smooth in x . In fact, there are multiple ways to applythis approach to obtain a polynomial time computable universal barrier function. For example, itcan be shown that that the following is an ˜ O ( n ) self-concordant barrier function φ r ( x ) def = max (cid:80) i ∈ [ m ] w i = n,w i ≥ ln det( A (cid:62) S − x WS − x A ) − nm (cid:88) i ∈ [ m ] w i ln w i − nm (cid:88) i ∈ [ m ] ln[ S x ] ii . Lewis Weight Barrier : In this paper, we provide a more elegant barrier that we believe furtherelucidates the geometric structure of the problem. In Section 5 for all p > we consider the function φ p ( x ) def = (cid:40) max w ∈ R m : w ≥ f p ( x, w ) if p ≥ w ∈ R m : w ≥ f p ( x, w ) if p ≤ where f p ( x, w ) def = ln det (cid:16) A (cid:62) S − x W − p S − x A (cid:17) − (cid:18) − p (cid:19) (cid:88) i ∈ [ m ] w i . We show that the maximizing ( q > ) or minimizing ( q < ) weights, w ∈ R m ≥ , for φ p are the (cid:96) p -Lewis weights for the matrix S − A [37] and hence we call φ p the Lewis weight barrier .Lewis weights are fundamental in the theory of Banach spaces and a key tool for approximating amatrix in (cid:96) p -norms. They generalize a fundamental (cid:96) measure of row importance known as leveragescores which are defined for A ∈ R m × n as σ ( A ) = diag( A ( A (cid:62) A ) − A (cid:62) ) , i.e. the diagonals of theorthogonal projection matrix onto the image of A . For all p > the (cid:96) p − Lewis weights of A are theunique vector w p ( A ) which is the leverage scores of W (1 / − (1 /p ) p A for W p = Diag ( w p ) . Intuitively,6he (cid:96) p -Lewis weight of a row i , w p ( A ) i , denotes the importance of the i th row under (cid:96) p norm and it isknown that sampling (cid:101) O ( n max { p/ , } ) rows of A with probability proportional to (cid:96) p Lewis weight andreweighting yields a matrix B such that with high probability (cid:107) B x (cid:107) p ≈ (cid:107) A x (cid:107) p multiplicatively forall x [4]. Recently, Cohen and Peng [9] studied Lewis weights in the context of solving (cid:96) p -regression,showed that Lewis weights computation can be written as a convex optimization problem for p ≥ ,and provided a nearly constant iteration algorithm for computing Lewis weights for p ∈ (0 , .In this paper we provide several complementary results regarding Lewis weights, including formu-lating their computation as a convex optimization problem for all p > (Section 4) and providingadditional algorithms for computing them (Section B). Further, we study the stability of Lewisweights under re-scalings and show that they induce ellipsoids that well approximate to the poly-tope Ω = { x ∈ R n | (cid:107) A x (cid:107) ∞ ≤ } for large p (Section 4). Leveraging this analysis we show that theLewis weight barrier for p = Θ(log m ) is an O ( n log m ) -self-concordant barrier for Ω ◦ (Section 5)and prove Theorem 3. The barrier φ ∞ is essentially the limit of φ p for p → ∞ and consequently ouranalysis shows that the (cid:96) Θ(log m ) generalization of the John ellipse yields a nearly universal barrier. Though the explanation of the previous section suffices to prove Theorem 3, it is unclear how toleverage this analysis to prove Theorem 1.1. As discussed, there is no O ( n ) -self-concordant barrierfor the feasible region of (1.2) and even if this issue could be overcome, naively implementing sucha method would require the expensive operation of computing Lewis weights. However, computingthese weights to high precision (or even certifying their properties) necessitates computing leveragescores which naively yields iteration costs comparable to that of Vaidya and Anstreicher’s interiorpoint methods [59, 61, 56, 3], i.e. slower then solving (cid:101) O (1) -linear systems.To overcome these issues we develop a scheme for dynamically re-weighting self-concordant bar-riers for dom( x i ) in (1.2). We provide -self-concordant barriers φ i for each dom( x i ) (see Section 3.1)and study the central path they induce, i.e. x t for t > , where x t def = arg min A (cid:62) x = b f t ( x ) where f t ( x ) def = t · c (cid:62) x + (cid:88) i ∈ [ m ] φ i ( x i ) . (1.4)Self-concordance theory yields that (cid:80) i ∈ [ m ] φ i ( x i ) is a m -self-concordant barrier and therefore thisyields an (cid:101) O ( √ m ) iteration method; we directly attempt to improve this bound.To motivate our improvement, note that the performance of this method is highly dependent onthe representation of (1.1). Duplicating a constraint, i.e. a row of A and the corresponding entryin b i , (cid:96) i and u i , corresponds to doubling the contribution of some φ i . Repeating a constraint manytimes can actually slow down the convergence of standard path following methods and in a seriesof papers [13, 14, 44, 45, 42], it was shown that by carefully duplicating constraints on Klee-Mintycubes standard interior point methods for the dual can take Ω( √ m ) iterations.Since the weighting of φ i can affect convergence, we provide algorithms which dynamically re-weight the φ i . We show that this can improve the convergence rate from Ω( √ m ) to (cid:101) O ( √ n ) . InSection 3, we study the weighted barrier function, φ ( x ) = (cid:80) i ∈ [ m ] g i ( x ) φ i ( x ) where g : R m ≥ → R m ≥ is a weight function of the current point, and the weighted central path they induce, i.e. x gt def = arg min A (cid:62) x = b f t ( x ) where f t ( x ) def = t · c (cid:62) x + (cid:88) i ∈ [ m ] g i ( x ) φ i ( x i ) for all t ≥ . To obtain our improved running times we investigate what properties of g ( x ) improve convergence.Standard analysis suggests that g should have small total size, i.e. (cid:107) g ( x ) (cid:107) = O ( n ) , and induce7ewton steps that do not change the Hessian much. Optimizing weights for these conditions suggestthat g ( x ) should be the (cid:96) -Lewis weights for the local re-weighting of the constraint matrix. In thespecial case where all (cid:96) i = 0 and u i = + ∞ this recovers the motivation for φ ∞ ! Here, we runinto the same issues discussed in Section 1.3, e.g. instability of John ellipse and (cid:96) -Lewis weights.Consequently, we consider the dual analog of the approach of Section 1.3 and let g be the (cid:96) p -Lewisweights for p = 1 − / log(4 m ) plus a fixed amount and show these regularized Lewis weights have thedesired properties. Interestingly, when (cid:96) i = 0 and u i = + ∞ , ignoring the constant regularization,the x gt are dual to the central path induced by the (cid:96) q -Lewis weight barrier.This reasoning yields a dual algorithm related to path following methods with the Lewis weightbarrier: Newton step x for fixed g ( x ) , update g ( x ) , update t , and repeat. All that remains isthe issue of computing (cid:96) p -Lewis weights. To overcome this issue, we exploit that leverage scoresand consequently (cid:96) p -Lewis weights for small p can be efficiently approximated for small p , as wasshown in the aforementioned exciting result [9]. In Section B we provide additional Lewis weightcomputation algorithms for all p which we leverage to compute multiplicative approximations to (cid:96) p -Lewis weights in our methods. Unfortunately, this error is still too much for our methods tohandle directly, as such large weights changes can greatly decrease centrality measures.To overcome this final issue, rather then using the weighted barrier φ ( x ) = (cid:80) i ∈ [ m ] g i ( x ) φ i ( x ) where the weights g ( x ) depends on the x directly, we instead maintain separate weights w ∈ R m> and current point x and use the barrier φ ( x, w ) = (cid:80) i ∈ [ m ] w i φ i ( x i ) . We then design a methodwhere we maintain the invariants that x is close to the minimum of φ ( x, w ) over A (cid:62) x = b and w ismultiplicatively close to g ( x ) . Since, each fixed w ∈ R m ≥ induces a particular weighted central path,i.e. the minimizers of t · c (cid:62) x + φ ( x, w ) , our method can be viewed as alternating between advancingalong a weighted central path and changing the path. We call this technique, path finding. We design this path-finding method in two steps. First, we show that we can take a Newtonstep on x and update w while improving centrality and not changing w too much. This requirescare, as with the weighted barrier it is difficult to certify that Newton steps are stable, i.e. does notchange points multiplicatively. To overcome this, we explicitly measure the centrality of our pointsby the size of the Newton step in a mixed norm of the form (cid:107) · (cid:107) = (cid:107) · (cid:107) ∞ + C norm (cid:107) · (cid:107) W to keep trackof both the standard measure of centrality and this multiplicative change. Second, we show thatgiven a multiplicative approximation to g ( x ) and bounds on the change of g ( x ) , we can maintainthe invariant that g ( x ) is close to w multiplicatively without moving w too much. We formulatethis as a general two player game and provide an efficient solution in Section C.By combining these insights and formulating minimum cost flow as a linear program, we proveTheorem 1 and Theorem 2. Measuring Newton step sizes with respect to the mixed norm helpsexplain how our method outperforms the self-concordance of the best barrier for (1.2). Self-concordance is based on (cid:96) analysis and lower bounds for self-concordance stem from the failure of (cid:96) to approximate (cid:96) ∞ . While ideally our methods might optimize over (cid:96) ∞ directly, (cid:96) ∞ is rife withdegeneracies impairing this analysis. However, unconstrained minimization over a box is simple andby working with this mixed norm and carefully choosing weights we are taking advantage of thesimplicity of minimizing (cid:96) ∞ over most of the domain and only paying for the (cid:101) O ( n ) -self-concordanceof a barrier for the subspace induced by the A (cid:62) x = b constraint. After providing notation in Section 2, in Section 3 we provide our analysis of weighted path finding,in Section 4 we provide our analysis of Lewis weights, and in Section 5 we prove the self-concordanceof the Lewis weight barrier. The proofs of Theorems 1, 2, and 3 are then given in Section B.Algorithms for computing Lewis weights and many technical details are deferred to the appendix.Note that throughout we made only limited attempts to reduce polylogarithmic factors.8
Notation
Vector Operations:
We frequently apply scalar operations to vectors with the interpretationthat these operations should be applied coordinate-wise, e.g. for x, y ∈ R n we let x/y ∈ R n with [ x/y ] i def = ( x i /y i ) , xy ∈ R n with [ xy ] i = x i y i , and log( x ) ∈ R n with [log( x )] i = log( x i ) for all i ∈ [ n ] . Matrices : We call a matrix A non-degenerate if it has full column-rank and no zero rows. We callsymmetric matrix B ∈ R n × n positive semidefinite (PSD) if x (cid:62) B x ≥ for all x ∈ R n and positivedefinite (PD) if x (cid:62) B x > for all x ∈ R n . Matrix Operations:
For symmetric matrices A , B ∈ R n × n we write A (cid:22) B to indicate that x (cid:62) A x ≤ x (cid:62) B x for all x ∈ R n and define ≺ , (cid:22) , and (cid:23) analogously. For A , B ∈ R n × m , we let A ◦ B denote the Schur product, i.e. [ A ◦ B ] ij def = A ij · B ij for all i ∈ [ n ] and j ∈ [ m ] , and we let A (2) def = A ◦ A . We use nnz( A ) to denote the number of nonzero entries in A . Diagonals:
For A ∈ R n × n we define diag( A ) ∈ R n with diag( A ) i = A ii for all i ∈ [ n ] and for x ∈ R n we define Diag ( x ) ∈ R n × n as the diagonal matrix with diag( Diag ( x )) = x . We often useupper case to denote a vectors associated diagonal matrix, e.g. X def = Diag ( x ) and S = Diag ( s ) . Fundamental Matrices : For non-degenerate A we let P ( A ) def = A ( A (cid:62) A ) − A (cid:62) denote the or-thogonal projection matrix onto A ’s image and σ ( A ) def = diag( P ( A )) denote A ’s leverage scores .We let Σ ( A ) def = Diag ( σ ( A )) , P (2) ( A ) def = P ( A ) ◦ P ( A ) , Λ ( A ) def = Σ ( A ) − P (2) ( A ) , and ¯ Λ ( A ) def = Σ ( A ) − / Λ ( A ) Σ ( A ) − / . Λ ( A ) is a Laplacian matrix and ¯ Λ ( A ) is a normalized Laplacian matrix. Norms : For PD A ∈ R n × n we let (cid:107) · (cid:107) A denote the norm where (cid:107) x (cid:107) A def = x (cid:62) A x for all x ∈ R n . Forpositive w ∈ R n> we let (cid:107) · (cid:107) w denote the norm where (cid:107) x (cid:107) w def = (cid:80) i ∈ [ n ] w i x i for all x ∈ R n . For anynorm (cid:107) · (cid:107) and matrix M , its induced operator norm of M is defined by (cid:107) M (cid:107) = sup (cid:107) x (cid:107) =1 (cid:107) M x (cid:107) . Calculus:
For a function of two vectors, i.e. g ( x, y ) ∈ R for all x ∈ R n and y ∈ R n , we let ∇ x g ( a, b ) ∈ R n denote the gradient of g as a function of x for fixed y at ( a, b ) ∈ R n × n , i.e. [ ∇ x g ( a, b )] i = ∂∂x i g ( a, b ) , and define ∇ y , ∇ xx , and ∇ yy analogously. For h : R n → R m and x ∈ R n we let J h ( x ) ∈ R m × n denote the Jacobian of h at x , i.e. [ J h ( x )] ij def = ∂∂x j h ( x ) i for all i ∈ [ m ] and j ∈ [ n ] . For f : R n → R and x, h ∈ R n we let Df ( x )[ h ] denote the directional derivative of f indirection h at x , i.e. Df ( x )[ h ] def = lim t → [ f ( x + th ) − f ( x )] /t . Convex Sets:
We call U ⊆ R k convex if t · x + (1 − t ) · y ∈ U for all x, y ∈ U and t ∈ [0 , and symmetric if x ∈ R k ⇔ − x ∈ R k . For all α > and U ⊆ R k we let αU def = { x ∈ R k | α − x ∈ U } . Forall p ∈ [1 , ∞ ] and r > we call the symmetric convex set { x ∈ R k |(cid:107) x (cid:107) p ≤ r } the (cid:96) p ball of radius r . Misc:
For z ∈ Z we let [ z ] def = { , , .., z } . We let i denote the vector that has value in coordinate i and elsewhere. We use (cid:101) O to hide factors polylogarithmic in m , n , U , | V | , | E | , and M . Here we introduce our weighted path finding scheme for solving (1.2). First we introduce the weightedcentral path (Section 3.2) and provide key properties of the path (Section 3.3) and weight functions(Section 3.4). Assuming a weight function (shown to exist in Section 4.4) we then provide the mainlemmas we need for an ˜ O ( (cid:112) rank( A ) log( U/(cid:15) )) iteration weighted path following algorithm for (1.2).In Section 3.5, 3.6 and 3.7 we study the effect of changing the path parameter, the point, and theweights, and in Section 3.8 we give our main subroutine for following the path.9 .1 Preliminaries Recall that our goal is to efficiently solve (1.2) repeated below min x ∈ R m : A (cid:62) x = b ∀ i ∈ [ m ] : l i ≤ x i ≤ u i c (cid:62) x . Here A ∈ R m × n , b ∈ R n , c ∈ R m , l i ∈ R ∪ {−∞} , and u i ∈ R ∪ { + ∞} and we assume that A isnon-degenerate, that dom( x i ) def = { x : l i < x < u i } is neither the empty set or the entire real line forall i ∈ [ m ] , and the interior of the polytope, Ω ◦ def = { x ∈ R m : A (cid:62) x = b, l i < x i < u i } is non-empty.Rather than working directly with the different domains of the x i we take a slightly more generalapproach and let φ i : dom( x i ) → R for all i ∈ [ m ] denote a 1-self-concordant barrier function for dom( x i ) (See Definition 4). In the remainder of the paper we will simply leverage that each φ i is a -self-concordant barrier for each of the dom( φ i ) and not use any further structure about thebarriers or the domains. It is easy to show that such φ i exist and for completeness, below we providean explicit 1-self-concordant barrier function for each possible dom( x i ) : • Case (1): l i finite and u i = + ∞ : We use a log barrier defined as φ i ( x ) def = − log( x − l i ) . Here φ (cid:48) i ( x ) = − x − l i , φ (cid:48)(cid:48) i ( x ) = 1( x − l i ) , and φ (cid:48)(cid:48)(cid:48) i ( x ) = − x − l i ) and therefore clearly | φ (cid:48)(cid:48)(cid:48) i ( x ) | = 2( φ (cid:48)(cid:48) i ( x )) / , | φ (cid:48) i ( x ) | = (cid:112) φ (cid:48)(cid:48) i ( x ) , and lim x → l + i φ i ( x ) = + ∞ . • Case (2): l i = −∞ and u i finite : We use a log barrier defined as φ i ( x ) def = − log( u i − x ) . Here φ (cid:48) i ( x ) = 1 u i − x , φ (cid:48)(cid:48) i ( x ) = 1( u i − x ) , and φ (cid:48)(cid:48)(cid:48) i ( x ) = − u i − x ) and therefore clearly | φ (cid:48)(cid:48)(cid:48) i ( x ) | = 2( φ (cid:48)(cid:48) i ( x )) / , | φ (cid:48) i ( x ) | = (cid:112) φ (cid:48)(cid:48) i ( x ) , and lim x → u − i φ i ( x ) = + ∞ . • Case (3): l i finite and u i finite : We use a trigonometric barrier defined as φ i ( x ) def = − log cos( a i x + b i ) for a i = πu i − l i and b i = − π u i + l i u i − l i . As x → u − i we have a i x + b i → π and as x → l + i we have a i x + b i → − π and therefore, in both cases φ i ( x ) → + ∞ . Further, φ (cid:48) i ( x ) = a i tan ( a i x + b i ) , φ (cid:48)(cid:48) i ( x ) = a i cos ( a i x + b i ) , and φ (cid:48)(cid:48)(cid:48) i = 2 a i sin( a i x + b i )cos ( a i x + b i ) . Therefore, | φ (cid:48) i ( x ) | ≤ a i / | cos ( a i x + b i ) | = (cid:112) φ (cid:48)(cid:48) i ( x ) and we have (cid:12)(cid:12) φ (cid:48)(cid:48)(cid:48) i ( x ) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) a i sin( a i x + b i )cos ( a i x + b i ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ a i | cos( a i x + b i ) | = 2( φ (cid:48)(cid:48) ( x )) / . While there is rich theory regarding self-concordance we will primarily use only following two lemmasregarding φ i . Lemma 8 bounds the change in the Hessian of φ i Lemma 9 bounds the gradient of φ i . Lemma 8 ([46, Theorem 4.1.6]) . If s ∈ dom( φ i ) for i ∈ [ m ] , and r def = (cid:112) φ (cid:48)(cid:48) i ( s ) | s − t | < then t ∈ dom( φ i ) and (1 − r ) (cid:112) φ (cid:48)(cid:48) i ( s ) ≤ (cid:112) φ (cid:48)(cid:48) i ( t ) ≤ (1 − r ) − (cid:112) φ (cid:48)(cid:48) i ( s ) . Therefore (cid:112) φ (cid:48)(cid:48) i ( s ) ≥ /U where U is the diameter of dom( φ i ) . Lemma 9 ([46, Theorem 4.2.4]) . φ (cid:48) i ( x ) · ( y − x ) ≤ for all x, y ∈ dom( φ i ) and i ∈ [ m ] . .2 The Weighted Central Path Our path-finding algorithm maintains a feasible point x ∈ Ω ◦ , weights w ∈ R m> , and minimizes thefollowing penalized objective function for increasing t and small w min A (cid:62) x = b f t ( x, w ) where f t ( x, w ) def = t · c (cid:62) x + (cid:88) i ∈ [ m ] w i φ i ( x i ) . (3.1)For every fixed set of weights , w ∈ R m> the set of points x w ( t ) = arg min x ∈ Ω ◦ f t ( x, w ) for t ∈ [0 , ∞ ) form a path through the interior of the polytope that we call the weighted central path . We call x w (0) a weighted center of Ω ◦ and note that lim t →∞ x w ( t ) is a solution to (1.2) (Lemma 41).While all weighted central paths converge to a solution of the linear program, different pathsmay have different algebraic properties which either improve or impair the convergence of a pathfollowing scheme. Consequently, our algorithm alternates between advancing down a central path(i.e. increasing t ), moving closer to the weighted central path (i.e. updating x ), and picking abetter path (i.e. updating the weights w ). More formally, we assume we have a feasible point { x, w } ∈ { Ω ◦ × R m> } and a weight function g ( x ) : Ω ◦ → R m> , such that for any point x ∈ R m> the function g ( x ) returns a good set of weights that suggest a possibly better weighted path. Ouralgorithm then repeats the following: (1) if x close to arg min y ∈ Ω f t ( y, w ) , then increase t (2)otherwise, use projected Newton step to update x and move w closer to g ( x ) .In the remainder of this section we present how we measure both the quality of a current feasiblepoint { x, w } ∈ { Ω ◦ × R m> } , the quality of the weight function, and with a weight function controlcentrality. In Section 3.3 we derive and present both how we measure how close { x, w } is to theweighted central path and the step we take to improve this centrality and in Section 3.4 we presenthow we measure the quality of a weight function, i.e. how good the weighted paths it finds are. Theremaining subsection analyze controlling centrality under changes to x , w , and t . Here we explain how we measure the distance from x to the minimum of f t ( x, w ) for fixed w ,denoted δ t ( x, w ) . As δ t ( x, w ) measures the proximity of x to the weighted central path, we call ita centrality . measure of x and w . To motivate δ t ( x, w ) we first compute a projected Newton stepfor x . For all x ∈ Ω ◦ , we define φ ( x ) ∈ R m by φ ( x ) i = φ i ( x i ) for i ∈ [ m ] , define φ (cid:48) ( x ) , φ (cid:48)(cid:48) ( x ) , and φ (cid:48)(cid:48)(cid:48) ( x ) analogously, and let Φ (cid:48) , Φ (cid:48)(cid:48) , Φ (cid:48)(cid:48)(cid:48) denote their associated diagonal matrices. This yields ∇ x f t ( x, w ) = t · c + wφ (cid:48) ( x ) and ∇ xx f t ( x, w ) = WΦ (cid:48)(cid:48) ( x ) . Lemma 51 (proved in the appendix) shows that a Newton step for x is given by h t ( x, w ) = − (cid:16) I − (cid:0) WΦ (cid:48)(cid:48) ( x ) (cid:1) − A ( A (cid:62) (cid:0) WΦ (cid:48)(cid:48) ( x ) (cid:1) − A ) − A (cid:62) (cid:17) (cid:0) WΦ (cid:48)(cid:48) ( x ) (cid:1) − ∇ x f t ( x, w )= − Φ (cid:48)(cid:48) ( x ) − / P x,w W − Φ (cid:48)(cid:48) ( x ) − / ∇ x f t ( x, w ) (3.2)where P x,w def = I − W − A x (cid:16) A (cid:62) x W − A x (cid:17) − A (cid:62) x for A x def = Φ (cid:48)(cid:48) ( x ) − / A . (3.3)As with standard convergence analysis of interior point methods, we wish to keep the Newton stepsize in the Hessian norm, i.e. (cid:107) h t ( x, w ) (cid:107) wφ (cid:48)(cid:48) ( x ) = (cid:107) (cid:112) φ (cid:48)(cid:48) ( x ) h t ( x, w ) (cid:107) w , small and the multiplicativechange in the Hessian, (cid:107) (cid:112) φ (cid:48)(cid:48) ( x ) h t ( x, w ) (cid:107) ∞ , small. While in standard logarithmic barrier analysis,i.e. w i = 1 for all i , we can bound the multiplicative change by the change in the hessian norm (since (cid:107) · (cid:107) ∞ ≤ (cid:107) · (cid:107) ), here we would like to use small weights and this comparison would be insufficient.11o track both these quantities simultaneously, we define the mixed norm for all y ∈ R m by (cid:107) y (cid:107) w + ∞ def = (cid:107) y (cid:107) ∞ + C norm (cid:107) y (cid:107) w (3.4)for C norm > defined in Definition 12. Note that (cid:107) · (cid:107) w + ∞ is indeed a norm for w ∈ R m> as in thiscase both (cid:107) · (cid:107) ∞ and (cid:107) · (cid:107) w are norms. However, rather than measuring centrality by the quantity (cid:107) (cid:112) φ (cid:48)(cid:48) ( x ) h t ( x, w ) (cid:107) w + ∞ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) P x,w (cid:32) ∇ x f t ( x, w ) w (cid:112) φ (cid:48)(cid:48) ( x ) (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w + ∞ we instead find it more convenient to use the following idealized form δ t ( x, w ) def = min η ∈ R n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∇ x f t ( x, w ) − A ηw (cid:112) φ (cid:48)(cid:48) ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w + ∞ . This definition is justified by the following lemma which shows that these two quantities differ byat most a multiplicative factor of (cid:107) P x,w (cid:107) w + ∞ . Lemma 10.
For any norm (cid:107) · (cid:107) and (cid:107) y (cid:107) Q def = min η ∈ R n (cid:13)(cid:13)(cid:13)(cid:13) y − A ηw √ φ (cid:48)(cid:48) ( x ) (cid:13)(cid:13)(cid:13)(cid:13) , we have (cid:107) y (cid:107) Q ≤ (cid:107) P x,w y (cid:107) ≤ (cid:107) P x,w (cid:107) · (cid:107) y (cid:107) Q and therefore for all { x, w } ∈ { Ω ◦ × R m> } we have δ t ( x, w ) ≤ (cid:107) (cid:112) φ (cid:48)(cid:48) ( x ) h t ( x, w ) (cid:107) w + ∞ ≤ (cid:107) P x,w (cid:107) w + ∞ · δ t ( x, w ) . (3.5) Proof.
By definition P x,w y = y − A η y w √ φ (cid:48)(cid:48) ( x ) for some η y ∈ R n . Consequently, (cid:107) y (cid:107) Q = min η ∈ R n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) y − A ηw (cid:112) φ (cid:48)(cid:48) ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:107) P x,w y (cid:107) . Further, letting η q be such that (cid:107) y (cid:107) Q = (cid:13)(cid:13)(cid:13)(cid:13) y − A η q w √ φ (cid:48)(cid:48) ( x ) (cid:13)(cid:13)(cid:13)(cid:13) and noting P x,w W − ( Φ (cid:48)(cid:48) ) − / A = yields (cid:107) P x,w y (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13) P x,w (cid:18) y − A η q w √ φ (cid:48)(cid:48) (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:107) P x,w (cid:107) · (cid:13)(cid:13)(cid:13)(cid:13) y − A η q w √ φ (cid:48)(cid:48) (cid:13)(cid:13)(cid:13)(cid:13) = (cid:107) P x,w (cid:107) · (cid:107) y (cid:107) Q . We summarize this section with the following definition.
Definition 11 (Centrality Measure) . For { x, w } ∈ { Ω ◦ × R m> } and t ≥ , we let h t ( x, w ) denotethe projected newton step for x on the penalized objective f t given by h t ( x, w ) def = − (cid:112) φ (cid:48)(cid:48) ( x ) P x,w (cid:32) ∇ x f t ( x, w ) w (cid:112) φ (cid:48)(cid:48) ( x ) (cid:33) P x,w is defined in (3.3). We measure the centrality of { x, w } by δ t ( x, w ) def = min η ∈ R n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∇ x f t ( x, w ) − A ηw (cid:112) φ (cid:48)(cid:48) ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w + ∞ (3.6)where for all y ∈ R m we let (cid:107) y (cid:107) w + ∞ def = (cid:107) y (cid:107) ∞ + C norm (cid:107) y (cid:107) W for C norm > defined in Definition 12. With the Newton step and centrality conditions defined, the specification of our algorithm becomescleaerr. Our algorithm simply repeatedly (1) increases t provided δ t ( x, w ) is small and (2) decreases δ t ( x, w ) by setting x (new) ← x + h t ( x, w ) and (3) moving w (new) towards g ( x (new) ) for some weightfunction g ( x ) : Ω ◦ → R m> . To prove this algorithm converges, we need to show what happens to δ t ( x, w ) when we change t , x , w . At the heart of this paper is understanding what conditions weneed to impose on the weight function g so that we can bound this change in δ t ( x, w ) and henceachieve a fast convergent rate. In Lemma 14 we show that the effect of changing t on δ t is boundedby C norm and (cid:107) g ( x ) (cid:107) , in Lemma 15 we show that the effect that a Newton Step on x has on δ t isbounded by (cid:107) P x,g ( x ) (cid:107) g ( x )+ ∞ , and in Lemma 16 and 17 we show the change of w as g ( x ) changes isbounded by (cid:107) G ( x ) − J g ( x )( Φ (cid:48)(cid:48) ( x )) − / (cid:107) g ( x )+ ∞ .For the remainder of the paper we assume we have a weight function g ( x ) : Ω ◦ → R m> and makethe following assumptions regarding our weight function. In Section 4.4 we prove that one exists. Definition 12 (Weight Function) . Differentiable g : Ω ◦ → R m> is a ( c , c γ , c k ) -weight function ifthe following hold for all x ∈ Ω ◦ and i ∈ [ m ] : • The size, c , satisfies c ≥ max { , (cid:107) g ( x ) (cid:107) } . This bounds how quickly centrality changes as t changes. • The sensitivity, c s , satisfies c s ≥ e (cid:62) i G ( x ) − A x (cid:0) A (cid:62) x G ( x ) − A x (cid:1) − A (cid:62) x G ( x ) − e i . This boundshow quickly Hessian change as x changes. • The consistency, c k , satisfies (cid:107) G ( x ) − J g ( x )( Φ (cid:48)(cid:48) ( x )) − / (cid:107) g ( x )+ ∞ ≤ − c − k < . This boundshow much the weights change as x changes, thereby governing how consistent the weights arewith changes to x along the weighted central path.Through we assume we have such a weight function and define C norm def = 24 √ c s c k .To motivate slack sensitivity, we show that it bounds (cid:107) P x,w (cid:107) w + ∞ . This is used in Lemma 15. Lemma 13.
For any w such that g ( x ) ≤ w ≤ g ( x ) , we have that (cid:107) P x,w (cid:107) w + ∞ ≤ c γ where c γ def = 1 + √ c s C norm ≤ c k . Proof.
Letting (cid:107) I − P x,w (cid:107) w →∞ def = max (cid:107) z (cid:107) w =1 (cid:107) ( I − P x,w ) z (cid:107) ∞ we see that for any y ∈ R m , we have (cid:107) P x,w y (cid:107) w + ∞ = (cid:107) P x,w y (cid:107) ∞ + C norm (cid:107) P x,w y (cid:107) w ≤ (cid:107) y (cid:107) ∞ + (cid:107) ( I − P x,w ) y (cid:107) ∞ + C norm (cid:107) y (cid:107) w ≤ (cid:107) y (cid:107) ∞ + ( (cid:107) I − P x,w (cid:107) w + ∞ + C norm ) (cid:107) y (cid:107) w ≤ (cid:18) (cid:107) I − P x,w (cid:107) w →∞ C norm (cid:19) (cid:107) y (cid:107) w + ∞ . (3.7)13here we used the fact that (cid:107) P x,w y (cid:107) w ≤ (cid:107) y (cid:107) w for all y in the first inequality. Further, note that (cid:107) I − P x,w (cid:107) w →∞ = max i ∈ [ m ] max (cid:107) y (cid:107) w ≤ ( e (cid:62) i ( I − P x,w ) y ) ≤ max i ∈ [ m ] (cid:107) e (cid:62) i ( I − P x,w ) W − (cid:107) = max i ∈ [ m ] e (cid:62) i W − A x (cid:16) A (cid:62) x W − A x (cid:17) − A (cid:62) x W − e i . Since g ( x ) ≤ w ≤ g ( x ) , we have that (cid:107) I − P x,w (cid:107) w →∞ ≤ i ∈ [ m ] e (cid:62) i G ( x ) − A x (cid:16) A (cid:62) x G ( x ) − A x (cid:17) − A (cid:62) x G ( x ) − e i ≤ c s . (3.8)Combing (3.7) and (3.8) and using C norm = 24 √ c s c k yields the claims. t Here we bound how much centrality increases as we increase t . We show that this rate of increaseis governed by C norm and (cid:107) w (cid:107) . Lemma 14.
For all { x, w } ∈ { Ω ◦ × R m> } , t > and α ≥ , we have δ (1+ α ) t ( x, w ) ≤ (1 + α ) δ t ( x, w ) + α (1 + C norm (cid:112) (cid:107) w (cid:107) ) . Proof.
Let η t ∈ R n be such that δ t ( x, w ) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∇ x f t ( x, w ) − A η t w (cid:112) φ (cid:48)(cid:48) ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w + ∞ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t · c + wφ (cid:48) ( x ) − A η t w (cid:112) φ (cid:48)(cid:48) ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w + ∞ . Applying this to the definition of δ (1+ α ) t and using that (cid:107) · (cid:107) w + ∞ is a norm then yields δ (1+ α ) t ( x, w ) = min η ∈ R n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (1 + α ) t · c + wφ (cid:48) ( x ) − A ηw (cid:112) φ (cid:48)(cid:48) ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w + ∞ ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (1 + α ) t · c + wφ (cid:48) ( x ) − (1 + α ) A η t w (cid:112) φ (cid:48)(cid:48) ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w + ∞ ≤ (1 + α ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t · c + wφ (cid:48) ( x ) + A η t w (cid:112) φ (cid:48)(cid:48) ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w + ∞ + α (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) φ (cid:48) ( x ) (cid:112) φ (cid:48)(cid:48) ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w + ∞ = (1 + α ) δ t ( x, w ) + α (cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) φ (cid:48) ( x ) (cid:112) φ (cid:48)(cid:48) ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ + C norm (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) φ (cid:48) ( x ) (cid:112) φ (cid:48)(cid:48) ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w (cid:33) The result follows from the fact that | φ (cid:48) i ( x ) | ≤ (cid:112) φ (cid:48)(cid:48) i ( x ) for all i ∈ [ m ] and x ∈ R m by Definition 4. x Here we analyze the effect of a Newton step of x on centrality. We show for sufficiently central { x, w } ∈ { Ω ◦ × R m> } and w sufficiently close to g ( x ) Newton steps converge quadratically.
Lemma 15.
Let { x , w } ∈ { Ω ◦ × R m> } such that δ t ( x , w ) ≤ and g ( x ) ≤ w ≤ g ( x ) andconsider a Newton step x = x + h t ( x, w ) . Then, δ t ( x , w ) ≤ δ t ( x , w )) . Proof.
Let φ def = φ ( x ) and let φ def = φ ( x ) . By the definition of h t ( x , w ) and the formula of P x ,w
14e know that there is some η ∈ R n such that − (cid:113) φ (cid:48)(cid:48) h t ( x , w ) = t · c + wφ (cid:48) − A η w (cid:112) φ (cid:48)(cid:48) . Therefore, A η = t · c + wφ (cid:48) + wφ (cid:48)(cid:48) h t ( x , w ) . Recalling the definition of δ t this implies that δ t ( x , w ) = min η ∈ R n (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t · c + wφ (cid:48) − A ηw (cid:112) φ (cid:48)(cid:48) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w + ∞ ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t · c + wφ (cid:48) − A η w (cid:112) φ (cid:48)(cid:48) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w + ∞ ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w ( φ (cid:48) − φ (cid:48) ) − wφ (cid:48)(cid:48) h t ( x , w ) w (cid:112) φ (cid:48)(cid:48) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w + ∞ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( φ (cid:48) − φ (cid:48) ) − φ (cid:48)(cid:48) h t ( x , w ) (cid:112) φ (cid:48)(cid:48) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w + ∞ . By mean value theorem φ (cid:48) − φ (cid:48) = φ (cid:48)(cid:48) ( θ ) h t ( x , w ) for θ between x and x coordinate-wise. Hence, δ t ( x , w ) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) φ (cid:48)(cid:48) ( θ ) h t ( x , w ) − φ (cid:48)(cid:48) h t ( x , w ) (cid:112) φ (cid:48)(cid:48) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w + ∞ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( φ (cid:48)(cid:48) ( θ ) − φ (cid:48)(cid:48) ) (cid:112) φ (cid:48)(cid:48) (cid:112) φ (cid:48)(cid:48) ( (cid:113) φ (cid:48)(cid:48) h t ( x , w )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w + ∞ ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) φ (cid:48)(cid:48) ( θ ) − φ (cid:48)(cid:48) (cid:112) φ (cid:48)(cid:48) (cid:112) φ (cid:48)(cid:48) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ · (cid:13)(cid:13)(cid:13)(cid:13)(cid:113) φ (cid:48)(cid:48) h t ( x , w ) (cid:13)(cid:13)(cid:13)(cid:13) w + ∞ . (3.9)To bound the first term, we use Lemma 8 as follows (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) φ (cid:48)(cid:48) ( θ ) − φ (cid:48)(cid:48) (cid:112) φ (cid:48)(cid:48) (cid:112) φ (cid:48)(cid:48) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13)(cid:13)(cid:13) φ (cid:48)(cid:48) ( θ ) φ (cid:48)(cid:48) − (cid:13)(cid:13)(cid:13)(cid:13) ∞ · (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:112) φ (cid:48)(cid:48) (cid:112) φ (cid:48)(cid:48) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:18) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:113) φ (cid:48)(cid:48) h t ( x , w ) (cid:13)(cid:13)(cid:13)(cid:13) ∞ (cid:19) − − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · (cid:18) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:113) φ (cid:48)(cid:48) h t ( x , w ) (cid:13)(cid:13)(cid:13)(cid:13) ∞ (cid:19) − . Using (3.5), i.e. Lemma 10, the bound c γ ≤ (Lemma 13), and that δ t ( x , w ) ≤ yields (cid:13)(cid:13)(cid:13)(cid:13)(cid:113) φ (cid:48)(cid:48) h t ( x , w ) (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:113) φ (cid:48)(cid:48) h t ( x , w ) (cid:13)(cid:13)(cid:13)(cid:13) w + ∞ ≤ c γ · δ t ( x , w ) ≤ . Using (cid:0) (1 − t ) − − (cid:1) · (1 − t ) − ≤ t for t ≤ / , we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) φ (cid:48)(cid:48) ( θ ) − φ (cid:48)(cid:48) (cid:112) φ (cid:48)(cid:48) (cid:112) φ (cid:48)(cid:48) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:113) φ (cid:48)(cid:48) h t ( x , w ) (cid:13)(cid:13)(cid:13)(cid:13) ∞ . (3.10)Combining the formulas (3.9) and (3.10) yields that δ t ( x , w ) ≤ δ t ( x , w )) as desired. w In Section 3.6 we used the assumption that the weights, w , were multiplicatively close to g ( x ) , forthe current point x ∈ Ω ◦ . To maintain this invariant when we change x we will need to change w to move it closer to g ( x ) . Here we bound how much g ( x ) can move as we move x (Lemma 16) andwe bound how much changing w can decrease centrality (Lemma 17). Together these lemmas willallow us to show that we can keep w close to g ( x ) while still improving centrality (Section 3.8). Lemma 16.
For all t ∈ [0 , , let x t def = x + t ∆ for ∆ ∈ R m and g t = g ( x t ) such that x t ∈ Ω ◦ . Then or (cid:15) = (cid:107) (cid:112) φ (cid:48)(cid:48) ∆ (cid:107) g + ∞ ≤ we have (cid:107) log( g ) − log( g ) (cid:107) g + ∞ ≤ (1 − c k ( g ) − + 4 (cid:15) ) (cid:15) ≤ and for all s, t ∈ [0 , and for all y ∈ R m we have (cid:107) y (cid:107) g s + ∞ ≤ (1 + 2 (cid:15) ) (cid:107) y (cid:107) g t + ∞ . Proof.
Let q : [0 , → R m be given by q ( t ) def = log( g t ) for all t ∈ [0 , . Then, q (cid:48) ( t ) = G − t J g ( x t )∆ x .Letting Q ( t ) def = (cid:107) q ( t ) − q (0) (cid:107) g + ∞ and using Jensen’s inequality yields that for all u ∈ [0 , , Q ( u ) ≤ Q ( u ) def = (cid:90) u (cid:13)(cid:13)(cid:13) G − t J g ( x t )( Φ (cid:48)(cid:48) t ) − / (cid:13)(cid:13)(cid:13) g + ∞ (cid:13)(cid:13)(cid:13)(cid:13)(cid:113) φ (cid:48)(cid:48) t ∆ (cid:13)(cid:13)(cid:13)(cid:13) g + ∞ dt. Using Lemma 8 and (cid:15) ≤ , we have for all t ∈ [0 , , (cid:13)(cid:13)(cid:13)(cid:13)(cid:113) φ (cid:48)(cid:48) t ∆ x (cid:13)(cid:13)(cid:13)(cid:13) g + ∞ ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:115) φ (cid:48)(cid:48) t φ (cid:48)(cid:48) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ (cid:13)(cid:13)(cid:13)(cid:13)(cid:113) φ (cid:48)(cid:48) ∆ x (cid:13)(cid:13)(cid:13)(cid:13) g + ∞ ≤ (cid:107) (cid:112) φ (cid:48)(cid:48) ∆ x (cid:107) g + ∞ − (cid:13)(cid:13)(cid:13)(cid:112) φ (cid:48)(cid:48) ∆ x (cid:13)(cid:13)(cid:13) ∞ ≤ (cid:15) − (cid:15) . Thus, we have Q ( u ) ≤ (cid:15) − (cid:15) (cid:90) u (cid:107) G − t J g ( x t )( Φ (cid:48)(cid:48) t ) − / (cid:107) g + ∞ dt. (3.11)Note that Q is monotonically increasing. Let θ = sup u ∈ [0 , { Q ( u ) ≤ (1 − c − k + 4 (cid:15) ) (cid:15) } . Since (cid:107) q ( t ) − q (0) (cid:107) ∞ ≤ Q ( θ ) ≤ and q ( t ) = log( g t ) , we know that for all s, t ∈ [0 , θ ] , we have (cid:13)(cid:13)(cid:13)(cid:13) g s − g t g t (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ (cid:107) q ( s ) − q ( t ) (cid:107) ∞ + (cid:107) q ( s ) − q ( t ) (cid:107) ∞ and therefore (cid:107) g s /g t (cid:107) ∞ ≤ (1 + (cid:107) q ( s ) − q ( t ) (cid:107) ∞ ) ≤ (1 + (1 − c − k + 4 (cid:15) ) (cid:15) ) . Consequently, (cid:107) y (cid:107) g s + ∞ ≤ (1 + (1 − c − k + 4 (cid:15) ) (cid:15) ) (cid:107) y (cid:107) g t + ∞ ≤ (1 + 2 (cid:15) ) (cid:107) y (cid:107) g t + ∞ . Using (3.11), we have for all u ∈ [0 , θ ] , Q ( u ) ≤ Q ( u ) ≤ (cid:15) − (cid:15) (cid:90) u (cid:107) G − t J g ( x t )( Φ (cid:48)(cid:48) t ) − / (cid:107) g + ∞ dt ≤ (cid:15) − (cid:15) (cid:90) u (1 + 2 (cid:15) ) (cid:107) G − t J g ( x t )( Φ (cid:48)(cid:48) t ) − / (cid:107) g t + ∞ dt ≤ (cid:15) − (cid:15) (1 + 2 (cid:15) )(1 − c − k ) θ < (1 − c − k + 4 (cid:15) ) (cid:15). Consequently, θ = 1 and we have the desired result by the above bound on Q (1) . Lemma 17.
Let v, w ∈ R m> such that (cid:15) = (cid:107) log( w ) − log( v ) (cid:107) w + ∞ ≤ . Then for x ∈ Ω ◦ we have δ t ( x, v ) ≤ (1 + 4 (cid:15) )( δ t ( x, w ) + (cid:15) ) . Proof.
Let η w be such that δ t ( x, w ) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) c + wφ (cid:48) ( x ) − A η w w (cid:112) φ (cid:48)(cid:48) ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w + ∞ . (3.12)16he assumptions imply that (1 + (cid:15) ) − w i ≤ v i ≤ (1 + (cid:15) ) w i for all i and consequently δ t ( x, v ) = min η (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) c + vφ (cid:48) ( x ) − A ηv (cid:112) φ (cid:48)(cid:48) ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) v + ∞ ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) c + vφ (cid:48) ( x ) − A η w v (cid:112) φ (cid:48)(cid:48) ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) v + ∞ ≤ (1 + (cid:15) ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) c + vφ (cid:48) ( x ) − A η w v (cid:112) φ (cid:48)(cid:48) ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w + ∞ ≤ (1 + (cid:15) ) · (cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) c + wφ (cid:48) ( x ) − A η w v (cid:112) φ (cid:48)(cid:48) ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w + ∞ + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( v − w ) φ (cid:48) ( x ) v (cid:112) φ (cid:48)(cid:48) ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w + ∞ (cid:33) ≤ (1 + (cid:15) ) δ t ( x, w ) + (1 + (cid:15) ) · (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) φ (cid:48) ( x ) (cid:112) φ (cid:48)(cid:48) ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ · (cid:13)(cid:13)(cid:13)(cid:13) ( v − w ) v (cid:13)(cid:13)(cid:13)(cid:13) w + ∞ . The result follows from the fact that | φ (cid:48) i ( x ) | ≤ (cid:112) φ (cid:48)(cid:48) i ( x ) for all i ∈ [ m ] and x ∈ R m by Definition 4. The results of the previous sections imply an efficient linear programming algorithm provided theweight function can be computed efficiently to high-precision. Unfortunately such a weight compu-tation algorithm is unknown and instead only efficient approximate weight computation algorithmsare presently available (See Appendix B). Consequently, here we show how to improve centralityeven when the weight function is only computed approximately. Our algorithm is based on a solutionto the “ (cid:96) ∞ Chasing Game” summarized in Theorem 18 and proved in Appendix C.
Theorem 18 ( (cid:96) ∞ Chasing Game) . For x (0) , y (0) ∈ R m and (cid:15) ∈ (0 , / , consider the two playergame consisting of repeating the following for k = 1 , , . . .
1. The adversary chooses U ( k ) ⊆ R m , u ( k ) ∈ U ( k ) , and sets y ( k ) = y ( k − + u ( k ) .2. The adversary chooses z ( k ) with (cid:107) z ( k ) − y ( k ) (cid:107) ∞ ≤ R and reveals z ( k ) and U ( k ) to the player.3. The player chooses ∆ ( k ) ∈ (1 + (cid:15) ) U ( k ) and sets x ( k ) = x ( k − + ∆ ( k ) . Suppose that each U ( k ) is a symmetric convex set that contains an (cid:96) ∞ ball of radius r k and iscontained in a (cid:96) ∞ ball of radius R k ≤ R and the player plays the strategy ∆ ( k ) = arg min ∆ ∈ (1+ (cid:15) ) U ( k ) (cid:68) ∇ Φ µ ( x ( k − − z ( k ) ) , ∆ (cid:69) where Φ µ ( x ) def = (cid:88) i ∈ [ m ] (cid:0) e µx i + e − µx i (cid:1) and µ def = (cid:15) R . If Φ µ ( x (0) − y (0) ) ≤ mτ(cid:15) for τ = max k R k r k then this strategy guarantees that for all k we have Φ µ ( x ( k ) − y ( k ) ) ≤ mτ(cid:15) and (cid:107) x ( k ) − y ( k ) (cid:107) ∞ ≤ R(cid:15) log (cid:18) mτ(cid:15) (cid:19) . We can think of the problem of maintaining weights as playing this game; we want to keep w isclose to g ( x ) while the adversary controls the change in g ( x ) and the noise in approximating g ( x ) .Theorem 18 shows that we control the error (cid:96) ∞ if we can approximate g ( x ) in (cid:96) ∞ . Since we wishto maintain multiplicative approximations to g ( x ) we play this game with the log of w and g ( x ) .Formally, our goal is to not move w too much in (cid:107) · (cid:107) w + ∞ while keeping (cid:107) log( g ( x )) − log( w ) (cid:107) ∞ ≤ K K just small enough to not impair our ability to decrease δ t and approximate g . Algorithm 1: ( x (new) , w (new) ) = centeringInexact ( x, w, K ) R = K c k log(36 c c s c k m ) , δ = δ t ( x, w ) and (cid:15) = c k . x (new) = x − √ φ (cid:48)(cid:48) ( x ) P x,w (cid:18) tc − wφ (cid:48) ( x ) w √ φ (cid:48)(cid:48) ( x ) (cid:19) . Let U = { x ∈ R m | (cid:107) x (cid:107) w + ∞ ≤ (1 − c k ) δ } .Find z such that (cid:107) z − log( g ( x (new) )) (cid:107) ∞ ≤ R . w (new) = exp (cid:16) log( w ) + arg min u ∈ (1+ (cid:15) ) U (cid:68) ∇ Φ (cid:15) R ( z − log( w )) , u (cid:69)(cid:17) .In the following Theorem 19 we show that the above algorithm, centeringInexact , achievesprecisely these goals. The algorithm consists primarily of taking a projected Newton step, whichcorresponds to solving a linear system, and a projection onto U (step 5), which in Section D weshow can be done polylogarithmic depth and nearly linear work. Consequently, Theorem 19 is ourprimary subroutine for designing efficient linear programming algorithms in Section 6. Theorem 19.
Assume that K ≤ c k . Suppose that δ def = δ t ( x, w ) ≤ R and Φ µ (log( g ( x )) − log( w )) ≤ c c s c k m where µ = (cid:15) R and R = K c k log(36 c c s c k m ) . Let ( x (new) , w (new) ) = centeringInexact ( x, w, K ) , then δ t ( x (new) , w (new) ) ≤ (cid:18) − c k (cid:19) δ and Φ µ (log( g ( x (new) )) − log( w (new) )) ≤ c c s c k m. Proof.
By Lemma 16, inequality (3.5), c γ ( g ) ≤ / c k ( g )) (Lemma 13) and δ ≤ c k , we have (cid:107) log( g ( x (new) )) − log( g ( x )) (cid:107) g ( x )+ ∞ ≤ (1 − c k + 4 c γ δ ) · c γ δ ≤ (1 − c k ) · c γ δ + 5 δ ≤ (cid:18) − c k (cid:19) δ. (3.13)Using Φ µ (log( g ( x )) − log( w )) ≤ c c s c k m , the definition of µ and R , and K ≤ c k , we have (cid:13)(cid:13)(cid:13)(cid:13) w − g ( x ) g ( x ) (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ (cid:107) log( w ) − log( g ( x )) (cid:107) ∞ + (cid:107) log( w ) − log( g ( x )) (cid:107) ∞ ≤ K ≤ c k . Hence, we have (cid:107) log( g ( x (new) )) − log( g ( x )) (cid:107) w + ∞ ≤ (cid:18) c k (cid:19) (cid:18) − c k (cid:19) δ ≤ (cid:18) − c k (cid:19) δ. Therefore, we know that for the Newton step, we have log( g ( x (new) )) − log( g ( x )) ∈ U where U isthe symmetric convex set given by U def = { x ∈ R m | (cid:107) x (cid:107) w + ∞ ≤ C } where C def = (1 − c k ) δ. Note thatfrom our assumption on δ , we have C ≤ δ ≤ K c k log(36 c c s c k m ) = R. and therefore U is contained in a (cid:96) ∞ ball of radius R . Therefore, we can play the (cid:96) ∞ chasing18ame on log( g ( x )) attempting to maintain the invariant that (cid:107) log( w ) − log( g ( x )) (cid:107) ∞ ≤ K withouttaking steps that are more than (cid:15) times the size of U where we pick (cid:15) = c k so to not interferewith our ability to decrease δ t linearly. To apply Theorem 18 we need to ensure that R satisfies R(cid:15) log (cid:0) mτ(cid:15) (cid:1) ≤ K where here τ is as defined in Theorem 18.To bound τ , we need to lower bound the radius of (cid:96) ∞ ball it contains. Since by assumption (cid:107) g ( x ) (cid:107) ≤ c ( g ) and (cid:107) log( g ( x (new) )) − log( w ) (cid:107) ∞ ≤ , we have that (cid:107) w (cid:107) ≤ c ( g ) . Hence, we have (cid:107) u (cid:107) ∞ ≥ c ( g ) (cid:107) u (cid:107) w for all u ∈ R m and consequently, if (cid:107) u (cid:107) ∞ ≤ δ C norm √ c ( g ) , then u ∈ U . Therefore, U contains a box of radius δ C norm √ c ( g ) and since U is contained in a box of radius δ , we have that τ ≤ C norm √ c ≤ √ c √ c s c k . where we used the fact that C norm = 24 √ c s c k . Using that (cid:15) = c k , we have that R(cid:15) log (cid:18) mτ(cid:15) (cid:19) ≤ Rc k log(36 c c s c k m ) = K and mτ(cid:15) ≤ c c s c k m . This proves that we meet the conditions of Theorem 18. Consequently, (cid:107) log( g ( x (new) )) − log( w (new) ) (cid:107) ∞ ≤ K and Φ µ ≤ c c s c k m .Since K ≤ , Lemma 15 and δ ≤ c k shows that δ t ( x (new) , w ) ≤ δ t ( x, w ) ≤ δ c k . (3.14)Step 5 shows that (cid:107) log( w ) − log( w (new) ) (cid:107) w + ∞ ≤ (cid:18) c k (cid:19) (cid:18) − c k (cid:19) δ ≤ (cid:18) − c k (cid:19) δ. Using this and (3.14), Lemma 17 shows that δ t ( x (new) , w (new) ) ≤ (cid:18) (cid:18) − c k (cid:19) δ (cid:19) (cid:18) δ t ( x (new) , w ) + (cid:18) − c k (cid:19) δ (cid:19) ≤ (cid:18) c k (cid:19) (cid:18) δ c k + (cid:18) − c k (cid:19) δ (cid:19) ≤ (cid:18) c k − (cid:18) c k − c k (cid:19)(cid:19) δ ≤ (cid:18) − c k (cid:19) δ. As discussed in Section 1.3 and Section 1.4, Lewis weights play a key role in designing a weightfunction needed to obtain our fastest linear programming algorithms and designing an efficientlycomputable self-concordant barrier. Formally, Lewis weights are defined as follows.
Definition 20 (Lewis Weight) . For all p > and non-degenerate A ∈ R m × n we define the (cid:96) p Lewisweight w ( p ) ( A ) as the unique vector w ∈ R m> such that w = σ ( W − p A ) where W = Diag ( w ) . The non-degeneracy assumptions on A are mild and made primarily for notational convenience. If A has a zerorow its corresponding Lewis weight is defined to be and if A is not full rank, much of the definitions and analysisstill apply where inverses and determinants are replaced with pseudoinverse and pseudodeterminants respectively.
19n this section we provide facts about Lewis weights that we use throughout the paper. Firstin Section 4.1 we show how Lewis weights can be written as the minimizer of a convex problem forall p > . This convex formulation is a critical for our self-concordant barrier construction. Thenin Section 4.1 we provide facts about the stability of Lewis weights which are critical for analyzingthe performance of this barrier. Further, in Section 4.3 we show that Lewis weights yield ellipsoidsthat are provably good approximations to the polytope Ω = { x ∈ R n | (cid:107) A x (cid:107) ∞ ≤ } . Finally, inSection 4.4 we use this analysis to show that Lewis weights yield a weight function in the context ofDefinition 12. In Appendix E we shed further light on Lewis weights and the algorithms we buildwith them showing that they interpolate between the natural uniform distribution over the rows ofthe matrix (i.e. p → ) and the weights that yield a John ellipse of Ω (i.e. p → ∞ ). Here we show that Lewis weights are the result of solving a particular convex optimization problemfor p > with p (cid:54) = 2 . This convex formulation relies on the following potential. Definition 21 (Volumetric Potential) . For non-degenerate A ∈ R m × n and p > with p (cid:54) = 2 wedefine the volumetric potential as V A p ( w ) def = − − p log det (cid:16) A (cid:62) W − p A (cid:17) . We will omit A and p when they are clear from context.The main result of this section is the following lemma, claiming that Lewis weights are theunique solution to min w i ≥ V A p ( w ) + (cid:80) i ∈ [ m ] w i , or equivalently, min w i ≥ , (cid:80) i ∈ [ m ] w i = n V A p ( w ) . Lemma 22.
For all non-degenerate A ∈ R m × n its (cid:96) p Lewis weights exist and are unique for p > .For p (cid:54) = 2 the weights w p ( A ) are the unique minimizer of the following equivalent convex problems: min w ∈ R m> V A p ( w ) + (cid:88) i ∈ [ m ] w i and min w ∈ R m> : (cid:80) i ∈ [ m ] w i = n V A p ( w ) . To prove Lemma 22 we first compute and bound the gradient and Hessian of V A p ( w ) . Lemma 23 (Gradient and Hessian of Volumetric Potential) . For all non-degenerate A ∈ R m × n , w ∈ R m> , and p > with p (cid:54) = 2 we have ∇V A p ( w ) = − W − σ w and ∇ V A p ( w ) = W − (cid:18) Σ w − (cid:18) − p (cid:19) Λ w (cid:19) W − where W def = Diag ( w ) , σ w def = σ ( W − p A ) , Σ w def = Σ ( W − p A ) , and Λ w def = P ( W − p A ). Conse-quently, V A p is convex in w and { p, } · W − Σ w W − (cid:22) ∇ V A p ( w ) (cid:22) { p, } · W − Σ w W − . (4.1) Proof.
The formulas for ∇V A p ( w ) and ∇ V A p ( w ) follow from a more general Lemma 50. Now, (cid:22) Λ w (cid:22) Σ w by Lemma 47; consequently W − Σ w W − (cid:22) ∇ V A p ( w ) (cid:22) p W − Σ w W − when p < and p W − Σ w W − (cid:22) ∇ V A p ( w ) (cid:22) W − Σ w W − when p > . In either case (4.1) holds.Using Lemma 23 we prove Lemma 22, our main result of this section.20 roof of Lemma 22. For all w ∈ R m> let f ( w ) = V A p ( w ) + (cid:80) mi =1 w i and σ w def = σ ( W − p A ) where W = Diag ( w ) . Lemma 23 and [ σ w ] i ≤ (Lemma 47) yields that for all w ∈ R m> if w i > then ∂f ( w ) ∂w i = 1 − [ σ w ] i w i ≥ − w i > . Hence, we have inf w i > f ( w ) = inf >w i > f ( w ) .Now, if p > and w i ∈ [0 , for all i ∈ [ m ] then since − p > σ w ] i = σ (cid:16) W − p A (cid:17) i = w − p i (cid:104) A ( A (cid:62) W − p A ) − A (cid:62) (cid:105) ii ≥ w − p i (cid:104) A ( A (cid:62) A ) − A (cid:62) (cid:105) ii = w − p i σ ( A ) i . (4.2)Since A is non-degenerate, σ ( A ) i ∈ (0 , for all i . Therefore for any j with w j < σ ( A ) p/ j , we have ∂f ( w ) ∂w i = 1 − [ σ w ] j w j ≤ − w − p j σ ( A ) j < and consequently, inf w i > f ( w ) = inf >w i ≥ f ( w ) = inf >w i >σ p/ i f ( w ) .Similarly, if p < , w i ∈ [0 , for all i ∈ [ m ] , and w min = min i ∈ [ m ] w i then since − p < we have W − p (cid:22) w − p min I . Consequently, by analogous derivation to (4.2) we have [ σ w ] i ≥ ( w i /w min ) − (2 /p ) σ ( A ) i . Consequently, if j ∈ arg min i ∈ [ m ] w i this implies that [ σ w ] j ≥ σ ( A ) j andtherefore if w j < σ ( A ) j we have ∂f ( w ) ∂w j < . Therefore, if we let σ min = min i ∈ [ m ] σ i > (since A isnon-degenerate) we have inf w i > f ( w ) = inf >w i ≥ σ ( A ) i f ( w ) .In either case, since f is continuous the above reasoning argues that f achieves its minimumon the interior of the domain and therefore we have that the minimizer of w ∗ of f ( w ) satisfies ∇ f ( w ∗ ) = 0 , i.e. [ w ∗ ] i = [ σ w ] i for all i ∈ [ n ] . This proves that the minimizer f ( w ) on w ∈ R m> exists and are Lewis weights. Further, Lemma 23 shows that ∇ f ( w ) (cid:23) p, · W − Σ w W − (cid:31) for all w > and therefore f is strictly convex where ≥ w i ≥ min( σ i , σ p/ i ) for all i . Consequently,the minimizer of f is unique and it is the unique point satisfying ∇ f ( w ) = 0 for w ∈ R n> . Further,since (cid:80) mi =1 [ σ w ] ii = n by Lemma 47 we have (cid:80) mi =1 w p ( A ) i = n and we have the desired equivalenceof the two given objective functions. Here we study the sensitivity of Lewis weight under rescaling A . In Lemma 24 we compute theJacobian of Lewis weights with respect to rescaling and in Lemma 25 we bound it. Lemma 24.
For all non-degenerate A ∈ R m × n , p > with p (cid:54) = 2 , and v ∈ R m. , let w ( v ) def = w p ( VA ) where V = Diag ( v ) . Then, for Λ v def = Λ (cid:16) W − p VA (cid:17) and W v def = Diag ( w ( v )) we have J w ( v ) = 2 W v (cid:18) W v − (cid:18) − p (cid:19) Λ v (cid:19) − Λ v V − . roof. Let f ( v, w ) def = − − p log det( A (cid:62) VW − p VA ) + (cid:80) mi =1 w i . Lemma 22 shows that w ( v ) = arg min w ∈ R m> f ( v, w ) and that the optimal is in the interior. Hence, the optimality conditions yield ∇ w f ( v, w ( v )) = 0 .Taking derivative with respect to v on both sides, we have that ∇ wv f ( v, w ( v )) + ∇ ww f ( v, w ( v )) J w ( v ) = 0 . Therefore, we have that J w ( v ) = − ( ∇ ww f ( v, w ( v ))) − ∇ wv f ( v, w ( v )) . (4.3)Lemma 23 and Lemma 50 yield that ∇ ww f ( v, w ) = W − (cid:18) Σ w − (cid:18) − p (cid:19) Λ w (cid:19) W − . (4.4)For ∇ wv f ( v, w ( v )) , we note that ∇ w f ( v, w ) = − W − σ ( W − p VA ) . Taking derivative withrespect to v and using Lemma 49 gives that [ ∇ wv f ( v, w )] ij = − w i Λ [ W − p VA ] ij · v − j . (4.5)Combining (4.3), (4.4), and (4.5), we have J w ( v ) = W v (cid:18) Σ w − (cid:18) − p (cid:19) Λ w (cid:19) − W v · W − v Λ v V − . The result follows from w ( v ) = σ ( W − p VA ) by Lemma 22. Lemma 25.
Under the setting of Lemma 24 for any v ∈ R m> and h ∈ R m , we have that (cid:107) W − v J w ( v ) h (cid:107) w ( v ) ≤ p · (cid:13)(cid:13) V − h (cid:13)(cid:13) w ( v ) , (4.6) and that (cid:107) [ W − v J w ( v ) − p V − ] h (cid:107) ∞ ≤ p · max (cid:110) p , (cid:111) · (cid:107) V − h (cid:107) w ( v ) . (4.7) Proof.
Fix an arbitrary v ∈ R m> and h ∈ R m and let w def = w ( v ) , Σ def = Σ ( W − p VA ) , Λ def = Λ ( W − p VA ) , ¯ Λ def = ¯ Λ ( W − p VA ) , and P (2) = P (2) ( W − p VA ) . By Lemma 24 and the fact that w def = w p ( VA ) = σ ( W − p VA ) ∈ R m> , we have that J w ( v ) h = 2 Σ (cid:18) Σ − (cid:18) − p (cid:19) Λ (cid:19) − ΛV − h = 2 Σ / Λ (cid:18) I − (cid:18) − p (cid:19) Λ (cid:19) − W / V − h (4.8)Consequently (4.6) follows from (cid:107) W − v J w ( v ) h (cid:107) w = (cid:107) Σ − / z (cid:107) = 2 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Λ (cid:18) I − (cid:18) − p (cid:19) Λ (cid:19) − W / V − h (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ p (cid:107) W / V − h (cid:107) Λ ( I − (1 − p ) Λ ) − is a symmetric matrix whose eigenvalues areof the form λ/ (1 − (1 − p ) λ ) for each eigenvalue of λ of Λ and since (cid:22) ¯ Λ (cid:22) I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Λ (cid:18) I − (cid:18) − p (cid:19) Λ (cid:19) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ max ≤ λ ≤ λ − (1 − p ) λ = p . (4.9)Next, note that Λ = Σ − P (2) and therefore I − ¯ Λ = W − / P (2) W − / . Combining this with(4.8) and that I − (cid:16) − p (cid:17) Λ is invertible yields [ W − J w ( v ) − p V − ] h = W − / (cid:20) Λ − p (cid:18) I − (cid:18) − p (cid:19) Λ (cid:19)(cid:21) (cid:18) I − (cid:18) − p (cid:19) Λ (cid:19) − W / V − h = p W − P (2) W − / (cid:18) I − (cid:18) − p (cid:19) Λ (cid:19) − W / V − h . However, by Lemma 47 we know that (cid:107) Σ − P (2) x (cid:107) ∞ ≤ (cid:107) x (cid:107) Σ = (cid:107) Σ / x (cid:107) for all x and therefore(4.7) follows from (cid:107) [ W − J w ( v ) − p V − ] h (cid:107) ∞ ≤ p (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:18) I − (cid:18) − p (cid:19) Λ (cid:19) − W / V − h (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ p · max (cid:110) , p (cid:111) · (cid:107) V − h (cid:107) w . To prove the inequality in the last step above, note that ( I − (1 − p ) Λ ) − is a symmetric matrixwhose eigenvalues are of the form / (1 − (1 − p ) λ ) for each eigenvalue of λ of Λ and since (cid:22) ¯ Λ (cid:22) I (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:18) I − (cid:18) − p (cid:19) Λ (cid:19) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ max ≤ λ ≤ − (1 − p ) λ = max (cid:110) , p (cid:111) . (4.10) Here we show that the Lewis weights for a matrix A ∈ R m × n provide ellipses that provably approx-imate the polytope Ω = { x ∈ R n |(cid:107) A x (cid:107) ∞ ≤ } . This bound helps analyze both the self-concordanceof the Lewis weight barrier and the efficacy as Lewis weights for the weighted central path. Themain result of this section is the more general Lemma 26 which relates every (cid:96) p Lewis weight to (cid:96) r Lewis weight for r ≥ p . In particular, it bounds how well (cid:96) p lewis weights w satisfy the optimalityconditions of being a (cid:96) r Lewis weight, i.e. the size of σ ( W − r A ) i w − i . Lemma 26.
For all non-degenerate A ∈ R m × n and w def = w p ( A ) for < p < r we have σ ( W − r A ) i w − i ≤ c p,r,m def = (1 + α ) α (cid:18)(cid:18) α (cid:19) m (cid:19) α α ≤ m α α where α = 2 p − r . Lemma 26 shows that (cid:96) p Lewis weights for large p yield an ellipse that well approximates Ω though the following simple lemma. Lemma 27.
For non-degenerate A ∈ R m × n , w def = w p ( A ) for p > , and W def = Diag ( w ) , define E def = { x ∈ R n : x (cid:62) A (cid:62) WA x ≤ } and K def = { x ∈ R n : (cid:107) A x (cid:107) ∞ ≤ } . hen R def = max i ∈ [ m ] σ ( W A ) i w − i is the smallest value such that E ⊆ √ RK.
Further, K is a √ c p, ∞ ,m n -rounding (See Lemma 26) as K ⊂ √ nE .Proof. Note that σ ( W A ) i w − i = e (cid:62) i A ( A (cid:62) WA ) − A (cid:62) e i and max x ∈ E (cid:107) A x (cid:107) ∞ = max i ∈ [ m ] ,x ∈ E (cid:16) e (cid:62) i A ( A (cid:62) WA ) − ( A (cid:62) WA ) x (cid:17) = max i ∈ [ m ] e (cid:62) i A ( A (cid:62) WA ) − A (cid:62) e i = R. Consequently, R is as desired. Further, for any x ∈ K we have x (cid:62) A (cid:62) WA x ≤ (cid:80) i ∈ [ m ] w i ≤ n andtherefore K ⊂ √ nE .Note that in Lemma 26 we have lim p →∞ c p, ∞ ,m = 1 . This suggest that that as p → ∞ theellipse E behaves like a John ellipse for Ω . Indeed, in Appendix E.2 we show that E converges tothe John ellipse of Ω .To prove Lemma 26 we first provide the following helper lemma analyzing the effect of changingthe power to which we might raise W in A (cid:62) WA . Lemma 28.
Let A ∈ R m × n , p > and w = w p ( A ) . Then, for any r ≥ p , we have that A (cid:62) W − r A (cid:22) A (cid:62) W − p A (cid:22) (1 + α ) (cid:18)(cid:18) α (cid:19) m (cid:19) α A (cid:62) W − r A where α = 2 p − r . Proof.
Since r ≥ p and w i ∈ (0 , for all i ∈ [ m ] we have that w − r i ≤ w − p i for all i ∈ [ m ] andtherefore A (cid:62) W − r A (cid:22) A (cid:62) W − p A .To prove the other direction, let (cid:15) > be arbitrary and let I w> (cid:15)m ∈ R m × m be the diagonalmatrix where [ I (cid:15) ] ii = 1 if w i > (cid:15)m and [ I (cid:15) ] ii = 0 otherwise and let I w ≤ (cid:15)m = I − I w> (cid:15)m . Note that tr (cid:104) ( A (cid:62) W − p A ) − A (cid:62) W − p I w ≤ (cid:15)m A (cid:105) = (cid:88) i ∈ [ m ]: w i ≤ (cid:15)m w i ≤ m · (cid:15)m = (cid:15) where we used that w = σ ( W − p A ) . Therefore, A (cid:62) W − p I w ≤ (cid:15)m A (cid:22) (cid:15) · A (cid:62) W − p A and hence A (cid:62) W − p A (cid:22) − (cid:15) A (cid:62) W − p I w> (cid:15)m A . (4.11)Now, we note that A (cid:62) W − p I w> (cid:15)m A ≺ (cid:16) m(cid:15) (cid:17) p − r A (cid:62) W − r I w> (cid:15)m A . (4.12)Combining (4.11) and (4.12), recalling α = p − r and choosing the minimizing (cid:15) = α α yields A (cid:62) W − p A (cid:22) − (cid:15) (cid:16) m(cid:15) (cid:17) α A (cid:62) W − r A = (1 + α ) α α α m α A (cid:62) W − r A . Using Lemma 28 we prove Lemma 26, the main result of this section.
Proof of Lemma 26.
Note that σ ( W − r A ) i w − i = e (cid:62) i A ( A (cid:62) W − r A ) − A (cid:62) e i w − r i . (4.13)24pplying Lemma 28 yields that A (cid:62) W − r A (cid:23) α )((1 + α ) m ) α A (cid:62) W − p A where α = 2 p − r . Consequently, for all i ∈ [ m ] it follows that e (cid:62) i A ( A (cid:62) W − r A ) − A (cid:62) e i w − r i ≤ (1 + α ) ((1 + α ) m ) α e (cid:62) i A ( A (cid:62) W − p A ) − A (cid:62) e i w − r i . (4.14)Further, since w = σ ( W − p A ) we have that e (cid:62) i A ( A (cid:62) W − p A ) − A (cid:62) e i w − r i = w p − i σ i ( W − p A ) w − r i = w αi (4.15)Additionally, since w − r i A (cid:62) e i e (cid:62) i A (cid:22) A (cid:62) W − r A we have w − r i e (cid:62) i A ( A (cid:62) W − r A ) − A (cid:62) e i ≤ and e (cid:62) i A ( A (cid:62) W − r A ) − A (cid:62) e i w − r i ≤ w − i . (4.16)Combining (4.13), (4.14), (4.15), and (4.16) yields σ ( W − r A ) i w − i ≤ min (cid:26) (1 + α ) (cid:18)(cid:18) α (cid:19) m (cid:19) α w αi , w − i (cid:27) ≤ (1 + α ) α (cid:18)(cid:18) α (cid:19) m (cid:19) α α . where we used that min { ax b , x c } ≤ a − cb − c for a ≥ , b ≥ c , and x ∈ [0 , . The final inequalityfollows from the fact that if we let f ( α ) def = (1 + α ) α (1 + α ) α α then f ( α ) ≤ for all α ≥ as theconcavity of log shows that log f ( α ) = log(1 + α )1 + α + α · log(1 + (1 /α ))1 + α ≤ log (cid:18) (1 + α )1 + α + α · /α )1 + α (cid:19) = log 2 . Here, we show that regularized Lewis weights for suitable p and small enough regularization area valid weight function (Definition 12). The main result of this section is the following theoremwhich bounds the weight function parameters. This theorem follows almost immediately from thecalculations in Section 4. To obtain our fastest algorithms for linear programming, we choose p = 1 − / log(4 m ) and c = n/ (2 m ) . Theorem 29.
For any p ∈ (0 , and any c ≥ , the weight function g : Ω ◦ → R m> defined for all x ∈ R m> as g ( x ) def = w p ( A x ) + c where A x def = ( Φ (cid:48)(cid:48) ( x )) − / A . (4.17) is a weight function in the context of Definition 12 and satisfies c ( g ) ≤ n + c m , c s ( g ) ≤ m − p ,and c k ( g ) ≤ − p . Further, for p = 1 − m ) and c = n m , we have c ( g ) ≤ n , c s ( g ) ≤ , and c k ( g ) ≤ m ) .Proof. To bound the size, c ( g ) , recall that w p ( A x ) = σ ( W − p A x ) and therefore Lemma 47 implies25 i ∈ [ m ] w p ( A x ) i = n . To bound the sensitivity, c s ( g ) , note that Lemma 26 and that p ≤ yield e (cid:62) i G ( x ) − A x (cid:16) A (cid:62) x G ( x ) − A x (cid:17) − A (cid:62) x G ( x ) − e i = g ( x ) − i σ ( G ( x ) − A x ) i ≤ m α α where α = p − = p (1 − p ) . As α α = − p p ≤ − p , the bound on c s ( g ) follows.To bound the consistency, c k ( g ) , note that for arbitrary h ∈ R m and w ( v ) = w p ( VA ) we have G ( x ) − J g ( x ) (cid:0) Φ (cid:48)(cid:48) ( x ) (cid:1) − h = G ( x ) − J w (( Φ (cid:48)(cid:48) ( x )) − ) z (4.18)where z = − ( Φ (cid:48)(cid:48) ( x )) − Φ (cid:48)(cid:48)(cid:48) ( x ) h . By Lemma 25, we have that (cid:13)(cid:13)(cid:13) G ( x ) − J w (( Φ (cid:48)(cid:48) ( x )) − ) z (cid:13)(cid:13)(cid:13) g ( x ) ≤ p (cid:13)(cid:13)(cid:13) ( Φ (cid:48)(cid:48) ( x )) z (cid:13)(cid:13)(cid:13) g ( x ) (4.19)and (cid:13)(cid:13)(cid:13) G ( x ) − J w (( Φ (cid:48)(cid:48) ( x )) − ) z (cid:13)(cid:13)(cid:13) ∞ ≤ p (cid:13)(cid:13)(cid:13) ( Φ (cid:48)(cid:48) ( x )) z (cid:13)(cid:13)(cid:13) ∞ + p (cid:13)(cid:13)(cid:13) ( Φ (cid:48)(cid:48) ( x )) z (cid:13)(cid:13)(cid:13) g ( x ) . (4.20)Combining (4.18), (4.19), (4.20) and using the definition (cid:107)·(cid:107) g ( x )+ ∞ def = (cid:107) · (cid:107) ∞ + C norm (cid:107) · (cid:107) g ( x ) yields (cid:13)(cid:13)(cid:13) G ( x ) − J g ( x ) (cid:0) Φ (cid:48)(cid:48) ( x ) (cid:1) − h (cid:13)(cid:13)(cid:13) g ( x )+ ∞ ≤ p (cid:13)(cid:13)(cid:13) ( Φ (cid:48)(cid:48) ( x )) z (cid:13)(cid:13)(cid:13) ∞ + p (1 + C norm ) · (cid:13)(cid:13)(cid:13) ( Φ (cid:48)(cid:48) ( x )) z (cid:13)(cid:13)(cid:13) g ( x ) . Note that (cid:12)(cid:12)(cid:12) Φ (cid:48)(cid:48) ( x ) z (cid:12)(cid:12)(cid:12) i = (cid:12)(cid:12)(cid:12) Φ (cid:48)(cid:48) ( x ) − Φ (cid:48)(cid:48)(cid:48) ( x ) h (cid:12)(cid:12)(cid:12) i ≤ | h i | by the self-concordance of Φ . Therefore, (cid:13)(cid:13)(cid:13) G ( x ) − J g ( x ) (cid:0) Φ (cid:48)(cid:48) ( x ) (cid:1) − h (cid:13)(cid:13)(cid:13) g ( x )+ ∞ ≤ p (cid:107) h (cid:107) ∞ + p (1 + C norm ) · (cid:107) h (cid:107) g ( x ) ≤ p (cid:18) C norm (cid:19) (cid:107) h (cid:107) g ( x )+ ∞ . Recalling that C norm = 24 (cid:112) c s ( g ) c k ( g ) and using c s ( g ) ≥ , the bound of c k ( g ) = − p follows from p (cid:18) C norm (cid:19) ≤ p + 124 c k ( g ) = 1 − c k ( g ) + 124 c k ( g ) ≤ − c k ( g ) . In this section, we construct an (cid:101) O ( n ) -self-concordant barrier for the set Ω ◦ def = { x ∈ R n | A x > b } for non-degenerate A ∈ R m × n and vector b ∈ R m using (cid:96) q Lewis weights. Interestingly, the centralpath for this barrier is the points x ( t ) ∈ R n , λ ( t ) ∈ R m> , and s ( t ) ∈ R m> for t > satisfying λ ( t ) i · s ( t ) i = t · w q ( A x ( t ) ) i for all i ∈ [ m ] A (cid:62) λ ( t ) = c, A x ( t ) + s ( t ) = b. We use q throughout rather than p as in Section 4 to clearly distinguish between the different (but closely related)functions considered in each section. A x def = S − x A and S x = Diag ( A x − b ) . For all x ∈ Ω ◦ and w ∈ R m> we let f ( x, w ) def = ln det (cid:16) A (cid:62) x W − q A x (cid:17) − (cid:18) − q (cid:19) m (cid:88) i =1 w i and define the barrier as ψ ( x ) def = (cid:40) max w ∈ R m : w ≥ f ( x, w ) if q ≥ w ∈ R m : w ≥ f ( x, w ) if q ≤ . (5.1)Note with respect to w the function f ( x, w ) is just a scaling of the function we used for definingLewis weights. Here we need these two cases to maintain that f is a convex function in x. Note that when q = 2 the function f ( x, w ) does not depend on w and therefore ψ is well defined.Further in this case ψ is exactly the volumetric barrier function, i.e. f ( x, w ) = ln det( A (cid:62) x A x ) .Further, note that as q → , Lemma 63 shows that w q ( A ) i = 1 (as long as A is in general position).and in this case ψ is the log barrier function.We call this the Lewis weight barrier function as by Lemma 22 we can equivalently write ψ ( x ) = ln det (cid:18) A (cid:62) x W − q x A x (cid:19) where W x = Diag ( w q ( A x )) . The main result of this section is the following theorem which shows that the Lewis weight barrier isa self-concordant barrier. In particular this theorem shows that for q = Θ(log( m )) the Lewis-weightbarrier is a O ( n log m ) -self concordant. Further, when q = 2 this theorem recovers the fact thatthe volumetric barrier function is a O ( √ mn ) -self concordant [47, 3]. Theorem 30.
Let Ω ◦ = { x : A x > b } denote the interior of non-empty polytope for non-degenerate A . For any q > , ψ : Ω ◦ → R defined in (5.1) is a barrier function such that for all x ∈ Ω ◦ and h ∈ R n , we have1. ∇ ψ ( x ) (cid:62) ∇ ψ ( x ) − ∇ ψ ( x ) ≤ n ,2. D ψ [ h, h, h ] ≤ v q (cid:107) h (cid:107) / ∇ ψ ( x ) for v q = ( q + 2) / m q +2 + 4 max { q, } . Consequently, v q ψ is a nv q -self-concordant barrier function for Ω ◦ . In the remainder of this section we prove Theorem 30. Leveraging the analysis of Section 4, thisis a straightforward but tedious calculus exercise. We split the proof into parts. In Section 5.1, wecompute the gradient of the Hessian of f and ψ and prove the first item in Theorem 30 (Lemma 32).In Section 5.2, we then prove the second item of Theorem 30 (Lemma 38) by bounding the stabilityof each component of the Hessian. For brevity, throughout the remainder of this section, we let w x = arg max w ∈ R m ≥ f ( x, w ) when q ≥ , w x = arg min w ∈ R m ≥ f ( x, w ) when q ≤ , and w x = σ ( A x ) when q = 2 . We will show inLemma 31 that w x = w q ( A x ) for all q . Further, for all x ∈ Ω ◦ we let P x def = P ( W − q x A x ) , P (2) x,w def = P (2) ( W − q x A x ) , ¯ Λ x def = ¯ Λ ( W − q x A x ) , and σ x def = σ ( W − q x A x ) . Leveraging this notation we compute and bound the gradient and Hessian of ψ .27 emma 31. For all x ∈ Ω ◦ , σ x = w x = w q ( A x ) and for N x def = 2 ¯ Λ x ( I − (1 − q ) ¯ Λ x ) − we have ∇ ψ ( x ) = − A (cid:62) x σ x and ∇ ψ ( x ) = A (cid:62) x Σ / x ( I + N x ) Σ / x A x (5.2) Further, N x is a symmetric matrix with (cid:22) N x (cid:22) q I and therefore A (cid:62) x Σ x A x (cid:22) ∇ ψ ( x ) (cid:22) (1 + q ) A (cid:62) x Σ x A x . (5.3) Proof.
By Lemma 50, deferred to the appendix, recalling that c q def = 1 − q we have ∇ x f ( x, w ) = − A (cid:62) x σ x,w , ∇ w f ( x, w ) = c q W − σ x,w − c q , ∇ xx f ( x, w ) = A (cid:62) x (2 Σ x,w + 4 Λ x,w ) A x , ∇ ww f ( x, w ) = − c q W − ( Σ x,w − c q Λ x,w ) W − , and ∇ xw f ( x, w ) = − c q A (cid:62) x Λ x,w W − . where P x,w def = P ( W − q A x ) , Λ x,w def = Λ ( W − q A x ) , and σ x,w def = σ ( W − q A x ) .Consequently, f ( x, w ) is concave in w when q > , convex in w when q < and each casethe optimizer is in the interior of the set { w i ≥ } by Lemma 22. Further, whenever q (cid:54) = 2 theoptimality conditions imply that ∇ w f ( x, w x ) = 0 and considering the q = 2 case directly we seethat in all cases σ x = w x = w q ( A x ) .For q (cid:54) = 2 taking the derivative of ∇ w f ( x, w x ) = 0 with respect to x yields that for w ( x ) def = w x , ∇ wx f ( x, w x ) + ∇ ww f ( x, w x ) J w ( x ) = 0 . Since ∇ ww f ( x, w x ) is invertible, we have that in this case J w ( x ) = − ( ∇ ww f ( x, w x )) − ∇ wx f ( x, w x ) . Using that ψ ( x ) = f ( x, w x ) and taking the derivative of x on both sides, we have ∇ ψ ( x ) = ∇ x f ( x, w x ) + J w ( x ) (cid:62) ∇ w f ( x, w x ) = ∇ x f ( x, w x ) = − A (cid:62) x σ x (5.4)where we used that ∇ w f ( x, w x ) = 0 by optimality. Next, taking the derivative again yields that ∇ ψ ( x ) = ∇ xx f ( x, w x ) + ∇ xw f ( x, w x ) J w ( x )= ∇ xx f ( x, w x ) − ∇ xw f ( x, w x ) (cid:0) ∇ ww f ( x, w x ) (cid:1) − (cid:0) ∇ wx f ( x, w x ) (cid:1) Substituting in the computed values for ∇ xx f ( x, w x ) , ∇ xw f ( x, w x ) , ∇ wx f ( x, w x ) and using that Σ x = W x then yields that ∇ ψ ( x ) = A (cid:62) x ( Σ x + 2 Λ x ) A x + 2 c q A (cid:62) x Λ x ( Σ x − c q Λ x ) − Λ x A x (5.5)Further, since when q = 2 we have ψ ( x ) = f ( x, w ) for any w ∈ R m> and c q = 0 we see that (5.4)and (5.5) are correct for all q > . Rearranging, scaling, and leveraging that Σ x is PD (i.e. allleverage scores are positive) yields ∇ ψ ( x ) = A (cid:62) x Σ / x (cid:16) I + 2 ¯ Λ x + 2 c q ¯ Λ x (cid:0) I − c q ¯ Λ x (cid:1) − ¯ Λ x (cid:17) Σ / x A x . Now, note that (cid:22) ¯ Λ (cid:22) I and c q ∈ ( −∞ , for q ∈ (0 , ∞ ) and therefore no eigenvalue of ¯ Λ has28alue /c q . Since, x + c q x (1 − c q x ) − = x (1 − c q x ) − for x (cid:54) = 1 /c q and ¯ Λ and I trivially commute wehave that ∇ ψ ( x ) = A (cid:62) x Σ / x ( I + N x ) Σ / x A x as desired. Further, this implies that N x is symmetricwith all eigenvalues in the range [0 , q ] (see e.g. (4.9)), proving (5.3).Using Lemma 31 we can immediately bound ∇ ψ ( x ) (cid:62) ∇ ψ ( x ) − ∇ ψ ( x ) . Lemma 32.
For all x ∈ Ω ◦ , we have ∇ ψ ( x ) (cid:62) ∇ ψ ( x ) − ∇ ψ ( x ) ≤ n .Proof. Since ∇ ψ ( x ) = − A (cid:62) x σ x and ∇ ψ ( x ) (cid:23) A (cid:62) x Σ x A x by Lemma 31 we have. ∇ ψ ( x ) (cid:62) ∇ ψ ( x ) − ∇ ψ ( x ) ≤ σ (cid:62) x A x (cid:16) A (cid:62) x Σ x A x (cid:17) − A (cid:62) x σ x = 1 (cid:62) Σ / x PΣ / x where P def = Σ / x A x ( A (cid:62) x Σ x A x ) − A (cid:62) x Σ / x . Since P is a projection matrix, P (cid:22) I (Lemma 47) and ∇ ψ ( x ) (cid:62) ∇ ψ ( x ) − ∇ ψ ( x ) ≤ (cid:62) Σ x (cid:88) i ∈ [ m ] [ σ x ] i = n . Here we bound the directional derivatives of the barrier and show that they are not too large.Lemma 38 proved in this section, combined with Lemma 32 of the previous section immediatelyprove Theorem 30, bounding the self-concordance of ψ .Throughout this section, to simplify the notation, we fix an arbitrary point x ∈ Ω ◦ and adirection h ∈ R n and define x t def = x + th , s t = A x t − b , and A t = A x t and further define w t , W t , Σ t , P (2) t , Λ t , ¯ Λ t „ and N t (Lemma 31) analogously.First, we bound the derivatives of the slacks and weights in the following Lemma 33 and 34. Lemma 33.
For all x ∈ Ω ◦ and h ∈ R n we have (cid:13)(cid:13)(cid:13)(cid:13) S − t ddt s t (cid:13)(cid:13)(cid:13)(cid:13) W t ≤ (cid:107) h (cid:107) A (cid:62) t W t A t and (cid:13)(cid:13)(cid:13)(cid:13) S − t ddt s t (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ √ m q +2 (cid:107) h (cid:107) A (cid:62) t W t A t . Proof.
Since S − t ddt s t = A t h we have (cid:107) S − t ddt s t (cid:107) W t = (cid:107) h (cid:107) A (cid:62) t W t A t . For the second inequality notethat by Cauchy Schwarz, (cid:13)(cid:13)(cid:13)(cid:13) S − t ddt s t (cid:13)(cid:13)(cid:13)(cid:13) ∞ = (cid:107) A x h (cid:107) ∞ = max i ∈ [ m ] |(cid:104) e i , A x h (cid:105)| = max i ∈ [ m ] (cid:12)(cid:12)(cid:12)(cid:68) ( A (cid:62) x W x A x ) − / A (cid:62) x e i , ( A (cid:62) x W x A x ) / h (cid:69)(cid:12)(cid:12)(cid:12) ≤ (cid:114) max i ∈ [ m ] [ A x ( A (cid:62) x W x A x ) − A (cid:62) x ] ii (cid:107) h (cid:107) A (cid:62) x W x A x . The result follows that Lemma 26 shows max i ∈ [ m ] (cid:104) A x ( A (cid:62) x W x A x ) − A (cid:62) x (cid:105) ii = max i ∈ [ m ] σ ( W x A x ) i [ w x ] − i ≤ m q +2 . Lemma 34.
For all x ∈ Ω ◦ and h ∈ R n we have (cid:13)(cid:13)(cid:13)(cid:13) W − t ddt w t (cid:13)(cid:13)(cid:13)(cid:13) W t ≤ q (cid:107) h (cid:107) A (cid:62) t W t A t and (cid:13)(cid:13)(cid:13)(cid:13) W − t ddt w t (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ q (cid:16) √ · m q +2 + max (cid:110) q , (cid:111)(cid:17) (cid:107) h (cid:107) A (cid:62) t W t A t . roof. Since that w t = w q ( S − t A ) , chain rule, Lemma 25, and Lemma 33 shows that the function p ( x ) def = w q ( S − x A ) satisfies (cid:107) J p ( x t ) h (cid:107) W − t ≤ q (cid:13)(cid:13)(cid:13)(cid:13) ( S − t ) − S − t ddt s t (cid:13)(cid:13)(cid:13)(cid:13) W t = q (cid:13)(cid:13)(cid:13)(cid:13) S − t ddt s t (cid:13)(cid:13)(cid:13)(cid:13) W t = q (cid:107) h (cid:107) A (cid:62) t W t A t . The same tools also show that (cid:107) W − t J p ( x t ) h (cid:107) ∞ ≤ q (cid:13)(cid:13)(cid:13)(cid:13) ( S − t ) − S − t ddt s t (cid:13)(cid:13)(cid:13)(cid:13) ∞ + q · max (cid:110) q , (cid:111) · (cid:13)(cid:13)(cid:13)(cid:13) ( S − t ) − S − t ddt s t (cid:13)(cid:13)(cid:13)(cid:13) W t ≤ q (cid:13)(cid:13)(cid:13)(cid:13) S − t ddt s t (cid:13)(cid:13)(cid:13)(cid:13) ∞ + q · max (cid:110) q , (cid:111) · (cid:13)(cid:13)(cid:13)(cid:13) S − t ddt s t (cid:13)(cid:13)(cid:13)(cid:13) W t ≤ q √ · m q +2 · (cid:107) h (cid:107) A (cid:62) t W t A t + q · max (cid:110) q , (cid:111) · (cid:107) h (cid:107) A (cid:62) t W t A t . Since ddt w t = J p ( x t ) h the result follows.Now, recall that by Lemma 31 we have ∇ ψ ( x ) = A (cid:62) t Σ / t ( I + N t ) Σ / t A x where N t def = 2 ¯ Λ t ( I − c q ¯ Λ t ) − and c q = 1 − (2 /q ) . Since we have already bounded the stability of A t and Σ t all thatremains is to bound the stability of ¯ Λ t and leverage this to bound the stability N t and ∇ ψ ( x t ) .To simplify these calculation for all t > and α ∈ R we define z t,α ∈ R n be defined for all i ∈ [ n ] by [ z t,α ] i = ddt ln ([ w t ] αi / [ s t ] i ) and Z t,α def = Diag ( z t,α ) . We will repeatedly use the fact ddt W αt S − t = W αt S − t ddt ln( W αt S − t ) = W αt S − t Z t,α . (5.6)In the following lemma we bound z t,α and use this to simplify these derivative bounds. Lemma 35.
For all x ∈ Ω ◦ and h ∈ R n we have and z t,α ∈ R n defined for all i ∈ [ n ] by [ z t,α ] i = ddt ln ([ w t ] αi / [ s t ] i ) we have that (cid:107) z t (cid:107) Σ ≤ ( | α | q + 1) (cid:107) h (cid:107) ∇ ψ ( x t ) and (cid:107) z t (cid:107) ∞ ≤ (cid:16) ( | α | q + 1) √ m q +2 + q | α | max (cid:110) q , (cid:111)(cid:17) (cid:107) h (cid:107) ∇ ψ ( x t ) Proof.
Note that [ z t,α ] i = α · ( ddt [ w t ] i / [ w t ] i ) − ( ddt [ s t ] i / [ s t ] i ) and consequently the result thereforefollows from triangle inequality, Lemma 33, Lemma 34, and A (cid:62) t W t A t (cid:22) ∇ ψ ( x t ) .Using this we can bound the stability of ¯ Λ t Lemma 36.
For all x ∈ Ω ◦ and h ∈ R n we have (cid:13)(cid:13)(cid:13)(cid:13) ddt ¯ Λ t (cid:13)(cid:13)(cid:13)(cid:13) ≤ max { q, }(cid:107) h (cid:107) ∇ ψ ( x t ) . Proof.
Let Q t def = W − / t P t W − / t . Since W t = Σ t this implies that ¯ Λ t = I − Σ − / t P (2) t Σ − / t = I − Q (2) t . Now, let, z t,α be defined as in Lemma 35 and let Z t,α def = Diag ( z t,α ) . Since ddt (cid:18) A (cid:62) t W − q t A t (cid:19) − = − (cid:18) A (cid:62) t W − q t A t (cid:19) − ddt (cid:20) A (cid:62) t W − q t A t (cid:21) (cid:18) A (cid:62) t W − q t A t (cid:19) − ddt S − t W − q t = ddt (cid:34)(cid:18) S − t W − q t (cid:19) (cid:35) = 2 S − t W − q t Z t, − q (using (5.6)), we have that ddt [ Q t ] ij = ddt e (cid:62) i W − q t A t (cid:18) A (cid:62) t W − q t A t (cid:19) − A (cid:62) t W − q t e j = [ z t, − q ] i [ Q t ] ij + [ Q t ] ij [ z t, − q ] j − (cid:104) Q t W / t Z t, − q W / t Q t (cid:105) ij . Consequently, ddt Q t = Z t, − q Q t + Q t Z t, − q − Q t W / t Z t, − q W / t Q t and by chain rule we have that ddt Q (2) t = 2 Q t ◦ (cid:20) ddt Q t (cid:21) = 2 Z t, − q Q (2) t + 2 Q (2) t Z t, − q − (cid:104) Q t ◦ Q t W / t Z t, − q W / t Q t (cid:105) Therefore, for all y ∈ R n we have y (cid:62) (cid:20) ddt Q (2) t (cid:21) y = 4 y (cid:62) Σ − / t Z t, − q P (2) t Σ − / t y − y (cid:62) (cid:104) P t ◦ Σ − / t P t Z t, − q P t Σ − / t (cid:105) y Applying Lemma 47 yields (cid:12)(cid:12)(cid:12)(cid:12) y (cid:62) (cid:20) ddt Q (2) t (cid:21) y (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) Σ − / t y (cid:107) Σ t (cid:107) z t, − q (cid:107) Σ t + 4 (cid:107) Σ − / t y (cid:107) Σ t (cid:107) z t, − q (cid:107) Σ t . Since, (cid:107) Σ − / t y (cid:107) Σ t = (cid:107) y (cid:107) applying Lemma 35 then yields that (cid:13)(cid:13)(cid:13)(cid:13) ddt Q (2) t (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:104) (cid:107) z t, − q (cid:107) Σ t + (cid:107) z t, − q (cid:107) Σ t (cid:105) ≤ (cid:20)(cid:18)(cid:12)(cid:12)(cid:12)(cid:12) − q (cid:12)(cid:12)(cid:12)(cid:12) q + 1 (cid:19) + (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) − q (cid:12)(cid:12)(cid:12)(cid:12) q + 1 (cid:19)(cid:21) (cid:107) h (cid:107) ∇ ψ ( x t ) . Since | q − | + | q − | + 8 ≤ max { q, } and ddt Q (2) t = ddt ¯ Λ t the result follows.Using this we can now bound the stability of N t . Lemma 37.
For all x ∈ Ω ◦ and h ∈ R n we have (cid:13)(cid:13)(cid:13)(cid:13) ( I + N t ) − ddt N t ( I + N t ) − (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:110) , q (cid:111) (cid:13)(cid:13)(cid:13)(cid:13) ddt ¯ Λ t (cid:13)(cid:13)(cid:13)(cid:13) ≤ { q, } (cid:107) h (cid:107) ∇ ψ ( x t ) . Proof.
Recall that N t def = 2 ¯ Λ t ( I − c q ¯ Λ t ) − for c q = 1 − q . Direct calculation yields ddt N t = 2 (cid:20) ddt ¯ Λ t (cid:21) (cid:0) I − c q ¯ Λ t (cid:1) − − Λ t (cid:0) I − c q ¯ Λ t (cid:1) − (cid:20) ( − α q ) ddt ¯ Λ t (cid:21) (cid:0) I − c q ¯ Λ t (cid:1) − = 2 (cid:104) I + c q ¯ Λ t (cid:0) I − c q ¯ Λ t (cid:1) − (cid:105) · (cid:20) ddt ¯ Λ t (cid:21) · (cid:2) I − c q ¯ Λ t (cid:3) − = 2 · (cid:2) I − c q ¯ Λ t (cid:3) − · (cid:20) ddt ¯ Λ t (cid:21) · (cid:2) I − c q ¯ Λ t (cid:3) − . c q ∈ ( −∞ , and (cid:22) ¯ Λ (cid:22) I we have I + N t = ( I − c q ¯ Λ t ) − ( I + (2 − c q ) ¯ Λ ) (cid:23) ( I − c q ¯ Λ t ) − and therefore (cid:107) ( I + N t ) − ( I − c q ¯ Λ t ) − (cid:107) ≤ . Further as (cid:107) ( I − c q ¯ Λ t ) − (cid:107) ≤ max { , q } (see, e.g.(4.10)), the result follows from Lemma 36.Now we can combine everything to prove the desired result Lemma 38.
For all x ∈ Ω ◦ and h, y ∈ R n we have (cid:12)(cid:12)(cid:12)(cid:12) y (cid:62) (cid:20) ddt ∇ ψ ( x t ) (cid:21) y (cid:12)(cid:12)(cid:12)(cid:12) (cid:22) (cid:16) ( q + 2) / √ m q +2 + 6 max { q, } . (cid:17) (cid:107) h (cid:107) ∇ ψ ( x t ) (cid:104) y (cid:62) ∇ ψ ( x t ) y (cid:105) . Proof.
Since ∇ ψ ( x ) = A (cid:62) t Σ / t ( I + N t ) Σ / t A x by Lemma 31 we have y (cid:62) (cid:20) ddt ∇ ψ ( x t ) (cid:21) y = 2 y (cid:62) A (cid:62) t W / t Z t, / [ I + N t ] W / t A t y + y (cid:62) A (cid:62) t W / t (cid:20) ddt N t (cid:21) W / t A t y . Cauchy Schwarz, and the facts that I + N t is PSD and ∇ ψ ( x ) = A (cid:62) t Σ / t ( I + N t ) Σ / t A x yield (cid:12)(cid:12)(cid:12) y (cid:62) A (cid:62) t W / t Z t, / [ I + N t ] W / t A t y (cid:12)(cid:12)(cid:12) ≤ (cid:107) W / t A t y (cid:107) (cid:107) Z t, / [ I + N t ] W / t A t y (cid:107) ≤ (cid:107) Z t, / (cid:107) (cid:112) (cid:107) I + N t (cid:107) (cid:107) y (cid:107) ∇ ψ ( x t ) . and (cid:12)(cid:12)(cid:12)(cid:12) y (cid:62) A (cid:62) t W / t (cid:20) ddt N t (cid:21) W / t A t y (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) y (cid:107) ∇ ψ ( x t ) (cid:13)(cid:13)(cid:13)(cid:13) ( I + N t ) − ddt N t ( I + N t ) − (cid:13)(cid:13)(cid:13)(cid:13) . Now, by Lemma 35 (cid:107) Z t, / (cid:107) = (cid:107) z t, / (cid:107) ∞ ≤ (cid:16)(cid:16) q (cid:17) √ m q +2 + q (cid:110) q , (cid:111)(cid:17) (cid:107) h (cid:107) ∇ ψ ( x t ) . Further, since (cid:107) I + N t (cid:107) ≤ q combining and applying Lemma 37 yields that (cid:12)(cid:12) y (cid:62) (cid:2) ddt ∇ ψ ( x t ) (cid:3) y (cid:12)(cid:12) is bounded by (cid:16) (cid:112) q (cid:16)(cid:16) q (cid:17) √ · m q +2 + q (cid:110) q , (cid:111)(cid:17) + 4 max { q, } (cid:17) (cid:107) h (cid:107) ∇ ψ ( x t ) (cid:104) y (cid:62) ∇ ψ ( x t ) y (cid:105) and the result follows by basic calculations. In this section we show how to leverage the results of the previous sections to obtain efficientalgorithms and derive the main results of this paper. In Section 6.1 we prove Theorem 1 andTheorem 43, our main results on linear programming, in Section 6.2 we prove Theorem 2, our mainresult on minimum cost maximum flow, and in Section 6.3 we prove Theorem 3 and a more generalTheorem 46, our main results on a polynomial time computable nearly-universal self-concordantbarrier. The algorithms in this section make critical use of algorithms for computing Lewis weightsprovided an analyzed in Appendix B and are stated as needed.32 .1 Linear Programming Algorithm
Here we show how to combine the results of the preceding sections to obtain our efficient linearprogramming algorithm and prove Theorem 1 and Theorem 43. Our algorithm uses the followingresult regarding approximately computing Lewis weights proved in Appendix B.
Theorem 39 (Approximate Weight Computation) . Let A ∈ R m × n be non-degenerate and let T w and T d denote the work and depth needed to compute ( A (cid:62) DA ) − z for arbitrary positive diagonal matrix D and vector z . For all (cid:15) ∈ (0 , , p ∈ (0 , , w (0) ∈ R m> with (cid:107) w − ( w p ( A ) − w (0) ) (cid:107) ∞ ≤ − p (4 − p ) ,the algorithm computeApxWeight ( x, w (0) , (cid:15) ) can be implemented to return w that with high probabilityin n (cid:107) w p ( A ) − ( w p ( A ) − w ) (cid:107) ∞ ≤ (cid:15) in O ( p − (4 − p ) − (cid:15) − log ( n/ ( p(cid:15) )) steps each of which can beimplemented in O (nnz( A ) + T w ) work and O ( T d ) depth.Without w (0) the algorithm computeInitialWeight ( A , p, (cid:15) ) (Algorithm 7) can be implementedto have the same guarantee with O ( √ n (4 − p ) − p − ) log mn log ( n/ ( p(cid:15) )) steps of the same cost. Leveraging this result in Algorithm 2 we give the procedure, pathFollowing , for approximatelyfollowing the weighted central path induced by regularized Lewis weights and in Theorem 40 weanalyze it. Interestingly, the regularization (i.e. choosing c > in Section 4.4) is not neededfor this path following procedure to work. Instead, it is used to reason about the conditioning ofthe systems encountered by this method and for leveraging the procedure to efficiently solve linearprograms. Algorithm 2: ( x (final) , w (final) ) = pathFollowing ( x, w, t start , t end , (cid:15) ) t = t start , K = c k , α = R √ n log m where R is defined in Theorem 19. repeat ( x (new) , w (new) ) = centeringInexact ( x, w, K ) where computeApxWeight to approximate g ( x ) (defined in Section 4.4 Theorem 29 for p = 1 − m and c = nm ). t ← median ((1 − α ) · t, t end , (1 + α ) · t ) . x ← x (new) , w ← w (new) . until t = t end ; for i = 1 , · · · , c k log( (cid:15) ) do ( x, w ) = centeringInexact ( x, w, K ) where computeApxWeight is used to approximate g ( x ) . endOutput: ( x, w ) . Theorem 40.
Define µ as defined in Theorem 19and suppose that δ t start ( x, w ) ≤ log m and Φ µ (log g ( x ) − log w ) ≤ c c s c k m. where. If ( x (final) , w (new) ) = pathFollowing ( x, w, t start , t end ) , then with high probability in n , δ t end ( x (final) , w (final) ) ≤ (cid:15) and Φ µ (log g ( x (final) ) − log w (final) ) ≤ c c s c k m. Further, pathFollowing ( x, w, t start , t end ) can be implemented O (cid:0) √ n log m · κ · T w (cid:1) work and O (cid:0) √ n log m · κ · T w (cid:1) depthwhere κ = (cid:12)(cid:12)(cid:12) log t end t start (cid:12)(cid:12)(cid:12) + log (cid:15) and T w and T d are the work and depth needed to compute ( A (cid:62) DA ) − q or input positive diagonal matrix D and vector q . Furthermore, with high probability in n during thewhole algorithm, we have δ t ( x, w ) ≤ R def = c k log(36 c c s c k m ) and Φ µ (log g ( x ) − log w ) ≤ c c s c k m .Proof. We first show that pathFollowing maintains the invariant that δ t ( x, w ) ≤ R and Φ µ (log g ( x ) − log w ) ≤ c c s c k m in each iteration where R def = K c k log(36 c c s c k m ) = 1768 c k log(36 c c s c k m ) = 13072 log m log(288 nm log m ) ≥ log m . Note that this holds for the input ( x, w ) by assumption, so suppose that this holds at the start ofone of the loops. By the definition of Φ µ , µ and K , we have (cid:107) log g ( x ) − log w (cid:107) ∞ ≤ R and by (3.13),we have (cid:107) log g ( x ) − log g ( x (new) ) (cid:107) ∞ ≤ R . Therefore, we have (cid:107) log g ( x (new) ) − log w (cid:107) ∞ ≤ R ≤ p + p ) (6.1)where we used the formula of R and p = m ) at the end. Thus, the weight w satisfies the condi-tions for Theorem 39 and the algorithm centeringInexact can use the function computeApxWeight to find the approximation of g ( x (new) ) . Consequently, by Lemma 19 with high probability in nδ t ( x (new) , w (new) ) ≤ (cid:18) − c k (cid:19) δ t ( x, w ) and Φ µ (log g ( x (new) ) − log w (new) ) ≤ c c s c k m . Using Lemma 14, (6.1) and Theorem 29,we have δ t (new) ( x (new) , w (new) ) ≤ (1 + α ) (cid:18) − c k (cid:19) δ t ( x, w ) + α (cid:16) C norm (cid:112) (cid:107) w (cid:107) (cid:17) ≤ δ t ( x, w ) − δ t ( x, w )8 c k + 100 α √ n log m ≤ δ t ( x, w ) . Hence, we proved that the invariant. Note that in the second loop, t does not change andtherefore δ t ( x, w ) decreases by (1 − c k ) in each step with high probability in n yielding that δ t end ( x (final) , w (final) ) ≤ (cid:15) as desired. To bound the runtime, note that R = Ω(log − m ) and hence α = Ω( n − / log − m ) . Therefore,the total number of step is O ( √ n log m · [ | log( t end /t start ) | + log (1 /(cid:15) )]) . Finally, each step involvescomputing projection to the mixed ball and computing Lewis weights. Theorem 62 shows that theprojection can be formed in O ( m log m ) time and O (log m ) depth. Theorem 18 shows that we needto compute Lewis weight with ± R = 1 ± Θ(log − m ) multiplicative approximation. Theorem39 shows that we can compute the Lewis weights using O ((1 /R ) log ( m/R )) = O (log m ) linearsystems solves of the desired form.To leverage this result we first provide Lemma 41, which bounds how large a t is needed guaranteean approximately optimal solution. Further, in Lemma 42, we show how much the approximate Note that the with high probability claim of this theorem requires that | log( t end /t start ) | + | log(1 /(cid:15) ) | = O ( poly ( n )) .However, if that is not the case, then every O ( poly ( n )) steps of the loops we can afford to exactly check the invariantsand compute the weights by Theorem 45 which we introduce later and repeat the steps if the invariants do not hold.This increases the expected running time only by multiplicative constants and we can run the algorithm O (log( n )) times in parallel to ensure that one of them outputs a point with the correct invariants without taking more thantwice the desired runtime with high probability. LPSolve , and prove Theorem 1 and Theorem 43.
Lemma 41 ([46, Theorem 4.2.7]) . Let x ∗ ∈ R m denote an optimal solution to (1.2) and x t =arg min f t ( x, w ) for some t > and w ∈ R m> . Then the following holds c (cid:62) x t ( w ) − c (cid:62) x ∗ ≤ (cid:107) w (cid:107) t . Proof.
By the optimality conditions of (1.2) we know that ∇ x f t ( x t ( w )) = t · c + wφ (cid:48) ( x t ( w )) isorthogonal to the kernel of A (cid:62) . Furthermore since x t ( w ) − x ∗ ∈ ker( A (cid:62) ) we have (cid:0) t · c + wφ (cid:48) ( x t ( w )) (cid:1) (cid:62) ( x t ( w ) − x ∗ ) = 0 . Using that φ (cid:48) i ( x t ( w ) i ) · ( x ∗ i − x t ( w ) i ) ≤ by Lemma 9 then yields c (cid:62) ( x t ( w ) − x ∗ ) = 1 t (cid:88) i ∈ [ m ] w i · φ (cid:48) i ( x t ( w ) i ) · ( x ∗ i − x t ( w ) i ) ≤ (cid:107) w (cid:107) t . Lemma 42.
For x such that δ t ( x, g ( x )) ≤ log m and x t def = arg min f t ( x, w ) we have (cid:13)(cid:13)(cid:13)(cid:112) φ (cid:48)(cid:48) ( x t ) ( x − x t ) (cid:13)(cid:13)(cid:13) ∞ ≤ δ t ( x, g ( x )) . Proof.
We prove this statement via our centering algorithm. We use Theorem 19 with exact weightcomputation and start with x (1) = x and w (1) = g ( x (1) ) . In each iteration, δ t is decreased by afactor of − (4 c k ) − . (3.5) shows that (cid:107) (cid:113) φ (cid:48)(cid:48) ( x ( k ) ) (cid:16) x ( k +1) − x ( k ) (cid:17) (cid:107) ∞ ≤ δ t ( x ( k ) , w ( k ) ) (6.2)where we used c γ ≤ (Lemma 13). The Lemma 8 shows that (cid:13)(cid:13)(cid:13) log (cid:16) φ (cid:48)(cid:48) ( x ( k ) ) (cid:17) − log (cid:16) φ (cid:48)(cid:48) ( x ( k +1) ) (cid:17)(cid:13)(cid:13)(cid:13) ∞ ≤ (cid:16) − δ t ( x ( k ) , w ( k ) ) (cid:17) − ≤ e δ t ( x ( k ) ,w ( k ) ) . Therefore, for any k , we have (cid:13)(cid:13)(cid:13) log (cid:16) φ (cid:48)(cid:48) ( x ( k ) ) (cid:17) − log (cid:0) φ (cid:48)(cid:48) ( x t ) (cid:1)(cid:13)(cid:13)(cid:13) ∞ ≤ e (cid:80) ki =1 δ t ( x ( i ) ,w ( i ) ) ≤ e c k δ t ( x (1) ,g ( x (1) )) ≤ where we used that δ t is decreased by a factor of (1 − c k ) , c k ≤ m and that δ t ≤ log m .Using this on (6.2), we have (cid:13)(cid:13)(cid:13)(cid:112) φ (cid:48)(cid:48) ( x t ) (cid:16) x (1) − x t (cid:17)(cid:13)(cid:13)(cid:13) ∞ ≤ k (cid:88) i =1 δ t ( x ( i ) , w ( i ) ) ≤ δ t ( x (1) , w (1) ) . lgorithm 3: x (final) = LPSolve ( x , (cid:15) ) Input: an initial point x such that A (cid:62) x = b . w = computeInitialWeight ( x , log m ) + n m , d = − w i φ (cid:48) i ( x ) . t = (2 m / U log m ) − , t = m(cid:15) , (cid:15) = log m , (cid:15) = (cid:15) U . ( x (new) , w (new) ) = pathFollowing ( x , w, , t , (cid:15) ) with cost vector d . ( x (final) , w (final) ) = pathFollowing ( x (new) , w (new) , t , t , (cid:15) ) with cost vector c . Output: x (final) . Proof of Theorem 1.
By Theorem 39, we know computeInitialWeight gives an weight (cid:107) G ( x ) − ( g ( x ) − w ) (cid:107) ∞ ≤ log m ≤ R. By the definition of R , we have that Φ µ (log g ( x ) − log w ) ≤ c c s c k m and that x is the minimumof min x d (cid:62) x − (cid:88) i w i φ i ( x ) given A (cid:62) x = b. Therefore, ( x, w ) satisfies the assumption of theorem 40 because δ t = 0 and Φ µ is small enough.Hence, we have δ dt ( x (new) , w (new) ) ≤ log m and Φ µ (log g ( x (new) ) − w (new) ) ≤ c c s c k m where we used the superscript d to indicate δ is defined using the cost vector d . Using this notationand (3.5), we have δ ct ( x (new) , w (new) ) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) P x (new) ,w (new) (cid:32) t c + w (new) φ (cid:48) ( x (new) ) w (new) (cid:112) φ (cid:48)(cid:48) ( x (new) ) (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w + ∞ ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) P x (new) ,w (new) (cid:32) t d + w (new) φ (cid:48) ( x (new) ) w (new) (cid:112) φ (cid:48)(cid:48) ( x (new) ) (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w + ∞ + t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) P x (new) ,w (new) (cid:32) c − dw (new) (cid:112) φ (cid:48)(cid:48) ( x (new) ) (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w + ∞ ≤ · δ dt ( x (new) , w (new) ) + 100 √ n log m · t · (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) P x (new) ,w (new) (cid:32) c − dw (new) (cid:112) φ (cid:48)(cid:48) ( x (new) ) (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ (6.3)where we used Lemma 13 at the end.Next, we note that for any x and w , let Q = W − A x (cid:0) A (cid:62) x W − A x (cid:1) − A (cid:62) x W − , then (3.3) andsensitivity c s ≤ shows that (cid:107) P x,w W − (cid:107) ∞→∞ ≤ (cid:18) min i ∈ [ m ] w i (cid:19) − + (cid:107) Q (cid:107) ∞→∞ ≤ m + m max i ∈ [ m ] Q ii ≤ m. Substituting into (6.3) yields δ ct ( x (new) , w (new) ) ≤ log m + 600 m / log m · t · (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) c − d (cid:112) φ (cid:48)(cid:48) ( x (new) ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ log m where we used that (cid:107) c − d (cid:107) ∞ ≤ (cid:107) c (cid:107) ∞ + (cid:107) φ (cid:48) ( x ) (cid:107) ∞ ≤ U (Lemma 9), min y (cid:112) φ (cid:48)(cid:48) ( y ) ≥ U (Lemma36) and that we have chosen t small enough.Hence, ( x (new) , w (new) ) satisfy the assumption of Theorem 40 for the original cost function c .Now, we only need to bound how large t should be and how small (cid:15) should be in order to get x such that c (cid:62) x ≤ OPT + (cid:15) . By Lemma 41 and (cid:107) w (final) (cid:107) ≤ m , we have c (cid:62) x t ≤ OPT + 2 mt . Also, Lemma 42 shows that we have (cid:13)(cid:13)(cid:13)(cid:112) φ (cid:48)(cid:48) ( x t ) (cid:16) x (final) − x t (cid:17)(cid:13)(cid:13)(cid:13) ∞ ≤ (cid:15) . Using min y (cid:112) φ (cid:48)(cid:48) ( y ) ≥ U , we have (cid:13)(cid:13) x (final) − x t (cid:13)(cid:13) ∞ ≤ (cid:15) U and hence our choice of t and (cid:15) yields c (cid:62) x (final) ≤ OPT + 2 mt + 8 (cid:15) U ≤ OPT + (cid:15). Theorem 43.
Let x ∈ Ω def = { A (cid:62) x = b, x ≥ } for non-degenerate A ∈ R m × n . There is analgorithm that finds y ∈ R n with A y ≤ c and b (cid:62) y ≥ max A y ≤ c b (cid:62) y − (cid:15) with constant probability in O (cid:18) √ n log m · log( mU(cid:15) ) · T w (cid:19) work and O (cid:18) √ n log m · log( mU(cid:15) ) · T d (cid:19) depthwhere U def = max { diam (Ω) , (cid:107) c (cid:107) ∞ , (cid:107) /x (cid:107) ∞ } , diam (Ω) is the diameter of Ω , and T w and T d is thework and depth needed to compute ( A (cid:62) DA ) − q for input positive diagonal matrix D and vector q .Proof. Use Algorithm
LPSolve to solve the linear program min A (cid:62) x = b,x ≥ c (cid:62) x . Following the proofof Theorem 1 and using φ i ( x i ) = − log x i for all i , we can find x and w such that δ t ( x, w ) ≤ with t = ( n(cid:15) ) O (1) in the time same work and depth as Theorem 1. Further, (3.5) shows that η = (cid:0) A (cid:62) x W − A x (cid:1) − A (cid:62) x ∇ x f t ( x,w ) w √ φ (cid:48)(cid:48) ( x ) satisfies (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∇ x f t ( x, w ) − A ηw (cid:112) φ (cid:48)(cid:48) ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ . We will prove that y = ηt has the desired properties. Since φ (cid:48)(cid:48) i ( x ) = x − i , we have that (cid:13)(cid:13)(cid:13) W − X (cid:16) tc − wx − A η (cid:17)(cid:13)(cid:13)(cid:13) ∞ ≤ . In particular, we have ( A y ) i ≤ c i − w i tx i + w i tx i ≤ c i for all i ∈ [ n ] . Similarly, we have that ( A y ) i ≥ c i − w i tx i − w i tx i . Hence, we have b (cid:62) y = x (cid:62) A y ≥ c (cid:62) x − t (cid:88) i ∈ [ m ] w i ≥ c (cid:62) x − nt and picking t = n(cid:15) gives the result. 37 .2 Minimum Cost Maximum Flow Here we show how to use the interior point method of the previous Section 6.1 to solve the maximumflow problem and the minimum cost flow problem and thereby prove Theorem 2. Formally, themaximum flow and minimum cost flow problems [11] is as follows. Let G = ( V, E ) be a connecteddirected graph where each edge e ∈ E has capacity c e > . We call x ∈ R E a s - t flow for s, t ∈ V if x e ∈ [0 , c e ] for all e in E and for each vertex v / ∈ { s, t } the amount of flow entering v , i.e. (cid:80) e =( a,v ) ∈ E f e equals the amount of flow leaving v , i.e. (cid:80) e =( v,b ) ∈ E f e . The value of s - t flow is theamount of flow leaving s (or equivalently, entering t ). The maximum flow problem is to compute a s - t flow of maximum value. In the minimum cost maximum flow problem there are costs q e ∈ R oneach edge e ∈ E and the goal is to compute a maximum s - t flow of minimum cost, (cid:80) e ∈ E q e f e = q (cid:62) f .Since the minimum cost flow problem includes the maximum flow problem, we focus on thisgeneral formulation. The problem can be written as the following linear program min ≤ x ≤ c q (cid:62) x such that A x = F e t where F is the maximum flow value, e t ∈ R | V \{ s }| is an indicator vector of size | V | − that is non-zero at vertices t and A is a | V \{ s }| × | E | matrix such that for each edge e , we have A e head ,e = 1 and A e tail ,e = − . In order words, the constraint A x = F e t requires the flow to satisfies the flowconversation at all vertices except s and t and requires it flows F unit of flow into t (and therefore F out of s ). We assume c e and q e are integer and M be the maximum absolute value of c e and q e .Note that rank ( A ) = | V | − because the graph is connected and hence our algorithm takes only ˜ O ( (cid:112) | V | log( U/(cid:15) )) iterations to compute an (cid:15) -approximate solution to this linear program. However,to solve minimum cost maximum flow with this we need to bound U/(cid:15) , compute F , and turn theapproximate solution into an exact minimum cost maximum flow. While there are many ways todeal with this issue we consider a different linear program formulation below related to [11]. Lemma 44.
Given a directed graph G = ( V, E ) with integral costs q ∈ Z E and capacities c ∈ Z E ≥ with (cid:107) q (cid:107) ∞ ≤ M and (cid:107) c (cid:107) ∞ ≤ M in linear time we can find a new integral cost vector ˜ q ∈ Z E with (cid:107) ˜ q (cid:107) ∞ ≤ ˜ M def = 8 | E | M such that the following modified linear program min ˜ q (cid:62) x + λ (1 (cid:62) y + 1 (cid:62) z ) − n ˜ M F subject to A x + y − z = F e t ≤ x i ≤ c i , ≤ y i ≤ | V | M, ≤ z i ≤ | V | M, ≤ F ≤ | V | M with λ = 440 | E | ˜ M M satisfies the following conditions with constant probability:1. F = | V | M , x = c , y = 2 | V | M − ( A c ) − + F e t , z = 2 | V | M A c ) + is an interior point ofthe linear program.2. Given any feasible ( x, y, z ) with cost value within M of the optimum. Then, one can find anexact minimum cost maximum s - t flow for graph G with costs q and capacities c in O ( | E | ) work and O (1) depth.3. The linear system of the linear program can be solve in nearly linear time, i.e. for any positive iagonal matrix S and vector b , it takes O (cid:0) | E | log | V | log( | V | /η ) (cid:1) work and O (cid:0) log | V | log( | V | /η ) (cid:1) depthto find x such that (cid:107) x − L − b (cid:107) L ≤ η (cid:107) x (cid:107) L (6.4) where L = [ A | I | − I | − e t ] S [ A | I | − I | − e t ] (cid:62) . Proof.
By Lemma [11, Lemma 3.13], if we add the cost of every edge by a number uniformly atrandom from { | E | M , | E | M , · · · , | E | M | E | M } . Then with probability at least / , the new problemhas an unique solution and this solution is a solution for the original problem. Applying thisreduction and scaling the problem back to integral, we obtain the new cost vector such that thesolution is unique.For 1) Note that | V | M ≤ y i ≤ | V | M and | V | M ≤ z i ≤ | V | M . So, ( x, y, z ) is an interior point.For 2) Let Val = ˜ q (cid:62) x − | V | ˜ M F + λ (cid:0) (cid:62) y + 1 (cid:62) z (cid:1) be the objective value of ( x, y, z ) . Let Opt bethe objective value given by the minimum cost maximum flow.First, we prove the the total excess demand (cid:62) y + 1 (cid:62) z is small. Since ≤ F ≤ | V | M and | q (cid:62) x | ≤ | E | ˜ M M , we have that | ˜ q (cid:62) x − | V | ˜ M F | ≤ | E | ˜ M M (6.5)for both the algorithm and for the optimum flow. By assumption, we know that
Val ≤ Opt + M ,we have that λ (1 (cid:62) y + 1 (cid:62) z ) ≤ | E | ˜ M M.
Using λ = 440 | E | ˜ M M , we have that the total excess demand (cid:62) y + 1 (cid:62) z ≤ (cid:15) def = | E | ˜ MM .To route back the excess demand, we first scale the vector x, y, z and F by a − (cid:15) factor. Then,we create a spanning tree at s . At every vertex v , we route the excess demand from v back to thesource s in the tree. To route one unit of excess demand at v , we pay at most (cid:80) i ∈ P v | ˜ q i | where P v is the path from s to v on the tree. Since (cid:80) i ∈ P v | ˜ q i | ≤ | V | ˜ M , the cost we pay for routing is at mostthe potential decrease in the term λ (cid:0) (cid:62) y + 1 (cid:62) z (cid:1) . So the objective value of this new flow is at most (1 − (cid:15) ) Val ≤ Val + 5 (cid:15) | E | ˜ M M ≤ Val + 112 M ≤ Opt + 16 M where we used Val ≤ | E | ˜ M M due to (6.5). Since we scale the vectors by − (cid:15) factor, the flow isfeasible.Due to the routing above, we can assume the flow x has no excess demand with Val ≤ Opt + M .However, the flow x may not be the maximum flow. Imagine now, we send the extra flow from s to t to make x maximum. For every unit we send, we decrease the objective by at least | V | ˜ M due to the term − | V | ˜ M F and the fact that the cost of that unit of flow is at most | V | ˜ M . Since Val ≤ Opt + M , we can send at most M ·| V | ˜ M amount of extra flow. We call this new x as ˜ x and welet (cid:103) Val be the objective value for this ˜ x . Note that the procedure above only decrease the objectivevalue. So, we have again (cid:103) Val ≤ Opt + M . Finally, we note that ˜ x is a weighted combination ofmaximum flow from s to t . Since the minimum cost solution is unique and the cost are integral, thecombined weight contributed by non-minimum-cost flow is at most M . Since the flow is boundedby M , we know ˜ x is at most far from the minimum cost solution for all edges.For the total runtime, note that both the step ˜ x and the step of routing excess demand cannot39e omitted because it does not change the flow for every edge by more than / . So, we can simplyround every number to nearest integer, which takes linear work and constant depth.For Part 3, L is symmetric diagonally dominant. The result follows from [30, Theorem 9.2].Using the reduction mentioned above, one can obtain the promised minimum cost flow algorithm.(Further, using techniques from [11] this can be generalized to solving lossy flow problems.) Theorem 2 (Maximum Flow) . Given a directed graph G = ( V, E ) with integral costs q ∈ Z E andcapacities c ∈ Z E ≥ with (cid:107) q (cid:107) ∞ ≤ M and (cid:107) c (cid:107) ∞ ≤ M , we can compute a minimum cost maximum flowwith constant probability with O ( | E | (cid:112) | V | log | E | log M ) work and O ( (cid:112) | V | log | E | log M ) depth.Proof. Using the reduction (Lemma 44) and Theorem 1, we get an algorithm of minimum cost flowby solving O (cid:16)(cid:112) | V | log | E | · log( | V | M ) · T w (cid:17) work and O (cid:16)(cid:112) | V | log | E | · log( | V | M ) · T d (cid:17) depthwhere T w and T d are the work and the depth of solving linear systems. It is known that for interiorpoint methods, we only need to solve linear system with accuracy η = m O (1) ( η defined in (6.4))because each step of interior point method only need to decrease the centrality δ t by a constantfactor. Hence, Lemma 44 shows that each linear system takes T w = O (cid:0) | E | log | V | log( | V | ) (cid:1) work and T d = O (cid:0) log | V | log( | V | ) (cid:1) depth . Hence, we have the result.
Here we show how to combine the results of the preceding sections to obtain our main results ona polynomial time computable nearly-universal self-concordant barrier. We first provide and proofTheorem 3, a generalization of Theorem 3, and then show Theorem 3 as a special case. The resultsof this section use following result regarding computing Lewis weights proved in Appendix B.
Theorem 45 (Exact Weight Computation) . Let A ∈ R m × n be non-degenerate matrix and let (cid:15) ∈ (0 , and p ∈ (0 , ∞ ) . For all w (0) ∈ R m> with (cid:107) w − ( w p ( A ) − w (0) ) (cid:107) ∞ ≤ p p +2) , the al-gorithm computeExactWeight ( A , p, w (0) , (cid:15) ) (Algorithm 4) can be implemented to return w suchthat (cid:107) w p ( A ) − ( w p ( A ) − w ) (cid:107) ∞ ≤ (cid:15) in O ( mn ω − ( p + p − ) log( n (1 + p ) (cid:15) − )) work and O (( p + p − ) log( m ) log( n (1 + p ) (cid:15) − )) depth.Without w (0) , the algorithm computeInitialWeight ( A , p, (cid:15) ) (Algorithm 7) can be implementedto achieve the same guarantee with O ( mn ω − (1 / ( p + p − ) log( mn ) log( n(cid:15) − ( p + p − ))) work and O (( p + p − ) log( mn ) log( m ) log( n(cid:15) − ( p + p − ))) depth. Theorem 46.
Let Ω ◦ = { x : A x > b } denote the interior of non-empty polytope for non-degenerate A ∈ R m × n . There is an O ( n log m ) -self concordant barrier ψ defined using (cid:96) q Lewis weight with q = Θ(log m ) (See (5.1) ) satisfying A (cid:62) x W x A (cid:62) x (cid:22) ∇ ψ ( x ) (cid:22) ( q + 1) A (cid:62) x W x A (cid:62) x where A x = Diag ( A x − b ) and w x is the (cid:96) q Lewis weight of the matrix A x . Furthermore, we cancompute or update the w x , ∇ ψ ( x ) and ∇ ψ ( x ) as follows: • Initial Weight: For any x ∈ R n , we can compute a vector (cid:101) w x such that (1 − (cid:15) ) w x ≤ (cid:101) w x ≤ (1 + (cid:15) ) w x in O ( mn ω − (1 / · log m · log( m/(cid:15) )) -work and O ( √ n · log m · log( m/(cid:15) )) -depth. Update Weight and Compute Gradient/Hessian: Given a vector (cid:101) w x such that (cid:101) w x = (1 ± ) w x ,for any y with (cid:107) x − y (cid:107) A (cid:62) x W x A (cid:62) x ≤ c log m with some small constant c > , we can compute (cid:101) w y , v and H such that (cid:101) w y = (1 ± (cid:15) ) w y , (cid:107) v − ∇ ψ ( x ) (cid:107) ∇ ψ ( x ) − ≤ (cid:15) and (1 − (cid:15) ) ∇ ψ ( x ) (cid:22) H (cid:22) (1 + (cid:15) ) ∇ ψ ( x ) in O ( mn ω − · log m · log( m/(cid:15) )) -work and O (log m · log( m/(cid:15) )) -depth.Proof. Theorem 30 with q = log m shows that there is such a barrier function ψ that is O ( n log m ) .Lemma 31 shows that ∇ ψ ( x ) = − A (cid:62) x σ x and ∇ ψ ( x ) = A (cid:62) x Σ / x ( I + N x ) Σ / x A x . Note that σ x = w q ( A x ) and we can compute ˜ σ such that ˜ σ ∈ (1 ± (cid:15) √ n ) σ x in O (cid:16) mn ω − · log m · log m(cid:15) (cid:17) work and O (cid:0) n / · log m · log m(cid:15) (cid:1) depth using Theorem 45.For the update version, we let w x and w y be the Lewis weight corresponding to x and y .Picking c to be small enough constant, Lemma 34 shows that (cid:107) w − y ( w x − w y ) (cid:107) ∞ ≤ O ( c ) ≤ q q +2) .Hence, Theorem 45 shows that we can compute w x with O ( mn ω − · log m · log( m/(cid:15) )) -work and O (log m · log( m/(cid:15) )) -depth in this case.For the gradient, with the approximate Lewis weight, Lemma 32 shows that (cid:107)∇ ψ ( x ) + A (cid:62) x ˜ σ (cid:107) ∇ ψ ( x ) − ≤ (cid:15) √ n (cid:107)∇ ψ ( x ) (cid:107) ∇ ψ ( x ) − ≤ (cid:15). For the Hessian, we recall that N x = 2 ¯ Λ x ( I − (1 − q ) ¯ Λ x ) − and ¯ Λ x def = ¯ Λ ( Σ − q x A x ) . Followingcalculations in Lemma 37 and Lemma 38, one can check that replacing σ x by (1 ± (cid:15) ) σ x in the formulaof ∇ ψ ( x ) (via N x and ¯ Λ x ) only changes the matrix ∇ ψ ( x ) multiplicatively by ± (cid:15) log O (1) m . Hence,we can compute it again in the same work and depth.Leveraging this this theorem we prove, Theorem 3. Proof of Theorem 3.
This theorem is a specialization of Theorem 46.
We thank Yan Kit Chi, Michael B. Cohen, Jonathan A. Kelner, Aleksander Mądry, Richard Peng,and Nisheeth Vishnoi for helpful conversations. This work was partially supported by NSF awardsCCF-0843915 and CCF-1111109, NSF Graduate Research Fellowship (grant no. 1122374), HongKong RGC grant 2150701, CCF-1749609, CCF-1740551, DMS-1839116, CCF-1844855, and a Mi-crosoft Research Faculty Fellowship. Part of this work was done while both authors were visitingthe Simons Institute for the Theory of Computing, UC Berkeley.
References [1] Jacob Abernethy and Elad Hazan. Faster convex optimization: Simulated annealing with anefficient universal barrier. In
International Conference on Machine Learning , pages 2520–2528,2016.[2] Deeksha Adil, Rasmus Kyng, Richard Peng, and Sushant Sachdeva. Iterative refinement for (cid:96) p -norm regression. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on DiscreteAlgorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019 , pages 1405–1424,2019. 413] Kurt M. Anstreicher. Volumetric path following algorithms for linear programming.
Math.Program. , 76:245–263, 1996.[4] Jean Bourgain, Joram Lindenstrauss, and V Milman. Approximation of zonoids by zonotopes.
Acta mathematica , 162(1):73–141, 1989.[5] Sébastien Bubeck and Ronen Eldan. The entropic barrier: a simple and optimal universal self-concordant barrier. In
Proceedings of The 28th Conference on Learning Theory, COLT 2015,Paris, France, July 3-6, 2015 , page 279, 2015.[6] Michael B Cohen, Rasmus Kyng, Gary L Miller, Jakub W Pachocki, Richard Peng, Anup BRao, and Shen Chen Xu. Solving sdd linear systems in nearly m log 1/2 n time. In
Proceedingsof the forty-sixth annual ACM symposium on Theory of computing , pages 343–352. ACM, 2014.[7] Michael B Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, and AaronSidford. Uniform sampling for matrix approximation. In
Proceedings of the 2015 Conferenceon Innovations in Theoretical Computer Science , pages 181–190. ACM, 2015.[8] Michael B. Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrixmultiplication time.
CoRR , abs/1810.07896, 2018.[9] Michael B. Cohen and Richard Peng. (cid:96) p row sampling by lewis weights. In Proceedings of theForty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland,OR, USA, June 14-17, 2015 , pages 183–192, 2015.[10] Richard Cole. Parallel merge sort.
SIAM Journal on Computing , 17(4):770–785, 1988.[11] Samuel I Daitch and Daniel A Spielman. Faster approximate lossy generalized flow via interiorpoint algorithms. In
Proceedings of the 40th annual ACM symposium on Theory of computing ,pages 451–460. ACM, 2008.[12] George B Dantzig. Maximization of a linear function of variables subject to linear inequalities.
New York , 1951.[13] Antoine Deza, Eissa Nematollahi, Reza Peyghami, and Tamás Terlaky. The central path visitsall the vertices of the klee–minty cube.
Optimisation Methods and Software , 21(5):851–865,2006.[14] Antoine Deza, Eissa Nematollahi, and Tamás Terlaky. How good are interior point methods?klee–minty cubes tighten iteration-complexity bounds.
Mathematical Programming , 113(1):1–14, 2008.[15] Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff. Fastapproximation of matrix coherence and statistical leverage.
Journal of Machine Learning Re-search , 13(Dec):3475–3506, 2012.[16] Shimon Even and R Endre Tarjan. Network flow and testing graph connectivity.
SIAM journalon computing , 4(4):507–518, 1975.[17] RobertM. Freund. Projective transformations for interior-point algorithms, and a superlinearlyconvergent algorithm for the w-center problem.
Mathematical Programming , 58(1-3):385–414,1993. 4218] Andrew V. Goldberg and Satish Rao. Beyond the flow decomposition barrier.
J. ACM ,45(5):783–797, 1998.[19] Clovis C Gonzaga. Path-following methods for linear programming.
SIAM review , 34(2):167–224, 1992.[20] Fritz John. Extremum problems with inequalities as subsidiary conditions, studies and essayspresented to r. courant on his 60th birthday, january 8, 1948, 187–204. 1948.[21] Narendra Karmarkar. A new polynomial-time algorithm for linear programming. In
Proceedingsof the sixteenth annual ACM symposium on Theory of computing , pages 302–311. ACM, 1984.[22] Alexander V Karzanov. On finding a maximum flow in a network with special structure andsome applications.
Matematicheskie Voprosy Upravleniya Proizvodstvom , 5:81–94, 1973.[23] Jonathan A. Kelner, Lorenzo Orecchia, Aaron Sidford, and Zeyuan Allen Zhu. A Simple,Combinatorial Algorithm for Solving SDD Systems in Nearly-Linear Time. January 2013.[24] Leonid G Khachiyan. Polynomial algorithms in linear programming.
USSR ComputationalMathematics and Mathematical Physics , 20(1):53–72, 1980.[25] Leonid G Khachiyan. Rounding of polytopes in the real number model of computation.
Math-ematics of Operations Research , 21(2):307–320, 1996.[26] Ioannis Koutis, Gary L. Miller, and Richard Peng. Approaching optimality for solving SDDsystems. In
Proceedings of the 51st Annual Symposium on Foundations of Computer Science ,2010.[27] Ioannis Koutis, Gary L. Miller, and Richard Peng. A nearly-m log n time solver for sdd linearsystems. In
Foundations of Computer Science (FOCS), 2011 IEEE 52nd Annual Symposiumon , pages 590 –598, oct. 2011.[28] Rasmus Kyng, Yin Tat Lee, Richard Peng, Sushant Sachdeva, and Daniel A. Spielman. Sparsi-fied cholesky and multigrid solvers for connection laplacians. In
Proceedings of the 48th AnnualACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June18-21, 2016 , pages 842–850, 2016.[29] Rasmus Kyng and Sushant Sachdeva. Approximate gaussian elimination for laplacians-fast,sparse, and simple. In , pages 573–582. IEEE, 2016.[30] Yin Tat Lee, Richard Peng, and Daniel A Spielman. Sparsified cholesky solvers for sdd linearsystems. arXiv preprint arXiv:1506.08204 , 2015.[31] Yin Tat Lee and Aaron Sidford. Efficient accelerated coordinate descent methods and fasteralgorithms for solving linear systems. In
The 54th Annual Symposium on Foundations ofComputer Science (FOCS) , 2013.[32] Yin Tat Lee and Aaron Sidford. Path finding i: Solving linear programs with \ ˜ o (sqrt(rank))linear system solves. arXiv preprint arXiv:1312.6677 , 2013.[33] Yin Tat Lee and Aaron Sidford. Path finding ii: An \ ˜ o (m sqrt (n)) algorithm for the minimumcost flow problem. arXiv preprint arXiv:1312.6713 , 2013.4334] Yin Tat Lee and Aaron Sidford. Path-finding methods for linear programming : Solving linearprograms in õ(sqrt(rank)) iterations and faster algorithms for maximum flow. In , pages 424–433, 2014.[35] Yin Tat Lee and Aaron Sidford. Efficient inverse maintenance and faster algorithms for linearprogramming. In IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS2015, Berkeley, CA, USA, 17-20 October, 2015 , pages 230–249, 2015.[36] Yin Tat Lee and Man-Chung Yue. Universal barrier is n -self-concordant. arXiv preprintarXiv:1809.03011 , 2018.[37] D. Lewis. Finite dimensional subspaces of l { p } . Studia Mathematica , 63(2):207–212, 1978.[38] Mu Li, Gary L Miller, and Richard Peng. Iterative row sampling. 2012.[39] Aleksander Madry. Navigating central path with electrical flows: from flows to matchings,and back. In
Proceedings of the 54th Annual Symposium on Foundations of Computer Science ,2013.[40] Michael W. Mahoney. Randomized algorithms for matrices and data.
Foundations and Trendsin Machine Learning , 3(2):123–224, 2011.[41] Nimrod Megiddo. Pathways to the optimal set in linear programming. In Nimrod Megiddo,editor,
Progress in Mathematical Programming , pages 131–158. Springer New York, 1989.[42] Murat Mut and Tamás Terlaky. A tight iteration-complexity upper bound for the mty predictor-corrector algorithm via redundant klee-minty cubes. 2013.[43] Jelani Nelson and Huy L Nguyên. Osnap: Faster numerical linear algebra algorithms via sparsersubspace embeddings. arXiv preprint arXiv:1211.1002 , 2012.[44] Eissa Nematollahi and Tamás Terlaky. A redundant klee–minty construction with all theredundant constraints touching the feasible region.
Operations Research Letters , 36(4):414–418, 2008.[45] Eissa Nematollahi and Tamás Terlaky. A simpler and tighter redundant klee–minty construc-tion.
Optimization Letters , 2(3):403–414, 2008.[46] Yu Nesterov.
Introductory Lectures on Convex Optimization: A Basic Course , volume I. 2003.[47] Yu Nesterov and Arkadi Nemirovskiy.
Self-concordant functions and polynomial-time methodsin convex programming . USSR Academy of Sciences, Central Economic & Mathematic Institute,1989.[48] Yu E Nesterov and Michael J Todd. Self-scaled barriers and interior-point methods for convexprogramming.
Mathematics of Operations research , 22(1):1–42, 1997.[49] Yurii Nesterov and Arkadii Semenovich Nemirovskii.
Interior-point polynomial algorithms inconvex programming , volume 13. Society for Industrial and Applied Mathematics, 1994.[50] Richard Peng and Daniel A Spielman. An efficient parallel solver for sdd linear systems. arXivpreprint arXiv:1311.3286 , 2013. 4451] James Renegar. A polynomial-time algorithm, based on newton’s method, for linear program-ming.
Mathematical Programming , 40(1-3):59–93, 1988.[52] Alexander Schrijver.
Combinatorial optimization: polyhedra and efficiency , volume 24.Springer, 2003.[53] Daniel A Spielman and Nikhil Srivastava. Graph sparsification by effective resistances.
SIAMJournal on Computing , 40(6):1913–1926, 2011.[54] Daniel A Spielman and Shang-Hua Teng. Nearly-linear time algorithms for graph partitioning,graph sparsification, and solving linear systems. In
Proceedings of the thirty-sixth annual ACMsymposium on Theory of computing , pages 81–90. ACM, 2004.[55] Michael J Todd. Scaling, shifting and weighting in interior-point methods.
ComputationalOptimization and Applications , 3(4):305–315, 1994.[56] Pravin M. Vaidya. A new algorithm for minimizing convex functions over convex sets (extendedabstract). In
FOCS , pages 338–343, 1989.[57] Pravin M Vaidya. Speeding-up linear programming using fast matrix multiplication. In
Foun-dations of Computer Science, 1989., 30th Annual Symposium on , pages 332–337. IEEE, 1989.[58] Pravin M Vaidya. An algorithm for linear programming which requires o (((m+ n) n 2+(m+n) 1.5 n) l) arithmetic operations.
Mathematical Programming , 47(1-3):175–201, 1990.[59] Pravin M. Vaidya. Reducing the parallel complexity of certain linear programming problems(extended abstract). In
FOCS , pages 583–589, 1990.[60] Pravin M Vaidya. A new algorithm for minimizing convex functions over convex sets.
Mathe-matical Programming , 73(3):291–341, 1996.[61] Pravin M Vaidya and David S Atkinson. A technique for bounding the number of iterations inpath following algorithms.
Complexity in Numerical Optimization , pages 462–489, 1993.[62] Stephen A Vavasis and Yinyu Ye. A primal-dual interior point method whose running timedepends only on the constraint matrix.
Mathematical Programming , 74(1):79–120, 1996.[63] Santosh S Vempala. Recent progress and open problems in algorithmic convex geometry. In
LIPIcs-Leibniz International Proceedings in Informatics , volume 8. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2010.[64] Virginia Vassilevska Williams. Multiplying matrices faster than coppersmith-winograd. In
Proceedings of the forty-fourth annual ACM symposium on Theory of computing , pages 887–898. ACM, 2012.[65] David P Woodruff et al. Sketching as a tool for numerical linear algebra.
Foundations andTrends R (cid:13) in Theoretical Computer Science , 10(1–2):1–157, 2014.[66] Yinyu Ye. Interior point algorithms: theory and analysis , volume 44. John Wiley & Sons, 2011.45
Projection Matrices, Leverages Scores, and log det
In this section, we prove various properties of projection matrices, leverage scores, and the logarithmof the determinant that we use throughout the paper.First we provide the following theorem which gives various properties of projection matrices andleverage scores.
Lemma 47 (Projection Matrices) . Let P ∈ R m × m be an arbitrary orthogonal projection matrix andlet Σ = Diag ( P ) . For all i, j ∈ [ m ] , x, y ∈ R m , and X = Diag ( x ) we have (1) Σ ii = (cid:80) j ∈ [ m ] P (2) ij (5) (cid:107) Σ − P (2) x (cid:107) ∞ ≤ (cid:107) x (cid:107) ∞ (2) (cid:22) P (2) (cid:22) Σ (cid:22) I , ( in particular, ≤ Σ ii ≤
1) (6) (cid:80) i ∈ [ m ] Σ ii = rank( P )(3) P (2) ij ≤ Σ ii Σ jj (7) (cid:12)(cid:12) y (cid:62) XP (2) y (cid:12)(cid:12) ≤ (cid:107) y (cid:107) Σ · (cid:107) x (cid:107) Σ (4) (cid:107) Σ − P (2) x (cid:107) ∞ ≤ (cid:107) x (cid:107) Σ (8) (cid:12)(cid:12) y (cid:62) ( P ◦ PXP ) y (cid:12)(cid:12) ≤ (cid:107) y (cid:107) Σ · (cid:107) x (cid:107) Σ . Proof.
To prove (1), we simply note that by definition of a projection matrix P = PP and therefore Σ ii = P ii = e (cid:62) i P e i = e (cid:62) i PP e i = (cid:88) j ∈ [ m ] P ij = (cid:88) j ∈ [ m ] P (2) ij . To prove (2), we observe that since P is a projection matrix, all its eigenvalues are either 0 or1. Therefore, Σ (cid:22) I and by (1) Σ − P (2) is diagonally dominant. Consequently, Σ − P (2) (cid:23) .Rearranging terms and using the well known fact that the Shur product of two positive semi-definitematrices is positive semi-definite yields (2).To prove (3), we use P = PP , Cauchy-Schwarz, and (1) to derive P ij = (cid:88) k ∈ [ m ] P ik P kj ≤ (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) (cid:88) k ∈ [ m ] P ik (cid:88) k ∈ [ m ] P kj = (cid:112) Σ ii Σ jj . Squaring then yields (3).To prove (4), we note that by the definition of P (2) and Cauchy-Schwarz, we have (cid:12)(cid:12)(cid:12) e (cid:62) i P (2) x (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) j ∈ [ m ] P (2) ij x j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) (cid:88) j ∈ [ m ] Σ jj x j · (cid:88) j ∈ [ m ] P (4) ij Σ jj . (A.1)Now, by (1) and (3), we know that (cid:88) j ∈ [ m ] P ij Σ jj ≤ (cid:88) j ∈ [ m ] P ij Σ ii Σ jj Σ jj = Σ ii (cid:88) j ∈ [ m ] P ij = Σ ii . (A.2)Since (cid:107) x (cid:107) Σ def = (cid:113)(cid:80) j ∈ [ m ] Σ jj x j , combining (A.1) and (A.2) yields (cid:12)(cid:12) e (cid:62) i P (2) x (cid:12)(cid:12) ≤ Σ ii (cid:107) x (cid:107) Σ as desired.To prove (5), we note that (cid:12)(cid:12)(cid:12) e (cid:62) i P (2) x (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) j ∈ [ m ] P (2) ij x j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) j ∈ [ m ] P (2) ij | x j | = Σ ii (cid:107) x (cid:107) ∞
46o prove (6), we note that all the eigenvalues of P are either 0 or 1 and (cid:80) i ∈ [ m ] Σ ii = tr( P ) .To prove (7), we apply (4) and Cauchy Schwarz to show (cid:12)(cid:12)(cid:12) y (cid:62) XP (2) y (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ [ m ] x i · y i (cid:62) i P (2) y (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) i ∈ [ m ] | x i | · | y i | · Σ ii · (cid:107) y (cid:107) Σ ≤ (cid:107) x (cid:107) Σ (cid:107) y (cid:107) Σ (cid:107) y (cid:107) Σ . To prove (8), we note that by Cauchy Schwarz (cid:12)(cid:12)(cid:12) y (cid:62) ( P ◦ PXP ) y (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i,j ∈ [ m ] y i y j P ij (cid:88) k ∈ [ m ] P ik P jk x k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) (cid:88) i,j ∈ [ m ] | y i | · | y j | · P ij · (cid:88) i,j ∈ [ m ] | y i | · | y j | · (cid:88) k ∈ [ m ] P ik P jk x k . Letting | x | and | y | be the vectors whose entries are the absolute values of the entries of x and y werespectively, see that by (2) we have (cid:88) i,j ∈ [ m ] | y i | · | y j | · P ij = (cid:107)| y |(cid:107) P (2) ≤ (cid:107)| y |(cid:107) Σ = (cid:107) y (cid:107) Σ and (cid:88) i,j ∈ [ m ] | y i | · | y j | · (cid:88) k ∈ [ m ] P ik P jk x k = (cid:88) i,j ∈ [ m ] (cid:88) k ∈ [ m ] (cid:104) P ik (cid:112) | y i || x k | (cid:105) (cid:20) P jk (cid:113) | y j || x k | (cid:21) . Applying Cauchy Schwarz twice then yields that (cid:88) i,j ∈ [ m ] | y i | · | y j | · (cid:88) k ∈ [ m ] P ik P jk x k ≤ (cid:88) i,k ∈ [ m ] | y i | P ik | x k | = (cid:16) | y | (cid:62) P (2) | x | (cid:17) ≤ (cid:107)| y |(cid:107) P (2) (cid:107)| x |(cid:107) P (2) ≤ (cid:107)| y |(cid:107) Σ (cid:107)| x |(cid:107) Σ = (cid:107) y (cid:107) Σ (cid:107) x (cid:107) Σ . Combining these inequalities than yields the desired bound on (cid:12)(cid:12) y (cid:62) ( P ◦ PXP ) y (cid:12)(cid:12) .Next, we derive various matrix calculus formulas relating the projection matrix with the logdeterminant. We start by computing the derivative of the volumetric barrier function , f ( w ) def =log det( A (cid:62) WA ) . Lemma 48 (Derivative of Volumetric Barrier) . For full rank matrix A ∈ R n × m let f : R m> → R begiven by f ( w ) def = log det( A (cid:62) WA ) . For any w ∈ R m> , we have ∇ f ( w ) = W − σ ( W A ) .Proof. Using the derivative of log det , we have that for all i ∈ [ m ] ∂f ( w ) ∂w i = tr (cid:20)(cid:16) A (cid:62) WA (cid:17) − ∂∂w i (cid:16) A (cid:62) WA (cid:17)(cid:21) = tr (cid:20)(cid:16) A (cid:62) WA (cid:17) − A (cid:62) e i e (cid:62) i A (cid:21) = w − i σ (cid:16) W A (cid:17) i . Lemma 49 (Derivative of Projection Matrix) . Given full rank A ∈ R n × m and w ∈ R m> we have D w P ( WA )[ h ] = ∆P ( WA ) + P ( WA ) ∆ − P ( WA ) ∆P ( WA ) where W = Diag ( w ) and ∆ = Diag ( h/w ) . In particular, we have that D w σ ( WA )[ h ] = 2 Λ ( WA ) W − h. Proof.
Note that P ( WA ) = WA ( A (cid:62) W A ) − A (cid:62) W . Using the derivative of matrix inverse, we have that D w P ( WA )[ h ] = HA ( A (cid:62) W A ) − A (cid:62) W + WA ( A (cid:62) WA ) − A (cid:62) H − WA ( A (cid:62) W A ) − A (cid:62) HWA ( A (cid:62) W A ) − A (cid:62) W = ∆P + P∆ − P∆P . Consequently, D w σ ( WA ) i [ h ] = [ D w P ( WA )[ h ]] ii = 2 ∆ ii P ii − P∆P ) ii = 2 σ i h i w i − (cid:88) j ∈ [ m ] P ij ∆ jj = 2 (cid:104)(cid:16) Σ − P (2) (cid:17) ( h/w ) (cid:105) i = 2( Λ ( h/w )) i . In the following lemma we provide a general formula regarding the derivative of a function thatappears throughout the paper.
Lemma 50 (Potential Function Derivative) . For non-degenerate A ∈ R m × n and q > with q (cid:54) = 2 let p ( x, w ) def = ln det( A (cid:62) x W − q A x ) for all x ∈ R n with A x > b and all w ∈ R m> where A x def = S − x A , S x = Diag ( A x − b ) , and w ∈ Diag ( w ) . Then, the following hold ∇ x p ( x, w ) = − A (cid:62) x σ x,w , ∇ w p ( x, w ) = c q W − σ x,w , ∇ xx p ( x, w ) = A (cid:62) x (2 Σ x,w + 4 Λ x,w ) A x , ∇ ww p ( x, w ) = − c q W − ( Σ x,w − c q Λ x,w ) W − , and ∇ xw p ( x, w ) = − c q A (cid:62) x Λ x,w W − where, c q def = 1 − q , σ x,w def = σ ( W − q A x ) , Σ x,w def = Σ ( W − q A x ) , and Λ x,w def = Λ ( W − q A x ) .Proof. To simplify the calculations, throughout this proof we overload notation and let p ( s, w ) def =ln det( A (cid:62) S − W − q S − A ) where S def = Diag ( s ) and let s x def = A x − b . Since S x = Diag ( s x ) and s x isa linear transformation of x this implies that the derivatives with respect to x to follow immediatelyfrom the derivatives with respect to s .For ∇ x p ( x, w ) , Lemma 48 shows that ∂∂s i p ( s x , w ) = s i w − q i [ σ x,w ] i · ( − s − i w − q i ) = − σ x,w ] i s i . (A.3)Therefore ∇ x p ( x, w ) = − A (cid:62) x σ x,w by chain rule.48or ∇ w p ( x, w ) , Lemma 48 and chain rule shows that ∂∂w i p ( x, w ) = s i w − q i [ σ x,w ] i · (cid:18) s − i (cid:18) − q (cid:19) w − q i (cid:19) = c q [ σ x,w ] i w i . (A.4)Therefore ∇ w p ( x, w ) = c q W − σ x,w .For ∇ xx p ( x, w ) , the formula for ∂∂s i p ( s x , w ) given by (A.3) and Lemma 49 yields ∂ ∂s i ∂s j p ( s x , w ) = 2 [ σ x,w ] i s i i = j − Λ x,w ] ij s i · s j w − + q j · ( − s − j w − q j )= 2 [ σ x,w ] i s i i = j + 4 [ Λ x,w ] ij s i s j . Therefore, ∇ xx p ( x, w ) = A (cid:62) x (2 Σ x,w + 4 Λ x,w ) A x .For ∇ ww p ( x, w ) , the formula for ∂∂w i p ( x, w ) given by (A.4) and Lemma 49 yields ∂ ∂w i ∂w j p ( x, w ) = − c q [ σ x,w ] i w i i = j + 2 c q [ Λ x,w ] ij w i · s j w − + q j · (cid:18) − p (cid:19) ( s − j w − − q j )= − c q [ σ x,w ] i w i i = j + c q [ Λ x,w ] ij w i w j . Therefore ∇ ww p ( x, w ) = − c q W − ( Σ x,w − c q Λ x,w ) W − .For ∇ xw p ( x, w ) , the formula for ∂∂s i p ( s x , w ) given by (A.3) and Lemma 49 yield ∂ ∂s i ∂w j p ( s x , w ) = − Λ x,w ] ij s i · s j w − + q j · (cid:18) − p (cid:19) ( s − j w − − q j ) = − c q [ Λ x,w ] ij s i w j . Therefore, − c q A (cid:62) x Λ x,w W − by chain rule. Lemma 51.
For any vector v , any positive vector w and matrix A , we have that arg min A (cid:62) x =0 v (cid:62) x + 12 (cid:107) x (cid:107) w = x ∗ def = − W − v + W − A ( A (cid:62) W − A ) − A (cid:62) W − v. Proof.
Let f ( x ) def = v (cid:62) x + (cid:107) x (cid:107) w . Note that ∇ f ( x ) = v + W x and consequently, x ∈ ker( A (cid:62) ) isoptimal if and only if v + W x ⊥ ker( A (cid:62) x ) , i.e. v + W x ∈ im( A ) , and A (cid:62) x = 0 . Since A (cid:62) x ∗ = 0 and w + W x = A ( A (cid:62) W − A ) − A (cid:62) W − v ∈ im( A (cid:62) ) the result follows. B Lewis Weight Computation
Here, we describe how to efficiently compute approximations to Lewis weights and ultimately proveTheorem 39 and Theorem 45 (the Lewis weight computation results claimed and used in Section 6).We achieve our results by a combination of a number of technical tools, including projected gradi-ent descent (for computing Lewis weights exactly in Section B.1 given a good initial weight), theJohnson-Lindenstrauss lemma (for computing Lewis weights approximately in Section B.2 givena good initial weight), and homotopy methods (for computing initial weights and completing theproofs of the main theorems in Section B.3).Throughout the remainder of this section we let A ∈ R m × n denote an arbitrary non-degeneratematrix and p ∈ (0 , ∞ ) with p (cid:54) = 2 . Further we let w p def = w p ( A ) and W p def = Diag ( w p ) .49 .1 Exact Computation Since Lewis weight can be found by the minimizer of a convex optimization problem (Lemma 22),we can use the gradient descent method directly to minimize V A p ( w ) . Indeed, in this section weshow how applying the gradient descent method in a carefully scaled space allows us to compute theweight to good accuracy in (cid:101) O (poly( p )) iterations. This results makes two assumptions to computethe weight: (1) we compute the gradient of V A p ( w ) exactly and (2) we are given a weight that is nottoo far from the true weight. In the remaining subsection we show how to address these issues.First we state the following theorem regarding gradient descent method we use in our analysis.This theorem shows that if we take repeated projected gradient steps then we can achieve linearconvergence up to bounds on how much the Hessian of the function changes over the domain ofinterest. Theorem 52 (Simple Constrained Minimization for Twice Differentiable Function) . Let H be apositive definite matrix and Q ⊆ R m be a convex set. Let f : Q → R be a twice differentiable function.Suppose that there are constants ≤ µ ≤ L such that for all x ∈ Q we have µ · H (cid:22) ∇ f ( x ) (cid:22) L · H .For any x (0) ∈ Q and any k ≥ if we apply the update rule x ( k +1) = arg min x ∈ Q f ( x ( k ) ) + ∇ f ( x ( k ) ) (cid:62) ( x − x ( k ) ) + L (cid:107) x − x ( k ) (cid:107) H then it follows that (cid:107) x ( k ) − x ∗ (cid:107) H ≤ (cid:16) − µL (cid:17) k (cid:107) x (0) − x ∗ (cid:107) H . To apply Theorem 52 to compute Lewis weight, we first recall from Lemma 22 that the Lewisweight w p ( A ) is the unique minimizer of the convex problem, min w i ≥ V A p ( w )+ (cid:80) i ∈ [ m ] w i . Therefore,to apply this result we first need to show that there is a region around the optimal point w p suchthat the Hessian of V A p ( w ) does not change too much. Lemma 53 (Hessian Approximation) . If w ∈ R m ≥ satisfies (cid:107) W − ( w p − w ) (cid:107) ∞ ≤ p p +2) for thematrix W def = Diag ( w ) then min (cid:26) , p (cid:27) W − (cid:22) ∇ V A p ( w ) (cid:22) max (cid:26) , p (cid:27) W − . Proof.
Using that (cid:107) W − ( w p − w ) (cid:107) ∞ ≤ p p +2) and letting V = Diag ( w p ) , we have Σ w = Σ (cid:16) W − p A (cid:17) (cid:22) (1 + p p +2) ) (cid:12)(cid:12)(cid:12) − p (cid:12)(cid:12)(cid:12) (1 − p p +2) ) (cid:12)(cid:12)(cid:12) − p (cid:12)(cid:12)(cid:12) Σ (cid:16) V − p A (cid:17) (cid:22) V (cid:22) W and Σ w = Σ (cid:16) W − p A (cid:17) (cid:23) (1 − p p +2) ) (cid:12)(cid:12)(cid:12) − p (cid:12)(cid:12)(cid:12) (1 + p p +2) ) (cid:12)(cid:12)(cid:12) − p (cid:12)(cid:12)(cid:12) Σ (cid:16) V − p A (cid:17) (cid:23) V (cid:23) W . The result therefore follows immediately from Lemma 23.Combining Theorem 52 and Lemma 53, we get the following algorithm to compute the weightfunction using the exact computation of the gradient of V A p .50 emma 54. Let w (0) ∈ R m> such that (cid:107) W − ( w p − w (0) ) (cid:107) ∞ ≤ r where r def = p p +2) . For all k ≥ let L def = max { , p } and w ( k +1) def = median (cid:32) (1 − r ) w (0) , w ( k ) − L (cid:32) w (0) − w (0) w ( k ) σ (cid:18) W − p ( k ) A (cid:19)(cid:33) , (1 + r ) w (0) (cid:33) (B.1) where [ median ( x, y, z )] i is the median of x i , y i and z i for all i ∈ [ m ] . For all k , we have (cid:107) w ( k ) − w p (cid:107) W − p ≤ n · (cid:32) − p + p ) (cid:33) k (cid:107) W − ( w p − w (0) ) (cid:107) ∞ . Proof.
Let Q def = { w ∈ R m | (cid:107) W − ( w − w (0) ) (cid:107) ∞ ≤ r } and W ( k ) def = Diag ( w ( k ) ) . Applied to min w i ≥ V A p ( w ) + (cid:80) mi =1 w i , by Lemma 23, iterations of Theorem 52 are w ( k +1) = arg min w ∈ Q (cid:28) − w − k ) σ (cid:18) W − p ( k ) A (cid:19) , w (cid:29) + L (cid:13)(cid:13)(cid:13) w − w ( k ) (cid:13)(cid:13)(cid:13) W − = arg min w ∈ Q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) w − w ( k ) + 1 L (cid:32) w (0) − w (0) w ( k ) σ (cid:18) W − p ( k ) A (cid:19)(cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) W − . Since the objective function and the constraints are axis-aligned, we can compute w ( k +1) coordinate-wise and we see that this is the same as in the statement of this lemma.To apply Theorem 52, we note that (cid:107) W − ( w p − w (0) ) (cid:107) ∞ ≤ p p +2) implies that any w ∈ Q satisfies (cid:107) W − ( w p − w ) (cid:107) ∞ ≤ p p +2) and hence Lemma 53 shows that min (cid:26) , p (cid:27) W − (cid:22) min (cid:26) , p (cid:27) W − (cid:22) ∇ V ( w ) (cid:22) max (cid:26) , p (cid:27) W − (cid:22) max (cid:26) , p (cid:27) W − . (B.2)Hence, Theorem 52 and inequality (B.2) shows that (cid:107) w ( k ) − w p (cid:107) W − ≤ (cid:32) − min { , p } max(4 , p ) (cid:33) k (cid:107) w (0) − w p (cid:107) W − ≤ (cid:32) − p + p ) (cid:33) k (cid:107) w (0) − w p (cid:107) W − . The result follows as w (0) is close to w p multiplicatively and therefore (cid:107) w (0) − w p (cid:107) W − p ≤ (cid:88) i ∈ [ n ] w (0) i · (cid:107) W − ( w p − w (0) ) (cid:107) ∞ ≤ n (cid:107) W − ( w p − w (0) ) (cid:107) ∞ . Note that the lemma does not shows that w ( k ) is a multiplicative approximation of w p . Thefollowing lemma shows that we can use w ( k ) to get a multiplicative approximation. Lemma 55.
Given w such that (cid:107) W − p ( w p − w ) (cid:107) ∞ ≤ p p +2) and that (cid:107) w − w p (cid:107) W − p ≤ p ) √ n . et (cid:98) w = (diag( A ( A (cid:62) W − p A ) − A (cid:62) )) p . Then, we have that (cid:13)(cid:13) W − p ( (cid:98) w − w p ) (cid:13)(cid:13) ∞ ≤ (cid:18) p (cid:19) √ n · (cid:107) w − w p (cid:107) W − p . Proof.
The definition of (cid:98) w is motivated from the equality w p = (diag( A ( A (cid:62) W − p p A ) − A (cid:62) )) p . Toshow (cid:98) w is multiplicative close to w p , it therefore suffices to prove that A (cid:62) W − p p A is multiplicativelyclose to A (cid:62) W − p A . Note that (1 − α ) A (cid:62) W − p p A (cid:22) A (cid:62) W − p A (cid:22) (1 + α ) A (cid:62) W − p p A with α ≤ tr (cid:20) ( A (cid:62) W − p p A ) − ( A (cid:62) | W − p − W − p p | A ) (cid:21) = (cid:88) i ∈ [ n ] σ i ( W − p p A )[ w p ] − /pi (cid:12)(cid:12)(cid:12) w − /pi − [ w p ] − /pi (cid:12)(cid:12)(cid:12) . Since (cid:107) W − p ( w p − w ) (cid:107) ∞ ≤ p p +2) we have that for all i ∈ [ n ] (cid:12)(cid:12)(cid:12) w − /pi − [ w p ] − /pi (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) − p (cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) w i − [ w p ] i [ w p ] /pi (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) and therefore, by Cauchy Schwarz and that (cid:80) i ∈ [ n ] σ i ( W − pp A ) [ w p ] i = (cid:80) i ∈ [ n ] σ i ( W − p p A ) = n we have α ≤ (cid:12)(cid:12)(cid:12)(cid:12) − p (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i ∈ [ n ] σ i ( W − p p A )[ w p ] − /pi (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) w i − [ w p ] i [ w p ] /pi (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) − p (cid:12)(cid:12)(cid:12)(cid:12) (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) (cid:88) i ∈ [ n ] σ i ( W − p p A ) [ w p ] i (cid:118)(cid:117)(cid:117)(cid:116) (cid:88) i ∈ [ n ] ( w i − [ w p ] i ) [ w p ] i = 2 (cid:12)(cid:12)(cid:12)(cid:12) − p (cid:12)(cid:12)(cid:12)(cid:12) √ n · δ . The result follows from w p = (diag( A ( A (cid:62) W − p p A ) − A (cid:62) )) p , that (1 − | − (2 /p ) | δ √ n ) − /p ≥ − p ) √ nδ , and that (1 + 2 | − (2 /p ) | δ √ n ) − /p ≤ p ) √ nδ .Combining Lemma 54 and Lemma 55 yields the following Theorem 56, the main result of thissection on weight computation. Algorithm 4: w = computeExactWeight ( A , p, w (0) , (cid:15) ) Let T = (cid:108) p + p ) log(8 n (1 + p ) (cid:15) − ) (cid:109) , r = p p +2) , and L = max { , p } for k = 1 , · · · , T − do w ( k +1) = median (cid:18) (1 − r ) w (0) , w ( k ) − L (cid:18) w (0) − w (0) w ( k ) σ (cid:18) W − p ( k ) A (cid:19)(cid:19) , (1 + r ) w (0) (cid:19) endOutput: (diag( A ( A (cid:62) W − p ( T ) A ) − A (cid:62) )) p for W ( T ) = Diag ( w ( T ) ) . Theorem 56 (Exact Weight Updates) . For all (cid:15) ∈ (0 , and w (0) ∈ R m> with (cid:107) w − ( w p ( A ) − w (0) ) (cid:107) ∞ ≤ p p +2) the algorithm computeExactWeight ( A , p, w (0) , (cid:15) ) (Algorithm 4) outputs w ∈ R m> with (cid:107) w p ( A ) − ( w p ( A ) − w (0) (cid:107) ∞ ≤ (cid:15) in O (( p + p ) log( n (1 + p ) (cid:15) − )) iterations, where each iteration nvolves computing σ ( VA ) for diagonal matrix V and extra linear time work and O (1) depth.Proof. This result follows immediately from Lemma 54 and Lemma 55.
B.2 Approximate Computation
Here we show how to modify the algorithm and analysis of the previous subsection to use ap-proximate leverage scores instead of exact leverage score in computing gradient. Further, we showhow to use the Johnson-Lindenstrauss lemma to compute approximate leverage scores efficientlyusing a linear system solver. Together, these results give us efficient algorithms for improving theapproximation quality of Lewis weights.To analyze our algorithm, computeApxWeight (Algorithm 5) given below, we first give a lemmashowing that the optimality condition σ ( W − p A ) i /w i is stable under changes to w . Algorithm 5: w = computeApxWeight ( A , p, w (0) , (cid:15) ) L = max { , p } , r = p (4 − p )2 and δ = (4 − p ) (cid:15) .Let the number of iterations T = (cid:108) p + p ) log (cid:0) pn (cid:15) (cid:1)(cid:109) . for j = 1 , · · · , T − do Compute σ ( j ) ∈ R n such that e − δ σ ( W − p ( j ) A ) i ≤ σ ( j ) i ≤ e δ σ ( W − p ( j ) A ) i for all i ∈ [ n ] . w ( j +1) = median (cid:16) (1 − r ) w (0) , w ( j ) − L (cid:16) w (0) − w (0) w ( j ) σ ( j ) (cid:17) , (1 + r ) w (0) (cid:17) . endOutput: (diag( A ( A (cid:62) W − p ( T ) A ) − A (cid:62) )) p . Lemma 57.
Let w, v ∈ R m> with w i = e δ i v i for | δ i | ≤ δ for all i ∈ [ n ] . Then, for all i ∈ [ n ] e p δ i −| − p | δ · σ ( W − p A ) i w i ≤ σ ( V − p A ) i v i ≤ e p δ i + | − p | δ · σ ( W − p A ) i w i . Proof.
Note that v − i σ ( V − p A ) i = v − p i a (cid:62) i ( A (cid:62) V − p A ) − a i where a i is the i -th row of A . By theassumptions on w, v ∈ R m> we have v − i σ ( V − p A ) i = e p δ i w − p i a (cid:62) i ( A (cid:62) V − p A ) − a i ≤ e p δ i + | − p | δ w − p i a (cid:62) i ( A (cid:62) W − p A ) − a i . and the lower bound on σ ( V − p A ) i follows similarly. Theorem 58 (Approximate Weight Computation) . If p ∈ (0 , and w (0) ∈ R m> satisfies (cid:107) w − ( w p ( A ) − w (0) ) (cid:107) ∞ ≤ r where r = p (4 − p )2 . For < (cid:15) < p − | − p | , the algorithm computeApxWeight ( x, w (0) , (cid:15) ) returns w such that (cid:107) w p ( A ) − ( w p ( A ) − w ) (cid:107) ∞ ≤ (cid:15) in O ( p − log( np − (cid:15) − )) steps. Each step involvescomputing σ up to ± Θ((4 − p ) · (cid:15) ) multiplicative error with some extra linear time work.Proof. Consider an execution of computeApxWeight ( x, w (0) , (cid:15) ) where there is no error in computingleverages scores, i.e. σ ( j ) = σ ( W − p ( j ) A ) , and let v ( j ) denote the w computed during this idealizedexecution of computeApxWeight . We will show that w ( j ) and v ( j ) are multiplicatively close.53uppose that w ( j ) i = e δ ( j ) i v ( j ) i with | δ ( j ) i | ≤ δ ( j ) for some δ ( j ) ≥ . Define v ( j +1) , w ( j +1) ∈ R m> tobe v ( j +1) and w ( j +1) i before taking the median, i.e. v ( j +1) i = v ( j ) − L (cid:32) w (0) − w (0) v ( j ) σ ( V − p ( j ) A ) (cid:33) and w ( j +1) i = w ( j ) − L (cid:32) w (0) − w (0) w ( j ) σ ( j ) (cid:33) Using ± δ to denote a real value with magnitude at most δ and applying Lemma 57 with v = v ( j ) and w = w ( j ) , we have w ( j +1) i − v ( j +1) i = w ( j ) − v ( j ) + w (0) L e ± δ σ ( W − p ( j ) A ) w ( j ) − σ ( V − p ( j ) A ) v ( j ) = ( e δ ( j ) i − v ( j ) i + w (0) L (cid:18) e − p δ ( j ) i ±| − p | δ ( j ) ± δ − (cid:19) · σ ( V − p ( j ) A ) v ( j ) . (B.3)Since (cid:107) W − ( w (0) − v ( j ) ) (cid:107) ∞ ≤ r and that (cid:107) W − ( w (0) − w p ( A )) (cid:107) ∞ ≤ r , we have that w p ( A ) = e ± r v ( j ) . Lemma 57 shows that for w = w p ( A ) we have e − p + | − p | ) r · w − i σ ( W − p A ) i ≤ ( v ( j ) i ) − σ ( V − p ( j ) A ) i ≤ e p + | − p | ) r · w − i σ ( W − p A ) i . (B.4)where Using that w i = σ ( W − p A ) i , (B.4), and (B.3), we have that w ( j +1) i − v ( j +1) i = ( e δ ( j ) i − v ( j ) i + w (0) L (cid:18) e − p δ ( j ) i ±| − p | δ ( j ) ± δ − (cid:19) e ± p ) r Since w ( j +1) and v ( j +1) are just truncation of w ( j +1) and v ( j +1) , we have the same bound for w ( j +1) i − v ( j +1) i . Using w ( j +1) i = e δ ( j +1) i v ( j +1) i , we get that ( e δ ( j +1) i − v ( j +1) i = ( e δ ( j ) i − v ( j ) i + w (0) L (cid:18) e − p δ ( j ) i ±| − p | δ ( j ) ± δ − (cid:19) e ± p ) r . Finally, we note that v ( j +1) = e ± r w (0) and hence e δ ( j +1) i − e ± r ( e δ ( j ) i −
1) + 1 L (cid:18) e − p δ ( j ) i ±| − p | δ ( j ) ± δ − (cid:19) e ± p ) r . L = max { , p } , r = p (4 − p )2 , (cid:12)(cid:12)(cid:12) δ ( j ) i (cid:12)(cid:12)(cid:12) ≤ δ ( j ) ≤ r , δ ≤ and p ≤ we obtain e δ ( j +1) i − e δ ( j ) i − ± rδ ( j ) i + 1 L (cid:18) e − p δ ( j ) i ±| − p | δ ( j ) ± δ − (cid:19) ± L (cid:18) p (cid:19) r (cid:18)(cid:18) p (cid:19) δ ( j ) + δ (cid:19) = δ ( j ) i ± ( δ ( j ) ) ± rδ ( j ) i + 1 L (cid:32) − p δ ( j ) i ± (cid:12)(cid:12)(cid:12)(cid:12) − p (cid:12)(cid:12)(cid:12)(cid:12) δ ( j ) ± δ ± (cid:18)(cid:18) p (cid:19) δ ( j ) (cid:19) (cid:33) ± L (cid:18) p (cid:19) r (cid:18)(cid:18) p (cid:19) δ ( j ) + δ (cid:19) = (cid:18) − pL (cid:19) δ ( j ) i ± L (cid:12)(cid:12)(cid:12)(cid:12) − p (cid:12)(cid:12)(cid:12)(cid:12) δ ( j ) ± δL ± δ ( j ) ) ± rδ ( j ) i ± L (cid:18) p (cid:19) δ ( j ) r = (cid:18) − pL (cid:19) δ ( j ) i ± L (cid:12)(cid:12)(cid:12)(cid:12) − p (cid:12)(cid:12)(cid:12)(cid:12) δ ( j ) ± δL ± (cid:18) p (cid:19) δ ( j ) r (B.5)where we used e x = 1 ± x for | x | ≤ in the first equality, we used e x = 1 + x ± x for | x | ≤ inthe second equality.For the first two terms, we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:18) − pL (cid:19) δ ( j ) i ± L (cid:12)(cid:12)(cid:12)(cid:12) − p (cid:12)(cid:12)(cid:12)(cid:12) δ ( j ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:18) − pL + 1 L (cid:12)(cid:12)(cid:12)(cid:12) − p (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) δ ( j ) . Using this, x ≤ e x and (B.5) and (cid:12)(cid:12)(cid:12) δ ( j ) i (cid:12)(cid:12)(cid:12) ≤ δ ( j ) ≤ r , we have δ ( j +1) ≤ (cid:18) − pL + 1 L (cid:12)(cid:12)(cid:12)(cid:12) − p (cid:12)(cid:12)(cid:12)(cid:12) + 40 (cid:18) p (cid:19) r (cid:19) δ ( j ) + 3 δL . Using our choice of r , we have p ) r ≤ L ( p − | − p | ) and hence δ ( j +1) ≤ (cid:18) − L (cid:18) p − (cid:12)(cid:12)(cid:12)(cid:12) − p (cid:12)(cid:12)(cid:12)(cid:12)(cid:19)(cid:19) δ ( j ) + 3 δL and hence for all j ∈ [ m ] δ ( j ) ≤ L ( p − | − p | ) · δL ≤ δ p − | − p | ≤ (cid:15) . Applying Lemma 54 and Lemma 55, and recalling that k = (cid:108) p + p ) log (cid:0) pn (cid:15) (cid:1)(cid:109) , we have (cid:13)(cid:13)(cid:13) W − p ( w p − w ( k ) ) (cid:13)(cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13)(cid:13) W − p ( w p − v ( k ) ) (cid:13)(cid:13)(cid:13) ∞ + (cid:13)(cid:13)(cid:13) W − p (cid:16) v ( k ) − w ( k ) (cid:17)(cid:13)(cid:13)(cid:13) ∞ ≤ (cid:18) p (cid:19) √ n · (cid:107) w − w p (cid:107) W − p + 2 δ ( k ) ≤ (cid:18) p (cid:19) √ n · √ n · (cid:32) − p + p ) (cid:33) k · p
160 + 2 δ ( k ) ≤ (cid:15). σ exactly is too expen-sive for our purposes. However, in [53, 15] they showed that we can compute leverage scores, σ ,approximately by solving only polylogarithmically many regression problems (See [40, 38, 65, 7] formore details). These results use the fact that the leverage scores of the the i th constraint, i.e. σ ( A ) i is the (cid:96) length of vector A ( A (cid:62) A ) − A (cid:62) i and that by the Johnson-Lindenstrauss Lemma theselengths are persevered up to multiplicative error if we project these vectors onto certain random lowdimensional subspace. Consequently, to approximate the σ we first compute the projected vectorsand then use it to approximate σ and hence only need to solve (cid:101) O (1) regression problems. Forcompleteness, we provide an algorithm and theorem statement below most closely resembling theone from [53]. here: Algorithm 6: σ (apx) = computeLeverageScores ( A , (cid:15) ) Let q ( j ) be k random ± / √ k vectors of length m with k = O (log( m ) /(cid:15) ) .Compute l ( j ) = ( A (cid:62) A ) − A (cid:62) q ( j ) and p ( j ) = A l ( i ) . Output: (cid:80) kj =1 (cid:16) p ( j ) i (cid:17) . Lemma 59.
For (cid:15) ∈ (0 , with probability at least − m O (1) the algorithm computeLeverageScores returns σ (apx) such that for all i ∈ [ m ] , (1 − (cid:15) ) σ ( A ) i ≤ σ (apx) i ≤ (1 + (cid:15) ) σ ( A ) i , by solving only O ( (cid:15) − · log m ) linear systems. In the next section we show how to combine these results to obtain our main result on approxi-mate weight computation, Theorem 39.
B.3 Initial Weight and Final Theorems
Here, we show how to compute an initial weight without having an approximate weight to helpthe computation. While we can use the results of the previous section during the iterations ofour linear programming algorithms (as we have shown that the Lewis weights do not change tooquickly) we still need to design a routine to compute the initial weights. Here we show that thealgorithm computeInitialWeight (Algorithm 7) simply calls the weight computation algorithms ofthe previous sections ˜ O ( √ n ) times by first computing lewis weights for p = 2 , i.e. leverage scores,and then gradually decreasing p can achieve this goal. Algorithm 7: w = computeInitialWeight ( A , p target , (cid:15) ) p = 2 while p (cid:54) = p target do Let r be defined as in computeApxWeight or computeExactWeight h = min { ,p }√ n log me n · rp (new) = median ( p − h, p target , p + h ) . w = computeApxWeight ( p (new) , w p (new) p , r ) or computeExactWeight ( p (new) , w p (new) p , r ) p = p (new) . endOutput: computeApxWeight ( p target , w, (cid:15) ) .The correctness of the above algorithm directly follows from the following lemma: This is the only place our algorithm uses randomness for general linear programs. Since we can verify thecentrality of central path by computing leverage score exactly (instead of using this theorem) every m O (1) iterationsof interior point method, − m O (1) probability is high enough even for the case (cid:15) is doubly exponentially small. emma 60. For all q > let (cid:101) w q ∈ R m> denote the vector with [ (cid:101) w q ] i = [ w p ( A )] q/pi for all i ∈ [ m ] .If | p − q | ≤ min { ,p }√ n log( me /n ) then (cid:13)(cid:13)(cid:13)(cid:13) log (cid:18) w q ( A ) (cid:102) w q (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ max (cid:26) , p (cid:27) √ n log (cid:18) me n (cid:19) · | p − q | . (B.6) Proof.
For notational convenience let w = w p ( A ) , W def = Diag ( w ) and Λ def = Λ ( W − p p A ) . Takingderivative with respect to p on both sides and using Lemma 49 yields dw p ( A ) dp = 2 Λ ( − p ) w − − p dwdp + w − p p log ww − p = Λ (cid:20)(cid:18) − p (cid:19) W − dwdp + 2 p ln w (cid:21) . Hence, we have that dw p ( A ) dp = 2 W (cid:18) W − (cid:18) − p (cid:19) Λ (cid:19) − Λ · ln wp . (B.7)Lemma 24 and 25 shows that for all h ∈ R m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:18) W − (1 − p ) Λ (cid:19) − Λ h − ph (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ p · max (cid:110) p , (cid:111) · (cid:107) h (cid:107) W . Setting h = p − log w and using (B.7) we have (cid:13)(cid:13)(cid:13)(cid:13) W − dw p ( A ) dp − ln w p p (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ p · max (cid:110) p , (cid:111) · (cid:107) p − log w (cid:107) W ≤ max (cid:26) , p (cid:27) (cid:107) log w (cid:107) W . Finally, we note that (cid:107) ln w (cid:107) W = (cid:88) i ∈ [ m ] w i log w i ≤ (cid:88) w i ≤ e w i log w i + (cid:88) w i ∈ ( e , w i ≤ n log mn + n ≤ n log men where we used that w log w is concave on [0 , e ] and (cid:80) i ∈ [ n ] w i ≤ n .Combining these bounds yields that for all q , w q def = w q ( A ) , and W q def = Diag ( w q ) (cid:13)(cid:13)(cid:13)(cid:13) ddq ln( w q / (cid:102) w q ) (cid:13)(cid:13)(cid:13)(cid:13) ∞ = (cid:13)(cid:13)(cid:13)(cid:13) W − q ddq w q − log( (cid:102) w q ) 1 p (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ max (cid:26) , p (cid:27) √ n log (cid:16) men (cid:17) + 1 p (cid:107) ln( w q / (cid:102) w q ) (cid:107) ∞ . Now let δ be the largest number for which q satisfying | p − q | ≤ δ implies that (cid:107) ln( w q / (cid:102) w q ) (cid:107) ∞ ≤ .Since for all such q we have (cid:13)(cid:13)(cid:13)(cid:13) ddq ln( w q / (cid:102) w q ) (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ max (cid:26) , p (cid:27) √ n log (cid:18) me n (cid:19) and ln( w p /w p ) = 0 , integration yields that (B.6) holds for all such q . Therefore, it must be the casethat δ ≤ (cid:104) max (cid:110) , p (cid:111) √ n log (cid:16) me n (cid:17)(cid:105) − and the result follows.We now have everything we need to prove our main theorems regarding exact and approximateLewis weight computation. First we prove the result on exact weight computation (Theorem 45)57nd then we prove the result on approximate weight computation (Theorem 39). Proof of Theorem 45.
From Lemma 60, we see that each step of p , we lies within the requirementof Theorem 56. Furthermore, Lemma 60 shows that it takes O ( √ n · ( p + p ) · log mn ) steps in the computeInitialWeight . Each call of computeExactWeight involves O (( p + p ) log( n(cid:15) − (1 + p )) iterations and each iteration involves computing leverage score, which takes O ( mn ω − ) work and O (log m ) depth. Proof of Theorem 39.
From Lemma 60, we see that each step of p , we lies within the requirement ofTheorem 58. Furthermore, Lemma 60 shows that it takes O ( √ n · ((4 − p ) − + p − ) · log mn ) steps inthe computeInitialWeight . Each call of computeApxWeight involves O ( p − log( n/ ( p(cid:15) ))) iterationsand each iteration involves computing leverage score up to accuracy (cid:15) p −| − p | ) = Θ((4 − p ) · (cid:15) ) .Finally, 59 shows this involves solving solving O ((4 − p ) − (cid:15) − log m ) many linear systems. C Chasing Game
The goal of this section is to prove the following theorem:
Theorem 18 ( (cid:96) ∞ Chasing Game) . For x (0) , y (0) ∈ R m and (cid:15) ∈ (0 , / , consider the two playergame consisting of repeating the following for k = 1 , , . . .
1. The adversary chooses U ( k ) ⊆ R m , u ( k ) ∈ U ( k ) , and sets y ( k ) = y ( k − + u ( k ) .2. The adversary chooses z ( k ) with (cid:107) z ( k ) − y ( k ) (cid:107) ∞ ≤ R and reveals z ( k ) and U ( k ) to the player.3. The player chooses ∆ ( k ) ∈ (1 + (cid:15) ) U ( k ) and sets x ( k ) = x ( k − + ∆ ( k ) . Suppose that each U ( k ) is a symmetric convex set that contains an (cid:96) ∞ ball of radius r k and iscontained in a (cid:96) ∞ ball of radius R k ≤ R and the player plays the strategy ∆ ( k ) = arg min ∆ ∈ (1+ (cid:15) ) U ( k ) (cid:68) ∇ Φ µ ( x ( k − − z ( k ) ) , ∆ (cid:69) where Φ µ ( x ) def = (cid:88) i ∈ [ m ] (cid:0) e µx i + e − µx i (cid:1) and µ def = (cid:15) R . If Φ µ ( x (0) − y (0) ) ≤ mτ(cid:15) for τ = max k R k r k then this strategy guarantees that for all k we have Φ µ ( x ( k ) − y ( k ) ) ≤ mτ(cid:15) and (cid:107) x ( k ) − y ( k ) (cid:107) ∞ ≤ R(cid:15) log (cid:18) mτ(cid:15) (cid:19) . This theorem says that taking “projected gradient steps” using the potential function Φ µ ( x ) ,suffices to maintain a point x ( k ) sufficiently close to y ( k ) with respect to (cid:96) ∞ provided that y ( k ) areupdated by a direction in U ( k ) , noisy z ( k ) measurements to the y ( k ) are available, and slightly largemovements to the x ( k ) (i.e. by (1 + (cid:15) ) U ( k ) ) are allowed. Formally, this theorem analyze the strategyof updating x ( k ) by setting the change, ∆ ( k ) , to be the vector in (1 + (cid:15) ) U ( k ) that best minimizes thepotential function of the observed position difference, i.e. Φ µ ( x ( k ) − z ( k ) ) for careful choice of µ .To prove Theorem 18, we first show the following properties of the potential function Φ µ . Lemma 61.
For all x ∈ R m and µ > , we have e µ (cid:107) x (cid:107) ∞ ≤ Φ µ ( x ) ≤ me µ (cid:107) x (cid:107) ∞ and µ Φ µ ( x ) − µm ≤ (cid:107)∇ Φ µ ( x ) (cid:107) (C.1)58 urthermore, for any symmetric convex set U ⊆ R m and any x ∈ R m , let x (cid:91) def = arg max y ∈ U (cid:104) x, y (cid:105) and (cid:107) x (cid:107) U def = max y ∈ U (cid:104) x, y (cid:105) . Then for all x, y ∈ R m with (cid:107) x − y (cid:107) ∞ ≤ δ ≤ µ we have e − µδ (cid:107)∇ Φ µ ( y ) (cid:107) U − µ (cid:107)∇ Φ µ ( y ) (cid:91) (cid:107) ≤ (cid:68) ∇ Φ µ ( x ) , ∇ Φ µ ( y ) (cid:91) (cid:69) ≤ e µδ (cid:107)∇ Φ µ ( y ) (cid:107) U + µe µδ (cid:107)∇ Φ µ ( y ) (cid:91) (cid:107) . (C.2) If additionally U is contained in a (cid:96) ∞ ball of radius R then e − µδ (cid:107)∇ Φ µ ( y ) (cid:107) U − µmR ≤ (cid:107)∇ Φ µ ( x ) (cid:107) U ≤ e µδ (cid:107)∇ Φ µ ( y ) (cid:107) U + µe µδ mR. (C.3) Proof.
For, notational convenience let p u ( x ) def = e µx + e − µx for all x ∈ R so that Φ µ ( x ) = (cid:80) i ∈ [ m ] p u ( x i ) .Equation (C.1) follows from the fact that for all x ∈ R , e µ | x | ≤ p µ ( x ) ≤ e µ | x | and p (cid:48) µ ( x ) = µ sign( x ) (cid:16) e µ | x | − e − µ | x | (cid:17) . Next, let x, y ∈ R with | x − y | ≤ δ . Note that (cid:12)(cid:12) p (cid:48) µ ( x ) (cid:12)(cid:12) = p (cid:48) µ ( | x | ) = µ ( e µ | x | − e − µ | x | ) and | x − y | ≤ δ implies that | x | = | y | + z for some z ∈ [ − δ, δ ] . Using that p (cid:48) ( | x | ) is monotonic in | x | we then have | p (cid:48) µ ( x ) | = p (cid:48) µ ( | x | ) = p (cid:48) µ ( | y | + z ) ≤ p (cid:48) µ ( | y | + δ ) = µ (cid:16) e µ | y | + µδ − e − µ | y |− µδ (cid:17) = e µδ p (cid:48) ( | y | ) + µ (cid:16) e µδ − µ | y | − e − µ | y |− µδ (cid:17) ≤ e µδ (cid:12)(cid:12) p (cid:48) ( y ) (cid:12)(cid:12) + µe µδ . (C.4)By symmetry (i.e. replacing x and y ) this implies that | p (cid:48) µ ( x ) | ≥ e − µδ | p (cid:48) ( y ) | − µ (C.5)Since U is symmetric this implies that for all i ∈ [ m ] we have sign( ∇ Φ µ ( y ) (cid:91) ) i = sign( ∇ Φ µ ( y ) i ) =sign( y i ) . Therefore, if for all i ∈ [ n ] we have sign( x i ) = sign( y i ) , by (C.4), we see that (cid:68) ∇ Φ µ ( x ) , ∇ Φ µ ( y ) (cid:91) (cid:69) = (cid:88) i ∈ [ m ] p (cid:48) µ ( x i ) ∇ Φ µ ( y ) (cid:91)i ≤ (cid:88) i ∈ [ m ] (cid:16) e µδ p (cid:48) µ ( y i ) + µe µδ (cid:17) ∇ Φ µ ( y ) (cid:91)i ≤ e µδ (cid:68) ∇ Φ µ ( y ) , ∇ Φ µ ( y ) (cid:91) (cid:69) + µe µδ (cid:107)∇ Φ µ ( y ) (cid:91) (cid:107) = e µδ (cid:107)∇ Φ µ ( y ) (cid:107) U + µe µδ (cid:107)∇ Φ µ ( y ) (cid:91) (cid:107) . Similarly, using (C.5), we have e − µδ (cid:107)∇ Φ µ ( y ) (cid:107) U − µ (cid:107)∇ Φ µ ( y ) (cid:91) (cid:107) ≤ (cid:10) ∇ Φ µ ( x ) , ∇ Φ µ ( y ) (cid:91) (cid:11) and hence(C.2) holds. On the other hand if sign( x i ) (cid:54) = sign( y i ) then we know that | x i | ≤ δ and consequently | p (cid:48) µ ( x i ) | ≤ µ ( e µδ − e − µδ ) ≤ µ since δ ≤ µ . Thus, we have e − µδ (cid:12)(cid:12) p (cid:48) µ ( y i ) (cid:12)(cid:12) − µ ≤ − µ ≤ sign ( y i ) p (cid:48) µ ( x i ) ≤ ≤ e µδ (cid:12)(cid:12) p (cid:48) µ ( y i ) (cid:12)(cid:12) + µe µδ . Taking inner product on both sides with ∇ Φ µ ( y ) (cid:91)i and using definition of (cid:107) · (cid:107) U and · (cid:91) , we get (C.2).Thus, (C.2) holds in general.Finally we note that since U is contained in a (cid:96) ∞ ball of radius R , we have (cid:107) y (cid:91) (cid:107) ≤ mR for all y . Using this fact, (C.2), and the definition of (cid:107) · (cid:107) U , we obtain e − µδ (cid:107)∇ Φ µ ( y ) (cid:107) U − µmR ≤ (cid:68) ∇ Φ µ ( x ) , ∇ Φ µ ( y ) (cid:91) (cid:69) ≤ (cid:107)∇ Φ µ ( x ) (cid:107) U ∇ Φ µ ( y ) (cid:91) ∈ U . By symmetry (C.3) follows.Using Lemma 61 we prove Theorem 18. Proof of Theorem 18.
For the remainder of the proof, let (cid:107) x (cid:107) U ( k ) = max y ∈ U ( k ) (cid:104) x, y (cid:105) and x (cid:91) ( k ) =arg max y ∈ U ( k ) (cid:104) x, y (cid:105) . Since U ( k ) is symmetric, we know that ∆ ( k ) = − (1+ (cid:15) ) (cid:0) ∇ Φ µ ( x ( k − − z ( k ) ) (cid:1) (cid:91) ( k ) and therefore by applying the mean value theorem twice we have that Φ µ ( x ( k ) − y ( k ) ) = Φ µ ( x ( k − − y ( k ) ) + (cid:68) ∇ Φ µ ( ζ ) , x ( k ) − x ( k − (cid:69) = Φ µ ( x ( k − − y ( k − ) + (cid:68) ∇ Φ µ ( ζ ) , y ( k ) − y ( k − (cid:69) + (cid:68) ∇ Φ µ ( ζ ) , x ( k ) − x ( k − (cid:69) for some ζ between x ( k ) − y ( k ) and x ( k − − y ( k ) and some ζ between x ( k − − y ( k ) and x ( k − − y ( k − .Now, using that y ( k ) − y ( k − ∈ U ( k ) and that x ( k ) − x ( k − = ∆ ( k ) we have Φ µ ( x ( k ) − y ( k ) ) ≤ Φ µ ( x ( k − − y ( k − )+ (cid:107)∇ Φ µ ( ζ ) (cid:107) U ( k ) − (1 + (cid:15) ) (cid:28) ∇ Φ µ ( ζ ) , (cid:16) ∇ Φ µ ( x ( k − − z ( k ) ) (cid:17) (cid:91) ( k ) (cid:29) . (C.6)Since U k is contained within the (cid:96) ∞ ball of radius R k , Lemma 61 shows that (cid:107)∇ Φ µ ( ζ ) (cid:107) U ( k ) ≤ e µR k (cid:107)∇ Φ µ ( x ( k − − y ( k − ) (cid:107) U ( k ) + mµR k e µR k . (C.7)Furthermore, since (cid:15) < and R k ≤ R , by triangle inequality we have (cid:107) ζ − ( x ( k − − z ( k ) ) (cid:107) ∞ ≤ (1 + (cid:15) ) R k + R ≤ R and (cid:107) z ( k ) − y ( k − (cid:107) ∞ ≤ R . Therefore, applying Lemma 61 twice yields that (cid:28) ∇ Φ µ ( ζ ) , (cid:16) ∇ Φ µ ( x ( k − − z ( k ) ) (cid:17) (cid:91) ( k ) (cid:29) ≥ e − µR (cid:107)∇ Φ µ ( x ( k − − z ( k ) ) (cid:107) U ( k ) − µmR k ≥ e − µR (cid:107)∇ Φ µ ( x ( k − − y ( k − ) (cid:107) U ( k ) − µmR k . (C.8)Combining (C.6), (C.7), and (C.8) then yields that Φ µ ( x ( k ) − y ( k ) ) ≤ Φ µ ( x ( k − − y ( k − ) − (cid:0) (1 + (cid:15) ) e − µR − e µR (cid:1) (cid:107)∇ Φ µ ( x ( k − − y ( k − ) (cid:107) U ( k ) + mµR k e µR + 2(1 + (cid:15) ) mµR k . Since we chose µ = (cid:15) R and (cid:15) ∈ (0 , / we have (1 + (cid:15) ) e − µR − e µR ≥ (cid:15) and mµR k e µR + 2(1 + (cid:15) ) mµR k ≤ (cid:15)m R k R .
Thus, we have Φ µ ( x ( k ) − y ( k ) ) ≤ Φ µ ( x ( k − − y ( k − ) − (cid:15) (cid:107)∇ Φ µ ( x ( k − − y ( k − ) (cid:107) U ( k ) + (cid:15)m R k R .
Using Lemma 61 and the fact that U k contains a (cid:96) ∞ ball of radius r k , we have (cid:107)∇ Φ µ ( x ( k − − y ( k − ) (cid:107) U ( k ) ≥ r k (cid:107)∇ Φ µ ( x ( k − − y ( k − ) (cid:107) ≥ (cid:15)r k R (cid:16) Φ µ ( x ( k − − y ( k − ) − m (cid:17) . Φ µ ( x ( k ) − y ( k ) ) ≤ (cid:18) − (cid:15) r k R (cid:19) Φ µ ( x ( k − − y ( k − ) + (cid:15) r k R m + (cid:15)m R k R ≤ (cid:18) − (cid:15) r k R (cid:19) Φ µ ( x ( k − − y ( k − ) + (cid:15)m R k R .
Hence, if Φ µ ( x ( k − − y ( k − ) ≤ mτ(cid:15) , we have Φ µ ( x ( k ) − y ( k ) ) ≤ mτ(cid:15) . Since Φ µ ( x (0) − y (0) ) ≤ mτ(cid:15) by assumption we have by induction that Φ µ ( x ( k ) − y ( k ) ) ≤ mτ(cid:15) for all k . The necessary bound on (cid:107) x ( k ) − y ( k ) (cid:107) ∞ then follows immediately from Lemma 61. D Appendix: Projection on Mixed Norm Ball
Here we give an algorithm to solve the following problem max (cid:107) x (cid:107) + (cid:107) l − x (cid:107) ∞ ≤ (cid:104) a, x (cid:105) (D.1)for some given vector l and a . This is used in Section 6 to compute weights. Note that max (cid:107) x (cid:107) + (cid:107) l − x (cid:107) ∞ ≤ (cid:104) a, x (cid:105) = max ≤ t ≤ (cid:20) max (cid:107) x (cid:107) ≤ − t and − tl i ≤ x i ≤ tl i (cid:104) a, x (cid:105) (cid:21) = max ≤ t ≤ (1 − t ) (cid:34) max (cid:107) x (cid:107) ≤ and − t − t l i ≤ x i ≤ t − t l i (cid:104) a, x (cid:105) (cid:35) . = max ≤ t ≤ (1 − t ) f ( t ) where f ( t ) def = max (cid:107) x (cid:107) ≤ , − t − t l i ≤ x i ≤ t − t l i (cid:104) a, x (cid:105) . (D.2)After sorting the coordinates so that | a i | /l i monotonically decrease with i ∈ [ n ] , and consideringthe maximization problem in f ( t ) with only the (cid:107) x (cid:107) or − t − t l i ≤ x i ≤ t − t l i constraints, it can beshown that the maximizing x in the definition of f is x i t where for all j ∈ [ n ] x i t j = t − t sign ( a j ) l j if j ∈ [ i t ] (cid:114) − ( t − t ) (cid:80) k ∈ [ it ] l k (cid:107) a (cid:107) − (cid:80) k ∈ [ it ] a k a j otherwise . (D.3)and i t is the first coordinate i ∈ [ n ] such that − (cid:16) t − t (cid:17) (cid:80) k ∈ [ i ] l k (cid:107) a (cid:107) − (cid:80) k ∈ [ i ] a k ≤ (cid:16) t − t (cid:17) l i a i . Note that i t ≥ i s if t ≤ s . Therefore, the set of t such that i t = j is simply an interval given by | a j | (cid:114) l j (cid:16) (cid:107) a (cid:107) − (cid:80) k ∈ [ j ] a k (cid:17) + a j (cid:80) k ∈ [ j ] l k ≤ t − t < | a j − | (cid:114) l j − (cid:16) (cid:107) a (cid:107) − (cid:80) k ∈ [ j − a k (cid:17) + a j − (cid:80) k ∈ [ j − l k . (D.4) There are some boundary cases we ignored for simplicity. f ( t ) = (cid:68) a, x ( i t ) (cid:69) = t − t (cid:88) j ∈ [ i t ] | a j | | l j | + (cid:118)(cid:117)(cid:117)(cid:116) − (cid:18) t − t (cid:19) (cid:88) k ∈ [ i t ] l k (cid:115) (cid:107) a (cid:107) − (cid:88) k ∈ [ i t ] a k . Substituting this into D.2, we have that max (cid:107) x (cid:107) + (cid:107) l − x (cid:107) ∞ ≤ (cid:104) a, x (cid:105) = max ≤ t ≤ g ( t ) def = t (cid:88) j ∈ [ i t ] | a j | | l j | + (cid:115) (1 − t ) − t (cid:88) k ∈ [ i t ] l k (cid:115) (cid:107) a (cid:107) − (cid:88) k ∈ [ i t ] a k . Note that g (cid:48) ( t ) = (cid:88) j ∈ [ i t ] | a j | | l j | + (cid:16) (1 − (cid:80) k ∈ [ i t ] l k ) t − (cid:17) (cid:113) (cid:107) a (cid:107) − (cid:80) k ∈ [ i t ] a k (cid:113) (1 − t ) − t (cid:80) k ∈ [ i t ] l k ,g (cid:48)(cid:48) ( t ) = − (cid:16)(cid:80) k ∈ [ i t ] l k (cid:17) · (cid:113) (cid:107) a (cid:107) − (cid:80) k ∈ [ i t ] a k (cid:16) (1 − t ) − t (cid:80) k ∈ [ i t ] l k (cid:17) / . Hence, g ( t ) is concave and its maximizer has a closed form via the quadratic formula. Therefore,one can compute the maximum value for each interval of t (D.4) and find which is the best. Thisyields the following algorithm. Algorithm 8: x = projectMixedBall ( a, l ) Sort the coordinate such that | a i | /l i is in descending order.Precompute (cid:80) ik =0 l k , (cid:80) ik =0 a k and (cid:80) ij =1 | a j | | l j | for all i .Let g i ( t ) = t (cid:80) j ∈ [ i ] | a j | | l j | + (cid:113) (1 − t ) − t (cid:80) ik =0 l k (cid:113) (cid:107) a (cid:107) − (cid:80) ik =0 a k . For each j ∈ { , · · · , n } , Find t j = arg max i t = j g j ( t ) using (D.4).Find i = arg max i g i ( t i ) . Output: (1 − t i ) x ( i ) defined by (D.3).The discussion above leads to the following theorem. Theorem 62.
For any a ∈ R n and l ∈ R n> , the algorithm projectMixedBall ( a, l ) outputs asolution to (D.1) in total work O ( n log n ) and depth O (log n ) (in EREW model). Proof.
The correctness follows from the discussion above. For the runtime, it is known that sortingcan be done in O ( n log n ) work and O (log n ) depth in EREW model [10] and that prefix sum canbe done in O ( n ) work and O (log n ) depth in EREW model. The rest is easy. E Extreme Lewis Weights and Barrier
In this section we discuss the limits of Lewis weights and the Lewis weight barrier when p → and p → ∞ . In Section E.1 we show that as p → Lewis weights converge to the uniform distributionover rows of a matrix under mild assumptions. This shows that under mild assumptions on thestructure of a polytope, the Lewis weight barrier considered in Section 5 converges to the standardlogarithmic barrier. In Section E.2 we consider the opposite extreme when p → ∞ . In this case weshow that (cid:96) ∞ Lewis weights of the matrix A are precisely the weights that induce a John ellipse ofthe polytope { x ∈ R n : (cid:107) A x (cid:107) ∞ ≤ } . This justifies the intuition given in the introduction regardingour barrier and path finding scheme as following a path induced by regularized John ellipses.62 .1 p → Here we show that (cid:96) p Lewis weights for a matrix A ∈ R m × n in general position, i.e. any n rows arelinearly independent, converge to uniform as p → . Note that the assumption of general positionis stronger than that of non-degeneracy and required for the statement to be true. For example, ifthere is a row that is perpendicular to all other rows, then it is not difficult to show that this rowmust have Lewis weight for any p > . Lemma 63.
Given a matrix A ∈ R m × n in general position, i.e. any n rows of A are linearlyindependent, then lim p → + w p ( A ) i = nm for all i. Proof.
For p ∈ (0 , , Lemma 22 shows that the Lewis weight is given by w p ( A ) = arg min w ∈ R m ≥ , (cid:80) i ∈ [ m ] w i = n det( A (cid:62) W − p A ) . Considering w ∈ R m with w i = nm for all i ∈ [ m ] we see that min w i ≥ , (cid:80) i ∈ [ m ] w i = n det( A (cid:62) W − p A ) ≤ (cid:16) nm (cid:17) n (1 − p ) det( A (cid:62) A ) . (E.1)On the other hand the Cauchy–Binet formula shows that det( A (cid:62) W − p A ) = (cid:88) S ∈ ( [ m ] n ) det( A S ) det( W − p S ) where A S ⊂ R n × n are the rows of A at indices from S , W S ⊂ R n × n is diagonal with the diagonals of W at indices from S and the summation is over all subsets of size n . Since A is in general position,we have that det A S (cid:54) = 0 for all S . Therefore, for all subsets S ⊆ [ m ] of size n and all W (cid:23) . det( W − p S ) ≤ det( A (cid:62) W − p A )min S ⊂ ( [ m ] n ) det( A S ) . (E.2)Now, let W p = Diag ( w p ( A )) be the diagonal matrix formed by the (cid:96) p Lewis weight of A .Combining (E.1) and (E.2), we have that det([ W − p p ] S ) ≤ c (cid:16) nm (cid:17) n (1 − p ) where c = det A (cid:62) A min S ⊂ ( mn )(det A S ) . Hence, we have that det([ W p ] S ) ≥ c − p · ( nm ) n . Let W ∗ = lim inf p → + W p . Taking limit p → + on both sides, we have that det( W ∗ S ) ≥ ( n/m ) n for all subsets S of size n . Since this holds for allsubsets and since (cid:80) i ∈ [ m ] w ∗ i = n , we have that w ∗ i = nm for all i . Since lim sup p → + (cid:88) i ∈ [ m ] w p ( A ) i = n = lim inf p → + (cid:88) i ∈ [ m ] w p ( A ) i this shows that lim w p exists and it converges to nm .63 .2 p → ∞ Here we show that as p → ∞ the ellipse E = { x ∈ R n : x (cid:62) M x ≤ } for M def = lim p → + ∞ A (cid:62) W p A where W p = w p ( A ) is the John ellipse of the polytope K = { x ∈ R n : (cid:107) A x (cid:107) ∞ ≤ } , i.e. theellipsoid of maximum volume contained inside K . To prove this we use the following lemma provedin [25] characterizing the John Ellipse. Lemma 64.
Given a polytope
Ω = { x ∈ R n : (cid:107) A x (cid:107) ∞ ≤ } with A ∈ R m × n . Let E be the Johnellipsoid of Ω , namely, E is the maximum volume ellipsoid contained inside Ω . Then, we have that E = { x (cid:62) A (cid:62) WA x ≤ } with the diagonal matrix W given by the vector maximizing min w i ≥ , (cid:80) i ∈ [ m [ w i = n log det A (cid:62) WA . Using this we prove our desired result regarding the limits of Lewis weights as p → ∞ . Lemma 65.
For non-degenerate A ∈ R m × n let M def = lim p → + ∞ A (cid:62) W p A where W p = Diag ( w p ( A )) .Then E = { x ∈ R n : x (cid:62) M x ≤ } is the John ellipsoid of K = { x ∈ R n : (cid:107) A x (cid:107) ∞ ≤ } .Proof. Let c p,m be the constant defined in Lemma 27. Further, for all p > let M p def = c p,m A (cid:62) W p A and E p def = { x ∈ R n : x (cid:62) M p x ≤ } . Lemma 27 shows that E p ⊆ K . Further, letting s n is thevolume of the unit sphere we have that vol( E p ) = s n (cid:16) c np,m det( A (cid:62) W p A ) (cid:17) − ≥ s n (cid:18) c np,m det( A (cid:62) W − p p A ) (cid:19) − = s n (cid:32) c np,m · min w i ≥ , (cid:80) i ∈ [ m ] w i = n log det A (cid:62) W − p A (cid:33) − where in the last step we used Lemma 22. Note that w − p → w as p → ∞ for all w > and c p,m → as p → ∞ . Hence, lim inf p → + ∞ vol( E p ) ≥ s n · (cid:32) min w i ≥ , (cid:80) i ∈ [ m ] w i = n log det A (cid:62) WA (cid:33) − On the other hand, it is known that the John ellipsoid E ∗ of K is unique and its volume is givenby the right hand side (Lemma 64). This implies that E p converges to the John ellipsoid of K . F Linear System Properties
Often the the running time for solving linear system solvers depends on the condition number of thematrix and/or how fast the linear systems change from iteration to iteration. Here we show thatour interior point method enjoys properties frequently exploited in other interior point methds andtherefore is amenable to techniques for improving iteration costs.There are two key lemmas we prove in this section. First, in Lemma 67 we provide a generaltechnical lemma on the structure of weighted minimizers of self-concordant barriers. This allows usto reason about how close the weighted central path can go to the boundary of the polytope andallows us to reason about how ill-conditioned the linear system we need to solver become over thecourse of the algorithm (see Corollary 68 and its proof). Second, in Lemma 69 we bound how muchthe linear systems can change over the course of our algorithm.64ince Lemma 67 is of independent interest, we prove a slightly more general version than what weneed here (which in turn is a generalization of [62, Lemma 16]). We consider the case of minimizingweighted combinations of arbitrary self-concordant functions subject to a linear constraint andbound under changes to the weights upper bound how well the line between minimizers. To proveLemma 67 we use the following known equivalent characterization of self-concordance and propertiesof self-concordant functions.
Lemma 66 ([46, Theorem 4.1.6, 4.2.4]) . We call convex function φ a ν -self-concordant barrier foropen convex set Ω ⊂ R n if φ ( x ) → + ∞ as x → ∂ Ω and for all x ∈ Ω and h ∈ R n , ψ ( t ) def = φ ( x + th ) satisfies ψ (cid:48)(cid:48)(cid:48) (0) ≤ ψ (cid:48)(cid:48) (0)) / and ψ (cid:48) (0) ≤ ( ν · ψ (cid:48)(cid:48) (0)) / . For such φ the following hold. • For all s ∈ Ω and t ∈ R n such that (cid:107) t − s (cid:107) ∇ φ ( s ) < then t ∈ dom( φ ) . • For all x, y ∈ Ω we have ∇ φ ( x ) · ( y − x ) ≤ ν . Lemma 67.
For all i ∈ [ k ] let φ i be a ν i -self-concordant barriers on dom( φ i ) , a open convex subsetof R m . Let Ω def = { x : A (cid:62) x = b } ∩ ( ∩ i ∈ [ m ] dom( φ i )) for arbitrary A ∈ R m × n and b ∈ R n . Further, c ∈ R m and for all w ∈ R m> let x w def = arg min x ∈ Ω c (cid:62) x + (cid:88) i ∈ [ m ] w i φ i ( x i ) . Then if w (0) , w (1) ∈ R m> are such that either w (0) ≥ w (1) or w (1) ≥ w (0) entrywise then p ( t ) def = x w (0) + t ( x w (1) − x w (0) ) ∈ Ω for all t ∈ ( − θ, θ ) where θ def = min (cid:110) min j ∈ [ k ] w (0) j , min j ∈ [ k ] w (1) j (cid:111)(cid:80) j ∈ [ k ] ν j (cid:12)(cid:12)(cid:12) w (0) j − w (1) j (cid:12)(cid:12)(cid:12) . (F.1) Further, for any w (0) , w (1) ∈ R m> (regardless of their entrywise relation), we have p ( t ) ∈ Ω for all t ∈ ( − γ, γ ) where γ = θ θ for θ as defined above.Proof. First, we prove the case where either w (0) ≥ w (1) or w (1) ≥ w (0) entrywise. Note that p ( t ) isa straight line intersecting p (0) = x w (0) and p (1) = x w (1) . Let θ denote the smallest value of θ forwhich either p ( − θ ) / ∈ Ω or p (1 + θ ) / ∈ Ω , i.e. the least amount the line segment between x w (0) and x w (1) needs to be extended in either direction to leave Ω . By convexity p ( t ) ∈ Ω for all t ∈ [0 , andtherefore θ ≥ . Further, we assume that θ is finite, i.e. the straight line passing through x w (0) and x w (1) leaves Ω , as otherwise the lemma trivially holds.Note that either p (1 + θ ) ∈ ∂ Ω , the boundary of Ω , or p ( − θ ) ∈ ∂ Ω . By symmetry, we assumewithout loss of generality that p (1 + θ ) ∈ ∂ Ω (as applying the lemma under this assumption with w (0) and w (1) swapped would yield the other case). Consequently, for some i ∈ [ m ] we have p (1 + θ ) ∈ ∂ dom( φ i ) , the boundary of φ i , and we fix such a i ∈ [ m ] throughout. For notationalconvenience, we define ψ j ( t ) def = φ j ( p ( t )) for all j ∈ [ m ] and let dom( ψ i ) denote the set of values of u for which p ( u ) ∈ dom( φ i ) . We will leverage that each ψ j is ν j -self-concordant on dom( φ i ) , as therestriction of a ν -self-concordant function to a line is ν -self-concordant (see Definition 66).Now, note that the first bullet of Lemma 66 implies that if for some t ∈ ( − θ, θ ) and u ≥ t we have (cid:112) ψ (cid:48)(cid:48) i ( t )( u − t ) < then u ∈ dom( ψ i ) . However, we know that u / ∈ dom( ψ i ) for u = 1 + θ and thus, ψ (cid:48)(cid:48) i ( t ) ≥ (1 + θ − t ) − . Integrating, yields that ψ (cid:48) i (1) = ψ (cid:48) i (0) + (cid:90) ψ (cid:48)(cid:48) i ( t ) dt ≥ ψ (cid:48) i (0) + (cid:90) θ − t ) dt = ψ (cid:48) i (0) + 1 θ (1 + θ ) . ψ j is convex we have that ψ (cid:48) j (1) ≥ ψ (cid:48) j (0) and combining yields that (cid:88) j ∈ [ k ] w (1) j ψ (cid:48) j (1) ≥ (cid:88) j ∈ [ k ] w (1) j ψ (cid:48) j (0) + w (1) i θ (1 + θ ) and (cid:88) j ∈ [ k ] w (0) j ψ (cid:48) j (1) ≥ (cid:88) j ∈ [ k ] w (0) j ψ (cid:48) j (0) + w (0) i θ (1 + θ ) . (F.2)Next, note that the optimality conditions of x w (0) and x w (1) imply that for some λ (0) , λ (1) ∈ R n c + (cid:88) j ∈ [ k ] w (0) j ∇ φ j ( x w (0) ) = A λ (0) and c + (cid:88) j ∈ [ k ] w (1) j ∇ φ j ( x w (1) ) = A λ (1) . Now, since A (cid:62) x w (0) = b = A (cid:62) x w (1) , this implies (cid:88) j ∈ [ k ] w (1) j ψ (cid:48) j (1) = (cid:88) j ∈ [ k ] w (1) j (cid:104) ∇ φ j ( x w (1) ) (cid:62) ( x w (1) − x w (0) ) (cid:105) = (cid:104) c − A λ (1) (cid:105) (cid:62) ( x w (1) − x w (0) )= c (cid:62) ( x w (1) − x w (0) ) = (cid:104) c − A λ (0) (cid:105) (cid:62) ( x w (1) − x w (0) )= (cid:88) j ∈ [ k ] w (0) j (cid:104) ∇ φ j ( x w (0) ) (cid:62) ( x w (1) − x w (0) ) (cid:105) = (cid:88) j ∈ [ k ] w (0) j ψ (cid:48) j (0) . Combining with (F.2) yields that w (1) i θ (1 + θ ) ≤ (cid:88) j ∈ [ k ] (cid:16) w (0) j − w (1) j (cid:17) ψ (cid:48) j (0) and w (0) i θ (1 + θ ) ≤ (cid:88) j ∈ [ k ] (cid:16) w (0) j − w (1) j (cid:17) ψ (cid:48) j (1) . Further, the definition of θ implies that t ∈ dom( ψ j ) for all t ∈ ( − θ, θ ) and j ∈ [ k ] . Therefore,the second bullet of Lemma 66 implies that ψ (cid:48) j (0) · (1 + θ ) = ψ (cid:48) j (0) · ((1 + θ ) − ≤ ν j and − ψ (cid:48) j (1) · (1 + θ ) = ψ (cid:48) j (1) · (( − θ ) − ≤ ν j . Consequently, if w (1) j ≤ w (0) j for all j we have min (cid:26) min j ∈ [ k ] w (0) j , min j ∈ [ k ] w (1) j (cid:27) ≤ w (1) i ≤ θ (cid:88) j ∈ [ k ] (cid:12)(cid:12)(cid:12) w (0) j − w (1) j (cid:12)(cid:12)(cid:12) · ν j and if w (1) j ≥ w (0) j for all j we have min (cid:26) min j ∈ [ k ] w (0) j , min j ∈ [ k ] w (1) j (cid:27) ≤ w (0) i ≤ θ (cid:88) j ∈ [ k ] (cid:12)(cid:12)(cid:12) w (0) j − w (1) j (cid:12)(cid:12)(cid:12) · ν j . Therefore, the result in (F.1) holds in either case.Finally, we consider the case of arbitrary w (0) , w (1) ∈ R m> (i.e. where it is not necessarily thecase that w (0) ≥ w (1) or w (1) ≥ w (0) ). In this case, we let v j = max { w (0) j , w (1) j } where max is appliedentrywise. Note that w (0) ≤ v and v ≥ w (1) entrywise and consequently we can apply the previousresult, i.e. (F.1), to the pairs ( w (0) , v ) and ( v, w (1) ) to show that v (0) t def = w (0) + t ( v − w (0) ) ∈ Ω and v (1) t def = v + t ( w (1) − v ) ∈ Ω t ∈ ( − θ, θ ) where (as in the previous case) θ ≥ min (cid:110) min j ∈ [ k ] w (0) j , min j ∈ [ k ] w (1) j (cid:111)(cid:80) j ∈ [ k ] ν j (cid:12)(cid:12)(cid:12) w (0) j − w (1) j (cid:12)(cid:12)(cid:12) . Consequently, since Ω is convex, considering t = 1 + γ for γ ∈ [0 , θ ) we have Ω (cid:51) (cid:18) γ γ (cid:19) v (0) t + (cid:18) γ γ (cid:19) v (1) t = (cid:18)
11 + 2 γ (cid:19) (cid:104) − γ w (0) + (1 + γ ) w (1) (cid:105) = w (0) + (1 + γ ) (1 + 2 γ ) (cid:104) w (1) − w (0) (cid:105) = w (0) + (cid:20) γ γ (cid:21) · (cid:16) w (1) − w (0) (cid:17) . Further, considering t = − γ for γ ∈ [0 , θ ] we haves Ω (cid:51) (cid:18) γ γ (cid:19) v (0) t + (cid:18) γ γ (cid:19) v (1) t = (cid:18)
11 + 2 γ (cid:19) (cid:104) (1 + γ ) w (0) − γ w (1) (cid:105) = w (0) + γ (1 + 2 γ ) (cid:104) w (1) − w (0) (cid:105) . Consequently w (0) + t ( w (1) − w (0) ) ∈ Ω for all t ∈ ( − γ, γ ) with γ = θ θ as desired.In the applications we consider in Section 6.1 we have arg min x f t ( x, w ) = arg min f (cid:0) x, wt (cid:1) .Further, since w is polynomial bounded by above and below by m , the ratio between old weights w (1) /t (2) and the new weights w ( t ) /t (2) is bounded polynomially by the ratio of t (1) and t (2) and m .Further, since our initial point starts away from the boundary of the polytope Lemma 67 impliesthat the distance from x to the boundary can always be bounded. Corollary 68.
Using the notation and assumptions in either Theorem 1 or Theorem 43, we have φ (cid:48)(cid:48) i ( x ) ≤ O ( poly ( mU/(cid:15) )) throughout the algorithm.Proof. We prove the claim for Theorem 1 as Theorem 43 applies the same algorithm. Consider,Algorithm 3. By the assumptions of Theorem 1, the initial point x has distance at least /U toeach of the (cid:96) i ≤ x i ≤ u i constraints. Further, by Lemma 67, m ≤ w ≤ and that m ≤ w (new) ≤ , x t = arg min f t ( x, w (new) ) has distance at least poly ( mU ) to any of the (cid:96) i ≤ x i ≤ u i . By Lemma 8, φ (cid:48)(cid:48) i ( x t ) ≤ O ( poly ( mU/(cid:15) )) and by Lemma 40, we have that δ t ( x ( new ) , w (new) ) ≤ log m . Hence,Lemma 42 and Lemma 8 shows that φ (cid:48)(cid:48) i ( x (new) ) ≤ O ( poly ( mU/(cid:15) )) . By the same argument, we alsohave φ (cid:48)(cid:48) i ( x ) ≤ O ( poly ( mU/(cid:15) )) for x = x ( final ) and for all intermediate steps x . Lemma 69.
Using the notation and assumptions in either Theorem 1 or Theorem 43 let A (cid:62) D k A be the k th linear system that is used in the algorithm LPSolve . For all k ≥ , we have the following:1. The condition number of A (cid:62) D k A relative to A (cid:62) A is bounded by poly ( mU/(cid:15) ) , i.e., poly ( (cid:15)/ ( mU )) A (cid:62) A (cid:22) A (cid:62) D k A (cid:22) poly ( mU/(cid:15) ) A (cid:62) A (cid:107) log( D k +1 ) − log( D k ) (cid:107) ∞ ≤ / .3. (cid:107) log( D k +1 ) − log( D k ) (cid:107) w p ( D / k A ) ≤ / . roof. We prove the claim for Theorem 1 as Theorem 43 applies the same algorithm.During the algorithm, the matrix we need to solve is of the form A (cid:62) DA where D = W − Φ (cid:48)(cid:48) ( x ) − .Lemma 27 shows that A (cid:62) DA ≈ poly ( m ) A (cid:62) Φ (cid:48)(cid:48) ( x ) − A . Lemma 8 shows that φ (cid:48)(cid:48) i ( x ) ≥ U . Also,Lemma 68 shows that φ (cid:48)(cid:48) i ( x ) is upper bounded by poly ( mU/(cid:15) ) . Thus, the condition number of A (cid:62) D k A relative to A (cid:62) A is bounded by poly ( mU/(cid:15) ) .Now, we bound the changes of D by bound the changes of Φ (cid:48)(cid:48) ( x ) and the changes of W sep-arately. For the changes of Φ (cid:48)(cid:48) ( x ) , (3.5) shows that (cid:107) (cid:112) φ (cid:48)(cid:48) ( x ) h t ( x, w ) (cid:107) w + ∞ ≤ (cid:107) P x,w (cid:107) w + ∞ δ t . Since (cid:107) P x,w (cid:107) w + ∞ ≤ and δ t ≤ / , we have (cid:107) (cid:112) φ (cid:48)(cid:48) ( x )( x (new) − x ) (cid:107) w + ∞ = (cid:107) (cid:112) φ (cid:48)(cid:48) ( x ) h t ( x, w ) (cid:107) w + ∞ ≤ / . Applying this with Lemma 8, we have (cid:13)(cid:13)(cid:13) log (cid:16) φ (cid:48)(cid:48) ( x (new) ) (cid:17) − log (cid:0) φ (cid:48)(cid:48) ( x ) (cid:1)(cid:13)(cid:13)(cid:13) w + ∞ ≤ (cid:16) − (cid:107) (cid:112) φ (cid:48)(cid:48) ( x )( x (new) − x ) (cid:107) w + ∞ (cid:17) − − ≤ / . Since w i ≥ w p ( D / A ) i for all i , we have (cid:13)(cid:13)(cid:13) log (cid:16) φ (cid:48)(cid:48) ( x (new) ) (cid:17) − log (cid:0) φ (cid:48)(cid:48) ( x ) (cid:1)(cid:13)(cid:13)(cid:13) w p ( D / A )+ ∞ ≤ / . (F.3)For the changes of W , we look at the description of centeringInexact . The algorithm ensuresthe changes of log( w ) is in (1 + (cid:15) ) U where U = { x ∈ R m | (cid:107) x (cid:107) w + ∞ ≤ (cid:16) − c k (cid:17) δ t } . Since δ t ≤ / and w i ≥ w p ( D / A ) i for all i , we get that (cid:13)(cid:13)(cid:13) log (cid:16) w (new) (cid:17) − log ( w ) (cid:13)(cid:13)(cid:13) w p ( D / A )+ ∞ ≤ / ..