[PDF] An Improved Deterministic Rescaling for Linear Programming Algorithms

Abstract

The perceptron algorithm for linear programming, arising from machine learning, has been around since the 1950s. While not a polynomial-time algorithm, it is useful in practice due to its simplicity and robustness. In 2004, Dunagan and Vempala showed that a randomized rescaling turns the perceptron method into a polynomial time algorithm, and later Pe\~{n}a and Soheili gave a deterministic rescaling. In this paper, we give a deterministic rescaling for the perceptron algorithm that improves upon the previous rescaling methods by making it possible to rescale much earlier. This results in a faster running time for the rescaled perceptron algorithm. We will also demonstrate that the same rescaling methods yield a polynomial time algorithm based on the multiplicative weights update method. This draws a connection to an area that has received a lot of recent attention in theoretical computer science.

Full PDF

aa r X i v : . [ m a t h . O C ] D ec An Improved Deterministic Rescaling for Linear ProgrammingAlgorithms

Rebecca Hoberg ∗ and Thomas Rothvoss † University of Washington, SeattleDecember 15, 2016

Abstract

The perceptron algorithm for linear programming, arising from machine learning, has been aroundsince the 1950s. While not a polynomial-time algorithm, it is useful in practice due to its simplicityand robustness. In 2004, Dunagan and Vempala showed that a randomized rescaling turns the per-ceptron method into a polynomial time algorithm, and later Peña and Soheili gave a deterministicrescaling . In this paper, we give a deterministic rescaling for the perceptron algorithm that improvesupon the previous rescaling methods by making it possible to rescale much earlier. This results in afaster running time for the rescaled perceptron algorithm. We will also demonstrate that the samerescaling methods yield a polynomial time algorithm based on the multiplicative weights update method. This draws a connection to an area that has received a lot of recent attention in theoreti-cal computer science.

One of the central algorithmic problems in theoretical computer science as well as in more practicalareas like operations research is ﬁnding the solution to a linear program max{ c T x | Ax ≥ b } (1)where A ∈ R m × n , c ∈ R n and b ∈ R m . On the theoretical side, linear programming relaxations are thebackbone for many approximation algorithms [WS11, Vaz01]. On the practical side, many real-worldproblems can either be modeled as linear programs or they can be modeled at least as integer linearprograms; the latter ones are then solved using Branch & Bound or Branch & Cut methods. Both of thesemethods rely on repeatedly computing solutions to linear programs [CCZ14].The ﬁrst algorithm for solving linear programs was the simplex method due to Dantzig [Dan51].While the method performs well in practice — and is still the method of choice today — for almost anypopular pivoting rule one can construct instances where the algorithm takes exponential time [KM72].In 1979, Khachiyan [Haˇc79, Sch86] developed the ﬁrst polynomial-time algorithm. However, despite ∗ Email: [email protected] † Email: [email protected] . Supported by an Alfred P. Sloan Research Fellowship. Both authors supported by NSF grant1420180 with title “

Limitations of convex relaxations in combinatorial optimization ”. File compiled on December 15, 2016,01:23. he desirable theoretical properties, Khachiyan’s ellipsoid method turned out to be too slow for practicalapplications.In the 1980s, interior point methods were developed which were efﬁcient in theory and in practice.Karmarkar’s algorithm has a running time of O ( n L ), where L is the number of bits in the input [Kar84].Since then, there have been many further improvements in interior point methods. As recently as 2015, itwas shown that there is an interior-point method using only ˜ O ( p rank( A ) · L ) many iterations; this upperbound essentially matches known lower bound barriers [LS15].A common way to ﬁnd a polynomial-time linear programming algorithm is with a greedy type pro-cedure along with periodic rescaling [DVZ16a]. One famous example of this is the perceptron algo-rithm [Agm54], which we will focus on in this paper. Instead of solving (1) directly, this method ﬁndsa feasible point in the open polyhedral coneP = { x ∈ R n | Ax > } (2)where A ∈ R m × n – using standard reductions one can interchange the representations (1) and (2) with atmost a linear overhead. The classical perceptron algorithm starts at the origin and iteratively walks inthe direction of any violated constraint. In the worst case this method is not polynomial time, but it isstill useful due to its simplicity and robustness [Agm54]. In 2004, Dunagan and Vempala [DV06] showedthat using a randomized rescaling procedure, the algorithm can be modiﬁed to ﬁnd a point in (2) inpolynomial time. Explicitly, their algorithm runs in time ˜ O ( mn log ρ ), where ρ > P with the unit ball B : = B (0, 1). A deterministic rescaling procedure wasprovided by Peña and Soheili in [PS16]. Their algorithm uses an improved convergence of the perceptronalgorithm based on Nesterov’s smoothing technique [Nes05, PS12]. Overall, their algorithm takes time˜ O ( m n log ρ ).Another classical LP algorithm that we will discuss in this paper is based on a very general algorith-mic framework called the multiplicative weights update (MWU) method. In its general form one imag-ines having m experts who each incur some cost in a sequence of iterations. In each iteration we haveto select a convex combination of experts so that the expected cost is minimized, where we only haveinformation on the past costs. The MWU method initially gives all experts the same weight and in eachiteration the weight of expert i is multiplied by exp( − ε · cost incurred by expert i ) where ε is some param-eter. Then on average, the convex combination given by the weights will be nearly as good as the costincurred by the best expert. MWU is an online algorithm that does not need to know the costs in advance,and it has numerous applications in machine learning, economics and theoretical computer science. Infact, MWU has been reinvented many times under different names in the literature. Recent applicationsin theoretical computer science include ﬁnding fast approximations to maximum ﬂows [CKM + P = { x : Ax > } where k A i k = A i . At iteration t , the cost associated with expert i will be 〈 A i , p ( t ) 〉 for some vector p ( t ) . Thereforethe weight of expert i at time T will be e −〈 A i , x 〉 where x = P Tt = ε ( t ) p ( t ) . The analysis of MWU consists ofbounding the sum of the weights, which in this case is given by the potential function Φ ( x ) = P mi = e −〈 A i , x 〉 .If we choose the update vector p ( t ) to be a weighted sum of constraints at every iteration, notice thatthe resulting walk in R n corresponds to gradient descent on Φ – in this case MWU terminates in ˜ O ( ρ ) The ˜ O -notation suppresses any polylog( m , n ) terms. Notice that normalizing the rows does not affect the feasible region. ρ need not be polynomial in the input size, and in fact this method is not polynomialtime in the worst case. For reference, the general form for the rescaled LP algorithms we will present in this paper is given inAlgorithm 1.

Algorithm 1

FOR ˜ O ( n log ρ ) phases DO:(1) Initial phase:

Either ﬁnd x ∈ P or provide a λ ∈ R m ≥ , k λ k = k λ A k ≤ ∆ .(2) Rescaling phase:

Find an invertible linear transformation F so that vol( F ( P ) ∩ B ) is a constantfraction larger than vol( P ∩ B ). Replace P by F ( P ).Our technical and conceptual contributions are as follows:(1) Improved rescaling:

We design a rescaling method that applies for a parameter of ∆ = Θ ( n ), whichimproves over the threshold ∆ = Θ ( m p n ) required by [PS16]. This results in a smaller number ofiterations that are needed per phase until one can rescale the system.(2) Rescaling the MWU method:

We show that in ˜ O (1/ ∆ ) iterations the MWU method can be madeto implement the initial phase of Algorithm 1. The idea is that if gradient descent is making insuf-ﬁcient progress then the gradient must have small norm, and from this we can extract an appro-priate λ . In particular, combining this with our rescaling method, we obtain a polynomial time LPalgorithm based on MWU.(3) Faster gradient descent:

The standard gradient descent approach terminates in at most ˜ O (1/ ∆ )iterations, which matches the ﬁrst approach in [PS16]. The more recent work of Peña and So-heili [PS12] uses Nesterov’s smoothing technique to bring the number of iterations down to a linearterm of ˜ O (1/ ∆ ). We prove that essentially the same speedup can be obtained without modifyingthe objective function by projecting the gradient on a signiﬁcant eigenspace of the Hessian.(4) Computing an approximate John ellipsoid:

For a general convex body K , computing a John el-lipsoid is equivalent to ﬁnding a linear transformation so that F ( K ) is well rounded. For ourunbounded region P , our improved rescaling algorithm gives a linear transformation F so that F ( P ) ∩ B is well-rounded. In this section we ﬁx an initial phase for Algorithm 1 – in particular, the paper of Peña and Soheili gives asmooth variant of the perceptron algorithm that achieves the following guarantee:

Lemma 1 ([PS16]) . In time ˜ O ( mn ∆ ) , either the smooth perceptron phase outputs x ∈ P or it gives λ ∈ R m ≥ with k λ k = and k λ A k ≤ ∆ . p n ≥ W IDTH ( P , c ) P ∩ B F ( P ∩ B )0Figure 1: Visualization of width and the rescaling operationWe then focus on the rescaling phase of the algorithm. Our main result is that we are able to rescalewith ∆ = O ( n ). Lemma 2.

Suppose λ ∈ R m ≥ with k λ k = and k λ A k ≤ O ( n ) . Then in time O ( mn ) we can rescale P sothat vol( P ∩ B ) increases by a constant factor. We introduce two new rescaling methods that achieve the guarantee of Lemma 2. First we show thatwe can extract a thin direction by sampling rows of A using a random hyperplane . The linear transfor-mation that scales P in that direction, corresponding to a rank-1 update , will increase vol( P ∩ B ) by aconstant factor.Next we give an alternate rescaling which is no longer a rank-1 update but which has the potentialto increase vol( P ∩ B ) by up to an exponential factor under certain conditions. In addition, if we take analternate view where the cone P is left invariant and instead update the underlying norm , we see thatthis rescaling consists of adding a scalar multiple of a particular Hessian matrix to the matrix deﬁningthe norm. We also believe that this view is the right one to make potential use of the sparsity of theunderlying matrix A , which would be a necessity for any practically relevant LP optimization method.Combining Lemmas 1 and 2 gives us the following theorem: Theorem 3.

There is an algorithm based on the perceptron algorithm that ﬁnds a point in P in time ˜ O ( mn log( ρ )) . In this section we will show how we can rescale by ﬁnding a direction in which the cone is thin – seeFigure 1 for a visualization. First we give the formal deﬁnition of width.

Deﬁnition 1.

Deﬁne the width of the cone P in the direction c ∈ R n \ { } as WIDTH ( P , c ) = k c k max x ∈ P ∩ B |〈 c , x 〉| .As described in [PS16], we will now show that stretching P in a thin enough direction increases thevolume of P ∩ B by a constant factor. We reproduce the argument of [PS16] here for the sake of complete-ness: 4 emma 4 ([PS16]) . Suppose that there is a direction c ∈ R n \{ } with WIDTH ( P , c ) ≤ p n . Deﬁne F : R n → R n as the linear map with F ( c ) = c and F ( x ) = x for all x ⊥ c. Thenvol ( F ( P ) ∩ B ) ≥ · vol ( P ∩ B ). Proof.

We may assume that k c k =

1. Since det( F ) =

2, we know that vol( F ( P ∩ B )) = P ∩ B ). Nowsuppose that x ∈ P ∩ B and write it as x = x ′ + 〈 c , x 〉 · c where x ′ ⊥ c . Then k F ( x ) k = k x ′ + 〈 c , x 〉 · c k =k x ′ k + 〈 c , x 〉 ≤ k x k + · WIDTH ( P , c ) ≤ + n and taking square roots gives k F ( x ) k ≤ + n ≤ e n . Inparticular, we know that F ( P ∩ B ) ⊆ e n · F ( P ) ∩ B , and so we havevol( F ( P ) ∩ B ) ≥ ( e n ) − n · vol( F ( P ∩ B )) ≥

34 vol( F ( P ∩ B ) =

32 vol( P ∩ B ).Explicitly, assuming k c k =

1, Lemma 4 updates our constraint matrix to A ( I − cc T ). In particular,we apply a rank-1 update to the constraint matrix. Given a solution x to these new constraints, a solutionto the original problem can be easily recovered as ( I − cc T ) x .It remains to argue how one can extract a thin direction for P , given a convex combination λ so that k λ A k is small. Here we will signiﬁcantly improve over the bounds of [PS16] which require k λ A k ≤ O ( m p n ). We begin by a new generic argument to obtain a thin direction: Lemma 5.

For any non-empty subset J ⊆ [ m ] of constraints one has WIDTH ³ P , X i ∈ J λ i A i ´ ≤ k P mi = λ i A i k k P i ∈ J λ i A i k . Proof.

First, note that by the full-dimensionality of P , we always have k P i ∈ J λ i A i k >

0. By deﬁnition ofwidth, we can write

WIDTH ³ P , X i ∈ J λ i A i ´ = k P i ∈ J λ i A i k max x ∈ P ∩ B 〈 X i ∈ J λ i A i , x 〉 .Now, we know that 〈 A i , x 〉 ≥ x ∈ P and somax x ∈ P ∩ B 〈 X i ∈ J λ i A i , x 〉 ≤ max x ∈ B 〈 m X i = λ i A i , x 〉 = k λ A k and the claim is proven.So in order to ﬁnd a direction of small width, it sufﬁces to ﬁnd a subset J ⊂ [ m ] with k P i ∈ J λ i A i k large. Implicitly, the choice that Peña and Soheili [PS16] make is to select J = { i } for i ∈ [ m ] maximizing λ i . This approach gives a bound of k P i ∈ J λ i A i k ≥ m . We will now prove the asymptotically optimalbound using a random hyperplane: Lemma 6.

Let λ ∈ R m ≥ be any convex combination and A ∈ R m × n with k A i k = for all i . Take a randomGaussian g and set J : = { i ∈ [ m ] | 〈 A i , g 〉 ≥ . Then with constant probability k P i ∈ J λ i A i k ≥ p π n . It sufﬁces here to consider the trivial example with λ = ... = λ n = n and A i = e i being the standard basis. Then k P i ∈ J λ i A i k ≤ p n for any subset J . The optimality of our rescaling can also be seen since the cone in the last iteration is˜ O ( n )-well rounded, which is optimal up to ˜ O -terms. roof. We set v : = g k g k . Since v is unit vector we can lower bound the length of k P i ∈ J λ i A i k by mea-suring the projection on v and obtain k P i ∈ J λ i A i k ≥ P j ∈ J λ i 〈 A i , v 〉 . By symmetry of the Gaussian itthen sufﬁces to argue that P mi = λ i | 〈 A i , v 〉 | ≥ p π n . First we will show that for an appropriate constant α ∈ (0, 1),(1) Pr( k g k ≥ p n ) ≤ n (2) Pr( P ni = λ i |〈 A i , g 〉| < q π ) ≤ α .Then, with probability at least γ = − α , we have P ni = λ i |〈 A i , v 〉| ≥ p π n .For (1), notice that k g k is just the chi-squared distribution with n degrees of freedom, and so it hasvariance 2 n and mean n . Therefore Chebyshev’s inequality tells us that Pr £ k g k ≥ n ¤ ≤ n . Now, for all i , 〈 A i , g 〉 is a normal random variable with mean 0 and variance 1, and so the expectation of its absolutevalue is q π . Summing these up gives E £P mi = ¯¯ 〈 λ i A i , g 〉 ¯¯¤ = q π . Moreover, P mi = ¯¯ 〈 λ i A i , g 〉 ¯¯ is Lipschitz in g with Lipschitz constant 1, and so Pr Ã m X i = ¯¯ 〈 λ i A i , g 〉 ¯¯ < r π − t ! ≤ e − t / π .Letting t = q π gives (2). By a union bound, the probability either of these events happens is at most α + n , and so with probability at least − α neither occurs, which gives us the claim.While the proof is probabilistic, one can use the method of conditional expectation to derandomizethe sampling [ASE92]. More concretely, consider the function F ( g ) : = P mi = λ i | 〈 A i , g 〉 | − p n k g k . Theproof of Lemma 6 implies that the expectation of this function is at least Ω (1). Then we can ﬁnd a desiredvector g = ( g , . . . , g n ) by choosing the coordinates one after the other so that the conditional expectationdoes not decrease. We are now ready to prove Lemma 2, which we restate here with explicit constants. Lemma 7.

Suppose λ ∈ R m ≥ with k λ k = and k λ A k ≤ n p π . Then in time O ( mn ) we can rescale P sothat vol( P ∩ B ) increases by a constant factor.Proof. Computing a random Gaussian and checking if it satisﬁes the conditions of Lemma 6 takes time O ( mn ). Since the conditions will be satisﬁed with constant probability, the expected number of timeswe must do this is constant. Once the conditions are satisﬁed, ﬁnding a thin direction and rescaling canbe done in time O ( n ). Lemmas 4 and 5 guarantee we get a constant increase in the volume. We now introduce an alternate linear transformation we can use to rescale. This is no longer a rank-1 update, but it is inherently deterministic along with other nice properties. For one thing, althoughwe only guarantee constant improvement in the volume, under certain circumstances the rescaling canimprove the volume by an exponential factor. This transformation will also take a nice form when wechange the view to consider rescaling the unit ball rather than the feasible region. Recall that a function F : R n → R is Lipschitz with Lipschitz constant 1 if | F ( x ) − F ( y ) | ≤ k x − y k for all x , y ∈ R n . A famousconcentration inequality by Sudakov, Tsirelson, Borell states that Pr[ | F ( g ) − µ | ≥ t ] ≤ e − t / π , where g is a random Gaussian and µ is the mean of F under g . emma 8. Suppose λ ∈ R m ≥ , k λ k = and k λ A k ≤ n . Let M denote the matrix P mi = λ i A i A Ti and suppose ≤ α ≤ δ max , where δ max = k M k op denotes the maximal eigenvalue of M . Deﬁne F ( x ) = ( I + α M ) x. Then vol( F ( P ) ∩ B ) ≥ e α /5 vol( P ∩ B ) .Proof. First notice that M is symmetric positive semi-deﬁnite with trace 1. Therefore the eigenvalues of I + α M take the form 1 + αδ i where 0 ≤ δ i ≤ α and P ni = δ i =

1. Note that since αδ i ≤

1, we can lowerbound the eigenvalues by 1 + αδ i ≥ e αδ i /2 . Thereforedet( I + α M ) ≥ n Y i = e αδ i /2 = exp ³ α n X i = δ i ´ = e α /2 .In particular, det( F ) ≥ e α /4 .So far we have shown that vol( F ( P ∩ B )) is signiﬁcantly larger than vol( P ∩ B ). However, the desiredbound is on vol( F ( P ) ∩ B ), and so we need to ensure that we do not lose too much of the volume whenwe intersect with the unit ball. It turns out the bound on k λ A k will allow us to do precisely this.For any x , we get the bound k F ( x ) k = x T ³ I + α m X i = λ i A i A Ti ´ x = k x k + α m X i = λ i 〈 A i , x 〉 ≤ k x k .Now, if we assume that x ∈ P ∩ B , this becomes k F ( x ) k ≤ + α m X i = λ i 〈 A i , x 〉 ≤ + α k λ A k ≤ + α n .The point is that every element of F ( P ∩ B ) has length at most 1 + α n , and so intersecting with the unitball will not lose more volume than shrinking by a factor of 1 + α n . In particular, the volume decreasesby at most (1 + α n ) − n ≥ e − α /20 , and so we havevol( F ( P ) ∩ B ) ≥ e − α /20 vol( F ( P ∩ B )) ≥ e − α /20 · e α /4 vol( P ∩ B ) ≥ e α /5 · vol( P ∩ B ).Note that one always has δ max ≤ α ≥

1. Therefore if k λ A k ≤ n , we get constant improvement in vol( P ∩ B ). In fact, if the eigenvalues of M happen to be small, wecould get up to exponential improvement. This computation can be carried out in time O ( mn ) and soLemma 8 proves Lemma 2 and hence Theorem 3. Obviously instead of applying a linear transformation to the cone P itself, there is an equivalent viewwhere instead one applies a linear transformation to the unit ball. We will now switch the view in thesense that we ﬁx the cone P , but we update the norm in each rescaling step so that the unit ball becomesmore representative of P .Recall that a symmetric positive deﬁnite matrix H ∈ R n × n induces a norm k x k H : = p x T H x . Note thatalso H − is a symmetric positive deﬁnite matrix and k · k H − is the dual norm of k · k H . In this view weassume the rows A i of A are normalized so that k A i k H − = An easy way to see this is to write H = P nj = µ j u j u Tj as the eigendecomposition of H . Then H − = P nj = µ j u j u Tj is theinverse; clearly all eigenvalues are positive and the inverse has the same spectrum as H . B H : = { x ∈ R n | k x k H ≤

1} be the unit ball for the norm k · k H . Note that B H is always an ellipsoid .We will measure progress in terms of the fraction of the ellipsoid B H that lies in the cone P , namely µ ( H ) : = vol( B H ∩ P )vol( B H ) . The goal of the rescaling step will then be to increase µ ( H ) by a constant factor. Notethat we initially have µ ( H ) = µ ( I ) ≥ ρ n , and at any time 0 ≤ µ ( H ) ≤

1, so we can rescale at most O ( n log ρ )times.In this view, Lemma 8 takes the following form: Lemma 9.

Let H ∈ R n × n be symmetric with H ≻ . Suppose λ ∈ R m ≥ with k λ k = and k λ A k H − ≤ n and let M : = P mi = λ i A i A Ti . Let ≤ α ≤ δ max , where δ max : = k H − M k op . Then for ˜ H : = H + α M one has µ ( ˜ H ) ≥ e α /5 · µ ( H ) . Algorithm 2 illustrates what the multi-rank rescaling looks like under the alternate view. Notice thatthe algorithm updates the norm matrix by adding a scalar multiple of the Hessian matrix of the MWUpotential function discussed in Section 3. Moreover, throughout the algorithm our matrix H will havethe form I + P mi = h i A i A Ti for some h i ≥

0. Note that this allows fairly compact representation as we onlyneed O ( m ) space to encode the coefﬁcients h i that deﬁne the norm matrix. Algorithm 2

FOR ˜ O ( n log ρ ) phases DO:(1) Initial phase:

Either ﬁnd x ∈ P or give λ ≥ k λ k = k λ A k H − ≤ O ( n ).(2) Rescaling phase:

Update H : = H + α M , where M = P mi = λ i A i A Ti . In this section we show that the same rescaling methods can be used to make the MWU method into apolynomial time algorithm for linear programming.Recall that the MWU algorithm corresponds to gradient descent on a particular potential function.First we show how we can apply rescaling to the standard gradient descent approach. We then introducea modiﬁed gradient descent, which speeds up the MWU phase. Combining this with our rescaling stepabove gives us the following result:

Theorem 10.

There is an algorithm based on the MWU algorithm that ﬁnds a point in P in time ˜ O ( mn ω + log( ρ )) , where ω ≈ is the exponent of matrix multiplication. Consider the potential function Φ ( x ) = P mi = e −〈 A i , x 〉 , where k A i k = A i . Notice that Φ (0) = m and that if Φ ( x ) < 〈 A i , x 〉 > i , and hence x ∈ P . In this section we analyze standard gradientdescent on Φ , starting at the origin. Notice that the gradient takes the form ∇ Φ ( x ) = − m X i = e −〈 A i , x 〉 A i .8f we let λ i = Φ ( x ) e −〈 A i , x 〉 , we see that k λ k = λ A = − ∇ Φ ( x ) Φ ( x ) . In particular, if at any iteration thisvector has small Euclidean norm, then we will be able to rescale. It remains to show, therefore, that if thisvector has large Euclidean norm, then we get sufﬁcient decrease in the potential function. Lemma 11.

Suppose x ∈ R n and abbreviate y = − ∇ Φ ( x ) Φ ( x ) . Then Φ ( x + y ) ≤ Φ ( x ) · e −k y k /4 . Proof.

First note that since k λ k = k A i k =

1, we know that |〈 A i , y 〉| ≤ i . In our analysis wewill also use the fact that for any z ∈ R with | z | ≤ e z ≤ + z + z . We obtain the following. Φ ( x + y ) = m X i = e −〈 A i , x + y 〉 = m X i = e −〈 A i , x 〉 e − 〈 A i , y 〉 ≤ m X i = e −〈 A i , x 〉 (1 − 〈 A i , y 〉 + 〈 A i , y 〉 ) = Φ ( x ) · m X i = λ i (1 − 〈 A i , y 〉 + y T A i A Ti y ) ≤ Φ ( x ) · (1 − k y k ).Thus as long as k y k ≥ Ω ( n ), gradient descent will decrease the potential function by a factor of e − Θ (1/ n ) in each iteration, and so in at most O ( n ln( m )) iterations we arrive at a point x with Φ ( x ) < With ∆ = Θ ( n ), the standard gradient descent approach implements the initial phase of Algorithm 1in ˜ O ( n ) iterations. It turns out we can get the same guarantee in ˜ O ( n ) iterations by choosing a moresophisticated update direction. While we do not know how to guarantee an update direction that de-creases Φ ( x ) by factor of more than e − Θ ( k y k ) , we are able to ﬁnd a direction so that the product of Φ ( x )and k∇ Φ ( x ) k decreases a lot faster. Note that in the following we will work with a general norm so thatthe results can be applied directly to either Algorithm 1 (with H = I ) or Algorithm 2. We assume now that k A i k H − = i . Theorem 12.

Suppose H ≻ and k λ A k H − ≥ β n , where β > is an arbitrary constant. Then in timeO ( mn ω − ) , we can ﬁnd ε > and p ∈ R n so that k∇ Φ ( x + ε p ) k H − · Φ ( x + ε p ) ≤ k∇ Φ ( x ) k H − · Φ ( x ) · e − ˜ Θ (1/ n ) .Before going through the proof, we note that the update step of Theorem 12 yields a MWU phase thatruns in time ˜ O ( mn ω ). In particular, this gives the running time guarantee of Theorem 10. Consider a single phase of the algorithm without rescaling. There exists x ∗ ∈ P with k x ∗ k = B ( x ∗ , ρ ) ⊆ P , where B ( x ∗ , ρ ) : = { x ∈ R n | k x ∗ − x k ≤ ρ }. Then k λ A k · k x ∗ k ≥ 〈 λ A , x ∗ 〉 = P mi = λ i 〈 A i , x ∗ 〉 ≥ ρ , since 〈 A i , x ∗ 〉 ≥ ρ for all i . Thereforethe algorithm is guaranteed to ﬁnd a feasible point in O ( ln( m ) ρ ) iterations without rescaling. This argument is closely related tothe classical analysis of the perceptron. emma 13. Suppose H ≻ , and let β be an arbitrary constant. Then in time ˜ O ( mn ω ) we can run a MWUphase, which either ﬁnds x ∈ P or gives λ ∈ R m ≥ with k λ k = and k λ A k H − ≤ β n .Proof. Let λ ≥ λ A = − ∇ Φ ( x ) Φ ( x ) . Then as long as k λ A k H − ≥ β n , Theorem 12 says that thequantity k λ A k H − · Φ ( x ) decreases by a factor of e − ˜ Θ (1/ n ) . Then in ˜ O ( n ) iterations we will have k λ A k H − · Φ ( x ) ≤ β n , which implies that either Φ ( x ) < k λ A k H − ≤ β n .The remainder of this section will be devoted to the proof of Theorem 12. We begin by establishingsome useful notation. For any symmetric positive deﬁnite matrix H ≻ 〈 x , y 〉 H : = x T H y . Without any subscript 〈 x , y 〉 = x T y will continue to denote the canonical inner product.Given x ∈ R n , deﬁne λ i = Φ ( x ) e −〈 A i , x 〉 , y = − ∇ Φ ( x ) Φ ( x ) = P mi = λ i A i and M = ∇ Φ ( x ) Φ ( x ) = P mi = λ i A i A Ti . Eventhough all three depend on x , we will not denote that here to keep the notation clean.To prove Theorem 12, we ﬁrst show how Φ ( x ) decreases as we take steps in an arbitrary direction p . Lemma 14.

For any < ε ≤ and p ∈ R n with k p k H ≤ , we have Φ ( x + ε p ) ≤ Φ ( x ) · (1 − ε 〈 y , p 〉 + ε p T M p ). Proof.

Notice that since k p k H ≤ k A i k H − = |〈 A i , ε p 〉| ≤ Φ ( x + ε p ) = m X i = e −〈 A i , x + ε p 〉 = m X i = e −〈 A i , x 〉 e − ε 〈 A i , p 〉 ( ∗ ) ≤ m X i = e −〈 A i , x 〉 (1 − ε 〈 A i , p 〉 + ε 〈 A i , p 〉 ) = Φ ( x ) · m X i = λ i (1 − ε 〈 A i , p 〉 + ε p T A i A Ti p ) = Φ ( x ) · (1 − ε 〈 y , p 〉 + ε p T M p ).In ( ∗ ) we use the estimate that for any z ∈ R with | z | ≤ e z ≤ + z + z .In a similar way, we bound k∇ Φ ( x ) k H − after an update step in an arbitrary direction p . Lemma 15.

Suppose p ∈ R n with k p k H ≤ , and < ε ≤ we have k∇ Φ ( x + ε p ) k H − ≤ k∇ Φ ( x ) k H − · Ã + k y k H − ε 〈 p , M p 〉 + k y k H − ¡ − ε 〈 y , M p 〉 H − + ε k M p k H − ¢! Proof.

For any z with | z | ≤

1, we have e z = + z + η z for some η ∈ R with | η | ≤

1. In particular, since10 A i k H − = k p k H ≤

1, we have |〈 A i , ε p 〉| ≤ η i for each i . k∇ Φ ( x + ε p ) k H − Φ ( x ) = Φ ( x ) °°° m X i = ( − A i ) · e −〈 A i , x + ε p 〉 °°° H − = Φ ( x ) °°° m X i = ( − A i ) e −〈 A i , x 〉 e − ε 〈 A i , p 〉 °°° H − = °°° m X i = ( − A i ) · λ i · ³ − ε 〈 A i , p 〉 + ε · η i 〈 A i , p 〉 ´°°° H − ≤ °°° m X i = ( − A i ) · λ i · (1 − ε 〈 A i , p 〉 ) °°° H − + ε · °°° m X i = ( − A i ) · λ i η i 〈 A i , p 〉 °°° H − ≤ °°° m X i = λ i A i − ε m X i = λ i A i 〈 A i , p 〉 °°° H − + ε · m X i = k A i k H − | {z } = · λ i · | η i | |{z} ≤ · 〈 A i , p 〉 ≤ °°° y − ε · M p °°° H − + ε · m X i = λ i 〈 A i , p 〉 = ¡ k y k H − − ε y T H − M p + ε k M p k H − ¢ + ε p T M p ≤ k y k H − · Ã + k y k H − ¡ − ε 〈 y , M p 〉 H − + ε k M p k H − ¢! + ε p T M p ≤ k y k H − Ã + k y k H − ¡ − ε 〈 y , M p 〉 H − + ε k M p k H − ¢! + ε p T M p ≤ k y k H − Ã + k y k H − ε p T M p + k y k H − ¡ − ε 〈 y , M p 〉 H − + ε k M p k H − ¢! Recalling that ∇ Φ ( x ) = − Φ ( x ) y ﬁnishes the proof.Using Lemmas 14 and 15, we can show a sufﬁcient condition for p to satisfy Theorem 12. Lemma 16.

Suppose p ∈ R n with k p k H ≤ and constant a > is such that either1. 〈 y , p 〉 ≥ k y k H − (log n ) a and 〈 y , M p 〉 H − ≥ k Mp k H − ·k y k H − (log n ) a or2. 〈 y , p 〉 ≥ k y k H − (log n ) a and k M p k H − ≤ O ³ poly ( n ) ´ .Then as long as k y k H − ≥ β n , choosing ε = min n k y k H − n ) a k Mp k H − , n ) a o gives k∇ Φ ( x + ε p ) k H − · Φ ( x + ε p ) ≤ k∇ Φ ( x ) k H − · Φ ( x ) e − ˜ Θ (1/ n ) . Proof.

Let ε = min n k y k H − n ) a k Mp k H − , n ) a o . Then by Lemma 14, we have Φ ( x + ε p ) ≤ Φ ( x ) · (1 − ε 〈 y , p 〉 + ε k M p k H − ) ≤ Φ ( x ) · (1 − ε k y k H − (log n ) a + ε k M p k H − ).Assume ﬁrst that we are in Case 1. By Lemma 15, since we know ε ≤ n ) a , we have k∇ Φ ( x + ε p ) k H − ≤ k∇ Φ ( x ) k H − · ³ − ε k M p k H − k y k H − (log n ) a + ε k M p k H − k y k H − ´ .11f ε = n ) a , then k M p k H − ≤ k y k H − n ) a . Using this, Φ ( x ) will decrease by e − ˜ Θ ( k y k H − ) , and k∇ Φ ( x ) k H − will decrease.On the other hand, if ε = k y k H − n ) a k Mp k H − , then Φ ( x ) will decrease, and k∇ Φ ( x ) k H − decreases by e − ˜ Θ (1) . Together, these show that the product decreases by a factor of e − ˜ Θ ( k y k H − ) .If we are in Case 2, the only thing that might change is that when ε = n ) a we might have k∇ Φ ( x ) k H − increase by up to e O (1/poly( n )) . Since k y k H − ≥ β n this is the dominating term, and so we will still get theappropriate decrease.Notice that the conditions of Lemma 16 essentially say that both p and M p are close in angle withthe vector y . In particular, if the gradient happened to be an eigenvector of the Hessian (and hence y an eigenvector of M ) then Lemma 16 would be satisﬁed with p = y k y k H − . With this in mind, the idea forcomputing such a direction p is to project y onto an appropriate eigenspace of M . To this end, we ﬁrstprove the following general statement about computing approximate eigenvectors of matrices. Lemma 17.

Suppose z ∈ R n is a unit vector, N a PSD matrix with no eigenvalue bigger than and K > agiven parameter. For k =

1, ...,

K , deﬁne z k = ( I − N ) k z. Then k z k k ≤ , and for an appropriate constantC > at least one of the following must hold:1. There exists k ≤ K with 〈 z , z k 〉 ≥ CK and 〈 z , N z k 〉 ≥ C k N z k k K

2. For k = K , we have 〈 z , z K 〉 ≥ CK and k N z K k ≤ K K Proof.

First note that we may assume in fact that no eigenvalue of N is bigger than since we can al-ways replace N with N and only lose a factor of 2. Suppose now the eigenvectors of N are unit vectors v , ..., v n with eigenvalues α , ..., α n .We see that z k = n X j = (1 − α j ) k 〈 z , v j 〉 v j . N z k = n X j = α j (1 − α j ) k 〈 z , v j 〉 v j . 〈 z k , z 〉 = n X j = (1 − α j ) k 〈 z , v j 〉 For any k , we can get the following bounds.• For any α j , we either have α j ≤ k k or e − α j k ≤ e − k ≤ − k . In either case we have the bound α j (1 − α j ) k ≤ α j e − α j k ≤ k k . Therefore we can conclude that k N z k k ≤ k k .• Whenever α j ≤ k we have (1 − α j ) k ≥ e − α j k ≥ e and all other coefﬁcients will be nonnegative,and therefore 〈 z , z k 〉 ≥ e X α j ≤ − k 〈 z , v j 〉 P nj = 〈 z , v j 〉 = k z k , either there exists k ≤ K so that X α j ∈ h k + , k i 〈 z , v j 〉 ≥ K ,or else we must have X α j ≤ − K 〈 z , v j 〉 ≥ =⇒ 〈 z , z K 〉 ≥ e In the latter case we are done, since we already showed k N z k k ≤ K K .Otherwise, choose this k , and notice that whenever α j ∈ [ k + , k ], we have α j (1 − α j ) k ≥ α j e − α j k ≥ e k + , and all other coefﬁcients will be nonnegative. Therefore 〈 z , N z k 〉 = P nj = α j (1 − α j ) k 〈 z , v j 〉 ≥ e k + K and so in particular 〈 z , N z k 〉 ≥ k Nz k k e K , as desired.Finally, using Lemma 17, we show that we can efﬁciently compute a vector p satisfying Lemma 16,and hence complete the proof of Theorem 12. Lemma 18.

In time ˜ O ( mn ω − ) we can ﬁnd p ∈ R n satisfying the hypotheses of Lemma 16.Proof. Set K =

10 log n , z = H − y k y k H − and N = H − M H − in Lemma 17 to get output of z k . Let p = H − z k and notice that1. k p k H = k z k k ,2. k M p k H − = k N z k k 〈 y k y k H − , p 〉 = 〈 z , z k 〉 〈 y k y k H − , M p 〉 H − = 〈 z , N z k 〉 In particular, rearranging the statement of Lemma 17, this p satisﬁes the hypotheses of Lemma 16. Fi-nally, note that computing M takes time O ( n ω − ) and all other matrix operations can be computed intime O ( n ω ). Since we perform at most O (log n ) iterations, the running time is ˜ O ( mn ω − ), as desired. It turns out that our algorithm implicitly computes an approximate John ellipsoid for the consideredcone P , which gives us geometric insight into P . Recall that a classical theorem of John [Joh48] showsthat for any closed, convex set Q ⊆ R n , there is an ellipsoid E and a center z so that z + E ⊆ Q ⊆ z + nE . Thebound of n is tight in general — for example for a simplex — but it can be improved to p n for symmetricsets. This is equivalent to saying that for each convex body, there is a linear transformation that makes itwell n-well rounded . Here, a body Q is α -well rounded if z + r · B ⊆ Q ⊆ z + α · r · B for some center z ∈ R and radius r >

0. See the excellent survey of Ball [Bal97] on this topic.A summary of our full MWU algorithm with rescaling is given in Algorithm 3. We will prove herethat after a minor modiﬁcation of the algorithm, the set P ∩ B will be well rounded when the algorithmterminates. 13 lgorithm 3 FOR O ( n log( ρ )) phases DO• MWU phase: (1) Normalize k A i k = i , and set x (0) : = (2) FOR t : = T DO(3) Set λ ( t ) i : = Φ ( x ( t ) ) exp( − 〈 A i , x ( t ) 〉 )(4) If Φ ( x ( t ) ) < x ( t ) ∈ P (5) IF k λ ( t ) A k ≤ β n THEN GOTO Rescaling phase(6) Select an update vector p ( t ) ∈ R n with k p ( t ) k ≤ step size < ε t ≤ x ( t + : = x ( t ) + ε t p ( t ) • Rescaling phase: (1) Compute an invertible linear transformation F so that vol( F ( P ) ∩ B ) is a constant factor largerthan vol( P ∩ B ). Replace P by F ( P ). Lemma 19.

Consider Algorithm 3 with the modiﬁcation that the MWU phase terminates in step (4) onlyif Φ ( x ) < e . Then in the ﬁnal iteration P ∩ B is ˜ O ( n ) -well rounded.Proof. Let us consider the last phase of the algorithm and let x (0) , . . . , x ( T ) be the computed sequence ofpoints with Φ ( x ( T ) ) < e and T ≤ ˜ O ( n ). Then e − 〈 A i , x ( T ) 〉 < e and hence 〈 A i , x ( T ) 〉 ≥ i . The step sizeof the algorithm is always bounded by 1/2, hence k x ( T ) k ≤ T . Now deﬁne z : = T · x ( T ) as center. Then k z k ≤ and 〈 A i , z 〉 ≥ T . Hence B ( z , T ) ⊆ P ∩ B ⊆ B ( z , 1), which shows that P ∩ B is T -well rounded.Note that running the algorithm until Φ ( x ) < e only increases the worst case running times by aconstant factor. Alternatively one can run the algorithm with standard gradient descent and a ﬁxed stepsize of ε : = Θ ( n ) and only terminate when Φ ( x ) < m . This increases the running time by up to a factorof n , but the ﬁnal set P ∩ B will be O ( n )-well rounded, thus removing the logarithmic terms suppressedby the ˜ O notation. On the other hand, no linear transformation can make the conic hull of a simplex o ( n )-well rounded, hence our obtained bound is asymptotically optimal. Note that to obtain the tightfactor for well-roundedness it was crucial to have the optimal rescaling threshold of ∆ = Θ ( n ). Independent publication.

The multi-rank rescaling was also discovered in a parallel and independentwork by Dadush, Vegh and Zambelli [DVZ16b] (see their Algorithm 5).

References [Agm54] S. Agmon. The relaxation method for linear inequalities.

Canadian Journal of Mathematics ,6:382–392, 1954.[AHK05] S. Arora, E. Hazan, and S. Kale. Fast algorithms for approximate semide.nite programmingusing the multiplicative weights update method. In ,pages 339–348, 2005. 14AHK12] S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: a meta-algorithmand applications.

Theory Comput. , 8:121–164, 2012.[ASE92] Noga Alon, Joel H. Spencer, and Pál Commentateur de texte écrit. Erd?s.

The Probabilisticmethod . Wiley-Interscience series in discrete mathematics and optimization. J. Wiley & sons,New York, Chichester, Brisbane, 1992.[Bal97] Keith Ball. An elementary introduction to modern convex geometry. In in Flavors of Geome-try , pages 1–58. Univ. Press, 1997.[CCZ14] M. Conforti, G. Cornuejols, and G. Zambelli.

Integer Programming . Springer Publishing Com-pany, Incorporated, 2014.[CKM +

11] P. Christiano, J. A. Kelner, A. Madry, D. A. Spielman, and S. Teng. Electrical ﬂows, laplaciansystems, and faster approximation of maximum ﬂow in undirected graphs. In

Proceedings ofthe Forty-third Annual ACM Symposium on Theory of Computing , STOC ’11, pages 273–282,New York, NY, USA, 2011. ACM.[Dan51] G.B. Dantzig. Maximization of a linear function of variables subject to linear inequalities. In

Activity Analysis of Production and Allocation , Cowles Commission Monograph No. 13, pages339–347. John Wiley & Sons, Inc., New York, N. Y.; Chapman & Hall, Ltd., London, 1951.[DV06] J. Dunagan and S. Vempala. A simple polynomial-time rescaling algorithm for solving linearprograms.

Math. Program. , 114(1):101–114, 2006.[DVZ16a] D. Dadush, L. A. Végh, and G. Zambelli. Rescaled coordinate descent methods for linearprogramming. In

Proceedings of the 18th International Conference on Integer Programmingand Combinatorial Optimization - Volume 9682 , IPCO 2016, pages 26–37, New York, NY, USA,2016. Springer-Verlag New York, Inc.[DVZ16b] Daniel Dadush, László A. Végh, and Giacomo Zambelli. Rescaling algorithms for linear pro-gramming - part I: conic feasibility.

CoRR , abs/1611.06427, 2016.[GK07] N. Garg and J. Könemann. Faster and simpler algorithms for multicommodity ﬂow and otherfractional packing problems.

SIAM J. Comput. , 37(2):630–652, 2007.[Haˇc79] L.G. Haˇcijan. A polynomial algorithm in linear programming.

Dokl. Akad. Nauk SSSR ,244(5):1093–1096, 1979.[Joh48] F. John. Extremum problems with inequalities as subsidiary conditions. In

Studies and Essayspresented to R. Courant on his 60th Birthday , pages 187–204. Interscience Publishers, 1948.[Kar84] N. Karmarkar. A new polynomial-time algorithm for linear programming.

Combinatorica ,4(4):373–395, 1984.[KM72] V. Klee and G. Minty. How good is the simplex algorithm? In

Inequalities, III (Proc. ThirdSympos., Univ. California, Los Angeles, Calif., 1969; dedicated to the memory of Theodore S.Motzkin) , pages 159–175. Academic Press, New York, 1972.[LS15] Y. Lee and A. Sinford. A new polynomial-time algorithm for linear programming. 2015.https://arxiv.org/abs/1312.6677. 15Mad10] A. Madry. Faster approximation schemes for fractional multicommodity ﬂow problems viadynamic graph algorithms. In

Proceedings of the Forty-second ACM Symposium on Theory ofComputing , STOC ’10, pages 121–130, New York, NY, USA, 2010. ACM.[Nes05] Y. Nesterov. Excessive gap technique in nonsmooth convex minimization.

SIAM Journal onOptimization , 16(1):235–249, 2005.[PS12] J. Peña and N. Soheili. A smooth perceptron algorithm.

SIAM J. Optim. , 22(2):728–737, 2012.[PS16] J. Peña and N. Soheili. A deterministic rescaled perceptron algorithm.

Math. Program. , 155(1-2):497–510, 2016.[PST95] S.A. Plotkin, D.B. Shmoys, and E. Tardos. Fast approximation algorithms for fractional pack-ing and covering problems.

Math. Oper. Res. , 20(2):257–301, 1995.[Sch86] A. Schrijver.

Theory of linear and integer programming . Wiley-Interscience Series in DiscreteMathematics. John Wiley & Sons, Ltd., Chichester, 1986. A Wiley-Interscience Publication.[Vaz01] V. Vazirani.

Approximation algorithms . Springer, 2001.[WS11] D.P. Williamson and D.B. Shmoys.