[PDF] Bipartite Matching in Nearly-linear Time on Moderately Dense Graphs

Abstract

We present an O ~ (m+ n 1.5 ) -time randomized algorithm for maximum cardinality bipartite matching and related problems (e.g. transshipment, negative-weight shortest paths, and optimal transport) on m -edge, n -node graphs. For maximum cardinality bipartite matching on moderately dense graphs, i.e. m=Ω( n 1.5 ) , our algorithm runs in time nearly linear in the input size and constitutes the first improvement over the classic O(m n − − √ ) -time [Dinic 1970; Hopcroft-Karp 1971; Karzanov 1973] and O ~ ( n ω ) -time algorithms [Ibarra-Moran 1981] (where currently ω≈2.373 ). On sparser graphs, i.e. when m= n 9/8+δ for any constant δ>0 , our result improves upon the recent advances of [Madry 2013] and [Liu-Sidford 2020b, 2020a] which achieve an O ~ ( m 4/3+o(1) ) runtime. We obtain these results by combining and advancing recent lines of research in interior point methods (IPMs) and dynamic graph algorithms. First, we simplify and improve the IPM of [v.d.Brand-Lee-Sidford-Song 2020], providing a general primal-dual IPM framework and new sampling-based techniques for handling infeasibility induced by approximate linear system solvers. Second, we provide a simple sublinear-time algorithm for detecting and sampling high-energy edges in electric flows on expanders and show that when combined with recent advances in dynamic expander decompositions, this yields efficient data structures for maintaining the iterates of both [v.d.Brand et al.] and our new IPMs. Combining this general machinery yields a simpler O ~ (n m − − √ ) time algorithm for matching based on the logarithmic barrier function, and our state-of-the-art O ~ (m+ n 1.5 ) time algorithm for matching based on the [Lee-Sidford 2014] barrier (as regularized in [v.d.Brand et al.]).

Full PDF

BBipartite Matching in Nearly-linear Timeon Moderately Dense Graphs

Jan van den Brand ∗ Yin-Tat Lee † Danupon Nanongkai ‡ Richard Peng § Thatchaphol Saranurak ¶ Aaron Sidford k Zhao Song ∗∗ Di Wang †† September 4, 2020

We present an e O ( m + n . )-time randomized algorithm for maximum cardinality bipartitematching and related problems (e.g. transshipment, negative-weight shortest paths, and op-timal transport) on m -edge, n -node graphs. For maximum cardinality bipartite matching onmoderately dense graphs, i.e. m = Ω( n . ), our algorithm runs in time nearly linear in theinput size and constitutes the ﬁrst improvement over the classic O ( m √ n )-time [Dinic 1970;Hopcroft-Karp 1971; Karzanov 1973] and e O ( n ω )-time algorithms [Ibarra-Moran 1981] (wherecurrently ω ≈ . m = n / δ for any constant δ >

0, ourresult improves upon the recent advances of [Madry 2013] and [Liu-Sidford 2020b, 2020a] whichachieve an e O ( m / o (1) ) runtime.We obtain these results by combining and advancing recent lines of research in interiorpoint methods (IPMs) and dynamic graph algorithms. First, we simplify and improve theIPM of [v.d.Brand-Lee-Sidford-Song 2020], providing a general primal-dual IPM framework andnew sampling-based techniques for handling infeasibility induced by approximate linear systemsolvers. Second, we provide a simple sublinear-time algorithm for detecting and sampling high-energy edges in electric ﬂows on expanders and show that when combined with recent advancesin dynamic expander decompositions, this yields eﬃcient data structures for maintaining theiterates of both [v.d.Brand et al.] and our new IPMs. Combining this general machinery yieldsa simpler e O ( n √ m ) time algorithm for matching based on the logarithmic barrier function, andour state-of-the-art e O ( m + n . ) time algorithm for matching based on the [Lee-Sidford 2014]barrier (as regularized in [v.d.Brand et al.]). ∗ [email protected] . KTH Royal Institute of Technology, Sweden. † [email protected] . University of Washington and Microsoft Research Redmond, USA. ‡ [email protected] . KTH Royal Institute of Technology, Sweden. § [email protected] . Georgia Institute of Technology, USA. ¶ [email protected] . Toyota Technological Institute at Chicago, USA k [email protected] . Stanford University, USA. ∗∗ [email protected] . Columbia University, Princeton University and Institute for Advanced Study, USA. †† [email protected] . Google Research, USA. i a r X i v : . [ c s . D S ] S e p ontents b -Matching Algorithms 77 e O ( n √ m )-Time Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 858.5 Nearly Linear Time Algorithm for Moderately Dense Graphs . . . . . . . . . . . 888.6 More Applications to Matching, Flow, and Shortest Paths . . . . . . . . . . . . . 92 A Initial Point and Solution Rounding 105

A.1 Finding an Initial Point (Proof of Lemmas 8.3 and 8.13) . . . . . . . . . . . . . . 105A.2 Switching the Cost Vector (Proof of Lemma 8.8) . . . . . . . . . . . . . . . . . . 107A.3 Rounding an Approximate to a Feasible Solution (Proof of Lemma 8.9) . . . . . 108A.4 Rounding to an Integral Solution (Proof of Lemma 8.10) . . . . . . . . . . . . . . 110

B Leverage Score Maintenance 111C Degeneracy of A 115 ii Introduction

The maximum-cardinality bipartite matching problem is to compute a matching of maximum sizein an m -edge n -vertex bipartite graph G = ( V, E ). This problem is one of the most fundamentaland well-studied problems in combinatorial optimization, theoretical computer science, andoperations research. It naturally encompasses a variety of practical assignment questions andis closely related to a wide range of prominent optimization problems, e.g. optimal transport,shortest path with negative edge-lengths, minimum mean-cycle, etc.Beyond these many applications, this problem has long served as a barrier towards eﬃ-cient optimization and a proving ground for new algorithmic techniques. Though numerouscombinatorial and continuous approaches have been proposed for the problem, improving uponthe classic time complexities of O ( m √ n ) [HK73, Din70, Kar73] and O ( n ω ) [IM81] has provento be notoriously diﬃcult. Since the early 80s, these complexities have only been improvedby polylogarithmic factors (see, e.g., [Sch03]) until a breakthrough result of Madry [Mad13]showed that faster algorithms could be achieved when the graph is moderately sparse . In par-ticular, Madry [Mad13] showed that the problem could be solved in e O ( m / ) and a line ofresearch [Mad16, LS19, CMSV17, LS20b, AMV20] led to the recent e O ( m / o (1) )-time algo-rithm [LS20a]. Nevertheless, for moderately dense graphs , i.e. m ≥ n . δ for any constant δ >

0, the O (min( m √ n, n ω )) runtime bound has remained unimproved for decades.The more general problem of minimum-cost perfect bipartite b -matching , where an edge canbe used multiple times and the goal is to minimize the total edge costs in order to match everynode v for b ( v ) times, for given non-negative integers b ( v ), has been even more resistant toprogress. An e O ( m √ n ) runtime for this problem with arbitrary polynomially bounded integercosts and b was achieved only somewhat recently by [LS14]. Improving this runtime by even asmall polynomial factor for moderately dense graphs, is a major open problem (see Table 2).The minimum-cost perfect bipartite b -matching problem can encode a host of problemsranging from transshipment, to negative-weight shortest paths, to maximum-weight bipartitematching. Even more recently, the problem has been popularized in machine learning throughits encapsulation of the optimal transport problem [BJKS18, Qua19, LHJ19, AWR17]. Therehas been progress on these problems in a variety of settings (see Section 1.1 and Section 8and the included Table 1 through Table 5), including recent improvements for sparse-graphs[CMSV17, AMV20], and nearly linear time algorithms for computing (1 + (cid:15) ) approximate so-lutions for maximum-cardinality/weight matching [HK73, Din70, Kar73, GT89, GT91, DP14]and undirected transshipment [She17, ASZ20, Li20]. However, obtaining nearly linear time algorithms for solving these problems to high-precision for any density regime has been elusive. In this paper, we show that maximum cardinality bipartite matching, and more broadly minimum-cost perfect bipartite b -matching, can be solved in e O ( m + n . ) time. Tables 1 and 2 comparethese results with previous ones. Compared to the state-of-the-art algorithms for maximum-cardinality matching our bound is the fastest whenever m = n / δ for any constant δ > m ≥ n . . This constitutes the ﬁrst near-optimal runtime in any density regime forthe bipartite matching problem. ω is the matrix multiplication exponent and currently ω ≈ .

373 [Gal14, Wil12] For simplicity, in the introduction we use e O ( · ) to hide polylog n and sometimes polylog( W ), where W typicallydenotes the largest absolute value used for specifying any value in the problem. In the rest of the paper, wedeﬁne e O (in Section 2) to hide polylog( n ) and poly log log W (but not polylog( W )). Further, throughout theintroduction when we state runtimes for an algorithm, the algorithm may be randomized and these runtimes mayonly old w.h.p. ear Authors References Time ( e O ( · ) )Sparse Dense m √ n n ω m / m / o (1) m / o (1) This paper m + n . Table 1: The summary of the results for the max-cardinality bipartite matching problem.For a more comprehensive list, see [DP14].

Year Authors References Time ( e O ( · ) ) mn m / m √ n This paper m + n . Table 2: The summary of the results for the min-cost perfect bipartite b -matching problem(equivalently, transshipment ) for polynomially bounded integer costs and b . For a morecomprehensive list, see Chapters 12 and 21 in [Sch03]. Note that there have been furtherruntime improvements to this problem (not included in the table) under the assumption that k b k = O ( m ), see [CMSV17]. Recently, a state-of-the-art runtime of e O ( m / o (1) ) was achievedby [AMV20] under this assumption.As a consequence, by careful application of standard reductions, we show that we can solvea host of problems within the same time complexity. These problems are those that can bedescribed as or reduced to the following transshipment problem. Given b ∈ R n , c ∈ R m , andmatrix A ∈ { , , − } m × n where each row of A consists of two nonzero entries, one being 1 andthe other being −

1, we want to ﬁnd x ∈ R m ≥ that achieves the following objective:min A > x = b, x ≥ c > x. (1)Viewed as a graph problem, we are given a directed graph G = ( V, E ), a demand function b : V → R and a cost function c : E → R . This problem is then to compute a transshipment f : E → R ≥ that minimizes P uv ∈ E f ( uv ) c ( uv ), where a transshipment is a ﬂow f : E → R ≥ such that for every node v , X uv ∈ E f ( uv ) − X vw ∈ E f ( vw ) = b ( v ) . (2)The main result of this paper is the following Theorem 1.1 providing our runtime for solvingthe transshipment problem. Theorem 1.1.

The transshipment problem can be solved to (cid:15) -additive accuracy in e O (( m + n . ) log ( k b k ∞ k c k ∞ /(cid:15) )) time. For the integral case, where all entries in b , c , and x are integers,the problem can be solved exactly in e O (( m + n . ) log ( k b k ∞ k c k ∞ )) time. Leveraging Theorem 1.1 we obtain the following results.1. A maximum-cardinality bipartite matching can be computed in e O ( m + n . ) time, where e O hides poly log( n ) factors. 2 ear Authors References Time ( e O ( · ) ) n n . n W /(cid:15) n . W . /(cid:15) n W/(cid:15)

This paper n Table 3: The summary of the results for the optimal transport problem.2. The minimum-cost perfect bipartite b -matching on graph G = ( V, E ), with integer edgecosts in [ − W, W ] and non-negative integer b ( v ) ≤ W for all v ∈ V , can be computed in e O (( m + n . ) log ( W )) time.3. The e O (( m + n . ) log ( W )) time complexity also holds for maximum-weight bipartitematching, negative-weight shortest paths, uncapaciated min-cost ﬂow, vertex-capacitatedmin-cost s - t ﬂow, minimum mean cost cycle, and deterministic Markov decision processes(here, W denotes the largest absolute value used for specifying any value in the problem).4. The optimal transport problem can be solved to (cid:15) -additive accuracy in e O ( n log ( W/(cid:15) ))time.We have already discussed the ﬁrst two results. Below we brieﬂy discuss some additionalresults. See Section 8.6 for the details of all results.

Single-source shortest paths with negative weights and minimum weight bipartiteperfect matching.

Due to Gabow and Tarjan’s algorithm from 1989 [GT89], this problemcan be solved in O ( m √ n log( nW )) time where W is the absolute maximum weight of an edgein the graph. For sparse graphs, this has been improved to e O ( m / log W ) [CMSV17] andrecently to e O ( m / o (1) log W ) [AMV20]. Here, our algorithm obtains a running time of e O (( m + n . ) log ( W )), which again is near-linear for dense graphs and is the lowest known when m = n / δ for any constant δ > W is polynomially bounded. Tables 4 and 5 compare ourresults with the previous ones. Optimal Transport.

Algorithms with (cid:15) -additive error received much attention from themachine learning community since the introduction of the algorithm of Altschuler, Weed,Rigollet [AWR17] (e.g. [BJKS18, Qua19, LHJ19]). The algorithm of [AWR17] runs in time e O ( n W /(cid:15) ), and [Qua19, LHJ19] runs in time e O ( n W/(cid:15) ). We improve these running times to e O ( n log ( W/(cid:15) )). (Note that the problem size is Ω( n ).) Table 3 summarizes previous results. Here we provide a brief high-level overview of our approach (see Section 3 for a much moredetailed and formal overview which links to the main theorems of the paper).Our results constitute a successful fusion and advancement of two distinct lines of researchon interior point methods (IPMs) for linear programming [Kar84, Ren88, Vai87, VA93, Ans96,NT97, LS14, CLS19, LSZ19, LS19, Bra20, BLSS20, JSWZ20] and dynamic graph algorithms[SW19, NSW17, CGL +

20, BBG +

20, NS17, Wul17]. This fusion was precipitated by a break-through result of Spielman and Teng [ST04] in 2004 that Laplacian systems could be solved innearly linear time. As IPMs for linear programming essentially reduce all the problems con-sidered in this paper to solving Laplacian systems in each iteration, one can hope for a fasteralgorithm via a combination of fast linear system solvers and interior point methods. Via thisapproach, Daitch and Spielman [DS08] showed in 2008 that minimum cost ﬂow and other prob-lems could be solved in e O ( m / ) time. Additionally, along this line the results of Madry and3thers [Mad13, Mad16, LS19, CMSV17, LS20b, AMV20] all showed that a variety of problemscould be solved faster. However, as discussed, none of these results lead to improved runtimesfor computing maximum cardinality bipartite matching in signiﬁcantly dense graphs.More recently, the result of v.d.Brand, Lee, Sidford and Song [BLSS20], which in turnwas built on [LS14, CLS19, Bra20], led to new possibilities. These methods provide a ro-bust IPM framework which allows one to solve many sub-problems required by each iteration approximately , instead of doing so exactly as required by the previous interior point frame-works. Combining this framework with sophisticated dynamic matrix data structures (e.g.,[Vai89, San04, BNS19, LS15, CLS19, LSZ19, AKPS19, Bra20]) has led to the linear program-ming algorithm of v.d.Brand et al. [BLSS20]. Unfortunately, this algorithm runs in time e O ( mn )for graph problems, and this running time seems inherent to the data structures used. More-over, solving sub-problems only approximately in each iteration leads to infeasible solutions,which were handled by techniques which somewhat complicated and in certain cases ineﬃcient(as this work shows).Correspondingly this paper makes two major advances. First we show that the data struc-tures (from sparse recovery literature [GLPS10, KNPW11, HIKP12, Pag13, LNNT16, Kap17,NS19, NSW19]) used by v.d.Brand et al. [BLSS20] can be replaced by more eﬃcient data struc-tures in the case of graph problems. These data structures are based on the dynamic expanderdecomposition data structure developed in the series of works in [NS17, Wul17, NSW17, SW19,CGL +

20, BBG + G undergoing edge insertions anddeletions given as input, this data structure maintains a partition of edges in G into expanders.This data structure was originally developed for the dynamic connectivity problem [NSW17,Wul17, NS17], but has recently found applications elsewhere (e.g. [BBG +

20, BGS20, GRST20]).We can use this data structure to detect entries in the solution that change signiﬁcantly be-tween consecutive iterations of the robust interior point methods. It was known that this taskis a key bottleneck in eﬃciently implementing prior IPMs methods. Our data structures solvethis problem near optimally. We therefore hope that they may serve in obtaining even fasteralgorithms in the future.The above data structures allow us to solve the problem needed for each iteration (in par-ticular, some linear system) approximately . It is still left open how to use this approximatesolution. The issue is that we might not get a feasible solution (we may get x such that A x = b when we try to solve the LP (1)). In [BLSS20], this was handled in a complicated way thatwould at best give an e O ( n ) time complexity for the graph problems we consider. In this paper,we simplify and further improve the method of [BLSS20] by sub-sampling entries of the afore-mentioned approximate solution (and we show that such sampling can be computed eﬃcientlyusing the aforementioned dynamic expander decompositions). Because of the sparsity of thesampled solution, we can eﬃciently measure the infeasibity (i.e. compute A x − b ) and then ﬁxit in a much simpler way than [BLSS20]. We actually provide a general framework and analysisfor these types of interior point methods that (i) when instantiated on the log barrier, with ourdata structures, yields a e O ( n √ m )-time algorithm (see Section 8.4) and (ii) when applied usingthe more advanced barriers of [LS14] gives our fastest running time (see Section 8.5).We believe that our result opens new doors for combining continuous and combinatorial tech-niques for graph related problems. The recent IPM advances for maximum ﬂow and bipartitematching problems, e.g. [DS08, Mad13, LS14, Mad16, CMSV17, LS20b, LS20a, AMV20] all useLaplacian system solvers [ST04] or more powerful smoothed- ‘ p solver [KPSW19, AKPS19, AS20] statically and ultimately spend almost linear work per iteration. In contrast, in addition to us-ing such solvers, we leverage dynamic data-structures for maintaining expanders to implementIPM iterations possibly in sublinear time. Our ultimate runtimes are then achieved by consid-ering the amortized cost of these data structures. We hope this proof of concept of intertwiningcontinuous and combinatorial techniques opens the door to new algorithmic advances.4 .3 Organization After the preliminaries and overview in Sections 2 and 3, we present our IPMs in Section 4.Our main new data structure called “

HeavyHitter ” is in Section 5. In Section 6 we showhow this

HeavyHitter data structure can be used to eﬃciently maintain an approximationof the slack of the dual solution. Maintaining an approximation of the primal solution isdescribed in Section 7. In Section 8, we put everything together to obtain our e O ( n √ m )-time and e O ( m + n . )-time algorithms for minimum cost perfect bipartite matching and other problemssuch as minimum cost ﬂow and shortest paths.Some additional tools, required by our algorithms, are in Appendix A (constructing theinitial point for the IPM), Appendix B (maintaining leverage scores eﬃciently), and Appendix C(handling the degeneracy of incidence matrices).5 Preliminaries

Basic notation and terminology used throughout this paper.

Misc

We write [ n ] for the interval { , , ..., n } . For a set I ⊂ [ n ] we also use I as 0 / I i = 1 when i ∈ I and I i = 0 otherwise. We write e i for the i -th standard unit vector.We use e O ( · ) notation to hide (log log W ) O (1) and (log n ) O (1) factors, where W typically denotesthe largest absolute value used for specifying any value in the problem (e.g. demands and edgeweights) and n denotes the number of nodes.When we write with high probability (or w.h.p), we mean with probability 1 − n c for anyconstant c > x ∈ R n , we use x i to denote the i -th coordinate of vector x if the symbol x is simple.If the symbol is complicated, we use ( x ) i or [ x ] i to denote the i -th coordinate of vector x (e.g.( δ s ) i ).We write condition for the indicator variable, which is 1 if the condition is true and 0otherwise. Diagonal Matrices

Given a vector v ∈ R d for some d , we write Diag ( v ) for the d × d diagonalmatrix with Diag ( v ) i,i = v i . For vectors x, s, s, x, x t , s t , w, w, w t , τ, g we let X def = Diag ( x ), S def = Diag ( s ), and deﬁne X , S , X t , S t , W , W , W t , T , G analogously. Matrix and Vector operations

Given vectors u, v ∈ R d for some d , we perform arithmeticoperations · , + , − , /, √· element-wise. For example ( u · v ) i = u i · v i or ( √ v ) i = √ v i . For the innerproduct we will write h u, v i and u > v instead. For a vector v ∈ R d and a scalar α ∈ R we have( αv ) i = αv i and we extend this notation to other arithmetic operations, e.g. ( v + α ) i = v i + α .For symmetric matrices A , B ∈ R n × n we write A (cid:22) B to indicate that x > A x ≤ x > B x for all x ∈ R n and deﬁne (cid:31) , ≺ , and (cid:23) analogously. We let S n × n> ⊆ R n × n denote the set of n × n symmetric positive deﬁnite matrices. We call any matrix (not necessarily symmetric) non-degenerate if its rows are all non-zero and it has full column rank.We use a ≈ (cid:15) b to denote that exp( − (cid:15) ) b ≤ a ≤ exp( (cid:15) ) b entrywise and A ≈ (cid:15) B to denote thatexp( − (cid:15) ) B (cid:22) A (cid:22) exp( (cid:15) ) B . Note that this notation implies a ≈ (cid:15) b ≈ δ c ⇒ a ≈ (cid:15) + δ c , and a ≈ (cid:15) b ⇒ a x ≈ (cid:15) · x b x for x ≥ A over reals, let nnz( A ) denote the number of non-zero entries in A . Leverage Scores and Projection Matrices

For any non-degenerate matrix A ∈ R m × n we let P ( A ) def = A ( A > A ) − A > denote the orthogonal projection matrix onto A ’s image.The deﬁnition extends to degenerate matrices via the Penrose-Pseudoinverse, i.e. P ( A ) = A ( A > A ) † A > . Further, we let σ ( A ) ∈ R m with σ ( A ) i def = P ( A ) i,i denote A ’s leverage scores and let Σ ( A ) def = Diag ( σ ( A )), and we let τ ( A ) def = σ ( A ) + nm ~ A ’s regularized leveragescores and T ( A ) def = Diag ( τ ( A )). Finally, we let P (2) ( A ) def = P ( A ) ◦ P ( A ) (where ◦ denotesentrywise product), and Λ ( A ) def = Σ ( A ) − P (2) ( A ). Norms

We write k · k p for the ‘ p -norm, i.e. k v k p := ( P i | v i | p ) /p , k v k ∞ = max i | v i | and k v k being the number of non-zero entries of v . For a positive deﬁnite matrix M we deﬁne k v k M = √ v > M v . For a vector τ we deﬁne k v k τ := ( P i τ i v i ) / and k v k τ + ∞ := k v k ∞ +40 log(4 m/n ) k v k τ , where m ≥ n are the dimensions of the constraint matrix of the linearprogram (we deﬁne k v k τ + ∞ again in Deﬁnition 4.14). Graph Matrices

Given a directed graph G = ( V, E ), we deﬁne the (edge-vertex) incidencematrix A ∈ {− , , } E × V via A e,u = − , A e,v = 1 for every edge e = ( u, v ) ∈ E . We typicallyrefer to the number of edges by m and the number of nodes by n , so the incidence matrix isan m × n matrix, which is why we also allow indices A i,j for i ∈ [ m ], j ∈ [ n ] by assuming someorder to the edges and nodes. 6or edge weights w ∈ R E ≥ we deﬁne the Laplacian matrix as L = A > WA . For an unweightedundirected simple graph the Laplacian matrix has L u,v = − { u, v } ∈ E and L v,v = deg( v ),which is the same as the previous deﬁnition when assigning arbitrary directions to each edge.Our algorithm must repeatedly solve Laplacian systems. These types of linear systems arewell studied [Vai91, ST03, ST04, KMP10, KMP11, KOSZ13, LS13, CKM +

14, KLP +

16, KS16]and we use the following result for solving Laplacian systems (see e.g. Theorem 1.2 of [KS16]):

Lemma 2.1.

There is a randomized procedure that given any n -vertex m -edge graph G withincidence matrix A , diagonal non-negative weight matrix W , and vector b ∈ R V such that thereexists an x ∈ R V with ( A > WA ) x = b computes x ∈ R V with k x − x k A > WA ≤ (cid:15) k x k A > WA in e O ( m log (cid:15) − ) w.h.p. Note that we can express the approximation error of Lemma 2.1 as some spectral approxi-mation, i.e. that there exists some H ≈ (cid:15) A > WA such that H x = b [BLSS20, Section 8]. Expanders

We call an undirected graph G = ( V, E ) a φ -expander, if φ ≤ min ∅6 = S (cid:40) V |{{ u, v } ∈ E | u ∈ S, v ∈ V \ S }| min { P v ∈ S deg( v ) , P v ∈ V \ S deg( v ) } . For an edge partition S ti =1 E i = E , consider the set of subgraphs G , ..., G t , where G i is inducedby E i (with isolated vertices removed). We call this edge partition and the corresponding setof subgraphs a φ -expander decomposition of G if each G i is a φ -expander.7 Overview

We start the overview by explaining how our new interior point method (IPM) works in Sec-tion 3.1. A graph-algorithmic view of this IPM can be found in Section 3.2 and the full detail canbe found in Section 4. This new IPM reduces solving linear programs to eﬃciently performinga number of computations approximately. To eﬃciently perform these computations for graphproblems and implement our IPM we provide new data structures, outlined in Section 3.3. Someof these data structures are easy to obtain via known tools, e.g. Laplacian solvers, and someconstitute new contributions. In Section 3.4 we outline our main data structure contributions.The details for these data structures are found in Sections 5 to 7.

Here we provide an overview of our new eﬃcient sampling-based primal-dual IPMs given in Sec-tion 4. Our method builds upon the recent IPM of v.d.Brand, Lee, Sidford, and Song [BLSS20]and a host of recent IPM advances [CLS19, LS19, LSZ19, Bra20]. As with many of these recentmethods, given a non-degenerate A ∈ R m × n and b ∈ R n and c ∈ R m , these IPMs are appliedto linear programs represented in the following primal ( P ) and dual ( D ) form:( P ) def = min x ∈ R m ≥ : A > x = b c > x and ( D ) def = max y ∈ R n ,s ∈ R m ≥ : A y + s = c b > y . (3)In the remainder of this subsection we explain, motivate, and compare our IPM for this generalformulation. For more information about how this IPM is applied in the case of matchingproblems, see the next subsections. Path following

As is typical for primal-dual IPMs, both our IPM and the IPMs in [CLS19,LSZ19, Bra20, BLSS20, JSWZ20] maintain primal x ( i ) ∈ R m ≥ and dual slack s ( i ) ∈ R m ≥ andproceed for iterations i = 0 , , . . . attempting to iteratively improve their quality. In eachiteration i , they attempt to compute ( x ( i ) , s ( i ) ) so that x ( i ) s ( i ) ≈ µ ( i ) τ ( x ( i ) , s ( i ) ) (4)for some path parameter µ ( i ) ∈ R ≥ and weight function τ ( x ( i ) , s ( i ) ) ∈ R m ≥ . (Recall fromSection 2 that x ( ‘ ) s ( ‘ ) is an element-wise multiplication.)The intuition behind this approach is that for many weight functions, e.g. any constantpositive vector, the set of primal-dual pairs ( x µ , s µ ) ∈ R m ≥ × R m ≥ , that are feasible , i.e. satisfy A > x µ = b and A y + s = c for some y ∈ R n , and are µ -centered , i.e. xs = µτ ( x, s ), forma continuous curve from solutions to (3), at lim µ → ( x µ , s µ ), to a type of center of the primaland dual polytopes (in the case they are bounded), at lim µ →∞ ( x µ , s µ ). This curve is known asthe central path and consequently these methods can be viewed as maintaining approximatelycentrality to approximately follow the central path.Our methods follow a standard step-by-step approach (similar to [BLSS20]) to reduce solvinga linear program to eﬃciently following the central path, i.e. maintaining (4) for changing µ :(i) modify the linear program to have trivial feasible centered initial points (ii) following thepath towards the interior (i.e. increase µ ) (iii) show that the resulting points are on the centralpath for the original linear program (iv) follow the path towards optimal points (i.e. decreasing µ ). (v) show that the resulting points are near optimal and can be rounded to exactly optimalsolutions (depending on the size of the weight function and the particular LP). See Appendix Afor how to perform steps (i), (iii), and (v).Where the IPMs of [CLS19, LSZ19, Bra20, BLSS20, JSWZ20] and ours all diﬀer is in whatweight function is used and how the central path is followed. There is a further complication insome of these methods in that in some cases feasibility of x is not always maintained exactly.8n some, linear systems can only be solved to high-precision, however this can be handled bynatural techniques, see e.g. [DS08, LS14]. Further, in [BLSS20], to allow for approximate linearsystem solves in the iterations and thereby improve the iteration costs, feasibility of x wasmaintained more crudely through complicated modiﬁcations to the steps. A key contributionsof this paper, is a simple sampling-based IPM that also maintains approximately feasible x tofurther decrease the iteration costs of [BLSS20]. Weight function

In this paper we provide a general IPM framework that we instantiatewith our sampling-based techniques on two diﬀerent weight functions τ ( x ( i ) , s ( i ) ). While thereare many possible weight functions we restrict our attention to τ log induced by the standardlogarithmic barrier and τ LS a regularized variant of the weights induced by the Lee-Sidfordbarrier function [LS19] (also used in [BLSS20]) deﬁned as follows: τ log ( x ( i ) , s ( i ) ) def = ~ τ LS ( x ( i ) , s ( i ) ) = σ ( x ( i ) , s ( i ) ) + nm~ σ ( x ( i ) , s ( i ) ) ∈ R m are leverage scores of A under a particular row re-weighting by x ( i ) , and s ( i ) as used in, e.g., [LS14, LS19, BLSS20] (see Deﬁnition 4.13 for formal deﬁnition). Roughly, σ ( x ( i ) , s ( i ) ) measures the importance of each row of A with respect to the current primal dualpair x ( i ) and s ( i ) in a way that the induced central path is still continuous and can be followedeﬃciently.On the one hand, τ log is perhaps the simplest weight function one could imagine. The centralpath it induces is the same as the one induced by penalizing approach the constraints of ( P )and ( D ) with a logarithmic barrier function, see (16). Starting with the seminal work of [Ren88]there have been multiple e O ( √ m ) iteration IPMs induced by τ log . On the other hand, τ LS isclosely related to the Lewis weight barrier or Lee-Sidford barrier given in [LS19] and its analysisis more complex. However, in [BLSS20] it was shown that this weight function induces a e O ( √ n )iteration IPM. (See [LS19, BLSS20] for further explanation and motivation of τ LS ):Though the bounds achieved by τ LS in this paper are never worse than those achieved by τ log (up to logarithmic factors), we consider both weight functions for multiple reasons. First,the analysis of τ log in this paper is simpler than that for τ log and yet is suﬃcient to still obtain e O ( n √ m ) = e O ( n ) time algorithms for the matching problems considered in this paper (andconsequently on a ﬁrst read of this paper one might want to focus on the use of τ log ). Second,the analysis of the two weight functions is very similar and leverage much common algorithmicand analytic machinery. Consequently, considering both barriers demonstrates the versatilityof our sampling-based IPM approach. Centrality potentials

To measure whether x ( i ) s ( i ) ≈ µ ( i ) τ ( x ( i ) , s ( i ) ) (for τ ∈ { τ log , τ LS } )and design our steps, as with previous IPM advances [LS19, CLS19, LSZ19, Bra20, BLSS20,JSWZ20] we use the softmax potential function Φ : R m → R deﬁned for all vectors v byΦ( v ) def = X i ∈ [ n ] φ ( v i ) where φ ( v i ) def = exp( λ ( v i − − λ ( v i − λ . We then deﬁne the centrality measures or potentials asΦ( x ( i ) , s ( i ) , µ ( i ) ) def = Φ x ( i ) s ( i ) µ ( i ) τ ( x ( i ) , s ( i ) ) ! where τ ∈ { τ log , τ LS } depending on which weight function is used. Φ intuitively measure howfar x ( i ) s ( i ) is from µ ( i ) τ ( x ( i ) , s ( i ) ), i.e. how far x ( i ) and s ( i ) are from being centered and having(4) hold. Observe that Φ( v ) is small when v = ~ x ( i ) s ( i ) = µ ( i ) τ ( x ( i ) , s ( i ) )) and increasesvery quickly as v deviates from ~ τ = τ log was used in [CLS19, LSZ19, Bra20, JSWZ20] and Φ with τ = τ LS was used in [BLSS20]. Where our method diﬀers from prior work is in how we designour steps for controlling the value of this potential function a discussed in the next section.9 mprovement Step (Short Step) Given the choice of weight function τ ∈ { τ log , τ LS } ourIPM follows the central path by taking improvement steps (called short steps ) deﬁned as follows: x ( i +1) = x ( i ) + η x δ ( i ) x , s ( i +1) = s ( i ) + η s δ ( i ) s , and µ ( i +1) = µ ( i ) + δ ( i ) µ (5)where η x , η s are constants depending on whether we use τ log or τ LS , and δ ( i +1) x , δ ( i +1) s , and δ ( i +1) µ are deﬁned next (See Algorithm 2 and Algorithm 3 in Section 4 for the complete pseudocodefor the short-step method for τ log and τ LS respectively.) Informally, these steps are deﬁned asapproximate projected Newton steps of Φ in the appropriate norm. Formally, δ ( i ) x and δ ( i ) s aregiven by the following equations δ ( i ) x = X ( i ) g ( i ) − R ( i ) [ X ( i ) ( S ( i ) ) − A ( H ( i ) ) − A > X ( i ) g ( i ) + δ ( i ) c ] , (6) δ ( i ) s = A ( H ( i ) ) − A > X ( i ) g ( i ) . (7)where the variables in (6) and (7) are described below.(I) x ( i ) , s ( i ) ∈ R m are any entry-wise, multiplicative approximations of x ( i ) and s ( i ) and X ( i ) = Diag ( x ( i ) ) and S ( i ) = Diag ( s ( i ) ). (See: Line 4 of Algorithm 2 and Line 4 of Algorithm 3.)(II) g ( i ) ∈ R m : This is an approximate steepest descent direction of Φ with respect to somenorm k·k . Formally, for v ( i ) ∈ R m which is an element-wise approximation of x ( i ) s ( i ) µ ( i ) τ ( x ( i ) ,s ( i ) ) ,we choose g ( i ) = argmax z ∈ R m : k z k≤ h∇ Φ( v ( i ) ) , z i (8)for some norm k · k that depends on whether we use τ log or τ LS . (See Lines 5-6 of Algo-rithm 2 and Lines 5-6 of Algorithm 3.)(III) H ( i ) ∈ R n × n is any matrix such that H ( i ) ≈ A > X ( i ) ( S ( i ) ) − A . (See Line 8 of Algorithm 2and Line 8 of Algorithm 3.)(IV) R ( i ) ∈ R m × m is a randomly selected PSD diagonal matrix chosen so that E [ R ( i ) ] = I , A > X ( i ) ( S ( i ) ) − R ( i ) A ≈ A > X ( i ) ( S ( i ) ) − A , and the second moments of δ x are bounded. The number of non-zero entries in R ( i ) is e O ( n + √ m ) when we use τ log and e O ( n + m/ √ n ) when we use τ LS . Intuitively R randomlysamples some rows of the matrix following it in (6) with overestimates of importancemeasures of the row. (See Line 11 of Algorithm 2 and Line 11 of Algorithm 3.)(V) δ ( i ) c ∈ R m : a “correction vector” which (as discussed more below), helps control theinfeasibility of x ( i +1) . For a parameter η c of value η c ≈ η c = 1 for τ log and η c = − /O (log n ) for τ LS ) this is deﬁned as δ ( i ) c def = η c X ( i ) ( S ( i ) ) − A ( H ( i ) ) − ( A > x ( i ) − b ) . (9) Flexibility of variables : Note that there is ﬂexibility in choosing variables of the form (cid:3) , i.e. x ( i ) , s ( i ) , g ( i ) and H ( i ) . Further, our algorithms have ﬂexibility in the choice of R ( i ) , we justneed too sample by overestimates. This ﬂexibility gives us freedom in how we implement thesteps of this method and thereby simpliﬁes the data-structure problem of maintaining them.As in [BLSS20], this ﬂexibility is key to our obtaining our runtimes. Setting δ ( i +1) µ : As we discuss more below, if R ( i ) = I and δ ( i ) c = 0 in (6), our short steps wouldbe almost the same as those in the IPMs in [CLS19, Bra20, BLSS20, JSWZ20]. For such IPMs,it was shown in [CLS19, Bra20, JSWZ20] (respectively in [BLSS20]) that δ ( i ) µ can be set to10e roughly e O (1 / √ m ) µ ( i ) if we use τ log (respectively e O (1 / √ n ) µ ( i ) if we use τ LS ), leading to amethod with e O ( √ m ) (respectively e O ( √ n )) iterations.In this paper, we can adjust the analyses in [CLS19, Bra20, BLSS20, JSWZ20] to show thatour IPMs require the same number of iterations. In particular, we provide a general frameworkfor IPMs of this type (Section 4.1) and show that by carefully choosing the distribution for R ( i ) (and restarting when necessary) we can preserve the typical convergence rates from [CLS19,Bra20, BLSS20, JSWZ20] for using τ log and τ LS while ensuring that the infeasibility of x is nevertoo large. Provided R ( i ) can be sampled eﬃciently, our new framework supports arbitrary crudepolylogarithmic multiplicative approximations of H ( i ) to A > X ( i ) ( S ( i ) ) − A , in contrast to thehigh precision approximations required by [CLS19, Bra20, JSWZ20] and a more complicatedapproximation required in [BLSS20] Motivations and comparisons to previous IPMs

The IPM in [BLSS20] and ours sharea common feature that they only approximately solve linear systems in each iteration, i.e. theyapply ( H ( i ) ) − to a vector for H ( i ) ≈ A > X ( i ) ( S ( i ) ) − A . While [BLSS20] carefully modiﬁed thesteps to make them feasible in expectation, here we provide a new technique of simply samplingfrom δ x to essentially sparsify the change in x so that we always know the infeasibility andtherefore can better control it. In particular, the short steps in [BLSS20] are almost the sameas ours with R ( i ) = I and δ ( i ) c = 0 in (6). This means that we modify the previous shortsteps in two ways. First, we sparsify the change in x ( i ) using a sparse random matrix R ( i ) deﬁned in (IV). Since E [ R ( i ) ] = I , in expectation the behavior of our IPM is similar to that in[BLSS20]. However, since R ( i ) has e O ( n + m/ √ n ) non-zero entries (and less for τ log ), we canquickly compute A > x ( i +1) − b from A > x ( i ) − b by looking at e O ( n + m/ √ n ) rows of A . Thisinformation is very useful in ﬁxing the feasibility of x ( i +1) so that A > x ( i +1) = b in the LP. Inparticular, while [BLSS20] requires a complicated process to keep x ( i +1) feasible, we only needour second modiﬁcation: a “correction vector” δ ( i ) c . The idea is that we choose δ ( i ) c so that x ( i +1) = x ( i ) + δ ( i ) x + δ ( i ) c would be feasible if we use H ( i ) = A > X ( i ) ( S ( i ) ) − A . Although we willstill have H ( i ) ≈ A > X ( i ) ( S ( i ) ) − A and so x ( i +1) will be infeasible, the addition of δ ( i ) c ﬁxes someof the previous induced infeasibility. This allows us to bypass the expensive infeasibility ﬁxingstep in [BLSS20] which takes e O ( mn + n ) time, and improve the running time to e O ( mn + n . ),and even less when A is an incidence matrix.To conclude, we advance the state-of-the-art for IPMs by providing new methods whichcan tolerate crude approximate linear system solvers and gracefully handle the resulting lossof infeasibility. Provided certain sampling can be performed eﬃciently, our methods improveand simplifying aspects of [BLSS20]. This new IPM framework, together with our new datastructures (discussed next), allow τ log to be used to obtain an e O ( n √ m )-time matching algorithmand τ LS to be used to obtain our e O ( m + n . )-time matching algorithm. We believe that thisframework is of independent interest and may ﬁnd further applications. Here we provide an overview of the IPM discussed in Section 3.1, specialized to the graphproblems we consider, such as matching and min-cost ﬂow. This subsection is intended toprovide further intuition on both our IPM and the data structures we develop for implementingthe IPM eﬃciently. For simplicity, we focus on our IPM with τ log in this subsection. In the caseof graph problems, typically the natural choice of A in the linear programming formulationsis the incidence matrix A ∈ {− , , } E × V of a graph (see Section 2). The structure of thismatrix ultimately enables our methods to have the graph interpretation given in this sectionand allows us to achieve more eﬃcient data structures (as compared to the case of generallinear programs). This interpretation is discussed here and the data structures are discussed inSections 3.3 and 3.4. 11ote that incidence matrices are degenerate; the all-ones vector is always in the kernel andtherefore A is not full column rank (and A > A is not invertible). Consequently, the algorithmsin Section 3.1 do not immediately apply. This can be ﬁxed by standard techniques (e.g. [DS08]).In this paper we ﬁx this issue by appending an identity block at the bottom of A (which can beinterpreted as adding self-loops to the input graph; see Appendix C). For simplicity, we ignorethis issue in this subsection. Min-cost ﬂow

We focus on the uncapacited min-cost ﬂow (a.k.a. transshipment) problem,where the goal is the ﬁnd the ﬂow satisfying nodes’ demands (Eq. (2)). Other graph problemscan be solved by reducing to this problem (see Figure 2). For simplicity, we focus on computing δ ( i ) x as in (6) and assume that η x = η s = 1. Below, entries of any n -dimensional (respectively m -dimensional) vectors are associated with vertices (respectively edges). After i iterations ofour IPM, we have • a ﬂow x ( i ) ∈ R m that is an approximation of a ﬂow x ( i ) (we do not explicitly maintain x ( i ) but it is useful for the analysis), • an approximated slack variable s ( i ) ∈ R m , and • A > x ( i ) − b ∈ R n called infeasibility (a reason for this will be clear later).We would like to improve the cost of x ( i ) by augmenting it with ﬂow x ( i ) g ( i ) ∈ R m , for some“gradient” vector g ( i ) . This corresponds to the ﬁrst term in (6) and gives us an intermediate m -dimensional ﬂow vector ˙ x ( i +1) def = x ( i ) + x ( i ) g ( i ) Let us oversimplify the situation by assuming that g ( i ) has e O ( n ) non-zero entries, so thatcomputing x ( i ) g ( i ) is not a bottleneck in our runtime. We will come back to this issue later. Infeasibility

The main problem of ˙ x ( i +1) is that it might be infeasible , i.e. A > ˙ x ( i +1) = b .The infeasibility A > ˙ x ( i +1) − b is due to (i) the infeasibility of x ( i ) (i.e. A > x ( i ) − b ), and (ii) theexcess ﬂow of x ( i ) g ( i ) , which is ( A > X ( i ) g ( i ) ) v = P uv ∈ E x ( i ) uv g ( i ) uv − P vu ∈ E x ( i ) vu g ( i ) vu on each vertex v . This infeasibility would be ﬁxed if we subtract ˙ x ( i +1) with some “correction” ﬂow f ( i ) c thatsatisﬁes, for every vertex v , the demand vector d ( i ) ∈ R n where d ( i ) def = A > ˙ x ( i +1) − b = A > X ( i ) g ( i ) + ( A > x ( i ) − b ) . (10)Note that given sparse g ( i ) (as assumed above) and A > x ( i ) − b , we can compute the demandvector d ( i ) in e O ( n ) time. Electrical ﬂow

A standard candidate for f ( i ) c is an electrical ﬂow on the input graph G ( i ) with resistance r ( i ) e = s ( i ) e /x ( i ) e on each edge e . In a close form, such electrical ﬂow is f ( i ) c = X ( i ) ( S ( i ) ) − A ( H ( i ) ) − d ( i ) , where H ( i ) is the Laplacian of G ( i ) . (Note that ( H ( i ) ) − does not exist. This issue can be easilyﬁxed (e.g., Appendix C), so we ignore it for now.) Observe that f ( i ) c is exactly the second termof (6) (also see (9)) with R ( i ) = I and H ( i ) = H ( i ) . Such f ( i ) c can be computed in e O ( m ) timein every iteration via fast Laplacian solvers (Lemma 2.1). Since known IPMs require Ω( √ n )iterations, this leads to e O ( m √ n ) total time at best. This is too slow for our purpose. The maincontribution of this paper is a combination of new IPM and data structures that reduces thetime per iteration to e O ( n ). We use a (1 + (cid:15) )-approximation Laplacian solver. Its runtime depends logarithmically on (cid:15) − , so we can treatit essentially as an exact algorithm. pectral sparsiﬁer A natural approach to avoid e O ( m ) time per iteration is to approximate f ( i ) c using a spectral approximation of H ( i ) , denoted by H ( i ) . In particular, consider a newintermediate ﬂow¨ x ( i +1) def = x ( i ) + x ( i ) g ( i ) − f ( i ) c , where f ( i ) c def = X ( i ) ( S ( i ) ) − A ( H ( i ) ) − d ( i ) , (11)Note that the deﬁnition of f ( i ) c is exactly the second term of (6) with R ( i ) = I , and it diﬀersfrom f ( i ) c only in H ( i ) . Given d ∈ R n , computing ( H ( i ) ) − d ∈ R n is straightforward: a spectralsparsiﬁer H ( i ) with (1+ (cid:15) )-approximation ratio and e O ( n/(cid:15) ) edges can be maintained in e O ( n/(cid:15) )time per iteration (under the change of resistances), either using the leverage scores [BLSS20]or the dynamic sparsiﬁer algorithm of [BBG + H ( i ) ) − d . This requires only e O ( n ) time per iteration. Diﬃculties

There are at least two diﬃculties in implementing the above idea:1.

Infeasibility : An approximate electrical ﬂow f ( i ) c might not satisfy the demand d ( i ) , thusdoes not ﬁx the infeasibility of x ( i ) .2. Time:

Computing f ( i ) c ∈ R m explicitly requires Ω( m ) time even just to output the result. Bounding infeasibility and random correction

For the ﬁrst issue, it turns out that whilewe cannot keep each x ( i ) feasible, we can prove that the infeasibility remains small throughout.As a result, we can bound the number of iterations as if every x ( i ) is feasible (e.g. e O ( √ m )iterations using τ log ). To get around the second issue, we apply the correction ﬂow f ( i +1) c onlyon e O ( n ) carefully sampled and rescaled edges ; i.e. our new (ﬁnal) ﬂow is x ( i +1) def = x ( i ) + x ( i ) g ( i ) − R ( i ) f ( i ) c , (12)for some random diagonal matrix R ( i ) ∈ R m × m with e O ( n ) non-zero entries; in other words, x ( i +1) e = x ( i ) e + x ( i ) e g ( i ) e − R ( i ) e,e ( f ( i ) c ) e for every edge e . Observe that (12) is equivalent to how wedeﬁne x ( i +1) in our IPM ((5) and (6)). Since R ( i ) has e O ( n ) non-zero entries , we can compute h ( i ) = R ( i ) f ( i ) c in e O ( n ) time. Our sampled edges basically form an enhanced spectral sparsiﬁer, A > R ( i ) A . For each edge e , let p ( i ) e be a probability that is proportional to the eﬀective resistance of e and ( f ( i ) c ) e .With probability p ( i ) e , we set R ( i ) e,e = 1 /p ( i ) e and zero otherwise. Without ( f ( i ) c ) e inﬂuencing theprobability, this graph would be a standard spectral sparsiﬁer. Our enhanced spectral sparsiﬁercan be constructed in e O ( n ) time using our new data structure based on the dynamic expanderdecomposition data structure, called heavy hitter (discussed in Sections 3.4 and 5). Comparedto a standard spectral sparsiﬁer, it provides some new properties (e.g. k R f c k ∞ is small in somesense and some moments are bounded) that allow us to bound the number of iterations to be thesame as when we do not have R ( i ) . In other words, introducing R ( i ) does not create additionalissues (though it does change the analysis and make the guarantees probabilistic), and helpsspeeding up the overall runtime. Computing x ( i +1) , s ( i +1) and A > x ( i +1) − b Above, we show how to compute x ( i +1) in e O ( n )time under an oversimplifying assumption that g ( i ) is sparse. In reality, g ( i ) may be dense and wecannot aﬀord to compute x ( i +1) explicitly. A more realistic assumption (although still simpliﬁed) If we use τ LS , the number of edges becomes e O ( m/ √ n ). Given d ( i ) ∈ R n , we can compute ( H ( i ) ) − d ( i ) ∈ R n using spectral sparsiﬁers and Laplacian solvers asdiscussed earlier. We can then compute ( R ( i ) f ( i ) c ) uv = R ( i ) uv,uv ( x ( i ) /s ( i ) )( h ( i ) v − h ( i ) u ) for every edge uv such that R uv,uv = 0.

13s that we can guarantee that the number of non-zero entries in g ( i ) − g ( i − is e O ( √ m ). Inthis case we cannot explicitly compute x ( i ) g ( i ) , and thus x ( i +1) . Instead, we explicitly maintain x ( i +1) such that for each edge e , x ( i +1) e is within a constant factor of x ( i +1) e . This means that,for any edge e , if ‘ e ( i ) is the last iteration before iteration i that we set x ( ‘ e ( i )) e = x ( ‘ e ( i )) e , and | P it = ‘ e ( i ) g ( t ) e | = Ω(1), then we have to set x ( i ) e = x ( i ) e . Using the fact that g ( i ) is a unit vector,we can show that we do not have to do this often; i.e. there are e O ( m ) pairs of ( i, e ) such that | P it = ‘ e ( i ) g ( t ) e | = Ω(1). By exploiting the fact that g ( i ) − g ( i − contains e O ( √ m ) non-zero entries,we can eﬃciently detect entries of x ( i ) that need to be changed from x ( i − . Also by the samefact, we can maintain d ( i ) , thus R ( i ) f ( i ) c , in e O ( n ) time per iteration. This implies that we cancomputed x ( i +1) in e O ( n + √ m ) = e O ( n ) amortized time per iteration.We are now left with computing s ( i +1) and A > x ( i +1) − b . Observe that δ ( i ) s (Eq. (7)) appearas part of δ ( i ) x in (6); so, intuitively, s ( i ) can be computed in a similar way to x ( i ) . Note thatalthough R ( i ) does not appear in (7), we can use our heavy hitter data structure (mentionedearlier and discussed in Sections 3.4 and 5) to also detect edges e where s ( i ) e is no longer agood approximation of s ( i ) e . That is, when j was the last iteration when we set s ( j ) e = s ( i ) e thenwe can use the heavy hitter data structure to detect when | s ( i ) e − s ( i ) e | = | s ( i ) e − s ( j ) e | grows toolarge, because the diﬀerence s ( i ) − s ( j ) can be interpreted as some ﬂow again. Finally, notethat A > x ( i +1) − b = ( A > x ( i ) − b ) + A > X ( i ) g ( i ) − A > R ( i ) f ( i ) c . The ﬁrst term is given to us.The last term can be computed quickly due to the sparsity of R ( i ) f ( i ) c . The middle term canbe maintained in e O ( √ m ) time by exploiting the fact that there are e O ( √ m ) non-zero entries in g ( i ) − g ( i − . As noted earlier, our IPMs are analyzed assuming that the constraint matrix A of the linearprogram is non-degenerate (i.e. the matrix ( A > A ) − exists). If A is an incidence matrix, thenthis is not satisﬁed. We ﬁx this by appending an identity block at the bottom of A . For provingand discussing the data structures we will, however, assume that A is just an incidence matrixwithout this appended identity block, as it results in a simpler analysis.Ultimately we would like to compute x ( ‘ ) in the ﬁnal iteration ‘ of the IPM. However, wedo not compute x ( i ) or s ( i ) in iterations i < ‘ because it would take to much time. Instead, weimplement eﬃcient data structures to maintain the following information about (6) and (7) inevery iteration.i Primal and Gradient Maintenance

Maintain vectors g ( i ) , A > X ( i ) g ( i ) and x ( i ) ∈ R m .ii Dual Vectors Maintenance:

Maintain vector s ( i ) ∈ R m .iii Row (edge) sampling:

Maintain R ( i ) .iv Inverse Maintenance:

Maintain ( implicitly ) ( H ( i ) ) − . Given w ∈ R n , return ( H ( i ) ) − w .v Leverage Scores Maintenance ( τ LS ( x ( i ) , s ( i ) ) ): When using the faster e O ( √ n )-iterationIPM with potential τ LS , we must maintain an approximation τ LS ( x ( i ) , s ( i ) ) of τ LS ( x ( i ) , s ( i ) ) = σ ( x ( i ) , s ( i ) ) + n/m , so we can maintain v ( i ) ≈ x ( i ) s ( i ) µ ( i ) ( τ LS x ( i ) ,s ( i ) ) which is needed for g ( i ) (in(8)).vi Infeasibility Maintenance:

Maintain the n -dimensional vector ( A > x ( i ) − b ). The actual situations are slightly more complicated. If we use τ log , we can guarantee that we know some t ( i ) ∈ R , for all i , such that P i k g ( i ) − t ( i ) g ( i − k = e O ( m ); i.e. we can obtain g ( i ) by rescaling g ( i − and changethe values of amortized e O ( √ m ) non-zero entries. We will stick with the simpliﬁed version in this subsection. Notefurther that if we use τ LS , we can guarantee that entries of each g ( i ) can be divided into polylog( n ) buckets whereentries in the same bucket are of the same value. For every i , we can describe the bucketing of g ( i ) by describingpolylog( n ) entries in the buckets of g ( i − that move to diﬀerent buckets in the bucketing of g ( i ) . Additionally,each bucket of g ( i ) may take diﬀerent values than its g ( i − counterpart. g ( i ) (i) and ( H ( i ) ) − (iv), are computed explicitly , meaning thattheir values are maintained in the working memory in every iteration. The vector g ( i ) is main-tained in an implicit form, and for ( H ( i ) ) − , we maintain a data structure that, given w ∈ R n ,can quickly return ( H ( i ) ) − w . Implementing the IPM via i-vi

Below, we repeat (6), (7) and (9) to summarize how weuse our data structures to maintain the information in these equations. δ ( i ) x = X ( i ) g ( i ) | {z } (i) − R ( i ) |{z} (iii) X ( i ) |{z} (i) ( S ( i ) ) − | {z } (ii) A ( H ( i ) ) − | {z } (iv) A > X ( i ) g ( i ) | {z } (i) + R ( i ) δ ( i ) c | {z } below , and (6 ) δ ( i ) s = A ( H ( i ) ) − | {z } (iv) A > X ( i ) g ( i ) | {z } (i) . (7 ) R ( i ) δ ( i ) c = η c R ( i ) |{z} (iii) X ( i ) |{z} (i) ( S ( i ) ) − | {z } (ii) A ( H ( i ) ) − | {z } (iv) ( A > x ( i ) − b ) | {z } (vi) . (9 )Here we see, that all information required to compute δ x and δ s is provided by the data structuresi-iv. Constructing the data structures

Next, we explain how to implement these data struc-tures eﬃciently. Our main contribution with respect to the data structures are for i, ii, and iii(primal, dual and gradient maintenance and row sampling). These data structures are outlinedin Section 3.4.When A is an incidence matrix, maintaining the inverse implicitly (iv) can be done bymaintaining a sparse spectral approximation H ( i ) of the Laplacian A > X ( i ) ( S ( i ) ) − A and thenrunning an existing approximate Laplacian system solver [Vai91, ST03, ST04, KMP10, KMP11,KOSZ13, LS13, CKM +

14, KLP +

16, KS16]. The spectral approximation H ( i ) can be maintainedusing existing tools such as the dynamic spectral sparsifer data structure from [BBG +

20] orsampling from the leverage scores upper bounds.Maintaining the leverage scores (v) is done via a data structure from [BLSS20] which reducesleverage scores maintenance to dual slack maintenance (ii) with some overhead. The resultingcomplexity for maintaining the leverage scores, when using our implementation of ii is analyzedin Appendix B.To maintain A > x ( i ) − b (vi), observe that A > x ( i ) − b = A > x ( i − − b + A > δ ( i − x . Here A > x ( i − − b is known from the previous iteration and A > δ ( i − x = A > X ( i − g ( i − − A > R ( i − [ X ( i − ( S ( i − ) − A ( H ( i − ) − A > X ( i − g ( i − + δ ( i − c ]can be computed eﬃciently because of the sparsity of R ( i − and the fact that we know A > X ( i − g ( i − from gradient maintenance (i). Time complexities

The total time to maintain the data structures i, ii, iv, and vi over ‘ iterations is e O ( m + n‘ ) when using the slower √ m -iteration IPM, and e O ( m + ( n + m/ √ n ) ‘ )when using the faster √ n -iteration IPM. The exception is for the leverage scores τ LS ( x ( i ) , s ( i ) )(v) which is only needed for the √ n -iteration IPM, where we need e O ( m + n‘ + ‘ m/n ) time.So, in total these data structures take e O ( n √ m ) time when we use τ log and e O ( m + n √ n ) timeusing τ LS . We ﬁrst describe how to maintain the approximation s ( i ) ≈ s ( i ) , i.e. the data structure of ii.Via a small modiﬁcation we then obtain a data structure for iii. Finally, we describe a datastructure for i which allows us to maintain the gradient g and the primal solution x .15 pproximation of s (See Sections 5 and 6) In order to maintain an approximation s ( i ) ≈ s ( i ) (i.e. a data structure for ii), we design a data structures for the following twoproblems:(D1): Maintain the exact vector s ( i ) ∈ R m implicitly, such that any entry can be queried in O (1)time.(D2): Detect all indices j ∈ [ m ] for which the current s ( i ) j is no longer a valid approximation of s ( i ) j .Task (D1) can be solved easily and we explain further below how to do it. Solving task (D2)eﬃciently is one of our main contributions and proven in Section 5, though we also given anoutline in this section further below. Once we solve both tasks (D1) and (D2), we can combinethese data structures to maintain a valid approximation of s ( i ) as follows (details in Section 6):Whenever some entry s ( i ) j changed a lot so that s ( i ) j is no longer a valid approximation, (whichis detected by (D2)) then we simply query the exact value via (D1) and update s ( i ) j ← s ( i ) j . Toconstruct these data structure, observe that by (7 ), we have s ( i +1) = s ( i ) + A ( H ( i ) ) − | {z } (iv) A > X ( i ) g ( i ) | {z } (i) = s ( i ) + A h ( i ) . (7 )Here the vector h ( i ) ∈ R n can be computed eﬃciently, thanks to iv and i. So we are left withthe problem of maintaining s ( i +1) ≈ s ( i +1) = s (init) + A i X k =1 h ( k ) . Here we can maintain P ik =1 h ( k ) in O ( n ) time per iteration by simply adding the new h ( i ) tothe sum in each iteration. For any j one can then compute s ( i +1) j in O (1) time so we have adata structure that solves (D1).To get some intuition for (D2), assume we have some s ( i ) with s ( i ) ≈ s ( i ) . Now if an entry( δ ( i ) s ) j is small enough, then we have s ( i ) j ≈ s ( i ) j + ( δ ( i ) s ) j = s ( i +1) j . This motivates why we wantto detect a set J ⊂ [ m ] containing all j where | ( δ ( i ) s ) j | is large, and then update s ( i ) to s ( i +1) bysetting s ( i +1) j ← s ( i +1) j for j ∈ J . So for simplicity we start with the simple case where we onlyneed to detect entries of s ( i +1) that changed by a lot within a single iteration of the IPM. Thatis, we want to ﬁnd every index j such that | ( δ ( i ) s ) j | = | s ( i +1) j − s ( i ) j | > (cid:15)s ( i ) j for some (cid:15) ∈ (0 , | ( A h ( i ) ) j | > (cid:15)s ( i ) j . (13)We assume that in each iteration we are given the vector h ( i ) . Since A is an incidence matrix,index j corresponds to some edge ( u, v ) and (13) is equivalent to | h ( i ) v − h ( i ) u | > (cid:15)s ( i )( u,v ) (14)where s ( i )( u,v ) := s ( i ) j . Assume by induction s ( i − j ≈ s ( i − j for all j , so then ﬁnding all j ’s where s j is changed by a lot in iteration ( i + 1) reduces to the problem of ﬁnding all edges ( u, v ) suchthat | ( h ( i ) v − h ( i ) u ) /s ( i )( u,v ) | > (cid:15) for some (cid:15) = Θ( (cid:15) ).To get the intuition of why we can eﬃciently ﬁnd all such edges, start with the simpliﬁed casewhen the edges have uniform weights (i.e. s ( i ) = ~ h , we can shift h by any constant vector c · ~ O ( n ) time to make h ⊥ d , where d is the vector of degrees of the nodes in the graph. For any edge j = ( u, v ) to have | h u − h v | ≥ (cid:15) , at least one of | h u | and | h v | has to be at least (cid:15) /

2. Thus, it suﬃces to check the16djacent edges of a node u only when | h u | is large, or equivalently h u (cid:15) is at least 1 /

4. Sincechecking the adjacent edges of any u takes time deg( u ), the time over all such nodes is boundedby O ( P u deg( u ) h u (cid:15) − ), which is O ( h > D h(cid:15) − ) where D is the diagonal degree matrix. If thegraph G has conductance at least φ (i.e., G is a φ -expander), we can exploit the classic spectralgraph theory result of Cheeger’s inequality to bound the running time by O ( h > L hφ − G (cid:15) − ),where L = A > A is the graph Laplacian (see Lemma 5.2). Here h > L h = k A h k will be smalldue to properties of our IPM, and thus this already gives us an eﬃcient implementation if ourgraph has large conductance φ = 1 / polylog( n ).To extend the above approach to the real setting where s is non-uniform and the graph is notan expander, we only need to partition the edges of G so that these two properties hold in eachinduced subgraph. For the non-uniform s part, we bucket the edges by their weights in s ( i ) intoedge sets E ( i ) k = { ( u, v ) | s ( i ) ( u, v ) ∈ [2 k , k +1 ) } , so edges in each bucket have roughly uniform s weights. To get the large conductance condition, we further partition each E ( i ) k into expandersubgraphs, i.e., E ( i ) k, , E ( i ) k, , . . . , each inducing an (1 / polylog( n ))-expander. Note that edges movebetween buckets over the iterations as s ( i ) changes, so we need to maintain the expander de-compositions in a dynamic setting. For this we employ the algorithm of [BBG +

20] (building ontools developed for dynamic minimum spanning tree [SW19, CGL +

20, NSW17, NS17, Wul17],especially [SW19]). Their dynamic algorithm can maintain our expander decomposition eﬃ-ciently (in polylog( n ) time per weight update). With the dynamic expander decomposition,we can essentially implement the method discussed above in each expander as follows. Foreach expander subgraph, we constrain the vector h to the nodes of the expander. Then, wetranslate h by the all-one vector so that h is orthogonal to the degree vector of the nodes inthe expander. To perform the translation on all expanders, we need the total size (in termsof nodes) of the induced expanders to be small for the computation to be eﬃcient. We indeedget this property as the dynamic expander decomposition algorithm in [BBG +

20] guarantees P q | V ( E ( i ) k,q ) | = O ( n log n ). In total this will bound the running time in the i th iteration of theIPM to be e O (( (cid:15) ) − k ( S ( i ) ) − A h ( i ) k + n log W ) , where W is a bound on the ratio of largest to smallest entry in s . By properties of the IPM,which bound the above norm, the total running time of our data structure over all e O ( √ n )iterations of the IPM becomes e O ( m + n . ), or e O ( √ mn ) when using the slower e O ( √ m ) iterationIPM.In the above we only consider detecting entries of s ( i ) undergoing large changes in a singleiteration. In order to maintain s ( i ) , we also need to detect entries of s ( i ) that change slowlyevery iteration, but accumulate enough change across multiple iterations so our approximationis no longer accurate enough. This can be handled via a reduction similar to the one performedin [BLSS20], where we employ lazy update and batched iteration tracking. In particular, forevery k = 0 , , . . . , d (log n ) / e , we use a copy of (D2) to check every 2 k iterations of the IPM,if some entry changes large enough over the past 2 k iterations. This reduction only incurs apolylog( n ) factor overhead in running time comparing to the method that only detects largesingle iteration changes, so the total running time is the same up to polylog factors. Row Sampling (Details in Section 5)

Another task is to solve data structure problemiii which is about constructing the random matrix R ( i ) . The desired distribution of R ( i ) is asfollows. For some large enough constant C > q ∈ R m with q j ≥ √ m (( δ ( i ) r ) j / k δ r k + 1 /m ) + C · σ (( X ( i ) ) / ( S ( i ) ) − / A ) j polylog n where δ ( i ) r = X ( i ) ( S ( i ) ) − A ( H ( i ) ) − AX ( i ) g ( i ) + δ ( i ) c then we have R ( i ) j,j = (min( q i , − with probability min( q i ,

1) and 0 otherwise.17his sampling task can be reduced to the two tasks of (i) sampling according to √ m (( δ r ) j / k δ r k and (ii) sampling according to C · σ (( X ( i ) ) / ( S ( i ) ) − / A ) polylog n . The latter can be imple-mented easily as we have approximate leverage scores via data structure v. The former isimplemented in a similar way as data structure (D2) of the previous paragraph. Instead of ﬁnd-ing large entries of some vector ( S ( i ) ) − A h ( i ) as in the previous paragraph (i.e. (13)), we nowwant to sample the entries proportional to X ( i ) ( S ( i ) ) − A h ( i ) where h ( i ) = ( H ( i ) ) − ( AX ( i ) g ( i ) + A > x ( i ) − b ).This sampling can be constructed via a simple modiﬁcation of the previous (D2) data struc-ture. Where (D2) tries to ﬁnd edges ( u, v ) with large | (( S ( i ) ) − A h ( i ) ) ( u,v ) | by looking for nodes v with large | h ( i ) v | , we now similarly sample edges ( u, v ) proportional to ( X ( i ) ( S ( i ) ) − A h ( i ) ) u,v ) by sampling for each node v incident edges proportional to ( h ( i ) v ) . Gradient Maintenance and Approximation of x (Section 7) For the primal solution x ,again we aim to maintain a good enough approximation x through our IPM algorithm. Considerthe update to x ( i ) in (6 ), x ( i +1) = x ( i ) + X ( i ) g ( i ) − R ( i ) X ( i ) ( S ( i ) ) − A ( H ( i ) ) − A > X ( i ) g ( i ) + R ( i ) δ ( i ) c the last two terms will be sparse due to the sparse diagonal sampling matrix R ( i ) , so we canaﬀord to compute that part of the updates explicitly. For the part of X ( i ) g ( i ) where g ( i ) = argmax z ∈ R m : k z k≤ h∇ Φ( v ( i ) ) , z i (see (8)) we will show that g admits a low dimensional representation. Here by low dimension-ality of g ∈ R m we mean that the m indices in the vector can be put into e O (1) buckets, whereindices j, j in the same bucket share the common value g j = g j . This allows us to representthe values of g as a e O (1) dimensional vector so we can eﬃciently represent and do computationswith g in a very compact way.For simplicity consider the case where we use k·k as the norm for the maximization problemthat deﬁnes g ( i ) (this norm is used by the √ m -iteration IPM, while the √ n -iteration IPM usesa slightly more complicated norm). In that case g ( i ) = ∇ Φ( v ( i ) ) / k∇ Φ( v ( i ) ) k and the way weconstruct the e O (1) dimensional approximation is fairly straightforward. We essentially discretize v ( i ) by rounding each entry down to the nearest multiple of some appropriate granularity tomake it low dimensional. Once v ( i ) is made to be e O (1) dimensional, it is simple to see fromthe deﬁnition of the potential function Φ( · ) that ∇ Φ( v ( i ) ) will also be in e O (1) dimension.For the faster √ n iteration IPM where a diﬀerent norm k · k is used, we can show that thelow dimensionality of ∇ Φ( v ( i ) ) also translates to the maximizer being in low dimensional (seeLemma 7.5).Once we compute the low dimensional updates, we still need to track accumulated changesof X ( i ) g ( i ) over multiple iterations. Because of properties of the IPM, we have that on averageany index j switches its bucket (of the low dimensional representation of g ) only some polylog( n )times. Likewise, the value of any entry X ( i ) j changes only some polylog( n ) number of times. Thusthe rate in which P ik =1 X ( k ) j g ( k ) j changes, stays the same for many iterations. This allows us to(A) predict when x ( i ) j is no longer a valid approximation of x ( i ) j , and (B) the low dimensionalityallows us to easily compute any x ( i ) j . In the same way as s ( i ) was maintained via (D1) and (D2),we can now combine (A) and (B) to maintain x ( i ) . In summary, we provide the IPMs in Section 4: the ﬁrst one in Section 4.2 requires e O ( √ m ) iter-ations, while the second IPM in Section 4.3 requires only e O ( √ n ) iterations. These IPMs require18arious approximations which we maintain eﬃciently via data structures. The approximationfor the slack of the dual solution is maintained via the data structure presented in Section 6and the approximation of the gradient and the primal solution is handles in Section 7. The e O ( √ n ) iteration IPM also requires approximate leverage scores which are maintained via thedata structure of Appendix B.As IPMs only improve an initial (non-optimal) solution, we also have to discuss how toconstruct such an initial solution. This is done in Appendix A.1. Further, the IPMs only yielda fractional almost optimal and almost feasible solution, which is why in Appendix A.3 we provehow to convert the solution to a truly feasible solution. Rounding this fractional almost optimalfeasible solution to a truly optimal integral solution is done in Appendix A.4.In Section 8 we combine all these tools and data structures to obtain algorithms for minimumweight perfect bipartite matching, where Section 8.4 obtains an e O ( n √ m )-time algorithm viathe e O ( √ m ) iteration IPM and Section 8.5 obtains an e O ( m + n . )-time algorithm via the faster e O ( √ n ) iteration IPM. At last, we show in Section 8.6 how this result can be extended toproblems such as uncapacitated minimum cost ﬂows (i.e. transshipment) and shortest paths.19 IPM

Throughout this section we let A ∈ R m × n denote a non-degenerate matrix, let b ∈ R n and c ∈ R m , and consider the problem of solving the following linear program ( P ) and its dual ( D )( P ) def = min x ∈ R m ≥ : A > x = b c > x and ( D ) def = max y ∈ R n : A y ≤ c b > y . (15)First, in Section 4.1 we provide our general IPM framework for solving (15). Then, in Section 4.2we show how to instantiate this framework with the logarithmic barrier function and therebyenable e O ( n √ m ) time minimum-cost perfect matching algorithms. Next, in Section 4.3 we showhow to instantiate this framework with the Lee-Sidford barrier [LS14] as used in [BLSS20] andobtain an improved e O ( m + n . )-time minimum-cost perfect matching algorithm. The analysisin both Section 4.2 and Section 4.3 hinge on technical properties of a potential function deferredto Section 4.4. Finally, additional properties of our IPM needed to obtain our ﬁnal runtimesare given in Section 4.5. Here we present our core routine for solving (15). This procedure, Algorithm 1, is a general metaalgorithm for progressing along the central path in IPMs. This general routine maintains primal-dual, approximately feasible points induced by a weight function τ ( x, s ) deﬁned as follows. Deﬁnition 4.1 ( (cid:15) -Centered Point) . For ﬁxed weight function τ : R m> × R m> → R m> , and γ ∈ [0 , we call ( x, s, µ ) ∈ R m> × R m> × R > , (cid:15) -centered if (cid:15) ∈ [0 , / and the following hold • (Approximate Centrality) w ≈ (cid:15) τ ( x, s ) where w def = w ( x, s, µ ) ∈ R m and [ w ( x, s, µ )] i def = x i s i µ . • (Dual Feasibility) There exists y ∈ R n such that A y + s = c . • (Approximate Primal Feasibility) k A > x − b k ( A > XS − A ) − ≤ (cid:15)γ √ µ . Under mild assumptions on the weight function τ , the set of 0-centered ( x, s, µ ) form acontinuous curve, known as the central path , from a center of the polytope (at µ = ∞ ) to thesolution to the linear program (at µ = 0). This path changes depending on the weight functionand in Section 4.2 and Section 4.3 we analyze two diﬀerent paths, the one induced by thestandard logarithmic barrier and the one induced by the LS-barrier [LS19] respectively. Theformer is simpler and yields e O ( n √ m ) time matching algorithms whereas the second is slightlymore complex and yields a e O ( m + n . ) time matching algorithm.Here, we provide a unifying meta algorithm for leveraging both of these paths. In particular,we provide Algorithm 1 which given centered points for one value of µ , computes centered pointsfor a target value of µ . Since we provide diﬀerent algorithms depending on the choice of weightfunction, we present our algorithm in a general form depending on how a certain ShortStep procedure is implemented. Algorithm 1 speciﬁes how to use

ShortStep and then we implementthis procedure in Section 4.2 for the logarithmic barrier and Section 4.3 for the LS-barrier.To analyze and understand this method we measure the quality of ( x, s, µ ) ∈ R m> × R m> × R > by a potential function we call the centrality potential . Deﬁnition 4.2 (Centrality Potential) . For all ( x, s, µ ) ∈ R m> × R m> × R > we deﬁne the centrality potential as follows Φ( x, s, µ ) def = Φ( w ( x, s, µ )) def = X i ∈ [ n ] φ (cid:18) [ w ( x, s, µ )] i [ τ ( x, s )] i (cid:19) where for all v ∈ R m we let φ ( v ) i def = φ ( v i ) def = exp( λ ( v i − − λ ( v i − for all i ∈ [ n ] and λ ≥ is a parameter we chose later. Further, we let φ ( w ) , φ ( w ) ∈ R n be the vectors with [ φ ( w )] i def = φ ( w i ) and [ φ ( w )] i def = φ ( w i ) for all i ∈ [ n ] . ShortStep in Section 4.2 and Section 4.3.In the remainder of this section we deﬁne a valid

ShortStep procedure (Deﬁnition 4.4),provide

PathFollowing (Algorithm 1) which uses it, and analyze this algorithm (Lemma 4.4).

Algorithm 1:

Path Following Meta Algorithm procedure PathFollowing ( x (init) ∈ R m> , s (init) ∈ R m> , µ (init) > , µ (end) > /* Assume ShortStep is a ( (cid:15), r, λ, γ, τ ) -short step procedure */ /* Assume x (init) s (init) ≈ (cid:15)/ µ (init) · τ ( x (init) , s (init) ) */ /* Assume k A x (init) − b k ( A > X (init) S (init) − A ) − ≤ (cid:15)γ q µ (init) */ x ← x (init) , s ← s (init) , µ ← µ (init) while µ = µ (end) do x (0) ← x, s (0) ← s, µ (0) ← µ for i = 1 , , · · · , (cid:15)r do µ (new) ← median( µ (end) , (1 − r ) µ, (1 + r ) µ ) ( x, s ) ← ShortStep ( x, s, µ, µ (new) ) µ ← µ (new) Compute p ≈ / Φ( x, s, µ ) , p ≈ / k A > x − b k ( A > XS − A ) − . if p > e − / exp( λ(cid:15) ) or p > (cid:15)γe − / √ µ then break ; if p > e − / exp( λ(cid:15) ) or p > (cid:15)γe − / √ µ then x ← x (0) , s ← s (0) , µ ← µ (0) return ( x, s ) Deﬁnition 4.3 (Short Step Procedure) . We call a

ShortStep ( x, s, µ, µ (new) ) a ( (cid:15), r, λ, γ, τ )-short step procedure if (cid:15) ∈ (0 , / , r ∈ (0 , / , λ ≥ (cid:15) − log(16 m/r ) , τ : R m> × R m> → R m> ,and γ ∈ [0 , and given input x, s ∈ R m> and µ, µ (new) ∈ R > such that ( x, s, µ ) is (cid:15) -centeredand | µ (new) − µ | ≤ r · µ the procedure outputs random x (new) , s (new) ∈ R m> such that1. A y (new) + s (new) = c for some y (new) ∈ R n ,2. E [Φ( x (new) , s (new) , µ (new) )] ≤ (1 − λr )Φ( x, s, µ ) + exp( λ(cid:15)/ (for Φ deﬁned by τ and λ ) P (cid:20) k A > x (new) − b k ( A > X (new) ( S (new) ) − A ) − ≤ ( (cid:15)/ γ q µ (new) (cid:21) ≥ − r. Lemma 4.4.

Let x (init) , s (init) ∈ R m> and µ (init) , µ (end) > satisfy A y (init) + s (init) = c for some y (init) ∈ R n and satisfy x (init) s (init) ≈ (cid:15)/ µ (init) · τ ( x (init) , s (init) ) and k A > x (init) − b k ( A > X (init) S (init) − A ) − ≤ (cid:15)γ q µ (init) . PathFollowing ( x (init) , s (init) , µ (init) , µ (end) ) (Algorithm 1) outputs ( x (end) , s (end) ) such that x (end) s (end) ≈ (cid:15)/ µ (end) · τ ( x (end) , s (end) ) and k A > x (end) − b k ( A > X (end) S (end) − A ) − ≤ (cid:15)γ q µ (end) in expected time O ( r T ) · | log( µ (end) /µ (init) ) | where O ( T ) expected time to compute p , p suchthat p ≈ / Φ( x, s, µ ) and p ≈ / k A > x − b k ( A > XS − A ) − and to invoke ShortStep once onaverage. Further, the ( x, s, µ ) at each Line 7 and the termination of the algorithm are (cid:15) -centered. roof. Note that the algorithm divides into phases of size (cid:15)r (Line 7 to Line 15). In each phase,if Φ or k A > x − b k ( A > XS − A ) − is too large, then Line 14 occurs and the phase is repeated. Tobound the runtime, we need to show that Line 14 does not happen too often. Notationwise, weuse x ( i ) , s ( i ) , µ ( i ) to denote the state of x, s, µ at the end of the i -th iteration in a phase, and x (0) , s (0) , µ (0) denote the vectors at the beginning of a phase which are stored in case we needto revert to those vectors.First, we claim that throughout the algorithm, Φ and k A > x − b k ( A > XS − A ) − are small atthe beginning of each phase, i.e. Line 7.Φ( x (0) , s (0) , µ (0) ) ≤ r λ(cid:15) ) and k A > x (0) − b k ( A > X (0) S (0) − A ) − ≤ (cid:15)γ q µ (0) . At the beginning of the algorithm, we have by assumption that xs ≈ (cid:15)/ µτ ( x, s ) and k A > x − b k ( A > XS − A ) − ≤ (cid:15)γ √ µ . Since (cid:15) ∈ [0 , / (cid:12)(cid:12)(cid:12) x i s i µ · τ ( x,s ) i − (cid:12)(cid:12)(cid:12) ≤ (cid:15) . Using λ ≥ (cid:15) − log(16 m/r ) and Lemma 4.35yields thatΦ( x, s, µ ) ≤ m · exp (cid:18) λ(cid:15) (cid:19) ≤ (cid:18) r

16 exp (cid:18) λ(cid:15) (cid:19)(cid:19) · exp (cid:18) λ(cid:15) (cid:19) ≤ r (cid:18) λ(cid:15) (cid:19) . Further, when we replace ( x (0) , s (0) , µ (0) ) (i.e., when the last phase is successful), we have p ≤ e − / exp( λ(cid:15) ) and p ≤ (cid:15)γe − / √ µ . By the deﬁnition of p and p , and that (1 / ≤ exp( − / k A > x − b k ( A > XS − A ) − and the claim holds.Similarly, as long as the phase continues (i.e. the conditions at the end of each iteration hold),we have that for all iterations i Φ( x ( i ) , s ( i ) , µ ( i ) ) ≤ exp (cid:18) λ(cid:15) (cid:19) and k A > x ( i ) − b k ( A > X ( i ) S ( i ) − A ) − ≤ (cid:15)γ √ µ. Note that Φ( x, s, µ ) ≤ exp( λ(cid:15) ) implies (cid:12)(cid:12)(cid:12) x j s j µ · τ ( x,s ) j − (cid:12)(cid:12)(cid:12) ≤ (cid:15) for all coordinates by Lemma 4.35.Using (cid:15) ∈ [0 , / x ( i ) s ( i ) ≈ (cid:15) µ ( i ) τ ( x ( i ) , s ( i ) ). Combined with property 1 of ashort step procedure, this implies that ( x ( i ) , s ( i ) , µ ( i ) ) is (cid:15) -centered for each iteration i .Now, we show that Line 14 does not happen too often. Note that there are three reasonsLine 14 happens, one is that it exceeds the threshold for k A > x ( i ) − b k ( A > X ( i ) S ( i ) − A ) − at someiteration, one is that it exceeds the threshold for Φ at some iteration, another is that it exceedsthe threshold for Φ at the end of the phase.For k A > x ( i ) − b k ( A > X ( i ) S ( i ) − A ) − : conditioned on the phase reaching iteration i , we can usethe guarantee of ShortStep to get k A > x ( i ) − b k ( A > X ( i ) S ( i ) − A ) − ≤ (cid:15)γ q µ ( i ) with probability 1 − r , in which case p ≤ ( (cid:15)/ e / γ q µ ( i ) ≤ (cid:15)e − / γ q µ ( i ) . Thus, the probabilitythat the phase fails at iteration i due to the check on p is at most r , and union bound over (cid:15)/r iterations gives that with probability at least 1 − (cid:15) , the algorithm passes per iteration thresholdon k A > x ( i ) − b k ( A > X ( i ) S ( i ) − A ) − for all iterations in a phase.For Φ: again conditioned on the phase reaching iteration i , the guarantee of ShortStep yields E [Φ( x ( i ) , s ( i ) , µ ( i ) )] ≤ (1 − λr )Φ( x ( i − , s ( i − , µ ( i − ) + exp( 16 λ(cid:15) ) ≤ (1 − λr )Φ( x ( i − , s ( i − , µ ( i − ) + r λ(cid:15) ) . m/(cid:15) ) term in the deﬁnition of λ . We know atthe beginning of the phase, Φ( x (0) , s (0) , µ (0) ) ≤ r exp( λ(cid:15) ) by the ﬁrst claim, and knowing thephase reaches iteration i won’t make E [Φ( x ( k ) , s ( k ) , µ ( k ) )] increase for any k ∈ [0 , i ]. Togetherwith the bound above, we know E [Φ( x ( i ) , s ( i ) , µ ( i ) )] ≤ r exp( λ(cid:15) ). By Markov’s inequalitywe have that Φ( x ( i ) , s ( i ) , µ ( i ) ) ≤ exp( λ(cid:15) ) with probability at least 1 − r , and in this case p ≤ e / exp( λ(cid:15) ) ≤ e − / exp( λ(cid:15) ). Thus, the (a priori) probability that the algorithm failsthe check at the end of iteration i is at most r , and union bound over the (cid:15)/r iterations yieldsthat with probability at least 1 − (cid:15) the check for Φ is passed in each iteration of a given phase.Furthermore, by expanding the guarantee of ShortStep over the (cid:15)/r iterations, we knowat the end of the phase E [Φ( x ( (cid:15)/r ) , s ( (cid:15)/r ) , µ ( (cid:15)/r ) )] ≤ (1 − λr ) (cid:15)r Φ( x (0) , s (0) , µ (0) ) + 1 − (1 − λr ) (cid:15)r λr exp (cid:18) λ(cid:15) (cid:19) ≤ e − λ(cid:15) r (cid:18) λ(cid:15) (cid:19) + 1 λr exp (cid:18) λ(cid:15) (cid:19) ≤

18 exp (cid:18) λ(cid:15) (cid:19) where we used that λ ≥ (cid:15) − log(16 m/r ). With probability at least 3 /

4, Φ( x, s, µ ) ≤ exp( λ(cid:15) )at the end of the phase, and hence p ≤ e / exp( λ(cid:15) ) ≤ e − / exp( λ(cid:15) ). Namely, it passes endof phase check for Φ.Union bounds over all three triggers, each phases success without reset with probability atleast . Hence, in expected O (1 /r ) iterations, µ increases or decreases by Θ(1). This yields theruntime O ( r T ) · | log( µ (end) /µ (init) ) | .Finally, we note that when the algorithm ends, we have p ≤ e − / exp( λ(cid:15) ) which impliesΦ ≤ exp( λ(cid:15) ) and x (end) s (end) ≈ (cid:15)/ µ (end) · τ ( x (end) , s (end) ). In this section we provide and analyze a short step when the weight function is constant, i.e. τ ( x, s ) def = ~ ∈ R m for all x, s ∈ R m> . The central path induced by this weight function is alsothe one induced by the logarithmic barrier function, i.e. ( x µ , s µ , µ ) is 0-centered if and only if x µ = argmin x ∈ R m ≥ : A > x = b c > x − µ X i ∈ [ m ] log( x i ) and s µ = argmax s ∈ R m ≥ ,y ∈ R n : A y + s = c b > y + µ X i ∈ [ m ] log( s i ) . (16)Our algorithm using this weight function is given as Algorithm 2. The algorithm simplycomputes an approximate gradient of Φ( x, s, µ ) for the input ( x, s, µ ) and then takes a naturalNewton step to decrease the value of this potential as µ changes and make x feasible. The keynovelty in our step over previous IPMs is that we sparsify the projection onto x by randomsampling. We show that steps on x can be made e O ( √ m + n )-sparse, ignoring the eﬀect of aknown additive term, by carefully picking a sampling distribution. Further, we show that theamount of infeasibility this sampling induces can be controlled and e O ( √ m ) iterations of themethod suﬃce. Ultimately, this leads to a e O ( n √ m ) time algorithm for matching.Our main result of this section is the following lemma which shows that Algorithm 2 is avalid ShortStep procedure for the meta-algorithm of Section 4.1. In this section we simplyprove this lemma. Later, we show how to eﬃciently implement Algorithm 2 to obtain eﬃcientlinear programming and matching algorithms and improve using the Lee-Sidford barrier.

Lemma 4.5 (Log Barrier Short Step) . Algorithm 2 is an ( (cid:15), r, λ, γ, τ ) ShortStep procedure(see Deﬁnition 4.3) for all (cid:15) ∈ (0 , / with weight function τ ( x, s ) def = ~ ∈ R m for all x, s ∈ R m> and parameters λ def = 36 (cid:15) − log(240 m/(cid:15) ) , let γ def = (cid:15)/ (100 λ ) , and r def = (cid:15)γ/ √ m , i.e. lgorithm 2: Short Step (Log Barrier Weight Function) procedure ShortStep ( x ∈ R m> , s ∈ R m> , µ ∈ R > , µ (new) ∈ R > ) /* Fix τ ( x, s ) def = ~ ∈ R m for all x ∈ R m> , s ∈ R m> in Definition 4.2 *//* Let λ def = 36 (cid:15) − log(240 m/(cid:15) ) , γ def = (cid:15)/ (100 λ ) , r def = (cid:15)γ/ √ m , (cid:15) ≤ / . */ /* Assume ( x, s, µ ) is (cid:15) -centered */ /* Assume δ µ def = µ (new) − µ satisfies | δ µ | ≤ rµ *//* Approximate the iterate and compute step direction */ Find x and s such that x ≈ (cid:15) x and s ≈ (cid:15) s Find v ∈ R m> such that k v − w k ∞ ≤ γ for w def = w ( x, s, µ ) g ← − γ ∇ Φ( v ) / k∇ Φ( v ) k /* Sample rows */ A def = X / S − / A , w def = w ( x, s, µ ) , X def = Diag ( x ) , S def = Diag ( s ) , W def = Diag ( w ) Find H such that H ≈ γ A > A Let δ p def = W − / AH − A > W / g , δ c def = µ − / · W − / AH − ( A > x − b ), and δ r def = δ p + δ c Find q ∈ R n> such that q i ≥ √ m · (cid:0) ( δ r ) i / k δ r k + 1 /m (cid:1) + C · σ i ( A ) log( m/ ( (cid:15)r )) γ − forsome large enough constant C . Sample a diagonal matrix R ∈ R n × n where each R ii = min { q i , } − independentlywith probability min { q i , } and R ii = 0 otherwise /* Compute projected step */ δ x ← X ( g − R δ r ) δ s ← S δ p x (new) ← x + δ x and s (new) ← s + δ s return ( x (new) , s (new) ) λ ≥ (cid:15) − log(16 m/r ) ,2. A y (new) + s (new) = c for some y (new) ∈ R n ,3. E [Φ( x (new) , s (new) , µ (new) )] ≤ (1 − λr )Φ( x, s, µ ) + exp( λ(cid:15)/ P (cid:20) k A > x (new) − b k ( A > X (new) ( S (new) ) − A ) − ≤ (cid:15) q µ (new) (cid:21) ≥ − r . We prove Lemma 4.5 at the end of this section in Section 4.2.3. There we prove claims 1and 2, which follow fairly immediately by straightforward calculation. The bulk of this sectionis dedicated to proving claim 3, i.e. Lemma 4.11 in Section 4.2.1, which essentially amountsto proving that the given method is an e O ( √ m )-iteration IPM. This proof leverages severalproperties of the centrality potential, Φ given in Section 4.4. Finally, claim item 4 of the lemmafollows by careful calculation and choice of the sampling distribution q and is given towards theend of this section as Lemma 4.12 in Section 4.2.2. Here we analyze the eﬀect of Algorithm 2 on the centrality potential Φ to prove claim 3 ofLemma 4.5. Our proof leverages Lemma 4.34 and Lemma 4.36 which together reduces bound-24ng the potential Φ( x (new) , s (new) , µ (new) ) to carefully analyzing bounding the ﬁrst and secondmoments of δ s , δ x , and δ µ (as well as there multiplicative stability).More precisely, Lemma 4.34 shows that provided x (new) , s (new) , and µ (new) do not changetoo much multiplicatively, i.e. k X − δ x k ∞ , k S − δ s k ∞ , and | δ µ /µ | are suﬃciently small, then E [Φ( x (new) , s (new) , µ (new) )] ≤ Φ( x, s, µ ) + φ ( w ) > W h ( X − E [ δ x ] + S − δ s − µ − δ µ ~ i + O (cid:16) E k X − δ x k φ ( w ) + k S − δ x k φ ( w ) + k µ − δ µ ~ k φ ( w ) (cid:17) . (17)(To see this, pick u (1) = x , δ (1) = δ x , c (1) = 1, u (2) = s , δ (2) = δ s , c (2) = 1, u (3) = µ~ δ (3) = δ µ ~

1, and c (3) = − φ .Now, (17) implies that to bound that the centrality potential Φ decreases in expectation, itsuﬃces to show the following: • First-order Expected Progress Bound:

The ﬁrst-order contribution of δ x and δ s inexpectation, i.e. W [ X − E [ δ x ] + S − δ s ] is suﬃciently well-aligned with − φ ( w ), i.e. g (upto scaling). (We show this in Lemma 4.7.) • Multiplicative Stability and Second Order s Bound: s does not change too muchmultiplicatively, i.e. k S − δ s k ∞ is small, and its second order contribution to (17) is small,i.e. k S − δ x k φ ( w ) is small. (We show this in Lemma 4.8.) • Multiplicative Stability and Second Order x Bound: x does not change too muchmultiplicatively, i.e. k X − δ x k ∞ is small with probability 1, and its expected second ordercontribution to (17) is small, i.e. E k X − δ x k φ ( w ) is small. (We show this in Lemma 4.9.)We provide these results one at a time, after ﬁrst providing some basic bounds on the componentsof a step, i.e. δ p and δ c , in Lemma 4.6. With these results in hand and some basic analysis of δ µ , in Lemma 4.10, we can show that E [Φ( x (new) , s (new) , µ (new) )] ≤ Φ( x, s, µ ) + φ ( w ) > g + O ( k φ ( w ) k ) + O ( k φ ( w ) k )for some suﬃciently small constants hidden in the O ( · ) notation above. We obtain the desiredLemma 4.11 by applying Lemma 4.36 which leverages properties of φ to show that progress ofthis forms yields to a suﬃciently large decrease to Φ whenever it is suﬃciently large. Lemma 4.6 (Basic Step Size Bounds) . In Algorithm 2 we have that k δ p k ≤ exp(4 (cid:15) ) γ and k δ c k ≤ (cid:15) exp(4 (cid:15) ) γ (18) and consequently, as (cid:15) < / k S − δ s k ≤ exp(5 (cid:15) ) γ and k X − E [ δ x ] k ≤ γ . Proof.

Recall that δ p = W − / AH − A > W / g and H ≈ γ A > A for γ ≤ (cid:15) . Since ( x, s, µ ) is (cid:15) -centered we have w ≈ (cid:15) ~ w ≈ (cid:15) w and w ≈ (cid:15) W − (cid:22) exp( − (cid:15) ) I and k δ p k = k AH − A > W / g k W − ≤ exp(3 (cid:15)/ k H − A > W / g k A > A . (19)Now, applying H ≈ γ A > A and γ ≤ (cid:15) twice we have k H − A > W / g k A > A ≤ exp( (cid:15)/ k A > W / g k H − ≤ exp( (cid:15) ) k W / g k A ( A > A ) − A > . (20)25urther, since A ( A > A ) − A > is a projection matrix we have (cid:22) A ( A > A ) − A > (cid:22) I and k W / g k A ( A > A ) − A > ≤ k g k w ≤ exp(3 (cid:15)/ k g k = exp(3 (cid:15)/ γ , (21)where in the second to last step we again used w ≈ (cid:15) ~ k g k = γ .Combining (19), (20), and (21) yields that k δ p k ≤ exp(4 (cid:15) ) γ as desired.Next, as x ≈ (cid:15) x and s ≈ (cid:15) s we have A > A ≈ (cid:15) A > XS − A . Since H ≈ γ A > A for γ ≤ (cid:15) thisimplies H ≈ (cid:15) A > XS − A . Since, δ c = µ − / W − / AH − ( A > x − b ) and again w ≈ (cid:15) ~ k δ c k = 1 √ µ k AH − [ A > x − b ] k W − ≤ exp(3 (cid:15)/ √ µ k H − [ A > x − b ] k A > A ≤ exp(2 (cid:15) ) √ µ k A > x − b k H − ≤ exp(7 (cid:15)/ √ µ k A > x − b k ( A > XS − A ) − ≤ (cid:15) exp(4 (cid:15) ) γ. where in the last step we used that k A > x − b k ( A > XS − A ) − ≤ (cid:15)γ √ µ since ( x, s, µ ) is (cid:15) -centered.Since, k S − δ s k ≤ exp( (cid:15) ) k S − δ s k = exp( (cid:15) ) k δ p k this yields the desired bound on k S − δ s k as well. The remainder of the result follows as by deﬁnition and triangle inequality we have k X − E [ δ x ] k = k g − δ p − δ c k ≤ k g k + k δ p k + k δ c k and we have (1 + (1 + (cid:15) ) exp(3 (cid:15) )) ≤ (cid:15) ∈ [0 , / W [ X − E [ δ x ] + S − δ s ] is suﬃciently well-alignedwith g . For that we will use the following Lemma 4.32 which will be proven in Section 4.4 andbounds additive ‘ -error with respect to multiplicative spectral-error. Lemma 4.32. If M ≈ (cid:15) N for symmetric PD M , N ∈ R n × n and (cid:15) ∈ [0 , / then k N − / ( M − N ) N − / k ≤ (cid:15) + (cid:15) . Consequently, if x ≈ (cid:15) y for x, y ∈ R > and (cid:15) ∈ [0 , / then k X − y − ~ k ∞ ≤ (cid:15) + (cid:15) . Lemma 4.7 (First-order Expected Progress Bound) . In Algorithm 2 we have k W ( X − E [ δ x ] + S − E [ δ s ]) − g k ≤ (cid:15)γ . Proof.

Note that X − E [ δ x ] + S − E [ δ s ] = X − E [ δ x ] + S − E [ δ s ] + ( X − X − I ) X − E [ δ x ] + ( S − S − I ) S − E [ δ s ] . Further, by the design of E [ δ x ] and E [ δ s ] we have X − E [ δ x ] + S − E [ δ s ] = g − δ c . Also, since X ≈ (cid:15) X and S ≈ (cid:15) S , Lemma 4.32 implies that k X − X − I k ≤ (cid:15) + (cid:15) ≤ (cid:15) exp( (cid:15) ) and similarly, k S − S − I k ≤ (cid:15) exp( (cid:15) ). Therefore Lemma 4.6 implies k W ( X − E [ δ x ] + S − E [ δ s ]) − W g k ≤ k δ c k W + k ( X − X − I ) X − E [ δ x ] k W + k ( S − S − I ) S − E [ δ s ] k W ≤ exp( (cid:15) ) (cid:16) k δ c k + (cid:15) exp( (cid:15) ) (cid:16) k X − E [ δ x ] k + k S − E [ δ s ] k (cid:17)(cid:17) ≤ exp( (cid:15) )( (cid:15) exp(4 (cid:15) ) + (cid:15) exp( (cid:15) )(3 + exp(5 (cid:15) )) γ = (cid:15)γ · (cid:16) exp( (cid:15) ) · (exp(4 (cid:15) ) + exp( (cid:15) )) · (3 + exp(5 (cid:15) )) (cid:17) . Further, as w ≈ (cid:15) I , Lemma 4.32 implies k W g − g k ≤ k W − I k k g k ≤ ( (cid:15) + (cid:15) ) γ = (cid:15)γ · (1 + (cid:15) ) . The result follows as exp( (cid:15) )(exp(4 (cid:15) ) + (exp( (cid:15) ))(3 + exp(5 (cid:15) ))) + (1 + (cid:15) ) ≤ (cid:15) ∈ [0 , / emma 4.8 (Multiplicative Stability and Second Order s Bound) . In Algorithm 2 we have k S − δ s k ∞ ≤ exp(5 (cid:15) ) γ and k S − δ s k φ ( v ) ≤ exp(10 (cid:15) ) γ k φ ( v ) k . Proof.

The result follows from Lemma 4.6, k S − δ s k ∞ ≤ k S − δ s k , and k S − δ s k φ ( v ) ≤ k S − δ s k ∞ · k S − δ s k · k φ ( v ) k . Lemma 4.9 (Multiplicative Stability and Second Order x Bound) . In Algorithm 2 we have P h k X − δ x k ∞ ≤ exp(6 (cid:15) ) γ i = 1 and E h k X − δ x k φ ( v ) i ≤ γ k φ ( v ) k . Proof.

Recall that δ x = X g − XR δ r and x ≈ (cid:15) x ∈ R m . Consequently, k X − δ x k ∞ ≤ exp( (cid:15) ) · ( k g k ∞ + k R δ r k ∞ ) . Now, R ii ≤ [min { , q i } ] − = max { , q − i } by deﬁnition and therefore k R δ r k ∞ ≤ max i ∈ [ m ] max ( | [ δ r ] i | , | [ δ r ] i |√ m (cid:0) [ δ r ] i / k δ r k + 1 /m (cid:1) ) ≤ max (cid:26) k δ r k ∞ , k δ r k (cid:27) ≤ k δ r k where we used that q i ≥ √ m (cid:0) [ δ r ] i / k δ r k + 1 /m (cid:1) by assumption and | [ δ r ] i | ≤ √ m k δ r k [ δ r ] i + k δ r k √ m = √ m k δ r k · ([ δ r ] i / k δ r k + 1 /m ) . (22)Since by deﬁnition and triangle inequality k δ r k ≤ k δ p k + k δ c k the desired bound on k X − δ x k ∞ follows from Lemma 4.6 and exp(5 (cid:15) )(1 + (cid:15) ) ≤ (cid:15) ∈ [0 , / x ≈ (cid:15) x the deﬁnition of δ x and k a + b k ≤ k a k + 2 k b k imply k X − δ x k φ ( v ) ≤ exp(2 (cid:15) ) k g − R δ r k φ ( v ) ≤ (cid:15) ) k g k φ ( v ) + 2 exp(2 (cid:15) ) k R δ r k φ ( v ) . Now, k g k φ ( v ) ≤ k φ ( v ) k k g k ∞ k g k ≤ k φ ( v ) k k g k = k φ ( v ) k γ . Further, the deﬁnition of R and that[min { , q i } ] − ≤ q i ≤ k δ r k √ m [ δ r ] i imply that E R h k R δ r k φ ( v ) i = X i ∈ [ m ] [ φ ( v )] i · [ δ r ] i min { q i , }≤ X i ∈ [ m ] [ φ ( v )] i · (cid:20) [ δ r ] i + 1 √ m k δ r k (cid:21) ≤ k φ ( v ) k k δ r k ∞ k δ r k + k φ ( v ) k k δ r k = 2 k φ ( v ) k k δ r k . The result follows again from k δ r k ≤ k δ p k + k δ c k , Lemma 4.6, and 2 exp(2 (cid:15) ) + 4 exp(2 (cid:15) )((1 + (cid:15) ) exp(4 (cid:15) )) ≤ (cid:15) ∈ [0 , / Lemma 4.34.

For all j ∈ [ k ] let vector u ( j ) ∈ R n> , exponent c j ∈ R and change δ ( j ) ∈ R n induce vectors v, v (new) ∈ R n deﬁned by v i def = Y j ∈ [ k ] ( u ( j ) i ) c j and v (new) i def = Y j ∈ [ k ] ( u ( j ) i + δ ( j ) i ) c j for all i ∈ [ n ] . Further, let V def = Diag ( v ) and U j def = Diag ( u ( j ) ) and suppose that k U − j δ ( j ) k ∞ ≤ λ k c k ) for all j ∈ [ k ] and v ≤ ~ . Then v (new) ≈ / (11 λ ) v and Φ( v (new) ) ≤ Φ( v ) + X j ∈ [ k ] c j φ ( v ) > VU − j δ ( j ) + 2(1 + k c k ) X j ∈ [ k ] | c j | · k U − δ ( j ) k φ ( v ) . We now have everything to bound the eﬀect of a step in terms of k φ ( w ) k and k φ ( w ) k . Lemma 4.10 (Expected Potential Decrease) . In Algorithm 2 we have E [Φ( x (new) , s (new) , µ (new) )] ≤ Φ( x, s, µ ) + φ ( w ) > g + 9 (cid:15)γ k φ ( w ) k + 80 γ k φ ( w ) k . Proof.

We apply Lemma 4.34 with u (1) = x , δ (1) = δ x , c (1) = 1, u (2) = s , δ (2) = δ s , c (2) = 1, u (3) = µ~ δ (3) = δ µ ~

1, and c (3) = −

1. Note that, v in the context of Lemma 4.34 is precisely w ( x, s, µ ) and v (new) in the context of Lemma 4.34 is precisely w ( x + δ x , s + δ s , µ + δ µ ).Now, by Lemma 4.8, the deﬁnition of γ , and (cid:15) ∈ [0 , /

80] we have k S − δ s k ∞ ≤ γ ≤ λ ≤ k c k λ ) . Similarly, Lemma 4.9 implies k X − δ x k ∞ ≤ γ ≤ λ ≤ k c k λ ) . Further, by deﬁnition of r we have µ − | δ µ | ≤ γ ≤ (12(1 + k c k λ )) − l. Thus, Lemma 4.34 implies E [Φ( x (new) , s (new) , µ (new) )] ≤ Φ( x, s, µ ) + φ ( w ) > W h ( X − E [ δ x ] + S − δ s − µ − δ µ ~ i + 8 (cid:16) k E X − δ x k φ ( w ) + k S − δ x k φ ( w ) + k µ − δ µ ~ k φ ( w ) (cid:17) . (23)Now, since w ≈ (cid:15) ~

1, the assumption on δ µ implies | ( δ µ /µ ) φ ( w ) > W ~ | ≤ | ( δ µ /µ ) | exp( (cid:15) ) k φ ( w ) k √ m ≤ (cid:15)γ k φ ( w ) k and k ( δ µ /µ ) ~ k φ ( w ) = | ( δ µ /µ ) | k φ ( w ) k ≤ ( (cid:15)γ/ √ m ) √ m k φ ( w ) k ≤ γ k φ ( w ) k . The result follows by applying Lemmas 4.7, 4.8, and 4.9 to the expectation of (23).To obtain our desired bound on E [Φ( x (new) , s (new) , µ (new) )] we will use the following generalLemma 4.36 to obtain bounds on k φ ( w ) k and k φ ( w ) k . Lemma 4.36 is proven in Section 4.4.28 emma 4.36. Let U ⊂ R m be an axis-symmetric convex set which is contained in the ‘ ∞ ball of radius and contains the ‘ ∞ ball of radius u ≤ , i.e. x ∈ U implies k x k ∞ ≤ andfor some k x k ∞ ≤ u implies x ∈ U . Let w, v ∈ R m such that k w − v k ∞ ≤ δ ≤ λ . Let g = − γ argmax z ∈ U h∇ Φ( v ) , z i and k h k U def = max z ∈ U h h, z i with γ ≥ and δ Φ def = φ ( w ) > g + c γ k φ ( w ) k U + c γ k φ ( w ) k U for some c , c ≥ satisfying λγ ≤ , λδ + c λγ ≤ . − c ) . Then, δ Φ ≤ − . − c ) λγu Φ( w ) + m Finally, we can prove the desired bound on E [Φ( x (new) , s (new) , µ (new) )] by combining thepreceding lemma with known bounds on k φ ( w ) k and k φ ( w ) k . We ﬁrst provide a verygeneral lemma for this which we can use in both the log barrier and LS barrier analysis andthen we use it to prove our desired bound. Lemma 4.11 (Centrality Improvement of Log Barrier Short-Step) . In Algorithm 2 we have E [Φ( x (new) , s (new) , µ (new) )] ≤ (1 − λr )Φ( x, s, µ ) + exp( λ(cid:15)/ . Proof.

Consider an invocation of Lemma 4.36 with δ = γ , c = 9 (cid:15) , c = 80, U set to be theunit ‘ ball. Note that the assumptions of Lemma 4.36 hold as γ = (cid:15)/ (100 λ ). Further, as U contains the ‘ ∞ ball of radius √ m , k v − w k ∞ ≤ γ , and g = − γ ∇ Φ( v ) / k∇ Φ( v ) k we see thatLemma 4.36 applied to the claim of Lemma 4.10 yields that E [Φ( x (new) , s (new) , µ (new) )] ≤ Φ( x, s, µ ) − (cid:18) − c (cid:19) λγ √ m Φ( x, s, µ ) + m ≤ (cid:18) − λγ √ m (cid:19) Φ( x, s, µ ) + m. The result follows from exp( λ(cid:15)/ ≥ m and r = (cid:15)γ √ m . Here we prove item 3 of lemma 4.5, analyzing the eﬀect of Algorithm 2 on approximate x feasibility. Our proof proceeds by ﬁrst showing that if instead of taking the step we take inAlgorithm 2, we instead took an idealized step where H was replaced with A > A and R wasreplaced with I , then the resulting x would have exactly met the linear feasibility constraints.We then bound the diﬀerence between linear feasbility of the actual step and this idealized step.Ultimately, the proof reveals that so long as A > RA approximates A > A , and H approximates A > A well enough and the step is not too large, then the infeasibility is not too large either. Lemma 4.12 (Feasibility Bound) . In Algorithm 2 if A > RA ≈ γ A > A then k A > x (new) − b k A > X (new) ( S (new) ) − A ≤ . (cid:15)γ ( µ (new) ) / (24) and consequently, this holds with probability at least − r .Proof. Recall that x (new) = x + δ x where δ x = X ( g − R δ r ) and δ r def = δ p + δ c . Further, recall that δ p = W − / AH − A > W / g and δ c = µ − / · W − / AH − ( A > x − b ) . A set U ⊂ R m is axis-symmetric if x ∈ U implies y ∈ U for all y ∈ R m with y i ∈ {− x i , x i } for all i ∈ [ m ]. δ r = W − / AH − d where d def = A > W / g + µ − / ( A > x − b ) . Now, consider the idealized step x ∗ where there is no matrix approximation error, i.e. H = A > A , and no sampling error, i.e. I = R . Formally, x ∗ def = x + δ ∗ x where δ ∗ x def = X ( g − δ ∗ r ), and δ ∗ r = W − / A ( A > A ) − d . Now since, A > X = √ µ A > W / we have A > x ∗ = A > x + √ µ A > (cid:16) W / g − A ( A > A ) − d (cid:17) = A > x − ( A > x − b ) = b . Thus, we see that x ∗ obeys the linear constraints for feasibility. Consequently, it suﬃces tobound the error induced by matrix approximation error and sampling error, i.e. A > x − b = A > ( x − x ∗ ) = − A > X ( R δ r − δ ∗ r ) = √ µ A > (cid:16) A ( A > A ) − − RAH − (cid:17) d = √ µ (cid:16) I − A > RAH − (cid:17) d (25)where we used again that A > X = √ µ A > W / and that diagonal matrices commute.Now, since A > RA ≈ γ A > A ≈ γ H we have k ( A > A ) − / ( I − A > RAH − ) H / k = k ( A > A ) − / ( H − A > RA ) H − / k ≤ exp( γ ) k H − / ( H − A > RA ) H − / k ≤ γ where we used Lemma 4.32 and (2 γ + 4 γ ) exp( γ ) ≤ γ for γ ≤ /

80. Consequently, combiningwith (25) yields that k A > x (new) − b k ( A > A ) − = √ µ (cid:13)(cid:13)(cid:13) ( A > A ) − / ( I − A > RAH − ) H / H − / d (cid:13)(cid:13)(cid:13) ≤ γ √ µ k d k H − ≤ γ √ µ (cid:16) k A > W − / g k H − + µ − / · k A > x − b k H − (cid:17) . Further, by design of g and that W ≈ (cid:15) I and 3 (cid:15) ≤ γ k A > W − / g k H − ≤ exp( γ/ k W − / g k A ( A > A ) − A > ≤ exp( γ/ k W − / g k ≤ exp(2 γ ) k g k ≤ γ and by the approximate primal feasibility of ( x, s, µ ) we have k A > x − b k H − ≤ exp( γ/ k A > x − b k A > XS − A ≤ (cid:15)γ √ µ . Combining yields that k A > x (new) − b k ( A > A ) − ≤ γ √ µ (2 γ + 2 (cid:15)γ ) ≤ γ √ µγ ≤ . (cid:15)γ √ µ . Now, since k S − δ s k ∞ ≤ γ and k X − δ x k ∞ ≤ γ by Lemma 4.8 and Lemma 4.9 and γ ≤ (cid:15)/

100 we have that A > X (new) ( S (new) ) − A ≈ (cid:15) A > XSA . Combining with the facts that A > XSA ≈ (cid:15) A > A , and µ (new) ≈ (cid:15)γ µ yields (24). The probability bound follows by the factthat q i ≥ Cσ i ( A ) log( n/ ( (cid:15)r )) γ − and Lemma 4 in [CLM + R correspondsto sampling each row of A with probability at least proportional to its leverage score. For further intuition regarding this proof, note that, ignoring the term k A > W − / g k H − and diﬀerencebetween the A > X (new) ( S (new) ) − A and A > A norm, the inequality below shows that the feasibility of the primalsolution is improved by a factor of 3 γ <

1. This improvement comes from the term δ c deﬁned in Line 9. The re-mainder of the proof after this step is therefor restricted to bounding the impact from the term k A > W − / g k H − which is the impact of δ p deﬁned in Line 9. .2.3 Proof of Lemma 4.5 We conclude by proving Lemma 4.5, the main claim of this section.

Proof of Lemma 4.5. (1) Note that r = (cid:15)γ √ m = (cid:15) λ √ m = (cid:15) β ) √ m for β = m(cid:15) . Thus,12 (cid:15) − log(16 m/r ) ≤ (cid:15) − log(16 β / · β )) ≤ (cid:15) − log((240) β ) ≤ λ where we used that √ x log x ≤ x for x ≥

1, (240) = 16 · · β ≥ δ s = µ / AH − A > W / g . Consequently, δ s ∈ im( A ) and this follows fromthe fact that A y + s = c for some y ∈ R n by deﬁnition of (cid:15) -centered.(3) and (4) these follow immediately from Lemma 4.11 and Lemma 4.12 respectively. In this section we show how to eﬃciently implement a short step when the weight function isregularized variant of the weights induced by Lee-Sidford barriers [LS19], i.e. it is deﬁned asfollows and for brevity, we refer to it as the Lee-Sidford (LS)-weight function.

Deﬁnition 4.13 (Lee-Sidford Weight Function) . We deﬁne the

Lee-Sidford (LS) weight func-tion for all x, s ∈ R m> as τ ( x, s ) def = σ ( x, s ) + nm ~ where σ ( x, s ) def = σ ( X / − α S − / − α A ) for α def =

14 log(4 m/n ) , and σ deﬁned as in Section 2. This weight function is analogous to the one used in [BLSS20] which in turn is inspired bythe weight functions and barriers used in [LS19] (see these works for further intuition).Our short-step algorithm for this barrier (Algorithm 2) is very similar to the one for the log-arithmic barrier (Algorithm 3). The primary diﬀerence is what g is set to in the two algorithms.In both algorithms g is chosen to approximate the negation of the gradient of the centralitypotential while being small in requisite norms for the method. However, by choosing a diﬀerentweight function both the gradient and the norm change. To explain the new step (and analyzeour method) throughout this section we make extensive use of the following mixed norm andits dual vector (variants of which were deﬁned in [LS19, BLSS20]). Deﬁnition 4.14 (Mixed Norm and Dual Vector) . We deﬁne the mixed norm k x k τ + ∞ def = k x k ∞ + C norm k x k τ with C norm = α and we deﬁne the dual vector of g as g [ ( w ) def = argmax k x k w + ∞ =1 g > x. (26) Further we deﬁne the dual norm k g k ∗ τ + ∞ := h g, g [ ( τ ) i , i.e. the maximum attained by (26) . With this notation in mind, recall that in the short-step algorithm for the logarithmicbarrier (Algorithm 3)) we chose g to approximate the direction which has best inner productwith −∇ Φ( w ) and has k · k at most γ . Formally, we picked v with k v − w k γ and let g = − γ ∇ (Φ( v )) / k∇ (Φ( v )) k . Note that g = − γg [ ( v ) if in (26) we replaced k x k w + ∞ with k x k .Since when using the Lee-Sidford weight function (Deﬁnition 4.13) we instead wish to decreaseΦ( w/τ ( x, s )) we instead pick v ∈ R m> such that k v − w/τ ( x, s ) k ∞ ≤ γ and let g def = − γ ∇ Φ( v ) [ ( τ ) where [ is as deﬁned in Deﬁnition 4.14.This setting of g is the primary diﬀerence between the algorithm in this section (Algorithm 3)and the one in previous section (Algorithm 2). The other diﬀerence are induced by this changeor done for technical or analysis reasons. We change the setting of r and the requirements for q to account for the fewer iterations we take and the diﬀerent norm in which we analyze ourerror. Further, we slightly change the step sizes for δ x and δ s , i.e. in Algorithm 3 we let δ x = X ((1 + 2 α ) g − R δ r ) and δ s = (1 − α ) S δ p δ x and δ s as above with α = 0). This change was also performedin [BLSS20] to account for the eﬀect of τ on the centrality potential and ensure suﬃcient progressis made. Algorithm 3:

Short Step (LS Weight Function) procedure ShortStep ( x ∈ R m> , s ∈ R m> , µ ∈ R > , µ (new) ∈ R > ) /* Fix τ ( x, s ) def = σ ( x, s ) + nm ~ ∈ R m for all x ∈ R m> , s ∈ R m> *//* Let λ =

36 log(400 m/(cid:15) ) (cid:15) , γ = (cid:15) λ , r = (cid:15)γ √ n , (cid:15) = α , α =

14 log( mn ) */ /* Assume ( x, s, µ ) is (cid:15) -centered */ /* Assume δ µ def = µ (new) − µ satisfies | δ µ | ≤ rµ *//* Approximate the iterate and compute step direction */ Find x, s, τ such that x ≈ (cid:15) x , s ≈ (cid:15) s , τ ≈ (cid:15) τ for τ def = τ ( x, s ) Find v ∈ R m> such that k v − w/τ k ∞ ≤ γ for w def = w ( x, s, µ ) g = − γ ∇ Φ( v ) [ ( τ ) where h [ ( w ) is deﬁned in Deﬁnition 4.14 /* Sample rows */ A def = X / S − / A , w def = w ( x, s, µ ) , X def = Diag ( x ) , S def = Diag ( s ) , W def = Diag ( w ) Find H such that H ≈ γ A > A Let δ p def = W − / AH − A > W / g , δ c def = µ − / · W − / AH − ( A > x − b ), and δ r def = (1 + 2 α ) δ p + δ c Find q ∈ R n> such that q i ≥ m √ n · (cid:0) ( δ r ) i / k δ r k + 1 /m (cid:1) + C · σ i ( A ) log( m/ ( (cid:15)r )) γ − forsome large enough constant C Sample a diagonal matrix R ∈ R m × m where each R i,i = [min { q i , } ] − independentlywith probability min { q i , } and R i,i = 0 otherwise /* Compute projected step */ δ x ← X ((1 + 2 α ) g − R δ r ) δ s ← (1 − α ) S δ p x (new) ← x + δ x and s (new) ← s + δ s return ( x (new) , s (new) )We analyze the performance of this short-step procedure for the Lee-Sidford barrier anal-ogously to how we analyzed the short step procedure for the log barrier in Section 4.2. Weagain apply Lemma 4.34 and Lemma 4.36 and together these lemmas reduce bounding thepotential Φ( x (new) , s (new) , µ (new) ) to carefully analyzing bounding the ﬁrst moments, secondmoments, and multiplicative stability induced by δ s , δ x , δ µ as well as the change from τ ( x, s )to τ ( x (new) , s (new) ), which we denote by δ τ def = τ ( x (new) , s (new) ) − τ ( x, s ).Key diﬀerences in the analysis of this section in the previous section are in how thesebounds are achieved and in analyzing δ τ . Whereas the analysis of the log barrier short-stepmethod primarily considered the ‘ norm, here we consider the mixed norm, Deﬁnition 4.14(variants of which were analyzed in [LS19, BLSS20]). For δ x and δ s this analysis is performedin Section 4.3.1. For analyzing δ τ new analysis, though related to calculations in [BLSS20], isprovided in Section 4.3.2. This analysis is then leveraged in Section 4.3.3 to provide the mainresult of this section, that Algorithm 3 is a valid short-step procedure.32 .3.1 Bounding δ x and δ s Here we provide basic bounds on δ x , δ s , and related quantities. First, we provide Lemma 4.15and Lemma 4.16 which are restatements of lemmas from [BLSS20] which immediately applyto our algorithm. Leveraging these lemmas, we state and prove Lemma 4.17 which boundsthe eﬀect of our feasibility correction step (which was not present in the same way [BLSS20]).Leveraging these lemmas we prove Lemma 4.18, which provides ﬁrst-order bounds and then weprove Lemma 4.19 which provides second-order and multiplicative bounds. Lemma 4.15 ([BLSS20, Lemma 18]) . In Algorithm 3, we have σ ( S − / − α X / − α A ) ≤ σ ( S − / X / A ) ≤ σ ( S − / − α X / − α A ) . Lemma 4.16 ([BLSS20, Lemma 19]) . In Algorithm 3, Q def = AH − A > satisﬁes Q (cid:22) e (cid:15) P ( A ) .Further, for all h ∈ R m , k W − / QW / h k τ ≤ e (cid:15) k h k τ and k W − / ( I − Q ) W / h k τ ≤ e (cid:15) k h k τ . Further, k W − / QW / h k ∞ ≤ k h k τ and therefore k W − / QW / k τ + ∞ ≤ e (cid:15) + 2 /C norm . Lemma 4.17.

In Algorithm 3, k W − / AH − g k τ ≤ e (cid:15) k g k H − and k W − / AH − g k ∞ ≤ k g k H − . Proof.

Let Q = AH − A > . Since H (cid:23) e − γ A > A (cid:23) e − (cid:15) A > A , we have k AH − / k = k AH − A > k ≤ e (cid:15) k A ( A > A ) − A > k ≤ e (cid:15) . Since γ ≤ (cid:15) we have w ≈ (cid:15) w and w ≈ (cid:15) τ . This implies w ≈ (cid:15) τ and k W − / AH − g k τ ≤ e . (cid:15) k AH − g k ≤ e . (cid:15) k AH − / k k H − / g k ≤ e (cid:15) k g k H − . Further, by Cauchy Schwarz, we have k W − / AH − g k ∞ = max i ∈ [ m ] h e > i W − / AH − g i ≤ max i ∈ [ m ] h W − / QW − / i i,i · h g > H − g i Q (cid:22) e (cid:15) P ( A ) (Lemma 4.16) and σ ( A ) i ≤ σ ( S − / − α X / − α A ) i ≤ τ i (Lemma 4.15), impliesmax i ∈ [ m ] h W − / QW − / i i,i ≤ max i ∈ [ m ] e (cid:15) σ ( A ) i w i ≤ e (cid:15) max i ∈ [ m ] σ ( A ) i τ i ≤ e (cid:15) . Lemma 4.18 (Basic Step Size Bounds) . In Algorithm 3, we have that k δ p k τ + ∞ ≤ γ and k δ c k τ + ∞ ≤ (cid:15)γα ≤ γ (27) and consequently k X − E [ | δ x | ] k τ + ∞ ≤ γ and k S − δ s k τ + ∞ ≤ γ. roof. We ﬁrst bound δ p . Let Q def = AH − A > and note that δ p = W − / QW / g. (28)By Lemma 4.16, we have k W − / QW / g k τ + ∞ ≤ ( e (cid:15) + 2 /C norm ) k g k τ + ∞ ≤ k g k τ + ∞ .Now, we bound δ c . Recall that δ c = µ − / · W − / AH − ( A > x − b ) . By Lemma 4.17, we have k µ − / · W − / AH − ( A > x − b ) k τ ≤ e (cid:15) √ µ k A > x − b k H − ≤ e (cid:15) √ µ k A > x − b k ( A > XS − A ) − . Hence, we have k δ c k τ ≤ e (cid:15) γ(cid:15) . For the sup norm, Lemma 4.17, shows that k µ − / · W − / AH − ( A > x − b ) k ∞ ≤ √ µ k A > x − b k H − ≤ e (cid:15) √ µ k A > x − b k ( A > XS − A ) − . Hence, we have k δ c k ∞ ≤ e (cid:15) γ(cid:15) by deﬁnition of (cid:15) -centered. In particular, as (cid:15) ≤ α/

80, we have k δ c k τ + ∞ ≤ e (cid:15) γ(cid:15) + C norm e (cid:15) γ(cid:15) = e (cid:15) + 10 e (cid:15) α ! γ(cid:15) ≤ γ(cid:15)α ≤ γ. For the bound of S − δ s , we note that δ s = (1 − α ) S δ p and hence k S − δ s k τ + ∞ ≤ k S − S δ p k τ + ∞ ≤ . · k δ p k τ + ∞ ≤ γ. For the bound of X − E [ | δ x | ], note that X − E [ | δ x | ] = X − X | (1 + 2 α )( g − δ p ) − δ c | and (28) yields g − δ p = W − / ( I − Q ) W / g. By Lemma 4.16, we have k g − δ p k τ ≤ e (cid:15) k g k τ and k g − δ p k ∞ ≤ k g k ∞ + k W − / QW / g k ∞ ≤k g k ∞ + 2 k g k τ . Hence, we have k g − δ p k τ + ∞ = k g − δ p k ∞ + C norm k g − δ p k τ ≤ k g k ∞ + 2 k g k τ + C norm e (cid:15) k g k τ ≤ (cid:18) e (cid:15) + 2 C norm (cid:19) k g k τ + ∞ ≤ . γ. Finally, we have k X − E [ | δ x | ] k τ + ∞ ≤ . k (1 + 2 α )( g − δ p ) − δ c k τ + ∞ ≤ (1 . (1 + 2 α ) γ + 27 γ ≤ (1 . · (1 + 2 / γ + 27 γ ≤ γ. where the third step follows α ∈ [0 , / Lemma 4.19 (Multiplicative Stability and Second-order Analysis of x and s ) . In Algorithm 3, • k X − δ x k ∞ ≤ γ and k S − δ s k ∞ ≤ γ with probability . • k E [( X − δ x ) ] k τ + ∞ ≤ γ and k ( S − δ s ) k τ + ∞ ≤ γ In particular, we have E h(cid:13)(cid:13) X − δ x (cid:13)(cid:13) φ ( v ) i ≤ γ k φ ( v ) k ∗ τ + ∞ . and (cid:13)(cid:13) S − δ s (cid:13)(cid:13) φ ( v ) ≤ γ k φ ( v ) k ∗ τ + ∞ . roof. For sup norm for s , it follows from k S − δ s k ∞ ≤ k S − δ s k τ + ∞ and Lemma 4.18.For sup norm for x , recall that X − δ x = X − X ((1 + 2 α ) g − R δ r ) . There are two cases.Case 1) i -th row is sampled with probability 1. We have( X − δ x ) i = E [( X − δ x ) i ]By Lemma 4.18, we have | ( X − δ x ) i | ≤ γ .Case 2) i -th row is sampled with probability at least m √ n · (( δ r ) i / k δ r k + m ) ≤ | [ δ r ] i | ≤ ( √ m/ k δ r k · ([ δ r ] i / k δ r k + 1 /m ) and therefore | ( R δ r ) i | ≤ | ( δ r ) i | q i ≤ √ nm | ( δ r ) i | ( δ r ) i / k δ r k + m ≤ √ nm · √ m k δ r k ≤ √ nm · √ m · r mn k δ r k τ = 12 k δ r k τ where we used τ ≥ nm in the last inequality. Using x ≈ (cid:15) x , we have k X − δ x k ∞ ≤ k g k ∞ + 23 k δ r k τ . (29)Finally, by δ r = (1 + 2 α ) δ p + δ c and Lemma 4.18, we have k δ r k τ + ∞ ≤ . k δ p k τ + ∞ + k δ c k τ + ∞ ≤ γ. (30)Now, using C norm ≥

50, we have k δ r k τ ≤ C norm k δ r k τ + ∞ ≤ k δ r k τ + ∞ ≤ γ . (31)This gives the bound k X − δ x k ∞ ≤ γ .For the second order term for s , leveraging the generalization of Cauchy Schwarz to arbitrarynorms, i.e. that for every norm k · k , its dual k · k ∗ , and all x, y we have | x > y | ≤ k x kk y k ∗ , yields (cid:13)(cid:13)(cid:13) S − δ s (cid:13)(cid:13)(cid:13) φ ( v ) = X i ∈ [ m ] φ ( v i )( S − δ s ) i ≤ k ( S − δ s ) k τ + ∞ k φ ( v ) k ∗ τ + ∞ ≤ k S − δ s k ∞ k S − δ s k τ + ∞ k φ ( v ) k ∗ τ + ∞ ≤ γ k φ ( v ) k ∗ τ + ∞ . For the second order term for x , we note that E h k X − δ x k φ ( v ) i = X i ∈ [ m ] φ ( v i ) E [( X − δ x ) i ] ≤ k E [( X − δ x ) ] k τ + ∞ k φ ( v ) k ∗ τ + ∞ . Hence, it suﬃces to bound k E [( X − δ x ) ] k τ + ∞ . Using that δ x = X ((1 + 2 α ) g − R δ r ), we have( X − δ x ) i ≤ α ) X − X g ) i + 2( X − XR δ r ) i Using x ≈ (cid:15) x and α ≤ , we have( X − δ x ) i ≤ g i + 2 . R δ r ) i . (32)35or the second term, we note that i -th row is sampled with probability at least min(1 , m √ n · ( δ r ) i / k δ r k ). Hence, we have E [( R δ r ) i ] ≤ ( δ r ) i + ( δ r ) im √ n · ( δ r ) i / k δ r k ≤ ( δ r ) i + √ nm · k δ r k ≤ ( δ r ) i + 1 √ n k δ r k τ Hence, we have k E [( R δ r ) ] k τ + ∞ ≤ k ( δ r ) k τ + ∞ + 1 √ n k δ r k τ · k k τ + ∞ . Since k k τ + ∞ = k k ∞ + C norm k k τ = 1 + √ nC norm , we have k E [( R δ r ) ] k τ + ∞ ≤ k δ r k τ + ∞ + 52 C norm k δ r k τ ≤ (cid:18) C norm (cid:19) k δ r k τ + ∞ ≤ (cid:18) (cid:19) · (cid:18) (cid:19) γ ≤ . γ where we used (30) at the third inequalities.Putting this into (32) gives k E [( X − δ x ) ] k τ + ∞ ≤ k g k τ + ∞ + (2 . . γ ≤ k g k τ + ∞ k g k ∞ + (2 . . γ ≤ γ . δ τ Here we bound the ﬁrst two moments of the change in the LS weight function. First weleverage Lemma 4.20, leverage score related derivative calculations from [LS19], to providegeneral bounds on how τ changes in Lemma 4.21. Then, applying a Lemma 4.22, a technicallemma about how the quadratic form of entrywise-squared projection matrices behave undersampling from [BLSS20], and Lemma 4.23, a new lemma regarding the stability P ( WA ) (2) under changes to W , in Lemma 4.24 we bound δ τ def = τ ( x (new) , s (new) ) − τ ( x, s ) in Algorithm 3. Lemma 4.20 ([LS19, Lemma 49]) . Given a full rank A ∈ R m × n and w ∈ R m> , we have D w P ( WA )[ h ] = ∆ P ( WA ) + P ( WA )∆ − P ( WA )∆ P ( WA ) where W = Diag ( w ) , ∆ = Diag ( h/w ) , and D w f [ h ] def = lim t → t [ f ( w + th ) − f ( w )] de-notes the directional derivative of f with respect to w in direction h . In particular, we have D w σ ( WA )[ h ] = 2 Λ ( WA ) W − h . Lemma 4.21.

Let w ∈ R m> be arbitrary and let w (new) be a random vector such that w (new) ≈ γ w with probability for some ≤ γ ≤ . Let τ = σ ( WA ) + η and τ (new) = σ ( W (new) A ) + η forsome η ∈ R m> . Let δ w = w (new) − w , δ τ = τ (new) − τ and W t = (1 − t ) W + t W (new) . Then, for K def = 10 k E w (new) [( W − δ w ) ] k σ ( WA )+ ∞ + 13 γ max t ∈ [0 , E w (new) [ k W − | δ w |k P (2) ( W t A )+ ∞ ] where k x k M + ∞ def = k x k ∞ + C k x k M for some ﬁxed constant C ≥ and any norm k · k M , thefollowing hold • k T − δ τ k ∞ ≤ γ , • k E w (new) [( T − δ τ ) ] k τ + ∞ ≤ K k E w (new) [ T − ( τ (new) − τ − Λ ( WA ) W − δ w )] k τ + ∞ ≤ K .Proof. Let w t = w + tδ w , σ t = σ ( W t A ), τ t = σ ( W t A ) + η . For simplicity, we let σ def = σ and r t def = W − t δ w .For the ﬁrst result, recall that σ t = Diag ( W t A ( A > W t A ) − A > W t ). Since, w t ≈ γ w , wehave that σ t ≈ γ σ and hence k T − δ τ k ∞ ≤ γ .For the second result, Jensen’s inequality shows that( T − δ τ ) i = (cid:18) T − Z (cid:20) dd t σ t (cid:21) d t (cid:19) i ≤ Z (cid:18) T − (cid:20) dd t σ t (cid:21)(cid:19) i dt (33)Using that dd t σ t = 2 Λ t r t (Lemma 4.20) and Λ t = Σ t − P (2) t , we have τ − i (cid:18)(cid:20) dd t σ t (cid:21) i (cid:19) = 4 τ − i ( Λ t r t ) i ≤ τ − i (cid:16) ( Σ t r t ) i + ( P (2) t r t ) i (cid:17) . (34)Substituting (34) in (33), we have k E ( T − δ τ ) k τ + ∞ ≤ Z (cid:13)(cid:13)(cid:13) T − E (cid:16) ( Σ t r t ) + ( P (2) t r t ) (cid:17)(cid:13)(cid:13)(cid:13) τ + ∞ dt ≤ t ∈ [0 , (cid:18)(cid:13)(cid:13)(cid:13) T − E ( Σ t r t ) (cid:13)(cid:13)(cid:13) τ + ∞ + (cid:13)(cid:13)(cid:13) T − E ( P (2) t r t ) (cid:13)(cid:13)(cid:13) τ + ∞ (cid:19) ≤

10 max t ∈ [0 , (cid:18)(cid:13)(cid:13)(cid:13) E r t (cid:13)(cid:13)(cid:13) σ + ∞ + (cid:13)(cid:13)(cid:13) E ( Σ − t P (2) t r t ) (cid:13)(cid:13)(cid:13) σ t + ∞ (cid:19) (35)where we used w t ≈ γ w , σ t ≈ γ σ and τ ≥ σ at the end.For the second term in (35), using k Σ − t P (2) t r t k ∞ ≤ k r t k ∞ ≤ γ (Lemma 4.25), Σ t (cid:23) P (2) t (Lemma 4.25) and that P (2) t is entry-wise non-negative, we have (cid:13)(cid:13)(cid:13) E ( Σ − t P (2) t r t ) (cid:13)(cid:13)(cid:13) σ t + ∞ ≤ γ E (cid:13)(cid:13)(cid:13) Σ − t P (2) t | r t | (cid:13)(cid:13)(cid:13) σ t + ∞ ≤ γ E k| r t |k P (2) t + ∞ ≤ . γ E k W − | δ w |k P (2) t + ∞ (36)Putting this into (35) gives the second result.For the third result, Taylor expansion shows that σ = σ + dd t σ t (cid:12)(cid:12)(cid:12) t =0 + Z (1 − t ) " d d t σ t d t. Using again that dd t σ t = 2 Λ t r t (Lemma 4.20) we have that dd t σ t (cid:12)(cid:12)(cid:12) t =0 = 2 ΛW − δ w and therefore,( ∗ ) def = k E w (new) [ T − ( τ − τ − ΛW − δ w )] k τ + ∞ ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)Z (1 − t ) E w (new) " T − d d t σ t ( t ) d t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) τ + ∞ ≤ Z (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E w (new) " T − d d t σ t ( t ) τ + ∞ d t. (37)Hence, it suﬃces to bound d dt σ t ( t ).Note that d d t σ t ( t ) = 2 (cid:20) dd t Λ t (cid:21) r t − Λ t r t . (38)37et ∆ t def = Diag ( r t ). Recall that Λ t = Σ t − ( W t A ( A > W t A ) − A > W t ) (2) , Lemma 4.20 showsdd t Λ t = dd t Σ t − P t ◦ dd t ( W t A ( A > W t A ) − A > W t )= 2 Diag ( Λ t r t ) − t P (2) t − P (2) t ∆ t + 4 P t ◦ ( P t ∆ t P t ) . Using this in (38) givesd d t σ t ( t ) = 4 Diag ( Λ t r t ) · r t − t P (2) t r t − P (2) t ∆ t r t + 8( P t ◦ ( P t ∆ t P t )) r t − Λ t r t = 4∆ t Λ t r t − t P (2) t r t − P (2) t r t + 8( P t ◦ ( P t ∆ t P t )) r t − Σ t r t + 2 P (2) t r t = 2 Σ t r t − t P (2) t r t − P (2) t r t + 8( P t ◦ ( P t ∆ t P t )) r t . Putting this into (37) and using w t ≈ γ w , σ t ≈ γ σ , and τ ≥ σ , we have( ∗ ) ≤ max t h k E [ T − Σ t r t k τ + ∞ ] + 4 k E [ T − ∆ t P (2) t r t ] k τ + ∞ i + max t h k E [ T − P (2) t r t ] k τ + ∞ + 4 k E [ T − ( P t ◦ ( P t ∆ t P t )) r t ] k τ + ∞ i ≤ max t h . k E [( W − δ w ) ] k σ + ∞ + 5 k r t k ∞ k E [ Σ − t P (2) t r t ] k σ t + ∞ i + max t h . k E [ Σ − t P (2) t r t ] k σ t + ∞ + 5 k E [ Σ − t ( P t ◦ ( P t ∆ t P t )) r t ] k σ t + ∞ i . (39)For the second term in (39), we follow a similar argument as (36) and get k E [ Σ − t P (2) t r t ] k σ t + ∞ ≤ E [ k Σ − t P (2) t r t k σ t + ∞ ] ≤ E h k| r t |k P (2) t Σ − t P (2) t + ∞ i ≤ E h k| r t |k P (2) t + ∞ i . Similarly, the third term in (39), we have k E [ Σ − t P (2) t r t ] k σ t + ∞ ≤ E h k r t k P (2) t + ∞ i . For the fourth term, note that e > i ( P t ◦ ( P t ∆ t P t )) r t = X j ∈ [ m ] ( P t ) i,j ( P t ∆ t P t ) i,j [ r t ] j = e > i P t ∆ t P t ∆ t P t e i ≤ e > i P t ∆ t P t e i = e > i P (2) t r t . Hence, we can bound the term again by E k r t k P (2) t . In summary, we have( ∗ ) ≤ . k E [( W − δ w ) ] k σ + ∞ + 5 k r t k ∞ E [ k| r t |k P (2) t + ∞ ] + 6 . E [ k r t k P (2) t + ∞ ] ≤ . k E [( W − δ w ) ] k σ + ∞ + 11 . k r t k ∞ E [ k| r t |k P (2) t + ∞ ] ≤ . k E [( W − δ w ) ] k σ + ∞ + 13 γ E [ k W − | δ w |k P (2) t + ∞ ]where we used that P (2) t is entry-wise non-negative and k r t k ∞ ≤ γ at the end. This gives thethird result. Lemma 4.22 ([BLSS20, Appendix A]) . Let A ∈ R m × n be non-degenerate, δ ∈ R m and p i ≥ min { , k · σ i ( A ) } . Further, let e δ ∈ R m be chosen randomly where for all i ∈ [ m ] independently e δ i = δ i /p i with probability p i and e δ i = 0 otherwise. Then E [ e δ ] = δ and E k e δ k P (2) ( A ) ≤ (1 + (1 /k )) · k δ k σ ( A ) . emma 4.23. for all non-degenerate A ∈ R m × n , w, w ∈ R m> with w ≈ γ w for γ > , andany non-negative vector v ∈ R m ≥ , we have k v k P ( W A ) (2) ≤ e γ k v k P ( WA ) (2) . Proof.

Let D def = W ( W ) − . Then, we have that0 (cid:22) P ( W A ) = DWA ( A > W A ) − A > WD (cid:22) e γ DWA ( A > W A ) − A > WD = e γ DP ( WA ) D . By Schur product theorem (that the entrywise product of two PSD matrices is PSD), we see thatfor any PSD M , N ∈ R m × m with M (cid:23) N it is the case that M (2) − N (2) = ( M − N ) ◦ ( M + N ) isPSD (where we recall that ◦ denotes entrywise product and use the shorthand M (2) def = M ◦ M ).Thus, P ( W A ) (cid:22) e γ DP ( WA ) D implies P ( W A ) (2) (cid:22) h e γ DP ( WA ) D i ◦ h e γ DP ( WA ) D i = e γ D P ( WA ) (2) D . Hence, we have k v k P ( W A ) (2) ≤ e γ · k D v k P ( WA ) (2) ≤ e γ · k v k P ( WA ) (2) where we used that v and P ( WA ) (2) are coordinate-wise non-negative in the last step.Now, we can bound the change of τ by bounding the change of w . Lemma 4.24.

In Algorithm 3 the vector δ τ def = τ ( x (new) , s (new) ) − τ ( x, s ) satisﬁes the following • k T − δ τ k ∞ ≤ γ , • δ τ = Λ ((1 − α ) X − δ x − (1 + 2 α ) S − δ s ) + η with k T − E [ η ] k τ + ∞ ≤ γ , and • k E [( T − δ τ ) ] k τ + ∞ ≤ γ . In particular, we have E h(cid:13)(cid:13) T − δ τ (cid:13)(cid:13) φ ( v ) i ≤ γ k φ ( v ) k ∗ τ + ∞ Proof.

Let e w def = x − α s − − α , e w (new) def = [ x (new) ] − α [ s (new) ] − − α , f W def = Diag ( e w ), and f W t def = Diag ((1 − t ) w + tw (new) )) for all t ∈ [0 , k X − δ x k ∞ ≤ γ , k S − δ s k ∞ ≤ γ (Lemma4.19) and α ∈ [0 , ], we have that w (new) ≈ γ w . Consequently, we can invoke Lemma 4.21 with w , w (new) , W t , and γ set to be e w , e w (new) , f W t , and 5 γ respectively. This gives the ﬁrst resultimmediately and for the second and third result, for δ e w def = e w (new) − e w it suﬃces to bound K def = 10 k E [( f W − δ e w ) ] k σ + ∞ + 65 γ max t ∈ [0 , E e w (new) [ k f W − | δ e w |k P (2) ( f W t A )+ ∞ ] . Now, for all t ∈ [0 ,

1] let x t def = x + tδ x and s t def = x = tδs . We have ddt ln (cid:18) x − αt s − − αt (cid:19) = X − + αt S + αt (cid:20)(cid:18) − α (cid:19) S − − αt X − α − t δ x + (cid:18) − − α (cid:19) X − αt S − − α − t δ s (cid:21) = (cid:18) − α (cid:19) X − t δ x − (cid:18)

12 + α (cid:19) S − t δ s and therefore we have k ln( e w (new) ) − ln( e w ) k ∞ ≤ Z (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) − α (cid:19) X − t δ x − (cid:18)

12 + α (cid:19) S − t δ s (cid:13)(cid:13)(cid:13)(cid:13) dt (cid:18) − α (cid:19) Z (cid:13)(cid:13)(cid:13) X − t δ x (cid:13)(cid:13)(cid:13) ∞ dt + (cid:18)

12 + α (cid:19) Z (cid:13)(cid:13)(cid:13) S − t δ s (cid:13)(cid:13)(cid:13) ∞ dt ≤ − (1 / (cid:20)(cid:18) − α (cid:19) Z (cid:13)(cid:13)(cid:13) X − δ x (cid:13)(cid:13)(cid:13) ∞ dt + (cid:18)

12 + α (cid:19) Z (cid:13)(cid:13)(cid:13) S − δ s (cid:13)(cid:13)(cid:13) ∞ dt (cid:21) where in the last step we used that k X − δ x k ∞ ≤ and k S − δ s k ∞ ≤ . Since α ∈ [0 , ] thisimplies that k ln( e w (new) ) − ln( e w ) k ∞ ≤ /

20, i.e. w (new) ≈ / w . Consequently, Lemma 4.32yields k f W − δ e w k ∞ ≤ / − (1 / (cid:20)(cid:18) − α (cid:19) Z (cid:13)(cid:13)(cid:13) X − δ x (cid:13)(cid:13)(cid:13) ∞ dt + (cid:18)

12 + α (cid:19) Z (cid:13)(cid:13)(cid:13) S − δ s (cid:13)(cid:13)(cid:13) ∞ dt (cid:21) ≤ k X − δ x k ∞ + k S − δ s k ∞ . (40)Consequently, for the ﬁrst term in K , using k E [( X − δ x ) ] k τ + ∞ ≤ γ , k ( S − δ s ) k τ + ∞ ≤ γ (Lemma 4.19), τ ≥ σ , and (40) we have k E [( f W − δ e w ) ] k σ + ∞ ≤ k E ( X − δ x ) k σ + ∞ + 2 k ( S − δ s ) k σ + ∞ ≤ γ ) + 2(4 γ ) = 32 γ . For the second term in K , we have again by (40) that E [ k f W − | δ e w |k P (2) ( W t A )+ ∞ ] ≤ E [ k X − | δ x |k P (2) ( f W t A )+ ∞ + k S − | δ s |k P (2) ( f W t A )+ ∞ ] ≤ . E [ k X − | δ x |k P (2) ( f WA )+ ∞ + k S − | δ s |k P (2) ( f WA )+ ∞ ]where we used Lemma 4.23, f W t ≈ γ f W and γ < /

80 at the end.Since we sample coordinates i of δ x independently with probability at least 10 σ i ( f WA ),Lemma 4.22 shows that E [ k X − | δ x |k P (2) ( f WA ) ] ≤ E [ k X − | δ x |k P (2) ( f WA ) ] ≤ . · k E [ X − | δ x | ] k σ ( f WA ) Using this, we have E [ k f W − | δ e w |k P (2) ( f W t A )+ ∞ ] ≤ . k E [ X − | δ x | ] k σ + ∞ + 1 . k S − δ s k σ + ∞ ≤ . γ + 2 . γ = 5 γ where we used k X − E [ | δ x | ] k τ + ∞ ≤ γ , k S − δ s k τ + ∞ ≤ γ (Lemma 4.18).Hence, we have K ≤ γ ) + 65(5 γ )(24 γ ) ≤ γ . Here we combine the analysis from Section 4.3.1 and Section 4.3.2 to analyze our short-stepmethod for the LS weight function. We ﬁrst provide two technical lemmas (Lemma 4.25 andLemma 4.26 respectively) from from [LS19] and [BLSS20] that we use in our analysis and thenprovide a sequence of results analogous to those we provided for the short-step method for thelog barrier. In Lemma 4.27 we bound our ﬁrst order expected progress from a step (analogousto Lemma 4.7), in Lemma 4.28 we bound the expected decrease in the potential (analogous toLemma 4.10), and in Lemma 4.29 we give our main centrality improvement lemma (analogousto Lemma 4.11. Finally, in Lemma 4.30 we analyze the feasibility of a step (analagous toLemma 4.12) and put everything together to prove that Algorithm 3 is a valid short stepprocedure in Lemma 4.31 (analogous to Lemma 4.5).40 emma 4.25 (Lemma 47 in [LS19]) . For any non-degenerate A ∈ R m × n , we have k Σ ( A ) − P (2) ( A ) k ∞ ≤ and k Σ ( A ) − P (2) ( A ) k τ ( A ) ≤ . Consequently, k T ( A ) − P (2) ( A ) k ∞ ≤ and k T ( A ) − P (2) ( A ) k τ ( A ) ≤ . Lemma 4.26 (Lemma 26 in [BLSS20]) . In Algorithm 3, k Ψ k τ + ∞ ≤ − α where Ψ def = ((1 − α ) T − Λ − α I ) W − / ( I − Q ) W / , Λ def = Λ ( A ) and Q = AH − A > . Lemma 4.27 (First-order Expected Progress Bound) . In Algorithm 3 we have k X − E [ δ x ] + S − E [ δ s ] − g − T − E [ δ τ ] k τ + ∞ ≤ (1 − α + 36 (cid:15)/α ) γ. Proof.

Let Q def = AH − A > . Note that E [ X − δ x ] = (1 + 2 α )( g − W − / QW / g ) + η x − E [ δ c ]where η x = E [ X − δ x − X − δ x ]. Since X ≈ (cid:15) X , Lemma 4.32 and Lemma 4.24 show that k η x k τ + ∞ ≤ k ( I − X − X ) X − E δ x k τ + ∞ ≤ (cid:15) k X − E δ x k τ + ∞ ≤ (cid:15)γ Further, Lemma 4.18 shows that k δ c k τ + ∞ ≤ (cid:15)γ/α .Similarly, we have S − δ s = (1 − α ) W − / QW / g + η s with k η s k τ + ∞ ≤ (cid:15)γ . Lemma 4.24 shows that T − δ τ = T − Λ ((1 − α ) X − δ x − (1 + 2 α ) S − δ s ) + η τ with k E η τ k τ + ∞ ≤ γ . Combining the three inequalities above, we have E h T − δ τ − X − δ x − S − δ s i = T − Λ ((1 − α ) E [ X − δ x ] − (1 + 2 α ) S − δ s ) + η τ − E [ X − δ x ] − S − δ s = T − Λ ((1 − α )(1 + 2 α )( g − W − / QW / g ) − (1 + 2 α )(1 − α ) W − / QW / g ) − (1 + 2 α )( g − W − / QW / g ) − (1 − α ) W − / QW / g + η τ + (1 − α ) T − Λ η x − η x − (1 + 2 α ) T − Λ η s − η s =(Ψ − I ) g + η with Ψ = ((1 − α ) T − Λ − α ) W − / ( I − Q ) W / ,η = η τ + (1 − α ) T − Λ η x − η x − (1 + 2 α ) T − Λ η s − η s + E [ δ c ] . By Lemma 4.26, we have k Ψ k τ + ∞ ≤ − α . By the fact that k T − Λ k τ + ∞ ≤ k η k τ + ∞ ≤ γ + 24 (cid:15)γ + 11 (cid:15)γ/α ≤ (cid:15)γ/α where we used γ ≤ (cid:15) and α < /

4. Hence, we have k E [ X − δ x − S − δ s − T − δ τ ] − g k τ + ∞ ≤ k Ψ g k τ + ∞ + k η k τ + ∞ ≤ (1 − α + 36 (cid:15)/α ) γ.

41e now have everything to bound the eﬀect of a step in terms of k φ ( w ) k τ + ∞ and k φ ( w ) k τ + ∞ . Lemma 4.28 (Expected Potential Decrease) . In Algorithm 3 we have E [Φ( x (new) , s (new) , µ (new) )] ≤ Φ( x, s, µ ) + φ ( v ) > g + (1 − α + 20 (cid:15)/α ) γ k φ ( v ) k ∗ τ + ∞ + 6000 γ k φ ( v ) k ∗ τ + ∞ . Proof.

We apply Lemma 4.34 with u (1) = x , δ (1) = δ x , c (1) = 1, u (2) = s , δ (2) = δ s , c (2) = 1, u (3) = µ~ δ (3) = δ µ ~ c (3) = − u (4) = τ ( x, s ), δ (4) = δ τ , c (4) = −

1. Note that, v in the contextof Lemma 4.34 is precisely xsµτ and v (new) in the context of Lemma 4.34 is precisely x (new) s (new) µ (new) τ (new) .Now, by Lemma 4.19, the deﬁnition of γ , and (cid:15) ∈ [0 , /

80] we have k S − δ s k ∞ ≤ γ ≤ λ ≤ k c k λ ) . Similarly, Lemma 4.19 implies k X − δ x k ∞ ≤ γ ≤ λ ≤ k c k λ ) . Further, by deﬁnition of r we have µ − | δ µ | ≤ γ ≤ (12(1 + k c k λ ) − ). Thus, Lemma 4.34 implies E h Φ( x (new) , s (new) , µ (new) ) i ≤ Φ( x, s, µ ) + φ ( v ) > V h ( X − E [ δ x ] + S − δ s − µ − δ µ ~ − T − E [ δ τ ]) i + 10 (cid:16) E [ k X − δ x k φ ( v ) ] + k S − δ x k φ ( v ) + k µ − δ µ ~ k φ ( v ) + E h k T − δ τ ] k φ ( v ) i(cid:17) . (41)Now, since v ≈ (cid:15) ~

1, the assumption on δ µ implies | µ − δ µ φ ( v ) > V ~ | ≤ µ − | δ µ | exp( (cid:15) ) k φ ( v ) k ∗ τ + ∞ k ~ k τ + ∞ ≤ (cid:15)γ k φ ( v ) k ∗ τ + ∞ and k µ − δ µ ~ k φ ( v ) = | µ − δ µ | k φ ( v ) k ≤ (cid:0) (cid:15)γ/ √ n (cid:1) k φ ( v ) k ∗ τ + ∞ k k τ + ∞ ≤ γ k φ ( v ) k ∗ τ + ∞ . Applying Lemmas 4.27, 4.19, and 4.24 yields the result.Finally, we can prove the desired bound on E [Φ( x (new) , s (new) , µ (new) )] by combining thepreceding lemma with Lemma 4.36 and bounds on k · k τ + ∞ . Lemma 4.29 (Centrality Improvement of LS Barrier Short-Step) . In Algorithm 3 we have E [Φ( x (new) , s (new) , µ (new) )] ≤ (1 − λr )Φ( x, s, µ ) + exp( λ(cid:15)/ . Proof.

Note that k v − w/τ k ∞ ≤ γ . Consequently, we will apply Lemma 4.36 with δ = γ , c = 1 − α + 20 (cid:15)/α , c = 6000, and U = { x : k x k τ + ∞ ≤ } . Using γ = (cid:15)/ (100 λ ), λ =36 log(400 m/(cid:15) ) /(cid:15) , and (cid:15) = α , we have δ ≤ / (5 λ ), λγ ≤ /

8, and2 λδ + c λγ = 6004 λγ = 60 . (cid:15) ≤ . α ≤ α ≤ − c . Hence, the assumption in Lemma 4.36 is satisﬁed. Further, if k x k ∞ ≤ α √ n then k x k τ + ∞ = C norm k x k τ + k x k ∞ ≤ α √ n (cid:20) α · √ n + 1 (cid:21) ≤ U contains the k · k τ + ∞ ball of radius α/ (16 √ n ). Consequently, applyingLemma 4.36 and using that (1 − c ) ≥ α/ E [Φ( x (new) , s (new) , µ (new) )] ≤ Φ( x, s, µ ) − (1 − c ) λγu x, s, µ ) + m ≤ Φ( x, s, µ ) − α λγ √ n Φ( x, s, µ ) + m . The result follows from exp( λ(cid:15) ) ≥ m and r = (cid:15)γ √ n .Next, we bound the infeasibility after a step (Lemma 4.30). This lemma is analogous toLemma 4.12 and consequently the proof is abbreviated. (See Section 4.2.2 for further intuition.) Lemma 4.30 (Feasibility Bound) . In Algorithm 3 if A > RA ≈ γ A > A then k A > x (new) − b k A > X (new) ( S (new) ) − A ≤ . (cid:15)γ ( µ (new) ) / (42) and consequently, this holds with probability at least − r .Proof. Recall that x (new) = x + δ x where δ x = X ( g − R δ r ), δ r def = (1 + 2 α ) δ p + δ c , and δ p = W − / AH − A > W / g and δ c = µ − / · W − / AH − ( A > x − b ) . Consequently, we can rewrite the step as δ r = W − / AH − d where d def = (1 + 2 α ) A > W / g + µ − / ( A > x − b ) . Now, consider the idealized step x ∗ def = x + δ ∗ x where δ ∗ x def = X ( g − δ ∗ r ), and δ ∗ r = W − / A ( A > A ) − d .Since, A > X = √ µ A > W / we have A > x ∗ = A > x + √ µ A > (cid:16) W / g − A ( A > A ) − d (cid:17) = A > x − ( A > x − b ) = b . Consequently, A > x − b = A > ( x − x ∗ ) = − A > X ( R δ r − δ ∗ r ) = √ µ A > (cid:16) A ( A > A ) − − RAH − (cid:17) d = √ µ (cid:16) I − A > RAH − (cid:17) d . (43)Since A > RA ≈ γ A > A ≈ γ H we have k ( A > A ) − / ( I − A > RAH − ) H / k = k ( A > A ) − / ( H − A > RA ) H − / k ≤ exp( γ ) k H − / ( H − A > RA ) H − / k ≤ γ where we used Lemma 4.32 and (2 γ + 4 γ ) exp( γ ) ≤ γ for γ ≤ /

80. Consequently, combiningwith (25) yields that k A > x (new) − b k ( A > A ) − = √ µ (cid:13)(cid:13)(cid:13) ( A > A ) − / ( I − A > RAH − ) H / H − / d (cid:13)(cid:13)(cid:13) ≤ γ √ µ k d k H − ≤ γ √ µ (cid:16) (1 + 2 α ) k A > W − / g k H − + µ − / · k A > x − b k H − (cid:17) . Next, note that by design of g we have k g k τ ≤ ( γ/C norm ) for C norm = 10 /α . Further, we have W ≈ (cid:15) τ as w ≈ γ w , w ≈ (cid:15) τ , τ ≈ γ τ , and γ ≤ (cid:15) ). Since (cid:15) ≤ /

80 and α ≤ / k A > W − / g k H − ≤ exp( γ/ k W − / g k A ( A > A ) − A > ≤ exp( γ/ k W − / g k exp(5 (cid:15)/ k g k τ ≤ exp(5 (cid:15)/ αγ ≤ γ α . Further, by the approximate primal feasibility of ( x, s, µ ) we have k A > x − b k H − ≤ exp( γ/ k A > x − b k A > XS − A ≤ (cid:15)γ √ µ . Combining yields that k A > x (new) − b k ( A > A ) − ≤ γ √ µ (2 γ + 2 (cid:15)γ ) ≤ γ √ µγ ≤ . (cid:15)γ √ µ . Now, since k S − δ s k ∞ ≤ γ and k X − δ x k ∞ ≤ γ by Lemma 4.19 and γ ≤ (cid:15)/

100 wehave that A > X (new) ( S (new) ) − A ≈ (cid:15) A > XSA . Combining with the facts that A > XSA ≈ γ A > A , and µ (new) ≈ (cid:15)γ µ yields (42). The probability bound follows by the fact that q i ≥ Cσ i ( A ) log( n/ ( (cid:15)r )) γ − and Lemma 4 in [CLM + R corresponds tosampling each row of A with probability at least proportional to its leverage score.Next we prove the main result of this section, that Algorithm 3 is a valid ShortStep procedure (see Deﬁnition 4.3).

Lemma 4.31 (LS Short Step) . Algorithm 3 is a valid

ShortStep procedure (Deﬁnition 4.3)with weight function τ ( x, s ) def = σ ( X / − α S − / − α A ) + nm ~ ∈ R m for all x, s ∈ R m> and parame-ters λ def = 36 (cid:15) − log(400 m/(cid:15) ) , γ def = (cid:15) λ , r def = (cid:15)γ √ n , (cid:15) = α , and α =

14 log( mn ) , i.e.1. λ ≥

12 log(16 m/r ) (cid:15) ,2. A y (new) + s (new) = b for some y (new) ∈ R n ,3. E [Φ( x (new) , s (new) , µ (new) )] ≤ (1 − λr )Φ( x, s, µ ) + exp ( λ(cid:15)/ P (cid:20) k A > x (new) − b k ( A > X (new) ( S (new) ) − A ) − ≤ (cid:15) q µ (new) (cid:21) ≥ − r . Proof.

We provide the proofs of all the four parts as follows:(1) Note that r = (cid:15)γ √ n = (cid:15) λ √ n = (cid:15) β ) √ n for β = m(cid:15) . Thus,12 log(16 m/r ) · (cid:15) − ≤

12 log(16 β / · β )) · (cid:15) − ≤

12 log((400) β ) · (cid:15) − ≤ λ where we used that √ x log x ≤ x for x ≥

1, (400) ≥ · · β ≥ δ s = AH − A > W / g . Consequently, δ s ∈ im( A ) and this follows from thefact that A y + s = b for some y ∈ R n by deﬁnition of (cid:15) -centered.(3) This follows immediately from Lemma 4.29.(4) This follows immediately from Lemma 4.30. Here we give general mathematical facts and properties of the potential function we usedthroughout this section in analyzing our IPM. First, we give a simple, standard technical lemmaallowing us to relate diﬀerent notions of multiplicative approximation Lemma 4.32. Then, inthe remainder of this section we give facts about the potential function given in Deﬁnition 4.2.

Lemma 4.32. If M ≈ (cid:15) N for symmetric PD M , N ∈ R n × n and (cid:15) ∈ [0 , / then k N − / ( M − N ) N − / k ≤ (cid:15) + (cid:15) . Consequently, if x ≈ (cid:15) y for x, y ∈ R > and (cid:15) ∈ [0 , / then k X − y − ~ k ∞ ≤ (cid:15) + (cid:15) . roof. These follow from the fact that for all α ∈ R with | α | ≤ / α ≤ exp( α ) ≤ α + α and therefore | exp( α ) − | ≤ | α | + | α | . Lemma 4.33.

For all w ∈ R we have ≤ φ ( w ) ≤ λ · ( | φ ( w ) | + 2 λ ) and for all z , z with | z − z | ≤ δ we have φ ( z ) ≤ exp( λδ ) φ ( z ) .Proof. Direct calculation shows that φ ( w ) = λ · (exp( λ ( w − − exp( λ (1 − w ))) and φ ( w ) = λ · (exp( λ ( w − λ (1 − w ))) . and therefore, clearly φ ( w ) ≥ φ ( w ) follow from (cid:12)(cid:12) φ ( w ) (cid:12)(cid:12) = λ · (exp( λ | w − | ) − exp( − λ | w − | ))= 1 λ φ ( w ) − λ exp( − λ | w − | )) ≥ λ φ ( w ) − λ . The bound on φ ( z ) is then immediate from φ ( z ) = λ · (exp( λ ( z − )) + exp( λ (1 − z ))). Lemma 4.34.

For all t ∈ [0 ,

1] let z t ∈ R n be deﬁned for all i ∈ [ n ] by [ z t ] i def = Q j ∈ [ k ] ( u ( j ) i + t · δ ( j ) i ) c j and f ( t ) def = Φ( z t ). Note that z = v and z = v (new) and consequently f (1) = Φ( v (new) ) and f (0) def = Φ( v ). Further, by Taylor’s theorem, f (1) = f (0) + f (0) + f ( ζ ) for some ζ ∈ [0 , z t ≈ / (11 λ ) z , compute f (0) and upper bound f ( ζ ).First, by direct calculation, we know that for all i ∈ [ n ] and t ∈ [0 , ddt [ z t ] i = X j ∈ [ k ]  Y j ∈ [ k ] \ j ( u ( j ) i + t · δ ( j ) i ) c j  · (cid:18) ddt ( u ( j ) i + t · δ ( j ) i ) c j (cid:19) = X j ∈ [ k ] h(cid:16) [ z t ] i · ( u ( j ) i + t · δ ( j ) i ) − c j (cid:17) · c j ( u ( j ) i + t · δ ( j ) i ) c j − · δ ( j ) i i = [ z t ] i X j ∈ [ k ] c j · δ ( j ) i u ( j ) i + t · δ ( j ) i and d dt [ z t ] i = (cid:20) ddt [ z t ] i (cid:21) X j ∈ [ k ] c j · δ ( j ) i u ( j ) i + t · δ ( j ) i − [ z t ] i X j ∈ [ k ] c j ( δ ( j ) i ) ( u ( j ) i + t · δ ( j ) i ) . Thus, Z t def = Diag ( z t ), u ( j ) t def = u ( j ) + tδ ( j ) , U ( j ) t def = Diag ( u ( j ) t ), and e δ ( j ) t def = [ U ( j ) t ] − δ ( j ) satisfy ddt z t = Z t X j ∈ [ k ] c j e δ ( j ) t and d dt z t = Z t  X j ∈ [ k ] c j e δ ( j ) t  − X j ∈ [ k ] c j [ e δ ( j ) t ]  (44)45here for a ∈ R n we deﬁne a ∈ R n with [ a ] i def = [ a i ] .Applying chain rule and (44) yields that f ( t ) = X i ∈ [ n ] φ ([ z t ] i ) ddt [ z t ] i = φ ( z t ) > Z t X j ∈ [ k ] c j e δ ( j ) t . Since z = v , U ( j )0 = U j , and e δ ( j )0 = U − j δ ( j ) we have f (0) = P j ∈ [ k ] c j φ ( v ) > VU − δ ( j ) asdesired.Further application of chain rule and (44) yields that f ( t ) = X i ∈ [ n ] φ ([ z t ] i ) (cid:18) ddt [ z t ] i (cid:19) + X i ∈ [ n ] φ ([ z t ] i ) d dt z t = X i ∈ [ n ] φ ([ z t ] i ) ·  Z t X j ∈ [ k ] c j e δ ( j ) t  i + X i ∈ [ n ] φ ([ z t ] i ) ·  [ z t ] i  X j ∈ [ k ] c j e δ ( j ) t  i − [ z t ] i X j ∈ [ k ] [ e δ ( j ) t ] i  = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X j ∈ [ k ] c j e δ ( j ) t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Z t φ ( z t ) Z t + Z t φ ( z t ) − X j ∈ [ k ] c j (cid:13)(cid:13)(cid:13)e δ ( j ) t (cid:13)(cid:13)(cid:13) Z t φ ( z t ) . Now, since by Cauchy Schwarz for all i ∈ [ n ] we have X j ∈ [ k ] h c j e δ ( j ) t i i ≤  X j ∈ [ k ] | c j |  ·  X j ∈ [ k ] | c j | · he δ ( j ) t i j  we have that f ( t ) ≤ k c k X j ∈ [ k ] | c j | (cid:13)(cid:13)(cid:13)e δ ( j ) t (cid:13)(cid:13)(cid:13) Z t φ ( z t ) Z t + Z t | φ ( z t ) | − X j ∈ [ k ] c j (cid:13)(cid:13)(cid:13)e δ ( j ) t (cid:13)(cid:13)(cid:13) Z t φ ( z t ) ≤ k c k X j ∈ [ k ] | c j | (cid:13)(cid:13)(cid:13)e δ ( j ) t (cid:13)(cid:13)(cid:13) Z t φ ( z t ) Z t + ( k c k + 1) X j ∈ [ k ] | c j | · (cid:13)(cid:13)(cid:13)e δ ( j ) t (cid:13)(cid:13)(cid:13) Z t | φ ( z t ) | ≤ ( k c k + 1) X j ∈ [ k ] | c j | · k e δ ( j ) t k Z t + Z t ) φ ( z t ) (45)where in the last line we used that φ ( z t ) ≥ λ | φ ( z t, ) | ≥ | φ ( z t ) | by Lemma 4.33 and λ ≥ v , v (new) ,u ( j ) , and u ( j ) t are all entrywisepositive. Since k U − j δ ( j ) k ∞ ≤ λ k c k ) for all j ∈ [ k ] we have that for all t ∈ [0 ,

1] and i ∈ [ n ][ u ( j ) t ] i ≤ (cid:18) − λ k c k ) (cid:19) −| c j | [ u ( j ) t ] ≤ exp (cid:18) | c j | λ k c k ) (cid:19) [ u ( j ) t ] where we used that 1 − x ≤ exp( − x ) for all x yields and for all x ∈ (0 ,

1) and c ≥ − x ) − c ≥ max { (1 + x ) c , (1 + x ) − c , (1 + x ) − c } . Consequently, U ( j ) t (cid:22) exp( λ ) U and forall t ∈ [0 ,

1] and i ∈ [ n ] we have[ z t ] i ≤ Y j ∈ [ k ] [ u ( j ) t ] i ≤ Y j ∈ [ k ] exp (cid:18) | c j | λ k c k ) (cid:19) [ u ( j ) ] i ≤ exp( 111 λ )[ z ] i . Thus, Z t (cid:22) exp( λ ) V and by analogous calculation U ( j ) t (cid:23) exp( − λ ) U ( j ) and Z t (cid:23) exp( − λ ) V .By the preceding paragraph, u ( j ) t ≈ / (11 λ ) u ( j ) and z t ≈ / (11 λ ) v . Further, v ≤ andLemma 4.32 implies k z t − v k ∞ = k ( V − Z t − ~ V k ≤ k V − Z t − ~ k k V k ≤ λ + (cid:18) λ (cid:19) ! ≤ . λ . φ ( z t ) ≤ exp(0 . φ ( v ). Combining these factswith (45) then yields that f ( t ) ≤ ( k c k + 1) X j ∈ [ k ] | c j | · k U − δ ( j ) k φ ( v ) · (cid:20) exp (cid:18) λ (cid:19) (cid:21) + (cid:20) exp (cid:18) λ (cid:19) (cid:21)! exp (cid:18) λ (cid:19) exp(0 . λ ≥ f (1) = f (0) + f (0) + f ( ζ ) for some ζ ∈ [0 , Lemma 4.35 (Lemma 54 of [LS19]) . For all x ∈ R m , we have e λ k x k ∞ ≤ Φ( x ) ≤ me λ k x k ∞ and λ Φ( x ) − λm ≤ k∇ Φ( x ) k (46) Furthermore, for any symmetric convex set U ⊆ R m and any x ∈ R m , let x [ def = argmax y ∈ U h x, y i and k x k U def = max y ∈ U h x, y i . Then for all x, y ∈ R m with k x − y k ∞ ≤ δ ≤ λ we have e − λδ k∇ Φ( y ) k U − λ k∇ Φ( y ) [ k ≤ D ∇ Φ( x ) , ∇ Φ( y ) [ E ≤ e λδ k∇ Φ( y ) k U + λe λδ k∇ Φ( y ) [ k . (47) If additionally U is contained in a ‘ ∞ ball of radius R then e − λδ k∇ Φ( y ) k U − λmR ≤ k∇ Φ( x ) k U ≤ e λδ k∇ Φ( y ) k U + λe λδ mR. (48) Lemma 4.36.

Let U ⊂ R m be an axis-symmetric convex set which is contained in the ‘ ∞ ball of radius and contains the ‘ ∞ ball of radius u ≤ , i.e. x ∈ U implies k x k ∞ ≤ andfor some k x k ∞ ≤ u implies x ∈ U . Let w, v ∈ R m such that k w − v k ∞ ≤ δ ≤ λ . Let g = − γ argmax z ∈ U h∇ Φ( v ) , z i and k h k U def = max z ∈ U h h, z i with γ ≥ and δ Φ def = φ ( w ) > g + c γ k φ ( w ) k U + c γ k φ ( w ) k U for some c , c ≥ satisfying λγ ≤ , λδ + c λγ ≤ . − c ) . Then, δ Φ ≤ − . − c ) λγu Φ( w ) + m Proof.

Since entrywise | φ ( w ) | ≤ λ · ( | φ ( w ) | + 2 λ ) by Lemma 4.33, we have that k φ ( w ) k U ≤ λ k φ ( w ) k U + 2 λ k k U ≤ λ k φ ( w ) k U + 2 λ m where in the ﬁrst step we used triangle inequality and the axis-symmetry of U and in the secondstep we used that U is contained in a ‘ ∞ ball of radius 1.Now, we bound the term φ ( w ) > g . Applying (47) and (48) in Lemma 4.35, we have − φ ( w ) > gγ ≥ e − λδ k φ ( v ) k U − λ k g k ((47) of Lemma 4.35) ≥ e − λδ k φ ( w ) k U − e − λδ λm − λ k g k ((48) Lemma 4.35) ≥ e − λδ k φ ( w ) k U − λm. ( U is contained in a ball of radius 1)Hence, as e − λδ ≥ − λ , we have δ Φ ≤ − (1 − λδ − c − c λγ ) γ k φ ( w ) k U + 2 λmγ + 2 c γ λ m ≤ − . − c ) γ k φ ( w ) k U + m/ λγ ≤ / λδ + c λγ ≤ − c . Since U contains the ‘ ∞ ball of radius u , wehave k φ ( w ) k U ≥ u k φ ( w ) k . Using this and (46) in Lemma 4.35 yields δ Φ ≤ − . − c ) γu k φ ( w ) k + 0 . m ≤ − . − c ) λγu Φ( w ) + λγmu + 0 . m and the result follows from the facts that λγ ≤ / u ≤ A set U ⊂ R m is axis-symmetric if x ∈ U implies y ∈ U for all y ∈ R m with y i ∈ {− x i , x i } for all i ∈ [ m ]. .5 Additional Properties of our IPMs Here we provide additional properties of our IPMs. First, we show that although the x iteratesof Algorithm 3 may change more than in [BLSS20] in a single step due to sampling, they donot drift too far away from a more stable sequence of points. We use this lemma to eﬃcientlymaintain approximations to regularized leverage scores, which are critical for this method. Lemma 4.37.

Suppose that q i ≥ mn Tβ log( mT ) · (( δ r ) i / k δ r k + 1 /m ) for all i ∈ [ m ] where β ∈ (0 , . and T ≥ √ n . Let ( x ( k ) , s ( k ) ) be the sequence of points found by Algorithm 3. Withprobability − m − , there is a sequence of points b x ( k ) from ≤ k ≤ T such that1. b x ( k ) ≈ β x ( k ) ,2. k ( b X ( k ) ) − ( b x ( k +1) − b x ( k ) ) k τ ( b x ( k ) ,s ( k ) ) ≤ γ ,3. k T ( b x ( k ) , s ( k ) ) − ( τ ( b x ( k +1) , s ( k +1) ) − τ ( b x ( k ) , s ( k ) )) k τ ( b x ( k ) ,s ( k ) ) ≤ γ .Proof. We deﬁne the sequence b x ( k ) recursively by b x (1) = x (1) andln b x ( k ) def = ln b x ( k − + E k [ln x ( k ) ] − ln x ( k − where E k [ln x ( k ) ] def = E R k [ln x ( k ) | x ( k − ] where R k is the random choice of R at the k -th iteration.Let ∆ ( k ) def = ln b x ( k ) − ln x ( k ) and note that this implies that∆ ( k ) = ∆ ( k − + E k [ln x ( k ) ] − ln x ( k ) . (49)Consequently, is a martingale, i.e. E k [∆ ( k ) ] = ∆ ( k − . Since b x (1) = x (1) , we have ∆ (1) = 0 andconsequently to bound ∆ ( k ) , it suﬃces to show that ∆ ( k ) − ∆ ( k − is small in the worst caseand that the variance of ∆ ( k ) − ∆ ( k − is small.For the worst case bound, note that by (49) and Jensen’s inequality (cid:12)(cid:12)(cid:12) ∆ ( k ) i − ∆ ( k − i (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) ln x ( k ) i − E k ln x ( k ) i (cid:12)(cid:12)(cid:12) ≤ sup x ( k ) ,y ( k ) (cid:12)(cid:12)(cid:12) ln x ( k ) i − ln y ( k ) i (cid:12)(cid:12)(cid:12) (50)where x ( k ) i and y ( k ) i are two independent samples using two diﬀerent independent R k to take astep from x ( k − . Lemma 4.19 shows that x ( k ) ≈ . x ( k − and y ( k ) ≈ . x ( k − . Consequently, x ( k ) ≈ . y ( k ) . Therefore, we have (cid:12)(cid:12)(cid:12) ∆ ( k ) i − ∆ ( k − i (cid:12)(cid:12)(cid:12) ≤ | x ( k ) i − y ( k ) i | x ( k − i . Using that x ( k ) − y ( k ) = XR x,k δ r − XR y,k δ r (where R x,k and R y,k are two diﬀerent independent draws of R in iteration k ) and x ≈ . x , we can simplify (50) to (cid:12)(cid:12)(cid:12) ∆ ( k ) i − ∆ ( k − i (cid:12)(cid:12)(cid:12) ≤ | ( R δ r ) i − ( R δ r ) i | . Recall that the i -th row of δ r is sampled with probability e q i def = min( q i , i -th rowsampled with probability 1, then the diﬀerence is 0. Hence, we can only consider the case e q i = q i ≥ mn Tβ log( mT ) · (( δ r ) i / k δ r k + 1 /m ). In this case, we have (cid:12)(cid:12)(cid:12) ∆ ( k ) i − ∆ ( k − i (cid:12)(cid:12)(cid:12) ≤ q i | δ r | i ≤ q i | δ r | i = 2 | δ r | imn Tβ log( mT ) · (( δ r ) i / k δ r k + 1 /m ) . One can check the worst case is given by | δ r | i = k δ r k / √ m and we get (cid:12)(cid:12)(cid:12) ∆ ( k ) i − ∆ ( k − i (cid:12)(cid:12)(cid:12) ≤ q i | δ r | i ≤ q i | δ r | i = β n √ mT log( mT ) k δ r k M def = √ nγβ T log( mT ) . where we used τ ( x ( k ) , s ( k ) ) ≥ nm and k δ r k τ ( x ( k ) ,s ( k ) ) ≤ γ (31) at the last sentence.For the variance bound, note that variance of a random variable X can be written as E ( X − E X ) = E ( X − Y ) where Y is an independent sample from X . Hence, we have E k h [∆ ( k ) − ∆ ( k − ] i i = E k h [ln x ( k ) − E k ln x ( k ) ] i i = 12 E k h [ln x ( k ) − ln y ( k ) ] i i ≤ E k " ( x ( k ) i − y ( k ) i ) ( x ( k − i ) ≤ E k h ( R x,k δ r − R y,k δ r ) i i where we used that x ( k ) ≈ . y ( k ) , x ( k ) − y ( k ) = XR x,k δ r − XR y,k δ r and x ≈ . x as before atthe last line. Again, if the i -th row sampled with probability 1, then the diﬀerence is 0. Hence,the last line is bounded by E k h [∆ ( k ) − ∆ ( k − ] i i ≤ q i [ δ r ] i ≤ n k δ r k mT log( mT ) ≤ σ def = γ β T log( mT )where we used q i ≥ mn Tβ log( mT ) · ( δ r ) i / k δ r k , τ ( x ( k ) , s ( k ) ) ≥ nm and k δ r k τ ( x ( k ) ,s ( k ) ) ≤ γ (31) atthe end.Now, we apply Bernstein inequality and get that P ( | ∆ ( k ) i | ≤ β for any k ∈ [ T ]) ≤ T exp( − β / σ T + M β )= 2 T exp( − β γ β T log( mT ) T + · √ nγβ T log( mT ) · β ) ≤ T exp( −

25 log( mT )) ≤ m . This completes the proof of b x ( k ) ≈ β x ( k ) for all k ≤ T .For the Property 2, we note that b x ( k +1) ≈ . b x ( k ) ≈ . x ( k ) and hence k ( b X ( k ) ) − ( b x ( k +1) − b x ( k ) ) k τ ( b x ( k ) ,s ( k ) ) ≤ k ln b x ( k +1) − ln b x ( k ) k τ ( x ( k ) ,s ( k ) ) = 2 k E ln x ( k +1) − ln x ( k ) k τ ( x ( k ) ,s ( k ) ) . Using that x ( k +1) ≈ . x ( k ) , we haveln x ( k +1) i = ln x ( k ) i + (( X ( k ) ) − ( x ( k +1) − x ( k ) )) i ± (( X ( k ) ) − ( x ( k +1) − x ( k ) )) i for all i and hence k ( b X ( k ) ) − ( b x ( k +1) − b x ( k ) ) k τ ( b x ( k ) ,s ( k ) ) ≤ k ( X ( k ) ) − ( E x ( k +1) − x ( k ) ) k τ ( x ( k ) ,s ( k ) ) + 2 k E (( X ( k ) ) − ( x ( k +1) − x ( k ) )) k τ ( x ( k ) ,s ( k ) ) . Lemma 4.18 shows that the ﬁrst term is bounded by 2 γ and Lemma 4.19 shows that the secondterm bounded by 12 γ . Hence, the whole term is bounded by 5 γ .For the Property 3, we let δ τ = τ ( b x ( k +1) , s ( k +1) ) − τ ( b x ( k ) , s ( k ) ). By Lemma 4.21, we have δ τ = Λ (1 − α ) b X − δ b x + 6 η where Λ = Λ ( b X / − α S − / − α A ) and | η i | ≤ σ i ( e > i Σ − P (2) b X − δ b x ) + e > i ( Σ + P (2) )( b X − δ b x ) . k T − P (2) k τ ≤ k T − Λ k τ ≤ k T − P (2) k ∞ ≤ k T − δ τ k τ ≤ k T − Λ b X − δ b x k τ + 6 k T − Σ − ( P (2) b X − δ b x ) k τ + 6 k T − ( Σ + P (2) )( b X − δ b x ) k τ ≤ k b X − δ b x k τ + 6 k ( T − P (2) b X − δ b x ) k τ + 12 k ( b X − δ b x ) k τ ≤ k b X − δ b x k τ + 6 k T − P (2) b X − δ b x k ∞ · k T − P (2) b X − δ b x k τ ≤ k b X − δ b x k τ + 6 k b X − δ b x k ∞ · k b X − δ b x k τ ≤ · γ + 6 · · γ where we used k b X − δ b x k τ ≤ γ (property 2) and k b X − δ b x k ∞ ≤ (since x ( k ) ≈ . x ( k − ).Our IPM maintains a point x that is nearly feasible, that is a point with small norm k A > x − b k A > XS − A ) − . We now show that there is a truly feasible x with x i ≈ x i for all i . Lemma 4.38.

Let ( x, s ) be a primal dual pair with xs ≈ τ ( x, s ) · µ for µ > , where x is notnecessarily feasible. Let H − ≈ δ ( A > XS − A ) − for some δ ∈ [0 , / . Then the point x := x + XS − AH − ( b − A > x ) satisﬁes both k A > x − b k A > XS − A ) − ≤ δ · k A > x − b k A > XS − A ) − , and (51) k X − ( x − x ) k ∞ ≤ µ k A x − b k A > XS − A ) − . (52) Proof.

Let δ x = XS − A ( b − A > x ). We start by proving (51). k A > x − b k A > XS − A ) − = k A > ( x + δ x ) − b k A > XS − A ) − = k ( I − A > XS − AH − )( A > x − b ) k A > XS − A ) − By using H − ≈ δ ( AXS − A ) − we can bound( I − H − A > XS − A )( A > XS − A ) − ( I − A > XS − AH − )= ( A > XS − A ) − − H − + H − A > XS − AH − ≺ (1 − − δ ) + exp(2 δ )) · ( A > XS − A ) − ≺ δ ( A > XS − A ) − . This then implies k A > x − b k A > XS − A ) − ≤ δ k A > x − b k A > XS − A ) − , so we have proven (51). Next, we prove (52). k X − δ x k ∞ = max i ∈ [ m ] k e > i S − AH − ( b − A > x ) k ∞ ≤ max i ∈ [ m ] k e > i S − AH − ( b − A > x ) k ≤ max i ∈ [ m ] k e > i S − AH − / k k H − / ( b − A > x ) k ≤ exp(2 δ ) · max i ∈ [ m ] k e > i S − A ( A > XS − A ) − / k k b − A > x k ( A > XS − A ) − , H − . The ﬁrst norm of above term can be bounded by k e > i S − A ( A > XS − A ) − / k = e > i XS ) / X / S / A ( A > XS − A ) − A > X / S / XS ) / e i = σ ( X / S / ) i x i s i ≤ σ ( X / − α S / α ) i x i s i ≤ τ ( x, s ) x i s i ≤ µ , where the last step uses xs ≈ µ · τ ( x, s ) and σ ( x, s ) ≤ τ ( x, s ) for both possible choices of τ (i.e.the IPM of Section 4.2 uses τ ( x, s ) = 1 and Section 4.3 uses τ ( x, s ) = σ ( x, s ) + n/m ). By using δ ≤ / k X − δ x k ∞ ≤ µ · k A x − b k A > XS − A ) − . In the remaining of this section, we bound how much the iterates of our IPMs move from theinitial point. Our proof relies on bounds on how fast the center of a weighted self-concordantbarrier moves. In particular we use the following lemma from [LS19] specialized to the weightedlog barrier. Lemma 4.39 (Special Case of [LS19] Lemma 67) . For arbitrary A ∈ R n × m , b ∈ R m , A ∈ R n × k , b ∈ R k , b ∈ R n and all w ∈ R k> , let Ω = { x ∈ R n : A > x = b , A > x > b } let x w def = arg min x ∈ Ω b > x − X i ∈ [ k ] w i log([ A > x − b ] i ) . For all w (0) , w (1) ∈ R k> , it holds that x w (0) + t ( x w (1) − x w (0) ) ∈ Ω for all t ∈ ( − γ, γ ) where γ = θ θ and θ = min i ∈ [ k ] min { w (0) i , w (1) i } P i ∈ [ k ] | w (0) i − w (1) i | . Using the lemma above, we can bound how the fast primal-dual central path moves byapplying the lemma on both primal and dual separately.

Lemma 4.40.

Fix arbitrary non-degenerate A ∈ R m × n , b ∈ R n , and c ∈ R m . For all w ∈ R m> ,we deﬁne ( x w , s w ) ∈ R m> be the unique vector such that X w s w = w , A > x w = b , A y + s w = c for some y . For any w (0) , w (1) ∈ R m> , we have that entrywise κ x w (1) ≤ x w (0) ≤ κx w (1) and κ s w (1) ≤ s w (0) , ≤ κs w (1) where κ = 4( W /W min ) for W def = k w (0) k + k w (1) k and W min def = min i ∈ [ k ] min { w (0) i , w (1) i } .Proof. Apply Lemma 4.39 with A = A , b = b , A = I , b = 0 and b = c so that x w is asdeﬁned in Lemma 4.39. Lemma 4.39 implies that x w (0) + t ( x w (1) − x w (0) ) > t ∈ ( − γ, γ )with γ and θ deﬁned as in Lemma 4.39. Consequently, γγ x w (1) ≥ x w ( ≥ γ γ x w (1) . Now, θ ≥ W min W as | w (0) i − w (1) i | ≤ | w (0) i | + | w (1) i | . Since θ / (1 + 2 θ ) increases monotonically for positive θ and W min W ≤ γ ≥ ( W min W ) . Similarly, since γ/ (1 + γ ) increases monotonically forpositive γ and ( W min W ) ≤ we have γ γ ≥ ( W min W ) ≥ κ . Thus, κ x w (0) ≤ x w (1) ≤ κx w (0) . This following Lemma 4.39 follows as a special case of Lemma 67 of [LS19] by choosing the φ i ( x ) in that lemmato be φ i def = − ln([ A > x − b ] i ) noting that each φ i is a 1-self-concordant barrier for the set { x ∈ R n | [ A > x ] i ≥ [ b ] i } . s w (0) and s w (1) we apply Lemma 4.39 with A = 0, b = 0, A = A > , b = c and b = b so that s w is as deﬁned in Lemma 4.39. Lemma 4.39 implies that s w (0) + t ( s w (1) − s w (0) ) > t ∈ ( − γ, γ ). Since this is the same statement that wasshown for x w (0) and x w (1) in the preceding paragraph we also have κ s w (0) ≤ s w (1) ≤ κs w (0) .This allows us to bound the number of bits of precision required to appropriately representall x , s as they occur throughout Algorithm 1, our main path following routine. Lemma 4.41.

Let x (init) , s (init) , µ (init) , µ (end) be the initial parameters for Algorithm 1 as inLemma 4.4. Let W be a bound on the ratio of largest to smallest entry in both x (init) and s (init) .Let W be a bound on the ratio of largest to smallest entry for all x and s at each Line 7 and thealgorithm’s termination. Then log W = e O (log W + | log µ (init) /µ (end) | ) for τ = τ LS and τ = τ log .Proof. Let ( x, s, µ ) be the state of the algorithm at either Line 7 or the algorithm’s termination.Lemma 4.4 implies that ( x, s, µ ) is (cid:15) -centered. Consequently, √ µ k A > x − b k ( AXS − A ) − = O (1).By Lemma 4.38 this implies there is a feasible x with x ≈ O (1) x where for w def = x s and x w , s w as deﬁned in Lemma 4.40 we have x w = x and s w = w . Further, the deﬁnition of (cid:15) -centeredimplies that w = x s ≈ O (1) xs ≈ O (1) µτ ( x, s ). Now, for τ ( x, s ) = τ log ( x, s ) = ~ µ · τ ( x, s ) ≥ min { µ (init) , µ (end) } and k µ · τ ( x, s ) k ≤ m · max { µ (init) , µ (end) } . and for τ ( x, s ) = τ LS ( x, s ) = σ ( x, s ) + nm ~ µ · τ ( x, s ) ≥ nm · min { µ (init) , µ (end) } and k µ · τ ( x, s ) k ≤ n · max { µ (init) , µ (end) } . In either case, applying Lemma 4.40 with κ = O  m · max { µ (init) , µ (end) } min { µ (init) , µ (end) } !  = O  m · max ( µ (init) µ (end) , µ (end) µ (init) )!  yields that 1 κ x w (init) ≤ x w ≤ κx w (init) and 1 κ s w (init) ≤ s w ≤ κs w (init) . Since the largest ratio of the entries in x (init) and s (init) is bounded by W this implies that W with log W = e O (log W + | log µ (init) /µ (end) | ) is an upper bound on the largest ratio of entries in x and s as desired. 52 Heavy Hitters

This section concerns maintaining information about a matrix-vector product A h for an in-cidence matrix A ∈ {− , , } m × n undergoing row scaling . Formally, let g ∈ R m . Note that Diag ( g ) A is the matrix obtained by scaling the i th row of A by a factor of g i . We want to beable to update entries of g and compute certain information about Diag ( g ) A h for query vector h ∈ R n . The data structure constructed in this section will be used in Section 6 to create adata structure for eﬃciently maintaining the slack of the dual solution inside our IPM method.If A is an incidence matrix representing some graph G , then Diag ( g ) A h ∈ R m can be viewedas a vector of values of edges, deﬁned as follows. View h ∈ R n as a vector of potentials on nodesand view g ∈ R m ≥ as a vector of edge weights (node v ∈ [ n ] has potential h v and edge e ∈ [ m ]has weight g e ). Deﬁne the value of each edge e = { u, v } as | ( Diag ( g ) A h ) e | = g e | h ( v ) − h ( u ) | .We would like to maintain certain information about the edge values when the potential andedge weights change.The data structure of this section (Lemma 5.1) can report edges with large absolute values,i.e. large | ( Diag ( g ) A h ) e | . It supports updating g e for some e ∈ [ m ] ( Scale operation). More-over, given h ∈ R n and (cid:15) ≥

0, it reports all edges (equivalently all entries in

Diag ( g ) A h ) withvalues at least (cid:15) ( HeavyQuery operation). Additionally, it can sample each edge with proba-bility roughly proportional to its value square ( Sample operation). Thus, each edge e = { u, v } is selected with probability roughly ( g e ( h u − h v )) k Diag ( g ) A h k . Our data structure requires near-linear preprocessing time and polylogarithmic time per

Scale and

Sample operation. It takes roughly O ( k Diag ( g ) A h k (cid:15) − + n ) time to answer aquery. (To make sense of this time complexity, note that reading the input vector h takes O ( n )time, and there can be at most and as many as k Diag ( g ) A h k (cid:15) − edges reported: For any x , x , . . . , x m , there can be at most and as many as ( P i x i ) /(cid:15) indices i such that x i ≥ (cid:15) .)Note that the data structure is randomized but holds against an adaptive adversary. It canbe made deterministic at the cost of m o (1) preprocessing time and n o (1) time per Scale (asopposed to polylog( n )), using [CGL + Diag ( g ) A h and to sample entries of the vector proportional to the squared values. Formally,our data structure has the following guarantees. Lemma 5.1 ( HeavyHitter ) . There exists a data structure

ProjectionHeavyHitter thatsupports the following operations. • Initialize( A ∈ {− , , } m × n , g ∈ R m ≥ ) : The data structure is given the edge incidencematrix A of a graph G and a scaling vector g . It initializes in O ( m log n ) time. • Scale ( i ∈ [ m ] , s ≥ : Updates g i ← s in O (log n ) amortized time. • HeavyQuery( h ∈ R n , (cid:15) ∈ R > ) : With high probability, the data structure returns all e ∈ [ m ] such that | ( Diag ( g ) A h ) e | ≥ (cid:15) in running time O ( k Diag ( g ) A h k (cid:15) − log n + n log n log W ) , where W is the ratio of the largest to the smallest non-zero entries in g . • Sample ( h ∈ R n , K ∈ R > ) : With high probability, in O ( K log n + n log n log W ) time thedata-structure returns independently sampled indices of Diag ( g ) A h ∈ R m (i.e., edges ingraph G ), where each edge e = ( u, v ) is sampled with some probability q e which is at least min ( K · ( g e ( h u − h v )) k Diag ( g ) A h k log n , ) , and with high probability there are at most O ( K log n ) entries returned. Probability ( I ⊂ [ m ] , h ∈ R n , K ∈ R > ) : Given a subset of edges I , this procedurereturns for every e ∈ I the probabilities q e that e would be sampled in the procedure Sample ( h, K ) . The running time is O ( | I | + n log n log W ) . • LeverageScoreSample ( K ∈ R > ) : With high probability, in O ( K n log n log W ) time LeverageScoreSample returns a set of sampled edges where every edge e is in-cluded independently with probability p e ≥ K · σ ( Diag ( g ) A ) e , and there are at most O ( K n log n log W ) entries returned. • LeverageScoreBound ( K , I ⊂ [ m ]) : Given a subset of edges I , this procedure re-turns for every e ∈ I the probabilities p e that e would be sampled in the procedure LeverageScoreSample ( K ) . The running time is O ( | I | ) . Our data structure exploits a classic spectral property of a graph, which is captured by thefollowing simple known variant of Cheeger’s inequality (see, e.g. [CGP +

18] for its generalization).

Lemma 5.2.

Let A ∈ R m × n be an incidence matrix of an unweighted φ -expander and let D ∈ R n × n the degree matrix. Let L = A > A be the corresponding Laplacian matrix, then forany y ∈ R n such that y ⊥ D ~ n , we have y > L y ≥ . · y > D y · φ . Proof.

Cheeger’s inequality says that φ ≤ λ ( D − / LD − / ) (see e.g. [ST11]). Since D − / LD − / is PSD and D / ~ φ ≤ λ ( D − / LD − / ) = min x ⊥ D / ~ x > D − / LD − / xx > x = min y ⊥ D ~ x > L xx > D x . To see the intuition of why we can eﬃciently implement the

HeavyQuery and

Sample operations, consider for the simple case when there is no scaling (i.e. g = ~

1) and the graph G is a φ G -expander. Since ( A h ) e = h u − h v for e = ( u, v ), we can shift h by any constantvector without changing A h , so we can shift h to make h ⊥ D ~ n . For any edge e = ( u, v ) to have | h u − h v | ≥ (cid:15) , at least one of | h u | and | h v | has to be at least (cid:15)/

2. Thus, it suﬃces to check theadjacent edges of a node u only when | h u | is large, or equivalently h u (cid:15) − is at least 1 /

4. Sincechecking the adjacent edges of any u takes time deg( u ), the time over all such nodes is boundedby P u deg( u ) h u (cid:15) − , which is O ( h > D h(cid:15) − ). By Lemma 5.2 this is O ( h > L hφ − G (cid:15) − ), and note h > L h = k A h k . The intuition behind Sample is similar. Since we can approximate k A h k (inthe denominator of edge probabilities) within a factor of φ G by using h > D h , and ( h u − h v ) (inthe numerator) is at most 2( h u + h v ), this allows us to work mostly in the n -dimensional nodespace instead of the m -dimensional edge space.The above discussion would give us the desired result when g is a constant vector and φ G is large, i.e. Ω(log − c n ) for some small constant c . To extend the argument to the general casewith scaling vector g and graph of smaller conductance, we simply partition the edges of G todivide it into subgraphs, where each induced subgraph has roughly the same g values on itsedges and large conductance. For the ﬁrst part (i.e., roughly constant g ), we can bucket theedges by their g values and move edges between buckets when their g values get updated. Toget subgraphs with large conductance, we utilize a dynamic expander decomposition algorithmstated in Lemma 5.3 as a blackbox to further partition the edges of each bucket into inducedexpanders and maintain these expanders dynamically as edges move between buckets. We usethe dynamic expander decomposition described in [BBG +

20] (building on tools developed fordynamic minimum spanning tree [SW19, CGL +

20, NSW17, NS17, Wul17], especially [SW19]).54 emma 5.3 (Dynamic expander decomposition, [BBG + . For any φ = O (1 / log n ) thereexists a dynamic algorithm against an adaptive adversary, that preprocesses an unweighted graph G with m edges in O ( φ − m log n ) time (or O (1) time, if the graph is empty). The algorithm canthen maintain a decomposition of G into φ -expanders G , ..., G t , supporting both edge deletionsand insertions to G in O ( φ − log n ) amortized time. The subgraphs ( G i ) ≤ i ≤ t partition theedges of G , and we have P ti =1 | V ( G i ) | = O ( n log n ) . The algorithm is randomized Monte-Carlo, i.e. with high probability every G i is an expander. Now we formally prove the guarantees of our data structure.

Proof of Lemma 5.1.

We describe each of the operation independently.

Initialization.

Let G = ( V, E ) be the graph corresponding to the given incidence matrix A with weight g e of each edge e . Partition the edges into subgraphs denoted as G i = ( V, E i ),where E i = { e | g e ∈ [2 i , i +1 ) } (53)i.e. each G i is an unweighted subgraph of G consisting of edges of roughly the same g e values.Let G −∞ be the subgraph induced by all zero-weight edges. Next, we choose φ = 1 / log n andinitialize the expander decomposition algorithm in Lemma 5.3 on each of the G i , which resultsin φ -expanders G i, , ..., G i,t i for each i . Note this can be implemented so the running time doesnot depend on the ratio W of largest and smallest non-zero entries in g if we only spend timeon the non-empty G i ’s (and if a later update of g creates a new non-empty subgraph G i , weattribute the time to initialize the expander decomposition of G i to the Scale operation. Intotal we spend O ( m log n ) time for the initialization since the average time for each edge is O (log n ). Scale.

Changing g e ← s means the edge weight of e is changed to s . Thus we may need todelete the given edge e from its current graph G i and insert it into some other G i , so that(53) is maintained. By Lemma 5.3 it takes O (log n ) amortized time to update the expanderdecomposition of G i and G i . HeavyQuery.

For every i = −∞ we iterate over each of the φ -expander G i,j and do thefollowing. Let m and n denote the number of edges and nodes in G i,j respectively. Let B bethe incidence matrix of G i,j ; thus rows and columns of B correspond to edges and nodes in G i,j ,respectively. To simplify notations, we pretend rows of B are in R n instead of R n (i.e. keep aghost column even for nodes in G but not in G i,j ). Note that G i,j is unweighted and each rowof B corresponds to an edge in G i,j and also appears as a row in A ∈ R m × n ; thus B ∈ R m × n for some m ≤ m . All m -row/column (respectively n -row/column) matrices here have eachrow/column correspond to an edge (respectively node) in G i,j , so we index them using edges e and nodes v in G i,j ; for example, we refer to a row of B as B e and an entry of h ∈ R n as h v .To answer the query (ﬁnding all e such that | ( Diag ( g ) A h ) e | ≥ (cid:15) ), it suﬃces to ﬁnd all edges e in G i,j such that | ( Diag ( g ) B h ) e | ≥ (cid:15) (54)because rows of B is a subset of rows of A . Finding e satisfying (54) is done as follows. Step 1:

Shift h so it is orthogonal to the degree vector of G i,j : h ← h − ~ n · ( ~ > n D h ) / ( ~ > n D ~ n ) , (55) The algorithm lists the number of changes (i.e. which edges are added/removed to/from any of the expanders).The update time is thus also an amortized bound on the number of changes to the decomposition. Without this assumption, B might not contain all column in A and we need to deﬁne vector b h which contains | V ( G i,j ) | entries of h corresponding to nodes in G i,j , which introduce unnecessary notations. However, it is crucialto actually work with n -dimensional vectors for eﬃciency. D ∈ R n × n is the diagonal degree matrix with D v,v = deg G i,j ( v ). Step 2:

Let δ = (cid:15)/ i +1 , and ﬁnd all e with | ( B h ) e | ≥ δ. (56)as follows. For every node v with | h v | ≥ . δ , ﬁnd all edges e incident to v that satisﬁes (56).Among these edges, return those satisfying (54). Correctness:

To see we correctly return all edges satisfying (54), let e be any such edge, thenwe have | ( B h ) e | = | ( B h ) e | (since B ~ n = ~ m ) ≥ (cid:15)/g e (since | ( Diag ( g ) B h ) e | = g e | ( B h ) e | ) ≥ (cid:15)/ i +1 = δ (since g e ≤ i +1 for every edge e in G i,j ).Thus, e must satisfy (56). So if our algorithm discovers all edges satisfying (56), we will ﬁndall edges satisfying (54) as desired.It is left to show that our algorithm actually discovers all edges satisfying (56). Note thatan edge e = ( u, v ) satisﬁes (56) only if | h ( u ) | ≥ . δ or | h ( v ) | ≥ . δ : if | h ( u ) | < . δ and | h ( v ) | < . δ , then | ( B h ) e | = | h ( v ) − h ( u ) | ≤ | h ( v ) | + | h ( u ) | < δ . Since in step 2 weconsider edges incident to every node u such that | h ( u ) | > . δ , the algorithm discovers alledges satisfying (56). Time Complexity:

Step 1 (computing h ) can be easily implemented to take O ( | V ( G i,j ) | ) (seeFootnote 13). For step 2 (ﬁnding e satisfying (56)), let b V = { v ∈ V ( G i,j ) | | h v | ≥ . δ } , thenthe time we spend in step 2 is bounded by O ( X v ∈ b V deg G i,j ( v )) = O ( X v ∈ b V deg G i,j ( v )( h v ) /δ ) ≤ O  X v ∈ V ( G i,j ) deg G i,j ( v )( h v ) /δ  = O (( h ) > D h /δ ) . (57)Above, the ﬁrst equality is because ( h v ) /δ ≥ .

25 for every v ∈ b V . To bound this timecomplexity further, observe that h ⊥ D ~ n and k B h k = k B h k , (58)where the latter is because each row of the incidence matrix B sums to zero, so B h remains thesame under constant shift. By Lemma 5.2, k B h k = k B h k = ( B h ) > ( B h ) = ( h ) > L h Lem 5.2 ≥ ( h ) > D h φ , where here L = B > B is the Laplacian matrix of G i,j . Thus, the time complexity in (57) can bebounded by O ( k B h k / ( φ δ )) = O ( k B h k φ − i /(cid:15) ) . Summing the time complexity over all maintained expanders G i,j with B i,j being the relevantincidence matrix B , the total time for answering a query is O X i,j  | V ( G i,j ) | | {z } step 1 + k B i,j h k φ − i /(cid:15) | {z } step 2  Without the assumption discussed in Footnote 13, we also need to deﬁne h i,j to be the corresponding b h hereas well. Again, this does not change our calculation below. O X i,j | V ( G i,j ) |  + X i X ( u,v ) ∈ E i ( h ( u ) − h ( v )) i φ − (cid:15) −  = O X i,j | V ( G i,j ) |  + X i X ( u,v ) ∈ E i g u,v ) ( h ( u ) − h ( v )) φ − (cid:15) −  = O (cid:16) n log n log W + k Diag ( g ) A h k (cid:15) − log n (cid:17) . The second equality is because g ( u,v ) ≥ i for all ( u, v ) ∈ E i (by (53)). The third is because P j | V ( G i,j ) | = O ( n log n ) for every i (by Lemma 5.3), and also because the E i ’s partition theedges in the original graph. Sample.

Again, let G i,j be an expander of the maintained decomposition and denote D ( i,j ) itsdiagonal degree matrix. Then we deﬁne h ( i,j ) as in (55) for each G i,j which satisﬁes ( h v − h u ) =( h ( i,j ) v − h ( i,j ) u ) and h ( i,j ) ⊥ D ( i,j ) ~ Q = K P i,j i P v ∈ V ( G i,j ) ( h ( i,j ) v ) deg G i,j ( v )then we perform the following procedure for each expander G i,j of our decomposition. For eachnode v in G i,j , we sample each edge incident to v with probabilitymin { Q · i ( h ( i,j ) v ) , } . If an edge ( u, v ) is sampled twice (i.e. once for u and once for v ), then it is included only oncein the output.Computing all h ( i,j ) ’s and Q takes O ( n log n log W ) time and the sampling of edges canbe implemented such that the complexity is bounded by the number of included edges. Forexample by ﬁrst sampling a binomial for each node and then picking the corresponding numberof incident edges uniformly at random.The expected number of included edges can be bounded by X i,j X ( u,v ) ∈ E ( G i,j ) Q · i (( h ( i,j ) v ) + ( h ( i,j ) u ) ) = 2 Q X i,j i X u deg G i,j ( u )( h ( i,j ) u ) = 2 K Thus the expected runtime is O ( K + n log n log W ), and for a w.h.p bound this increases to O ( K log n + n log n log W ).We are left with proving the claim, i.e., each edge ( u, v ) is sampled with probability at leastmin ( K · g u,v ) ( h u − h v ) k Diag ( g ) A h k log n , ) . If Q ≥ i ( h ( i,j ) v ) − or Q ≥ i ( h ( i,j ) u ) − , then this is clear since then the sampling probability isjust 1. So we consider the case of Q ≤ min(2 i ( h ( i,j ) v ) − , i ( h ( i,j ) u ) − ). Then the probability is q ( u,v ) = Q · i (cid:16) ( h ( i,j ) v ) + ( h ( i,j ) u ) (cid:17) − Q i ( h ( i,j ) v ) ( h ( i,j ) u ) ≥ Q · i (cid:16) ( h ( i,j ) v ) + ( h ( i,j ) u ) − | ( h ( i,j ) v )( h ( i,j ) u ) | (cid:17) ≥ . Q · i (cid:16) ( h ( i,j ) v ) + ( h ( i,j ) u ) (cid:17) ≥ . Q · i ( h ( i,j ) v − h ( i,j ) u ) ≥ . Q · g u,v ( h v − h u ) Q ≤ i ( h ( i,j ) v ) − and Q ≤ i ( h ( i,j ) u ) − .Further, we can bound Q as Q = K P i,j i P v ∈ V ( G i,j ) ( h ( i,j ) v ) deg G i,j ( u ) ≥ Kφ k Diag ( g ) A h k , by applying the Cheeger-inequality to each P v ∈ V ( G i,j ) ( h ( i,j ) v ) deg G i,j ( u ). In summary we obtainthat any edge ( u, v ) is included with probability at least K g u,v ( h v − h u ) k Diag ( g ) A h k log n . Probability.

As discussed in the analysis of

Sample , we can compute Q and all h ( i,j ) ’s in O ( n log n log W ) time, then for each edge e ∈ I , we can look up the subgraph G i,j it belongsto, and compute its probability q e used in Sample in O (1) time, which takes O ( | I | ) time intotal for all edges in I . LeverageScoreSample.

We will sample the edges in each expander G i,j separately. Let L ( i,j ) denote the Laplacian of the unweighted subgraph G i,j , and we consider it as in the originaldimension n × n . Since g e ∈ [2 i , i +1 ] if e is an edge in G i,j , we have X ( i,j ) i L ( i,j ) (cid:22) b L (cid:22) X ( i,j ) i +2 L ( i,j ) where b L = A > Diag ( g ) A is the weighted Laplacian of the entire graph G with weights g e on each edge e . Since L ( i,j ) (cid:23) i, j , we know 2 i L ( i,j ) (cid:22) b L for all i, j , which gives b L † (cid:22) − i ( L ( i,j ) ) † since the null space of L ( i,j ) contains the null space of b L . As a result, for anedge e = ( u, v ) in the expander G i,j , we know σ ( Diag ( g ) A ) e = g e χ > e b L † χ e ≤ g e − i χ > e ( L ( i,j ) ) † χ > e ≤ χ > e ( L ( i,j ) ) † χ e , where χ e is the row of A corresponding to e , so it has only two non-zero entries, 1 at u and − v . Thus, it suﬃces to have p e ≥ K χ > e ( L ( i,j ) ) † χ e . As G i,j is a φ -expander for φ = 1 / log n , using Cheeger’s inequality (i.e. λ ( L ( i,j ) ) ≥ φ / y ⊥ ~ n , y > ( L ( i,j ) ) † y ≤ φ − y > ( D ( i,j ) ) − y , where D ( i,j ) is the diagonaldegree matrix of G i,j . Thus, we know χ > e ( L ( i,j ) ) † χ e ≤ φ − ( i,j ) u + 1deg ( i,j ) v ! . Similar to how we implement

Sample , we can go through each node v (with non-zero degree)in G i,j and sample each edge incident to v with probability p v = min ( K φ − deg ( i,j ) u , ) , and only include an edge once if it is sampled twice at both endpoints. Again this can beimplemented so that the complexity is bounded by the number of included edges, and theexpected number of sampled edges can be bounded by X i,j X ( u,v ) ∈ E ( G i,j ) K φ − deg ( i,j ) u + 16 K φ − deg ( i,j ) v = X i,j X u deg ( i,j ) u · K φ − deg ( i,j ) u

58 16 K φ − X i,j | V ( G i,j ) | ≤ O ( K n log n log W ) , and we can include additional log n factor to make the bound hold with high probability. Tosee the probability that an edge is sampled satisﬁes the requirement, note p e ≥ max p u , p v ≥ p u + p v ≥ min ( K φ − ( 1deg ( i,j ) u + 1deg ( i,j ) v ) , ) ≥ K χ e ( L ( i,j ) ) † χ e ≥ σ ( Diag ( g ) A ) e LeverageScoreBound.

We can get the exact probability an edge e is sampled in the previousimplementation of LeverageScoreSample . Suppose e = ( u, v ) is in subgraph G i,j , we cancompute in constant time p u = min ( K φ − deg ( i,j ) u , ) , p v = min ( K φ − deg ( i,j ) v , ) , and p e = p u + p v − p u p v . Going through all edges in I takes O ( | I | ) time.59 Dual Solution Maintenance

In this section we discuss how we eﬃciently maintain an approximation of the dual slack (i.e. s ) in our IPM. Recall that our dual slack vector starts with an initial value s (init) , and in eachiteration t accumulates an update with the generic form of A δ ( t ) where A ∈ R m × n is the edgeincidence matrix. The actual computation δ ( t ) relies on other data structures such as gradientmaintenance, which will be discussed in later sections. In this section, we focus on maintainingan approximation of the dual slack assuming δ ( t ) is given to us. Theorem 6.1 ( VectorMaintainer ) . There exists a data-structure that supports the followingoperations • Initialize( A ∈ R m × n , v (init) ∈ R m , (cid:15) > ) The data-structure stores the given edge inci-dence matrix A ∈ R m × n , the vector v (init) ∈ R m and accuracy parameter < (cid:15) ≤ in e O ( m ) time. • Add( h ∈ R n ) : Suppose this is the t -th time the Add operations is called, and let h ( k ) bethe vector h given when the Add operation is called for the k th time. Deﬁne v ( t ) ∈ R m tobe the vector v ( t ) = v (init) + A t X k =1 h ( k ) . Then the data structure returns a vector v ( t ) ∈ R m such that v ( t ) ≈ (cid:15) v ( t ) . The output willbe in a compact representation to reduce the size. In particular, the data-structure returnsa pointer to v and a set I ⊂ [ m ] of indices i where v ( t ) i is changed compared to v ( t − i , i.e.,the result of the previous call to Add . The total time after T calls to Add is e O T n log W + T (cid:15) − · T X t =1 k ( v ( t ) − v ( t − ) /v ( t ) k ! . • ComputeExact() : Returns v ( t ) ∈ R m in O ( m ) time, where t is the number of times Add is called so far (i.e., v ( t ) is the state of the exact vector v after the most recent callto Add ). From the graph-algorithmic perspective, our data structure maintains a ﬂow v ∈ R E onan unweighted oriented graph G = ( V, E ) (initially v = v (init) ). This ﬂow is changed by the Add ( h ) operation, whose input is the vertex potential h ∈ R E inducing a new electrical ﬂow tobe augmented; in particular, the new ﬂow on each edge ( i, j ) is v ( i,j ) ← v ( i,j ) + h j − h i . The Add operation then returns necessary information so that the user can maintain an approximationof ﬂow v . The user can query for the exact value of v by calling ComputeExact() . Note thatthe graph G never changes; though, our data structured can be easily modiﬁed to allow G tohave edge resistances that may change over time.To make sense of the running time for the Add operations, note that reading the input over T Add calls needs

T n time, and in total there can be as many as

T (cid:15) − · P Tt =1 k ( v ( t ) − v ( t − ) /v ( t ) k changes to the output vector we need to perform to guarantee the approximation bound. Thus,ignoring the log W factor, the complexity in our theorem is optimal as it is just the bound oninput and output size.Throughout this section we denote h ( t ) the input vector h of the t -th call to Add (orequivalently referred to as the t -th iteration), and let v ( t ) = v (init) + A P tk =1 h ( k ) be the state ofthe exact solution v (as deﬁned in Theorem 6.1) for the t -th call to Add .In our algorithm (see Algorithm 4) we maintain a vector b f which is the sum of all past inputvectors h , so we can retrieve the exact value of v ( t ) i = v (init) i + ( A b f ) i for any i eﬃciently. This Here, a ﬂow refers any assignment of a real value to each edge. lgorithm 4: Algorithm for Theorem 6.1 members D, b f , v ∈ R m , t ∈ N D j , f ( j ) ∈ R n and F j ⊂ [ m ] for 0 ≤ j ≤ log n procedure Initialize ( A , v (init) , (cid:15) ) v ← v (init) , b f ← ~ n , t ← for j = 0 , ..., log n do D j . Initialize ( A , /v, . (cid:15)/ log n ) ( HeavyHitter , Lemma 5.1) f ( j ) ← ~ n , F j ← ∅ procedure FindIndices ( h ∈ R n ) I ← ∅ for j = log n, ..., do f ( j ) ← f ( j ) + h // When j | t , then f ( j ) = P tk = t − j +1 h ( k ) if j | t then I ← I ∪ D j . HeavyQuery ( f ( j ) ) f ( j ) ← ~ n return I procedure VerifyIndex ( i ) if | v i − ( v (init) + A b f ) i | ≥ . (cid:15)v i / log n then v i ← ( v (init) + A b f ) i for j = 0 , ..., log n do F j ← F j ∪ { i } , D j . Scale ( i, // Notify other D j ’s to stop tracking i . return True return False procedure Add ( h ∈ R n ) t ← t + 1, b f ← b f + h , I ← FindIndices ( h ) I ← { i | i ∈ I and VerifyIndex ( i ) = T rue } for j : 2 j | t do I ← I ∪ { i | i ∈ F j and VerifyIndex ( i ) = T rue } ; for j : 2 j | t do for i ∈ I ∪ F j do D j . Scale ( i, g i /v i ) F j ← ∅ return I , v procedure ComputeExact () return v (init) + A g v i whenever the approximation v that we maintain no longersatisﬁes the error guarantee for some coordinate i . As to how we detect when this may happen,we know the diﬀerence between v ( t ) and the state of v at an earlier t -th Add call is v ( t ) − v ( t ) = A  t X k = t +1 h ( t )  , and thus we can detect all coordinates i that changes above certain threshold from t to t -th Add call using the

HeavyHitter data structure of Lemma 5.1 (by querying it with P tk = t +1 h ( t ) as the parameter h ). Note since the error guarantee we want is multiplicative in v (i.e., v ( t ) i ∈ ± (cid:15)v ( t ) i for all i ), while the threshold (cid:15) in Lemma 5.1 is absolute and uniform, we give 1 /v asthe scaling vector to HeavyHitter to accommodate this.Since the most recent updates on v i for diﬀerent indices i ’s happen at diﬀerent iterations,we need to track accumulated changes to v i ’s over diﬀerent intervals to detect the next time anupdate is necessary for each i . Thus, it is not suﬃcient to just have one copy of HeavyHit-ter . On the other hand, keeping one individual copy of

HeavyHitter for each 0 ≤ t < t will be too costly in terms of running time. We handle this by instantiating log n copies ofthe HeavyHitter data structure D j for j = 0 , ..., log n , each taking charge of batches withincreasing number of iterations. In particular, the purpose of D j is to detect all coordinates i in v with large accumulated change over batches of 2 j iterations (see how we update andreset f ( j ) in FindIndices in Algorithm 4). Each D j has its local copy of a scaling vector,which is initialized to be 1 /v , we refer to it as b g ( j ) , and the cost to query D j is proportional to k Diag ( b g ( j ) ) A f ( j ) k . Note f ( j ) accumulates updates over 2 j iterations, and k P j k =1 h ( k ) k canbe as large as 2 j P j k =1 k h ( k ) k . Since we want to bound the cost of our data structure by thesum of the squares of updates (which can in turn be bounded by our IPM method) instead ofthe square of the sum of updates, querying D j incurs an additional 2 j factor overhead. Thus foreﬃciency purposes, if v i would take much less than 2 j iterations to accumulate a large enoughchange, we can safely let D j stop tracking i during its current batch since v i ’s change wouldhave been detected by a D j of appropriate (and much smaller) j so that v i would have beenupdated to be the exact value (see implementation of VerifyIndex ). Technically, we keep aset F j to store all indices i that D j stops tracking for its current batch of iterations and set b g ( j ) i to 0 so we don’t pay for coordinate i when we query D j . At the start of a new batch of 2 j iterations for D j , we add back all indices in F j to D j (Line 30) and reset F j . As a result, onlythose i ’s that indeed would take (close to) 2 j iterations to accumulate a large enough changeare necessary to be tracked by D j , so we can query D j less often for large j to oﬀset its largecost. In particular, we query each D j every 2 j iterations (see Line 14).We start our formal analysis with the following lemma. Lemma 6.2.

Suppose we perform the t -th call to Add , and let ‘ be the largest integer with ‘ | t (i.e. ‘ divides t ). Then the call to FindIndices in Line 25 returns a set I ⊂ [ m ] containing all i ∈ [ m ] such that there exists some ≤ j ≤ ‘ satisfying both i / ∈ F j and | v ( t − j ) i − v ( t ) i | ≥ . | v | i (cid:15)/ log n .Proof. Pick any j ∈ [0 , ‘ ], if | v ( t − j ) i − v ( t ) i | ≥ . | v i | (cid:15)/ log n , then (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) e > i A t X k = t − j +1 h ( k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ . | v i | (cid:15)/ log n. We will argue why

FindIndices detect all i ’s satisfying this condition.62ote that we have f ( j ) = P tk = t − j +1 h ( k ) , and thus by guarantee of Lemma 5.1 when we call D j . HeavyQuery ( f ( j ) ) (in Line 14), we obtain for every j ∈ [0 , ‘ ] all i ∈ [ m ] with (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)b g ( j ) i e > i A t X k = t − j +1 h ( k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ . (cid:15)/ log n. Here b g ( j ) i = 0 if i ∈ F j by Line 21, which happens whenever v i is changed in Line 19. Thus byLine 30 we have b g ( j ) i = 1 / | v i | for all i / ∈ F j . Equivalently, we obtain all indices i / ∈ F j satisfyingthe following condition, which proves the lemma. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) e > i A  t X k = t − j +1 h ( k )  (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ . | v i | (cid:15)/ log n To guarantee that the approximation v we maintain is within the required (cid:15) error bound ofthe exact vector v , we need to argue that the D j ’s altogether are suﬃcient to detect all potentialevents that would cause v i to become outside of (1 ± (cid:15) ) v . It is easy to see that if an index i isincluded in the returned set I of FindIndices (Line 25), then our algorithm will follow up witha call to

VerifyIndex ( i ), which will guarantee that v i is close to the exact value v i (or v i will beupdated to be v i ). Thus, if we are in iteration t , and t is the most recent time VerifyIndex ( i )is called, we know v i ≈ v ( t ) i , the value of v i remains the same since iteration t , and the index i is not in the result of FindIndices for any of the iterations after t . We will demonstrate thelast condition is suﬃcient to show v ( t ) i ≈ v ( t ) i , which in turn will prove v i ≈ v ( t ) i . To start, weﬁrst need to argue that for any two iterations t < t , the interval can be partitioned into a smallnumber of batches such that each batch is exactly one of the batches tracked by some D j . Lemma 6.3.

Given any t < t , there exists a sequence t = t > t > ... > t k = t such that k ≤ t and t z +1 = t z − ‘ z where ‘ z satisﬁes ‘ z | t z for all z = 0 , . . . , k − .Proof. We can ﬁnd such a sequence ( t z ) z of length at most 2 log n as follows. Write both t, t as log t -bit binary, and start from the most signiﬁcant bit. t and t will have some commonpreﬁx, and the ﬁrst bit where they diﬀer must be a 1 for t and 0 for t since t > t . Supposethe bit they diﬀer corresponds to 2 ‘ , then let t = b t/ ‘ c ‘ , i.e. t is t but with everything tothe right of 2 ‘ zeroed out. We can create the sequence in two halves. First from t to t byiteratively subtracting the current least signiﬁcant 1-bit. To go from t down to t , consider thesequence backwards, we start from t , and iteratively add a number equal to the current leastsigniﬁcant 1-bit. This will keep pushing the least signiﬁcant 1 to the left, and eventually arriveat t . Clearly this gives a sequence satisfying the condition that t z +1 = t z − ‘ z and 2 ‘ z | t z forall t z created in our sequence. The length of the sequence is at most 2 log t , since each half haslength at most log t . For an example see Figure 1.Now we can argue v stays in the desired approximation range around v . Lemma 6.4 (Correctness of Theorem 6.1) . Assume we perform the t -th call to Add , thereturned vector v satisﬁes | v ( t ) i − v i | ≤ (cid:15) | v ( t ) i | for all i ∈ [ m ] , and I contains all indices that havechanged since the ( t − -th Add call. k binary representation of t k l k t = t l = 0 t l = 2 t l = 3 t l = 5 t k binary representation of t k l k t l = 2 t l = 1 t l = 0 t = t t to t in 2 log t steps. t in the proof of Lemma 6.3 is t . Here2 ‘ k | t k and t k +1 = t k − ‘ k for all k , and the bit (i.e. 2 ‘ k ) is highlighted in each row. Proof. By b f = P tj =1 h ( j ) we have v (init) + A b f = v ( t ) in Line 18. So after a call to VerifyIndex ( i )we know that | v i − v ( t ) i | < . (cid:15)v j / log n , either because the comparison | v i − ( v (init) + A b f ) i | ≥ . (cid:15)v i / log n in Line 18 returned false, or because we set v i ← ( v (init) + A b f ) i in Line 19. Note thisis also the only place we may change v i . So consider some time t ≤ t when VerifyIndex ( i ) wascalled for the last time (alternatively t = 0). Then v i has not changed during the past t − t callsto Add , and we know | v i − v ( t ) i | ≤ . (cid:15)v i / log n . We now want to argue that | v ( t ) − v ( t ) | ≤ . (cid:15)v i ,which via triangle inequality would then imply v i ≈ (cid:15) v ( t ) i .For t = t this is obvious, so consider t < t . We know from Lemma 6.3 the existence of asequence t = t > t > ... > t k = t with 2 ‘ z | t z and t z +1 = t z − ‘ z . In particular, this means that the interval between iteration t z +1 and t z correspond to exactly a batch tracked by D l z . Thus, at iteration t z when FindIndices iscalled, D l z . HeavyQuery is executed in Line 14. This gives us | v ( t z ) i − v ( t z +1 ) i | < . (cid:15) | v i | / log n for all z , because by Lemma 6.2 the set I ∪ ( S j F j ) contains all indices i which might havechanged by 0 . | v i | (cid:15)/ log n over the past 2 ‘ iterations for any 2 ‘ | t , and because VerifyIndex ( i )is called for all i ∈ I ∪ ( S j F j ) in Line 26 and Line 27.Note we can assume log t ≤ log n by resetting the data-structure after n iterations, and thisbounds the length of the sequence k ≤ n . This then yields the bound | v ( t ) i − v ( t ) i | = | v ( t k ) i − v ( t ) i | ≤ k X z =1 | v ( t z ) i − v ( t z − ) i | ≤ k · . (cid:15) | v i | / log n ≤ . (cid:15) | v i | Thus we have | v i − v ( t ) i | ≤ (0 . (cid:15) + 0 . (cid:15)/ log n ) | v i | , which implies v i ≈ (cid:15) v i . It is also straightfor-ward to check that when we return the set I at the end of Add , I contains all the i ’s where VerifyIndex ( i ) is called and returned true in this iteration, which are exactly all the i ’s where v i ’s are changed in Line 19.Now we proceed to the complexity of our data structure. We start with the cost of Find-Indices , which is mainly on the cost of querying D j ’s. As we discussed at the beginning, therecan be a large overhead for large j , but this is compensated by querying large j less frequently. Lemma 6.5.

After T calls to Add , the total time spent in

FindIndices and

VerifyIndex is bounded by e O T (cid:15) − T X t =1 k ( v ( t ) − v ( t − ) /v ( t ) k + T n log W ! Proof.

We start with the cost of

FindIndices . Every call to

Add invokes a call to

FindIndices ,so we denote the t -th call to FindIndices as the one associated with the t -th Add . Fix any j and consider the cost for D j . We update f ( j ) once in each call, which takes O ( n ). Every 2 j calls would incur the cost to D j . HeavyQuery ( f ( j ) ). Without loss of generality we considerthe cost of the ﬁrst time this happens (at iteration 2 j ) since the other batches follow the same64alculation. We denote b g ( j ) as the scaling vector in D j when the query happens. We know b g ( j ) i = 0 if i ∈ F j , and b g ( j ) i = 1 / | v i | otherwise. Note here we can skip the superscript indicatingthe iteration number, since if i / ∈ F j it must be that v i has not changed over the 2 j iterations.The cost to query D j in Line 14 can then be bounded by. e O ( k Diag ( b g ( j ) A f ( j ) k k (cid:15) − + n log W )) ≤ e O ( k j X t =1 Diag (1 /v ( t ) ) A h ( t ) k (cid:15) − + n log W ) ≤ e O (  j X t =1 k Diag (1 /v ( t ) ) A h ( t ) k  (cid:15) − + n log W ) ≤ e O (2 j · j X t =1 k Diag (1 /v ( t ) ) A h ( t ) k (cid:15) − + n log W )The ﬁrst line is by Lemma 5.1, and note we use 0 . (cid:15)/ log n as the error parameter in the call.The ﬁrst inequality is by looking at b g ( j ) i for each index i separately. The value is either 0 soreplacing it by 1 /v ( t ) i only increase the norm, or we know i / ∈ F j , so b g ( j ) i is within a constantfactor of 1 /v ( t ) i since v ( t ) i ≈ v (1) i ≈ v i for all t ∈ [1 , j ]. The second inequality uses the triangleinequality, and the third inequality uses Cauchy-Schwarz. The cost of all subsequent queries to D j follows similar calculation, and as this query is only performed once every 2 j iterations, thetotal time after T iterations is e O T (cid:15) − T X t =1 k Diag (1 /v ( t ) ) A h ( t ) k + T − j n log W ! = e O T (cid:15) − T X t =1 k ( v ( t ) − v ( t − ) /v ( t ) k + T − j n log W ! Note that the equality follows the deﬁnition of v ( t ) in Theorem 6.1. We can then sum over thetotal cost for all D j ’s for j = 1 , . . . , log n to get the ﬁnal running time bound in the lemmastatement.As to the cost of VerifyIndex , each call computes ( v (init) + A b f ) i which takes O (1) time aseach row of A has only two non-zero entries. Further, the updates to F j ’s and calls to D j . Scale take e O (1) time. Now we need to bound the total number of times we call VerifyIndex , whichcan only happen in two cases. The ﬁrst case (Line 26) is when i is returned by FindIndices inLine 25, and the number of times we call

VerifyIndex is bounded by the size of I , which is inturn bounded by the running time of FindIndices . So the total cost over T iterations can bebounded by the total cost of FindIndices . The second case (Line 27) is when i is in some F j because v i was updated due to v i changing by more than 0 . (cid:15)/ log n , and the number of timesthis occurs can be bounded by e O T (cid:15) − T X t =1 k ( v ( t ) − v ( t − ) /v ( t ) k ! . Adding up the total cost of

VerifyIndex and

FindIndices proves the lemma.We proceed to prove the complexity bounds in Theorem 6.1.

Initialize.

The main work is to initialize the data structures D j for j = 1 , ..., log n , whichtakes e O ( m ) time in total by Lemma 5.1. 65 dd. The cost not associated with any

FindIndices and

VerifyIndex is O ( n ). Togetherwith Lemma 6.5 gives the bound of total time of T calls to Add e O T (cid:15) − T X t =1 k ( v ( t ) − v ( t − ) /v ( t ) k + T n log W ! ComputeExact.

This just takes O (nnz( A )) to compute the matrix-vector product, which is O ( m ) since A is an edge incidence matrix. 66 Gradient and Primal Solution Maintenance

In Section 4.2 and Section 4.3 we provided two IPMs. These IPMs try to decrease some potentialfunction Φ( v ) for v ∈ R m , so they require the direction of steepest descent, which is typicallygiven by the gradient ∇ Φ( v ). However, the two IPMs allows for approximations and works withdiﬀerent norms, so they actually require the maximizer g := argmax w ∈ R m : k w k≤ h∇ Φ( v ) , w i , (59)which gives the direction of steepest ascent with respect to some norm k · k and some v ≈ v .The IPM of Section 4.2 uses the ‘ -norm (in which case the solution of above problem is just g = ∇ Φ( v ) / k∇ Φ( v ) k ) whereas the IPM of Section 4.3 uses the norm k · k τ + ∞ .In this section we describe a framework that is able to maintain the maximizer of (59)eﬃciently. Note that the maximizer is an m -dimensional vector, so writing down the entirevector in each iteration of the IPM would be too slow. Luckily, the IPM only requires thevector A > X g , where x ≈ x is an approximation of the current primal solution and A ∈ R m × n is the constraint matrix of the linear program. This vector is only n dimensional, so we canaﬀord to write it down explicitly. Hence our task is to create a data structure that can eﬃcientlymaintain A > X g . This will be done in Section 7.1.Note that the vector g is closely related to the primal solution x . In each iteration of ourIPM, the primal solution x changes by (see (5) and (6) in Section 3.1) x (new) ← x + η X g − R h, for some h ∈ R n , η ∈ R , random sparse diagonal matrix R ∈ R m × m , X = Diag ( x ) for x ≈ x ,and g is as in (59). Because of the sparsity of R , we can thus say that during iteration t theprimal solution x ( t ) is of the form x ( t ) = x (init) + t X k =1 (cid:16) h ( k ) + η X ( k ) g ( k ) (cid:17) , (60)where h ( k ) is the vector R h during iteration number k , and g ( k ) , x ( k ) are the vectors g and x during iteration number k .Thus in summary, the primal solution x is just the sum of (scaled) gradient vectors g ( k ) andsome sparse vectors h ( k ) . The main result of this section will be data structures (Theorems 7.1and 7.7) that maintain both A > X g and an approximation x of the primal solution x .We state the result with respect to the harder case, when using the k · k τ + ∞ -norm for (59).At the end of Section 7.2, when we ﬁnish this result, we also state what the variant would looklike for the easier case of using the ‘ -norm, where the vector g is given by the simple expression ∇ Φ( v ) / k∇ Φ( v ) k .Recall that for v ∈ R m we deﬁned Φ( v ) := P mi =1 exp( λ ( v i − − λ ( v i − λ of value polylog n . Our main result of this section is the followingTheorem 7.1. When using Theorem 7.1 in our IPM, we will use g = ηx for some scalar η ∈ R ,so that QuerySum returns the desired approximation of (60).

Theorem 7.1.

There exists a deterministic data-structure that supports the following operations • Initialize ( A ∈ R m × n , x (init) ∈ R m , g ∈ R m , e τ ∈ R m , z ∈ R m , (cid:15) > : The data-structurepreprocesses the given matrix A ∈ R m × n , vectors x (init) , g, e τ , z ∈ R m , and accuracy pa-rameter (cid:15) > in e O (nnz( A )) time. We denote G the diagonal matrix Diag ( g ) . Thedata-structure assumes . ≤ z ≤ and n/m ≤ e τ ≤ . • Update ( i ∈ [ m ] , a ∈ R , b ∈ R , c ∈ R ) : Sets g i ← a , e τ i ← b and z i ← c in O ( k e > i A k ) time. The data-structure assumes . ≤ z ≤ and n/m ≤ e τ ≤ . QueryProduct () : Returns A > G ∇ Φ( z ) [ ( τ ) ∈ R n for some τ ∈ R m , z ∈ R m with τ ≈ (cid:15) e τ and k z − z k ∞ ≤ (cid:15) , where x [ ( τ ) := argmax k w k τ + ∞ h x, w i . Every call to

QueryProduct must be followed by a call to

QuerySum , and we boundtheir complexity together (see

QuerySum ). • QuerySum ( h ∈ R m ) : Let v ( ‘ ) be the vector G ∇ Φ( z ) [ ( τ ) used for the result of the ‘ -th callto QueryProduct . Let h ( ‘ ) be the input vector h given to the ‘ -th call to QuerySum .We deﬁne x ( t ) := x (init) + t X ‘ =1 v ( ‘ ) + h ( ‘ ) . Then the t -th call to QuerySum returns a vector x ∈ R m with x ≈ (cid:15) x ( t ) .Assuming the input vector h in a sparse representation (e.g. a list of non-zero entries),then after T calls to QuerySum and

QueryProduct the total time for all calls togetheris bounded by O T n(cid:15) − log n + log n · T X ‘ =0 k h ( ‘ ) k + T log n · T X ‘ =1 k v ( ‘ ) /x ( ‘ − k /(cid:15) ! . The output x ∈ R m is returned in a compact representation to reduce the size. In particu-lar, the data-structure returns a pointer to x and a set J ⊂ [ m ] of indices which speciﬁeswhich entries of x have changed between the current and previous call to QuerySum . • ComputeExactSum () : Returns the exact x ( t ) in O ( m log n ) time. • Potential () : Returns Φ( z ) in O (1) time for some z such that k z − z k ∞ ≤ (cid:15) . The main idea is that, although the exact gradient ∇ Φ( z ) [ ( e τ ) ∈ R m can be indeed m -dimensional and thus costly to maintain, if we perturb e τ and z slightly we can show that ∇ Φ( z ) [ ( τ ) will be in O ( (cid:15) − log n ) dimension. Here we say dimension in the sense that the m entries can be put into O ( (cid:15) − log n ) buckets, and entries in the same bucket share a commonvalue. The proof consists of two steps, and we employ two sub data structures. First, weconstruct the O ( (cid:15) − log n ) dimensional approximation ∇ Φ( z ) [ ( τ ) to the exact gradient. Usingits low dimensional representation we are able to eﬃciently maintain AG ∇ Φ( z ) [ ( τ ) . The datastructure that maintains this low dimensional representation and AG ∇ Φ( z ) [ ( τ ) is discussed inSection 7.1. The subsequent Section 7.2 presents a data structure that takes the low dimensionalrepresentation of the gradient as input and maintains the desired sum for QuerySum . In this section we discuss the computation of the O ( (cid:15) − log n ) dimensional representation of thegradient ∇ Φ( z ) [ ( τ ) , which in turn provides a good enough approximation to the real gradient ∇ Φ( z ) [ ( e τ ) ∈ R m . This representation is then used to eﬃciently maintain A > G ∇ Φ( z ) [ ( τ ) . Theformal result is as follows: Lemma 7.2.

There exists a deterministic data-structure that supports the following operations • Initialize ( A ∈ R m × n , g ∈ R m , e τ ∈ R m , z ∈ R m , (cid:15) > : The data-structure preprocessesthe given matrix A ∈ R m × n , vectors g, e τ , z ∈ R m , and accuracy parameter (cid:15) > in O (nnz( A )) time. The data-structure assumes . ≤ z ≤ and n/m ≤ e τ ≤ . The outputis a partition S Kk =1 I k = [ m ] with K = O ( (cid:15) − log n ) . Update ( i ∈ [ m ] , a ∈ R , b ∈ R , c ∈ R ) : Sets g i = a , e τ i = b and z i = c in O ( k e > i A ) time.The data-structure assumes . ≤ z ≤ and n/m ≤ e τ ≤ . The index i might be movedto a diﬀerent set, so the data-structure returns k such that i ∈ I k . • Query () : Returns A > G ∇ Φ( z ) [ ( τ ) ∈ R n for some τ ∈ R m , z ∈ R m with τ ≈ (cid:15) e τ and k z − z k ∞ ≤ (cid:15) , where x [ ( τ ) := argmax k w k τ + ∞ ≤ h x, w i . The data-structure further returnsthe low dimensional representation s ∈ R K such that K X k =1 s k i ∈ I k = (cid:16) ∇ Φ( z ) [ ( τ ) (cid:17) i for all i ∈ [ m ] , in O ( n(cid:15) − log n ) time. • Potential () Returns Φ( z ) in O (1) time. Before describing the low dimensional representation of Φ( z )) [ ( τ ) , we ﬁrst state the followingresult which we will use to compute x [ ( τ ) . Note this is just the steepest ascent direction of x with respect to a custom norm, and when w, v are two vectors of the same dimension we referto wv as the entry-wise product vector. Lemma 7.3 ([LS19, Algorithm 8]) . Given x ∈ R n , v ∈ R n , we can compute in O ( n log n ) time u = argmax k w k + k wv − k ∞ ≤ h x, w i . We have the following simple corollary of Lemma 7.3.

Corollary 7.4.

Given x ∈ R n , v ∈ R n , we can compute in O ( n log n ) time u = argmax k vw k + k w k ∞ ≤ h x, w i . Proof.

We have max k vw k + k w k ∞ ≤ h x, w i = max k wv k + k w k ∞ h xv − , wv i = max k w k + k w v − k ∞ h xv − , w i by substituting w = wv . Thus for the maximizer we haveargmax k vw k + k w k ∞ ≤ h x, w i = v − argmax k w k + k w v − k ∞ h xv − , w i , where the maximizer of on the right hand side can be computed in O ( n log n ) via Lemma 7.3and entry-wise multiplying by v − takes only O ( n ) additional time.Recall x [ ( τ ) := argmax k w k τ + ∞ ≤ h x, w i and k w k τ + ∞ := k w k ∞ + C k w √ τ k for some constant C . Corollary 7.4 allows us to compute x [ ( τ ) (using the corollary with v = C √ τ ) for any x in itsoriginal dimension. However, if x is an m dimensional vector such as ∇ Φ( z ), then Corollary 7.4is too slow to be used in every iteration of the IPM.To address this, we ﬁrst discretize z and e τ by rounding the entries to values determinedby the error parameter (cid:15) to get z, τ which are in low dimensional representation (i.e. put theindices into buckets). In particular, z will be the vector z with each entry rounded down to thenearest value of form 0 . ‘(cid:15)/ ‘ , and τ rounds down each entry of e τ tothe nearest power of (1 − (cid:15) ). It is fairly straightforward to see from deﬁnition this in turn alsogives low dimensional representation of ∇ Φ( z ). In the next lemma we show when x and τ areboth low dimensional, x [ ( τ ) also admits a low dimensional representation that can be computedeﬃciently, which let us to compute ∇ Φ( z ) [ ( τ ) . In the following lemma, when we use a set (i.e.a bucket) of indices as a vector, we mean it’s indicator vector with 1 at indices belong to theset and 0 elsewhere. 69 lgorithm 5: Algorithm for reducing the dimension of ∇ Φ( z ) [ and maintaining A > G ∇ Φ( z ) [ (Lemma 7.2) members I ( k,‘ ) partition of [ m ]. w ( k,‘ ) ∈ R n ; // Maintained to be A > G1 i ∈ I ( k,‘ ) g, z ∈ R m , p ∈ R ; // p is maintained to be Φ( z ) procedure Initialize ( A ∈ R m × n , g ∈ R m , e τ ∈ R m , z ∈ R m , (cid:15) > A ← A , g ← g , z ← z , p ← Φ( z ) I ( k,‘ ) ← ∅ , w ( k,‘ ) ← ~ ∀ k = 1 , . . . , log − − (cid:15) ) ( n/m ) and ‘ = 0 , ..., (1 . /(cid:15) ) for i ∈ [1 , m ] do Find k, ‘ such that 0 . ‘(cid:15)/ ≤ z i < . ‘ + 1) (cid:15)/ − (cid:15) ) k +1 ≤ e τ i ≤ (1 − (cid:15) ) k . Add i to I ( k,‘ ) and set τ i ← (1 − (cid:15) ) k +1 w ( k,‘ ) ← w ( k,‘ ) + A > g i e i return ( I ( k,‘ ) ) k,‘ ≥ procedure Update ( i ∈ [ m ] , a ∈ R , b ∈ R , c ∈ R ) p ← exp( λc ) + exp( − λc ) − (exp( λz i ) + exp( − λz i )) z i ← c Find k, ‘ such that i ∈ I ( k,‘ ) , then remove i from I ( k,‘ ) . w ( k,‘ ) ← w ( k,‘ ) − A > g i e i Find k, ‘ such that 0 . ‘(cid:15)/ ≤ c < . ‘ + 1) (cid:15)/ − (cid:15) ) k +1 ≤ b ≤ (1 − (cid:15) ) k ,then insert i into I ( k,‘ ) . w ( k,‘ ) ← w ( k,‘ ) + A > ae i g i ← a return k, ‘ procedure Query () // Construct scaled low dimensional representation of ∇ Φ( z ) Let x k,‘ = | I ( k,‘ ) | ( λ exp( λ (0 . ‘(cid:15)/ − − λ exp( − λ (0 . ‘(cid:15)/ − Interpret x as an O ( (cid:15) − log n ) dimensional vector. // Construct scaled low dimensional representation of τ , here C is theconstant when define k · k τ + ∞ Let v be the vector with v k,‘ = q | I ( k,‘ ) | (1 − (cid:15) ) k +1 /C . s ← argmax y : k vy k + k y k ∞ ≤ h x, y i via Corollary 7.4. return s and P k,l s k,‘ w ( k,‘ ) procedure Potential () return p Lemma 7.5.

Given x, v, g ∈ R k , x, v ∈ R n , and a partition S ki =1 I ( i ) = [ n ] , such that x = P ki =1 x i I ( i ) , v = P ki =1 v i I ( i ) and g i = q | I ( i ) | , let u ∈ R k be the maximizer of u := argmax w ∈ R k : k ( gv ) w k + k w k ∞ ≤ h x g , w i . We can deﬁne u ∈ R n via u = P ki =1 u i I ( i ) , then h v, u i = max w ∈ R n : k vw k + k w k ∞ ≤ h x, w i . Proof.

Consider w ∈ R k and w ∈ R n with w = P ki =1 w i I ( i ) . Then k w k ∞ = k w k ∞ and k wv k = X i ( w i v i ) = X j X i ∈ I ( j ) ( w j v j ) = X j | I ( j ) | ( w j v j ) = k ( gv ) w k . w satisﬁes k ( gv ) w k + k w k ∞ ≤ w satisﬁes k vw k + k w k ∞ ≤

1. Further, h x g , w i = k X i =1 | I ( i ) | x i w i = n X i =1 x i w i = h x, w i . Thus max w ∈ R k : k ( gv ) w k + k w k ∞ ≤ h x g , w i ≤ max w ∈ R n : k vw k + k w k ∞ ≤ h x, w i which in turn means that u ∈ R k as deﬁned in the lemma satisﬁes h x, u i = h x g , u i ≤ max w ∈ R n : k vw k + k w k ∞ ≤ h x, w i . Let S := { w ∈ R n : k vw k + k w k ∞ ≤ w i = w j for all i, j ∈ I ( ‘ ) } , i.e. the set of n dimensional vectors w that has a low dimensional representation via some w ∈ R k and w = P ki =1 w i I ( i ) . Then h x, u i = max w ∈ S h x, w i . by deﬁnition of u . We claim thatmax w ∈ S h x, w i = max w ∈ R n : k vw k + k w k ∞ ≤ h x, w i , which would then conclude the proof of the lemma.So assume there exists some maximizer w ∈ R n with k vw k + k w k ∞ ≤ i, j, ‘ with i, j ∈ I ( ‘ ) and w i = w j , i.e. the vector w / ∈ S . Then deﬁne w ∈ R n with w i = w j = ( w i + w j ) /

2. For this vector we have k w k ∞ ≤ k w k ∞ and k vw k = X k / ∈{ i,j } v k w k + v i ( w i + w j ) / ≤ X k / ∈{ i,j } v k w k + v i w i + v j w j = k vw k , where the ﬁrst equality is because v i = v j for i, j in same I ‘ . So we know w also satisﬁes k vw k + k w k ∞ ≤

1. On the other hand, h x, w i = X k / ∈{ i,j } x k w k + x i w i + x i w j = X k / ∈{ i,j } x k w k + x i ( w i + w j ) / x i ( w i + w j ) / h x, w i . Thus w must also be a maximizer, and if we repeat this transformation we eventually have w ∈ S . This concludes the proof.We now have all tools to prove Lemma 7.2. Proof of Lemma 7.2.

The data-structure is given by Algorithm 5. It is straightforward to checkthat

Initialize and

Update guarantees that the data-structure maintains the following invari-ants: • S k,‘ I ( k,‘ ) = [ m ] is a partition of [ m ]. • i ∈ I ( k,‘ ) if and only if 0 . ‘(cid:15)/ ≤ z i < . ‘ + 1) (cid:15)/ − (cid:15) ) k +1 ≤ e τ i ≤ (1 − (cid:15) ) k . • w ( k,‘ ) = A > G · I ( k,‘ ) ∈ R n . • p = P i exp( λz i ) + exp( − λz i ) = Φ( z ). 71oreover, our rounding of z, e τ gives O ( (cid:15) − log − − (cid:15) ( n/m )) dimensional representations becauseof the assumption 0 . ≤ z ≤ n/m ≤ e τ ≤

2, and is simply O ( (cid:15) − log n ).Next, let us analyze the function Query . We can interpret the decomposition S k,‘ I ( k,‘ ) =[ m ] as a decomposition of the form S t I ( t ) = [ m ] by renaming the indices. Recall z, τ arelow dimensional approximations of z, e τ with each entry rounded. It is straightforward to seethe vector x constructed in the ﬁrst line of Query is just the low dimensional representationof ∇ Φ( z ) where each entry is scaled appropriately by the | I ( k,‘ ) | ’s, so the x we constructedcorresponds to xg in Lemma 7.5. Similarly, the v vector constructed in Query is the lowdimensional representation of √ τ scaled appropriately to serve as the gv in Lemma 7.5. Thusby Lemma 7.5 the vector s obtained in Query is the vector such that s := P t s ( t ) I ( t ) ∈ R m satisﬁes h s , ∇ Φ( z ) i = max C k√ τy k + k y k ∞ ≤ h∇ Φ( z ) , y i , That is, s = ∇ Φ( z ) [ ( τ ) , and the returned s is a low dimensional representation of it. Moreover, P k,l s k,‘ w ( k,‘ ) is exactly A > G ∇ Φ( z ) [ ( τ ) because we maintained w ( k,‘ ) = A > G · I ( k,‘ ) ∈ R n . Complexity

The complexity of

Initialize is dominated by the initialization of w ( k,‘ ) ’s, whichtakes O (nnz( A )) since we just partition the (scaled) rows of A into buckets according toour rounding and sum up the vectors in each bucket. The complexity of Update ( i, · , · , · ) is O ( k e > i A k ), i.e. the cost of maintaining w ( k,‘ ) = A > G · I ( k,‘ ) ∈ R n . The time for Potential is O (1) since it just returns the value p .As to the complexity of Query , as we work with O ( (cid:15) − log n ) dimensional vectors in Query ,the time complexity according to Lemma 7.5 is O ( (cid:15) − log n log( (cid:15) − log n )) for computing themaximizer s . Constructing A > G ( ∇ Φ( z )) [ ( τ ) = P k,‘ w ( k,‘ ) s ( k,‘ ) takes O ( n(cid:15) − log n ) time, whichsubsumes the ﬁrst complexity, assuming (cid:15) is polynomial in 1 /n . Thus Query takes O ( n(cid:15) − log n )time. In this section we discuss the data structure that eﬀectively maintains an approximation of thesum of all gradients computed so far, which is captured by Lemma 7.6. The input to the datastructure is the O ( (cid:15) − log n ) dimensional representations of the gradients of each iteration, whichis computed by the data structure of the previous subsection. At the end of this subsection wewill combine Lemma 7.2 and the following Lemma 7.6 to obtain the main result Theorem 7.1,which allows us to eﬃciently maintain an approximation x of the primal solution. Lemma 7.6.

There exists a deterministic data-structure that supports the following operations • Initialize ( x (init) ∈ R m , g ∈ R m , ( I k ) ≤ k ≤ K , (cid:15) > : The data-structure initialized on thegiven vectors x (init) , g ∈ R m , the partition S Kk =1 I k = [ m ] where K = O ( (cid:15) − log n ) , and theaccuracy parameter (cid:15) > in O ( m ) time. • Scale ( i ∈ [ m ] , a ∈ R ) : Sets g i ← a in O (log n ) amortized time. • Move ( i ∈ [ m ] , k ∈ [1 , K ]) : Moves index i to set I k in O (log n ) amortized time. • Query ( s ∈ R K , h ∈ R m ) : Let g ( ‘ ) be the state of vector g during the ‘ -th call to Query and let s ( ‘ ) and h ( ‘ ) be the input arguments of the respective call. The vector h will alwaysbe provided as a sparse vector so that we know where are the non-zeros in the vector.Deﬁne y ( ‘ ) = G ( ‘ ) P Kk =1 I ( ‘ ) k s ( ‘ ) k and x ( t ) = x (init) + P t‘ =1 h ( ‘ ) + y ( ‘ ) , then the t -th call to uery returns a vector x ≈ (cid:15) x ( t ) . After T calls to Query , the total time of all T callsis bounded by O T K + log n · T X ‘ =0 k h ( ‘ ) k + T log n · T X ‘ =1 k y ( ‘ ) /x ( ‘ − k /(cid:15) ! . The vector x ∈ R m is returned as a pointer and additionally a set J ⊂ [ m ] is returned thatcontains the indices where x changed compared to the result of the previous Query call. • ComputeExactSum () : Returns the current exact vector x ( t ) in O ( m log n ) time. The implementation of the data structure is given in Algorithm 6. At a high level, eachiteration t (i.e. Query call) the vector x accumulates an update of h ( t ) + y ( t ) . For the h component, we always include the update to our approximation x as soon as we see h . This isbecause the h provided by our IPM will be sparse, and we can aﬀord to spend e O ( k h k ) eachiteration to carry out the update. For the y component in the update, we use a "‘lazy"’ updateidea. In particular, we maintain a vector b ‘ ∈ R m where b ‘ i stores the most recent iteration t (i.e. Query call) when x i is updated to be the exact value x ( t ) i for each index i . We update x i to be the exact value (in ComputeX ) whenever h i is non-zero, g i is scaled, or i is moved to adiﬀerent set I k . If none of these events happen, the accumulated update to x i since iteration b ‘ i will have a very simple form, so we just store the accumulated updates up to each iteration t ina vector f ( t ) . To detect when the accumulated update on x i has gone out of the approximationbound (cid:15)x i since the last time x i is updated, we use ∆ ( high ) i and ∆ ( low ) i to store the upper andlower boundary of the approximation range around x i , and update x i once the accumulatedupdates goes beyond the range. We proceed with the proof of our result. Proof of Lemma 7.6.

We start by analyzing the correctness.

Invariant

Let s ( t ) , h ( t ) , g ( t ) , I ( t ) k be the state of s, h, g, I k during the t -th call to Query andby deﬁnition of x ( t ) we have for any index i that x ( t ) i = x (init) i + t X ‘ =1 g ( ‘ ) i K X k =1 s ( ‘ ) k i ∈ I ( ‘ ) k ! + h ( ‘ ) i . It is easy to check that b ‘ i always store the most recent iteration when x i is updated by Com-puteX ( i, h i ). We ﬁrst prove by induction that this update is always calculated correct, that is,the data-structure maintains the invariant x i = x ( b ‘ i ) i .We see from Line 27 that the data-structure maintains f ( t ) = P tk =1 s ( k ) . Further note that ComputeX ( i, h i ) is called whenever h i is non-zero (Line 29), g i is changed, or i is moved to adiﬀerent I k . Thus if b ‘ i < t , we know none of these events happened during iteration ‘ ∈ ( b ‘ i , t ]and the only moving part is the s ( ‘ ) ’s over these iterations, which is exactly f ( t ) − f ( b ‘ i ) . Thus,if k is the set I k where i belongs to over iterations ( b ‘ i , t ], the execution of Line 11 gives g i · ( f ( t ) k − f ( b ‘ i ) k ) + h ( t ) i = g i t X ‘ = b ‘ i +1 s ( ‘ ) k + h ( t ) i = h ( t ) i + t X ‘ = b ‘ i +1 g ( ‘ ) i s ( ‘ ) k = h ( t ) i + t X ‘ = b ‘ i +1 g ( ‘ ) i K X k =1 s ( ‘ ) k i ∈ I ( ‘ ) k ! = t X ‘ = b ‘ i +1 g ( ‘ ) i K X k =1 s ( ‘ ) k i ∈ I ( ‘ ) k ! + h ( ‘ ) i where the ﬁrst equality uses f ( t ) = P t‘ =1 s ( ‘ ) and the second equality uses g ( ‘ ) i = g ( t ) i for all b ‘ i < ‘ ≤ t , because ComputeX ( i, h i ) is called whenever g i is changed. The third equality isbecause ComputeX ( i, h i ) is called whenever i is moved to a diﬀerent set, so i ∈ I ( ‘ ) k for the same73 lgorithm 6: Algorithm for accumulating G ∇ Φ( v ) [ (Lemma 7.6) members I , ..., I K ; // Partition S k I k = [ m ] t ∈ N , x ∈ R m ; // Query counter and approximation of x ( t ) b ‘ ∈ N m ; // b ‘ i is value of t when we last update x i ← x i f ( t ) ∈ R K ; // Maintain f ( t ) = P tk =1 s ( k ) ∆ ( high ) , ∆ ( low ) ∈ R m ; // Maintain ∆ i = f ( b ‘ i ) k ± | (cid:15)x i / (10 g i ) | if i ∈ I k procedure Initialize ( x (init) ∈ R m , g ∈ R m , ( I k ) ≤ k ≤ K ) x ← x (init) , ( I k ) ≤ k ≤ K ← ( I k ) ≤ k ≤ K , t ← f (0) ← ~ K , w ← ~ m , g ← g private procedure ComputeX ( i, h i ) Let k be such that i ∈ I k x i ← x i + g i · ( f ( t ) k − f ( b ‘ i ) k ) + h i b ‘ i ← t private procedure UpdateDelta ( i ) Let k be such that i ∈ I k . ∆ ( high ) i ← f ( b ‘ i ) k + | (cid:15)x i / (10 g i ) | ∆ ( low ) i ← f ( b ‘ i ) k − | (cid:15)x i / (10 g i ) | procedure Move ( i ∈ [ m ] , k ) ComputeX ( i, Move index i to set I k UpdateDelta ( i ) private procedure Scale ( i, a ) ComputeX ( i, g i ← a UpdateDelta ( i ) procedure Query ( s ∈ R K , h ∈ R m ) t ← t + 1, J ← ∅ f ( t ) ← f ( t − + s for i such that h i = 0 do ComputeX ( i, h i ), UpdateDelta ( i ), J ← J ∪ { i } for k = 1 , . . . , K do for i ∈ I k with f ( t ) k > ∆ ( high ) i or f ( t ) k < ∆ ( low ) i do ComputeX ( i, UpdateDelta ( i ), J ← J ∪ { i } return x , J procedure ComputeExactSum () for i ∈ [ m ] and b ‘ i < t do ComputeX ( i, UpdateDelta ( i ) ; return xk for all b ‘ i < ‘ ≤ t . The last equality uses h ( ‘ ) i = 0 for b ‘ i < ‘ < t , because ComputeX ( i, h i ) iscalled whenever h i is non-zero. Thus by induction over the number of calls to ComputeX ( i, h i ),when b ‘ i is increased to t we have x i = x ( t ) i = x (init) i + t X ‘ =1 g ( ‘ ) i K X k =1 s ( ‘ ) k i ∈ I ( ‘ ) k ! + h ( ‘ ) i = x ( t ) i , so the invariant is always maintained. 74 orrectness of Query. We claim that the function

Query returns a vector x such that forall i x i ≈ (cid:15) x ( t ) i := x (init) i + t X ‘ =1 g ( ‘ ) i K X k =1 s ( ‘ ) k i ∈ I ( ‘ ) k ! + h ( ‘ ) i . Given the invariant discussed above, we only need to guarantee

ComputeExact ( i, h i ) is calledwhenever the approximation guarantee is violated for some i . Moreover, same as when weproved the invariant above, we only need to guarantee this in the case that since iteration b ‘ i , h i is always 0, g i remains constant and i remains in the same I k for some k . Thus, the task isequivalent to detect whenever (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t X ‘ = b ‘ i +1 g ( ‘ ) i s ( ‘ ) k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = | g i · ( f ( t ) k − f ( b ‘ i ) k ) | > (cid:15) | x i | . which is the same as f ( t ) k / ∈ [ f ( b ‘ i ) k − | ( (cid:15)x i ) / (10 g i ) | , f ( b ‘ i ) k + | ( (cid:15)x i ) / (10 g i ) | ]Note that the lower and upper limits in the above range are exactly ∆ ( inc ) i and ∆ ( dec ) i asmaintained by UpdateDelta ( i ), which will be called whenever any of the terms involved inthe calculation of these limits changes. Thus Line 32 makes sure that we indeed maintain x i ≈ (cid:15) x ( t ) i ∀ i. Also it is easy to check the returned set J contains all i ’s such that x i changed since last Query . Complexity

The call to

Initialize takes O ( m + K ) as we initialize a constant number of K and m dimensional vectors, and this reduces to O ( m ) since there can be at most m non-empty I k ’s. A call to ComputeX takes O (1) time.To implement Line 32 eﬃciently without enumerating all m indices, we maintain for each k ∈ [ K ] two sorted lists of the i ’s in I k , sorted by ∆ ( high ) i and ∆ ( low ) i respectively. Maintaining thesesorted lists results in O (log n ) time per call for UpdateDelta . Hence

Move and

Scale alsorun in O (log n ) time. To implement the loop for Line 32 we can go through the two sorted lists inorder, but stop as soon as the check condition no longer holds. This bounds the cost of the loopby O ( K ) plus O (log n ) times the number of indices i satisfying f ( t ) k > ∆ ( high ) i or f ( t ) < ∆ ( low ) i , i.e. | f ( t ) k − f ( b ‘ i ) k | > Θ( (cid:15)x ( b ‘ i ) i ). Note if a ComputeX and

UpdateDelta is triggered by this conditionfor any i , h i must be 0 during ( b ‘ i , t ] iterations. Thus, let z ( t ) := x (init) + P t‘ =1 G ( ‘ ) P k I ( ‘ ) k s ( ‘ ) k ,we can rewrite that condition as | z ( t ) i − z ( b ‘ i ) i | > Θ( | (cid:15)x ( b ‘ i ) i | ). Throughout T calls to Query , wecan bound the total number of times where i satisﬁes | z ( t ) i − z ( b ‘ i ) i | > Θ( | (cid:15)x ( b ‘ i ) i | ) by O T T X ‘ =1 k G ( ‘ ) ( X k I ( ‘ ) k s ( ‘ ) k ) /x ( ‘ − k /(cid:15) ! . The number of times

ComputeX and

UpdateDelta are triggered due to h ( t ) i = 0 is k h ( t ) k each iteration, and updating f ( t ) takes O ( K ) time. So the total time for T calls to Query canbe bounded by O ( T K + log n · T X ‘ =0 k h ( ‘ ) k + log n · T T X ‘ =1 k G ( ‘ ) ( X k s ( ‘ ) I ( ‘ ) k k ) /x ( ‘ − k /(cid:15) ) . The time for

ComputeExactSum is O ( m log n ) since it just calls ComputeX and

Updat-eDelta on all m indices. 75 roof of Theorem 7.1. The data-structure for Theorem 7.1 follows directly by combining Lemma 7.2and Lemma 7.6. The result for

QueryProduct is obtained from Lemma 7.2, and the resultfor

QuerySum is obtained from Lemma 7.6 using the vector s ∈ R K returned by Lemma 7.2as input to Lemma 7.6.Note that for our IPM variant using the steepest descent direction in ‘ norm, we use asimple normalized ∇ Φ( z ) / k∇ Φ( z ) k instead of ( ∇ Φ( z )) [ ( e τ ) . Here ∇ Φ( z ) / k∇ Φ( z ) k = argmax k w k ≤ h∇ Φ( z ) , w i is analogous to ( ∇ Φ( z )) [ ( τ ) = argmax k w k τ + ∞ ≤ h∇ Φ( z ) , w i , as these are both steepest ascent directions of ∇ Φ( z ) but in diﬀerent norms. It is straightforwardto see how to simplify and extend the data structure of Theorem 7.1 to work for this setting.We state without proof the following variant. Note in this case we only need to round z so thelow dimensional representation is in O ( (cid:15) − ) instead of O ( (cid:15) − log n ) dimension. Theorem 7.7.

There exists a deterministic data-structure that supports the following operations • Initialize ( A ∈ R m × n , x (init) ∈ R m , g ∈ R m , z ∈ R m , (cid:15) > : The data-structure pre-processes the given matrix A ∈ R m × n , vectors x (init) , g, z ∈ R m , and accuracy parameter (cid:15) > in e O (nnz( A )) time. The data-structure assumes . ≤ z ≤ . • Update ( i ∈ [ m ] , a ∈ R , c ∈ R ) : Sets g i ← a and z i ← c in O ( k e > i A k ) time. Thedata-structure assumes . ≤ z ≤ . • QueryProduct () : Returns A > G ∇ Φ( z ) / k∇ Φ( z ) k ∈ R n for some z ∈ R m with k z − z k ∞ ≤ (cid:15) . Every call to QueryProduct must be followed by a call to

QuerySum , andwe bound their complexity together (see

QuerySum ). • QuerySum ( h ∈ R m ) : Let v ( ‘ ) be the vector G ∇ Φ( z ) / k∇ Φ( z ) k used for the result ofthe ‘ -th call to QueryProduct . Let h ( ‘ ) be the input vector h given to the ‘ -th call to QuerySum . We deﬁne x ( t ) := x (init) + t X ‘ =1 v ( ‘ ) + h ( ‘ ) . Then the t -th call to QuerySum returns a vector x ∈ R m with x ≈ (cid:15) x ( t ) .Assuming the input vector h in a sparse representation (e.g. a list of non-zero entries),after T calls to QuerySum and

QueryProduct the total time for all calls together isbounded by O T n(cid:15) − log(1 /(cid:15) ) + log n · T X ‘ =0 k h ( ‘ ) k + T log n · T X ‘ =1 k v ( ‘ ) /x ( ‘ − k /(cid:15) ! . The output x ∈ R m is returned in a compact representation to reduce the size. In particu-lar, the data-structure returns a pointer to x and a set J ⊂ [ m ] of indices which speciﬁeswhich entries of x have changed between the current and previous call to QuerySum . • ComputeExactSum () Returns the exact x ( t ) in O ( m log n ) time. • Potential () Returns Φ( z ) in O (1) time. Minimum Weight Perfect Bipartite b -Matching Algorithms In this section we prove the main result: a nearly linear time algorithm for minimum weightperfect bipartite b matching. The exact result proven in this section is the following theorem. Theorem 8.1.

For b, c ∈ Z m , we can solve minimum weight perfect bipartite b -matching forcost vector c in e O (( m + n . ) log ( k c k ∞ k b k ∞ )) time. We show how to combine the tools from Sections 4 to 7 to obtain our fast b -matchingalgorithm. In Section 4.2 and Section 4.3 we present two algorithms that each encapsulate avariant of the IPM. These algorithms only specify which computations must be performed, butthey do not specify how they must be implemented. So our task is to use the data structuresfrom Sections 5 to 7 to implement these IPMs.This section is split into ﬁve subsections: We ﬁrst consider the simpler, but slower IPMsfrom Section 4.2, based on the log-barrier method. This method is formalized via Algorithm 2.Line 11 of Algorithm 2, requires random sampling to sparsify the solution of some linearsystem. Consequently, our ﬁrst task (Section 8.1) is to use the data structure from Section 5 toimplement this sampling procedure.The next task is to implement the IPM of Algorithm 2, which is done in Section 8.2.Note that Algorithm 2 only formalized a single step of the IPM, and the complete methodis formalized by combining Algorithm 2 with Algorithm 1. So the next task (Section 8.3) is toimplement Algorithm 1 eﬃciently using our data structures.We then use our implementation of the IPM in Section 8.4 to obtain a e O ( n √ m ) timealgorithm. At last, in Section 8.5, we show which small modiﬁcation must be performed toreplace the slower log-barrier based IPM with the faster leverage score based IPM of Section 4.3.This is also where we prove Theorem 8.1. Our ﬁrst task is to implement the random sampling of Line 11 of Algorithm 2. Let A ∈ R m × n , g, g ∈ R m ≥ , and h ∈ R n , then for the random sampling we want to sample each i ∈ [ m ] withsome probability q i such that q i ≥ min { , √ m (( GA h ) i / k GA h k + 1 /m ) + C · σ ( G A ) i log( m/ ( (cid:15)r )) γ − } , where C is some large constant and γ = 1 / polylog n . As this sampling is performed in eachiteration of the IPM, it would be prohibitively expensive to explicitly compute the leveragescores σ ( G A ) or the vector GA h , as that would require Ω( m ) time. Consequently, instead wewill use the data structure of Lemma 5.1 to eﬃciently perform this sampling. Lemma 8.2.

Consider a call to

SamplePrimal ( K ∈ R > , D (sample ,σ ) , D (sample) , h ∈ R n ) (Al-gorithm 7). Let GA be the matrix represented by D (sample) (Lemma 5.1), and G A be the matrixrepresented by D (sample ,σ ) (Lemma 5.1), then each R i,i is /q i with some probability q i and zerootherwise, where q i ≥ min { , K √ m (( GA h ) i / k GA h k + 1 /m ) + C · σ ( G A ) i log( m/ ( (cid:15)r )) γ − } for some large constant C > . Further, w.h.p., there are at most e O ( n log( m/ ( (cid:15)r )) log W ) non-zero entries in R and the complexity of SamplePrimal can be bounded by e O ( n log( m/ ( (cid:15)r )) log W ) , where W is the ratio of largest to smallest (non-zero) entry in G and in G .Proof. We start by analyzing the vector u, v, w . Afterwards we analyze the distribution of thematrix R . At last we bound the number of non-zero elements in R and the complexity of thealgorithm. 77 lgorithm 7: Sampling Algorithm for the IPM procedure SamplePrimal ( K ∈ R > , D (sample ,σ ) , D (sample) , h ∈ R n ) /* D (sample ,σ ) , D (sample) instances of Lemma 5.1 */ I u ← D (sample ,σ ) . LeverageScoreSample (3 C log( m/ ( (cid:15)r )) γ − ) I v ← D (sample) . Sample ( h, K √ m · (16 log ( n ))) I w ⊂ [ m ], where P [ i ∈ I w ] = 3 K/ √ m independently for each i . I ← I u ∪ I v ∪ I w Let u, v, w be ~ m . /* Set v i = P [ i ∈ I v ] , u i = P [ i ∈ I u ] , w i = P [ i ∈ I w ] for i ∈ I */ v I ← D (sample) . Probability ( I, h, K √ m · (16 log ( n ))) u I ← D (sample ,σ ) . LeverageScoreBound (3 C log( m/ ( (cid:15)r )) γ − , I ) w i ← K/ √ m for i ∈ I R ← m × m for i ∈ I do R i,i ← / min { , u i + v i + w i } with probability min { ,u i + v i + w i } − (1 − u i )(1 − v i )(1 − w i ) return RDistribution of vectors u, v, w By guarantee of Lemma 5.1 the methods D (sample) . Probability and D (sample ,σ ) . LeverageScoreBound return the sampling probabilities used by D (sample) . Sample and D (sample ,σ ) . LeverageScoreSample . Thus the vectors u, v satisfy u i = P [ i ∈ I u ] with probability P [ i ∈ I ] and 0 otherwise. v i = P [ i ∈ I v ] with probability P [ i ∈ I ] and 0 otherwise.Further by Lemma 5.1 we have P [ i ∈ I u ] ≥ C · σ ( G A ) log( m/ ( (cid:15)r )) γ − and P [ i ∈ I v ] ≥ K √ m ( GA h ) i / k GA h k . For the vector w note that we have w i = 3 K/ √ m for each i ∈ I , and w i = 0 otherwise.Here the set I is distributed as follows P [ i ∈ I ] = P [ i ∈ I u ∪ Iv ∪ I w ]= 1 − (1 − P [ i ∈ I u ])(1 − P [ i ∈ I v ])(1 − P [ i ∈ I w ])= 1 − (1 − u i )(1 − v i )(1 − w i ) . Distribution of matrix R

Note that we have P [ i ∈ I ] ≥ min { , max { P [ i ∈ I u ] , P [ i ∈ I v ] , P [ i ∈ I w ] }}≥ min { , ( P [ i ∈ I u ] + P [ i ∈ I v ] + P [ i ∈ I w ]) / } = min { , ( u i + v i + w i ) / } , thus the probability used in Line 12 is well deﬁned (i.e. at most 1). Hence Line 12 ensures that P [ R i,i = 0] = P [ i ∈ I ] · min { , ( u i + v i + w i ) / } − (1 − u i )(1 − v i )(1 − w i )= min { , ( u i + v i + w i ) / } Further, the non-zero R i,i are set to exactly the inverse of that probability and we havemin { , ( u i + v i + w i ) / }≥ min n , Cσ ( G A ) i log( m/ ( (cid:15)r )) γ − + K/ √ m + K √ m ( GA h ) i / k GA h k o so R has the desired distribution. 78 parsity of R and complexity bound The number of nonzero entries of R can w.h.p. bebounded by e O ( n log( m/ ( (cid:15)r )) γ − log W ) + O ( K √ m log n ) + O ( K √ m log n )= e O (cid:0) n log ( m/ ( (cid:15)r )) log W + K √ m (cid:1) , via Lemma 5.1 and C = O (1).The complexity is dominated by the leverage score sample via Lemma 5.1 for a cost of e O ( n log ( m/ ( (cid:15)r )) log W ) . Here we show how to eﬃciently implement the log barrier based IPM, formalized by Algorithm 2.Note that, Algorithm 2 does not specify how to perform the computations, rather it simplyspeciﬁes a suﬃcient set of conditions for the method to perform as desired. Consequently, inthis section we show how to implement these computations using our data structures.To be more precise, consider Line 4 of Algorithm 2, which speciﬁes to ﬁnd some s ≈ (cid:15) s .The algorithm does not explain how to ﬁnd such an approximation, so we must implement thisourself. Note that in Algorithm 2, s is deﬁned incrementally via s (new) ← s + γ AH − A > X ∇ Φ( v ) / k∇ Φ( v ) k for any H ≈ A > XS − A and v ≈ xs/µ . However, here h := γ A > X ∇ Φ( v ) / k∇ Φ( v ) k is exactlythe return value of QueryProduct , promised by the data structure of Theorem 7.7. So wecan eﬃciently compute h in each iteration of the IPM by using Theorem 7.7.Next, we must compute h := H − h . This can be done by ﬁrst sparsifying the weightedLaplacian matrix L := ( A > XS − A ) via leverage score sampling . This sampling can be per-formed eﬃciently via the LeverageScoreSample method of the data structure of Lemma 5.1.After sparsifying the Laplacian L we can apply a Laplacian solver (Lemma 2.1) to compute h eﬃciently.At last, we must compute s + A h . Note however, that Algorithm 2 never requires the exactvector s , and it always suﬃces to just know some element-wise approximation s ≈ s . Hencethe problem we actually need to solve is to maintain an element-wise approximation of the sum s = s (init) + P Tt =1 A h ( t ) (this represents the vector s after T steps of the IPM), for a sequence ofvectors h (1) , ..., h ( T ) ∈ R n . That is exactly the guarantee of the data structure of Theorem 6.1.The only thing left to do is to maintain a similar approximation x ≈ x of the primal solutionvia Theorem 7.7. Then we have an eﬃcient implementation of Algorithm 2.A formal description of this implementation can be found in Algorithm 8 and we are leftwith proving that this algorithm does indeed implement all steps of Algorithm 2. It is diﬃcultto analyze the complexity per call to Algorithm 8, as many complexity bounds of the useddata structures only bound the overall time over several iterations. Consequently, we deferthe complexity analysis to the next section, where we analyze the overall complexity of severalconsecutive calls to Algorithm 8. For now we only prove the correctness of Algorithm 8 inLemma 8.3. Lemma 8.3.

Algorithm 8 implements Algorithm 2, assuming the data structures are initializedas stated in Algorithm 8.Proof.

As Algorithm 8 uses many diﬀerent data structures that share similar variable names(e.g variable g occurs in Lemma 5.1 and Theorems 6.1 and 7.7), we will use the notation D.var to refer to variable var of data structure D . For example D (sample) .g refers to variable g of datastructure D (sample) (which is an instance of Lemma 5.1). From a graph perspective this might also be known as sampling by the eﬀective resistances. lgorithm 8: Implementation of Algorithm 2 global variables /* Imported from Algorithm 9 */ D ( x, ∇ ) instance of Theorem 7.7, initialized on( A , x (init) , γx (init) , x (init) s (init) /µ (init) , γ/ D ( s ) instance of Theorem 6.1, initialized on ( A , s (init) , γ/ D (sample) instance of Lemma 5.1, initialized on ( A , /s (init) ) D (sample ,σ ) instance of Lemma 5.1, initialized on ( A , q x (init) /s (init) ) ∆ ∈ R n maintained to be A > x − b µ, µ ∈ R progress parameter µ of the IPM and its approximation µ ≈ γ/ µ x, s ∈ R m element-wise approximations x ≈ γ/ x and s ≈ γ/ s γ accuracy parameter procedure ShortStep ( µ (new) > /* Update µ and data structures that depend on it. */ if µ γ/ µ (new) then µ ← µ (new) for i ∈ [ m ] do D ( x, ∇ ) . Update ( i, γx i , ( x i s i ) /µ ); /* Leverage Score Sampling to sparsify ( A > XS − A ) with γ/ spectralapproximation */ I v ← D (sample ,σ ) . LeverageScoreSample ( cγ − log n ) for some large enoughconstant c > v ← D (sample ,σ ) . LeverageScoreBound ( cγ − log n, I v ) v i ← /v i for i with v i = 0 /* Perform ShortStep (Algorithm 2) */ h ← D ( x, ∇ ) . QueryProduct() // h = γ A > X ∇ Φ( v ) / k∇ Φ( v ) k h ← solve ( A > VA ) − ( h + ∆) with γ/ // h = H − ( h + ( A > x − b )) , δ r = S − A h R ← SamplePrimal (1 , D (sample) , D (sample ,σ ) , h ) x (tmp) , I x ← D ( x, ∇ ) . QuerySum ( − RXS − A h ) s (tmp) , I s ← D ( s ) . Add (( A > VA ) − h , γ/ ∆ ← ∆ + h − A > RXS − A h // Maintain ∆ = A > x − b /* Update x, s and data structures that depend on them. */ for i ∈ I x ∪ I s do if x (tmp) i γ/ x i or s (tmp) i γ/ s i then x i ← x (tmp) i , s i ← s (tmp) i D ( x, ∇ ) . Update ( i, γx i , ( x i s i ) /µ ) D (sample) . Scale ( i, /s i ) D (sample ,σ ) . Scale ( i, p x i /s i ) Assumptions

By Line 2 to Line 8 of Algorithm 8, we assume that the following assumptionshold true at the start of the ﬁrst call to

ShortStep (Algorithm 8). x ≈ γ/ x, s ≈ γ/ s, µ ≈ γ/ µ (61)∆ = A > x − b (62) D ( x, ∇ ) .g = γx, D ( x, ∇ ) .z = γxs/ ( µ ) , D ( x, ∇ ) .(cid:15) = γ/

16 (63) D ( sample,σ ) .g = √ xs − (64)80 ( sample ) .g = s − (65)We prove further below that these assumptions also hold true for all further calls to ShortStep (Algorithm 8). However, ﬁrst we prove that, if these assumptions are satisﬁed, then

ShortStep (Algorithm 8) performs the computations required by the IPM of Algorithm 2.

Algorithm 8 implements Algorithm 2

We argue the correctness line by line of Algo-rithm 2. Line 4 (Algorithm 2) is satisﬁed by assumption (61). Line 5 and Line 6 (Algorithm 2)asks us to compute g ← − γ ∇ Φ( v ) / k∇ Φ( v ) k for v ≈ γ xs/µ . While we do not compute g , our implementation does compute AX g as follows:Line 17 of Algorithm 8 computes h = γ A > X ∇ Φ( v ) / k∇ Φ( v ) k = − A > X g for some v ≈ γ/ D ( x, ∇ ) .z = xs/ ( µ ) ≈ γ/ xs/µ by the guarantees of Theorem 7.7 and assumptions(63).By Line 7 and Line 8 of Algorithm 2 we must obtain a matrix H ≈ γ A > XS − A . In ourimplementation Algorithm 8 this is done in Line 16, where a vector v is constructed with v i = 1 /p i with probability p i ≥ min { , c · σ ( X / S / A ) i log n } and v i = 0 otherwise.by the guarantee of Lemma 5.1 and assumption (64). Thus with high probability ( A > VA ) ≈ γ/ ( AXS − A ), so by applying a γ/ H − ≈ γ ( A > XS − A ) − For A := X / S / A , W := µ − XS , Line 9 of Algorithm 2 wants us to compute δ r = W − / AH − A > W / g + µ − / W − / AH − ( A > x − b )This is done implicitly in Line 18 of our implementation Algorithm 8. By assumption (62) wehave ∆ = A > x − b , thus Line 18 computes h with h = H − ( h + A > x − b ) , so S − A h = S − AH − ( h + A > x − b )= S − AH − ( γ AX ∇ Φ( v ) / k∇ Φ( v ) k + A > x − b )= ( XS /µ ) − / ( X / S − / A ) H − ( X / S − / A ) > ( XS /µ ) / ∇ Φ( v ) / k∇ Φ( v ) k + µ − / ( XS /µ ) − / ( X / S − / A ) H − ( A > x − b )= W − / AH − A > W / ∇ Φ( v ) / k∇ Φ( v ) k + µ − / W − / AH − ( A > x − b )=: δ r . Thus we have access to δ r via S − A h . 81ine 11 of Algorithm 2 wants us to construct a random diagonal matrix R where R i,i = 1 /q i with probability q i and R i,i = 0 otherwise, where q i ≥ min { , √ m (( δ r ) i / k δ r k + 1 /m ) + C · σ i ( X / S − / A ) log( m/ ( (cid:15)r )) γ − } By assumption (65) and (64) we have D (sample) .g = s − and D (sample ,σ ) .g = √ xs − , so byLine 19 Line 19 of Algorithm 8 returns R with the desired properties.Line 12 and Line 14 of Algorithm 2 wants us to compute x (new) ← x + X ( g − R δ r ) . By Theorem 7.7 and assumption (63) Line 20 of our implementation Algorithm 8 computes x (tmp) with x (tmp) ≈ γ/ x + X g − RXS − A h = x + X ( g − R δ r ) . Line 13 and Line 14 of Algorithm 2 asks us to compute s (new) ← s + S δ p δ p := W − / AH − A > W / g By Theorem 6.1 and assumption (61) Line 21 of our implementation Algorithm 8 computes s (tmp) with s (tmp) ≈ γ/ s + AH − h = s + AH − A > X g = s + S ( XS /µ ) − / ( X / S − / A ) H − ( X / S − / A ) > ( XS /µ ) / g = s + S W − / AH − A > W / g = s + S δ p Assumption on x , s , µ : Previously we assumed x ≈ γ/ x , s ≈ γ/ s , and µ ≈ γ/ . Here weshow that this assumption is true by induction, that is, if it was true before calling ShortStep ,then it is also true after executing

ShortStep .We previously argued that x (tmp) ≈ γ/ x (new) and s (tmp) ≈ γ/ s (new) . So by the FOR-loopof Line 23 we have x ≈ γ/ x (tmp) , and thus x ≈ γ/ x (new) . Likewise s ≈ γ/ s (new) .The assumption µ ≈ γ/ is true by Line 11. Assumption on D (sample ,σ ) , D (sample) , and D ( x, ∇ ) : All data structures that depend on x and s are updated, whenever an entry of x or s changes (see Line 23). So the assumptionthat D (sample ,σ ) .g = p x/s , D (sample) .g = 1 /s , D ( x, ∇ ) .z = xz/µ and D ( x, ∇ ) .g = γx , are all stillsatisﬁed after the execution of ShortStep .Likewise, whenever µ changes we set D ( x, ∇ ) .z = xz/µ . Assumption ∆ = A > x − b : If ∆ = A > x − b initially, then after Line 22 we have ∆ = A > x (new) − b for x (new) = x + X g − RXS − A h . Thus we always maintain A > x − b , whenever x changes. Here we implement the path following procedure Algorithm 1 using our data structures. Notethat Algorithm 1 consists of essentially a single FOR-loop, which calls

ShortStep (Algo-rithm 8), so the main task for the implementation is just the initialization of the data struc-tures used in Algorithm 8. Lemma 8.4 then analyzes the complexity of Algorithm 9, i.e. ourimplementation of Algorithm 1. 82 lgorithm 9:

Implementation of Algorithm 1 global variables D ( x, ∇ ) instance of Theorem 7.7 D ( s ) instance of Theorem 6.1 D (sample) instance of Lemma 5.1 D (sample ,σ ) instance of Theorem B.1 A ∈ R m × n , b ∈ R n linear constraints A > x = b ∆ ∈ R n maintained to be A > x − b µ, µ ∈ R progress parameter µ of the IPM and its approximation µ x, s ∈ R m element-wise approximations of x and s (cid:15), λ, γ, r ∈ R + accuracy parameters procedure PathFollowing ( x (init) , s (init) , µ (init) , µ (end) ) (cid:15) ← log n , λ ←

36 log(240 m/(cid:15) ) (cid:15) , γ ← (cid:15) λ , r ← (cid:15)γ √ m ∆ = A > x (init) − b µ ← µ (init) , µ ← µ (init) , x ← x (init) , s ← s (init) , τ ← τ /* Initialize data structures */ D ( x, ∇ ) . Initialize ( A , x (init) , x (init) , x (init) s (init) /µ (init) , γ/ D ( s ) . Initialize ( A , s (init) , γ/ D (sample) . Initialize ( A , /s (init) ) D (sample ,σ ) . Initialize ( A , q x (init) /s (init) ) /* Follow the Central Path */ while µ = µ (end) do for i = 1 , , ..., (cid:15)r do µ ← median( µ (end) , (1 − r ) µ, (1 + r ) µ ) ShortStep ( µ ) p ← D ( x, ∇ ) . Potential () p ← ∆ > ( A > GA ) − ∆, where g is a leverage score sample from D (sample ,σ ) . if p > exp( λ(cid:15) − ) or p > γ(cid:15) √ µ exp( − / then break ; if p > exp( λ(cid:15) − ) or p > γ(cid:15) √ µ exp( − / then Revert iterations of previous for loop /* Compute and return exact result */ x ← D ( x, ∇ ) . ComputeExactSum (), s ← D ( s ) . ComputeExact () return x, s Lemma 8.4.

The time complexity of Algorithm 9 is e O (cid:16) n √ m · | log µ (end) /µ (init) | · (log W + | log µ (end) /µ (init) | ) (cid:17) , where W is an upper bound on the ratio of largest to smallest entry in both x (init) and s (init) .Proof. We analyze the overall complexity by ﬁrst bounding the time for the initialization. Thenwe bound the total time spend on the most inner FOR-loop, which also includes all calls to

ShortStep (Algorithm 8).

Initialization

Algorithm 9 starts by computing ∆ = A > x − b , which takes e O ( m ) time. Thesubsequent initializations of the data structures of Theorems 6.1 and 7.1 and lemma 5.1, require e O ( m ) as well. Last line

In the last line of Algorithm 9, we compute the exact x and s values, which byTheorem 7.7 and Theorem 6.1 require e O ( m ) time.83 OR-loop

The most inner loop (Line 20) is repeated a total of K := O ( r − | log µ (end) /µ (init) | ) = e O ( √ m | log µ (end) /µ (init) | )times by Lemma 4.4. We now analyze the cost per iteration.The cost for D ( x, ∇ ) . Potential () is bounded by O (1). Computing ∆ > ( A > GA ) − ∆ canbe implemented via Laplacian Solver (Lemma 2.1), so the complexity is bounded by obtainingthe leverage score sample g , which also bound the number of nonzero entries. By Lemma 5.1the complexity is bounded by e O ( n log W ), where W is the ratio of largest to smallest entry in p x/s . By Lemma 4.41 we can bound log W by e O (log W + | log µ (end) /µ (init) | ). Thus computing∆ > ( A > VA ) − ∆ requires e O ( n (log W + | log µ (end) /µ (init) | )) time.We are left with analyzing the cost of ShortStep (Algorithm 8).

Updating the data structures

Performing D ( x, ∇ ) . Update for all i ∈ [ m ] in Line 11 of Algo-rithm 8 takes O ( m ) time, but it happens only once every O ( r/γ ) = e O ( √ m ) calls to ShortStep ,because µ changes by an (1 ± r ) factor between any two calls to ShortStep . The total costof this call to D ( x, ∇ ) . Update over all calls to

ShortStep together can thus be bounded by e O ( √ mK ).Next consider the calls to Update and

Scale of our various data structures in Line 24 ofAlgorithm 8. Note that throughout K calls to ShortStep , the entries of s can change at most O ( K /γ ) times by a factor of (1 ± γ/ k S − δ k < / x can change at most e O ( Kn (log( W ) + | log( µ (init) /µ (end) ) | ) + K /γ ) timesby a factor of (1 ± γ/ x changes by γ X ∇ Φ( v ) / k∇ Φ( v ) k − RXS − A h ,where for the ﬁrst term we have k X − γ X ∇ Φ( v ) / k∇ Φ( v ) k k < γ and the second term has atmost e O ( n (log( W ) + | log( µ (init) /µ (end) ) | )) non-zero entries.So the total number of times the condition in Line 24 is satisﬁed can be bounded by e O ( Kn (log( W )+ | log( µ (init) /µ (end) ) | )+ K /γ ). Thus the cost for all calls to Scale and

Update of the data structures is bounded by e O ( K + Kn (log( W ) + | log( µ (init) /µ (end) ) | )), as each callto Scale and

Update of any of the used data structures costs e O (1), and γ = e Ω(1). D (sample) and D (sample ,σ ) data structure As argued before, the ratio of largest to smallestweight of x and s can be bounded via Lemma 4.41. Hence performing the leverage scoresample to obtain v in Line 14 to Line 16 of Algorithm 8 requires at most e O ( n (log( W ) + | log( µ (init) /µ (end) ) | )) time by Lemma 5.1. Likewise, each call to SamplePrimal in Line 19of Algorithm 8 requires at most e O ( n (log( W ) + | log( µ (init) /µ (end) ) | )) time by Lemma 8.2, whichis also a bound on the number of non-zero entries in matrix R . This results in a total of e O ( nK (log( W ) + | log( µ (init) /µ (end) ) | )) time after all K calls to ShortStep . D ( x, ∇ ) data structure The total cost of D ( x. ∇ ) . QueryProduct and

QuerySum is boundedby e O ( Kn + Kn (log( W ) + | log( µ (init) /µ (end) ) | ) + K )by Theorem 7.7. This is because (i) the number of non-zero entries in − RXS − A h is boundedby e O ( Kn (log( W ) + | log( µ (init) /µ (end) ) | )), and (ii) the vector v ( ‘ ) and x ( ‘ − in Theorem 7.7 areexactly the vectors γ X ∇ Φ( v ) / k∇ Φ( v ) k and x in Algorithm 8 during the ‘ -th call to Short-Step . By x ≈ γ/ x we have k X − γ X ∇ Φ( v ) / k∇ Φ( v ) k k = O ( γ ) which yields the complexitybound above. D ( s ) data structure The total of all calls to D ( s ) . Add require at most e O ( K k S − δ s k + Kn (log W + | log( µ (init) /µ (end) ) | ))= e O ( K + Kn (log W + | log( µ (init) /µ (end) ) | ))by Theorem 6.1 and k S − δ s k = O (1) by Lemma 4.6.84 lgorithm 10: Minimum cost perfect b -matching algorithm global variables procedure MinimumCostPerfectBMatching ( G = ( U, V, E ) , b ∈ Z U ∪ V , c ∈ Z E ) /* Given minimum cost perfect b -matching instance on bipartite graph G with cost vector c . */ Construct G = ( U ∪ V ∪ { z } , E ), x , s , c , d as in Lemma 8.7. Let c ∈ R E with c u,v = c u,v and c u,z = c z,v = k b k k c k ∞ for all u ∈ U , v ∈ V . Modify cost vector c as in Lemma 8.10 to create unique optimum. Let A be the incidence matrix of G and rename d to b . /* Solve the linear program min c x for A > x = b and x ≥ . */ x, s ← PathFollowing ( x, s, , k c k ∞ k b k ∞ ) s ← s + c − c x, s ← PathFollowing ( x, s, k c k ∞ k b k ∞ , n · m ( k b k ∞ k c k ∞ ) ) x ← feasible solution via Lemma 8.9. /* x is now a feasible flow with c x ≤ OPT + m ( k b k ∞ k c k ∞ ) . */ Round entries of x to the nearest integer. return x Updating ∆ Updating ∆ in Line 22 takes O ( n + k R k ) = e O ( n (log( W )+ | log( µ (init) /µ (end) ) | ))time, where we bound k R k via Lemma 8.2. The same bound holds for computing RXS − A h in Line 20. After all calls to ShortStep this contributes a total cost of e O ( Kn (log( W ) + | log( µ (init) /µ (end) ) | )). Total complexity

In summary, the time for all calls to

ShortStep together is bounded by e O ( K + Kn (log W + | log( µ (init) /µ (end) ) | )= e O ( n √ m · | log( µ (end) /µ (init) ) | · (log W + | log( µ (end) /µ (init) ) | )) . This also subsumes the initialization cost. e O ( n √ m ) -Time Algorithm In this section we use the implemented IPM (Algorithm 9) to obtain an algorithm that solvesminimum weight bipartite b -matching in e O (( n √ m ) log W ) time. More accurately, we will provethe following theorem. Theorem 8.5.

Algorithm 10 ﬁnds a minimum weight perfect bipartite b -matching. On a graphwith n nodes, m edges, and edge cost vector c ∈ Z m , the algorithm runs in e O ( n √ m log ( k c k ∞ k b k ∞ )) time. Outline of Theorem 8.5

Algorithm 10 can be outlined as follows: ﬁrst we reduce theproblem to uncapacitated minimum cost ﬂows, so that we can represent the problem as a linearprogram of the form min A > x = b,x ≥ c > x, where A ∈ R m × n is an incidence matrix and b ∈ R n are is a demand vector.We must then construct an initial feasible ﬂow x ∈ R m and a feasible slack s ∈ R m of adual solution, such that x i s i ≈ µ for all i ∈ [ m ] and some µ ∈ R . Unfortunately, we are notable to construct such an initial pair ( x, s ) for the given edge cost vector c ∈ R m . So instead,we construct an initial pairs for some other cost vector c ∈ R m (Lemma 8.7).85e then use the IPM (Algorithm 9) to move this initial solution away from the optimalsolution towards the center of the polytope. Once our intermediate solution pair ( x, s ) is suf-ﬁciently close to the center, then it does no longer matter which cost vector was used. So weare then able to switch the cost vector c with the actual cost vector c (Lemma 8.8). Afterswitching the cost vector, we then use the IPM (Algorithm 9) again to move our solution pair( x, s ) towards the new optimal solution induced by c .Note that the obtained primal solution x is not feasible. Lemma 4.4 only guarantees that x is close to a feasible solution. So we then use Lemma 8.9 to turn x into a nearby feasiblesolution x .The obtained solution x is an almost optimal fractional solution. Via the Isolation-Lemma8.10, we can simply round each entry of x to the nearest integer to obtain an optimal integralsolution. Reduction to uncapacitated minimum ﬂow

We start the formal proof of Theorem 8.5by proving the reduction of minimum-cost bipartite b -matching to uncapacitated minimum costﬂows. Given a bipartite graph G = ( U, V, E ) and vector b ∈ N V , we construct the following ﬂowinstance. Deﬁnition 8.6.

Let G = ( U, V, E ) be a bipartite graph and b ∈ Z U ∪ V a vector with P u ∈ U b u − P v ∈ V b v = 0 (i.e. a b -matching instance).We deﬁne the corresponding starred ﬂow graph as G = ( V , E ) with nodes V = V ∪ { z } ,and directed edges E = E ∪ { ( u, z ) , ( z, v ) | u ∈ U, v ∈ V } and demand vector d ∈ R V with d v = b v for v ∈ V , d u = − b u for u ∈ U and d z = 0 . The reduction for minimum cost b -matching to uncapacitated minimum cost ﬂow is nowtrivially given. Given a cost vector c ∈ R E for the bipartite b -matching instance, we can deﬁnea cost vector c ∈ R E for the starred ﬂow graph G , where c u,v = c u,v , c u,z = c z,v = k b k k c k ∞ for all u ∈ U and v ∈ V . Then any integral minimum cost ﬂow on G with cost less than k b k k c k ∞ for cost vector c results in a minimum weight perfect b matching on G . On the otherhand, if no such ﬂow exists, then G has no perfect b -matching. Initial point

Next, we use the following Lemma 8.7 to construct an initial point ( x, s ) forthe IPM Algorithm 9 with x i s i = µ for some µ ∈ R and all i ∈ [ m ]. Lemma 8.7 is proven inAppendix A.1. Lemma 8.7.

We are given a minimum weight perfect b -matching instance on graph G =( U, V, E ) with cost vector c ∈ R E .Let G = ( V , E ) with demand vector d be the corresponding starred ﬂow graph (see Deﬁ-nition 8.6) and let n be the number of nodes of G and m the number of edges.For τ ( x, s ) := 1 , we can construct in O ( m ) time a cost vector c ∈ R E for G and a feasibleprimal dual solution pair ( x, s ) for the minimum cost ﬂow problem on G with cost vector c ,where the solution satisﬁes xs = τ ( x, s ) . The cost vector c satisﬁes k b k − ∞ ≤ c ≤ n. Switching the cost vector

Lemma 8.7 returns some cost vector c that diﬀers from c . Asoutlined before and as can be seen in Line 8 of Algorithm 10, we will switch the cost vectorsonce our point ( x, s ) is far enough away from the optimum. Lemma 8.8, which formally showsthat this switching is possible, is proven in Appendix A.2. Lemma 8.8.

Consider the function τ ( x, s ) and a uncapacitated min-cost-ﬂow instance as con-structed in Lemma 8.15 or Lemma 8.7 and let A be the incidence matrix of the underlying graph G . Let c be the cost vector constructed in Lemma 8.15 and let c be any other cost vector with k c k ∞ ≥ . ssume we have a primal dual solution pair ( x, s ) for cost vector c and demand d with k A x − d k ( AXS − A ) − ≤ √ µ/ and xs ≈ (cid:15)/ µτ ( x, s ) for any (cid:15) ≤ and µ ≥ k c k ∞ k d k ∞ (cid:15) .If we replace the cost vector c with cost vector c then ( x, s + c − c ) is a primal dual solutionpair for the new min-cost-ﬂow instance with cost vector c and we have x ( s + c − c ) ≈ (cid:15)/ µτ ( x, s + c − c ) . Rounding to an optimal integral solution

After running the IPM for the cost vector c (i.e. Line 9 of Algorithm 10) we obtain a near optimal and near feasible fractional solution x . The following Lemma 8.9 shows that we can convert this near feasible solution to a trulyfeasible solution that is still near optimal. The proof is deferred to Appendix A.3. Lemma 8.9.

Consider any (cid:15) > and an uncapacitated min-cost ﬂow instance on a starredﬂow graph (Deﬁnition 8.6) with cost vector c ∈ R E , demand vector d ∈ R V and the propertythat any feasible ﬂow f satisﬁes f ≤ k d k ∞ .Assume we are given a primal dual solution pair ( x, s ) , with xs ≈ / µτ ( x, s ) and µ · k A x − d k A > XS − A ) − ≤ for µ ≤ (cid:15) n .Let δ = (cid:15) k c k ∞ k d k ∞ m ) , then in e O ( m log δ − ) time we can construct a feasible ﬂow f with c > f ≤ OPT + (cid:15) , where

OPT is the optimal value of the min-cost ﬂow instance.

At last, we convert the feasible near optimal fractional solution x to a truly optimal integralsolution via the Isolation-Lemma of [DS08]. Lemma 8.10 below shows that, if the ﬂow instanceis integral and the set of optimal ﬂows has congestion at most W (where the congestion of a ﬂowrefers to the maximum ﬂow value over all edges), then we can obtain an optimal feasible ﬂowby (i) slightly perturbing the problem, (ii) solving the perturbed problem approximately with1 / poly( mW ) additive error, and (iii) rounding the ﬂow on each edge to nearest integer. Weshow in Appendix A.4 how Lemma 8.10 is obtained via the Isolation-Lemma of [DS08, KS01]. Lemma 8.10.

Let

Π = (

G, b, c ) be an instance for minimum-cost ﬂow problem where G isa directed graph with m edges, the demand vector b ∈ {− W, . . . , W } V and the cost vector c ∈ {− W, . . . , W } E . Further assume that all optimal ﬂows have congestion at most W .Let the perturbed instance e Π = (

G, b, e c ) be such that e c e = c e + z e where z e is a randomnumber from the set n m W , . . . , mW m W o . Let e f be a feasible ﬂow for e Π whose cost is at most OPT( e Π) + m W where OPT( e Π) is the optimal cost for problem e Π . Let f be obtained byrounding the ﬂow e f on each edge to the nearest integer. Then, with probability at least / , f is an optimal feasible ﬂow for Π . Proof of Theorem 8.5

We start by proving the correctness of Algorithm 10 in Lemma 8.11 byusing the previously stated Lemmas 8.7 to 8.10. Then we analyze the complexity of Algorithm 10in Lemma 8.12. Lemma 8.11 and Lemma 8.12 together form the proof of Algorithm 10.

Lemma 8.11.

With high probability Algorithm 10 returns an optimal minimum weight perfectbipartite b -matching.Proof. Given the bipartite graph G and vector b , Lemma 8.7 constructs the correspondingstarred ﬂow graph with demand vector d ∈ R n , a cost vector c ∈ R m , and a feasible primaldual pair ( x, s ) with xs = 1. Thus PathFollowing in Line 7 returns x, s with xs ≈ (cid:15)/ k c k ∞ k b k ∞ =: µ , and µ − / k A > x − b k ( AXS − A ) − ≤ (cid:15)γ < / c with c , suchthat xs ≈ (cid:15)/ τ ( x, s ). Then PathFollowing in Line 9 returns x, s with xs ≈ (cid:15)/ n · m ( k b k ∞ k c k ∞ ) =: µ , and µ / k A > x − b k ( AXS − A ) − ≤ (cid:15)γ < / x into a feasible ﬂow on graph G ,with c x ≤ OPT + 112 m ( k b k ∞ k c k ∞ ) . Further we know that every optimal ﬂow on that graph can have congestion at most k b k ∞ , asthe congestion is bounded by the demand of the incident nodes. Thus by Lemma 8.10, we canround the entries of x to the nearest integer and obtain an optimal solution for the minimumcost ﬂow with cost vector c . Since c u,z = c z,v = n k c k ∞ , the optimal solution will not use anyof the edges ( u, z ) or ( z, v ) for any u ∈ U or v ∈ V . Thus the obtained ﬂow x can be interpretedas a perfect b -matching, because the demands d u = − b u and d v = b v match the given b -vector.Note that we only obtain the optimal solution with probability at least 1 /

2, but we can repeatthe algorithm O (log n ) times and return the minimum result, which is the optimal solution withhigh probability. Lemma 8.12.

Algorithm 10 runs in e O ( n √ m log ( k c k ∞ k b k ∞ )) time.Proof. Constructing the initial point via Lemma 8.15 takes O ( m ) time. Next, moving awayfrom the optimal solution, switching the cost vector, and then moving towards the optimalsolution takes O ( n √ m log ( k c k ∞ k b k ∞ )) time by Lemma 8.4, as both | log µ (end) /µ (init) | andlog W are bounded by e O (log( k c k ∞ k b k ∞ )). At last, we construct a feasible x via Lemma 8.9 in e O ( m log( k c k ∞ k b k ∞ )) time and rounding each entry of x to the nearest integer takes only O ( m )time. In this section we improve the previous e O ( n √ m )-time algorithm of Theorem 8.5 to run in e O ( m + n . )-time instead. For that we will modify the ShortStep algorithm (Algorithm 8) touse the faster improved IPM of Section 4.3 (Algorithm 3).Note that Algorithm 3 (leverage score based IPM) is very similar to Algorithm 2 (log barrierbased IPM). Hence when implementing Algorithm 3, we can start with the implemention ofAlgorithm 2 from Section 8.2 and perform a few modiﬁcations. The main diﬀerence is thatAlgorithm 3 uses ( ∇ Φ( v )) [ ( τ ) for v ≈ xs/ ( µτ ( x, s )) where τ ( x, s ) = σ ( √ XS − A ) + n/m , whileAlgorithm 2 uses ∇ Φ( v ) / k∇ Φ( v ) k for v ≈ xs . Previously, Algorithm 8 (the implementationof Algorithm 2) used Theorem 7.7 to maintain ∇ Φ( v ) / k∇ Φ( v ) k . We now replace this datastructure by Theorem 7.1 to maintain ( ∇ Φ( v )) [ ( τ ) instead.The next diﬀerence in implementation is, that we must have an accurate approximation τ ≈ τ ( x, s ). Note that, while Lemma 5.1 does provide upper bounds on the leverage scores, itdoes not yield a good approximation. So we must use a diﬀerent data structure for maintaining τ ≈ τ ( x, s ). Such a data structure was previously given by [BLSS20], and we state these resultsin Appendix B. This data structure requires that x, s are very accurate approximations of x, s ,so another diﬀerence is that we use much smaller accuracy parameters than before. However,the accuracy is smaller by only some polylog( n )-factor, so it will not aﬀect the complexitybesides an additional e O (1)-factor. 88 lgorithm 11: Implementation of Algorithm 3 global variables Same variables as in Algorithm 8, and additionally D ( x, ∇ ) instance of Theorem 7.1 using γ/ (10 log n ) accuracy D ( s ) instance of Theorem 6.1 using γ/ (10 log n ) accuracy D ( τ ) instance of Theorem B.1 with accuracy γ/ τ ∈ R m element-wise approximations of τ ( x, s ) T ≥ √ n upper bound on how often ShortStep is called α =

14 log( mn ) procedure ShortStep ( µ (new) > /* Update µ and data structures that depend on it. */ if µ γ/ µ (new) then µ ← µ (new) for i ∈ [ m ] do D ( x, ∇ ) . Update ( i, γx i , ( x i s i ) / ( µτ i )); /* Leverage Score Sampling to sparsify ( AX / SA ) with γ/ spectralapproximation */ I v ← D (sample ,σ ) . LeverageScoreSample ( cγ − log n ) for some large enoughconstant c > v ← D (sample ,σ ) . LeverageScoreBound ( cγ − log n, I v ) v i ← /v i for i with v i = 0 /* Perform ShortStep (Algorithm 3) */ h ← D ( x, ∇ ) . QueryProduct() // h = γ AX ∇ Φ( v ) / k∇ Φ( v ) k h ← solve ( A > VA ) − ( h + ∆) with γ/ // h = H − ( h + ( A > x − b )) , δ r = S − A h R ← SamplePrimal ( c ( √ m/n ) · T γ log n log( mT ) , D (sample) , D (sample ,σ ) , h ) forsome large enough constant c > x ≈ γ/ (10 log n ) b x via Lemma 4.37 x (tmp) , I x ← D ( x, ∇ ) . QuerySum ( − RXS − A h ) s (tmp) , I s ← D ( s ) . Add (( A > VA ) − h , γ/ ∆ ← ∆ + h − A > RXS − A h // Maintain ∆ = A > x − b /* Update x, s and data structures that depend on them. */ for i ∈ I x ∪ I s do if x (tmp) i γ/ (10 log n ) x i or s (tmp) i γ/ (10 log n ) s i then x i ← x (tmp) i , s i ← s (tmp) i D ( x, ∇ ) . Update ( i, γx i , ( x i s i ) / ( µτ i )) D (sample) . Scale ( i, /s i ) D (sample ,σ ) . Scale ( i, p x i /s i ) D ( τ ) . Scale ( i, ( x i /s i ) − α ) // So τ ≈ σ ( X / − α S − / − α ) + n/m /* Update τ and data structures that depend on it. */ I τ , τ ← D ( τ ) . Query () for i ∈ I τ do D ( x, ∇ ) . Update ( i, γx i , τ i , ( x i s i ) / ( µτ i ))Algorithm 11 gives a formal description of our implementation of Algorithm 3 and Lemma 8.13shows that this implementation is indeed correct.Afterward, we show in Lemma 8.14 that the complexity of Algorithm 9 improves when usingthis new implementation of the faster IPM. 89 emma 8.13. Algorithm 11 implements Algorithm 3.Proof.

Note that the only diﬀerence between Algorithm 2 and Algorithm 3 is the gradient.While in Algorithm 2 we use ∇ Φ( v ) / k∇ Φ( v ) k for v ≈ γ xs/µ , the faster Algorithm 3 uses( ∇ Φ( v )) [ ( e τ ) for v ≈ γ xs/ ( µτ ( x, s )) and e τ ≈ γ τ ( x, s ).Likewise Algorithm 11 is almost the same as Algorithm 8. The main diﬀerence being thatwe replaced the data structure of Theorem 7.7 for maintaining ∇ Φ( v ) / k∇ Φ( v ) k by the datastructure of Theorem 7.1 for maintaining ( ∇ Φ( v )) [ ( τ ) . The next diﬀerence is that we now useTheorem B.1 to maintain the leverage scores.Given that the rest of the implementation (Algorithm 11) of the LS-barrier method is iden-tical to the implementation (Algorithm 8) of the log-barrier method, we only need to verify thatthese modiﬁcations maintain the desired values with suﬃcient accuracy. Leverage Scores

Note that Line 22 makes sure that Theorem B.1 always uses g = ( x/s ) − α .As D ( x, ∇ ) and D ( s ) return γ/ (10 log n ) approximations and we change x, s whenever these ap-proximations change by some γ/ (10 log n ) factor, we have ( x/s ) − α ≈ γ/ (10 log n ) ( x/s ) − α .This then implies τ ≈ γ/ σ ( X / − α S − / − α A ) ≈ γ/ (10 log n ) σ ( X / − α S − / − α A ) =: τ ( x, s ) , assuming D ( τ ) (Theorem B.1) is initialized with accuracy γ/

4. Hence Theorem B.1 maintains τ ≈ γ/ τ ( x, s ). Gradient

In Line 10, Line 22 and Line 30 we make sure that Theorem 7.1 uses z = xs/ ( µτ ).Hence the data structure maintains ( ∇ φ ( v )) [ ( e τ ) for v ≈ γ/ z = xs/ ( µτ ) ≈ γ/ xs/τ ( x, s ), so v ≈ γ xs/ ( τ ( x, s )). Further, e τ ≈ γ/ (10 log n ) τ ≈ γ/ τ ( x, s ), so e τ ≈ γ τ ( x, s ).We now show that the complexity of Algorithm 9 improves when using Algorithm 11 for the ShortStep function.

Lemma 8.14.

We can improve Algorithm 9 to run in e O (cid:16) ( m + n . ) · | log µ (end) /µ (init) | · (log W + | log µ (end) /µ (init) | ) (cid:17) time, where log W is a bound on the ratio of largest to smallest entry in x (init) and s (init) .Proof. The main diﬀerence is that we run Algorithm 11 (the LS-barrier) for

ShortStep in-stead of Algorithm 8 (the log-barrier). For that we must ﬁrst initialize Theorem B.1 via D ( τ ) . Initialize (( x (init) /s (init) ) / − α , γ/ e O ( m ) time. Wealso run Theorem 7.1 instead of Theorem 7.7 to maintain ( ∇ Φ( v )) [ ( e τ ) instead of ∇ Φ( v ) / k∇ Φ( v ) k ,but the complexities of these two data structures are the same. Note that we must initialize D ( x, ∇ ) and D ( s ) for a higher accuracy ( γ/ (10 log n )) than in the previous log-barrier imple-mentation (which had accuracy γ/ e O (1)factor. Complexity of Algorithm 11

The main diﬀerence to the previous analysis from Lemma 8.4is that now

ShortStep is called only K = e O ( √ n | log( µ (init) /µ (end) ) | ) times. We will now listall the diﬀerences compared to the previous analysis of Lemma 8.4. Note that Algorithm 11assumes a global variable T ≥ √ n , which is an upper bound on how often ShortStep is called.So we modify Algorithm 9 to set T = K = e O ( √ n | log( µ (init) /µ (end) ) | ). Further note that asbefore, we can bound the ratio W of largest to smallest entry in x and s throughout all callsto ShortStep by log W = e O (log W + | log( µ (init) /µ (end) ) | ) via Lemma 4.41.90 ampling the primal We scale the sampling probability for

SamplePrimal by e O ( √ mK/n ).The complexity per call thus increases to e O ( n (log W + | log( µ (init) /µ (end) ) | ) + Km/n )by Lemma 8.2. Further note that by Lemma 4.37 the sampling probability is large enough forour sequence of primal solutions x (1) , x (2) , ... to be close to some sequence b x (1) , b x (2) , ... , such that x ( t ) ≈ γ/ (10 log n b x ( t ) and k ( b X ( t − ) − ( b x ( t ) − b x ( t − ) k τ for all t . D ( x, ∇ ) data structure The number of times µ changes is now bounded by e O ( K/ √ n ), sothe total cost of updating D ( x, ∇ ) in Line 10 is now bounded by e O ( Km/ √ n ). The cost of D ( x, ∇ ) . QueryProduct and

QuerySum is bounded by e O ( Kn + Kn (log W + | log( µ (init) /µ (end) ) | ) + K · k X ( ∇ Φ( v )) [ ( τ ) /x k )= e O ( Kn + Kn (log W + | log( µ (init) /µ (end) ) | ) + K · m/n ) , by Theorem 7.1, where we use that k X ( ∇ Φ( v )) [ ( τ ) /x k = O (1) · k∇ Φ( v )) [ ( τ ) k = O ( m/n ) · k∇ Φ( v )) [ ( τ ) k τ = e O ( m/n ) · k∇ Φ( v )) [ ( τ ) k τ + ∞ = e O ( m/n ) D ( s ) data structure The complexity of calls to D ( s ) . Add (Theorem 6.1) is now bounded by e O ( K · m/n + Kn (log W + | log( µ (init) /µ (end) ) | )), because k S − δ s k ≤ O ( m/n ) k S − δ s k τ ( x,s ) = O ( m/n ). D ( τ ) data structure Any change to τ causes e O (1) cost in Line 30. The number of times anyentry of τ changes is bounded by the Query complexity of Theorem B.1. This complexity canbe bounded by e O ( K m/n + Kn (log W + | log( µ (init) /µ (end) ) | )), when applying Lemma 4.41 andthe fact that we scale the sampling probability for SampleProbability by e O ( √ mK/n ). Notethat this complexity bound only holds when for the given input sequence g ( t ) := ( x ( t ) /s ( t ) ) − α ,there exists sequence b g ( t ) ≈ γ/ (2 · ·

144 log n ) g ( t ) which satisﬁes (70) and (71). This is satis-ﬁed because of the accuracy of x ≈ γ/ (10 log n ) x , s ≈ γ/ (10 log n ) s , the stability property k S − δ s k τ ≤ γ (Lemma 4.19), and the stable sequence b x ( t ) ≈ γ/ (10 log n ) x ( t ) that exists accord-ing to Lemma 4.37. So the sequence b g ( t ) := ( b x ( t ) /s ( t ) ) − α satisﬁes b g ( t ) ≈ γ/ (2 · ·

144 log n ) g ( t ) =( x ( t ) /s ( t ) ) − α , (70) and (71). Updating the data structure

The number of times that any entry of s changes now boundedby e O ( K m/n ), because k S − δ s k = O ( m/n ). Likewise the number of times that any entry of s changes is bounded by e O ( Kn (log W + | log( µ (init) /µ (end) ) | + K m/n ) because x changes by X ( ∇ Φ( v )) [ ( τ ) − RXS − A h , where for the ﬁrst term we have k X ( ∇ Φ( v )) [ ( τ ) /x k = O ( m/n )and the second term as only at most e O ( n (log W + | log( µ (init) /µ (end) ) | ) + Km/n ) non-zeroentries. So updating all the data structures via calls to

Update and

Scale in Line 22 nowtakes e O ( Kn (log W + | log( µ (init) /µ (end) ) | ) + K · m/n ) time. Total Complexity

The total complexity is thus e O ( m + K m/n + Kn (log W + | log( µ (init) /µ (end) ) | ))= e O (( m + n . ) | log( µ (init) /µ (end) ) | (log W + | log( µ (init) /µ (end) ) | )) . Note that Lemma 8.7, as used in our e O ( n √ m )-time algorithm, only provides an initial point( x, s ) with xs = µ for some µ ∈ R . For the faster IPM, however, we require xs ≈ µτ ( x, s ) with τ ( x, s ) := σ ( x, s ) + n/m as deﬁned in Deﬁnition 4.13. The following Lemma 8.15 shows thatsuch a point can be constructed, and it is proven in Appendix A.1.91 emma 8.15. We are given a minimum weight perfect b -matching instance on graph G =( U, V, E ) with cost vector c ∈ R E .Let G = ( V , E ) with demand vector d be the corresponding starred ﬂow graph (see Deﬁ-nition 8.6) and let n be the number of nodes of G and m the number of edges.For τ ( x, s ) := σ ( x, s ) + m/n and for any (cid:15) > we can construct in e O ( m poly( (cid:15) − )) time acost vector c ∈ R E for G and a feasible primal dual solution pair ( x, s ) for the minimum costﬂow problem on G with cost vector c , where the solution satisﬁes xs ≈ (cid:15) τ ( x, s ) . (66) The cost vector c satisﬁes n m (1 + k b k ∞ ) ≤ c ≤ n. We now obtain Theorem 8.1 by replacing the slow √ m -iteration IPM, used in Algorithm 10,with the faster √ n -iteration IPM. Theorem 8.1.

For b, c ∈ Z m , we can solve minimum weight perfect bipartite b -matching forcost vector c in e O (( m + n . ) log ( k c k ∞ k b k ∞ )) time.Proof. This follows directly from Algorithm 10 (Theorem 8.5), when we use the faster variantof Algorithm 9 (Lemma 8.14 instead of Lemma 8.4). To construct the initial point we now useLemma 8.15 instead of Lemma 8.7.

In this section, we apply known reductions and list applications of our main results (see Fig-ure 2). In Section 8.6.1, we show that we can obtain exact algorithms for many fundamentalproblems when all the numbers from the problem instance are integral. For a history overviewfor some of these problems see Tables 4 and 5. Then, in Section 8.6.2, we show that even whenthe numbers are not integral, we can still (1 + (cid:15) )-approximately solve all the problems withlog (1 /(cid:15) ) dependency in the running time. Vertex-capacitated min-cost 𝑠 - 𝑡 -flowTransshipment Bipartite perfectmin-cost 𝑏 -matching Bipartite max-weight 𝑏 -matchingNegative-weight shortest paths Figure 2: Reductions between problems

In this section, we deﬁne several problems such that the numbers from the problem instanceare integral, and show how to solve them exactly. We will later allow fractions in the probleminstance in Section 8.6.2, and show how to solve those problems approximately.92 ear Author Reference Complexity Notes O ( n )1956 Ford [Net56] O ( W n m )1958 Bellman, Moore [Bel58] [Moo59] O ( nm ) *1983 Gabow [Gab85] O ( n / m log W )1989 Gabow and Tarjan [GT89] O ( √ nm log( nW ))1993 Goldberg [Gol93] O ( √ nm log( W )) *2005 Sankowski; Yuster and Zwick [San05], [YZ05] e O ( W n ω )2017 Cohen, Madry, Sankowski, Vladu [CMSV17] e O ( m / log W )2020 Axiotis, Madry, Vladu [AMV20] e O ( m / o (1) log W ) *2020 This paper e O (( m + n . ) log W ) *Table 4: The complexity results for the SSSP problem with negative weights (* indicatesasymptotically the best bound for some range of parameters). The table is extended from[CMSV17]

Year Author Reference Complexity Notes O ( W n m )1955 Khun and Munkers [Kuh55],[Mun57] O ( W n m )1960 Iri [Iri60] O ( n m )1969 Dinic and Kronrod [DK69] O ( n )1970 Edmonds and Karp [EK72] O ( nm + n log n ) *1983 Gabow [Gab85] O ( n m log W )1989 Gabow and Tarjan [GT89] O ( √ nm log( nW )) *1999 Kao, Lam, Sung and Ting [KLST99] O ( W √ nm )2006 Sankowski [San06] O ( W n ω )2017 Cohen, Madry, Sankowski, Vladu [CMSV17] e O ( m / log W )2020 Axiotis, Madry, Vladu [AMV20] e O ( m / o (1) log W ) *2020 This paper e O (( m + n . ) log W ) *Table 5: The complexity results for the minimum-weight bipartite perfect matching problem (* indicates asymptotically the best bound for some range of parameters). The tableis extended from [CMSV17]. Deﬁnition 8.16 ( b -matching) . For any graph G = ( V, E ) , a demand vector b ∈ Z V ≥ , and a costvector c ∈ Z E , a b -matching is an assignment x ∈ Z E ≥ where P e v x e ≤ b v for all v ∈ V . Wesay that x is perfect if P e v x e = b v for all v ∈ V . The cost/weight of x is c > x = P e ∈ E c e x e .When b is an all-one vector, the b -matching is referred to simply as a matching . We say that x is fractional if we allow x ∈ R E ≥ . Similarly, b is fractional if b ∈ R V ≥ . In the maximum weight bipartite b -matching problem, given a bipartite graph G , a demandvector b , and a cost vector c , we need to ﬁnd a b -matching with maximum weight. In the minimum/maximum weight bipartite perfect b -matching problem, we need to ﬁnd a perfect b -matching with minimum/maximum weight or report that no perfect b -matching exists. Below,we show a relation between these problems. We let W denote the largest absolute value usedfor specifying any value in the problem (i.e. b v ≤ W for all v ∈ V and − W ≤ c u,v ≤ W for all( u, v ) ∈ E ). Equivalence of variants of perfect b -matching. Most variants of the problems for perfect b -matching are equivalent. First, we can assume that the cost is non-negative by working withthe cost vector c where c ( e ) = c ( e ) + W for all e ∈ E . This is because the cost of every perfect93 -matching is increased by exactly W · k b k /

2. Second, the maximization variant is equivalentwith the minimization variant by working with the cost c ( e ) = − c ( e ) because the cost of everyperfect b -matching is negated. Maximum weight non-perfect b -matching → Maximum weight perfect b -matching. We can reduce the non-perfect variant to the perfect one. Given a problem instance (

G, b, c )of maximum weight bipartite b -matching where G = ( U, V, E ) is a bipartite graph, b is ademand vector and c is a cost vector, we construct an instance ( G , b , c ) of maximum weightbipartite perfect b -matching as follows. First, note that we can assume that all entries in c are non-negative since we never add edges of negative cost to the optimal maximum weightnon-perfect b -matching. Next, we add new vertices u into U and v into V . Set b u = P v ∈ V b v , b v = P u ∈ U b u , and b x = b x for all x ∈ U ∪ V . We add edges connecting u to all vertices in V (including v ) and connecting v to all vertices in U . The cost of all these edges incidentto u or v is 0. It is easy to see that, given a perfect b -matching x in G , we can obtaina b -matching x with the same cost in G simply by deleting all edges incident to u and v .Also, given any b -matching x in G , there is a perfect b -matching x in G with the same costthat is obtained from x by “matching the residual demand” on vertices in U and V to v and u respectively. More formally, we set x e = x e ∀ e ∈ E , x ( u ,v ) = b v − P ( u,v ) ∈ E x ( u,v ) ∀ v ∈ V , x ( u,v ) = b u − P ( u,v ) ∈ E x ( u,v ) ∀ u ∈ U , and x ( u ,v ) = P ( u,v ) ∈ E x ( u,v ) . It is easy to verify that x is a perfect b -matching with the same cost as x .From the above discussion, we conclude: Proposition 8.17.

Consider the following problems when the input graph has n vertices and m edges, and let W denote the largest number in the problem instance: • minimum weight bipartite perfect b -matching (with positive cost vector), • minimum weight bipartite perfect b -matching, and • maximum weight bipartite perfect b -matching.If one of the above problems can be solved in T ( m, n, W ) time, then we can solve the otherproblems in T ( m, n, W ) time. Additionally, we can solve the maximum weight bipartite b -matching problem in T ( m + O ( n ) , n, O ( W n )) time. Deﬁnition 8.18 (Flow) . For any directed graph G = ( V, E ) , a demand vector b ∈ Z V where P v ∈ V b v = 0 , and a cost vector c ∈ Z E , a ﬂow (w.r.t. b ) is an assignment f ∈ Z E ≥ such that,for all v ∈ V , P ( u,v ) ∈ E f ( u,v ) − P ( v,u ) ∈ E f ( v,u ) = b v . The cost of f is c > f = P e ∈ E c e f e . We saythat f is fractional if we allow f ∈ R E ≥ .An s - t ﬂow f is a ﬂow such that b v = 0 for all v ∈ V \ { s, t } . So b s = k = − b t for some k ∈ R . We say that k is the value of the s - t ﬂow f . Let ~ cap ∈ Z V \{ s,t } > denote vertex capacities.We say that f respects the vertex capacity if P ( u,v ) ∈ E f ( u,v ) ≤ ~ cap v for all v ∈ V \ { s, t } . In the uncapacitated minimum-cost ﬂow problem (a.k.a. the transshipment problem), givena directed graph G = ( V, E ), a demand vector b , and a cost vector c , we need either (1) reportthat no ﬂow w.r.t. b exists, (2) report that the optimal cost is −∞ (this can happen if there isa negative cycle w.r.t. c ) or (3) ﬁnd such ﬂow with minimum cost. We will need the followingobservation: Claim 8.19.

If there is a feasible solution for the transshipment problem and the optimal costis not −∞ , then the optimal cost is between [ − W n , W n ] .Proof. If the optimal cost is not −∞ , there is no negative cycle. So an optimal ﬂow f can bedecomposed into a collection of paths. The total ﬂow value summing overall ﬂow paths is exactly P v ∈ V || b || . So the ﬂow value on every edge is at most P v ∈ V || b || . Therefore, the cost of f is at94east P v ∈ V || b || × min e ∈ E c e × | E | ≥ − W n and at most P v ∈ V || b || × max e ∈ E c e × | E | ≤ W n . In the vertex-capacitated minimum-cost s - t ﬂow problem, given a directed G and a pair ofvertices ( s, t ), a target value k , a vertex capacity vector ~ cap, and a cost vector c , we eitherreport that no such s - t ﬂow with value k respecting vertex capacities ~ cap exists, or ﬁnd suchﬂow with minimum cost.Below, we show the equivalence between these two problems and the minimum weight perfect b -matching problem. Minimum weight perfect b -matching → Transshipment.

Observe that minimumweight bipartite perfect b -matching is a special case of the transshipment problem when weonly allow edges that direct from vertices with positive demands (for the transshipment prob-lem) to vertices with negative demands. Transshipment → Vertex-capacitated minimum-cost s - t ﬂow. Given a transshipmentinstance ( G = ( V, E ) , b, c ), we can construct an instance ( G , k, ~ cap , c ) for vertex-capacitated s - t ﬂow as follows. First, we add a special source s and t into the graph. For each vertex v withpositive demand b v >

0, we add a dummy vertex s v and two edges ( s, s v ) and ( s v , v ). For eachvertex v with positive demand b v <

0, we add a dummy vertex t v and two edges ( v, t v ) and ( t v , t ).Let G denote the resulting graph. The vertex capacity of each s v is ~ cap s v = b v . The vertexcapacity of each t v is ~ cap t v = − b v . For original vertices v ∈ V , we set ~ cap v = 3 n W + n || b || / c e = c e for all original edges e ∈ E , otherwise c e = 0 for all new edges e . Lastly,we set the target value k = P v : b v > b v = || b || / s - t ﬂow in G with value k if and only if there is a ﬂow in G satisfying the demand b . To see this, consider any s - t ﬂow f with value k respecting vertexcapacities in G . Observe that the vertex capacities of all s v must be exactly saturated, becausethere are k units of ﬂow going out from s , but the total capacities of all s v are at most k .This similarly holds for all t v . Therefore, for all original vertices v ∈ V , P ( u,v ) ∈ E f ( u,v ) − P ( v,u ) ∈ E f ( v,u ) = b v . Observe that, by deleting all non-original vertices in G , f corresponds toa ﬂow f (w.r.t. b ) in G with the same cost as f .Therefore, if the algorithm for solving vertex-capacitated s - t ﬂow reports that there is no s - t ﬂow in G with value k , then we just report that there is no ﬂow in G satisfying the demand b . Otherwise, we obtain an s - t ﬂow f in G with value k .We claim that the cost of f is less than − W n if and only if there is a negative cycle w.r.t. c in G . If the cost of f is less than − W n , then there is a ﬂow in f satisfying the demand b with the same cost by the above argument. So, by Claim 8.19, the optimal cost of ﬂow in G must be −∞ and, hence, there is a negative cycle in G . In the other direction, suppose thereis a negative cycle C in G , then it suﬃces to construct a s - t ﬂow f in G with value k withcost less than − W n . Start with an arbitrary s - t ﬂow f in G with value k such that f canbe decomposed into a collection of paths only (no cycle). We know that f exists because f exists. Observe that the ﬂow value in f is at most k on every edge and at most kn on everyvertex. So the cost of f is at most k × max e ∈ E c e × | E | ≤ n W . Now, f is obtained from f byaugmenting a ﬂow along the negative cycle C with value 3 n W . Note that this augmentationis possible because the capacity of every original vertex v ∈ V is nk + 3 n W . So the cost of f is at most n W − n W ≤ − n W .By the above argument, we simply check the cost of f . If it is less than − W n , wereport that the optimal cost for the transshipment problem is −∞ . Otherwise, we return thecorresponding ﬂow f from the discussion above. Vertex-capacitated minimum-cost s - t ﬂow → Minimum weight perfect b -matching. We show a reduction that directly generalizes the well-known reduction from minimum-cost95ertex-disjoint s - t paths to minimum weight perfect matching (see e.g. Section 3.3 of [San09]and Chapter 16.7c of [Sch03]).Suppose we are given an instance ( G = ( V, E ) , k, ~ cap , c ) for the vertex-capacitated minimum-cost s - t ﬂow problem (possibly with negative costs). First, we assume that s has no incomingedges and t has no outgoing edges. To assume this, we can add vertices s and t , and edges( s , s ) and ( t, t ) both with zero cost. Set ~ cap s = ~ cap t = P v ∈ V \{ s,t } ~ cap v . By treating s and t as the new source and sink, our assumption is justiﬁed.Next, we construct a bipartite graph G = ( L, R, E ), a cost vector c ∈ Z E , a demand vector b ∈ Z L ∪ R as follows. For each v ∈ V \ { t } , we create a copy v l ∈ L . For each v ∈ V \ { s } , wecreate a copy v r ∈ R . For v ∈ V \ { s, t } , add an edge ( v l , v r ) ∈ E with cost c ( v l ,v r ) = 0. For each e = ( u, v ) ∈ E , add ( u l , v r ) ∈ E with cost c e = c e . (Note that here we exploit the assumptionthat s / t has no incoming/outgoing edges.) We assign the demand vector b as follows: b s l = k , b t r = k , and, for every v ∈ V \ { s, t } , b v l = b v r = ~ cap v .Observe that every perfect b -matching x in G corresponds to a set C of cycles in G ∪ { ( t, s ) } where each vertex v is part of at most ~ cap v cycles and ( t, s ) is part of exactly k cycles. Hence, C in turn corresponds to an s - t ﬂow f in G of value k that respect the vertex capacities ~ cap.Moreover, the cost of x is exactly the same as the cost of f .From the above three reductions, we conclude: Proposition 8.20.

Consider the following problems when the input graph has n vertices and m edges, and W denotes the largest number in the problem instance: • minimum weight bipartite perfect b -matching, • uncapacitated minimum-cost ﬂow (a.k.a. transshipment), and • vertex-capacitated minimum-cost s - t ﬂow.If one of the above problems can be solved in T ( m, n, W ) time, then we can solve the otherproblems in T ( m + O ( n ) , O ( n ) , O ( n W )) time. Negative-weight single-source shortest paths → Transshipment / Dual of minimumweight perfect b -matching. We also consider the negative-weight single-source shortest path problem. In this problem, we are given a graph G = ( V, E ), a source s ∈ V , and a cost vector c ∈ Z E , the goal is to compute a shortest path tree rooted at s w.r.t. the cost c , or detect anegative cycle.The problem can be reduced to the transshipment problem as follows. First, by performingdepth ﬁrst search, we assume that the source s can reach all other vertices. (The unreachablevertices have distance ∞ and can be removed from the graph.) Next, we make sure that theshortest path is unique by randomly perturbing the weight of each edge (exploiting the Isolationlemma as in, e.g., [CCE13, Section 6]). Then, we set the demand vector b s = n − b v = − v = s . As s can reach all other vertices, we know that there exists a ﬂow satisfying thedemand b . Finally, we invoke the algorithm for solving the transshipment instance G with thisdemand b . If the algorithm reports that the optimal cost is −∞ , then we report that there is anegative cycle. Otherwise, we obtain a ﬂow f that can be decomposed into s - v paths P v for all v ∈ V \ { s } . Observe that P v must be a shortest s - v path. Since the shortest paths are unique,we have that the union of P v over all v ∈ V \ { s } , which is exactly the support of f , form ashortest path tree. Therefore, we simply return the support of f as the answer. Minimum mean cycle, deterministic MDP → Negative-weight single-source shortestpath.

Given a cost function c on edges, the mean cost of a directed cycle σ is c ( σ ) = | σ | · P e ∈ σ c ( e ). The minimum mean cycle problem is to ﬁnd a cycle of minimum mean cost. The deterministic markov decision process (MDP) problem [Mad02] is the same problem except that96e want to ﬁnd a cycle with maximum mean cost. These two problems are clearly equivalentby negating all the edge cost.Given a subroutine for detecting a negative cycle, Lawler [Law72] shows that we can alsosolve the minimum mean cycle problem on a graph of size n by calling the negative cyclesubroutine O (log( nW )) times on graphs with the same order. Plugging Algorithms to Reductions.

Fianlly, by plugging the main algorithm from The-orem 8.1 into the above reductions, we immediately obtain several applications.

Corollary 8.21 (Exact Algorithms for Integral Problems) . There are algorithms with runningtime e O (( m + n . ) log W ) for solving the following problems exactly : • minimum weight bipartite perfect b -matching, • maximum weight bipartite perfect b -matching, • maximum weight bipartite b -matching, • uncapacitated minimum-cost ﬂow (a.k.a. transshipment), • vertex-capacitated minimum-cost s - t ﬂow, and • negative-weight single-source shortest path, • minimum mean cycle, and • deterministic MDPwhen the input graph has n vertices and m edges, and W denotes the largest absolute value usedfor specifying any value in the problem. Here, we assume that all numbers used for specifyingthe input are integral. Generally, for each the problem considered in the previous section, we can consider the sameproblem where the values that describe the input might not be integral. We say that analgorithm solves a problem with additive approximation factor of (cid:15) if it always returns a solutionwhose cost is at most OPT + (cid:15) where OPT is the cost of the optimal solution.

Corollary 8.22 (Approximation Algorithms for Fractional Problems) . There are algorithmswith running time e O (( m + n . ) log W(cid:15) ) for solving the following problems with additive approx-imation factor of (cid:15) : • minimum weight bipartite perfect b -matching (a.k.a. optimal transport), • maximum weight bipartite perfect b -matching, • maximum weight bipartite b -matching, • uncapacitated minimum-cost ﬂow (a.k.a. transshipment), • vertex-capacitated minimum-cost s - t ﬂow, and • negative-weight single-source shortest path, • minimum mean cycle, and • deterministic MDPwhen the input graph has n vertices and m edges, and W denotes the ratio of the largest non-zero absolute value to the smallest non-zero absolute value used for specifying any value in theproblem. We note that the optimal transport problem (a.k.a. the transportation problem) is the sameas the minimum weight bipartite perfect b -matching problem except that we allow all vectorsto be fractional, including the demand vector b , the cost vector c , and the solution vector x .Several papers in the literature assume that the input graph is a complete bipartite graph. Wedo not assume this. 97he idea of the proof above is to round up all the values from the description of the input to(integral) multiple of 1 / poly( nW/(cid:15) ) which takes O (log( nW/(cid:15) )) bits. Then, we solve the roundedproblem instance exactly using the algorithm Corollary 8.21 for integral problems. Consider anoptimal solution S with cost OPT before rounding. Observe that after rounding, the roundedcost of this optimal solution S can increase by at most n × max e ∈ E c ( e ) × nW/(cid:15) ) ≤ (cid:15) . Sothe optimal cost in the rounded instance has cost at most OPT + (cid:15) . Acknowledgments

We thank Yang Liu for helpful conversations, feedback on earlier drafts of the paper, andsuggesting Lemma 4.23 and its application. This project has received funding from the EuropeanResearch Council (ERC) under the European Unions Horizon 2020 research and innovationprogramme under grant agreement No 715672. Danupon Nanongkai is also partially supportedby the Swedish Research Council (Reg. No. 2015-04659 and 2019-05622). Aaron Sidfordis supported by NSF CAREER Award CCF-1844855 and PayPal research gift. Yin Tat Leeis supported by NSF awards CCF-1749609, CCF-1740551, DMS-1839116, Microsoft ResearchFaculty Fellowship, a Sloan Research Fellowship. Di Wang did part of this work while at GeorigaTech, and was partially supported by NSF grant CCF-1718533. Richard Peng was partiallysupported by NSF grants CCF-1718533 and CCF-1846218. Zhao Song was partially supportedby Ma Huateng Foundation, Schmidt Foundation, Simons Foundation, NSF, DARPA/SRC,Google and Amazon.

References [AKPS19] Deeksha Adil, Rasmus Kyng, Richard Peng, and Sushant Sachdeva. Itera-tive reﬁnement for ‘ p -norm regression. In Proceedings of the Thirtieth AnnualACM-SIAM Symposium on Discrete Algorithms (SODA) , pages 1405–1424. SIAM, https://arxiv.org/pdf/1901.06764.pdf , 2019.[AMV20] Kyriakos Axiotis, Aleksander Madry, and Adrian Vladu. Circulation control forfaster minimum cost ﬂow in unit-capacity graphs. In arXiv preprint . https//arxiv.org/pdf/2003.04863.pdf , 2020.[Ans96] Kurt M. Anstreicher. Volumetric path following algorithms for linear programming. Math. Program. , 76:245–263, 1996.[AS20] Deeksha Adil and Sushant Sachdeva. Faster p -norm minimizing ﬂows, via smoothed q -norm problems. In SODA , pages 892–910. SIAM, 2020.[ASZ20] Alexandr Andoni, Cliﬀord Stein, and Peilin Zhong. Parallel approximate undi-rected shortest paths via low hop emulators.

STOC , https://arxiv.org/pdf/1911.01956.pdf , 2020.[AWR17] Jason Altschuler, Jonathan Weed, and Philippe Rigollet. Near-linear time approx-imation algorithms for optimal transport via sinkhorn iteration. In NIPS , pages1964–1974, 2017.[BBG +

20] Aaron Bernstein, Jan van den Brand, Maximilian Probst Gutenberg, DanuponNanongkai, Thatchaphol Saranurak, Aaron Sidford, and He Sun. Fully-dynamicgraph sparsiﬁers against an adaptive adversary.

CoRR , abs/2004.08432, 2020.[Bel58] R. Bellman. On a Routing Problem.

Quarterly of Applied Mathematics , 16(1):87–90, 1958. 98BGS20] Aaron Bernstein, Maximilian Probst Gutenberg, and Thatchaphol Saranurak. De-terministic decremental reachability, scc, and shortest paths via directed expandersand congestion balancing. 2020. To appear at FOCS’20.[BJKS18] Jose H. Blanchet, Arun Jambulapati, Carson Kent, and Aaron Sidford. Towardsoptimal running times for optimal transport.

CoRR , abs/1810.07717, 2018.[BLSS20] Jan van den Brand, Yin Tat Lee, Aaron Sidford, and Zhao Song. Solving tall denselinear programs in nearly linear time. In

STOC . https://arxiv.org/pdf/2002.02304.pdf , 2020.[BNS19] Jan van den Brand, Danupon Nanongkai, and Thatchaphol Saranurak. Dynamicmatrix inverse: Improved algorithms and matching conditional lower bounds. In FOCS , pages 456–480. IEEE Computer Society, 2019.[Bra20] Jan van den Brand. A deterministic linear program solver in current matrix multi-plication time. In

SODA , pages 259–278. SIAM, 2020.[CCE13] Sergio Cabello, Erin W Chambers, and Jeﬀ Erickson. Multiple-source shortest pathsin embedded graphs.

SIAM Journal on Computing , 42(4):1542–1571, 2013.[CGL +

20] Julia Chuzhoy, Yu Gao, Jason Li, Danupon Nanongkai, Richard Peng, andThatchaphol Saranurak. A deterministic algorithm for balanced cut with appli-cations to dynamic connectivity, ﬂows, and beyond. In

FOCS , 2020. https://arxiv.org/pdf/1910.08025.pdf .[CGP +

18] Timothy Chu, Yu Gao, Richard Peng, Sushant Sachdeva, Saurabh Sawlani, andJunxing Wang. Graph sparsiﬁcation, spectral sketches, and faster resistance com-putation, via short cycle decompositions. In

FOCS , pages 361–372. IEEE ComputerSociety, 2018.[CKM +

14] Michael B. Cohen, Rasmus Kyng, Gary L. Miller, Jakub W. Pachocki, RichardPeng, Anup B. Rao, and Shen Chen Xu. Solving sdd linear systems in nearly m log / n time. In Proceedings of the 46th Annual ACM Symposium on Theory ofComputing (STOC) , pages 343–352, 2014.[CLM +

15] Michael B Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng,and Aaron Sidford. Uniform sampling for matrix approximation. In

Proceedingsof the 2015 Conference on Innovations in Theoretical Computer Science (ITCS) ,pages 181–190, 2015.[CLS19] Michael B Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in thecurrent matrix multiplication time. In

STOC , 2019. https://arxiv.org/pdf/1810.07896 .[CMSV17] Michael B Cohen, Aleksander Madry, Piotr Sankowski, and Adrian Vladu.Negative-weight shortest paths and unit capacity minimum cost ﬂow in O ( m / log W ) time. In Proceedings of the Twenty-Eighth Annual ACM-SIAMSymposium on Discrete Algorithms (SODA) , pages 752–771. SIAM, 2017.[CP15] Michael B. Cohen and Richard Peng. ‘ p row sampling by lewis weights. In STOC ,pages 183–192. ACM, 2015.[Din70] Eﬁm A Dinic. Algorithm for solution of a problem of maximum ﬂow in networkswith power estimation. In

Soviet Math. Doklady , volume 11, pages 1277–1280, 1970.99DK69] E. A. Dinic and M. A. Kronrod. An Algorithm for the Solution of the AssignmentProblem.

Soviet Math. Dokl , 10:1324–1326, 1969.[DP14] Ran Duan and Seth Pettie. Linear-time approximation for maximum weight match-ing.

J. ACM , 61(1):1:1–1:23, 2014.[DS08] Samuel I Daitch and Daniel A Spielman. Faster approximate lossy generalized ﬂowvia interior point algorithms. In

Proceedings of the fortieth annual ACM symposiumon Theory of computing (STOC) , pages 451–460, 2008.[Ege31] E. Egerváry. Matrixok kombinatorius tulajdonságairól (hungarian) on combinatorialproperties of matrices.

Matematikai és Fizikai Lapok , 38:16–28, 1931.[EK72] Jack Edmonds and Richard M. Karp. Theoretical improvements in algorithmiceﬃciency for network ﬂow problems.

J. ACM , 19(2):248–264, 1972.[Gab85] Harold N. Gabow. Scaling algorithms for network problems.

J. Comput. Syst. Sci. ,31(2):148–168, 1985. announced at FOCS’83.[Gal14] François Le Gall. Powers of tensors and fast matrix multiplication. In

ISSAC , pages296–303. ACM, 2014.[GLPS10] Anna C Gilbert, Yi Li, Ely Porat, and Martin J Strauss. Approximate sparserecovery: optimizing time and measurements.

SIAM Journal on Computing 2012(A preliminary version of this paper appears in STOC 2010) , 41(2):436–453, 2010.[Gol93] Andrew V. Goldberg. Scaling algorithms for the shortest paths problem. In

SODA’93: Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algo-rithms , pages 222–231. Society for Industrial and Applied Mathematics, 1993.[GRST20] Gramoz Goranci, Harald Räcke, Thatchaphol Saranurak, and Zihan Tan. Theexpander hierarchy and its applications to dynamic graph algorithms.

CoRR ,abs/2005.02369, 2020.[GT89] Harold N. Gabow and Robert Endre Tarjan. Faster scaling algorithms for networkproblems.

SIAM J. Comput. , 18(5):1013–1036, 1989.[GT91] Harold N. Gabow and Robert Endre Tarjan. Faster scaling algorithms for generalgraph-matching problems.

J. ACM , 38(4):815–853, 1991.[HIKP12] Haitham Hassanieh, Piotr Indyk, Dina Katabi, and Eric Price. Nearly optimalsparse Fourier transform. In

Proceedings of the forty-fourth annual ACM symposiumon Theory of computing (STOC) , pages 563–578. ACM, https://arxiv.org/pdf/1201.2501.pdf , 2012.[HK73] John E. Hopcroft and Richard M. Karp. An n / algorithm for maximum matchingsin bipartite graphs. SIAM J. Comput. , 2(4):225–231, 1973. Announced at FOCS’71.[IM81] Oscar H. Ibarra and Shlomo Moran. Deterministic and probabilistic algorithmsfor maximum bipartite matching via fast matrix multiplication.

Inf. Process. Lett. ,13(1):12–15, 1981.[Iri60] M. Iri. A new method for solving transportation-network problems.

Journal of theOperations Research Society of Japan , 3:27–87, 1960.[JL84] William B Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings intoa hilbert space.

Contemporary mathematics , 26(189-206):1, 1984.100JSWZ20] Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang. Faster dynamicmatrix inverse for faster lps.

CoRR , abs/2004.07470, 2020.[Kap17] Michael Kapralov. Sample eﬃcient estimation and recovery in sparse FFT viaisolation on average. In . https://arxiv.org/pdf/1708.04544 , 2017.[Kar73] Alexander V Karzanov. On ﬁnding maximum ﬂows in networks with special struc-ture and some applications. Matematicheskie Voprosy Upravleniya Proizvodstvom ,5:81–94, 1973.[Kar84] Narendra Karmarkar. A new polynomial-time algorithm for linear programming.

Combinatorica , 4(4):373–396, 1984. Announced at STOC’84.[KLP +

16] Rasmus Kyng, Yin Tat Lee, Richard Peng, Sushant Sachdeva, and Daniel A.Spielman. Sparsiﬁed cholesky and multigrid solvers for connection laplacians. InSTOC’16: Proceedings of the 48th Annual ACM Symposium on Theory of Com-puting, 2016.[KLST99] M.-Y. Kao, T. W. Lam, W.-K. Sung, and H.-F. Ting. A decomposition theorem formaximum weight bipartite matchings with applications to evolutionary trees. InProceedings of the 7th Annual European Symposium on Algorithms, pages 438–449,1999.[KMP10] Ioannis Koutis, Gary L. Miller, and Richard Peng. Approaching optimality forsolving SDD systems. In

Proceedings of the 51st Annual IEEE Symposium onFoundations of Computer Science (FOCS) , pages 235–244, 2010.[KMP11] Ioannis Koutis, Gary L. Miller, and Richard Peng. A nearly m log n -time solverfor SDD linear systems. In Proceedings of the 52nd Annual IEEE Symposium onFoundations of Computer Science (FOCS) , pages 590–598, 2011.[KNPW11] Daniel M Kane, Jelani Nelson, Ely Porat, and David P Woodruﬀ. Fast momentestimation in data streams in optimal space. In

Proceedings of the forty-third annualACM symposium on Theory of computing (STOC) , pages 745–754, 2011.[KOSZ13] Jonathan A. Kelner, Lorenzo Orecchia, Aaron Sidford, and Zeyuan Allen Zhu. Asimple. In

STOC’13: Proceedings of the 45th Annual ACM Symposium on theTheory of Computing , pages 911–920. combinatorial algorithm for solving SDDsystems in nearly-linear time. In, 2013.[KPSW19] Rasmus Kyng, Richard Peng, Sushant Sachdeva, and Di Wang. Flows in almostlinear time via adaptive preconditioning. In

Proceedings of the 51st Annual ACMSIGACT Symposium on Theory of Computing (STOC) , pages 902–913. https://arxiv.org/pdf/1906.10340.pdf , 2019.[KS01] Adam R Klivans and Daniel Spielman. Randomness eﬃcient identity testing ofmultivariate polynomials. In

Proceedings of the thirty-third annual ACM symposiumon Theory of computing (STOC) , pages 216–223, 2001.[KS16] Rasmus Kyng and Sushant Sachdeva. Approximate gaussian elimination for lapla-cians - fast, sparse, and simple. In

FOCS , pages 573–582. IEEE Computer Society,2016.[Kuh55] H. W Kuhn. The hungarian method for the assignment problem.

Naval ResearchLogistics Quarterly , 2:83–97, 1955.101Law72] Eugene L. Lawler.

Optimal Cycles in Graphs and the Minimal Cost-To-Time RatioProblem , pages 37–60. Springer Vienna, Vienna, 1972.[LHJ19] Tianyi Lin, Nhat Ho, and Michael I. Jordan. On eﬃcient optimal transport: Ananalysis of greedy and accelerated mirror descent algorithms. In

ICML , volume 97of

Proceedings of Machine Learning Research , pages 3982–3991. PMLR, 2019.[Li20] Jason Li. Faster parallel algorithm for approximate shortest path. In

STOC . https://arxiv.org/pdf/1911.01626.pdf , 2020.[LN17] Kasper Green Larsen and Jelani Nelson. Optimality of the johnson-lindenstrausslemma. In , pages 633–638. IEEE, 2017.[LNNT16] Kasper Green Larsen, Jelani Nelson, Huy L Nguyen, and Mikkel Thorup. Heavyhitters via cluster-preserving clustering. In , pages 61–70. IEEE, https://arxiv.org/pdf/1604.01357 , 2016.[LS13] Yin Tat Lee and Aaron Sidford. Eﬃcient accelerated coordinate descent meth-ods and faster algorithms for solving linear systems. In , pages 147–156. IEEE, 2013.[LS14] Yin Tat Lee and Aaron Sidford. Path ﬁnding methods for linear programming:Solving linear programs in O ( √ rank ) iterations and faster algorithms for maxi-mum ﬂow. In , pages 424–433. https://arxiv.org/pdf/1312.6677.pdf , https://arxiv.org/pdf/1312.6713.pdf , 2014.[LS15] Yin Tat Lee and Aaron Sidford. Eﬃcient inverse maintenance and faster algorithmsfor linear programming. In FOCS , pages 230–249. IEEE Computer Society, 2015.[LS19] Yin Tat Lee and Aaron Sidford. Solving linear programs with √ rank linear systemsolves. In arXiv preprint . https://arxiv.org/pdf/1910.08033.pdf , 2019.[LS20a] Yang P Liu and Aaron Sidford. Faster divergence maximization for faster maximumﬂow. In arXiv preprint . https://arxiv.org/pdf/2003.08929.pdf , 2020.[LS20b] Yang P Liu and Aaron Sidford. Faster energy maximization for faster maximumﬂow. In STOC . https://arxiv.org/pdf/1910.14276.pdf , 2020.[LSZ19] Yin Tat Lee, Zhao Song, and Qiuyi Zhang. Solving empirical risk minimization inthe current matrix multiplication time. In COLT . https://arxiv.org/pdf/1905.04447 , 2019.[Mad02] Omid Madani. Polynomial value iteration algorithms for detrerminstic mdps. In UAI , pages 311–318, 2002.[Mad13] Aleksander Madry. Navigating central path with electrical ﬂows: From ﬂows tomatchings, and back. In

FOCS , pages 253–262. IEEE Computer Society, 2013.[Mad16] Aleksander Madry. Computing maximum ﬂow with augmenting electrical ﬂows. In ,pages 593–602. IEEE, 2016.[Moo59] E. F. Moore. The Shortest Path Through a Maze. In Proceedings of the Interna-tional Symposium on the Theory of Switching, pages 285–292, 1959.102Mun57] J. Munkres. Algorithms for the Assignment and Transportation Problems.

Journalof SIAM , 5(1):32–38, 1957.[Net56] R. Ford Network Flow Theory.

Paper P-923 . The RAND Corperation, SantaMoncia, California, 1956.[NS17] Danupon Nanongkai and Thatchaphol Saranurak. Dynamic spanning forest withworst-case update time: adaptive, las vegas, and O ( n / − (cid:15) )-time. In Proceedingsof the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC) ,pages 1122–1129, 2017.[NS19] Vasileios Nakos and Zhao Song. Stronger l2/l2 compressed sensing; without iterat-ing. In

STOC . https://arxiv.org/pdf/1903.02742 , 2019.[NSW17] Danupon Nanongkai, Thatchaphol Saranurak, and Christian Wulﬀ-Nilsen. Dy-namic minimum spanning forest with subpolynomial worst-case update time. In FOCS , pages 950–961. IEEE Computer Society, 2017.[NSW19] Vasileios Nakos, Zhao Song, and Zhengyu Wang. (Nearly) Sample-optimal sparseFourier transform in any dimension; RIPless and Filterless. In

FOCS . https://arxiv.org/pdf/1909.11123.pdf , 2019.[NT97] Yurii E. Nesterov and Michael J. Todd. Self-scaled barriers and interior-point meth-ods for convex programming. Math. Oper. Res. , 22(1):1–42, 1997.[Pag13] Rasmus Pagh. Compressed matrix multiplication.

ACM Transactions on Compu-tation Theory (TOCT) , 5(3):1–17, 2013.[Qua19] Kent Quanrud. Approximating optimal transport with linear programs. In

SOSA ,2019.[Ren88] James Renegar. A polynomial-time algorithm, based on newton’s method, for linearprogramming.

Math. Program. , 40(1-3):59–93, 1988.[San04] Piotr Sankowski. Dynamic transitive closure via dynamic matrix inverse (extendedabstract). In

FOCS , pages 509–517. IEEE Computer Society, 2004.[San05] Piotr Sankowski.

Algorithms – ESA 2005: 13th Annual European Symposium,Palma de Mallorca, Spain, October 3-6, 2005. Proceedings . chapter Shortest Pathsin Matrix Multiplication Time, pages 770–778. Springer Berlin Heidelberg, Berlin,Heidelberg, 2005.[San06] Piotr Sankowski.

Automata, Languages and Programming: 33rd International Col-loquium, ICALP 2006, Venice, Italy, July 10-14, 2006, Proceedings, Part I . chap-ter Weighted Bipartite Matching in Matrix Multiplication Time, pages 274–285.Springer Berlin Heidelberg, Berlin, Heidelberg, 2006.[San09] Piotr Sankowski. Maximum weight bipartite matching in matrix multiplicationtime.

Theor. Comput. Sci. , 410(44):4480–4488, 2009.[Sch03] Alexander Schrijver.

Combinatorial optimization: polyhedra and eﬃciency , vol-ume 24. Springer Science & Business Media, 2003.[She17] Jonah Sherman. Generalized preconditioning and undirected minimum-cost ﬂow.In

SODA , pages 772–780. SIAM, 2017.103Shi55] A. Shimbel. Structure in Communication Nets. In

In Proceedings of the Symposiumon Information Networks , pages 199–203, Brooklyn, 1955. Polytechnic Press of thePolytechnic Institute of Brooklyn.[ST03] Daniel A. Spielman and Shang-Hua Teng. Solving sparse. In

FOCS’03: Proceedingsof the 44th Annual IEEE Symposium on Foundations of Computer Science , pages416–427, diagonally-dominant linear systems in time O ( m . ). In, 2003. symmetric.[ST04] Daniel A. Spielman and Shang-Hua Teng. Nearly-linear time algorithms for graphpartitioning, graph sparsiﬁcation, and solving linear systems. In STOC’04: Pro-ceedings of the 36th Annual ACM Symposium on the Theory of Computing , pages81–90. ACM, 2004.[ST11] Daniel A. Spielman and Shang-Hua Teng. Spectral sparsiﬁcation of graphs.

SIAMJ. Comput. , 40(4):981–1025, 2011.[SW19] Thatchaphol Saranurak and Di Wang. Expander decomposition and pruning:Faster, stronger, and simpler. In

SODA , pages 2616–2635. SIAM, 2019.[VA93] Pravin M Vaidya and David S Atkinson. A technique for bounding the number ofiterations in path following algorithms. In

Complexity in Numerical Optimization ,pages 462–489. World Scientiﬁc, 1993.[Vai87] Pravin M. Vaidya. An algorithm for linear programming which requires O ((( m + n ) n + ( m + n ) . n ) L ) arithmetic operations. In STOC , pages 29–38. ACM, 1987.[Vai89] Pravin M. Vaidya. Speeding-up linear programming using fast matrix multiplication(extended abstract). In

FOCS , pages 332–337. IEEE Computer Society, 1989.[Vai91] Pravin M. Vaidya. Solving linear equations with symmetric diagonally dominantmatrices by constructing good preconditioners. Technical report, Unpublishedmanuscript, UIUC 1990. A talk based on the manuscript was presented at the IMAWorkshop on Graph Theory and Sparse Matrix Computation Mineapolis, October1991.[Wil12] Virginia Vassilevska Williams. Multiplying matrices faster than coppersmith-winograd. In

STOC , pages 887–898. ACM, 2012.[Wul17] Christian Wulﬀ-Nilsen. Fully-dynamic minimum spanning forest with improvedworst-case update time. In

Proceedings of the 49th Annual ACM SIGACT Sympo-sium on Theory of Computing (STOC) , pages 1130–1143, 2017.[YZ05] Raphael Yuster and Uri Zwick. Answering distance queries in directed graphs usingfast matrix multiplication. pages 389–396, 2005.104 ppendixA Initial Point and Solution Rounding

In this section we discuss how to construct the initial point for our IPM and how to convertfractional to integral solutions.In Appendix A.1 we show how to construct a feasible primal solution x and a feasible dualslack s , such that xs ≈ τ ( x, s ), as required by the IPM. Unfortunately, for a given minimum cost b -matching instance we are not able to construct such an initial point. Instead, we replace thegiven cost vector c by some alternative cost vector c . The idea is to then run the IPM in reverse,so that our solution pairs ( x, s ) becomes less tailored to c and further away from the constraints,i.e. the points move closer towards the center of the polytope. Once we are suﬃciently closeto the center it does not matter if we use cost vector c or c , so we can simply switch the costvector. That switching the cost vector is indeed possible is proven in Appendix A.2. Afterswitching the cost vector, we then run the IPM again as to move the point ( x, s ) closer towardsthe optimal solution.Our IPM is not able to maintain a truly feasible primal solution x , instead the solution isonly approximately feasible. Thus we must discuss how to turn such an approximately feasiblesolution into a truly feasible solution, which is done in Appendix A.3. Further, the solutionwe obtain this way might be fractional, so at last in Appendix A.4 we show how to obtain anintegral solution. A.1 Finding an Initial Point (Proof of Lemmas 8.3 and 8.13)

We want to solve minimum cost b -matching instances. In order to construct an initial point,we reduce this problem to uncapacitated minimum cost ﬂows. For the reduction we ﬁrst con-struct the underlying graph for the ﬂow instance from a given bipartite b -matching instance asdescribed in Deﬁnition 8.6. We are left with constructing a feasible primal solution x , a costvector c and a feasible slack s of the dual w.r.t. the cost vector c , such that xs ≈ τ ( x, s ) asrequired by our IPM.In Section 4.2 and Section 4.3 we present two diﬀerent IPMs, where the ﬁrst uses τ ( x, s ) = 1and the second uses τ ( x, s ) = σ ( x, s ) + n/m where σ ( x, s ) := σ ( X / − α S − / − α A ) for α =

14 log(4 m/n ) . Hence we will also present two lemmas for construct the initial point, one for eachoption of τ ( x, s ). Lemma 8.7.

We ﬁrst construct the primal solution x , then we construct the dual slack s , and at lastwe construct a cost vector c for which there is a dual solution y with slack s . Primal solution (feasible ﬂow)

Let n, m be the number of nodes and edges in G . For ourprimal solution x ∈ R E be deﬁne the following ﬂow x u,v = 1 n for u ∈ U , v ∈ V ,( u, v ) ∈ E ,105 u,z = b u − deg G ( u ) n for u ∈ U , x z,v = b v − deg G ( v ) n for v ∈ V .Here we have x > b v ≥ G ( v ) < n for all v ∈ V ∪ U . The ﬂow also satisﬁes thedemand d , because for every u ∈ U there is a total amount of P ( u,v ) ∈ E /n + b u − deg G ( u ) /n = b u = d u leaving u , and likewise a total ﬂow of b v reaching v ∈ V . Further, the ﬂow reachingand leaving z is 0, because X u ∈ U ( b u − deg G ( u ) /n ) = ( X u ∈ U b u ) + | E | /n = ( X v ∈ V b v ) + | E | /n = X v ∈ V ( b v − deg G ( v ) /n ) , where the ﬁrst term i the ﬂow reaching z and the last term is the ﬂow leaving z . Thus insummary, the ﬂow x is a feasible ﬂow. Dual slack and cost vector

Now, we deﬁne the slack s and the cost vector c . The slack ofthe dual are of the form s u,v = c u,v − y u + y v for ( u, v ) ∈ E . We simply set y = ~ n and s u,v = c u,v = 1 /x u,v for every ( u, v ) ∈ E , so then xs = 1. As 1 /n ≤ x u,v ≤ b u − deg G ( u ) /n ≤ k b k ∞ ,we get k b k − ∞ ≤ c ≤ n . Lemma A.1 ([CP15], Theorem 12 of [BLSS20]) . Given p ∈ (0 , , η > and incidence matrix A ∈ R m × n w.h.p. in n , we can compute w ∈ R m> with w ≈ (cid:15) σ ( W − p A ) + η~ in e O ( m poly(1 /(cid:15) )) time. We remark that the complexity stated in [BLSS20] is e O ((nnz( A ) + n ω ) poly(1 /(cid:15) )), becauseit is stated for general matrices. However, when using a Laplacian Solver (Lemma 2.1) the n ω term becomes e O ( m ). Lemma 8.15.

We are given a minimum weight perfect b -matching instance on graph G =( U, V, E ) with cost vector c ∈ R E .Let G = ( V , E ) with demand vector d be the corresponding starred ﬂow graph (see Deﬁ-nition 8.6) and let n be the number of nodes of G and m the number of edges.For τ ( x, s ) := σ ( x, s ) + m/n and for any (cid:15) > we can construct in e O ( m poly( (cid:15) − )) time acost vector c ∈ R E for G and a feasible primal dual solution pair ( x, s ) for the minimum costﬂow problem on G with cost vector c , where the solution satisﬁes xs ≈ (cid:15) τ ( x, s ) . (66) The cost vector c satisﬁes n m (1 + k b k ∞ ) ≤ c ≤ n. Proof.

The construction of the primal solution x is the same asthe one constructed in the proof of Lemma 8.7, and its feasibility follows the same argument.106 ual slack Let A be the incidence matrix of graph G and deﬁne A := XA . We constructa vector s with the property s ≈ (cid:15) σ ( S / − α A ) + nm , This construction is done via Lemma A.1.Next, deﬁne s := s /x , so the vectors x and s satisfy xs = s ≈ (cid:15) σ ( S / − α A ) + nm = σ ( S / − α XA ) + nm = σ (( XS ) − / − α XA ) + nm = σ ( X / − α S − / − α A ) + nm = τ ( x, s )Thus the vector s satisﬁes xs ≈ (cid:15) τ ( x, s ) and we are only left with making sure that s is indeeda slack vector of the dual. Constructing cost vector

We deﬁne the cost vector c := s and consider the dual solution y = ~ n . The dual problem of the uncapacitated min-cost-ﬂow is c u,v ≥ y u − y v for ( u, v ) ∈ E . So by setting y = ~ n the dual slack is exactly c = s , i.e. the vector s is indeed a slack of thedual problem. As c := s , we can bound c i ≤ e (cid:15) τ ( x, s ) /x i ≤ n and c i ≥ e − (cid:15) τ ( x, s ) /x i ≥ n m k d k ∞ , where x ≤ k d k ∞ comes from the fact that any edge incident to u ∈ U or v ∈ V can carry atmost | d u | or | d v | units of ﬂow, because all edges are directed away from u and towards v . A.2 Switching the Cost Vector (Proof of Lemma 8.8)

The construction of our initial point replaces the given cost vector by some other cost vector.We now show that it is possible to revert this replacement, provided that our current point( x, s ) is far enough away from the optimal solution. Below we restate Lemma 8.8.

Lemma 8.8.

Consider the function τ ( x, s ) and a uncapacitated min-cost-ﬂow instance as con-structed in Lemma 8.15 or Lemma 8.7 and let A be the incidence matrix of the underlying graph G . Let c be the cost vector constructed in Lemma 8.15 and let c be any other cost vector with k c k ∞ ≥ .Assume we have a primal dual solution pair ( x, s ) for cost vector c and demand d with k A x − d k ( AXS − A ) − ≤ √ µ/ and xs ≈ (cid:15)/ µτ ( x, s ) for any (cid:15) ≤ and µ ≥ k c k ∞ k d k ∞ (cid:15) .If we replace the cost vector c with cost vector c then ( x, s + c − c ) is a primal dual solutionpair for the new min-cost-ﬂow instance with cost vector c and we have x ( s + c − c ) ≈ (cid:15)/ µτ ( x, s + c − c ) .Proof. Let y be the dual solution with A y + s = c , and let s := s + c − c be the slack vectorwhen replacing cost vector c by c . Then A y + s = A y + s + c − c = c , so if s ≥

0, then it isindeed a valid slack vector.In order to show that s ≥ xs ≈ (cid:15)/ µτ ( x, s ), we will bound the relative diﬀerencebetween s and s , i.e. we want an upper bound on s − ss = c − cs . s is large. So we want to lower bound s . By xs ≈ / µτ ( x, s ) we have s i ≥ µτ ( x, s ) i x i ≥ µn mx i for all i, (67)where we used that τ ( x, s ) ≥ m/n . In order to bound this term, we must ﬁrst ﬁnd an upperbound on x .By Lemma 4.38 and the assumption k A x − d k ( AXS − A ) − ≤ √ µ/ x with k x − ( x − x ) k ∞ ≤ /

2. For this feasible x we know that x ≤ k d k ∞ and thus x ≤ . k d k ∞ .By (67) we have s i ≥ µn mx i ≥ µn . k d k ∞ which in turn implies k s − ss k ∞ ≤ ( k c k ∞ + k c k ∞ ) 1 . k d k ∞ µn ≤ ( k c k ∞ + 3 n ) 1 . k d k ∞ µn ≤ . k c k ∞ k d k ∞ µ Thus for µ ≥ k c k ∞ k d k ∞ (cid:15) we can bound (cid:13)(cid:13)(cid:13)(cid:13) s − ss (cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ (cid:15) . This then implies xs ≈ (cid:15)/ µτ ( x, s ). A.3 Rounding an Approximate to a Feasible Solution (Proof of Lemma 8.9)

Our IPM is not able to maintain a feasible primal solution x , instead we can only guaranteethat µ k A > x − b k A > XS − A ) − is small. We now show that this is enough to construct a feasiblesolution that does not diﬀer too much from the provided approximately feasible solution. Wefurther show how small the progress parameter µ of the IPM must be, in order to obtain a smalladditive (cid:15) error compared to the optimal solution. Below we restate Lemma 8.9. Lemma 8.9.

OPT is the optimal value of the min-cost ﬂow instance.

In order to prove Lemma 8.9, we will ﬁrst prove a helpful lemma that bounds k A > x − b k with respect to µ k A > x − b k A > XS − A ) − . Lemma A.2.

Let G be a uncapacitated ﬂow instance for demand d and cost c where for everyfeasible ﬂow f we have f ≤ k d k ∞ . Let A be the incidence matrix of G . Given a primal dualpair ( x, s ) with infeasible x and xs ≈ τ ( x, s ) , deﬁne δ := 1 µ · k A > x − d k A > XS − A ) − . Then k A > x − d k ≤ m k d k ∞ δ roof. By Lemma 4.38 we know that there is a feasible x with k x − ( x − x ) k ∞ ≤ δ . k A > x − d k ≤ λ min (( A > XS − A ) − ) k A > x − d k A > XS − A ) − = λ max ( A > XS − A ) µδ ≤ max i { x i /s i } λ max ( A > A ) µδ ≤ n max i { x i /s i } µδ Because of xs ≈ µτ ( x, s ) and x ≤ k d k ∞ we have s i ≥ µτ ( x, s )3 x i ≥ nµ m k d k ∞ which implies x i s i ≤ m k d k ∞ nµ and thus k A > x − d k ≤ m k d k ∞ δ. We can now prove Lemma 8.9.

Proof of Lemma 8.9.

The algorithm works as follows. First we compute x = x + XS − AH − ( d − A > x )for some H − ≈ δ ( A > XS − A ) − . This can be done in e O ( m log δ − ) time via a Laplacian solver.Then for every u ∈ U with ( A > x ) u > d u we reduce the ﬂow on some edges incident to u such that the demand of u is satisﬁed. We also do the same for every v ∈ V with ( A > x ) v < d v .Let x be the resulting ﬂow. After that we route ( A > x ) u − d u ﬂow from u ∈ U to z andlikewise d v − ( A > x ) v ﬂow from z to v ∈ V . Let f be the resulting ﬂow, then f is a feasibleﬂow as all demands are satisﬁed. Constructing f from x this way takes e O ( m ) time. Optimality

We are left with proving claim c > f ≤ OPT + (cid:15) . By Lemma 4.38 the vector x satisﬁes k A > x − d k AXS − A ) − ≤ δ · k A > x − d k AXS − A ) − ≤ δµ k X − ( x − x ) k ∞ ≤ µ · k A x − b k A > XS − A ) − ≤

12 (69)So by Lemma A.2 we have k A > x − d k ≤ m k d k ∞ k A > x − d k AXS − A ) − µ ≤ m k d k ∞ δ. As this bounds the maximum error we have for the demand of each node, this also bounds howmuch we might change the ﬂow on some edge, i.e. we obtain the bound k x − f k ≤ k d k ∞ √ mδ. c > f ≤ c > x + k c k k x − f k ≤ c > x + m k c k ∞ k d k ∞ √ δ. We are left with bounding c > x . For this note that the duality gap of ( x , s ) can be bounded by X i ∈ [ m ] x i s i ≤ X i ∈ [ m ] x i s i ≤ µ X i ∈ [ m ] τ ( x, s ) i ≤ µn, where we use (69), xs ≈ µτ ( x, s ), and P i ∈ [ m ] τ ( x, s ) i = 2 n . As the duality gap is an upperbound on the diﬀerence to the optimal value, we have c > x ≤ OPT + 12 µn so for µ ≤ (cid:15) n and δ ≤ (cid:15) k c k ∞ k d k ∞ m ) we have c > f ≤ OPT + 12 µn + m k c k ∞ k d k ∞ √ δ ≤ OPT + (cid:15).

A.4 Rounding to an Integral Solution (Proof of Lemma 8.10)

In Appendix A.3 we showed how to obtain a feasible nearly optimal fractional solution. Wenow show that we can also obtain a feasible truly optimal integral solution by rounding eachentry of the solution vector x to the nearest integer. This only works, if the optimal solutionis unique. In order to make sure that the optimal solution is unique, we use a variant of theIsolation-lemma. In high-level, this lemma shows that by adding a small random cost to eachedge, the minimum-cost ﬂow becomes unique. Below we restate Lemma 8.10. Lemma 8.10.

Let

Π = (

Lemma A.3 ([KS01, Lemma 4]) . Let C be any collection of distinct linear forms in variables z , ..., z ‘ with coeﬃcients in the range { , ..., K } . If z , ..., z ‘ are independently chosen uniformlyat random in { , ..., K‘/(cid:15) } , then, with probability greater than − (cid:15) , there is a unique form ofminimal value at z , ..., z ‘ . Here we prove Lemma 8.10 via a similar proof to the one in [DS08].

Proof of Lemma 8.10.

We start by proving that e Π has a unique optimal solution. We then showthat rounding a nearly optimal fractional matching results in the optimal integral matching.

Unique Optimal Solution

Let C ⊂ R E be the set of optimal integral ﬂows. Then we caninterpret each f ∈ C as a linear form z

7→ h f, z i in the variables z , ..., z m . Further we have0 ≤ f e ≤ W for all f ∈ C and e ∈ E .Now consider z ∈ R m where each z i is an independent uniformly sampled element from { , ..., K‘/(cid:15) } . By Lemma 8.10 the minimizer argmin f ∈ C h f, z i is unique with probability at least1 /

2. As h f, c i = h f , c i for all f, f ∈ C , we have that for c := c + z/ (4 m W ) there also existsa unique f ∈ C that minimizes h f, c i . Further note that h f, c i < h f, c i + 1 so this f is also theunique optimal solution to e Π. 110 ounding to the optimal integral matching

Since we added random multiples of m W to each edge cost, the second best integral matching f has cost at least OPT( e Π) + m W .Now assume we have some feasible fractional matching e f . As every feasible fractionalmatching is a convex combination of feasible integral matchings, we can write tf = λf +(1 − λ ) g ,where λ ∈ [0 , f is the optimal (integral) matching, and g is a feasible fractional matchingwith cost at least OPT( e Π) + m W (i.e. g is a convex combination of non-optimal feasibleintegral matchings).If e f has cost at most OPT( e Π) + m W , then λ ≥ − W and thus k e f − f k ∞ ≤ /

3. So byrounding the entries of e f to the nearest integer we obtain the optimal ﬂow f . B Leverage Score Maintenance

In this section we explain how to obtain the following data structure for maintaining approximateleverage scores via results from [BLSS20].

Theorem B.1.

There exists a Monte-Carlo data-structure, which works against an adaptiveadversary, that supports the following procedures: • Initialize ( A ∈ R m × n , g ∈ R m , (cid:15) ∈ (0 , : Given incidence matrix A ∈ R m × n , scaling g ∈ R m and accuracy (cid:15) > , the data-structure initializes in e O ( m(cid:15) − ) time. • Scale ( I ⊂ [ m ] , u ∈ R I ) : Sets g I = u in e O ( | I | (cid:15) − ) time. • Query () : Let g ( t ) ∈ R m be the vector g ∈ R m during t -th call to Query and assume g ( t ) ≈ / g ( t − . W.h.p. in n the data-structure outputs a vector e τ ∈ R m such that e τ ≈ (cid:15) σ ( √ G ( t ) A ) + nm ~ . Further, with probability at least the total cost over K steps is e O (cid:16) K · (cid:15) − · m/n + K ( n(cid:15) − + n(cid:15) − log W ) (cid:17) , where W is a bound on the ratio of largest to smallest entry of any g ( t ) .Let g ( k ) be the state of g during the k -th call to Query and deﬁne τ ( g ) := σ ( √ GA ) + nm ~ .Then the complexity bound on Query holds, if there exists a sequence e g (1) , ..., e g ( K ) ∈ R m wherefor all k ∈ [ K ] g ( k ) i ∈ (cid:18) ± · (cid:15) log n (cid:19) e g ( k ) i for all i ∈ [ m ] , and (70)1 (cid:15) k G ( k ) − ( g ( k +1) − g ( k ) ) k τ ( g ( k ) ) + k ( T ( g ( k ) ) − ( τ ( g ( k +1) ) − τ ( g ( k ) )) k τ ( g ( k ) ) = O (1) . (71)We use the following Lemma B.2 from [BLSS20] which describes a data structure for main-taining approximate leverage scores. Note that the complexity scales with the Frobenius normof some matrix. We obtain Theorem B.1 by combining Lemma B.2 with another data structurethat guarantees that this norm bound is small. Lemma B.2 ([BLSS20, Theorem 8, Algorithm 7, when specialized to incidence matrices]) . There exists a Monte-Carlo data-structure, which works against an adaptive adversary, thatsupports the following procedures: • Initialize ( A ∈ R m × n , g ∈ R m , (cid:15) ∈ (0 , : Given incidence matrix A ∈ R m × n , scaling g ∈ R m and accuracy (cid:15) > , the data-structure initializes in e O ( m(cid:15) − ) time. • Scale ( i ∈ [ m ] , u ∈ R ) : Given i ∈ [ m ] and u ∈ R sets g i = u in e O ( (cid:15) − ) time. Query (Ψ ( t ) ∈ R n × n , Ψ ( t )(safe) ∈ R n × n ) : Let g ( t ) ∈ R m be the vector g ∈ R m during t -th callto Query , assume g ( t ) ≈ / g ( t − and deﬁne H ( t ) = A > G ( t ) A ∈ R n × n . Given randominput-matrices Ψ ( t ) ∈ R n × n and Ψ ( t )(safe) ∈ R n × n such that Ψ ( t ) ≈ (cid:15)/ (24 log n ) ( H ( t ) ) − , Ψ ( t )(safe) ≈ (cid:15)/ (24 log n ) ( H ( t ) ) − and any randomness used to generate Ψ ( t )(safe) is independent of the randomness used togenerate Ψ ( t ) , w.h.p. in n the data-structure outputs a vector e τ ∈ R m independent of Ψ (1) , ..., Ψ ( t ) such that e τ ≈ (cid:15) σ ( √ G ( t ) A ) + nm ~ . Further, the total cost over K steps is e O (cid:16)(cid:16) X t ∈ [ K ] k p G ( t ) A Ψ ( t ) A > p G ( t ) − p G ( t − A Ψ ( t − A > p G ( t − k F (cid:17) · (cid:15) − m/n + K (cid:16) T Ψ (safe) + T Ψ + n(cid:15) − + n(cid:15) − log W (cid:17) (cid:17) where T Ψ , T Ψ (safe) is the time required to multiply a vector with Ψ ( t ) ∈ R n × n and Ψ ( t )(safe) ∈ R n × n respectively (i.e. in case the matrices are given implicitly via data structures). The initialization, scale and query complexity of [BLSS20, Theorem 8, Algorithm 7] arelarger by a factor of n compared to the complexities stated in Lemma B.2. This diﬀerence incomplexity is because the data structure of [BLSS20] was proven for general matrices A whichmay have up to n entries per row, while in Lemma B.2 we consider only incidence matrices withonly 2 entries per row. The proof of [BLSS20, Theorem 8, Algorithm 7] works by reducing theproblem of maintaining leverage scores to the problem of detecting large entries of the product Diag ( g ) A h , for some vectors g ∈ R m ≥ , h ∈ R n . This is exactly what our data structure ofLemma 5.1 does. By plugging our data structure of Lemma 5.1 into the algorithm of [BLSS20]and observing the sparsity of A , we obtain the faster complexities as stated in Lemma B.2 forincidence matrices A .We combine Lemma B.2 with the following Algorithm 12. The data structure of Algo-rithm 12 guarantees that the Frobenius norm (as used in the complexity statement of Lemma B.2)stays small. The properties of Algorithm 12 are stated in Lemma B.3 and Lemma B.4, wherethe latter bounds the change of the Frobenius norm. Lemma B.3 ([BLSS20, Theorem 10, Algorithm 8]) . There exists a randomized data-structurethat supports the following operations (Algorithm 12) • Initialize ( g ∈ R m , τ ∈ R m , (cid:15) ) : Preprocess vector g ∈ R m , τ ∈ R m and accuracy parameter (cid:15) > in O ( m ) time. Returns two vector e g, v ∈ R m . • Update ( I ⊂ [ m ] , s ∈ R I , t ∈ R I ) Sets g I ← s , τ I ← t in e O ( | I | ) time. Returns set J ⊂ [ m ] and two vectors e g, v ∈ R m . Here e g is returned via pointer and J lists the indices i where e g i or v i changed compared to the previous output.If the update does not depend on any of the previous outputs, and if there is some matrix A ∈ R m × n with τ ≈ . τ ( G / A ) , then e g, v satisfy following two properties: (i) with highprobability k v k = O ( n(cid:15) log n ) . (ii) e g ≈ (cid:15) g , and with high probability A > VA ≈ (cid:15) A > GA . Lemma B.4 ([BLSS20, Lemma 11]) . Let g ( k ) , τ ( k ) ∈ R m , e g ( k ) , v ( k ) ∈ R m be the vectors g , τ , e g , v right before the k -th call to Update of Lemma B.3. Assume there exists a sequence g (1) , ..., g ( K ) ∈ R m and matrix A ∈ R m × n with g ( k ) i ∈ (1 ± / (16 (cid:15) log n )) g ( k ) i , τ ( k ) i ∈ (1 ± / (16 log n )) τ ( G ( k ) A ) i , ∀ i ∈ [ m ]112 lgorithm 12: Pseudocode for Lemma B.3, based on [BLSS20, Algorithm 8]. members e g , g , e τ , τ , (cid:15) , v , y , γ := c log n for some large constant c . procedure Initialize ( g, τ , (cid:15) ) g ← g , e g ← g , τ ← τ , e τ ← τ , (cid:15) ← (cid:15) , y ← ~ m v i ← e g i / min { , γ(cid:15) − e τ i } independently for each i ∈ [ m ] with probabilitymin { , γ(cid:15) − e τ i } and v i ← procedure Update ( I ⊂ [ m ] , s ∈ R I , t ∈ R I ) g I ← s , τ I ← t y i ← (cid:15) ( g i / e g i −

1) and y i + n ← τ i / e τ i −

1) for i ∈ I Let π : [2 n ] → [2 n ] be a sorting permutation such that | y π ( i ) | ≥ | y π ( i +1) | For each integer ‘ , we deﬁne i ‘ be the smallest integer such that P j ∈ [ i ‘ ] τ π ( j ) ≥ ‘ . Let k be the smallest integer such that | y π ( i k ) | ≤ − k d log n e J ← ∅ for each coordinate j ∈ [ i k ] do Set i = π ( j ) if π ( j ) ≤ n and set i = π ( j ) − n otherwise e g i ← g i , e τ i ← τ i , y j ← J ← J ∪ { i } v i ← (e g i / min { , γ(cid:15) − · e τ i } with probability min { , γ(cid:15) − · e τ i } return J , e g , v and (cid:15) − k ( G ( k ) ) − ( g ( k +1) − g ( k ) ) k τ ( g ( k ) ) + k ( T ( k ) ) − ( τ ( k +1) − τ ( k ) ) k τ ( g ( k ) ) = O (1) for all k ∈ [ K − , then E  X k ∈ [ K − (cid:13)(cid:13)(cid:13)(cid:13)q e G ( k +1) A ( A > V ( k +1) A ) − A > q e G ( k +1) − q e G ( k ) A ( A > V ( k ) A ) − A > q e G ( k ) (cid:13)(cid:13)(cid:13)(cid:13) F  is at most O ( K log / n ) . The complexity of

Update in [BLSS20, Lemma 11] is a bit higher since they consider densematrices. Here we show that for sparse matrices the amortized update time is nearly linear inthe input size.

Lemma B.5.

The function

Update can be implemented to run in amortized e O ( | I | ) time.Proof. We maintain a balanced binary search tree to sort the y . This takes e O ( | I | ) amortizedtime. Then use that binary search to tree as a preﬁx data-structure, i.e. each node is marked bythe sum of it’s children to the left. Then the preﬁx of ﬁrst i elements is obtained by going fromthe root down to the i th element and adding the value of a node whenever we take the right edge.When removing/inserting a node, updating these partial sums takes only O ( depth ) = e O (1) time.Hence the amortized time for maintaining this structure is e O ( | I | ).The rest of Update simply uses that data-structure, and can ﬁnd the right value for i ‘ and k via binary-search. Note that every index i for which the loop of Update is executed, hasbeen part of some input set I before and will not be included in that loop again, unless it ispart of another future input set I . Hence we can charge the cost of the loop to previous callsto Update . So we obtain O ( | I | ) amortized cost for the loop per call to Update .By combining Lemma B.2 with Lemma B.3 to get a good bound on the Frobenius norm viaLemma B.4 we then obtain Theorem B.1. 113 lgorithm 13:

Pseudocode for the data structure of Theorem B.1 members D (stable) , D ( τ ) , e g ∈ R m , v ∈ R m procedure Initialize ( A ∈ R m × n , g ∈ R m , (cid:15) ∈ (0 , τ ← compute σ ( √ GA ) + n/m . e g, v ← D (stable) . Initialize ( g, τ , (cid:15)/ (144 log n )) (Lemma B.3) D ( τ ) . Initialize ( A , e g, (cid:15)/

10) (Lemma B.2) procedure Scale ( I ⊂ [ m ] , u ∈ R | I | ) J, e g, v ← D (stable) . Update ( i, u ) D ( τ ) . Scale ( i, e g i ) for all i ∈ J . procedure Query () Ψ ← Laplacian Solver for ( A > VA ) − Ψ (safe) ← Laplacian Solver for ( A > WA ) − where w ∈ R m is a leverage score samplefrom τ . I, τ ← D ( τ ) . Query (Ψ , Ψ (safe) ) return I , τ Proof of Theorem B.1.

The algorithm is given by Algorithm 13. We will ﬁrst prove correctness,and then bound the complexity.

Correctness

Note that by Line 6 and Line 9, the data structure D ( τ ) always maintains theleverage score of √ e g A , where e g is the vector maintained by D ( sample ) (Lemma B.3). We willargue that D ( τ ) maintains τ ≈ (cid:15)/ τ ( p e GA ). By e g ≈ (cid:15)/ g this then implies τ ≈ (cid:15) τ ( GA ).For Lemma B.3 to return a good approximation, we require that (i) Ψ ≈ / (48 log n ) A e GA , (ii)Ψ (safe) ≈ / (48 log n ) A e GA , and (iii) the randomness of Ψ (safe) does not depend on the randomnessin Ψ.By Lemma B.3 the vector v satisﬁes A > VA ≈ (cid:15)/ (144 log n ) AGA , and e g ≈ (cid:15)/ (144 log n ) g . Sofor a (cid:15)/ (144 log n ) accurate Laplacian Solver (Lemma 2.1) we have Ψ ≈ (cid:15)/ (48 log n ) A e GA . Thuscondition (i) is satisﬁed.We assume that the result of the last call to D ( τ ) . Query satisﬁed τ ≈ τ ( √ e g A ). Then bythe assumption g ( t ) ≈ / g ( t − (i.e. the vector g does not change too much between two callsto Query ) we have that any entry of τ ( √ GA ) can change by at most a constant factor betweenany two calls to Query . As e g ≈ g , this means that this old τ from the previous iteration is stilla constant factor approximation of the new τ ( √ e g A ). So we can use the old τ of the previousiteration for the leverage score sampling in Line 12 to obtain a good approximation of A e GA .Note that the approximation can be arbitrarily (i.e. δ >

0) close by sampling O ( nδ − log n )entries. Hence we can have a Laplacian solver with Ψ (safe) ≈ (cid:15)/ (48 log n ) A e GA . Also note that therandomness of Ψ (safe) depends only on τ , which by Lemma B.2 does not depend on Ψ. Thus(iii) is also satisﬁed. Complexity

The complexities for

Initialization and

Scale come directly from Lemma B.2and Lemma B.3. So let us consider

Query instead.By assumption (71) and (70) the requirements for Lemma B.4 are satisﬁed and we have E  X k ∈ [ K − (cid:13)(cid:13)(cid:13)p G ( k +1) A ( A > V ( k +1) A ) − A > p G ( k +1) − p G ( k ) A ( A > V ( k ) A ) − A > p G ( k ) (cid:13)(cid:13)(cid:13) F  is at most O ( K log / n ). 114ote that the Laplacian Solver for ( A > V ( k ) A ) − can be made δ > O (log δ − ) (Lemma 2.1), so by choosing some δ = O ( (cid:15)/ poly( n )), we can bound E  X k ∈ [ K − (cid:13)(cid:13)(cid:13)p G ( k +1) A Ψ ( k +1) A > p G ( k +1) − p G ( k ) A Ψ ( k ) A > p G ( k ) (cid:13)(cid:13)(cid:13) F  ≤ O ( K log / n ) . By Markov inequality we thus have with probability at least 1 / e O ( K · (cid:15) − · m/n + K ( n(cid:15) − + n(cid:15) − log W )) , where we used that both Ψand Ψ (safe) can be applied in e O ( n(cid:15) − ) time, because of the sparsity of v and w . C Degeneracy of A

For all our data structures and the ﬁnal algorithm from Section 8, for simplicity we alwaysassumed that the constraint matrix A ∈ R m × n of the linear program is the incidence matrix ofsome directed graph. However, this comes with some other problem: Note that the rank of sucha matrix is at most n −

1, so A is a so called degenerate matrix, and the Laplacian L = ( A > A )is singular, i.e. L − does not exist. However, the IPM presented in Section 4 assumes that theconstraint matrix is non-degenerate and that ( A > A ) − exists. So technically it is not clear ifthe IPM can indeed be used to solve the perfect b -matching problem as we did in Section 8. Weoutline here how we can assume that A is indeed non-degenerate and what small modiﬁcationsmust be performed to our data structures to handle the new matrix A .Assume A is the incidence matrix of some minimum weight perfect bipartite b -matchinginstance, and the corresponding formulation as a linear program is min A > x = b,x ≥ c > x . Insteadof using the incidence matrix A and cost vector c , we use A = " AI n × n , and c = " c k b k k c k ∞ · ~ n . That is, we add an n × n identity block to the bottom of A and add a large cost at the bottomof c to make sure the optimal solution of min A x = b,x ≥ c x has x k = 0 for all m < k ≤ m + n ,so that the ﬁrst m coordinates of x are an optimal solution for the original linear program.After this modiﬁcation, the matrix A is full rank and the IPM from Section 4 can beapplied. However, the matrix A is no longer an incidence matrix, so we have to modify ourdata structures that assume A is an incidence matrix. Lemma 5.1:

We can detect the large entries of A h by using Lemma 5.1 for the ﬁrst m entriesand simply computing the bottom n entries of A h explicitly. This additional cost of O ( n ) issubsumed by Lemma 5.1. We can handle the sampling in Sample the same way by computingthe impact of the bottom n entries of A h to the norm by computing the bottom n entriesexplicitly. For the upper bound on the leverage scores and the leverage score sample, note thatadding more rows to A only decreases the leverage score of the original rows. Hence we canuse the previous upper bound for the top m rows and simply use 1 as an upper bound on theleverage scores of the bottom n rows. Laplacian Solvers:

We use Laplacian solvers in several algorithms. While A GA is nolonger a Laplacian matrix, it is still an SDDM (symmetric diagonally dominant matrix) andsolving systems in such matrices also has fast nearly linear time solvers. Other data structures

All other data structures do not require any special structure to thematrix A besides of very sparse rows. While results such as Theorem 6.1 state that A must bean incidence matrix, this is only because it uses Lemma 5.1 internally, which has already beencovered. Likewise, Lemma B.2 is obtained via a reduction from [BLSS20] to the data structureLemma 5.1. 115 nitial Points Finding initial points is even easier now, as we no longer have to add a starto the bipartite graph. To construct a feasible x , we again route 1 /n ﬂow on each edge of thebipartite graph, and then set the bottom n coordinates of x such that we satisfy A x = bb