[PDF] Almost-linear-time Weighted \ell_p-norm Solvers in Slightly Dense Graphs via Sparsification

Abstract

We give almost-linear-time algorithms for constructing sparsifiers with n\ poly(\log n) edges that approximately preserve weighted (\ell^{2}_2 + \ell^{p}_p) flow or voltage objectives on graphs. For flow objectives, this is the first sparsifier construction for such mixed objectives beyond unit \ell_p weights, and is based on expander decompositions. For voltage objectives, we give the first sparsifier construction for these objectives, which we build using graph spanners and leverage score sampling. Together with the iterative refinement framework of [Adil et al, SODA 2019], and a new multiplicative-weights based constant-approximation algorithm for mixed-objective flows or voltages, we show how to find (1+2^{-\text{poly}(\log n)}) approximations for weighted \ell_p-norm minimizing flows or voltages in p(m^{1+o(1)} + n^{4/3 + o(1)}) time for p=\omega(1), which is almost-linear for graphs that are slightly dense (m \ge n^{4/3 + o(1)}).

Full PDF

aa r X i v : . [ c s . D S ] F e b ALMOST-LINEAR-TIME WEIGHTED ℓ p -NORM SOLVERS IN SLIGHTLYDENSE GRAPHS VIA SPARSIFICATION DEEKSHA ADIL , BRIAN BULLINS , RASMUS KYNG , AND SUSHANT SACHDEVA Abstract.

We give almost-linear-time algorithms for constructing sparsiﬁers with n poly(log n )edges that approximately preserve weighted ( ℓ + ℓ pp ) ﬂow or voltage objectives on graphs. Forﬂow objectives, this is the ﬁrst sparsiﬁer construction for such mixed objectives beyond unit ℓ p weights, and is based on expander decompositions. For voltage objectives, we give the ﬁrst spar-siﬁer construction for these objectives, which we build using graph spanners and leverage scoresampling. Together with the iterative reﬁnement framework of [Adil et al, SODA 2019], and anew multiplicative-weights based constant-approximation algorithm for mixed-objective ﬂows orvoltages, we show how to ﬁnd (1 + 2 − poly(log n ) ) approximations for weighted ℓ p -norm minimizingﬂows or voltages in p ( m o (1) + n / o (1) ) time for p = ω (1) , which is almost-linear for graphs thatare slightly dense ( m ≥ n / o (1) ). , University of Toronto, TTI Chicago, ETH Zurich.

E-mail addresses : { deeksha | sachdeva } @cs.toronto.edu, [email protected], [email protected] . . Introduction

Network ﬂow problems are some of the most extensively studied problems in optimization (e.g.see [AMO93; Sch02; GT14]). A general network ﬂow problem on a graph G ( V, E ) with n verticesand m edges can be formulated as min B ⊤ f = d cost ( f ) , where f ∈ R E is a ﬂow vector on edges satisfying net vertex demands d ∈ R V , B ∈ R E × V isthe signed edge-vertex incidence matrix of the graph, and cost ( f ) is a cost measure on ﬂows.The weighted ℓ ∞ -minimizing ﬂow problem, i.e., cost ( f ) = k S − f k ∞ , captures the celebratedmaximum-ﬂow problem with capacities S ; the weighted ℓ -minimizing ﬂow problem, cost ( f ) = k S f k captures the transshipment problem generalizing shortest paths with lengths S ; and cost ( f ) = f ⊤ Rf = k R f k captures the electrical ﬂow problem [ST04].Dual to ﬂow problems are voltage problems, which can be formulated asmin d ⊤ v =1 cost ′ ( B v ) , Analogous to the ﬂow problems, picking cost ′ ( B v ) = k S B v k captures the capacitated min-cut problem, cost ′ ( B v ) = k S − B v k ∞ captures vertex-labeling [Kyn+15], and cost ′ ( B v ) =(

B v ) ⊤ R − B v = k R − B v k captures the electrical voltages problem.The seminal work of Spielman and Teng [ST04] gave the ﬁrst nearly-linear-time algorithm forcomputing (1+1 / poly( n ))-approximate solutions to electrical (weighted ℓ -minimizing) ﬂow/voltageproblems. This work spurred the “Laplacian Paradigm” for designing faster algorithms for severalclassic graph optimization problems including maximum ﬂow [Chr+11; She13; Kel+14], multi-commodity ﬂow [Kel+14], bipartite matching [Mad13], transshipment [She17], and graph parti-tioning [OSV12]; culminating in almost-linear-time or nearly-linear-time low-accuracy algorithms(i.e. 1 + ε approximations with poly( ε ) running time dependence) for many of these problems.Progress on high-accuracy algorithms (i.e. algorithms that return (1 + 1 / poly( n ))-approximatesolutions with only a poly(log n ) factor overhead in time) for solving these problems has beenharder to come by, and for many ﬂow problems has been based on interior point methods [DS08].E.g. the best running time for maximum ﬂow stands at e O (min( m √ n, n ω + n / )) [LS14; CLS19]and e O ( m / ) for unit-capacity graphs [Mad13; LS20b; LS20a]. Other results making progress inthis direction include works on shortest paths with small range negative weights [Coh+17b], andmatrix-scaling [Coh+17a; All+17]. Recently, there has been progress on the dense case. In[van+20], the authors developed an algorithm for weighted bipartite matching and transshipmentrunning in e O ( m + n / ) time. This is a nearly-linear-time algorithm in moderately dense graphs.Bubeck et al. [Bub+18] restarted the study of faster high-accuracy algorithms for the weighted ℓ p -norm objective, cost ( f ) = k S f k p , a natural intermediate objective between ℓ and ℓ ∞ . This resultimproved the running time signiﬁcantly over classical interior point methods [NN94] for p close to 2.Adil et al. [Adi+19] gave a high-accuracy algorithm for computing ℓ p -norm minimizing ﬂows in timemin { m + o (1) , n ω } for p ∈ (2 , √ log n ]. Building on their work, Kyng et al. [Kyn+19] gave an almost-linear-time high-accuracy algorithm for unit-weight ℓ p -norm minimizing ﬂows cost ( f ) = k f k pp forlarge p ∈ ( ω (1) , √ log n ] . More generally, they give an almost-linear time-high-accuracy algorithmfor mixed ℓ + ℓ pp objectives as long as the ℓ pp -norm is unit-weight, i.e., cost ( f ) = k R f k + k f k pp . Their algorithm for ( ℓ + ℓ pp )-minimizing ﬂows was subsequently used as a key ingredient in recentresults improving the running time for high-accuracy/exact maximum ﬂow on unit-capacity graphsto m / o (1) [LS20b; LS20a]. n this paper, we obtain a nearly-linear running time for weighted ℓ + ℓ pp ﬂow/voltage problems ongraphs. Our algorithm requires p ( m o (1) + n / o (1) ) time for p = ω (1) which is almost-linear-timefor p ≤ m o (1) in slightly dense graphs, ( m ≥ n / o (1) ).Our running time m o (1) + n / o (1) is even better than the e O ( m + n / ) time obtained forbipartite matching in [van+20]. Our result beats the Ω( n / ) barrier that arises in [van+20] fromthe use of interior point methods that maintain a vertex dual solution using dense updates across √ n iterations. The progress on bipartite matching relies on highly technical graph-based inversemaintenance techniques that are tightly interwoven with interior point method analysis. In con-strast, our sparsiﬁcation methods provide a clean interface to iterative reﬁnement, which makesour analysis much more simple and compact. Graph Sparsiﬁcation.

Various notions of graph sparsiﬁcation – replacing a dense graph with asparse one, while approximately preserving some key properties of the dense graph – have been keyingredients in faster low-accuracy algorithms. Bencz´ur and Karger [BK96] deﬁned cut-sparsiﬁersthat approximately preserve all cuts, and used them to give faster low-accuracy approximationalgorithms for maximum-ﬂow. Since then, several notions of sparsiﬁcation have been studied ex-tensively and utilized for designing faster algorithms [PS89; ST11; Rac08; SS11; Mad10; She13;Kel+14; RST14; CP15; Kyn+16; Dur+17; Chu+18].Sparsiﬁcation has had a smaller direct impact on the design of faster high-accuracy algorithmsfor graph problems, limited mostly to the design of linear system solvers [ST04; KMP11; PS14;Kyn+16]. Kyng et al. [Kyn+19] constructed sparsiﬁers for weighted ℓ + unweighted ℓ pp -normobjectives for ﬂows. In this paper, we develop almost-linear time algorithms for building sparsiﬁersfor weighted ℓ + ℓ pp norm objectives for ﬂows and voltages, cost ( f ) = k R f k + k S f k pp , and cost ′ ( B v ) = k W B v k + k U B v k pp , and utilize them as key ingredients in our faster high-accuracy algorithms for optimizing suchobjectives on graphs. Our construction of sparsiﬁers for ﬂow objectives builds on the machineryfrom [Kyn+19], and our construction of sparsiﬁers for voltage objectives builds on graph span-ners [Alt+93; PS89; BS07]. 2. Our Results

Our main results concern ﬂow and voltage problems for mixed ( ℓ + ℓ pp )-objectives for p ≥ p , we restrict our attention to p = ω (1) in this overview.Section 3 provides detailed running times for all p ≥

2. We emphasize that by setting the quadraticterm to zero in our mixed ( ℓ + ℓ pp )-objectives, we get new state of the art algorithms for ℓ p -normminiziming ﬂows and voltages. Mixed ℓ - ℓ p -norm minimizing ﬂow. Consider a graph G = ( V, E ) along with non-negativediagonal matrices R , S ∈ R E × E , and a gradient vector g ∈ R E , as well as demands d ∈ R V . Werefer to the diagonal entries of R and S as ℓ -weights and ℓ p -weights respectively. Let B denote thesigned edge-vertex incidence of G (see Appendix A.1). We wish to solve the following minimizationproblem with the objective E ( f ) = g ⊤ f + k R / f k + k S f k pp (1) min B ⊤ f = d E ( f )We require g ⊥ (cid:8) ker( R ) ∩ ker( S ) ∩ ker( B ) (cid:9) so that the problem has bounded minimum value,and d ⊥ so a feasible solution exists. These conditions can be checked in linear time and havea simple combinatorial interpretation. Note that the choice of graph edge directions in B mattersfor the value of g ⊤ f , . The ﬂow on an edge is allowed to be both positive or negative. ixed ℓ - ℓ p -norm minimizing voltages. Consider a graph G = ( V, E ) along with non-negativediagonal matrices W ∈ R E × E and U ∈ R E × E , and demands d ∈ R V . We refer to the diagonalentries of W and U as ℓ -conductances and ℓ p -conductances respectively. In this case, we want tominimize the objective E ( v ) = d ⊤ v + k W / B v k + k U v k pp in minimization problem(2) min v E ( v )In the voltage setting, we only require d ⊥ so the problem has bounded minimum value. Obtaining good solutions.

For both these problems, we study high accuracy approximationalgorithms that provide feasible solutions x (a ﬂow or a voltage respectively), that approximatelyminimize the objective function from some starting point x (0) , i.e., for some small ε > , we have E ( x ) − E ( x ⋆ ) ≤ ε ( E ( x (0) ) − E ( x ⋆ ))wher x ⋆ denotes an optimal feasible solution. Our algorithms apply to problems with quasipoly-nomially bounded parameters, including quasipolynomial bounds on non-zero singular values ofmatrices we work with. Below we state our main algorithmic results. Theorem 2.1 (Flow Algorithmic Result) . Consider a graph G with n vertices and m edges,equipped with non-negative ℓ and ℓ p -weights, as well as a gradient and demands, all with quasi-polynomially bounded entries. For p = ω (1) , in p ( m o (1) + n / o (1) ) log / ε time we can computean ε -approximately optimal ﬂow solution to Problem (1) with high probability. This improves upon [Adi+19; APS19; AS20] which culminated in a pm / o (1) log / ε time al-gorithm. Theorem 2.2 (Voltage Algorithmic Result) . Consider a graph G with n vertices and m edges,equipped with non-negative ℓ and ℓ p -conductances, as well as demands, all with quasi-polynomiallybounded entries. For p = ω (1) , in p ( m o (1) + n / o (1) ) log / ε time we can compute an ε -approximately optimal voltage solution to Problem (2) with high probability. Background: Iterative Reﬁnement for Mixed ℓ - ℓ p -norm Flow Objectives. Adil etal. [Adi+19] developed a notion of iterative reﬁnement for mixed ( ℓ + ℓ pp )-objectives which inthe ﬂow setting, i.e. Problem (1), corresponds to approximating E ′ ( δ ) = E ( f + δ ) using another( ℓ + ℓ pp )-objective which roughly speaking corresponds to the 2nd degree Taylor series approxima-tion of E ′ ( δ ) combined with an ℓ p -norm term k S δ k pp , while ensuring feasibility of f + δ through aconstraint B δ = . We call the resulting problem a residual problem. Adil et al. [Adi+19] showedthat obtaining a constant-factor approximate solution to the residual problem in δ is suﬃcient toensure that E ( f + δ ) is closer to the optimal solution by a multiplicative factor depending onlyon p . In [APS19], this result was sharpened to show that such an approximate solution for theresidual problem can be used to make (1 − Ω(1 /p )) multiplicative progress to the optimum, so that O ( p log( m/ε )) iterations suﬃce to produce an ε -accurate solution.In order to solve the residual problem to a constant approximation, Adil et al. [Adi+19] developedan accelerated multiplicative weights method for ( ℓ + ℓ pp )-ﬂow objectives, or more generally, formixed ( ℓ + ℓ pp )-regression in an underconstrained setting. Sparsiﬁcation results.

Our central technical results in this paper concern sparsiﬁcation of residualﬂow and voltage problems, in the sense outlined in the previous paragraph. Concretely, in nearly-linear time, we can take a residual problem on a dense graph and produce a residual problem on asparse graph with e O ( n ) edges, with the property that constant factor solutions to the sparse residualproblem still make (1 − Ω( m − p − p )) multiplicative progress on the original problem. This leadsto an iterative reﬁnement that converges in O ( pm p − log( m/ε )) steps. However, the accelerated ultiplicative weights algorithm that we use for each residual problem now only requires e O ( n / )time to compute a crude solution. Flow residual problem sparsiﬁcation.

In the ﬂow setting, we show the following:

Theorem 2.3 (Informal Flow Sparsiﬁcation Result) . Consider a graph G with n vertices and m edges, equipped with non-negative ℓ and ℓ p -weights, as well as a gradient. In e O ( m ) time, we cancompute a graph H with n vertices and e O ( n ) edges, equipped with non-negative ℓ and ℓ p -weights,as well as a gradient, such that a constant factor approximation to the ﬂow residual problem on H ,when scaled by m − p − results in an e O ( m p − ) approximate solution to the ﬂow residual problem on G . The algorithm works for all p ≥ and succeeds with high probability. Our sparsiﬁcation techniques build on [Kyn+19], require a new bucketing scheme to deal withnon-uniform ℓ p -weights, as well as a prepreprocessing step to handle cycles with zero ℓ -weightand ℓ p -weight. This preprocessing scheme in turn necessitates a more careful analysis of additiveerrors introduced by gradient rounding, and we provide a more powerful framework for this than[Kyn+19]. Voltage residual problem sparsiﬁcation.

In the voltage setting, we show the following.

Theorem 2.4 (Voltage Sparsiﬁcation Result (Informal)) . Consider a graph G with n vertices and m edges, equipped with non-negative ℓ and ℓ p -conductances. In e O ( m ) time, we can compute agraph H with n vertices and e O ( n ) edges, equipped with non-negative ℓ and ℓ p -conductances, suchthat constant factor approximation to the voltage residual problem on H , when scaled by m − p − results in an e O ( m p − ) approximate solution to the voltage residual problem on G . The algorithmworks for all p ≥ and succeeds with high probability. Note that our voltage sparsiﬁcation is slightly stronger than our ﬂow sparsiﬁcation, as the formerloses only a factor e O ( m p − ) in the approximation while the latter loses a factor e O ( m p − ). Ourvoltage sparsiﬁcation uses a few key observations: In voltage space, surprisingly, we can treat treatthe ℓ and ℓ p costs separately. This behavior is very diﬀerent than the ﬂow case, and arises becasein voltage space, every edge provides an “obstacle”, i.e. adding an edge increases cost, whereasin ﬂow space, every edge provides an “opportunity”, i.e. adding an edge decreases cost. Thismeans that in voltage space, we can separately account for the energy costs created by our ℓ and ℓ p terms, whereas in ﬂow space, the ℓ and ℓ p weights must be highly correlated in a sparsiﬁer.Armed with this decoupling observation, we preserve ℓ cost using standard tools for spectral graphsparsiﬁcation, and we preserve ℓ p cost approximately by a reduction to graph distance preservation,which we in turn achieve using weighted undirected graph spanners. Voltage space accelerated multiplicative weights solver.

The algorithm from [Adi+19] forconstant approximate solutions to the residual problem works in the ﬂow setting. Using iterativereﬁnement, the algorithm could be used to compute high-accuracy solutions. Because we canuse high-accuracy ﬂow solutions to extract high-accuracy solutions to the dual voltage problem,[Adi+19] were also able to produce solutions to ℓ q -norm minimizing voltage problems (where ℓ q for q = p/ ( p −

1) is the dual norm to ℓ p ). Hence, by solving ℓ p -ﬂow problems for all p ∈ (2 , ∞ ),[Adi+19] were able to solve ℓ q -norm minimizing voltage problems for all q ∈ (1 , p ≥ . Thus, in order to solve for q -norm minimizing voltages for q > , we require a solver that works directly in voltage space formixed ( ℓ + ℓ pp )-voltage objectives.We develop an accelerated multiplicative weights algorithm along the lines of [Chr+11; Chi+13;Adi+19] that works directly in voltage space for mixed ( ℓ + ℓ pp )-objectives, or more generally foroverconstrained mixed ( ℓ + ℓ pp )-objective regression. Concretely, this directly gives an algorithm or computing crude solutions to the residual problems that arise from applying [Adi+19] iterativereﬁnement to Problem (2). Our solver produces an improved O (1)-approximation to the residualproblem rather than a p O ( p ) -approximation from [Adi+19]. This gives an e O ( m / ) high-accuracyalgorithm for mixed ( ℓ + ℓ pp )-objective voltage problems for p >

2, unlike [Adi+19], which couldonly solve pure p > p ( m o (1) + n / o (1) ) timealgorithm for p = ω (1) by developing a sparsiﬁcation procedure that applies directly to mixed( ℓ + ℓ pp )-voltage problems for p > Mixed ℓ - ℓ p -norm regression. Our framework can also be applied outside of a graph setting,where our new accelerated multiplicative weights algorithm for overconstrained mixed ( ℓ + ℓ pp )-regression gives new state-of-the-art results in some regimes when combined with new sparsiﬁcationresults. In this setting we develop sparsiﬁcation techniques based on the Lewis weights samplingfrom the work of Cohen and Peng [CP15]. We focus on the case 2 < p <

4, where [CP15] providedfast algorithms for Lewis weight sampling.

Theorem 2.5 (General Matrices Sparsiﬁcation Result) . Let p ∈ [2 , , let M ∈ R m × n , N ∈ R m × n be matrices, m , m ≥ n , and let LSS ( B ) denote the time to solve a linear system in B ⊤ B . Then,we may compute f M , f N ∈ R O ( n p/ log( n )) × n such that with probability at least − n Ω(1) , for all ∆ ∈ R n , k f M ∆ k + k f N ∆ k pp ≈ O (1) k M ∆ k + k N ∆ k pp , in time ˜ O (cid:16) nnz( M ) + nnz( N ) + LSS ( c M ) + LSS ( c N ) (cid:17) , for some c M and c N each containing O ( n log( n )) rescaled rows of M and N , respectively. Theorem 2.6 (General Matrices Algorithmic Result) . For p ∈ [2 , , with high probability we canﬁnd an ε -approximate solution to (3) in time e O (cid:18) nnz( M ) + nnz( N ) + (cid:16) LSS ( f M ) + LSS ( f N ) (cid:17) n p ( p − p − (cid:19) log (1 /ε ) ! , for some f M and f N each containing O ( n p/ log( n )) rescaled rows of M and N , respectively, where LSS ( A ) is the time required to solve a linear equation in A ⊤ A to quasipolynomial accuracy. Note that for all p ∈ (2 , , we have that the exponent p ( p − p − ≤ . . Remark 2.7.

By [Coh+15], a linear equation in A ⊤ A , where A ∈ R m × n can be solved to quasipoly-nomial accuracy in time e O (nnz( A ) + n ω ).Using the above result for solving the required linear systems, we get a running time of e O (nnz( M )+nnz( N ) + ( n p/ + n ω ) n p ( p − p − ) , matching an earlier input sparsity result by Bubeck et al. [Bub+18]that achieves e O ((nnz( M )+ nnz( N ))(1+ n m − p )+ m − p n + n ω ) , where M ∈ R m × n , N ∈ R m × n and m = max { m , m } . Main Algorithm

In this section, we prove Theorems 2.1, 2.2, and 2.6. We ﬁrst design an algorithm to solve thefollowing general problem:

Deﬁnition 3.1.

For matrices M ∈ R m × n , N ∈ R m × n and A ∈ R d × n , m , m ≥ n , d ≤ n , andvectors b ⊥ (cid:8) ker( M ) ∩ ker( N ) ∩ ker( A ) (cid:9) and c ∈ im( A ), we want to solvemin x b ⊤ x + k M x k + k N x k pp (3) s.t. Ax = c . n order to solve the above problem, we use the iterative reﬁnement framework from [APS19] toobtain a residual problem which is deﬁned as follows. Deﬁnition 3.2.

For any p ≥

2, we deﬁne the residual problem res (∆), for (3) at a feasible x as,max A ∆=0 res (∆) def = g ⊤ ∆ − ∆ ⊤ R ∆ − k N ∆ k pp , where, g = 1 p b + 2 p M ⊤ M x + | N x | p − N x and R = 2 p M ⊤ M + 2 N ⊤ Diag ( | N x | p − ) N . This residual problem can further be reduced by moving the term linear in x to the constraintsvia a binary search. This leaves us with a problem of the form,min ∆ ∆ ⊤ R ∆ + k N ∆ k pp s.t. g ⊤ ∆ = a, A ∆ = 0 , for some constant a .In order to solve the above problem with ℓ + ℓ pp objective, we reduce the instance size via asparsiﬁcation routine, and then solve the smaller problem by a multiplicative weights algorithm.We adapt the multiplicative-weights algorithm from [Adi+19] to work in the voltage space whileimproving the p dependence of the runtime from p O ( p ) to p , and the approximation quality from p O ( p ) to O (1). The precise sparsiﬁcation routines are described in later sections.For large p, i.e., p > log m , in order to get a linear dependence on the running time on p, weneed to reduce the residual problem in ℓ p -norm to a residual problem in log m -norm by using theframework from [AS20].The entire meta-algorithm is described formally in Algorithm 1, and its guarantees are describedby the next theorem. Most proof details are deferred to Appendix D. Theorem 3.3.

For an instance of Problem (3) , suppose we are given a starting solution x (0) thatsatisﬁes Ax (0) = c and is a κ approximate solution to the optimum. Consider an iteration of thewhile loop, line 8 of Algorithm 1 for the ℓ p -norm residual problem at x ( t ) . We can deﬁne µ and κ such that if ¯∆ is a β approximate solution to a corresponding p ′ -norm residual problem, then µ ¯∆ is a κ -approximate solution to the p -residual problem. Further, suppose we have the followingprocedures,(1) Sparsify : Runs in time K , takes as input any matrices R , N and vector g and returns e R , f N , e g having sizes at most ˜ n × n for the matrices , such that if e ∆ is a β approximatesolution to, max A ∆=0 e g ⊤ ∆ − k e R ∆ k − k f N ∆ k p ′ p ′ , for any p ′ ≥ , then µ e ∆ , for a computable µ is a κ β -approximate solution for, max A ∆=0 res (∆) def = g ⊤ ∆ − k R / ∆ k − k N ∆ k p ′ p ′ . (2) Solver : Approximately solves (4) to return ¯∆ such that k e R ¯∆ k ≤ κ ν and k f N ∆ k pp ≤ κ ν in time ˜ K (˜ n ) for instances of size at most ˜ n .Algorithm 1 ﬁnds an ε -approximate solution for Problem (3) in time e O pκ / ( p − κ κ κ ( K + ˜ K (˜ n )) log (cid:18) κpε (cid:19) ! . lgorithm 1 Meta-Algorithm for ℓ p Flows and Voltages procedure Sparsified-p-Problems ( A , M , N , c , b , p ) x ← x (0) , such that f (cid:16) x (0) (cid:17) ≤ κ Opt T ← e O (cid:16) pκ κ κ log (cid:0) κε (cid:1)(cid:17) for t = 0 to T do At x ( t ) deﬁne g , R , N and res (∆), the residual problem (Deﬁnition 3.2) a ← , b ← , µ ← , κ ← ν ← f (cid:16) x (0) (cid:17) while ν ≥ ε f ( x (0) ) κp do if p > log m then ⊲ Convert ℓ p -norm residual to log m -norm residual p ′ ← log m N ′ ← /p ′ (cid:0) νm (cid:1) p ′ − p N a ← , b ← O (1) m o (1) µ ← m − o (1) , κ ← m o (1) ⊲ Lose κ in approx. when scaled by µ ( e g , e R , f N ) ← Sparsify ( g , R , N ′ ) ⊲ Lose κ in approx. when scaled by µ else ( e g , e R , f N ) ← Sparsify ( g , R , N ) ⊲ Lose κ in approx. when scaled by µ p ′ ← p Use

Solver to compute κ , κ approximate solution to e ∆ ( ν ) ← arg min ∆ k e R / ∆ k + k f N ∆ k p ′ p ′ s.t. e g ⊤ ∆ = aν, A ∆ = 0 . (4) ¯∆ ( ν ) ← a bκ κ / ( p ′− µ µ e ∆ ( ν ) ν ← ν/ ∆ ← arg min ¯∆ ( ν ) f (cid:16) x − ¯∆ ( ν ) p (cid:17) x ← x − ∆ p return x Algorithms for ℓ p -norm Problems. The problems discussed in Section 2 are special casesof Problem (3), which means we can use Algorithm 1. To prove our results, we will utilize Theorem3.3, with the respective sparsiﬁcation procedures and the following multiplicative-weights basedalgorithm for solving problems of the form,min ∆ ∆ ⊤ M ⊤ M ∆ + k N ∆ k pp (5) s.t. A ∆ = c . We describe our solver formally and prove the following theorem about its guarantees in AppendixC.

Theorem 3.4.

Let p ≥ . Consider an instance of Problem (5) described by matrices A ∈ R d × n , N ∈ R m × n , M ∈ R m × n , d ≤ n ≤ m , m , and vector c ∈ R d . If the optimum of thisproblem is at most ν , Procedure Residual-Solver (Algorithm 3) returns an x such that Ax = c , nd x ⊤ M ⊤ M x ≤ O (1) ν and k N x k pp ≤ O (3 p ) ν . The algorithm makes O pm p − p − ! calls to alinear system solver. We utilize Procedure

Residual-Solver as the Procedure

Solver in Algorithm

Sparsified-p-Problems . The algorithm uses the procedure only for solving problems instances with p ≤ log m. Thus, its running time is ˜ K (˜ n ) = e O (cid:18) ˜ n p − p − · LSS (˜ n ) (cid:19) ≤ e O (cid:16) ˜ n / · LSS (˜ n ) (cid:17) , where LSS (˜ n ) denotesthe time required to solve a linear system in matrices of size ˜ n . We also have, κ = O (1) , κ / ( p − = O (1).We next estimate the values of κ and µ . If p ≤ log m , we have µ = 1 and κ = 1. Otherwise, µ = e O (1) and κ = O ( m o (1) ) (Refer to Lemma D.4 in Appendix D).In order to obtain an initial solution, we usually solve an ℓ -norm problem. This gives an m p/ approximate initial solution which results in a factor of p in the running time. To avoid this, wecan do a homotopy on p similar to [AS20], i.e., start with an ℓ solution and solve the ℓ problemto a constant approximation, followed by ℓ , ..ℓ p . We note that a constant approximate solutionto the ℓ p/ -norm problem gives an O ( m ) approximation to the ℓ p problem and thus, we can solvelog p problems where we can assume κ = O ( m ).We now complete the proof of our various algorithmic results by utilizing sparsiﬁcation proceduresspeciﬁc to each problem. ℓ p Flows.

We will prove Theorem 2.1 (Flow Algorithmic Result), with explicit p dependencies. Proof.

From Theorem 2.3, we obtain a sparse graph in K = e O ( m ) time with ˜ n = e O ( n ) edges.A constant factor approximation to the ﬂow residual problem on this sparse graph when scaledby µ = m − p − gives a κ = e O (cid:18) m p − (cid:19) -approximate solution to the ﬂow residual problem onthe original graph. We can solve linear systems on the sparse graph in e O (˜ n ) = e O ( n ) time usingfast Laplacian solvers. Using all these values in Theorem 3.3, we get the ﬁnal runtime to be pm p − + o (1) (cid:18) m + n p − p − (cid:19) log (cid:0) pmε (cid:1) as claimed. We prove Theorem 2.3 in Section A. (cid:3) ℓ p Voltages.

We will prove Theorem 2.2 (Voltage Algorithmic Result), with explicit p dependencies. Proof.

From Theorem 2.4, we obtain a sparse graph in K = e O ( m ) time with ˜ n = e O ( n ) edges. Aconstant factor approximation to the voltage residual problem on this sparse graph when scaledby µ = m − p − gives a κ = e O (cid:18) m p − (cid:19) -approximate solution to the voltage residual problemon the original graph. We can solve linear systems on the sparse graph in e O (˜ n ) = e O ( n ) timeusing fast Laplacian solvers. Using these values in Theorem 3.3, we get the ﬁnal runtime to be pm p − + o (1) (cid:18) m + n p − p − (cid:19) log (cid:0) pmε (cid:1) as claimed. We prove Theorem 2.4 in Section 4. (cid:3) General Matrices.

We will now prove Theorem 2.6.

Proof.

We assume Theorem 2.5, which we prove in Appendix B. From the theorem, we have κ = O (1) and µ = O (1). Note that K = LSS ( c M ) + LSS ( c N ) for some c M , c N ∈ R O ( n log( n )) × n ,which is the time required to solve linear systems in c M ⊤ c M and c N ⊤ c N , respectively. Since, byTheorem 2.5, the size of f M and f N is ˜ n = O ( n p/ log( n )), the cost from the solver in Theorem 3.4is ˜ O p (cid:18)(cid:16) LSS ( f M ) + LSS ( f N ) (cid:17) n p ( p − p − (cid:19) . (cid:3) . Construction of Sparsifiers for ℓ + ℓ pp Voltages

In this section, we prove a formal version of the voltage sparsiﬁcation result (Theorem 2.4):

Theorem 4.1.

Consider a graph G = ( V, E ) with non-negative -weights w ∈ R E and non-negative p -weights s ∈ R E , with m and n vertices. We can produce a graph H = ( V, F ) with edges F ⊆ E , ℓ -weights u ∈ R F , and ℓ p -weights t ∈ R F , such that with probability at least − δ the graph H has O ( n log( n/ε )) edges and (6) 11 . k W B G x k ≤ k U B H x k ≤ . k W B G x k and for any p ∈ [1 , ∞ ](7) 1 m /p log( n ) k S B G x k p ≤ k T B H x k p ≤ k S B G x k p where W = Diag ( w ) , U = Diag ( u ) , S = Diag ( s ) , T = Diag ( t ) . We denote the routine com-puting H and u , t by SpannerSparsify , so that ( H, u , t ) = SpannerSparsify ( G, w , s ) . Thisalgorithm runs in e O ( m log(1 /δ )) time. We will ﬁrst deﬁne some terms required for our result. Given a undirected graph G = ( V, E ),with edge lengths l ∈ R E , and u, v ∈ V , we let d G ( u, v ) denote the shortest path distance in G w.r.t l , so that if P is the shortest path w.r.t l then d G, l ( u, v ) = X e ∈ P l ( e ) Deﬁnition 4.2.

Given a undirected graph G = ( V, E ) with edge lengths l ∈ R E , a K -spanner isa subgraph H of G with the same edge lengths s.t. d H ( u, v ) ≤ Kd G ( u, v ).Baswana and Sen showed the following result on spanners [BS07]. Theorem 4.3.

Given an undirected graph G = ( V, E, l ) with m edges and n vertices, and aninteger k > , we can compute a (2 k − -spanner H of G with O ( n /k ) edges in expected time O ( km ) . Lemma 4.4.

Given an undirected graph G = ( V, E ) with positive edge lengths l ∈ R E , and a K -spanner H = ( V, F ) of G , for all x ∈ R V we have max ( u,v ) ∈ F l ( u, v ) (cid:12)(cid:12) x ( u ) − x ( v ) (cid:12)(cid:12) ≤ max ( u,v ) ∈ E l ( u, v ) (cid:12)(cid:12) x ( u ) − x ( v ) (cid:12)(cid:12) ≤ K max ( u,v ) ∈ F l ( u, v ) (cid:12)(cid:12) x ( u ) − x ( v ) (cid:12)(cid:12) Proof.

The inequality max ( u,v ) ∈ F l ( u,v ) (cid:12)(cid:12) x ( u ) − x ( v ) (cid:12)(cid:12) ≤ max ( u,v ) ∈ E l ( u,v ) (cid:12)(cid:12) x ( u ) − x ( v ) (cid:12)(cid:12) is immediatefrom F ⊆ E .To prove the second inequality, we note that if ( u, v ) ∈ E has shortest path P in H then1 l ( u, v ) (cid:12)(cid:12) x ( u ) − x ( v ) (cid:12)(cid:12) ≤ K P ( z,y ) ∈ P l ( z, y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ( z,y ) ∈ P x ( z ) − x ( y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max ( z,y ) ∈ P K l ( z, y ) (cid:12)(cid:12) x ( z ) − x ( y ) (cid:12)(cid:12) . (cid:3) Deﬁnition 4.5.

Given a undirected graph G = ( V, E ) with m edges and n vertices with positiveedge ℓ -weights w ∈ R E , a spectral ε -approximation of G is a graph H = ( V, F ) with F ⊆ E withpositive edge ℓ -weights u ∈ R F s.t.11 + ε k W B G x k ≤ k U B H x k ≤ (1 + ε ) k W B G x k where W = Diag ( w ) and U = Diag ( u ). he following result on spectral sparsiﬁers was shown by Spielman and Srivastava [SS11] (seealso [Spi15]). Theorem 4.6.

Given a graph G = ( V, E ) with positive ℓ -weights w ∈ R E with m edges and n vertices, for any ε ∈ (0 , / , we can produce a graph H = ( V, F ) with edges F ⊆ E and ℓ -weights u ∈ R F such that H has O ( nε − log( n/δ )) edges and with probability at least − δ we havethat ( H, u ) is a spectral ε -approximation of ( G, w ) . We denote the routine computing H and u by SpectralSparsify , so that ( H, u ) = SpectralSparsify ( G, s , ε, δ ) . This algorithm runs in e O ( m ) time. Furthermore, if the weights w are quasipolynomially bounded, then so are the weightsof u . We can now prove our main result.

Proof of Theorem 4.1.

We consider a graph G = ( V, E ) with m edges and n vertices, and withnon-negative ℓ p -weights r ∈ R E , non-negative ℓ -weights s ∈ R E . We deﬁne ˆ E ⊆ E to be the edgess.t. s ( e ) >

0, and then let l ∈ R ˆ E by l ( e ) = 1 / s ( e ), and ˆ G = ( V, ˆ E ). We then apply Theorem 4.3to ˆ G with l as edge lengths, and with k = log( n ). We turn the algorithm of Theorem 4.3 intorunning time e O ( m log(1 /δ )), instead of expected time e O ( m ), by applying the standard Las Vegasto Monte-Carlo reduction. With probability 1 − δ/

2, this gives us a log n -spanner H of ˆ G , and wedeﬁne t by restricting s to the edges of H . By Lemma 4.4, we then have (cid:13)(cid:13) T B H x (cid:13)(cid:13) ∞ ≤ k S B G x k ∞ ≤ log( n ) (cid:13)(cid:13) T B H x (cid:13)(cid:13) ∞ Because

T B H x is a restriction of S B G x to a subset of the coordinates, we always have for any p ≥ (cid:13)(cid:13) T B H x (cid:13)(cid:13) p ≤ k S B G x k p .At the same time, we also have k S B G x k p ≤ m /p k S B G x k ∞ ≤ m /p log( n ) (cid:13)(cid:13) T B H x (cid:13)(cid:13) ∞ ≤ m /p log( n ) (cid:13)(cid:13) T B H x (cid:13)(cid:13) p We deﬁne ˜ E ⊆ E to be the edges s.t. r ( e ) >

0, and the let ˜ G = ( V, ˜ E ). Now, appealing toTheorem 4.6, we let ( H , u ) = SpectralSparsify ( ˜ G, r , / , ε/ H by taking the union of the edge sets of H and H and extending u and t to the new edge set by adding zero entries as needed. By a union bound, the approximationguarantees of Equations (6) and (7) simultaneously hold with probability at least 1 − δ .The edge set remains bounded in size by O ( n log n ). (cid:3) To see Theorem 2.4, note that from Theorem 4.1, we get, m − p − (cid:18) m − p − k W B G x k + m − k S B G x k pp (cid:19) ≤ m − p − (cid:16) k U B H x k + k T B H x k pp (cid:17) The other direction is easy to see.5.

Extensions of Our Results and Open Problems

Solving dual problems: q -norm minimizing ﬂows and voltages for q < . When the mixed ( ℓ + ℓ pp )-objective ﬂow problem (Problem (1)) is restricted to the case g = and R = , it becomes a pure ℓ p -norm minimizing ﬂow problem, and its dual problem can be slightly rearranged to give(8) min v d ⊤ v + (cid:13)(cid:13)(cid:13) S − B v (cid:13)(cid:13)(cid:13) qq where q = p/ ( p −

1) = 1 + 1 / ( p − S − as ℓ q -conductances.Because we can solve Problem (1) to high-accuracy in near-linear time for p = ω (1), this allows usto solve Problem (8), the dual voltage ℓ q -norm minimization, in time p ( m o (1) + n / o (1) ) log / ε (see [Adi+19, Section 7] for the reduction). We summarize this in the theorem below. heorem 5.1 (Voltage Algorithmic Result, q < . Consider a graph G with n vertices and m edges, equipped with positive ℓ q -conductances, as well as a demand vector. For < q < , when q = 1 + o (1) , in poly (cid:16) q − (cid:17) ( m o (1) + n / o (1) ) log / ε time, we can compute an ε -approximately optimal voltage solution to Problem (8) with high probability. Similarly, we can solve ℓ q -norm minimizing ﬂows for q < ℓ p -voltage problem, aspecial case of the mixed ( ℓ + ℓ pp )-voltage problem. Picking W = in Problem (2), we obtain apure ℓ p -norm minimizing voltage problem, and its dual problem can be slightly rearranged to give(9) min B ⊤ f = d (cid:13)(cid:13)(cid:13) U − f (cid:13)(cid:13)(cid:13) qq where q = p/ ( p −

1) = 1 + 1 / ( p − U − as q -weights. Again,because we can solve Problem (2) to high-accuracy in near-linear time for p = ω (1), this allows usto solve Problem (9), the dual ﬂow ℓ q -norm minimization, in time p ( m o (1) + n / o (1) ) log / ε . Theorem 5.2 (Flow Algorithmic Result, q < . Consider a graph G with n verticesand m edges, equipped with positive q -weights, as well as a demand vector. For < q < , when q = 1 + o (1) , in poly (cid:16) q − (cid:17) ( m o (1) + n / o (1) ) log / ε time, we can compute an ε -approximatelyoptimal ﬂow solution to Problem (9) with high probability.Open Questions. Mixed ℓ , ℓ q problems for small q < . In this work, we provided new state-of-the-art algorithms for weighted mixed ℓ , ℓ p -norm minimizing ﬂow and voltage problems for p >> ℓ q -norm minimizing ﬂow and voltage problems for q near 1.A reasonable deﬁnition of mixed ℓ , ℓ q -norm problems for q < ℓ , ℓ p -gamma function problem for p > Directly sparsifying mixed ℓ , ℓ q problems for q < . A second approach to developing a fast ℓ , ℓ q -gamma function solver for q < Removing the m O (1) p − loss in sparsiﬁcation. Our current approaches to graph mixed ℓ , ℓ p -sparsiﬁcation lose a factor m O (1) p − in their quality of approximation, which leads to a m O (1) p − factorslowdown in running time, and makes our algorithms less useful for small p . We believe a moresophisticated graph sparsiﬁcation routine could remove this loss and result in signiﬁcantly fasteralgorithms for p close to 2. Using mixed ℓ , ℓ p -objectives as oracles for ℓ ∞ regression. The current state-of-the-artalgorithm for computing maximum ﬂow in unit capacity graphs runs in e O ( m / ) time [LS20a], anduses the almost-linear-time algorithm from [Kyn+19] for solving unweighted ℓ + ℓ pp instances as akey ingredient. References [Adi+19] D. Adil, R. Kyng, R. Peng, and S. Sachdeva. “Iterative reﬁnement for ℓ p -norm regres-sion”. In: Proceedings of the Thirtieth Annual ACM-SIAM Symposium on DiscreteAlgorithms . SIAM. 2019, pp. 1405–1424 (cit. on pp. 1, 3, 4, 5, 6, 10, 11, 30, 32).[All+17] Z. Allen-Zhu, Y. Li, R. Oliveira, and A. Wigderson. “Much faster algorithms for matrixscaling”. In: . IEEE. 2017, pp. 890–901 (cit. on p. 1). Alt+93] I. Alth¨ofer, G. Das, D. Dobkin, D. Joseph, and J. Soares. “On sparse spanners ofweighted graphs”. In:

Discrete & Computational Geometry

Network ﬂows - theory, algorithms andapplications . Prentice Hall, 1993 (cit. on p. 1).[APS19] D. Adil, R. Peng, and S. Sachdeva. “Fast, provably convergent IRLS algorithm forp-norm linear regression”. In:

Advances in Neural Information Processing Systems .2019, pp. 14189–14200 (cit. on pp. 3, 6, 34, 39, 40, 42).[AS20] D. Adil and S. Sachdeva. “Faster p-norm minimizing ﬂows, via smoothed q-norm prob-lems”. In:

Proceedings of the Fourteenth Annual ACM-SIAM Symposium on DiscreteAlgorithms . SIAM. 2020, pp. 892–910 (cit. on pp. 3, 6, 8, 41).[BK96] A. A. Bencz´ur and D. R. Karger. “Approximating s-t minimum cuts in ˜ O ( n ) time”.In: Proceedings of the twenty-eighth annual ACM symposium on Theory of computing .STOC ’96. Philadelphia, Pennsylvania, USA: ACM, 1996, pp. 47–55. isbn : 0-89791-785-5 (cit. on p. 2).[BLM89] J. Bourgain, J. Lindenstrauss, and V. Milman. “Approximation of zonoids by zono-topes”. In:

Acta mathematica

Random Struct. Algorithms issn : 1042-9832 (cit. on pp. 2, 9).[Bub+18] S. Bubeck, M. B. Cohen, Y. T. Lee, and Y. Li. “An Homotopy Method for Lp Regres-sion Provably Beyond Self-concordance and in Input-sparsity Time”. In:

Proceedingsof the 50th Annual ACM SIGACT Symposium on Theory of Computing . STOC 2018.Los Angeles, CA, USA: ACM, 2018, pp. 1130–1137. isbn : 978-1-4503-5559-9 (cit. onpp. 1, 5, 11).[Chi+13] H. H. Chin, A. Madry, G. L. Miller, and R. Peng. “Runtime guarantees for regressionproblems”. In:

Proceedings of the 4 th conference on Innovations in Theoretical Com-puter Science . ITCS ’13. Berkeley, California, USA: ACM, 2013, pp. 269–282. isbn :978-1-4503-1859-4 (cit. on p. 4).[Chr+11] P. Christiano, J. A. Kelner, A. Madry, D. A. Spielman, and S.-H. Teng. “Electricalﬂows, laplacian systems, and faster approximation of maximum ﬂow in undirectedgraphs”. In: Proceedings of the 43rd annual ACM symposium on Theory of computing .STOC ’11. San Jose, California, USA: ACM, 2011, pp. 273–282. isbn : 978-1-4503-0691-1 (cit. on pp. 1, 4).[Chu+18] T. Chu, Y. Gao, R. Peng, S. Sachdeva, S. Sawlani, and J. Wang. “Graph Sparsiﬁcation,Spectral Sketches, and Faster Resistance Computation, via Short Cycle Decomposi-tions”. In: . 2018, pp. 361–372 (cit. on p. 2).[CLS19] M. B. Cohen, Y. T. Lee, and Z. Song. “Solving Linear Programs in the Current MatrixMultiplication Time”. In:

Proceedings of the 51st Annual ACM SIGACT Symposiumon Theory of Computing . STOC 2019. Phoenix, AZ, USA: Association for ComputingMachinery, 2019, pp. 938–942. isbn : 9781450367059 (cit. on p. 1).[Coh+15] M. B. Cohen, Y. T. Lee, C. Musco, C. Musco, R. Peng, and A. Sidford. “UniformSampling for Matrix Approximation”. In:

Proceedings of the 2015 Conference on In-novations in Theoretical Computer Science . ITCS ’15. Rehovot, Israel: Association forComputing Machinery, 2015, pp. 181–190. isbn : 9781450333337 (cit. on pp. 5, 27).[Coh+17a] M. B. Cohen, A. Madry, D. Tsipras, and A. Vladu. “Matrix Scaling and Balancing viaBox Constrained Newton’s Method and Interior Point Methods”. In: nnual Symposium on Foundations of Computer Science (FOCS) . 2017, pp. 902–913(cit. on p. 1).[Coh+17b] M. B. Cohen, A. Madry, P. Sankowski, and A. Vladu. “Negative-Weight ShortestPaths and Unit Capacity Minimum Cost Flow in ˜ O ( m / log W ) Time (ExtendedAbstract)”. In: Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium onDiscrete Algorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, January 16-19 .2017, pp. 752–771 (cit. on p. 1).[CP15] M. B. Cohen and R. Peng. “ ℓ p Row Sampling by Lewis Weights”. In:

Proceedingsof the Forty-Seventh Annual ACM Symposium on Theory of Computing . STOC ’15.Portland, Oregon, USA: Association for Computing Machinery, 2015, pp. 183–192. isbn : 9781450335362 (cit. on pp. 2, 5, 27).[DS08] S. I. Daitch and D. A. Spielman. “Faster approximate lossy generalized ﬂow via interiorpoint algorithms”. In:

Proceedings of the 40th annual ACM symposium on Theory ofcomputing . STOC ’08. Available at http://arxiv.org/abs/0803.0988. Victoria, BritishColumbia, Canada: ACM, 2008, pp. 451–460. isbn : 978-1-60558-047-0 (cit. on p. 1).[Dur+17] D. Durfee, J. Peebles, R. Peng, and A. B. Rao. “Determinant-Preserving Sparsiﬁcationof SDDM Matrices with Applications to Counting and Sampling Spanning Trees”. In:

FOCS . IEEE Computer Society, 2017, pp. 926–937 (cit. on p. 2).[Fos+53] F. G. Foster et al. “On the stochastic matrices associated with certain queuing pro-cesses”. In:

The Annals of Mathematical Statistics

Commun.ACM

Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposiumon Discrete Algorithms, SODA 2014, Portland, Oregon, USA, January 5-7, 2014 . 2014,pp. 217–226 (cit. on pp. 1, 2).[KMP11] I. Koutis, G. L. Miller, and R. Peng. “A Nearly- m log n Time Solver for SDD LinearSystems”. In:

Proceedings of the 2011 IEEE 52nd Annual Symposium on Foundationsof Computer Science . FOCS ’11. Washington, DC, USA: IEEE Computer Society,2011, pp. 590–598. isbn : 978-0-7695-4571-4 (cit. on p. 2).[Kyn+15] R. Kyng, A. Rao, S. Sachdeva, and D. A. Spielman. “Algorithms for Lipschitz Learningon Graphs”. In:

Proceedings of The 28th Conference on Learning Theory . Ed. by P.Gr¨unwald, E. Hazan, and S. Kale. Vol. 40. Proceedings of Machine Learning Research.Paris, France: PMLR, 2015, pp. 1190–1223 (cit. on p. 1).[Kyn+16] R. Kyng, Y. T. Lee, R. Peng, S. Sachdeva, and D. A. Spielman. “Sparsiﬁed Choleskyand multigrid solvers for connection laplacians”. In:

Proceedings of the 48th AnnualACM SIGACT Symposium on Theory of Computing . ACM. 2016, pp. 842–850 (cit. onp. 2).[Kyn+19] R. Kyng, R. Peng, S. Sachdeva, and D. Wang. “Flows in Almost Linear Time via Adap-tive Preconditioning”. In:

Proceedings of the 51st Annual ACM SIGACT Symposiumon Theory of Computing . STOC 2019. Phoenix, AZ, USA: ACM, 2019, pp. 902–913. isbn : 978-1-4503-6705-9 (cit. on pp. 1, 2, 4, 11, 18, 19, 22).[Lew78] D. Lewis. “Finite dimensional subspaces of L { p } ”. eng. In: Studia Mathematica O (vrank) Iterations and Faster Algorithms for Maximum Flow”.In: FOCS . 2014 (cit. on p. 1). LS20a] Y. P. Liu and A. Sidford. “Faster Divergence Maximization for Faster Maximum Flow”.In:

FOCS (2020) (cit. on pp. 1, 11).[LS20b] Y. P. Liu and A. Sidford. “Faster energy maximization for faster maximum ﬂow”. In:

Proccedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing,STOC 2020, Chicago, IL, USA, June 22-26, 2020 . ACM, 2020, pp. 803–814 (cit. onp. 1).[Mad10] A. Madry. “Fast approximation algorithms for cut-based problems in undirected graphs”.In:

Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on .IEEE. 2010, pp. 245–254 (cit. on p. 2).[Mad13] A. Madry. “Navigating Central Path with Electrical Flows: From Flows to Matchings,and Back”. In:

FOCS . 2013 (cit. on p. 1).[NN94] Y. Nesterov and A. Nemirovskii.

Interior-Point Polynomial Algorithms in Convex Pro-gramming . Society for Industrial and Applied Mathematics, 1994 (cit. on p. 1).[OSV12] L. Orecchia, S. Sachdeva, and N. K. Vishnoi. “Approximating the Exponential, theLanczos Method and an ˜ O ( m )-time Spectral Algorithm for Balanced Separator”. In: Proceedings of the Forty-fourth Annual ACM Symposium on Theory of Computing .STOC ’12. New York, New York, USA: ACM, 2012, pp. 1141–1160. isbn : 978-1-4503-1245-5 (cit. on p. 1).[PS14] R. Peng and D. A. Spielman. “An Eﬃcient Parallel Solver for SDD Linear Systems”.In:

Proceedings of the 46th Annual ACM Symposium on Theory of Computing . STOC’14. New York, New York: ACM, 2014, pp. 333–342. isbn : 978-1-4503-2710-7 (cit. onp. 2).[PS89] D. Peleg and A. A. Sch¨aﬀer. “Graph spanners”. In:

Journal of Graph Theory

Proceedings of the 40th annual ACM symposium on Theory of computing .STOC ’08. Victoria, British Columbia, Canada: ACM, 2008, pp. 255–264. isbn : 978-1-60558-047-0 (cit. on p. 2).[RST14] H. Racke, C. Shah, and H. Taubig. “Computing Cut-Based Hierarchical Decomposi-tions in Almost Linear Time”. In:

Proceedings of the 25 th Annual ACM-SIAM Sym-posium on Discrete Algorithms . SODA ‘14. 2014, pp. 227–238 (cit. on p. 2).[Sch02] A. Schrijver. “On the history of the transportation and maximum ﬂow problems”. In:

Math. Program. . 2013, pp. 263–269 (cit. on pp. 1, 2).[She17] J. Sherman. “Generalized Preconditioning and Undirected Minimum-cost Flow”. In:

Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algo-rithms . SODA ’17. Barcelona, Spain, 2017, pp. 772–780 (cit. on p. 1).[Spi15] D. A. Spielman.

Spectral Graph Theory Lectures: Sparsiﬁcation by Eﬀective ResistanceSampling . 2015 (cit. on p. 10).[SS11] D. Spielman and N. Srivastava. “Graph Sparsiﬁcation by Eﬀective Resistances”. In:

SIAM Journal on Computing

STOC . 2004 (cit. on pp. 1, 2).[ST11] D. Spielman and S. Teng. “Spectral Sparsiﬁcation of Graphs”. In:

SIAM Journal onComputing raphs”. In: . IEEE, 2020, pp. 919–930 (cit. on pp. 1, 2). ppendix A. Construction of Sparsifiers for ℓ + ℓ pp Flows

In this section we will prove the following formal version of Theorem 2.3.

Theorem A.1.

Consider an instance G = ( V G , E G , r G , s G , g G ) with n vertices and m edges.Suppose we want to solve, min B ⊤ G f =0 E G ( f ) . We can compute in time e O ( m ) an instance H =( V H , E H , r H , s H , g H ) such that with probability − ε , H has n vertices and m H = n polylog( n/ ( εδ )) edges, and for all f H we can compute a corresponding f G in e O ( m ) time such that, H (cid:22) cycle κ,δ G and G (cid:22) cycle κ,δ H where κ = m / ( p − polylog( n/ ( εδ )) . A.1.

Preliminaries.

Smoothed ℓ p norm functions. We consider p -norms smoothed by the addition of a quadratic term.First we deﬁne such a smoothed p th -power on R . Deﬁnition A.2 (Smoothed p th -power) . Given r, x ∈ R , r ≥ r -smoothed s -weighted p th -power of x to be h p ( r, s, x ) = rx + s | x | p . This deﬁnition can be naturally extended to vectors to obtained smoothed ℓ p -norms. Deﬁnition A.3 (Smoothed ℓ p -norm) . Given vectors x ∈ R m , r , s ∈ R m ≥ , , deﬁne the r -smooth s -weighted p -norm of x to be h p ( r , s , x ) = m X i =1 h p ( r i , s i , x i ) = m X i =1 ( r i x i + s i | x i | p ) . Flow Problems and Approximation.

We will consider problems where we seek to ﬁnd ﬂows mini-mizing smoothed p -norms. We ﬁrst deﬁne these problem instances. Deﬁnition A.4 (Smoothed p -norm instance) . A smoothed p -norm instance is a tuple G , G def = ( V G , E G , g G , r G , s G ) , where V G is a set of vertices, E G is a set of undirected edges on V G , the edges are accompanied bya gradient, speciﬁed by g G ∈ R E G , the edges have ℓ -resistances given by r G ∈ R E G ≥ , and s ∈ R E G ≥ gives the p -norm scaling. Deﬁnition A.5 (Flows, residues, and circulations) . Given a smoothed p -norm instance G , a vector f ∈ R E G is said to be a ﬂow on G . A ﬂow vector f satisﬁes residues b ∈ R V G if (cid:16) B G (cid:17) ⊤ f = b , where B G ∈ R E G × V G is the edge-vertex incidence matrix of the graph ( V G , E G ) , i.e. , (cid:16) B G (cid:17) ⊤ ( u,v ) = u − v . A ﬂow f with residue is called a circulation on G .Note that our underlying instance and the edges are undirected. However, for every undirectededge e = ( u, v ) ∈ E , we assign an arbitrary ﬁxed direction to the edge, say u → v, and interpret f e ≥ u to v, and f e < u, v ) ∈ E, we have f ( u,v ) = − f ( v,u ) . Deﬁnition A.6 (Objective, E G ) . Given a smoothed p -norm instance G , and a ﬂow f on G , theassociated objective function, or the energy, of f is given by E G ( f ) = (cid:16) g G (cid:17) ⊤ f − h p ( r , s , f ) . eﬁnition A.7 (Smoothed p -norm ﬂow / circulation problem) . Given a smoothed p -norm instance G and a residue vector b ∈ R E G , the smoothed p -norm ﬂow problem ( G , b ), ﬁnds a ﬂow f ∈ R E G with residues b that maximizes E G ( f ) , i.e. , max f :( B G ) ⊤ f = b E G ( f ) . If b = , we call it a smoothed p -norm circulation problem .Note that the optimal objective of a smoothed p -norm circulation problem is always non-negative,whereas for a smoothed p -norm ﬂow problem, it could be negative. Approximating Smoothed p -norm Instances. Since we work with objective functions that are non-standard (and not even homogeneous), we need to carefully deﬁne a new notion of approximationfor these instances.

Deﬁnition A.8 ( H (cid:22) κ,δ G ) . For two smoothed p -norm instances, G , H , we write H (cid:22) κ,δ G ifthere is a linear map M H→G : R E H → R E G such that for every ﬂow f H on H , we have that f G = M H→G ( f H ) is a ﬂow on G such that(1) f G has the same residues as f H i.e. , ( B G ) ⊤ f G = ( B H ) ⊤ f H , and(2) has energy bounded by:1 κ (cid:18) E H (cid:16) f H (cid:17) − δ (cid:13)(cid:13)(cid:13) f H (cid:13)(cid:13)(cid:13) (cid:19) ≤ E G (cid:18) κ f G (cid:19) . For some of our transformations on graphs, we will be able to prove approximation guaranteesonly for circulations. Thus, we deﬁne the following notion restricted to circulations.

Deﬁnition A.9 ( H (cid:22) cycle κ,δ G ) . For two smoothed p -norm instances, G , H , we write H (cid:22) cycle κ G ifthere is a linear map M H→G : R E H → R E G such that for any circulation f H on H , i.e. , ( B H ) ⊤ f H = , the ﬂow f G = M H→G ( f H ) is a circulation, i.e. , ( B G ) ⊤ f G = , and satisﬁes1 κ (cid:18) E H (cid:16) f H (cid:17) − δ (cid:13)(cid:13)(cid:13) f H (cid:13)(cid:13)(cid:13) (cid:19) ≤ E G (cid:18) κ f G (cid:19) . Observe that

H (cid:22) κ,δ G implies H (cid:22) cycle κ,δ G . We deﬁne the usual induced matrix 1-to-1 norm as kMk → = max f ∈ R E kM f k k f k We deﬁne a special matrix 1-to-1 norm over circulations by kMk cycle1 → = max Bf = kM f k k f k Deﬁnition A.10 ( H (cid:22) κ G and H (cid:22) cycle κ G ) . We abbreviate

H (cid:22) κ, G as H (cid:22) κ G , and H (cid:22) cycle κ, G as H (cid:22) cycle κ G Deﬁnition A.11.

In the context of a problem with m vertices and n edges, we say a real number x is quasi-polynomially bounded 2 − polylog( n ) ≤ | x | ≤ polylog( n )17 eﬁnition A.12. Consider a smoothed p -norm ﬂow problem ( G , b ) where G = ( V, E, r , s , g ) with n vertices and m edges. We say that the instance is quasipolynomially-bounded if the entries of r and s are quasipolynomially bounded andmax f :( B G ) ⊤ f = b E G ( f ) ≤ polylog( n ) . Deﬁnition A.13 (Touched and untouched cycles) . We say a cycle of edges in E is touched if itcontains an edge e s.t. r ( e ) = 0 or s ( e ) = 0. Otherwise, we say the cycle is untouched Deﬁnition A.14 (Cycle-touching instance) . We say an instance G = ( V, E, r , s , g ) is cycle-touching if every cycle of edges in e is touched.A.2. Additional Properties of Flow Problems and Approximation.

The deﬁnitions in Sec-tion A.1 satisfy most properties that we want from comparisons. The following lemma, slightlyextends a similar statement in [Kyn+19].

Lemma A.15 (Reﬂexivity) . For every smoothed p -norm instance G , and every κ ≥ , δ ≥ , wehave G (cid:22) κ,δ G with the identity map.Proof. Consider the map M G→G such that for every ﬂow f G on G , we have M G→G ( f G ) = f G . Thus, E G (cid:16) κ − M G→G ( f G ) (cid:17) = E G (cid:16) κ − f G (cid:17) = (cid:16) g G (cid:17) ⊤ (cid:16) κ − f G (cid:17) − h p ( r , κ − f G ) ≥ κ − (cid:16) g G (cid:17) ⊤ f G − κ − h p ( r , f G ) ≥ κ − (cid:16) g G (cid:17) ⊤ f G − κ − h p ( r , f G )= κ − E G ( f G ) ≥ κ − E G ( f G ) − δ (cid:13)(cid:13)(cid:13) f G (cid:13)(cid:13)(cid:13) Moreover ( B G ) ⊤ M G→G ( f G ) = B G f G . Thus, the claims follow. (cid:3)

It behaves well under composition.

Lemma A.16 (Composition) . Given two smoothed p -norm instances, G , G , such that G (cid:22) κ ,δ G with the map M G →G and G (cid:22) κ ,δ G with the map M G →G , then G (cid:22) κ,δ G with the map M G →G = M G →G ◦ M G →G and κ = κ κ and δ = δ + δ (cid:13)(cid:13) M G →G (cid:13)(cid:13) → .Similarly, given two smoothed p -norm instances, G , G , such that G (cid:22) cycle κ ,δ G with the map M G →G and G (cid:22) cycle κ ,δ G with the map M G →G , then G (cid:22) cycle κ,δ G with the map M G →G = M G →G ◦ M G →G and κ = κ κ and δ = δ + δ (cid:13)(cid:13) M G →G (cid:13)(cid:13) cycle → .Proof. We can simply chain together the guarantees to see that: E G (cid:16) f G (cid:17) ≤ κ E G (cid:18) κ f G (cid:19) + δ (cid:13)(cid:13)(cid:13) f G (cid:13)(cid:13)(cid:13) ≤ κ κ E G (cid:18) κ κ f G (cid:19) + δ (cid:13)(cid:13)(cid:13)(cid:13) κ f G (cid:13)(cid:13)(cid:13)(cid:13) ! + δ (cid:13)(cid:13)(cid:13) f G (cid:13)(cid:13)(cid:13) = κ κ E G (cid:18) κ κ f G (cid:19) + δ (cid:13)(cid:13)(cid:13) f G (cid:13)(cid:13)(cid:13) + δ (cid:13)(cid:13)(cid:13) f G (cid:13)(cid:13)(cid:13) κ κ E G (cid:18) κ κ f G (cid:19) + (cid:16) δ (cid:13)(cid:13) M G →G (cid:13)(cid:13) → + δ (cid:17)(cid:13)(cid:13)(cid:13) f G (cid:13)(cid:13)(cid:13) A similar calculation gives the cycle composition guarantee, but now allows us to bound norms usingthe (cid:13)(cid:13) M G →G (cid:13)(cid:13) cycle1 → norm. In all cases, our maps preserve that ﬂows route the correct demands. (cid:3) The most important property of this is that this notion of approximation is also additive, i.e. , itworks well with graph decompositions.

Deﬁnition A.17 (Union of two instances) . Consider smoothed p -norm instances, G , G , with thesame set of vertices, i.e. V G = V G . Deﬁne G = G ∪ G as the instance on the same set of verticesobtained by taking a disjoint union of the edges (potentially resulting in multi-edges). Formally, G = ( V G , E G ∪ E G , ( g G , g G ) , ( r G , r G ) , ( s G , s G )) . We prove the following lemma, which closely follows an analogous statement in [Kyn+19].

Lemma A.18 (Union of instances) . Consider four smoothed p -norm instances, G , G , H , H , onthe same set of vertices, i.e. V G = V G = V H = V H , such that for i = 1 , , H i (cid:22) κ,δ G i with themap M H i →G i . Let G def = G ∪ G , and H def = H ∪ H . Then,

H (cid:22) κ,δ G with the map M H→G (cid:16) f H = ( f H , f H ) (cid:17) def = (cid:18) M H →G (cid:16) f H (cid:17) , M H →G (cid:16) f H (cid:17)(cid:19) , where ( f H , f H ) is the decomposition of f H onto the supports of H and H .Proof. Let f H be a ﬂow on H . We write f H = ( f H , f H ) . Let f G def = M H→G ( f H ) . If f G i denotes M H i →G i ( f H i ) for i = 1 , , then we know that f G = ( f G , f G ) . Thus, the objectives satisfy E G ( κ − f G ) = E G ( κ − f G ) + E G ( κ − f G ) ≥ κ − (cid:18) E H ( f H ) − δ (cid:13)(cid:13)(cid:13) f H (cid:13)(cid:13)(cid:13) (cid:19) + κ − (cid:18) E H ( f H ) − δ (cid:13)(cid:13)(cid:13) f H (cid:13)(cid:13)(cid:13) (cid:19) = κ − (cid:18) E H ( f H ) − δ (cid:13)(cid:13)(cid:13) f H (cid:13)(cid:13)(cid:13) (cid:19) For the residues, we have,( B G ) ⊤ ( f G ) = ( B G ) ⊤ ( f G ) + ( B G ) ⊤ ( f G )= ( B H ) ⊤ ( f H ) + ( B H ) ⊤ ( f H ) = ( B H ) ⊤ ( f H ) . Thus,

H (cid:22) κ,δ G . (cid:3) This notion of approximation also behaves nicely with scaling of ℓ and ℓ p resistances. Lemma A.19.

For all κ ≥ , and for all pairs of smoothed p -norm instances, G , H , on the sameunderlying graphs, i.e. , ( V G , E G ) = ( V H , E H ) , such that,(1) the gradients are identical, g G = g H , (2) the ℓ resistances are oﬀ by at most κ, i.e. , r G e ≤ κ r H e for all edges e, and(3) the p -norm scaling is oﬀ by at most κ p − , i.e. , s G ≤ κ p − s H , then H (cid:22) κ G with the identity map.Proof. Follows from Lemma 2.13 in [Kyn+19]. (cid:3) .3. Orthogonal Decompositions of Flows.

At the core of our graph decomposition and spar-siﬁcation procedures is a decomposition of the gradient g of G into its cycle space and potentialﬂow space. We denote such a splitting using(10) g G = b g G + B G ψ G , s.t. B G⊤ b g G = . Here b g is a circulation, while B ψ gives a potential induced edge value. We will omit the superscriptswhen the context is clear.The following minimization based formulation of this splitting of g is critical to our method ofbounding the overall progress of our algorithm Fact A.20.

The projection of g onto the cycle space is obtained by minimizing the Euclidean normof g plus a potential ﬂow. Speciﬁcally, k b g k = min x k g + B x k . Lemma A.21.

Given a graph/gradient instance G , consider H formed from a subset of its edges.The projections of g G and g H onto their respective cycle spaces, b g G and b g H satsify: (cid:13)(cid:13)(cid:13)b g H (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)b g G (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) g G (cid:13)(cid:13)(cid:13) . A.4.

Numerics, Conditioning, Inexact Laplacian Solvers.

Because we allow instances G =( V, E, r , s , g ) with r ( e ) = 0 and s ( e ) = 0 for some edges e ∈ E (and this is important for oursparsiﬁcaton procedures), we need to be somewhat careful about disallowing instances with a cyclethat has zero r and s values and non-zero gradient: In that case, our “energy” can diverge to + ∞ . Deﬁnition A.22 (Unbounded and constant cycles) . An untouched cycle C ⊆ E is unbounded ifthe sum of terms around the cycle is non-zero, i.e. P e ∈ C g ( e ) = 0. If the sum is zero, we call theunbounded cycle a constant cycle . Lemma A.23.

Consider a smoothed p -norm ﬂow problem ( G , b ) , i.e. max f :( B G ) ⊤ f = b E G ( f ) . The problem is unbounded, i.e. has objective value E G → ∞ if and only if G contains an unboundedcycle.Proof. The problem is unbounded exactly when the gradient has non-zero inner product with someelement of the subspace ker( B ⊤ ) ∩ ker( R ) ∩ ker( S ). The ﬁrst condition means it ker( B ⊤ ) tells usthe element must be in the cycle space, while the latter two tells us it must be supported edges withnon-zero R and S . Writing the element as a linear combination of particular cycles, the gradientmust have a non-zero inner product with one untouched cycle. (cid:3) Lemma A.24.

The algorithm detectUnbounded takes as input a smoothed p -norm problem ( G , b ) and if G contains an unbounded cycle, the algorithm detects this and output an unboundedcycle (an arbitrary one if there are multiple).Proof. We run a DFS only using edges outside the union of the support of r and s (just deletethe others before), and assign voltages. On an edge to already visited vertex, check if new voltageagrees, if output cycle on stack: it is an unbounded cycle. At the end, consistent voltages certifythat gradient is an electrical ﬂow and hence objective is bounded. Overall time is linear, as eachedge is processed only once. (cid:3) Lemma A.25.

The algorithm constantCycleContraction takes as input a smoothed p -normproblem ( G , b ) which has bounded objective value and

1) Returns an equivalent instance ( G ′ , b ′ ) where is G ′ is cycle-touching. G ′ is equals G with allconstant cycle contracted into a vertex.(2) The problems ( G , b ) and ( G ′ , b ′ ) are equivalent in that(a) Any feasible solution f to the ﬁrst problem can be trivially mapped to a feasible solutionto the second and vice versa.(b) Mapping a solution from ( G , b ) to ( G ′ , b ′ ) always changes the objective value by the same scalar (say c ) for all solutions, and simiarly mapping the other way changes theobjective by − c .(c) The maps can be applied in O ( m ) time.(d) If the entries of ( G , b ) are quasi-polynomially bounded, then so are the entries of ( G ′ , b ′ ) .Furthermore, if b = then b ′ = 0 and the solution maps preserve the objective value exactly, i.e. c = 0 and hence G (cid:22) cycle G ′ and G ′ (cid:22) cycle G , and for zero-demand ﬂows the maps M G → G ′ and M G → G ′ satisfy kM G → G ′ k cycle → ≤ and kM G ′ → G k cycle → ≤ m. Proof.

Once we know the problem is bounded, and hence contains no unbounded cycles, we canlook at the connected components consisting of untouched edges, and we can repeatedly contractcycles of these parts. In the case where demands are zero initially, we produce a smaller instancewhere again demands are zero. To lift to the larger instance, we can simply put no ﬂow on theseedges. Because having a cycle ﬂow on these edges does not increase the objective, our mappingguarantees hold. Overall time is linear, as each edge is processed only once. (cid:3)

Lemma A.26.

Consider a smoothed p -norm circulation problem problem ( G , ) , where G = ( V, E, r , s , g ) .Suppose the entries of r and s are quasipolynomially bounded and G is cycle-touching, and suppose k g k ∞ ≤ polylog( n ) . Then ( G , ) is quasi-polynomially bounded.Proof. If every cycle that every non-zero g edges appears in contains a non-zero entry of r or s ,then increasing ﬂow along that cycle will eventually lead to a decrease in objective. (cid:3) Remark A.27.

We are analyzing our algorihtm in the Real RAM model, but by applying the toolsfrom this section, it can also be analyzed in ﬁxed precision arithmetic with polylogarithmic bit com-plexity per number: In this model, our detectUnbounded and constantCycleContraction procedures still work and returns a cycle-touching instance. Once an instance is cycle-touchingand the non-zero vectors it returns are not too small or big, it is possible to manage errors fromﬁxed point arithmetic. This can also allow us to work with inexact Laplacian solvers usingquasipolynomial errors, which we can compute in this model in nearly-linear time.A.5.

Main Sparsiﬁcation Theorem for Flows.Theorem A.28 (Instance Sparsiﬁcation) . Consider an instance G = ( V G , E G , r G , s G , g G ) with n vertices and m edges, with r G and s G quasipolynomially bounded, and (cid:13)(cid:13)(cid:13) g G (cid:13)(cid:13)(cid:13) ∞ ≤ polylog( n ) , andsuppose the instance is cycle-touching. We can compute an instance H = ( V H E H , r H , s H , g H ) with n vertices and m H = n polylog( n/ ( εδ )) edges, again with r H and s H quasipolynomially bounded,and (cid:13)(cid:13)(cid:13) g H (cid:13)(cid:13)(cid:13) ∞ ≤ polylog( n ) , in time e O ( m ) such that with probability − ε the maps M G→H and M H→G certify

H (cid:22) κ,δ G and G (cid:22) κ,δ H , where κ = m / ( p − polylog( n/ ( εδ )) . Furthermore, these maps can be applied in time e O ( m ) . eﬁnition A.29. A graph G is a α -uniform φ -expander (or uniform expander when parametersnot spelled out explicitly) if(1) r on all edges are the same.(2) s on all edges are the same.(3) G has conductance at least φ .(4) The projection of g onto the cycle space of G , b g G = ( I − B L † B ⊤ ) g , is α -uniform (seenext deﬁnition), where B is the edge-vertex incidence matrix of G , and L = B ⊤ B is theLaplacian. Deﬁnition A.30.

A vector y ∈ R m is said to be α -uniform if k y k ∞ ≤ αm k y k . We abuse the notation to also let the all zero vector be 1-uniform.The next 2 theorems are from [Kyn+19] Theorem A.31 (Decomposition into Uniform Expanders) . Given any graph/gradient/resistanceinstance G with n vertices, m edges, all equal to r , p -weights all equal to s , and gradient g G , alongwith a parameter δ , Decompose ( G , δ ) returns disjoint vertex subsets V , V , . . . in O ( m log n log ( n/δ )) time such that if we let G , G , . . . be the instances obtained by restricting G to the induced graphson the V i sets, then at least m/ edges are contained in these subgraphs, and each G i satisﬁes (forsome absolute constant c partition ):(1) The graph ( V G i , E G i ) has conductance at least φ = 1 c partition · log n · log (cid:0) n/δ (cid:1) , and degrees at least φ · m n , where c partition is an absolute constant.(2) The projection of its gradient g G i into the cycle space of G i , b g G i satisﬁes one of:(a) b g G i is O (log n log ( n/δ )) -uniform, (cid:16)b g G i e (cid:17) ≤ O (cid:16) log n log (cid:0) n/δ (cid:1)(cid:17) m i (cid:13)(cid:13)(cid:13)b g G i (cid:13)(cid:13)(cid:13) ∀ e ∈ E ( G i ) . Here m i is the number of edges in G G i .(b) The ℓ norm of b g G i is smaller by a factor of δ than the unprojected gradient: (cid:13)(cid:13)(cid:13)b g G i (cid:13)(cid:13)(cid:13) ≤ δ · (cid:13)(cid:13)(cid:13) g G (cid:13)(cid:13)(cid:13) . Theorem A.32 (Sampling Uniform Expanders) . Given an α -uniform φ -expander G = ( V G E G , r G , s G , g G ) with m edges and vertex degrees at least d min , for any sampling probability τ satisfying τ ≥ c sample log( n/ε ) · (cid:18) αm + 1 φ d min (cid:19) , We use an instance and its underlying graph interchangeably in our discussion. r are uniform, so conductance is deﬁned as in unweighted graphs. We use the standard deﬁnition of conductance.For graph G = ( V, E ), the conductance of any ∅ 6 = S ( V is φ ( S ) = δ ( S )min ( vol ( S ) ,vol ( V \ S ) ) where δ ( S ) is the numberof edges on the cut ( S, V \ S ) and vol ( S ) is the sum of the degree of nodes in S . The conductance of a graph is φ G = min S = ∅ ,V φ ( S ). here c sample is some absolute constant, SampleAndFixGradient ( G , τ ) with probability at least − ε returns a partial instance H = ( H, r H , s H , g H ) and maps M G→H and M H→G . The graph H has the same vertex set as G , and H has at most τ m edges. Furthermore, r H = τ · r G and s H = τ p · s G . The maps M G→H and M H→G certify

H (cid:22) κ G and G (cid:22) κ H , where κ = m / ( p − φ − log n This map can be applied in time e O ( m ) . Deﬁnition A.33 (2-Rounded instance) . We call an instance G = ( V G , E G , r G , s G , g G ) , -rounded ,if every non-zero entry of s and r has absolute value equal to a power of two (which can benegative).Given an instance G = ( V G , E G , r G , s G , g G ) , we compute a 2-rounded instance G ′ = ( V G , E G , r G ′ , s G ′ , g G ) , .We round each r G e of edges e ∈ E G down to the nearest power of 2 (can be less than 1). Similarly,we round each s G e of edges e ∈ E G down to the nearest power of 2 (can be less than 1). We denotethis rounding procedure by instanceRound , s.t. G ′ = instanceRound ( G ). We will only need toapply this procedure to quasipolynomially bounded numbers, which ensures it can be implementedin logarithmic time in the Real RAM with comparisons. Remark A.34.

Becaues it is applied to quasipolynomially bounded entries, the rounding can beimplemented using a polylogarithmic number of bit operations in ﬁxed point arithmic.

Lemma A.35 (2-rounding) . Consider an instance G = ( V, E, r , s , g ) . Let G ′ = instanceRound ( G ) .Then the identity map between the instances certiﬁes G (cid:22) G ′ and G ′ (cid:22) G . and G ′ is -rounded.Proof of Theorem A.28. We are given an instance G = ( V G , E G , r G , s G , g G ) with n vertices and m edges, with r G and s G quasipolynomially bounded, and (cid:13)(cid:13)(cid:13) g G (cid:13)(cid:13)(cid:13) ∞ ≤ polylog( n ) , and cycle-touching.We then compute G ′ = instanceRound ( G ), to get the 2-rounded, cycle-touching instance G ′ =( V G ′ , E G ′ , r G ′ , s G ′ , g G ′ ) with r G ′ and s G ′ quasipolynomially bounded, and (cid:13)(cid:13)(cid:13) g G ′ (cid:13)(cid:13)(cid:13) ∞ ≤ polylog( n ) withusing the identity map between the instances(11) G (cid:22) G ′ and G ′ (cid:22) G . Because G ′ is 2-rounded and r G ′ and s G ′ quasipolynomially bounded, the entries of these twovectors only take on polylog( n ) diﬀerent values. Thus we can divide the edges E G ′ into polylog( n )buckets such that in every bucket i , all edges have the same 2-resistance value r e = r ( i ) and thesame p -weight value s ( e ) = s ( i ) .Let G ( i ) = ( V G ( i ) , E G ( i ) , r G ( i ) , s G ( i ) , g G ( i ) ) be the instance arising from restricting G ′ to the edgesof bucket i , while letting V G ( i ) to be the set of vertices incident on edges of E G ( i ) . There will beexactly one bucket containing all the edges e with r ( e ) = s ( e ) = 0. We let this bucket have index i = 0. This bucket cannot contain a cycle of edges, because if it did then G ′′ would contain anuntouched cycle, contradiction that it is cycle-touching. Hence G (0) contains at most n − i = 0, which we do not spar-sify. For i >

0, we deﬁne G ( i, = G ( i ) , and let j ←

0. As long as G ( i,j ) contains more than n log n edges we then repeat the following: Appealing to Theorem A.31, we now call Decompose ( G ( i,j ) , ˜ δ )with ˜ δ = 2 − log c n δ for some universal constant c large enough that (cid:13)(cid:13)(cid:13) g G (cid:13)(cid:13)(cid:13) ∞ ˜ δ ≤ δ . This producesa partition of V ( i,j ) into disjoint V ( i,j )1 , . . . , V ( i,j ) k . The Decompose algorithm deﬁnes G ( i,j ) l to be he instance given by restricting G ( i,j ) to the induced graph on V ( i,j ) l and let G ( i,j +1) be the in-stance arising from restricting G ( i,j ) to the graph consisting of edges crossing between the vertexpartitions, and the vertices incident on these edges. We then let j ← j + 1 and repeat the decom-position if necessary. By Theorem A.31, G ( i,j +1) contains at most half of the edges of G ( i,j +1) , wecall Decompose at most log n times as log ( m/n ) ≤ log( n ). For each bucket i , we let j i denote thelast instance produced, i.e. G ( i,j i )) is the this ﬁnal instance, which is not included in any expander.For every instance G ( i,j ) , we let m ( i,j ) = (cid:12)(cid:12)(cid:12) E G ( i,j ) (cid:12)(cid:12)(cid:12) denote the number of edges of the graph, and n ( i,j ) = (cid:12)(cid:12)(cid:12) V G ( i,j ) (cid:12)(cid:12)(cid:12) the number of vertices. Note that the edges of G ( i ) are partitioned between the G ( i,j )) l instances and the G ( i,j i )) instances, i.e. each edges of G ( i ) is contained in exactly instance,either a G ( i,j )) l or a G ( i,j i )) . Thus the union of all these in the sense of Deﬁnition A.17 is exactly G ( i ) , and the union G ( i ) is G ′ .We will not sparsify the ﬁnal G ( i,j i ) , which for each bucket i have at most n log n edges. Again,for every instance G ( i,j ) l , we let m ( i,j ) l = (cid:12)(cid:12)(cid:12)(cid:12) E G ( i,j ) l (cid:12)(cid:12)(cid:12)(cid:12) denote the number of edges of the graph.By Theorem A.31, • The graph associated with each G ( i,j ) l has conductance at least φ ≥ C · log n · log (cid:16) n/ ˜ δ (cid:17) , forsome universal constant C . • For each G ( i,j ) l either – (“Uniform case”) the instance is α -uniform with α ≤ C log n log ( n/ ˜ δ ) for someuniversal constant C . – (“Small case”) The ℓ norm of b g G ( i,j ) l , the gradient projected to the cycle space, issmaller by a factor of ˜ δ than the unprojected gradient of the original graph (cid:13)(cid:13)(cid:13)(cid:13)b g G ( i,j ) l (cid:13)(cid:13)(cid:13)(cid:13) ≤ ˜ δ · (cid:13)(cid:13)(cid:13)b g G ( i,j ) (cid:13)(cid:13)(cid:13) ≤ ˜ δ · (cid:13)(cid:13)(cid:13)b g G (cid:13)(cid:13)(cid:13) . • The minimum degree of each G ( i,j ) l graph is at least φ · m ( i,j ) n ( i,j ) .We let ˜ ε = ε/m . For instances G ( i,j ) l in the “Uniform case”, we let τ ( i,j ) l = c sample log( n/ ˜ ε )  αm ( i,j ) l + 1 φ d min  ≤ C log( n/ ˜ ε ) log ( n/ ˜ δ )  m ( i,j ) l + n ( i,j ) m ( i,j )  for some universal constant C , and we call SampleAndFixGradient ( G ( i,j ) l , τ ( i,j ) l ), which returnsinstance H ( i,j ) l and maps M G ( i,j ) l →H ( i,j ) l and M H ( i,j ) l →G ( i,j ) l .For instances G ( i,j ) l in the “small case”, we deﬁne G ′ ( i,j ) l to be the instance G ( i,j ) l with all gradiententries set to zero. We let τ ( i,j ) l = c sample log( n/ ˜ ε ) 1 φ d min ≤ C log( n/ ˜ ε ) log ( n/ ˜ δ ) n ( i,j ) m ( i,j ) And call

SampleAndFixGradient ( G ′ ( i,j ) l , τ ( i,j ) l ) which again returns an instance H ( i,j ) l and maps M G ( i,j ) l →H ( i,j ) l and M H ( i,j ) l →G ( i,j ) l .The instance H ( i,j ) l has the same vertex set as G ( i,j ) l and at most τ ( i,j ) l m ( i,j ) l edges.In the “uniform case”, with probability at least 1 − ˜ ε , the maps certify H ( i,j ) l (cid:22) κ G ( i,j ) l and G ( i,j ) l (cid:22) κ H ( i,j ) l , here κ ≤ m / ( p − log (cid:16) n/ ˜ δ (cid:17) , directly by Theorem A.32. In the “small case”, with probabilityat least 1 − ˜ ε , the maps certify with the same κ that H ( i,j ) l (cid:22) κ,δ G ( i,j ) l and G ( i,j ) l (cid:22) κ,δ H ( i,j ) l , (12)because by Theorem A.32, H ( i,j ) l (cid:22) κ G ′ ( i,j ) l and G ′ ( i,j ) l (cid:22) κ H ( i,j ) l , and our rounding of the gradients implies (certiﬁed by the identity map) that G ( i,j ) l (cid:22) , ˜ δ k g k ∞ G ′ ( i,j ) l and G ′ ( i,j ) l (cid:22) , ˜ δ k g k ∞ G ( i,j ) l , with ﬁnally our choice of ˜ δ = 2 log c ( n ) δ for some large enough universal constant c ensures ˜ δ k g k ∞ ≤ δ , and by Lemma A.16, the guarantees compose to give Equation (12). By Lemma A.16, theguarantees compose and we get that the total edge count in the sampled instances H ( i,j ) l associatedwith G ( i,j ) is bounded by X l τ ( i,j ) l m ( i,j ) l ≤ X l C log( n/ ˜ ε ) log ( n/ ˜ δ )  n ( i,j ) m ( i,j ) l m ( i,j )  ≤ C log( n/ ˜ ε ) log ( n/ ˜ δ ) n ( i,j ) ≤ C log( n/ ˜ ε ) log ( n/ ˜ δ ) n. Since the number of buckets (indexed by i ) is bounded by polylog( n ) and the number Decompose calls for each bucket (indexed by j ) is bounded by log( n ), we get that the total number of edgessummed across all the H ( i,j ) l for all i, j, l is bounded by n log(1 / ˜ ε ) log (1 / ˜ δ ) polylog( n )Summed over all i , the total number of edges in the G ( i,j i ) instances is n polylog( n ) . We return a sparsiﬁer instance H consisting of the union over all i and over all j of the H ( i,j ) l andthe G ( i,j i ) , and G (0) (the bucket with edges where 2 and p weights are both zero), with a totalnumber of edges bounded by n log(1 / ˜ ε ) log (1 / ˜ δ ) polylog( n ) ≤ n polylog( n/ ( εδ )) . Because the union of the original instances G ( i,j i ) , and G (0) gives us G ′ , by Lemma A.18, G ′ (cid:22) κ,δ H and H (cid:22) κ,δ G ′ where again κ ≤ m / ( p − log (cid:16) n/ ˜ δ (cid:17) ≤ m / ( p − polylog( n/δ ). Then, because Equation (11) holdsusing the identity map between G and G ′ , we have G (cid:22) κ,δ H and H (cid:22) κ,δ G We sparsiﬁed at most m diﬀerent expanders, since each contains a distinct edges of G , and eachsparsiﬁcation fails with probability at most ˜ ε = ε/m , so by a union bound, the probability none ofthe sparsiﬁcations fail is at least 1 − ε . That weights of H are quasipolynomially bounded followsfrom the explicit weights given in Theorem A.32. The overall time bound to apply the union mapalso follows immediately from Theorem A.32. (cid:3) roof of Theorem A.1. Proof.

We will use Theorem A.28. This requires that the instance is cycle-touching. So we ﬁrstconvert out instance G to G ′ using Lemma A.25. We thus have maps M G→G ′ and M G ′ →G thatcan be applied in O ( m ) time, where kM G→G ′ k cycle1 → ≤ G ′ (cid:22) cycle1 , G , G (cid:22) cycle1 , G ′ (since we aresolving the residual problem, our demand vector is 0). We can apply Theorem A.28 on G ′ to get aninstance H with at most m H = n polylog( n/ ( εδ )) edges, maps M ′ that can be computed in e O ( m )time, and H (cid:22) cycle κ,δ G ′ and G ′ (cid:22) cycle κ,δ H where κ = m − / ( p − . Now to go from G to H , we will compose these two approximations and wethus have from Lemma A.16, G (cid:22) cycle κ,δ k M G→G′ k → H and H (cid:22) cycle κ,δ G Finally, as kM G→G ′ k → ≤

1, this completes our proof. We remark that any quasipolynomialblow-up in this error would also be acceptable. (cid:3)

Appendix B. Sparsification for General ℓ + ℓ pp Objectives Using Lewis Weights

We will prove Theorem 2.5.B.1.

Leverage Scores and Lewis Weights.

For α ≥

1, and x, y > , we say x ≈ α y if α x ≤ y ≤ αx . The statistical leverage score of a row aaa i of a matrix A is deﬁned as τ ,i ( A ) def = aaa ⊤ i ( A ⊤ A ) − aaa i = (cid:13)(cid:13)(cid:13) ( A ⊤ A ) − / aaa i (cid:13)(cid:13)(cid:13) , The generalization of statistical leverage scores to ℓ p -norms is given by ℓ p Lewis weights [Lew78],which are deﬁned as follows:

Deﬁnition B.1.

For a matrix A and for p ≥

1, we deﬁne the ℓ p Lewis weights { τ p,i } i to be theunique weights such that τ p,i = τ ,i ( Diag (cid:0) τ p,i (cid:1) / − /p A ) . Equivalently, aaa ⊤ i (cid:16) A ⊤ Diag (cid:0) τ p,i (cid:1) − /p A (cid:17) − aaa i = τ /pp,i . When the matrix A is not obvious from the context, we will denote the Lewis weights by τ p,i ( A ) . We use e τ p,i to denote β -approximate Lewis weights, i.e., e τ p,i ≈ β τ p,i . Lemma B.2 (Foster’s Theorem [Fos+53]) . For any matrix A ∈ R m × n , m ≥ n , we have P i τ ,i ( A ) = rank ( A ) ≤ n. As a simple corollary, we get that the ℓ p Lewis weights also sum to n. Corollary B.3.

For any matrix A ∈ R m × n , m ≥ n, and any p, we have P i τ p,i ( A ) ≤ n. Proof.

By deﬁnition and existence of the Lewis weights, X i τ p,i ( A ) = X i τ ,i ( Diag (cid:0) τ p,i (cid:1) / − /p A ) = rank ( Diag (cid:0) τ p,i (cid:1) / − /p A ) ≤ n, where the second equality follows from Lemma B.2. (cid:3) s we will see, having access to τ ,i would allow us to determine a spectral approximation to A , though with many fewer rows. Unfortunately, the naive approach to calculating them requirescomputing ( A ⊤ A ) + , which would defeat the purpose of ﬁnding a smaller spectral approximation inthe ﬁrst place. Thus, the key insight of work by [Coh+15] is that a certain uniform sampling-based approach is suﬃcient to determine approximate leverage scores, as established by the followingimportant lemma. Lemma B.4 (Lemma 7, [Coh+15]) . Given matrix A , p ∈ [2 , , θ < , and a matrix B containing O ( n log( n )) rescaled rows of A , there is an algorithm that, w.h.p. in n , computes n θ -approximate τ ,i for A in time O (( LSS ( B ) + nnz( A )) θ − ) , where LSS ( B ) is the time required to solve a linearequation in B ⊤ B to quasipolynomial accuracy. While the previous lemma provides approximations to τ ,i , it was later shown by [CP15] that infact we may use such a routine as a black-box for determining approximations τ p,i ( A ), for p ∈ [2 , Lemma B.5 (Lemma 2.4, [CP15]) . For any ﬁxed p < , given a routine ApproxLeverageScores for computing, with high probability in n , β -approximate statistical leverage scores of rows of ma-trices of the form W A for β = n θ −| p/ − | p , there is an algorithm ApproxLewisWeights n θ -approximate ℓ p Lewis weights for A with O (cid:16) log( θ − )1 −| p/ − | (cid:17) calls to ApproxLeverageScores . Combining these two lemmas, we arrive at the following overall computational cost for ﬁnding e τ p,i . Theorem B.6.

Given matrix A , p ∈ [2 , , θ < , there is an algorithm that computes n θ -approximate τ p,i for A in time O (cid:18) p (1 − | p/ − | ) ( LSS ( B ) + nnz( A )) θ − log( θ − ) (cid:19) , where LSS ( B ) is the time required to solve a linear equation in B ⊤ B to quasipolynomial accuracyfor some matrix B containing O ( n log( n )) rescaled rows of A .Proof. The theorem follows immediately from Lemmas B.4 and B.5. (cid:3)

Lemma B.7 ( ℓ Matrix Concentration Bound (Lemma 4, [Coh+15])) . There exists an absoluteconstant C such that for any matrix A ∈ R m × n , and any set of sampling values ν i satisfying ν i ≥ τ ,i ( A ) · C ε − log n, if we generate a matrix S with N = P i ν i rows, each chosen independently as the i th standard basisvector times √ ν i , with probability ν i N , then with probability at least − n Ω(1) we have ∀ x ∈ R n , k S Ax k ≈ ε k Ax k . Lemma B.8 ([BLM89], [CP15](Lemma 7.1)) . For p ≥ , there exists an absolute constant C p suchthat for any matrix A ∈ R m × n , and any set of sampling values ν i satisfying ν i ≥ τ p,i ( A ) · C p n p − ε − log n log / ε , if we generate a matrix S with N = P i ν i rows, each chosen independently as the i th standard basisvector times ν /pi , with probability ν i N , then with probability at least − n Ω(1) we have ∀ x ∈ R n , k S Ax k p ≈ ε k Ax k p . Lemma B.9.

For p ∈ [2 , , given matrices C , D ∈ R m × n , there exist ν i > , i ∈ [ m ] with N = P i ν i ≤ O (1) n p / log n such that, if we generate a matrix S with N rows, each chosen independently s the i th standard basis vector times ν /pi with probability ν i N , then we can compute a diagonal matrix R ∈ R N × N ≥ such that with probability at least − n Ω(1) , ∀ x ∈ R n , k RS C x k ≈ k C x k and k S D x k p ≈ k D x k p . Proof.

Let e τ ,i ( C ) be 2-approximate leverage scores of C and e τ p,i ( D ) be 2-approximate ℓ p Lewisweights for D . Deﬁne ν i = C ,p max ne τ ,i ( C ) · log n, e τ p,i ( D ) · n p − log n o , where C ,p is a large enough absolute constant we specify later. Since P i e τ ,i ( C ) ≤ P i τ ,i ( C ) ≤ n and P i e τ p,i ( D ) ≤ P i τ p,i ( D ) ≤ n from Corollary B.3, we get N = P i ν i ≤ O ( C ,p ) n p − log n. Let S be as deﬁned in the lemma statement, i.e. S ab =  ν /pb if b th basis vector is chosen for row a, . Let us assume for row a , we have chosen the b th basis vector. Now deﬁne the diagonal matrix R as R aa = ν p − b . Note that e S = RS is a matrix with N rows, each chosen independently as the i th standard basisvector times ν / i with probability ν i N . We can pick C ,p large enough so that ν i ≥ τ ,i ( C ) · C log n, and we can apply Lemma B.7 for some constant ε < − n Ω(1) , we have(13) ∀ x ∈ R n , k RS C x k = k e S C x k ≈ k C x k . Similarly, we can pick C ,p large enough so that we have ν i ≥ τ p,i ( D ) · n p − log n . Thus, usingLemma B.8, we get that with probability at least 1 − n Ω(1) , we have(14) ∀ x ∈ R n , k S D x k p ≈ k D x k p . Combining the above two claims, and applying a union bound, we obtain our lemma. (cid:3)

Lemma B.10.

Let p ∈ [2 , , let M , N , A be matrices such that M ∈ R m × n , N ∈ R m × n , m , m ≥ n , and A ∈ R d × n , d ≤ n , and consider the problem min ∆ ∆ ⊤ M ⊤ M ∆ + k N ∆ k pp (15) s.t. A ∆ = c , with optimum at ∆ ∗ . Then, with high probability we may compute f M , f N ∈ R O ( n p/ log( n )) × n suchthat, for a κ -approximate solution e ∆ to the problem min ∆ ∆ ⊤ f M ⊤ f M ∆ + k f N ∆ k pp (16) s.t. A ∆ = c with optimum at e ∆ ∗ , e ∆ is a O ( κ ) -approximate solution to (15) .Proof. Let f M = RS M and f N = S N be as provided by Lemma B.9. It follows that e ∆ ⊤ M ⊤ M e ∆ + k N e ∆ k pp ≤ p (cid:18) e ∆ ⊤ f M ⊤ f M e ∆ + k f N e ∆ k pp (cid:19) ≤ p κ (cid:18) e ∆ ∗⊤ f M ⊤ f M e ∆ ∗ + k f N e ∆ ∗ k pp (cid:19) ≤ p κ (cid:18) ∆ ∗⊤ f M ⊤ f M ∆ ∗ + k f N ∆ ∗ k pp (cid:19) p κ (cid:16) ∆ ∗⊤ M ⊤ M ∆ ∗ + k N ∆ ∗ k pp (cid:17) ≤ κ (cid:16) ∆ ∗⊤ M ⊤ M ∆ ∗ + k N ∆ ∗ k pp (cid:17) , where the last inequality follows from our bound on p . (cid:3) We now recall the main Lewis weights-based sparsiﬁcation result, Theorem 2.6 which was provenin Section 3.1.This result gives us the following corollaries which distinguish the general problem from the morestructured graph problem, whereby the latter may take advantage of fast Laplacian solvers. Notethat, for Theorem 2.6 and its corollaries, we use ˜ O p ( · ) to suppress a (1 − | p/ − | ) − term, whichwill become large as p approaches 4. Proof of Theorem 2.5.

Follows from Lemmas B.6 and B.10. (cid:3)

Corollary B.11 (General matrix setting) . Consider (3) for arbitrary M ∈ R m × n , N ∈ R m × n .Then, for p ∈ [2 , , with high probability, we can ﬁnd an ε -approximate solution in time ˜ O p (cid:18) nnz( M ) + nnz( N ) + (cid:16) nnz( f M ) + nnz( f N ) + n ω (cid:17) n p ( p − p − (cid:19) log (1 /ε ) ! , for some f M and f N each containing O ( n p/ log( n )) rescaled rows of M and N , respectively. Corollary B.12 (Graph setting) . Consider (3) for M ∈ R m × n , N ∈ R m × n , m , m ≥ n , givenas the edge-vertex incidence matrices for some graphs. Then, for p ∈ [2 , , with high probability,we can ﬁnd an ε -approximate solution to (3) in time ˜ O p (cid:18) m + m + n p (cid:16) p − p − (cid:17) (cid:19) log (1 /ε ) ! . Appendix C. Width-Reduced Approximate Solver for ℓ + ℓ pp Problems

We will solve problems of the form,min ∆ ∆ ⊤ M ⊤ M ∆ + k N ∆ k pp (17) s.t. A ∆ = c , which have an optimum at most ν . We ﬁrst scale the problem down to a new problem withoptimum at most 1. Note that there exists ∆ ⋆ such that A ∆ ⋆ = c and ∆ ⋆ ⊤ M ⊤ M ∆ ⋆ + k N ∆ ⋆ k pp ≤ ν . Let ˜ M = ν − p − M and e ∆ = ν − /p ∆ ⋆ . The following problem has optimum at most 1 since e ∆is a feasible solution. min ∆ ∆ ⊤ ˜ M ⊤ ˜ M ∆ + k N ∆ k pp (18) s.t. A ∆ = ν − /p c , Now, let ¯∆ denote a feasible solution such that ¯∆ ⊤ ˜ M ⊤ ˜ M ¯∆ ≤ α and k N ∆ k pp ≤ β . Note that∆ = ν /p ¯∆ satisﬁes the constraints of (17) and,∆ ⊤ M ⊤ M ∆ ≤ αν, and , k N ∆ k pp ≤ βν. It is thus suﬃcient to solve Problem (18) to an α, β approximation. We thus have the followingresult which follows from Theorem C.1, heorem 3.4. Let p ≥ . Consider an instance of Problem (5) described by matrices A ∈ R d × n , N ∈ R m × n , M ∈ R m × n , d ≤ n ≤ m , m , and vector c ∈ R d . If the optimum of thisproblem is at most ν , Procedure Residual-Solver (Algorithm 3) returns an x such that Ax = c , and x ⊤ M ⊤ M x ≤ O (1) ν and k N x k pp ≤ O (3 p ) ν . The algorithm makes O pm p − p − ! calls to alinear system solver. C.1.

Solving Scaled Problem.

We will show that we can solve problems of the formmin ∆ ∆ ⊤ M ⊤ M ∆ + k N ∆ k pp (19) s.t. A ∆ = c , which have an optimum value at most 1. We will use the following oracle in our algorithm. Algorithm 2

Oracle procedure Oracle ( A , M , N , c , w ) r e ← w p − e Compute, ∆ = arg min A ∆ ′ = c m p − p ∆ ′⊤ M ⊤ M ∆ ′ + 13 p − X e r e (cid:16) N ∆ ′ (cid:17) e return ∆We can use now use Algorithm 4 from [Adi+19].Notation. We will use ∆ ⋆ to denote the optimum of (19) and e ∆ to denote the solution returned bythe oracle (Algorithm 2). We thus have, • ∆ ⋆ ⊤ M ⊤ M ∆ ⋆ ≤ • k N ∆ ∗ k p ≤ • r e ≥ , ∀ e .We will prove the following main theorem: Theorem C.1.

Let p ≥ . Given matrices A ∈ R d × n , N ∈ R m × n , M ∈ R m × n , m , m ≥ n , d ≤ n , and vector c , Algorithm 3 uses O pm p − p − ! , calls to the oracle (Algorithm 2) and returnsa vector x such that Ax = c , x ⊤ M ⊤ M x ≤ O (1) and k N x k pp = O (3 p ) . Analysis of Algorithm 3.

Similar to [Adi+19] we will track two potentials Φ and Ψ which wedeﬁne as, Φ (cid:16) w ( i ) (cid:17) def = k w k pp Ψ( r ) def = min ∆: A ∆= c m p − p ∆ ⊤ M ⊤ M ∆ + 13 p − X e r e ( N ∆) e . Note that these potentials have a similar idea as [Adi+19] but are deﬁned diﬀerently. Our proofwill follow the following structure,(1) Provided the total number of width reduction steps, K , is not too big, Φ( · ) is small. Thisin turn helps upper bound the value of the solution returned by the algorithm.(2) Showing that K cannot be too big, because each width reduction step cause large growthin Ψ( · ), while we can bound the total growth in Ψ( · ) by relating it to Φ( · ). lgorithm 3 Algorithm for the Scaled down Problem procedure Residual-Solver ( A , M , N , c ) w (0 , e ← x ← ρ ← Θ m ( p − p +2) p (3 p − ! ⊲ width parameter β ← Θ (cid:18) m p − p − (cid:19) ⊲ resistance threshold α ← Θ p − m − p − p +2 p (3 p − ! ⊲ step size τ ← Θ m ( p − p − p − ! ⊲ ℓ p energy threshold T ← α − m /p = Θ (cid:18) pm p − p − (cid:19) i ← , k ← while i < T do ∆ = Oracle ( A , M , N , c , w ) if k N ∆ k pp ≤ τ then ⊲ ﬂow step w ( i +1 ,k ) ← w ( i,k ) + α | N ∆ | x ← x + α ∆ i ← i + 1 else ⊲ width reduction step For all edges e with | N ∆ | e ≥ ρ and r e ≤ β w ( i,k +1) e ← p − w e k ← k + 1 return m − p x We start by proving some results that we need to prove our ﬁnal result, Theorem C.1. Theproofs of all lemmas are in Section C.2.

Lemma C.2.

Let p ≥ . For any w , let e ∆ be the solution returned by Algorithm 2. Then, X e ( N e ∆) e ≤ X e r e ( N e ∆) e ≤ k w k p − We next show through the following lemma that the Φ potential does not increase too rapidly.The proof is through induction and can be found in the Appendix.

Lemma C.3.

After i ﬂow steps, and k width-reduction steps, provided(1) p p α p τ ≤ pαm p − p , (controls Φ growth in ﬂow-steps)(2) k ≤ − pp − ρ m /p β − p − ,(acceptable number of width-reduction steps)the potential Φ is bounded as follows: Φ (cid:16) w ( i,k ) (cid:17) ≤ (cid:16) αi + m / p (cid:17) p exp  pp − kρ m /p β − p −  . We next show how the potential Ψ changes with a change in resistances. The proof is in theAppendix. emma C.4. Let e ∆ = arg min A ∆= c m p − p ∆ ⊤ M ⊤ M ∆ + p − P e r e ( N ∆) e . Then one has for any r ′ and r such that r ′ ≥ r , Ψ( r ′ ) ≥ Ψ( r ) + X e (cid:18) − r e r ′ e (cid:19) r e ( N e ∆) e . The next lemma gives a lower bound on the energy in the beginning and an upper bound on theenergy at each step.

Lemma C.5.

Initially, we have, Ψ (cid:16) r (0 , (cid:17) ≥ k M + N k k c k k A k = L, where k M + N k min = min Ax = c k M x k + k N x k and k A k is the operator norm of A . Moreover,at any step ( i, k ) we have, Ψ (cid:16) r ( i,k ) (cid:17) ≤ m p − p + 13 p − Φ( i, k ) p − p . We next bound the change in energy in a ﬂow step and a width reduction step. This lemma isdirectly from [Adi+19] and the proof is also very similar. We include it here for completeness.

Lemma C.6.

Suppose at step ( i, k ) we have (cid:13)(cid:13)(cid:13) N e ∆ (cid:13)(cid:13)(cid:13) pp > τ so that we perform a width reductionstep (line 17). If(1) τ /p ≥ Ω(1) Ψ( r ) β , and(2) τ ≥ Ω(1)Ψ( r ) ρ p − .Then Ψ (cid:16) r ( i,k +1) (cid:17) ≥ Ψ (cid:16) r ( i,k ) (cid:17) + Ω(1) τ /p . Furthermore, if at ( i, k ) we have (cid:13)(cid:13)(cid:13) N e ∆ (cid:13)(cid:13)(cid:13) pp ≤ τ so that we perform a ﬂow step, then Ψ (cid:16) r ( i,k +1) (cid:17) ≥ Ψ (cid:16) r ( i,k ) (cid:17) . Proof of Theorem C.1.

Proof.

We begin by setting all our parameter values. • α ← Θ p − m − p − p +2 p (3 p − ! • τ ← Θ m ( p − p − p − ! • β = Θ (cid:18) m p − p − (cid:19) • ρ = Θ m ( p − p +2) p (3 p − ! Note that the above values satisfy the relations p p α p τ = pαm p − p .Let m − /p x be the solution returned by Algorithm 3. Note that this satisﬁes the linear constraintrequired. We will now bound the values of m − /p x ⊤ M ⊤ M x and m − k N x k pp . If the algorithm erminates in T = α − m /p ﬂow steps and K ≤ − pp − ρ m /p β − p − width reduction steps, thenfrom Lemma C.3, Φ (cid:16) w ( T,K ) (cid:17) ≤ O (3 p ) m e = O (3 p m )Note that throughout the algorithm w ≥ | N x | . This means that the algorithm returns m − p x with m − k N x k pp ≤ m (cid:13)(cid:13)(cid:13) w ( T,K ) (cid:13)(cid:13)(cid:13) pp = 1 m Φ (cid:16) w ( T,K ) (cid:17) ≤ O (3 p ) . To bound the other term, let e ∆ ( t ) denote the solution returned by the oracle in iteration t . Notethat, since Φ ≤ O (3 p ) m for all iterations, we always have Ψ( r ) ≤ O (1) m p − p . We claim that (cid:16) e ∆ ( t ) (cid:17) ⊤ M ⊤ M e ∆ ( t ) ≤ O (1) for all t . To see this, note that from Lemma C.5, m p − p (cid:16) e ∆ ( t ) (cid:17) ⊤ M ⊤ M e ∆ ( t ) ≤ Ψ( r ) ≤ O (1) m p − p . We also know that x = P t αp e ∆ ( t ) . Combining this and the convexity of k x k , we get m − /p k M x k ≤ α m − /p T X t k M e ∆ ( t ) k ≤ α m − /p T O (1) ≤ O (1) . This concludes the ﬁrst part of the proof that if the number of width reduction steps are bounded,then we return a solution with the required values. We will now show that we cannot have morewidth reduction steps.Suppose to the contrary, the algorithm takes a width reduction step starting from step ( i, k ) where i < T and k = 2 − pp − ρ m /p β − p − . Since the conditions for Lemma C.3 hold for all preceding steps,we must have Φ (cid:16) w ( i,k ) (cid:17) ≤ O (3 p ) m . We note that our parameter values satisfy τ /p ≥ Ω(1) Ψ β and τ ≥ Ω(1) ρ p − Ψ since Ψ ≤ O (1) m p − p .This means that at every step ( j, l ) preceding the current step, the conditions of Lemma C.6 aresatisﬁed, so we can prove by a simple induction thatΨ (cid:16) r ( i,k +1) (cid:17) ≥ Ψ (cid:16) r (0 , (cid:17) + Ω(1) τ /p k. Since our parameter choices ensure τ /p k > Θ( m ),Ψ (cid:16) r ( i,k +1) (cid:17) − Ψ (cid:16) r (0 , (cid:17) > Ω( m ) . Since Φ (cid:16) w ( i,k ) (cid:17) ≤ O (3 p ) m , Ψ (cid:16) r ( i,k +1) (cid:17) − Ψ (cid:16) r (0 , (cid:17) ≤ O (cid:18) m p − p (cid:19) , which is a contradiction. We can thus conclude that we can never have more than K = 2 − pp − ρ m /p β − p − width reduction steps, thus concluding the correctness of the returned solution. We next boundthe number of oracle calls required. The total number of iterations is at most, T + K ≤ α − m /p + 2 − p/ ( p − ρ m /p β − p − ≤ O (cid:18) pm p − p − (cid:19) . (cid:3) .2. Missing Proofs.Lemma C.2.

Let p ≥ . For any w , let e ∆ be the solution returned by Algorithm 2. Then, X e ( N e ∆) e ≤ X e r e ( N e ∆) e ≤ k w k p − Proof.

Since e ∆ is the solution returned by Algorithm 2, and ∆ ⋆ satisﬁes the constraints of theoracle, we have, X e r e ( N e ∆) e ≤ X e r e ( N ∆ ∗ ) e = X e w p − e ( N ∆ ∗ ) e ≤ k w k p − p . In the last inequality we use, X e w e ( N ∆ ⋆ ) e ≤ X e ( N ∆ ⋆ ) · p e ! /p X e | w e | ( p − · pp − ! ( p − /p = k N ∆ ⋆ k p k w k ( p − /pp ≤ k w k ( p − /pp , since (cid:13)(cid:13) N ∆ ∗ (cid:13)(cid:13) p ≤ . Finally, using r e ≥ , we have P e ( N ∆) e ≤ P e r e ( N ∆) e , concluding the proof. (cid:3) Lemma C.3.

We prove this claim by induction. Initially, i = k = 0 , and Φ (cid:16) w (0 , (cid:17) = m , and thus,the claim holds trivially. Assume that the claim holds for some i, k ≥ . We will use Φ as anabbreviated notation for Φ (cid:16) w ( i,k ) (cid:17) below.Flow Step. For brevity, we use w to denote w ( i,k ) . If the next step is a ﬂow step,Φ (cid:16) w ( i +1 ,k ) (cid:17) = (cid:13)(cid:13)(cid:13)(cid:13) w ( i,k ) + α (cid:12)(cid:12)(cid:12) N e ∆ (cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13)(cid:13) pp ≤k w k pp + α (cid:12)(cid:12)(cid:12) ( N e ∆) (cid:12)(cid:12)(cid:12) ⊤ (cid:12)(cid:12)(cid:12) ∇k w k pp (cid:12)(cid:12)(cid:12) + 2 p α X e | w e | p − (cid:12)(cid:12)(cid:12) N e ∆ (cid:12)(cid:12)(cid:12) e + α p p p k N e ∆ k pp by Lemma B.1 of [APS19]We next bound (cid:12)(cid:12)(cid:12) ( N e ∆) (cid:12)(cid:12)(cid:12) ⊤ (cid:12)(cid:12) ∇k w k pp (cid:12)(cid:12) as, P e (cid:12)(cid:12)(cid:12) ( N e ∆) e (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ e k w k pp (cid:12)(cid:12) ≤ p k w k p − p . Using Cauchy Schwarz’s inequality, X e (cid:12)(cid:12)(cid:12) N e ∆ (cid:12)(cid:12)(cid:12) e (cid:12)(cid:12)(cid:12) ∇ e k w k pp (cid:12)(cid:12)(cid:12)! = p X e (cid:12)(cid:12)(cid:12) N e ∆ (cid:12)(cid:12)(cid:12) e | w e | p − | w e | ! p X e | w e | p − w e ! X e | w e | p − ( N e ∆) e ! = p k w k pp X e r e ( N e ∆) e ≤ p k w k p − p We thus have, X e (cid:12)(cid:12)(cid:12) N e ∆ (cid:12)(cid:12)(cid:12) e (cid:12)(cid:12)(cid:12) ∇ e k w k pp (cid:12)(cid:12)(cid:12) ≤ p k w k p − p . Using the above bound, we now have,Φ (cid:16) w ( i +1 ,k ) (cid:17) ≤k w k pp + pα k w k p − p + 2 p α k w k p − p + p p α p k N e ∆ k pp ≤k w k pp + pα k w k p − p + 2 p α k w k p − p + pαm p − p , by Assumption 1 of this LemmaRecall k w k pp = Φ( w ) . Since Φ ≥ m , we have, ≤ Φ( w ) + pα Φ( w ) p − p + 2 p α Φ( w ) p − p + pα Φ( w ) p − p ≤ (Φ( w ) /p + 2 α ) p . From the inductive assumption, we haveΦ( w ) ≤ (cid:16) αi + m / p (cid:17) p exp  O p (1) kρ m /p β − p −  . Thus, Φ( i + 1 , k ) ≤ (Φ( w ) /p + 2 α ) p ≤ (cid:16) α ( i + 1) + m / p (cid:17) p exp  O p (1) kρ m /p β − p −  proving the inductive claim.Width Reduction Step. We have the following: X e ∈ H r e ≤ ρ − X e ∈ H r e ( N ∆) e ≤ ρ − X e r e ( N ∆) e ≤ ρ − k w k p − p ≤ ρ − Φ p − p , and Φ( i, k + 1) ≤ Φ + X e ∈ H (cid:12)(cid:12)(cid:12) w k +1 e (cid:12)(cid:12)(cid:12) p ≤ Φ + 2 pp − X e ∈ H | w e | p ≤ Φ + 2 pp − X e r pp − e ≤ Φ + 2 pp − X e ∈ H r e (cid:18) max e ∈ H r e (cid:19) pp − − Φ + 2 pp − ρ − Φ p − p β p − . Again, since Φ( w ) ≥ m ,Φ( i, k + 1) ≤ Φ (cid:18) pp − ρ − m − p β p − (cid:19) ≤ (cid:16) αi + m / p (cid:17) p exp  pp − k + 1 ρ m /p β − p −  proving the inductive claim. (cid:3) Lemma C.4.

Let e ∆ = arg min A ∆= c m p − p ∆ ⊤ M ⊤ M ∆ + p − P e r e ( N ∆) e . Then one has for any r ′ and r such that r ′ ≥ r , Ψ( r ′ ) ≥ Ψ( r ) + X e (cid:18) − r e r ′ e (cid:19) r e ( N e ∆) e . Proof. Ψ( r ) = min Ax = c x ⊤ M ⊤ M x + x ⊤ N ⊤ RN x . Constructing the Lagrangian and noting that strong duality holds,Ψ( r ) = min x max y x ⊤ M ⊤ M x + x ⊤ N ⊤ RN x + 2 y ⊤ ( c − Ax )= max y min x x ⊤ M ⊤ M x + x ⊤ N ⊤ RN x + 2 y ⊤ ( c − Ax ) . Optimality conditions with respect to x give us,2 M ⊤ M x ⋆ + 2 N ⊤ RN x ⋆ = 2 A ⊤ y . Substituting this in Ψ gives us,Ψ( r ) = max y y ⊤ c − y ⊤ A (cid:16) M ⊤ M + N ⊤ RN (cid:17) − A ⊤ y . Optimality conditions with respect to y now give us,2 c = 2 A (cid:16) M ⊤ M + N ⊤ RN (cid:17) − A ⊤ y ⋆ , which upon re-substitution gives,Ψ( r ) = c ⊤ (cid:18) A (cid:16) M ⊤ M + N ⊤ RN (cid:17) − A ⊤ (cid:19) − c . We also note that(20) x ⋆ = (cid:16) M ⊤ M + N ⊤ RN (cid:17) − A ⊤ (cid:18) A (cid:16) M ⊤ M + N ⊤ RN (cid:17) − A ⊤ (cid:19) − c . We now want to see what happens when we change r . Let R denote the diagonal matrix withentries r and let R ′ = R + S , where S is the diagonal matrix with the changes in the resistances.We will use the following version of the Sherman-Morrison-Woodbury formula multiple times,( X + U C V ) − = X − − X − U ( C − + V X − U ) − V X − . We begin by applying the above formula for X = M ⊤ M + N ⊤ RN , C = I , U = N ⊤ S / and V = S / N . We thus get,(21) (cid:16) M ⊤ M + N ⊤ R ′ N (cid:17) − = (cid:16) M ⊤ M + N ⊤ RN (cid:17) − − (cid:16) M ⊤ M + N ⊤ RN (cid:17) − N ⊤ S / (cid:18) I + S / N (cid:16) M ⊤ M + N ⊤ RN (cid:17) − N ⊤ S / (cid:19) − S / N (cid:16) M ⊤ M + N ⊤ RN (cid:17) − . e next claim that, I + S / N (cid:16) M ⊤ M + N ⊤ RN (cid:17) − N ⊤ S / (cid:22) I + S / R − S / , which gives us,(22) (cid:16) M ⊤ M + N ⊤ R ′ N (cid:17) − (cid:22) (cid:16) M ⊤ M + N ⊤ RN (cid:17) − − (cid:16) M ⊤ M + N ⊤ RN (cid:17) − N ⊤ S / ( I + S / R − S / ) − S / N (cid:16) M ⊤ M + N ⊤ RN (cid:17) − . This further implies,(23) A (cid:16) M ⊤ M + N ⊤ R ′ N (cid:17) − A ⊤ (cid:22) A (cid:16) M ⊤ M + N ⊤ RN (cid:17) − A ⊤ − A (cid:16) M ⊤ M + N ⊤ RN (cid:17) − N ⊤ S / ( I + S / R − S / ) − S / N (cid:16) M ⊤ M + N ⊤ RN (cid:17) − A ⊤ . We apply the Sherman-Morrison formula again for, X = A (cid:16) M ⊤ M + N ⊤ RN (cid:17) − A ⊤ , C = − ( I + S / R − S / ) − , U = A (cid:16) M ⊤ M + N ⊤ RN (cid:17) − N ⊤ S / and V = S / N (cid:16) M ⊤ M + N ⊤ RN (cid:17) − A ⊤ .Let us look at the term C − + V X − U . − (cid:16) C − + V X − U (cid:17) − = (cid:16) I + S / R − S / − V X − U (cid:17) − (cid:23) ( I + S / R − S / ) − . Using this, we get, (cid:18) A (cid:16) M ⊤ M + N ⊤ R ′ N (cid:17) − A ⊤ (cid:19) − (cid:23) X − + X − U ( I + S / R − S / ) − V X − , which on multiplying by c ⊤ and c gives,Ψ( r ′ ) ≥ Ψ( r ) + c ⊤ X − U ( I + S / R − S / ) − V X − c . We note from Equation (20) that x ⋆ = (cid:16) M ⊤ M + N ⊤ RN (cid:17) − A ⊤ X − c . We thus have,Ψ( r ′ ) ≥ Ψ( r ) + (cid:0) x ⋆ (cid:1) ⊤ N ⊤ S / ( I + S / R − S / ) − S / N x ⋆ = Ψ( r ) + X e (cid:18) r ′ e − r e r ′ e (cid:19) r e ( N x ⋆ ) e (cid:3) Lemma C.5.

Initially, we have, Ψ (cid:16) r (0 , (cid:17) ≥ k M + N k k c k k A k = L, where k M + N k min = min Ax = c k M x k + k N x k and k A k is the operator norm of A . Moreover,at any step ( i, k ) we have, Ψ (cid:16) r ( i,k ) (cid:17) ≤ m p − p + 13 p − Φ( i, k ) p − p . roof. For the lower bound in the initial state, since r e ≥

1, for any solution ∆, we haveΨ( r (0 , ) ≥ m p − p e ∆ ⊤ M ⊤ M e ∆+ 13 p − X e r (0 , ( N e ∆) e = k M e ∆ k + 13 p − k N e ∆ k ≥ p − k M + N k k e ∆ k , where k M + N k = min Ax = c k M x k + k N x k . Note that if k M + N k = 0 then the oraclehas returned the optimum in the ﬁrst iteration. On the other hand, because k c k = (cid:13)(cid:13)(cid:13) A e ∆ (cid:13)(cid:13)(cid:13) ≤ k A k (cid:13)(cid:13)(cid:13) e ∆ (cid:13)(cid:13)(cid:13) , where k A k is the operator norm of A . We get (cid:13)(cid:13)(cid:13) e ∆ (cid:13)(cid:13)(cid:13) ≥ k c k k A k , upon which squaring gives the lower bound on Ψ( r (0 , ).For the upper bound, Lemma C.2 implies that,Ψ (cid:16) r ( i,k ) (cid:17) = m p − p e ∆ ⊤ M ⊤ M e ∆ + 13 p − X e r e ( N e ∆) e ≤ m p − p ∆ ⋆ ⊤ M ⊤ M ∆ ⋆ + 13 p − X e r e ( N ∆ ⋆ ) e ≤ m p − p + 13 p − k w k p − p ≤ m p − p + 13 p − Φ( i, k ) p − p . (cid:3) Lemma C.6.

It will be helpful for our analysis to split the index set into three disjoint parts: • S = (cid:8) e : | N ∆ e | ≤ ρ (cid:9) • H = (cid:8) e : | N ∆ e | > ρ and r e ≤ β (cid:9) • B = (cid:8) e : | N ∆ e | > ρ and r e > β (cid:9) .Firstly, we note X e ∈ S | N ∆ | pe ≤ ρ p − X e ∈ S | N ∆ | e ≤ ρ p − X e ∈ S r e | N ∆ | e ≤ ρ p − Ψ( r ) . hence, using Assumption 2 X e ∈ H ∪ B | N ∆ | pe ≥ X e | N ∆ | pe − X e ∈ S | N ∆ | pe ≥ τ − ρ p − Ψ( r ) ≥ Ω(1) τ. his means, X e ∈ H ∪ B ( N ∆) e ≥  X e ∈ H ∪ B | N ∆ | pe  /p ≥ Ω(1) τ /p . Secondly we note that, X e ∈ B ( N ∆) e ≤ β − X e ∈ B r e ( N ∆) e ≤ β − Ψ( r ) . So then, using Assumption 1, X e ∈ H ( N ∆) e = X e ∈ H ∪ B ( N ∆) e − X e ∈ B ( N ∆) e ≥ Ω(1) τ /p − β − Ψ( r ) ≥ Ω(1) τ /p . As r e ≥

1, this implies P e ∈ H r e ( N ∆) e ≥ Ω(1) τ /p . We note that in a width reduction step,the resistances change by a factor of 2. Thus, combining our last two observations, and applyingLemma C.4, we get Ψ (cid:16) r ( i,k +1) (cid:17) ≥ Ψ (cid:16) r ( i,k ) (cid:17) + Ω(1) τ /p . Finally, for the “ﬂow step” case, we use the trivial bound from Lemma C.4, ignoring the secondterm, Ψ (cid:16) r ( i,k +1) (cid:17) ≥ Ψ (cid:16) r ( i,k ) (cid:17) . (cid:3) Appendix D. ℓ p -Regression Deﬁnition D.1 ( κ -approximate solution) . Let κ ≥

1. A κ -approximate solution for the residualproblem is e ∆ such that A e ∆ = 0 and res ( e ∆) ≥ κ res (∆ ⋆ ) , where ∆ ⋆ = argmax A ∆=0 res (∆). Lemma D.2. (Iterative Reﬁnement [APS19]). Let p ≥ , and κ ≥ . Starting from an ini-tial feasible solution x (0) , and iterating as x ( t +1) = x ( t ) − ∆ p , where ∆ is a κ -approximate solu-tion to the residual problem (Deﬁnition 3.2), we get an ε -approximate solution to (3) in at most O pκ log (cid:18) f ( x (0) ) − Opt ε Opt (cid:19)! calls to a κ -approximate solver for the residual problem.Proof. Let f ( x ) = b ⊤ x + k M x k + k N x k pp and res (∆) = g ⊤ ∆ − ∆ ⊤ R ∆ − k N ∆ k pp . Observe that, b ⊤ ( x + ∆) = b ⊤ x + b ⊤ ∆ , and k M ( x + ∆) k = k M x k + 2∆ ⊤ M ⊤ M x + k M ∆ k . Using Lemma B.1 from [APS19] we have, k N ( x + ∆) k pp ≤ k N x k pp + p ( N ∆) ⊤ | N x | p − N x + 2 p ∆ ⊤ N ⊤ Diag ( | N x | p − ) N ∆ + p p k N ∆ k pp , and, k N ( x + ∆) k pp ≥ k N x k pp + p ( N ∆) ⊤ | N x | p − N x + p ⊤ N ⊤ Diag ( | N x | p − ) N ∆ + 12 p +1 k N ∆ k pp . Using these relations, we have, f ( x + ∆) ≤ f ( x ) + p g ⊤ ∆ + p ∆ ⊤ R ∆ + p p k N ∆ k pp , or f (cid:18) x − ∆ p (cid:19) ≤ f ( x ) − res (∆) . lower bound looks like, f ( x + ∆) ≥ f ( x ) + p g ⊤ ∆ + p

16 ∆ ⊤ R ∆ + 12 p +1 k N ∆ k pp . For λ = 16 p , f ( x ) − f (cid:18) x − λ ∆ p (cid:19) ≤ λ g ⊤ ∆ − λ p ∆ ⊤ R ∆ − λ p p p p +1 k N ∆ k pp ≤ λ g ⊤ ∆ − λ p ∆ ⊤ R ∆ − λ p − p p p +1 k N ∆ k pp ! ≤ λ (cid:16) g ⊤ ∆ − ∆ ⊤ R ∆ − k N ∆ k pp (cid:17) = λres (∆) . These relations are the same as Lemma B.2 of [APS19]. We can follow the proof further from[APS19] to get our result. (cid:3)

Solving the Residual Problem.

Lemma D.3.

Let ∆ ⋆ denote the optimum of the residual problem at x ( t ) and Opt denote theoptimum of Problem (3) . We have that res (∆ ⋆ ) ∈ ( ν/ , ν ] for some ν ∈ (cid:20) ε Opt p , f (cid:16) x (0) (cid:17) − Opt (cid:21) .Proof.

From the above proof, we note that for any x , let ∆ ⋆ be the optimum of the residual problem. res (∆ ⋆ ) ≤ f ( x ) − f (cid:18) x − ∆ ⋆ p (cid:19) ≤ f (cid:16) x (0) (cid:17) − Opt . Let ∆ be the step we take to reach the optimum from x . res (∆ ⋆ ) ≥ res (∆) ≥ f ( x ) − Opt λ ≥ ε Opt λ , where the last inequality follows since otherwise x is an ε approximate solution. The lemma thusfollows. (cid:3) Lemma D.4.

Let p ≥ p ′ and ν be such that res p (∆ ⋆ ) ∈ ( ν/ , ν ] , where ∆ ⋆ is the optimum ofthe residual problem for p -norm (Deﬁnition 3.2). The following problem has optimum between (cid:20) ν/ , O (1) m p ′− ν (cid:21) def = ( aν, bν ] . max A ∆=0 g ⊤ ∆ − ∆ ⊤ R ∆ − (cid:18) νm (cid:19) − p ′ p k N ∆ k p ′ p ′ (24) For β ≥ , if e ∆ is a β -approximate solution to the above problem, then α e ∆ gives an bβ a m pp − (cid:16) p ′ − p (cid:17) + p ′− approximate solution to the residual problem, where α = a bβ m − pp − (cid:16) p ′ − p (cid:17) − p ′− .Proof. We will ﬁrst show that the optimum of Problem (24) is at most O ( m p ′− ν ). Suppose theoptimum ∆ ⋆ is such that it gives an objective value of βν . Using a scaling argument as in theabove proof, we can conclude that, g ⊤ ∆ ⋆ − ∆ ⋆ ⊤ R ∆ ⋆ − (cid:18) νm (cid:19) − p ′ p (cid:13)(cid:13) N ∆ ⋆ (cid:13)(cid:13) p ′ p ′ = ∆ ⋆ ⊤ R ∆ ⋆ + ( p ′ −

1) 12 (cid:18) νm (cid:19) − p ′ p (cid:13)(cid:13) N ∆ ⋆ (cid:13)(cid:13) p ′ p ′ = βν. his implies that g ⊤ ∆ ⋆ ≥ βν , ∆ ⋆ ⊤ R ∆ ⋆ ≤ βν and k N ∆ ⋆ k p ′ p ′ ≤ βν p ′ /p m − p ′ /p . We thus have, (cid:13)(cid:13) N ∆ ⋆ (cid:13)(cid:13) pp ≤ (2 β ) p/p ′ m p/p ′ − ν. Let α = m − pp − (cid:16) p ′ − p (cid:17) be some scaling factor. Now, α ∆ ⋆ ⊤ R ∆ ⋆ ≤ α βν , and(25) α p (cid:13)(cid:13) N ∆ ⋆ (cid:13)(cid:13) pp ≤ α p − m − (cid:16) pp ′ − (cid:17) (cid:13)(cid:13) N ∆ ⋆ (cid:13)(cid:13) pp ≤ α p − pp ′ − ( p − β pp ′ ν ≤ αβ pp ′ ν . Consider, res p (cid:18) β − p − p ′ p ′ ( p − α ∆ ⋆ (cid:19) ≥ (cid:18) β − p − p ′ p ′ ( p − α (cid:19)(cid:18) g ⊤ ∆ ⋆ − βν − βν (cid:19) ≥ β p ( p ′− p ′ ( p − α ν. Since res p ( · ) has optimum at most ν , we must have β p ( p ′− p ′ ( p − α ≤ , which gives, β ≤ O (1) m p ′− . Thus Problem (24) has an optimum at most O (1) m p ′− ν = bν . To obtain a lower bound, consider e ∆ obtained in Lemma 3.1 of [AS20]. We will evaluate the objective at e ∆, g ⊤ e ∆ − e ∆ ⊤ R e ∆ − p (cid:18) νm (cid:19) − p ′ p (cid:13)(cid:13)(cid:13) N e ∆ (cid:13)(cid:13)(cid:13) p ′ p ′ ≥ ν − ν

32 = ν

32 = aν.

Therefore, the optimum of Problem (24) must be at least ν . We next look at how a β -approximatesolution of (24) translates to an approximate solution of the residual problem for p . Let e ∆ be a β approximate solution to (24) and ∆ ⋆ denote its optimum. Denote the objective at ∆ for (24) as res p ′ (∆). We know that res p ′ (∆ ⋆ ) ≥ res p ′ ( e ∆) ≥ β res p ′ (∆ ⋆ ). Note that, g ⊤ e ∆ has to be between aν/β and bν . This ensures that, e ∆ ⊤ R e ∆ + 12 (cid:18) νm (cid:19) − p ′ p (cid:13)(cid:13)(cid:13) N e ∆ (cid:13)(cid:13)(cid:13) p ′ p ′ ≤ g ⊤ e ∆ ≤ bν. Let α = a bβ m − pp − (cid:16) p ′ − p (cid:17) − p ′− . Following the calculations above, we have, α e ∆ ⊤ R e ∆ ≤ α aν β , and α p (cid:13)(cid:13)(cid:13) N e ∆ (cid:13)(cid:13)(cid:13) pp ≤ α a p − p − b p − β p − m − (cid:16) pp ′ − (cid:17) − p − p ′− (cid:13)(cid:13)(cid:13) N e ∆ (cid:13)(cid:13)(cid:13) pp ≤ α aν β . Therefore, res p ( α e ∆) ≥ aα β ν. (cid:3) emma D.5. At x ( t ) , let ν be such that res (∆ ⋆ ) ∈ ( aν, bν ] for some values a and b . The followingproblem has optimum at most bν . min ∆ ∆ ⊤ R ∆ + k N ∆ k pp s.t. g ⊤ ∆ = aν, A ∆ = 0 . (26) Further, if e ∆ is a feasible solution to Problem (26) such that e ∆ ⊤ R e ∆ ≤ αbν and k N e ∆ k pp ≤ βbν ,then we can pick a scalar µ = a bαβ / ( p − such that µ e ∆ is a b αβ / ( p − a -approximate solution to theresidual problem. Adapted from the proof of Lemma B.4 of [APS19]

Proof.

The assumption on the residual is res (∆ ⋆ ) = g ⊤ ∆ ⋆ − ∆ ⋆ ⊤ R ∆ ⋆ − (cid:13)(cid:13) N ∆ ⋆ (cid:13)(cid:13) pp ∈ ( aν, bν ] . Since the last 2 terms are strictly non-positive, we must have, g ⊤ ∆ ⋆ ≥ aν. Since ∆ ⋆ is the optimumand satisﬁes A ∆ ⋆ = 0, ddλ (cid:16) g ⊤ λ ∆ ⋆ − λ ∆ ⋆ ⊤ R ∆ ⋆ − λ p (cid:13)(cid:13) N ∆ ⋆ (cid:13)(cid:13) pp (cid:17) λ =1 = 0 . Thus, g ⊤ ∆ ⋆ − ∆ ⋆ ⊤ R ∆ ⋆ − (cid:13)(cid:13) A ∆ ⋆ (cid:13)(cid:13) pp = ∆ ⋆ ⊤ R ∆ ⋆ + ( p − (cid:13)(cid:13) N ∆ ⋆ (cid:13)(cid:13) pp . Since p ≥ , we get the following∆ ⋆ ⊤ R ∆ ⋆ + (cid:13)(cid:13) N ∆ ⋆ (cid:13)(cid:13) pp ≤ g ⊤ ∆ ⋆ − ∆ ⋆ ⊤ RA ∆ ⋆ − (cid:13)(cid:13) A ∆ ⋆ (cid:13)(cid:13) pp ≤ bν. For notational convenience, let function h p ( R , ∆) = ∆ ⊤ R ∆ + k N ∆ k pp . Now, we know that, g ⊤ ∆ ⋆ ≥ aν and g ⊤ ∆ ⋆ − h p ( R , ∆ ⋆ ) ≤ bν . This gives, aν ≤ g ⊤ ∆ ⋆ ≤ h p ( R , ∆ ⋆ ) + bν ≤ bν. Let ∆ = δ ∆ ⋆ , where δ = aν g ⊤ ∆ ⋆ . Note that δ ∈ [ a/ b, g ⊤ ∆ = aν and, h p ( R , ∆) ≤ max { δ , δ p } h p ( R , ∆ ⋆ ) ≤ bν. Note that this ∆ satisﬁes the constraints of program (26) and has an optimum at most bν . So theoptimum of the program must have an objective at most bν . Now let e ∆ be a ( α, β ) approximatesolution to (26), i.e., e ∆ ⊤ R e ∆ ≤ αbν and k N e ∆ k pp ≤ βbν. Let µ = a αβ / ( p − b . We have, g ⊤ ( µ e ∆) − h p ( R , µ e ∆) ≥ µ (cid:16) aν − µ e ∆ ⊤ R e ∆ − µ p − k N e ∆ k pp (cid:17) ≥ µ (cid:18) aν − aν β / ( p − − aν α (cid:19) ≥ µ (cid:18) aν − aν − aν (cid:19) ≥ aµ b bν ≥ a b αβ / ( p − Opt . (cid:3) heorem 3.3. For an instance of Problem (3) , suppose we are given a starting solution x (0) thatsatisﬁes Ax (0) = c and is a κ approximate solution to the optimum. Consider an iteration of thewhile loop, line 8 of Algorithm 1 for the ℓ p -norm residual problem at x ( t ) . We can deﬁne µ and κ such that if ¯∆ is a β approximate solution to a corresponding p ′ -norm residual problem, then µ ¯∆ is a κ -approximate solution to the p -residual problem. Further, suppose we have the followingprocedures,(1) Sparsify : Runs in time K , takes as input any matrices R , N and vector g and returns e R , f N , e g having sizes at most ˜ n × n for the matrices , such that if e ∆ is a β approximatesolution to, max A ∆=0 e g ⊤ ∆ − k e R ∆ k − k f N ∆ k p ′ p ′ , for any p ′ ≥ , then µ e ∆ , for a computable µ is a κ β -approximate solution for, max A ∆=0 res (∆) def = g ⊤ ∆ − k R / ∆ k − k N ∆ k p ′ p ′ . (2) Solver : Approximately solves (4) to return ¯∆ such that k e R ¯∆ k ≤ κ ν and k f N ∆ k pp ≤ κ ν in time ˜ K (˜ n ) for instances of size at most ˜ n .Algorithm 1 ﬁnds an ε -approximate solution for Problem (3) in time e O pκ / ( p − κ κ κ ( K + ˜ K (˜ n )) log (cid:18) κpε (cid:19) ! . Proof.

From Lemma D.2 we know that given an instance of Problem (3), at every x ( t ) we can deﬁnea residual problem and if ¯∆ is a β approximate solution of the residual problem, updating x ( t ) to x ( t ) − ¯∆ p , we can ﬁnd the required ε approximate solution in O  pβ log  f (cid:16) x (0) (cid:17) − Opt ε Opt  ≤ O pβ log (cid:18) κε (cid:19)! iterations. The last inequality follows since f ( x (0) ) − Opt ε Opt ≤ κ Opt ε Opt = κε . It is thus suﬃcient to solve aresidual problem at x that looks like,max A ∆=0 g ⊤ ∆ − ∆ ⊤ R ∆ − k N ∆ k pp . Here g and R depend on x ( t ) , M , N . Now, suppose we have res (∆ ⋆ ) ∈ ( ν/ , ν ]. We will considerthe following cases:(1) p ≤ log m :We apply Sparsify to g , R , N to get e g , e R , f N . Now is ∆ is a β approximate solution to(27) max A ∆=0 e g ⊤ ∆ − ∆ ⊤ e R ∆ − k f N ∆ k pp , then µ ∆ is a κ β approximate solution to the residual problem. We will now solve theabove problem. Note that the size of this problem is m . Let e ∆ be a κ , κ approximatesolution to Problem (4). From D.5, a bκ κ / ( p − e ∆ is a κ κ / ( p − b a approximation to (27).Now, going back, a bκ κ / ( p − µ e ∆ is a κ κ / ( p − b a κ -approximate solution to the residualproblem. Now, µ = κ = 1 and thus we have that, in this case if res (∆ ⋆ ) ∈ ( ν/ , ν ], then a bκ κ / ( p − µ e ∆ is a κ κ / ( p − b a κ -approximate solution to the residual problem. p > log m :In this case, there is an additional step that converts the residual problem to an instanceof the previous case for p = log m . We apply Sparsify to g , R , N ′ and set p = log m . If ¯∆is the α solution returned by the previous case on p = log m and N = N ′ , then µ ¯∆ is a κ α -approximation for the residual problem. Thus a bκ κ / ( p − µ µ e ∆ is a κ κ / ( p − b a κ κ -approximate solution to the residual problem.From the above discussion we conclude that, if res (∆ ⋆ ) ∈ ( ν/ , ν ], then we get a solution ¯∆ suchthat res ( ¯∆) ≥ a κ / ( p − κ κ κ b res (∆ ⋆ ). From Lemma D.3, we know that res (∆ ⋆ ) ∈ ( ν/ , ν ] forsome ν ∈ (cid:20) ε Opt p , f (cid:16) x (0) (cid:17) − Opt (cid:21) . Since

Opt ≥ f ( x (0) ) κ and Opt ≥

0, it is suﬃcient to change ν in the range (cid:20) ε f ( x (0) ) κp , f (cid:16) x (0) (cid:17)(cid:21) .We ﬁnally look at the running time. We start with a residual problem. We require time K toapply procedure Sparsify . We next require time ˜ K ( m ) to solve (4). Now this either gives usan κ / ( p − κ κ κ b a -approximate solution to the residual problem. For every residual problem werepeat the above process at most log κpε times (corresponding to the number of values of ν ). Weuse the fact that a, b ≤ m o (1) . Thus the total running time is, O  pκ / ( p − κ κ κ b a ( K + ˜ K ( m )) log (cid:18) κε (cid:19) log (cid:18) κpε (cid:19) ≤ e O pκ / ( p − κ κ κ ( K + ˜ K ( m )) log (cid:18) κpε (cid:19) ! (cid:3)(cid:3)