Almost-linear-time Weighted \ell_p-norm Solvers in Slightly Dense Graphs via Sparsification
aa r X i v : . [ c s . D S ] F e b ALMOST-LINEAR-TIME WEIGHTED ℓ p -NORM SOLVERS IN SLIGHTLYDENSE GRAPHS VIA SPARSIFICATION DEEKSHA ADIL , BRIAN BULLINS , RASMUS KYNG , AND SUSHANT SACHDEVA Abstract.
We give almost-linear-time algorithms for constructing sparsifiers with n poly(log n )edges that approximately preserve weighted ( ℓ + ℓ pp ) flow or voltage objectives on graphs. Forflow objectives, this is the first sparsifier construction for such mixed objectives beyond unit ℓ p weights, and is based on expander decompositions. For voltage objectives, we give the first spar-sifier construction for these objectives, which we build using graph spanners and leverage scoresampling. Together with the iterative refinement framework of [Adil et al, SODA 2019], and anew multiplicative-weights based constant-approximation algorithm for mixed-objective flows orvoltages, we show how to find (1 + 2 − poly(log n ) ) approximations for weighted ℓ p -norm minimizingflows or voltages in p ( m o (1) + n / o (1) ) time for p = ω (1) , which is almost-linear for graphs thatare slightly dense ( m ≥ n / o (1) ). , University of Toronto, TTI Chicago, ETH Zurich.
E-mail addresses : { deeksha | sachdeva } @cs.toronto.edu, [email protected], [email protected] . . Introduction
Network flow problems are some of the most extensively studied problems in optimization (e.g.see [AMO93; Sch02; GT14]). A general network flow problem on a graph G ( V, E ) with n verticesand m edges can be formulated as min B ⊤ f = d cost ( f ) , where f ∈ R E is a flow vector on edges satisfying net vertex demands d ∈ R V , B ∈ R E × V isthe signed edge-vertex incidence matrix of the graph, and cost ( f ) is a cost measure on flows.The weighted ℓ ∞ -minimizing flow problem, i.e., cost ( f ) = k S − f k ∞ , captures the celebratedmaximum-flow problem with capacities S ; the weighted ℓ -minimizing flow problem, cost ( f ) = k S f k captures the transshipment problem generalizing shortest paths with lengths S ; and cost ( f ) = f ⊤ Rf = k R f k captures the electrical flow problem [ST04].Dual to flow problems are voltage problems, which can be formulated asmin d ⊤ v =1 cost ′ ( B v ) , Analogous to the flow problems, picking cost ′ ( B v ) = k S B v k captures the capacitated min-cut problem, cost ′ ( B v ) = k S − B v k ∞ captures vertex-labeling [Kyn+15], and cost ′ ( B v ) =(
B v ) ⊤ R − B v = k R − B v k captures the electrical voltages problem.The seminal work of Spielman and Teng [ST04] gave the first nearly-linear-time algorithm forcomputing (1+1 / poly( n ))-approximate solutions to electrical (weighted ℓ -minimizing) flow/voltageproblems. This work spurred the “Laplacian Paradigm” for designing faster algorithms for severalclassic graph optimization problems including maximum flow [Chr+11; She13; Kel+14], multi-commodity flow [Kel+14], bipartite matching [Mad13], transshipment [She17], and graph parti-tioning [OSV12]; culminating in almost-linear-time or nearly-linear-time low-accuracy algorithms(i.e. 1 + ε approximations with poly( ε ) running time dependence) for many of these problems.Progress on high-accuracy algorithms (i.e. algorithms that return (1 + 1 / poly( n ))-approximatesolutions with only a poly(log n ) factor overhead in time) for solving these problems has beenharder to come by, and for many flow problems has been based on interior point methods [DS08].E.g. the best running time for maximum flow stands at e O (min( m √ n, n ω + n / )) [LS14; CLS19]and e O ( m / ) for unit-capacity graphs [Mad13; LS20b; LS20a]. Other results making progress inthis direction include works on shortest paths with small range negative weights [Coh+17b], andmatrix-scaling [Coh+17a; All+17]. Recently, there has been progress on the dense case. In[van+20], the authors developed an algorithm for weighted bipartite matching and transshipmentrunning in e O ( m + n / ) time. This is a nearly-linear-time algorithm in moderately dense graphs.Bubeck et al. [Bub+18] restarted the study of faster high-accuracy algorithms for the weighted ℓ p -norm objective, cost ( f ) = k S f k p , a natural intermediate objective between ℓ and ℓ ∞ . This resultimproved the running time significantly over classical interior point methods [NN94] for p close to 2.Adil et al. [Adi+19] gave a high-accuracy algorithm for computing ℓ p -norm minimizing flows in timemin { m + o (1) , n ω } for p ∈ (2 , √ log n ]. Building on their work, Kyng et al. [Kyn+19] gave an almost-linear-time high-accuracy algorithm for unit-weight ℓ p -norm minimizing flows cost ( f ) = k f k pp forlarge p ∈ ( ω (1) , √ log n ] . More generally, they give an almost-linear time-high-accuracy algorithmfor mixed ℓ + ℓ pp objectives as long as the ℓ pp -norm is unit-weight, i.e., cost ( f ) = k R f k + k f k pp . Their algorithm for ( ℓ + ℓ pp )-minimizing flows was subsequently used as a key ingredient in recentresults improving the running time for high-accuracy/exact maximum flow on unit-capacity graphsto m / o (1) [LS20b; LS20a]. n this paper, we obtain a nearly-linear running time for weighted ℓ + ℓ pp flow/voltage problems ongraphs. Our algorithm requires p ( m o (1) + n / o (1) ) time for p = ω (1) which is almost-linear-timefor p ≤ m o (1) in slightly dense graphs, ( m ≥ n / o (1) ).Our running time m o (1) + n / o (1) is even better than the e O ( m + n / ) time obtained forbipartite matching in [van+20]. Our result beats the Ω( n / ) barrier that arises in [van+20] fromthe use of interior point methods that maintain a vertex dual solution using dense updates across √ n iterations. The progress on bipartite matching relies on highly technical graph-based inversemaintenance techniques that are tightly interwoven with interior point method analysis. In con-strast, our sparsification methods provide a clean interface to iterative refinement, which makesour analysis much more simple and compact. Graph Sparsification.
Various notions of graph sparsification – replacing a dense graph with asparse one, while approximately preserving some key properties of the dense graph – have been keyingredients in faster low-accuracy algorithms. Bencz´ur and Karger [BK96] defined cut-sparsifiersthat approximately preserve all cuts, and used them to give faster low-accuracy approximationalgorithms for maximum-flow. Since then, several notions of sparsification have been studied ex-tensively and utilized for designing faster algorithms [PS89; ST11; Rac08; SS11; Mad10; She13;Kel+14; RST14; CP15; Kyn+16; Dur+17; Chu+18].Sparsification has had a smaller direct impact on the design of faster high-accuracy algorithmsfor graph problems, limited mostly to the design of linear system solvers [ST04; KMP11; PS14;Kyn+16]. Kyng et al. [Kyn+19] constructed sparsifiers for weighted ℓ + unweighted ℓ pp -normobjectives for flows. In this paper, we develop almost-linear time algorithms for building sparsifiersfor weighted ℓ + ℓ pp norm objectives for flows and voltages, cost ( f ) = k R f k + k S f k pp , and cost ′ ( B v ) = k W B v k + k U B v k pp , and utilize them as key ingredients in our faster high-accuracy algorithms for optimizing suchobjectives on graphs. Our construction of sparsifiers for flow objectives builds on the machineryfrom [Kyn+19], and our construction of sparsifiers for voltage objectives builds on graph span-ners [Alt+93; PS89; BS07]. 2. Our Results
Our main results concern flow and voltage problems for mixed ( ℓ + ℓ pp )-objectives for p ≥ p , we restrict our attention to p = ω (1) in this overview.Section 3 provides detailed running times for all p ≥
2. We emphasize that by setting the quadraticterm to zero in our mixed ( ℓ + ℓ pp )-objectives, we get new state of the art algorithms for ℓ p -normminiziming flows and voltages. Mixed ℓ - ℓ p -norm minimizing flow. Consider a graph G = ( V, E ) along with non-negativediagonal matrices R , S ∈ R E × E , and a gradient vector g ∈ R E , as well as demands d ∈ R V . Werefer to the diagonal entries of R and S as ℓ -weights and ℓ p -weights respectively. Let B denote thesigned edge-vertex incidence of G (see Appendix A.1). We wish to solve the following minimizationproblem with the objective E ( f ) = g ⊤ f + k R / f k + k S f k pp (1) min B ⊤ f = d E ( f )We require g ⊥ (cid:8) ker( R ) ∩ ker( S ) ∩ ker( B ) (cid:9) so that the problem has bounded minimum value,and d ⊥ so a feasible solution exists. These conditions can be checked in linear time and havea simple combinatorial interpretation. Note that the choice of graph edge directions in B mattersfor the value of g ⊤ f , . The flow on an edge is allowed to be both positive or negative. ixed ℓ - ℓ p -norm minimizing voltages. Consider a graph G = ( V, E ) along with non-negativediagonal matrices W ∈ R E × E and U ∈ R E × E , and demands d ∈ R V . We refer to the diagonalentries of W and U as ℓ -conductances and ℓ p -conductances respectively. In this case, we want tominimize the objective E ( v ) = d ⊤ v + k W / B v k + k U v k pp in minimization problem(2) min v E ( v )In the voltage setting, we only require d ⊥ so the problem has bounded minimum value. Obtaining good solutions.
For both these problems, we study high accuracy approximationalgorithms that provide feasible solutions x (a flow or a voltage respectively), that approximatelyminimize the objective function from some starting point x (0) , i.e., for some small ε > , we have E ( x ) − E ( x ⋆ ) ≤ ε ( E ( x (0) ) − E ( x ⋆ ))wher x ⋆ denotes an optimal feasible solution. Our algorithms apply to problems with quasipoly-nomially bounded parameters, including quasipolynomial bounds on non-zero singular values ofmatrices we work with. Below we state our main algorithmic results. Theorem 2.1 (Flow Algorithmic Result) . Consider a graph G with n vertices and m edges,equipped with non-negative ℓ and ℓ p -weights, as well as a gradient and demands, all with quasi-polynomially bounded entries. For p = ω (1) , in p ( m o (1) + n / o (1) ) log / ε time we can computean ε -approximately optimal flow solution to Problem (1) with high probability. This improves upon [Adi+19; APS19; AS20] which culminated in a pm / o (1) log / ε time al-gorithm. Theorem 2.2 (Voltage Algorithmic Result) . Consider a graph G with n vertices and m edges,equipped with non-negative ℓ and ℓ p -conductances, as well as demands, all with quasi-polynomiallybounded entries. For p = ω (1) , in p ( m o (1) + n / o (1) ) log / ε time we can compute an ε -approximately optimal voltage solution to Problem (2) with high probability. Background: Iterative Refinement for Mixed ℓ - ℓ p -norm Flow Objectives. Adil etal. [Adi+19] developed a notion of iterative refinement for mixed ( ℓ + ℓ pp )-objectives which inthe flow setting, i.e. Problem (1), corresponds to approximating E ′ ( δ ) = E ( f + δ ) using another( ℓ + ℓ pp )-objective which roughly speaking corresponds to the 2nd degree Taylor series approxima-tion of E ′ ( δ ) combined with an ℓ p -norm term k S δ k pp , while ensuring feasibility of f + δ through aconstraint B δ = . We call the resulting problem a residual problem. Adil et al. [Adi+19] showedthat obtaining a constant-factor approximate solution to the residual problem in δ is sufficient toensure that E ( f + δ ) is closer to the optimal solution by a multiplicative factor depending onlyon p . In [APS19], this result was sharpened to show that such an approximate solution for theresidual problem can be used to make (1 − Ω(1 /p )) multiplicative progress to the optimum, so that O ( p log( m/ε )) iterations suffice to produce an ε -accurate solution.In order to solve the residual problem to a constant approximation, Adil et al. [Adi+19] developedan accelerated multiplicative weights method for ( ℓ + ℓ pp )-flow objectives, or more generally, formixed ( ℓ + ℓ pp )-regression in an underconstrained setting. Sparsification results.
Our central technical results in this paper concern sparsification of residualflow and voltage problems, in the sense outlined in the previous paragraph. Concretely, in nearly-linear time, we can take a residual problem on a dense graph and produce a residual problem on asparse graph with e O ( n ) edges, with the property that constant factor solutions to the sparse residualproblem still make (1 − Ω( m − p − p )) multiplicative progress on the original problem. This leadsto an iterative refinement that converges in O ( pm p − log( m/ε )) steps. However, the accelerated ultiplicative weights algorithm that we use for each residual problem now only requires e O ( n / )time to compute a crude solution. Flow residual problem sparsification.
In the flow setting, we show the following:
Theorem 2.3 (Informal Flow Sparsification Result) . Consider a graph G with n vertices and m edges, equipped with non-negative ℓ and ℓ p -weights, as well as a gradient. In e O ( m ) time, we cancompute a graph H with n vertices and e O ( n ) edges, equipped with non-negative ℓ and ℓ p -weights,as well as a gradient, such that a constant factor approximation to the flow residual problem on H ,when scaled by m − p − results in an e O ( m p − ) approximate solution to the flow residual problem on G . The algorithm works for all p ≥ and succeeds with high probability. Our sparsification techniques build on [Kyn+19], require a new bucketing scheme to deal withnon-uniform ℓ p -weights, as well as a prepreprocessing step to handle cycles with zero ℓ -weightand ℓ p -weight. This preprocessing scheme in turn necessitates a more careful analysis of additiveerrors introduced by gradient rounding, and we provide a more powerful framework for this than[Kyn+19]. Voltage residual problem sparsification.
In the voltage setting, we show the following.
Theorem 2.4 (Voltage Sparsification Result (Informal)) . Consider a graph G with n vertices and m edges, equipped with non-negative ℓ and ℓ p -conductances. In e O ( m ) time, we can compute agraph H with n vertices and e O ( n ) edges, equipped with non-negative ℓ and ℓ p -conductances, suchthat constant factor approximation to the voltage residual problem on H , when scaled by m − p − results in an e O ( m p − ) approximate solution to the voltage residual problem on G . The algorithmworks for all p ≥ and succeeds with high probability. Note that our voltage sparsification is slightly stronger than our flow sparsification, as the formerloses only a factor e O ( m p − ) in the approximation while the latter loses a factor e O ( m p − ). Ourvoltage sparsification uses a few key observations: In voltage space, surprisingly, we can treat treatthe ℓ and ℓ p costs separately. This behavior is very different than the flow case, and arises becasein voltage space, every edge provides an “obstacle”, i.e. adding an edge increases cost, whereasin flow space, every edge provides an “opportunity”, i.e. adding an edge decreases cost. Thismeans that in voltage space, we can separately account for the energy costs created by our ℓ and ℓ p terms, whereas in flow space, the ℓ and ℓ p weights must be highly correlated in a sparsifier.Armed with this decoupling observation, we preserve ℓ cost using standard tools for spectral graphsparsification, and we preserve ℓ p cost approximately by a reduction to graph distance preservation,which we in turn achieve using weighted undirected graph spanners. Voltage space accelerated multiplicative weights solver.
The algorithm from [Adi+19] forconstant approximate solutions to the residual problem works in the flow setting. Using iterativerefinement, the algorithm could be used to compute high-accuracy solutions. Because we canuse high-accuracy flow solutions to extract high-accuracy solutions to the dual voltage problem,[Adi+19] were also able to produce solutions to ℓ q -norm minimizing voltage problems (where ℓ q for q = p/ ( p −
1) is the dual norm to ℓ p ). Hence, by solving ℓ p -flow problems for all p ∈ (2 , ∞ ),[Adi+19] were able to solve ℓ q -norm minimizing voltage problems for all q ∈ (1 , p ≥ . Thus, in order to solve for q -norm minimizing voltages for q > , we require a solver that works directly in voltage space formixed ( ℓ + ℓ pp )-voltage objectives.We develop an accelerated multiplicative weights algorithm along the lines of [Chr+11; Chi+13;Adi+19] that works directly in voltage space for mixed ( ℓ + ℓ pp )-objectives, or more generally foroverconstrained mixed ( ℓ + ℓ pp )-objective regression. Concretely, this directly gives an algorithm or computing crude solutions to the residual problems that arise from applying [Adi+19] iterativerefinement to Problem (2). Our solver produces an improved O (1)-approximation to the residualproblem rather than a p O ( p ) -approximation from [Adi+19]. This gives an e O ( m / ) high-accuracyalgorithm for mixed ( ℓ + ℓ pp )-objective voltage problems for p >
2, unlike [Adi+19], which couldonly solve pure p > p ( m o (1) + n / o (1) ) timealgorithm for p = ω (1) by developing a sparsification procedure that applies directly to mixed( ℓ + ℓ pp )-voltage problems for p > Mixed ℓ - ℓ p -norm regression. Our framework can also be applied outside of a graph setting,where our new accelerated multiplicative weights algorithm for overconstrained mixed ( ℓ + ℓ pp )-regression gives new state-of-the-art results in some regimes when combined with new sparsificationresults. In this setting we develop sparsification techniques based on the Lewis weights samplingfrom the work of Cohen and Peng [CP15]. We focus on the case 2 < p <
4, where [CP15] providedfast algorithms for Lewis weight sampling.
Theorem 2.5 (General Matrices Sparsification Result) . Let p ∈ [2 , , let M ∈ R m × n , N ∈ R m × n be matrices, m , m ≥ n , and let LSS ( B ) denote the time to solve a linear system in B ⊤ B . Then,we may compute f M , f N ∈ R O ( n p/ log( n )) × n such that with probability at least − n Ω(1) , for all ∆ ∈ R n , k f M ∆ k + k f N ∆ k pp ≈ O (1) k M ∆ k + k N ∆ k pp , in time ˜ O (cid:16) nnz( M ) + nnz( N ) + LSS ( c M ) + LSS ( c N ) (cid:17) , for some c M and c N each containing O ( n log( n )) rescaled rows of M and N , respectively. Theorem 2.6 (General Matrices Algorithmic Result) . For p ∈ [2 , , with high probability we canfind an ε -approximate solution to (3) in time e O (cid:18) nnz( M ) + nnz( N ) + (cid:16) LSS ( f M ) + LSS ( f N ) (cid:17) n p ( p − p − (cid:19) log (1 /ε ) ! , for some f M and f N each containing O ( n p/ log( n )) rescaled rows of M and N , respectively, where LSS ( A ) is the time required to solve a linear equation in A ⊤ A to quasipolynomial accuracy. Note that for all p ∈ (2 , , we have that the exponent p ( p − p − ≤ . . Remark 2.7.
By [Coh+15], a linear equation in A ⊤ A , where A ∈ R m × n can be solved to quasipoly-nomial accuracy in time e O (nnz( A ) + n ω ).Using the above result for solving the required linear systems, we get a running time of e O (nnz( M )+nnz( N ) + ( n p/ + n ω ) n p ( p − p − ) , matching an earlier input sparsity result by Bubeck et al. [Bub+18]that achieves e O ((nnz( M )+ nnz( N ))(1+ n m − p )+ m − p n + n ω ) , where M ∈ R m × n , N ∈ R m × n and m = max { m , m } . Main Algorithm
In this section, we prove Theorems 2.1, 2.2, and 2.6. We first design an algorithm to solve thefollowing general problem:
Definition 3.1.
For matrices M ∈ R m × n , N ∈ R m × n and A ∈ R d × n , m , m ≥ n , d ≤ n , andvectors b ⊥ (cid:8) ker( M ) ∩ ker( N ) ∩ ker( A ) (cid:9) and c ∈ im( A ), we want to solvemin x b ⊤ x + k M x k + k N x k pp (3) s.t. Ax = c . n order to solve the above problem, we use the iterative refinement framework from [APS19] toobtain a residual problem which is defined as follows. Definition 3.2.
For any p ≥
2, we define the residual problem res (∆), for (3) at a feasible x as,max A ∆=0 res (∆) def = g ⊤ ∆ − ∆ ⊤ R ∆ − k N ∆ k pp , where, g = 1 p b + 2 p M ⊤ M x + | N x | p − N x and R = 2 p M ⊤ M + 2 N ⊤ Diag ( | N x | p − ) N . This residual problem can further be reduced by moving the term linear in x to the constraintsvia a binary search. This leaves us with a problem of the form,min ∆ ∆ ⊤ R ∆ + k N ∆ k pp s.t. g ⊤ ∆ = a, A ∆ = 0 , for some constant a .In order to solve the above problem with ℓ + ℓ pp objective, we reduce the instance size via asparsification routine, and then solve the smaller problem by a multiplicative weights algorithm.We adapt the multiplicative-weights algorithm from [Adi+19] to work in the voltage space whileimproving the p dependence of the runtime from p O ( p ) to p , and the approximation quality from p O ( p ) to O (1). The precise sparsification routines are described in later sections.For large p, i.e., p > log m , in order to get a linear dependence on the running time on p, weneed to reduce the residual problem in ℓ p -norm to a residual problem in log m -norm by using theframework from [AS20].The entire meta-algorithm is described formally in Algorithm 1, and its guarantees are describedby the next theorem. Most proof details are deferred to Appendix D. Theorem 3.3.
For an instance of Problem (3) , suppose we are given a starting solution x (0) thatsatisfies Ax (0) = c and is a κ approximate solution to the optimum. Consider an iteration of thewhile loop, line 8 of Algorithm 1 for the ℓ p -norm residual problem at x ( t ) . We can define µ and κ such that if ¯∆ is a β approximate solution to a corresponding p ′ -norm residual problem, then µ ¯∆ is a κ -approximate solution to the p -residual problem. Further, suppose we have the followingprocedures,(1) Sparsify : Runs in time K , takes as input any matrices R , N and vector g and returns e R , f N , e g having sizes at most ˜ n × n for the matrices , such that if e ∆ is a β approximatesolution to, max A ∆=0 e g ⊤ ∆ − k e R ∆ k − k f N ∆ k p ′ p ′ , for any p ′ ≥ , then µ e ∆ , for a computable µ is a κ β -approximate solution for, max A ∆=0 res (∆) def = g ⊤ ∆ − k R / ∆ k − k N ∆ k p ′ p ′ . (2) Solver : Approximately solves (4) to return ¯∆ such that k e R ¯∆ k ≤ κ ν and k f N ∆ k pp ≤ κ ν in time ˜ K (˜ n ) for instances of size at most ˜ n .Algorithm 1 finds an ε -approximate solution for Problem (3) in time e O pκ / ( p − κ κ κ ( K + ˜ K (˜ n )) log (cid:18) κpε (cid:19) ! . lgorithm 1 Meta-Algorithm for ℓ p Flows and Voltages procedure Sparsified-p-Problems ( A , M , N , c , b , p ) x ← x (0) , such that f (cid:16) x (0) (cid:17) ≤ κ Opt T ← e O (cid:16) pκ κ κ log (cid:0) κε (cid:1)(cid:17) for t = 0 to T do At x ( t ) define g , R , N and res (∆), the residual problem (Definition 3.2) a ← , b ← , µ ← , κ ← ν ← f (cid:16) x (0) (cid:17) while ν ≥ ε f ( x (0) ) κp do if p > log m then ⊲ Convert ℓ p -norm residual to log m -norm residual p ′ ← log m N ′ ← /p ′ (cid:0) νm (cid:1) p ′ − p N a ← , b ← O (1) m o (1) µ ← m − o (1) , κ ← m o (1) ⊲ Lose κ in approx. when scaled by µ ( e g , e R , f N ) ← Sparsify ( g , R , N ′ ) ⊲ Lose κ in approx. when scaled by µ else ( e g , e R , f N ) ← Sparsify ( g , R , N ) ⊲ Lose κ in approx. when scaled by µ p ′ ← p Use
Solver to compute κ , κ approximate solution to e ∆ ( ν ) ← arg min ∆ k e R / ∆ k + k f N ∆ k p ′ p ′ s.t. e g ⊤ ∆ = aν, A ∆ = 0 . (4) ¯∆ ( ν ) ← a bκ κ / ( p ′− µ µ e ∆ ( ν ) ν ← ν/ ∆ ← arg min ¯∆ ( ν ) f (cid:16) x − ¯∆ ( ν ) p (cid:17) x ← x − ∆ p return x Algorithms for ℓ p -norm Problems. The problems discussed in Section 2 are special casesof Problem (3), which means we can use Algorithm 1. To prove our results, we will utilize Theorem3.3, with the respective sparsification procedures and the following multiplicative-weights basedalgorithm for solving problems of the form,min ∆ ∆ ⊤ M ⊤ M ∆ + k N ∆ k pp (5) s.t. A ∆ = c . We describe our solver formally and prove the following theorem about its guarantees in AppendixC.
Theorem 3.4.
Let p ≥ . Consider an instance of Problem (5) described by matrices A ∈ R d × n , N ∈ R m × n , M ∈ R m × n , d ≤ n ≤ m , m , and vector c ∈ R d . If the optimum of thisproblem is at most ν , Procedure Residual-Solver (Algorithm 3) returns an x such that Ax = c , nd x ⊤ M ⊤ M x ≤ O (1) ν and k N x k pp ≤ O (3 p ) ν . The algorithm makes O pm p − p − ! calls to alinear system solver. We utilize Procedure
Residual-Solver as the Procedure
Solver in Algorithm
Sparsified-p-Problems . The algorithm uses the procedure only for solving problems instances with p ≤ log m. Thus, its running time is ˜ K (˜ n ) = e O (cid:18) ˜ n p − p − · LSS (˜ n ) (cid:19) ≤ e O (cid:16) ˜ n / · LSS (˜ n ) (cid:17) , where LSS (˜ n ) denotesthe time required to solve a linear system in matrices of size ˜ n . We also have, κ = O (1) , κ / ( p − = O (1).We next estimate the values of κ and µ . If p ≤ log m , we have µ = 1 and κ = 1. Otherwise, µ = e O (1) and κ = O ( m o (1) ) (Refer to Lemma D.4 in Appendix D).In order to obtain an initial solution, we usually solve an ℓ -norm problem. This gives an m p/ approximate initial solution which results in a factor of p in the running time. To avoid this, wecan do a homotopy on p similar to [AS20], i.e., start with an ℓ solution and solve the ℓ problemto a constant approximation, followed by ℓ , ..ℓ p . We note that a constant approximate solutionto the ℓ p/ -norm problem gives an O ( m ) approximation to the ℓ p problem and thus, we can solvelog p problems where we can assume κ = O ( m ).We now complete the proof of our various algorithmic results by utilizing sparsification proceduresspecific to each problem. ℓ p Flows.
We will prove Theorem 2.1 (Flow Algorithmic Result), with explicit p dependencies. Proof.
From Theorem 2.3, we obtain a sparse graph in K = e O ( m ) time with ˜ n = e O ( n ) edges.A constant factor approximation to the flow residual problem on this sparse graph when scaledby µ = m − p − gives a κ = e O (cid:18) m p − (cid:19) -approximate solution to the flow residual problem onthe original graph. We can solve linear systems on the sparse graph in e O (˜ n ) = e O ( n ) time usingfast Laplacian solvers. Using all these values in Theorem 3.3, we get the final runtime to be pm p − + o (1) (cid:18) m + n p − p − (cid:19) log (cid:0) pmε (cid:1) as claimed. We prove Theorem 2.3 in Section A. (cid:3) ℓ p Voltages.
We will prove Theorem 2.2 (Voltage Algorithmic Result), with explicit p dependencies. Proof.
From Theorem 2.4, we obtain a sparse graph in K = e O ( m ) time with ˜ n = e O ( n ) edges. Aconstant factor approximation to the voltage residual problem on this sparse graph when scaledby µ = m − p − gives a κ = e O (cid:18) m p − (cid:19) -approximate solution to the voltage residual problemon the original graph. We can solve linear systems on the sparse graph in e O (˜ n ) = e O ( n ) timeusing fast Laplacian solvers. Using these values in Theorem 3.3, we get the final runtime to be pm p − + o (1) (cid:18) m + n p − p − (cid:19) log (cid:0) pmε (cid:1) as claimed. We prove Theorem 2.4 in Section 4. (cid:3) General Matrices.
We will now prove Theorem 2.6.
Proof.
We assume Theorem 2.5, which we prove in Appendix B. From the theorem, we have κ = O (1) and µ = O (1). Note that K = LSS ( c M ) + LSS ( c N ) for some c M , c N ∈ R O ( n log( n )) × n ,which is the time required to solve linear systems in c M ⊤ c M and c N ⊤ c N , respectively. Since, byTheorem 2.5, the size of f M and f N is ˜ n = O ( n p/ log( n )), the cost from the solver in Theorem 3.4is ˜ O p (cid:18)(cid:16) LSS ( f M ) + LSS ( f N ) (cid:17) n p ( p − p − (cid:19) . (cid:3) . Construction of Sparsifiers for ℓ + ℓ pp Voltages
In this section, we prove a formal version of the voltage sparsification result (Theorem 2.4):
Theorem 4.1.
Consider a graph G = ( V, E ) with non-negative -weights w ∈ R E and non-negative p -weights s ∈ R E , with m and n vertices. We can produce a graph H = ( V, F ) with edges F ⊆ E , ℓ -weights u ∈ R F , and ℓ p -weights t ∈ R F , such that with probability at least − δ the graph H has O ( n log( n/ε )) edges and (6) 11 . k W B G x k ≤ k U B H x k ≤ . k W B G x k and for any p ∈ [1 , ∞ ](7) 1 m /p log( n ) k S B G x k p ≤ k T B H x k p ≤ k S B G x k p where W = Diag ( w ) , U = Diag ( u ) , S = Diag ( s ) , T = Diag ( t ) . We denote the routine com-puting H and u , t by SpannerSparsify , so that ( H, u , t ) = SpannerSparsify ( G, w , s ) . Thisalgorithm runs in e O ( m log(1 /δ )) time. We will first define some terms required for our result. Given a undirected graph G = ( V, E ),with edge lengths l ∈ R E , and u, v ∈ V , we let d G ( u, v ) denote the shortest path distance in G w.r.t l , so that if P is the shortest path w.r.t l then d G, l ( u, v ) = X e ∈ P l ( e ) Definition 4.2.
Given a undirected graph G = ( V, E ) with edge lengths l ∈ R E , a K -spanner isa subgraph H of G with the same edge lengths s.t. d H ( u, v ) ≤ Kd G ( u, v ).Baswana and Sen showed the following result on spanners [BS07]. Theorem 4.3.
Given an undirected graph G = ( V, E, l ) with m edges and n vertices, and aninteger k > , we can compute a (2 k − -spanner H of G with O ( n /k ) edges in expected time O ( km ) . Lemma 4.4.
Given an undirected graph G = ( V, E ) with positive edge lengths l ∈ R E , and a K -spanner H = ( V, F ) of G , for all x ∈ R V we have max ( u,v ) ∈ F l ( u, v ) (cid:12)(cid:12) x ( u ) − x ( v ) (cid:12)(cid:12) ≤ max ( u,v ) ∈ E l ( u, v ) (cid:12)(cid:12) x ( u ) − x ( v ) (cid:12)(cid:12) ≤ K max ( u,v ) ∈ F l ( u, v ) (cid:12)(cid:12) x ( u ) − x ( v ) (cid:12)(cid:12) Proof.
The inequality max ( u,v ) ∈ F l ( u,v ) (cid:12)(cid:12) x ( u ) − x ( v ) (cid:12)(cid:12) ≤ max ( u,v ) ∈ E l ( u,v ) (cid:12)(cid:12) x ( u ) − x ( v ) (cid:12)(cid:12) is immediatefrom F ⊆ E .To prove the second inequality, we note that if ( u, v ) ∈ E has shortest path P in H then1 l ( u, v ) (cid:12)(cid:12) x ( u ) − x ( v ) (cid:12)(cid:12) ≤ K P ( z,y ) ∈ P l ( z, y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ( z,y ) ∈ P x ( z ) − x ( y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max ( z,y ) ∈ P K l ( z, y ) (cid:12)(cid:12) x ( z ) − x ( y ) (cid:12)(cid:12) . (cid:3) Definition 4.5.
Given a undirected graph G = ( V, E ) with m edges and n vertices with positiveedge ℓ -weights w ∈ R E , a spectral ε -approximation of G is a graph H = ( V, F ) with F ⊆ E withpositive edge ℓ -weights u ∈ R F s.t.11 + ε k W B G x k ≤ k U B H x k ≤ (1 + ε ) k W B G x k where W = Diag ( w ) and U = Diag ( u ). he following result on spectral sparsifiers was shown by Spielman and Srivastava [SS11] (seealso [Spi15]). Theorem 4.6.
Given a graph G = ( V, E ) with positive ℓ -weights w ∈ R E with m edges and n vertices, for any ε ∈ (0 , / , we can produce a graph H = ( V, F ) with edges F ⊆ E and ℓ -weights u ∈ R F such that H has O ( nε − log( n/δ )) edges and with probability at least − δ we havethat ( H, u ) is a spectral ε -approximation of ( G, w ) . We denote the routine computing H and u by SpectralSparsify , so that ( H, u ) = SpectralSparsify ( G, s , ε, δ ) . This algorithm runs in e O ( m ) time. Furthermore, if the weights w are quasipolynomially bounded, then so are the weightsof u . We can now prove our main result.
Proof of Theorem 4.1.
We consider a graph G = ( V, E ) with m edges and n vertices, and withnon-negative ℓ p -weights r ∈ R E , non-negative ℓ -weights s ∈ R E . We define ˆ E ⊆ E to be the edgess.t. s ( e ) >
0, and then let l ∈ R ˆ E by l ( e ) = 1 / s ( e ), and ˆ G = ( V, ˆ E ). We then apply Theorem 4.3to ˆ G with l as edge lengths, and with k = log( n ). We turn the algorithm of Theorem 4.3 intorunning time e O ( m log(1 /δ )), instead of expected time e O ( m ), by applying the standard Las Vegasto Monte-Carlo reduction. With probability 1 − δ/
2, this gives us a log n -spanner H of ˆ G , and wedefine t by restricting s to the edges of H . By Lemma 4.4, we then have (cid:13)(cid:13) T B H x (cid:13)(cid:13) ∞ ≤ k S B G x k ∞ ≤ log( n ) (cid:13)(cid:13) T B H x (cid:13)(cid:13) ∞ Because
T B H x is a restriction of S B G x to a subset of the coordinates, we always have for any p ≥ (cid:13)(cid:13) T B H x (cid:13)(cid:13) p ≤ k S B G x k p .At the same time, we also have k S B G x k p ≤ m /p k S B G x k ∞ ≤ m /p log( n ) (cid:13)(cid:13) T B H x (cid:13)(cid:13) ∞ ≤ m /p log( n ) (cid:13)(cid:13) T B H x (cid:13)(cid:13) p We define ˜ E ⊆ E to be the edges s.t. r ( e ) >
0, and the let ˜ G = ( V, ˜ E ). Now, appealing toTheorem 4.6, we let ( H , u ) = SpectralSparsify ( ˜ G, r , / , ε/ H by taking the union of the edge sets of H and H and extending u and t to the new edge set by adding zero entries as needed. By a union bound, the approximationguarantees of Equations (6) and (7) simultaneously hold with probability at least 1 − δ .The edge set remains bounded in size by O ( n log n ). (cid:3) To see Theorem 2.4, note that from Theorem 4.1, we get, m − p − (cid:18) m − p − k W B G x k + m − k S B G x k pp (cid:19) ≤ m − p − (cid:16) k U B H x k + k T B H x k pp (cid:17) The other direction is easy to see.5.
Extensions of Our Results and Open Problems
Solving dual problems: q -norm minimizing flows and voltages for q < . When the mixed ( ℓ + ℓ pp )-objective flow problem (Problem (1)) is restricted to the case g = and R = , it becomes a pure ℓ p -norm minimizing flow problem, and its dual problem can be slightly rearranged to give(8) min v d ⊤ v + (cid:13)(cid:13)(cid:13) S − B v (cid:13)(cid:13)(cid:13) qq where q = p/ ( p −
1) = 1 + 1 / ( p − S − as ℓ q -conductances.Because we can solve Problem (1) to high-accuracy in near-linear time for p = ω (1), this allows usto solve Problem (8), the dual voltage ℓ q -norm minimization, in time p ( m o (1) + n / o (1) ) log / ε (see [Adi+19, Section 7] for the reduction). We summarize this in the theorem below. heorem 5.1 (Voltage Algorithmic Result, q < . Consider a graph G with n vertices and m edges, equipped with positive ℓ q -conductances, as well as a demand vector. For < q < , when q = 1 + o (1) , in poly (cid:16) q − (cid:17) ( m o (1) + n / o (1) ) log / ε time, we can compute an ε -approximately optimal voltage solution to Problem (8) with high probability. Similarly, we can solve ℓ q -norm minimizing flows for q < ℓ p -voltage problem, aspecial case of the mixed ( ℓ + ℓ pp )-voltage problem. Picking W = in Problem (2), we obtain apure ℓ p -norm minimizing voltage problem, and its dual problem can be slightly rearranged to give(9) min B ⊤ f = d (cid:13)(cid:13)(cid:13) U − f (cid:13)(cid:13)(cid:13) qq where q = p/ ( p −
1) = 1 + 1 / ( p − U − as q -weights. Again,because we can solve Problem (2) to high-accuracy in near-linear time for p = ω (1), this allows usto solve Problem (9), the dual flow ℓ q -norm minimization, in time p ( m o (1) + n / o (1) ) log / ε . Theorem 5.2 (Flow Algorithmic Result, q < . Consider a graph G with n verticesand m edges, equipped with positive q -weights, as well as a demand vector. For < q < , when q = 1 + o (1) , in poly (cid:16) q − (cid:17) ( m o (1) + n / o (1) ) log / ε time, we can compute an ε -approximatelyoptimal flow solution to Problem (9) with high probability.Open Questions. Mixed ℓ , ℓ q problems for small q < . In this work, we provided new state-of-the-art algorithms for weighted mixed ℓ , ℓ p -norm minimizing flow and voltage problems for p >> ℓ q -norm minimizing flow and voltage problems for q near 1.A reasonable definition of mixed ℓ , ℓ q -norm problems for q < ℓ , ℓ p -gamma function problem for p > Directly sparsifying mixed ℓ , ℓ q problems for q < . A second approach to developing a fast ℓ , ℓ q -gamma function solver for q < Removing the m O (1) p − loss in sparsification. Our current approaches to graph mixed ℓ , ℓ p -sparsification lose a factor m O (1) p − in their quality of approximation, which leads to a m O (1) p − factorslowdown in running time, and makes our algorithms less useful for small p . We believe a moresophisticated graph sparsification routine could remove this loss and result in significantly fasteralgorithms for p close to 2. Using mixed ℓ , ℓ p -objectives as oracles for ℓ ∞ regression. The current state-of-the-artalgorithm for computing maximum flow in unit capacity graphs runs in e O ( m / ) time [LS20a], anduses the almost-linear-time algorithm from [Kyn+19] for solving unweighted ℓ + ℓ pp instances as akey ingredient. References [Adi+19] D. Adil, R. Kyng, R. Peng, and S. Sachdeva. “Iterative refinement for ℓ p -norm regres-sion”. In: Proceedings of the Thirtieth Annual ACM-SIAM Symposium on DiscreteAlgorithms . SIAM. 2019, pp. 1405–1424 (cit. on pp. 1, 3, 4, 5, 6, 10, 11, 30, 32).[All+17] Z. Allen-Zhu, Y. Li, R. Oliveira, and A. Wigderson. “Much faster algorithms for matrixscaling”. In: . IEEE. 2017, pp. 890–901 (cit. on p. 1). Alt+93] I. Alth¨ofer, G. Das, D. Dobkin, D. Joseph, and J. Soares. “On sparse spanners ofweighted graphs”. In:
Discrete & Computational Geometry
Network flows - theory, algorithms andapplications . Prentice Hall, 1993 (cit. on p. 1).[APS19] D. Adil, R. Peng, and S. Sachdeva. “Fast, provably convergent IRLS algorithm forp-norm linear regression”. In:
Advances in Neural Information Processing Systems .2019, pp. 14189–14200 (cit. on pp. 3, 6, 34, 39, 40, 42).[AS20] D. Adil and S. Sachdeva. “Faster p-norm minimizing flows, via smoothed q-norm prob-lems”. In:
Proceedings of the Fourteenth Annual ACM-SIAM Symposium on DiscreteAlgorithms . SIAM. 2020, pp. 892–910 (cit. on pp. 3, 6, 8, 41).[BK96] A. A. Bencz´ur and D. R. Karger. “Approximating s-t minimum cuts in ˜ O ( n ) time”.In: Proceedings of the twenty-eighth annual ACM symposium on Theory of computing .STOC ’96. Philadelphia, Pennsylvania, USA: ACM, 1996, pp. 47–55. isbn : 0-89791-785-5 (cit. on p. 2).[BLM89] J. Bourgain, J. Lindenstrauss, and V. Milman. “Approximation of zonoids by zono-topes”. In:
Acta mathematica
Random Struct. Algorithms issn : 1042-9832 (cit. on pp. 2, 9).[Bub+18] S. Bubeck, M. B. Cohen, Y. T. Lee, and Y. Li. “An Homotopy Method for Lp Regres-sion Provably Beyond Self-concordance and in Input-sparsity Time”. In:
Proceedingsof the 50th Annual ACM SIGACT Symposium on Theory of Computing . STOC 2018.Los Angeles, CA, USA: ACM, 2018, pp. 1130–1137. isbn : 978-1-4503-5559-9 (cit. onpp. 1, 5, 11).[Chi+13] H. H. Chin, A. Madry, G. L. Miller, and R. Peng. “Runtime guarantees for regressionproblems”. In:
Proceedings of the 4 th conference on Innovations in Theoretical Com-puter Science . ITCS ’13. Berkeley, California, USA: ACM, 2013, pp. 269–282. isbn :978-1-4503-1859-4 (cit. on p. 4).[Chr+11] P. Christiano, J. A. Kelner, A. Madry, D. A. Spielman, and S.-H. Teng. “Electricalflows, laplacian systems, and faster approximation of maximum flow in undirectedgraphs”. In: Proceedings of the 43rd annual ACM symposium on Theory of computing .STOC ’11. San Jose, California, USA: ACM, 2011, pp. 273–282. isbn : 978-1-4503-0691-1 (cit. on pp. 1, 4).[Chu+18] T. Chu, Y. Gao, R. Peng, S. Sachdeva, S. Sawlani, and J. Wang. “Graph Sparsification,Spectral Sketches, and Faster Resistance Computation, via Short Cycle Decomposi-tions”. In: . 2018, pp. 361–372 (cit. on p. 2).[CLS19] M. B. Cohen, Y. T. Lee, and Z. Song. “Solving Linear Programs in the Current MatrixMultiplication Time”. In:
Proceedings of the 51st Annual ACM SIGACT Symposiumon Theory of Computing . STOC 2019. Phoenix, AZ, USA: Association for ComputingMachinery, 2019, pp. 938–942. isbn : 9781450367059 (cit. on p. 1).[Coh+15] M. B. Cohen, Y. T. Lee, C. Musco, C. Musco, R. Peng, and A. Sidford. “UniformSampling for Matrix Approximation”. In:
Proceedings of the 2015 Conference on In-novations in Theoretical Computer Science . ITCS ’15. Rehovot, Israel: Association forComputing Machinery, 2015, pp. 181–190. isbn : 9781450333337 (cit. on pp. 5, 27).[Coh+17a] M. B. Cohen, A. Madry, D. Tsipras, and A. Vladu. “Matrix Scaling and Balancing viaBox Constrained Newton’s Method and Interior Point Methods”. In: nnual Symposium on Foundations of Computer Science (FOCS) . 2017, pp. 902–913(cit. on p. 1).[Coh+17b] M. B. Cohen, A. Madry, P. Sankowski, and A. Vladu. “Negative-Weight ShortestPaths and Unit Capacity Minimum Cost Flow in ˜ O ( m / log W ) Time (ExtendedAbstract)”. In: Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium onDiscrete Algorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, January 16-19 .2017, pp. 752–771 (cit. on p. 1).[CP15] M. B. Cohen and R. Peng. “ ℓ p Row Sampling by Lewis Weights”. In:
Proceedingsof the Forty-Seventh Annual ACM Symposium on Theory of Computing . STOC ’15.Portland, Oregon, USA: Association for Computing Machinery, 2015, pp. 183–192. isbn : 9781450335362 (cit. on pp. 2, 5, 27).[DS08] S. I. Daitch and D. A. Spielman. “Faster approximate lossy generalized flow via interiorpoint algorithms”. In:
Proceedings of the 40th annual ACM symposium on Theory ofcomputing . STOC ’08. Available at http://arxiv.org/abs/0803.0988. Victoria, BritishColumbia, Canada: ACM, 2008, pp. 451–460. isbn : 978-1-60558-047-0 (cit. on p. 1).[Dur+17] D. Durfee, J. Peebles, R. Peng, and A. B. Rao. “Determinant-Preserving Sparsificationof SDDM Matrices with Applications to Counting and Sampling Spanning Trees”. In:
FOCS . IEEE Computer Society, 2017, pp. 926–937 (cit. on p. 2).[Fos+53] F. G. Foster et al. “On the stochastic matrices associated with certain queuing pro-cesses”. In:
The Annals of Mathematical Statistics
Commun.ACM
Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposiumon Discrete Algorithms, SODA 2014, Portland, Oregon, USA, January 5-7, 2014 . 2014,pp. 217–226 (cit. on pp. 1, 2).[KMP11] I. Koutis, G. L. Miller, and R. Peng. “A Nearly- m log n Time Solver for SDD LinearSystems”. In:
Proceedings of the 2011 IEEE 52nd Annual Symposium on Foundationsof Computer Science . FOCS ’11. Washington, DC, USA: IEEE Computer Society,2011, pp. 590–598. isbn : 978-0-7695-4571-4 (cit. on p. 2).[Kyn+15] R. Kyng, A. Rao, S. Sachdeva, and D. A. Spielman. “Algorithms for Lipschitz Learningon Graphs”. In:
Proceedings of The 28th Conference on Learning Theory . Ed. by P.Gr¨unwald, E. Hazan, and S. Kale. Vol. 40. Proceedings of Machine Learning Research.Paris, France: PMLR, 2015, pp. 1190–1223 (cit. on p. 1).[Kyn+16] R. Kyng, Y. T. Lee, R. Peng, S. Sachdeva, and D. A. Spielman. “Sparsified Choleskyand multigrid solvers for connection laplacians”. In:
Proceedings of the 48th AnnualACM SIGACT Symposium on Theory of Computing . ACM. 2016, pp. 842–850 (cit. onp. 2).[Kyn+19] R. Kyng, R. Peng, S. Sachdeva, and D. Wang. “Flows in Almost Linear Time via Adap-tive Preconditioning”. In:
Proceedings of the 51st Annual ACM SIGACT Symposiumon Theory of Computing . STOC 2019. Phoenix, AZ, USA: ACM, 2019, pp. 902–913. isbn : 978-1-4503-6705-9 (cit. on pp. 1, 2, 4, 11, 18, 19, 22).[Lew78] D. Lewis. “Finite dimensional subspaces of L { p } ”. eng. In: Studia Mathematica O (vrank) Iterations and Faster Algorithms for Maximum Flow”.In: FOCS . 2014 (cit. on p. 1). LS20a] Y. P. Liu and A. Sidford. “Faster Divergence Maximization for Faster Maximum Flow”.In:
FOCS (2020) (cit. on pp. 1, 11).[LS20b] Y. P. Liu and A. Sidford. “Faster energy maximization for faster maximum flow”. In:
Proccedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing,STOC 2020, Chicago, IL, USA, June 22-26, 2020 . ACM, 2020, pp. 803–814 (cit. onp. 1).[Mad10] A. Madry. “Fast approximation algorithms for cut-based problems in undirected graphs”.In:
Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on .IEEE. 2010, pp. 245–254 (cit. on p. 2).[Mad13] A. Madry. “Navigating Central Path with Electrical Flows: From Flows to Matchings,and Back”. In:
FOCS . 2013 (cit. on p. 1).[NN94] Y. Nesterov and A. Nemirovskii.
Interior-Point Polynomial Algorithms in Convex Pro-gramming . Society for Industrial and Applied Mathematics, 1994 (cit. on p. 1).[OSV12] L. Orecchia, S. Sachdeva, and N. K. Vishnoi. “Approximating the Exponential, theLanczos Method and an ˜ O ( m )-time Spectral Algorithm for Balanced Separator”. In: Proceedings of the Forty-fourth Annual ACM Symposium on Theory of Computing .STOC ’12. New York, New York, USA: ACM, 2012, pp. 1141–1160. isbn : 978-1-4503-1245-5 (cit. on p. 1).[PS14] R. Peng and D. A. Spielman. “An Efficient Parallel Solver for SDD Linear Systems”.In:
Proceedings of the 46th Annual ACM Symposium on Theory of Computing . STOC’14. New York, New York: ACM, 2014, pp. 333–342. isbn : 978-1-4503-2710-7 (cit. onp. 2).[PS89] D. Peleg and A. A. Sch¨affer. “Graph spanners”. In:
Journal of Graph Theory
Proceedings of the 40th annual ACM symposium on Theory of computing .STOC ’08. Victoria, British Columbia, Canada: ACM, 2008, pp. 255–264. isbn : 978-1-60558-047-0 (cit. on p. 2).[RST14] H. Racke, C. Shah, and H. Taubig. “Computing Cut-Based Hierarchical Decomposi-tions in Almost Linear Time”. In:
Proceedings of the 25 th Annual ACM-SIAM Sym-posium on Discrete Algorithms . SODA ‘14. 2014, pp. 227–238 (cit. on p. 2).[Sch02] A. Schrijver. “On the history of the transportation and maximum flow problems”. In:
Math. Program. . 2013, pp. 263–269 (cit. on pp. 1, 2).[She17] J. Sherman. “Generalized Preconditioning and Undirected Minimum-cost Flow”. In:
Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algo-rithms . SODA ’17. Barcelona, Spain, 2017, pp. 772–780 (cit. on p. 1).[Spi15] D. A. Spielman.
Spectral Graph Theory Lectures: Sparsification by Effective ResistanceSampling . 2015 (cit. on p. 10).[SS11] D. Spielman and N. Srivastava. “Graph Sparsification by Effective Resistances”. In:
SIAM Journal on Computing
STOC . 2004 (cit. on pp. 1, 2).[ST11] D. Spielman and S. Teng. “Spectral Sparsification of Graphs”. In:
SIAM Journal onComputing raphs”. In: . IEEE, 2020, pp. 919–930 (cit. on pp. 1, 2). ppendix A. Construction of Sparsifiers for ℓ + ℓ pp Flows
In this section we will prove the following formal version of Theorem 2.3.
Theorem A.1.
Consider an instance G = ( V G , E G , r G , s G , g G ) with n vertices and m edges.Suppose we want to solve, min B ⊤ G f =0 E G ( f ) . We can compute in time e O ( m ) an instance H =( V H , E H , r H , s H , g H ) such that with probability − ε , H has n vertices and m H = n polylog( n/ ( εδ )) edges, and for all f H we can compute a corresponding f G in e O ( m ) time such that, H (cid:22) cycle κ,δ G and G (cid:22) cycle κ,δ H where κ = m / ( p − polylog( n/ ( εδ )) . A.1.
Preliminaries.
Smoothed ℓ p norm functions. We consider p -norms smoothed by the addition of a quadratic term.First we define such a smoothed p th -power on R . Definition A.2 (Smoothed p th -power) . Given r, x ∈ R , r ≥ r -smoothed s -weighted p th -power of x to be h p ( r, s, x ) = rx + s | x | p . This definition can be naturally extended to vectors to obtained smoothed ℓ p -norms. Definition A.3 (Smoothed ℓ p -norm) . Given vectors x ∈ R m , r , s ∈ R m ≥ , , define the r -smooth s -weighted p -norm of x to be h p ( r , s , x ) = m X i =1 h p ( r i , s i , x i ) = m X i =1 ( r i x i + s i | x i | p ) . Flow Problems and Approximation.
We will consider problems where we seek to find flows mini-mizing smoothed p -norms. We first define these problem instances. Definition A.4 (Smoothed p -norm instance) . A smoothed p -norm instance is a tuple G , G def = ( V G , E G , g G , r G , s G ) , where V G is a set of vertices, E G is a set of undirected edges on V G , the edges are accompanied bya gradient, specified by g G ∈ R E G , the edges have ℓ -resistances given by r G ∈ R E G ≥ , and s ∈ R E G ≥ gives the p -norm scaling. Definition A.5 (Flows, residues, and circulations) . Given a smoothed p -norm instance G , a vector f ∈ R E G is said to be a flow on G . A flow vector f satisfies residues b ∈ R V G if (cid:16) B G (cid:17) ⊤ f = b , where B G ∈ R E G × V G is the edge-vertex incidence matrix of the graph ( V G , E G ) , i.e. , (cid:16) B G (cid:17) ⊤ ( u,v ) = u − v . A flow f with residue is called a circulation on G .Note that our underlying instance and the edges are undirected. However, for every undirectededge e = ( u, v ) ∈ E , we assign an arbitrary fixed direction to the edge, say u → v, and interpret f e ≥ u to v, and f e < u, v ) ∈ E, we have f ( u,v ) = − f ( v,u ) . Definition A.6 (Objective, E G ) . Given a smoothed p -norm instance G , and a flow f on G , theassociated objective function, or the energy, of f is given by E G ( f ) = (cid:16) g G (cid:17) ⊤ f − h p ( r , s , f ) . efinition A.7 (Smoothed p -norm flow / circulation problem) . Given a smoothed p -norm instance G and a residue vector b ∈ R E G , the smoothed p -norm flow problem ( G , b ), finds a flow f ∈ R E G with residues b that maximizes E G ( f ) , i.e. , max f :( B G ) ⊤ f = b E G ( f ) . If b = , we call it a smoothed p -norm circulation problem .Note that the optimal objective of a smoothed p -norm circulation problem is always non-negative,whereas for a smoothed p -norm flow problem, it could be negative. Approximating Smoothed p -norm Instances. Since we work with objective functions that are non-standard (and not even homogeneous), we need to carefully define a new notion of approximationfor these instances.
Definition A.8 ( H (cid:22) κ,δ G ) . For two smoothed p -norm instances, G , H , we write H (cid:22) κ,δ G ifthere is a linear map M H→G : R E H → R E G such that for every flow f H on H , we have that f G = M H→G ( f H ) is a flow on G such that(1) f G has the same residues as f H i.e. , ( B G ) ⊤ f G = ( B H ) ⊤ f H , and(2) has energy bounded by:1 κ (cid:18) E H (cid:16) f H (cid:17) − δ (cid:13)(cid:13)(cid:13) f H (cid:13)(cid:13)(cid:13) (cid:19) ≤ E G (cid:18) κ f G (cid:19) . For some of our transformations on graphs, we will be able to prove approximation guaranteesonly for circulations. Thus, we define the following notion restricted to circulations.
Definition A.9 ( H (cid:22) cycle κ,δ G ) . For two smoothed p -norm instances, G , H , we write H (cid:22) cycle κ G ifthere is a linear map M H→G : R E H → R E G such that for any circulation f H on H , i.e. , ( B H ) ⊤ f H = , the flow f G = M H→G ( f H ) is a circulation, i.e. , ( B G ) ⊤ f G = , and satisfies1 κ (cid:18) E H (cid:16) f H (cid:17) − δ (cid:13)(cid:13)(cid:13) f H (cid:13)(cid:13)(cid:13) (cid:19) ≤ E G (cid:18) κ f G (cid:19) . Observe that
H (cid:22) κ,δ G implies H (cid:22) cycle κ,δ G . We define the usual induced matrix 1-to-1 norm as kMk → = max f ∈ R E kM f k k f k We define a special matrix 1-to-1 norm over circulations by kMk cycle1 → = max Bf = kM f k k f k Definition A.10 ( H (cid:22) κ G and H (cid:22) cycle κ G ) . We abbreviate
H (cid:22) κ, G as H (cid:22) κ G , and H (cid:22) cycle κ, G as H (cid:22) cycle κ G Definition A.11.
In the context of a problem with m vertices and n edges, we say a real number x is quasi-polynomially bounded 2 − polylog( n ) ≤ | x | ≤ polylog( n )17 efinition A.12. Consider a smoothed p -norm flow problem ( G , b ) where G = ( V, E, r , s , g ) with n vertices and m edges. We say that the instance is quasipolynomially-bounded if the entries of r and s are quasipolynomially bounded andmax f :( B G ) ⊤ f = b E G ( f ) ≤ polylog( n ) . Definition A.13 (Touched and untouched cycles) . We say a cycle of edges in E is touched if itcontains an edge e s.t. r ( e ) = 0 or s ( e ) = 0. Otherwise, we say the cycle is untouched Definition A.14 (Cycle-touching instance) . We say an instance G = ( V, E, r , s , g ) is cycle-touching if every cycle of edges in e is touched.A.2. Additional Properties of Flow Problems and Approximation.
The definitions in Sec-tion A.1 satisfy most properties that we want from comparisons. The following lemma, slightlyextends a similar statement in [Kyn+19].
Lemma A.15 (Reflexivity) . For every smoothed p -norm instance G , and every κ ≥ , δ ≥ , wehave G (cid:22) κ,δ G with the identity map.Proof. Consider the map M G→G such that for every flow f G on G , we have M G→G ( f G ) = f G . Thus, E G (cid:16) κ − M G→G ( f G ) (cid:17) = E G (cid:16) κ − f G (cid:17) = (cid:16) g G (cid:17) ⊤ (cid:16) κ − f G (cid:17) − h p ( r , κ − f G ) ≥ κ − (cid:16) g G (cid:17) ⊤ f G − κ − h p ( r , f G ) ≥ κ − (cid:16) g G (cid:17) ⊤ f G − κ − h p ( r , f G )= κ − E G ( f G ) ≥ κ − E G ( f G ) − δ (cid:13)(cid:13)(cid:13) f G (cid:13)(cid:13)(cid:13) Moreover ( B G ) ⊤ M G→G ( f G ) = B G f G . Thus, the claims follow. (cid:3)
It behaves well under composition.
Lemma A.16 (Composition) . Given two smoothed p -norm instances, G , G , such that G (cid:22) κ ,δ G with the map M G →G and G (cid:22) κ ,δ G with the map M G →G , then G (cid:22) κ,δ G with the map M G →G = M G →G ◦ M G →G and κ = κ κ and δ = δ + δ (cid:13)(cid:13) M G →G (cid:13)(cid:13) → .Similarly, given two smoothed p -norm instances, G , G , such that G (cid:22) cycle κ ,δ G with the map M G →G and G (cid:22) cycle κ ,δ G with the map M G →G , then G (cid:22) cycle κ,δ G with the map M G →G = M G →G ◦ M G →G and κ = κ κ and δ = δ + δ (cid:13)(cid:13) M G →G (cid:13)(cid:13) cycle → .Proof. We can simply chain together the guarantees to see that: E G (cid:16) f G (cid:17) ≤ κ E G (cid:18) κ f G (cid:19) + δ (cid:13)(cid:13)(cid:13) f G (cid:13)(cid:13)(cid:13) ≤ κ κ E G (cid:18) κ κ f G (cid:19) + δ (cid:13)(cid:13)(cid:13)(cid:13) κ f G (cid:13)(cid:13)(cid:13)(cid:13) ! + δ (cid:13)(cid:13)(cid:13) f G (cid:13)(cid:13)(cid:13) = κ κ E G (cid:18) κ κ f G (cid:19) + δ (cid:13)(cid:13)(cid:13) f G (cid:13)(cid:13)(cid:13) + δ (cid:13)(cid:13)(cid:13) f G (cid:13)(cid:13)(cid:13) κ κ E G (cid:18) κ κ f G (cid:19) + (cid:16) δ (cid:13)(cid:13) M G →G (cid:13)(cid:13) → + δ (cid:17)(cid:13)(cid:13)(cid:13) f G (cid:13)(cid:13)(cid:13) A similar calculation gives the cycle composition guarantee, but now allows us to bound norms usingthe (cid:13)(cid:13) M G →G (cid:13)(cid:13) cycle1 → norm. In all cases, our maps preserve that flows route the correct demands. (cid:3) The most important property of this is that this notion of approximation is also additive, i.e. , itworks well with graph decompositions.
Definition A.17 (Union of two instances) . Consider smoothed p -norm instances, G , G , with thesame set of vertices, i.e. V G = V G . Define G = G ∪ G as the instance on the same set of verticesobtained by taking a disjoint union of the edges (potentially resulting in multi-edges). Formally, G = ( V G , E G ∪ E G , ( g G , g G ) , ( r G , r G ) , ( s G , s G )) . We prove the following lemma, which closely follows an analogous statement in [Kyn+19].
Lemma A.18 (Union of instances) . Consider four smoothed p -norm instances, G , G , H , H , onthe same set of vertices, i.e. V G = V G = V H = V H , such that for i = 1 , , H i (cid:22) κ,δ G i with themap M H i →G i . Let G def = G ∪ G , and H def = H ∪ H . Then,
H (cid:22) κ,δ G with the map M H→G (cid:16) f H = ( f H , f H ) (cid:17) def = (cid:18) M H →G (cid:16) f H (cid:17) , M H →G (cid:16) f H (cid:17)(cid:19) , where ( f H , f H ) is the decomposition of f H onto the supports of H and H .Proof. Let f H be a flow on H . We write f H = ( f H , f H ) . Let f G def = M H→G ( f H ) . If f G i denotes M H i →G i ( f H i ) for i = 1 , , then we know that f G = ( f G , f G ) . Thus, the objectives satisfy E G ( κ − f G ) = E G ( κ − f G ) + E G ( κ − f G ) ≥ κ − (cid:18) E H ( f H ) − δ (cid:13)(cid:13)(cid:13) f H (cid:13)(cid:13)(cid:13) (cid:19) + κ − (cid:18) E H ( f H ) − δ (cid:13)(cid:13)(cid:13) f H (cid:13)(cid:13)(cid:13) (cid:19) = κ − (cid:18) E H ( f H ) − δ (cid:13)(cid:13)(cid:13) f H (cid:13)(cid:13)(cid:13) (cid:19) For the residues, we have,( B G ) ⊤ ( f G ) = ( B G ) ⊤ ( f G ) + ( B G ) ⊤ ( f G )= ( B H ) ⊤ ( f H ) + ( B H ) ⊤ ( f H ) = ( B H ) ⊤ ( f H ) . Thus,
H (cid:22) κ,δ G . (cid:3) This notion of approximation also behaves nicely with scaling of ℓ and ℓ p resistances. Lemma A.19.
For all κ ≥ , and for all pairs of smoothed p -norm instances, G , H , on the sameunderlying graphs, i.e. , ( V G , E G ) = ( V H , E H ) , such that,(1) the gradients are identical, g G = g H , (2) the ℓ resistances are off by at most κ, i.e. , r G e ≤ κ r H e for all edges e, and(3) the p -norm scaling is off by at most κ p − , i.e. , s G ≤ κ p − s H , then H (cid:22) κ G with the identity map.Proof. Follows from Lemma 2.13 in [Kyn+19]. (cid:3) .3. Orthogonal Decompositions of Flows.
At the core of our graph decomposition and spar-sification procedures is a decomposition of the gradient g of G into its cycle space and potentialflow space. We denote such a splitting using(10) g G = b g G + B G ψ G , s.t. B G⊤ b g G = . Here b g is a circulation, while B ψ gives a potential induced edge value. We will omit the superscriptswhen the context is clear.The following minimization based formulation of this splitting of g is critical to our method ofbounding the overall progress of our algorithm Fact A.20.
The projection of g onto the cycle space is obtained by minimizing the Euclidean normof g plus a potential flow. Specifically, k b g k = min x k g + B x k . Lemma A.21.
Given a graph/gradient instance G , consider H formed from a subset of its edges.The projections of g G and g H onto their respective cycle spaces, b g G and b g H satsify: (cid:13)(cid:13)(cid:13)b g H (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)b g G (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) g G (cid:13)(cid:13)(cid:13) . A.4.
Numerics, Conditioning, Inexact Laplacian Solvers.
Because we allow instances G =( V, E, r , s , g ) with r ( e ) = 0 and s ( e ) = 0 for some edges e ∈ E (and this is important for oursparsificaton procedures), we need to be somewhat careful about disallowing instances with a cyclethat has zero r and s values and non-zero gradient: In that case, our “energy” can diverge to + ∞ . Definition A.22 (Unbounded and constant cycles) . An untouched cycle C ⊆ E is unbounded ifthe sum of terms around the cycle is non-zero, i.e. P e ∈ C g ( e ) = 0. If the sum is zero, we call theunbounded cycle a constant cycle . Lemma A.23.
Consider a smoothed p -norm flow problem ( G , b ) , i.e. max f :( B G ) ⊤ f = b E G ( f ) . The problem is unbounded, i.e. has objective value E G → ∞ if and only if G contains an unboundedcycle.Proof. The problem is unbounded exactly when the gradient has non-zero inner product with someelement of the subspace ker( B ⊤ ) ∩ ker( R ) ∩ ker( S ). The first condition means it ker( B ⊤ ) tells usthe element must be in the cycle space, while the latter two tells us it must be supported edges withnon-zero R and S . Writing the element as a linear combination of particular cycles, the gradientmust have a non-zero inner product with one untouched cycle. (cid:3) Lemma A.24.
The algorithm detectUnbounded takes as input a smoothed p -norm problem ( G , b ) and if G contains an unbounded cycle, the algorithm detects this and output an unboundedcycle (an arbitrary one if there are multiple).Proof. We run a DFS only using edges outside the union of the support of r and s (just deletethe others before), and assign voltages. On an edge to already visited vertex, check if new voltageagrees, if output cycle on stack: it is an unbounded cycle. At the end, consistent voltages certifythat gradient is an electrical flow and hence objective is bounded. Overall time is linear, as eachedge is processed only once. (cid:3) Lemma A.25.
The algorithm constantCycleContraction takes as input a smoothed p -normproblem ( G , b ) which has bounded objective value and
1) Returns an equivalent instance ( G ′ , b ′ ) where is G ′ is cycle-touching. G ′ is equals G with allconstant cycle contracted into a vertex.(2) The problems ( G , b ) and ( G ′ , b ′ ) are equivalent in that(a) Any feasible solution f to the first problem can be trivially mapped to a feasible solutionto the second and vice versa.(b) Mapping a solution from ( G , b ) to ( G ′ , b ′ ) always changes the objective value by the same scalar (say c ) for all solutions, and simiarly mapping the other way changes theobjective by − c .(c) The maps can be applied in O ( m ) time.(d) If the entries of ( G , b ) are quasi-polynomially bounded, then so are the entries of ( G ′ , b ′ ) .Furthermore, if b = then b ′ = 0 and the solution maps preserve the objective value exactly, i.e. c = 0 and hence G (cid:22) cycle G ′ and G ′ (cid:22) cycle G , and for zero-demand flows the maps M G → G ′ and M G → G ′ satisfy kM G → G ′ k cycle → ≤ and kM G ′ → G k cycle → ≤ m. Proof.
Once we know the problem is bounded, and hence contains no unbounded cycles, we canlook at the connected components consisting of untouched edges, and we can repeatedly contractcycles of these parts. In the case where demands are zero initially, we produce a smaller instancewhere again demands are zero. To lift to the larger instance, we can simply put no flow on theseedges. Because having a cycle flow on these edges does not increase the objective, our mappingguarantees hold. Overall time is linear, as each edge is processed only once. (cid:3)
Lemma A.26.
Consider a smoothed p -norm circulation problem problem ( G , ) , where G = ( V, E, r , s , g ) .Suppose the entries of r and s are quasipolynomially bounded and G is cycle-touching, and suppose k g k ∞ ≤ polylog( n ) . Then ( G , ) is quasi-polynomially bounded.Proof. If every cycle that every non-zero g edges appears in contains a non-zero entry of r or s ,then increasing flow along that cycle will eventually lead to a decrease in objective. (cid:3) Remark A.27.
We are analyzing our algorihtm in the Real RAM model, but by applying the toolsfrom this section, it can also be analyzed in fixed precision arithmetic with polylogarithmic bit com-plexity per number: In this model, our detectUnbounded and constantCycleContraction procedures still work and returns a cycle-touching instance. Once an instance is cycle-touchingand the non-zero vectors it returns are not too small or big, it is possible to manage errors fromfixed point arithmetic. This can also allow us to work with inexact Laplacian solvers usingquasipolynomial errors, which we can compute in this model in nearly-linear time.A.5.
Main Sparsification Theorem for Flows.Theorem A.28 (Instance Sparsification) . Consider an instance G = ( V G , E G , r G , s G , g G ) with n vertices and m edges, with r G and s G quasipolynomially bounded, and (cid:13)(cid:13)(cid:13) g G (cid:13)(cid:13)(cid:13) ∞ ≤ polylog( n ) , andsuppose the instance is cycle-touching. We can compute an instance H = ( V H E H , r H , s H , g H ) with n vertices and m H = n polylog( n/ ( εδ )) edges, again with r H and s H quasipolynomially bounded,and (cid:13)(cid:13)(cid:13) g H (cid:13)(cid:13)(cid:13) ∞ ≤ polylog( n ) , in time e O ( m ) such that with probability − ε the maps M G→H and M H→G certify
H (cid:22) κ,δ G and G (cid:22) κ,δ H , where κ = m / ( p − polylog( n/ ( εδ )) . Furthermore, these maps can be applied in time e O ( m ) . efinition A.29. A graph G is a α -uniform φ -expander (or uniform expander when parametersnot spelled out explicitly) if(1) r on all edges are the same.(2) s on all edges are the same.(3) G has conductance at least φ .(4) The projection of g onto the cycle space of G , b g G = ( I − B L † B ⊤ ) g , is α -uniform (seenext definition), where B is the edge-vertex incidence matrix of G , and L = B ⊤ B is theLaplacian. Definition A.30.
A vector y ∈ R m is said to be α -uniform if k y k ∞ ≤ αm k y k . We abuse the notation to also let the all zero vector be 1-uniform.The next 2 theorems are from [Kyn+19] Theorem A.31 (Decomposition into Uniform Expanders) . Given any graph/gradient/resistanceinstance G with n vertices, m edges, all equal to r , p -weights all equal to s , and gradient g G , alongwith a parameter δ , Decompose ( G , δ ) returns disjoint vertex subsets V , V , . . . in O ( m log n log ( n/δ )) time such that if we let G , G , . . . be the instances obtained by restricting G to the induced graphson the V i sets, then at least m/ edges are contained in these subgraphs, and each G i satisfies (forsome absolute constant c partition ):(1) The graph ( V G i , E G i ) has conductance at least φ = 1 c partition · log n · log (cid:0) n/δ (cid:1) , and degrees at least φ · m n , where c partition is an absolute constant.(2) The projection of its gradient g G i into the cycle space of G i , b g G i satisfies one of:(a) b g G i is O (log n log ( n/δ )) -uniform, (cid:16)b g G i e (cid:17) ≤ O (cid:16) log n log (cid:0) n/δ (cid:1)(cid:17) m i (cid:13)(cid:13)(cid:13)b g G i (cid:13)(cid:13)(cid:13) ∀ e ∈ E ( G i ) . Here m i is the number of edges in G G i .(b) The ℓ norm of b g G i is smaller by a factor of δ than the unprojected gradient: (cid:13)(cid:13)(cid:13)b g G i (cid:13)(cid:13)(cid:13) ≤ δ · (cid:13)(cid:13)(cid:13) g G (cid:13)(cid:13)(cid:13) . Theorem A.32 (Sampling Uniform Expanders) . Given an α -uniform φ -expander G = ( V G E G , r G , s G , g G ) with m edges and vertex degrees at least d min , for any sampling probability τ satisfying τ ≥ c sample log( n/ε ) · (cid:18) αm + 1 φ d min (cid:19) , We use an instance and its underlying graph interchangeably in our discussion. r are uniform, so conductance is defined as in unweighted graphs. We use the standard definition of conductance.For graph G = ( V, E ), the conductance of any ∅ 6 = S ( V is φ ( S ) = δ ( S )min ( vol ( S ) ,vol ( V \ S ) ) where δ ( S ) is the numberof edges on the cut ( S, V \ S ) and vol ( S ) is the sum of the degree of nodes in S . The conductance of a graph is φ G = min S = ∅ ,V φ ( S ). here c sample is some absolute constant, SampleAndFixGradient ( G , τ ) with probability at least − ε returns a partial instance H = ( H, r H , s H , g H ) and maps M G→H and M H→G . The graph H has the same vertex set as G , and H has at most τ m edges. Furthermore, r H = τ · r G and s H = τ p · s G . The maps M G→H and M H→G certify
H (cid:22) κ G and G (cid:22) κ H , where κ = m / ( p − φ − log n This map can be applied in time e O ( m ) . Definition A.33 (2-Rounded instance) . We call an instance G = ( V G , E G , r G , s G , g G ) , -rounded ,if every non-zero entry of s and r has absolute value equal to a power of two (which can benegative).Given an instance G = ( V G , E G , r G , s G , g G ) , we compute a 2-rounded instance G ′ = ( V G , E G , r G ′ , s G ′ , g G ) , .We round each r G e of edges e ∈ E G down to the nearest power of 2 (can be less than 1). Similarly,we round each s G e of edges e ∈ E G down to the nearest power of 2 (can be less than 1). We denotethis rounding procedure by instanceRound , s.t. G ′ = instanceRound ( G ). We will only need toapply this procedure to quasipolynomially bounded numbers, which ensures it can be implementedin logarithmic time in the Real RAM with comparisons. Remark A.34.
Becaues it is applied to quasipolynomially bounded entries, the rounding can beimplemented using a polylogarithmic number of bit operations in fixed point arithmic.
Lemma A.35 (2-rounding) . Consider an instance G = ( V, E, r , s , g ) . Let G ′ = instanceRound ( G ) .Then the identity map between the instances certifies G (cid:22) G ′ and G ′ (cid:22) G . and G ′ is -rounded.Proof of Theorem A.28. We are given an instance G = ( V G , E G , r G , s G , g G ) with n vertices and m edges, with r G and s G quasipolynomially bounded, and (cid:13)(cid:13)(cid:13) g G (cid:13)(cid:13)(cid:13) ∞ ≤ polylog( n ) , and cycle-touching.We then compute G ′ = instanceRound ( G ), to get the 2-rounded, cycle-touching instance G ′ =( V G ′ , E G ′ , r G ′ , s G ′ , g G ′ ) with r G ′ and s G ′ quasipolynomially bounded, and (cid:13)(cid:13)(cid:13) g G ′ (cid:13)(cid:13)(cid:13) ∞ ≤ polylog( n ) withusing the identity map between the instances(11) G (cid:22) G ′ and G ′ (cid:22) G . Because G ′ is 2-rounded and r G ′ and s G ′ quasipolynomially bounded, the entries of these twovectors only take on polylog( n ) different values. Thus we can divide the edges E G ′ into polylog( n )buckets such that in every bucket i , all edges have the same 2-resistance value r e = r ( i ) and thesame p -weight value s ( e ) = s ( i ) .Let G ( i ) = ( V G ( i ) , E G ( i ) , r G ( i ) , s G ( i ) , g G ( i ) ) be the instance arising from restricting G ′ to the edgesof bucket i , while letting V G ( i ) to be the set of vertices incident on edges of E G ( i ) . There will beexactly one bucket containing all the edges e with r ( e ) = s ( e ) = 0. We let this bucket have index i = 0. This bucket cannot contain a cycle of edges, because if it did then G ′′ would contain anuntouched cycle, contradiction that it is cycle-touching. Hence G (0) contains at most n − i = 0, which we do not spar-sify. For i >
0, we define G ( i, = G ( i ) , and let j ←
0. As long as G ( i,j ) contains more than n log n edges we then repeat the following: Appealing to Theorem A.31, we now call Decompose ( G ( i,j ) , ˜ δ )with ˜ δ = 2 − log c n δ for some universal constant c large enough that (cid:13)(cid:13)(cid:13) g G (cid:13)(cid:13)(cid:13) ∞ ˜ δ ≤ δ . This producesa partition of V ( i,j ) into disjoint V ( i,j )1 , . . . , V ( i,j ) k . The Decompose algorithm defines G ( i,j ) l to be he instance given by restricting G ( i,j ) to the induced graph on V ( i,j ) l and let G ( i,j +1) be the in-stance arising from restricting G ( i,j ) to the graph consisting of edges crossing between the vertexpartitions, and the vertices incident on these edges. We then let j ← j + 1 and repeat the decom-position if necessary. By Theorem A.31, G ( i,j +1) contains at most half of the edges of G ( i,j +1) , wecall Decompose at most log n times as log ( m/n ) ≤ log( n ). For each bucket i , we let j i denote thelast instance produced, i.e. G ( i,j i )) is the this final instance, which is not included in any expander.For every instance G ( i,j ) , we let m ( i,j ) = (cid:12)(cid:12)(cid:12) E G ( i,j ) (cid:12)(cid:12)(cid:12) denote the number of edges of the graph, and n ( i,j ) = (cid:12)(cid:12)(cid:12) V G ( i,j ) (cid:12)(cid:12)(cid:12) the number of vertices. Note that the edges of G ( i ) are partitioned between the G ( i,j )) l instances and the G ( i,j i )) instances, i.e. each edges of G ( i ) is contained in exactly instance,either a G ( i,j )) l or a G ( i,j i )) . Thus the union of all these in the sense of Definition A.17 is exactly G ( i ) , and the union G ( i ) is G ′ .We will not sparsify the final G ( i,j i ) , which for each bucket i have at most n log n edges. Again,for every instance G ( i,j ) l , we let m ( i,j ) l = (cid:12)(cid:12)(cid:12)(cid:12) E G ( i,j ) l (cid:12)(cid:12)(cid:12)(cid:12) denote the number of edges of the graph.By Theorem A.31, • The graph associated with each G ( i,j ) l has conductance at least φ ≥ C · log n · log (cid:16) n/ ˜ δ (cid:17) , forsome universal constant C . • For each G ( i,j ) l either – (“Uniform case”) the instance is α -uniform with α ≤ C log n log ( n/ ˜ δ ) for someuniversal constant C . – (“Small case”) The ℓ norm of b g G ( i,j ) l , the gradient projected to the cycle space, issmaller by a factor of ˜ δ than the unprojected gradient of the original graph (cid:13)(cid:13)(cid:13)(cid:13)b g G ( i,j ) l (cid:13)(cid:13)(cid:13)(cid:13) ≤ ˜ δ · (cid:13)(cid:13)(cid:13)b g G ( i,j ) (cid:13)(cid:13)(cid:13) ≤ ˜ δ · (cid:13)(cid:13)(cid:13)b g G (cid:13)(cid:13)(cid:13) . • The minimum degree of each G ( i,j ) l graph is at least φ · m ( i,j ) n ( i,j ) .We let ˜ ε = ε/m . For instances G ( i,j ) l in the “Uniform case”, we let τ ( i,j ) l = c sample log( n/ ˜ ε ) αm ( i,j ) l + 1 φ d min ≤ C log( n/ ˜ ε ) log ( n/ ˜ δ ) m ( i,j ) l + n ( i,j ) m ( i,j ) for some universal constant C , and we call SampleAndFixGradient ( G ( i,j ) l , τ ( i,j ) l ), which returnsinstance H ( i,j ) l and maps M G ( i,j ) l →H ( i,j ) l and M H ( i,j ) l →G ( i,j ) l .For instances G ( i,j ) l in the “small case”, we define G ′ ( i,j ) l to be the instance G ( i,j ) l with all gradiententries set to zero. We let τ ( i,j ) l = c sample log( n/ ˜ ε ) 1 φ d min ≤ C log( n/ ˜ ε ) log ( n/ ˜ δ ) n ( i,j ) m ( i,j ) And call
SampleAndFixGradient ( G ′ ( i,j ) l , τ ( i,j ) l ) which again returns an instance H ( i,j ) l and maps M G ( i,j ) l →H ( i,j ) l and M H ( i,j ) l →G ( i,j ) l .The instance H ( i,j ) l has the same vertex set as G ( i,j ) l and at most τ ( i,j ) l m ( i,j ) l edges.In the “uniform case”, with probability at least 1 − ˜ ε , the maps certify H ( i,j ) l (cid:22) κ G ( i,j ) l and G ( i,j ) l (cid:22) κ H ( i,j ) l , here κ ≤ m / ( p − log (cid:16) n/ ˜ δ (cid:17) , directly by Theorem A.32. In the “small case”, with probabilityat least 1 − ˜ ε , the maps certify with the same κ that H ( i,j ) l (cid:22) κ,δ G ( i,j ) l and G ( i,j ) l (cid:22) κ,δ H ( i,j ) l , (12)because by Theorem A.32, H ( i,j ) l (cid:22) κ G ′ ( i,j ) l and G ′ ( i,j ) l (cid:22) κ H ( i,j ) l , and our rounding of the gradients implies (certified by the identity map) that G ( i,j ) l (cid:22) , ˜ δ k g k ∞ G ′ ( i,j ) l and G ′ ( i,j ) l (cid:22) , ˜ δ k g k ∞ G ( i,j ) l , with finally our choice of ˜ δ = 2 log c ( n ) δ for some large enough universal constant c ensures ˜ δ k g k ∞ ≤ δ , and by Lemma A.16, the guarantees compose to give Equation (12). By Lemma A.16, theguarantees compose and we get that the total edge count in the sampled instances H ( i,j ) l associatedwith G ( i,j ) is bounded by X l τ ( i,j ) l m ( i,j ) l ≤ X l C log( n/ ˜ ε ) log ( n/ ˜ δ ) n ( i,j ) m ( i,j ) l m ( i,j ) ≤ C log( n/ ˜ ε ) log ( n/ ˜ δ ) n ( i,j ) ≤ C log( n/ ˜ ε ) log ( n/ ˜ δ ) n. Since the number of buckets (indexed by i ) is bounded by polylog( n ) and the number Decompose calls for each bucket (indexed by j ) is bounded by log( n ), we get that the total number of edgessummed across all the H ( i,j ) l for all i, j, l is bounded by n log(1 / ˜ ε ) log (1 / ˜ δ ) polylog( n )Summed over all i , the total number of edges in the G ( i,j i ) instances is n polylog( n ) . We return a sparsifier instance H consisting of the union over all i and over all j of the H ( i,j ) l andthe G ( i,j i ) , and G (0) (the bucket with edges where 2 and p weights are both zero), with a totalnumber of edges bounded by n log(1 / ˜ ε ) log (1 / ˜ δ ) polylog( n ) ≤ n polylog( n/ ( εδ )) . Because the union of the original instances G ( i,j i ) , and G (0) gives us G ′ , by Lemma A.18, G ′ (cid:22) κ,δ H and H (cid:22) κ,δ G ′ where again κ ≤ m / ( p − log (cid:16) n/ ˜ δ (cid:17) ≤ m / ( p − polylog( n/δ ). Then, because Equation (11) holdsusing the identity map between G and G ′ , we have G (cid:22) κ,δ H and H (cid:22) κ,δ G We sparsified at most m different expanders, since each contains a distinct edges of G , and eachsparsification fails with probability at most ˜ ε = ε/m , so by a union bound, the probability none ofthe sparsifications fail is at least 1 − ε . That weights of H are quasipolynomially bounded followsfrom the explicit weights given in Theorem A.32. The overall time bound to apply the union mapalso follows immediately from Theorem A.32. (cid:3) roof of Theorem A.1. Proof.
We will use Theorem A.28. This requires that the instance is cycle-touching. So we firstconvert out instance G to G ′ using Lemma A.25. We thus have maps M G→G ′ and M G ′ →G thatcan be applied in O ( m ) time, where kM G→G ′ k cycle1 → ≤ G ′ (cid:22) cycle1 , G , G (cid:22) cycle1 , G ′ (since we aresolving the residual problem, our demand vector is 0). We can apply Theorem A.28 on G ′ to get aninstance H with at most m H = n polylog( n/ ( εδ )) edges, maps M ′ that can be computed in e O ( m )time, and H (cid:22) cycle κ,δ G ′ and G ′ (cid:22) cycle κ,δ H where κ = m − / ( p − . Now to go from G to H , we will compose these two approximations and wethus have from Lemma A.16, G (cid:22) cycle κ,δ k M G→G′ k → H and H (cid:22) cycle κ,δ G Finally, as kM G→G ′ k → ≤
1, this completes our proof. We remark that any quasipolynomialblow-up in this error would also be acceptable. (cid:3)
Appendix B. Sparsification for General ℓ + ℓ pp Objectives Using Lewis Weights
We will prove Theorem 2.5.B.1.
Leverage Scores and Lewis Weights.
For α ≥
1, and x, y > , we say x ≈ α y if α x ≤ y ≤ αx . The statistical leverage score of a row aaa i of a matrix A is defined as τ ,i ( A ) def = aaa ⊤ i ( A ⊤ A ) − aaa i = (cid:13)(cid:13)(cid:13) ( A ⊤ A ) − / aaa i (cid:13)(cid:13)(cid:13) , The generalization of statistical leverage scores to ℓ p -norms is given by ℓ p Lewis weights [Lew78],which are defined as follows:
Definition B.1.
For a matrix A and for p ≥
1, we define the ℓ p Lewis weights { τ p,i } i to be theunique weights such that τ p,i = τ ,i ( Diag (cid:0) τ p,i (cid:1) / − /p A ) . Equivalently, aaa ⊤ i (cid:16) A ⊤ Diag (cid:0) τ p,i (cid:1) − /p A (cid:17) − aaa i = τ /pp,i . When the matrix A is not obvious from the context, we will denote the Lewis weights by τ p,i ( A ) . We use e τ p,i to denote β -approximate Lewis weights, i.e., e τ p,i ≈ β τ p,i . Lemma B.2 (Foster’s Theorem [Fos+53]) . For any matrix A ∈ R m × n , m ≥ n , we have P i τ ,i ( A ) = rank ( A ) ≤ n. As a simple corollary, we get that the ℓ p Lewis weights also sum to n. Corollary B.3.
For any matrix A ∈ R m × n , m ≥ n, and any p, we have P i τ p,i ( A ) ≤ n. Proof.
By definition and existence of the Lewis weights, X i τ p,i ( A ) = X i τ ,i ( Diag (cid:0) τ p,i (cid:1) / − /p A ) = rank ( Diag (cid:0) τ p,i (cid:1) / − /p A ) ≤ n, where the second equality follows from Lemma B.2. (cid:3) s we will see, having access to τ ,i would allow us to determine a spectral approximation to A , though with many fewer rows. Unfortunately, the naive approach to calculating them requirescomputing ( A ⊤ A ) + , which would defeat the purpose of finding a smaller spectral approximation inthe first place. Thus, the key insight of work by [Coh+15] is that a certain uniform sampling-based approach is sufficient to determine approximate leverage scores, as established by the followingimportant lemma. Lemma B.4 (Lemma 7, [Coh+15]) . Given matrix A , p ∈ [2 , , θ < , and a matrix B containing O ( n log( n )) rescaled rows of A , there is an algorithm that, w.h.p. in n , computes n θ -approximate τ ,i for A in time O (( LSS ( B ) + nnz( A )) θ − ) , where LSS ( B ) is the time required to solve a linearequation in B ⊤ B to quasipolynomial accuracy. While the previous lemma provides approximations to τ ,i , it was later shown by [CP15] that infact we may use such a routine as a black-box for determining approximations τ p,i ( A ), for p ∈ [2 , Lemma B.5 (Lemma 2.4, [CP15]) . For any fixed p < , given a routine ApproxLeverageScores for computing, with high probability in n , β -approximate statistical leverage scores of rows of ma-trices of the form W A for β = n θ −| p/ − | p , there is an algorithm ApproxLewisWeights n θ -approximate ℓ p Lewis weights for A with O (cid:16) log( θ − )1 −| p/ − | (cid:17) calls to ApproxLeverageScores . Combining these two lemmas, we arrive at the following overall computational cost for finding e τ p,i . Theorem B.6.
Given matrix A , p ∈ [2 , , θ < , there is an algorithm that computes n θ -approximate τ p,i for A in time O (cid:18) p (1 − | p/ − | ) ( LSS ( B ) + nnz( A )) θ − log( θ − ) (cid:19) , where LSS ( B ) is the time required to solve a linear equation in B ⊤ B to quasipolynomial accuracyfor some matrix B containing O ( n log( n )) rescaled rows of A .Proof. The theorem follows immediately from Lemmas B.4 and B.5. (cid:3)
Lemma B.7 ( ℓ Matrix Concentration Bound (Lemma 4, [Coh+15])) . There exists an absoluteconstant C such that for any matrix A ∈ R m × n , and any set of sampling values ν i satisfying ν i ≥ τ ,i ( A ) · C ε − log n, if we generate a matrix S with N = P i ν i rows, each chosen independently as the i th standard basisvector times √ ν i , with probability ν i N , then with probability at least − n Ω(1) we have ∀ x ∈ R n , k S Ax k ≈ ε k Ax k . Lemma B.8 ([BLM89], [CP15](Lemma 7.1)) . For p ≥ , there exists an absolute constant C p suchthat for any matrix A ∈ R m × n , and any set of sampling values ν i satisfying ν i ≥ τ p,i ( A ) · C p n p − ε − log n log / ε , if we generate a matrix S with N = P i ν i rows, each chosen independently as the i th standard basisvector times ν /pi , with probability ν i N , then with probability at least − n Ω(1) we have ∀ x ∈ R n , k S Ax k p ≈ ε k Ax k p . Lemma B.9.
For p ∈ [2 , , given matrices C , D ∈ R m × n , there exist ν i > , i ∈ [ m ] with N = P i ν i ≤ O (1) n p / log n such that, if we generate a matrix S with N rows, each chosen independently s the i th standard basis vector times ν /pi with probability ν i N , then we can compute a diagonal matrix R ∈ R N × N ≥ such that with probability at least − n Ω(1) , ∀ x ∈ R n , k RS C x k ≈ k C x k and k S D x k p ≈ k D x k p . Proof.
Let e τ ,i ( C ) be 2-approximate leverage scores of C and e τ p,i ( D ) be 2-approximate ℓ p Lewisweights for D . Define ν i = C ,p max ne τ ,i ( C ) · log n, e τ p,i ( D ) · n p − log n o , where C ,p is a large enough absolute constant we specify later. Since P i e τ ,i ( C ) ≤ P i τ ,i ( C ) ≤ n and P i e τ p,i ( D ) ≤ P i τ p,i ( D ) ≤ n from Corollary B.3, we get N = P i ν i ≤ O ( C ,p ) n p − log n. Let S be as defined in the lemma statement, i.e. S ab = ν /pb if b th basis vector is chosen for row a, . Let us assume for row a , we have chosen the b th basis vector. Now define the diagonal matrix R as R aa = ν p − b . Note that e S = RS is a matrix with N rows, each chosen independently as the i th standard basisvector times ν / i with probability ν i N . We can pick C ,p large enough so that ν i ≥ τ ,i ( C ) · C log n, and we can apply Lemma B.7 for some constant ε < − n Ω(1) , we have(13) ∀ x ∈ R n , k RS C x k = k e S C x k ≈ k C x k . Similarly, we can pick C ,p large enough so that we have ν i ≥ τ p,i ( D ) · n p − log n . Thus, usingLemma B.8, we get that with probability at least 1 − n Ω(1) , we have(14) ∀ x ∈ R n , k S D x k p ≈ k D x k p . Combining the above two claims, and applying a union bound, we obtain our lemma. (cid:3)
Lemma B.10.
Let p ∈ [2 , , let M , N , A be matrices such that M ∈ R m × n , N ∈ R m × n , m , m ≥ n , and A ∈ R d × n , d ≤ n , and consider the problem min ∆ ∆ ⊤ M ⊤ M ∆ + k N ∆ k pp (15) s.t. A ∆ = c , with optimum at ∆ ∗ . Then, with high probability we may compute f M , f N ∈ R O ( n p/ log( n )) × n suchthat, for a κ -approximate solution e ∆ to the problem min ∆ ∆ ⊤ f M ⊤ f M ∆ + k f N ∆ k pp (16) s.t. A ∆ = c with optimum at e ∆ ∗ , e ∆ is a O ( κ ) -approximate solution to (15) .Proof. Let f M = RS M and f N = S N be as provided by Lemma B.9. It follows that e ∆ ⊤ M ⊤ M e ∆ + k N e ∆ k pp ≤ p (cid:18) e ∆ ⊤ f M ⊤ f M e ∆ + k f N e ∆ k pp (cid:19) ≤ p κ (cid:18) e ∆ ∗⊤ f M ⊤ f M e ∆ ∗ + k f N e ∆ ∗ k pp (cid:19) ≤ p κ (cid:18) ∆ ∗⊤ f M ⊤ f M ∆ ∗ + k f N ∆ ∗ k pp (cid:19) p κ (cid:16) ∆ ∗⊤ M ⊤ M ∆ ∗ + k N ∆ ∗ k pp (cid:17) ≤ κ (cid:16) ∆ ∗⊤ M ⊤ M ∆ ∗ + k N ∆ ∗ k pp (cid:17) , where the last inequality follows from our bound on p . (cid:3) We now recall the main Lewis weights-based sparsification result, Theorem 2.6 which was provenin Section 3.1.This result gives us the following corollaries which distinguish the general problem from the morestructured graph problem, whereby the latter may take advantage of fast Laplacian solvers. Notethat, for Theorem 2.6 and its corollaries, we use ˜ O p ( · ) to suppress a (1 − | p/ − | ) − term, whichwill become large as p approaches 4. Proof of Theorem 2.5.
Follows from Lemmas B.6 and B.10. (cid:3)
Corollary B.11 (General matrix setting) . Consider (3) for arbitrary M ∈ R m × n , N ∈ R m × n .Then, for p ∈ [2 , , with high probability, we can find an ε -approximate solution in time ˜ O p (cid:18) nnz( M ) + nnz( N ) + (cid:16) nnz( f M ) + nnz( f N ) + n ω (cid:17) n p ( p − p − (cid:19) log (1 /ε ) ! , for some f M and f N each containing O ( n p/ log( n )) rescaled rows of M and N , respectively. Corollary B.12 (Graph setting) . Consider (3) for M ∈ R m × n , N ∈ R m × n , m , m ≥ n , givenas the edge-vertex incidence matrices for some graphs. Then, for p ∈ [2 , , with high probability,we can find an ε -approximate solution to (3) in time ˜ O p (cid:18) m + m + n p (cid:16) p − p − (cid:17) (cid:19) log (1 /ε ) ! . Appendix C. Width-Reduced Approximate Solver for ℓ + ℓ pp Problems
We will solve problems of the form,min ∆ ∆ ⊤ M ⊤ M ∆ + k N ∆ k pp (17) s.t. A ∆ = c , which have an optimum at most ν . We first scale the problem down to a new problem withoptimum at most 1. Note that there exists ∆ ⋆ such that A ∆ ⋆ = c and ∆ ⋆ ⊤ M ⊤ M ∆ ⋆ + k N ∆ ⋆ k pp ≤ ν . Let ˜ M = ν − p − M and e ∆ = ν − /p ∆ ⋆ . The following problem has optimum at most 1 since e ∆is a feasible solution. min ∆ ∆ ⊤ ˜ M ⊤ ˜ M ∆ + k N ∆ k pp (18) s.t. A ∆ = ν − /p c , Now, let ¯∆ denote a feasible solution such that ¯∆ ⊤ ˜ M ⊤ ˜ M ¯∆ ≤ α and k N ∆ k pp ≤ β . Note that∆ = ν /p ¯∆ satisfies the constraints of (17) and,∆ ⊤ M ⊤ M ∆ ≤ αν, and , k N ∆ k pp ≤ βν. It is thus sufficient to solve Problem (18) to an α, β approximation. We thus have the followingresult which follows from Theorem C.1, heorem 3.4. Let p ≥ . Consider an instance of Problem (5) described by matrices A ∈ R d × n , N ∈ R m × n , M ∈ R m × n , d ≤ n ≤ m , m , and vector c ∈ R d . If the optimum of thisproblem is at most ν , Procedure Residual-Solver (Algorithm 3) returns an x such that Ax = c , and x ⊤ M ⊤ M x ≤ O (1) ν and k N x k pp ≤ O (3 p ) ν . The algorithm makes O pm p − p − ! calls to alinear system solver. C.1.
Solving Scaled Problem.
We will show that we can solve problems of the formmin ∆ ∆ ⊤ M ⊤ M ∆ + k N ∆ k pp (19) s.t. A ∆ = c , which have an optimum value at most 1. We will use the following oracle in our algorithm. Algorithm 2
Oracle procedure Oracle ( A , M , N , c , w ) r e ← w p − e Compute, ∆ = arg min A ∆ ′ = c m p − p ∆ ′⊤ M ⊤ M ∆ ′ + 13 p − X e r e (cid:16) N ∆ ′ (cid:17) e return ∆We can use now use Algorithm 4 from [Adi+19].Notation. We will use ∆ ⋆ to denote the optimum of (19) and e ∆ to denote the solution returned bythe oracle (Algorithm 2). We thus have, • ∆ ⋆ ⊤ M ⊤ M ∆ ⋆ ≤ • k N ∆ ∗ k p ≤ • r e ≥ , ∀ e .We will prove the following main theorem: Theorem C.1.
Let p ≥ . Given matrices A ∈ R d × n , N ∈ R m × n , M ∈ R m × n , m , m ≥ n , d ≤ n , and vector c , Algorithm 3 uses O pm p − p − ! , calls to the oracle (Algorithm 2) and returnsa vector x such that Ax = c , x ⊤ M ⊤ M x ≤ O (1) and k N x k pp = O (3 p ) . Analysis of Algorithm 3.
Similar to [Adi+19] we will track two potentials Φ and Ψ which wedefine as, Φ (cid:16) w ( i ) (cid:17) def = k w k pp Ψ( r ) def = min ∆: A ∆= c m p − p ∆ ⊤ M ⊤ M ∆ + 13 p − X e r e ( N ∆) e . Note that these potentials have a similar idea as [Adi+19] but are defined differently. Our proofwill follow the following structure,(1) Provided the total number of width reduction steps, K , is not too big, Φ( · ) is small. Thisin turn helps upper bound the value of the solution returned by the algorithm.(2) Showing that K cannot be too big, because each width reduction step cause large growthin Ψ( · ), while we can bound the total growth in Ψ( · ) by relating it to Φ( · ). lgorithm 3 Algorithm for the Scaled down Problem procedure Residual-Solver ( A , M , N , c ) w (0 , e ← x ← ρ ← Θ m ( p − p +2) p (3 p − ! ⊲ width parameter β ← Θ (cid:18) m p − p − (cid:19) ⊲ resistance threshold α ← Θ p − m − p − p +2 p (3 p − ! ⊲ step size τ ← Θ m ( p − p − p − ! ⊲ ℓ p energy threshold T ← α − m /p = Θ (cid:18) pm p − p − (cid:19) i ← , k ← while i < T do ∆ = Oracle ( A , M , N , c , w ) if k N ∆ k pp ≤ τ then ⊲ flow step w ( i +1 ,k ) ← w ( i,k ) + α | N ∆ | x ← x + α ∆ i ← i + 1 else ⊲ width reduction step For all edges e with | N ∆ | e ≥ ρ and r e ≤ β w ( i,k +1) e ← p − w e k ← k + 1 return m − p x We start by proving some results that we need to prove our final result, Theorem C.1. Theproofs of all lemmas are in Section C.2.
Lemma C.2.
Let p ≥ . For any w , let e ∆ be the solution returned by Algorithm 2. Then, X e ( N e ∆) e ≤ X e r e ( N e ∆) e ≤ k w k p − We next show through the following lemma that the Φ potential does not increase too rapidly.The proof is through induction and can be found in the Appendix.
Lemma C.3.
After i flow steps, and k width-reduction steps, provided(1) p p α p τ ≤ pαm p − p , (controls Φ growth in flow-steps)(2) k ≤ − pp − ρ m /p β − p − ,(acceptable number of width-reduction steps)the potential Φ is bounded as follows: Φ (cid:16) w ( i,k ) (cid:17) ≤ (cid:16) αi + m / p (cid:17) p exp pp − kρ m /p β − p − . We next show how the potential Ψ changes with a change in resistances. The proof is in theAppendix. emma C.4. Let e ∆ = arg min A ∆= c m p − p ∆ ⊤ M ⊤ M ∆ + p − P e r e ( N ∆) e . Then one has for any r ′ and r such that r ′ ≥ r , Ψ( r ′ ) ≥ Ψ( r ) + X e (cid:18) − r e r ′ e (cid:19) r e ( N e ∆) e . The next lemma gives a lower bound on the energy in the beginning and an upper bound on theenergy at each step.
Lemma C.5.
Initially, we have, Ψ (cid:16) r (0 , (cid:17) ≥ k M + N k k c k k A k = L, where k M + N k min = min Ax = c k M x k + k N x k and k A k is the operator norm of A . Moreover,at any step ( i, k ) we have, Ψ (cid:16) r ( i,k ) (cid:17) ≤ m p − p + 13 p − Φ( i, k ) p − p . We next bound the change in energy in a flow step and a width reduction step. This lemma isdirectly from [Adi+19] and the proof is also very similar. We include it here for completeness.
Lemma C.6.
Suppose at step ( i, k ) we have (cid:13)(cid:13)(cid:13) N e ∆ (cid:13)(cid:13)(cid:13) pp > τ so that we perform a width reductionstep (line 17). If(1) τ /p ≥ Ω(1) Ψ( r ) β , and(2) τ ≥ Ω(1)Ψ( r ) ρ p − .Then Ψ (cid:16) r ( i,k +1) (cid:17) ≥ Ψ (cid:16) r ( i,k ) (cid:17) + Ω(1) τ /p . Furthermore, if at ( i, k ) we have (cid:13)(cid:13)(cid:13) N e ∆ (cid:13)(cid:13)(cid:13) pp ≤ τ so that we perform a flow step, then Ψ (cid:16) r ( i,k +1) (cid:17) ≥ Ψ (cid:16) r ( i,k ) (cid:17) . Proof of Theorem C.1.
Proof.
We begin by setting all our parameter values. • α ← Θ p − m − p − p +2 p (3 p − ! • τ ← Θ m ( p − p − p − ! • β = Θ (cid:18) m p − p − (cid:19) • ρ = Θ m ( p − p +2) p (3 p − ! Note that the above values satisfy the relations p p α p τ = pαm p − p .Let m − /p x be the solution returned by Algorithm 3. Note that this satisfies the linear constraintrequired. We will now bound the values of m − /p x ⊤ M ⊤ M x and m − k N x k pp . If the algorithm erminates in T = α − m /p flow steps and K ≤ − pp − ρ m /p β − p − width reduction steps, thenfrom Lemma C.3, Φ (cid:16) w ( T,K ) (cid:17) ≤ O (3 p ) m e = O (3 p m )Note that throughout the algorithm w ≥ | N x | . This means that the algorithm returns m − p x with m − k N x k pp ≤ m (cid:13)(cid:13)(cid:13) w ( T,K ) (cid:13)(cid:13)(cid:13) pp = 1 m Φ (cid:16) w ( T,K ) (cid:17) ≤ O (3 p ) . To bound the other term, let e ∆ ( t ) denote the solution returned by the oracle in iteration t . Notethat, since Φ ≤ O (3 p ) m for all iterations, we always have Ψ( r ) ≤ O (1) m p − p . We claim that (cid:16) e ∆ ( t ) (cid:17) ⊤ M ⊤ M e ∆ ( t ) ≤ O (1) for all t . To see this, note that from Lemma C.5, m p − p (cid:16) e ∆ ( t ) (cid:17) ⊤ M ⊤ M e ∆ ( t ) ≤ Ψ( r ) ≤ O (1) m p − p . We also know that x = P t αp e ∆ ( t ) . Combining this and the convexity of k x k , we get m − /p k M x k ≤ α m − /p T X t k M e ∆ ( t ) k ≤ α m − /p T O (1) ≤ O (1) . This concludes the first part of the proof that if the number of width reduction steps are bounded,then we return a solution with the required values. We will now show that we cannot have morewidth reduction steps.Suppose to the contrary, the algorithm takes a width reduction step starting from step ( i, k ) where i < T and k = 2 − pp − ρ m /p β − p − . Since the conditions for Lemma C.3 hold for all preceding steps,we must have Φ (cid:16) w ( i,k ) (cid:17) ≤ O (3 p ) m . We note that our parameter values satisfy τ /p ≥ Ω(1) Ψ β and τ ≥ Ω(1) ρ p − Ψ since Ψ ≤ O (1) m p − p .This means that at every step ( j, l ) preceding the current step, the conditions of Lemma C.6 aresatisfied, so we can prove by a simple induction thatΨ (cid:16) r ( i,k +1) (cid:17) ≥ Ψ (cid:16) r (0 , (cid:17) + Ω(1) τ /p k. Since our parameter choices ensure τ /p k > Θ( m ),Ψ (cid:16) r ( i,k +1) (cid:17) − Ψ (cid:16) r (0 , (cid:17) > Ω( m ) . Since Φ (cid:16) w ( i,k ) (cid:17) ≤ O (3 p ) m , Ψ (cid:16) r ( i,k +1) (cid:17) − Ψ (cid:16) r (0 , (cid:17) ≤ O (cid:18) m p − p (cid:19) , which is a contradiction. We can thus conclude that we can never have more than K = 2 − pp − ρ m /p β − p − width reduction steps, thus concluding the correctness of the returned solution. We next boundthe number of oracle calls required. The total number of iterations is at most, T + K ≤ α − m /p + 2 − p/ ( p − ρ m /p β − p − ≤ O (cid:18) pm p − p − (cid:19) . (cid:3) .2. Missing Proofs.Lemma C.2.
Let p ≥ . For any w , let e ∆ be the solution returned by Algorithm 2. Then, X e ( N e ∆) e ≤ X e r e ( N e ∆) e ≤ k w k p − Proof.
Since e ∆ is the solution returned by Algorithm 2, and ∆ ⋆ satisfies the constraints of theoracle, we have, X e r e ( N e ∆) e ≤ X e r e ( N ∆ ∗ ) e = X e w p − e ( N ∆ ∗ ) e ≤ k w k p − p . In the last inequality we use, X e w e ( N ∆ ⋆ ) e ≤ X e ( N ∆ ⋆ ) · p e ! /p X e | w e | ( p − · pp − ! ( p − /p = k N ∆ ⋆ k p k w k ( p − /pp ≤ k w k ( p − /pp , since (cid:13)(cid:13) N ∆ ∗ (cid:13)(cid:13) p ≤ . Finally, using r e ≥ , we have P e ( N ∆) e ≤ P e r e ( N ∆) e , concluding the proof. (cid:3) Lemma C.3.
After i flow steps, and k width-reduction steps, provided(1) p p α p τ ≤ pαm p − p , (controls Φ growth in flow-steps)(2) k ≤ − pp − ρ m /p β − p − ,(acceptable number of width-reduction steps)the potential Φ is bounded as follows: Φ (cid:16) w ( i,k ) (cid:17) ≤ (cid:16) αi + m / p (cid:17) p exp pp − kρ m /p β − p − . Proof.
We prove this claim by induction. Initially, i = k = 0 , and Φ (cid:16) w (0 , (cid:17) = m , and thus,the claim holds trivially. Assume that the claim holds for some i, k ≥ . We will use Φ as anabbreviated notation for Φ (cid:16) w ( i,k ) (cid:17) below.Flow Step. For brevity, we use w to denote w ( i,k ) . If the next step is a flow step,Φ (cid:16) w ( i +1 ,k ) (cid:17) = (cid:13)(cid:13)(cid:13)(cid:13) w ( i,k ) + α (cid:12)(cid:12)(cid:12) N e ∆ (cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13)(cid:13) pp ≤k w k pp + α (cid:12)(cid:12)(cid:12) ( N e ∆) (cid:12)(cid:12)(cid:12) ⊤ (cid:12)(cid:12)(cid:12) ∇k w k pp (cid:12)(cid:12)(cid:12) + 2 p α X e | w e | p − (cid:12)(cid:12)(cid:12) N e ∆ (cid:12)(cid:12)(cid:12) e + α p p p k N e ∆ k pp by Lemma B.1 of [APS19]We next bound (cid:12)(cid:12)(cid:12) ( N e ∆) (cid:12)(cid:12)(cid:12) ⊤ (cid:12)(cid:12) ∇k w k pp (cid:12)(cid:12) as, P e (cid:12)(cid:12)(cid:12) ( N e ∆) e (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ e k w k pp (cid:12)(cid:12) ≤ p k w k p − p . Using Cauchy Schwarz’s inequality, X e (cid:12)(cid:12)(cid:12) N e ∆ (cid:12)(cid:12)(cid:12) e (cid:12)(cid:12)(cid:12) ∇ e k w k pp (cid:12)(cid:12)(cid:12)! = p X e (cid:12)(cid:12)(cid:12) N e ∆ (cid:12)(cid:12)(cid:12) e | w e | p − | w e | ! p X e | w e | p − w e ! X e | w e | p − ( N e ∆) e ! = p k w k pp X e r e ( N e ∆) e ≤ p k w k p − p We thus have, X e (cid:12)(cid:12)(cid:12) N e ∆ (cid:12)(cid:12)(cid:12) e (cid:12)(cid:12)(cid:12) ∇ e k w k pp (cid:12)(cid:12)(cid:12) ≤ p k w k p − p . Using the above bound, we now have,Φ (cid:16) w ( i +1 ,k ) (cid:17) ≤k w k pp + pα k w k p − p + 2 p α k w k p − p + p p α p k N e ∆ k pp ≤k w k pp + pα k w k p − p + 2 p α k w k p − p + pαm p − p , by Assumption 1 of this LemmaRecall k w k pp = Φ( w ) . Since Φ ≥ m , we have, ≤ Φ( w ) + pα Φ( w ) p − p + 2 p α Φ( w ) p − p + pα Φ( w ) p − p ≤ (Φ( w ) /p + 2 α ) p . From the inductive assumption, we haveΦ( w ) ≤ (cid:16) αi + m / p (cid:17) p exp O p (1) kρ m /p β − p − . Thus, Φ( i + 1 , k ) ≤ (Φ( w ) /p + 2 α ) p ≤ (cid:16) α ( i + 1) + m / p (cid:17) p exp O p (1) kρ m /p β − p − proving the inductive claim.Width Reduction Step. We have the following: X e ∈ H r e ≤ ρ − X e ∈ H r e ( N ∆) e ≤ ρ − X e r e ( N ∆) e ≤ ρ − k w k p − p ≤ ρ − Φ p − p , and Φ( i, k + 1) ≤ Φ + X e ∈ H (cid:12)(cid:12)(cid:12) w k +1 e (cid:12)(cid:12)(cid:12) p ≤ Φ + 2 pp − X e ∈ H | w e | p ≤ Φ + 2 pp − X e r pp − e ≤ Φ + 2 pp − X e ∈ H r e (cid:18) max e ∈ H r e (cid:19) pp − − Φ + 2 pp − ρ − Φ p − p β p − . Again, since Φ( w ) ≥ m ,Φ( i, k + 1) ≤ Φ (cid:18) pp − ρ − m − p β p − (cid:19) ≤ (cid:16) αi + m / p (cid:17) p exp pp − k + 1 ρ m /p β − p − proving the inductive claim. (cid:3) Lemma C.4.
Let e ∆ = arg min A ∆= c m p − p ∆ ⊤ M ⊤ M ∆ + p − P e r e ( N ∆) e . Then one has for any r ′ and r such that r ′ ≥ r , Ψ( r ′ ) ≥ Ψ( r ) + X e (cid:18) − r e r ′ e (cid:19) r e ( N e ∆) e . Proof. Ψ( r ) = min Ax = c x ⊤ M ⊤ M x + x ⊤ N ⊤ RN x . Constructing the Lagrangian and noting that strong duality holds,Ψ( r ) = min x max y x ⊤ M ⊤ M x + x ⊤ N ⊤ RN x + 2 y ⊤ ( c − Ax )= max y min x x ⊤ M ⊤ M x + x ⊤ N ⊤ RN x + 2 y ⊤ ( c − Ax ) . Optimality conditions with respect to x give us,2 M ⊤ M x ⋆ + 2 N ⊤ RN x ⋆ = 2 A ⊤ y . Substituting this in Ψ gives us,Ψ( r ) = max y y ⊤ c − y ⊤ A (cid:16) M ⊤ M + N ⊤ RN (cid:17) − A ⊤ y . Optimality conditions with respect to y now give us,2 c = 2 A (cid:16) M ⊤ M + N ⊤ RN (cid:17) − A ⊤ y ⋆ , which upon re-substitution gives,Ψ( r ) = c ⊤ (cid:18) A (cid:16) M ⊤ M + N ⊤ RN (cid:17) − A ⊤ (cid:19) − c . We also note that(20) x ⋆ = (cid:16) M ⊤ M + N ⊤ RN (cid:17) − A ⊤ (cid:18) A (cid:16) M ⊤ M + N ⊤ RN (cid:17) − A ⊤ (cid:19) − c . We now want to see what happens when we change r . Let R denote the diagonal matrix withentries r and let R ′ = R + S , where S is the diagonal matrix with the changes in the resistances.We will use the following version of the Sherman-Morrison-Woodbury formula multiple times,( X + U C V ) − = X − − X − U ( C − + V X − U ) − V X − . We begin by applying the above formula for X = M ⊤ M + N ⊤ RN , C = I , U = N ⊤ S / and V = S / N . We thus get,(21) (cid:16) M ⊤ M + N ⊤ R ′ N (cid:17) − = (cid:16) M ⊤ M + N ⊤ RN (cid:17) − − (cid:16) M ⊤ M + N ⊤ RN (cid:17) − N ⊤ S / (cid:18) I + S / N (cid:16) M ⊤ M + N ⊤ RN (cid:17) − N ⊤ S / (cid:19) − S / N (cid:16) M ⊤ M + N ⊤ RN (cid:17) − . e next claim that, I + S / N (cid:16) M ⊤ M + N ⊤ RN (cid:17) − N ⊤ S / (cid:22) I + S / R − S / , which gives us,(22) (cid:16) M ⊤ M + N ⊤ R ′ N (cid:17) − (cid:22) (cid:16) M ⊤ M + N ⊤ RN (cid:17) − − (cid:16) M ⊤ M + N ⊤ RN (cid:17) − N ⊤ S / ( I + S / R − S / ) − S / N (cid:16) M ⊤ M + N ⊤ RN (cid:17) − . This further implies,(23) A (cid:16) M ⊤ M + N ⊤ R ′ N (cid:17) − A ⊤ (cid:22) A (cid:16) M ⊤ M + N ⊤ RN (cid:17) − A ⊤ − A (cid:16) M ⊤ M + N ⊤ RN (cid:17) − N ⊤ S / ( I + S / R − S / ) − S / N (cid:16) M ⊤ M + N ⊤ RN (cid:17) − A ⊤ . We apply the Sherman-Morrison formula again for, X = A (cid:16) M ⊤ M + N ⊤ RN (cid:17) − A ⊤ , C = − ( I + S / R − S / ) − , U = A (cid:16) M ⊤ M + N ⊤ RN (cid:17) − N ⊤ S / and V = S / N (cid:16) M ⊤ M + N ⊤ RN (cid:17) − A ⊤ .Let us look at the term C − + V X − U . − (cid:16) C − + V X − U (cid:17) − = (cid:16) I + S / R − S / − V X − U (cid:17) − (cid:23) ( I + S / R − S / ) − . Using this, we get, (cid:18) A (cid:16) M ⊤ M + N ⊤ R ′ N (cid:17) − A ⊤ (cid:19) − (cid:23) X − + X − U ( I + S / R − S / ) − V X − , which on multiplying by c ⊤ and c gives,Ψ( r ′ ) ≥ Ψ( r ) + c ⊤ X − U ( I + S / R − S / ) − V X − c . We note from Equation (20) that x ⋆ = (cid:16) M ⊤ M + N ⊤ RN (cid:17) − A ⊤ X − c . We thus have,Ψ( r ′ ) ≥ Ψ( r ) + (cid:0) x ⋆ (cid:1) ⊤ N ⊤ S / ( I + S / R − S / ) − S / N x ⋆ = Ψ( r ) + X e (cid:18) r ′ e − r e r ′ e (cid:19) r e ( N x ⋆ ) e (cid:3) Lemma C.5.
Initially, we have, Ψ (cid:16) r (0 , (cid:17) ≥ k M + N k k c k k A k = L, where k M + N k min = min Ax = c k M x k + k N x k and k A k is the operator norm of A . Moreover,at any step ( i, k ) we have, Ψ (cid:16) r ( i,k ) (cid:17) ≤ m p − p + 13 p − Φ( i, k ) p − p . roof. For the lower bound in the initial state, since r e ≥
1, for any solution ∆, we haveΨ( r (0 , ) ≥ m p − p e ∆ ⊤ M ⊤ M e ∆+ 13 p − X e r (0 , ( N e ∆) e = k M e ∆ k + 13 p − k N e ∆ k ≥ p − k M + N k k e ∆ k , where k M + N k = min Ax = c k M x k + k N x k . Note that if k M + N k = 0 then the oraclehas returned the optimum in the first iteration. On the other hand, because k c k = (cid:13)(cid:13)(cid:13) A e ∆ (cid:13)(cid:13)(cid:13) ≤ k A k (cid:13)(cid:13)(cid:13) e ∆ (cid:13)(cid:13)(cid:13) , where k A k is the operator norm of A . We get (cid:13)(cid:13)(cid:13) e ∆ (cid:13)(cid:13)(cid:13) ≥ k c k k A k , upon which squaring gives the lower bound on Ψ( r (0 , ).For the upper bound, Lemma C.2 implies that,Ψ (cid:16) r ( i,k ) (cid:17) = m p − p e ∆ ⊤ M ⊤ M e ∆ + 13 p − X e r e ( N e ∆) e ≤ m p − p ∆ ⋆ ⊤ M ⊤ M ∆ ⋆ + 13 p − X e r e ( N ∆ ⋆ ) e ≤ m p − p + 13 p − k w k p − p ≤ m p − p + 13 p − Φ( i, k ) p − p . (cid:3) Lemma C.6.
Suppose at step ( i, k ) we have (cid:13)(cid:13)(cid:13) N e ∆ (cid:13)(cid:13)(cid:13) pp > τ so that we perform a width reductionstep (line 17). If(1) τ /p ≥ Ω(1) Ψ( r ) β , and(2) τ ≥ Ω(1)Ψ( r ) ρ p − .Then Ψ (cid:16) r ( i,k +1) (cid:17) ≥ Ψ (cid:16) r ( i,k ) (cid:17) + Ω(1) τ /p . Furthermore, if at ( i, k ) we have (cid:13)(cid:13)(cid:13) N e ∆ (cid:13)(cid:13)(cid:13) pp ≤ τ so that we perform a flow step, then Ψ (cid:16) r ( i,k +1) (cid:17) ≥ Ψ (cid:16) r ( i,k ) (cid:17) . Proof.
It will be helpful for our analysis to split the index set into three disjoint parts: • S = (cid:8) e : | N ∆ e | ≤ ρ (cid:9) • H = (cid:8) e : | N ∆ e | > ρ and r e ≤ β (cid:9) • B = (cid:8) e : | N ∆ e | > ρ and r e > β (cid:9) .Firstly, we note X e ∈ S | N ∆ | pe ≤ ρ p − X e ∈ S | N ∆ | e ≤ ρ p − X e ∈ S r e | N ∆ | e ≤ ρ p − Ψ( r ) . hence, using Assumption 2 X e ∈ H ∪ B | N ∆ | pe ≥ X e | N ∆ | pe − X e ∈ S | N ∆ | pe ≥ τ − ρ p − Ψ( r ) ≥ Ω(1) τ. his means, X e ∈ H ∪ B ( N ∆) e ≥ X e ∈ H ∪ B | N ∆ | pe /p ≥ Ω(1) τ /p . Secondly we note that, X e ∈ B ( N ∆) e ≤ β − X e ∈ B r e ( N ∆) e ≤ β − Ψ( r ) . So then, using Assumption 1, X e ∈ H ( N ∆) e = X e ∈ H ∪ B ( N ∆) e − X e ∈ B ( N ∆) e ≥ Ω(1) τ /p − β − Ψ( r ) ≥ Ω(1) τ /p . As r e ≥
1, this implies P e ∈ H r e ( N ∆) e ≥ Ω(1) τ /p . We note that in a width reduction step,the resistances change by a factor of 2. Thus, combining our last two observations, and applyingLemma C.4, we get Ψ (cid:16) r ( i,k +1) (cid:17) ≥ Ψ (cid:16) r ( i,k ) (cid:17) + Ω(1) τ /p . Finally, for the “flow step” case, we use the trivial bound from Lemma C.4, ignoring the secondterm, Ψ (cid:16) r ( i,k +1) (cid:17) ≥ Ψ (cid:16) r ( i,k ) (cid:17) . (cid:3) Appendix D. ℓ p -Regression Definition D.1 ( κ -approximate solution) . Let κ ≥
1. A κ -approximate solution for the residualproblem is e ∆ such that A e ∆ = 0 and res ( e ∆) ≥ κ res (∆ ⋆ ) , where ∆ ⋆ = argmax A ∆=0 res (∆). Lemma D.2. (Iterative Refinement [APS19]). Let p ≥ , and κ ≥ . Starting from an ini-tial feasible solution x (0) , and iterating as x ( t +1) = x ( t ) − ∆ p , where ∆ is a κ -approximate solu-tion to the residual problem (Definition 3.2), we get an ε -approximate solution to (3) in at most O pκ log (cid:18) f ( x (0) ) − Opt ε Opt (cid:19)! calls to a κ -approximate solver for the residual problem.Proof. Let f ( x ) = b ⊤ x + k M x k + k N x k pp and res (∆) = g ⊤ ∆ − ∆ ⊤ R ∆ − k N ∆ k pp . Observe that, b ⊤ ( x + ∆) = b ⊤ x + b ⊤ ∆ , and k M ( x + ∆) k = k M x k + 2∆ ⊤ M ⊤ M x + k M ∆ k . Using Lemma B.1 from [APS19] we have, k N ( x + ∆) k pp ≤ k N x k pp + p ( N ∆) ⊤ | N x | p − N x + 2 p ∆ ⊤ N ⊤ Diag ( | N x | p − ) N ∆ + p p k N ∆ k pp , and, k N ( x + ∆) k pp ≥ k N x k pp + p ( N ∆) ⊤ | N x | p − N x + p ⊤ N ⊤ Diag ( | N x | p − ) N ∆ + 12 p +1 k N ∆ k pp . Using these relations, we have, f ( x + ∆) ≤ f ( x ) + p g ⊤ ∆ + p ∆ ⊤ R ∆ + p p k N ∆ k pp , or f (cid:18) x − ∆ p (cid:19) ≤ f ( x ) − res (∆) . lower bound looks like, f ( x + ∆) ≥ f ( x ) + p g ⊤ ∆ + p
16 ∆ ⊤ R ∆ + 12 p +1 k N ∆ k pp . For λ = 16 p , f ( x ) − f (cid:18) x − λ ∆ p (cid:19) ≤ λ g ⊤ ∆ − λ p ∆ ⊤ R ∆ − λ p p p p +1 k N ∆ k pp ≤ λ g ⊤ ∆ − λ p ∆ ⊤ R ∆ − λ p − p p p +1 k N ∆ k pp ! ≤ λ (cid:16) g ⊤ ∆ − ∆ ⊤ R ∆ − k N ∆ k pp (cid:17) = λres (∆) . These relations are the same as Lemma B.2 of [APS19]. We can follow the proof further from[APS19] to get our result. (cid:3)
Solving the Residual Problem.
Lemma D.3.
Let ∆ ⋆ denote the optimum of the residual problem at x ( t ) and Opt denote theoptimum of Problem (3) . We have that res (∆ ⋆ ) ∈ ( ν/ , ν ] for some ν ∈ (cid:20) ε Opt p , f (cid:16) x (0) (cid:17) − Opt (cid:21) .Proof.
From the above proof, we note that for any x , let ∆ ⋆ be the optimum of the residual problem. res (∆ ⋆ ) ≤ f ( x ) − f (cid:18) x − ∆ ⋆ p (cid:19) ≤ f (cid:16) x (0) (cid:17) − Opt . Let ∆ be the step we take to reach the optimum from x . res (∆ ⋆ ) ≥ res (∆) ≥ f ( x ) − Opt λ ≥ ε Opt λ , where the last inequality follows since otherwise x is an ε approximate solution. The lemma thusfollows. (cid:3) Lemma D.4.
Let p ≥ p ′ and ν be such that res p (∆ ⋆ ) ∈ ( ν/ , ν ] , where ∆ ⋆ is the optimum ofthe residual problem for p -norm (Definition 3.2). The following problem has optimum between (cid:20) ν/ , O (1) m p ′− ν (cid:21) def = ( aν, bν ] . max A ∆=0 g ⊤ ∆ − ∆ ⊤ R ∆ − (cid:18) νm (cid:19) − p ′ p k N ∆ k p ′ p ′ (24) For β ≥ , if e ∆ is a β -approximate solution to the above problem, then α e ∆ gives an bβ a m pp − (cid:16) p ′ − p (cid:17) + p ′− approximate solution to the residual problem, where α = a bβ m − pp − (cid:16) p ′ − p (cid:17) − p ′− .Proof. We will first show that the optimum of Problem (24) is at most O ( m p ′− ν ). Suppose theoptimum ∆ ⋆ is such that it gives an objective value of βν . Using a scaling argument as in theabove proof, we can conclude that, g ⊤ ∆ ⋆ − ∆ ⋆ ⊤ R ∆ ⋆ − (cid:18) νm (cid:19) − p ′ p (cid:13)(cid:13) N ∆ ⋆ (cid:13)(cid:13) p ′ p ′ = ∆ ⋆ ⊤ R ∆ ⋆ + ( p ′ −
1) 12 (cid:18) νm (cid:19) − p ′ p (cid:13)(cid:13) N ∆ ⋆ (cid:13)(cid:13) p ′ p ′ = βν. his implies that g ⊤ ∆ ⋆ ≥ βν , ∆ ⋆ ⊤ R ∆ ⋆ ≤ βν and k N ∆ ⋆ k p ′ p ′ ≤ βν p ′ /p m − p ′ /p . We thus have, (cid:13)(cid:13) N ∆ ⋆ (cid:13)(cid:13) pp ≤ (2 β ) p/p ′ m p/p ′ − ν. Let α = m − pp − (cid:16) p ′ − p (cid:17) be some scaling factor. Now, α ∆ ⋆ ⊤ R ∆ ⋆ ≤ α βν , and(25) α p (cid:13)(cid:13) N ∆ ⋆ (cid:13)(cid:13) pp ≤ α p − m − (cid:16) pp ′ − (cid:17) (cid:13)(cid:13) N ∆ ⋆ (cid:13)(cid:13) pp ≤ α p − pp ′ − ( p − β pp ′ ν ≤ αβ pp ′ ν . Consider, res p (cid:18) β − p − p ′ p ′ ( p − α ∆ ⋆ (cid:19) ≥ (cid:18) β − p − p ′ p ′ ( p − α (cid:19)(cid:18) g ⊤ ∆ ⋆ − βν − βν (cid:19) ≥ β p ( p ′− p ′ ( p − α ν. Since res p ( · ) has optimum at most ν , we must have β p ( p ′− p ′ ( p − α ≤ , which gives, β ≤ O (1) m p ′− . Thus Problem (24) has an optimum at most O (1) m p ′− ν = bν . To obtain a lower bound, consider e ∆ obtained in Lemma 3.1 of [AS20]. We will evaluate the objective at e ∆, g ⊤ e ∆ − e ∆ ⊤ R e ∆ − p (cid:18) νm (cid:19) − p ′ p (cid:13)(cid:13)(cid:13) N e ∆ (cid:13)(cid:13)(cid:13) p ′ p ′ ≥ ν − ν
32 = ν
32 = aν.
Therefore, the optimum of Problem (24) must be at least ν . We next look at how a β -approximatesolution of (24) translates to an approximate solution of the residual problem for p . Let e ∆ be a β approximate solution to (24) and ∆ ⋆ denote its optimum. Denote the objective at ∆ for (24) as res p ′ (∆). We know that res p ′ (∆ ⋆ ) ≥ res p ′ ( e ∆) ≥ β res p ′ (∆ ⋆ ). Note that, g ⊤ e ∆ has to be between aν/β and bν . This ensures that, e ∆ ⊤ R e ∆ + 12 (cid:18) νm (cid:19) − p ′ p (cid:13)(cid:13)(cid:13) N e ∆ (cid:13)(cid:13)(cid:13) p ′ p ′ ≤ g ⊤ e ∆ ≤ bν. Let α = a bβ m − pp − (cid:16) p ′ − p (cid:17) − p ′− . Following the calculations above, we have, α e ∆ ⊤ R e ∆ ≤ α aν β , and α p (cid:13)(cid:13)(cid:13) N e ∆ (cid:13)(cid:13)(cid:13) pp ≤ α a p − p − b p − β p − m − (cid:16) pp ′ − (cid:17) − p − p ′− (cid:13)(cid:13)(cid:13) N e ∆ (cid:13)(cid:13)(cid:13) pp ≤ α aν β . Therefore, res p ( α e ∆) ≥ aα β ν. (cid:3) emma D.5. At x ( t ) , let ν be such that res (∆ ⋆ ) ∈ ( aν, bν ] for some values a and b . The followingproblem has optimum at most bν . min ∆ ∆ ⊤ R ∆ + k N ∆ k pp s.t. g ⊤ ∆ = aν, A ∆ = 0 . (26) Further, if e ∆ is a feasible solution to Problem (26) such that e ∆ ⊤ R e ∆ ≤ αbν and k N e ∆ k pp ≤ βbν ,then we can pick a scalar µ = a bαβ / ( p − such that µ e ∆ is a b αβ / ( p − a -approximate solution to theresidual problem. Adapted from the proof of Lemma B.4 of [APS19]
Proof.
The assumption on the residual is res (∆ ⋆ ) = g ⊤ ∆ ⋆ − ∆ ⋆ ⊤ R ∆ ⋆ − (cid:13)(cid:13) N ∆ ⋆ (cid:13)(cid:13) pp ∈ ( aν, bν ] . Since the last 2 terms are strictly non-positive, we must have, g ⊤ ∆ ⋆ ≥ aν. Since ∆ ⋆ is the optimumand satisfies A ∆ ⋆ = 0, ddλ (cid:16) g ⊤ λ ∆ ⋆ − λ ∆ ⋆ ⊤ R ∆ ⋆ − λ p (cid:13)(cid:13) N ∆ ⋆ (cid:13)(cid:13) pp (cid:17) λ =1 = 0 . Thus, g ⊤ ∆ ⋆ − ∆ ⋆ ⊤ R ∆ ⋆ − (cid:13)(cid:13) A ∆ ⋆ (cid:13)(cid:13) pp = ∆ ⋆ ⊤ R ∆ ⋆ + ( p − (cid:13)(cid:13) N ∆ ⋆ (cid:13)(cid:13) pp . Since p ≥ , we get the following∆ ⋆ ⊤ R ∆ ⋆ + (cid:13)(cid:13) N ∆ ⋆ (cid:13)(cid:13) pp ≤ g ⊤ ∆ ⋆ − ∆ ⋆ ⊤ RA ∆ ⋆ − (cid:13)(cid:13) A ∆ ⋆ (cid:13)(cid:13) pp ≤ bν. For notational convenience, let function h p ( R , ∆) = ∆ ⊤ R ∆ + k N ∆ k pp . Now, we know that, g ⊤ ∆ ⋆ ≥ aν and g ⊤ ∆ ⋆ − h p ( R , ∆ ⋆ ) ≤ bν . This gives, aν ≤ g ⊤ ∆ ⋆ ≤ h p ( R , ∆ ⋆ ) + bν ≤ bν. Let ∆ = δ ∆ ⋆ , where δ = aν g ⊤ ∆ ⋆ . Note that δ ∈ [ a/ b, g ⊤ ∆ = aν and, h p ( R , ∆) ≤ max { δ , δ p } h p ( R , ∆ ⋆ ) ≤ bν. Note that this ∆ satisfies the constraints of program (26) and has an optimum at most bν . So theoptimum of the program must have an objective at most bν . Now let e ∆ be a ( α, β ) approximatesolution to (26), i.e., e ∆ ⊤ R e ∆ ≤ αbν and k N e ∆ k pp ≤ βbν. Let µ = a αβ / ( p − b . We have, g ⊤ ( µ e ∆) − h p ( R , µ e ∆) ≥ µ (cid:16) aν − µ e ∆ ⊤ R e ∆ − µ p − k N e ∆ k pp (cid:17) ≥ µ (cid:18) aν − aν β / ( p − − aν α (cid:19) ≥ µ (cid:18) aν − aν − aν (cid:19) ≥ aµ b bν ≥ a b αβ / ( p − Opt . (cid:3) heorem 3.3. For an instance of Problem (3) , suppose we are given a starting solution x (0) thatsatisfies Ax (0) = c and is a κ approximate solution to the optimum. Consider an iteration of thewhile loop, line 8 of Algorithm 1 for the ℓ p -norm residual problem at x ( t ) . We can define µ and κ such that if ¯∆ is a β approximate solution to a corresponding p ′ -norm residual problem, then µ ¯∆ is a κ -approximate solution to the p -residual problem. Further, suppose we have the followingprocedures,(1) Sparsify : Runs in time K , takes as input any matrices R , N and vector g and returns e R , f N , e g having sizes at most ˜ n × n for the matrices , such that if e ∆ is a β approximatesolution to, max A ∆=0 e g ⊤ ∆ − k e R ∆ k − k f N ∆ k p ′ p ′ , for any p ′ ≥ , then µ e ∆ , for a computable µ is a κ β -approximate solution for, max A ∆=0 res (∆) def = g ⊤ ∆ − k R / ∆ k − k N ∆ k p ′ p ′ . (2) Solver : Approximately solves (4) to return ¯∆ such that k e R ¯∆ k ≤ κ ν and k f N ∆ k pp ≤ κ ν in time ˜ K (˜ n ) for instances of size at most ˜ n .Algorithm 1 finds an ε -approximate solution for Problem (3) in time e O pκ / ( p − κ κ κ ( K + ˜ K (˜ n )) log (cid:18) κpε (cid:19) ! . Proof.
From Lemma D.2 we know that given an instance of Problem (3), at every x ( t ) we can definea residual problem and if ¯∆ is a β approximate solution of the residual problem, updating x ( t ) to x ( t ) − ¯∆ p , we can find the required ε approximate solution in O pβ log f (cid:16) x (0) (cid:17) − Opt ε Opt ≤ O pβ log (cid:18) κε (cid:19)! iterations. The last inequality follows since f ( x (0) ) − Opt ε Opt ≤ κ Opt ε Opt = κε . It is thus sufficient to solve aresidual problem at x that looks like,max A ∆=0 g ⊤ ∆ − ∆ ⊤ R ∆ − k N ∆ k pp . Here g and R depend on x ( t ) , M , N . Now, suppose we have res (∆ ⋆ ) ∈ ( ν/ , ν ]. We will considerthe following cases:(1) p ≤ log m :We apply Sparsify to g , R , N to get e g , e R , f N . Now is ∆ is a β approximate solution to(27) max A ∆=0 e g ⊤ ∆ − ∆ ⊤ e R ∆ − k f N ∆ k pp , then µ ∆ is a κ β approximate solution to the residual problem. We will now solve theabove problem. Note that the size of this problem is m . Let e ∆ be a κ , κ approximatesolution to Problem (4). From D.5, a bκ κ / ( p − e ∆ is a κ κ / ( p − b a approximation to (27).Now, going back, a bκ κ / ( p − µ e ∆ is a κ κ / ( p − b a κ -approximate solution to the residualproblem. Now, µ = κ = 1 and thus we have that, in this case if res (∆ ⋆ ) ∈ ( ν/ , ν ], then a bκ κ / ( p − µ e ∆ is a κ κ / ( p − b a κ -approximate solution to the residual problem. p > log m :In this case, there is an additional step that converts the residual problem to an instanceof the previous case for p = log m . We apply Sparsify to g , R , N ′ and set p = log m . If ¯∆is the α solution returned by the previous case on p = log m and N = N ′ , then µ ¯∆ is a κ α -approximation for the residual problem. Thus a bκ κ / ( p − µ µ e ∆ is a κ κ / ( p − b a κ κ -approximate solution to the residual problem.From the above discussion we conclude that, if res (∆ ⋆ ) ∈ ( ν/ , ν ], then we get a solution ¯∆ suchthat res ( ¯∆) ≥ a κ / ( p − κ κ κ b res (∆ ⋆ ). From Lemma D.3, we know that res (∆ ⋆ ) ∈ ( ν/ , ν ] forsome ν ∈ (cid:20) ε Opt p , f (cid:16) x (0) (cid:17) − Opt (cid:21) . Since
Opt ≥ f ( x (0) ) κ and Opt ≥
0, it is sufficient to change ν in the range (cid:20) ε f ( x (0) ) κp , f (cid:16) x (0) (cid:17)(cid:21) .We finally look at the running time. We start with a residual problem. We require time K toapply procedure Sparsify . We next require time ˜ K ( m ) to solve (4). Now this either gives usan κ / ( p − κ κ κ b a -approximate solution to the residual problem. For every residual problem werepeat the above process at most log κpε times (corresponding to the number of values of ν ). Weuse the fact that a, b ≤ m o (1) . Thus the total running time is, O pκ / ( p − κ κ κ b a ( K + ˜ K ( m )) log (cid:18) κε (cid:19) log (cid:18) κpε (cid:19) ≤ e O pκ / ( p − κ κ κ ( K + ˜ K ( m )) log (cid:18) κpε (cid:19) ! (cid:3)(cid:3)