[PDF] Strongly local p-norm-cut algorithms for semi-supervised learning and local graph clustering

Abstract

Full PDF

AAcknowledgements.

This researchwas supported in part by NSF awardsIIS-1546488, CCF-1909528, and NSFCenter for Science of InformationSTC, CCF-0939370, as well as DOEDE-SC0014543, NASA, and the SloanFoundation.

S T R O N G LY L O C A L P - N O R M - C U T A L G O R I T H M SF O R S E M I - S U P E RV I S E D L E A R N I N G A N D L O C A LG R A P H C L U S T E R I N G

Meng Liu and David F. GleichPurdue University

June 16, 2020

Graph based semi-supervised learning is the problem of learning alabeling function for the graph nodes given a few example nodes, oftencalled seeds, usually under the assumption that the graph’s edges indicatesimilarity of labels. This is closely related to the local graph clustering orcommunity detection problem of ﬁnding a cluster or community of nodesaround a given seed. For this problem, we propose a novel generalizationof random walk, diﬀusion, or smooth function methods in the literatureto a convex p-norm cut function. The need for our p-norm methods isthat, in our study of existing methods, we ﬁnd those principled methodsbased on eigenvector, spectral, random walk, or linear system oftenhave diﬃculty capturing the correct boundary of a target label or targetcluster. In contrast, 1-norm or maxﬂow-mincut based methods capturethe boundary, but cannot grow from small seed set; hybrid proceduresthat use both have many hard to set parameters. In this paper, wepropose a generalization of the objective function behind these methodsinvolving p-norms. To solve the p-norm cut problem we give a stronglylocal algorithm – one whose runtime depends on the size of the outputrather than the size of the graph. Our method can be thought as anonlinear generalization of the Anderson-Chung-Lang push procedure toapproximate a personalized PageRank vector eﬃciently. Our procedureis general and can solve other types of nonlinear objective functions, suchas p-norm variants of Huber losses. We provide a theoretical analysisof ﬁnding planted target clusters with our method and show that thep-norm cut functions improve on the standard Cheeger inequalities forrandom walk and spectral methods. Finally, we demonstrate the speedand accuracy of our new method in synthetic and real world datasets.Our code is available github.com/MengLiuPurdue/SLQ . Many datasets important to machine learning either start as a graph or have asimple translation into graph data. For instance, relational network data naturallystarts as a graph. Arbitrary data vectors become graphs via nearest-neighborconstructions, among other choices. Consequently, understanding graph-basedlearning algorithms – those that learn from graphs – is a recurring problem. Thisﬁeld has a rich history with methods based on linear systems [Zhou et al., 2003;Zhu et al., 2003], eigenvectors [Joachims, 2003; Hansen and Mahoney, 2012], graphcuts [Blum and Chawla, 2001], and network ﬂows [Lang and Rao, 2004; Andersenand Lang, 2008; Veldt et al., 2016], although recent work in graph-based learninghas often focused on embeddings [Perozzi et al., 2014; Grover and Leskovec, 2016]and graph neural networks [Yadati et al., 2019; Klicpera et al., 2019; Li et al.,2019]. Our research seeks to understand the possibilities enabled by a certain p -norm generalization of the standard techniques.Perhaps the prototypical graph-based learning problems are semi-supervisedlearning and local clustering . Other graph-based learning problems include role1 a r X i v : . [ c s . S I] J un a) Seed node and target. (b) 2-norm problem. (c) 1.1-norm problem. FIGURE 1 – A simple illustration ofthe beneﬁts of our p -norm methods.In this problem, we generate a graphfrom an image with weighted neigh-bors as described in [Shi and Malik,2000]. We intentionally make thisgraph consider large regions, so eachpixel is connected to all neighborswithin 40 pixels away. The target inthis problem is the cluster deﬁnedby the interior of the window and weselect a single pixel inside the windowas the seed. The three colors (yellow,orange, red) show how the non-zeroelements of the solution ﬁll-in as wedecrease a sparsity penalty in our for-mulation (yellow is sparsest, red isdensest). The 2-norm result exhibits atypical phenomenon of over-expansion,whereas the 1 . Full details

The image is a real-valued grey-scale image between0 and 1. We use Malik and Shi’sprocedure [Shi and Malik, 2000] toconvert the image into a weightedgraph. In the graph, pixels repre-sent nodes and pixels are connectedwithin a 2-squared-norm distanceof 40. The weight on an edge is w ( i, j ) = exp( −| I ( i ) − I ( j ) | /σ I −| D ( i, j ) | /σ x ) Ind [ | D ( i, j ) ≤ r ],where I ( i ) is the intensity at pixel i , D ( i, j ) is the 2-norm distance inpixel locations, and Ind [ · ] is theindicator function. The value of r = 40, σ I = 0 . σ d = 512 /

10. Weran our SLQ solver with γ = 0 . κ = [0 . , . , . ρ = 0 . q = 1 . , ,

000 steps, even though it hadnot fully converged. Running it longer(over one billion steps) shows thatthere are a few exceptionally smallentries that bleed out of the targetwindow. (Recall that we show anynon-zero entry ever introduced by thealgorithms.) These are illustrated inFigure 2. discovery and alignments. Semi-supervised learning involves learning a labelingfunction for the nodes of a graph based on a few examples, often called seeds.The most interesting scenarios are when most of the graph has unknown labelsand there are only a few examples per label. This could be a constant number ofexamples per label, such as 10 or 50, or a small fraction of the total label size,such as 1%. Local clustering is the problem of ﬁnding a cluster or communityof nodes around a given set of seeds. This is closely related to semi-supervisedlearning because that cluster is a natural suggestion for nodes that ought to sharethe same label, if there is a homophily property for edges in the network. If thishomophily is not present, then there are transformations of the graph that canmake these methods work better [Peel, 2017].For both problems, a standard set of techniques is based on random walkdiﬀusions and mincut constructions [Zhou et al., 2003; Zhu et al., 2003; Joachims,2003; Gleich and Mahoney, 2015; Pan et al., 2004]. These reduce the problem to alinear system, eigenvector, random walk, or mincut-maxﬂow problem, which canoften be further approximated. As a simple example, consider solving a seededPageRank problem that is seeded on the nodes known to be labeled with a singlelabel. The resulting PageRank vector indicates other nodes likely to share thatsame label. This propensity of PageRank to propogate labels has been used in amany applications and it has many interpretations [Kloumann et al., 2016; Gleich,2015; Orecchia and Mahoney, 2011; Pan et al., 2004; Lisewski and Lichtarge, 2010;Ghosh et al., 2014], including guilt-by-association [Koutra et al., 2011]. A relatedclass of mincut-maxﬂow constructions uses similar reasoning [Blum and Chawla,2001; Veldt et al., 2016, 2019a].The link between these PageRank methods and the mincut-maxﬂow compu-tations is that they correspond to 1-norm and 2-norm variations on a generalobjective function (see [Gleich and Mahoney, 2014] and Equation 1). In thispaper, we replace the norm with a general p -norm. (For various reasons, we referto it as a q -norm in the subsequent technical sections. We use p -norm here as thisusage is more common.) The literature on 1 and 2-norms is well established andlargely suggests that 1-norm (mincut) objectives are best used for reﬁning large results from other methods – especially because they tend to sharpen boundaries– whereas 2-norm methods are best used for expanding small seed sets [Veldtet al., 2016]. There is a technical reason for why mincut-maxﬂow formulationscannot expand small seed sets, unless they have uncommon properties, discussedin [Fountoulakis et al., 2020, Lemma 7.2]. The downside to 2-norm methods isthat they tend to “expand” or “bleed out” over natural boundaries in the data.This is illustrated in Figure 1(b). The hypothesis motivating this work is thattechniques that use a p -norm where 1 < p < . FIGURE 2 – Running our SLQ solverfor an extremely long time will causea few entries to bleed out of the targetwindow. Compare with Figure 1.

2e are hardly the ﬁrst to notice these eﬀects or propose p -norms as a solution.For instance, the p -Laplacian [Amghibech, 2003] and related ideas [Alamgir andLuxburg, 2011] has been widely studied as a way to improve results in spectralclustering [Bühler and Hein, 2009] and semi-supervised learning [Brindle and Zhu,2013]. This has recently been used to show the power of simple nonlinearitiesin diﬀusions for semi-supervised learning as well [Ibrahim and Gleich, 2019].The major rationale for our paper is that our algorithmic techniques are closelyrelated those used for 2-norm optimization. It remains the case that spectral(2-norm) approaches are far more widely used in practice, partly because theyare simpler to implement and use, whereas the other approaches involve moredelicate computations. Our new formulations are amenable to similar computationtechniques as used for 2-norm problems, which we hope will enable them to bewidely used.To forward the goal of making these techniques useful, we release all ofexperimental code and the tools necessary to easily use the strongly-local p -normcuts on github: github.com/MengLiuPurdue/SLQ This includes related codes for similar purposes as well.The remainder of this paper consists of a full demonstration of the potential ofthis idea. We ﬁrst formally state the problem and review technical preliminariesin Section 2. As an optimization problem the p -norm problem is strongly convexwith a unique solution. Next, we provide a strongly local algorithm to approximatethe solution (Section 3). A strongly local algorithm is one where the runtimedepends on the sparsity of the output rather than the size of the input graph.This enables the methods to run eﬃciently even on large graphs, because, simplyput, we are able to bound the maximum output size and runtime independentlyof the graph size. A hallmark of the existing literature is on these methods is arecovery guarantee called a Cheeger inequality. Roughly, this inequality showsthat, if the methods are seeded nearby a good cluster , then the methods willreturn something that is not too far away from that good cluster. This is oftenquantiﬁed in terms of the conductance of the good cluster and the conductanceof the returned cluster. There are a variety of tradeoﬀs possible here [Andersenet al., 2006; Zhu et al., 2013; Wang et al., 2017]. We prove such a relationship forour methods where the quality of the guarantee depends on the exponent 1 /p ,which reproduces the square root Cheeger guarantees [Chung, 1992] for p = 2 butgives better results when p <

2. Finally, we empirically demonstrate a numberof aspects of our methods in comparison with a number of other techniques inSection 5. The goal is to highlight places where our p -norm objectives diﬀer.At the end, we have a number of concluding discussions (Section 6), whichhighlight dimensions where our methods could be improved, as well as relatedliterature. For instance, there are many ways to use personalized PageRankmethods with graph convolutional networks and embedding techniques [Klicperaet al., 2019] – we conjecture that our p -norm methods will simply improve onthese relationships. Also, and importantly, as we were completing this paper,we became aware of [Yang et al., 2020] which discusses p -norms for ﬂow-baseddiﬀusions. Our two papers have many similar ﬁndings on the beneﬁt of p -norms,although there are some meaningful diﬀerences in the approaches, which wediscuss in Section 6. In particular, our algorithm is distinct and follows a simplegeneralization of the widely used and deployed push method for PageRank. Ourhope is that both papers can highlight the beneﬁts of this idea to improve thepractice of graph-based learning. 3 G E N E R A L I Z E D L O C A L G R A P H C U T S

We consider graphs that are undirected, connected, and weighted with positiveedge weights lower-bounded by 1. Let G = ( V, E, w ) be such a graph, where n = | V | and m = | E | . The adjacency matrix A has non-zero entries w ( i, j ) foreach edge ( i, j ), all other entries are zero. This is symmetric because the graphis undirected. The degree vector d is deﬁned as the row sum of A and D is adiagonal matrix deﬁned as diag( d ). The incidence matrix B ∈ { , − , } m × n measures the diﬀerences of adjacent nodes. The k th row of B represents the k thedge and each row has exactly two nonzero elements, i.e. 1 for start node of k thedge and − k th edge. For undirected graphs, either node canbe the start node or end node and the order does not matter. We use vol( S ) forthe sum of weighted degrees of the nodes in S and φ ( S ) = cut( S )min(vol( S ) , vol( ¯ S )) forconductance. We use i ∼ j to represent that node i and node j are adjacent.For simplicity, we begin with PageRank, which has been used for all of thesetasks in various guises [Zhou et al., 2003; Gleich and Mahoney, 2015; Andersenet al., 2006]. A PageRank vector [Gleich, 2015] is the solution of the linear system( I − α AD − ) x = (1 − α ) v where α is a probability between 0 and 1 and v is astochastic vector that gives the seed distribution. This can be easily reworkedinto the equivalent linear system ( γ D + L ) y = γ v where y = D x and L is thegraph Laplacian L = D − A . The starting point for our methods is a resultshown in [Gleich and Mahoney, 2014], where we can further translate this intoa 2-norm “cut” computation on a graph called the localized cut graph that isclosely related to common constructions in maxﬂow-mincut computations forcluster improvement [Andersen and Lang, 2008; Fountoulakis et al., 2020].The localized cut graph is created from the original graph, a set S , and a value γ . The construction adds an extra source node s and an extra sink node t , andedges from s to the original graph that localize a solution, or bias, a solution withinthe graph near the set S . Formally, given a graph G = ( V, E ) with adjacencymatrix A , a seed set S ⊂ V and a non-negative constant γ , the adjacency matrixof the localized cut graph is: A S =  γ d TS γ d S A γ d ¯ S γ d ¯ S  and a smallillustration is s t R1 R2R3 R4R5 a1 a2a3U1 U2U3 U4 γ γ γ γ γ γ γ γ γ γ γ γ S ¯ S Here ¯ S is the complement set of S , d S = D e S , d ¯ S = D e ¯ S , and e S is an indicatorvector for S .Let B , w be the incidence matrix and weight vector for the localized cut-graph.Then PageRank is equivalent to the following problem (see full detailsin [Gleich and Mahoney, 2014])minimize x w T ( B x ) = P i,j w i,j ( x i − x j ) = x T B T diag( w ) B x subject to x s = 1 , x t = 0 (1)We call this a cut problem because if we replace the squared term with an absolutevalue (i.e., P w i,j | x i − x j | ), then we have the standard s, t -mincut problem. Ourpaper proceeds from changing this power of 2 into a more general loss-function ‘ and also adding a sparsity penalty, which is often needed to produce stronglylocal solutions [Gleich and Mahoney, 2014]. We deﬁne this formally now. DEFINITION 1 (Generalized local graph cut)

Fix a set S of seeds and a value of γ .Let B , w be the incidence matrix and weight vector of the localized cut graph.Then the generalized local graph cut problem is: minimize x w T ‘ ( B x ) + κγ d T x = P ij w i,j ‘ ( x i − x j ) + κγ P i x i d i subject to x s = 1 , x t = 0 , x ≥ . (2)4 a) PageRank ( α = 0 .

85) (b) q =2 , γ = κ =10 − (c) q =5 , γ =10 − , κ =10 − (d) q =1 . , γ = κ =10 − (e) heat kernel t = 10 , ε = 0 .

003 (f) CRD U = 60 , h = 60 , w = 5 (g) p = 1 . h =0 . ,k = 35000 (h) 1 . h =0 . ,k = 7500 FIGURE 3 – A comparison of seededcut-like and clustering objectives on aregular grid-graph with 4 axis-alignedneighbors. The graph is 50-by-50, theseed is in the center. The diﬀusionslocalize before the boundary so weonly show the relevant region andthe quantile contours of the values.We selected the parameters to givesimilar-sized outputs. (Top row) Atleft (a), we have seeded PageRank;(b)-(d) show our q -norm objectives;(b) is a 2-norm which closely resem-bles PageRank; (c) is a 5-norm thathas diamond-contours; and (d) is a1 . p -norm non-linearity in the diﬀusion or a (h) p -Laplacian) show that similar resultsare possible with existing methods,although they lack the simplicity ofour optimization setup and often lackthe strongly local algorithms. Here ‘ ( x ) is an element-wise function and κ ≥ is a sparsity-promoting term. We compare using power functions ‘ ( x ) = q | x | q to a variety of other techniques forsemi-supervised learning and local clustering in Figure 3. If ‘ is convex, then theproblem is convex and can be solved via general-purpose solvers such as CVX. Anadditional convex solver is SnapVX [Hallac et al., 2017], which studied a generalcombination of convex functions on nodes and edges of a graph, although neitherof these approaches scale to the large graphs we study in subsequent portions ofthis paper (65 million edges). To produce a specialized, strongly local solver, wefound it necessary to restrict the class of functions ‘ to have similar properties tothe power function ‘ ( x ) = q | x | q and its derivative ‘ ( x ). Reproduction notes for Figure 3.

We release the exact code to reproducethis ﬁgure. For all methods, for all values above a threshold, we compute 4 quantilelines to give roughly equally spaced regions. (a). PageRank is mathematicallynon-zero at all nodes in connected graph. Here, we threshold at 10 − to focuson the circular contours. This is reproduced by (b) using q = 2. The “wiggles”around the edge are because we used CVX to solve this problem and there wereminor tolerance issues around the edge. We also boosted the threshold to 5 · − because of the tolerance in CVX. (c) Same as (b). (d) we used our SLQ solveras CVXpy with either the ECOS or SCS solver reported an error while using q = 1 .

25. We set ρ = 0 .

99 to get an accurate solution (close to KKT). Here, weused the algorithmic non-zeros as the code introduces elements “sparsely”. (e)This used mathematical non-zeros again because the algorithm from [Kloster andGleich, 2014] uses the same sparse “push” mechanisms as our SLQ algorithm. (f)CRD returns a set, so we simply display that set. The parameters were chosento make it look as close to a square as possible. (g and h) We used the forwardEuler algorithm from [Ibrahim and Gleich, 2019] with non-zero truncation. k is the number of steps and h is the step-size. These were chosen to make thepictures look like diamonds and squares, respectively to mirror our results. The5ntry thresholds were also 5 times the minimum element because the vectors arenon-zero everywhere. DEFINITION 2

In the [ − , domain, the loss function ‘ ( x ) should satisfy (1) ‘ ( x ) is convex; (2) ‘ ( x ) is an increasing and anti-symmetric function; (3) For ∆ x > , ‘ ( x ) should satisfy either of the following condition with constants k > and c > (3a) ‘ ( x + ∆ x ) ≤ ‘ ( x ) + k‘ (∆ x ) and ‘ ( x ) > c or (3b) ‘ ( x ) is c -Lipschitzcontinuous and ‘ ( x + ∆ x ) ≥ ‘ ( x ) + k‘ (∆ x ) when x ≥ . REMARK 3 If ‘ ( x ) is Lipschitz continuous with Lipschitz constant to be L and ‘ ( x ) > c , then constraint 3(a) can be satisﬁed with k = L/c . However, ‘ ( x ) can still satisfy 3(a) even if it is not Lipschitz continuous. A simple example is ‘ ( x ) = | x | . , − ≤ x ≤ . In this case, k = 1 but it is not Lipschitz continuousat x = 0 . On the other hand, when ‘ ( x ) is Lipschitz continuous, it can satisfyconstraint 3(b) even if ‘ ( x ) = 0 . An example is ‘ ( x ) = | x | . , − < x < . Inthis case ‘ ( x ) = 0 when x = 0 but ‘ ( x + ∆ x ) ≥ ‘ ( x ) + ‘ (∆ x ) when x ≥ . LEMMA 4

The power function ‘ ( x ) = q | x | q , − < x < satisﬁes deﬁnition 2 forany q > . More speciﬁcally, when < q < , ‘ ( x ) satisﬁes 3(a) with c = q − and k = 2 − q , when q ≥ , ‘ ( x ) satisﬁes 3(b) with c = q − and k = 1 . Proof

First, we know ‘ ( x ) = | x | q − sgn( x ) and ‘ ( x ) = ( q − | x | q − . And wedeﬁne ‘ (0) = ∞ .For 3(a), since − < x <

1, 1 < q <

2, we have ‘ ( x ) > ( q − ‘ ( x + ∆ x ) − ‘ ( x ) ‘ (∆ x ) = (cid:12)(cid:12)(cid:12) x ∆ x + 1 (cid:12)(cid:12)(cid:12) q − sgn (cid:16) x ∆ x + 1 (cid:17) − (cid:12)(cid:12)(cid:12) x ∆ x (cid:12)(cid:12)(cid:12) q − sgn (cid:16) x ∆ x (cid:17) Deﬁne a new function f ( x ) = | x | q − sgn(1 + x ) − | x | q − sgn( x ). f ( x ) = | x | q − − | x | q − . So the maximum of f ( x ) is achived at f ( − .

5) = 2 − q .For 3(b), since − < x < q >

2, we have ‘ ( x ) < ( q − x ≥ x + ∆ x ) q − ≥ x q − + ∆ x q − is obvious. (cid:4) Note that the ‘ ( x ) = | x | does not satisfy either choice for property (3).Consequently, our theory will not apply to mincut problems. In order to justifythe generalized term, we note that q -norm generalizations of the Huber and Berhuloss functions [Owen, 2007] do satisfy these deﬁnitions. DEFINITION 5

Given < q < and < δ < , the “q-Huber" and “Berq” functionsare q -Huber ‘ ( x ) = = (cid:26) δ q − x if | x | ≤ δ q | x | q + ( q − q ) δ q otherwise and Berq ‘ ( x ) = = ( q δ − q | x | q if | x | ≤ δ x + ( − q q ) δ otherwise. LEMMA 6

When − ≤ x ≤ , both “ q -Huber” and “Berq” satisfy Deﬁnition 2.The value of k for both is − q , the c for q -Huber is q − while the c for “Berq”is . Proof

Obviously, both condition (1) and (2) are satisﬁed for “ q -Huber” and “Berq”.Now we show 3(a) is also satisﬁed for “ q -Huber” based on the proof of lemma 4.The proof of “Berq” is also similar.When ∆ x > δ (∆ x ≤ δ is similar) k = ‘ ( x + ∆ x ) − ‘ ( x )∆ x q − = (cid:12)(cid:12) x ∆ x + 1 (cid:12)(cid:12) q − sgn (cid:0) x ∆ x + 1 (cid:1) − (cid:12)(cid:12) x ∆ x (cid:12)(cid:12) q − sgn (cid:0) x ∆ x (cid:1) | x | > δ, | x + ∆ x | > δ δ q − ( x +∆ x ) −| x | q − sgn( x )∆ x q − | x | > δ, | x + ∆ x | ≤ δ | x +∆ x | q − sgn( x +∆ x ) − δ q − x ∆ x q − | x | ≤ δ, | x + ∆ x | > δ ∆ x − q δ − q | x | ≤ δ, | x + ∆ x | ≤ δ ase 1: Same as the proof of lemma 4.

Case 2:

In this case, x can only be negative, i.e. x < − δ . After somesimpliﬁcation, k = (cid:18) ∆ xδ (cid:19) − q − (cid:18) − xδ (cid:19) − q − ! (cid:18) − x ∆ x (cid:19) q − Note that the right hand side is an increasing function of ∆ x and − δ − x ≤ ∆ x ≤ δ − x . Replacing ∆ x by − δ − x yields k = ( − x ) q − − δ q − ( − x − δ ) q − > x by δ − x yields k = δ q − + ( − x ) q − ( δ − x ) q − ≤ − q Here the last inequality is due to Jensen’s inequality.

Case 3:

Its proof is very similar to case 2.

Case 4:

Since 0 < ∆ x ≤ δ , 0 ≤ k ≤ − q . (cid:4) We now state uniqueness.

THEOREM 7

Fix a set S , γ > , κ > . For any loss function satisfying Deﬁni-tion 2, then the solution x of (2) is unique. Moreover, deﬁne a residual function r ( x ) = − γ B T diag ( ‘ ( B x )) w . A necessary and suﬃcient condition to satisfy theKKT conditions is to ﬁnd x ∗ where x ∗ ≥ , r ( x ∗ ) = [ r s , g T , r t ] T with g ≤ κ d (where d reﬂects the original graph), k ∗ = [0 , κ d − g , T and g T ( κ d − g ) = 0 . Proof

We ﬁrst prove uniqueness. The Hessian of the objective in (2) is: H ( i, j ) =  ‘ ( x i − ( e S ) i ) if i = j‘ ( x i − x j ) if i ∼ j x T H x = P i ∈ V x i ‘ ( x i − ( e S ) i ) + P i,j,i ∼ j x i x j ‘ ( x i − x j ). If 3(a) issatisﬁed, we have ‘ ( x ) > x T H x >

0. So the objective 2is strictly convex and the uniqueness is guaranteed. When 3(b) is satisﬁed, ‘ ( x + ∆ x ) ≥ ‘ ( x ) + k‘ (∆ x ) guarantees that ‘ ( x ) can only become zero in arange around zero, i.e. ‘ ( x ) = ‘ ( x ) = 0 when x ∈ [ − ψ, ψ ], where 0 ≤ ψ ≤ x T H x = 0 implies x i ≥ − ψ when i ∈ S , x i ≤ ψ when i / ∈ S and − ψ ≤ x i − x j ≤ ψ or x i x j = 0. In this case, the uniqueness is implied by κγ d in (2), i.e. each x i will be the smallest feasible value.Next, we will show the KKT condition of (2). If we translate problem (2) toadd the constraint u = B x , then the loss is ‘ ( u ). The Lagrangian is L = w T ‘ ( u ) + κγ d T x − f T ( B x − u ) − λ s ( x s − − λ t x t − k T x Standard optimality results give the KKT of (2) as ∂L∂ x = κ d − γ B T f − λ s e s − λ t e t − k = 0 ∂L∂ u = diag( ‘ ( u )) w + f = 0 k T x = 0 B x = uk ≥ , x s = 1 , x t = 0 (4)Thus, combining the ﬁrst and second equations, r = γ B T f . Since k ≥

0, from theﬁrst equation, we have g ≤ κ d . And from k T x = 0, we have g T ( κ d − g ) = 0. (cid:4) S T R O N G LY L O C A L A L G O R I T H M S

In this section, we will provide a strongly local algorithm to approximatelyoptimize equation (2) with ‘ ( x ) satisfying deﬁnition 2. The simplest way tounderstand this algorithms is as a nonlinear generalization of the Andersen-Chung-Lang push procedure for PageRank [Andersen et al., 2006], which we callACL. (The ACL procedure has strong relationships with Gauss-Seidel, coordinatesolvers, and various other standard algorithms.) The overall algorithm is simple:ﬁnd a vertex i where the KKT conditions from Theorem 7 are violated andincrease x i on that node until we approximately satisfy the KKT conditions.Update the residual, look for another violation, and repeat. The ACL algorithmtargets q = 2 case, which has a closed form update, we simply need to replacethis with a binary search. Algorithm nonlin-cut ( γ, κ, ρ, ε ) for set S and graph G where 0 <ρ< <ε determine accuracy Let x ( i ) = 0 except for x s = 1 and set r = − γ B T diag[ ‘ ( B x )] w While there is any vertex i where r i > κd i , or stop if none exists (ﬁnd a KKT violation) Apply nonlin-push at vertex i , updating x and r Return xAlgorithm nonlin-push ( i, γ, κ, x , r , ρ, ε ) Use binary search to ﬁnd ∆ x i such that the i th coordinate of the residual after adding ∆ x i to x i , r i = ρκd i , the binary search stops when the range of ∆ x is smaller than ε (satisfy KKT at i) . Change the following entries in x and r to update the solution and residual (a) x i ← x i + ∆ x i (b) For each neighbor j in the original graph G , r j ← r j + γ w i,j ‘ ( x j − x i ) − γ w i,j ‘ ( x j − x i − ∆ x i )For ρ <

1, we only approximately satisfy the KKT conditions, as discussedfurther in the Section 3.3. We have the following strongly local runtime guaranteewhen 3(a) in deﬁnition 2 is satisﬁed. See Section 3.2 for similar guarantee on 3(b).(This ignores binary search, but that only scales the runtime by log ε because thevalues are in [0 , THEOREM 8

Let γ > , κ > be ﬁxed and let k and c be the parameters fromDeﬁnition 2 for ‘ ( x ) . For < ρ < , suppose nonlin-cut stops after T iterations,and d i is the degree of node updated at the i -th iteration, then T must satisfy: P Ti =1 d i ≤ vol ( S ) /c‘ ( γ (1 − ρ ) κ/k (1 + γ )) = O ( vol ( S )) . The notation ‘ − refers to the inverse functions of ‘ ( x ), This function mustbe invertible under the the deﬁnition of 3(a). The runtime bound when 3(b) holdsis slightly diﬀerent, see below.Note that if κ = 0, γ = 0, or ρ = 1, then this bound goes to ∞ and we loseour guarantee. However, if these are not the case, then the bound T shows thatthe algorithm will terminate in time that is independent of the size of the graph. This is the type of guarantee provided by strongly local graph algorithms and hasbeen extremely useful to scalable network analysis methods [Leskovec et al., 2009;Jeub et al., 2015; Yin et al., 2017; Veldt et al., 2016; Kloster and Gleich, 2014].

LEMMA 9

During algorithm 1, for any i ∈ { V \{ s, t }} , g i will stay nonnegativeand ≤ x i ≤ . Proof

We can show this by induction. At the initial step, for node i ∈ S , g i = d i ,and for node i ∈ ¯ S , g i = 0. And after a nonlin-push step, every g i will staynonnegative. 8o prove 0 ≤ x i ≤

1, by expanding g i , we have g i = − γ X j ∼ i w i ‘ ( x i − x j ) − d i ‘ ( x i − ( e S ) i ) x i ≥ x i is the largest element of x and x i >

1, then we willhave ‘ ( x i − x j ) ≥ j ∼ i and ‘ ( x i − ( e S ) i ) >

0. Then g i <

0, which is acontradiction. (cid:4)

LEMMA 10

When 3(a) is satisﬁed, after calling nonlin-push on node i , thedecrease of || g || will be strictly larger than cd i ( ‘ ) − (cid:18) γ (1 − ρ ) κk (1 + γ ) (cid:19) Proof

We use g to denote g after calling nonlin-push on node i . At anyintermediate step of nonlin-cut procedure, || g || = X g i = − X i ∈ S d i ‘ ( x i − − X i ∈ ¯ S d i ‘ ( x i )This is because for any edge ( i, j ) ∈ E , g i has a term γ w ( i, j ) ‘ ( x i − x j ) while g j has a term γ w ( j, i ) ‘ ( x j − x i ). Since our graph is undirected, w ( i, j ) = w ( j, i ), sothese two terms will cancel out. What remains are the terms corresponding tothe edges connecting to s or t . So after calling nonlin-push on node i , || g || − || g || = d i ‘ ( x i + ∆ x i − ( e S ) i ) − d i ‘ ( x i − ( e S ) i ) ≥ d i min { l ( x i + ∆ x i − ( e S ) i ) , l ( x i − ( e S ) i ) } ∆ x i ≥ cd i ∆ x i On the other hand, we need to choose ∆ x i such that g i = ρκd i . We know g i = − γ X j ∼ i w ( i, j ) ‘ ( x i + ∆ x i − x j ) − d i ‘ ( x i + ∆ x i − ( e S ) i )is a decreasing function of ∆ x i . And when ∆ x i = 0, g i = κd i > ρκd i , when∆ x i = 1, g i < < ρκd i , since ‘ ( x ) is a strictly increasing function, there exists aunique ∆ x i such that g i = ρκd i . Moreover, we can lower bound ∆ x i . To see that, g i = ρκd i = − γ X j ∼ i w ( i, j ) ‘ ( x i + ∆ x i − x j ) − d i ‘ ( x i + ∆ x i − ( e S ) i ) ≥ − γ X j ∼ i w ( i, j ) ‘ ( x i − x j ) − d i ‘ ( x i − ( e S ) i ) − k (1 + γ ) γ d i ‘ (∆ x i )= g i − k (1 + γ ) γ d i ‘ (∆ x i )Thus, we have∆ x i ≥ ( ‘ ) − (cid:18) γ ( g i − ρκd i ) k (1 + γ ) d i (cid:19) > ( ‘ ) − (cid:18) γ (1 − ρ ) κk (1 + γ ) (cid:19) which means || g || − || g || > cd i ( ‘ ) − (cid:18) γ (1 − ρ ) κk (1 + γ ) (cid:19) . (cid:4) The only step left to prove Theorem 8 is that at the beginning, we have || g || = vol( S ). Then the theorem follows by Lemma 10.9 .2 RUNNING TIME ANALYSIS WHEN 3(B) IS SATISFIED For the following results, we add an extra strictly increasing condition sothat ‘ ( γ (1 − ρ ) κc (1+ γ ) ) is positive. When ‘ is not strictly increasing, i.e. ‘ ( x ) = 0 in asmall range round 0, it is our conjecture that the algorithm will still ﬁnish in astrongly local time, although we have not yet proven that. Note that this strictlyincreasing criteria is true for all the loss used in the experiments. LEMMA 11

When 3(b) is satisﬁed and ‘ ( x ) is strictly increasing, then after calling nonlin-push on node i , the decrease of || g || will be strictly larger than kd i ‘ (cid:18) γ (1 − ρ ) κc (1 + γ ) (cid:19) Proof

Similarly to the proof of lemma 10, after calling nonlin-push on node i , || g || − || g || = d i ‘ ( x i + ∆ x i − ( e S ) i ) − d i ‘ ( x i − ( e S ) i ) ≥ kd i ‘ (∆ x i )On the other hand, g i = ρκd i = − γ X j ∼ i w ( i, j ) ‘ ( x i + ∆ x i − x j ) − d i ‘ ( x i + ∆ x i − ( e S ) i ) ≥ − γ X j ∼ i w ( i, j ) ‘ ( x i − x j ) − d i ‘ ( x i − ( e S ) i ) − c (1 + γ ) γ d i ∆ x i = g i − c (1 + γ ) γ d i ∆ x i Thus, we have ∆ x i ≥ γ ( r i − ρκd i ) c (1 + γ ) d i > γ (1 − ρ ) κc (1 + γ )which means || g || − || g || > kd i ‘ (cid:18) γ (1 − ρ ) κc (1 + γ ) (cid:19) . (cid:4) Lemma 11 along with the same type of analysis as before give the followingresult when 3(b) is satisﬁed.

THEOREM 12

Let γ > , κ > be ﬁxed and let k and c be the parametersfrom Deﬁnition 2 for ‘ ( x ) when 3(b) is satisﬁed with a strict increase. For < ρ < , suppose nonlin-cut stops after T iterations, and d i is the de-gree of node updated at the i -th iteration, then T must satisfy: P Ti =1 d i ≤ vol ( S ) /k‘ ( γ (1 − ρ ) κ/c (1 + γ )) = O ( vol ( S )) . When ρ <

1, then we only approximately satisfy the KKT conditions. Here,we do some quick analysis of the diﬀerence in the idealized slackness condition k T x = 0 compared to what we get from our solver. Note that by choosing ρ closeto 1, we do produce a fairly accurate solution when 3(a) is satisﬁed. LEMMA 13

When Algorithm 1 returns, if ‘ ( x ) satisﬁes 3(a) we have k T x ≤ κk‘ (1)(1 − ρ ) vol ( S ) c roof We know k = [0 , κ d − r , T . Every time algorithm 2 is called at node i , it will set g i = ρκd i . In the following iterations, g i can only increase untilalgorithm 2 is called at node i again. This means k ≤ (1 − ρ ) κ d .On the other hand, when 3(a) is satisﬁed, ‘ (1 − x i ) ≤ − ‘ ( x i ) + k‘ (1) || g || = − X i/ ∈ S d i ‘ ( x i ) − X i ∈ S d i ‘ ( x i − ≤ − X i ∈ V d i ‘ ( x i ) + k‘ (1)vol( S ) ≤ − c d T x + k‘ (1)vol( S ) . Thus d T x ≤ k‘ (1) c vol( S )Combining the two inequality gives this lemma. (cid:4) When 3(b) is satisﬁed, it is easy to see k T x ≤ (1 − ρ ) κ d T x , however, there isn’ta closed form equation on the upper bound of k T x in terms of vol( S ). A common use for the results of these localized cut solutions is as localizedFiedler vectors of a graph to induce a cluster [Andersen et al., 2006; Leskovecet al., 2009; Mahoney et al., 2012; Zhu et al., 2013; Orecchia and Zhu, 2014]. Thiswas the original motivation of the ACL procedure [Andersen et al., 2006], forwhich the goal was a small conductance cluster. One of the most common (andtheoretically justiﬁed!) ways to convert a real-valued “clustering hint” vector x into clusters is to use a sweep cut process. This involves sorting x in decreasingorder and evaluating the conductance of each preﬁx set S j = { x , x , ..., x j } foreach j ∈ [ n ]. The set with the smallest conductance will be returned. Thiscomputation is a key piece of Cheeger inequalities [Chung, 1992; Mihail, 1989].In the following, we seek a slightly diﬀerent type of guarantee. We posit theexistence of a target cluster T and show that if T has useful clustering properties(small conductance, no good internal clusters), then a sweep cut over a q -norm or q -Huber localized cut vector seeded inside of T will accurately recover T . The keypiece is understanding how the computation plays out with respect to T insidethe graph and T as a graph by itself. The following two observations are not directly related to the main result. Butwe still ﬁnd them useful in understanding the problem in general.

LEMMA 14

For two seed sets S and S , denote x and x to be the solutions ofLq norm cut problem using S and S correspondingly, if S ⊆ S , then x ≤ x . Proof

Considering two nonlin-cut processes P , P using S or S as inputcorrespondingly, suppose we set the initial vector of P to be the solution of P ,i.e. x , then for nodes i / ∈ S \ S , its residual stays zero, while for nodes i ∈ S \ S ,its residual becomes positive. This means P needs more iterations to converge.And each iteration can only add nonnegative values to x . Thus, x ≤ x . (cid:4) LEMMA 15

Suppose that κ = 0 . We can compute the exact solution of problem (2) under two extreme cases γ → ∞ and γ → ,· When γ → ∞ , x i = 1 for i ∈ S and x i = 0 for i ∈ ¯ S .· When γ → , x i ≥ ( vol ( S )) q − ( vol ( V )) q − for any i ∈ V . roof When κ = 0, the objective function of (2) becomes X i ∼ j w ( i, j ) ‘ ( x i − x j ) + γ X i ∈ V d i ‘ ( x i − ( e S ) i )When γ → ∞ , the ﬁrst term vanishes, and the second term achieves its smallestvalue, when x i = 1 for i ∈ S and x i = 0 for i ∈ ¯ S .When γ →

0, the second term vanishes, and the ﬁrst term is minimal withobjective zero when every x i converges to a ﬁxed constant. Moreover, the KKTcondition now becomes1 γ X j ∼ i w ( i, j ) ‘ ( x i − x j ) + d i ‘ ( x i − ( e S ) i ) = 0Summing the KKT condition over all nodes yields: X i ∈ V d i ‘ ( x i − ( e S ) i ) = 0So we can compute the constant that x i converges to by making x i = c , which is c = (vol( S )) q − (vol( V ) − vol( S )) q − +(vol( S )) q − ≥ (vol( S )) q − (vol( V )) q − . (cid:4) As we mentioned before, the key piece is understanding how the computationplays out with respect to T inside the graph and T as a graph by itself. We usevol T ( S ) to be the volume of seed set S in the subgraph induced by T and ∂T ⊂ T to be the boundary set of T , i.e. nodes in ∂T has at least one edge connecting to¯ T . Quantities with tildes, e.g., ˜ d , reﬂect quantities in the subgraph induced by T .We assume κ = 0, ρ = 1 and: ASSUMPTION 16

The seed set S satisﬁes S ⊆ T , S ∩ ∂T = ∅ and P i ∈ ∂T ( d i − ˜ d i ) x q − i ≤ φ ( T )vol( S ). (cid:4) We call this the leaking assumption, which roughly states that the solutionwith the set S stays mostly within the set T . As some quick justiﬁcation for thisassumption, we note that when when q = 2, [Zhu et al., 2013] shows by a Markovbound that there exists T g where vol( T g ) ≥ vol( T ) such that any node i ∈ T g satisﬁes P i ∈ ∂T ( d i − ˜ d i ) x i ≤ φ ( T ) d i . So in that case, any seed sets S ⊆ T g meetsour assumption. For 1 < q <

2, it is trivial to see any set S with vol( S ) ≥ vol( T )satisﬁes this assumption since the left hand side is always smaller than cut( T, ¯ T ).However, such a strong assumption is not necessary for our approach. The aboveguarantee allows for a small vol( S ) and we simply require Assumption 16 holds.We currently lack a detailed analysis of how many such seed sets there will be.Our second assumption regards the behavior within only the set T comparedwith the entire graph. To state it, we wish to be precise. Consider the localizedcut graph associated with the hidden target set T on the entire graph and let B , w be the incidence and weights for this graph. We wish to understand howthe solution x on this problemminimize x w T ‘ ( B x )subject to x s = 1 , x t = 0 , x ≥ only on the subgraph inducedby T . Let ˜ B , ˜ w be the incidence matrix of the localized cut graph on the vertexinduced subgraph corresponding to T and seeded on T (so the tilde-problem isseeded on all nodes). So formally, we wish to understand how ˜ x inminimize ˜ x ˜ w T ‘ ( ˜ B ˜ x )subject to ˜ x s = 1 , ˜ x t = 0 , ˜ x ≥ x . For these comparisons, we assume we are looking at values otherthan x s , x t and ˜ x s , ˜ x t . ASSUMPTION 17

A relatively small γ should be chosen such that the solution oflocalized q -norm cut problem in the subgraph induced by target cluster T cansatisfy min(˜ x ) ≥ (0 . T ( S )) / ( q − (vol T ( T )) / ( q − = M . (cid:4) We will call Assumption 17 a “mixing-well” guarantee.To better understand this assumption, when ‘ ( x ) = q | x | q and q = 2, a solutionof the nonlin-cut process (Algorithm 1) will be equivalent to a Markov process.In this case, one can lower bound min(˜ x ) by the well known inﬁnity-norm mixingtime of Markov chain. In fact, as shown in the proof of lemma 3.2 of [Zhu et al.,2013], when γ ≤ O ( φ ( T ) · Gap), they show that min(˜ x T ) ≥ . T ( S )vol T ( T ) . Here Gapis deﬁned as the ratio of internal connectivity and external connectivity and oftenassumed to be Ω(1). We refer to [Zhu et al., 2013] for a detailed deﬁnition of this. The proof of lemma 3.2 in [Zhuet al., 2013] proves that the telepor-tation probability β = 1 − α needsto be smaller than O ( φ ( T ) · Gap).When q = 2, as shown in [Gleich andMahoney, 2014], β = γ γ , whichmeans γ = β − β . Since we assume γ <

1, we have β < γ < β . In otherwords, γ and β are only diﬀerent bya constant factor. For 1 < q < nonlin-cut is no longer equivalent to the solution of a Markovprocess and thus it will be more diﬃcult to derive a closed form equation on howsmall γ needs to be so that equation 17 is satisﬁed. However, lemma 18 (below)shows that for graphs with small diameters, it is easier (i.e. γ can be larger) forthe solution of (6) to satisfy equation 17. This is reasonable because we expectgood clusters and good communities to have small diameters. LEMMA 18

Assume the subgraph induced by target cluster T has diameter O (log | T | ) and when we uniformly randomly sample points from T as seed sets, the expectedlargest distance of any node in ¯ S to S is O (cid:16) log( | T | ) | S | (cid:17) . Also deﬁne γ to be thelargest γ such that assumption 17 is satisﬁed at q = 2 and assume γ < , if weset γ = γ q − for < q < , andvol T ( S ) vol T ( T ) ≤  γ γ · | T | | S | log (cid:16) l q − (cid:17)  q − where l ≤ (1 + γ ) max( ˜ d i ) . Then the solution of 6 can satisfy assumption 17. Proof

Given a seed set S , we can partition the ¯ S into disjoint subsets L ∪ L ∪ L . . . ∪ L n , where L i contains nodes that are i distance away from S . For anynode i ∈ L k , we denote d outi to be d outi = X j ∼ i,j ∈ L k ∪ L k +1 w ( i, j )And d ini = ˜ d i − d outi . Also deﬁne l = (1 + γ ) d outi d ini ≤ (1 + γ )max( ˜ d i ). Suppose˜ x i ≥ c for any node i with distance at most k −

1, then we can show for node i ∈ L k , ˜ x i ≥ c l q − . To see this, if ˜ x i < c , then by the KKT condition, d ini ( c − ˜ x i ) q − ≤ d outi x q − i + γd i x q − i Here for j ∼ i , if j is closer to S , we set ˜ x j to be c , otherwise, we set ˜ x j to be 0.This means ˜ x i ≥ c ( d ini ) q − ( d outi + γd i ) q − + ( d ini ) q − ≥ cl q − + 1Also, for node i ∈ S , the ﬁrst iteration of q -norm process will add at least γ q − γ q − to ˜ x i (This follows from unrolling the ﬁrst loop of our algorithm and checking13hat this satisﬁes the binary search criteria.), which means ˜ x i ≥ γ q − γ q − . Thus,for node i ∈ L k ,˜ x i ≥ γ q − γ q − · (cid:16) l q − (cid:17) k = γ γ · (cid:16) l q − (cid:17) k Since the subgraph induced by target cluster T has diameter O (log( | T | )) andwhen we uniformly randomly sample points from T as seed sets, the expectedlargest distance r of any node in ¯ S to S is O (cid:16) log( | T | ) | S | (cid:17) , we have r = O (cid:16) log( | T | ) | S | (cid:17) ,which means min(˜ x ) ≥ γ γ · | T | | S | log (cid:16) l q − (cid:17) Assumption 17 requires min(˜ x ) ≥ (0 . T ( S )) q − (vol T ( T )) q − . So we just needvol T ( S )vol T ( T ) ≤  γ γ · | T | | S | log (cid:16) l q − (cid:17)  q − , which was the ﬁnal assumption. (cid:4) LEMMA 19

Under the previous assumptions, deﬁne a sweep cut set S c as ( i ∈ V | x i ≥ c (0 . vol ( S )) q − ( vol ( T )) q − ) , then for any < c ≤ ,vol ( S c \ T ) = O (cid:18) φ ( T ) γc q − (cid:19) vol ( T ) vol ( T \ S c ) = O (cid:18) φ ( T ) γ (cid:19) vol ( T ) Proof

The proof is mostly a generalization to the proof of Lemma 3.4 in [Zhuet al., 2013]. For any i ∈ ¯ T , by the KKT condition and Assumption 160 = r i ( x )= − γ X j ∼ i w ( i, j ) ‘ ( x i − x j ) − d i x q − i = − γ X j ∼ i,j ∈ ¯ T w ( i, j ) ‘ ( x i − x j ) − γ X j ∼ i,j ∈ T w ( i, j ) ‘ ( x i − x j ) − d i x q − i = − γ X j ∼ i,j ∈ ¯ T w ( i, j ) ‘ ( x i − x j ) + 1 γ X j ∼ i,j ∈ T w ( i, j ) ‘ ( x j − x i ) − d i x q − i < − γ X j ∼ i,j ∈ ¯ T w ( i, j ) ‘ ( x i − x j ) + 1 γ X j ∼ i,j ∈ T w ( i, j ) ‘ ( x j ) − d i x q − i . By summing the inequality above over all nodes in ¯ T , the ﬁrst term will all cancelout, it yields that X i ∈ ¯ T d i x q − i < γ X i ∈ ∂T ( d i − ˜ d i ) x q − i ≤ φ ( T )vol( S ) γ . i ∈ S c \ T , x q − i ≥ c q − u vol( S )vol( T ) , thus c q − vol( S )2vol( T ) vol( S c \ T ) ≤ X i ∈ S c \ T d i x q − i ≤ φ ( T )vol( S ) γ which means vol( S c \ T ) = O (cid:18) φ ( T ) γc q − (cid:19) vol( T ) . In the following, we deﬁne x i = ˜ x i + v i and ‘ ( x i − ( e S ) i ) = ‘ (˜ x i − ( e S ) i ) + k i ‘ ( v i ).For any node i ∈ T , by KKT condition,0 = r i ( x )= − γ X j ∼ i w ( i, j ) ‘ ( x i − x j ) − d i ‘ ( x i − ( e S ) i )= − γ X j ∼ i,j ∈ T w ( i, j ) ‘ ( x i − x j ) − γ X j ∼ i,j ∈ ¯ T w ( i, j ) ‘ ( x i − x j ) − d i ‘ ( x i − ( e S ) i ) > − γ X j ∼ i,j ∈ T w ( i, j ) ‘ ( x i − x j ) − γ X j ∼ i,j ∈ ¯ T w ( i, j ) ‘ ( x i ) − ˜ d i ‘ ( x i − ( e S ) i ) − ( d i − ˜ d i ) ‘ ( x i )= − γ X j ∼ i,j ∈ T w ( i, j ) ‘ ( x i − x j ) − ˜ d i ‘ (˜ x i − ( e S ) i ) − k i d i ‘ ( v i ) − (1 + 1 γ )( d i − ˜ d i ) ‘ ( x i )= − γ X j ∼ i,j ∈ T w ( i, j ) ‘ ( x i − x j ) − γ X j ∼ i,j ∈ T w ( i, j ) ‘ (˜ x i − ˜ x j ) − k i d i ‘ ( v i ) − (1 + 1 γ )( d i − ˜ d i ) ‘ ( x i ) . By summing the inequality above over all nodes in T , the ﬁrst and the secondterms cancel out, so it yields: X i ∈ T k i d i ‘ ( v i ) > − γ ) γ φ ( T )vol( S ) . For nodes i ∈ T \ S c , x i < c ˜ x i , which means v i < ( c − x i . And ‘ ( v i ) = − ( − v i ) q − < − (1 − c ) q − . T ( S )vol T ( T ) ≤ − (1 − c ) q − . S )vol( T ) . (Here we use the factthat vol T ( T ) ≤ vol( T ) and S ∩ ∂T = ∅ ). From the proof of lemma 18, we knowthat S will be included in S c . When i / ∈ S , k i = (cid:18) − ˜ x i v i + 1 (cid:19) q − − (cid:18) − ˜ x i v i (cid:19) q − > (2 − c ) q − − − c ) q − . Thus, we have vol( T \ S c ) = O (cid:18) φ ( T ) γ (cid:19) vol( T ) . (cid:4) LEMMA 20

Under the same assumptions as lemma 19, among sweep cut sets S c ∈ { S c | ≤ c ≤ } , there exsits one R such that φ ( R ) = O (cid:18) φ ( T ) q Gap q − (cid:19) . Proof

Our proof is mostly a generalization to the proof of Lemma 4.1 in [Zhuet al., 2013]. If cut( S c , ¯ S c ) ≥ E holds for all ≤ c ≤ , then we just need toupper bound E . 15e introduce values k ( i, j ) that allow us to break ‘ ( x i − x j ) into ‘ ( x i ) − k ( i, j ) ‘ ( x j ). The speciﬁc choice k ( i, j ) > x i and x j .For any node i ∈ S c , by KKT condition,0 = 1 γ X j ∼ i w ( i, j ) ‘ ( x i − x j ) + d i ‘ ( x i − ( e S ) i )= 1 γ X j ∼ i ( w ( i, j ) ‘ ( x i ) − w ( i, j ) k ( i, j ) ‘ ( x j )) + d i ‘ ( x i ) − k i d i ( e S ) i . Deﬁne K to be the matrix induced by k ( i, j ). Rearranging the equation aboveyields: ( K ◦ A x q − ) i = (1 + γ ) d i x q − i − γk i d i ( e S ) i . Also for two adjacent nodes i, j that are both in S c , we have k ( i, j ) ‘ ( x j ) + k ( j, i ) ‘ ( x i ) = ‘ ( x i ) + ‘ ( x j ) . This is because ‘ ( x i − x j ) + ‘ ( x j − x i ) = 0. And for two adjacent nodes i, j suchthat i ∈ S c and j / ∈ S c , x i > x j , k ( i, j ) <

1. Deﬁne a Lovasz-Simonovits curve y over d i x q − i , then we have X i ∈ S c ( K ◦ A x q − ) i + X i ∈ S c d i x q − i = 2 X i ∈ S c X j ∼ i,j ∈ S c w ( i, j ) x q − j + X i ∈ S c X j ∼ i,j / ∈ S c k ( i, j ) w ( i, j ) x q − j < X i ∈ S c X j ∼ i,j ∈ S c w ( i, j ) x q − j + X i ∈ S c X j ∼ i,j / ∈ S c w ( i, j ) x q − j ≤ y [vol( S ) − cut( S c , ¯ S c )] + y [vol( S ) + cut( S c , ¯ S c )] ≤ y [vol( S ) − E ] + y [vol( S ) + E ]here the second inequality is due to the deﬁnition of Lovasz-Simonovits curve andthe third inequality is due to y ( x ) is concave. This means y [vol( S ) − E ] + y [vol( S ) + E ] ≥ X i ∈ S c ( K ◦ A x q − ) i + X i ∈ S c d i x q − i ≥ (2 + γ ) X i ∈ S c d i x q − i − γ X i ∈ S c k i d i ( e S ) i ≥ (2 + γ ) X i ∈ S c d i x q − i − γ X i ∈ S k i d i = (2 + γ ) X i ∈ S c d i x q − i − γ X i ∈ V d i x q − i = 2 X i ∈ S c d i x q − i − γ X i/ ∈ S c d i x q − i ≥ y [vol( S c )] − O ( φ ( T )vol( S )) . Thus, y [vol( S c )] − y [vol( S c − E )] ≤ y [vol( S c + E )] − y [vol( S c )] + O ( φ ( T )vol( S )) . . E vol( S )4 q − vol( T ) ≤ y [vol( S / )] − y [vol( S / ) − E ] ≤ vol( S / \ S / ) E O ( φ ( T )vol( S )) + y [vol( S / )] − y [vol( S / ) − E ] ≤ vol( S / \ T ) + vol( T \ S / ) E O ( φ ( T )vol( S )) + 0 . E vol( S )8 q − vol( T ) ≤ O ( φ ( T ) /γ )vol( T ) E O ( φ ( T )vol( S )) + 0 . E vol( S )8 q − vol( T ) . Hence , E ≤ O (cid:18) φ ( T ) √ γ (cid:19) vol( T ) . And from lemma 19, we know vol( S c ) = 1 ± O (cid:16) φ ( T ) γ (cid:17) vol(T), since we choose γ = ( γ ) q − and γ = Θ( φ ( T ) · Gap), vol( S c ) = Θ(vol( T )). So there exists R suchthat φ ( R ) = O (cid:18) φ ( T ) √ γ (cid:19) = O φ ( T ) − q Gap ( q − / ! ≤ O φ ( T ) q Gap ( q − / ! . Here the last inequality uses the fact that (3 − q ) / > /q when 1 < q < (cid:4) By combing all these lemmas, we can get the following theorem.

THEOREM 21

Assume the subgraph induced by target cluster T has diameter O (log( | T | )) , when we uniformly randomly sample points from T as seed sets,the expected largest distance of any node in ¯ S to S is O (cid:16) log( | T | ) | S | (cid:17) . Assume vol T ( S ) vol T ( T ) ≤ (cid:0) ( γ γ ) / | T | | S | log ( l / ( q − ) (cid:1) q − where l ≤ (1 + γ ) max ( ˜ d i ) , then wecan set γ = γ q − to satisfy assumption 17 for < q < . Then a sweep cut over x will ﬁnd a cluster R where φ ( R ) = O (cid:0) φ ( T ) q / Gap q − (cid:1) . We perform three experiments that are designed to compare our method toothers designed for similar problems. We call ours SLQ (strongly local q -norm)for ‘ ( x ) = (1 /q ) | x | q with parameters γ for localization and κ for the sparsity. Wecall it SLQ δ with the q -Huber loss. Existing solvers are (i) ACL [Andersen et al.,2006], that computes a personalized PageRank vector approximately adapted withthe same parameters [Gleich and Mahoney, 2014]; (ii) CRD [Wang et al., 2017],which is hybrid of ﬂow and spectral ideas; (iii) FS is FlowSeed [Veldt et al., 2019a],a 1-norm based method; (iv) HK is the push-based heat kernel [Kloster and Gleich,2014]; (v) NLD is a recent nonlinear diﬀusion [Ibrahim and Gleich, 2019]; (vi) GCNis a graph convolutional network [Kipf and Welling, 2016]. Parameters are chosenbased on defaults or with slight variations designed to enhance the performancewithin a reasonable running time. We provide a full Julia implementation of SLQin Section 5.5. We evaluate the routines in terms of their recovery performancefor planted sets and clusters. The bands reﬂect randomizing seeds choices in thetarget cluster. The ﬁrst experiment uses the LFR benchmark [Lancichinetti et al., 2008]. Wevary the mixing parameter µ (where larger µ is more diﬃcult) and provide 1%of a cluster as a seed, then we check how much of the cluster we recover aftera conductance-based sweep cut over the solutions from various methods. Here,we use the F nodes10 Running time (seconds) 0.1 0.2 0.3 0.4 0.50.20.40.60.81.0 F1 score

SLQ (q=1.2) SLQ (q=1.4) SLQ (q=1.6) CRD (h=3) CRD (h=5) ACL heat kernel

FIGURE 4 – The left ﬁgure shows themedian running time for the methodsas we scale the graph size keeping thecluster sizes roughly the same. As wevary cluster mixing µ for a graph with10 ,

000 nodes, the middle ﬁgure showsthe median F1 score (higher is better)along with the 20-80% quantiles; theright ﬁgure shows the conductancevalues (lower is better). These resultsshow SLQ is better than ACL andcompetitive with CRD while runningmuch faster.

Reproduction details.

When creating the LFR graphs, we set the powerlaw exponent for the degree distribution to be 2, power law exponent for thecommunity size distribution to be 2, desired average degree to be 10, maximumdegree to be 50, minimum size of community to be 200 and maximum size ofcommunity to be 500. We create 40 random graphs for each µ . For SLQ, we set δ = 0, γ = 0 . ρ = 0 . (cid:15) = 10 − . For ACL, we set γ = 0 .

1. For both SLQand ACL, κ is automatically chosen from 0 .

005 and 0 .

002 based on which willgive a cluster with smaller conductance. For HK, we use four diﬀerent pairs of( (cid:15), t ), which are (0 . , . , . ,

40) and (0 . , h , which is is the maximumﬂow that each edge can handle. We provide results of using h = 3 and h = 5.For methods that are using multiple choices of parameters, we report the totalrunning time. The second experiment uses the class-year metadata on Facebook [Traud et al.,2012], which is known to have good conductance structure for at least class year2009 [Veldt et al., 2019b] that should be identiﬁable with many methods. Otherclass years are harder to detect with conductance. Here, we use F Reproduction details.

In this experiment, for SLQ, we set q = 1 . γ = 0 . κ = 0 . (cid:15) = 10 − , ρ = 0 . δ = 0. For SLQ δ , the parameters are the sameas SLQ except we set δ = 10 − . For ACL, we set γ = 0 .

05 and κ = 0 . .

5. For NLD, we set the power to be 1 .

5, step sizeto be 0 .

002 and the number of iterations to be 5000. For GCN, we use 5 hiddenlayers and negative log likelihood loss. We set dropout ratio to be 0.5, learningrate to be 0.01, weight decay to be 0.0005 and the number of iterations to be 200.The feature vector is the 6 diﬀerent metadata info as described in [Traud et al.,2012]. For each true set, we randomly choose 1% of the true set as seed 50 times.

The ﬁnal experiment evaluates a ﬁnding from [Kloumann and Kleinberg,2014] on the recall of seed-based community detection methods. For a group ofcommunities with roughly the same size, we evaluate the recall of the largest k entries in a diﬀusion vector. They found PageRank (ACL) outperformed manydiﬀerent methods. In Figure 6, we see the same general result and found thatSLQ with q > F1 & Med. F1 & Med. F1 & Med. F1 & Med. F1 & Med. F1 & Med. F1 & Med. δ δ TABLE 1 – Cluster recovery re-sults from a set of 7 Facebook net-works [Traud et al., 2012]. Studentswith a speciﬁc graduation class yearare used as target cluster. We use arandom set of 1% of the nodes identi-ﬁed with that class year as seeds. Theclass year 2009 is the set of incomingstudents, which form better conduc-tance groups because the students hadnot yet mixed with the other classes.Class year 2008 is already mixed andso the methods do not do as wellthere. The values are median F Method SLQ SLQ δ CRD-3 CRD-5 ACL FS HK NLD GCNTime 123 80 3049 9378 12 1593 106 10375 16534(seconds)

TABLE 2 – Total running time of methods in this experiment.

Finally we would like to describe an experiment where we study the perfor-mance change of diﬀerent methods when varying the size of the seed set. Thedataset we use is the same MIT Facebook dataset and the target cluster is classyear 2008. This choice is one where most of the methods in Table 1 did poorly,but ACL did better in some trials. We repeat 50 times for each seed size level.From the previous experiments, we can see that none of the methods works wellﬁnding this cluster. In this experiment, we only report results from SLQ, ACL,FS, CRD-3 and HK as they are all strongly local methods and they perform betterthan global methods as we have seen from previous experiments. Also, we didn’tadd CRD-5 because CRD-3 performed better than CRD-5 on this particularcluster as shown in Table 1. The result of this experiment is in Figure 5. Whenseed size is smaller than 15 nodes, the F1 score of all methods improves as weincrease seed size. After 15 nodes, only the F1 score of SLQ and ACL continues toimprove when seed size becomes larger, while the performance of other methodsstays the same or even slightly worse.

10 20 30 40 50 600.00.10.20.30.40.50.60.7 SLQACLCRD-3FSHK

FIGURE 5 – This ﬁgure shows the per-formance change (F1 score) of diﬀer-ent methods when we vary the sizeof seed set. The dataset is MIT Face-book with the true cluster to be classyear 2008. The envelope represents20%-80% quantile.

Reproduction details

For HK and CRD-3, we use the same parameters asthe previous Facebook experiment. For ACL and SLQ, we use a coarse binarysearch (initial region is between 0.001 and 0.1, smallest feasible region is 0.001)to ﬁnd a good sparsity level such that the total number of nonzero entries is 20%of the total number of nodes. The other parameters are the same as the previousFacebook experiment. We also use a similar coarse binary search (initial regionis between 0.4 and 5.0, smallest feasible region is 0.1) to choose (cid:15) for FS. Wedidn’t implement this procedure for CRD and HK because CRD doesn’t have astandalone parameter to control the sparsity of the solution and HK has alreadybeen set up to choose the best cluster from a list of parameters. One thing wewould like to mention is that in Table 1, we use 1% nodes of the true cluster as19a) DBLP (b) LiveJournal

SLQ (q=1.5)SLQ-DN (q=1.5)SLQ (q=4.0)SLQ-DN (q=4.0)SLQ (q=8.0)SLQ-DN (q=8.0)ACLACL-DN

SLQ (q=1.5)SLQ-DN (q=1.5)SLQ (q=8.0)SLQ-DN (q=4.0)SLQ (q=4.0)SLQ-DN (q=8.0)ACLACL-DN

FIGURE 6 – A replication of an experi-ment from [Kloumann and Kleinberg,2014] with SLQ on DBLP [Backstromet al., 2006; Yang and Leskovec, 2012](with 1M edges) and edges LiveJour-nal [Mislove et al., 2007] (with 65Medges). The plot shows median re-call over 600 groups of roughly thesame size as we look at the top k en-tries in the solution vector (x axis).The envelope represents 2 standarderror. This shows SLQ with q > seeds which is roughly 32 nodes in this case. So we can see that the performanceof both ACL and SLQ is improved upon this extra layer of binary search (i.e. themedian F1 score is increased to 0.6). While the performance of FS remains thesame. Our full implementation is available in the

SLQ.jl function on github: github.com/MengLiuPurdue/SLQ and the experiment codes are available too. We veriﬁedthis Julia implementation of ACL is as eﬃcient as ACL implemented in C++. Sothere is no appreciable overhead of using Julia compared with C or C++ for thiscomputation.First we want to mention that in our experiments, we ﬁnd that we can speedup SLQ by using a slightly modiﬁed binary search procedure. The logic is when q is close to 1 and vol( S ) is small, ∆ x i after each step of “push” procedure isalso small. So it doesn’t make sense to set the initial range of binary search tobe [0 , k − t, k t ], where t is chosenfrom either last ∆ x i or (vol( S ) / vol( A )) / ( q − . (Note this is just the lower boundof x i when γ → k by checking k = 1 , , ... until the residual becomesnegative. This strategy is implemented in our code. The most strongly related work was posted to arXiv [Yang et al., 2020]contemporaneously as we were ﬁnalizing our results. This research applies a p -norm function to the ﬂow dual of the mincut problem with a similar motivation.This bears a resemblance to our procedures, but does diﬀer in that we includethe localizing set S in our nonlinear penalty. Also, our solver uses the cut valuesinstead of the ﬂow dual on the edges and we include details that enable Huberand Berhu functions for faster computation. In the future, we plan to comparethe approaches more concretely.There also remain ample opportunities to further optimize our procedures. Aswe were developing these ideas, we drew inspiration from algorithms for p -normregression [Adil et al., 2019]. Also there are faster converging (in theory) solversusing diﬀerent optimization procedures [Fountoulakis et al., 2017] for 2-normproblems as well as parallelization strategies [Shun et al., 2016].Our work further contributes to the ongoing research into p -Laplacian re-search [Amghibech, 2003; Bühler and Hein, 2009; Alamgir and Luxburg, 2011;Brindle and Zhu, 2013; Li and Milenkovic, 2018] by giving a related problem thatcan be solved in a strongly local fashion. We note that our ideas can be easilyadapted to the growing space of hypergraph and higher-order graph analysisliterature [Benson et al., 2016; Yin et al., 2017; Li and Milenkovic, 2018] wherethe strategy is to derive a useful hypergraph from graph data to support deeperanalysis. We are also excited by the opportunities to combine with generalized20aplacian perspectives on diﬀusions [Ghosh et al., 2014]. Moreover, our workcontributes to the general idea of using simple nonlinearities on existing success-ful methods. A recent report shows that a simple nonlinearity on a Laplacianpseudoinverse is competitive with complex embedding procedures [Chanpuriyaand Musco, 2020].Finally, we note that there are more general constructions possible. Forinstance, diﬀerential penalties for S and ¯ S in the localized cut graph can beused for a variety of eﬀects [Orecchia and Zhu, 2014; Veldt et al., 2019b]. For1-norm objectives, optimal parameters for γ and κ can also be chosen to modeldesierable clusters [Veldt et al., 2019b] – similar ideas may be possible for these p -norm generalizations. We view the structured ﬂexibility of these ideas as akey advantage because ideas are easy to compose. This contributed to usingpersonalized PageRank to make graph convolution networks faster [Klicpera et al.,2019].In conclusion, given the strong similarities to the popular ACL – and theimproved performance in practice – we are excited about the possibilities forlocalized p -norm-cuts in graph-based learning. R E F E R E N C E S [Adil et al., 2019] D.

Adil , R.

Kyng , R.

Peng ,and S.

Sachdeva . Iterative reﬁnement for ‘ p-norm regression . In Proceedings of theThirtieth Annual ACM-SIAM Symposiumon Discrete Algorithms , pp. 1405–1424.2019. Cited on page 20.[Alamgir and Luxburg, 2011] M.

Alamgir and U. V.

Luxburg . Phase transition inthe family of p-resistances . In

Advancesin Neural Information Processing Systems24 , pp. 379–387. Curran Associates, Inc.,2011. Cited on pages 3 and 20.[Amghibech, 2003] S.

Amghibech . Eigenval-ues of the discrete p-laplacian for graphs .Ars Comb., 67, 2003. Cited on pages 3and 20.[Andersen et al., 2006] R.

Andersen ,F.

Chung , and K.

Lang . Local graph par-titioning using pagerank vectors . In ,pp. 475–486. 2006. Cited on pages 3, 4, 8,11, and 17.[Andersen and Lang, 2008] R.

Andersen andK. J.

Lang . An algorithm for improv-ing graph partitions . In

Proceedings of thenineteenth annual ACM-SIAM symposiumon Discrete algorithms , pp. 651–660. 2008.Cited on pages 1 and 4.[Backstrom et al., 2006] L.

Backstrom ,D.

Huttenlocher , J.

Kleinberg , andX.

Lan . Group formation in large so-cial networks: membership, growth, andevolution . In

Proceedings of the 12thACM SIGKDD international conferenceon Knowledge discovery and data mining ,pp. 44–54. 2006. doi:10.1145/1150402.1150412 . Cited on page 20.[Benson et al., 2016] A.

Benson , D. F.

Gle-ich , and J.

Leskovec . Higher-order or-ganization of complex networks . Science, 353 (6295), pp. 163–166, 2016. doi:10.1126/science.aad9029 . Cited on page 20.[Blum and Chawla, 2001] A.

Blum andS.

Chawla . Learning from labeled and un-labeled data using graph mincuts . In

Pro-ceedings of the Eighteenth InternationalConference on Machine Learning , pp. 19–26. 2001. Cited on pages 1 and 2.[Brindle and Zhu, 2013] N.

Brindle andX.

Zhu . p-voltages: Laplacian regulariza-tion for semi-supervised learning on high-dimensional data. Workshop on Miningand Learning with Graphs (MLG2013),2013. Cited on pages 3 and 20.[Bühler and Hein, 2009] T.

Bühler andM.

Hein . Spectral clustering based on thegraph p-laplacian . In

Proceedings of the26th Annual International Conference onMachine Learning , pp. 81–88. 2009. Citedon pages 3 and 20.[Chanpuriya and Musco, 2020] S.

Chan-puriya and C.

Musco . Inﬁnitewalk: Deepnetwork embeddings as laplacian embed-dings with a nonlinearity . 2020. arXiv:2006.00094 . Cited on page 21.[Chung, 2007] F.

Chung . The heat ker-nel as the PageRank of a graph . Pro-ceedings of the National Academy of Sci-ences, 104 (50), pp. 19735–19740, 2007. doi:10.1073/pnas.0708838104 . Cited onpage 5.[Chung, 1992] F. R. L.

Chung . Spectral GraphTheory , American Mathematical Society,1992. Cited on pages 3 and 11.[Fountoulakis et al., 2020] K.

Fountoulakis ,M.

Liu , D. F.

Gleich , and M. W.

Ma-honey . Flow-based algorithms for improv-ing clusters: A unifying framework, soft-ware, and performance . arXiv, cs.LG, p.2004.09608, 2020. Cited on pages 2 and 4. [Fountoulakis et al., 2017] K.

Foun-toulakis , F.

Roosta-Khorasani ,J.

Shun , X.

Cheng , and M. W.

Mahoney . Variational perspective on local graphclustering . Mathematical Programming,2017. doi:10.1007/s10107-017-1214-8 .Cited on page 20.[Ghosh et al., 2014] R.

Ghosh , S.-h.

Teng ,K.

Lerman , and X.

Yan . The interplay be-tween dynamics and networks: centrality,communities, and cheeger inequality . In

Proceedings of the 20th ACM SIGKDD in-ternational conference on Knowledge dis-covery and data mining , pp. 1406–1415.2014. Cited on pages 2 and 21.[Gleich and Mahoney, 2014] D.

Gleich andM.

Mahoney . Anti-diﬀerentiating approx-imation algorithms: A case study withmin-cuts, spectral, and ﬂow . In

Interna-tional Conference on Machine Learning ,pp. 1018–1025. 2014. Cited on pages 2, 4,13, and 17.[Gleich, 2015] D. F.

Gleich . PageRank be-yond the web . SIAM Review, 57 (3), pp.321–363, 2015. doi:10.1137/140976649 .Cited on pages 2 and 4.[Gleich and Mahoney, 2015] D. F.

Gleich and M. W.

Mahoney . Using local spectralmethods to robustify graph-based learningalgorithms . In

Proceedings of the 21thACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining ,pp. 359–368. 2015. doi:10.1145/2783258.2783376 . Cited on pages 2 and 4.[Grover and Leskovec, 2016] A.

Grover andJ.

Leskovec . Node2vec: Scalable fea-ture learning for networks . In

Proceed-ings of the 22Nd ACM SIGKDD Interna-tional Conference on Knowledge Discov-ery and Data Mining , pp. 855–864. 2016. oi:10.1145/2939672.2939754 . Cited onpage 1.[Hallac et al., 2017] D. Hallac , C.

Wong ,S.

Diamond , A.

Sharang , R.

Sosic ,S.

Boyd , and J.

Leskovec . Snapvx: Anetwork-based convex optimization solver .The Journal of Machine Learning Re-search, 18 (1), pp. 110–114, 2017. Citedon page 5.[Hansen and Mahoney, 2012] T. J.

Hansen and M. W.

Mahoney . Semi-supervisedeigenvectors for locally-biased learning .In

Advances in Neural Information Pro-cessing Systems 25 , pp. 2528–2536. 2012.Cited on page 1.[Ibrahim and Gleich, 2019] R.

Ibrahim andD. F.

Gleich . Nonlinear diﬀusion forcommunity detection and semi-supervisedlearning . In

The World Wide Web Con-ference , pp. 739–750. 2019. doi:10.1145/3308558.3313483 . Cited on pages 3, 5,and 17.[Jeub et al., 2015] L. G. S.

Jeub , P.

Bal-achandran , M. A.

Porter , P. J.

Mucha ,and M. W.

Mahoney . Think locally,act locally: Detection of small, medium-sized, and large communities in large net-works . Phys. Rev. E, 91, p. 012821, 2015. doi:10.1103/PhysRevE.91.012821 . Citedon page 8.[Joachims, 2003] T.

Joachims . Transductivelearning via spectral graph partitioning . In

ICML , pp. 290–297. 2003. Cited on pages1 and 2.[Kipf and Welling, 2016] T. N.

Kipf andM.

Welling . Semi-supervised classiﬁca-tion with graph convolutional networks .arXiv preprint arXiv:1609.02907, 2016.Cited on page 17.[Klicpera et al., 2019] J.

Klicpera , A.

Bo-jchevski , and S.

Günnemann . Predictthen propagate: Graph neural networksmeet personalized pagerank . In

Interna-tional Conference on Learning Represen-tations (ICLR) . 2019. Cited on pages 1, 3,and 21.[Kloster and Gleich, 2014] K.

Kloster andD. F.

Gleich . Heat kernel based commu-nity detection . In

Proceedings of the 20thACM SIGKDD International Conferenceon Knowledge Discovery and Data Min-ing , pp. 1386–1395. 2014. doi:10.1145/2623330.2623706 . Cited on pages 5, 8,and 17.[Kloumann and Kleinberg, 2014] I. M.

Kloumann and J. M.

Kleinberg . Com-munity membership identiﬁcation fromsmall seed sets . In

Proceedings of the20th ACM SIGKDD International Con-ference on Knowledge Discovery and DataMining , pp. 1366–1375. 2014. doi:10.1145/2623330.2623621 . Cited on pages 18and 20. [Kloumann et al., 2016] I. M.

Kloumann ,J.

Ugander , and J.

Kleinberg . Blockmodels and personalized PageRank . Pro-ceedings of the National Academy of Sci-ences, 114 (1), pp. 33–38, 2016. doi:10.1073/pnas.1611275114 . Cited on page2.[Koutra et al., 2011] D.

Koutra , T.-Y. Ke , U. Kang , D. H.

Chau , H.-K. K.

Pao , and C.

Faloutsos . Unifyingguilt-by-association approaches: Theo-rems and fast algorithms . In

ECML/P-KDD , pp. 245–260. 2011. doi:10.1007/978-3-642-23783-6_16 . Cited on page 2.[Lancichinetti et al., 2008] A.

Lancichinetti ,S.

Fortunato , and F.

Radicchi . Bench-mark graphs for testing community de-tection algorithms . Phys. Rev. E, 78, p.046110, 2008. doi:10.1103/PhysRevE.78.046110 . Cited on page 17.[Lang and Rao, 2004] K.

Lang and S.

Rao . A ﬂow-based method for improving the ex-pansion or conductance of graph cuts . In

IPCO 2004: Integer Programming andCombinatorial Optimization , pp. 325–337.2004. Cited on page 1.[Leskovec et al., 2009] J.

Leskovec , K. J.

Lang , A.

Dasgupta , and M. W.

Ma-honey . Community structure in largenetworks: Natural cluster sizes and theabsence of large well-deﬁned clusters .Internet Mathematics, 6 (1), pp. 29–123, 2009. doi:10.1080/15427951.2009.10129177 . Cited on pages 8 and 11.[Li and Milenkovic, 2018] P. Li andO. Milenkovic . Submodular hypergraphs:p-laplacians, Cheeger inequalities andspectral clustering . In

Proceedings of the35th International Conference on MachineLearning , pp. 3014–3023. 2018. Cited onpage 20.[Li et al., 2019] Q. Li , X.-M. Wu , H. Liu ,X.

Zhang , and Z.

Guan . Label eﬃcientsemi-supervised learning via graph ﬁlter-ing . In

Proceedings of the IEEE Con-ference on Computer Vision and PatternRecognition , pp. 9582–9591. 2019. Citedon page 1.[Lisewski and Lichtarge, 2010] A. M.

Lisewski and O.

Lichtarge . Untan-gling complex networks: Risk minimiza-tion in ﬁnancial markets through acces-sible spin glass ground states . PhysicaA: Statistical Mechanics and its Applica-tions, 389 (16), pp. 3250–3253, 2010. doi:10.1016/j.physa.2010.04.005 . Cited onpage 2.[Mahoney et al., 2012] M. W.

Mahoney ,L.

Orecchia , and N. K.

Vishnoi . A localspectral method for graphs: With applica-tions to improving graph partitions andexploring data graphs locally . Journal ofMachine Learning Research, 13, pp. 2339–2365, 2012. Cited on page 11. [Mihail, 1989] M.

Mihail . Conductance andconvergence of markov chains-a combina-torial treatment of expanders . In

Foun-dations of Computer Science, 1989., 30thAnnual Symposium on , pp. 526 –531. 1989. doi:10.1109/SFCS.1989.63529 . Cited onpage 11.[Mislove et al., 2007] A.

Mislove , M.

Mar-con , K. P.

Gummadi , P.

Druschel , andB.

Bhattacharjee . Measurement andanalysis of online social networks . In

Pro-ceedings of the 7th ACM SIGCOMM Con-ference on Internet Measurement , pp. 29–42. 2007. doi:10.1145/1298306.1298311 .Cited on page 20.[Orecchia and Mahoney, 2011] L.

Orecchia and M. W.

Mahoney . Implementing regu-larization implicitly via approximate eigen-vector computation . In

Proceedings of the28th International Conference on MachineLearning (ICML-11) , pp. 121–128. 2011.Cited on page 2.[Orecchia and Zhu, 2014] L.

Orecchia andZ. A.

Zhu . Flow-based algorithms for lo-cal graph clustering . In

Proceedings ofthe twenty-ﬁfth annual ACM-SIAM sym-posium on Discrete algorithms , pp. 1267–1286. 2014. Cited on pages 11 and 21.[Owen, 2007] A. B.

Owen . A robust hybrid oflasso and ridge regression . ContemporaryMathematics, 443 (7), pp. 59–72, 2007.Cited on page 6.[Pan et al., 2004] J.-Y.

Pan , H.-J.

Yang ,C.

Faloutsos , and P.

Duygulu . Au-tomatic multimedia cross-modal correla-tion discovery . In

KDD ’04: Proceed-ings of the tenth ACM SIGKDD inter-national conference on Knowledge discov-ery and data mining , pp. 653–658. 2004. doi:10.1145/1014052.1014135 . Cited onpage 2.[Peel, 2017] L.

Peel . Graph-based semi-supervised learning for relational net-works . In

Proceedings of the 2017 SIAMInternational Conference on Data Min-ing , pp. 435–443. 2017. doi:10.1137/1.9781611974973.49 . Cited on page 2.[Perozzi et al., 2014] B.

Perozzi , R.

Al-Rfou ,and S.

Skiena . DeepWalk: Online learn-ing of social representations . In

Proceed-ings of the 20th ACM SIGKDD Interna-tional Conference on Knowledge Discov-ery and Data Mining , pp. 701–710. 2014. doi:10.1145/2623330.2623732 . Cited onpage 1.[Shi and Malik, 2000] J.

Shi and J.

Malik . Normalized cuts and image segmentation .Pattern Analysis and Machine Intelligence,IEEE Transactions on, 22 (8), pp. 888–905,2000. doi:10.1109/34.868688 . Cited onpage 2.[Shun et al., 2016] J.

Shun , F.

Roosta-Khorasani , K.

Fountoulakis , and M. W. ahoney . Parallel local graph cluster-ing . Proceedings of the VLDB Endowment,9 (12), pp. 1041–1052, 2016. Cited on page20.[Traud et al., 2012] A. L.

Traud , P. J.

Mucha , and M. A.

Porter . Social struc-ture of facebook networks . Physica A: Sta-tistical Mechanics and its Applications,391 (16), pp. 4165–4180, 2012. doi:10.1016/j.physa.2011.12.021 . Cited onpages 18 and 19.[Veldt et al., 2016] L. N.

Veldt , D. F.

Gle-ich , and M. W.

Mahoney . A simple andstrongly-local ﬂow-based method for cutimprovement . In

International Confer-ence on Machine Learning , pp. 1938–1947.2016. Cited on pages 1, 2, and 8.[Veldt et al., 2019a] N.

Veldt , C.

Klymko ,and D. F.

Gleich . Flow-based local graphclustering with better seed set inclusion .In

Proceedings of the SIAM InternationalConference on Data Mining , pp. 378–386.2019a. doi:10.1137/1.9781611975673.43 .Cited on pages 2 and 17.[Veldt et al., 2019b] N.

Veldt , A.

Wirth ,and D. F.

Gleich . Learning resolu- tion parameters for graph clustering . In

The World Wide Web Conference , pp.1909–1919. 2019b. doi:10.1145/3308558.3313471 . Cited on pages 18 and 21.[Wang et al., 2017] D.

Wang , K.

Foun-toulakis , M.

Henzinger , M. W.

Ma-honey , and S.

Rao . Capacity releasingdiﬀusion for speed and locality . In

Proceed-ings of the 34th International Conferenceon Machine Learning-Volume 70 , pp. 3598–3607. 2017. Cited on pages 3, 5, and 17.[Yadati et al., 2019] N.

Yadati , M. R.

Nimishakavi , P.

Yadav , V.

Nitin ,A.

Louis , and P.

Talukdar . Hypergcn:A new method for training graph con-volutional networks on hypergraphs . In

NeurIPS . 2019. Cited on page 1.[Yang and Leskovec, 2012] J.

Yang andJ.

Leskovec . Deﬁning and evaluating net-work communities based on ground-truth .In

Data Mining (ICDM), 2012 IEEE 12thInternational Conference on , pp. 745–754.2012. doi:10.1109/ICDM.2012.138 . Citedon page 20.[Yang et al., 2020] S.

Yang , D.

Wang , andK.

Fountoulakis . p-norm ﬂow diﬀusion for local graph clustering . arXiv, cs.LG, p.2005.09810, 2020. Cited on pages 3 and 20.[Yin et al., 2017] H. Yin , A. R.

Benson ,J.

Leskovec , and D. F.

Gleich . Localhigher-order graph clustering . In

Proceed-ings of the 23rd ACM SIGKDD Interna-tional Conference on Knowledge Discov-ery and Data Mining , pp. 555–564. 2017. doi:10.1145/3097983.3098069 . Cited onpages 8 and 20.[Zhou et al., 2003] D.

Zhou , O.

Bousquet ,T. N.

Lal , J.

Weston , and B.

Schölkopf . Learning with local and global consistency .In

NIPS . 2003. Cited on pages 1, 2, and 4.[Zhu et al., 2003] X.

Zhu , Z.

Ghahramani ,and J.

Lafferty . Semi-supervised learn-ing using gaussian ﬁelds and harmonicfunctions . In

ICML , pp. 912–919. 2003.Cited on pages 1 and 2.[Zhu et al., 2013] Z. A.

Zhu , S.

Lattanzi ,and V. S.

Mirrokni . A local algorithm forﬁnding well-connected clusters. In ICML(3) , pp. 396–404. 2013. Cited on pages 3,11, 12, 13, 14, 15, and 17., pp. 396–404. 2013. Cited on pages 3,11, 12, 13, 14, 15, and 17.