Strongly local p-norm-cut algorithms for semi-supervised learning and local graph clustering
AAcknowledgements.
This researchwas supported in part by NSF awardsIIS-1546488, CCF-1909528, and NSFCenter for Science of InformationSTC, CCF-0939370, as well as DOEDE-SC0014543, NASA, and the SloanFoundation.
S T R O N G LY L O C A L P - N O R M - C U T A L G O R I T H M SF O R S E M I - S U P E RV I S E D L E A R N I N G A N D L O C A LG R A P H C L U S T E R I N G
Meng Liu and David F. GleichPurdue University
June 16, 2020
Graph based semi-supervised learning is the problem of learning alabeling function for the graph nodes given a few example nodes, oftencalled seeds, usually under the assumption that the graph’s edges indicatesimilarity of labels. This is closely related to the local graph clustering orcommunity detection problem of finding a cluster or community of nodesaround a given seed. For this problem, we propose a novel generalizationof random walk, diffusion, or smooth function methods in the literatureto a convex p-norm cut function. The need for our p-norm methods isthat, in our study of existing methods, we find those principled methodsbased on eigenvector, spectral, random walk, or linear system oftenhave difficulty capturing the correct boundary of a target label or targetcluster. In contrast, 1-norm or maxflow-mincut based methods capturethe boundary, but cannot grow from small seed set; hybrid proceduresthat use both have many hard to set parameters. In this paper, wepropose a generalization of the objective function behind these methodsinvolving p-norms. To solve the p-norm cut problem we give a stronglylocal algorithm – one whose runtime depends on the size of the outputrather than the size of the graph. Our method can be thought as anonlinear generalization of the Anderson-Chung-Lang push procedure toapproximate a personalized PageRank vector efficiently. Our procedureis general and can solve other types of nonlinear objective functions, suchas p-norm variants of Huber losses. We provide a theoretical analysisof finding planted target clusters with our method and show that thep-norm cut functions improve on the standard Cheeger inequalities forrandom walk and spectral methods. Finally, we demonstrate the speedand accuracy of our new method in synthetic and real world datasets.Our code is available github.com/MengLiuPurdue/SLQ . Many datasets important to machine learning either start as a graph or have asimple translation into graph data. For instance, relational network data naturallystarts as a graph. Arbitrary data vectors become graphs via nearest-neighborconstructions, among other choices. Consequently, understanding graph-basedlearning algorithms – those that learn from graphs – is a recurring problem. Thisfield has a rich history with methods based on linear systems [Zhou et al., 2003;Zhu et al., 2003], eigenvectors [Joachims, 2003; Hansen and Mahoney, 2012], graphcuts [Blum and Chawla, 2001], and network flows [Lang and Rao, 2004; Andersenand Lang, 2008; Veldt et al., 2016], although recent work in graph-based learninghas often focused on embeddings [Perozzi et al., 2014; Grover and Leskovec, 2016]and graph neural networks [Yadati et al., 2019; Klicpera et al., 2019; Li et al.,2019]. Our research seeks to understand the possibilities enabled by a certain p -norm generalization of the standard techniques.Perhaps the prototypical graph-based learning problems are semi-supervisedlearning and local clustering . Other graph-based learning problems include role1 a r X i v : . [ c s . S I] J un a) Seed node and target. (b) 2-norm problem. (c) 1.1-norm problem. FIGURE 1 – A simple illustration ofthe benefits of our p -norm methods.In this problem, we generate a graphfrom an image with weighted neigh-bors as described in [Shi and Malik,2000]. We intentionally make thisgraph consider large regions, so eachpixel is connected to all neighborswithin 40 pixels away. The target inthis problem is the cluster definedby the interior of the window and weselect a single pixel inside the windowas the seed. The three colors (yellow,orange, red) show how the non-zeroelements of the solution fill-in as wedecrease a sparsity penalty in our for-mulation (yellow is sparsest, red isdensest). The 2-norm result exhibits atypical phenomenon of over-expansion,whereas the 1 . Full details
The image is a real-valued grey-scale image between0 and 1. We use Malik and Shi’sprocedure [Shi and Malik, 2000] toconvert the image into a weightedgraph. In the graph, pixels repre-sent nodes and pixels are connectedwithin a 2-squared-norm distanceof 40. The weight on an edge is w ( i, j ) = exp( −| I ( i ) − I ( j ) | /σ I −| D ( i, j ) | /σ x ) Ind [ | D ( i, j ) ≤ r ],where I ( i ) is the intensity at pixel i , D ( i, j ) is the 2-norm distance inpixel locations, and Ind [ · ] is theindicator function. The value of r = 40, σ I = 0 . σ d = 512 /
10. Weran our SLQ solver with γ = 0 . κ = [0 . , . , . ρ = 0 . q = 1 . , ,
000 steps, even though it hadnot fully converged. Running it longer(over one billion steps) shows thatthere are a few exceptionally smallentries that bleed out of the targetwindow. (Recall that we show anynon-zero entry ever introduced by thealgorithms.) These are illustrated inFigure 2. discovery and alignments. Semi-supervised learning involves learning a labelingfunction for the nodes of a graph based on a few examples, often called seeds.The most interesting scenarios are when most of the graph has unknown labelsand there are only a few examples per label. This could be a constant number ofexamples per label, such as 10 or 50, or a small fraction of the total label size,such as 1%. Local clustering is the problem of finding a cluster or communityof nodes around a given set of seeds. This is closely related to semi-supervisedlearning because that cluster is a natural suggestion for nodes that ought to sharethe same label, if there is a homophily property for edges in the network. If thishomophily is not present, then there are transformations of the graph that canmake these methods work better [Peel, 2017].For both problems, a standard set of techniques is based on random walkdiffusions and mincut constructions [Zhou et al., 2003; Zhu et al., 2003; Joachims,2003; Gleich and Mahoney, 2015; Pan et al., 2004]. These reduce the problem to alinear system, eigenvector, random walk, or mincut-maxflow problem, which canoften be further approximated. As a simple example, consider solving a seededPageRank problem that is seeded on the nodes known to be labeled with a singlelabel. The resulting PageRank vector indicates other nodes likely to share thatsame label. This propensity of PageRank to propogate labels has been used in amany applications and it has many interpretations [Kloumann et al., 2016; Gleich,2015; Orecchia and Mahoney, 2011; Pan et al., 2004; Lisewski and Lichtarge, 2010;Ghosh et al., 2014], including guilt-by-association [Koutra et al., 2011]. A relatedclass of mincut-maxflow constructions uses similar reasoning [Blum and Chawla,2001; Veldt et al., 2016, 2019a].The link between these PageRank methods and the mincut-maxflow compu-tations is that they correspond to 1-norm and 2-norm variations on a generalobjective function (see [Gleich and Mahoney, 2014] and Equation 1). In thispaper, we replace the norm with a general p -norm. (For various reasons, we referto it as a q -norm in the subsequent technical sections. We use p -norm here as thisusage is more common.) The literature on 1 and 2-norms is well established andlargely suggests that 1-norm (mincut) objectives are best used for refining large results from other methods – especially because they tend to sharpen boundaries– whereas 2-norm methods are best used for expanding small seed sets [Veldtet al., 2016]. There is a technical reason for why mincut-maxflow formulationscannot expand small seed sets, unless they have uncommon properties, discussedin [Fountoulakis et al., 2020, Lemma 7.2]. The downside to 2-norm methods isthat they tend to “expand” or “bleed out” over natural boundaries in the data.This is illustrated in Figure 1(b). The hypothesis motivating this work is thattechniques that use a p -norm where 1 < p < . FIGURE 2 – Running our SLQ solverfor an extremely long time will causea few entries to bleed out of the targetwindow. Compare with Figure 1.
2e are hardly the first to notice these effects or propose p -norms as a solution.For instance, the p -Laplacian [Amghibech, 2003] and related ideas [Alamgir andLuxburg, 2011] has been widely studied as a way to improve results in spectralclustering [Bühler and Hein, 2009] and semi-supervised learning [Brindle and Zhu,2013]. This has recently been used to show the power of simple nonlinearitiesin diffusions for semi-supervised learning as well [Ibrahim and Gleich, 2019].The major rationale for our paper is that our algorithmic techniques are closelyrelated those used for 2-norm optimization. It remains the case that spectral(2-norm) approaches are far more widely used in practice, partly because theyare simpler to implement and use, whereas the other approaches involve moredelicate computations. Our new formulations are amenable to similar computationtechniques as used for 2-norm problems, which we hope will enable them to bewidely used.To forward the goal of making these techniques useful, we release all ofexperimental code and the tools necessary to easily use the strongly-local p -normcuts on github: github.com/MengLiuPurdue/SLQ This includes related codes for similar purposes as well.The remainder of this paper consists of a full demonstration of the potential ofthis idea. We first formally state the problem and review technical preliminariesin Section 2. As an optimization problem the p -norm problem is strongly convexwith a unique solution. Next, we provide a strongly local algorithm to approximatethe solution (Section 3). A strongly local algorithm is one where the runtimedepends on the sparsity of the output rather than the size of the input graph.This enables the methods to run efficiently even on large graphs, because, simplyput, we are able to bound the maximum output size and runtime independentlyof the graph size. A hallmark of the existing literature is on these methods is arecovery guarantee called a Cheeger inequality. Roughly, this inequality showsthat, if the methods are seeded nearby a good cluster , then the methods willreturn something that is not too far away from that good cluster. This is oftenquantified in terms of the conductance of the good cluster and the conductanceof the returned cluster. There are a variety of tradeoffs possible here [Andersenet al., 2006; Zhu et al., 2013; Wang et al., 2017]. We prove such a relationship forour methods where the quality of the guarantee depends on the exponent 1 /p ,which reproduces the square root Cheeger guarantees [Chung, 1992] for p = 2 butgives better results when p <
2. Finally, we empirically demonstrate a numberof aspects of our methods in comparison with a number of other techniques inSection 5. The goal is to highlight places where our p -norm objectives differ.At the end, we have a number of concluding discussions (Section 6), whichhighlight dimensions where our methods could be improved, as well as relatedliterature. For instance, there are many ways to use personalized PageRankmethods with graph convolutional networks and embedding techniques [Klicperaet al., 2019] – we conjecture that our p -norm methods will simply improve onthese relationships. Also, and importantly, as we were completing this paper,we became aware of [Yang et al., 2020] which discusses p -norms for flow-baseddiffusions. Our two papers have many similar findings on the benefit of p -norms,although there are some meaningful differences in the approaches, which wediscuss in Section 6. In particular, our algorithm is distinct and follows a simplegeneralization of the widely used and deployed push method for PageRank. Ourhope is that both papers can highlight the benefits of this idea to improve thepractice of graph-based learning. 3 G E N E R A L I Z E D L O C A L G R A P H C U T S
We consider graphs that are undirected, connected, and weighted with positiveedge weights lower-bounded by 1. Let G = ( V, E, w ) be such a graph, where n = | V | and m = | E | . The adjacency matrix A has non-zero entries w ( i, j ) foreach edge ( i, j ), all other entries are zero. This is symmetric because the graphis undirected. The degree vector d is defined as the row sum of A and D is adiagonal matrix defined as diag( d ). The incidence matrix B ∈ { , − , } m × n measures the differences of adjacent nodes. The k th row of B represents the k thedge and each row has exactly two nonzero elements, i.e. 1 for start node of k thedge and − k th edge. For undirected graphs, either node canbe the start node or end node and the order does not matter. We use vol( S ) forthe sum of weighted degrees of the nodes in S and φ ( S ) = cut( S )min(vol( S ) , vol( ¯ S )) forconductance. We use i ∼ j to represent that node i and node j are adjacent.For simplicity, we begin with PageRank, which has been used for all of thesetasks in various guises [Zhou et al., 2003; Gleich and Mahoney, 2015; Andersenet al., 2006]. A PageRank vector [Gleich, 2015] is the solution of the linear system( I − α AD − ) x = (1 − α ) v where α is a probability between 0 and 1 and v is astochastic vector that gives the seed distribution. This can be easily reworkedinto the equivalent linear system ( γ D + L ) y = γ v where y = D x and L is thegraph Laplacian L = D − A . The starting point for our methods is a resultshown in [Gleich and Mahoney, 2014], where we can further translate this intoa 2-norm “cut” computation on a graph called the localized cut graph that isclosely related to common constructions in maxflow-mincut computations forcluster improvement [Andersen and Lang, 2008; Fountoulakis et al., 2020].The localized cut graph is created from the original graph, a set S , and a value γ . The construction adds an extra source node s and an extra sink node t , andedges from s to the original graph that localize a solution, or bias, a solution withinthe graph near the set S . Formally, given a graph G = ( V, E ) with adjacencymatrix A , a seed set S ⊂ V and a non-negative constant γ , the adjacency matrixof the localized cut graph is: A S = γ d TS γ d S A γ d ¯ S γ d ¯ S and a smallillustration is s t R1 R2R3 R4R5 a1 a2a3U1 U2U3 U4 γ γ γ γ γ γ γ γ γ γ γ γ S ¯ S Here ¯ S is the complement set of S , d S = D e S , d ¯ S = D e ¯ S , and e S is an indicatorvector for S .Let B , w be the incidence matrix and weight vector for the localized cut-graph.Then PageRank is equivalent to the following problem (see full detailsin [Gleich and Mahoney, 2014])minimize x w T ( B x ) = P i,j w i,j ( x i − x j ) = x T B T diag( w ) B x subject to x s = 1 , x t = 0 (1)We call this a cut problem because if we replace the squared term with an absolutevalue (i.e., P w i,j | x i − x j | ), then we have the standard s, t -mincut problem. Ourpaper proceeds from changing this power of 2 into a more general loss-function ‘ and also adding a sparsity penalty, which is often needed to produce stronglylocal solutions [Gleich and Mahoney, 2014]. We define this formally now. DEFINITION 1 (Generalized local graph cut)
Fix a set S of seeds and a value of γ .Let B , w be the incidence matrix and weight vector of the localized cut graph.Then the generalized local graph cut problem is: minimize x w T ‘ ( B x ) + κγ d T x = P ij w i,j ‘ ( x i − x j ) + κγ P i x i d i subject to x s = 1 , x t = 0 , x ≥ . (2)4 a) PageRank ( α = 0 .
85) (b) q =2 , γ = κ =10 − (c) q =5 , γ =10 − , κ =10 − (d) q =1 . , γ = κ =10 − (e) heat kernel t = 10 , ε = 0 .
003 (f) CRD U = 60 , h = 60 , w = 5 (g) p = 1 . h =0 . ,k = 35000 (h) 1 . h =0 . ,k = 7500 FIGURE 3 – A comparison of seededcut-like and clustering objectives on aregular grid-graph with 4 axis-alignedneighbors. The graph is 50-by-50, theseed is in the center. The diffusionslocalize before the boundary so weonly show the relevant region andthe quantile contours of the values.We selected the parameters to givesimilar-sized outputs. (Top row) Atleft (a), we have seeded PageRank;(b)-(d) show our q -norm objectives;(b) is a 2-norm which closely resem-bles PageRank; (c) is a 5-norm thathas diamond-contours; and (d) is a1 . p -norm non-linearity in the diffusion or a (h) p -Laplacian) show that similar resultsare possible with existing methods,although they lack the simplicity ofour optimization setup and often lackthe strongly local algorithms. Here ‘ ( x ) is an element-wise function and κ ≥ is a sparsity-promoting term. We compare using power functions ‘ ( x ) = q | x | q to a variety of other techniques forsemi-supervised learning and local clustering in Figure 3. If ‘ is convex, then theproblem is convex and can be solved via general-purpose solvers such as CVX. Anadditional convex solver is SnapVX [Hallac et al., 2017], which studied a generalcombination of convex functions on nodes and edges of a graph, although neitherof these approaches scale to the large graphs we study in subsequent portions ofthis paper (65 million edges). To produce a specialized, strongly local solver, wefound it necessary to restrict the class of functions ‘ to have similar properties tothe power function ‘ ( x ) = q | x | q and its derivative ‘ ( x ). Reproduction notes for Figure 3.
We release the exact code to reproducethis figure. For all methods, for all values above a threshold, we compute 4 quantilelines to give roughly equally spaced regions. (a). PageRank is mathematicallynon-zero at all nodes in connected graph. Here, we threshold at 10 − to focuson the circular contours. This is reproduced by (b) using q = 2. The “wiggles”around the edge are because we used CVX to solve this problem and there wereminor tolerance issues around the edge. We also boosted the threshold to 5 · − because of the tolerance in CVX. (c) Same as (b). (d) we used our SLQ solveras CVXpy with either the ECOS or SCS solver reported an error while using q = 1 .
25. We set ρ = 0 .
99 to get an accurate solution (close to KKT). Here, weused the algorithmic non-zeros as the code introduces elements “sparsely”. (e)This used mathematical non-zeros again because the algorithm from [Kloster andGleich, 2014] uses the same sparse “push” mechanisms as our SLQ algorithm. (f)CRD returns a set, so we simply display that set. The parameters were chosento make it look as close to a square as possible. (g and h) We used the forwardEuler algorithm from [Ibrahim and Gleich, 2019] with non-zero truncation. k is the number of steps and h is the step-size. These were chosen to make thepictures look like diamonds and squares, respectively to mirror our results. The5ntry thresholds were also 5 times the minimum element because the vectors arenon-zero everywhere. DEFINITION 2
In the [ − , domain, the loss function ‘ ( x ) should satisfy (1) ‘ ( x ) is convex; (2) ‘ ( x ) is an increasing and anti-symmetric function; (3) For ∆ x > , ‘ ( x ) should satisfy either of the following condition with constants k > and c > (3a) ‘ ( x + ∆ x ) ≤ ‘ ( x ) + k‘ (∆ x ) and ‘ ( x ) > c or (3b) ‘ ( x ) is c -Lipschitzcontinuous and ‘ ( x + ∆ x ) ≥ ‘ ( x ) + k‘ (∆ x ) when x ≥ . REMARK 3 If ‘ ( x ) is Lipschitz continuous with Lipschitz constant to be L and ‘ ( x ) > c , then constraint 3(a) can be satisfied with k = L/c . However, ‘ ( x ) can still satisfy 3(a) even if it is not Lipschitz continuous. A simple example is ‘ ( x ) = | x | . , − ≤ x ≤ . In this case, k = 1 but it is not Lipschitz continuousat x = 0 . On the other hand, when ‘ ( x ) is Lipschitz continuous, it can satisfyconstraint 3(b) even if ‘ ( x ) = 0 . An example is ‘ ( x ) = | x | . , − < x < . Inthis case ‘ ( x ) = 0 when x = 0 but ‘ ( x + ∆ x ) ≥ ‘ ( x ) + ‘ (∆ x ) when x ≥ . LEMMA 4
The power function ‘ ( x ) = q | x | q , − < x < satisfies definition 2 forany q > . More specifically, when < q < , ‘ ( x ) satisfies 3(a) with c = q − and k = 2 − q , when q ≥ , ‘ ( x ) satisfies 3(b) with c = q − and k = 1 . Proof
First, we know ‘ ( x ) = | x | q − sgn( x ) and ‘ ( x ) = ( q − | x | q − . And wedefine ‘ (0) = ∞ .For 3(a), since − < x <
1, 1 < q <
2, we have ‘ ( x ) > ( q − ‘ ( x + ∆ x ) − ‘ ( x ) ‘ (∆ x ) = (cid:12)(cid:12)(cid:12) x ∆ x + 1 (cid:12)(cid:12)(cid:12) q − sgn (cid:16) x ∆ x + 1 (cid:17) − (cid:12)(cid:12)(cid:12) x ∆ x (cid:12)(cid:12)(cid:12) q − sgn (cid:16) x ∆ x (cid:17) Define a new function f ( x ) = | x | q − sgn(1 + x ) − | x | q − sgn( x ). f ( x ) = | x | q − − | x | q − . So the maximum of f ( x ) is achived at f ( − .
5) = 2 − q .For 3(b), since − < x < q >
2, we have ‘ ( x ) < ( q − x ≥ x + ∆ x ) q − ≥ x q − + ∆ x q − is obvious. (cid:4) Note that the ‘ ( x ) = | x | does not satisfy either choice for property (3).Consequently, our theory will not apply to mincut problems. In order to justifythe generalized term, we note that q -norm generalizations of the Huber and Berhuloss functions [Owen, 2007] do satisfy these definitions. DEFINITION 5
Given < q < and < δ < , the “q-Huber" and “Berq” functionsare q -Huber ‘ ( x ) = = (cid:26) δ q − x if | x | ≤ δ q | x | q + ( q − q ) δ q otherwise and Berq ‘ ( x ) = = ( q δ − q | x | q if | x | ≤ δ x + ( − q q ) δ otherwise. LEMMA 6
When − ≤ x ≤ , both “ q -Huber” and “Berq” satisfy Definition 2.The value of k for both is − q , the c for q -Huber is q − while the c for “Berq”is . Proof
Obviously, both condition (1) and (2) are satisfied for “ q -Huber” and “Berq”.Now we show 3(a) is also satisfied for “ q -Huber” based on the proof of lemma 4.The proof of “Berq” is also similar.When ∆ x > δ (∆ x ≤ δ is similar) k = ‘ ( x + ∆ x ) − ‘ ( x )∆ x q − = (cid:12)(cid:12) x ∆ x + 1 (cid:12)(cid:12) q − sgn (cid:0) x ∆ x + 1 (cid:1) − (cid:12)(cid:12) x ∆ x (cid:12)(cid:12) q − sgn (cid:0) x ∆ x (cid:1) | x | > δ, | x + ∆ x | > δ δ q − ( x +∆ x ) −| x | q − sgn( x )∆ x q − | x | > δ, | x + ∆ x | ≤ δ | x +∆ x | q − sgn( x +∆ x ) − δ q − x ∆ x q − | x | ≤ δ, | x + ∆ x | > δ ∆ x − q δ − q | x | ≤ δ, | x + ∆ x | ≤ δ ase 1: Same as the proof of lemma 4.
Case 2:
In this case, x can only be negative, i.e. x < − δ . After somesimplification, k = (cid:18) ∆ xδ (cid:19) − q − (cid:18) − xδ (cid:19) − q − ! (cid:18) − x ∆ x (cid:19) q − Note that the right hand side is an increasing function of ∆ x and − δ − x ≤ ∆ x ≤ δ − x . Replacing ∆ x by − δ − x yields k = ( − x ) q − − δ q − ( − x − δ ) q − > x by δ − x yields k = δ q − + ( − x ) q − ( δ − x ) q − ≤ − q Here the last inequality is due to Jensen’s inequality.
Case 3:
Its proof is very similar to case 2.
Case 4:
Since 0 < ∆ x ≤ δ , 0 ≤ k ≤ − q . (cid:4) We now state uniqueness.
THEOREM 7
Fix a set S , γ > , κ > . For any loss function satisfying Defini-tion 2, then the solution x of (2) is unique. Moreover, define a residual function r ( x ) = − γ B T diag ( ‘ ( B x )) w . A necessary and sufficient condition to satisfy theKKT conditions is to find x ∗ where x ∗ ≥ , r ( x ∗ ) = [ r s , g T , r t ] T with g ≤ κ d (where d reflects the original graph), k ∗ = [0 , κ d − g , T and g T ( κ d − g ) = 0 . Proof
We first prove uniqueness. The Hessian of the objective in (2) is: H ( i, j ) = ‘ ( x i − ( e S ) i ) if i = j‘ ( x i − x j ) if i ∼ j x T H x = P i ∈ V x i ‘ ( x i − ( e S ) i ) + P i,j,i ∼ j x i x j ‘ ( x i − x j ). If 3(a) issatisfied, we have ‘ ( x ) > x T H x >
0. So the objective 2is strictly convex and the uniqueness is guaranteed. When 3(b) is satisfied, ‘ ( x + ∆ x ) ≥ ‘ ( x ) + k‘ (∆ x ) guarantees that ‘ ( x ) can only become zero in arange around zero, i.e. ‘ ( x ) = ‘ ( x ) = 0 when x ∈ [ − ψ, ψ ], where 0 ≤ ψ ≤ x T H x = 0 implies x i ≥ − ψ when i ∈ S , x i ≤ ψ when i / ∈ S and − ψ ≤ x i − x j ≤ ψ or x i x j = 0. In this case, the uniqueness is implied by κγ d in (2), i.e. each x i will be the smallest feasible value.Next, we will show the KKT condition of (2). If we translate problem (2) toadd the constraint u = B x , then the loss is ‘ ( u ). The Lagrangian is L = w T ‘ ( u ) + κγ d T x − f T ( B x − u ) − λ s ( x s − − λ t x t − k T x Standard optimality results give the KKT of (2) as ∂L∂ x = κ d − γ B T f − λ s e s − λ t e t − k = 0 ∂L∂ u = diag( ‘ ( u )) w + f = 0 k T x = 0 B x = uk ≥ , x s = 1 , x t = 0 (4)Thus, combining the first and second equations, r = γ B T f . Since k ≥
0, from thefirst equation, we have g ≤ κ d . And from k T x = 0, we have g T ( κ d − g ) = 0. (cid:4) S T R O N G LY L O C A L A L G O R I T H M S
In this section, we will provide a strongly local algorithm to approximatelyoptimize equation (2) with ‘ ( x ) satisfying definition 2. The simplest way tounderstand this algorithms is as a nonlinear generalization of the Andersen-Chung-Lang push procedure for PageRank [Andersen et al., 2006], which we callACL. (The ACL procedure has strong relationships with Gauss-Seidel, coordinatesolvers, and various other standard algorithms.) The overall algorithm is simple:find a vertex i where the KKT conditions from Theorem 7 are violated andincrease x i on that node until we approximately satisfy the KKT conditions.Update the residual, look for another violation, and repeat. The ACL algorithmtargets q = 2 case, which has a closed form update, we simply need to replacethis with a binary search. Algorithm nonlin-cut ( γ, κ, ρ, ε ) for set S and graph G where 0 <ρ< <ε determine accuracy Let x ( i ) = 0 except for x s = 1 and set r = − γ B T diag[ ‘ ( B x )] w While there is any vertex i where r i > κd i , or stop if none exists (find a KKT violation) Apply nonlin-push at vertex i , updating x and r Return xAlgorithm nonlin-push ( i, γ, κ, x , r , ρ, ε ) Use binary search to find ∆ x i such that the i th coordinate of the residual after adding ∆ x i to x i , r i = ρκd i , the binary search stops when the range of ∆ x is smaller than ε (satisfy KKT at i) . Change the following entries in x and r to update the solution and residual (a) x i ← x i + ∆ x i (b) For each neighbor j in the original graph G , r j ← r j + γ w i,j ‘ ( x j − x i ) − γ w i,j ‘ ( x j − x i − ∆ x i )For ρ <
1, we only approximately satisfy the KKT conditions, as discussedfurther in the Section 3.3. We have the following strongly local runtime guaranteewhen 3(a) in definition 2 is satisfied. See Section 3.2 for similar guarantee on 3(b).(This ignores binary search, but that only scales the runtime by log ε because thevalues are in [0 , THEOREM 8
Let γ > , κ > be fixed and let k and c be the parameters fromDefinition 2 for ‘ ( x ) . For < ρ < , suppose nonlin-cut stops after T iterations,and d i is the degree of node updated at the i -th iteration, then T must satisfy: P Ti =1 d i ≤ vol ( S ) /c‘ ( γ (1 − ρ ) κ/k (1 + γ )) = O ( vol ( S )) . The notation ‘ − refers to the inverse functions of ‘ ( x ), This function mustbe invertible under the the definition of 3(a). The runtime bound when 3(b) holdsis slightly different, see below.Note that if κ = 0, γ = 0, or ρ = 1, then this bound goes to ∞ and we loseour guarantee. However, if these are not the case, then the bound T shows thatthe algorithm will terminate in time that is independent of the size of the graph. This is the type of guarantee provided by strongly local graph algorithms and hasbeen extremely useful to scalable network analysis methods [Leskovec et al., 2009;Jeub et al., 2015; Yin et al., 2017; Veldt et al., 2016; Kloster and Gleich, 2014].
LEMMA 9
During algorithm 1, for any i ∈ { V \{ s, t }} , g i will stay nonnegativeand ≤ x i ≤ . Proof
We can show this by induction. At the initial step, for node i ∈ S , g i = d i ,and for node i ∈ ¯ S , g i = 0. And after a nonlin-push step, every g i will staynonnegative. 8o prove 0 ≤ x i ≤
1, by expanding g i , we have g i = − γ X j ∼ i w i ‘ ( x i − x j ) − d i ‘ ( x i − ( e S ) i ) x i ≥ x i is the largest element of x and x i >
1, then we willhave ‘ ( x i − x j ) ≥ j ∼ i and ‘ ( x i − ( e S ) i ) >
0. Then g i <
0, which is acontradiction. (cid:4)
LEMMA 10
When 3(a) is satisfied, after calling nonlin-push on node i , thedecrease of || g || will be strictly larger than cd i ( ‘ ) − (cid:18) γ (1 − ρ ) κk (1 + γ ) (cid:19) Proof
We use g to denote g after calling nonlin-push on node i . At anyintermediate step of nonlin-cut procedure, || g || = X g i = − X i ∈ S d i ‘ ( x i − − X i ∈ ¯ S d i ‘ ( x i )This is because for any edge ( i, j ) ∈ E , g i has a term γ w ( i, j ) ‘ ( x i − x j ) while g j has a term γ w ( j, i ) ‘ ( x j − x i ). Since our graph is undirected, w ( i, j ) = w ( j, i ), sothese two terms will cancel out. What remains are the terms corresponding tothe edges connecting to s or t . So after calling nonlin-push on node i , || g || − || g || = d i ‘ ( x i + ∆ x i − ( e S ) i ) − d i ‘ ( x i − ( e S ) i ) ≥ d i min { l ( x i + ∆ x i − ( e S ) i ) , l ( x i − ( e S ) i ) } ∆ x i ≥ cd i ∆ x i On the other hand, we need to choose ∆ x i such that g i = ρκd i . We know g i = − γ X j ∼ i w ( i, j ) ‘ ( x i + ∆ x i − x j ) − d i ‘ ( x i + ∆ x i − ( e S ) i )is a decreasing function of ∆ x i . And when ∆ x i = 0, g i = κd i > ρκd i , when∆ x i = 1, g i < < ρκd i , since ‘ ( x ) is a strictly increasing function, there exists aunique ∆ x i such that g i = ρκd i . Moreover, we can lower bound ∆ x i . To see that, g i = ρκd i = − γ X j ∼ i w ( i, j ) ‘ ( x i + ∆ x i − x j ) − d i ‘ ( x i + ∆ x i − ( e S ) i ) ≥ − γ X j ∼ i w ( i, j ) ‘ ( x i − x j ) − d i ‘ ( x i − ( e S ) i ) − k (1 + γ ) γ d i ‘ (∆ x i )= g i − k (1 + γ ) γ d i ‘ (∆ x i )Thus, we have∆ x i ≥ ( ‘ ) − (cid:18) γ ( g i − ρκd i ) k (1 + γ ) d i (cid:19) > ( ‘ ) − (cid:18) γ (1 − ρ ) κk (1 + γ ) (cid:19) which means || g || − || g || > cd i ( ‘ ) − (cid:18) γ (1 − ρ ) κk (1 + γ ) (cid:19) . (cid:4) The only step left to prove Theorem 8 is that at the beginning, we have || g || = vol( S ). Then the theorem follows by Lemma 10.9 .2 RUNNING TIME ANALYSIS WHEN 3(B) IS SATISFIED For the following results, we add an extra strictly increasing condition sothat ‘ ( γ (1 − ρ ) κc (1+ γ ) ) is positive. When ‘ is not strictly increasing, i.e. ‘ ( x ) = 0 in asmall range round 0, it is our conjecture that the algorithm will still finish in astrongly local time, although we have not yet proven that. Note that this strictlyincreasing criteria is true for all the loss used in the experiments. LEMMA 11
When 3(b) is satisfied and ‘ ( x ) is strictly increasing, then after calling nonlin-push on node i , the decrease of || g || will be strictly larger than kd i ‘ (cid:18) γ (1 − ρ ) κc (1 + γ ) (cid:19) Proof
Similarly to the proof of lemma 10, after calling nonlin-push on node i , || g || − || g || = d i ‘ ( x i + ∆ x i − ( e S ) i ) − d i ‘ ( x i − ( e S ) i ) ≥ kd i ‘ (∆ x i )On the other hand, g i = ρκd i = − γ X j ∼ i w ( i, j ) ‘ ( x i + ∆ x i − x j ) − d i ‘ ( x i + ∆ x i − ( e S ) i ) ≥ − γ X j ∼ i w ( i, j ) ‘ ( x i − x j ) − d i ‘ ( x i − ( e S ) i ) − c (1 + γ ) γ d i ∆ x i = g i − c (1 + γ ) γ d i ∆ x i Thus, we have ∆ x i ≥ γ ( r i − ρκd i ) c (1 + γ ) d i > γ (1 − ρ ) κc (1 + γ )which means || g || − || g || > kd i ‘ (cid:18) γ (1 − ρ ) κc (1 + γ ) (cid:19) . (cid:4) Lemma 11 along with the same type of analysis as before give the followingresult when 3(b) is satisfied.
THEOREM 12
Let γ > , κ > be fixed and let k and c be the parametersfrom Definition 2 for ‘ ( x ) when 3(b) is satisfied with a strict increase. For < ρ < , suppose nonlin-cut stops after T iterations, and d i is the de-gree of node updated at the i -th iteration, then T must satisfy: P Ti =1 d i ≤ vol ( S ) /k‘ ( γ (1 − ρ ) κ/c (1 + γ )) = O ( vol ( S )) . When ρ <
1, then we only approximately satisfy the KKT conditions. Here,we do some quick analysis of the difference in the idealized slackness condition k T x = 0 compared to what we get from our solver. Note that by choosing ρ closeto 1, we do produce a fairly accurate solution when 3(a) is satisfied. LEMMA 13
When Algorithm 1 returns, if ‘ ( x ) satisfies 3(a) we have k T x ≤ κk‘ (1)(1 − ρ ) vol ( S ) c roof We know k = [0 , κ d − r , T . Every time algorithm 2 is called at node i , it will set g i = ρκd i . In the following iterations, g i can only increase untilalgorithm 2 is called at node i again. This means k ≤ (1 − ρ ) κ d .On the other hand, when 3(a) is satisfied, ‘ (1 − x i ) ≤ − ‘ ( x i ) + k‘ (1) || g || = − X i/ ∈ S d i ‘ ( x i ) − X i ∈ S d i ‘ ( x i − ≤ − X i ∈ V d i ‘ ( x i ) + k‘ (1)vol( S ) ≤ − c d T x + k‘ (1)vol( S ) . Thus d T x ≤ k‘ (1) c vol( S )Combining the two inequality gives this lemma. (cid:4) When 3(b) is satisfied, it is easy to see k T x ≤ (1 − ρ ) κ d T x , however, there isn’ta closed form equation on the upper bound of k T x in terms of vol( S ). A common use for the results of these localized cut solutions is as localizedFiedler vectors of a graph to induce a cluster [Andersen et al., 2006; Leskovecet al., 2009; Mahoney et al., 2012; Zhu et al., 2013; Orecchia and Zhu, 2014]. Thiswas the original motivation of the ACL procedure [Andersen et al., 2006], forwhich the goal was a small conductance cluster. One of the most common (andtheoretically justified!) ways to convert a real-valued “clustering hint” vector x into clusters is to use a sweep cut process. This involves sorting x in decreasingorder and evaluating the conductance of each prefix set S j = { x , x , ..., x j } foreach j ∈ [ n ]. The set with the smallest conductance will be returned. Thiscomputation is a key piece of Cheeger inequalities [Chung, 1992; Mihail, 1989].In the following, we seek a slightly different type of guarantee. We posit theexistence of a target cluster T and show that if T has useful clustering properties(small conductance, no good internal clusters), then a sweep cut over a q -norm or q -Huber localized cut vector seeded inside of T will accurately recover T . The keypiece is understanding how the computation plays out with respect to T insidethe graph and T as a graph by itself. The following two observations are not directly related to the main result. Butwe still find them useful in understanding the problem in general.
LEMMA 14
For two seed sets S and S , denote x and x to be the solutions ofLq norm cut problem using S and S correspondingly, if S ⊆ S , then x ≤ x . Proof
Considering two nonlin-cut processes P , P using S or S as inputcorrespondingly, suppose we set the initial vector of P to be the solution of P ,i.e. x , then for nodes i / ∈ S \ S , its residual stays zero, while for nodes i ∈ S \ S ,its residual becomes positive. This means P needs more iterations to converge.And each iteration can only add nonnegative values to x . Thus, x ≤ x . (cid:4) LEMMA 15
Suppose that κ = 0 . We can compute the exact solution of problem (2) under two extreme cases γ → ∞ and γ → ,· When γ → ∞ , x i = 1 for i ∈ S and x i = 0 for i ∈ ¯ S .· When γ → , x i ≥ ( vol ( S )) q − ( vol ( V )) q − for any i ∈ V . roof When κ = 0, the objective function of (2) becomes X i ∼ j w ( i, j ) ‘ ( x i − x j ) + γ X i ∈ V d i ‘ ( x i − ( e S ) i )When γ → ∞ , the first term vanishes, and the second term achieves its smallestvalue, when x i = 1 for i ∈ S and x i = 0 for i ∈ ¯ S .When γ →
0, the second term vanishes, and the first term is minimal withobjective zero when every x i converges to a fixed constant. Moreover, the KKTcondition now becomes1 γ X j ∼ i w ( i, j ) ‘ ( x i − x j ) + d i ‘ ( x i − ( e S ) i ) = 0Summing the KKT condition over all nodes yields: X i ∈ V d i ‘ ( x i − ( e S ) i ) = 0So we can compute the constant that x i converges to by making x i = c , which is c = (vol( S )) q − (vol( V ) − vol( S )) q − +(vol( S )) q − ≥ (vol( S )) q − (vol( V )) q − . (cid:4) As we mentioned before, the key piece is understanding how the computationplays out with respect to T inside the graph and T as a graph by itself. We usevol T ( S ) to be the volume of seed set S in the subgraph induced by T and ∂T ⊂ T to be the boundary set of T , i.e. nodes in ∂T has at least one edge connecting to¯ T . Quantities with tildes, e.g., ˜ d , reflect quantities in the subgraph induced by T .We assume κ = 0, ρ = 1 and: ASSUMPTION 16
The seed set S satisfies S ⊆ T , S ∩ ∂T = ∅ and P i ∈ ∂T ( d i − ˜ d i ) x q − i ≤ φ ( T )vol( S ). (cid:4) We call this the leaking assumption, which roughly states that the solutionwith the set S stays mostly within the set T . As some quick justification for thisassumption, we note that when when q = 2, [Zhu et al., 2013] shows by a Markovbound that there exists T g where vol( T g ) ≥ vol( T ) such that any node i ∈ T g satisfies P i ∈ ∂T ( d i − ˜ d i ) x i ≤ φ ( T ) d i . So in that case, any seed sets S ⊆ T g meetsour assumption. For 1 < q <
2, it is trivial to see any set S with vol( S ) ≥ vol( T )satisfies this assumption since the left hand side is always smaller than cut( T, ¯ T ).However, such a strong assumption is not necessary for our approach. The aboveguarantee allows for a small vol( S ) and we simply require Assumption 16 holds.We currently lack a detailed analysis of how many such seed sets there will be.Our second assumption regards the behavior within only the set T comparedwith the entire graph. To state it, we wish to be precise. Consider the localizedcut graph associated with the hidden target set T on the entire graph and let B , w be the incidence and weights for this graph. We wish to understand howthe solution x on this problemminimize x w T ‘ ( B x )subject to x s = 1 , x t = 0 , x ≥ only on the subgraph inducedby T . Let ˜ B , ˜ w be the incidence matrix of the localized cut graph on the vertexinduced subgraph corresponding to T and seeded on T (so the tilde-problem isseeded on all nodes). So formally, we wish to understand how ˜ x inminimize ˜ x ˜ w T ‘ ( ˜ B ˜ x )subject to ˜ x s = 1 , ˜ x t = 0 , ˜ x ≥ x . For these comparisons, we assume we are looking at values otherthan x s , x t and ˜ x s , ˜ x t . ASSUMPTION 17
A relatively small γ should be chosen such that the solution oflocalized q -norm cut problem in the subgraph induced by target cluster T cansatisfy min(˜ x ) ≥ (0 . T ( S )) / ( q − (vol T ( T )) / ( q − = M . (cid:4) We will call Assumption 17 a “mixing-well” guarantee.To better understand this assumption, when ‘ ( x ) = q | x | q and q = 2, a solutionof the nonlin-cut process (Algorithm 1) will be equivalent to a Markov process.In this case, one can lower bound min(˜ x ) by the well known infinity-norm mixingtime of Markov chain. In fact, as shown in the proof of lemma 3.2 of [Zhu et al.,2013], when γ ≤ O ( φ ( T ) · Gap), they show that min(˜ x T ) ≥ . T ( S )vol T ( T ) . Here Gapis defined as the ratio of internal connectivity and external connectivity and oftenassumed to be Ω(1). We refer to [Zhu et al., 2013] for a detailed definition of this. The proof of lemma 3.2 in [Zhuet al., 2013] proves that the telepor-tation probability β = 1 − α needsto be smaller than O ( φ ( T ) · Gap).When q = 2, as shown in [Gleich andMahoney, 2014], β = γ γ , whichmeans γ = β − β . Since we assume γ <
1, we have β < γ < β . In otherwords, γ and β are only different bya constant factor. For 1 < q < nonlin-cut is no longer equivalent to the solution of a Markovprocess and thus it will be more difficult to derive a closed form equation on howsmall γ needs to be so that equation 17 is satisfied. However, lemma 18 (below)shows that for graphs with small diameters, it is easier (i.e. γ can be larger) forthe solution of (6) to satisfy equation 17. This is reasonable because we expectgood clusters and good communities to have small diameters. LEMMA 18
Assume the subgraph induced by target cluster T has diameter O (log | T | ) and when we uniformly randomly sample points from T as seed sets, the expectedlargest distance of any node in ¯ S to S is O (cid:16) log( | T | ) | S | (cid:17) . Also define γ to be thelargest γ such that assumption 17 is satisfied at q = 2 and assume γ < , if weset γ = γ q − for < q < , andvol T ( S ) vol T ( T ) ≤ γ γ · | T | | S | log (cid:16) l q − (cid:17) q − where l ≤ (1 + γ ) max( ˜ d i ) . Then the solution of 6 can satisfy assumption 17. Proof
Given a seed set S , we can partition the ¯ S into disjoint subsets L ∪ L ∪ L . . . ∪ L n , where L i contains nodes that are i distance away from S . For anynode i ∈ L k , we denote d outi to be d outi = X j ∼ i,j ∈ L k ∪ L k +1 w ( i, j )And d ini = ˜ d i − d outi . Also define l = (1 + γ ) d outi d ini ≤ (1 + γ )max( ˜ d i ). Suppose˜ x i ≥ c for any node i with distance at most k −
1, then we can show for node i ∈ L k , ˜ x i ≥ c l q − . To see this, if ˜ x i < c , then by the KKT condition, d ini ( c − ˜ x i ) q − ≤ d outi x q − i + γd i x q − i Here for j ∼ i , if j is closer to S , we set ˜ x j to be c , otherwise, we set ˜ x j to be 0.This means ˜ x i ≥ c ( d ini ) q − ( d outi + γd i ) q − + ( d ini ) q − ≥ cl q − + 1Also, for node i ∈ S , the first iteration of q -norm process will add at least γ q − γ q − to ˜ x i (This follows from unrolling the first loop of our algorithm and checking13hat this satisfies the binary search criteria.), which means ˜ x i ≥ γ q − γ q − . Thus,for node i ∈ L k ,˜ x i ≥ γ q − γ q − · (cid:16) l q − (cid:17) k = γ γ · (cid:16) l q − (cid:17) k Since the subgraph induced by target cluster T has diameter O (log( | T | )) andwhen we uniformly randomly sample points from T as seed sets, the expectedlargest distance r of any node in ¯ S to S is O (cid:16) log( | T | ) | S | (cid:17) , we have r = O (cid:16) log( | T | ) | S | (cid:17) ,which means min(˜ x ) ≥ γ γ · | T | | S | log (cid:16) l q − (cid:17) Assumption 17 requires min(˜ x ) ≥ (0 . T ( S )) q − (vol T ( T )) q − . So we just needvol T ( S )vol T ( T ) ≤ γ γ · | T | | S | log (cid:16) l q − (cid:17) q − , which was the final assumption. (cid:4) LEMMA 19
Under the previous assumptions, define a sweep cut set S c as ( i ∈ V | x i ≥ c (0 . vol ( S )) q − ( vol ( T )) q − ) , then for any < c ≤ ,vol ( S c \ T ) = O (cid:18) φ ( T ) γc q − (cid:19) vol ( T ) vol ( T \ S c ) = O (cid:18) φ ( T ) γ (cid:19) vol ( T ) Proof
The proof is mostly a generalization to the proof of Lemma 3.4 in [Zhuet al., 2013]. For any i ∈ ¯ T , by the KKT condition and Assumption 160 = r i ( x )= − γ X j ∼ i w ( i, j ) ‘ ( x i − x j ) − d i x q − i = − γ X j ∼ i,j ∈ ¯ T w ( i, j ) ‘ ( x i − x j ) − γ X j ∼ i,j ∈ T w ( i, j ) ‘ ( x i − x j ) − d i x q − i = − γ X j ∼ i,j ∈ ¯ T w ( i, j ) ‘ ( x i − x j ) + 1 γ X j ∼ i,j ∈ T w ( i, j ) ‘ ( x j − x i ) − d i x q − i < − γ X j ∼ i,j ∈ ¯ T w ( i, j ) ‘ ( x i − x j ) + 1 γ X j ∼ i,j ∈ T w ( i, j ) ‘ ( x j ) − d i x q − i . By summing the inequality above over all nodes in ¯ T , the first term will all cancelout, it yields that X i ∈ ¯ T d i x q − i < γ X i ∈ ∂T ( d i − ˜ d i ) x q − i ≤ φ ( T )vol( S ) γ . i ∈ S c \ T , x q − i ≥ c q − u vol( S )vol( T ) , thus c q − vol( S )2vol( T ) vol( S c \ T ) ≤ X i ∈ S c \ T d i x q − i ≤ φ ( T )vol( S ) γ which means vol( S c \ T ) = O (cid:18) φ ( T ) γc q − (cid:19) vol( T ) . In the following, we define x i = ˜ x i + v i and ‘ ( x i − ( e S ) i ) = ‘ (˜ x i − ( e S ) i ) + k i ‘ ( v i ).For any node i ∈ T , by KKT condition,0 = r i ( x )= − γ X j ∼ i w ( i, j ) ‘ ( x i − x j ) − d i ‘ ( x i − ( e S ) i )= − γ X j ∼ i,j ∈ T w ( i, j ) ‘ ( x i − x j ) − γ X j ∼ i,j ∈ ¯ T w ( i, j ) ‘ ( x i − x j ) − d i ‘ ( x i − ( e S ) i ) > − γ X j ∼ i,j ∈ T w ( i, j ) ‘ ( x i − x j ) − γ X j ∼ i,j ∈ ¯ T w ( i, j ) ‘ ( x i ) − ˜ d i ‘ ( x i − ( e S ) i ) − ( d i − ˜ d i ) ‘ ( x i )= − γ X j ∼ i,j ∈ T w ( i, j ) ‘ ( x i − x j ) − ˜ d i ‘ (˜ x i − ( e S ) i ) − k i d i ‘ ( v i ) − (1 + 1 γ )( d i − ˜ d i ) ‘ ( x i )= − γ X j ∼ i,j ∈ T w ( i, j ) ‘ ( x i − x j ) − γ X j ∼ i,j ∈ T w ( i, j ) ‘ (˜ x i − ˜ x j ) − k i d i ‘ ( v i ) − (1 + 1 γ )( d i − ˜ d i ) ‘ ( x i ) . By summing the inequality above over all nodes in T , the first and the secondterms cancel out, so it yields: X i ∈ T k i d i ‘ ( v i ) > − γ ) γ φ ( T )vol( S ) . For nodes i ∈ T \ S c , x i < c ˜ x i , which means v i < ( c − x i . And ‘ ( v i ) = − ( − v i ) q − < − (1 − c ) q − . T ( S )vol T ( T ) ≤ − (1 − c ) q − . S )vol( T ) . (Here we use the factthat vol T ( T ) ≤ vol( T ) and S ∩ ∂T = ∅ ). From the proof of lemma 18, we knowthat S will be included in S c . When i / ∈ S , k i = (cid:18) − ˜ x i v i + 1 (cid:19) q − − (cid:18) − ˜ x i v i (cid:19) q − > (2 − c ) q − − − c ) q − . Thus, we have vol( T \ S c ) = O (cid:18) φ ( T ) γ (cid:19) vol( T ) . (cid:4) LEMMA 20
Under the same assumptions as lemma 19, among sweep cut sets S c ∈ { S c | ≤ c ≤ } , there exsits one R such that φ ( R ) = O (cid:18) φ ( T ) q Gap q − (cid:19) . Proof
Our proof is mostly a generalization to the proof of Lemma 4.1 in [Zhuet al., 2013]. If cut( S c , ¯ S c ) ≥ E holds for all ≤ c ≤ , then we just need toupper bound E . 15e introduce values k ( i, j ) that allow us to break ‘ ( x i − x j ) into ‘ ( x i ) − k ( i, j ) ‘ ( x j ). The specific choice k ( i, j ) > x i and x j .For any node i ∈ S c , by KKT condition,0 = 1 γ X j ∼ i w ( i, j ) ‘ ( x i − x j ) + d i ‘ ( x i − ( e S ) i )= 1 γ X j ∼ i ( w ( i, j ) ‘ ( x i ) − w ( i, j ) k ( i, j ) ‘ ( x j )) + d i ‘ ( x i ) − k i d i ( e S ) i . Define K to be the matrix induced by k ( i, j ). Rearranging the equation aboveyields: ( K ◦ A x q − ) i = (1 + γ ) d i x q − i − γk i d i ( e S ) i . Also for two adjacent nodes i, j that are both in S c , we have k ( i, j ) ‘ ( x j ) + k ( j, i ) ‘ ( x i ) = ‘ ( x i ) + ‘ ( x j ) . This is because ‘ ( x i − x j ) + ‘ ( x j − x i ) = 0. And for two adjacent nodes i, j suchthat i ∈ S c and j / ∈ S c , x i > x j , k ( i, j ) <
1. Define a Lovasz-Simonovits curve y over d i x q − i , then we have X i ∈ S c ( K ◦ A x q − ) i + X i ∈ S c d i x q − i = 2 X i ∈ S c X j ∼ i,j ∈ S c w ( i, j ) x q − j + X i ∈ S c X j ∼ i,j / ∈ S c k ( i, j ) w ( i, j ) x q − j < X i ∈ S c X j ∼ i,j ∈ S c w ( i, j ) x q − j + X i ∈ S c X j ∼ i,j / ∈ S c w ( i, j ) x q − j ≤ y [vol( S ) − cut( S c , ¯ S c )] + y [vol( S ) + cut( S c , ¯ S c )] ≤ y [vol( S ) − E ] + y [vol( S ) + E ]here the second inequality is due to the definition of Lovasz-Simonovits curve andthe third inequality is due to y ( x ) is concave. This means y [vol( S ) − E ] + y [vol( S ) + E ] ≥ X i ∈ S c ( K ◦ A x q − ) i + X i ∈ S c d i x q − i ≥ (2 + γ ) X i ∈ S c d i x q − i − γ X i ∈ S c k i d i ( e S ) i ≥ (2 + γ ) X i ∈ S c d i x q − i − γ X i ∈ S k i d i = (2 + γ ) X i ∈ S c d i x q − i − γ X i ∈ V d i x q − i = 2 X i ∈ S c d i x q − i − γ X i/ ∈ S c d i x q − i ≥ y [vol( S c )] − O ( φ ( T )vol( S )) . Thus, y [vol( S c )] − y [vol( S c − E )] ≤ y [vol( S c + E )] − y [vol( S c )] + O ( φ ( T )vol( S )) . . E vol( S )4 q − vol( T ) ≤ y [vol( S / )] − y [vol( S / ) − E ] ≤ vol( S / \ S / ) E O ( φ ( T )vol( S )) + y [vol( S / )] − y [vol( S / ) − E ] ≤ vol( S / \ T ) + vol( T \ S / ) E O ( φ ( T )vol( S )) + 0 . E vol( S )8 q − vol( T ) ≤ O ( φ ( T ) /γ )vol( T ) E O ( φ ( T )vol( S )) + 0 . E vol( S )8 q − vol( T ) . Hence , E ≤ O (cid:18) φ ( T ) √ γ (cid:19) vol( T ) . And from lemma 19, we know vol( S c ) = 1 ± O (cid:16) φ ( T ) γ (cid:17) vol(T), since we choose γ = ( γ ) q − and γ = Θ( φ ( T ) · Gap), vol( S c ) = Θ(vol( T )). So there exists R suchthat φ ( R ) = O (cid:18) φ ( T ) √ γ (cid:19) = O φ ( T ) − q Gap ( q − / ! ≤ O φ ( T ) q Gap ( q − / ! . Here the last inequality uses the fact that (3 − q ) / > /q when 1 < q < (cid:4) By combing all these lemmas, we can get the following theorem.
THEOREM 21
Assume the subgraph induced by target cluster T has diameter O (log( | T | )) , when we uniformly randomly sample points from T as seed sets,the expected largest distance of any node in ¯ S to S is O (cid:16) log( | T | ) | S | (cid:17) . Assume vol T ( S ) vol T ( T ) ≤ (cid:0) ( γ γ ) / | T | | S | log ( l / ( q − ) (cid:1) q − where l ≤ (1 + γ ) max ( ˜ d i ) , then wecan set γ = γ q − to satisfy assumption 17 for < q < . Then a sweep cut over x will find a cluster R where φ ( R ) = O (cid:0) φ ( T ) q / Gap q − (cid:1) . We perform three experiments that are designed to compare our method toothers designed for similar problems. We call ours SLQ (strongly local q -norm)for ‘ ( x ) = (1 /q ) | x | q with parameters γ for localization and κ for the sparsity. Wecall it SLQ δ with the q -Huber loss. Existing solvers are (i) ACL [Andersen et al.,2006], that computes a personalized PageRank vector approximately adapted withthe same parameters [Gleich and Mahoney, 2014]; (ii) CRD [Wang et al., 2017],which is hybrid of flow and spectral ideas; (iii) FS is FlowSeed [Veldt et al., 2019a],a 1-norm based method; (iv) HK is the push-based heat kernel [Kloster and Gleich,2014]; (v) NLD is a recent nonlinear diffusion [Ibrahim and Gleich, 2019]; (vi) GCNis a graph convolutional network [Kipf and Welling, 2016]. Parameters are chosenbased on defaults or with slight variations designed to enhance the performancewithin a reasonable running time. We provide a full Julia implementation of SLQin Section 5.5. We evaluate the routines in terms of their recovery performancefor planted sets and clusters. The bands reflect randomizing seeds choices in thetarget cluster. The first experiment uses the LFR benchmark [Lancichinetti et al., 2008]. Wevary the mixing parameter µ (where larger µ is more difficult) and provide 1%of a cluster as a seed, then we check how much of the cluster we recover aftera conductance-based sweep cut over the solutions from various methods. Here,we use the F nodes10 Running time (seconds) 0.1 0.2 0.3 0.4 0.50.20.40.60.81.0 F1 score
SLQ (q=1.2) SLQ (q=1.4) SLQ (q=1.6) CRD (h=3) CRD (h=5) ACL heat kernel
FIGURE 4 – The left figure shows themedian running time for the methodsas we scale the graph size keeping thecluster sizes roughly the same. As wevary cluster mixing µ for a graph with10 ,
000 nodes, the middle figure showsthe median F1 score (higher is better)along with the 20-80% quantiles; theright figure shows the conductancevalues (lower is better). These resultsshow SLQ is better than ACL andcompetitive with CRD while runningmuch faster.
Reproduction details.
When creating the LFR graphs, we set the powerlaw exponent for the degree distribution to be 2, power law exponent for thecommunity size distribution to be 2, desired average degree to be 10, maximumdegree to be 50, minimum size of community to be 200 and maximum size ofcommunity to be 500. We create 40 random graphs for each µ . For SLQ, we set δ = 0, γ = 0 . ρ = 0 . (cid:15) = 10 − . For ACL, we set γ = 0 .
1. For both SLQand ACL, κ is automatically chosen from 0 .
005 and 0 .
002 based on which willgive a cluster with smaller conductance. For HK, we use four different pairs of( (cid:15), t ), which are (0 . , . , . ,
40) and (0 . , h , which is is the maximumflow that each edge can handle. We provide results of using h = 3 and h = 5.For methods that are using multiple choices of parameters, we report the totalrunning time. The second experiment uses the class-year metadata on Facebook [Traud et al.,2012], which is known to have good conductance structure for at least class year2009 [Veldt et al., 2019b] that should be identifiable with many methods. Otherclass years are harder to detect with conductance. Here, we use F Reproduction details.
In this experiment, for SLQ, we set q = 1 . γ = 0 . κ = 0 . (cid:15) = 10 − , ρ = 0 . δ = 0. For SLQ δ , the parameters are the sameas SLQ except we set δ = 10 − . For ACL, we set γ = 0 .
05 and κ = 0 . .
5. For NLD, we set the power to be 1 .
5, step sizeto be 0 .
002 and the number of iterations to be 5000. For GCN, we use 5 hiddenlayers and negative log likelihood loss. We set dropout ratio to be 0.5, learningrate to be 0.01, weight decay to be 0.0005 and the number of iterations to be 200.The feature vector is the 6 different metadata info as described in [Traud et al.,2012]. For each true set, we randomly choose 1% of the true set as seed 50 times.
The final experiment evaluates a finding from [Kloumann and Kleinberg,2014] on the recall of seed-based community detection methods. For a group ofcommunities with roughly the same size, we evaluate the recall of the largest k entries in a diffusion vector. They found PageRank (ACL) outperformed manydifferent methods. In Figure 6, we see the same general result and found thatSLQ with q > F1 & Med. F1 & Med. F1 & Med. F1 & Med. F1 & Med. F1 & Med. F1 & Med. δ δ TABLE 1 – Cluster recovery re-sults from a set of 7 Facebook net-works [Traud et al., 2012]. Studentswith a specific graduation class yearare used as target cluster. We use arandom set of 1% of the nodes identi-fied with that class year as seeds. Theclass year 2009 is the set of incomingstudents, which form better conduc-tance groups because the students hadnot yet mixed with the other classes.Class year 2008 is already mixed andso the methods do not do as wellthere. The values are median F Method SLQ SLQ δ CRD-3 CRD-5 ACL FS HK NLD GCNTime 123 80 3049 9378 12 1593 106 10375 16534(seconds)
TABLE 2 – Total running time of methods in this experiment.
Finally we would like to describe an experiment where we study the perfor-mance change of different methods when varying the size of the seed set. Thedataset we use is the same MIT Facebook dataset and the target cluster is classyear 2008. This choice is one where most of the methods in Table 1 did poorly,but ACL did better in some trials. We repeat 50 times for each seed size level.From the previous experiments, we can see that none of the methods works wellfinding this cluster. In this experiment, we only report results from SLQ, ACL,FS, CRD-3 and HK as they are all strongly local methods and they perform betterthan global methods as we have seen from previous experiments. Also, we didn’tadd CRD-5 because CRD-3 performed better than CRD-5 on this particularcluster as shown in Table 1. The result of this experiment is in Figure 5. Whenseed size is smaller than 15 nodes, the F1 score of all methods improves as weincrease seed size. After 15 nodes, only the F1 score of SLQ and ACL continues toimprove when seed size becomes larger, while the performance of other methodsstays the same or even slightly worse.
10 20 30 40 50 600.00.10.20.30.40.50.60.7 SLQACLCRD-3FSHK
FIGURE 5 – This figure shows the per-formance change (F1 score) of differ-ent methods when we vary the sizeof seed set. The dataset is MIT Face-book with the true cluster to be classyear 2008. The envelope represents20%-80% quantile.
Reproduction details
For HK and CRD-3, we use the same parameters asthe previous Facebook experiment. For ACL and SLQ, we use a coarse binarysearch (initial region is between 0.001 and 0.1, smallest feasible region is 0.001)to find a good sparsity level such that the total number of nonzero entries is 20%of the total number of nodes. The other parameters are the same as the previousFacebook experiment. We also use a similar coarse binary search (initial regionis between 0.4 and 5.0, smallest feasible region is 0.1) to choose (cid:15) for FS. Wedidn’t implement this procedure for CRD and HK because CRD doesn’t have astandalone parameter to control the sparsity of the solution and HK has alreadybeen set up to choose the best cluster from a list of parameters. One thing wewould like to mention is that in Table 1, we use 1% nodes of the true cluster as19a) DBLP (b) LiveJournal
SLQ (q=1.5)SLQ-DN (q=1.5)SLQ (q=4.0)SLQ-DN (q=4.0)SLQ (q=8.0)SLQ-DN (q=8.0)ACLACL-DN
SLQ (q=1.5)SLQ-DN (q=1.5)SLQ (q=8.0)SLQ-DN (q=4.0)SLQ (q=4.0)SLQ-DN (q=8.0)ACLACL-DN
FIGURE 6 – A replication of an experi-ment from [Kloumann and Kleinberg,2014] with SLQ on DBLP [Backstromet al., 2006; Yang and Leskovec, 2012](with 1M edges) and edges LiveJour-nal [Mislove et al., 2007] (with 65Medges). The plot shows median re-call over 600 groups of roughly thesame size as we look at the top k en-tries in the solution vector (x axis).The envelope represents 2 standarderror. This shows SLQ with q > seeds which is roughly 32 nodes in this case. So we can see that the performanceof both ACL and SLQ is improved upon this extra layer of binary search (i.e. themedian F1 score is increased to 0.6). While the performance of FS remains thesame. Our full implementation is available in the
SLQ.jl function on github: github.com/MengLiuPurdue/SLQ and the experiment codes are available too. We verifiedthis Julia implementation of ACL is as efficient as ACL implemented in C++. Sothere is no appreciable overhead of using Julia compared with C or C++ for thiscomputation.First we want to mention that in our experiments, we find that we can speedup SLQ by using a slightly modified binary search procedure. The logic is when q is close to 1 and vol( S ) is small, ∆ x i after each step of “push” procedure isalso small. So it doesn’t make sense to set the initial range of binary search tobe [0 , k − t, k t ], where t is chosenfrom either last ∆ x i or (vol( S ) / vol( A )) / ( q − . (Note this is just the lower boundof x i when γ → k by checking k = 1 , , ... until the residual becomesnegative. This strategy is implemented in our code. The most strongly related work was posted to arXiv [Yang et al., 2020]contemporaneously as we were finalizing our results. This research applies a p -norm function to the flow dual of the mincut problem with a similar motivation.This bears a resemblance to our procedures, but does differ in that we includethe localizing set S in our nonlinear penalty. Also, our solver uses the cut valuesinstead of the flow dual on the edges and we include details that enable Huberand Berhu functions for faster computation. In the future, we plan to comparethe approaches more concretely.There also remain ample opportunities to further optimize our procedures. Aswe were developing these ideas, we drew inspiration from algorithms for p -normregression [Adil et al., 2019]. Also there are faster converging (in theory) solversusing different optimization procedures [Fountoulakis et al., 2017] for 2-normproblems as well as parallelization strategies [Shun et al., 2016].Our work further contributes to the ongoing research into p -Laplacian re-search [Amghibech, 2003; Bühler and Hein, 2009; Alamgir and Luxburg, 2011;Brindle and Zhu, 2013; Li and Milenkovic, 2018] by giving a related problem thatcan be solved in a strongly local fashion. We note that our ideas can be easilyadapted to the growing space of hypergraph and higher-order graph analysisliterature [Benson et al., 2016; Yin et al., 2017; Li and Milenkovic, 2018] wherethe strategy is to derive a useful hypergraph from graph data to support deeperanalysis. We are also excited by the opportunities to combine with generalized20aplacian perspectives on diffusions [Ghosh et al., 2014]. Moreover, our workcontributes to the general idea of using simple nonlinearities on existing success-ful methods. A recent report shows that a simple nonlinearity on a Laplacianpseudoinverse is competitive with complex embedding procedures [Chanpuriyaand Musco, 2020].Finally, we note that there are more general constructions possible. Forinstance, differential penalties for S and ¯ S in the localized cut graph can beused for a variety of effects [Orecchia and Zhu, 2014; Veldt et al., 2019b]. For1-norm objectives, optimal parameters for γ and κ can also be chosen to modeldesierable clusters [Veldt et al., 2019b] – similar ideas may be possible for these p -norm generalizations. We view the structured flexibility of these ideas as akey advantage because ideas are easy to compose. This contributed to usingpersonalized PageRank to make graph convolution networks faster [Klicpera et al.,2019].In conclusion, given the strong similarities to the popular ACL – and theimproved performance in practice – we are excited about the possibilities forlocalized p -norm-cuts in graph-based learning. R E F E R E N C E S [Adil et al., 2019] D.
Adil , R.
Kyng , R.
Peng ,and S.
Sachdeva . Iterative refinement for ‘ p-norm regression . In Proceedings of theThirtieth Annual ACM-SIAM Symposiumon Discrete Algorithms , pp. 1405–1424.2019. Cited on page 20.[Alamgir and Luxburg, 2011] M.
Alamgir and U. V.
Luxburg . Phase transition inthe family of p-resistances . In
Advancesin Neural Information Processing Systems24 , pp. 379–387. Curran Associates, Inc.,2011. Cited on pages 3 and 20.[Amghibech, 2003] S.
Amghibech . Eigenval-ues of the discrete p-laplacian for graphs .Ars Comb., 67, 2003. Cited on pages 3and 20.[Andersen et al., 2006] R.
Andersen ,F.
Chung , and K.
Lang . Local graph par-titioning using pagerank vectors . In ,pp. 475–486. 2006. Cited on pages 3, 4, 8,11, and 17.[Andersen and Lang, 2008] R.
Andersen andK. J.
Lang . An algorithm for improv-ing graph partitions . In
Proceedings of thenineteenth annual ACM-SIAM symposiumon Discrete algorithms , pp. 651–660. 2008.Cited on pages 1 and 4.[Backstrom et al., 2006] L.
Backstrom ,D.
Huttenlocher , J.
Kleinberg , andX.
Lan . Group formation in large so-cial networks: membership, growth, andevolution . In
Proceedings of the 12thACM SIGKDD international conferenceon Knowledge discovery and data mining ,pp. 44–54. 2006. doi:10.1145/1150402.1150412 . Cited on page 20.[Benson et al., 2016] A.
Benson , D. F.
Gle-ich , and J.
Leskovec . Higher-order or-ganization of complex networks . Science, 353 (6295), pp. 163–166, 2016. doi:10.1126/science.aad9029 . Cited on page 20.[Blum and Chawla, 2001] A.
Blum andS.
Chawla . Learning from labeled and un-labeled data using graph mincuts . In
Pro-ceedings of the Eighteenth InternationalConference on Machine Learning , pp. 19–26. 2001. Cited on pages 1 and 2.[Brindle and Zhu, 2013] N.
Brindle andX.
Zhu . p-voltages: Laplacian regulariza-tion for semi-supervised learning on high-dimensional data. Workshop on Miningand Learning with Graphs (MLG2013),2013. Cited on pages 3 and 20.[Bühler and Hein, 2009] T.
Bühler andM.
Hein . Spectral clustering based on thegraph p-laplacian . In
Proceedings of the26th Annual International Conference onMachine Learning , pp. 81–88. 2009. Citedon pages 3 and 20.[Chanpuriya and Musco, 2020] S.
Chan-puriya and C.
Musco . Infinitewalk: Deepnetwork embeddings as laplacian embed-dings with a nonlinearity . 2020. arXiv:2006.00094 . Cited on page 21.[Chung, 2007] F.
Chung . The heat ker-nel as the PageRank of a graph . Pro-ceedings of the National Academy of Sci-ences, 104 (50), pp. 19735–19740, 2007. doi:10.1073/pnas.0708838104 . Cited onpage 5.[Chung, 1992] F. R. L.
Chung . Spectral GraphTheory , American Mathematical Society,1992. Cited on pages 3 and 11.[Fountoulakis et al., 2020] K.
Fountoulakis ,M.
Liu , D. F.
Gleich , and M. W.
Ma-honey . Flow-based algorithms for improv-ing clusters: A unifying framework, soft-ware, and performance . arXiv, cs.LG, p.2004.09608, 2020. Cited on pages 2 and 4. [Fountoulakis et al., 2017] K.
Foun-toulakis , F.
Roosta-Khorasani ,J.
Shun , X.
Cheng , and M. W.
Mahoney . Variational perspective on local graphclustering . Mathematical Programming,2017. doi:10.1007/s10107-017-1214-8 .Cited on page 20.[Ghosh et al., 2014] R.
Ghosh , S.-h.
Teng ,K.
Lerman , and X.
Yan . The interplay be-tween dynamics and networks: centrality,communities, and cheeger inequality . In
Proceedings of the 20th ACM SIGKDD in-ternational conference on Knowledge dis-covery and data mining , pp. 1406–1415.2014. Cited on pages 2 and 21.[Gleich and Mahoney, 2014] D.
Gleich andM.
Mahoney . Anti-differentiating approx-imation algorithms: A case study withmin-cuts, spectral, and flow . In
Interna-tional Conference on Machine Learning ,pp. 1018–1025. 2014. Cited on pages 2, 4,13, and 17.[Gleich, 2015] D. F.
Gleich . PageRank be-yond the web . SIAM Review, 57 (3), pp.321–363, 2015. doi:10.1137/140976649 .Cited on pages 2 and 4.[Gleich and Mahoney, 2015] D. F.
Gleich and M. W.
Mahoney . Using local spectralmethods to robustify graph-based learningalgorithms . In
Proceedings of the 21thACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining ,pp. 359–368. 2015. doi:10.1145/2783258.2783376 . Cited on pages 2 and 4.[Grover and Leskovec, 2016] A.
Grover andJ.
Leskovec . Node2vec: Scalable fea-ture learning for networks . In
Proceed-ings of the 22Nd ACM SIGKDD Interna-tional Conference on Knowledge Discov-ery and Data Mining , pp. 855–864. 2016. oi:10.1145/2939672.2939754 . Cited onpage 1.[Hallac et al., 2017] D. Hallac , C.
Wong ,S.
Diamond , A.
Sharang , R.
Sosic ,S.
Boyd , and J.
Leskovec . Snapvx: Anetwork-based convex optimization solver .The Journal of Machine Learning Re-search, 18 (1), pp. 110–114, 2017. Citedon page 5.[Hansen and Mahoney, 2012] T. J.
Hansen and M. W.
Mahoney . Semi-supervisedeigenvectors for locally-biased learning .In
Advances in Neural Information Pro-cessing Systems 25 , pp. 2528–2536. 2012.Cited on page 1.[Ibrahim and Gleich, 2019] R.
Ibrahim andD. F.
Gleich . Nonlinear diffusion forcommunity detection and semi-supervisedlearning . In
The World Wide Web Con-ference , pp. 739–750. 2019. doi:10.1145/3308558.3313483 . Cited on pages 3, 5,and 17.[Jeub et al., 2015] L. G. S.
Jeub , P.
Bal-achandran , M. A.
Porter , P. J.
Mucha ,and M. W.
Mahoney . Think locally,act locally: Detection of small, medium-sized, and large communities in large net-works . Phys. Rev. E, 91, p. 012821, 2015. doi:10.1103/PhysRevE.91.012821 . Citedon page 8.[Joachims, 2003] T.
Joachims . Transductivelearning via spectral graph partitioning . In
ICML , pp. 290–297. 2003. Cited on pages1 and 2.[Kipf and Welling, 2016] T. N.
Kipf andM.
Welling . Semi-supervised classifica-tion with graph convolutional networks .arXiv preprint arXiv:1609.02907, 2016.Cited on page 17.[Klicpera et al., 2019] J.
Klicpera , A.
Bo-jchevski , and S.
Günnemann . Predictthen propagate: Graph neural networksmeet personalized pagerank . In
Interna-tional Conference on Learning Represen-tations (ICLR) . 2019. Cited on pages 1, 3,and 21.[Kloster and Gleich, 2014] K.
Kloster andD. F.
Gleich . Heat kernel based commu-nity detection . In
Proceedings of the 20thACM SIGKDD International Conferenceon Knowledge Discovery and Data Min-ing , pp. 1386–1395. 2014. doi:10.1145/2623330.2623706 . Cited on pages 5, 8,and 17.[Kloumann and Kleinberg, 2014] I. M.
Kloumann and J. M.
Kleinberg . Com-munity membership identification fromsmall seed sets . In
Proceedings of the20th ACM SIGKDD International Con-ference on Knowledge Discovery and DataMining , pp. 1366–1375. 2014. doi:10.1145/2623330.2623621 . Cited on pages 18and 20. [Kloumann et al., 2016] I. M.
Kloumann ,J.
Ugander , and J.
Kleinberg . Blockmodels and personalized PageRank . Pro-ceedings of the National Academy of Sci-ences, 114 (1), pp. 33–38, 2016. doi:10.1073/pnas.1611275114 . Cited on page2.[Koutra et al., 2011] D.
Koutra , T.-Y. Ke , U. Kang , D. H.
Chau , H.-K. K.
Pao , and C.
Faloutsos . Unifyingguilt-by-association approaches: Theo-rems and fast algorithms . In
ECML/P-KDD , pp. 245–260. 2011. doi:10.1007/978-3-642-23783-6_16 . Cited on page 2.[Lancichinetti et al., 2008] A.
Lancichinetti ,S.
Fortunato , and F.
Radicchi . Bench-mark graphs for testing community de-tection algorithms . Phys. Rev. E, 78, p.046110, 2008. doi:10.1103/PhysRevE.78.046110 . Cited on page 17.[Lang and Rao, 2004] K.
Lang and S.
Rao . A flow-based method for improving the ex-pansion or conductance of graph cuts . In
IPCO 2004: Integer Programming andCombinatorial Optimization , pp. 325–337.2004. Cited on page 1.[Leskovec et al., 2009] J.
Leskovec , K. J.
Lang , A.
Dasgupta , and M. W.
Ma-honey . Community structure in largenetworks: Natural cluster sizes and theabsence of large well-defined clusters .Internet Mathematics, 6 (1), pp. 29–123, 2009. doi:10.1080/15427951.2009.10129177 . Cited on pages 8 and 11.[Li and Milenkovic, 2018] P. Li andO. Milenkovic . Submodular hypergraphs:p-laplacians, Cheeger inequalities andspectral clustering . In
Proceedings of the35th International Conference on MachineLearning , pp. 3014–3023. 2018. Cited onpage 20.[Li et al., 2019] Q. Li , X.-M. Wu , H. Liu ,X.
Zhang , and Z.
Guan . Label efficientsemi-supervised learning via graph filter-ing . In
Proceedings of the IEEE Con-ference on Computer Vision and PatternRecognition , pp. 9582–9591. 2019. Citedon page 1.[Lisewski and Lichtarge, 2010] A. M.
Lisewski and O.
Lichtarge . Untan-gling complex networks: Risk minimiza-tion in financial markets through acces-sible spin glass ground states . PhysicaA: Statistical Mechanics and its Applica-tions, 389 (16), pp. 3250–3253, 2010. doi:10.1016/j.physa.2010.04.005 . Cited onpage 2.[Mahoney et al., 2012] M. W.
Mahoney ,L.
Orecchia , and N. K.
Vishnoi . A localspectral method for graphs: With applica-tions to improving graph partitions andexploring data graphs locally . Journal ofMachine Learning Research, 13, pp. 2339–2365, 2012. Cited on page 11. [Mihail, 1989] M.
Mihail . Conductance andconvergence of markov chains-a combina-torial treatment of expanders . In
Foun-dations of Computer Science, 1989., 30thAnnual Symposium on , pp. 526 –531. 1989. doi:10.1109/SFCS.1989.63529 . Cited onpage 11.[Mislove et al., 2007] A.
Mislove , M.
Mar-con , K. P.
Gummadi , P.
Druschel , andB.
Bhattacharjee . Measurement andanalysis of online social networks . In
Pro-ceedings of the 7th ACM SIGCOMM Con-ference on Internet Measurement , pp. 29–42. 2007. doi:10.1145/1298306.1298311 .Cited on page 20.[Orecchia and Mahoney, 2011] L.
Orecchia and M. W.
Mahoney . Implementing regu-larization implicitly via approximate eigen-vector computation . In
Proceedings of the28th International Conference on MachineLearning (ICML-11) , pp. 121–128. 2011.Cited on page 2.[Orecchia and Zhu, 2014] L.
Orecchia andZ. A.
Zhu . Flow-based algorithms for lo-cal graph clustering . In
Proceedings ofthe twenty-fifth annual ACM-SIAM sym-posium on Discrete algorithms , pp. 1267–1286. 2014. Cited on pages 11 and 21.[Owen, 2007] A. B.
Owen . A robust hybrid oflasso and ridge regression . ContemporaryMathematics, 443 (7), pp. 59–72, 2007.Cited on page 6.[Pan et al., 2004] J.-Y.
Pan , H.-J.
Yang ,C.
Faloutsos , and P.
Duygulu . Au-tomatic multimedia cross-modal correla-tion discovery . In
KDD ’04: Proceed-ings of the tenth ACM SIGKDD inter-national conference on Knowledge discov-ery and data mining , pp. 653–658. 2004. doi:10.1145/1014052.1014135 . Cited onpage 2.[Peel, 2017] L.
Peel . Graph-based semi-supervised learning for relational net-works . In
Proceedings of the 2017 SIAMInternational Conference on Data Min-ing , pp. 435–443. 2017. doi:10.1137/1.9781611974973.49 . Cited on page 2.[Perozzi et al., 2014] B.
Perozzi , R.
Al-Rfou ,and S.
Skiena . DeepWalk: Online learn-ing of social representations . In
Proceed-ings of the 20th ACM SIGKDD Interna-tional Conference on Knowledge Discov-ery and Data Mining , pp. 701–710. 2014. doi:10.1145/2623330.2623732 . Cited onpage 1.[Shi and Malik, 2000] J.
Shi and J.
Malik . Normalized cuts and image segmentation .Pattern Analysis and Machine Intelligence,IEEE Transactions on, 22 (8), pp. 888–905,2000. doi:10.1109/34.868688 . Cited onpage 2.[Shun et al., 2016] J.
Shun , F.
Roosta-Khorasani , K.
Fountoulakis , and M. W. ahoney . Parallel local graph cluster-ing . Proceedings of the VLDB Endowment,9 (12), pp. 1041–1052, 2016. Cited on page20.[Traud et al., 2012] A. L.
Traud , P. J.
Mucha , and M. A.
Porter . Social struc-ture of facebook networks . Physica A: Sta-tistical Mechanics and its Applications,391 (16), pp. 4165–4180, 2012. doi:10.1016/j.physa.2011.12.021 . Cited onpages 18 and 19.[Veldt et al., 2016] L. N.
Veldt , D. F.
Gle-ich , and M. W.
Mahoney . A simple andstrongly-local flow-based method for cutimprovement . In
International Confer-ence on Machine Learning , pp. 1938–1947.2016. Cited on pages 1, 2, and 8.[Veldt et al., 2019a] N.
Veldt , C.
Klymko ,and D. F.
Gleich . Flow-based local graphclustering with better seed set inclusion .In
Proceedings of the SIAM InternationalConference on Data Mining , pp. 378–386.2019a. doi:10.1137/1.9781611975673.43 .Cited on pages 2 and 17.[Veldt et al., 2019b] N.
Veldt , A.
Wirth ,and D. F.
Gleich . Learning resolu- tion parameters for graph clustering . In
The World Wide Web Conference , pp.1909–1919. 2019b. doi:10.1145/3308558.3313471 . Cited on pages 18 and 21.[Wang et al., 2017] D.
Wang , K.
Foun-toulakis , M.
Henzinger , M. W.
Ma-honey , and S.
Rao . Capacity releasingdiffusion for speed and locality . In
Proceed-ings of the 34th International Conferenceon Machine Learning-Volume 70 , pp. 3598–3607. 2017. Cited on pages 3, 5, and 17.[Yadati et al., 2019] N.
Yadati , M. R.
Nimishakavi , P.
Yadav , V.
Nitin ,A.
Louis , and P.
Talukdar . Hypergcn:A new method for training graph con-volutional networks on hypergraphs . In
NeurIPS . 2019. Cited on page 1.[Yang and Leskovec, 2012] J.
Yang andJ.
Leskovec . Defining and evaluating net-work communities based on ground-truth .In
Data Mining (ICDM), 2012 IEEE 12thInternational Conference on , pp. 745–754.2012. doi:10.1109/ICDM.2012.138 . Citedon page 20.[Yang et al., 2020] S.
Yang , D.
Wang , andK.
Fountoulakis . p-norm flow diffusion for local graph clustering . arXiv, cs.LG, p.2005.09810, 2020. Cited on pages 3 and 20.[Yin et al., 2017] H. Yin , A. R.
Benson ,J.
Leskovec , and D. F.
Gleich . Localhigher-order graph clustering . In
Proceed-ings of the 23rd ACM SIGKDD Interna-tional Conference on Knowledge Discov-ery and Data Mining , pp. 555–564. 2017. doi:10.1145/3097983.3098069 . Cited onpages 8 and 20.[Zhou et al., 2003] D.
Zhou , O.
Bousquet ,T. N.
Lal , J.
Weston , and B.
Schölkopf . Learning with local and global consistency .In
NIPS . 2003. Cited on pages 1, 2, and 4.[Zhu et al., 2003] X.
Zhu , Z.
Ghahramani ,and J.
Lafferty . Semi-supervised learn-ing using gaussian fields and harmonicfunctions . In
ICML , pp. 912–919. 2003.Cited on pages 1 and 2.[Zhu et al., 2013] Z. A.
Zhu , S.
Lattanzi ,and V. S.
Mirrokni . A local algorithm forfinding well-connected clusters. In ICML(3) , pp. 396–404. 2013. Cited on pages 3,11, 12, 13, 14, 15, and 17., pp. 396–404. 2013. Cited on pages 3,11, 12, 13, 14, 15, and 17.