Local Partition in Rich Graphs
LLocal Partition in Rich Graphs
Scott Freitas
Arizona State UniversityTempe, [email protected]
Hanghang Tong
Arizona State UniversityTempe, [email protected]
Nan Cao
Tongji UniversityShanghai, [email protected]
Yinglong Xia
HuaweiSanta Clara, [email protected]
ABSTRACT
Local graph partitioning is a key graph mining tool that allowsresearchers to identify small groups of interrelated nodes (e.g. peo-ple) and their connective edges (e.g. interactions). Because localgraph partitioning is primarily focused on the network structureof the graph (vertices and edges), it often fails to consider the ad-ditional information contained in the attributes. In this paper wepropose—(i) a scalable algorithm to improve local graph partition-ing by taking into account both the network structure of the graphand the attribute data and (ii) an application of the proposed localgraph partitioning algorithm (AttriPart) to predict the evolutionof local communities (LocalForecasting). Experimental resultsshow that our proposed AttriPart algorithm finds up to × denser local partitions, while running approximately × fasterthan traditional local partitioning techniques (PageRank-Nibble[1]). In addition, our LocalForecasting algorithm shows a sig-nificant improvement in the number of nodes and edges correctlypredicted over baseline methods. ACM Reference Format:
Scott Freitas, Hanghang Tong, Nan Cao, and Yinglong Xia. 2018. LocalPartition in Rich Graphs. In
Proceedings of ACM KDD (KDD’18).
ACM, NewYork, NY, USA, Article 4, 9 pages. https://doi.org/10.475/123_4
Motivation.
With the rise of the big data era an exponential amountof network data is being generated at an unprecedented rate acrossmany disciplines. One of the critical challenges before us is thetranslation of this large-scale network data into meaningful infor-mation. A key task in this translation is the identification of localcommunities with respect to a given seed node . In practical terms,the information discovered in these local communities can be uti-lized in a wide range of high-impact areas—from the micro (proteininteraction networks [13] [26]) to the macro (social [21] [4] andtransportation networks). we interchangeably refer to local community as a local partitionPermission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s). KDD’18, August 2018, London, United Kingdom © 2018 Copyright held by the owner/author(s).ACM ISBN 123-4567-24-567/08/06...$15.00https://doi.org/10.475/123_4
Problem Overview.
How can we quickly determine the localgraph partition around a given seed node? This problem is tradi-tionally solved using an algorithm like Nibble [19], which identifiesa small cluster in time proportional to the size of the cluster, orPageRank-Nibble, [1] which improves the running time and ap-proximation ratio of Nibble with a smaller polylog time complexity.While both of these methods provide powerful techniques in theanalysis of network structure, they fail to take into account theattribute information contained in many real-world graphs. Othertechniques to find improved rank vectors, such as attributed PageR-ank [10], lack a generalized conductance metric for measuringcluster "goodness" containing attribute information. In this paper,we propose a novel method that combines the network structureand attribute information contained in graphs—to better identifylocal partitions using a generalized conductance metric.
Applications.
Local graph partition plays a central role in manyapplication scenarios. For example, a common problem in recom-mender systems is that of social media networks and determininghow a local community will evolve over time. The proposed Local-Forecasting algorithm can be used to determine the evolution oflocal communities, which can then assist in user recommendations.Another example utilizing social media networks is ego-centricnetwork identification , where the goal is to identify the locally im-portant neighbors relative to a given person. To this end, we can useour AttriPart algorithm to identify better ego-centric networksusing the graph’s network structure and attribute information. Fi-nally, newly arrived nodes (i.e., cold-start nodes ) often contain fewconnections to their surrounding neighbors, making it difficult toascertain their grouping to various communities. The proposedLocalForecasting algorithm mitigates this problem by introduc-ing additional attribute edges (link prediction), which can assist indetermining which local partitions the cold start nodes will belongto in the future.
Contributions.
Our primary contributions are three-fold: • The formulation of a graph model and generalized conduc-tance metric that incorporates both attribute and networkstructure edges. • The design and analysis of local clustering algorithm At-triPart and local community prediction algorithm Local-Forecasting. Both algorithms utilize the proposed graphmodel, modified conductance metric and novel subgraphidentification technique. a r X i v : . [ c s . S I] M a r DD’18, August 2018, London, United Kingdom S. Freitas et al. • The evaluation of the proposed algorithms on three real-world datasets—demonstrating the ability to rapidly identifydenser local partitions compared to traditional techniques.
Deployment.
The local partitioning algorithm AttriPart iscurrently deployed performance nearlyidentical to the results presented in section 4.
Figure 1: Close-up of the AttriPart algorithm on thePathFinder web platform.
This paper is organized as follows—Section 2 defines the prob-lem of local partitioning in rich graphs; Section 3 introduces ourproposed model and algorithms; Section 4 presents our experimen-tal results on multiple real-world datasets; Section 5 reviews therelated literature; and Section 6 concludes the paper.
In this paper we consider three graphs—(1) an undirected, un-weighted structure graph G = ( V , E ) , (2) an undirected, weightedattribute graph A = ( V , E ) and (3) a combined graph consistingof both G and A that is undirected and weighted B = ( V , E ) . Ineach graph, V is the set of vertices, E is the set of edges, n is thenumber of vertices and m is the number of edges (i.e. G , H and B contain the same number of vertices and edges by default). In orderto denote the degree centrality we say δ ( v ) is the degree of vertex v . We use bold uppercase letters to denote matrices (e.g. G ) andbold lowercase letters to denote vectors (e.g. v ).For the ease of description, we define terms that are interchange-ably used throughout the literature and this paper—(a) we refer tonetwork as a graph, (b) node is synonymous with vertex, (c) localpartition is referred to as a local cluster, (d) seed node is equivalentto query and start vertex, (e) topological edges of the graph refersto the network structure of the graph, (f) a rich graph is a graphwith attributes on the nodes and or edges. Having outlined the notation, we define the problem of localpartitioning in rich graphs as follows: Problem 1. Local Partitioning in Rich Graphs
Given: (1) an undirected, unweighted graph G = ( V , E ) , (2) a seednode q ∈ V and (3) attribute information for each node v ∈ V containing a k-dimensional attribute vector x i —with an attributematrix X = [ x , x , ..., x n ] ∈ R k × n representing the attribute vectorfor each node v . Output: a subset of vertices S ⊂ V such that S best represents thelocal partition around seed node q in graph B . Table 1:
Symbols and Definition
Symbol Definition G , A , B network, attribute & combined graphs n , m number of nodes & edges in graphs G , A , B m e number of edges in B after LocalForecasting p , m p number of nodes & edges in Ts , q , ϕ o preference vector, seed node & target conductance W lazy random walk transition matrix S set of vertices representing local partition ϵ , ϵ t rank truncation and iteration thresholds t m , n s rank vector iterations; number of vertices to sweep α n , α r AttriPart & LocalProximity teleport values t s , n w subgraph relevance threshold & number of walks T ; D , L subgraph of B ; walk count dictionary & list µ ( L ) , σ ( L ) mean and standard deviation of Lt e edge addition threshold This section first describes the preliminaries for our proposed algo-rithms, including the graph model and modified conductance metric.Next, we introduce each proposed algorithm—(1) LocalProximity,(2) AttriPart and (3) LocalForecasting. Finally, we provide ananalysis of the proposed algorithms in terms of effectiveness andefficiency.
Graph Model.
Topological network G represents the networkstructure of the graph and is formally defined in Eq. (1). Attributenetwork A represents the attribute structure of the graph and iscomputed based on the similarity for every edge ( u , v ) ∈ E in G .In order to determine the similarity between the two nodes, weuse Jaccard Similarity J ( u , v ) . A is formally defined in Eq. (2) where0.05 is the default attribute similarity between an edge ( u , v ) ∈ E in G if J ( x u , x v ) =
0. In addition, t e is the similarity threshold for theaddition of edges not in G where 0 < t e ≤
1. Combined Network B represents the combined graph of G and A and is formally definedin Eq. (3).Formally, we define each of the three graph models G , A and B inEq. (1), Eq. (2) and Eq. (3). Figure 2 presents an illustrative example. G ( u , v ) = (cid:40) , if ( u , v ) ∈ E and u (cid:44) v0 , otherwise (1) ocal Partition in Rich Graphs KDD’18, August 2018, London, United Kingdom Figure 2: Example of the three graph models: (a) graph G isthe network structure with nodes { , , , } and correspond-ing attribute set { x , x , x , x } given as input. (b) Graph A isthe attribute network with the same set of edges as G witheach edge ( u , v ) assigned a positive similarity weight s uv . (c)Graph B is a linear combination of the each respective edge ( u , v ) from G and A . A ( u , v ) = J(u,v) , if ( u , v ) ∈ E , u (cid:44) v and J ( u , v ) > . , if ( u , v ) ∈ E , u (cid:44) v and J ( u , v ) = , if ( u , v ) (cid:60) E , u (cid:44) v and J ( u , v ) > t e , otherwise (2) B ( u , v ) = + A(u,v) , if ( u , v ) ∈ E and ( u , v ) ∈ A A(u,v) , if ( u , v ) (cid:60) E and ( u , v ) ∈ A , otherwise (3) Conductance . Conductance is a standard metric for determin-ing how tight knit a set of vertices are in a graph [12]. The traditionalconductance metric is defined in Eq. (4), where S is the set of ver-tices representing the local partition. The lower the conductancevalue ϕ ( S ) , where 0 ≤ ϕ ( S ) ≤
1, the more likely S represents a goodpartition of the graph. ϕ ( S ) = cut ( S ) min ( vol ( S ) , vol ( ¯ S )) (4)Where the cut is Cut ( S ) = {( u , v ) ∈ E | u ∈ S , v (cid:60) S } , and the volumeis vol ( S ) = (cid:205) v ∈ S δ ( v ) .This definition of conductance will serve as the benchmark tocompare the results of our parallel conductance metric. Parallel Conductance.
We propose a parallel conductance metricwhich takes into account both the attribute and topological edgesin the graph. Instead of simply adding the cut of each vertex v ∈ S ,we want to determine whether v is more similar to the vertices in S or ¯ S . The new cut and conductance metric is formally defined inEq. (5) and Eq. (6), respectively. The key idea behind the parallelconductance metric is to determine whether each vertex in S ismore similar to S or ¯ S using the additional information providedby the attribute links. parallel _ cut ( S ) = (cid:213) iϵS (cid:205) j ̸ ϵS B ( i , j ) (cid:205) jϵS B ( i , j ) = (cid:213) iϵS (cid:205) j ̸ ϵS (cid:2) A ( i , j ) + G ( i , j ) (cid:3)(cid:205) jϵS (cid:2) A ( i , j ) + G ( i , j ) (cid:3) (5) By definition, B can be split into its representative components, G and A . We also note a few key properties of the parallel cut metricbelow:(1) Parallel _ cut = S have connec-tions of equal weighting between S and ¯ S .(2) Parallel _ cut < S have only afew strong connections to ¯ S .(3) Parallel _ cut > S are morestrongly connected to ¯ S than S .Eq. (6) uses the cut as defined in Eq. (5) and the volume as definedabove with the modification that δ ( v ) is a sum of it’s componentsin G and A . ϕ ( S ) = parallel _ cut ( S ) vol ( S ) (6)We note that the parallel conductance metric has a different scalecompared to the traditional conductance metric. For example, aconductance of 0.3 in the traditional conductance doesn’t have thesame meaning as a conductance of 0.3 in the parallel definition. Wealso bound the volume of S to vol ( S ) < / vol ( B ) . This allows usto reduce the min ( vol ( S ) , vol ( ¯ S )) computation to vol ( S ) . Figure 3: A toy example calculating the parallel cut and con-ductance with local partition S containing vertices { , , , } .Parallel cut( V ) = 1.05/2.1 = 0.5, parallel cut( V ) = 0, paral-lel cut( V ) = 1.05/2.2 = 0.477, parallel cut( V ) = 0, parallelcut( Total ) = 0.5 + 0.477 = 0.977. Volume( S ) = 12. Parallelconductance( S ) = 0.977/12 = 0.0814. We propose three algorithms in this subsection, including (1) Lo-calProximity (2) AttriPart and (3) LocalForecasting. First, weintroduce the LocalProximity algorithm as a key building blockfor speeding-up the AttriPart and LocalForecasting algorithmsby finding a subgraph containing only the nodes and edges rele-vant to the given seed node. Based on LocalProximity, we furtherpropose the AttriPart algorithm to find a local partition around aseed node by minimizing the parallel conductance metric. Finally,we propose the LocalForecasting algorithm, which builds uponAttriPart, to predict a local community’s evolution.
DD’18, August 2018, London, United Kingdom S. Freitas et al.
LocalProximity.
There are two primary purposes for the Lo-calProximity algorithm—(i) the requisite computations for theLocalForecasting algorithm require a pairwise similarity calcu-lation of all nodes, which is intractable for large graphs due tothe quadratic run time. To make this computation feasible, we usethe LocalProximity algorithm to determine a small subgraph ofrelevant vertices around a given seed node q . (ii) We experimentallyfound that the PageRank vector utilized in the AttriPart algo-rithm is significantly faster to compute after running the proposedLocalProximity algorithm. Algorithm Details.
The goal is to find a subgraph T around seednode q , such that T contains only nodes and edges likely to bereached in n w trials of random walk with restart. We base the im-portance of a vertex v ∈ V on the theory that random walks canmeasure the importance of nodes and edges in a graph [5][17]. Thisis done by defining node relevance proportional to the frequencyof times a random walk with restart walks on a vertex in n w tri-als (nodes walked on more than once in a walk will still count asone). Instead of using a simple threshold parameter to determinenode/edge relevance as in [5], we utilize the mean and standarddeviation of the walk distribution in order for the results to remaininsensitive of n w given that n w is sufficiently large. In conjunctionwith the mean and standard deviation, we introduce t s as a rele-vance threshold parameter to determine the size of the resultingsubgraph T . See section 3.3 for more details. Algorithm Description.
The LocalProximity algorithm takes agraph B , a seed node q ∈ B , a teleport value α r , the number of walksto simulate n w , a relevance threshold t s —and returns a subgraph T containing the relevant vertices in relation to q . This algorithm canbe viewed in three major steps:(1) Compute the walk distribution around seed node q in graph B using random walk with restart (line 2). We omit the Ran-dom Walk algorithm due to space constraints, however, thetechnique is described above.(2) Determine the number of vertices to include in the subgraph T based on the relevance threshold parameter t s , mean ofthe walk distribution list µ ( L ) and the standard deviation ofthe walk distribution list σ ( L ) (lines 4-6).(3) Create a subgraph based on the included vertices (line 8). Algorithm 1:
Local Proximity
Input:
Graph B , seed node q , teleport value α r , number ofwalks to simulate n w , relevance threshold t s Result:
Subgraph T subgraph_nodes = []; D = RandomWalk( q , α r , n w , B ); L = D .values; for vertex u in B do if D [ u ] > µ ( L ) + σ ( L ) / t s then subgraph_nodes.append(u); end T = B .subgraph(subgraph_nodes); return T ; AttriPart.
Armed with the LocalProximity algorithm, wefurther propose an algorithm AttriPart, which takes into accountthe network structure and attribute information contained in graphto find denser local partitions than can be found using the net-work structure alone. The foundation of this algorithm is basedon [19][1][30] with subtle modifications on lines 1, 4 and 9. Thesemodifications incorporate the addition of a combined graph model,approximate PageRank computation using the LocalProximityalgorithm, and the parallel cut and conductance metric. In addition,AttriPart doesn’t depend on reaching a target conductance inorder to return a local partition—instead it returns the best localpartition found within sweeping n s vertices of the sorted PageRankvector. Algorithm Description.
Given a graph B , seed node q ∈ V , tar-get conductance ϕ o , rank truncation threshold ϵ , the number ofiterations to compute the rank vector t last , teleport value α n , rankiteration threshold ϵ t and number of nodes to sweep n s —AttriPartwill find a local partition S around q within n s iterations of sweep-ing. This algorithm can be viewed in five steps:(1) Set values for ϵ and t last as seen in Eq. (7) and Eq. (9) respec-tively. We experimentally set b = + loд ( m ) and ϵ t to 0.01.For additional detail on parameters ϵ , t last and b see [19].For all other parameter values see Section 4.(2) Run LocalProximity around seed node q in order to reducethe run time of the PageRank computations (line 1).(3) Compute the PageRank vector using a lazy random tran-sition with personalized restart—with preference vector s containing all the probability on seed node q . At each it-eration truncate a vertex’s rank if it’s degree normalizedPageRank score is less than ϵ (lines 2-7).(4) Divide each vertex in the PageRank vector by its correspond-ing degree centrality and order the rank vector in descendingorder (line 8).(5) Sweep over the PageRank vector for the first n s vertices,returning the best local partition S found (lines 9-10). The sweep works by taking the re-organized rank vector andcreating a set of vertices S by iterating through each vertexin the rank vector one at a time, each time adding the nextvertex in the rank vector to S and computing ϕ ( S ) . ϵ = /( ( l + ) t last b ) (7) l = ⌈ loд ( m / )⌉ (8) t last = ( l + )⌈ ϕ ln ( c ( l + ) (cid:112) m / )⌉ (9) LocalForecasting.
As a natural application of the AttriPartalgorithm, we introduce a method to predict how local communitieswill evolve over time. This method is based on the AttriPartalgorithm with two significant modifications—(i) required use of theLocalProximity algorithm to create a subgraph around the seednode and (ii) the use of the ExpandedNeighborhood algorithmto predict links between nodes in the subgraph. The idea behindusing the ExpandedNeighborhood algorithm is that nodes areoften missing many connections they will make in the future, whichin turn affects the grouping of nodes into communities. To aid in ocal Partition in Rich Graphs KDD’18, August 2018, London, United Kingdom
Algorithm 2:
AttriPart
Input:
Graph B , seed node q , target conductance ϕ o ,truncation threshold ϵ , iterations t last , teleport value α n , iteration threshold ϵ t , vertices to sweep n s Result:
Local partition S T = Local_Proximity( B , q , α r , n w , t s ); D i , i = δ ( v i ) ; W = ( I + D − T ) ; for t = to t last and sum( q t ) - sum( q t − ) < ϵ t do q t = ( − α ) q t − W + αs ; r t ( i ) = q t ( i ) if q t ( i )/ d ( i ) > ϵ , else 0; end Order i from large to small based on r t ( i )/ d ( i ) ; Sweep Parallel_Conductance ϕ ( S { i = .. j }) while i < n s ; If there is j : ϕ ( S j ) < ϕ o , return S ;predicting future edge connections we use Jaccard Similarity [14] topredict the likelihood of each vertex connecting to the others—withedges added if the similarity between two nodes is greater thanthreshold t e . Algorithm Description.
Given a graph B , a seed node q ∈ V , atarget conductance ϕ o , a rank truncation threshold ϵ , the numberof iterations to compute the rank vector t last , a teleport value α n ,rank iteration threshold ϵ t , similarity threshold t e and numberof nodes to sweep n s —this algorithm will find a predicted localpartition around q within n s iterations of sweeping. As the Local-Forecasting algorithm is similar to AttriPart, we highlight thethree primary steps:(1) Determine the subgraph around a given seed node using theLocalProximity algorithm (line 1).(2) Determine the pairwise similarity between all nodes in thesubgraph using Jaccard Similarity, adding edges that areabove a given similarity threshold (line 2).(3) Run the AttriPart algorithm to find the predicted localpartition around the seed node (line 3). Algorithm 3:
Local Forecasting
Input:
Graph B , seed node q , target conductance ϕ o ,truncation threshold ϵ , iterations t last , teleport value α n , iteration threshold ϵ t , similarity threshold t e ,vertices to sweep n s Result:
Predicted local partition S T = Local_Proximity( B , q ); T = Expanded_Neighborhood( T , t e ) ; S = AttriPart( T , q , ϕ o , ϵ , t last , α n , ϵ t , n s ) ; return S ; Effectiveness . LocalProximity (Algorithm 1). The objective isto ensure that all relevant nodes in proximity to seed node q areincluded. We use the fact that many real-world graphs follow ascale-free distribution [3] [6], with many nodes containing only a Algorithm 4:
Expanded Neighborhood
Input:
Subgraph T , edge addition threshold t e Result:
Subgraph T with predicted edges for u in T do for v in T and v not u do u_attr = T [ u ] ; v_attr = T [ v ] ; similarity_score = JaccardSimilarity(u_attr, v_attr); if similarity_score > t e and not T [ u ][ v ] then T [ u ][ v ] = similarity_score; end end return T ;few links while a handful encompasses the majority. In Figure 4,we found that after running n w trials of random walk with restart,a scale-free like distribution formed—with a large majority of thenodes containing a small number of ‘hits’, while a few nodes con-stituted the bulk. Figure 4: Random walk w/restart—distribution of nodewalk counts. n w = 10,000, α r =0.15; dataset: wikipedia, startvertex: ‘ewok’, y-axis: right;dataset: Aminer, start vertex:364298, y-axis: left. We omitnodes walked zero times in thegraph, however, they’re used incalculating µ ( L ) , σ ( L ) . As the number of ran-dom walks n w is in-creased, the scale-freelike distribution is main-tained since each nodeis proportionally walkedwith the same distribu-tion. We therefore needonly some minimum valuefor n w , which we setto 10,000. We use thisskewed scale-free likedistribution in combina-tion with Eq. (10) belowto ensure the extractionof relevant nodes in rela-tion to a query vertex.Mathematically we de-fine node relevance basedon Eq. (10), where D is adictionary containing thewalk count of each vertexand D ( v ) represents thenumber of times vertex v is walked in n w trials of the random walkwith restart. L is a list of each node’s walk count in the graph, µ ( L ) is the average number of times all of the nodes in the graph arewalked and σ ( L ) is the standard deviation of the number of timesall of the nodes in the graph are walked. In section 4 we discussvalues of t s that have been shown to be empirically effective. D ( v ) > µ ( L ) + σ ( L )/ t s (10)After determining the relevant nodes we create a subgraph T from a portion of the long-tail curve as defined by threshold param-eter t s in conjunction with µ ( L ) and σ ( L ) . We say that subgraph T contains p ≪ n nodes—with p increasing nearly independentlyof the graph size (depending on threshold t s ). As seen in Figure 4 DD’18, August 2018, London, United Kingdom S. Freitas et al. the number of nodes with r walks converges independent of graphsize. Efficiency . All algorithms use the same data structure for stor-ing the graph information. If a compressed sparse row (CSR) formatis used, the space complexity is O ( m + n + ) . Alternatively, wenote that with minor modification to the algorithms above we canuse an adjacency list format with O ( n + m ) space.Lemma 3.1 (Time Complexity). LocalProximity has a time com-plexity of O ( n + m p + n w ) while AttriPart has a time complexity of O ( p + pm p + n + n w ) and LocalForecasting a time complexity of O ( p + pm e + n + n w ) . Proof. LocalProximity: There are three major componentsto this algorithm: (1) n w random walks with walk length l for atime complexity of O ( n w ) (line 2). (2) Linear iteration through thenumber of nodes taking O ( n ) (lines 4-7). (3) Subgraph T creationbased on the number of included vertices p with node set V t —requiring iteration through every edge of node v ∈ V t for m p totaledges. Iterating through every edge is linear in the number of edgesfor a time complexity of O ( m p ) (line 8). This leads to a total timecomplexity of O ( n + m p + n w ) AttriPart: There are six major steps to this algorithm: (1) callingLocalProximity which returns a subgraph T containing p nodesand m p edges for a time complexity of O ( n + m p + n w ) (line 1).(2) Creating a diagonal degree matrix by iterating through eachnode in T with time complexity O ( p ) (line 2). (3) Creating the lazyrandom walk transition matrix W , which requires O ( m p ) frommultiplying the corresponding matrix entries (line 3). (4) In lines4-7 we iterate for t last iterations, with each iteration (i) updatingthe rank vector by multiplying the corresponding edges in thetransition matrix W , with the rank vector q for a time complexityof O ( m p ) and (ii) truncating every vertex with rank q t ( i )/ d ( i ) ≤ ϵ for a time complexity linear in the number of nodes in the rankvector O ( p ) . (5) Sort the rank vector which will be upper boundedby O ( ploдp ) (line 8). (6) Compute the parallel conductance, whichtakes O ( p + pm p ) time (lines 9-10). Combining each step leads toa total time complexity of O ( p + pm p + n + n w ) .LocalForecasting: This algorithm has three major steps: (1)run the LocalProximity algorithm, which has a time complexityof O ( n + m p + n w ) . (2) Perform the ExpandedNeighborhood algo-rithm, which densifies T by adding predicted edges for a total of m e edges in T . This algorithm has a time complexity of O ( p ) due tothe nested for loops. (3) Run the AttriPart algorithm, which hasa time complexity of O ( p + pm e + n + n w ) with the modificationof m p to m e for the additional edges. This leads to an overall timecomplexity of O ( p + pm e + n + n w ) . □ While AttriPart and LocalForecasting both scale quadrati-cally with respect to p , we note that in practice these algorithmsare very fast since p ≪ n and p scales nearly independent of graphsize as shown in section 3.3. In this section, we demonstrate the effectiveness and efficiency ofthe proposed algorithms on three real-world network datasets ofvarying scale.
Datasets.
We evaluate the performance of the proposed algorithmson three datasets—(1) the Aminer co-authorship network [27],(2) a Musician network mined from DBpedia and (3) a subset ofWikipedia entries in DBpedia containing both abstracts and links.All three networks are undirected with detailed information oneach below: • Aminer.
Nodes represents an author, with each author con-taining a set of topic keywords, and an edge representinga co-authorship. To form the attribute network, we com-pute attribute edges based on the similarity between twoauthors for every network edge, using Jaccard Similarity onthe corresponding authors’s topic set. • Musician.
Nodes represent a Musician, with each Musiciancontaining a set of music genres, and an edge representingtwo Musicians who have played in the same band. To formthe attribute network, we compute attribute edges basedon the similarity between two Musicians for every networkedge, using Jaccard Similarity on the corresponding artist’smusic genre set. • Wikipedia.
Nodes represent an entity, place or conceptfrom Wikipedia which we will jointly refer to as an item.Each item contains a set of defining key words; with edgesrepresenting a link between the two items. The dataset orig-inates from DBpedia as a directed graph with links betweenWikipedia entries. We modify the graph to be undirected foruse with our algorithms—which we believe to be a reasonableas each edge denotes a relationship between two items. Inaddition, this dataset uses only a portion of the Wikipedia en-tries containing both abstracts and links to other Wikipediapages found in DBpedia. To form the attribute network, wecompute attribute edges based on the similarity betweentwo items for every network edge using Jaccard Similarityon the corresponding item’s key word set.Category Network Nodes EdgesAminer Co-Author 1,560,640 4,258,946Musician Co-Musician 6,006 8,690Wikipedia Link 237,588 1,130,846
Table 2:
Network Statistics
Metrics. (1) To benchmark the LocalProximity algorithm’seffectiveness and efficiency, we compare (i) the difference betweenlocal partition created with and without the LocalProximity algo-rithm on AttriPartand (ii) the run time and difference betweenthe top 20 PageRank vector entries with and without the Local-Proximity algorithm. (2) To benchmark the AttriPart algorithm’seffectiveness and efficiency we compare the triangle count, nodecount, local partition density and run time to PageRank-Nibble.Normally, PageRank-Nibble does not return a local partition if thetarget conductance is not met, however, we modify it to returnthe best local partition found—even if the target conductance isnot met. This modification allows for more comparable results toAttriPart. (3) To provide a baseline for the LocalForecasting ocal Partition in Rich Graphs KDD’18, August 2018, London, United Kingdom algorithm’s effectiveness, we compare the local partition results toAttriPart on two graph missing 15% of their edges.
Repeatability.
All data and source code used in this researchwill be made publicly available. The Aminer co-authorship net-work can be found on the Aminer website ; the Musician andWikipedia datasets used in the experiments will be released on theauthor’s website. All algorithms and experiments were conductedin a Windows environment using Python. LocalProximity.
In Figure 5 parts (a)-(c), we can see that the pro-posed LocalProximity algorithm significantly reduces the compu-tational run time, while maintaining high levels of accuracy acrossboth metrics. Parts (a)-(b) demonstrate to what extent the accuracyof the results are dependent upon the parameter values. In particu-lar, a low value of α r (random walk alpha) and a high value of t s (relevance threshold) are critical to providing high accuracy results.In Figure 5 part (a), we measure accuracy as the number ofvertices that differ between the local partitions w/ and w/o theLocalProximity algorithm on AttriPart. A small partition differ-ence indicates that the LocalProximity algorithm finds a relevantsubgraph around the given seed node and that the full graph isunnecessary for accurate results. In part (b), we define the accuracyof the results to be the difference between the set of top entriesin the PageRank vectors for the full graph and subgraph using theLocalProximity algorithm. Overall, the results from part (b) cor-relate well to (a)—showing that for low values of α r (random walkalpha) and high values of t s (relevance threshold), their is negligibledifference between the results computed on the full graph and thesubgraph found using the LocalProximity algorithm. AttriPart.
In Figure 6, we see that AttriPart finds signifi-cantly denser local partitions than PageRank-Nibble—with localpartition densities approximately × , × and × higher in At-triPart than PageRank-Nibble in the Aminer, Wikipedia and Mu-sician datasets respectively. Density is measured as mn ( n − ) where m is the number of edges and n is the number of nodes.In Figure 6, we observe that the triangle count of the Attri-Part algorithm is lower than PageRank-Nibble in the Musician andAminer datasets. We attribute this to the fact that AttriPart isfinding smaller partitions (as measured by node count) and, there-fore, there are less possible triangles. We also note that each triangleis counted three times, once for each node in the triangle. While nosweeps across algorithm parameters were performed, we believethat the gathered results provide an effective baseline for parameterselection. LocalForecasting.
In order to measure the effectiveness ofthe LocalForecasting algorithm we setup the following experi-ment with three local partition calculations: (1) calculate the localpartition using AttriPart, (2) calculate the local partition using At-triPart with 15% of the edges randomly removed from the graphand (3) calculate the local partition using the LocalForecastingalgorithm with 15% of the edges randomly removed from the graph.We treat (1) as the baseline local community and want to test if(3) finds better local partitions than (2). The idea behind randomlyremoving 15% of the edges in the graph is to simulate the evolution https://Aminer.org/data (a) Y-axis represents the difference in vertices between the local par-tition calculated w/ and w/o the LocalProximity algorithm.(b) Y-axis represents the Figure 5: Each data point averages 10 randomly sampled ver-tices in both the Aminer and Musician datasets. Default pa-rameters (unless sweeped across): α n = 0.2, α r = 0.15, ϕ o = 0.2, t s = 2, n w = 10,000, n s = 200. Parameter ranges: α r , α n and ϕ o [0.1-0.7] in 0.1 intervals; t s [1-5] in 0.5 intervals. of the graph over time and test if the LocalForecasting algorithmcan predict better local communities in the future. Ideally, we wouldhave ground-truth local community data for a rich graph with timeseries snapshots, however, in its absence we use the above method.In Figure 7, each data point is generated in three steps—(i) tak-ing the difference between the set of vertices and edges in localpartitions (1) and (3), (ii) taking the difference between the set ofvertices and edges in local partitions (1) and (2) and (iii) by takingthe difference between (ii) and (i). Step (i) tells us how far off theLocalForecasting algorithm is from the baseline, step (ii) tellsus how far off the local partition would be from the baseline if noprediction techniques were used and step (iii) tells us the differencebetween the local partitions with and without the LocalForecast-ing algorithm (which is what we see graphed in Figure 7). DD’18, August 2018, London, United Kingdom S. Freitas et al. (a) Scalability: Each data point represents the Aminer dataset in1/10th intervals, with each point averaged over 3 randomly sampledvertices. Parameters: α n = 0.2, α r = 0.15, ϕ o = 0.2, t s = 2, n w = 10,000, n s = 200. (b) Figure 6: Effectiveness: results are averaged over 20 and 100randomly sampled vertices in the Aminer/Wikipedia andMusician datasets, respectively. Parameters: α n = 0.2, α r =0.15, ϕ o = 0.05, t s = 2, n w = 10,000, n s = 200. In Figure 7, we see that the local partition prediction accuracy, forboth the edges and vertices, is above the baseline calculations in theAminer dataset for a majority of edge similarity threshold values( t e ). The best results were obtained when t e is 0.6, with an averageof 1.4 vertices and 2.75 edges predicted over the baseline usingthe LocalForecasting algorithm. This number, while relativelysmall, is an average of 20 randomly sampled vertices—with oneresult reaching up to 14 vertices and 26 edges over baseline. Inaddition, we can see that the Musician dataset does not performas well as the Aminer dataset, with most of the prediction resultsperforming worse than the baseline (as indicated by the negativedifference). We believe that this result on the Musician dataset isdue to the different nature of each dataset’s network structure—with the Musician dataset being significantly more sparse (no giantconnected component) than the Aminer dataset. For both the proposed and baseline algorithms, the efficiency re-sults represent only the time taken to run the algorithm (e.g. notincluding loading data into memory).
LocalProximity.
Across amajority of the parameters the run time for the full graph PageR-ank computation is approximately 450 seconds longer comparedto computing the PageRank vector based on the LocalProximitysugraph.
AttriPart.
In Figure 6, we see that the AttriPart al-gorithm finds local partitions × faster than PageRank-Nibble. Figure 7: Each data point averages 20 randomly sampled ver-tices in the Aminer and Musician datasets. Default parame-ters (unless sweeped across): α n = 0.2, α r = 0.15, ϕ o = 0.2, t s =5, t e = 0.7, n w = 10,000, n s = 200. Parameter ranges: t e [0.1-0.9]in 0.1 intervals, ϕ o [0.1-0.6] in 0.1 intervals.LocalForecasting. This algorithm has an expected run timenearly identical to AttriPart, we therefore refer the reader toFigure 6 for run time results.
We provide a high level review of both local and global communitydetection methods, with a focus on the research that pertains tothe algorithms we propose in this paper.
A - Local Community Detection.
Given an undirected graph,start vertex and a target conductance—the goal of Nibble is to finda subset of vertices that has conductance less than the target con-ductance [19]. This algorithm has strong theoretical propertieswith a run time of O ( b ( loд m )/ ϕ ) , where b is a user defined con-stant, ϕ is the target conductance and m is the number of edges.PageRank-Nibble builds on the work of Nibble by introducing theuse of personalized PageRank [9, 22], in addition to an algorithmfor the computation of approximate PageRank vectors [1]. SincePageRank-Nibble and Nibble run on undirected graphs, they usetruncated random walks in order to prevent the stationary distribu-tion from becoming proportional to the degree centrality of eachnode [8]. There are also many alternative techniques for local com-munity detection. To name a few, the paper by Bagrow and Bollt [2]introduces a method of local community identification that utilizesan l -shell spreading outward from a start vertex. However, theiralgorithm requires knowledge of the entire graph and is thereforenot truly local. The research by J. Chen et. al. [4] proposes a methodfor local community identification in social networks that avoidsthe use of hard to obtain parameters and improves the accuracy ofidentified communities by introducing a new metric. In addition,the work by [29] and [25] introduces two methods of local com-munity identification that take into account high-order networkstructure information. In [29], the authors provide mathematicalguarantees of the optimality and scalability of their algorithms, inaddition to the generalization of it to various network types (e.g.signed and multi-partite networks). B - Global Community Detection.
The basic idea behind theWalktrap algorithm is that random walks on a graph tend to get"trapped" in densely connected parts that correspond to communi-ties [18]. Utilizing the properties of random walks on graphs, theydefine a measurement of structural similarity between vertices and ocal Partition in Rich Graphs KDD’18, August 2018, London, United Kingdom between communities, creating a distance metric. The algorithmitself has an upper bound of O ( mn ) . Another popular choice forglobal community detection is spectral analysis. In the paper by M.Newman [15] it is shown that the problems of community detectionby modularity maximization, community detection by statisticalinference and normalized-cut graph partitioning when tackled us-ing spectral methods, are in fact, the same problem. The work by S.White et. al. in [24] attempts to find communities in graphs usingspectral clustering. They achieve this by using an objective func-tion for graph clustering [16] and reformulating it as a spectralrelaxation problem, for which they propose two algorithms to solveit. A systematic introduction to spectral clustering techniques canbe found in [23]. There also exists many alternative techniquesfor global community detection. Among others, two interestingtechniques relevant to this work are [11] [20]. In [11], the authorspropose a community detection algorithm that uses the informationin both the network structure and the node attributes, while in [20]the authors use network feature extraction to predict the evolutionof communities. A detailed review of various community detectionalgorithms can be found in [28]. This paper proposes new algorithms for attributed graphs, withthe goal of (i) computing denser local graph partitions and (ii) pre-dicting the evolution of local communities. We believe that theproposed algorithms will be of particular interest to data miningresearchers given the computational speed-up and enhanced denselocal partition identification. The proposed local partitioning al-gorithm AttriPart has already deployed
REFERENCES [1] R. Andersen, F. Chung, and K. Lang. 2006. Local Graph Partitioning using PageR-ank Vectors. In . 475–486. https://doi.org/10.1109/FOCS.2006.44[2] James P. Bagrow and Erik M. Bollt. 2005. Local method for detecting communities.
Phys. Rev. E
72 (Oct 2005), 046108. Issue 4. https://doi.org/10.1103/PhysRevE.72.046108[3] Albert-László Barabási and Réka Albert. 1999. Emergence of Scaling in RandomNetworks.
Science . 237–242. https://doi.org/10.1109/ASONAM.2009.14[5] Pierre Dupont, J Callut, G Dooms, J N. Monette, and Yves Deville. 2017. Relevantsubgraph extraction from random walks in a graph. (12 2017).[6] Michalis Faloutsos, Petros Faloutsos, and Christos Faloutsos. 1999. On Power-lawRelationships of the Internet Topology. In
Proceedings of the Conference on Appli-cations, Technologies, Architectures, and Protocols for Computer Communication(SIGCOMM ’99) . ACM, New York, NY, USA, 251–262. https://doi.org/10.1145/316188.316229[7] Scott Freitas, Hanghang Tong, Nan Cao, and Yinglong Xia. 2017. Rapid Analysisof Network Connectivity. In
Proceedings of the 2017 ACM on Conference on In-formation and Knowledge Management (CIKM ’17) . ACM, New York, NY, USA,2463–2466. https://doi.org/10.1145/3132847.3133170[8] Vince Grolmusz. 2015. A Note on the PageRank of Undirected Graphs.
Inf. Process.Lett.
IEEE Transactions on Knowledge and Data Engineering
15, 4 (July 2003), 784–796. https://doi.org/10.1109/TKDE.2003.1208999 [10] Chin-Chi Hsu, Yi-An Lai, Wen-Hao Chen, Ming-Han Feng, and Shou-De Lin.2017. Unsupervised Ranking Using Graph Structures and Node Attributes. In
Proceedings of the Tenth ACM International Conference on Web Search and DataMining (WSDM ’17) . ACM, New York, NY, USA, 771–779. https://doi.org/10.1145/3018661.3018668[11] Jure Leskovec Jaewon yang, Julian McAuley. 2013. Community Detection inNetworks with Node Attributes.
ICDM (2013).[12] Ravi Kannan, Santosh Vempala, and Adrian Vetta. 2004. On Clusterings: Good,Bad and Spectral.
J. ACM
51, 3 (May 2004), 497–515. https://doi.org/10.1145/990308.990313[13] Songsong Liu Lazaros G. Papageorgiou Sophia Tsoka Laura Bennett, Aris-totelis Kittas. 2014. Community Structure Detection for Overlapping Modulesthrough Mathematical Programming in Protein Interaction Networks.
PLOS ONE (2014). https://doi.org/10.1371/journal.pone.0112821[14] David Liben-Nowell and Jon Kleinberg. 2007. The Link-prediction Problemfor Social Networks.
J. Am. Soc. Inf. Sci. Technol.
58, 7 (May 2007), 1019–1031.https://doi.org/10.1002/asi.v58:7[15] Mark E. J. Newman. 2013. Spectral methods for network community detectionand graph partitioning.
CoRR abs/1307.7729 (2013).[16] M. E. J. Newman and M. Girvan. 2004. Finding and evaluating communitystructure in networks.
Physical Review
E 69, 026113 (2004).[17] M.E. J. Newman. 2005. A measure of betweenness centrality based on randomwalks.
Social Networks
27, 1 (2005), 39 – 54. https://doi.org/10.1016/j.socnet.2004.11.009[18] Pascal Pons and Matthieu Latapy. 2005. Computing Communities in Large Net-works Using Random Walks. In
Proceedings of the 20th International Conference onComputer and Information Sciences (ISCIS’05) . Springer-Verlag, Berlin, Heidelberg,284–293. https://doi.org/10.1007/11569596_31[19] Daniel A. Spielman and Shang-Hua Teng. 2013. A Local Clustering Algorithmfor Massive Graphs and Its Application to Nearly Linear Time Graph Parti-tioning.
SIAM J. Comput.
42, 1 (2013), 1–26. https://doi.org/10.1137/080744888arXiv:https://doi.org/10.1137/080744888[20] M. Takaffoli, R. Rabbany, and O. R. ZaÃŕane. 2014. Community evolution pre-diction in dynamic social networks. In . 9–16.https://doi.org/10.1109/ASONAM.2014.6921553[21] Chayant Tantipathananandh, Tanya Berger-Wolf, and David Kempe. 2007. AFramework for Community Identification in Dynamic Social Networks. In
Proceedings of the 13th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD ’07) . ACM, New York, NY, USA, 717–726.https://doi.org/10.1145/1281192.1281269[22] Hanghang Tong, Jingrui He, Mingjing Li, Wei-Ying Ma, Hong-Jiang Zhang, andChangshui Zhang. 2006. Manifold-Ranking-Based Keyword Propagation forImage Retrieval.
EURASIP Journal on Applied Signal Processing
Statistics and Com-puting
17, 4 (01 Dec 2007), 395–416. https://doi.org/10.1007/s11222-007-9033-z[24] Scott White and Padhraic Smyth. [n. d.].
A Spectral Clustering Approach To FindingCommunities in Graphs . 274–285. https://doi.org/10.1137/1.9781611972757.25arXiv:http://epubs.siam.org/doi/pdf/10.1137/1.9781611972757.25[25] Hao Yin, Austin R. Benson, Jure Leskovec, and David F. Gleich. 2017. Local Higher-Order Graph Clustering. In
Proceedings of the 23rd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDD ’17) . ACM, New York,NY, USA, 555–564. https://doi.org/10.1145/3097983.3098069[26] Sune Lehmann Yong-Yeol Ahn, James P. Bagrow. 2010. Link communities revealmultiscale complexity in networks.
Nature (August 2010), 761–764. https://doi.org/doi:10.1038/nature09182[27] Jing Zhang, Jie Tang, Cong Ma, Hanghang Tong, Yu Jing, Juanzi Li, Walter Luyten,and Marie-Francine Moens. 2017. Fast and Flexible Top-k Similarity Search onLarge Networks.
ACM Trans. Inf. Syst.
36, 2, Article 13 (Aug. 2017), 30 pages.https://doi.org/10.1145/3086695[28] Claudio J. Tessone Zhao Yang, RenÃľ Algesheimer. 2016. A Comparative Analysisof Community Detection Algorithms on Artificial Networks.
Scientific Reports (2016). https://doi.org/doi:10.1038/srep30750[29] Dawei Zhou, Si Zhang, Mehmet Yigit Yildirim, Scott Alcorn, Hanghang Tong,Hasan Davulcu, and Jingrui He. 2017. A Local Algorithm for Structure-PreservingGraph Cut. In