[PDF] Effective and Scalable Clustering on Massive Attributed Graphs

Abstract

Given a graph G where each node is associated with a set of attributes, and a parameter k specifying the number of output clusters, k-attributed graph clustering (k-AGC) groups nodes in G into k disjoint clusters, such that nodes within the same cluster share similar topological and attribute characteristics, while those in different clusters are dissimilar. This problem is challenging on massive graphs, e.g., with millions of nodes and billions of edges. For such graphs, existing solutions either incur prohibitively high costs, or produce clustering results with compromised quality. In this paper, we propose ACMin, an effective approach to k-AGC that yields high-quality clusters with cost linear to the size of the input graph G. The main contributions of ACMin are twofold: (i) a novel formulation of the k-AGC problem based on an attributed multi-hop conductance quality measure custom-made for this problem setting, which effectively captures cluster coherence in terms of both topological proximities and attribute similarities, and (ii) a linear-time optimization solver that obtains high-quality clusters iteratively, based on efficient matrix operations such as orthogonal iterations, an alternative optimization approach, as well as an initialization technique that significantly speeds up the convergence of ACMin in practice. Extensive experiments, comparing 11 competitors on 6 real datasets, demonstrate that ACMin consistently outperforms all competitors in terms of result quality measured against ground-truth labels, while being up to orders of magnitude faster. In particular, on the Microsoft Academic Knowledge Graph dataset with 265.2 million edges and 1.1 billion attribute values, ACMin outputs high-quality results for 5-AGC within 1.68 hours using a single CPU core, while none of the 11 competitors finish within 3 days.

Full PDF

EEffective and Scalable Clustering on Massive Attributed Graphs ∗ Technical Report

Renchi Yang

Nanyang Technological [email protected]

Jieming Shi † Hong Kong Polytechnic [email protected]

Yin Yang

Hamad bin Khalifa [email protected]

Keke Huang

National University of [email protected]

Shiqi Zhang

National University of [email protected]

Xiaokui Xiao

National University of [email protected]

ABSTRACT

Given a graph 𝐺 where each node is associated with a set of at-tributes, and a parameter 𝑘 specifying the number of output clusters, 𝑘 -attributed graph clustering ( 𝑘 -AGC) groups nodes in 𝐺 into 𝑘 dis-joint clusters, such that nodes within the same cluster share similartopological and attribute characteristics, while those in differentclusters are dissimilar. This problem is challenging on massivegraphs, e.g. , with millions of nodes and billions of attribute values.For such graphs, existing solutions either incur prohibitively highcosts, or produce clustering results with compromised quality.In this paper, we propose ACMin , an efficient approach to 𝑘 -AGCthat yields high-quality clusters with costs linear to the size of theinput graph 𝐺 . The main contributions of ACMin are twofold: (i)a novel formulation of the 𝑘 -AGC problem based on an attributedmulti-hop conductance quality measure custom-made for this prob-lem setting, which effectively captures cluster coherence in termsof both topological proximities and attribute similarities, and (ii) alinear-time optimization solver that obtains high quality clustersiteratively, based on efficient matrix operations such as orthogonaliterations, an alternative optimization approach, as well as an ini-tialization technique that significantly speeds up the convergenceof ACMin in practice.Extensive experiments, comparing 11 competitors on 6 realdatasets, demonstrate that

ACMin consistently outperforms allcompetitors in terms of result quality measured against groundtruth labels, while being up to orders of magnitude faster. In par-ticular, on the Microsoft Academic Knowledge Graph dataset with265.2 million edges and 1.1 billion attribute values,

ACMin outputshigh-quality results for 5-AGC within 1.68 hours using a singleCPU core, while none of the 11 competitors finish within 3 days.

Node clustering is a fundamental task in graph mining [28, 39,44, 61], and finds important real-world applications, e.g. , commu-nity detection in social networks [12], functional cartography ofmetabolic networks [15], and protein grouping in biological net-works [49]. Traditionally, node clustering is done based on thegraph topology, i.e. , by grouping together well-connected nodes.This approach, however, is often insufficient to obtain high-qualityclusters [13, 22], especially when the graph comes with attributes ∗ This is the full version of the paper appearing in TheWebConf 2021. † Corresponding author. associated to nodes. In such attributed graphs , well-connected nodestend to share similar attributes; meanwhile, nodes with similar at-tributes are also likely to be well-connected, as observed in [26, 27].Therefore, to obtain high-quality node clustering, it is important toconsider both graph topology and node attributes. The resulting attributed graph clustering has use cases such as gene clusteringin biological networks [18], group-oriented marketing in commu-nication networks [54], service/app recommendation, and onlineadvertising in social networks [23, 30].This paper focuses on 𝑘 - attributed graph clustering ( 𝑘 -AGC),which takes as input an attributed graph 𝐺 and a parameter 𝑘 ,and aims to partition 𝐺 into 𝑘 disjoint node clusters 𝐶 , 𝐶 , · · · , 𝐶 𝑘 ,such that the nodes within the same cluster 𝐶 𝑖 are not only well-connected to each other, but also share similar attribute values,whereas the nodes in different clusters are distant to each otherand share less attributes. It is highly challenging to devise a 𝑘 -AGCalgorithm that yields high-quality clusters, especially on massivegraphs, e.g. , with millions of nodes and billions of attribute values.Most existing solutions ( e.g. , [2, 7, 10, 29, 33, 36, 37, 42, 45, 52–54, 62,64, 67, 68]) fail to scale to such large graphs, since they either incurprohibitive computational overhead, or produce clustering resultswith compromised quality. For instance, a common methodology[7, 10, 36, 67] relies on materializing the attribute similarity betweenevery pair of nodes in the input graph 𝐺 , and, thus, requires 𝑂 ( 𝑛 ) space for 𝑛 nodes, which is infeasible for a graph with numerousnodes. Methods based on probabilistic models ( e.g. , [21, 40, 54, 62,63]) generally require immense costs on large graphs to estimatethe likelihood parameters in their respective optimization programs.Among the faster solutions, some ( e.g. , [7, 33, 37, 42, 45]) reducethe problem to non-attributed graph clustering by re-weightingeach edge ( 𝑢, 𝑣 ) in 𝐺 based on the attribute similarity betweennodes 𝑢 and 𝑣 . This approach, however, ignores attribute similaritiesbetween nodes that are not directly connected, and, consequently,suffers from severe result quality degradation. Finally, 𝑘 -AGC couldbe done by first applying attributed network embedding to the inputgraph ( e.g. , [17, 31, 34, 55–57, 60, 66]) to obtain an embedding vectorfor each node, and subsequently feeding the resulting embeddingsto a non-graph method such as 𝑘 - Means clustering [19, 41]. Thistwo-stage pipeline leads to sub-optimal result quality, however,since the node embedding methods do not specifically target forgraph clustering, as demonstrated in our experiments.Facing the challenge of 𝑘 -AGC on massive attributed graphs,we propose ACMin (short for Attributed multi-hop Conductance a r X i v : . [ c s . S I] F e b onference’17, July 2017, Washington, DC, USA Yang and Shi, et al. Minimization), a novel solution that seamlessly incorporates bothgraph topology and node attributes to identify high-quality clus-ters, while being highly scalable and efficient on massive graphswith numerous nodes, edges and attributes. Specifically,

ACMin computes 𝑘 -AGC by solving an optimization problem, in whichthe main objective is formulated based on a novel concept called average attributed multi-hop conductance , which is a non-trivialextension to conductance [6, 61], a classic measure of node clustercoherence. The main idea is to map both node relationships ( i.e. ,connections via edges) and similarities ( i.e. , common attributes) tomotions of a random walker. Then, we show that the correspond-ing concept of conductance in our setting, i.e. , attributed multi-hopconductance, is equivalent to the probability that a random walkerstarting from a node in a cluster (say, 𝐶 ) terminates at any nodeoutside the cluster 𝐶 . Accordingly, our goal is to identify a node par-titioning scheme that minimizes the average attributed multi-hopconductance among all 𝑘 clusters in the result.Finding the exact solution to the above optimization problemturns out to be infeasible for large graphs, as we prove its NP-hardness. Hence, ACMin tackles the problem via an approximatesolution with space and time costs linear to the size of the inputgraph. In particular, there are three key techniques in the

ACMin algorithm. First, instead of actually sampling random walks,

ACMin converts the optimization objective into its equivalent matrix form,and iteratively refines a solution via efficient matrix operations, i.e. , orthogonal iterations [43]. Second, the

ACMin solver appliesan alternative optimization approach and randomized SVD [16] toefficiently generate and refine clustering results. Third,

ACMin in-cludes an effective greedy initialization technique that significantlyspeeds up the convergence of the iterative process in practice.We formally analyze the asymptotic time and space complexitiesof

ACMin , and evaluate its performance thoroughly by comparingagainst 11 existing solutions on 6 real datasets. The quality of aclustering method’s outputs is evaluated by both (i) comparing themwith ground truth labels, and (ii) measuring their attributed multi-hop conductance, which turns out to agree with (i) on all datasets inthe experiments. The evaluation results demonstrate that

ACMin consistently outperforms its competitors in terms of clusteringquality, at a fraction of their costs. In particular, on the

Flickr dataset,the performance gap between

ACMin and the best competitor is aslarge as 28.6 percentage points, measured as accuracy with respectto ground truth. On the Microsoft Academic Knowledge Graph(

MAG ) dataset with 265.2 million edges and 1.1 billion attributevalues,

ACMin terminates in 1.68 hours for a 5-AGC task, whilenone of the 11 competitors finish within 3 days.The rest of this paper is organized as follows. Section 2 presentsour formulation of the 𝑘 -AGC problem, based on two novel con-cepts: attributed random walks and attributed multi-hop conduc-tance. Section 3 overviews the proposed solution ACMin and pro-vides the intuitions of the algorithm. Section 4 describes the com-plete

ACMin algorithm and analyzes its asymptotic complexity.Section 5 contains an extensive set of experimental evaluations.Section 6 reviews related work, and Section 7 concludes the paperwith future directions.

Table 1: Frequently used notations.

Notation Description 𝐺 = ( 𝑉, 𝐸 𝑉 , 𝑅, 𝐸 𝑅 ) A graph 𝐺 with node set 𝑉 , edge set 𝐸 𝑉 , attribute set 𝑅 , andnode-attribute association set 𝐸 𝑅 . 𝑛,𝑑 The number of nodes ( i.e. , | 𝑉 | ) and the number of attributes ( i.e. , | 𝑅 | ) in 𝐺 , respectively. 𝑘 The number of clusters. A , D , R The adjacency, out-degree and attribute matrices of 𝐺 . P 𝑉 , P 𝑅 The topological transition and attributed transition matrices of 𝐺 , respectively. 𝛼, 𝛽 Stopping and attributed branching probabilities. S The attributed random walk probability matrix (see Eq. (2)). F The top- 𝑘 eigenvectors of S . Y , Ψ ( Y ) A 𝑘 × 𝑛 node-cluster indicator ( i.e. , NCI) and the average attrib-uted multi-hop conductance ( i.e. , AAMC) of Y (see Eq. (8)). Section 2.1 provides necessary background and defines commonnotations. Section 2.2 describes a random walk model that incorpo-rates both topological proximity and attribute similarity informa-tion. Section 2.3 defines the novel concept of attributed multi-hopconductance, which forms the basis of the objective function in our 𝑘 -AGC problem formulation, presented in Section 2.4. Let 𝐺 = ( 𝑉 , 𝐸 𝑉 , 𝑅, 𝐸 𝑅 ) be an attributed graph consisting of a nodeset 𝑉 with cardinality 𝑛 , a set of edges 𝐸 𝑉 of size 𝑚 , each connectingtwo nodes in 𝑉 , a set of attributes 𝑅 with cardinality 𝑑 , and a setof node-attribute associations 𝐸 𝑅 , where each element is a tuple ( 𝑣 𝑖 , 𝑟 𝑗 , 𝑤 𝑖,𝑗 ) signifying that node 𝑣 𝑖 ∈ 𝑉 is directly associated withattribute 𝑟 𝑗 ∈ 𝑅 with a weight 𝑤 𝑖,𝑗 . Without loss of generality,we assume that each edge ( 𝑣 𝑖 , 𝑣 𝑗 ) ∈ 𝐸 𝑉 is directed; an undirectededge ( 𝑣 𝑖 , 𝑣 𝑗 ) is simply converted to a pair of directed edges withopposing directions ( 𝑣 𝑖 , 𝑣 𝑗 ) and ( 𝑣 𝑗 , 𝑣 𝑖 ) . A high-level definition ofthe 𝑘 -AGC problem is as follows. Definition 2.1 ( 𝑘 -Attributed Graph Clustering ( 𝑘 -AGC) [67] ). Givenan attributed graph 𝐺 and the number 𝑘 of clusters, 𝑘 -AGC aims topartition the node set 𝑉 of 𝐺 into disjoint subsets: 𝐶 , 𝐶 , · · · , 𝐶 𝑘 ,such that (i) nodes within the same cluster 𝐶 𝑖 are close to each other,while nodes between any two clusters 𝐶 𝑖 , 𝐶 𝑗 are distant from eachother; and (ii) nodes within the same cluster 𝐶 𝑖 have homogeneousattribute values, while the nodes in different clusters may havediverse attribute values.Note that the above definition does not include a concrete opti-mization objective that quantifies node proximity and attribute ho-mogeneity. As explained in Sections 2.2-2.4, the design of effectivecluster quality measures is non-trivial, and is a main contributionof this paper. The problem formulation is completed later in Section2.4 with a novel objective function.Regarding notations, we denote matrices in bold uppercase, e.g. , M . We use M [ 𝑖 ] to denote the 𝑖 -th row vector of M , and M [ : , 𝑗 ] todenote the 𝑗 -th column vector of M . In addition, we use M [ 𝑖, 𝑗 ] todenote the element at the 𝑖 -th row and 𝑗 -th column of M . Given anindex set I , we let M [I] (resp. M [ : , I] ) be the matrix block of M that contains the row (resp. column) vectors of the indices in I . Following common practice in the literature [54, 60], we assume that the attributeshave already been pre-processed, e.g. , categorical attributes such as marital status areone-hot encoded into binary ones.2 ffective and Scalable Clustering on Massive Attributed Graphs Conference’17, July 2017, Washington, DC, USA v v v v v v v v v v v v v v r r r r r r v v v v v v v r r r (a) Clusters 𝐶 ,𝐶 v v v v v v v v v v v v v v r r r r r r v v v v v v v r r r (b) Clusters 𝐶 ′ ,𝐶 ′ Figure 1: Example attributed graph and clustering schemes.

Let A be the adjacency matrix of the input graph 𝐺 , i.e. , A [ 𝑣 𝑖 , 𝑣 𝑗 ] = ( 𝑣 𝑖 , 𝑣 𝑗 ) ∈ 𝐸 𝑉 , otherwise A [ 𝑣 𝑖 , 𝑣 𝑗 ] =

0. Let D be the diagonalout-degree matrix of 𝐺 , i.e. , D [ 𝑣 𝑖 , 𝑣 𝑖 ] = (cid:205) 𝑣 𝑗 ∈ 𝑉 A [ 𝑣 𝑖 , 𝑣 𝑗 ] . We definethe topological transition matrix of 𝐺 as P 𝑉 = D − A . Furthermore,we define an attribute matrix R ∈ R 𝑛 × 𝑑 , such that R [ 𝑣 𝑖 , 𝑟 𝑗 ] = 𝑤 𝑖,𝑗 is the weight associated with the entry ( 𝑣 𝑖 , 𝑟 𝑗 , 𝑤 𝑖 𝑗 ) ∈ 𝐸 𝑅 . We referto R [ 𝑣 𝑖 ] as node 𝑣 𝑖 ’s attribute vector . Also, let 𝑑 𝑜𝑢𝑡 ( 𝑣 𝑖 ) and 𝑑 𝑖𝑛 ( 𝑣 𝑖 ) represent the out-degree and in-degree of node 𝑣 𝑖 in 𝐺 , respectively.Table 1 lists the frequently used notations throughout the paper. Random walk is an effective model for capturing multi-hop rela-tionships between nodes in a graph [32]. Common definitions ofrandom walk, e.g. , random walk with restart (RWR) [25, 47], con-sider only graph topology but not node attributes. Hence, we devisea new attributed random walk model that seamlessly integratestopological proximity and attribute similarity between nodes in acoherent framework, which plays a key role in our formulation ofthe 𝑘 -AGC problem, elaborated later.Given an attributed graph 𝐺 , we first define the attributed transi-tion probability and topological transition probability between a pairof nodes 𝑣 𝑖 and 𝑣 𝑗 in 𝐺 . We say that 𝑣 𝑖 and 𝑣 𝑗 are connected viaattribute 𝑟 𝑥 , iff. 𝑣 𝑖 and 𝑣 𝑗 have a common attribute 𝑟 𝑥 . For example,in Figure 1, nodes 𝑣 and 𝑣 are connected via three attributes 𝑟 − 𝑟 (shown in blue dashed lines). The attributed transition probabilityfrom 𝑣 𝑖 to 𝑣 𝑗 via 𝑟 𝑥 is defined as R [ 𝑣 𝑖 ,𝑟 𝑥 ]· R [ 𝑣 𝑗 ,𝑟 𝑥 ] (cid:205) 𝑣𝑙 ∈ 𝑉 (cid:205) 𝑟𝑦 ∈ 𝑅 R [ 𝑣 𝑖 ,𝑟 𝑦 ]· R [ 𝑣 𝑙 ,𝑟 𝑦 ] , whichcorresponds to the motion of the random walker that hops from 𝑣 𝑖 to 𝑣 𝑗 through a “bridge” 𝑟 𝑥 . Accordingly, we define the attributedtransition probability matrix P 𝑅 of 𝐺 as: P 𝑅 [ 𝑣 𝑖 , 𝑣 𝑗 ] = R [ 𝑣 𝑖 ]· R [ 𝑣 𝑗 ] ⊤ (cid:205) 𝑣𝑙 ∈ 𝑉 R [ 𝑣 𝑖 ]· R [ 𝑣 𝑙 ] ⊤ . (1)Intuitively, P 𝑅 [ 𝑣 𝑖 , 𝑣 𝑗 ] models the attributed transition probabilityfrom 𝑣 𝑖 to 𝑣 𝑗 via any attribute in 𝑅 .Meanwhile, following conventional random walk definitions, forany two nodes 𝑣 𝑖 and 𝑣 𝑗 that are directly connected by an edge in 𝐺 , i.e. , ( 𝑣 𝑖 , 𝑣 𝑗 ) ∈ 𝐸 𝑉 , the topological transition probability P 𝑉 [ 𝑣 𝑖 , 𝑣 𝑗 ] from 𝑣 𝑖 to 𝑣 𝑗 is 𝑑 𝑜𝑢𝑡 ( 𝑣 𝑖 ) , where 𝑑 𝑜𝑢𝑡 ( 𝑣 𝑖 ) is the out-degree of node 𝑣 𝑖 .The topological transition matrix P 𝑉 can then be obtained by P 𝑉 = D − A , where D and A are the node degree and adjacency matricesof 𝐺 , respectively. Based on the above concepts, we formally defineattributed random walk as follows. Definition 2.2 (Attributed Random Walk).

Given an attributedgraph 𝐺 , a stopping probability 𝛼 ∈ ( , ) , and an attributed branch-ing probability 𝛽 ∈ ( , ) , an attributed random walk starting fromnode 𝑣 𝑖 in 𝐺 performs one of the following actions at each step: (1) with probability 𝛼 , stop at the current node (denoted as 𝑣 𝑗 ),(2) with probability 1 − 𝛼 , jump to another node 𝑣 𝑙 as follows:(a) (attributed transition) with probability 𝛽 , jump to anothernode 𝑣 𝑙 via any attribute with probability P 𝑅 [ 𝑣 𝑗 , 𝑣 𝑙 ] ,(b) (topological transition) with probability 1 − 𝛽 , jump to anout-neighbor 𝑣 𝑙 of 𝑣 𝑗 with probability P 𝑉 [ 𝑣 𝑗 , 𝑣 𝑙 ] .Based on Definition 2.2, the following lemma shows how todirectly compute the probability S [ 𝑣 𝑖 , 𝑣 𝑗 ] that an attributed randomwalk starting from node 𝑣 𝑖 stops at node 𝑣 𝑗 .Lemma 2.3. Given an attributed graph 𝐺 , the probability that anattributed random walk starting from node 𝑣 𝑖 stops at node 𝑣 𝑗 is S [ 𝑣 𝑖 , 𝑣 𝑗 ] = 𝛼 (cid:205) ∞ ℓ = ( − 𝛼 ) ℓ · (( − 𝛽 ) · P 𝑉 + 𝛽 · P 𝑅 ) ℓ [ 𝑣 𝑖 , 𝑣 𝑗 ] . (2)Note that computing S directly using Eq. (2) is inefficient, whichinvolves sampling numerous attributed random walks. Instead, theproposed solution ACMin , presented later, computes the probabil-ities in S based on an alternative matrix representation, withoutsimulating any attributed random walk. Conductance is widely used to evaluate the quality of a node clusterin a graph [6, 61]. A smaller conductance indicates a more coherentcluster, and vice versa. Specifically, given a cluster 𝐶 of graph 𝐺 ,the conductance of 𝐶 , denoted as (cid:98) Φ ( 𝐶 ) , is defined as follows. (cid:98) Φ ( 𝐶 ) = | cut ( 𝐶 ) | min { vol ( 𝐶 ) , vol ( 𝑉 \ 𝐶 ) } , (3)where vol ( 𝐶 ) = (cid:205) 𝑣 𝑖 ∈ 𝐶 𝑑 𝑜𝑢𝑡 ( 𝑣 𝑖 ) , i.e. , the sum of the out-degreesof all nodes in 𝐶 , and cut ( 𝐶 ) = {( 𝑣 𝑖 , 𝑣 𝑗 ) | 𝑣 𝑖 ∈ 𝐶, 𝑣 𝑗 ∈ 𝑉 \ 𝐶 } , i.e. ,the set of outgoing edges with an endpoint in 𝐶 and the otherin 𝑉 \ 𝐶 . Intuitively, (cid:98) Φ ( 𝐶 ) is smaller when 𝐶 has fewer outgoingedges linking to the nodes outside the cluster ( i.e. , lower inter-cluster connectivity), and more edges with both endpoints within 𝐶 (higher intra-cluster connectivity).In our setting, the classic definition of conductance (cid:98) Φ ( 𝐶 ) is in-adequate, since it captures neither attribute information nor multi-hop relationships between nodes. Figure 1 illustrates an example inwhich (cid:98) Φ ( 𝐶 ) leads to counter-intuitive cluster quality measurements.The example contains nodes 𝑣 - 𝑣 and attributes 𝑟 - 𝑟 . Suppose thatwe aim to partition 𝐺 into two clusters. As shown in Figure 1b, node 𝑣 is mutually connected to nodes 𝑣 and 𝑣 , and also shares manyattributes ( i.e. , 𝑟 , 𝑟 , and 𝑟 ) and neighbors ( i.e. , 𝑣 and 𝑣 ) withnode 𝑣 ; in contrast, among nodes 𝑣 - 𝑣 , 𝑣 is only mutually con-nected to 𝑣 , and share no common attributes with them. Imaginethat this is in a social media setting where each node representsa user, and each edge indicates a follow relationship; then, 𝑣 isclearly closer to nodes 𝑣 - 𝑣 than to nodes 𝑣 - 𝑣 , due to its strongerconnections and shared attributes to the former group. However,the conductance definition in Eq. (3) leads to the counter-intuitiveconclusion that favors the clustering scheme 𝐶 = { 𝑣 , 𝑣 , 𝑣 } and 𝐶 = { 𝑣 , 𝑣 , 𝑣 , 𝑣 } in Figure 1a over 𝐶 ′ and 𝐶 ′ in Figure 1b, sincethe conductance (cid:98) Φ ( 𝐶 ) = (cid:98) Φ ( 𝐶 ) = ≤ (cid:98) Φ ( 𝐶 ′ ) = (cid:98) Φ ( 𝐶 ′ ) = .To address the above issue, we propose a new measure of clusterquality dubbed attributed multi-hop conductance , which can beviewed as an adaptation of conductance to the problem setting All proofs appear in Appendix A3 onference’17, July 2017, Washington, DC, USA Yang and Shi, et al. of 𝑘 -AGC. Specifically, given a cluster 𝐶 of an attributed graph 𝐺 ,suppose that we perform 𝑛 𝑟 attributed random walks from eachnode 𝑣 𝑖 in 𝐶 . Let 𝑤 ( 𝑣 𝑖 , 𝑣 𝑗 ) be the number of walks from 𝑣 𝑖 stoppingat 𝑣 𝑗 . Then, we can use the following quantity instead of Eq. (3) asa measure of cluster coherence: E (cid:20) (cid:205) 𝑣𝑖 ∈ 𝐶,𝑣𝑗 ∈ 𝑉 \ 𝐶 𝑤 ( 𝑣 𝑖 ,𝑟 𝑗 ) 𝑛 𝑟 ·| 𝐶 | (cid:21) = (cid:205) 𝑣𝑖 ∈ 𝐶,𝑣𝑗 ∈ 𝑉 \ 𝐶 E [ 𝑤 ( 𝑣𝑖,𝑣𝑗 ) 𝑛𝑟 ]| 𝐶 | . Intuitively, the above value quantifies the expected portion of the at-tributed random walks escaping from 𝐶 , i.e. , stopping at any outsidenode 𝑣 𝑗 ∈ 𝑉 \ 𝐶 . Hence, the smaller the number of escaped walks,the higher the cluster coherence. Further, observe that E [ 𝑤 ( 𝑣 𝑖 ,𝑣 𝑗 ) 𝑛 𝑟 ] corresponds to the probability that an attributed random walk start-ing from 𝑣 𝑖 terminates at 𝑣 𝑗 , i.e. , S [ 𝑣 𝑖 , 𝑣 𝑗 ] in Eq. (2). Accordingly,we arrive at the following definition of attributed multi-hop con-ductance Φ ( 𝐶 ) . Definition 2.4 (Attributed Multi-Hop Conductance).

Given a clus-ter 𝐶 of an attributed graph 𝐺 , the attributed multi-hop conductance Φ ( 𝐶 ) of the cluster 𝐶 is defined as Φ ( 𝐶 ) = (cid:205) 𝑣 𝑖 ∈ 𝐶,𝑣 𝑗 ∈ 𝑉 \ 𝐶 S [ 𝑣 𝑖 ,𝑣 𝑗 ]| 𝐶 | . (4) Given an input attributed graph 𝐺 , we aim to partition all nodes into 𝑘 disjoint clusters 𝐶 , 𝐶 , · · · , 𝐶 𝑘 , such that their average attributedmulti-hop conductance (AAMC) 𝜙 of the 𝑘 clusters is minimized, asfollows. 𝜙 ∗ = min 𝐶 ,𝐶 , ··· ,𝐶 𝑘 (cid:205) 𝑘𝑖 = Φ ( 𝐶 𝑖 ) 𝑘 . (5)The above objective, in combination with Definition 2.1, com-pletes our formuation of the 𝑘 -AGC problem. As an example, inFigure 1, let 𝛼 = . , 𝛽 = .

5. Then, we have Φ ( 𝐶 ) = . , Φ ( 𝐶 ) = .

125 for the clusters 𝐶 , 𝐶 in Figure 1a, and Φ ( 𝐶 ′ ) = . , Φ ( 𝐶 ′ ) = .

185 for the clusters 𝐶 ′ , 𝐶 ′ in Figure 1b. The AAMC values of thesetwo clustering results are Φ ( 𝐶 )+ Φ ( 𝐶 ) = . > Φ ( 𝐶 ′ )+ Φ ( 𝐶 ′ ) = . 𝐶 ′ , 𝐶 ′ are a better clustering of 𝐺 , whichagrees with our intuition explained in Section 2.3. This section provides a high-level overview of the proposed solution

ACMin for 𝑘 -AGC computation, and explains the intuitions behindthe algorithm design. The complete ACMin method is elaboratedlater in Section 4.First, we transform the optimization objective in Eq. (5) to anequivalent form that is easier to analyze. For this purpose, weintroduce the following binary node-cluster indicator (NCI) Y ∈ 𝑘 × 𝑛 to represent a clustering result: Y [ 𝐶 𝑖 , 𝑣 𝑗 ] = (cid:40) 𝑣 𝑗 ∈ 𝐶 𝑖 ,0 𝑣 𝑗 ∈ 𝑉 \ 𝐶 𝑖 , (6)where 𝐶 𝑖 is the 𝑖 -th cluster and 𝑣 𝑗 is the 𝑗 -th node in the node set 𝑉 of the input graph 𝐺 . Based on NCI Y , the following lemma presentsan equivalent form of the AAMC objective function in Eq. (5).Lemma 3.1. Given a clustering result 𝐶 , 𝐶 , · · · , 𝐶 𝑘 , representedby NCI Y , the AAMC of 𝐶 , 𝐶 , · · · , 𝐶 𝑘 can be obtained by: (cid:205) 𝑘𝑖 = Φ ( 𝐶 𝑖 ) 𝑘 = 𝑘 · trace ((( YY ⊤ ) − Y ) · ( I − S ) · (( YY ⊤ ) − Y ) ⊤ ) (7) Then, our optimization objective for 𝑘 -AMC is transformed to: 𝜙 ∗ = min Y ∈ 𝑘 × 𝑛 Ψ ( Y ) (8)where Ψ ( Y ) = 𝑘 · trace ((( YY ⊤ ) − Y ) · ( I − S ) · (( YY ⊤ ) − Y ) ⊤ ) Note that Eq. (8) is equivalent to Eq. (5), and yet the formeris more friendly to analysis. In particular, we have the followingnegative result.Lemma 3.2.

The optimization problem of finding the optimal val-ues of Y from the objective function in Eq. (8) is NP-hard. □ Accordingly, to devise a solution for 𝑘 -AGC on massive graphs,we focus on approximate techniques for optimizing our objective.Observe that the NP-hardness of our objective function in Eq. (8)is due to the requirement that elements of the NCI are binary, i.e. , Y ∈ 𝑘 × 𝑛 . Thus, we apply a common trick that relaxes NCI ele-ments from binary to fractional , i.e. , Y ∈ R 𝑘 × 𝑛 . The following lemmashows a sufficient condition to find the optimal values of fractionalNCI Y : when the row vectors of ( YY ⊤ ) − Y are the top- 𝑘 eigen-vectors of matrix S (defined in Lemma 2.3), i.e. , the 𝑘 eigenvectorscorresponding to the 𝑘 largest eigenvalues of S .Lemma 3.3. Assume that we relax the requirement Y ∈ 𝑘 × 𝑛 to Y ∈ R 𝑘 × 𝑛 . Let F ∈ R 𝑘 × 𝑛 denote the matrix consisting of the top- 𝑘 eigenvectors of S . Then, the optimal value of Y for the objective min Y ∈ R 𝑘 × 𝑛 Ψ ( Y ) is obtained when ( YY ⊤ ) − Y = F , which leads toa value of Ψ ( Y ) no larger than the solution of the original optimalobjective 𝜙 ∗ in Eq. (8) . □ The optimal value of fractional Y , however, does not directlycorrespond to a clustering solution, which requires the NCI to bebinary. The following lemma points to a way to obtain a goodapproximation of the optimal binary Y ∈ 𝑘 × 𝑛 .Lemma 3.4. Given the top- 𝑘 eigenvectors F of S , if we obtain abinary NCI Y that satisfies min ∥ XF − ( YY ⊤ ) − Y ∥ 𝐹 s.t. Y ∈ 𝑘 × 𝑛 , X ⊤ X = I , (9) then Ψ ( Y ) → 𝜙 ∗ in Eq. (8) . □ Based on Lemmata 3.3 and 3.4, to approximate the optimal binaryNCI Y , we can first compute the top- 𝑘 eigenvectors F of S , andthen solve for the best NCI Y that optimizes Eq. (9). The proposedalgorithm ACMin follows this two-step approach.There remain two major challenges in realizing the above idea: • How to compute F for large graphs. Note that it is prohibitively ex-pensive to compute F directly by performing eigen-deompositionon a materialized matrix S (defined in Eq. (2)), which would con-sume Ω ( 𝑛 ) space and Ω ( 𝑛 𝑘 ) time. • Given F , how to efficiently compute Y ∈ 𝑘 × 𝑛 based on Eq. (9),which in itself is a non-trivial optimization problem.To address the above challenges, the proposed method ACMin contains three key techniques. First, to compute the top- 𝑘 eigen-vectors F of S , ACMin employs a scalable, iterative process basedon orthogonal iterations [43], which does not need to materialize S .Second, to find the best NCI Y , ACMin applies an alternative opti-mization approach and randomized SVD [16] to efficiently optimizeEq. (9). Third, to acclerate the above iterative processes,

ACMin ffective and Scalable Clustering on Massive Attributed Graphs Conference’17, July 2017, Washington, DC, USA Algorithm 1:

ACMin

Input:

𝐺, 𝑘, 𝛼, 𝛽 . Output: Y . Compute (cid:98) R by Eq. (10); Y ← InitNCI ( P 𝑉 , (cid:98) R , R , 𝛼, 𝛽 ) ; F ← ( Y Y ⊤ ) − Y ; Y ← Y ; 𝜙 ← AppoxAAMC ( P 𝑉 , (cid:98) R , R , 𝛼, 𝛽, Y ) ; for ℓ ← to 𝑡 𝑒 do Z ℓ ← ( − 𝛽 ) · P 𝑉 F ⊤ ℓ − + 𝛽 · (cid:98) R ( R ⊤ F ⊤ ℓ − ) ; F ℓ ← QR ( Z ℓ ) ; if F ℓ = F ℓ − then break ; Y ℓ ← GenNCI ( F ℓ ) ; 𝜙 ℓ ← AppoxAAMC ( P 𝑉 , (cid:98) R , R , 𝛼, 𝛽, Y ℓ ) ; if 𝜙 ℓ < 𝜙 then 𝜙 ← 𝜙 ℓ , Y ← Y ℓ ; return Y ; includes an effective greedy algorithm to compute a high-qualityinitial value of Y , which significantly speeds up convergence inpractice. Overall, ACMin only requires space and time linear to thesize of the input graph 𝐺 . The next section presents the detailed ACMin algorithm and complexity analysis.

ACMin

ALGORITHM

This section presents the detailed

ACMin algorithm, shown in Al-gorithm 1. In the following, Sections 4.1-4.3 detail the three mostimportant components of

ACMin : the computation of top- 𝑘 eigen-vectors F , binary NCI Y , and a greedy initialization of Y , respec-tively. Section 4.4 summarizes the complete ACMin algorithm andanalyzes its complexity. 𝑘 Eigenvectors F Recall from Section 3 that

ACMin follows a two-step strategy thatfirst computes F , the top- 𝑘 eigenvectors of S (Eq. (2)). Since materi-alizing S is infeasible on large graphs, this subsection presents ouriterative procedure for computing F without materializing S , whichcorresponds to Lines 6-9 of Algorithm 1.First of all, the following lemma reduces the problem of comput-ing F to computing the top- 𝑘 eigenvectors of ( − 𝛽 ) · P 𝑉 + 𝛽 · P 𝑅 .Lemma 4.1. Let F be the top- 𝑘 eigenvectors of ( − 𝛽 ) · P 𝑉 + 𝛽 · P 𝑅 .Then, F is also the top- 𝑘 eigenvectors of S . □ Computing the exact top- 𝑘 eigenvectors of ( − 𝛽 ) · P 𝑉 + 𝛽 · P 𝑅 is still rather challenging, however, since materializing P 𝑅 also re-quires Ω ( 𝑛 ) space. To tackle this issue, ACMin applies orthogonaliterations [43], as follows. First,

ACMin computes a normalizedattribute vector (cid:98) R [ 𝑣 𝑖 ] for each node 𝑣 𝑖 in the graph using the fol-lowing equation, leading to matrix (cid:98) R (Line 1 in Algorithm 1). (cid:98) R [ 𝑣 𝑖 ] = R [ 𝑣 𝑖 ] R [ 𝑣 𝑖 ]· r ⊤ ∀ 𝑣 𝑖 ∈ 𝑉 , where r = (cid:205) 𝑣 𝑗 ∈ 𝑉 R [ 𝑣 𝑗 ] . (10)Comparing above equation with Eq. (1), it follows that P 𝑅 = (cid:98) RR ⊤ .Hence, ( − 𝛽 ) · P 𝑉 + 𝛽 · P 𝑅 in Lemma 4.1 can be transformed to ( − 𝛽 ) · P 𝑉 + 𝛽 · (cid:98) RR ⊤ , eliminating the need to materialize P 𝑅 .Next, suppose that we are currently at the start of the ℓ -th it-eration (Line 6 of Algorithm 1) with F ℓ − obtained in previousiteration. Note that in the first iteration, F is computed from an Algorithm 2:

GenNCI

Input: F . Output: Y . X ′ ← I , X ← I ; for ℓ ← to 𝑡 𝑚 do for 𝑖 ← to 𝑘 do Compute 𝛾 𝑖 by Eq. (15) ; for 𝑣 𝑗 ∈ 𝑉 do Pick 𝑐 𝑖 by Eq. (14); Y [ : , 𝑣 𝑗 ] ← , Y [ 𝑐 𝑖 , 𝑣 𝑗 ] ← U , 𝚺 , V ← SVD ( ( YY ⊤ ) − YF ⊤ ) ; X ′ ← X , X ← U · V ⊤ ; if X = X ′ then break ; return Y ; initial value Y of Y , elaborated in Section 4.3. ACMin computes Z ℓ = (( − 𝛽 ) · P 𝑉 + 𝛽 · (cid:98) RR ⊤ ) F ⊤ ℓ − = ( − 𝛽 ) · P 𝑉 F ⊤ ℓ − + 𝛽 · (cid:98) R · ( R ⊤ F ⊤ ℓ − ) (Line 7 of the algorithm), which can be done in 𝑂 ( 𝑘 · (| 𝐸 𝑉 | + | 𝐸 𝑅 |)) time. Then, ACMin employs QR decomposition [9] (Line 8) to de-compose Z ℓ into two matrices: F ℓ and 𝚲 ℓ , such that Z ℓ = F ⊤ ℓ · 𝚲 ℓ ,where 𝚲 ℓ is an upper-triangular matrix, and F ℓ is orthogonal ( i.e. , F ℓ F ⊤ ℓ = I ). Clearly, the QR decomposition step can be done in 𝑂 ( 𝑛𝑘 ) time, leading to 𝑂 ( 𝑘 · (| 𝐸 𝑉 | + | 𝐸 𝑅 |) + 𝑛𝑘 ) total time for oneiteration in the computation of F .Suppose that F ℓ converges in iteration ℓ = 𝑡 𝑐 , i.e. , F 𝑡 𝑐 is the sameas F 𝑡 𝑐 − (Line 9). Then, we have Z ℓ = (( − 𝛽 ) · P 𝑉 + 𝛽 · P 𝑅 ) F ⊤ 𝑡 𝑐 = F ⊤ 𝑡 𝑐 · 𝚲 𝑡 𝑐 . Considering that 𝚲 ℓ is an upper-triangular matrix, and F ℓ is orthogonal ( i.e. , F ℓ F ⊤ ℓ = I ), according to [43], we conclude that F 𝑡 𝑐 is the top- 𝑘 eigenvectors of ( − 𝛽 ) · P 𝑉 + 𝛽 · P 𝑅 and the diagonalelements of 𝚲 𝑡 𝑐 are the top- 𝑘 eigenvalues. According to Lemma 4.1,the row vectors of F 𝑡 𝑐 are also the top- 𝑘 eigenvectors of S .Note that throughout the process for computing F , there is nomaterialization of either S or P 𝑅 , which avoids the correspondingquadratic space requirement. Meanwhile, with a constant 𝑘 , eachiteration takes time linear to the size of the input graph 𝐺 , whichis far more scalable than decomposing S directly. In practice, thenumber of required iterations can be significantly reduced througha good initialization, detailed later in Section 4.3. Y As described in Section 3, after obtaining the top- 𝑘 eigenvectors F of S , ACMin proceeds to compute the binary NCI Y by solvingthe optimization problem in Eq. (9). In Algorithm 1, this is done inLines 10-12. Note that in ACMin , the computation of Y is performedonce in every iteration for computing F , rather than only once afterthe final value of F is obtained. This is because our algorithm isapproximate, and, thus, the final value of F does not necessarilylead to the best clustering quality, measured by AAMC (Section2.4). Hence, ACMin computes Y ℓ and the corresponding AAMC 𝜙 ℓ for each iteration ℓ , and udpate the current best result Y and 𝜙 whenever a better result is found (Lines 11-12 in Algorithm 1).Next we clarify the GenNCI function, shown in in Algorithm 2,which computes the binary NCI Y ℓ ∈ 𝑘 × 𝑛 with F ℓ in the currentiteration ℓ . First, based on properties of matrix trace, we transformthe optimization objective in Eq. (9), as follows. ∥ XF − ( YY ⊤ ) − Y ∥ 𝐹 = 𝑘 − · trace (( YY ⊤ ) − YF ⊤ X ⊤ ) . (11) onference’17, July 2017, Washington, DC, USA Yang and Shi, et al. GenNCI applies an alternative optimization approach to minimizeEq. (11). Specifically, the algorithm updates two variables, X and Y in an alternating fashion, each time fixing one of them and updatingthe other, according to the following rules. Updating Y with X fixed. Given F , according to Eq. (11), with X fixed, the function to optimize becomes:max Y ∈ 𝐾 × 𝑛 trace (( YY ⊤ ) − YF ⊤ X ⊤ ) (12)Let M = F ⊤ X ⊤ . Eq. (12) is equivalent tomax Y ∈ 𝑘 × 𝑛 (cid:205) 𝑣 𝑗 ∈ 𝑉 (cid:205) 𝑘𝑖 = (cid:32) Y [ 𝑐 𝑖 , 𝑣 𝑗 ] · M [ 𝑣 𝑗 ,𝑐 𝑖 ] √︃(cid:205) 𝑣𝑙 ∈ 𝑉 Y [ 𝑐 𝑖 ,𝑣 𝑙 ] (cid:33) . (13)Since Y ∈ 𝑘 × 𝑛 , for each column Y [ : , 𝑣 𝑗 ] ( 𝑣 𝑗 ∈ 𝑉 ), we update theentry at 𝑐 𝑖 of Y [ : , 𝑣 𝑗 ] ( i.e. , Y [ 𝑐 𝑖 , 𝑣 𝑗 ] ) to 1, and 0 everywhere else,where 𝑐 𝑖 is picked greedily as follows: 𝑐 𝑖 = arg max ≤ 𝑐 𝑙 ≤ 𝑘 (cid:34) ( − Y [ 𝑐 𝑙 ,𝑣 𝑗 ]) · M [ 𝑣 𝑗 ,𝑐 𝑙 ] √︃ 𝛾 𝑙 + + Y [ 𝑐 𝑙 ,𝑣 𝑗 ]· M [ 𝑣 𝑗 ,𝑐 𝑙 ] 𝛾 𝑙 (cid:35) , (14)where 𝛾 𝑙 = √︁(cid:205) 𝑣 𝑧 ∈ 𝑉 Y [ 𝑐 𝑙 , 𝑣 𝑧 ] , (15)meaning that we always update each column Y [ : , 𝑣 𝑗 ] ( 𝑣 𝑗 ∈ 𝑉 ) suchthat the objective function in Eq. (12) is maximized. Since both M = F ⊤ X ⊤ and 𝛾 𝑙 can be precomputed at the beginning of eachiteration, which takes 𝑂 ( 𝑛𝑘 ) time, it takes 𝑂 ( 𝑛𝑘 ) time to updatethe whole Y in each iteration. Updating X with Y fixed. Given F , according to Eq. (11), with Y fixed, the function to optimize becomes:max X ⊤ X = I trace (( YY ⊤ ) − YF ⊤ X ⊤ ) (16)The following lemma shows that the optimal X in Eq. (16) canbe obtained via singular value decomposition (SVD) of matrix ( YY ⊤ ) − YF ⊤ .Lemma 4.2. The optimal solution to the objective function in Eq. (16) is X = UV ⊤ , where U and V are the left and right singular vectors of ( YY ⊤ ) − YF ⊤ respectively. □ To compute SVD of ( YY ⊤ ) − YF ⊤ ∈ R 𝑘 × 𝑘 , GenNCI employs therandomized SVD algorithm [16], which finishes in 𝑂 ( 𝑘 ) time.With the above update rules for X and Y respectively, GenNCI (Algorithm 2) iteratively updates X and Y for a maximum of 𝑡 𝑚 iterations (Lines 2-9). In our experiments, we found that setting 𝑡 𝑚 to 50 usually leads to satisfactory performance. Note that theiterations may converge earlier than 𝑡 𝑚 iterations (Line 9). Since up-dating Y and X takes 𝑂 ( 𝑛𝑘 ) and 𝑂 ( 𝑘 ) time respectively, GenNCI terminates within 𝑂 ( 𝑡 𝑚 · ( 𝑛𝑘 + 𝑘 )) time. Next we clarify the computation of the initial value Y of the NCI(Line 2 of Algorithm 1). If we simply assign random values toelements of Y , the iterative process in ACMin from Lines 6 to12 would converge slowly. To address this issue, we propose aneffective greedy initialization technique

InitNCI , which usuallyleads to fast convergence of

ACMin in practice, as demonstrated inour experiments in Section 5.4.

Algorithm 3:

InitNCI

Input: P 𝑉 , 𝛼, 𝛽 . Output: Y . Y ← , 𝑉 𝜏 ← ∅ ; 𝑉 ′ 𝜏 ← { 𝑣 𝜏 , 𝑣 𝜏 , · · · , 𝑣 𝜏 𝑘 } where 𝑣 𝜏 𝑖 is the node in 𝑉 with 𝑖 -thlargest in-degree; 𝚷 ← I [ : ,𝑉 ′ 𝜏 ] , 𝑡 ← 𝛼 ; for ℓ ← to 𝑡 do 𝚷 ℓ ← ( − 𝛼 ) · P 𝑉 𝚷 ℓ − + 𝚷 ; 𝚷 𝑡 ← 𝛼 · 𝚷 𝑡 ; for 𝑣 𝜏 ∈ 𝑉 ′ 𝜏 do compute (cid:205) 𝑣 𝑗 ∈ 𝑉 𝚷 𝑡 [ 𝑣 𝑗 , 𝑣 𝜏 ] ; Select the top- 𝑘 nodes 𝑣 𝜏 ∈ 𝑉 ′ 𝜏 with the largest (cid:205) 𝑣 𝑗 ∈ 𝑉 𝚷 𝑡 [ 𝑣 𝑗 , 𝑣 𝜏 ] into 𝑉 𝜏 as the 𝑘 center nodes; for 𝑣 𝑗 ∈ 𝑉 do select 𝑣 𝜏 𝑖 ∈ 𝑉 𝜏 with the largest 𝚷 𝑡 [ 𝑣 𝑗 , 𝑣 𝜏 𝑖 ] , and set Y [ 𝑖, 𝑣 𝑗 ] ← return Y ; Given a cluster 𝐶 , recall that its attributed multi-hop conductance Φ ( 𝐶 ) (Eq. (4)) is defined based on the intuition that Φ ( 𝐶 ) is lowerwhen an attributed random walk from any nodes in 𝐶 is more likelyto stop at a node within 𝐶 . Further, we observe that in practice, ahigh-quality cluster 𝐶 tends to have high intra-cluster connectivityvia certain center nodes within 𝐶 , and such a center node usuallyhas high in-degree ( i.e. , many in-neighbors). In other words, thenodes belonging to the same cluster tend to have many paths tothe center node of the cluster, and consequently, a random walkwith restart (RWR) [25, 47] within a cluster is more likely to stopat the center node [46]. Based on these intuitions, we propose toleverage graph topology ( i.e. , 𝑉 and 𝐸 𝑉 of the input attributedgraph 𝐺 ) as well as RWR to quickly identify 𝑘 possible clustercenter nodes, 𝑉 𝜏 = { 𝑣 𝜏 , 𝑣 𝜏 , · · · , 𝑣 𝜏 𝑘 } ⊂ 𝑉 , and greedily initializeNCI Y by grouping the nodes in 𝑉 to a center node according totheir topological relationships to the center node.Algorithm 3 presents the pseudo-code of InitNCI . After initializ-ing Y to a 𝑘 × 𝑛 zero matrix and 𝑉 𝜏 to an empty set at Line 1, themethod first selects from 𝑉 a candidate set 𝑉 ′ 𝜏 of size 5 𝑘 (Line 2),which consists of the top- ( 𝑘 ) nodes with the largest in-degrees.The nodes in 𝑉 ′ 𝜏 serve as the candidate nodes for the 𝑘 center nodesto be detected. Then we compute the 𝑡 -hop RWR value 𝚷 𝑡 [ 𝑣 𝑗 , 𝑣 𝜏 ] from every node 𝑣 𝑗 ∈ 𝑉 to every node 𝑣 𝜏 ∈ 𝑉 ′ 𝜏 from Lines 3 to 5according to the following equation [59]. 𝚷 𝑡 = (cid:205) 𝑡ℓ = 𝛼 ( − 𝛼 ) ℓ P ℓ𝑉 · I [ : , 𝑉 ′ 𝜏 ] (17)In particular, we set 𝑡 = 𝛼 at Line 3, which is the expected lengthof an RWR, and is usually sufficient for our purpose. If 𝚷 𝑡 [ 𝑣 𝑗 , 𝑣 𝜏 ] is large, it means that the random walks starting from 𝑣 𝑗 are morelikely to stop at 𝑣 𝜏 , which matches our aforementioned intuition ofpossible cluster center nodes.Then, at Line 6, for each candidate center node 𝑣 𝜏 ∈ 𝑉 ′ 𝜏 , wecompute the sum of 𝚷 𝑡 [ 𝑣 𝑗 , 𝑣 𝜏 ] from all nodes 𝑣 𝑗 ∈ 𝑉 to 𝑣 𝜏 . If 𝑣 𝜏 has larger (cid:205) 𝑣 𝑗 ∈ 𝑉 𝚷 𝑡 [ 𝑣 𝑗 , 𝑣 𝜏 ] , it indicates that the random walksstarting from any nodes in 𝑉 are more likely to stop at 𝑣 𝜏 . There-fore, at Line 7, we select the top- 𝑘 nodes 𝑣 𝜏 ∈ 𝑉 ′ 𝜏 with the largest (cid:205) 𝑣 𝑗 ∈ 𝑉 𝚷 𝑡 [ 𝑣 𝑗 , 𝑣 𝜏 𝑖 ] as the 𝑘 possible center nodes in 𝑉 𝜏 . At Line 8,for each node 𝑣 𝑗 ∈ 𝑉 , we select the center node 𝑣 𝜏 𝑖 ∈ 𝑉 𝜏 with thelargest 𝚷 𝑡 [ 𝑣 𝑗 , 𝑣 𝜏 𝑖 ] and greedily group 𝑣 𝑗 and 𝑣 𝜏 𝑖 into the same 𝑖 -thcluster by setting Y [ 𝑖, 𝑣 𝑗 ] to 1, completing the computation of Y . ffective and Scalable Clustering on Massive Attributed Graphs Conference’17, July 2017, Washington, DC, USA Algorithm 4:

AppoxAAMC

Input: P 𝑉 , (cid:98) R , R , 𝛼, 𝛽, Y . Output: 𝜙 . H ← ( YY ⊤ ) − Y , 𝑡 ← 𝛼 ; for ℓ = to 𝑡 do H ℓ ← ( − 𝛼 ) · ( ( − 𝛽 ) · P 𝑉 H ⊤ ℓ − + 𝛽 · (cid:98) R ( R ⊤ H ⊤ ℓ − )) + H ; 𝜙 ← 𝑘 · (cid:205) 𝑘𝑖 = H [ 𝑖 ] · ( H ⊤ [ 𝑖 ] − 𝛼 · H 𝑡 [ : , 𝑖 ]) ; return 𝜙 ; Note that Line 2 in Algorithm 3 takes 𝑂 ( 𝑛 + 𝑘 log ( 𝑛 )) time, andthe computation of 𝚷 𝑡 requires 𝑂 ( 𝑘𝛼 ·| 𝐸 𝑉 |) time. Therefore, InitNCI runs in 𝑂 ( 𝑘𝛼 · | 𝐸 𝑉 |) time. ACMin

Algorithm and Analysis

Algorithm 1 summarizes the pseudo-code of

ACMin , which takesas input an attributed graph 𝐺 , the number of clusters 𝑘 , randomwalk stopping probability 𝛼 , and attributed branching probability 𝛽 (defined in Definition 2.2). Initially (Line 1), ACMin computesmatrix (cid:98) R , explained in Section 4.1. Then (Line 2), ACMin computesan initial value Y for Y via InitNCI (Algorithm 3), and derives thecorresponding value F for F according to Lemma 3.3 in Line 3.Next (Line 5), we invoke AppoxAAMC (Algorithm 4) that uses Y to compute 𝜙 , the best AAMC obtained so far. Note that the exactAAMC 𝜙 = Ψ ( Y ) in Eq. (8) is hard to evaluate since S in Eq. (2)is the sum of an infinite series. Instead, ApproxAAMC performs afinite number 𝑡 = 𝛼 of iterations in Eq. (2) to obtain an approximateAAMC, since the expected length of an attributed random walk is 𝛼 . Specifically, given P 𝑉 , (cid:98) R , R , 𝛼, 𝛽 and Y as inputs, AppoxAAMC first initializes H as ( YY ⊤ ) − Y and the number of iterations 𝑡 to 𝛼 (Line 1 of Algorithm 4). Then, it computes the intermediate result H 𝑡 by 𝑡 iterations in Lines 2-3. Lastly AppoxAAMC computes 𝜙 with H 𝑡 and H at Line 4. Algorithm 4 takes 𝑂 ( 𝑘𝛼 · (| 𝐸 𝑉 | + | 𝐸 𝑅 |)) time with the precomputed (cid:98) R .Utilizing algorithms GenNCI and

AppoxAAMC , ACMin obtainsthe binary NCI Y ℓ and its corresponding quality measure 𝜙 ℓ foreach iteration ℓ , after obtaining F ℓ . ACMin may terminate uponconvergence, or reaching a preset maximum number of iterations 𝑡 𝑒 . In our experiments, we found that 𝑡 𝑒 =

200 is usually sufficientlylarge for convergence.Next we analyze the total time and space complexities of

ACMin .The computation of (cid:98) R at Line 1 in Algorithm 1 takes 𝑂 (| 𝐸 𝑅 |) time.Algorithm 4 requires 𝑂 ( 𝑛𝑘 + 𝑘𝛼 · (| 𝐸 𝑉 | + | 𝐸 𝑅 |)) time. In each itera-tion (Lines 6-12), Line 7 takes 𝑂 ( 𝑘 · (| 𝐸 𝑉 | + | 𝐸 𝑅 |)) time and the QR decomposition over Z ℓ takes 𝑂 ( 𝑛𝑘 ) time. According to Section 4.2and Section 4.3, GenNCI and

InitNCI run in 𝑂 ( 𝑡 𝑚 · ( 𝑛𝑘 + 𝑘 )) and 𝑂 ( 𝑘𝛼 · | 𝐸 𝑉 |) time, respectively. Thus, the total time complexityof ACMin is 𝑂 (cid:16) 𝑘 ( 𝛼 + 𝑡 𝑒 ) · (| 𝐸 𝑉 | + | 𝐸 𝑅 |) + 𝑛𝑘 𝑡 𝑒 𝑡 𝑚 + 𝑘𝑡 𝑒 𝛼 · | 𝐸 𝑉 |) (cid:17) when 𝑘 ≪ 𝑛 , which equals 𝑂 (| 𝐸 𝑉 | + | 𝐸 𝑅 |) when 𝑡 𝑒 , 𝑡 𝑚 and 𝑘 areregarded as constants. The space overhead incurred by ACMin is determined by the storage of P 𝑉 , (cid:98) R , R , Z ℓ , F ℓ and H ℓ , which isbounded by 𝑂 (| 𝐸 𝑉 | + | 𝐸 𝑅 | + 𝑛𝑘 ) . Table 2: Datasets. (K= , M= , B= ) Name | 𝑉 | | 𝐸 𝑉 | | 𝑅 | | 𝐸 𝑅 | | 𝐶 | Cora [29, 52, 53, 60, 64] 2.7K 5.4K 1.4K 49.2K 7

Citeseer [29, 52, 53, 60, 64] 3.3K 4.7K 3.7K 105.2K 6

Pubmed [52, 60, 64, 66] 19.7K 44.3K 0.5K 988K 3

Flickr [24, 29, 34, 58, 60] 7.6K 479.5K 12.1K 182.5K 9

TWeibo [60] 2.3M 50.7M 1.7K 16.8M 8

MAG-Scholar-C [3] 10.5M 265.2M 2.78M 1.1B 8

We experimentally evaluate

ACMin against 11 competitors in termsof both clustering quality and efficiency on 6 real-world datasets.All experiments are conducted on a Linux machine powered by anIntel Xeon(R) Gold [email protected] CPU and 377GB RAM. Sourcecodes of all competitors are obtained from the respective authors.

Datasets.

Table 2 shows the statistics of the 6 real-world directedattributed graphs used in our experiments. | 𝑉 | and | 𝐸 𝑉 | denotethe number of nodes and edges, while | 𝑅 | and | 𝐸 𝑅 | represent thenumber of attributes and node-attribute associations, respectively. | 𝐶 | is the number of ground-truth clusters in 𝐺 . In particular, Cora , Citeseer , Pubmed and MAG-Scholar-C are citation graphs, inwhich each node represents a paper and each edge denotes a citationrelationship. Flickr and TWeibo are social networks, in whicheach node represents a user, and each directed edge representsa following relationship. Further, notice that all 6 datasets haveground-truth cluster labels, and the number of ground-truth clusters | 𝐶 | is also included in Table 2. Competitors.

We compare

ACMin with 11 competitors, including7 𝑘 -AGC algorithms ( CSM [36], SA - Cluster [67],

BAGC [54],

MGAE [53],

CDE [29],

AGCC [64],

USC [50]), and 4 recent attributed net-work embedding algorithms (

TADW [55],

PANE [60],

LQANR [56],

PRRE [66]). The network embedding competitors are used togetherwith 𝑘 - Means to produce clustering results. In addition, we alsocompare with the classic unnormalized spectral clustering method

USC [50], which directly works on S to extract clusters by materi-alizing S , computing the top- 𝑘 eigenvectors of S , and then applying 𝑘 - Means on the eigenvectors.

Parameter settings.

We adopt the default parameter settings ofall competitors as suggested in their corresponding papers. Specif-ically, for attributed network embedding competitors, we set theembedding dimensionality to 128. For

ACMin , we set 𝑡 𝑒 = , 𝑡 𝑚 = , 𝛼 = .

2, and 𝛽 = .

35. Competitor

USC shares the same parame-ter settings of 𝛼 , 𝛽 , and 𝑡 𝑒 with ACMin . Evaluation criteria.

For efficiency evaluation, we vary the num-ber of clusters 𝑘 in { , , , , } , and report the running time(seconds) of each method on each dataset in Section 5.2. The re-ported running time does not include the time for loading datasets.We terminate a method if it fails to return results within 3 days. Interms of clustering quality, we report the proposed AAMC measure( i.e. , average attributed multi-hop conductance), modularity [38], http://linqs.soe.ucsc.edu/data (accessed October, 2020) https://figshare.com/articles/dataset/mag_scholar/12696653 (accessed October, 2020) https://github.com/xhuang31/LANE (accessed October, 2020) onference’17, July 2017, Washington, DC, USA Yang and Shi, et al. ACMin USC CSM BAGC SA - Cluster MGAE CDE AGCC PANE TADW LQANR PRRE − running time (sec) (a) Cora − running time (sec) (b) Citeseer running time (sec) (c) Pubmed running time (sec) (d) Flickr running time (sec) (e) TWeibo running time (sec) (f) MAG-Scholar-C

Figure 2: Running time with varying 𝑘 (best viewed in color).Table 3: CA, NMI and AAMC with ground-truth (Large CA, NMI, and small AAMC indicate high clustering quality). Solution

Cora Citeseer Pubmed Flickr TWeibo MAG-Scholar-C

CA NMI AAMC CA NMI AAMC CA NMI AAMC CA NMI AAMC CA NMI AAMC CA NMI AAMCGround-truth 1.0 1.0 0.546 1.0 1.0 0.531 1.0 1.0 0.505 1.0 1.0 0.691 1.0 1.0 0.719 1.0 1.0 0.63

TADW

LQANR

PRRE

PANE

CSM SA - Cluster

BAGC

MGAE

CDE

AGCC

USC

ACMin CA (clustering accuracy with respect to ground truth labels) and NMI (normalized mutual information) [1] to measure the cluster-ing quality in Section 5.3. Note that AAMC considers both graphtopology and node attributes to measure clustering quality, whilemodularity only considers graph topology. Also, note that CA andNMI rely on ground-truth clusters, while AAMC and modularitydo not. Therefore, when evaluating by CA and NMI, we set 𝑘 to be | 𝐶 | as in Table 2 for each dataset; when evaluating by modularityand AAMC, we vary 𝑘 in { , , , , } . Figure 2 presents the running time of all methods on all datasetswhen varying the number of clusters 𝑘 in { , , , , } . The 𝑦 -axis is the running time (seconds) in log-scale. As shown in Figure 2, ACMin is consistently faster than all competitors on all datasets, of-ten by up to orders of magnitude.

ACMin is highly efficient on largeattributed graphs, e.g. , TWeibo and

MAG-Scholar-C in Figures 2e and2f, while most of the 11 competitors fail to return results withinthree days. For instance, in Figure 2e, when 𝑘 = ACMin needs 630seconds to finish, which is 7 . × faster than AGCC (4634 seconds)and 71 × faster than PANE (44658 seconds), respectively. Further,

ACMin is the only method able to finish on

MAG-Scholar-C datasetthat has 265.2 million edges and 1.1 billion attribute values. Specifi-cally,

ACMin only needs 1.68 hours when 𝑘 =

5. The high efficiencyof

ACMin on massive real datasets is due to the its high scalablealgorithmic components, whose total cost is linear to the size of theinput graph as analyzed in Section 4.4. On small/moderate-sized at-tributed graphs in Figures 2a-2d,

ACMin is also significantly fasterthan the competitors, especially when 𝑘 is small. For instance, when 𝑘 =

10, on

Flickr in Figure 2d,

ACMin takes 4 seconds, while thefastest competitor

PANE needs 381 seconds. Note that the total run-ning time of

ACMin increases linearly with 𝑘 , which is consistent with our time complexity analysis in Section 4.4 when the numberof edges | 𝐸 𝑉 | + | 𝐸 𝑅 | far exceeds the number of nodes 𝑛 . The runningtime results for the 4 attributed network embedding competitors( i.e. , TADW , LQANR , PRRE , and

PANE ) are not sensitive to 𝑘 , sincetheir cost is dominated by the node embedding computation ratherthan 𝑘 - Means . Even on settings with a large 𝑘 , ACMin is still fasterthan all these competitors.

CA, NMI and AAMC with ground-truth.

Table 3 reports the CA,NMI, and AAMC scores of all methods, comparing to the ground-truth clusters of each dataset in Table 2. We also report the AAMCvalues of the ground-truth labels, which are the lower than theAAMC obtained by all methods except for

ACMin , which explicitlyaims to minimize AAMC. Meanwhile, we observe that relative per-formance of all methods measured by AAMC generally agrees withCA. These results demonstrate that AAMC effectively measureseffectively reflects clustering quality.

ACMin clearly and consistently achieves the best CA, NMI, andAAMC on all datasets. Specifically, on small attributed graphs, i.e. , Cora , Citeseer and

Pubmed , compared with the best competi-tors (underlined in Table 3)

ACMin improves CA by 1 . . . . . . ACMin are also significantly better than the competi-tors on moderate-sized/large attributed graphs ( i.e. , Flickr , TWeibo ,and

MAG-Scholar-C ). For instance, on

Flickr , ACMin has CA 75 . .

6% higher than that of the best competitor

AGCC , whichis only 47 . TWeibo , ACMin is slightly better than

AGCC ;note that on this dataset,

ACMin is orders of magnitude faster than

AGCC as shown in Figure 2e. Hence,

ACMin is overall preferablethan

AGCC in these settings. Finally,

ACMin is the only 𝑘 -AGCmethod capable of handling MAG-Scholar-C , and achieves CA 65 . ffective and Scalable Clustering on Massive Attributed Graphs Conference’17, July 2017, Washington, DC, USA ACMin USC CSM BAGC SA - Cluster MGAE CDE AGCC PANE TADW LQANR PRRE

AAMC (a)

Cora

AAMC (b)

Citeseer

AAMC (c)

Pubmed

AAMC (d)

Flickr

AAMC (e)

TWeibo

AAMC (f)

MAG-Scholar-C

Figure 3: AAMC with varying 𝑘 (best viewed in color). . . . . modularity (a) Cora . . . . . . . modularity (b) Citeseer . . . . modularity (c) Pubmed . . . . modularity (d) Flickr . . . . modularity (e) TWeibo . . . . . modularity (f) MAG-Scholar-C

Figure 4: Modularity with varying 𝑘 (best viewed in color). ACMin ACMin - RI AAMC (a)

Cora

AAMC (b)

Citeseer

AAMC (c)

Pubmed

AAMC (d)

Flickr

AAMC (e)

TWeibo

AAMC (f)

MAG-Scholar-C

Figure 5: AAMC with varying 𝑡 𝑒 (best viewed in color). and NMI 49 . ACMin demonstrates the effectiveness of the proposed AAMC optimizationobjective in Section 2, as well as our approximate solution for thisoptimization program described in Section 4.

AAMC with varying 𝑘 . Figure 3 reports the AAMC achieved by

ACMin against all competitors on all datasets when varying thenumber 𝑘 of clusters in { , , , , } . Observe that ACMin con-sistently produces the smallest AAMC under all 𝑘 settings on alldatasets (smaller AAMC indicates better results), which confirmsthat the proposed ACMin algorithm (Algorithm 1) effectively mini-mizes the proposed AAMC objective function defined in Section2.4. In particular, as shown in Figure 3, when 𝑘 = ACMin hasAAMC better than the best competitor by a margin of 0 . . . .

84% and 2 .

6% on

Cora , Citeseer , Pubmed , Flickr and

TWeibo respectively. Figure 3f reports the AAMC achieved by

ACMin onMAG-Scholar-C, which is the only method able to return results.Further, considering the relative performance of all methods mea-sured by CA and NMI generally agree with that measured by AAMCas shown in the results in Table 3, and the fact that

ACMin is farmore efficient and scalable compared to its competitors as shownin Section 5.2, we conclude that

ACMin is the method of choice for 𝑘 -AGC on massive graphs in practice. Modularity with varying 𝑘 . Figure 4 reports the modularity of allmethods on all datasets when varying 𝑘 in { , , , , } . Again,observe that, for all settings of 𝑘 and all datasets (except TWeibo ), ACMin has the highest modularity. In particular,

ACMin obtains a substantial imporvement of up to 5% , . , . .

1% on

Cora , Citeseer , Pubmed and

Flickr , compared to the best competitor,respectively. Note that modularity only considers graph topologyand ignores node attributes, indicating that modularity may not beable to fully evaluate clustering quality of attributed graphs. Thismay explain why on

TWeibo the modularity of

ACMin is slightlylower than some competitors. Even so,

ACMin still achieves highmodularity under most cases, meaning that the proposed attributedrandom walk model can still preserve graph topological featuresfor clustering, in addition to node attributes.

ACMin

In this section, we evaluate the convergence properties of

ACMin ,focusing on the effects of the greedy initialization technique

InitNCI described in Section 4.3 on convergence speed. In particular, wecompare

ACMin with an ablated version

ACMin - RI that replaces InitNCI at Line 2 of Algorithm 1 with random initialization of Y .The number 𝑘 of clusters to be detected is set to be | 𝐶 | as in Table2 for each dataset. Figure 5 reports the AAMC ( i.e. , Ψ ( Y ) in Eq. (8))produced by ACMin and

ACMin - RI per iterations (Lines 6-12 in Al-gorithm 1), when 𝑡 𝑒 is set to 200. Observe that the AAMC producedby ACMin decreases significantly faster than that of

ACMin - RI in the early iterations, and also converges faster than ACMin - RI .For instance, in Figure 5b, on Citeseer , ACMin requires about 80iterations to reach a plateaued AAMC, while

ACMin - RI needs 140iterations. Moreover, GenNCI is able to help

ACMin to achieve onference’17, July 2017, Washington, DC, USA Yang and Shi, et al. lower AAMC at convergence as shown in Figure 5. This experimen-tal evaluation demonstrates the efficiency and effectiveness of theproposed greedy initialization technique in Section 4.3. Attributed graph clustering has been extensively studied in liter-ature, as surveyed in [4, 5, 11]. In the following, we review theexisting methods that are most relevant to this work.

Edge-weight-based clustering.

A classic methodology is to con-vert the input attributed graph to a weighted graph by assigningeach edge a weight based on the attribute and topological similaritybetween the two nodes of the edge; then, traditional weighted graphclustering algorithms are directly applied [7, 33, 37, 42, 45]. For in-stance, Neville et al. [37] assign a weight to each edge ( 𝑢, 𝑣 ) of theinput attributed graph 𝐺 based on the number of attribute valuesthat 𝑢 and 𝑣 have in common, and construct a weighted graph 𝐺 ′ .Then they apply the classic spectral clustering [50] over 𝐺 ′ to pro-duce clusters. However, these methods only consider the attributesof two directly connected nodes and use hand-crafted weights torepresent attributes, and thus, result in inferior clustering quality. Distance-based clustering.

Existing distance-based clustering so-lutions construct a distance matrix M by combining the topolog-ical and attribute similarity between nodes, and then apply clas-sic distance-based clustering methods, such as 𝑘 - Means [19] and 𝑘 - Medoids [41], on M to generate clusters. For instance, SA - Cluster [67] extends the original input attributed graph 𝐺 to an attribute-augmented graph 𝐺 ′ by treating each attribute as a node, and thensamples random walks over 𝐺 ′ to compute the distance betweennodes in 𝐺 ′ , in order to construct M , which is then fed into a 𝑘 -Centroids method to generate clusters. Further, DCom [7] applieshierarchical agglomerative clustering on a constructed distance ma-trix.

CSM [36] computes the distance matrix M based on a shortestpath strategy that considers both structural and attribute relevanceamong nodes, and applies 𝑘 - Medoids over M to generate clusters. ANCA [10] applies 𝑘 - Means for the sum of eigenvectors of the dis-tance and similarity matrices to generate clusters. Distance-basedclustering methods suffer from severe efficiency issues since theyrequire to compute the distance of every node pair, resulting in 𝑂 ( 𝑛 ) time and space overhead, which is prohibitive in practice.For instance, as shown in our experiments, both SA - Cluster and

CSM suffer from costly running time and poor clustering quality.

Probabilistic-model-based clustering.

Based on the assumptionthat the structure, attributes, and clusters of attributed graphs aregenerated according to a certain parametric distribution, there ex-ist a collection of probabilistic-model-based clustering methods,which statistically infer a probabilistic model for attributed graphclustering, in order to generate clustering results. In particular,

PCL - DC [62] combines a conditional model of node popularityand a discriminative model that reduces the impact of irrelevantattributes into a unified model, and then finds the clustering resultthat optimizes the model. CohsMix [63] formulates the clusteringproblem by

MixNet model [40] and then utilizes a varient of EMalgorithm to optimize it, in order to generate clustering results.

BAGC [54] designs a generative Bayesian model [8] that producesa sample of all the possible combinations of a graph based on adja-cency matrix A and attribute matrix X , and aims to find a clustering result 𝐶 maximizing a conjoint probability P ( 𝐶 | A , X ) . Note thatthe optimization process to estimate the likelihood parameters inthese probabilistic-model-based clustering methods often incurssubstantial time overheads, as validated in our experiments (Section5.2). Embedding-based methods.

In recent years, a plethora of net-work embedding techniques are proposed for attributed graphs.The objective of network embedding is to learn an embeddingvector for each node such that the graph topology and attributeinformation surrounding the nodes can be preserved. We can di-rectly employ traditional clustering methods ( e.g. , 𝑘 - Means ) overthe embedding vectors to generate clusters [19, 41]. AA - Cluster [2] builds a weighted graph based on graph topology and nodeattributes, and then applies network embedding on the weightedgraph to generate embeddings.

MGAE [53] proposes a marginalizedgraph convolutional network to learn embeddings.

CDE [29] learnsnode embeddings by optimizing a non-negative matrix factoriza-tion problem based on community structure embeddings and nodeattributes.

DAEGC [52] fuses graph topology and node attributesvia an attention-based autoencoder [48] to obtain embeddings, andthen generates soft labels to guide a self-training graph cluster-ing procedure.

AGCC [64] utilizes an adaptive graph convolutionmethod to learn embeddings, and then applies the spectral clus-tering on the similarity matrix computed from the learnt embed-dings to obtain clusters. The above methods either incur immenseoverheads in learning embeddings or suffer from unsatisfactoryclustering quality. There are many attributed network embeddingmethods proposed, e.g. , [17, 31, 34, 55–57, 60, 66]. However, most ofthem are not specially designed for clustering purpose, leading tosuboptimal clustering quality, as demonstrated in our experimentswhen comparing with

TADW , LQANR , PRRE and

PANE . This paper presents

ACMin , an effective and scalable solution for 𝑘 -AGC computation. ACMin achieves high scalability and effective-ness through a novel problem formulation based on the proposedattributed multi-hop conductance measure for cluster quality, aswell as a carefully designed iterative optimization framework and aneffective greedy clustering initialization method. Extensive experi-ments demonstrate that

ACMin achieves substantial performancegains over the previous state of the art in terms of both efficiencyand clustering quality. Regarding future work, we plan to studyparallelized versions of

ACMin , running on multi-core CPUs andGPUs, as well as in a distributed setting with multiple servers, in or-der to handle even larger datasets. Meanwhile, we intend to extend

ACMin to handle attributed heterogeneous graphs with differenttypes of nodes and edges.

ACKNOWLEDGMENTS

This work is supported by the National University of Singaporeunder SUG grant R-252-000-686-133, by NPRP grant ffective and Scalable Clustering on Massive Attributed Graphs Conference’17, July 2017, Washington, DC, USA A PROOFS

Proof of Lemma 2.3.

Let 𝑝 ℓ ( 𝑣 𝑖 , 𝑣 𝑗 ) be the probability that an at-tributed random walk starting from 𝑣 𝑖 stops at 𝑣 𝑗 at the ℓ -th hop.We first prove that 𝑝 ℓ ( 𝑣 𝑖 , 𝑣 𝑗 ) = 𝛼 ( − 𝛼 ) ℓ · (( − 𝛽 ) · P 𝑉 + 𝛽 · P 𝑅 ) ℓ [ 𝑣 𝑖 , 𝑣 𝑗 ] . (18)Note that if Eq. (18) holds, the overall probability that an attributedrandom walk from 𝑣 𝑖 terminates at 𝑣 𝑗 is (cid:205) ∞ ℓ = 𝑝 ℓ ( 𝑣 𝑖 , 𝑣 𝑗 ) = S [ 𝑣 𝑖 , 𝑣 𝑗 ] ,which establishes the equivalence in Eq. (2). To this end, we proveEq. (18) by induction. First, let us consider the initial case thatthe attributed random walk terminates at source node 𝑣 𝑖 withprobability 𝛼 . In this case, 𝑝 ( 𝑣 𝑖 , 𝑣 𝑗 ) = 𝛼 if 𝑣 𝑖 = 𝑣 𝑗 ; otherwise 𝑝 ( 𝑣 𝑖 , 𝑣 𝑗 ) =

0, which is identical to the r.h.s of Eq. (18) when ℓ =

0. Therefore, Eq. (18) holds when ℓ =

0. Assume that Eq.(18) holds at the ℓ ′ -th hop. Then the probability that an attributedrandom walk from 𝑣 𝑖 visits any node 𝑣 𝑙 ∈ 𝑉 at the ℓ ′ -th hop is ( − 𝛼 ) ℓ ′ · (( − 𝛽 ) · P 𝑉 + 𝛽 · P 𝑅 ) ℓ ′ [ 𝑣 𝑖 , 𝑣 𝑙 ] . Based on this assumption,for the case ℓ = ℓ ′ +

1, with probability 1 − 𝛼 , it will navigate to node 𝑣 𝑗 according to the probability ( − 𝛽 ) · P 𝑉 [ 𝑣 𝑙 , 𝑣 𝑗 ] + 𝛽 · P 𝑅 [ 𝑣 𝑙 , 𝑣 𝑗 ] ,and finally stop at 𝑣 𝑗 with probability 𝛼 . Thus, 𝑝 ℓ ′ + ( 𝑣 𝑖 , 𝑣 𝑗 ) = (cid:205) 𝑣 𝑙 ∈ 𝑉 ( − 𝛼 ) ℓ ′ · (( − 𝛽 ) · P 𝑉 + 𝛽 · P 𝑅 ) ℓ ′ [ 𝑣 𝑖 , 𝑣 𝑙 ] · ( − 𝛼 ) 𝛼 (( − 𝛽 ) · P 𝑉 [ 𝑣 𝑙 , 𝑣 𝑗 ] + 𝛽 · P 𝑅 [ 𝑣 𝑙 , 𝑣 𝑗 ]) = 𝛼 ( − 𝛼 ) ℓ ′ + · (( − 𝛽 ) · P 𝑉 + 𝛽 · P 𝑅 ) ℓ ′ + [ 𝑣 𝑖 , 𝑣 𝑗 ] , which completes the proof. □ Proof of Lemma 3.1.

By Eq. (6), for cluster 𝐶 𝑖 , we have vector (( YY ⊤ ) − Y )[ 𝑐 𝑖 ] , where each entry (( YY ⊤ ) − Y )[ 𝑐 𝑖 , 𝑣 𝑗 ] = / √︁ | 𝐶 𝑖 | if 𝑣 𝑗 ∈ 𝐶 𝑖 and otherwise (( YY ⊤ ) − Y )[ 𝑐 𝑖 , 𝑣 𝑗 ] =

0. Note that2 · (( YY ⊤ ) − Y )[ 𝑐 𝑖 ] · ( I − S ) · (( YY ⊤ ) − Y )[ 𝑐 𝑖 ] ⊤ = (cid:205) 𝑣 𝑗 ,𝑣 𝑙 ∈ 𝑉 S [ 𝑣 𝑗 , 𝑣 𝑙 ] · ((( YY ⊤ ) − Y )[ 𝑐 𝑖 , 𝑣 𝑗 ] − (( YY ⊤ ) − Y )[ 𝑐 𝑖 , 𝑣 𝑙 ]) = (cid:205) 𝑣 𝑗 ∈ 𝐶 𝑖 ,𝑣 𝑙 ∈ 𝑉 \ 𝐶 𝑖 S [ 𝑣 𝑗 , 𝑣 𝑙 ] · (( YY ⊤ ) − Y )[ 𝑐 𝑖 , 𝑣 𝑗 ] = Φ ( 𝐶 𝑖 ) . Then we have (cid:205) 𝑘𝑖 = Φ ( 𝐶 𝑖 ) 𝑘 = 𝑘 (cid:205) 𝑘𝑐 𝑖 = (( YY ⊤ ) − Y )[ 𝑐 𝑖 ] · ( I − S ) · (( YY ⊤ ) − Y )[ 𝑐 𝑖 ] ⊤ = 𝑘 · trace ((( YY ⊤ ) − Y ) · ( I − S ) · (( YY ⊤ ) − Y ) ⊤ ) , which completes our proof. □ Proof of Lemma 3.2.

First, we construct a weighted graph G = (V , E) based on the input graph 𝐺 = ( 𝑉 , 𝐸 𝑉 , 𝑅, 𝐸 𝑅 ) by letting V = 𝑉 and E = {( 𝑣 𝑖 , 𝑣 𝑗 , S [ 𝑣 𝑖 , 𝑣 𝑗 ]) | 𝑣 𝑖 , 𝑣 𝑗 ∈ 𝑉 and S [ 𝑣 𝑖 , 𝑣 𝑗 ] > } ,where S [ 𝑣 𝑖 , 𝑣 𝑗 ] signifies the weight of edge ( 𝑣 𝑖 , 𝑣 𝑗 ) . Thus, Eq. (8)can be reduced to the objective function of the min-cut problem on G , which is proven to be NP-hard in [14, 51]. □ Proof of Lemma 3.3.

Let 𝜆 𝑖 ( M ) be the 𝑖 -th smallest eigenvalue ofmatrix M . Note that ∀ Y ∈ R 𝑘 × 𝑛 , (( YY ⊤ ) − Y ) · (( YY ⊤ ) − Y ) ⊤ = I ,meaning that 𝑓 ( Y ) = (( YY ⊤ ) − Y ) ⊤ · (( YY ⊤ ) − Y ) is a projec-tion matrix of rank 𝑘 . Therefore, ∀ 𝑖 < 𝑛 − 𝑘 + , 𝜆 𝑖 ( 𝑓 ( Y )) = ∀ 𝑖 ≥ 𝑛 − 𝑘 + , 𝜆 𝑖 ( 𝑓 ( Y )) =

1. By Von Neumann’s trace inequality[35] and the property of matrix trace, for any Y ∈ R 𝑘 × 𝑛 , we havethe following inequality: Ψ ( Y ) = 𝑘 · trace ((( YY ⊤ ) − Y ) · ( I − S ) · (( YY ⊤ ) − Y ) ⊤ ) = 𝑘 · trace (( I − S ) · 𝑓 ( Y )) ≥ 𝑘 · (cid:205) 𝑛𝑖 = 𝜆 𝑖 ( I − S ) · 𝜆 𝑛 − 𝑖 + ( 𝑓 ( Y )) = 𝑘 · (cid:205) 𝑘𝑖 = 𝜆 𝑖 ( I − S ) = 𝑘 · (cid:205) 𝑘𝑖 = ( − 𝜆 𝑛 − 𝑖 + ( S )) . (19) Note that F be the top- 𝑘 eigenvectors of S , implying FF ⊤ = I and ( I − S ) · F [ 𝑐 𝑖 ] ⊤ = ( − 𝜆 𝑛 − 𝑖 + ( S )) · F [ 𝑐 𝑖 ] ⊤ for 1 ≤ 𝑖 ≤ 𝑘 . Hence, 𝑘 · trace ( F ( I − S ) F ⊤ ) = 𝑘 · (cid:205) 𝑘𝑖 = ( − 𝜆 𝑛 − 𝑖 + ( S )) , (20)which implies that Ψ ( Y ) is minimized when (( YY ⊤ ) − Y ) = F .Suppose Y ∗ ∈ 𝑘 × 𝑛 is the optimal solution to Eq. (8). Therefore,with Eq. (19) and Eq. (20), the following inequality holds 𝜙 ∗ = Ψ ( Y ∗ ) = 𝑘 · trace ((( Y ∗ Y ∗⊤ ) − Y ∗ )( I − S )(( Y ∗ Y ∗⊤ ) − Y ∗ ) ⊤ )≥ 𝑘 · (cid:205) 𝑘𝑖 = ( − 𝜆 𝑛 − 𝑖 + ( S )) = 𝑘 · trace ( F ( I − S ) F ⊤ ) , which finishes our proof. □ Proof of Lemma 3.4.

Eq. (9) implies that XF → ( YY ⊤ ) − Y where X ⊤ X = I . By the property of matrix trace, we have Ψ ( Y ) = 𝑘 · trace (( YY ⊤ ) − Y · ( I − S ) · (( YY ⊤ ) − Y ) ⊤ )→ 𝑘 · trace ( XF ( I − S ) F ⊤ X ⊤ ) = 𝑘 · trace ( X ⊤ XF ( I − S ) F ⊤ ) = 𝑘 · trace ( F ( I − S ) F ⊤ ) . (21)By Lemma 3.3, we have Ψ ( Y ) → 𝜙 ∗ , completing our proof. □ Proof of Lemma 4.1.

We need the following lemmas for the proof.Lemma A.1 ([65]). If [ 𝜆, x ] is an eigen-pair of matrix M ∈ R 𝑛 × 𝑛 ,then [ (cid:205) 𝑡ℓ = 𝑤 ℓ 𝜆 ℓ , x ] is an eigen-pair of matrix (cid:205) 𝑡ℓ = 𝑤 ℓ M ℓ . Lemma A.2 ([20]).

Given a M ∈ R 𝑛 × 𝑛 satisfying (cid:205) 𝑛𝑗 = M [ 𝑖, 𝑗 ] = ∀ ≤ 𝑖 ≤ 𝑛 and each entry M [ 𝑖, 𝑗 ] ≥ ∀ ≤ 𝑖, 𝑗 ≤ 𝑛 , the largesteigenvalue 𝜆 of M is 1. Suppose that [ 𝜆 𝑖 , x 𝑖 ] is an eigen-pair of ( − 𝛽 ) · P 𝑉 + 𝛽 · P 𝑅 and 𝜆 𝑖 is its 𝑖 -th largest eigenvalue. Note that each row sum of ( − 𝛽 ) · P 𝑉 + 𝛽 · P 𝑅 is equal to 1 and each entry of ( − 𝛽 ) · P 𝑉 + 𝛽 · P 𝑅 is non-negative. Then, by Lemma A.2, we have 𝜆 𝑖 ∈ [− , ] for1 ≤ 𝑖 ≤ 𝑛 . Let 𝑓 ( 𝜆 𝑖 ) = (cid:205) 𝑡ℓ = 𝛼 ( − 𝛼 ) ℓ 𝜆 ℓ𝑖 . Lemma A.1 implies thatany eigen-pair ∀ 𝑖 ∈ [ , 𝑛 ] , [ 𝑓 ( 𝜆 𝑖 ) , x 𝑖 ] of ( − 𝛽 ) · P 𝑉 + 𝛽 · P 𝑅 isan eigen-pair of S . By the sum of geometric sequence, we have 𝑓 ( 𝜆 𝑖 ) = 𝛼 · −( − 𝛼 ) 𝑡 + 𝜆 𝑡 + 𝑖 −( − 𝛼 ) 𝜆 𝑖 = 𝛼 −( − 𝛼 ) 𝜆 𝑖 , which is is monotonouslydecreasing when 1 ≤ 𝑖 ≤ 𝑛 . Hence, for 1 ≤ 𝑖 ≤ 𝑛 , 𝑓 ( 𝜆 𝑖 ) and x 𝑖 are the 𝑖 -th largest eigenvalue and the 𝑖 -th largest eigenvector of S . Recall that F is the top- 𝑘 eigenvectors of ( − 𝛽 ) · P 𝑉 + 𝛽 · P 𝑅 .Therefore, F is the top- 𝑘 eigenvectors of S . The lemma is proved. □ Proof of Lemma 4.2.

Let Z = V ⊤ X ⊤ U . Since U and V are the leftand right singular vectors, we have UU ⊤ = I and VV ⊤ = I . Note that ZZ ⊤ = I , which implies that each Z [ 𝑖, 𝑗 ] satisfies − ≤ Z [ 𝑖, 𝑗 ] ≤ 𝚺 [ 𝑖, 𝑖 ] is a singular value and thus 𝚺 [ 𝑖, 𝑖 ] >

0. Then,trace (( YY ⊤ ) − YF ⊤ X ⊤ ) = trace ( U 𝚺 V ⊤ X ⊤ ) = trace ( 𝚺 V ⊤ X ⊤ U ) = (cid:205) 𝑘𝑖 = 𝚺 [ 𝑖, 𝑖 ] · Z [ 𝑖, 𝑖 ] ≤ (cid:205) 𝑘𝑖 = 𝚺 [ 𝑖, 𝑖 ] . Therefore, trace (( YY ⊤ ) − YF ⊤ X ⊤ ) is maximized when Z = I , whichimplies that X = UV ⊤ . The lemma is proved. □ B COMPARISON WITH SPECTRALCLUSTERING

It is straightforward to apply the classic spectral clustering [50](dubbed as

USC ) on S (see Eq. (2)) to generate clusters. Here, wetheoretically analyse its major difference from ACMin . onference’17, July 2017, Washington, DC, USA Yang and Shi, et al. Given an attributed input graph 𝐺 , the number 𝑘 of clusters, andthe random walk factors 𝛼, 𝛽 as inputs, USC runs with three phases:(i) computing the attributed random walk probability matrix S ; (ii)finding the top- 𝑘 eigenvectors F of the attributed random walkprobability matrix S ; and (iii) generating an NCI Y via 𝑘 - Means with F . Specifically, given the top- 𝑘 eigenvectors F ∈ R 𝑘 × 𝑛 , math-ematically, 𝑘 - Means finds 𝑘 disjoint clusters 𝐶 , 𝐶 , · · · , 𝐶 𝑘 of 𝐺 such thatmin 𝐶 ,𝐶 , ··· ,𝐶 𝑘 (cid:205) 𝑘𝑖 = (cid:205) 𝑣 𝑗 ∈ 𝐶 𝑖 (cid:13)(cid:13)(cid:13) F [ : , 𝑣 𝑗 ] − | 𝐶 𝑖 | (cid:205) 𝑣 𝑙 ∈ 𝐶 𝑖 F [ : , 𝑣 𝑙 ] (cid:13)(cid:13)(cid:13) , which can be rewritten as an NCI-based matrix form, i.e. ,min Y ∈ 𝑘 × 𝑛 (cid:13)(cid:13) F ⊤ − Y ⊤ X (cid:13)(cid:13) 𝐹 , (22)where X = ( YY ⊤ ) − YF ⊤ . (23) 𝑘 - Means solves Eq. (22) by using the expectation-maximization(EM) algorithm. Initially, it sets Y as a random incicator matrix ( i.e. , 𝑘 × 𝑛 ) and computes X via Eq. (23) accordingly. Then, with X fixed,each entry Y [ 𝑐 𝑖 , 𝑣 𝑗 ] for cluster 𝐶 𝑖 and node 𝑣 𝑗 is updated as Y [ 𝑐 𝑖 , 𝑣 𝑗 ] =  , 𝑖 = arg min ≤ 𝑙 ≤ 𝑘 (cid:13)(cid:13) F [ : , 𝑣 𝑗 ] − X [ 𝑙 ] (cid:13)(cid:13) , , else . After that, with Y fixed, it recomputes X by Eq. (23) accordingly. 𝑘 - Means repeats the above optimization process in an alternativeand iterative manner until convergence. Suppose that we find theoptimal Y to Eq. (22). Then we have F = FY ⊤ ( YY ⊤ ) − Y , which im-plies that USC actually finds an NCI Y that minimizes the followingobjective function 𝑘 · trace ( FY ⊤ ( YY ⊤ ) − Y · ( I − S ) · ( FY ⊤ ( YY ⊤ ) − Y ) ⊤ ) , where F is the top- 𝑘 eigenvectors of S , which is totally different from our objective function in Eq. (8). In anutshell, the NCI Y returned by USC fails to provide us an attributedconductance that approaches the optimal attributed conductance 𝜙 ∗ of 𝐺 . REFERENCES [1] C. C. Aggarwal and C. K. Reddy.

Data Clustering: Algorithms and Applications .CRC Press, 2014.[2] E. Akbas and P. Zhao. Attributed graph clustering: An attribute-aware graphembedding approach. In

ASONAM , 2017.[3] A. Bojchevski, J. Klicpera, B. Perozzi, A. Kapoor, M. Blais, B. Rózemberczki,M. Lukasik, and S. Günnemann. Scaling graph neural networks with approximatepagerank. In

SIGKDD , 2020.[4] C. Bothorel, J. D. Cruz, M. Magnani, and B. Micenkova. Clustering attributedgraphs: models, measures and methods.

Network Science , 2015.[5] P. Chunaev. Community detection in node-attributed social networks: a survey. arXiv preprint arXiv:1912.09816 , 2019.[6] F. R. Chung and F. C. Graham.

Spectral graph theory . 1997.[7] D. Combe, C. Largeron, E. Egyed-Zsigmond, and M. Géry. Combining relationsand text in scientific network clustering. In

ASONAM , 2012.[8] P. Congdon.

Bayesian statistical modelling . 2007.[9] J. W. Demmel.

Applied numerical linear algebra . Siam, 1997.[10] I. Falih, N. Grozavu, R. Kanawati, and Y. Bennani. Anca: Attributed networkclustering algorithm. In

Complex Networks , 2017.[11] I. Falih, N. Grozavu, R. Kanawati, and Y. Bennani. Community detection inattributed network. In

WWW , 2018.[12] S. Fortunato. Community detection in graphs.

Physics reports , 2010.[13] L. C. Freeman. Cliques, galois lattices, and the structure of human social groups.

Social networks , 1996.[14] O. Goldschmidt and D. S. Hochbaum. Polynomial algorithm for the k-cut problem.In

FOCS , 1988.[15] R. Guimera and L. A. N. Amaral. Functional cartography of complex metabolicnetworks.

Nature , 2005. [16] N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness:Probabilistic algorithms for constructing approximate matrix decompositions.

SIAM review , 2011.[17] W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on largegraphs. In

NeurIPS , 2017.[18] D. Hanisch, A. Zien, R. Zimmer, and T. Lengauer. Co-clustering of biologicalnetworks and gene expression data.

Bioinformatics , 2002.[19] J. A. Hartigan and M. A. Wong. Algorithm as 136: A k-means clustering algorithm.

J R Stat Soc Ser C , 1979.[20] T. Haveliwala and S. Kamvar. The second eigenvalue of the google matrix.Technical report, Stanford, 2003.[21] D. He, Z. Feng, D. Jin, X. Wang, and W. Zhang. Joint identification of networkcommunities and semantics via integrative modeling of network topologies andnode contents. In

AAAI , 2017.[22] D. Hric, R. K. Darst, and S. Fortunato. Community detection in networks: Struc-tural communities versus ground truth.

Physical Review E , 2014.[23] H. Huang, H. Shen, and Z. Meng. Community-based influence maximization inattributed networks.

Applied Intelligence , 2020.[24] X. Huang, J. Li, and X. Hu. Label informed attributed network embedding. In

WSDM , 2017.[25] G. Jeh and J. Widom. Scaling personalized web search. In

WWW , 2003.[26] G. Kossinets and D. J. Watts. Empirical analysis of an evolving social network. science , 2006.[27] T. La Fond and J. Neville. Randomization tests for distinguishing social influenceand homophily effects. In

WWW , 2010.[28] A. Lancichinetti and S. Fortunato. Community detection algorithms: a compara-tive analysis.

Physical review E , 2009.[29] Y. Li, C. Sha, X. Huang, and Y. Zhang. Community detection in attributed graphs:An embedding approach. In

AAAI , 2018.[30] U. Liji, Y. Chai, and J. Chen. Improved personalized recommendation based onuser attributes clustering and score matrix filling.

CSI , 2018.[31] J. Liu, Z. He, L. Wei, and Y. Huang. Content to node: Self-translation networkembedding. In

SIGKDD , 2018.[32] L. Lovász et al. Random walks on graphs: A survey.

Combinatorics, Paul erdos iseighty , 1993.[33] F. Meng, X. Rui, Z. Wang, Y. Xing, and L. Cao. Coupled node similarity learningfor community detection in attributed networks.

Entropy , 2018.[34] Z. Meng, S. Liang, H. Bao, and X. Zhang. Co-embedding attributed networks. In

WSDM , 2019.[35] L. Mirsky. A trace inequality of john von neumann.

Monatshefte für mathematik ,1975.[36] W. Nawaz, K.-U. Khan, Y.-K. Lee, and S. Lee. Intra graph clustering using collab-orative similarity measure.

DAPD , 2015.[37] J. Neville, M. Adler, and D. Jensen. Clustering relational data using attribute andlink information. In

IJCAI , 2003.[38] M. E. Newman and M. Girvan. Finding and evaluating community structure innetworks.

Physical review E , 2004.[39] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and analgorithm. In

NeurIPS , 2002.[40] K. Nowicki and T. A. B. Snijders. Estimation and prediction for stochastic block-structures.

J Am Stat Assoc , 2001.[41] H.-S. Park and C.-H. Jun. A simple and fast algorithm for k-medoids clustering.

Expert systems with applications , 2009.[42] Y. Ruan, D. Fuhry, and S. Parthasarathy. Efficient community detection in largenetworks using content and links. In

WWW , 2013.[43] H. Rutishauser. Computational aspects of fl bauer’s simultaneous iterationmethod.

Numerische Mathematik , 1969.[44] S. E. Schaeffer. Graph clustering.

Computer science review , 2007.[45] K. Steinhaeuser and N. V. Chawla. Community detection in a large real-worldsocial network. In

SBP . 2008.[46] S. A. Tabrizi, A. Shakery, M. Asadpour, M. Abbasi, and M. A. Tavallaie. Personal-ized pagerank clustering: A graph clustering algorithm based on random walks.

Physica A , 2013.[47] H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart and itsapplications. In

ICDM , 2006.[48] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graphattention networks.

ICLR , 2018.[49] K. Voevodski, S.-H. Teng, and Y. Xia. Finding local communities in proteinnetworks.

BMC bioinformatics , 2009.[50] U. Von Luxburg. A tutorial on spectral clustering.

Statistics and computing , 2007.[51] D. Wagner and F. Wagner. Between min cut and graph bisection. In

MFCS , 1993.[52] C. Wang, S. Pan, R. Hu, G. Long, J. Jiang, and C. Zhang. Attributed graphclustering: a deep attentional embedding approach. In

IJCAI , 2019.[53] C. Wang, S. Pan, G. Long, X. Zhu, and J. Jiang. Mgae: Marginalized graphautoencoder for graph clustering. In

CIKM , 2017.[54] Z. Xu, Y. Ke, Y. Wang, H. Cheng, and J. Cheng. A model-based approach toattributed graph clustering. In

SIGMOD , 2012.12 ffective and Scalable Clustering on Massive Attributed Graphs Conference’17, July 2017, Washington, DC, USA [55] C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Chang. Network representation learningwith rich text information. In

AAAI , 2015.[56] H. Yang, S. Pan, L. Chen, C. Zhou, and P. Zhang. Low-bit quantization forattributed network representation learning. In

IJCAI , 2019.[57] H. Yang, S. Pan, P. Zhang, L. Chen, D. Lian, and C. Zhang. Binarized attributednetwork embedding. In

ICDM , 2018.[58] J. Yang, J. McAuley, and J. Leskovec. Community detection in networks withnode attributes. In

ICDM , 2013.[59] R. Yang, J. Shi, X. Xiao, Y. Yang, and S. S. Bhowmick. Homogeneous networkembedding for massive graphs via reweighted personalized pagerank.

PVLDB ,2020.[60] R. Yang, J. Shi, X. Xiao, Y. Yang, J. Liu, and S. S. Bhowmick. Scaling attributednetwork embedding to massive graphs.

PVLDB , 2021.[61] R. Yang, X. Xiao, Z. Wei, S. S. Bhowmick, J. Zhao, and R.-H. Li. Efficient estimationof heat kernel pagerank for local clustering. In

SIGMOD , 2019. [62] T. Yang, R. Jin, Y. Chi, and S. Zhu. Combining link and content for communitydetection: a discriminative approach. In

SIGKDD , 2009.[63] H. Zanghi, S. Volant, and C. Ambroise. Clustering based on random graph modelembedding vertex features.

Pattern Recognition Letters , 2010.[64] X. Zhang, H. Liu, Q. Li, and X.-M. Wu. Attributed graph clustering via adaptivegraph convolution. In

IJCAI , 2019.[65] Z. Zhang, P. Cui, X. Wang, J. Pei, X. Yao, and W. Zhu. Arbitrary-order proximitypreserved network embedding. In

SIGKDD , 2018.[66] S. Zhou, H. Yang, X. Wang, J. Bu, M. Ester, P. Yu, J. Zhang, and C. Wang. Prre:Personalized relation ranking embedding for attributed networks. In

CIKM , 2018.[67] Y. Zhou, H. Cheng, and J. X. Yu. Graph clustering based on structural/attributesimilarities.

PVLDB , 2009.[68] Y. Zhou, H. Cheng, and J. X. Yu. Clustering large attributed graphs: An efficientincremental approach. In