Distributed Community Detection for Large Scale Networks Using Stochastic Block Model
DDistributed Community Detection for Large ScaleNetworks Using Stochastic Block Model
Shihao Wu , Zhe Li , and Xuening Zhu School of Data Science, Fudan University, Shanghai, China
Abstract
With rapid developments of information and technology, large scale networkdata are ubiquitous. In this work we develop a distributed spectral clusteringalgorithm for community detection in large scale networks. To handle the prob-lem, we distribute l pilot network nodes on the master server and the otherson worker servers. A spectral clustering algorithm is first conducted on themaster to select pseudo centers. The indexes of the pseudo centers are thenbroadcasted to workers to complete distributed community detection task usinga SVD type algorithm. The proposed distributed algorithm has three merits.First, the communication cost is low since only the indexes of pseudo centersare communicated. Second, no further iteration algorithm is needed on workersand hence it does not suffer from problems as initialization and non-robustness.Third, both the computational complexity and the storage requirements are muchlower compared to using the whole adjacency matrix. A Python package DCD
KEY WORDS:
Large scale network; Community detection; Distributed spec-tral clustering; Stochastic block model; Distributed system. ∗ Shihao Wu and Zhe Li are joint first authors. Xuening Zhu is corresponding author ( [email protected] ). Xuening Zhu is supported by the National Natural Science Foundation ofChina (nos. 11901105, 71991472, U1811461), the Shanghai Sailing Program for Youth Science andTechnology Excellence (19YF1402700), and the Fudan-Xinzailing Joint Research Centre for Big Data,School of Data Science, Fudan University. a r X i v : . [ s t a t . M E ] S e p . INTRODUCTION Large scale networks have become more and more popular in today’s world. Re-cently, network data analysis receives great attention in a wide range of applications,which include but not limited to social network analysis (Sojourner, 2013; Liu et al.,2017; Zhu et al., 2020), biological study (Marbach et al., 2010, 2012), financial riskmanagement (H¨ardle et al., 2016; Zou et al., 2017) and many others.Among the existing literature for large scale network data, the stochastic blockmodel (SBM) is widely used due to its simple form and great usefulness (Holland et al.,1983). In a SBM, the network nodes are partitioned into K communities according totheir connections. Within the same community, nodes are more likely to form edgeswith each other. On the other hand, the nodes from different communities are less likelyto form connections. Understanding the community structure is vital in a variety offields. For instance, in social network analysis, users from the same community arelikely to share similar social interests. As a consequence, particular marketing strategiescan be applied based on their community memberships.Statistically, the communities in the SBM are latent hence need to be detected.One of the most fundamental problems in the SBM is to recover community mem-berships from the observed network relationships. To address this issue, researchershave proposed various estimation methods to accomplish this task. For instance, Zhaoet al. (2012), Amini et al. (2013) and Bickel and Chen (2009) adopted likelihood basedmethods and proved asymptotic properties. Other approaches include convex opti-mization (Chen et al., 2012), methods of moments (Anandkumar et al., 2014), spectralclustering (Lei and Rinaldo, 2015; Jin et al., 2015; Lei et al., 2020) and many others.Among the approaches, spectral clustering (Von Luxburg, 2007; Balakrishnan et al.,2011; Rohe et al., 2011; Lei and Rinaldo, 2015; Jin et al., 2015; Sarkar et al., 2015;Lei et al., 2020) is one of the most widely used methods for community detection.Particularly, it first performs eigen-decomposition using the adjacency matrix or thegraph Laplacian matrix. Then the community memberships are estimated by furtherapplying a k -means algorithm to the first several leading eigenvectors. Theoretically,both Rohe et al. (2011) and Lei and Rinaldo (2015) have studied the consistency ofspectral clustering under stochastic block models.Despite the usefulness of spectral clustering on community detection problem, theprocedure is computationally demanding especially when the network is of large scale.In the meanwhile, with rapid developments of information and technology, large scalenetwork data are ubiquitous. On one hand, handling such enormous datasets requiresgreat computational power and storage capacity. Hence, it is nearly impossible to com-plete statistical modelling tasks on a central server. On the other hand, the concernsof privacy and ownership issues require the datasets to be distributed across differ-ent data centers. In the meanwhile, due to the distributed storage of the datasets,constraints on communication budgets also post great challenges on statistical mod-elling tasks. Therefore, developing distributed statistical modelling methods which areefficient with low computation and communication cost is important.In recent literature, a surge of researches have emerged to solve the distributedstatistical modelling problems. For instance, to conduct distributed regression analysis,both one-shot and iterative distributed algorithms are designed and studied (Zhanget al., 2013; Liu and Ihler, 2014; Chang et al., 2017a,b). Furthermore, high-dimensionalsparse learning problems are investigated and corresponding asymptotic properties areestablished (Lee et al., 2015; Battey et al., 2015; Jordan et al., 2018; Zhu et al., 2019).Other than the supervised learning tasks, distributed semi-supervised and unsupervised3earning methods are also studied (Chang et al., 2017a; Fan et al., 2017). However, dueto our best knowledge, none of the above literature could tackle distributed communitydetection problems for large scale networks.In this work, we propose a distributed community detection (DCD) algorithm. Thedistributed system typically consists of a master server and multiple worker servers.In each round of computation, the master server is responsible to broadcast tasks toworkers, then the workers conduct computational tasks using local datasets and com-municate the results to the master. More specifically, we distribute the network nodestogether with their network relationships on both masters and workers. Specifically, onthe master server we distribute l network nodes, who are referred to as pilot nodes. Thenetwork relationships among the pilot nodes are stored on the master server. On the m th worker, we distribute n m network nodes together with l pilot nodes. The networkrelationships between the n m network nodes and the pilot nodes are recorded. Com-pared to storing the whole network relationships, we resort to storing only a partialnetwork, which leads to much lower storage requirements. Identify pseudo centers:
Step 1: • Conduct spectral clustering on master server to identify pseudo centers.
Step 2: • Broadcast pseudo centers to workers • Complete distributed community detection task using a SVD type algorithm.
Figure 1:
Illustration of distributed community detection algorithm. One time communica-tion is required between master and workers. l pilot nodes. During this step, K pseudo centers are identified. The pseudo centers are identified as the pilot nodesmost close to the clustering centers. Next, we broadcast the indexes of the pseudocenters to workers to further complete the community detection task by using a SVDtype algorithm. The basic steps of the algorithm are summarized in Figure 1. Thealgorithm has the following three merits. First, the communication cost is low sinceonly the indexes of pseudo centers are communicated and only one time communica-tion is used. Second, no further iteration algorithm is needed on workers and henceit does not suffer from problems as initialization and non-robustness. Third, the totalcomputational complexity is of order O ( M l + (cid:80) Mm =1 n m l ), where M is the number ofworkers. Therefore the computational cost is low as long as the size of pilot nodes is wellcontrolled. We would like to remark that the proposed algorithm can be applied notonly on distributed systems, but also on a single computer with memory constraint.Theoretically, we establish upper bounds of (a) the singular vector estimation errorand (b) the number of mis-clustering nodes. Extensive numerical study is presentedto illustrate the computational power of the proposed methodology.The article is organized as follows. In Section 2, we introduce stochastic blockmodel and our distributed community detection algorithm. In Section 3, we developthe theoretical properties of the estimation accuracies of the community detectiontask. In Section 4 and 5, we study the performance of our algorithm via simulationand real data analysis. Section 6 concludes the article with a discussion. All proofsand technique lemmas are relegated to the Appendix.
2. DISTRIBUTED SPECTRAL CLUSTERING FORSTOCHASTIC BLOCK MODEL .1. Stochastic Block Model and Spectral Clustering Consider a large scale network with N nodes, which can be clustered into K com-munities. For each node i , let g i ∈ { , · · · , K } be its community label. A stochasticblock model is parameterized by a membership matrix Θ = (Θ , · · · , Θ N ) (cid:62) ∈ R N × K and a connectivity matrix B ∈ R K × K (with full rank). For the i th row of Θ, only the g i th element takes 1 and the others are 0. In addition, the connectivity matrix B char-acterizes the connection probability between communities. Specifically, the connectionprobability between the k th and l th community is B kl . The edge A ij between the node i and j is generated independently from Bernoulli( B g i g j ) distribution. The adjacencymatrix is then defined as A = ( A ij ). By using the adjacency matrix, the Laplacianmatrix L can be defined as L = D − / AD − / , where D is a diagonal matrix with the i th diagonal element being D ii = (cid:80) j A ij .Define A = E ( A ) and D = E ( D ) as the population leveled counterparts of A and D . Accordingly let L = D − / AD − / . For a matrix X ∈ R m × n , denote X i ∈ R n asthe i th row of matrix X . The following Lemma shows the connection between themembership matrix and the eigenvector matrix of L . Lemma 1.
The eigen-decomposition of L takes the form L = U Λ U (cid:62) , where U =( U , · · · , U N ) (cid:62) ∈ R N × K collects the eigen-vectors and Λ ∈ R K × K is a diagonal matrix.Further we have U = Θ µ , where µ is a K × K orthogonal matrix and Θ i = Θ j if andonly if U i = U j . The proof of Lemma 1 is given by Rohe et al. (2011). By Lemma 1, it can beconcluded that U only has K distinct rows and the i th row is equal to the j th row if thecorresponding two nodes belong to the same community. Accordingly, let (cid:98) U ∈ R N × K denote the K eigenvectors of L with top K absolute eigenvalues. Under mild conditions,6ne can show that (cid:98) U is a slightly perturbed version of U and thus has roughly K distinct rows as well. Applying a k -means clustering algorithm to (cid:98) U , we are then ableto estimate the membership matrix. The spectral clustering algorithm is summarizedin Algorithm 1. Algorithm 1
Spectral Clustering for SBM
Input:
Adjacency matrix A ; number of communities K ; approximationerror ε . Output:
Membership matrix (cid:98) Θ. Compute Laplacian matrix L based on A . Conduct eigen-decomposition of L and extract the top K eigenvectors(i.e., (cid:98) U ). Conduct k -means algorithm using (cid:98) U and then output the estimatedmembership matrix (cid:98) Θ.Despite the usefulness, the classical spectral clustering method for the SBM iscomputationally intensive with computational complexity in the order O ( N ). Henceit is hard to apply in the large scale networks. In the following we aim to developa distributed spectral clustering algorithm for the SBM model. Specifically, we firstintroduce a pilot network spectral clustering algorithm on the master server in Section2.2. Then we elaborate the communication mechanism and computation on workersfor the distributed community detection task in Section 2.3 and Section 2.4. For the distributed community detection task, we first conduct a pilot-based spec-tral clustering on the master server. Suppose we have l network nodes on the master,which are referred to as pilot nodes . In addition we distribute the pilot nodes both on7aster and workers. In the distributed system, the adjacency matrix is distributed asin Figure 2. As a result, compared to storing the whole network relationships, onlya sub-adjacency matrix (i.e., partial network) is stored. This leads to a much lowerstorage requirement. Master: 𝑙 pilot nodes Worker 1: 𝑛 + 𝑙 network nodes Worker 2: 𝑛 + 𝑙 network nodes…Worker M: 𝑛 𝑀 + 𝑙 network nodes Figure 2:
Distributed adjacency matrix A in the distributed system. On master server, l pilot nodes are distributed. On the m th worker, the network relationships between the n m + l network nodes and pilot nodes are stored. Let m k = (cid:80) Ni =1 I ( g i = k ) be the number of nodes in the k th community. Inaddition, define n k be the number of pilot nodes in the k th community. Without lossof generality, we assume n k /m k = r for k = 1 , · · · , K . Consequently, the relative sizeof each community (i.e. distribution of memberships) is the same for the pilot nodes.Subsequently, define the adjacency matrix among the pilot nodes as A ∈ R l × l .Denote the corresponding Laplacian matrix L as L = D − / A D − / , where D =diag { D , , · · · , D ,ll } with D ,ii = (cid:80) j A ,ij . Accordingly, define D = E ( D ), A = E ( A ), and L = D − / A D − / . 8y Lemma 3.1 in Rohe et al. (2011), the eigen-decomposition of L takes the formas L = U Λ U (cid:62) , where Λ ∈ R K × K and U ∈ R l × K has K distinct rows. We collectthe K distinct rows in the matrix U ( K )0 ∈ R K × K . In addition, let U ( K ) collect K distinctrows of U . The following proposition establishes the relationship between U ( K )0 and U ( K ) . Proposition 1.
Under the assumption that n k /m k = r for k = 1 , · · · , K , we have U ( K )0 = r − / U ( K ) . The proof of Proposition 1 is given in Appendix B.1. By Proposition 1, it couldbe concluded that U ( K )0 is equivalent to U ( K ) up to a ratio r − / . Empirically, byconducting a spectral clustering algorithm on the master, we are able to cluster thepilot nodes correctly with a high probability (Rohe et al., 2011). After clustering pilot nodes on the master, we then broadcast the clustering resultsto workers to further complete community detection on workers. To conduct the task, K pseudo centers are selected for broadcasting as in Step 2 of Algorithm 2. To bemore specific, define the clustering centers of (cid:98) U (eigenvector matrix of L ) after the k -means algorithm as (cid:98) C = ( (cid:98) C k : 1 ≤ k ≤ K ) (cid:62) . Then the index of the k th pseudocenter is defined by i k = arg min i (cid:107) (cid:98) U i − (cid:98) C k (cid:107) , which is the closest node to the center ofthe k th cluster. The pseudo centers are pseudo in the sense that they are not exactlythe clustering centers but the closest nodes to the centers. As a result, they could betreated as the most representative nodes for each community. The indexes of pseudocenters are recorded as C = { i , · · · , i K } . Remark 1.
Note that in the communication step, we only broadcast the indexesof pseudo centers instead of clustering centers (cid:98) C . There are two advantages for doing9o. First, the communication cost is low compared to broadcasting (cid:98) C . Specifically,only K integers need to be communicated. Second, even though the clustering centermatrix (cid:98) C is broadcasted, we still need to know a rotation matrix for further clusteringon workers. Instead, by broadcasting pseudo centers, we no longer need to estimate therotations but only use the pseudo center indexes on workers. The detailed procedureis presented in the next section. Suppose we distribute n m network nodes as well as the pilot nodes on the m thworker. Let P collect the indexes of pilot nodes and M m collect the indexes of n m network nodes on the m th worker. Denote S m = P ∪ M m with |S m | = n m , where n m = l + n m . Particularly on the m th worker, we store the network relationshipsbetween nodes in S m and the pilot nodes in P . Denote the corresponding sub-adjacencymatrix as A ( S m ) ∈ R n m × l . Without loss of generality, we permute the row indexes of A ( S m ) to ensure that A ( S m ) = ( A ( S m ) (cid:62) , A ( S m ) (cid:62) ) (cid:62) with A ( S m )1 = A . As a result, the first l rows of A ( S m ) (i.e., A ( S m )1 ) store the adjacency matrix for the pilot nodes, and the rest(i.e., A ( S m )2 ) records the network relationship between the other n m nodes (i.e., M m )and the l pilot nodes.Let D ( S m ) ii = (cid:80) j A ( S m ) ij and F ( S m ) jj = (cid:80) i A ( S m ) ij be the out- and in-degrees of node i and j in the subnetwork on worker m . Correspondingly, define D ( S m ) = diag { D ( S m ) ii :1 ≤ i ≤ n m } ∈ R n m × n m and F ( S m ) = diag { F ( S m ) jj : 1 ≤ j ≤ l } ∈ R l × l . Then a Laplacianversion of A ( S m ) is given by L ( S m ) = ( D ( S m ) ) − / A ( S m ) ( F ( S m ) ) − / ∈ R n m × l .Given the Laplacian matrix, we further perform the clustering algorithm on work-ers. First, we conduct a singular value decomposition (SVD) using L ( S m ) . Note thatthe SVD can be done very efficiently as follows. First, we conduct an eigenvalue decom-10osition on L ( S m ) (cid:62) L ( S m ) ∈ R l × l with computational complexity in the order of O ( l ).This leads to L ( S m ) (cid:62) L ( S m ) = (cid:101) V m (cid:101) Λ m (cid:101) V (cid:62) m , where (cid:101) V m ∈ R l × l is the right singular vectorsof L ( S m ) ∈ R l × l and (cid:101) Λ m ∈ R l × l is a diagonal matrix. Then the left singular vectorscan be efficiently computed by (cid:101) U m = L ( S m ) (cid:101) V m (cid:101) Λ − / m with computational complexity O ( n m l ). Next, let (cid:98) U ( S m ) ∈ R n m × K collect the top K left singular vectors in (cid:101) U m . Thenwe assign each node to the cluster with the closest pseudo center. Specifically, recallthe indexes of the pseudo centers are collected by C = { i , · · · , i K } . As a result, forthe i th ( l + 1 ≤ i ≤ n m ) node in S m , the cluster label g i is estimated by (cid:98) g i = arg min ≤ k ≤ K,i k ∈C (cid:13)(cid:13) (cid:98) U ( S m ) i − (cid:98) U ( S m ) i k (cid:13)(cid:13) . (2.1)An obvious merit of (2.1) is that no further iteration algorithms (e.g., k -means) areneeded for clustering. It makes the clustering results more stable and computationallyefficient. The procedure for community detection on workers is summarized in Step 3of Algorithm 2.
3. THEORETICAL PROPERTIES
In this section, we discuss the accuracy of the clustering algorithm. We first es-tablish the theoretical properties of the procedure on the population level. Next, theconvergence of singular vectors is given, which is the key for establishing the consistentclustering result. Lastly, we derive error bounds on the mis-clustering rates.11 lgorithm 2
Distributed Community Detection (DCD) for SBM
Input:
Adjacency matrix A ; sub-adjacency matrices { A ( S m ) } m =1 ,...,M ;number of communities K ; approximation error ε . Output:
Membership matrix (cid:98) Θ Step 1 Pilot-based Network Spectral Clustering on MasterServerStep 1.1
Conduct eigen-decomposition of L and extract the top K eigenvectors (denoted in matrix (cid:98) U ). Step 1.2
Conduct k -means algorithm and obtain clustering centers (cid:98) C = (cid:0) (cid:98) C k : 1 ≤ k ≤ K (cid:1) (cid:62) . Step 2 Broadcast Pseudo Centers to WorkersStep 2.1
Determine the indexes of the k th pseudo centers as i k =arg min i (cid:107) (cid:98) U i − (cid:98) C k (cid:107) . Step 2.2
Broadcast the index set of pseudo centers C = { i , · · · , i K } toworkers. Step 3 Community Detection on WorkersStep 3.1
Perform singular value decomposition using L ( S m ) and denotethe top K left singular vector matrix as (cid:98) U ( S m ) . Step 3.2
Use (2.1) to obtain the estimated community labels.
To motivate the study, we first discuss the theoretical properties on the populationlevel. Define A ( S m ) = E ( A ( S m ) ), D ( S m ) = E ( D ( S m ) ) ∈ R n m × n m and F ( S m ) = E ( F ( S m ) ) ∈ R l × l . In addition, the normalized population adjacency matrix is defined by L ( S m ) =12 D ( S m ) ) − / A ( S m ) ( F ( S m ) ) − / . Suppose the singular value decomposition of L ( S m ) is L ( S m ) = U ( S m ) Λ ( S m ) ( V ( S m ) ) (cid:62) , where U ( S m ) ∈ R n m × K and V ( S m ) ∈ R l × K are left andright eigenvectors respectively. In the following proposition we show that U ( S m ) has K distinct rows and could identify the memberships of the nodes uniquely. Proposition 2.
Let Θ ( S m ) ∈ R n m × K be the membership matrix on the m th worker.Then we have U ( S m ) = Θ ( S m ) µ , where µ ∈ R K × K is a rotation matrix, and µ (cid:62) Θ ( S m ) i = µ (cid:62) Θ ( S m ) j ⇔ Θ ( S m ) i = Θ ( S m ) j . Proof of Proposition 2 is given in Appendix B.2. Proposition 2 implies that the singularvectors could play the same role as the eigenvectors of the adjacency matrix in thecommunity detection.We then build the connection between U ( S m ) with the eigenvector matrix U of L ,i.e., L = U Λ U (cid:62) . Denote U m = ( U i : i ∈ S m ) (cid:62) ∈ R n m × K as the submatrix of U whoserow indexes are in S m . The connection could be built between U ( S m ) and U m . Denote n mk as the number of nodes on the m th worker belonging to the k th community. Ifwe have n mk /m k s are equivalent over 1 ≤ k ≤ K . Then it could be easily verified asProposition 1 that U ( S m ) = r − / m U m , where r m = n m / ( N + l ). However, in practice,the distributed nodes on the workers are mostly unbalanced with respect to the wholepopulation. For instance, smaller samples of the k th community may be distributed onthe m th worker compared to other workers. As a result, U ( S m ) will not be just equalto r − / m U m .This unbalanced effect can be quantified in the theoretical analysis. Define the unbalanced effect as α ( S m ) = max k | n mk /n m − m k /N | . As a result, α ( S m ) will be large ifthe ratio of one community (e.g., the k th community) on the m th worker is far away13rom its population ratio m k /N . In addition, let d ≤ min k n k /l ≤ max k n k /l ≤ u and d m ≤ min k n mk /n m ≤ max k n mk /n m ≤ u m . We establish an upper bound for thedeviation of U ( S m ) from r − / m U m . Proposition 3.
Let b min = min ≤ i,j ≤ K B ij . It holds (cid:13)(cid:13) U ( S m ) − r − / m U m Q m (cid:13)(cid:13) F ≤ √ K u m max { u / , u / m } α ( S m )1 / σ min ( B ) b d d m ( d + d m ) + α ( S m ) d (3.1) where Q m is an K × K orthogonal matrix. Proof of Proposition 3 is given in Appendix B.3 . The upper bound in (3.1) illustratesthe relationship between the error bounds and the unbalanced effect. Particularly, theerror bound is tighter when the community members are distributed more evenly oneach worker. In the extreme case, when the unbalanced effect is 0 (i.e., α ( S m ) = 0), theupper bound in (3.1) will be zero. As we have shown previously, U ( S m ) has K distinct rows. As a result, if (cid:98) U ( S m ) converges to U ( S m ) with a high probability, we are able to achieve a high clusteringaccuracy based on spectral clustering using (cid:98) U ( S m ) . In the following theorem we establishthe convergence result of (cid:98) U ( S m ) to U ( S m ) . Theorem 1. (Singular Vector Convergence)
Let λ ,m ≥ λ ,m ≥ · · · ≥ λ K,m > be the top K singular values of L ( S m ) . Define δ m = min i D ( S m ) ii . Then for any (cid:15) m > and δ m > n m + 2 l ) + 3 log(4 /(cid:15) m ) , with probability at least − (cid:15) m it holds (cid:13)(cid:13) (cid:98) U ( S m ) − U ( S m ) Q ( S m ) (cid:13)(cid:13) F ≤ √ λ K,m (cid:115) K log(4( n m + 2 l ) /(cid:15) m ) δ m , (3.2)14 here Q ( S m ) ∈ R K × K is a K × K orthogonal matrix. The proof of Theorem 1 is given in Appendix C.1. To better understand the esti-mation error bound given in (3.2), we make the following comments. First, the errorbound is related to λ K,m . According to Rohe et al. (2011) and Lei and Rinaldo (2015),if λ K,m is larger, the eigengap between the eigenvalues of interest and the rest will behigher. This enables us to detect communities with higher accuracy level.Second, the upper bound is lower if the minimum out-degree δ m is higher. Onecould verify that D ( S m ) ii = (Θ ( S m ) i ) (cid:62) B Θ (cid:62) l ≥ b min (cid:80) k n k = b min l . Consequently δ m grows almost linearly with l if b min is lower bounded. If δ m (cid:29) K log n m and λ K,m islower bounded by a positive constant, then we have (cid:107) (cid:98) U ( S m ) − U ( S m ) Q ( S m ) (cid:107) F = o p (1).Lastly, the error bound is higher when the number of communities K and the sub-sample size n m is larger. As a result, larger K and n m will increase the difficulty ofthe community detection task. In this section, we conduct clustering accuracy analysis for the DCD algorithm.To this end, we first present a sufficient condition, which guarantees correct clusteringfor a single node. Let (cid:98) C ( S m ) = ( (cid:98) C ( S m )1 , · · · , (cid:98) C ( S m ) K ) (cid:62) ∈ R K × K be the pseudo centerson the worker m . Denote P m = (2 /D m ) / − ζ m with D m = max ≤ k ≤ K n mk and ζ m = max k ∈{ K } (cid:107) Q ( S m ) (cid:62) U ( S m ) i k − (cid:98) C ( S m ) k (cid:107) . Here ζ m characterizes the distance of thepseudo centers to their population values on worker m . We then have the followingproposition. Proposition 4.
The node i will be correctly clustered (i.e., (cid:98) g i = g i ) as long as (cid:13)(cid:13) (cid:98) U ( S m ) i − (cid:98) C ( S m ) g i (cid:13)(cid:13) < P m . (3.3)15he proof of Proposition 4 is given in Appendix B.4. It indicates that the clusteringaccuracy is closely related to P m . If with a high probability that the pseudo nodes arecorrectly clustered, then P m will be higher. As a consequence, it could yield a higheraccuracy of the community detection result.In the following we analyze the lower bound of P m . If we could prove that P m ispositive with a high probability, we are then able to show that the total number ofmis-clustered nodes are well controlled. Specifically, define the pseudo centers on themaster node as (cid:98) U c def = ( (cid:98) U i : i ∈ C ) (cid:62) ∈ R K × K . Ideally, we could directly map (cid:98) U c tothe column space of (cid:98) U ( S m ) and then complete the community detection on workers.To this end, a rotation Q c should be made on the pseudo centers of the master node(i.e., (cid:98) U c ). According to Proposition 3 and Theorem 1, rotation Q c takes the form Q c = r − / m r / Q (cid:62) Q m Q ( S m ) . As a result, the pseudo centers on the m th worker isdefined as (cid:98) C ( S m ) = (cid:98) U c Q c . To establish a lower bound for P m , we first assume thefollowing conditions.(C1) (Eigenvalue and Eigengap on Master) Let δ = min i D ,ii . Assume δ > l ) + 3 log(4 /(cid:15) l ) and (cid:15) l → l → ∞ .(C2) (Pilot Nodes) Assume K log( l/(cid:15) l ) / ( b min λ K, ) (cid:28) l with (cid:15) l → l → ∞ .(C3) (Unbalanced Effect) Let d , d m , u , u m be finite constants and assume α ( S m ) = o ( σ min ( B ) /K ).Condition (C1) is imposed by assuming the same condition as in Theorem 1 for thepilot nodes. Condition (C2) gives a lower bound on the number of pilot nodes. Specif-ically, it should be larger than both the number of communities K and (log l ) /b ,which is easy to satisfy in practice. 16ondition (C3) restricts the unbalanced effect. First, it states that the relative ratioof communities across all workers are stable by assuming d , d m , u , u m are constants.Next, the unbalanced effect α ( S m ) is assumed to converge to zero faster than O (1 /K ).As a result, as long as K is well controlled (for instance, in the order of log N ) andsignal strength in B is strong enough, the conditions (C2) and (C3) could be easilysatisfied. We then have the following Proposition. Proposition 5.
Assume Conditions (C1)–(C3). Then with probability − (cid:15) l , we have P m ≥ c / √ n m as min { l, n m } → ∞ with rotation Q c , where c is a positive constant. The proof of Proposition 5 is given in Appendix B.5. In practice, to save us theeffort of estimating the rotation matrix Q c , we directly broadcast the pseudo centerindexes C to workers and let (cid:98) C ( S m ) = ( (cid:98) U ( S m ) i : i ∈ C ) (cid:62) . As a result, (cid:98) C ( S m ) is naturallyembedded in the column space of (cid:98) U ( S m ) and no further rotation is required. Giventhe results presented in Theorem 1 and Proposition 5, we are then able to obtain themis-clustering rates for each worker as follows. Theorem 2. (Bound of mis-clustering Rates)
Assume conditions in Theorem1 and Proposition 5. Denote R ( S m ) as the ratio of misclustered nodes on worker m ,then we have R ( S m ) = o (cid:32) K log( l/(cid:15) l ) b min lλ K, + K log(4( n m + 2 l ) /(cid:15) m ) λ K,m δ m + K α ( S m ) σ min ( B ) b (cid:33) , (3.4) with probability at least − (cid:15) l − (cid:15) m . The proof of Theorem 2 is given in Appendix C.2. Theorem 2 establishes an upperbound for the mis-clustering rate on the worker m . With respect to the result, we havethe following remark. 17 emark 2. One could observe that there are three terms included in the mis-clustering rate. The first and second terms are related to convergence of spectrumon master and workers. Specifically, the first term is related to the convergence ofeigenvectors on the master. The second term is determined by convergence of singularvectors on the m th worker. As we comment before, with large sample size and strongsignal strength, the mis-clustering rate could be well controlled. Next, the third termis mainly related to the unbalanced effect α ( S m ) among the workers, which is lower ifthe distribution of the communities is more balanced on different workers.Compared with using the full adjacency matrix of S m , the error bound in (3.4)is higher. That is straightforward to understand since in our case we use a sub-adjacency matrix instead of the full one. According to Rohe et al. (2012), whenthe full adjacency matrix is used, the mis-clustering rate is bounded by R ( S m ) all = O ( K log( n m /(cid:15) m ) / ( n m λ K,m )) with high probability. In our case, we have R ( S m ) = O ( R ( S m ) all n m /l ). Hence if it holds n m (cid:28) l (i.e., n m ≈ l ), then the mis-clustering rateis asymptotically the same as using the full adjacency matrix. Furthermore, we canobtain a mis-clustering error bound for all network nodes as in the following corollary. Corollary 1.
Assume the same conditions as in Theorem 2. In addition, assume n = n = · · · = n M def = n and α ( S m ) = 0 for ≤ m ≤ M . Denote R all as number of allmis-clustered nodes across all workers. Then with probability − ( M + 1) /l we have R all = O (cid:16) K (log n + log l ) lλ K (cid:17) , (3.5) where λ K = min m λ K,m . The Corollary 1 could be immediately obtained from Theorem 2 by setting (cid:15) l = (cid:15) m = 1 /l for 1 ≤ m ≤ M . As indicated by (3.5), the mis-clustering rate is smaller18hen the number of pilot nodes l is larger. Particularly, if l = rN with r ∈ (0 , A . While in the same time, the computational timeis roughly r smaller than using the whole adjacency matrix. As a consequence, thecomputational advantage is obvious.
4. SIMULATION STUDIES
In order to demonstrate the performance of our DCD algorithm, we conduct exper-iments using synthetic datasets under three scenarios. The main differences lie in thegenerating mechanism of the networks. For simplicity, we consider a stochastic blockmodel with K blocks and each block contains s nodes. As a result, Ks = N . Theconnectivity matrix B is set as B = ν (cid:8) λI K + (1 − λ ) K (cid:62) K (cid:9) , (4.1)where ν ∈ [0 ,
1] and λ ∈ [0 , ν and the connection divergence is characterized by λ .The random experiments are repeated for R = 500 times for a reliable evaluation.To gauge the finite sample performance, we consider two accuracy measures. The firstis the mis-clustering rate, i.e., R all = (cid:80) Ni =1 ( (cid:98) g i (cid:54) = g i ) /N . The second is the estimationaccuracy of the singular vectors, i.e., (cid:98) U ( S m ) for each worker, which is captured by thelog-estimation error (LEE). Define LEE m = log (cid:107) (cid:98) U ( S m ) − U ( S m ) Q ( S m ) (cid:107) F for the m thworker, where the rotation matrix Q ( S m ) is calculated according to Rohe et al. (2011).19ubsequently LEE = M − (cid:80) m LEE m is calculated to quantify the average estimationerrors over all workers. Scenario 1 (Pilot Nodes)
First, we investigate the role of pilot nodes on thenumerical performances. Particularly, we let l = rN with N = 10000 and r varyingfrom 0 .
01 to 0 .
2. The performances are evaluated for K = 3 , , , ν = 0 . λ = 0 .
5. In addition, the number ofworkers is given as M = 5. We calculated the mis-clustering rate R all in the left panelof Figure 3. As shown in Figure 3, the mis-clustering rate converges to zero as l grows,which corroborates with our theoretical findings in Corollary 1. Ratio M i s - c l u s t e r i ng R a t e Number of community
K = 3K = 4K = 5K = 6 slope=-0.5 -1.6-1.2-0.8-0.4 5 6 7
Log- number of pilot nodes
Log E s t i m a t i on E rr o r Total nodes
N=8000N=10000N=12000N=14000
Figure 3: Left panel: the mis-clustering rates versus pilot nodes ratio (i.e., l/N ) underdifferent community sizes K = 3 , , ,
6; Right panel: LEE versus the log-number ofpilot nodes under sample sizes N = 8000 , , , N , as log( l ) grows,the estimation error of eigenvectors decreases with the slope of LEE roughly parallelwith − /
2. This corroborates with the theoretical results in Theorem 1.
Connection intensity M i s - c l u s t e r i ng R a t e Number of community
K = 3K = 4K = 5K = 6
Connection divergence M i s - c l u s t e r i ng R a t e Number of community
K = 3K = 4K = 5K = 6
Figure 4: The influence of ν (connection intensity) and λ (connection divergence) on themis-clustering rates. As shown in the figure, stronger connection and larger divergencecan lead to more accurate community detection results. Scenario 2 (Signal Strength)
In this scenario, we observe how the mis-clustering rates change with respect to the signal strengths. Accordingly we fix l = 300, N = 20000, and vary the number of communities as K = 3 , , ,
6. For ν = 0 .
2, wefirst change the connection divergence from λ = 0 .
05 to λ = 0 .
95. As λ increases,the connection intensity within the same community will be higher than nodes fromdifferent communities. In the meanwhile, the eigengap is larger and the signal strengthis higher. Next, we conduct the experiment by varying connection intensity ν and fix λ = 0 .
5. Theoretically, as ν increases, b min will increase accordingly, which results ina higher signal strength. According to Theorem 1 and Corollary 1, the mis-clustering21ates will drop as the signal strength is higher. This phenomenon can be confirmedfrom the right panel of Figure 4. Scenario 3 (Unbalanced Effect)
In this setting, we verify the unbalancedeffect on the finite sample performances. First, we fix l = 500, N = 5000, ν = 0 . λ = 0 .
5, and M = 3. Denote π mk as the ratio of nodes in the k th community on the m th worker. We set π mk as follows, π mk = 1 K + (cid:16) k − K + 12 (cid:17) sign (cid:16) m − M + 12 (cid:17) αK ( K − . If π m = π m = · · · = π mK = 1 /K , then there is no unbalanced effect. As α increases,the unbalanced effect is larger. The mis-clustering rate is visualized in Figure 5. As α is increased from 0 to 0 .
95, we could observe that the mis-clustering rates increaseaccordingly, which verifies the result of Theorem 2. α M i s - c l u s t e r i ng R a t e Number of Community
K = 2K = 3K = 4
Figure 5: The mis-clustering rates versus the unbalanced effect α for different numberof communities K = 2 , ,
4. As the unbalanced effect increases, the mis-clustering ratesalso increase, which results in inferior performance of the distributed algorithm.
Lastly, we compare the performances of the proposed method with the spectral22lustering (SC) using the whole network data. Both the mis-clustering rates and thecomputational efficiency are compared. For a network with size N , we conduct spec-tral clustering in Algorithm 1 and record the clustering accuracy and computationaltime. For comparison, we conduct distributed spectral clustering using M workers byAlgorithm 2. Ratio M i s - c l u s t e r i ng R a t e Number of worker
M = 5M = 10M = 15M = 20
Ratio C o m pu t a t i on a l T i m e ( s ) Number of worker
M = 5M = 10M = 15M = 20
Total nodes C o m pu t a t i on a l T i m e ( s ) SCDCD
Figure 6: The mis-clustering rates (left panel) and computational time (middle panel)with respect to varying pilot nodes ratios for different number of workers. The com-putational times of SC Algorithm 1 and DCD Algorithm 2 are further compared as N grows (right panel).For N = 10000 and K = 3, the average computational time of spectral clusteringAlgorithm 1 (using whole network adjacency matrix) is 48.84s and the mis-clusteringrate is zero. Next, we set l = rN with r = 0 . , . , ..., .
18 for Algorithm 2. Bothmis-clustering rates and computational time are compared, which is shown in Figure6. As we could observe, after l ≥ . × N grows. For each N , l is set when the mis-clustering rate is the same as using the whole adjacency matrix. As we can observefrom the right panel of Figure 6, as the network size grows, the computational time23f Algorithm 1 increases drastically compared to Algorithm 2, which illustrates thecomputational advantage of the proposed approach.
5. EMPIRICAL STUDY
We evaluate the empirical performance of the proposed method using two networkdatasets. The estimation accuracy and computational time are evaluated using bothdistributed community detection algorithm and spectral clustering method. Particu-larly, the distributed community detection algorithm is implemented using our newlydeveloped package
DCD on the Spark system. The system consists 36 virtual coresand 128 GB of RAM. We set the number of workers as M=2. Descriptions of the twonetwork datasets and corresponding experimental results are presented as follows.
The Pubmed dataset consists of 19,717 scientific publications from PubMed database(Kipf and Welling, 2016). Each publication is identified as one of the three classes, i.e.,Diabetes Mellitus Experimental, Diabetes Mellitus Type 1, Diabetes Mellitus Type 2.The sizes of the three classes are 4,103, 7,875, and 7,739 respectively. In this casethe community sizes are relatively unbalanced since both the second and third classeshave roughly twice members than the first class. The network link is defined using thecitation relationships among the publications. Specifically, if the i th publication citesthe j th one (or otherwise), then A ij = 1, otherwise A ij = 0. The resulting networkdensity is 0 . . = l/N from 0.02 to 0.30. One could observe in Figure 7 that, the mis-clustering ratesof the DCD algorithm is comparable to the SC algorithm when r = 0 .
22, while thecomputational time is much lower. R SC =0.3303R DCD
Ratio M i s - c l u s t e r i ng R a t e t SC =376.755s t DCD
Ratio C o m pu t a t i on a l T i m e ( s ) Figure 7: The Comparison between SC algorithm and DCD algorithm on Pubmeddataset both in mis-clustering rate and computational time. The mis-clustering rateand computational time of the DCD (SC) algorithm are denoted by R
DCD (R SC ) andt DCD (t SC ) respectively. In this study, we consider a large scale social network, Pokec (Takac and Zabovsky,2012). The Pokec is the most popular online social network in Slovak. The datasetwas collected during May 25–27 in the year of 2012, which contains 50,000 active usersin the network. If the i th user is a friend of the j th user, then there is a connectionbetween the two users, i.e., A ij = 1. The resulting network density is 0 . between / Den within , whereDen between = (cid:80) i,j a i,j I ( (cid:98) g i (cid:54) = (cid:98) g j ) / (cid:80) i,j I ( (cid:98) g i (cid:54) = (cid:98) g j ) is the between-community density, andDen within = (cid:80) i,j a i,j I ( (cid:98) g i = (cid:98) g j ) / (cid:80) i,j I ( (cid:98) g i = (cid:98) g j ) is the within-community density. TheRED is visualized in Figure 8. As one can observe, after l/N ≥ .
28, the RED is stablewith the corresponding computational time as 642.524s. This further illustrates thecomputational advantage of the proposed DCD algorithm.
RED=0.23 Ratio=0.28
Ratio R e l a t i ve D e n s i t y Figure 8: The relative density decreases rapidly as ratio increases, after r = l/N ≥ .
6. CONCLUDING REMARKS
In this work, we propose a distributed community detection (DCD) algorithm totackle community detection task in large scale networks. We distribute l pilot nodeson the master and a non-square adjacency matrix on workers. The proposed DCDalgorithm has three merits. First, the communication cost is low. Second, no furtheriteration algorithm is used on workers therefore the algorithm is stable. Third, both26he computational complexity and the storage requirements are much lower comparedto using the whole adjacency matrix. The DCD algorithm is shown to have clearcomputational advantage and competitive statistical performance by using a varietysynthetic and empirical datasets.To conclude the article, we provide several topics for future studies. First, bettermechanisms can be designed to select pilot nodes on the master server. This enables usto obtain more accurate estimation of the pseudo centers and yields better clusteringresults. Next, it is interesting to extend the proposed method to directed network byconsidering sending and receiving clusters respectively (Rohe et al., 2012). The theo-retical property and computational complexity could be discussed accordingly. Third,in the community detection task, we only employ the network structure informationand ignore other potential useful nodal covariates. As a result, it is important to extendthe DCD algorithm to further incorporate various exogenous information. References
Amini, A. A., Chen, A., Bickel, P. J., Levina, E., et al. (2013), “Pseudo-likelihoodmethods for community detection in large sparse networks,”
The Annals of Statistics ,41, 2097–2122.Anandkumar, A., Ge, R., Hsu, D., and Kakade, S. M. (2014), “A tensor approach tolearning mixed membership community models,”
The Journal of Machine LearningResearch , 15, 2239–2312.Balakrishnan, S., Xu, M., Krishnamurthy, A., and Singh, A. (2011), “Noise thresholdsfor spectral clustering,” in
Advances in Neural Information Processing Systems , pp.954–962. 27attey, H., Fan, J., Liu, H., Lu, J., and Zhu, Z. (2015), “Distributed estimation andinference with statistical guarantees,” arXiv preprint arXiv:1509.05457 .Bickel, P. J. and Chen, A. (2009), “A nonparametric view of network models andNewman–Girvan and other modularities,”
Proceedings of the National Academy ofSciences , 106, 21068–21073.Chang, X., Lin, S.-B., and Wang, Y. (2017a), “Divide and conquer local average re-gression,”
Electronic Journal of Statistics , 11, 1326–1350.Chang, X., Lin, S.-B., and Zhou, D.-X. (2017b), “Distributed semi-supervised learningwith kernel ridge regression,”
The Journal of Machine Learning Research , 18, 1493–1514.Chen, Y., Sanghavi, S., and Xu, H. (2012), “Clustering sparse graphs,” in
Advances inneural information processing systems , pp. 2204–2212.Fan, J., Wang, D., Wang, K., and Zhu, Z. (2017), “Distributed estimation of principaleigenspaces,” arXiv preprint arXiv:1702.06488 .H¨ardle, W. K., Wang, W., and Yu, L. (2016), “Tenet: Tail-event driven network risk,”
Journal of Econometrics , 192, 499–513.Holland, P. W., Laskey, K. B., and Leinhardt, S. (1983), “Stochastic blockmodels:First steps,”
Social networks , 5, 109–137.Jin, J. et al. (2015), “Fast community detection by SCORE,”
The Annals of Statistics ,43, 57–89.Jordan, M. I., Lee, J. D., and Yang, Y. (2018), “Communication-efficient distributedstatistical inference,”
Journal of the American Statistical Association , 1–14.28ipf, T. N. and Welling, M. (2016), “Semi-supervised classification with graph convo-lutional networks,” arXiv preprint arXiv:1609.02907 .Lee, J. D., Sun, Y., Liu, Q., and Taylor, J. E. (2015), “Communication-efficient sparseregression: a one-shot approach,” arXiv preprint arXiv:1503.04337 .Lei, J. and Rinaldo, A. (2015), “Consistency of spectral clustering in stochastic blockmodels,”
The Annals of Statistics , 43, 215–237.Lei, L., Li, X., and Lou, X. (2020), “Consistency of Spectral Clustering on HierarchicalStochastic Block Models,” arXiv preprint arXiv:2004.14531 .Liu, Q. and Ihler, A. T. (2014), “Distributed estimation, information loss and exponen-tial families,” in
Advances in neural information processing systems , pp. 1098–1106.Liu, X., Patacchini, E., and Rainone, E. (2017), “Peer effects in bedtime decisionsamong adolescents: a social network model with sampled data,”
The econometricsjournal , 20, S103–S125.Marbach, D., Costello, J. C., K¨uffner, R., Vega, N. M., Prill, R. J., Camacho, D. M.,Allison, K. R., Kellis, M., Collins, J. J., and Stolovitzky, G. (2012), “Wisdom ofcrowds for robust gene network inference,”
Nature methods , 9, 796–804.Marbach, D., Prill, R. J., Schaffter, T., Mattiussi, C., Floreano, D., and Stolovitzky, G.(2010), “Revealing strengths and weaknesses of methods for gene network inference,”
Proceedings of the national academy of sciences , 107, 6286–6291.Rohe, K., Chatterjee, S., Yu, B., et al. (2011), “Spectral clustering and the high-dimensional stochastic blockmodel,”
The Annals of Statistics , 39, 1878–1915.Rohe, K., Qin, T., and Yu, B. (2012), “Co-clustering for directed graphs: the Stochasticco-Blockmodel and spectral algorithm Di-Sim,” arXiv preprint arXiv:1204.2296 .29arkar, P., Bickel, P. J., et al. (2015), “Role of normalization in spectral clustering forstochastic blockmodels,”
The Annals of Statistics , 43, 962–990.Sojourner, A. (2013), “Identification of peer effects with missing peer data: Evidencefrom Project STAR,”
The Economic Journal , 123, 574–605.Takac, L. and Zabovsky, M. (2012), “Data analysis in public social networks,” in
International scientific conference and international workshop present day trends ofinnovations , vol. 1.Von Luxburg, U. (2007), “A tutorial on spectral clustering,”
Statistics and computing ,17, 395–416.Zhang, Y., Duchi, J. C., and Wainwright, M. J. (2013), “Communication-efficientalgorithms for statistical optimization,”
The Journal of Machine Learning Research ,14, 3321–3363.Zhao, Y., Levina, E., Zhu, J., et al. (2012), “Consistency of community detection innetworks under degree-corrected stochastic block models,”
The Annals of Statistics ,40, 2266–2292.Zhu, X., Huang, D., Pan, R., and Wang, H. (2020), “Multivariate spatial autoregressivemodel for large scale social networks,”
Journal of Econometrics , 215, 591–606.Zhu, X., Li, F., and Wang, H. (2019), “Least Squares Approximation for a DistributedSystem,” arXiv preprint arXiv:1908.04904 .Zou, T., Lan, W., Wang, H., and Tsai, C.-L. (2017), “Covariance regression analysis,”