[PDF] Distributed Community Detection for Large Scale Networks Using Stochastic Block Model

Abstract

With rapid developments of information and technology, large scale network data are ubiquitous. In this work we develop a distributed spectral clustering algorithm for community detection in large scale networks. To handle the problem, we distribute l pilot network nodes on the master server and the others on worker servers. A spectral clustering algorithm is first conducted on the master to select pseudo centers. The indexes of the pseudo centers are then broadcasted to workers to complete distributed community detection task using a SVD type algorithm. The proposed distributed algorithm has three merits. First, the communication cost is low since only the indexes of pseudo centers are communicated. Second, no further iteration algorithm is needed on workers and hence it does not suffer from problems as initialization and non-robustness. Third, both the computational complexity and the storage requirements are much lower compared to using the whole adjacency matrix. A Python package DCD (this http URL) is developed to implement the distributed algorithm for a Spark system. Theoretical properties are provided with respect to the estimation accuracy and mis-clustering rates. Lastly, the advantages of the proposed methodology are illustrated by experiments on a variety of synthetic and empirical datasets.

Full PDF

DDistributed Community Detection for Large ScaleNetworks Using Stochastic Block Model

Shihao Wu , Zhe Li , and Xuening Zhu School of Data Science, Fudan University, Shanghai, China

Abstract

With rapid developments of information and technology, large scale networkdata are ubiquitous. In this work we develop a distributed spectral clusteringalgorithm for community detection in large scale networks. To handle the prob-lem, we distribute l pilot network nodes on the master server and the otherson worker servers. A spectral clustering algorithm is ﬁrst conducted on themaster to select pseudo centers. The indexes of the pseudo centers are thenbroadcasted to workers to complete distributed community detection task usinga SVD type algorithm. The proposed distributed algorithm has three merits.First, the communication cost is low since only the indexes of pseudo centersare communicated. Second, no further iteration algorithm is needed on workersand hence it does not suﬀer from problems as initialization and non-robustness.Third, both the computational complexity and the storage requirements are muchlower compared to using the whole adjacency matrix. A Python package DCD

KEY WORDS:

Large scale network; Community detection; Distributed spec-tral clustering; Stochastic block model; Distributed system. ∗ Shihao Wu and Zhe Li are joint ﬁrst authors. Xuening Zhu is corresponding author ( [email protected] ). Xuening Zhu is supported by the National Natural Science Foundation ofChina (nos. 11901105, 71991472, U1811461), the Shanghai Sailing Program for Youth Science andTechnology Excellence (19YF1402700), and the Fudan-Xinzailing Joint Research Centre for Big Data,School of Data Science, Fudan University. a r X i v : . [ s t a t . M E ] S e p . INTRODUCTION Large scale networks have become more and more popular in today’s world. Re-cently, network data analysis receives great attention in a wide range of applications,which include but not limited to social network analysis (Sojourner, 2013; Liu et al.,2017; Zhu et al., 2020), biological study (Marbach et al., 2010, 2012), ﬁnancial riskmanagement (H¨ardle et al., 2016; Zou et al., 2017) and many others.Among the existing literature for large scale network data, the stochastic blockmodel (SBM) is widely used due to its simple form and great usefulness (Holland et al.,1983). In a SBM, the network nodes are partitioned into K communities according totheir connections. Within the same community, nodes are more likely to form edgeswith each other. On the other hand, the nodes from diﬀerent communities are less likelyto form connections. Understanding the community structure is vital in a variety ofﬁelds. For instance, in social network analysis, users from the same community arelikely to share similar social interests. As a consequence, particular marketing strategiescan be applied based on their community memberships.Statistically, the communities in the SBM are latent hence need to be detected.One of the most fundamental problems in the SBM is to recover community mem-berships from the observed network relationships. To address this issue, researchershave proposed various estimation methods to accomplish this task. For instance, Zhaoet al. (2012), Amini et al. (2013) and Bickel and Chen (2009) adopted likelihood basedmethods and proved asymptotic properties. Other approaches include convex opti-mization (Chen et al., 2012), methods of moments (Anandkumar et al., 2014), spectralclustering (Lei and Rinaldo, 2015; Jin et al., 2015; Lei et al., 2020) and many others.Among the approaches, spectral clustering (Von Luxburg, 2007; Balakrishnan et al.,2011; Rohe et al., 2011; Lei and Rinaldo, 2015; Jin et al., 2015; Sarkar et al., 2015;Lei et al., 2020) is one of the most widely used methods for community detection.Particularly, it ﬁrst performs eigen-decomposition using the adjacency matrix or thegraph Laplacian matrix. Then the community memberships are estimated by furtherapplying a k -means algorithm to the ﬁrst several leading eigenvectors. Theoretically,both Rohe et al. (2011) and Lei and Rinaldo (2015) have studied the consistency ofspectral clustering under stochastic block models.Despite the usefulness of spectral clustering on community detection problem, theprocedure is computationally demanding especially when the network is of large scale.In the meanwhile, with rapid developments of information and technology, large scalenetwork data are ubiquitous. On one hand, handling such enormous datasets requiresgreat computational power and storage capacity. Hence, it is nearly impossible to com-plete statistical modelling tasks on a central server. On the other hand, the concernsof privacy and ownership issues require the datasets to be distributed across diﬀer-ent data centers. In the meanwhile, due to the distributed storage of the datasets,constraints on communication budgets also post great challenges on statistical mod-elling tasks. Therefore, developing distributed statistical modelling methods which areeﬃcient with low computation and communication cost is important.In recent literature, a surge of researches have emerged to solve the distributedstatistical modelling problems. For instance, to conduct distributed regression analysis,both one-shot and iterative distributed algorithms are designed and studied (Zhanget al., 2013; Liu and Ihler, 2014; Chang et al., 2017a,b). Furthermore, high-dimensionalsparse learning problems are investigated and corresponding asymptotic properties areestablished (Lee et al., 2015; Battey et al., 2015; Jordan et al., 2018; Zhu et al., 2019).Other than the supervised learning tasks, distributed semi-supervised and unsupervised3earning methods are also studied (Chang et al., 2017a; Fan et al., 2017). However, dueto our best knowledge, none of the above literature could tackle distributed communitydetection problems for large scale networks.In this work, we propose a distributed community detection (DCD) algorithm. Thedistributed system typically consists of a master server and multiple worker servers.In each round of computation, the master server is responsible to broadcast tasks toworkers, then the workers conduct computational tasks using local datasets and com-municate the results to the master. More speciﬁcally, we distribute the network nodestogether with their network relationships on both masters and workers. Speciﬁcally, onthe master server we distribute l network nodes, who are referred to as pilot nodes. Thenetwork relationships among the pilot nodes are stored on the master server. On the m th worker, we distribute n m network nodes together with l pilot nodes. The networkrelationships between the n m network nodes and the pilot nodes are recorded. Com-pared to storing the whole network relationships, we resort to storing only a partialnetwork, which leads to much lower storage requirements. Identify pseudo centers:

Step 1: • Conduct spectral clustering on master server to identify pseudo centers.

Step 2: • Broadcast pseudo centers to workers • Complete distributed community detection task using a SVD type algorithm.

Figure 1:

Illustration of distributed community detection algorithm. One time communica-tion is required between master and workers. l pilot nodes. During this step, K pseudo centers are identiﬁed. The pseudo centers are identiﬁed as the pilot nodesmost close to the clustering centers. Next, we broadcast the indexes of the pseudocenters to workers to further complete the community detection task by using a SVDtype algorithm. The basic steps of the algorithm are summarized in Figure 1. Thealgorithm has the following three merits. First, the communication cost is low sinceonly the indexes of pseudo centers are communicated and only one time communica-tion is used. Second, no further iteration algorithm is needed on workers and henceit does not suﬀer from problems as initialization and non-robustness. Third, the totalcomputational complexity is of order O ( M l + (cid:80) Mm =1 n m l ), where M is the number ofworkers. Therefore the computational cost is low as long as the size of pilot nodes is wellcontrolled. We would like to remark that the proposed algorithm can be applied notonly on distributed systems, but also on a single computer with memory constraint.Theoretically, we establish upper bounds of (a) the singular vector estimation errorand (b) the number of mis-clustering nodes. Extensive numerical study is presentedto illustrate the computational power of the proposed methodology.The article is organized as follows. In Section 2, we introduce stochastic blockmodel and our distributed community detection algorithm. In Section 3, we developthe theoretical properties of the estimation accuracies of the community detectiontask. In Section 4 and 5, we study the performance of our algorithm via simulationand real data analysis. Section 6 concludes the article with a discussion. All proofsand technique lemmas are relegated to the Appendix.

2. DISTRIBUTED SPECTRAL CLUSTERING FORSTOCHASTIC BLOCK MODEL .1. Stochastic Block Model and Spectral Clustering Consider a large scale network with N nodes, which can be clustered into K com-munities. For each node i , let g i ∈ { , · · · , K } be its community label. A stochasticblock model is parameterized by a membership matrix Θ = (Θ , · · · , Θ N ) (cid:62) ∈ R N × K and a connectivity matrix B ∈ R K × K (with full rank). For the i th row of Θ, only the g i th element takes 1 and the others are 0. In addition, the connectivity matrix B char-acterizes the connection probability between communities. Speciﬁcally, the connectionprobability between the k th and l th community is B kl . The edge A ij between the node i and j is generated independently from Bernoulli( B g i g j ) distribution. The adjacencymatrix is then deﬁned as A = ( A ij ). By using the adjacency matrix, the Laplacianmatrix L can be deﬁned as L = D − / AD − / , where D is a diagonal matrix with the i th diagonal element being D ii = (cid:80) j A ij .Deﬁne A = E ( A ) and D = E ( D ) as the population leveled counterparts of A and D . Accordingly let L = D − / AD − / . For a matrix X ∈ R m × n , denote X i ∈ R n asthe i th row of matrix X . The following Lemma shows the connection between themembership matrix and the eigenvector matrix of L . Lemma 1.

The eigen-decomposition of L takes the form L = U Λ U (cid:62) , where U =( U , · · · , U N ) (cid:62) ∈ R N × K collects the eigen-vectors and Λ ∈ R K × K is a diagonal matrix.Further we have U = Θ µ , where µ is a K × K orthogonal matrix and Θ i = Θ j if andonly if U i = U j . The proof of Lemma 1 is given by Rohe et al. (2011). By Lemma 1, it can beconcluded that U only has K distinct rows and the i th row is equal to the j th row if thecorresponding two nodes belong to the same community. Accordingly, let (cid:98) U ∈ R N × K denote the K eigenvectors of L with top K absolute eigenvalues. Under mild conditions,6ne can show that (cid:98) U is a slightly perturbed version of U and thus has roughly K distinct rows as well. Applying a k -means clustering algorithm to (cid:98) U , we are then ableto estimate the membership matrix. The spectral clustering algorithm is summarizedin Algorithm 1. Algorithm 1

Spectral Clustering for SBM

Input:

Adjacency matrix A ; number of communities K ; approximationerror ε . Output:

Membership matrix (cid:98) Θ. Compute Laplacian matrix L based on A . Conduct eigen-decomposition of L and extract the top K eigenvectors(i.e., (cid:98) U ). Conduct k -means algorithm using (cid:98) U and then output the estimatedmembership matrix (cid:98) Θ.Despite the usefulness, the classical spectral clustering method for the SBM iscomputationally intensive with computational complexity in the order O ( N ). Henceit is hard to apply in the large scale networks. In the following we aim to developa distributed spectral clustering algorithm for the SBM model. Speciﬁcally, we ﬁrstintroduce a pilot network spectral clustering algorithm on the master server in Section2.2. Then we elaborate the communication mechanism and computation on workersfor the distributed community detection task in Section 2.3 and Section 2.4. For the distributed community detection task, we ﬁrst conduct a pilot-based spec-tral clustering on the master server. Suppose we have l network nodes on the master,which are referred to as pilot nodes . In addition we distribute the pilot nodes both on7aster and workers. In the distributed system, the adjacency matrix is distributed asin Figure 2. As a result, compared to storing the whole network relationships, onlya sub-adjacency matrix (i.e., partial network) is stored. This leads to a much lowerstorage requirement. Master: 𝑙 pilot nodes Worker 1: 𝑛 + 𝑙 network nodes Worker 2: 𝑛 + 𝑙 network nodes…Worker M: 𝑛 𝑀 + 𝑙 network nodes Figure 2:

Distributed adjacency matrix A in the distributed system. On master server, l pilot nodes are distributed. On the m th worker, the network relationships between the n m + l network nodes and pilot nodes are stored. Let m k = (cid:80) Ni =1 I ( g i = k ) be the number of nodes in the k th community. Inaddition, deﬁne n k be the number of pilot nodes in the k th community. Without lossof generality, we assume n k /m k = r for k = 1 , · · · , K . Consequently, the relative sizeof each community (i.e. distribution of memberships) is the same for the pilot nodes.Subsequently, deﬁne the adjacency matrix among the pilot nodes as A ∈ R l × l .Denote the corresponding Laplacian matrix L as L = D − / A D − / , where D =diag { D , , · · · , D ,ll } with D ,ii = (cid:80) j A ,ij . Accordingly, deﬁne D = E ( D ), A = E ( A ), and L = D − / A D − / . 8y Lemma 3.1 in Rohe et al. (2011), the eigen-decomposition of L takes the formas L = U Λ U (cid:62) , where Λ ∈ R K × K and U ∈ R l × K has K distinct rows. We collectthe K distinct rows in the matrix U ( K )0 ∈ R K × K . In addition, let U ( K ) collect K distinctrows of U . The following proposition establishes the relationship between U ( K )0 and U ( K ) . Proposition 1.

Under the assumption that n k /m k = r for k = 1 , · · · , K , we have U ( K )0 = r − / U ( K ) . The proof of Proposition 1 is given in Appendix B.1. By Proposition 1, it couldbe concluded that U ( K )0 is equivalent to U ( K ) up to a ratio r − / . Empirically, byconducting a spectral clustering algorithm on the master, we are able to cluster thepilot nodes correctly with a high probability (Rohe et al., 2011). After clustering pilot nodes on the master, we then broadcast the clustering resultsto workers to further complete community detection on workers. To conduct the task, K pseudo centers are selected for broadcasting as in Step 2 of Algorithm 2. To bemore speciﬁc, deﬁne the clustering centers of (cid:98) U (eigenvector matrix of L ) after the k -means algorithm as (cid:98) C = ( (cid:98) C k : 1 ≤ k ≤ K ) (cid:62) . Then the index of the k th pseudocenter is deﬁned by i k = arg min i (cid:107) (cid:98) U i − (cid:98) C k (cid:107) , which is the closest node to the center ofthe k th cluster. The pseudo centers are pseudo in the sense that they are not exactlythe clustering centers but the closest nodes to the centers. As a result, they could betreated as the most representative nodes for each community. The indexes of pseudocenters are recorded as C = { i , · · · , i K } . Remark 1.

Note that in the communication step, we only broadcast the indexesof pseudo centers instead of clustering centers (cid:98) C . There are two advantages for doing9o. First, the communication cost is low compared to broadcasting (cid:98) C . Speciﬁcally,only K integers need to be communicated. Second, even though the clustering centermatrix (cid:98) C is broadcasted, we still need to know a rotation matrix for further clusteringon workers. Instead, by broadcasting pseudo centers, we no longer need to estimate therotations but only use the pseudo center indexes on workers. The detailed procedureis presented in the next section. Suppose we distribute n m network nodes as well as the pilot nodes on the m thworker. Let P collect the indexes of pilot nodes and M m collect the indexes of n m network nodes on the m th worker. Denote S m = P ∪ M m with |S m | = n m , where n m = l + n m . Particularly on the m th worker, we store the network relationshipsbetween nodes in S m and the pilot nodes in P . Denote the corresponding sub-adjacencymatrix as A ( S m ) ∈ R n m × l . Without loss of generality, we permute the row indexes of A ( S m ) to ensure that A ( S m ) = ( A ( S m ) (cid:62) , A ( S m ) (cid:62) ) (cid:62) with A ( S m )1 = A . As a result, the ﬁrst l rows of A ( S m ) (i.e., A ( S m )1 ) store the adjacency matrix for the pilot nodes, and the rest(i.e., A ( S m )2 ) records the network relationship between the other n m nodes (i.e., M m )and the l pilot nodes.Let D ( S m ) ii = (cid:80) j A ( S m ) ij and F ( S m ) jj = (cid:80) i A ( S m ) ij be the out- and in-degrees of node i and j in the subnetwork on worker m . Correspondingly, deﬁne D ( S m ) = diag { D ( S m ) ii :1 ≤ i ≤ n m } ∈ R n m × n m and F ( S m ) = diag { F ( S m ) jj : 1 ≤ j ≤ l } ∈ R l × l . Then a Laplacianversion of A ( S m ) is given by L ( S m ) = ( D ( S m ) ) − / A ( S m ) ( F ( S m ) ) − / ∈ R n m × l .Given the Laplacian matrix, we further perform the clustering algorithm on work-ers. First, we conduct a singular value decomposition (SVD) using L ( S m ) . Note thatthe SVD can be done very eﬃciently as follows. First, we conduct an eigenvalue decom-10osition on L ( S m ) (cid:62) L ( S m ) ∈ R l × l with computational complexity in the order of O ( l ).This leads to L ( S m ) (cid:62) L ( S m ) = (cid:101) V m (cid:101) Λ m (cid:101) V (cid:62) m , where (cid:101) V m ∈ R l × l is the right singular vectorsof L ( S m ) ∈ R l × l and (cid:101) Λ m ∈ R l × l is a diagonal matrix. Then the left singular vectorscan be eﬃciently computed by (cid:101) U m = L ( S m ) (cid:101) V m (cid:101) Λ − / m with computational complexity O ( n m l ). Next, let (cid:98) U ( S m ) ∈ R n m × K collect the top K left singular vectors in (cid:101) U m . Thenwe assign each node to the cluster with the closest pseudo center. Speciﬁcally, recallthe indexes of the pseudo centers are collected by C = { i , · · · , i K } . As a result, forthe i th ( l + 1 ≤ i ≤ n m ) node in S m , the cluster label g i is estimated by (cid:98) g i = arg min ≤ k ≤ K,i k ∈C (cid:13)(cid:13) (cid:98) U ( S m ) i − (cid:98) U ( S m ) i k (cid:13)(cid:13) . (2.1)An obvious merit of (2.1) is that no further iteration algorithms (e.g., k -means) areneeded for clustering. It makes the clustering results more stable and computationallyeﬃcient. The procedure for community detection on workers is summarized in Step 3of Algorithm 2.

3. THEORETICAL PROPERTIES

In this section, we discuss the accuracy of the clustering algorithm. We ﬁrst es-tablish the theoretical properties of the procedure on the population level. Next, theconvergence of singular vectors is given, which is the key for establishing the consistentclustering result. Lastly, we derive error bounds on the mis-clustering rates.11 lgorithm 2

Distributed Community Detection (DCD) for SBM

Input:

Adjacency matrix A ; sub-adjacency matrices { A ( S m ) } m =1 ,...,M ;number of communities K ; approximation error ε . Output:

Membership matrix (cid:98) Θ Step 1 Pilot-based Network Spectral Clustering on MasterServerStep 1.1

Conduct eigen-decomposition of L and extract the top K eigenvectors (denoted in matrix (cid:98) U ). Step 1.2

Conduct k -means algorithm and obtain clustering centers (cid:98) C = (cid:0) (cid:98) C k : 1 ≤ k ≤ K (cid:1) (cid:62) . Step 2 Broadcast Pseudo Centers to WorkersStep 2.1

Determine the indexes of the k th pseudo centers as i k =arg min i (cid:107) (cid:98) U i − (cid:98) C k (cid:107) . Step 2.2

Broadcast the index set of pseudo centers C = { i , · · · , i K } toworkers. Step 3 Community Detection on WorkersStep 3.1

Perform singular value decomposition using L ( S m ) and denotethe top K left singular vector matrix as (cid:98) U ( S m ) . Step 3.2

Use (2.1) to obtain the estimated community labels.

To motivate the study, we ﬁrst discuss the theoretical properties on the populationlevel. Deﬁne A ( S m ) = E ( A ( S m ) ), D ( S m ) = E ( D ( S m ) ) ∈ R n m × n m and F ( S m ) = E ( F ( S m ) ) ∈ R l × l . In addition, the normalized population adjacency matrix is deﬁned by L ( S m ) =12 D ( S m ) ) − / A ( S m ) ( F ( S m ) ) − / . Suppose the singular value decomposition of L ( S m ) is L ( S m ) = U ( S m ) Λ ( S m ) ( V ( S m ) ) (cid:62) , where U ( S m ) ∈ R n m × K and V ( S m ) ∈ R l × K are left andright eigenvectors respectively. In the following proposition we show that U ( S m ) has K distinct rows and could identify the memberships of the nodes uniquely. Proposition 2.

Let Θ ( S m ) ∈ R n m × K be the membership matrix on the m th worker.Then we have U ( S m ) = Θ ( S m ) µ , where µ ∈ R K × K is a rotation matrix, and µ (cid:62) Θ ( S m ) i = µ (cid:62) Θ ( S m ) j ⇔ Θ ( S m ) i = Θ ( S m ) j . Proof of Proposition 2 is given in Appendix B.2. Proposition 2 implies that the singularvectors could play the same role as the eigenvectors of the adjacency matrix in thecommunity detection.We then build the connection between U ( S m ) with the eigenvector matrix U of L ,i.e., L = U Λ U (cid:62) . Denote U m = ( U i : i ∈ S m ) (cid:62) ∈ R n m × K as the submatrix of U whoserow indexes are in S m . The connection could be built between U ( S m ) and U m . Denote n mk as the number of nodes on the m th worker belonging to the k th community. Ifwe have n mk /m k s are equivalent over 1 ≤ k ≤ K . Then it could be easily veriﬁed asProposition 1 that U ( S m ) = r − / m U m , where r m = n m / ( N + l ). However, in practice,the distributed nodes on the workers are mostly unbalanced with respect to the wholepopulation. For instance, smaller samples of the k th community may be distributed onthe m th worker compared to other workers. As a result, U ( S m ) will not be just equalto r − / m U m .This unbalanced eﬀect can be quantiﬁed in the theoretical analysis. Deﬁne the unbalanced eﬀect as α ( S m ) = max k | n mk /n m − m k /N | . As a result, α ( S m ) will be large ifthe ratio of one community (e.g., the k th community) on the m th worker is far away13rom its population ratio m k /N . In addition, let d ≤ min k n k /l ≤ max k n k /l ≤ u and d m ≤ min k n mk /n m ≤ max k n mk /n m ≤ u m . We establish an upper bound for thedeviation of U ( S m ) from r − / m U m . Proposition 3.

Let b min = min ≤ i,j ≤ K B ij . It holds (cid:13)(cid:13) U ( S m ) − r − / m U m Q m (cid:13)(cid:13) F ≤ √ K u m max { u / , u / m } α ( S m )1 / σ min ( B ) b d d m ( d + d m ) + α ( S m ) d (3.1) where Q m is an K × K orthogonal matrix. Proof of Proposition 3 is given in Appendix B.3 . The upper bound in (3.1) illustratesthe relationship between the error bounds and the unbalanced eﬀect. Particularly, theerror bound is tighter when the community members are distributed more evenly oneach worker. In the extreme case, when the unbalanced eﬀect is 0 (i.e., α ( S m ) = 0), theupper bound in (3.1) will be zero. As we have shown previously, U ( S m ) has K distinct rows. As a result, if (cid:98) U ( S m ) converges to U ( S m ) with a high probability, we are able to achieve a high clusteringaccuracy based on spectral clustering using (cid:98) U ( S m ) . In the following theorem we establishthe convergence result of (cid:98) U ( S m ) to U ( S m ) . Theorem 1. (Singular Vector Convergence)

Let λ ,m ≥ λ ,m ≥ · · · ≥ λ K,m > be the top K singular values of L ( S m ) . Deﬁne δ m = min i D ( S m ) ii . Then for any (cid:15) m > and δ m > n m + 2 l ) + 3 log(4 /(cid:15) m ) , with probability at least − (cid:15) m it holds (cid:13)(cid:13) (cid:98) U ( S m ) − U ( S m ) Q ( S m ) (cid:13)(cid:13) F ≤ √ λ K,m (cid:115) K log(4( n m + 2 l ) /(cid:15) m ) δ m , (3.2)14 here Q ( S m ) ∈ R K × K is a K × K orthogonal matrix. The proof of Theorem 1 is given in Appendix C.1. To better understand the esti-mation error bound given in (3.2), we make the following comments. First, the errorbound is related to λ K,m . According to Rohe et al. (2011) and Lei and Rinaldo (2015),if λ K,m is larger, the eigengap between the eigenvalues of interest and the rest will behigher. This enables us to detect communities with higher accuracy level.Second, the upper bound is lower if the minimum out-degree δ m is higher. Onecould verify that D ( S m ) ii = (Θ ( S m ) i ) (cid:62) B Θ (cid:62) l ≥ b min (cid:80) k n k = b min l . Consequently δ m grows almost linearly with l if b min is lower bounded. If δ m (cid:29) K log n m and λ K,m islower bounded by a positive constant, then we have (cid:107) (cid:98) U ( S m ) − U ( S m ) Q ( S m ) (cid:107) F = o p (1).Lastly, the error bound is higher when the number of communities K and the sub-sample size n m is larger. As a result, larger K and n m will increase the diﬃculty ofthe community detection task. In this section, we conduct clustering accuracy analysis for the DCD algorithm.To this end, we ﬁrst present a suﬃcient condition, which guarantees correct clusteringfor a single node. Let (cid:98) C ( S m ) = ( (cid:98) C ( S m )1 , · · · , (cid:98) C ( S m ) K ) (cid:62) ∈ R K × K be the pseudo centerson the worker m . Denote P m = (2 /D m ) / − ζ m with D m = max ≤ k ≤ K n mk and ζ m = max k ∈{ K } (cid:107) Q ( S m ) (cid:62) U ( S m ) i k − (cid:98) C ( S m ) k (cid:107) . Here ζ m characterizes the distance of thepseudo centers to their population values on worker m . We then have the followingproposition. Proposition 4.

The node i will be correctly clustered (i.e., (cid:98) g i = g i ) as long as (cid:13)(cid:13) (cid:98) U ( S m ) i − (cid:98) C ( S m ) g i (cid:13)(cid:13) < P m . (3.3)15he proof of Proposition 4 is given in Appendix B.4. It indicates that the clusteringaccuracy is closely related to P m . If with a high probability that the pseudo nodes arecorrectly clustered, then P m will be higher. As a consequence, it could yield a higheraccuracy of the community detection result.In the following we analyze the lower bound of P m . If we could prove that P m ispositive with a high probability, we are then able to show that the total number ofmis-clustered nodes are well controlled. Speciﬁcally, deﬁne the pseudo centers on themaster node as (cid:98) U c def = ( (cid:98) U i : i ∈ C ) (cid:62) ∈ R K × K . Ideally, we could directly map (cid:98) U c tothe column space of (cid:98) U ( S m ) and then complete the community detection on workers.To this end, a rotation Q c should be made on the pseudo centers of the master node(i.e., (cid:98) U c ). According to Proposition 3 and Theorem 1, rotation Q c takes the form Q c = r − / m r / Q (cid:62) Q m Q ( S m ) . As a result, the pseudo centers on the m th worker isdeﬁned as (cid:98) C ( S m ) = (cid:98) U c Q c . To establish a lower bound for P m , we ﬁrst assume thefollowing conditions.(C1) (Eigenvalue and Eigengap on Master) Let δ = min i D ,ii . Assume δ > l ) + 3 log(4 /(cid:15) l ) and (cid:15) l → l → ∞ .(C2) (Pilot Nodes) Assume K log( l/(cid:15) l ) / ( b min λ K, ) (cid:28) l with (cid:15) l → l → ∞ .(C3) (Unbalanced Effect) Let d , d m , u , u m be ﬁnite constants and assume α ( S m ) = o ( σ min ( B ) /K ).Condition (C1) is imposed by assuming the same condition as in Theorem 1 for thepilot nodes. Condition (C2) gives a lower bound on the number of pilot nodes. Specif-ically, it should be larger than both the number of communities K and (log l ) /b ,which is easy to satisfy in practice. 16ondition (C3) restricts the unbalanced eﬀect. First, it states that the relative ratioof communities across all workers are stable by assuming d , d m , u , u m are constants.Next, the unbalanced eﬀect α ( S m ) is assumed to converge to zero faster than O (1 /K ).As a result, as long as K is well controlled (for instance, in the order of log N ) andsignal strength in B is strong enough, the conditions (C2) and (C3) could be easilysatisﬁed. We then have the following Proposition. Proposition 5.

Assume Conditions (C1)–(C3). Then with probability − (cid:15) l , we have P m ≥ c / √ n m as min { l, n m } → ∞ with rotation Q c , where c is a positive constant. The proof of Proposition 5 is given in Appendix B.5. In practice, to save us theeﬀort of estimating the rotation matrix Q c , we directly broadcast the pseudo centerindexes C to workers and let (cid:98) C ( S m ) = ( (cid:98) U ( S m ) i : i ∈ C ) (cid:62) . As a result, (cid:98) C ( S m ) is naturallyembedded in the column space of (cid:98) U ( S m ) and no further rotation is required. Giventhe results presented in Theorem 1 and Proposition 5, we are then able to obtain themis-clustering rates for each worker as follows. Theorem 2. (Bound of mis-clustering Rates)

Assume conditions in Theorem1 and Proposition 5. Denote R ( S m ) as the ratio of misclustered nodes on worker m ,then we have R ( S m ) = o (cid:32) K log( l/(cid:15) l ) b min lλ K, + K log(4( n m + 2 l ) /(cid:15) m ) λ K,m δ m + K α ( S m ) σ min ( B ) b (cid:33) , (3.4) with probability at least − (cid:15) l − (cid:15) m . The proof of Theorem 2 is given in Appendix C.2. Theorem 2 establishes an upperbound for the mis-clustering rate on the worker m . With respect to the result, we havethe following remark. 17 emark 2. One could observe that there are three terms included in the mis-clustering rate. The ﬁrst and second terms are related to convergence of spectrumon master and workers. Speciﬁcally, the ﬁrst term is related to the convergence ofeigenvectors on the master. The second term is determined by convergence of singularvectors on the m th worker. As we comment before, with large sample size and strongsignal strength, the mis-clustering rate could be well controlled. Next, the third termis mainly related to the unbalanced eﬀect α ( S m ) among the workers, which is lower ifthe distribution of the communities is more balanced on diﬀerent workers.Compared with using the full adjacency matrix of S m , the error bound in (3.4)is higher. That is straightforward to understand since in our case we use a sub-adjacency matrix instead of the full one. According to Rohe et al. (2012), whenthe full adjacency matrix is used, the mis-clustering rate is bounded by R ( S m ) all = O ( K log( n m /(cid:15) m ) / ( n m λ K,m )) with high probability. In our case, we have R ( S m ) = O ( R ( S m ) all n m /l ). Hence if it holds n m (cid:28) l (i.e., n m ≈ l ), then the mis-clustering rateis asymptotically the same as using the full adjacency matrix. Furthermore, we canobtain a mis-clustering error bound for all network nodes as in the following corollary. Corollary 1.

Assume the same conditions as in Theorem 2. In addition, assume n = n = · · · = n M def = n and α ( S m ) = 0 for ≤ m ≤ M . Denote R all as number of allmis-clustered nodes across all workers. Then with probability − ( M + 1) /l we have R all = O (cid:16) K (log n + log l ) lλ K (cid:17) , (3.5) where λ K = min m λ K,m . The Corollary 1 could be immediately obtained from Theorem 2 by setting (cid:15) l = (cid:15) m = 1 /l for 1 ≤ m ≤ M . As indicated by (3.5), the mis-clustering rate is smaller18hen the number of pilot nodes l is larger. Particularly, if l = rN with r ∈ (0 , A . While in the same time, the computational timeis roughly r smaller than using the whole adjacency matrix. As a consequence, thecomputational advantage is obvious.

4. SIMULATION STUDIES

In order to demonstrate the performance of our DCD algorithm, we conduct exper-iments using synthetic datasets under three scenarios. The main diﬀerences lie in thegenerating mechanism of the networks. For simplicity, we consider a stochastic blockmodel with K blocks and each block contains s nodes. As a result, Ks = N . Theconnectivity matrix B is set as B = ν (cid:8) λI K + (1 − λ ) K (cid:62) K (cid:9) , (4.1)where ν ∈ [0 ,

1] and λ ∈ [0 , ν and the connection divergence is characterized by λ .The random experiments are repeated for R = 500 times for a reliable evaluation.To gauge the ﬁnite sample performance, we consider two accuracy measures. The ﬁrstis the mis-clustering rate, i.e., R all = (cid:80) Ni =1 ( (cid:98) g i (cid:54) = g i ) /N . The second is the estimationaccuracy of the singular vectors, i.e., (cid:98) U ( S m ) for each worker, which is captured by thelog-estimation error (LEE). Deﬁne LEE m = log (cid:107) (cid:98) U ( S m ) − U ( S m ) Q ( S m ) (cid:107) F for the m thworker, where the rotation matrix Q ( S m ) is calculated according to Rohe et al. (2011).19ubsequently LEE = M − (cid:80) m LEE m is calculated to quantify the average estimationerrors over all workers. Scenario 1 (Pilot Nodes)

First, we investigate the role of pilot nodes on thenumerical performances. Particularly, we let l = rN with N = 10000 and r varyingfrom 0 .

01 to 0 .

2. The performances are evaluated for K = 3 , , , ν = 0 . λ = 0 .

5. In addition, the number ofworkers is given as M = 5. We calculated the mis-clustering rate R all in the left panelof Figure 3. As shown in Figure 3, the mis-clustering rate converges to zero as l grows,which corroborates with our theoretical ﬁndings in Corollary 1. Ratio M i s - c l u s t e r i ng R a t e Number of community

K = 3K = 4K = 5K = 6 slope=-0.5 -1.6-1.2-0.8-0.4 5 6 7

Log- number of pilot nodes

Log E s t i m a t i on E rr o r Total nodes

N=8000N=10000N=12000N=14000

Figure 3: Left panel: the mis-clustering rates versus pilot nodes ratio (i.e., l/N ) underdiﬀerent community sizes K = 3 , , ,

6; Right panel: LEE versus the log-number ofpilot nodes under sample sizes N = 8000 , , , N , as log( l ) grows,the estimation error of eigenvectors decreases with the slope of LEE roughly parallelwith − /

2. This corroborates with the theoretical results in Theorem 1.

Connection intensity M i s - c l u s t e r i ng R a t e Number of community

K = 3K = 4K = 5K = 6

Connection divergence M i s - c l u s t e r i ng R a t e Number of community

K = 3K = 4K = 5K = 6

Figure 4: The inﬂuence of ν (connection intensity) and λ (connection divergence) on themis-clustering rates. As shown in the ﬁgure, stronger connection and larger divergencecan lead to more accurate community detection results. Scenario 2 (Signal Strength)

In this scenario, we observe how the mis-clustering rates change with respect to the signal strengths. Accordingly we ﬁx l = 300, N = 20000, and vary the number of communities as K = 3 , , ,

6. For ν = 0 .

2, weﬁrst change the connection divergence from λ = 0 .

05 to λ = 0 .

95. As λ increases,the connection intensity within the same community will be higher than nodes fromdiﬀerent communities. In the meanwhile, the eigengap is larger and the signal strengthis higher. Next, we conduct the experiment by varying connection intensity ν and ﬁx λ = 0 .

5. Theoretically, as ν increases, b min will increase accordingly, which results ina higher signal strength. According to Theorem 1 and Corollary 1, the mis-clustering21ates will drop as the signal strength is higher. This phenomenon can be conﬁrmedfrom the right panel of Figure 4. Scenario 3 (Unbalanced Effect)

In this setting, we verify the unbalancedeﬀect on the ﬁnite sample performances. First, we ﬁx l = 500, N = 5000, ν = 0 . λ = 0 .

5, and M = 3. Denote π mk as the ratio of nodes in the k th community on the m th worker. We set π mk as follows, π mk = 1 K + (cid:16) k − K + 12 (cid:17) sign (cid:16) m − M + 12 (cid:17) αK ( K − . If π m = π m = · · · = π mK = 1 /K , then there is no unbalanced eﬀect. As α increases,the unbalanced eﬀect is larger. The mis-clustering rate is visualized in Figure 5. As α is increased from 0 to 0 .

95, we could observe that the mis-clustering rates increaseaccordingly, which veriﬁes the result of Theorem 2. α M i s - c l u s t e r i ng R a t e Number of Community

K = 2K = 3K = 4

Figure 5: The mis-clustering rates versus the unbalanced eﬀect α for diﬀerent numberof communities K = 2 , ,

4. As the unbalanced eﬀect increases, the mis-clustering ratesalso increase, which results in inferior performance of the distributed algorithm.

Lastly, we compare the performances of the proposed method with the spectral22lustering (SC) using the whole network data. Both the mis-clustering rates and thecomputational eﬃciency are compared. For a network with size N , we conduct spec-tral clustering in Algorithm 1 and record the clustering accuracy and computationaltime. For comparison, we conduct distributed spectral clustering using M workers byAlgorithm 2. Ratio M i s - c l u s t e r i ng R a t e Number of worker

M = 5M = 10M = 15M = 20

Ratio C o m pu t a t i on a l T i m e ( s ) Number of worker

M = 5M = 10M = 15M = 20

Total nodes C o m pu t a t i on a l T i m e ( s ) SCDCD

Figure 6: The mis-clustering rates (left panel) and computational time (middle panel)with respect to varying pilot nodes ratios for diﬀerent number of workers. The com-putational times of SC Algorithm 1 and DCD Algorithm 2 are further compared as N grows (right panel).For N = 10000 and K = 3, the average computational time of spectral clusteringAlgorithm 1 (using whole network adjacency matrix) is 48.84s and the mis-clusteringrate is zero. Next, we set l = rN with r = 0 . , . , ..., .

18 for Algorithm 2. Bothmis-clustering rates and computational time are compared, which is shown in Figure6. As we could observe, after l ≥ . × N grows. For each N , l is set when the mis-clustering rate is the same as using the whole adjacency matrix. As we can observefrom the right panel of Figure 6, as the network size grows, the computational time23f Algorithm 1 increases drastically compared to Algorithm 2, which illustrates thecomputational advantage of the proposed approach.

5. EMPIRICAL STUDY

We evaluate the empirical performance of the proposed method using two networkdatasets. The estimation accuracy and computational time are evaluated using bothdistributed community detection algorithm and spectral clustering method. Particu-larly, the distributed community detection algorithm is implemented using our newlydeveloped package

DCD on the Spark system. The system consists 36 virtual coresand 128 GB of RAM. We set the number of workers as M=2. Descriptions of the twonetwork datasets and corresponding experimental results are presented as follows.

The Pubmed dataset consists of 19,717 scientiﬁc publications from PubMed database(Kipf and Welling, 2016). Each publication is identiﬁed as one of the three classes, i.e.,Diabetes Mellitus Experimental, Diabetes Mellitus Type 1, Diabetes Mellitus Type 2.The sizes of the three classes are 4,103, 7,875, and 7,739 respectively. In this casethe community sizes are relatively unbalanced since both the second and third classeshave roughly twice members than the ﬁrst class. The network link is deﬁned using thecitation relationships among the publications. Speciﬁcally, if the i th publication citesthe j th one (or otherwise), then A ij = 1, otherwise A ij = 0. The resulting networkdensity is 0 . . = l/N from 0.02 to 0.30. One could observe in Figure 7 that, the mis-clustering ratesof the DCD algorithm is comparable to the SC algorithm when r = 0 .

22, while thecomputational time is much lower. R SC =0.3303R DCD

Ratio M i s - c l u s t e r i ng R a t e t SC =376.755s t DCD

Ratio C o m pu t a t i on a l T i m e ( s ) Figure 7: The Comparison between SC algorithm and DCD algorithm on Pubmeddataset both in mis-clustering rate and computational time. The mis-clustering rateand computational time of the DCD (SC) algorithm are denoted by R

DCD (R SC ) andt DCD (t SC ) respectively. In this study, we consider a large scale social network, Pokec (Takac and Zabovsky,2012). The Pokec is the most popular online social network in Slovak. The datasetwas collected during May 25–27 in the year of 2012, which contains 50,000 active usersin the network. If the i th user is a friend of the j th user, then there is a connectionbetween the two users, i.e., A ij = 1. The resulting network density is 0 . between / Den within , whereDen between = (cid:80) i,j a i,j I ( (cid:98) g i (cid:54) = (cid:98) g j ) / (cid:80) i,j I ( (cid:98) g i (cid:54) = (cid:98) g j ) is the between-community density, andDen within = (cid:80) i,j a i,j I ( (cid:98) g i = (cid:98) g j ) / (cid:80) i,j I ( (cid:98) g i = (cid:98) g j ) is the within-community density. TheRED is visualized in Figure 8. As one can observe, after l/N ≥ .

28, the RED is stablewith the corresponding computational time as 642.524s. This further illustrates thecomputational advantage of the proposed DCD algorithm.

RED=0.23 Ratio=0.28

Ratio R e l a t i ve D e n s i t y Figure 8: The relative density decreases rapidly as ratio increases, after r = l/N ≥ .

6. CONCLUDING REMARKS

In this work, we propose a distributed community detection (DCD) algorithm totackle community detection task in large scale networks. We distribute l pilot nodeson the master and a non-square adjacency matrix on workers. The proposed DCDalgorithm has three merits. First, the communication cost is low. Second, no furtheriteration algorithm is used on workers therefore the algorithm is stable. Third, both26he computational complexity and the storage requirements are much lower comparedto using the whole adjacency matrix. The DCD algorithm is shown to have clearcomputational advantage and competitive statistical performance by using a varietysynthetic and empirical datasets.To conclude the article, we provide several topics for future studies. First, bettermechanisms can be designed to select pilot nodes on the master server. This enables usto obtain more accurate estimation of the pseudo centers and yields better clusteringresults. Next, it is interesting to extend the proposed method to directed network byconsidering sending and receiving clusters respectively (Rohe et al., 2012). The theo-retical property and computational complexity could be discussed accordingly. Third,in the community detection task, we only employ the network structure informationand ignore other potential useful nodal covariates. As a result, it is important to extendthe DCD algorithm to further incorporate various exogenous information. References

Amini, A. A., Chen, A., Bickel, P. J., Levina, E., et al. (2013), “Pseudo-likelihoodmethods for community detection in large sparse networks,”

The Annals of Statistics ,41, 2097–2122.Anandkumar, A., Ge, R., Hsu, D., and Kakade, S. M. (2014), “A tensor approach tolearning mixed membership community models,”

The Journal of Machine LearningResearch , 15, 2239–2312.Balakrishnan, S., Xu, M., Krishnamurthy, A., and Singh, A. (2011), “Noise thresholdsfor spectral clustering,” in

Advances in Neural Information Processing Systems , pp.954–962. 27attey, H., Fan, J., Liu, H., Lu, J., and Zhu, Z. (2015), “Distributed estimation andinference with statistical guarantees,” arXiv preprint arXiv:1509.05457 .Bickel, P. J. and Chen, A. (2009), “A nonparametric view of network models andNewman–Girvan and other modularities,”

Proceedings of the National Academy ofSciences , 106, 21068–21073.Chang, X., Lin, S.-B., and Wang, Y. (2017a), “Divide and conquer local average re-gression,”

Electronic Journal of Statistics , 11, 1326–1350.Chang, X., Lin, S.-B., and Zhou, D.-X. (2017b), “Distributed semi-supervised learningwith kernel ridge regression,”

The Journal of Machine Learning Research , 18, 1493–1514.Chen, Y., Sanghavi, S., and Xu, H. (2012), “Clustering sparse graphs,” in

Advances inneural information processing systems , pp. 2204–2212.Fan, J., Wang, D., Wang, K., and Zhu, Z. (2017), “Distributed estimation of principaleigenspaces,” arXiv preprint arXiv:1702.06488 .H¨ardle, W. K., Wang, W., and Yu, L. (2016), “Tenet: Tail-event driven network risk,”

Journal of Econometrics , 192, 499–513.Holland, P. W., Laskey, K. B., and Leinhardt, S. (1983), “Stochastic blockmodels:First steps,”

Social networks , 5, 109–137.Jin, J. et al. (2015), “Fast community detection by SCORE,”

The Annals of Statistics ,43, 57–89.Jordan, M. I., Lee, J. D., and Yang, Y. (2018), “Communication-eﬃcient distributedstatistical inference,”

Journal of the American Statistical Association , 1–14.28ipf, T. N. and Welling, M. (2016), “Semi-supervised classiﬁcation with graph convo-lutional networks,” arXiv preprint arXiv:1609.02907 .Lee, J. D., Sun, Y., Liu, Q., and Taylor, J. E. (2015), “Communication-eﬃcient sparseregression: a one-shot approach,” arXiv preprint arXiv:1503.04337 .Lei, J. and Rinaldo, A. (2015), “Consistency of spectral clustering in stochastic blockmodels,”

The Annals of Statistics , 43, 215–237.Lei, L., Li, X., and Lou, X. (2020), “Consistency of Spectral Clustering on HierarchicalStochastic Block Models,” arXiv preprint arXiv:2004.14531 .Liu, Q. and Ihler, A. T. (2014), “Distributed estimation, information loss and exponen-tial families,” in

Advances in neural information processing systems , pp. 1098–1106.Liu, X., Patacchini, E., and Rainone, E. (2017), “Peer eﬀects in bedtime decisionsamong adolescents: a social network model with sampled data,”

The econometricsjournal , 20, S103–S125.Marbach, D., Costello, J. C., K¨uﬀner, R., Vega, N. M., Prill, R. J., Camacho, D. M.,Allison, K. R., Kellis, M., Collins, J. J., and Stolovitzky, G. (2012), “Wisdom ofcrowds for robust gene network inference,”

Nature methods , 9, 796–804.Marbach, D., Prill, R. J., Schaﬀter, T., Mattiussi, C., Floreano, D., and Stolovitzky, G.(2010), “Revealing strengths and weaknesses of methods for gene network inference,”

Proceedings of the national academy of sciences , 107, 6286–6291.Rohe, K., Chatterjee, S., Yu, B., et al. (2011), “Spectral clustering and the high-dimensional stochastic blockmodel,”

The Annals of Statistics , 39, 1878–1915.Rohe, K., Qin, T., and Yu, B. (2012), “Co-clustering for directed graphs: the Stochasticco-Blockmodel and spectral algorithm Di-Sim,” arXiv preprint arXiv:1204.2296 .29arkar, P., Bickel, P. J., et al. (2015), “Role of normalization in spectral clustering forstochastic blockmodels,”

The Annals of Statistics , 43, 962–990.Sojourner, A. (2013), “Identiﬁcation of peer eﬀects with missing peer data: Evidencefrom Project STAR,”

The Economic Journal , 123, 574–605.Takac, L. and Zabovsky, M. (2012), “Data analysis in public social networks,” in

International scientiﬁc conference and international workshop present day trends ofinnovations , vol. 1.Von Luxburg, U. (2007), “A tutorial on spectral clustering,”

Statistics and computing ,17, 395–416.Zhang, Y., Duchi, J. C., and Wainwright, M. J. (2013), “Communication-eﬃcientalgorithms for statistical optimization,”

The Journal of Machine Learning Research ,14, 3321–3363.Zhao, Y., Levina, E., Zhu, J., et al. (2012), “Consistency of community detection innetworks under degree-corrected stochastic block models,”

The Annals of Statistics ,40, 2266–2292.Zhu, X., Huang, D., Pan, R., and Wang, H. (2020), “Multivariate spatial autoregressivemodel for large scale social networks,”

Journal of Econometrics , 215, 591–606.Zhu, X., Li, F., and Wang, H. (2019), “Least Squares Approximation for a DistributedSystem,” arXiv preprint arXiv:1908.04904 .Zou, T., Lan, W., Wang, H., and Tsai, C.-L. (2017), “Covariance regression analysis,”