Consistency of spectral clustering in stochastic block models
aa r X i v : . [ m a t h . S T ] D ec The Annals of Statistics (cid:13)
Institute of Mathematical Statistics, 2015
CONSISTENCY OF SPECTRAL CLUSTERING IN STOCHASTICBLOCK MODELS
By Jing Lei and Alessandro Rinaldo Carnegie Mellon University
We analyze the performance of spectral clustering for commu-nity extraction in stochastic block models. We show that, under mildconditions, spectral clustering applied to the adjacency matrix of thenetwork can consistently recover hidden communities even when theorder of the maximum expected degree is as small as log n , with n the number of nodes. This result applies to some popular polyno-mial time spectral clustering algorithms and is further extended todegree corrected stochastic block models using a spherical k -medianspectral clustering method. A key component of our analysis is acombinatorial bound on the spectrum of binary random matrices,which is sharper than the conventional matrix Bernstein inequalityand may be of independent interest.
1. Introduction.
Network analysis is concerned with describing and mod-eling the joint occurrence of random interactions among actors in a givenpopulation of interest. In its simplest form, a network dataset over n actorsis a simple undirected random graph on n nodes, where the edges encode therealized binary interactions among the nodes. Examples include social net-works (friendship between Facebook users, blog following, twitter following,etc.), biological networks (gene network, gene-protein network), informationnetwork (email network, World Wide Web) and many others. A review ofmodeling and inference on network data can be found in Kolaczyk (2009),Newman (2010) and Goldenberg et al. (2010).Among the many existing statistical models for network data, the stochas-tic block model, henceforth SBM, of Holland, Laskey and Leinhardt (1983) Received September 2014. Supported by NSF Grant BCS-0941518, NSF Grant DMS-14-07771 and NIH GrantMH057881. Supported by AFOSR and DARPA Grant FA9550-12-1-0392 and NSF CAREERGrant DMS-11-49677.
AMS 2000 subject classifications.
Key words and phrases.
Network data, stochastic block model, spectral clustering,sparsity.
This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in
The Annals of Statistics ,2015, Vol. 43, No. 1, 215–237. This reprint differs from the original in paginationand typographic detail. 1
J. LEI AND A. RINALDO stands out for both its simplicity and expressive power. In a SBM, the nodesare partitioned into
K < n disjoint groups, or communities , according tosome latent random mechanism. Conditionally on the realized but unob-servable community assignments, the edges then occur independently withprobabilities depending only on the community membership of the nodes,so that nodes from the same community will have higher average degree ofconnectivity among themselves than compared to the remaining nodes (seeSection 2.1 for details). Because of its simple analytic form and its abilityto capture the emergence of communities, a feature commonly observed inreal network data, the SBM is certainly among the most popular models fornetwork data.Within the SBM framework, the most important inferential task is thatof recovering the community membership of the nodes from a single obser-vation of the network. To solve this problem, in recent years researchershave proposed a variety procedures, which vary greatly in their degreesof statistical accuracy and computational complexity. See, in particular,modularity maximization [Newman and Girvan (2004)], likelihood meth-ods [Bickel and Chen (2009), Choi, Wolfe and Airoldi (2012), Zhao, Levinaand Zhu (2012), Amini et al. (2012), Celisse, Daudin and Pierre (2012)],method of moments [Anandkumar et al. (2013)], belief propagation [Decelleet al. (2011)], convex optimization [Chen, Sanghavi and Xu (2012)], spectralclustering [Rohe, Chatterjee and Yu (2011), Balakrishnan et al. (2011), Jin(2012), Fishkind et al. (2013), Sarkar and Bickel (2013)] and its variants[Coja-Oghlan (2010), Chaudhuri, Chung and Tsiatas (2012)] and spectralembeddings [Sussman et al. (2012), Lyzinski et al. (2013)].Spectral clustering [see, e.g., von Luxburg (2007)] is arguably one of themost widely used methods for community recovery. Broadly speaking, thisprocedure first performs an eigen-decomposition of the adjacency matrix orthe graph Laplacian. Then the community membership is inferred by apply-ing a clustering algorithm, typically k -means, to the (possibly normalized)rows of the matrix formed by the first few leading eigenvectors. Spectralclustering is easier to implement and computationally less demanding thanmany other methods, most of which amount to computationally intractablecombinatorial searches. From a theoretical standpoint, spectral clusteringhas been shown to enjoy good theoretical properties in denser stochasticblock models where the average degree grows faster than log n ; see, for ex-ample, Rohe, Chatterjee and Yu (2011), Jin (2012), Sarkar and Bickel (2013).In addition, spectral clustering has been empirically observed to yield goodperformance even in sparser regimes. For example, it is recommended asthe initial solution for a search based procedure in Amini et al. (2012). Incomputer science literature, spectral clustering is also a standard procedurefor graph partitioning and for solving the planted partition model, a specialcase of the SBM [see, e.g., Ng et al. (2002)]. ONSISTENCY OF SPECTRAL CLUSTERING Despite its popularity and simplicity, the theoretical properties of spectralclustering are still not well understood in sparser SBM settings where themagnitude of the maximum expected node degree can be as small as log n .This regime of sparsity is in fact not covered by existing analyses of theperformance of spectral clustering for community recovery, which postulatea denser network. Indeed, Rohe, Chatterjee and Yu (2011), Fishkind et al.(2013) require the expected node degree to be almost linear in n , whileJin (2012) requires polynomial growth. Analogous conditions can be foundelsewhere; see, for example, Sussman et al. (2012) and Balakrishnan et al.(2011).In this paper, we derive new error bounds for spectral clustering for thepurpose of community recovery in moderately sparse stochastic block mod-els and degree corrected stochastic block models [see, e.g., Karrer and New-man (2011)], where the maximum expected node degree is of order log n or higher. Our main contribution is to show that the most basic form ofspectral clustering is successful in recovering the latent community mem-berships under conditions on the network sparsity that are weaker than theones used in most of literature. Our results yield some sharpening of exist-ing analyses of spectral clustering for community recovery, and provide atheoretical justification for the effectiveness of this procedure in moderatelysparse networks. We take note that there are competing methods yieldingconsistent community recovery under even milder conditions on the rate ofgrowth of the node degrees, but they either rely on combinatorial methodsthat are computationally demanding [Bickel and Chen (2009)] or are guar-anteed to be successful provided that they are given good starting points[Amini et al. (2012)], which are typically unknown. Other computationallyefficient procedures with strong theoretical guarantees, which include in par-ticular the ones proposed and analyzed in McSherry (2001), Chen, Sanghaviand Xu (2012), Channarond, Daudin and Robin (2012), Sarkar and Bickel(2013), require instead the degrees to be of larger order than log n . Moredetailed comparisons with some of these contributions will be given afterthe statement of main results as more technical background is introduced.Finally, it is also known that in the ultra-sparse case, where the maximumdegree is of order O (1), consistent community recovery is impossible andone can only hope to recover the communities up to a constant fraction [seeCoja-Oghlan (2010), Decelle et al. (2011), Krzakala et al. (2013), Massoulie(2013), Mossel, Neeman and Sly (2012, 2013)].The contributions of this paper are as follows. We prove that a simplestform of spectral clustering, consisting of applying approximate k -means al-gorithms to the rows of the matrix formed by the leading eigenvectors of theadjacency matrix, allows to recover the community membeships of all buta vanishing fraction of the nodes in stochastic block models with expecteddegree as small as log n , with high probability. We also extend this result to J. LEI AND A. RINALDO degree corrected stochastic block models by analyzing an approximate spher-ical k -median spectral clustering algorithm. The algorithms we consider areamong the most practical and computationally affordable procedures avail-able. Yet the theoretical guarantees we provide hold under rather generalassumptions of sparsity that are weaker than the ones used in algorithmsof similar complexity. Our arguments extend those in Rohe, Chatterjee andYu (2011) and Jin (2012) by combining a principal subspace perturbationanalysis (Lemma 5.1), a deterministic performance guarantee of approxi-mate k -means clustering (Lemma 5.3) and a sharp bound on the spectrumof binary random matrices (Theorem 5.2), which may be of independentinterest. These techniques give sharper results under weaker conditions. Inparticular, the subspace perturbation analysis allows us to avoid the indi-vidual eigengap condition. On the other hand, the spectral bound gives abetter large deviation result that cannot be obtained by the matrix Bern-stein inequality [Chung and Radcliffe (2011), Tropp (2012)] and leads to asimple extension to the degree corrected stochastic block model.The article is organized as follows. In Section 2 we give formal introductionto the stochastic block model and spectral clustering. The main results arepresented and compared to related works in Section 3 for regular SBM’s andin Section 4 for degree corrected block models. Section 5 presents the proofsof main results, including a general, highly modular scheme of analyzingperformance of spectral clustering algorithms. Concluding remarks are givenin Section 6. Notation . For a matrix M and index sets I , J ⊆ [ n ], let M I∗ and M ∗J be the submatrix of M consisting the corresponding rows and columns. Let M n,K be the collection of all n × K matrices where each row has exactlyone 1 and ( K −
1) 0’s. For any Θ ∈ M n,K , we call Θ a membership matrix ,and the community membership of a node i is denoted by g i ∈ { , . . . , K } ,which satisfies Θ ig i = 1. Let G k = G k (Θ) = { ≤ i ≤ n : g i = k } and n k = | G k | for all 1 ≤ k ≤ K . Let n min = min ≤ k ≤ K n k , n max = max ≤ k ≤ K n k , and n ′ max be the second largest community size. We use k · k to denote both theEuclidean norm of a vector and the spectral norm of a matrix. k M k F =(trace( M T M )) / denotes the Frobenius norm of a matrix M . The ℓ norm k M k simply counts the number of nonzero entries in M . For any squarematrix M , diag( M ) denotes the matrix obtained by setting all off-diagonalentries of M to 0. For two sequences of real numbers { x n } and { y n } , we willwrite x n = o ( y n ) if lim n x n /y n = 0, x n = O ( y n ) if | x n /y n | ≤ C for all n andsome positive C and x n = Ω( y n ) if | x n /y n | > C for all n and some positive C .
2. Preliminaries.
Model setup.
A stochastic block model with n nodes and K com-munities is parameterized by a pair of matrices (Θ , B ), where Θ ∈ M n,K is ONSISTENCY OF SPECTRAL CLUSTERING the membership matrix and B ∈ R K × K is a symmetric connectivity matrix .For each node i , let g i (1 ≤ g i ≤ K ) be its community label, such that the i th row of Θ is 1 in column g i and 0 elsewhere. On the other hand, the entry B kℓ in B is the edge probability between any node in community k and anynode in community ℓ . Given (Θ , B ), the adjacency matrix A = ( a ij ) ≤ i,j ≤ n is generated as a ij = independent Bernoulli( B g i g j ) , if i < j, , if i = j,a ji , if i > j. The goal of community recovery is to recover the membership matrix Θup to column permutations. Throughout this article, we assume that thenumber of communities, K , is known. For an estimate b Θ ∈ M n,K of thenode memberships, we consider two measures of estimation error. The firstone is an overall relative error L ( b Θ , Θ) = n − min J ∈ E K k b Θ J − Θ k , where E K is the set of all K × K permutation matrices. Because both b Θ J and Θ are membership matrices, we have k b Θ J − Θ k = k b Θ J − Θ k F . Thisquantity measures the overall proportion of mis-clustered nodes.The other performance criterion measures the worst case relative errorover all communities:˜ L ( b Θ , Θ) = min J ∈ E K max ≤ k ≤ K n − k k ( b Θ J ) G k ∗ − Θ G k ∗ k . It is obvious that 0 ≤ L ( b Θ , Θ) ≤ ˜ L ( b Θ , Θ) ≤
2. Thus, ˜ L is a stronger criterionthan L in that it requires the estimator to do well for all communities, whilean estimator b Θ with small L ( b Θ , Θ) may have large relative errors for somesmall communities.2.2.
Spectral clustering.
Spectral clustering is a simple method for com-munity recovery [von Luxburg (2007), Rohe, Chatterjee and Yu (2011), Jin(2012)]. In a SBM, the heuristic of spectral clustering is to relate the eigen-vectors of A to those of P := Θ B Θ T using the fact that E ( A ) = P − diag( P ).Let P = U DU T be the eigen-decomposition of P with U T U = I K and D ∈ R K × K diagonal, then it is easy to see that U has only K distinct rows since P has only K distinct rows. Under mild conditions, it is also the case thattwo nodes are in the same community if and only if their corresponding rowsin U are the same. This is formally stated in the following lemma. Lemma 2.1 (Basic eigen-structure of SBMs).
Let the pair (Θ , B ) parame-trize a SBM with K communities, where B is full rank. Let U DU T be the J. LEI AND A. RINALDO eigen-decomposition of P = Θ B Θ T . Then U = Θ X where X ∈ R K × K and k X k ∗ − X ℓ ∗ k = q n − k + n − ℓ for all ≤ k < ℓ ≤ K . Proof.
Let ∆ = diag( √ n , . . . , √ n K ) then P = Θ B Θ = Θ∆ − ∆ B ∆(Θ∆ − ) T . (2.1)It is straightforward to verify that Θ∆ − is orthonormal. Let ZDZ T = ∆ B ∆be the eigen-decomposition of ∆ B ∆. Thus, we have P = U DU T where U =Θ∆ − Z . The claim follows by letting X = ∆ − Z and realizing that therows of ∆ − Z are perpendicular to each other and the k th row has length k (∆ Z ) k ∗ k = p /n k . (cid:3) Based on this observation, spectral clustering tries to estimate U and itsrow clustering using a spectral decomposition of A . The intuition for theprocedure is as follows. Consider the difference between A and P : A − P = ( A − E ( A )) − diag( P ) , which is a symmetric noise matrix plus a diagonal matrix. Intuitively, theeigenvectors of A will be close to those of P because the eigenvalues of P scales linearly with n while the noise matrix ( A − E ( A )) has operatornorm on the scale of √ n and diag( P ) is like a constant. Therefore, letting A = b U b D b U T be the K -dimensional eigen-decomposition of A correspondingto the K largest absolute eigenvalues, we can see that b U should have roughly K distinct rows because they are slightly perturbed versions of the rowsin U . Then one should be able to obtain a good community partition byapplying a clustering algorithm on the rows of b U . In this paper we considerthe k -means clustering, defined as( b Θ , b X ) = arg min Θ ∈ M n,K ,X ∈ R K × K k Θ X − b U k F . (2.2)It is known that finding a global minimizer for the k -means problem (2.2) isNP-hard [see, e.g., Aloise et al. (2009)]. However, efficient algorithms existfor finding an approximate solution whose value is within a constant fractionof the optimal value [Kumar, Sabharwal and Sen (2004)]. That is, there arepolynomial time algorithms that find( b Θ , b X ) ∈ M n,K × R K × K (2.3) s . t . k b Θ b X − b U k F ≤ (1 + ε ) min Θ ∈ M n,K ,X ∈ R K × K k Θ X − b U k F . The spectral clustering algorithm we consider here is summarized in Algo-rithm 1.
ONSISTENCY OF SPECTRAL CLUSTERING Algorithm 1: Spectral clustering with approximate k -means Input:
Adjacency matrix A ; number of communities K ; approxima-tion parameter ε . Output:
Membership matrix b Θ ∈ M n,K .1. Calculate b U ∈ R n × K consisting of the leading k eigenvectors (or-dered in absolute eigenvalue) of A .2. Let ( b Θ , b X ) be an (1 + ε )-approximate solution to the k -meansproblem (2.3) with K clusters and input matrix b U .3. Output b Θ.2.3.
Sparsity scaling.
Real-world large scale networks are usually sparse,in the sense that the number of edges from a node (the node degree) are verysmall compared to the total number of nodes. Generally speaking, commu-nity recovery is hard when data is sparse. As a result, an important criterionof evaluating a community recovery method is its performance under differ-ent levels of sparsity (usually measured in the error rate as a function of theaverage/maximum degree). The following prototypical example exemplifieswell the roles played by network sparsity as well as other model parametersin determining the hardness of community recovery.
Example 2.2.
Consider a SBM with K communities parameterized by(Θ , B ) where B = α n B ; B = λI K + (1 − λ ) K TK , < λ < , (2.4) I K is the K × K identity matrix, and K is the K × α n within community and α n (1 − λ ) between commu-nity. The quantity λ reflects the relative difference in connectivity betweencommunities and within communities. The network sparsity is captured by α n , where nα n provides an upper bound on the average (and maximum inthis example) expected node degree. It can be easily seen that if α n or λ areclose to 0 then it is hard to identify communities.The hardness of community reconstruction also depends on the number ofcommunities and the community size imbalance. For example, the famousplanted clique problem concerns community recovery under a SBM with K = 2 and B = (cid:18) / / / (cid:19) . (2.5) J. LEI AND A. RINALDO
In the planted clique problem, it is known that community recovery is easyif n ≥ c √ n for a constant c [see Deshpande and Montanari (2013) andreferences therein] and on the other hand no polynomial time algorithmshave been found to succeed when n = o ( √ n ). Remark.
The primary concern of this paper is the effect of α n on theperformance of spectral clustering. Nevertheless, our results explicitly keeptrack of other quantities such as K , λ , n max and n min , all of which are allowedto change with n in a nontrivial manner. The dependence of recovery errorbound on some of these quantities, such as K and λ , is concerned by someauthors, such as Chen, Sanghavi and Xu (2012), Chaudhuri, Chung andTsiatas (2012), Anandkumar et al. (2013). For ease of readability, we do notalways make this dependence on n explicit in our notation.
3. Stochastic block models.
Our main result provides an upper boundon relative community reconstruction error of spectral clustering for a SBM(Θ , B ) in terms of several model parameters.
Theorem 3.1.
Let A be an adjacency matrix generated from a stochasticblock model (Θ , B ) . Assume that P = Θ B Θ T is of rank K , with smallestabsolute nonzero eigenvalue at least γ n and max k,ℓ B kℓ ≤ α n for some α n ≥ log n/n . Let b Θ be the output of spectral clustering using (1 + ε ) -approximate k -means (Algorithm 1). There exists an absolute constant c > , such that,if (2 + ε ) Knα n γ n < c, (3.1) then, with probability at least − n − , there exist subsets S k ⊂ G k for k =1 , . . . , K , and a K × K permutation matrix J such that b Θ G ∗ J = Θ G ∗ , where G = S Kk =1 ( G k \ S k ) , and n X k =1 | S k | n k ≤ c − (2 + ε ) Knα n γ n . (3.2)The proof of Theorem 3.1, given in Section 5, is modular, and can bederived from several relatively independent lemmas.The sets S k (1 ≤ k ≤ K ) consist of nodes in G k for which the clusteringcorrectness cannot be guaranteed. The permutation matrix J in the abovetheorem leads to an upper bound on reconstruction error ˜ L ( b Θ , Θ) [and henceon L ( b Θ , Θ)] through equation (3.2).Condition (3.1) specifies the range of model parameters (
K, n, γ n , α n ) forwhich the result is applicable. It is included only for technical reasons, be-cause it holds whenever the bound in (3.2) vanishes and, therefore, im-plies consistency. In particular, as discussed after Corollary 3.2, we have ONSISTENCY OF SPECTRAL CLUSTERING Knα n /γ n = o (1) in many interesting cases. The constant c in (3.1) can bewritten as c = 1 / (64 C ) where C is an absolute constant defined in Theo-rem 5.2 and can be explicitly tracked in the proof presented in the supple-mentary material [Lei and Rinaldo (2014)]. The assumption of α n ≥ log n/n can be changed to α n ≥ c log n/n for any c >
0, and also the probabilitybound 1 − n − can be changed to 1 − n − r for any r >
0, with a differentconstant c = c ( c , r ) in (3.1) and (3.2).While Theorem 3.1 provides a general error bound for spectral clustering,the quantities involved are not in the most transparent form. For example,the bound does not clearly reflect the intuition that the error should increasewhen α n decreases. This is because the quantity γ n contains the parame-ter α n . Also the dependence on the community size imbalance as well asthe community separation (which corresponds to the parameter λ in Exam-ple 2.2) remains unclear. The next corollary illustrates the error bound interms of these model parameters. Corollary 3.2.
Let A be an adjacency matrix from the SBM (Θ , B ) ,where B = α n B for some α n ≥ log n/n and with B having minimum ab-solute eigenvalue ≥ λ > and max kℓ B ( k, ℓ ) = 1 . Let b Θ be the output ofspectral clustering using (1 + ε ) -approximate k -means (Algorithm 1). Thenthere exists an absolute constant c such that if (2 + ε ) Knn λ α n < c (3.3) then with probability at least − n − , ˜ L ( b Θ , Θ) ≤ c − (2 + ε ) Knn λ α n and L ( b Θ , Θ) ≤ c − (2 + ε ) Kn ′ max n λ α n . In the special case of a balanced community sizes [i.e., n max /n min = O (1)] and constant λ , if α n = Ω(log n/n ), then L ( b Θ , Θ) = O P ( K ( nα n ) − ) = O P ( K / log n ). Thus L ( b Θ , Θ) = o P (1) if K = o ( √ log n ). This improves theresults in Rohe, Chatterjee and Yu (2011) where α n needs to be of order1 / log n for a similar result.In Example 2.2, the smallest nonzero eigenvalue of B is λ . Recall that λ is the relative difference of within- and between-community edge proba-bilities. Corollary 3.2 then implies that when this relative difference staysbounded away from zero, the communities can be consistently recovered bysimple spectral clustering as long as the expected node degrees are no less J. LEI AND A. RINALDO than log n . On the other hand, when α n is constant and λ = λ n varies with n , spectral clustering can recover the communities when the relative edgeprobability gap grows faster than 1 / √ n .In the planted clique problem, L ( b Θ , Θ) has limited meaning because atrivial clustering putting all nodes in one cluster achieves L ( b Θ , Θ) = 2 n min /n which is o (1) in the most interesting regime. Therefore, it makes more senseto consider ˜ L ( b Θ , Θ). Now B = B is given by (2.5), with minimum eigenvalue > .
19. Applying Corollary 3.2 with K = 2, λ = 0 . α n = 1, and any fixed ε >
0, we have ˜ L ( b Θ , Θ) < c ′ nn , provided that c ′ n/n <
1, where c ′ is a different absolute constant. There-fore, when n min ≥ √ an for some a > c ′ , b Θ recovers the hidden clique with arelative error no larger than c ′ /a . Thus, our result reaches the well believedcomputation barrier [up to constant factor, see Deshpande and Montanari(2013) and references therein] of the planted clique problem.There are spectral methods other than spectral clustering that can pro-vide consistent community recovery. One such well-known method is theprocedure analyzed by McSherry (2001). The planted partition problem inthat setting corresponds to the problem of recovering the community mem-berships in the SBM. To simplify the presentation and focus on the depen-dence of network sparsity, we consider the SBM in Example 2.2 with twoequal-sized communities and a constant λ ∈ (0 , − n − provided that, after some simplification, λ α n n > cσ n log n and σ n > (log n ) /n, (3.4)for some constant c , where σ n is an upper bound on the maximal variance ofthe edges. Therefore, the condition (3.4) implies that α n > √ cλ − (log n ) . /n ,which is stronger than the condition in our Corollary 3.2.
4. Degree corrected stochastic block models.
The degree corrected blockmodel [DCBM, Karrer and Newman (2011)] extends the standard SBM byintroducing node specific parameters to allow for varying degrees even withinthe same community. A DCBM is parameterized by a triplet (Θ , B, ψ ),where, in addition to the membership matrix Θ and connectivity matrix B , the vector ψ ∈ R n is included to model additional variability of the edgeprobabilities at the node level. Given (Θ , B, ψ ), the edge probability be-tween nodes i and j is ψ i ψ j B g i g j (recall that g i is the community label ofnode i ). Similar to the SBM, the DCBM also assumes independent edge for-mation given (Θ , B, ψ ). The inclusion of ψ raises an issue of identifiability. ONSISTENCY OF SPECTRAL CLUSTERING So we assume that max i ∈ G k ψ i = 1 for all k = 1 , . . . , K . The SBM can beviewed as a special case of DCBM with ψ i = 1 for all i . The DCBM greatlyenhances the flexibility of modeling degree heterogeneity and is able to fitnetwork data with arbitrary degree distribution. Successful application andtheoretical developments can be found in Zhao, Levina and Zhu (2012) forlikelihood methods, and in Chaudhuri, Chung and Tsiatas (2012), Jin (2012)for spectral methods. Additional notation about the degree heterogeneity . Let φ k be the n × ψ on G k and zero otherwise. Define ˜ φ k = φ k / k φ k k and ˜ ψ = P Kk =1 ˜ φ k . Let ˜Θ be a normalized membership matrix such that˜Θ( i, k ) = ˜ ψ i if i ∈ G k and ˜Θ( i, k ) = 0 otherwise. We also define effectivecommunity size ˜ n k := k φ k k . Let ˜ n min = min k ˜ n k and ˜ n max = max k ˜ n k .The spectral clustering heuristic can be extended to DCBMs by consider-ing the eigen-decomposition P = U DU T where P = diag( ψ )Θ B Θ T diag( ψ ).Now the matrix U may have more than K distinct rows due to the effectof ψ . However, the rows of U point to at most K distinct directions [Jin(2012)]. The following lemma is the analogue of Lemma 2.1 for DCBMs. Lemma 4.1 (Spectral structure of mean matrix in DCBM).
Let
U DU T be the eigen-decomposition of P = diag( ψ )Θ B Θ T diag( ψ ) in a DCBM pa-rameterized by (Θ , B, ψ ) . Then there exists a K × K orthogonal matrix H such that U i ∗ = ˜ ψ i H k ∗ ∀ ≤ k ≤ K, i ∈ G k . Proof.
First, realize that diag( ψ )Θ = ˜ΘΨ, where Ψ = diag( k φ k , . . . , k φ K k ). P = diag( ψ )Θ B Θ T diag( ψ ) = ˜ΘΨ B Ψ ˜Θ T = ˜Θ HD ( ˜Θ H ) T , (4.1)where Ψ B Ψ =
HDH T is the eigen-decomposition of Ψ B Ψ. Note that ˜Θ T ˜Θ = I K so ˜Θ HD ( ˜Θ H ) T is an eigen-decomposition of P . (cid:3) As a result, finding the true community partition corresponds to cluster-ing the directions of the row vectors in U , where some form of normalizationmust be employed in order to filter out the nuisance parameter ψ . In par-ticular, we consider spherical clustering , which looks for a cluster structureamong the rows of a normalized matrix U ′ with U ′ i ∗ = U i ∗ / k U i ∗ k .In addition to the overall sparsity, the difficulty of community recovery ina DCBM is also affected by small entries of ψ . Intuitively, if ψ i ≈
0, then itis hard to identify the community membership of node i because few edgesare observed for this node. However, the interaction between small entries of ψ and the overall network sparsity (the maximum/average degree) has notbeen well understood. In the analysis of profile likelihood methods, Zhao, J. LEI AND A. RINALDO
Algorithm 2: Spherical k -median spectral clustering Input:
Adjacency matrix A ; number of communities K ; approxima-tion parameter ε . Output:
Membership matrix b Θ ∈ M n,K .1. Calculate b U ∈ R n × K consisting of the leading k eigenvectors (or-dered in absolute eigenvalue) of A .2. Let I + = { i : k b U i ∗ k > } and b U + = ( b U I + ∗ ).3. Let b U ′ be row-normalized version of b U + .4. Let ( b Θ + , b X ) be an (1 + ε )-approximate solution to the k -medianproblem with K clusters and input matrix b U ′ .5. Output b Θ with b Θ i ∗ being the corresponding row in b Θ + if i ∈ I + ,and b Θ i ∗ = (1 , , . . . ,
0) if i / ∈ I + .Levina and Zhu (2012) assume that the entries of ψ are fixed constants. Inspectral clustering, Jin (2012) allows milder conditions on ψ but needs theaverage degree to be polynomial in n .Our analysis uses the following quantity as a summarizing measure ofnode heterogeneity in each community G k : ν k := n − k X i ∈ G k ˜ ψ − i , k = 1 , , . . . , K. By definition ν k ∈ [1 , ∞ ) and a larger ν k indicates a stronger heterogeneity inthe k th community. On the other hand, ν k = 1 indicates within-communityhomogeneity ( ψ i = 1 for all i ∈ G k ).The argument developed for SBMs in previous sections can be extendedto cover very general degree corrected models. In particular, let b U ∈ R n × K consist the K leading eigenvectors of A . We consider the following spherical k -median spectral clustering:minimize Θ ∈ M n,K ,X ∈ R K × K k Θ X − b U ′ k , , (4.2)where b U ′ is the row-normalized version of b U and k M k , = P i =1 k M i ∗ k isthe matrix (2 , ε ) approximation ( b Θ , b X ) to the k -median problem, whichcan be solved in polynomial time when ε > √ b U and is described in detail in Algorithm 2.4.1. Analysis of spherical k -median spectral clustering for DCBM. Wehave the following main theorem for spherical k -median spectral clusteringin DCBMs. It is proved in Appendix A.3. ONSISTENCY OF SPECTRAL CLUSTERING Theorem 4.2 (Main result for DCBM).
Consider a DCBM (Θ , B, ψ ) with K communities, where P = diag( ψ )Θ B Θ T diag( ψ ) has rank K , thesmallest nonzero absolute eigenvalue at least γ n , and the maximum entrybounded from above by α n ≥ log n/n . There exists an absolute constant c > such that if (2 . ε ) √ Knα n γ n < c n min qP Kk =1 n k ν k (4.3) then, with probability at least − n − , L ( b Θ , Θ) ≤ c − (2 . ε ) vuut K X k =1 n k ν k √ Kα n γ n √ n . (4.4) Remark.
The constant c equals 1 / (8 C ) where C is the universal con-stant in Theorem 5.2. The condition on α n and probability guarantee canalso be changed to α ≥ c log n/n and 1 − n − r , respectively, with a differentconstant c = c ( c , r ) in equations (4.3) and (4.4).Theorem 4.2 immediately implies a counterpart of Corollary 3.2 undermore explicit scaling of the model parameters. Corollary 4.3.
Let A be an adjacency matrix from DCBM (Θ , B, ψ ) ,such that B = α n B for some α n ≥ log n/n where B has minimum ab-solute eigenvalue λ > and max kℓ B ( k, ℓ ) = 1 . Let ( b Θ , b X ) be an (1 + ε ) -approximate solution to the spherical k -median algorithm (Algorithm 2).There exists an absolute constant c such that if (2 . ε ) √ Kn ˜ n min λ √ α n < c n min qP Kk =1 n k ν k , then, with probability at least − n − , L ( b Θ , Θ) ≤ c − (2 . ε ) √ K ˜ n min λ √ nα n vuut K X k =1 n k ν k . Comparing with Theorem 3.1 and Corollary 3.2, the results for DCBMare different in two major aspects. First, the DCBM condition (4.3) involvesthe term n / P Kk =1 n k ν k which is smaller than 1 (indeed smaller than1 /K ). This makes (4.3) more stringent than (3.1). Also the upper bound on L ( b Θ , Θ) is different in the same manner. Furthermore, the argument used to J. LEI AND A. RINALDO prove Theorem 4.2 is not likely to provide a sharp upper bound on ˜ L ( b Θ , Θ).We believe this has to do with the additional normalization step used inthe spherical k -median algorithm as well as the specific strategy used in ourproof.To better understand this result, consider Example 2.2 with balancedcommunity size: n max /n min = O (1). To work with a DCBM, assume in ad-dition that the node degree vector ψ has comparable degree heterogeneityacross communities: c ν ≤ ν k ≤ c ν for constants c , c . Then Corollary 4.3implies an overall relative error rate L ( b Θ , Θ) = O P (cid:18) √ ν ˜ n min λ √ nα n (cid:19) . (4.5)Several observations are worth mentioning. First, the error rate dependson ν , the degree heterogeneity measure, in a simple manner. Second, thecommunity size n min that appears in Corollary 3.2 is replaced by ˜ n min =min k k φ k k , the minimum effective sample size. Roughly speaking, ˜ n min ≍ n min as long as a constant fraction of nodes have their ψ i ’s bounded awayfrom zero (but the rest should not be too small in order to keep ν small).Third, if there is no degree heterogeneity ( ν k ≡ n min = n min ), then therate in (4.5) is the square root of that given by Corollary 3.2. This is dueto the additional normalization step (which is not necessary since ν = 1)involved in spherical k -median and the different argument used to analyzethe spherical k -median algorithm. Moreover, the relative error can still be o P (1) even when α n is as small as log n/n , provided that 1 /ν , ˜ n min /n , and λ stay bounded away from zero or approach zero sufficiently slowly. Comparisons with existing work.
There are relatively fewer results forcommunity recovery in degree corrected block models that allow the maxi-mum node degree to be of order o ( n ). Chaudhuri, Chung and Tsiatas (2012)extended the method of McSherry (2001) to degree corrected block models.In the setting of Example 2.2 with equal community size, their main result(Theorems 2 and 3 in their paper) requires α n to be at least of order 1 / √ n .A similar requirement of a polynomial growth of expected average degreeis implicitly imposed in Jin (2012), who first studied the performance ofnormalized k -means spectral clustering in degree corrected block models.
5. Proof of the main results.
In this section, we present a general schemeto prove error bounds for spectral clustering. It contains the SBM as a specialcase and can be easily extended to the degree corrected block model. Ourargument consists of three parts: (1) control the perturbation of principalsubspaces for general symmetric matrices, (2) bound the spectrum of randombinary matrices, and (3) error bound of k -mean and spherical k -medianclustering. ONSISTENCY OF SPECTRAL CLUSTERING Principal subspace perturbation.
The first ingredient of our proof isto bound the difference between the eigenvectors of A and those of P , where A can be viewed as a noisy version of P . Lemma 5.1 (Principal subspace perturbation).
Assume that P ∈ R n × n is a rank K symmetric matrix with smallest nonzero singular value γ n . Let A be any symmetric matrix and b U , U ∈ R n × K be the K leading eigenvectorsof A and P , respectively. Then there exists a K × K orthogonal matrix Q such that k b U − U Q k F ≤ √ Kγ n k A − P k . Lemma 5.1 is proved in Appendix A.1, which is based on an applicationof the Davis–Kahan sin Θ theorem [Theorem VII.3.1 of Bhatia (1997)]. Thepresence of a K × K orthonormal matrix Q in the statement of Lemma 5.1is to take care of the situation where some leading eigenvalues have multi-plicities larger than one. In this case, the eigenvectors are determined onlyup to a rotation.5.2. Spectral bound of binary symmetric random matrices.
The next the-orem provides a sharp probabilistic upper bound on k A − P k when A is arandom adjacency matrix with E ( a ij ) = p ij . Theorem 5.2 (Spectral bound of binary symmetric random matrices).
Let A be the adjacency matrix of a random graph on n nodes in whichedges occur independently. Set E [ A ] = P = ( p ij ) i,j =1 ,...,n and assume that n max ij p ij ≤ d for d ≥ c log n and c > . Then, for any r > there existsa constant C = C ( r, c ) such that k A − P k ≤ C √ d with probability at least − n − r . This result does not follow conventional matrix concentration inequalitiessuch as the matrix Bernstein inequality (which will only give √ d log n ). Luand Peng (2012) use a path counting technique in random matrix theory toprove a bound of the same order but require a maximal degree d ≥ c (log n ) .The proof of Theorem 5.2 is technically involved, as it uses combinatorialarguments in order to derive spectral bounds for sparse random matrices.Our proof is based on techniques developed by Feige and Ofek (2005) forbounding the second largest eigenvalue of an Erd¨os–R´eyni random graphwith edge probability d/n . The full proof is provided in Lei and Rinaldo(2014). Here we give a brief outline of the three major steps. J. LEI AND A. RINALDO
Step Discretization . We first reduce controlling k A − P k to the problemof bounding the supremum of | x T ( A − P ) y | over all pairs of vectors x, y ina finite set of grid points. For any given pair ( x, y ) in the grid, the quantity x T ( A − P ) y is decomposed into the sum of two parts. The first part corre-sponds to the small entries of both x and y , called light pairs , the other partcorresponds to the larger entries of x or y , the heavy pairs . Step Bounding the light pairs . The next step is to use Bernstein’s in-equality and the union bound to control the contribution of the light pairs,uniformly over the points in the grid.
Step Bounding the heavy pairs . In the final step, the contribution fromthe heavy pairs, which cannot be simply bounded by conventional Bern-stein’s inequality, will be bounded using a combinatorial argument on theevent that the edge numbers in a collection of subgraphs do not deviatemuch from their expectation. A sharp large deviation bound for sums of in-dependent Bernoulli random variables [Corollary A.1.10 of Alon and Spencer(2004)] is used to achieve better rate than standard Bernstein’s inequality.5.3.
Error bound of k -means/ k -median on perturbed eigenvectors. Spec-tral clustering (or spherical spectral clustering) applies a clustering algo-rithm to a matrix consisting of the eigenvectors of A , which is close (inview of Lemma 5.1 and Theorem 5.2) to a matrix whose rows can be per-fectly clustered. We would like to bound the clustering error in terms of thecloseness between the real input matrix b U and the ideal input matrix U .The next lemma generalizes an argument used in Jin (2012) and providesan error bound for any (1 + ε )-approximate k -means solution. Lemma 5.3 (Approximate k -means error bound). For ε > and anytwo matrices b U , U ∈ R n × K such that U = Θ X with Θ ∈ M n,K , X ∈ R K × K ,let ( b Θ , b X ) be a (1 + ε ) -approximate solution to the k -means problem in equa-tion (2.2) and ¯ U = b Θ b X . For any δ k ≤ min ℓ = k k X ℓ ∗ − X k ∗ k , define S k = { i ∈ G k (Θ) : k ¯ U i ∗ − U i ∗ k ≥ δ k / } then K X k =1 | S k | δ k ≤ ε ) k b U − U k F . (5.1) Moreover, if (16 + 8 ε ) k b U − U k F /δ k < n k for all k, (5.2) then there exists a K × K permutation matrix J such that b Θ G ∗ = Θ G ∗ J ,where G = S Kk =1 ( G k \ S k ) . ONSISTENCY OF SPECTRAL CLUSTERING Lemma 5.3 provides a performance guarantee for approximate k -meansclustering under a deterministic Frobenius norm condition on the input ma-trix. As suggested by a referee, the proof of Lemma 5.3 shares some similar-ities with the proof of Theorem 3.1 in Awasthi and Sheffet (2012) [see alsoKumar and Kannan (2010)], though our assumptions are slightly different.For completeness we provide a short and self-contained proof of Lemma 5.3in Appendix A.2, giving explicit constant factors in the result.5.4. Proof of main results for SBM.
We first prove Theorem 3.1.
Proof of Theorem 3.1.
Combining Lemma 5.1 and Theorem 5.2, weobtain that, for some K -dimensional orthogonal matrix Q , k b U − U Q k F ≤ √ Kγ n k A − P k ≤ √ Kγ n C √ nα n , (5.3)with probability at least 1 − n − , where C is the absolute constant involvedin Theorem 5.2. (Notice that the term d in Theorem 5.2 becomes nα n in thecurrent setting.)The main strategy for the rest of the proof is to apply Lemma 5.3 to b U and U Q . To that end, Lemma 2.1 implies that
U Q = Θ XQ = Θ X ′ where k X ′ k ∗ − X ′ ℓ ∗ k = q n k + n ℓ . As a result, we can choose δ k = q /n k + { n ℓ : ℓ = k } inLemma 5.3 and hence n k δ k ≥ k . Using (5.3), a sufficient conditionfor (5.2) to hold is(16 + 8 ε )8 C K nα n γ n ≤ ≤ min ≤ k ≤ K n k δ k , (5.4)so that (3.1) indeed implies (5.2) by setting c = C . In detail, the choiceof δ k = 1 / √ n k together with (5.1) yields that K X k =1 | S k | (cid:18) n k + 1max { n ℓ : ℓ = k } (cid:19) = K X k =1 | S k | δ k ≤ ε ) k b U − U Q k F , which, combined with (5.3), gives (3.2): K X k =1 | S k | n k ≤ ε )8 C Knα n γ n = c − (2 + ε ) Knα n γ n . Since Lemma 5.3 ensures that the membership is correctly recovered outsideof S ≤ k ≤ K S k , the claim follows. (cid:3) Proof of Corollary 3.2.
It is easy to see, for example, from (2.1),that in this specific stochastic block model setting, γ n = n min α n λ . Then the J. LEI AND A. RINALDO proof of Theorem 3.1 applies with γ n = n min α n λ and gives K X k =1 | S k | (cid:18) n k + 1max { n ℓ : ℓ = k } (cid:19) ≤ C (2 + ε ) Knn λ α n , which implies that˜ L ( b Θ , Θ) ≤ max ≤ k ≤ K | S k | n k ≤ X ≤ k ≤ K | S k | n k ≤ C (2 + ε ) Knn λ α n , and, recalling that n ′ max is the second largest community size, L ( b Θ , Θ) ≤ n K X k =1 | S k | ≤ C (2 + ε ) Kn ′ max n λ α n . (cid:3)
6. Concluding remarks.
The analysis in this paper applies directly to theeigenvectors of the adjacency matrix, by combining tools in subspace per-turbation and spectral bounds of binary random graphs. In the literature,spectral clustering using the graph Laplacian or its variants is very popu-lar and can sometimes lead to better empirical performance [von Luxburg(2007), Rohe, Chatterjee and Yu (2011), Sarkar and Bickel (2013)]. An im-portant future work would be to extend some of the results and techniquesin this paper to spectral clustering using the graph Laplacian. The graphLaplacian normalizes the adjacency matrix by the node degree, which canintroduce extra noise if the network is sparse and many node degrees aresmall. In several recent works, Chaudhuri, Chung and Tsiatas (2012), Qinand Rohe (2013) studied graph Laplacian based spectral clustering with reg-ularization, where a small constant is added to all node degrees prior to thenormalization. Further understanding the bias-variance trade off would beboth important and interesting.For degree corrected block models, regularization methods may also leadto error bounds with better dependence on small entries of ψ . The intuitionis that ν k can be very large even when only one ψ i is very close to zero.In this case, one should be able to simply discard nodes like this and workon those with large enough degrees. Finding the correct regularization todiminish the effect of small-degree nodes and analyzing the new algorithmwill be pursued in future work.This paper aims at understanding the performance of spectral clusteringin stochastic block models. While our main focus is the performance of spec-tral clustering as the network sparsity changes, the resulting error boundsexplicitly keep track of five independent model parameters ( K , α n , λ , n min , n max ). Existing results usually develop error bounds depending on a subsetof these parameters, keeping others as constant [see, e.g., Bickel and Chen ONSISTENCY OF SPECTRAL CLUSTERING (2009), Chen, Sanghavi and Xu (2012), Zhao, Levina and Zhu (2012)]. Inthe planted clique model, our result implies that spectral clustering can findthe hidden clique when its size is at least c √ n for some large enough con-stant c . Our result also provides good insight in understanding the impact ofthe number of clusters and separation between communities. For instance,in Example 2.2, let α n ≡ n max = n min = n/K . Then Corollary 3.2 impliesthat spectral clustering is consistent if K / ( nλ ) →
0. More generally, theguarantees of Corollary 3.2 compares favorably against most existing resultsas summarized in Chen, Sanghavi and Xu (2012), in terms of allowable clus-ter size, density gap and overall sparsity. It would be interesting to develop aunified theoretical framework (e.g., minimax theory) such that all methodsand model parameters can be studied and compared together.APPENDIX: TECHNICAL PROOFSFor any two matrices A and B of the same dimension, we use the notation h A, B i = trace( A T B ) for the standard matrix inner product. A.1. Proof of Lemma 5.1.
By Proposition 2.2 of Vu and Lei (2013), thereexists a K -dimensional orthogonal matrix Q such that1 √ K k b U − U Q k F ≤ √ K k ( I − b U b U T ) U U T k F ≤ k ( I − b U b U T ) U U T k . Next, we establish that k ( I − b U b U T ) U U T k ≤ k A − P k γ n . If k A − P k ≤ γ n / k ( I − b U b U T ) U U T k ≤ k A − P k γ n − k A − P k ≤ k A − P k γ n . If k A − P k > γ n /
2, then k ( I − b U b U T ) U U T k ≤ ≤ k A − P k γ n . A.2. Proof of Lemma 5.3.
First, by the definition of ¯ U and the fact that U is feasible for problem (2.2), we have k ¯ U − U k F ≤ k ¯ U − b U k F + 2 k b U − U k F ≤ (4 + 2 ε ) k b U − U k F . Then K X k =1 | S k | δ k / ≤ k ¯ U − U k F ≤ (4 + 2 ε ) k b U − U k F , (A.1)which concludes the first claim of the lemma.Under the assumption described in the second part of the lemma, equation(A.1) further implies that | S k | ≤ (16 + 8 ε ) k b U − U k F /δ k < n k for all k. J. LEI AND A. RINALDO
Therefore, T k ≡ G k \ S k = ∅ , for each k . If i ∈ T k and j ∈ T ℓ with k = ℓ ,then ¯ U i ∗ = ¯ U j ∗ because otherwise max( δ k , δ ℓ ) ≤ k U i ∗ − U j ∗ k ≤ k U i ∗ − ¯ U i ∗ k + k U j ∗ − ¯ U j ∗ k < δ k / δ ℓ /
2, which is impossible. This further implies that ¯ U has exactly K distinct rows, because the number of distinct rows is no largerthan K as part of the constraints of the optimization problem (2.2).On the other hand, if i and j are both in T k , for some k , then ¯ U i ∗ = ¯ U j ∗ because otherwise there would be more than K distinct rows since there areat least K − T ℓ for ℓ = k .As a result, ¯ U i ∗ = ¯ U j ∗ if i, j ∈ T k for some k , and ¯ U i ∗ = ¯ U j ∗ if i ∈ T k , j ∈ T ℓ with k = ℓ . This gives a correspondence of clustering between the rows in¯ U T ∗ and those in U T ∗ where T = S Kk =1 T k . A.3. Proofs for degree corrected block models.
The argument fits verywell in the general argument developed in Section 5. Then Lemma 5.1 andTheorem 5.2 still apply and P (cid:20) k b U − U Q k F ≤ √ C √ Knα n γ n for some QQ T = I K (cid:21) ≥ − n − , (A.2)where C is the constant in Theorem 5.2.For presentation simplicity, in the following argument we will work with Q = I K . The general case can be handled in the same manner with morecomplicated notation (simply substitute U by U Q ).To prove Theorem 4.2, we first give a bound on the zero rows in b U . Recallthat I + = { i : b U i ∗ = 0 } . Define I = I c + . Lemma A.1 (Number of zero rows in b U ). In a DCBM (Θ , B, ψ ) satisfy-ing the conditions of Theorem 4.2, let b U and U be the leading eigenvectorsof A and P , respectively. Then | I | ≤ vuut K X k =1 n k ν k k b U − U k F . Proof.
Use Cauchy–Schwarz: k b U − U k F ≥ n X i =1 ( b U i ∗ = 0) k U i ∗ k ≥ ( P ni =1 ( b U i ∗ = 0)) P ni =1 k U i ∗ k − = | I | P Kk =1 n k ν k . (cid:3) We also need the following simple fact about the distance between nor-malized vectors.
Fact.
For two nonzero vectors v , v of same dimension, we have k v k v k − v k v k k ≤ k v − v k max( k v k , k v k ) . ONSISTENCY OF SPECTRAL CLUSTERING Proof.
Without loss of generality, assume k v k ≥ k v k . Then (cid:13)(cid:13)(cid:13)(cid:13) v k v k − v k v k (cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13) v k v k − v k v k + v k v k − v k v k (cid:13)(cid:13)(cid:13)(cid:13) ≤ k v − v kk v k + k v k|k v k − k v k|k v kk v k ≤ k v − v kk v k . (cid:3) Proof of Theorem 4.2.
Recall that U ′ is the row-normalized versionof U . Let U ′′ = U ′ I + ∗ be the sub-matrix of U ′ corresponding to the nonzerorows in b U . Then k b U ′ − U ′′ k , ≤ n X i =1 k b U i ∗ − U i ∗ kk U i ∗ k≤ vuut n X i =1 k b U i ∗ − U i ∗ k n X i =1 k U i ∗ k − ≤ vuut k b U − U k F K X k =1 n k ν k . Now we can bound the (2 ,
1) distance between an approximate solutionof k -median problem (4.2) and the targeted solution U ′′ . k b Θ + b X − U ′′ k , ≤ k b Θ + b X − b U ′ k , + k b U ′ − U ′′ k , ≤ (2 + ε ) k b U ′ − U ′′ k , . Let S = { i ∈ I + : k b Θ i ∗ b X − U ′ i ∗ k ≥ √ } . The size of S can be bounded usinga similar argument as in the proof of Lemma A.1. | S | √ ≤ k b Θ + b X − U ′′ k , ≤ (2 + ε ) k b U ′ − U ′′ k , ≤ ε ) vuut K X k =1 n k ν k k b U − U k F , which implies | S | ≤ √ ε ) vuut K X k =1 n k ν k k b U − U k F . (A.3)On the event in (A.2) (recall that we assume Q = I ), (A.3) and Lemma A.1implies | S | + | I | ≤ (2 . ε )8 C √ Knα n γ n vuut K X k =1 n k ν k . (A.4) J. LEI AND A. RINALDO
Combining this with condition (4.3) implies | S | + | I | < n k for all k andhence G k ∩ ( I + \ S ) = ∅ . Therefore, for any two rows in G := I + \ S , if theyare in different clusters of Θ then they must be in different clusters of b Θ(otherwise, k U ′ i ∗ − U ′ j ∗ k ≤ k U ′ i ∗ − b Θ i ∗ b X k + k b Θ j ∗ b X − U ′ j ∗ k < √ I ∪ S , andthe number is bounded by the right-hand side of (A.4). The claimed resultfollows by choosing c = 8 C . (cid:3) Acknowledgment.
The authors thank an anonymous reviewer for helpfulsuggestions that led in particular to a significant simplification of the proofof Lemma 5.1. SUPPLEMENTARY MATERIAL
Supplement to “Consistency of spectral clustering in sparse stochasticblock models” (DOI: 10.1214/14-AOS1274SUPP; .pdf). The supplementaryfile contains a proof of Theorem 5.2.REFERENCES
Aloise, D. , Deshpande, A. , Hansen, P. and
Popat, P. (2009). NP-hardness of Eu-clidean sum-of-squares clustering.
Machine Learning Alon, N. and
Spencer, J. H. (2004).
The Probabilistic Method , 2nd ed. Wiley, Hoboken.
Amini, A. A. , Chen, A. , Bickel, P. J. and
Levina, E. (2012). Pseudo-likelihoodmethods for community detection in large sparse networks. Preprint. Available atarXiv:1207.2340.
Anandkumar, A. , Ge, R. , Hsu, D. and
Kakade, S. M. (2013). A tensor spectralapproach to learning mixed membership community models. Preprint. Available atarXiv:1302.2684.
Awasthi, P. and
Sheffet, O. (2012). Improved spectral-norm bounds for clustering.In
Approximation, Randomization, and Combinatorial Optimization . Lecture Notes inComputer Science
Balakrishnan, S. , Xu, M. , Krishnamurthy, A. and
Singh, A. (2011). Noise thresholdsfor spectral clustering. In
Advances in Neural Information Processing Systems 24 ( J.Shawe-Taylor , R. S. Zemel , P. L. Bartlett , F. Pereira and
K. Q. Weinberger ,eds.) 954–962. Curran Associates, Red Hook, NY.
Bhatia, R. (1997).
Matrix Analysis . Graduate Texts in Mathematics . Springer, NewYork. MR1477662
Bickel, P. J. and
Chen, A. (2009). A nonparametric view of network models andNewman–Girvan and other modularities.
Proc. Natl. Acad. Sci. USA
Celisse, A. , Daudin, J.-J. and
Pierre, L. (2012). Consistency of maximum-likelihoodand variational estimators in the stochastic block model.
Electron. J. Stat. Channarond, A. , Daudin, J.-J. and
Robin, S. (2012). Classification and estimation inthe stochastic blockmodel based on the empirical degrees.
Electron. J. Stat. Charikar, M. , Guha, S. , Tardos, ´E. and
Shmoys, D. B. (1999). A constant-factorapproximation algorithm for the k -median problem. In Proceedings of the Thirty-FirstAnnual ACM Symposium on Theory of Computing Chaudhuri, K. , Chung, F. and
Tsiatas, A. (2012). Spectral clustering of graphs withgeneral degrees in the extended planted partition model.
JMLR: Workshop and Con-ference Proceedings
Chen, Y. , Sanghavi, S. and
Xu, H. (2012). Clustering sparse graphs. In
Advances inNeural Information Processing Systems 25 ( F. Pereira , C. J. C. Burges , L. Bottou and
K. Q. Weinberger , eds.) 2204–2212. Curran Associates, Red Hook, NY.
Choi, D. S. , Wolfe, P. J. and
Airoldi, E. M. (2012). Stochastic blockmodels with agrowing number of classes.
Biometrika Chung, F. and
Radcliffe, M. (2011). On the spectra of general random graphs.
Electron.J. Combin. Paper 215, 14. MR2853072
Coja-Oghlan, A. (2010). Graph partitioning via adaptive spectral techniques.
Combin.Probab. Comput. Decelle, A. , Krzakala, F. , Moore, C. and
Zdeborov´a, L. (2011). Asymptotic analy-sis of the stochastic block model for modular networks and its algorithmic applications.
Phys. Rev. E (3) Deshpande, Y. and
Montanari, A. (2013). Finding hidden cliques of size p N/e innearly linear time. Preprint. Available at arXiv:1304.7047.
Feige, U. and
Ofek, E. (2005). Spectral techniques applied to sparse random graphs.
Random Structures Algorithms Fishkind, D. E. , Sussman, D. L. , Tang, M. , Vogelstein, J. T. and
Priebe, C. E. (2013). Consistent adjacency-spectral partitioning for the stochastic block model whenthe model parameters are unknown.
SIAM J. Matrix Anal. Appl. Goldenberg, A. , Zheng, A. X. , Fienberg, S. E. and
Airoldi, E. M. (2010). A surveyof statistical network models.
Foundations and Trends R (cid:13) in Machine Learning Holland, P. W. , Laskey, K. B. and
Leinhardt, S. (1983). Stochastic blockmodels:First steps.
Social Networks Jin, J. (2012). Fast community detection by SCORE. Available at arXiv:1211.5803.
Karrer, B. and
Newman, M. E. J. (2011). Stochastic blockmodels and communitystructure in networks.
Phys. Rev. E (3) Kolaczyk, E. D. (2009).
Statistical Analysis of Network Data: Methods and Models .Springer, New York. MR2724362
Krzakala, F. , Moore, C. , Mossel, E. , Neeman, J. , Sly, A. , Zdeborov´a, L. and
Zhang, P. (2013). Spectral redemption in clustering sparse networks.
Proc. Natl. Acad.Sci. USA
Kumar, A. and
Kannan, R. (2010). Clustering with spectral norm and the k -meansalgorithm. In Proceedings of the 2010 IEEE 51st Annual Symposium on Foundationsof Computer Science FOCS
Kumar, A. , Sabharwal, Y. and
Sen, S. (2004). A simple linear time (1 + ε )-approximation algorithm for k -means clustering in any dimensions. In Proceedings ofthe 45th Annual IEEE Symposium on Foundations of Computer Science
Lei, J. and
Rinaldo, A. (2014). Supplement to “Consistency of spectral clustering instochastic block models.” DOI:10.1214/14-AOS1274SUPP.
Li, S. and
Svensson, O. (2013). Approximating k-median via pseudo-approximation. In
Proceedings of the 45th Annual ACM Symposium on Symposium on Theory of Com-puting
Lu, L. and
Peng, X. (2012). Spectra of edge-independent random graphs. Preprint. Avail-able at arXiv:1204.6207. J. LEI AND A. RINALDO
Lyzinski, V. , Sussman, D. , Tang, M. , Athreya, A. and
Priebe, C. (2013). Perfectclustering for stochastic blockmodel graphs via adjacency spectral embedding. Preprint.Available at arXiv:1310.0532.
Massoulie, L. (2013). Community detection thresholds and the weak Ramanujan prop-erty. Preprint. Available at arXiv:1311.3085.
McSherry, F. (2001). Spectral partitioning of random graphs. In
Mossel, E. , Neeman, J. and
Sly, A. (2012). Stochastic block models and reconstruction.Preprint. Available at arXiv:1202.1499.
Mossel, E. , Neeman, J. and
Sly, A. (2013). A proof of the block model thresholdconjecture. Preprint. Available at arXiv:1311.4115.
Newman, M. E. J. (2010).
Networks: An Introduction . Oxford Univ. Press, Oxford.MR2676073
Newman, M. E. J. and
Girvan, M. (2004). Finding and evaluating community structurein networks.
Phys. Rev. E (3) Ng, A. Y. , Jordan, M. I. , Weiss, Y. et al. (2002). On spectral clustering: Analysis andan algorithm.
Adv. Neural Inf. Process. Syst. Qin, T. and
Rohe, K. (2013). Regularized spectral clustering under the degree-correctedstochastic blockmodel. Preprint. Available at arXiv:1309.4111.
Rohe, K. , Chatterjee, S. and
Yu, B. (2011). Spectral clustering and the high-dimensional stochastic blockmodel.
Ann. Statist. Sarkar, P. and
Bickel, P. (2013). Role of normalization in spectral clustering forstochastic blockmodels. Preprint. Available at arXiv:1310.1495.
Sussman, D. L. , Tang, M. , Fishkind, D. E. and
Priebe, C. E. (2012). A consistent ad-jacency spectral embedding for stochastic blockmodel graphs.
J. Amer. Statist. Assoc.
Tropp, J. A. (2012). User-friendly tail bounds for sums of random matrices.
Found.Comput. Math. von Luxburg, U. (2007). A tutorial on spectral clustering. Stat. Comput. Vu, V. Q. and
Lei, J. (2013). Minimax sparse principal subspace estimation in highdimensions.
Ann. Statist. Zhao, Y. , Levina, E. and
Zhu, J. (2012). Consistency of community detection in net-works under degree-corrected stochastic block models.