[PDF] Consistency of spectral clustering in stochastic block models

Abstract

We analyze the performance of spectral clustering for community extraction in stochastic block models. We show that, under mild conditions, spectral clustering applied to the adjacency matrix of the network can consistently recover hidden communities even when the order of the maximum expected degree is as small as logn , with n the number of nodes. This result applies to some popular polynomial time spectral clustering algorithms and is further extended to degree corrected stochastic block models using a spherical k -median spectral clustering method. A key component of our analysis is a combinatorial bound on the spectrum of binary random matrices, which is sharper than the conventional matrix Bernstein inequality and may be of independent interest.

Full PDF

aa r X i v : . [ m a t h . S T ] D ec The Annals of Statistics (cid:13)

Institute of Mathematical Statistics, 2015

CONSISTENCY OF SPECTRAL CLUSTERING IN STOCHASTICBLOCK MODELS

By Jing Lei and Alessandro Rinaldo Carnegie Mellon University

We analyze the performance of spectral clustering for commu-nity extraction in stochastic block models. We show that, under mildconditions, spectral clustering applied to the adjacency matrix of thenetwork can consistently recover hidden communities even when theorder of the maximum expected degree is as small as log n , with n the number of nodes. This result applies to some popular polyno-mial time spectral clustering algorithms and is further extended todegree corrected stochastic block models using a spherical k -medianspectral clustering method. A key component of our analysis is acombinatorial bound on the spectrum of binary random matrices,which is sharper than the conventional matrix Bernstein inequalityand may be of independent interest.

1. Introduction.

Network analysis is concerned with describing and mod-eling the joint occurrence of random interactions among actors in a givenpopulation of interest. In its simplest form, a network dataset over n actorsis a simple undirected random graph on n nodes, where the edges encode therealized binary interactions among the nodes. Examples include social net-works (friendship between Facebook users, blog following, twitter following,etc.), biological networks (gene network, gene-protein network), informationnetwork (email network, World Wide Web) and many others. A review ofmodeling and inference on network data can be found in Kolaczyk (2009),Newman (2010) and Goldenberg et al. (2010).Among the many existing statistical models for network data, the stochas-tic block model, henceforth SBM, of Holland, Laskey and Leinhardt (1983) Received September 2014. Supported by NSF Grant BCS-0941518, NSF Grant DMS-14-07771 and NIH GrantMH057881. Supported by AFOSR and DARPA Grant FA9550-12-1-0392 and NSF CAREERGrant DMS-11-49677.

AMS 2000 subject classiﬁcations.

Key words and phrases.

Network data, stochastic block model, spectral clustering,sparsity.

This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in

The Annals of Statistics ,2015, Vol. 43, No. 1, 215–237. This reprint diﬀers from the original in paginationand typographic detail. 1

J. LEI AND A. RINALDO stands out for both its simplicity and expressive power. In a SBM, the nodesare partitioned into

K < n disjoint groups, or communities , according tosome latent random mechanism. Conditionally on the realized but unob-servable community assignments, the edges then occur independently withprobabilities depending only on the community membership of the nodes,so that nodes from the same community will have higher average degree ofconnectivity among themselves than compared to the remaining nodes (seeSection 2.1 for details). Because of its simple analytic form and its abilityto capture the emergence of communities, a feature commonly observed inreal network data, the SBM is certainly among the most popular models fornetwork data.Within the SBM framework, the most important inferential task is thatof recovering the community membership of the nodes from a single obser-vation of the network. To solve this problem, in recent years researchershave proposed a variety procedures, which vary greatly in their degreesof statistical accuracy and computational complexity. See, in particular,modularity maximization [Newman and Girvan (2004)], likelihood meth-ods [Bickel and Chen (2009), Choi, Wolfe and Airoldi (2012), Zhao, Levinaand Zhu (2012), Amini et al. (2012), Celisse, Daudin and Pierre (2012)],method of moments [Anandkumar et al. (2013)], belief propagation [Decelleet al. (2011)], convex optimization [Chen, Sanghavi and Xu (2012)], spectralclustering [Rohe, Chatterjee and Yu (2011), Balakrishnan et al. (2011), Jin(2012), Fishkind et al. (2013), Sarkar and Bickel (2013)] and its variants[Coja-Oghlan (2010), Chaudhuri, Chung and Tsiatas (2012)] and spectralembeddings [Sussman et al. (2012), Lyzinski et al. (2013)].Spectral clustering [see, e.g., von Luxburg (2007)] is arguably one of themost widely used methods for community recovery. Broadly speaking, thisprocedure ﬁrst performs an eigen-decomposition of the adjacency matrix orthe graph Laplacian. Then the community membership is inferred by apply-ing a clustering algorithm, typically k -means, to the (possibly normalized)rows of the matrix formed by the ﬁrst few leading eigenvectors. Spectralclustering is easier to implement and computationally less demanding thanmany other methods, most of which amount to computationally intractablecombinatorial searches. From a theoretical standpoint, spectral clusteringhas been shown to enjoy good theoretical properties in denser stochasticblock models where the average degree grows faster than log n ; see, for ex-ample, Rohe, Chatterjee and Yu (2011), Jin (2012), Sarkar and Bickel (2013).In addition, spectral clustering has been empirically observed to yield goodperformance even in sparser regimes. For example, it is recommended asthe initial solution for a search based procedure in Amini et al. (2012). Incomputer science literature, spectral clustering is also a standard procedurefor graph partitioning and for solving the planted partition model, a specialcase of the SBM [see, e.g., Ng et al. (2002)]. ONSISTENCY OF SPECTRAL CLUSTERING Despite its popularity and simplicity, the theoretical properties of spectralclustering are still not well understood in sparser SBM settings where themagnitude of the maximum expected node degree can be as small as log n .This regime of sparsity is in fact not covered by existing analyses of theperformance of spectral clustering for community recovery, which postulatea denser network. Indeed, Rohe, Chatterjee and Yu (2011), Fishkind et al.(2013) require the expected node degree to be almost linear in n , whileJin (2012) requires polynomial growth. Analogous conditions can be foundelsewhere; see, for example, Sussman et al. (2012) and Balakrishnan et al.(2011).In this paper, we derive new error bounds for spectral clustering for thepurpose of community recovery in moderately sparse stochastic block mod-els and degree corrected stochastic block models [see, e.g., Karrer and New-man (2011)], where the maximum expected node degree is of order log n or higher. Our main contribution is to show that the most basic form ofspectral clustering is successful in recovering the latent community mem-berships under conditions on the network sparsity that are weaker than theones used in most of literature. Our results yield some sharpening of exist-ing analyses of spectral clustering for community recovery, and provide atheoretical justiﬁcation for the eﬀectiveness of this procedure in moderatelysparse networks. We take note that there are competing methods yieldingconsistent community recovery under even milder conditions on the rate ofgrowth of the node degrees, but they either rely on combinatorial methodsthat are computationally demanding [Bickel and Chen (2009)] or are guar-anteed to be successful provided that they are given good starting points[Amini et al. (2012)], which are typically unknown. Other computationallyeﬃcient procedures with strong theoretical guarantees, which include in par-ticular the ones proposed and analyzed in McSherry (2001), Chen, Sanghaviand Xu (2012), Channarond, Daudin and Robin (2012), Sarkar and Bickel(2013), require instead the degrees to be of larger order than log n . Moredetailed comparisons with some of these contributions will be given afterthe statement of main results as more technical background is introduced.Finally, it is also known that in the ultra-sparse case, where the maximumdegree is of order O (1), consistent community recovery is impossible andone can only hope to recover the communities up to a constant fraction [seeCoja-Oghlan (2010), Decelle et al. (2011), Krzakala et al. (2013), Massoulie(2013), Mossel, Neeman and Sly (2012, 2013)].The contributions of this paper are as follows. We prove that a simplestform of spectral clustering, consisting of applying approximate k -means al-gorithms to the rows of the matrix formed by the leading eigenvectors of theadjacency matrix, allows to recover the community membeships of all buta vanishing fraction of the nodes in stochastic block models with expecteddegree as small as log n , with high probability. We also extend this result to J. LEI AND A. RINALDO degree corrected stochastic block models by analyzing an approximate spher-ical k -median spectral clustering algorithm. The algorithms we consider areamong the most practical and computationally aﬀordable procedures avail-able. Yet the theoretical guarantees we provide hold under rather generalassumptions of sparsity that are weaker than the ones used in algorithmsof similar complexity. Our arguments extend those in Rohe, Chatterjee andYu (2011) and Jin (2012) by combining a principal subspace perturbationanalysis (Lemma 5.1), a deterministic performance guarantee of approxi-mate k -means clustering (Lemma 5.3) and a sharp bound on the spectrumof binary random matrices (Theorem 5.2), which may be of independentinterest. These techniques give sharper results under weaker conditions. Inparticular, the subspace perturbation analysis allows us to avoid the indi-vidual eigengap condition. On the other hand, the spectral bound gives abetter large deviation result that cannot be obtained by the matrix Bern-stein inequality [Chung and Radcliﬀe (2011), Tropp (2012)] and leads to asimple extension to the degree corrected stochastic block model.The article is organized as follows. In Section 2 we give formal introductionto the stochastic block model and spectral clustering. The main results arepresented and compared to related works in Section 3 for regular SBM’s andin Section 4 for degree corrected block models. Section 5 presents the proofsof main results, including a general, highly modular scheme of analyzingperformance of spectral clustering algorithms. Concluding remarks are givenin Section 6. Notation . For a matrix M and index sets I , J ⊆ [ n ], let M I∗ and M ∗J be the submatrix of M consisting the corresponding rows and columns. Let M n,K be the collection of all n × K matrices where each row has exactlyone 1 and ( K −

1) 0’s. For any Θ ∈ M n,K , we call Θ a membership matrix ,and the community membership of a node i is denoted by g i ∈ { , . . . , K } ,which satisﬁes Θ ig i = 1. Let G k = G k (Θ) = { ≤ i ≤ n : g i = k } and n k = | G k | for all 1 ≤ k ≤ K . Let n min = min ≤ k ≤ K n k , n max = max ≤ k ≤ K n k , and n ′ max be the second largest community size. We use k · k to denote both theEuclidean norm of a vector and the spectral norm of a matrix. k M k F =(trace( M T M )) / denotes the Frobenius norm of a matrix M . The ℓ norm k M k simply counts the number of nonzero entries in M . For any squarematrix M , diag( M ) denotes the matrix obtained by setting all oﬀ-diagonalentries of M to 0. For two sequences of real numbers { x n } and { y n } , we willwrite x n = o ( y n ) if lim n x n /y n = 0, x n = O ( y n ) if | x n /y n | ≤ C for all n andsome positive C and x n = Ω( y n ) if | x n /y n | > C for all n and some positive C .

2. Preliminaries.

Model setup.

A stochastic block model with n nodes and K com-munities is parameterized by a pair of matrices (Θ , B ), where Θ ∈ M n,K is ONSISTENCY OF SPECTRAL CLUSTERING the membership matrix and B ∈ R K × K is a symmetric connectivity matrix .For each node i , let g i (1 ≤ g i ≤ K ) be its community label, such that the i th row of Θ is 1 in column g i and 0 elsewhere. On the other hand, the entry B kℓ in B is the edge probability between any node in community k and anynode in community ℓ . Given (Θ , B ), the adjacency matrix A = ( a ij ) ≤ i,j ≤ n is generated as a ij =  independent Bernoulli( B g i g j ) , if i < j, , if i = j,a ji , if i > j. The goal of community recovery is to recover the membership matrix Θup to column permutations. Throughout this article, we assume that thenumber of communities, K , is known. For an estimate b Θ ∈ M n,K of thenode memberships, we consider two measures of estimation error. The ﬁrstone is an overall relative error L ( b Θ , Θ) = n − min J ∈ E K k b Θ J − Θ k , where E K is the set of all K × K permutation matrices. Because both b Θ J and Θ are membership matrices, we have k b Θ J − Θ k = k b Θ J − Θ k F . Thisquantity measures the overall proportion of mis-clustered nodes.The other performance criterion measures the worst case relative errorover all communities:˜ L ( b Θ , Θ) = min J ∈ E K max ≤ k ≤ K n − k k ( b Θ J ) G k ∗ − Θ G k ∗ k . It is obvious that 0 ≤ L ( b Θ , Θ) ≤ ˜ L ( b Θ , Θ) ≤

2. Thus, ˜ L is a stronger criterionthan L in that it requires the estimator to do well for all communities, whilean estimator b Θ with small L ( b Θ , Θ) may have large relative errors for somesmall communities.2.2.

Spectral clustering.

Spectral clustering is a simple method for com-munity recovery [von Luxburg (2007), Rohe, Chatterjee and Yu (2011), Jin(2012)]. In a SBM, the heuristic of spectral clustering is to relate the eigen-vectors of A to those of P := Θ B Θ T using the fact that E ( A ) = P − diag( P ).Let P = U DU T be the eigen-decomposition of P with U T U = I K and D ∈ R K × K diagonal, then it is easy to see that U has only K distinct rows since P has only K distinct rows. Under mild conditions, it is also the case thattwo nodes are in the same community if and only if their corresponding rowsin U are the same. This is formally stated in the following lemma. Lemma 2.1 (Basic eigen-structure of SBMs).

Let the pair (Θ , B ) parame-trize a SBM with K communities, where B is full rank. Let U DU T be the J. LEI AND A. RINALDO eigen-decomposition of P = Θ B Θ T . Then U = Θ X where X ∈ R K × K and k X k ∗ − X ℓ ∗ k = q n − k + n − ℓ for all ≤ k < ℓ ≤ K . Proof.

Let ∆ = diag( √ n , . . . , √ n K ) then P = Θ B Θ = Θ∆ − ∆ B ∆(Θ∆ − ) T . (2.1)It is straightforward to verify that Θ∆ − is orthonormal. Let ZDZ T = ∆ B ∆be the eigen-decomposition of ∆ B ∆. Thus, we have P = U DU T where U =Θ∆ − Z . The claim follows by letting X = ∆ − Z and realizing that therows of ∆ − Z are perpendicular to each other and the k th row has length k (∆ Z ) k ∗ k = p /n k . (cid:3) Based on this observation, spectral clustering tries to estimate U and itsrow clustering using a spectral decomposition of A . The intuition for theprocedure is as follows. Consider the diﬀerence between A and P : A − P = ( A − E ( A )) − diag( P ) , which is a symmetric noise matrix plus a diagonal matrix. Intuitively, theeigenvectors of A will be close to those of P because the eigenvalues of P scales linearly with n while the noise matrix ( A − E ( A )) has operatornorm on the scale of √ n and diag( P ) is like a constant. Therefore, letting A = b U b D b U T be the K -dimensional eigen-decomposition of A correspondingto the K largest absolute eigenvalues, we can see that b U should have roughly K distinct rows because they are slightly perturbed versions of the rowsin U . Then one should be able to obtain a good community partition byapplying a clustering algorithm on the rows of b U . In this paper we considerthe k -means clustering, deﬁned as( b Θ , b X ) = arg min Θ ∈ M n,K ,X ∈ R K × K k Θ X − b U k F . (2.2)It is known that ﬁnding a global minimizer for the k -means problem (2.2) isNP-hard [see, e.g., Aloise et al. (2009)]. However, eﬃcient algorithms existfor ﬁnding an approximate solution whose value is within a constant fractionof the optimal value [Kumar, Sabharwal and Sen (2004)]. That is, there arepolynomial time algorithms that ﬁnd( b Θ , b X ) ∈ M n,K × R K × K (2.3) s . t . k b Θ b X − b U k F ≤ (1 + ε ) min Θ ∈ M n,K ,X ∈ R K × K k Θ X − b U k F . The spectral clustering algorithm we consider here is summarized in Algo-rithm 1.

ONSISTENCY OF SPECTRAL CLUSTERING Algorithm 1: Spectral clustering with approximate k -means Input:

Adjacency matrix A ; number of communities K ; approxima-tion parameter ε . Output:

Membership matrix b Θ ∈ M n,K .1. Calculate b U ∈ R n × K consisting of the leading k eigenvectors (or-dered in absolute eigenvalue) of A .2. Let ( b Θ , b X ) be an (1 + ε )-approximate solution to the k -meansproblem (2.3) with K clusters and input matrix b U .3. Output b Θ.2.3.

Sparsity scaling.

Real-world large scale networks are usually sparse,in the sense that the number of edges from a node (the node degree) are verysmall compared to the total number of nodes. Generally speaking, commu-nity recovery is hard when data is sparse. As a result, an important criterionof evaluating a community recovery method is its performance under diﬀer-ent levels of sparsity (usually measured in the error rate as a function of theaverage/maximum degree). The following prototypical example exempliﬁeswell the roles played by network sparsity as well as other model parametersin determining the hardness of community recovery.

Example 2.2.

Consider a SBM with K communities parameterized by(Θ , B ) where B = α n B ; B = λI K + (1 − λ ) K TK , < λ < , (2.4) I K is the K × K identity matrix, and K is the K × α n within community and α n (1 − λ ) between commu-nity. The quantity λ reﬂects the relative diﬀerence in connectivity betweencommunities and within communities. The network sparsity is captured by α n , where nα n provides an upper bound on the average (and maximum inthis example) expected node degree. It can be easily seen that if α n or λ areclose to 0 then it is hard to identify communities.The hardness of community reconstruction also depends on the number ofcommunities and the community size imbalance. For example, the famousplanted clique problem concerns community recovery under a SBM with K = 2 and B = (cid:18) / / / (cid:19) . (2.5) J. LEI AND A. RINALDO

In the planted clique problem, it is known that community recovery is easyif n ≥ c √ n for a constant c [see Deshpande and Montanari (2013) andreferences therein] and on the other hand no polynomial time algorithmshave been found to succeed when n = o ( √ n ). Remark.

The primary concern of this paper is the eﬀect of α n on theperformance of spectral clustering. Nevertheless, our results explicitly keeptrack of other quantities such as K , λ , n max and n min , all of which are allowedto change with n in a nontrivial manner. The dependence of recovery errorbound on some of these quantities, such as K and λ , is concerned by someauthors, such as Chen, Sanghavi and Xu (2012), Chaudhuri, Chung andTsiatas (2012), Anandkumar et al. (2013). For ease of readability, we do notalways make this dependence on n explicit in our notation.

3. Stochastic block models.

Our main result provides an upper boundon relative community reconstruction error of spectral clustering for a SBM(Θ , B ) in terms of several model parameters.

Theorem 3.1.

Let A be an adjacency matrix generated from a stochasticblock model (Θ , B ) . Assume that P = Θ B Θ T is of rank K , with smallestabsolute nonzero eigenvalue at least γ n and max k,ℓ B kℓ ≤ α n for some α n ≥ log n/n . Let b Θ be the output of spectral clustering using (1 + ε ) -approximate k -means (Algorithm 1). There exists an absolute constant c > , such that,if (2 + ε ) Knα n γ n < c, (3.1) then, with probability at least − n − , there exist subsets S k ⊂ G k for k =1 , . . . , K , and a K × K permutation matrix J such that b Θ G ∗ J = Θ G ∗ , where G = S Kk =1 ( G k \ S k ) , and n X k =1 | S k | n k ≤ c − (2 + ε ) Knα n γ n . (3.2)The proof of Theorem 3.1, given in Section 5, is modular, and can bederived from several relatively independent lemmas.The sets S k (1 ≤ k ≤ K ) consist of nodes in G k for which the clusteringcorrectness cannot be guaranteed. The permutation matrix J in the abovetheorem leads to an upper bound on reconstruction error ˜ L ( b Θ , Θ) [and henceon L ( b Θ , Θ)] through equation (3.2).Condition (3.1) speciﬁes the range of model parameters (

K, n, γ n , α n ) forwhich the result is applicable. It is included only for technical reasons, be-cause it holds whenever the bound in (3.2) vanishes and, therefore, im-plies consistency. In particular, as discussed after Corollary 3.2, we have ONSISTENCY OF SPECTRAL CLUSTERING Knα n /γ n = o (1) in many interesting cases. The constant c in (3.1) can bewritten as c = 1 / (64 C ) where C is an absolute constant deﬁned in Theo-rem 5.2 and can be explicitly tracked in the proof presented in the supple-mentary material [Lei and Rinaldo (2014)]. The assumption of α n ≥ log n/n can be changed to α n ≥ c log n/n for any c >

0, and also the probabilitybound 1 − n − can be changed to 1 − n − r for any r >

0, with a diﬀerentconstant c = c ( c , r ) in (3.1) and (3.2).While Theorem 3.1 provides a general error bound for spectral clustering,the quantities involved are not in the most transparent form. For example,the bound does not clearly reﬂect the intuition that the error should increasewhen α n decreases. This is because the quantity γ n contains the parame-ter α n . Also the dependence on the community size imbalance as well asthe community separation (which corresponds to the parameter λ in Exam-ple 2.2) remains unclear. The next corollary illustrates the error bound interms of these model parameters. Corollary 3.2.

Let A be an adjacency matrix from the SBM (Θ , B ) ,where B = α n B for some α n ≥ log n/n and with B having minimum ab-solute eigenvalue ≥ λ > and max kℓ B ( k, ℓ ) = 1 . Let b Θ be the output ofspectral clustering using (1 + ε ) -approximate k -means (Algorithm 1). Thenthere exists an absolute constant c such that if (2 + ε ) Knn λ α n < c (3.3) then with probability at least − n − , ˜ L ( b Θ , Θ) ≤ c − (2 + ε ) Knn λ α n and L ( b Θ , Θ) ≤ c − (2 + ε ) Kn ′ max n λ α n . In the special case of a balanced community sizes [i.e., n max /n min = O (1)] and constant λ , if α n = Ω(log n/n ), then L ( b Θ , Θ) = O P ( K ( nα n ) − ) = O P ( K / log n ). Thus L ( b Θ , Θ) = o P (1) if K = o ( √ log n ). This improves theresults in Rohe, Chatterjee and Yu (2011) where α n needs to be of order1 / log n for a similar result.In Example 2.2, the smallest nonzero eigenvalue of B is λ . Recall that λ is the relative diﬀerence of within- and between-community edge proba-bilities. Corollary 3.2 then implies that when this relative diﬀerence staysbounded away from zero, the communities can be consistently recovered bysimple spectral clustering as long as the expected node degrees are no less J. LEI AND A. RINALDO than log n . On the other hand, when α n is constant and λ = λ n varies with n , spectral clustering can recover the communities when the relative edgeprobability gap grows faster than 1 / √ n .In the planted clique problem, L ( b Θ , Θ) has limited meaning because atrivial clustering putting all nodes in one cluster achieves L ( b Θ , Θ) = 2 n min /n which is o (1) in the most interesting regime. Therefore, it makes more senseto consider ˜ L ( b Θ , Θ). Now B = B is given by (2.5), with minimum eigenvalue > .

19. Applying Corollary 3.2 with K = 2, λ = 0 . α n = 1, and any ﬁxed ε >

0, we have ˜ L ( b Θ , Θ) < c ′ nn , provided that c ′ n/n <

1, where c ′ is a diﬀerent absolute constant. There-fore, when n min ≥ √ an for some a > c ′ , b Θ recovers the hidden clique with arelative error no larger than c ′ /a . Thus, our result reaches the well believedcomputation barrier [up to constant factor, see Deshpande and Montanari(2013) and references therein] of the planted clique problem.There are spectral methods other than spectral clustering that can pro-vide consistent community recovery. One such well-known method is theprocedure analyzed by McSherry (2001). The planted partition problem inthat setting corresponds to the problem of recovering the community mem-berships in the SBM. To simplify the presentation and focus on the depen-dence of network sparsity, we consider the SBM in Example 2.2 with twoequal-sized communities and a constant λ ∈ (0 , − n − provided that, after some simpliﬁcation, λ α n n > cσ n log n and σ n > (log n ) /n, (3.4)for some constant c , where σ n is an upper bound on the maximal variance ofthe edges. Therefore, the condition (3.4) implies that α n > √ cλ − (log n ) . /n ,which is stronger than the condition in our Corollary 3.2.

4. Degree corrected stochastic block models.

The degree corrected blockmodel [DCBM, Karrer and Newman (2011)] extends the standard SBM byintroducing node speciﬁc parameters to allow for varying degrees even withinthe same community. A DCBM is parameterized by a triplet (Θ , B, ψ ),where, in addition to the membership matrix Θ and connectivity matrix B , the vector ψ ∈ R n is included to model additional variability of the edgeprobabilities at the node level. Given (Θ , B, ψ ), the edge probability be-tween nodes i and j is ψ i ψ j B g i g j (recall that g i is the community label ofnode i ). Similar to the SBM, the DCBM also assumes independent edge for-mation given (Θ , B, ψ ). The inclusion of ψ raises an issue of identiﬁability. ONSISTENCY OF SPECTRAL CLUSTERING So we assume that max i ∈ G k ψ i = 1 for all k = 1 , . . . , K . The SBM can beviewed as a special case of DCBM with ψ i = 1 for all i . The DCBM greatlyenhances the ﬂexibility of modeling degree heterogeneity and is able to ﬁtnetwork data with arbitrary degree distribution. Successful application andtheoretical developments can be found in Zhao, Levina and Zhu (2012) forlikelihood methods, and in Chaudhuri, Chung and Tsiatas (2012), Jin (2012)for spectral methods. Additional notation about the degree heterogeneity . Let φ k be the n × ψ on G k and zero otherwise. Deﬁne ˜ φ k = φ k / k φ k k and ˜ ψ = P Kk =1 ˜ φ k . Let ˜Θ be a normalized membership matrix such that˜Θ( i, k ) = ˜ ψ i if i ∈ G k and ˜Θ( i, k ) = 0 otherwise. We also deﬁne eﬀectivecommunity size ˜ n k := k φ k k . Let ˜ n min = min k ˜ n k and ˜ n max = max k ˜ n k .The spectral clustering heuristic can be extended to DCBMs by consider-ing the eigen-decomposition P = U DU T where P = diag( ψ )Θ B Θ T diag( ψ ).Now the matrix U may have more than K distinct rows due to the eﬀectof ψ . However, the rows of U point to at most K distinct directions [Jin(2012)]. The following lemma is the analogue of Lemma 2.1 for DCBMs. Lemma 4.1 (Spectral structure of mean matrix in DCBM).

Let

U DU T be the eigen-decomposition of P = diag( ψ )Θ B Θ T diag( ψ ) in a DCBM pa-rameterized by (Θ , B, ψ ) . Then there exists a K × K orthogonal matrix H such that U i ∗ = ˜ ψ i H k ∗ ∀ ≤ k ≤ K, i ∈ G k . Proof.

First, realize that diag( ψ )Θ = ˜ΘΨ, where Ψ = diag( k φ k , . . . , k φ K k ). P = diag( ψ )Θ B Θ T diag( ψ ) = ˜ΘΨ B Ψ ˜Θ T = ˜Θ HD ( ˜Θ H ) T , (4.1)where Ψ B Ψ =

HDH T is the eigen-decomposition of Ψ B Ψ. Note that ˜Θ T ˜Θ = I K so ˜Θ HD ( ˜Θ H ) T is an eigen-decomposition of P . (cid:3) As a result, ﬁnding the true community partition corresponds to cluster-ing the directions of the row vectors in U , where some form of normalizationmust be employed in order to ﬁlter out the nuisance parameter ψ . In par-ticular, we consider spherical clustering , which looks for a cluster structureamong the rows of a normalized matrix U ′ with U ′ i ∗ = U i ∗ / k U i ∗ k .In addition to the overall sparsity, the diﬃculty of community recovery ina DCBM is also aﬀected by small entries of ψ . Intuitively, if ψ i ≈

0, then itis hard to identify the community membership of node i because few edgesare observed for this node. However, the interaction between small entries of ψ and the overall network sparsity (the maximum/average degree) has notbeen well understood. In the analysis of proﬁle likelihood methods, Zhao, J. LEI AND A. RINALDO

Algorithm 2: Spherical k -median spectral clustering Input:

Adjacency matrix A ; number of communities K ; approxima-tion parameter ε . Output:

Membership matrix b Θ ∈ M n,K .1. Calculate b U ∈ R n × K consisting of the leading k eigenvectors (or-dered in absolute eigenvalue) of A .2. Let I + = { i : k b U i ∗ k > } and b U + = ( b U I + ∗ ).3. Let b U ′ be row-normalized version of b U + .4. Let ( b Θ + , b X ) be an (1 + ε )-approximate solution to the k -medianproblem with K clusters and input matrix b U ′ .5. Output b Θ with b Θ i ∗ being the corresponding row in b Θ + if i ∈ I + ,and b Θ i ∗ = (1 , , . . . ,

0) if i / ∈ I + .Levina and Zhu (2012) assume that the entries of ψ are ﬁxed constants. Inspectral clustering, Jin (2012) allows milder conditions on ψ but needs theaverage degree to be polynomial in n .Our analysis uses the following quantity as a summarizing measure ofnode heterogeneity in each community G k : ν k := n − k X i ∈ G k ˜ ψ − i , k = 1 , , . . . , K. By deﬁnition ν k ∈ [1 , ∞ ) and a larger ν k indicates a stronger heterogeneity inthe k th community. On the other hand, ν k = 1 indicates within-communityhomogeneity ( ψ i = 1 for all i ∈ G k ).The argument developed for SBMs in previous sections can be extendedto cover very general degree corrected models. In particular, let b U ∈ R n × K consist the K leading eigenvectors of A . We consider the following spherical k -median spectral clustering:minimize Θ ∈ M n,K ,X ∈ R K × K k Θ X − b U ′ k , , (4.2)where b U ′ is the row-normalized version of b U and k M k , = P i =1 k M i ∗ k isthe matrix (2 , ε ) approximation ( b Θ , b X ) to the k -median problem, whichcan be solved in polynomial time when ε > √ b U and is described in detail in Algorithm 2.4.1. Analysis of spherical k -median spectral clustering for DCBM. Wehave the following main theorem for spherical k -median spectral clusteringin DCBMs. It is proved in Appendix A.3. ONSISTENCY OF SPECTRAL CLUSTERING Theorem 4.2 (Main result for DCBM).

Consider a DCBM (Θ , B, ψ ) with K communities, where P = diag( ψ )Θ B Θ T diag( ψ ) has rank K , thesmallest nonzero absolute eigenvalue at least γ n , and the maximum entrybounded from above by α n ≥ log n/n . There exists an absolute constant c > such that if (2 . ε ) √ Knα n γ n < c n min qP Kk =1 n k ν k (4.3) then, with probability at least − n − , L ( b Θ , Θ) ≤ c − (2 . ε ) vuut K X k =1 n k ν k √ Kα n γ n √ n . (4.4) Remark.

The constant c equals 1 / (8 C ) where C is the universal con-stant in Theorem 5.2. The condition on α n and probability guarantee canalso be changed to α ≥ c log n/n and 1 − n − r , respectively, with a diﬀerentconstant c = c ( c , r ) in equations (4.3) and (4.4).Theorem 4.2 immediately implies a counterpart of Corollary 3.2 undermore explicit scaling of the model parameters. Corollary 4.3.

Let A be an adjacency matrix from DCBM (Θ , B, ψ ) ,such that B = α n B for some α n ≥ log n/n where B has minimum ab-solute eigenvalue λ > and max kℓ B ( k, ℓ ) = 1 . Let ( b Θ , b X ) be an (1 + ε ) -approximate solution to the spherical k -median algorithm (Algorithm 2).There exists an absolute constant c such that if (2 . ε ) √ Kn ˜ n min λ √ α n < c n min qP Kk =1 n k ν k , then, with probability at least − n − , L ( b Θ , Θ) ≤ c − (2 . ε ) √ K ˜ n min λ √ nα n vuut K X k =1 n k ν k . Comparing with Theorem 3.1 and Corollary 3.2, the results for DCBMare diﬀerent in two major aspects. First, the DCBM condition (4.3) involvesthe term n / P Kk =1 n k ν k which is smaller than 1 (indeed smaller than1 /K ). This makes (4.3) more stringent than (3.1). Also the upper bound on L ( b Θ , Θ) is diﬀerent in the same manner. Furthermore, the argument used to J. LEI AND A. RINALDO prove Theorem 4.2 is not likely to provide a sharp upper bound on ˜ L ( b Θ , Θ).We believe this has to do with the additional normalization step used inthe spherical k -median algorithm as well as the speciﬁc strategy used in ourproof.To better understand this result, consider Example 2.2 with balancedcommunity size: n max /n min = O (1). To work with a DCBM, assume in ad-dition that the node degree vector ψ has comparable degree heterogeneityacross communities: c ν ≤ ν k ≤ c ν for constants c , c . Then Corollary 4.3implies an overall relative error rate L ( b Θ , Θ) = O P (cid:18) √ ν ˜ n min λ √ nα n (cid:19) . (4.5)Several observations are worth mentioning. First, the error rate dependson ν , the degree heterogeneity measure, in a simple manner. Second, thecommunity size n min that appears in Corollary 3.2 is replaced by ˜ n min =min k k φ k k , the minimum eﬀective sample size. Roughly speaking, ˜ n min ≍ n min as long as a constant fraction of nodes have their ψ i ’s bounded awayfrom zero (but the rest should not be too small in order to keep ν small).Third, if there is no degree heterogeneity ( ν k ≡ n min = n min ), then therate in (4.5) is the square root of that given by Corollary 3.2. This is dueto the additional normalization step (which is not necessary since ν = 1)involved in spherical k -median and the diﬀerent argument used to analyzethe spherical k -median algorithm. Moreover, the relative error can still be o P (1) even when α n is as small as log n/n , provided that 1 /ν , ˜ n min /n , and λ stay bounded away from zero or approach zero suﬃciently slowly. Comparisons with existing work.

There are relatively fewer results forcommunity recovery in degree corrected block models that allow the maxi-mum node degree to be of order o ( n ). Chaudhuri, Chung and Tsiatas (2012)extended the method of McSherry (2001) to degree corrected block models.In the setting of Example 2.2 with equal community size, their main result(Theorems 2 and 3 in their paper) requires α n to be at least of order 1 / √ n .A similar requirement of a polynomial growth of expected average degreeis implicitly imposed in Jin (2012), who ﬁrst studied the performance ofnormalized k -means spectral clustering in degree corrected block models.

5. Proof of the main results.

In this section, we present a general schemeto prove error bounds for spectral clustering. It contains the SBM as a specialcase and can be easily extended to the degree corrected block model. Ourargument consists of three parts: (1) control the perturbation of principalsubspaces for general symmetric matrices, (2) bound the spectrum of randombinary matrices, and (3) error bound of k -mean and spherical k -medianclustering. ONSISTENCY OF SPECTRAL CLUSTERING Principal subspace perturbation.

The ﬁrst ingredient of our proof isto bound the diﬀerence between the eigenvectors of A and those of P , where A can be viewed as a noisy version of P . Lemma 5.1 (Principal subspace perturbation).

Assume that P ∈ R n × n is a rank K symmetric matrix with smallest nonzero singular value γ n . Let A be any symmetric matrix and b U , U ∈ R n × K be the K leading eigenvectorsof A and P , respectively. Then there exists a K × K orthogonal matrix Q such that k b U − U Q k F ≤ √ Kγ n k A − P k . Lemma 5.1 is proved in Appendix A.1, which is based on an applicationof the Davis–Kahan sin Θ theorem [Theorem VII.3.1 of Bhatia (1997)]. Thepresence of a K × K orthonormal matrix Q in the statement of Lemma 5.1is to take care of the situation where some leading eigenvalues have multi-plicities larger than one. In this case, the eigenvectors are determined onlyup to a rotation.5.2. Spectral bound of binary symmetric random matrices.

The next the-orem provides a sharp probabilistic upper bound on k A − P k when A is arandom adjacency matrix with E ( a ij ) = p ij . Theorem 5.2 (Spectral bound of binary symmetric random matrices).

Let A be the adjacency matrix of a random graph on n nodes in whichedges occur independently. Set E [ A ] = P = ( p ij ) i,j =1 ,...,n and assume that n max ij p ij ≤ d for d ≥ c log n and c > . Then, for any r > there existsa constant C = C ( r, c ) such that k A − P k ≤ C √ d with probability at least − n − r . This result does not follow conventional matrix concentration inequalitiessuch as the matrix Bernstein inequality (which will only give √ d log n ). Luand Peng (2012) use a path counting technique in random matrix theory toprove a bound of the same order but require a maximal degree d ≥ c (log n ) .The proof of Theorem 5.2 is technically involved, as it uses combinatorialarguments in order to derive spectral bounds for sparse random matrices.Our proof is based on techniques developed by Feige and Ofek (2005) forbounding the second largest eigenvalue of an Erd¨os–R´eyni random graphwith edge probability d/n . The full proof is provided in Lei and Rinaldo(2014). Here we give a brief outline of the three major steps. J. LEI AND A. RINALDO

Step Discretization . We ﬁrst reduce controlling k A − P k to the problemof bounding the supremum of | x T ( A − P ) y | over all pairs of vectors x, y ina ﬁnite set of grid points. For any given pair ( x, y ) in the grid, the quantity x T ( A − P ) y is decomposed into the sum of two parts. The ﬁrst part corre-sponds to the small entries of both x and y , called light pairs , the other partcorresponds to the larger entries of x or y , the heavy pairs . Step Bounding the light pairs . The next step is to use Bernstein’s in-equality and the union bound to control the contribution of the light pairs,uniformly over the points in the grid.

Step Bounding the heavy pairs . In the ﬁnal step, the contribution fromthe heavy pairs, which cannot be simply bounded by conventional Bern-stein’s inequality, will be bounded using a combinatorial argument on theevent that the edge numbers in a collection of subgraphs do not deviatemuch from their expectation. A sharp large deviation bound for sums of in-dependent Bernoulli random variables [Corollary A.1.10 of Alon and Spencer(2004)] is used to achieve better rate than standard Bernstein’s inequality.5.3.

Error bound of k -means/ k -median on perturbed eigenvectors. Spec-tral clustering (or spherical spectral clustering) applies a clustering algo-rithm to a matrix consisting of the eigenvectors of A , which is close (inview of Lemma 5.1 and Theorem 5.2) to a matrix whose rows can be per-fectly clustered. We would like to bound the clustering error in terms of thecloseness between the real input matrix b U and the ideal input matrix U .The next lemma generalizes an argument used in Jin (2012) and providesan error bound for any (1 + ε )-approximate k -means solution. Lemma 5.3 (Approximate k -means error bound). For ε > and anytwo matrices b U , U ∈ R n × K such that U = Θ X with Θ ∈ M n,K , X ∈ R K × K ,let ( b Θ , b X ) be a (1 + ε ) -approximate solution to the k -means problem in equa-tion (2.2) and ¯ U = b Θ b X . For any δ k ≤ min ℓ = k k X ℓ ∗ − X k ∗ k , deﬁne S k = { i ∈ G k (Θ) : k ¯ U i ∗ − U i ∗ k ≥ δ k / } then K X k =1 | S k | δ k ≤ ε ) k b U − U k F . (5.1) Moreover, if (16 + 8 ε ) k b U − U k F /δ k < n k for all k, (5.2) then there exists a K × K permutation matrix J such that b Θ G ∗ = Θ G ∗ J ,where G = S Kk =1 ( G k \ S k ) . ONSISTENCY OF SPECTRAL CLUSTERING Lemma 5.3 provides a performance guarantee for approximate k -meansclustering under a deterministic Frobenius norm condition on the input ma-trix. As suggested by a referee, the proof of Lemma 5.3 shares some similar-ities with the proof of Theorem 3.1 in Awasthi and Sheﬀet (2012) [see alsoKumar and Kannan (2010)], though our assumptions are slightly diﬀerent.For completeness we provide a short and self-contained proof of Lemma 5.3in Appendix A.2, giving explicit constant factors in the result.5.4. Proof of main results for SBM.

We ﬁrst prove Theorem 3.1.

Proof of Theorem 3.1.

Combining Lemma 5.1 and Theorem 5.2, weobtain that, for some K -dimensional orthogonal matrix Q , k b U − U Q k F ≤ √ Kγ n k A − P k ≤ √ Kγ n C √ nα n , (5.3)with probability at least 1 − n − , where C is the absolute constant involvedin Theorem 5.2. (Notice that the term d in Theorem 5.2 becomes nα n in thecurrent setting.)The main strategy for the rest of the proof is to apply Lemma 5.3 to b U and U Q . To that end, Lemma 2.1 implies that

U Q = Θ XQ = Θ X ′ where k X ′ k ∗ − X ′ ℓ ∗ k = q n k + n ℓ . As a result, we can choose δ k = q /n k + { n ℓ : ℓ = k } inLemma 5.3 and hence n k δ k ≥ k . Using (5.3), a suﬃcient conditionfor (5.2) to hold is(16 + 8 ε )8 C K nα n γ n ≤ ≤ min ≤ k ≤ K n k δ k , (5.4)so that (3.1) indeed implies (5.2) by setting c = C . In detail, the choiceof δ k = 1 / √ n k together with (5.1) yields that K X k =1 | S k | (cid:18) n k + 1max { n ℓ : ℓ = k } (cid:19) = K X k =1 | S k | δ k ≤ ε ) k b U − U Q k F , which, combined with (5.3), gives (3.2): K X k =1 | S k | n k ≤ ε )8 C Knα n γ n = c − (2 + ε ) Knα n γ n . Since Lemma 5.3 ensures that the membership is correctly recovered outsideof S ≤ k ≤ K S k , the claim follows. (cid:3) Proof of Corollary 3.2.

It is easy to see, for example, from (2.1),that in this speciﬁc stochastic block model setting, γ n = n min α n λ . Then the J. LEI AND A. RINALDO proof of Theorem 3.1 applies with γ n = n min α n λ and gives K X k =1 | S k | (cid:18) n k + 1max { n ℓ : ℓ = k } (cid:19) ≤ C (2 + ε ) Knn λ α n , which implies that˜ L ( b Θ , Θ) ≤ max ≤ k ≤ K | S k | n k ≤ X ≤ k ≤ K | S k | n k ≤ C (2 + ε ) Knn λ α n , and, recalling that n ′ max is the second largest community size, L ( b Θ , Θ) ≤ n K X k =1 | S k | ≤ C (2 + ε ) Kn ′ max n λ α n . (cid:3)

6. Concluding remarks.

The analysis in this paper applies directly to theeigenvectors of the adjacency matrix, by combining tools in subspace per-turbation and spectral bounds of binary random graphs. In the literature,spectral clustering using the graph Laplacian or its variants is very popu-lar and can sometimes lead to better empirical performance [von Luxburg(2007), Rohe, Chatterjee and Yu (2011), Sarkar and Bickel (2013)]. An im-portant future work would be to extend some of the results and techniquesin this paper to spectral clustering using the graph Laplacian. The graphLaplacian normalizes the adjacency matrix by the node degree, which canintroduce extra noise if the network is sparse and many node degrees aresmall. In several recent works, Chaudhuri, Chung and Tsiatas (2012), Qinand Rohe (2013) studied graph Laplacian based spectral clustering with reg-ularization, where a small constant is added to all node degrees prior to thenormalization. Further understanding the bias-variance trade oﬀ would beboth important and interesting.For degree corrected block models, regularization methods may also leadto error bounds with better dependence on small entries of ψ . The intuitionis that ν k can be very large even when only one ψ i is very close to zero.In this case, one should be able to simply discard nodes like this and workon those with large enough degrees. Finding the correct regularization todiminish the eﬀect of small-degree nodes and analyzing the new algorithmwill be pursued in future work.This paper aims at understanding the performance of spectral clusteringin stochastic block models. While our main focus is the performance of spec-tral clustering as the network sparsity changes, the resulting error boundsexplicitly keep track of ﬁve independent model parameters ( K , α n , λ , n min , n max ). Existing results usually develop error bounds depending on a subsetof these parameters, keeping others as constant [see, e.g., Bickel and Chen ONSISTENCY OF SPECTRAL CLUSTERING (2009), Chen, Sanghavi and Xu (2012), Zhao, Levina and Zhu (2012)]. Inthe planted clique model, our result implies that spectral clustering can ﬁndthe hidden clique when its size is at least c √ n for some large enough con-stant c . Our result also provides good insight in understanding the impact ofthe number of clusters and separation between communities. For instance,in Example 2.2, let α n ≡ n max = n min = n/K . Then Corollary 3.2 impliesthat spectral clustering is consistent if K / ( nλ ) →

0. More generally, theguarantees of Corollary 3.2 compares favorably against most existing resultsas summarized in Chen, Sanghavi and Xu (2012), in terms of allowable clus-ter size, density gap and overall sparsity. It would be interesting to develop auniﬁed theoretical framework (e.g., minimax theory) such that all methodsand model parameters can be studied and compared together.APPENDIX: TECHNICAL PROOFSFor any two matrices A and B of the same dimension, we use the notation h A, B i = trace( A T B ) for the standard matrix inner product. A.1. Proof of Lemma 5.1.

By Proposition 2.2 of Vu and Lei (2013), thereexists a K -dimensional orthogonal matrix Q such that1 √ K k b U − U Q k F ≤ √ K k ( I − b U b U T ) U U T k F ≤ k ( I − b U b U T ) U U T k . Next, we establish that k ( I − b U b U T ) U U T k ≤ k A − P k γ n . If k A − P k ≤ γ n / k ( I − b U b U T ) U U T k ≤ k A − P k γ n − k A − P k ≤ k A − P k γ n . If k A − P k > γ n /

2, then k ( I − b U b U T ) U U T k ≤ ≤ k A − P k γ n . A.2. Proof of Lemma 5.3.

First, by the deﬁnition of ¯ U and the fact that U is feasible for problem (2.2), we have k ¯ U − U k F ≤ k ¯ U − b U k F + 2 k b U − U k F ≤ (4 + 2 ε ) k b U − U k F . Then K X k =1 | S k | δ k / ≤ k ¯ U − U k F ≤ (4 + 2 ε ) k b U − U k F , (A.1)which concludes the ﬁrst claim of the lemma.Under the assumption described in the second part of the lemma, equation(A.1) further implies that | S k | ≤ (16 + 8 ε ) k b U − U k F /δ k < n k for all k. J. LEI AND A. RINALDO

Therefore, T k ≡ G k \ S k = ∅ , for each k . If i ∈ T k and j ∈ T ℓ with k = ℓ ,then ¯ U i ∗ = ¯ U j ∗ because otherwise max( δ k , δ ℓ ) ≤ k U i ∗ − U j ∗ k ≤ k U i ∗ − ¯ U i ∗ k + k U j ∗ − ¯ U j ∗ k < δ k / δ ℓ /

2, which is impossible. This further implies that ¯ U has exactly K distinct rows, because the number of distinct rows is no largerthan K as part of the constraints of the optimization problem (2.2).On the other hand, if i and j are both in T k , for some k , then ¯ U i ∗ = ¯ U j ∗ because otherwise there would be more than K distinct rows since there areat least K − T ℓ for ℓ = k .As a result, ¯ U i ∗ = ¯ U j ∗ if i, j ∈ T k for some k , and ¯ U i ∗ = ¯ U j ∗ if i ∈ T k , j ∈ T ℓ with k = ℓ . This gives a correspondence of clustering between the rows in¯ U T ∗ and those in U T ∗ where T = S Kk =1 T k . A.3. Proofs for degree corrected block models.

The argument ﬁts verywell in the general argument developed in Section 5. Then Lemma 5.1 andTheorem 5.2 still apply and P (cid:20) k b U − U Q k F ≤ √ C √ Knα n γ n for some QQ T = I K (cid:21) ≥ − n − , (A.2)where C is the constant in Theorem 5.2.For presentation simplicity, in the following argument we will work with Q = I K . The general case can be handled in the same manner with morecomplicated notation (simply substitute U by U Q ).To prove Theorem 4.2, we ﬁrst give a bound on the zero rows in b U . Recallthat I + = { i : b U i ∗ = 0 } . Deﬁne I = I c + . Lemma A.1 (Number of zero rows in b U ). In a DCBM (Θ , B, ψ ) satisfy-ing the conditions of Theorem 4.2, let b U and U be the leading eigenvectorsof A and P , respectively. Then | I | ≤ vuut K X k =1 n k ν k k b U − U k F . Proof.

Use Cauchy–Schwarz: k b U − U k F ≥ n X i =1 ( b U i ∗ = 0) k U i ∗ k ≥ ( P ni =1 ( b U i ∗ = 0)) P ni =1 k U i ∗ k − = | I | P Kk =1 n k ν k . (cid:3) We also need the following simple fact about the distance between nor-malized vectors.

Fact.

For two nonzero vectors v , v of same dimension, we have k v k v k − v k v k k ≤ k v − v k max( k v k , k v k ) . ONSISTENCY OF SPECTRAL CLUSTERING Proof.

Without loss of generality, assume k v k ≥ k v k . Then (cid:13)(cid:13)(cid:13)(cid:13) v k v k − v k v k (cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13) v k v k − v k v k + v k v k − v k v k (cid:13)(cid:13)(cid:13)(cid:13) ≤ k v − v kk v k + k v k|k v k − k v k|k v kk v k ≤ k v − v kk v k . (cid:3) Proof of Theorem 4.2.

Recall that U ′ is the row-normalized versionof U . Let U ′′ = U ′ I + ∗ be the sub-matrix of U ′ corresponding to the nonzerorows in b U . Then k b U ′ − U ′′ k , ≤ n X i =1 k b U i ∗ − U i ∗ kk U i ∗ k≤ vuut n X i =1 k b U i ∗ − U i ∗ k n X i =1 k U i ∗ k − ≤ vuut k b U − U k F K X k =1 n k ν k . Now we can bound the (2 ,

1) distance between an approximate solutionof k -median problem (4.2) and the targeted solution U ′′ . k b Θ + b X − U ′′ k , ≤ k b Θ + b X − b U ′ k , + k b U ′ − U ′′ k , ≤ (2 + ε ) k b U ′ − U ′′ k , . Let S = { i ∈ I + : k b Θ i ∗ b X − U ′ i ∗ k ≥ √ } . The size of S can be bounded usinga similar argument as in the proof of Lemma A.1. | S | √ ≤ k b Θ + b X − U ′′ k , ≤ (2 + ε ) k b U ′ − U ′′ k , ≤ ε ) vuut K X k =1 n k ν k k b U − U k F , which implies | S | ≤ √ ε ) vuut K X k =1 n k ν k k b U − U k F . (A.3)On the event in (A.2) (recall that we assume Q = I ), (A.3) and Lemma A.1implies | S | + | I | ≤ (2 . ε )8 C √ Knα n γ n vuut K X k =1 n k ν k . (A.4) J. LEI AND A. RINALDO

Combining this with condition (4.3) implies | S | + | I | < n k for all k andhence G k ∩ ( I + \ S ) = ∅ . Therefore, for any two rows in G := I + \ S , if theyare in diﬀerent clusters of Θ then they must be in diﬀerent clusters of b Θ(otherwise, k U ′ i ∗ − U ′ j ∗ k ≤ k U ′ i ∗ − b Θ i ∗ b X k + k b Θ j ∗ b X − U ′ j ∗ k < √ I ∪ S , andthe number is bounded by the right-hand side of (A.4). The claimed resultfollows by choosing c = 8 C . (cid:3) Acknowledgment.

The authors thank an anonymous reviewer for helpfulsuggestions that led in particular to a signiﬁcant simpliﬁcation of the proofof Lemma 5.1. SUPPLEMENTARY MATERIAL

Supplement to “Consistency of spectral clustering in sparse stochasticblock models” (DOI: 10.1214/14-AOS1274SUPP; .pdf). The supplementaryﬁle contains a proof of Theorem 5.2.REFERENCES

Aloise, D. , Deshpande, A. , Hansen, P. and

Popat, P. (2009). NP-hardness of Eu-clidean sum-of-squares clustering.

Machine Learning Alon, N. and

Spencer, J. H. (2004).

The Probabilistic Method , 2nd ed. Wiley, Hoboken.

Amini, A. A. , Chen, A. , Bickel, P. J. and

Levina, E. (2012). Pseudo-likelihoodmethods for community detection in large sparse networks. Preprint. Available atarXiv:1207.2340.

Anandkumar, A. , Ge, R. , Hsu, D. and

Kakade, S. M. (2013). A tensor spectralapproach to learning mixed membership community models. Preprint. Available atarXiv:1302.2684.

Awasthi, P. and

Sheffet, O. (2012). Improved spectral-norm bounds for clustering.In

Approximation, Randomization, and Combinatorial Optimization . Lecture Notes inComputer Science

Balakrishnan, S. , Xu, M. , Krishnamurthy, A. and

Singh, A. (2011). Noise thresholdsfor spectral clustering. In

Advances in Neural Information Processing Systems 24 ( J.Shawe-Taylor , R. S. Zemel , P. L. Bartlett , F. Pereira and

K. Q. Weinberger ,eds.) 954–962. Curran Associates, Red Hook, NY.

Bhatia, R. (1997).

Matrix Analysis . Graduate Texts in Mathematics . Springer, NewYork. MR1477662

Bickel, P. J. and

Chen, A. (2009). A nonparametric view of network models andNewman–Girvan and other modularities.

Proc. Natl. Acad. Sci. USA

Celisse, A. , Daudin, J.-J. and

Pierre, L. (2012). Consistency of maximum-likelihoodand variational estimators in the stochastic block model.

Electron. J. Stat. Channarond, A. , Daudin, J.-J. and

Robin, S. (2012). Classiﬁcation and estimation inthe stochastic blockmodel based on the empirical degrees.

Electron. J. Stat. Charikar, M. , Guha, S. , Tardos, ´E. and

Shmoys, D. B. (1999). A constant-factorapproximation algorithm for the k -median problem. In Proceedings of the Thirty-FirstAnnual ACM Symposium on Theory of Computing Chaudhuri, K. , Chung, F. and

Tsiatas, A. (2012). Spectral clustering of graphs withgeneral degrees in the extended planted partition model.

JMLR: Workshop and Con-ference Proceedings

Chen, Y. , Sanghavi, S. and

Xu, H. (2012). Clustering sparse graphs. In

Advances inNeural Information Processing Systems 25 ( F. Pereira , C. J. C. Burges , L. Bottou and

K. Q. Weinberger , eds.) 2204–2212. Curran Associates, Red Hook, NY.

Choi, D. S. , Wolfe, P. J. and

Airoldi, E. M. (2012). Stochastic blockmodels with agrowing number of classes.

Biometrika Chung, F. and

Radcliffe, M. (2011). On the spectra of general random graphs.

Electron.J. Combin. Paper 215, 14. MR2853072

Coja-Oghlan, A. (2010). Graph partitioning via adaptive spectral techniques.

Combin.Probab. Comput. Decelle, A. , Krzakala, F. , Moore, C. and

Zdeborov´a, L. (2011). Asymptotic analy-sis of the stochastic block model for modular networks and its algorithmic applications.

Phys. Rev. E (3) Deshpande, Y. and

Montanari, A. (2013). Finding hidden cliques of size p N/e innearly linear time. Preprint. Available at arXiv:1304.7047.

Feige, U. and

Ofek, E. (2005). Spectral techniques applied to sparse random graphs.

Random Structures Algorithms Fishkind, D. E. , Sussman, D. L. , Tang, M. , Vogelstein, J. T. and

Priebe, C. E. (2013). Consistent adjacency-spectral partitioning for the stochastic block model whenthe model parameters are unknown.

SIAM J. Matrix Anal. Appl. Goldenberg, A. , Zheng, A. X. , Fienberg, S. E. and

Airoldi, E. M. (2010). A surveyof statistical network models.

Foundations and Trends R (cid:13) in Machine Learning Holland, P. W. , Laskey, K. B. and

Leinhardt, S. (1983). Stochastic blockmodels:First steps.

Social Networks Jin, J. (2012). Fast community detection by SCORE. Available at arXiv:1211.5803.

Karrer, B. and

Newman, M. E. J. (2011). Stochastic blockmodels and communitystructure in networks.

Phys. Rev. E (3) Kolaczyk, E. D. (2009).

Statistical Analysis of Network Data: Methods and Models .Springer, New York. MR2724362

Krzakala, F. , Moore, C. , Mossel, E. , Neeman, J. , Sly, A. , Zdeborov´a, L. and

Zhang, P. (2013). Spectral redemption in clustering sparse networks.

Proc. Natl. Acad.Sci. USA

Kumar, A. and

Kannan, R. (2010). Clustering with spectral norm and the k -meansalgorithm. In Proceedings of the 2010 IEEE 51st Annual Symposium on Foundationsof Computer Science FOCS

Kumar, A. , Sabharwal, Y. and

Sen, S. (2004). A simple linear time (1 + ε )-approximation algorithm for k -means clustering in any dimensions. In Proceedings ofthe 45th Annual IEEE Symposium on Foundations of Computer Science

Lei, J. and

Rinaldo, A. (2014). Supplement to “Consistency of spectral clustering instochastic block models.” DOI:10.1214/14-AOS1274SUPP.

Li, S. and

Svensson, O. (2013). Approximating k-median via pseudo-approximation. In

Proceedings of the 45th Annual ACM Symposium on Symposium on Theory of Com-puting

Lu, L. and

Peng, X. (2012). Spectra of edge-independent random graphs. Preprint. Avail-able at arXiv:1204.6207. J. LEI AND A. RINALDO

Lyzinski, V. , Sussman, D. , Tang, M. , Athreya, A. and

Priebe, C. (2013). Perfectclustering for stochastic blockmodel graphs via adjacency spectral embedding. Preprint.Available at arXiv:1310.0532.

Massoulie, L. (2013). Community detection thresholds and the weak Ramanujan prop-erty. Preprint. Available at arXiv:1311.3085.

McSherry, F. (2001). Spectral partitioning of random graphs. In

Mossel, E. , Neeman, J. and

Sly, A. (2012). Stochastic block models and reconstruction.Preprint. Available at arXiv:1202.1499.

Mossel, E. , Neeman, J. and

Sly, A. (2013). A proof of the block model thresholdconjecture. Preprint. Available at arXiv:1311.4115.

Newman, M. E. J. (2010).

Networks: An Introduction . Oxford Univ. Press, Oxford.MR2676073

Newman, M. E. J. and

Girvan, M. (2004). Finding and evaluating community structurein networks.

Phys. Rev. E (3) Ng, A. Y. , Jordan, M. I. , Weiss, Y. et al. (2002). On spectral clustering: Analysis andan algorithm.

Adv. Neural Inf. Process. Syst. Qin, T. and

Rohe, K. (2013). Regularized spectral clustering under the degree-correctedstochastic blockmodel. Preprint. Available at arXiv:1309.4111.

Rohe, K. , Chatterjee, S. and

Yu, B. (2011). Spectral clustering and the high-dimensional stochastic blockmodel.

Ann. Statist. Sarkar, P. and

Bickel, P. (2013). Role of normalization in spectral clustering forstochastic blockmodels. Preprint. Available at arXiv:1310.1495.

Sussman, D. L. , Tang, M. , Fishkind, D. E. and

Priebe, C. E. (2012). A consistent ad-jacency spectral embedding for stochastic blockmodel graphs.

J. Amer. Statist. Assoc.

Tropp, J. A. (2012). User-friendly tail bounds for sums of random matrices.

Found.Comput. Math. von Luxburg, U. (2007). A tutorial on spectral clustering. Stat. Comput. Vu, V. Q. and

Lei, J. (2013). Minimax sparse principal subspace estimation in highdimensions.

Ann. Statist. Zhao, Y. , Levina, E. and

Zhu, J. (2012). Consistency of community detection in net-works under degree-corrected stochastic block models.