[PDF] Estimating the number of communities by Stepwise Goodness-of-fit

Abstract

Given a symmetric network with n nodes, how to estimate the number of communities K is a fundamental problem. We propose Stepwise Goodness-of-Fit (StGoF) as a new approach to estimating K . For m=1,2,… , StGoF alternately uses a community detection step (pretending m is the correct number of communities) and a goodness-of-fit step. We use SCORE \cite{SCORE} for community detection, and propose a new goodness-of-fit measure. Denote the goodness-of-fit statistic in step m by ψ (m) n . We show that as n→∞ , ψ (m) n →N(0,1) when m=K and ψ (m) n →∞ in probability when m<K . Therefore, with a proper threshold, StGoF terminates at m=K as desired. We consider a broad setting where we allow severe degree heterogeneity, a wide range of sparsity, and especially weak signals. In particular, we propose a measure for signal-to-noise ratio (SNR) and show that there is a phase transition: when SNR→0 as n→∞ , consistent estimates for K do not exist, and when SNR→∞ , StGoF is consistent, uniformly for a broad class of settings. In this sense, StGoF achieves the optimal phase transition. Stepwise testing algorithms of similar kind (e.g., \cite{wang2017likelihood, ma2018determining}) are known to face analytical challenges. We overcome the challenges by using a different design in the stepwise algorithm and by deriving sharp results in the under-fitting case (m<K) and the null case ( m=K ). The key to our analysis is to show that SCORE has the {\it Non-Splitting Property (NSP)}. The NSP is non-obvious, so additional to rigorous proofs, we also provide an intuitive explanation.

Full PDF

EEstimating the number of communities by StepwiseGoodness-of-ﬁt

Jiashun Jin Zheng Tracy Ke Shengming Luo Minzhe Wang Carnegie Mellon University and Harvard University and University of Chicago

September 22, 2020

Abstract

Given a symmetric network with n nodes, how to estimate the number of communities K is a fundamental problem. We propose Stepwise Goodness-of-Fit (StGoF) as a newapproach to estimating K . For m = 1 , , . . . , StGoF alternately uses a community detectionstep (pretending m is the correct number of communities) and a goodness-of-ﬁt step. We useSCORE [14] for community detection, and propose a new goodness-of-ﬁt measure. Denotethe goodness-of-ﬁt statistic in step m by ψ ( m ) n . We show that as n → ∞ , ψ ( m ) n → N (0 , m = K and ψ ( m ) n → ∞ in probability when m < K . Therefore, with a properthreshold, StGoF terminates at m = K as desired.We consider a broad setting where we allow severe degree heterogeneity, a wide range ofsparsity, and especially weak signals. In particular, we propose a measure for signal-to-noiseratio (SNR) and show that there is a phase transition: when SNR → n → ∞ , consistentestimates for K do not exist, and when SNR → ∞ , StGoF is consistent, uniformly for abroad class of settings. In this sense, StGoF achieves the optimal phase transition. Stepwisetesting algorithms of similar kind (e.g., [36, 29]) are known to face analytical challenges.We overcome the challenges by using a diﬀerent design in the stepwise algorithm and byderiving sharp results in the under-ﬁtting case ( m < K ) and the null case ( m = K ). Thekey to our analysis is to show that SCORE has the Non-Splitting Property (NSP) . The NSPis non-obvious, so additional to rigorous proofs, we also provide an intuitive explanation.

MSC subject classiﬁcation . Primary: 62H12, 62H30. Secondary: 91D30.

Keyword . Clusters community detection DCBM estimating K goodness-of-ﬁt k -means lowerbound Non-Splitting Property (NSP) over-ﬁtting phase transition simplex under-ﬁtting keyword In network analysis, how to estimate the number of communities K is a fundamental problem.In many recent approaches, K is assumed as known a priori (see for example [2, 7, 19, 30, 39,37] on community detection, [15, 38] on mixed-membership estimation, and [26] on dynamiccommunity detection). Unfortunately, K is rarely known in applications, so the performance ofthese approaches hinges on how well we can estimate K .The primary interest of this paper is how to estimate K . Given a symmetric and connectedsocial network with n nodes and K communities, let A be the adjacency matrix: A ij = (cid:26) , if node i and node j have an edge , , otherwise , ≤ i (cid:54) = j ≤ n. (1.1)1 a r X i v : . [ s t a t . M E ] S e p s a convention, self-edges are not allowed so all the diagonal entries of A are 0. Denote the K perceivable communities by N , N , . . . , N K . We model the network by the widely-used degree-corrected block model (DCBM) [19]. For each 1 ≤ i ≤ n , we encode the community label ofnode i by a vector π i ∈ R K where i ∈ N k ⇐⇒ π i ( k ) = 1 and π i ( m ) = 0 for m (cid:54) = k. (1.2)Moreover, for a K × K symmetric nonnegative matrix P which models the community structureand positive parameters θ , θ , . . . , θ n which model the degree heterogeneity, we assume the uppertriangular entries of A are independent Bernoulli variables satisfying P ( A ij = 1) = θ i θ j · π (cid:48) i P π j ≡ Ω ij , ≤ i < j ≤ n, (1.3)where Ω denotes the matrix ΘΠ P Π (cid:48) Θ, with Θ being the n × n diagonal matrix diag( θ , . . . , θ n )and Π being the n × K matrix [ π , π , . . . , π n ] (cid:48) . For identiﬁability, we assumeall diagonal entries of P are 1 . (1.4)Write for short diag(Ω) = diag(Ω , Ω , . . . , Ω nn ), and let W be the matrix where for 1 ≤ i, j ≤ n , W ij = A ij − Ω ij if i (cid:54) = j and W ij = 0 otherwise. In matrix form, we have A = Ω − diag(Ω) + W, where we recall Ω = ΘΠ P Π (cid:48) Θ . (1.5)In the special case of θ = θ = . . . = θ n , DCBM reduces to the stochastic block model (SBM)[8]. In this paper, we focus on DCBM, but the idea is extendable to the degree-corrected mixed-membership (DCMM) model [38, 15], where mixed membership is allowed; see Remark 3 below.Real world networks have a few interesting features that we frequently observe. • Severe degree heterogeneity . The distribution of the node degrees has a power-law tail,implying severe degree heterogeneity. Therefore, the sparsity level for individual nodes(measured by the number of edges) may vary signiﬁcantly from one to another. • Network sparsity . The overall network sparsity may range signiﬁcantly from one networkto another. • Weak signal . The community structure is masked by strong noise, and the signal-to-noiseratio (SNR) is usually relatively small.For analysis, we let n be the driving asymptotic parameter, and allow (Θ , Π , P ) to depend on n , so that DCBM is broad enough to cover all interesting range of these metrics. Let θ =( θ , θ , . . . , θ n ) (cid:48) , θ max = max { θ , . . . , θ n } , and θ min = min { θ , θ , . . . , θ n } . Let λ , λ , . . . , λ K bethe K nonzero eigenvalues of Ω, arranged in the descending order of magnitudes. The followingwere suggested by existing literature (e.g., [17, 14]). First, a reasonable metic for network sparsityis (cid:107) θ (cid:107) and a reasonable metric for the degree heterogeneity is θ max /θ min . Second, the range ofinterest for (cid:107) θ (cid:107) is C (cid:112) log( n ) ≤ (cid:107) θ (cid:107) ≤ C √ n, (1.6)where C > | λ K | and (cid:107) W (cid:107) , respectively. When θ max ≤ Cθ min and some mild conditions hold (e.g., (cid:107) P (cid:107) ≤ C ), λ (cid:16) (cid:107) θ (cid:107) , and (cid:107) W (cid:107) = a multi-log( n ) term ·√ λ with high probability , (1.7)(examples for multi-log( n )-terms are (cid:112) log( n ) , log log( n ), etc.), so a reasonable metric for thesignal to noise ratio (SNR) is | λ K | / √ λ . When θ max /θ min → ∞ , we need an adjusted SNR; seeSection 2. We consider two extreme cases. 2 Strong signal case . | λ | , | λ | , . . . , | λ K | are at the same magnitude, and so SNR (cid:16) √ λ . • Weak signal case . | λ K | / √ λ is much smaller than √ λ and grows to ∞ slowly as n → ∞ (in our range of interest, λ may grow to ∞ rapidly as n → ∞ , so for example, we mayhave SNR = log log( n ) and λ = √ n ).Section 2.3 suggests that when SNR = o (1), consistent estimate for K does not exist, so theweak signal case is a very challenging case. Motivated by the above observations, it is desirableto ﬁnd a consistent estimate for K that satisﬁes the following requirements. • (R1). Allow severe degree heterogeneity (i.e., θ max /θ min may tend to ∞ ). • (R2). Optimally adaptive to network sparsity, where (cid:107) θ (cid:107) may be as small as O ( (cid:112) log( n ))or be as large as O ( √ n ). • (R3). Attain the information lower bound . Consistent for both the strong signal case whereSNR is large and the weak signal case where SNR may be as small as log log( n ) (say). Example 1 . Recently, a frequently considered DCBM is to assume P = P and θ i (cid:16) √ α n for all 1 ≤ i ≤ n , where α n > P is a ﬁxed matrix. It is seen that λ , . . . , λ K are at the same order, so the model only considers the strong signal case. Example 2 (A special DCBM) . Let e , . . . , e K be the standard basis vectors of R K . Fixinga positive vector θ ∈ R n and a scalar b n ∈ (0 , K is ﬁxed, eachcommunity has n/K nodes, and P = (1 − b n ) I K + b n K (cid:48) K . In this model, (1 − b n ) measures the“dis-similarity” of diﬀerent communities and is small in the more challenging case when diﬀerentcommunities are similar. By basic algebra, λ (cid:16) (cid:107) θ (cid:107) , λ = . . . = λ K (cid:16) (cid:107) θ (cid:107) (1 − b n ), andSNR (cid:16) (cid:107) θ (cid:107) (1 − b n ). In the very sparse case, (cid:107) θ (cid:107) = O ( (cid:112) log( n )). In the dense case, (cid:107) θ (cid:107) = O ( √ n ).When b n ≤ c for a constant c < | λ K | ≥ C | λ | and SNR (cid:16) (cid:107) θ (cid:107) ; we are in the strong signalcase if (cid:107) θ (cid:107) ≥ n a for a constant a >

0. When b n = 1 + o (1) and (cid:107) θ (cid:107) (1 − b n ) = log log( n ) (say),SNR (cid:16) log log( n ) and we are in the weak signal case. In recent years, many interesting approaches for estimating K have been proposed, which can beroughly divided into the spectral approaches, the cross validation approaches, the penalizationapproaches, and the likelihood ratio approaches.Among the spectral approaches, Le and Levina [21] proposed to estimate K using the eigen-values of the non-backtracking matrix or Bethe Hessian matrix. The approach uses ideas frommathematical graph theory, and is quite interesting for it is diﬀerent from most statistical ap-proaches. Unfortunately, the approach requires relatively strong conditions for consistency. Forexample, their Theorem 4.1 only considers the idealized SBM model in the very sparse case,where θ = θ = . . . = θ n = 1 / √ n and P = P for a ﬁxed matrix P . Liu et al . [28] proposedto estimate K by using the classical scree plot approach with careful theoretical justiﬁcation,but the approach is known to be unsatisfactory in the presence of severe degree heterogeneity,for it is hard to derive a sharp bound for the spectral norm of the noise matrix W (e.g., [14]).Therefore, their approach requires the condition of θ max ≤ Cθ min . The paper also imposedthe condition of (cid:107) θ (cid:107) = O ( √ n ) so it did not address the settings of sparse networks (see (1.6)for the interesting range of (cid:107) θ (cid:107) ). Among the cross-validation approaches, we have [1, 25], andamong the penalization approaches, we have [33, 3, 20], where K is estimated by the integer thatoptimizes some objective functions. For example, Salda et al . [33] used a BIC-type objectivefunction and [3, 20] used an objective function of the Bayesian model selection ﬂavor. However,these methods did not provide explicit theoretical guarantee on consistency (though a partial3esult was established in [25], which stated that under SBM, the proposed estimator (cid:98) K is nogreater than K with high probability).For likelihood ratio approaches, Wang and Bickel [36] proposed to estimate K by solving aBIC type optimization problem, where the objective function is the sum of the log-likelihood andthe model complexity. The major challenge here is that the likelihood is the sum of exponentiallymany terms and is hard to compute. In a remarkable paper, Ma et al . [29] extended the idea of[36] by proposing a new approach that is computationally more feasible.On a high level, we can recast their methods as a stepwise testing or sequential testing algorithm. Consider a stepwise testing scheme where for m = 1 , , . . . , we construct a test statistic (cid:96) ( m ) n (e.g. log-likelihood) assuming m is the correct number of communities. We estimate K asthe smallest m such that the pairwise log-likelihood ratio ( (cid:96) ( m +1) n − (cid:96) ( m ) n ) falls below a threshold .As mentioned in [36, 29], such an approach faces challenges. Call the cases m < K , m = K , and m > K the under-ﬁtting , null , and over-ﬁtting cases, respectively. • We have to analyze (cid:96) ( m ) n for both the under-ﬁtting case and the over-ﬁtting case, but wedo not have eﬃcient technical tools to address either case. • It is hard to derive sharp results on the limiting distribution of (cid:96) ( m +1) n − (cid:96) ( m ) n in the nullcase, and so it is unclear how to pin down the threshold.Ma et al . [29] (see also [36]) made interesting progress but unfortunately the problems are notresolved satisfactorily. For example, they require hard-to-check strong conditions on both theunder-ﬁtting and over-ﬁtting cases. Also, in the over-ﬁtting case, it is unclear whether theirresults are sharp, and in the under-ﬁtting case, it is unclear how to standardize (cid:96) ( m +1) n − (cid:96) ( m ) n asthe variance term is unknown; as a result, how to pin down the threshold remains unclear. Mostimportantly, both papers focus on the setting in Example 1 (see above), where severe degreeheterogeneity is not allowed and they only consider the strong signal case.In this paper, we propose Stepwise Goodness-of-Fit (StGoF) as a new approach to estimating K . Our approach follows a diﬀerent vein, so it is diﬀerent not only by the particular procedureswe use, but also in the design of the stepwise testing. In detail, for m = 1 , , . . . , StGoF alternatelyuses two sub-steps, a community detection sub-step where we apply SCORE [14] assuming m isthe correct number of communities, and a Goodness-of-Fit (GoF) sub-step. We propose a newGoF approach and let ψ ( m ) n be the GoF test statistic in step m . Assuming SNR → ∞ , we showthat ψ ( m ) n (cid:26) → N (0 , , when m = K (null case) , → ∞ in probability , when 1 ≤ m < K (under-ﬁtting case) . (1.8)This gives rise to a consistent estimate for K . Note that we have derived N (0 ,

1) as the explicitlimiting null distribution which is crucial in our study. To prove (1.8), the key is to show thatin the under-ﬁtting case, SCORE has the so-called

Non-Splitting Property (NSP) , meaning thatall nodes in each (true) community are always clustered together. See Section 1.3 for what theanalytical challenges are and how the NSP helps overcome the challenges. In the over-ﬁttingcase, m > K . The NSP does not hold and so the analytical challenge remains, but the design ofStGoF and the sharp results in (1.8) help avoid the analysis in this case.For the stepwise testing algorithms in [36, 29], analysis in the over-ﬁtting case can not beavoided, as we need to analyze (cid:96) ( m +1) n − (cid:96) ( m ) n for m = 1 , , . . . , K ; see details therein.To assess the optimality, we use the phase transition, a well-known optimality framework.It is related to the minimax framework but can be frequently more informative [5, 11, 31, 32].We show that when SNR → ∞ , (1.8) gives rise to an estimator that is consistent in a broadsetting. We also obtain an information lower bound by showing that when SNR →

0, consistentestimates for K do not exist. This suggests that our consistency result is sharp in terms of the4ate of SNR, so we say that StGoF achieves the optimal phase transition; see Section 2.3. Asfar as we know, such a phase transition result on estimating K is new.In order to achieve the optimal phase transition, a procedure needs to work well in the weaksignal case. Since most existing methods have been focused on the strong signal case, it is unclearwhether they achieve the optimal phase transition. Our contributions are as follows. • We propose StGoF as a new approach to estimating K , where we use both a diﬀerentdesign for stepwise testing and a new GoF test. • We derive N (0 ,

1) as the explicit limiting null distribution, and use the NSP of SCORE toderive tight bounds in the under-ﬁtting case. These sharp results and the design of StGoFallow us to avoid the analysis in the over-ﬁtting case and so to overcome the technicalchallenges faced by stepwise testing of this kind. Such an analytical strategy is extendableto other settings (e.g., the study of directed or bipartite graphs). • We show that StGoF achieves the optimal phase transition when θ max ≤ Cθ min andconsistent in broad settings (e.g., weak signals, severe degree heterogeneity, and a widerange of sparsity). In particular, StGoF satisﬁes all requirements (R1)-(R3) as desired.Compared to [14], both papers study SCORE, but the goal of [14] is community detection where K is assumed as known, and the analysis were focused on the null case ( m = K ). Here, the goalis to estimate K : SCORE is only used as part of our stepwise algorithm, and the analysis ofSCORE is focused on the under-ﬁtting case ( m < K ), where the property of SCORE is largelyunknown, and our results on the NSP of SCORE are new.The proof of NSP is non-trivial when m < K . It depends on the row-wise distances of thematrix Ξ consisting of the ﬁrst m columns of [ ξ , . . . , ξ K ]Γ, where ξ k is the k -th eigenvector ofΩ and Γ is an orthogonal matrix dictated by the Davis Kahan sin( θ ) theorem [4]. Γ is hard totrack without a strong eigen-gap assumption, and when it ranges, the row-wise distances of Ξare the same when m = K but may vary signiﬁcantly when m < K . This is why the study onSCORE is much harder in the under-ﬁtting case than in the null case. See Section 3. 𝛹 A m = 1 Community Detection Reﬁtting α n < Z Stop

YesNo m = m + 1 (m)

Figure 1: The ﬂow chart of StGoF.StGoF is a stepwise algorithm where for m = 1 , , . . . , we alternately use a communitydetection step and a Goodness-of-Fit (GoF) step. In principle, we can view StGoF as a generalframework, and for both steps, we may use diﬀerent algorithms. However, for most existingcommunity detection algorithms (e.g., [2, 7, 38]), it is unclear whether they have the desiredtheoretical properties (especially the NSP), so we may face analytical challenges. For this reason,we choose to use SCORE [14], which we prove to have the NSP. For GoF, existing algorithms(e.g., [10, 22]; see Remark 2 for more discussion) do not apply to the current setting, so wepropose a new GoF measure called the Reﬁtted Quadrilateral (RQ).In detail, ﬁxing a tolerance parameter 0 < α < z α be the α upper-quantile of N (0 , A and initialize m = 1.5 (a). Community detection . If m = 1, let (cid:98) Π ( m ) be the n -dimensional vector of 1’s. If m > A assuming m is the correct number of communities and obtain an n × m matrix (cid:98) Π ( m ) for the estimated community labels. • (b). Goodness-of-Fit . Assuming (cid:98) Π ( m ) is the matrix of true community labels, we obtainan estimate (cid:98) Ω ( m ) for Ω by reﬁtting the DCBM, following (1.10)-(1.11) below. Obtain theReﬁtted Quadrilateral test score ψ ( m ) n as in (1.13)-(1.16). • (c). Termination . If ψ ( m ) n ≥ z α , repeat (a)-(b) with m = m + 1. Otherwise, output m asthe estimate for K . Denote the ﬁnal estimate by ˆ K ∗ α .We recommend α = 1% or 5%. See Figure 1 for the ﬂow chart of the algorithm.We now ﬁll in the details for steps (a)-(b). Consider (a) ﬁrst. The case of m = 1 is trivialso we only consider the case of m >

1. Let ˆ λ k be the k -th largest (in magnitude) eigenvalue of A , and let ˆ ξ k be the corresponding eigenvector. For each m >

1, we apply SCORE as follows.Input A and m . Output: the estimated n × m matrix of community labels (cid:98) Π ( m ) . • Obtain the ﬁrst m eigenvectors ˆ ξ , ˆ ξ , . . . , ˆ ξ m of A . Deﬁne the n × ( m −

1) matrix ofentry-wise ratios (cid:98) R ( m ) by (cid:98) R ( m ) ( i, k ) = ˆ ξ k +1 ( i ) / ˆ ξ ( i ), 1 ≤ i ≤ n, ≤ k ≤ m − • Cluster the rows of (cid:98) R ( m ) by the classical k-means assuming we have m clusters. Output (cid:98) Π ( m ) = [ˆ π ( m )1 , . . . , ˆ π ( m ) n ] (cid:48) (ˆ π ( m ) i ( k ) = 1 if node i is clustered to cluster k and 0 otherwise).Existing study of SCORE has been focused on the null case of m = K . Our interest here is onthe under-ﬁtting case (1 < m < K ), where the property of SCORE is largely unknown.Consider (b). The idea is to pretend that the SCORE estimate (cid:98) Π ( m ) is accurate. We thenuse it to estimate Ω by re-ﬁtting, and check how well the estimated Ω ﬁts with the adjacencymatrix A . In detail, let d i be the degree of node i , 1 ≤ i ≤ n , and let (cid:98) N ( m ) k be the set of nodesthat SCORE assigns to group k , 1 ≤ k ≤ m . We decompose n as follows n = m (cid:88) k =1 ˆ ( m ) k , where ˆ ( m ) k ( j ) = 1 if j ∈ (cid:98) N ( m ) k and 0 otherwise . (1.9)For most quantities that have superscript ( m ), we may only include the superscript when intro-ducing these quantities for the ﬁrst time, and omit it later for notational simplicity when there isno confusion. Introduce a vector ˆ θ ( m ) = (ˆ θ ( m )1 , ˆ θ ( m )2 , . . . , ˆ θ ( m ) n ) (cid:48) ∈ R n and a matrix (cid:98) P ( m ) ∈ R m,m where for all 1 ≤ i ≤ n and 1 ≤ k, (cid:96) ≤ m ,ˆ θ ( m ) i = [ d i / (ˆ (cid:48) k A n )] · (cid:113) ˆ (cid:48) k A ˆ k , (cid:98) P ( m ) k(cid:96) = (ˆ (cid:48) k A ˆ (cid:96) ) / (cid:113) (ˆ (cid:48) k A ˆ k )(ˆ (cid:48) (cid:96) A ˆ (cid:96) ) . (1.10)Let (cid:98) Θ ( m ) = diag(ˆ θ ). We reﬁt Ω by (cid:98) Ω ( m ) = (cid:98) Θ ( m ) (cid:98) Π ( m ) (cid:98) P ( m ) ( (cid:98) Π ( m ) ) (cid:48) (cid:98) Θ ( m ) . (1.11)Recall that Ω = ΘΠ P Π (cid:48) Θ and P has unit diagonal entries. In the ideal case where m = K , (cid:98) Π ( m ) = Π, and A = Ω, we can verify that ( (cid:98) Θ ( m ) , (cid:98) P ( m ) , (cid:98) Ω ( m ) ) = (Θ , P, Ω). This suggests thatthe reﬁtting in (1.11) is reasonable. The Reﬁtted Quadrilateral (RQ) test statistic is then Q ( m ) n = (cid:88) i ,i ,i ,i ( dist ) ( A i i − (cid:98) Ω ( m ) i i )( A i i − (cid:98) Ω ( m ) i i )( A i i − (cid:98) Ω ( m ) i i )( A i i − (cid:98) Ω ( m ) i i ) , (1.12)(“dist” means the indices are distinct). Without the reﬁtted matrix (cid:98) Ω ( m ) , Q ( m ) n reduces to C n = (cid:88) i ,i ,i ,i ( dist ) A i i A i i A i i A i i , (1.13) As the network is connected, ˆ ξ is uniquely deﬁned with all positive entries, by Perron’s theorem [14]. Q ( m ) n thereﬁtted quadrilaterals.We now discuss the mean and variance of Q ( m ) n in the null case of m = K . In this case, ﬁrst,it turns out that the variance can be well-approximated by 8 C n . Second, while that E [ Q ( K ) n ] = 0in the ideal case of (cid:98) Ω ( K ) = Ω, in the real case, (cid:98) Ω ( K ) (cid:54) = Ω and E [ Q ( K ) n ] is comparable to thestandard deviation of Q ( K ) n . Therefore, the mean is not negligible in the null case, and we needbias correction.Motivated by these, for any m ≥

1, we introduce two vectors ˆ g ( m ) , ˆ h ( m ) ∈ R m whereˆ g ( m ) k = (ˆ (cid:48) k ˆ θ ) / (cid:107) ˆ θ (cid:107) , ˆ h ( m ) k = (ˆ (cid:48) k (cid:98) Θ ˆ k ) / / (cid:107) ˆ θ (cid:107) , ≤ k ≤ m. (1.14)Write for short (cid:98) V ( m ) = diag( (cid:98) P ˆ g ) and (cid:98) H ( m ) = diag(ˆ h ). We estimate the mean of Q ( m ) n by B ( m ) n = 2 (cid:107) ˆ θ (cid:107) · [ˆ g (cid:48) (cid:98) V − ( (cid:98) P (cid:98) H (cid:98) P ◦ (cid:98) P (cid:98) H (cid:98) P ) (cid:98) V − ˆ g ] , (1.15)where for matrixes A and B , A ◦ B is their Hadamard product [9]. We show that in the nullcase, B ( m ) n is a good estimate for E [ Q ( m ) n ], and in the under-ﬁtting case, it is much smaller thanthe leading term of Q ( m ) n and so is negligible. Finally, the StGoF statistic is deﬁned by ψ ( m ) n = [ Q ( m ) n − B ( m ) n ] / (cid:112) C n . (1.16)The computational cost of the StGoF algorithm is determined by (i) the number of iterations,(ii) the cost of SCORE, and (iii) the cost of computing ψ ( m ) n in (1.16). For (i), we show inSection 2 that, under mild conditions, StGoF terminates in exactly K steps with high probability.For (ii), the costs are from implementing PCA and k -means [14]. PCA is manageable even forvery large networks, and the complexity is O ( n ¯ d ) for each m if we use the power method,where ¯ d is the average degree. In practice, the k -means is usually implemented with the Lloyd’salgorithm which is fast (e.g., only a few seconds when n is a few thousands). In theory, thecomputational cost of k -means for our setting is polynomial-time, since the dimension of eachrow of (cid:98) R ( m ) is ( m − Lemma 1.1.

For each m = 1 , , . . . , K , the complexity for computing ψ ( m ) n is O ( n ¯ d ) , where ¯ d is average degree of the network. Remark 1 . The RQ test has some connections to the SgnQ test in [17], but is for diﬀerentproblem and is more sophisticated. The RQ test is for goodness-of-ﬁt. It depends on the matrix (cid:98) Ω ( m ) , reﬁtted for each m using the community detection results by SCORE. The SgnQ test isfor global testing, where the goal is to test K = 1 vs. K >

1. The SgnQ test is not stepwise,and does not depend on any results of community detection. In particular, to analyze RQ, weneed new technical tools, where the NSP of SCORE plays a key role.

Remark 2 . Existing GoF algorithms include [10, 22], but they only address the muchnarrower settings (e.g., dense networks with stochastic block model and strong signals). Asmentioned in [10], it remains unclear how to generalize these approaches to the DCBM settinghere. In principle, a GoF approach only focuses on the null case, and can not be used forestimating K without sharp results in the under-ﬁtting case, or the over-ﬁtting case, or both. Remark 3 . In this paper, we are primarily interested in DCBM, but the idea can be extendedto the broader DCMM [38, 15], where mixed-memberships are allowed. To this end, we need toreplace SCORE by Mixed-SCORE [15] (an adapted version of SCORE for networks with mixedmemberships), and modify the reﬁtting step accordingly. The analysis of the resultant procedureis much more challenging so we leave it to the future.7 .3 Why StGoF works and how it overcomes the challenges

We brieﬂy explain why StGoF achieves the optimal phase transition. Recall that a reasonablemeasure for the SNR is | λ K | / √ λ , where λ k is the k -th largest (in magnitude) eigenvalue of Ω;see (1.7). In Section 4, we show that if | λ K | / √ λ → ∞ , then (cid:40) ψ ( m ) n → N (0 , , if m = K, E [ ψ ( m ) n ] (cid:16) ( (cid:80) Kk = m +1 λ k ) /λ and so ψ ( m ) n → ∞ in probability , if 1 ≤ m < K, (1.17)where we note that ( (cid:80) Kk = m +1 λ k ) /λ ≥ ( λ K / √ λ ) when m < K . Combining these with thedeﬁnition of ˆ K ∗ α gives P ( ˆ K ∗ α (cid:54) = K ) ≤ α + o (1). Hence, ˆ K ∗ α is consistent if we let α tend to 0slowly enough. When SNR →

0, Section 2.3 shows that consistent estimation for K is impossiblein the minimax sense. Therefore, StGoF achieves the optimal phase transition.The main technical challenge is how to analyze ψ ( m ) n in the under-ﬁtting case, where wenot only need sharp row-wise large deviation bounds for the matrix (cid:98) R ( m ) , but also need toestablish the NSP of SCORE. To see why NSP is important, note that Q ( m ) n depends on (cid:98) Ω ( m ) (see (1.12)), where (cid:98) Ω ( m ) is obtained by reﬁtting using the SCORE estimate (cid:98) Π ( m ) , and dependson A in a complicate way. The dependence poses challenges for analyzing Q ( m ) n , to overcomewhich, a conventional approach is to use concentrations. However, (cid:98) Π ( m ) has exp( O ( n )) possiblerealizations, and how to characterize the concentration of (cid:98) Π ( m ) is known to be a challengingproblem (e.g., [36, 29]). However, if we are able to show that SCORE has the NSP (meaning that for each 1 ≤ m ≤ K ,nodes in each (true) community are always clustered together), then (cid:98) Π ( m ) only has (cid:0) Km (cid:1) possiblerealizations, because we only have K true communities. In fact, (cid:98) Π ( m ) may have even fewerpossible realizations if we impose some mild conditions. This means that for each 1 ≤ m ≤ K , (cid:98) Ω ( m ) only concentrates on a few non-stochastic matrices. Using such a concentration result andconventional union bound, we can therefore remove the technical hurdle for analyzing ψ ( m ) n inthe under-ﬁtting case; see Section 4 for details.For the over-ﬁtting case of m > K , SCORE produces m clusters but we only have K truecommunities. In this case, NSP won’t hold, and it is unclear how to derive sharp results for ψ ( m ) n . For stepwise testing procedures of similar kind, such a challenge was noted in [36, 29].Our approach avoids the analysis of the over-ﬁtting case by a diﬀerent design in stepwise testingand sharp results in the null and under-ﬁtting cases. In theory, a good approximation for the null distribution of ψ ( m ) n is N (0 ,

1) (see (1.17) andTheorem 2.1, where we show ψ ( m ) n → N (0 ,

1) in the null case). Such a result requires somemodel assumptions, which may be violated in real applications (e.g., outliers, artifacts). Whenthis happens, a good approximation for the null distribution of ψ ( m ) n is no longer N (0 ,

1) (i.e.,theoretical null), but N ( u, σ ) (i.e., empirical null) for some ( u, σ ) (cid:54) = (0 , empirical null frequently works better for real data than the theoretical null . The problem is then how to estimate the parameters ( u, σ ) of the empiricalnull. This explains why in StGoF we do not use the reﬁtted triangle (RT) T n = (cid:80) i ,i ,i ( dist ) ( A i i − (cid:98) Ω i i )( A i i − (cid:98) Ω i i )( A i i − (cid:98) Ω i i ), which is comparably easier to analyze. The power of RT depends on( (cid:80) Kk = m +1 λ k ) /λ / , where λ m +1 , . . . , λ K may have diﬀerent signs and so may cancel with each other. To shed light on why (cid:98) Π ( m ) has so many possible realizations, suppose we wish to group n iid samples from N (0 ,

1) into two clusters with the same size. It is seen that we have exp( O ( n )) possible clustering results.

8e propose a bootstrap approach to estimating ( u, σ ). Recall that ˆ λ k is the k -th largesteigenvalue of A and ˆ ξ k is the corresponding eigenvector. Fixing N > m >

1, letting (cid:99) M ( m ) = (cid:80) mk =1 ˆ λ k ˆ ξ k ˆ ξ (cid:48) k and let (cid:98) S ( m ) = A − (cid:99) M ( m ) . For b = 1 , , . . . , N , we simultaneously permutethe rows and columns of (cid:98) S ( m ) and denote the resultant matrix by (cid:98) S ( m,b ) . Truncating all entriesof ( (cid:99) M ( m ) + (cid:98) S ( m,b ) ) at 1 at the top and 0 at the bottom, and denote the resultant matrix by (cid:98) Ω ( b ) . Generate an adjacency matrix A ( b ) such that for all 1 ≤ i < j ≤ n , A ( b ) ij are independentBernoulli samples with parameters (cid:98) Ω ( b ) ij (we may need to repeat this step until the network isconnected). Apply StGoF to A ( b ) and denote the resultant statistic by Q ( b ) n . We estimate u and σ by the empirical mean and standard deviation of { Q ( b ) n } Nb =1 , respectively. Denote the estimatesby ˆ u ( m ) and ˆ σ ( m ) , respectively. The bootstrap StGoF statistic is then ψ ( m, ∗ ) n = [ Q ( m ) n − ˆ u ( m ) ] / ˆ σ ( m ) , m = 1 , , . . . , (1.18)where Q ( m ) n is the same as in (1.16). Similarly, we estimate K as the smallest integer m suchthat ψ ( m, ∗ ) n ≤ z α , for the same z α in StGoF. We recommend N = 25, as it usually gives stableestimates for ˆ u ( m ) and ˆ σ ( m ) . See Section 5 for details.The original StGoF works well for real data where the DCBM is reasonable, but for datasets where DCBM is signiﬁcantly violated, bootstrap StGoF may help. For the 6 data setsconsidered in Section 5, two methods perform similarly for all but one data set. This particulardata set is suspected to have many outliers, and bootstrap StGoF performs signiﬁcantly better.For theoretical analysis, we focus on the original StGoF statistics ψ ( m ) n as in (1.16). Sections 2-3 contain main theoretical results. In Section 2, we show that StGoF is consistentfor K uniformly in a broad class of settings. We also present the information lower bound andshow that StGoF achieves the optimal phase transition. In Section 3, we show that SCORE hasthe Non-Splitting Property (NSP) for 1 ≤ m ≤ K . We also shed light on why SCORE has theNSP and what the technical challenges are. In Section 4, we prove the main results. Section 5presents numerical results with real and simulated data. The appendix contains the proofs forsecondary theorems and lemmas.In this paper, C > θ , . . . , θ n , θ max = max { θ , . . . , θ n } , and θ min = min { θ , . . . , θ n } . For any vectors θ = ( θ , . . . , θ n ) (cid:48) , both diag( θ ) and diag( θ , . . . , θ n ) denote the n × n diagonal matrix with θ i being the i -th diagonal entry, 1 ≤ i ≤ n . For any vector a ∈ R n , (cid:107) a (cid:107) q denotes the Euclidean (cid:96) q -norm (we write (cid:107) a (cid:107) for short when q = 2). For any matrix P ∈ R n,n , (cid:107) P (cid:107) denotes the matrixspectral norm, and (cid:107) P (cid:107) max denotes the entry-wise maximum norm. For two positive sequences { a n } and { b n } , we say a n ∼ b n if lim n →∞ { a n /b n } = 1 and a n (cid:16) b n if there are constants c > c > c a n ≤ b n ≤ c a n for suﬃciently large n . This section contains the ﬁrst part of our main results, where we discuss the consistency andoptimality of StGoF. Section 3 contains the second part of our main results, where we discussthe NSP of SCORE [14].Consider a DCBM with K communities as in (1.5). We assume (cid:107) P (cid:107) ≤ C, (cid:107) θ (cid:107) → ∞ , and θ max (cid:112) log( n ) → . (2.1)The ﬁrst one is a mild regularity condition on the K × K community structure matrix P . Theother two are mild conditions on sparsity. See (1.6) for the interesting range of (cid:107) θ (cid:107) . We exclude9he case where θ i = O (1) for all 1 ≤ i ≤ n for convenience, but our results continue to hold inthis case provided that we make some small changes in our proofs. Moreover, for 1 ≤ k ≤ K , let N k be the set of nodes belonging to community k , let n k be the cardinality of N k , and let θ ( k ) be the n -dimensional vector where θ ( k ) i = θ i if i ∈ N k and θ ( k ) i = 0 otherwise. We assume the K communities are balanced in the sense thatmin { ≤ k ≤ K } { n k /n, (cid:107) θ ( k ) (cid:107) / (cid:107) θ (cid:107) , (cid:107) θ ( k ) (cid:107) / (cid:107) θ (cid:107)} ≥ C. (2.2)In the presence of severe degree heterogeneity, the valid SNR for SCORE is s n = a ( θ )( | λ K | / (cid:112) λ ) , where a ( θ ) = ( θ min /θ max ) · ( (cid:107) θ (cid:107) / (cid:112) θ max (cid:107) θ (cid:107) ) ≤ . In the special case of θ max ≤ Cθ min , it is true that a ( θ ) (cid:16) s n (cid:16) | λ K | / √ λ . In this case, s n is the SNR introduced (1.7). We assume s n ≥ C (cid:112) log( n ) , for a suﬃciently large constant C > . (2.3)In the special case of θ max ≤ Cθ min , (2.3) is equivalent to | λ K | / √ λ ≥ C (cid:112) log( n ), which is mild.See Remark 6 for more discussion. Deﬁne a K × K diagonal matrix H by H kk = (cid:107) θ ( k ) (cid:107) / (cid:107) θ (cid:107) ,1 ≤ k ≤ K . For the matrix HP H and 1 ≤ k ≤ K , let (largest means largest in magnitude) µ k be the k -th largest eigenvalue and η k be the corresponding eigenvector . By Perron’s theorem [9], if P is irreducible, then the multiplicity of µ is 1, and all entries of η are all strictly positive. Note also the size of the matrix P is small. It is therefore only a mildcondition to assume that for a constant 0 < c < ≤ k ≤ K | µ − µ k | ≥ c | µ | , and max ≤ k ≤ K { η ( k ) } min ≤ k ≤ K { η ( k ) } ≤ C. (2.4)In fact, (2.4) holds if all entries of P are lower bounded by a positive constant or P → P for aﬁxed irreducible matrix P . We also note that the most challenging case for network analysis iswhen the matrix P is close to the matrix of 1’s (where it is hard to distinguish one communityfrom another), and (2.4) always holds in such a case. In this paper, we implicitly assume K isﬁxed. This is mostly for simplicity, as there is really no technical hurdle for the case of diverging K . See Remark 5 for more discussion. K In the null case, m = K . In this case, if we apply SCORE to the rows of (cid:98) R ( m ) assuming m clusters, then we have perfect community recovery. As a result, StGoF provides a conﬁdencelower bound for K . Theorem 2.1.

Fix < α < . Suppose we apply StGoF to a DCBM model where (2.1)-(2.4) hold. As n → ∞ , up to a permutation of the columns of (cid:98) Π ( K ) , P ( (cid:98) Π ( K ) (cid:54) = Π) ≤ Cn − , ψ ( K ) n → N (0 , in law, and P ( (cid:98) K ∗ α ≤ K ) ≥ (1 − α ) + o (1) . Theorem 2.1 is proved in Section 4. Theorem 2.1 allows for severe degree heterogeneity. If thedegree heterogeneity is moderate, s n (cid:16) | λ K | / √ λ , and we have the following corollary. Corollary 2.1.

Fix < α < . Suppose we apply StGoF to a DCBM model where (2.1)-(2.2) and (2.4) hold. Suppose θ max ≤ Cθ min and | λ K | / √ λ ≥ C (cid:112) log( n ) for a suﬃciently largeconstant C > . As n → ∞ , up to a permutation of the columns of (cid:98) Π ( K ) , P ( (cid:98) Π ( K ) (cid:54) = Π) ≤ Cn − , ψ ( K ) n → N (0 , in law, and P ( ˆ K ∗ α ≤ K ) ≥ (1 − α ) + o (1) . (cid:98) K ∗ α provides a level-(1 − α ) conﬁdence lower bound for K . If α depends on n and tends to 0 slowly enough, these results continue to hold. In this case, P ( ˆ K ∗ α ≤ K ) ≥ o (1). In cases (e.g., when the SNR is slightly smaller than those above) whereperfect community recovery is impossible but the faction of of misclassiﬁed nodes is small, theasymptotic normality continues to hold. Same comments apply to Theorem 2.3 and Corollary2.2. In the under-ﬁtting case, m < K . We focus on the case of 1 < m < K as the case of m = 1is trivial. Suppose we apply SCORE to the rows of (cid:98) R ( m ) assuming m is the correct number ofcommunities and let (cid:98) Π ( m ) be the matrix of estimated community labels as before. When 1

Deﬁnition 2.1.

Fix

K > and m ≤ K . We say that a realization of the n × m matrix ofestimated labels (cid:98) Π ( m ) satisﬁes the NSP if for any pair of nodes in the same (true) community,the estimated community labels are the same. When this happens, we write Π (cid:22) (cid:98) Π ( m ) , meaningthe partition (into clusters) on the left is ﬁner than that on the right. Theorem 2.2.

Consider a DCBM where (2.1)-(2.4) hold. With probability at least − O ( n − ) ,for each < m ≤ K , Π (cid:22) (cid:98) Π ( m ) up to a permutation in the columns. Theorem 2.2 says that SCORE has the NSP and is proved in Section 3. The theorem is the keyto our study of the upper bound below. In Section 3, we explain the main technical challengeswe face in proving the theorem, and present the key theorem and lemmas required for the proof.Why SCORE has the NSP is non-obvious, so to shed light on this, we present an intuitiveexplanation in Section 3. The following theorem is proved in Section 4.

Theorem 2.3.

Fix < α < . Suppose we apply StGoF to a DCBM model where (2.1)-(2.4)hold. As n → ∞ , min ≤ m

Fix < α < . Suppose we apply StGoF to a DCBM model where (2.1)-(2.2)and (2.4) hold. Suppose θ max ≤ Cθ min and | λ K | / √ λ ≥ C (cid:112) log( n ) for a suﬃciently largeconstant C > . As n → ∞ , min ≤ m

Remark 5 . In this paper, we assume K is ﬁxed. For diverging K , the main idea of ourpaper continues to be valid, but we need to revise several things (e.g., deﬁnition of consistencyand SNR, some regularity conditions, phase transition) to reﬂect the role of K . The proof forthe case of diverging K can be much more tedious, but aside from that, we do not see a majortechnical hurdle. Especially, the NSP of SCORE continues to hold for a diverging K . Then, withsome mild conditions, we can show that (cid:98) Π ( m ) has very few realizations, so the analysis of StGoFis readily extendable. That we assume K as ﬁxed is not only for simplicity but also for practical11elevance. For example, real networks may have hierarchical tree structure, and in each layer,the number of leaves (i.e., clusters) is small (e.g., [12, 13, 23, 24]). Therefore, we have small K in each layer when we perform hierarchical network analysis. Also, the goal of real applicationsis to have interpretable results. For example, for community detection, results with a large K ishard to interpret, so we may prefer a DCBM with a small K to an SBM with a large K . In thissense, a small K is practically more relevant. Remark 6 . Conditions (2.3) is the main condition that ensures (a) SCORE yields exactcommunity recovery when m = K , and (b) SCORE has the NSP when 1 ≤ m < K . Thecondition is much weaker than those in existing works (e.g., [36], [29]), and can not be signiﬁcantlyimproved in the case of θ max ≤ Cθ min (see phase transition results in Section 2.3). The morediﬃcult case where θ max /θ min tends to ∞ rapidly has never been studied before, at least forestimating K , and it is unclear whether we can ﬁnd an alternative algorithm that satisﬁes (a)-(b)under a signiﬁcantly weaker condition than (2.3). On the other hand, we can view StGoF asa general framework for estimating K , where SCORE may be improved or replaced by someother procedures satisfying (a)-(b) in the future as researchers continue to make advancementsin this area, so whether (2.3) can be further improved does not aﬀect our main contributions(see Section 1.1 for our contributions). In Theorem 2.3 and Corollary 2.2, we require the SNR, | λ K | / √ λ , to tend to ∞ at a speed ofat least (cid:112) log( n ). Such a condition cannot be signiﬁcantly relaxed. For example, if SNR → K . The exact meaning of this is describedbelow.We say two DCBM models are asymptotically indistinguishable if for any test that tries todecide which model is true, the sum of Type I and Type II errors is no smaller than 1 + o (1), as n → ∞ . Given a DCBM with K communities, our idea is to construct a DCBM with ( K + m )communities for any m ≥

1, and show that two DCBM are asymptotically indistinguishable,provided that the SNR of the latter is o (1).In detail, ﬁxing K ≥

1, consider a DCBM with K communities that satisﬁes (1.1)-(1.4).Let (Θ , (cid:101) Π , (cid:101) P ) be the parameters of this DCBM, and let (cid:101) Ω = Θ (cid:101) Π (cid:101) P (cid:101) Π (cid:48) Θ. When K >

1, let ( β (cid:48) , (cid:48) be the last column of (cid:101) P , and let S be the sub-matrix of (cid:101) P excluding the last row and the lastcolumn. Given m ≥ b n ∈ (0 , K + m ) communitiesas follows. We deﬁne a ( K + m ) × ( K + m ) matrix P : P = (cid:34) S β (cid:48) m +1 m +1 β (cid:48) m +11+ mb n M (cid:35) , where M = (1 − b n ) I m +1 + b n m +1 (cid:48) m +1 . (2.5)When K = 1, we simply let P = m +11+ mb n M . Let ˜ (cid:96) i ∈ { , . . . , K } be the community label ofnode i deﬁned by (cid:101) Π. We generate labels (cid:96) i ∈ { , . . . , K + m } by (cid:96) i = (cid:40) ˜ (cid:96) i , if ˜ (cid:96) i ∈ { , . . . , K − } , uniformly drawn from { K , K + 1 , . . . K + m } , if ˜ (cid:96) i = K . (2.6)Let Π be the corresponding matrix of community labels. This gives rise to a DCBM model with( K + m ) communities, where Ω = ΘΠ P Π (cid:48) Θ. Note that P does not have unit diagonals, butwe can re-parametrize so that it has unit diagonals. In detail, introduce a ( K + m ) × ( K + m )diagonal matrix D where D kk = √ P kk , 1 ≤ k ≤ K + m . Now, if we let P ∗ = D − P D − , θ ∗ i = θ i (cid:107) Dπ i (cid:107) , and Θ ∗ = diag( θ ∗ , . . . , θ ∗ n ), then P ∗ has unit-diagonals and Ω = Θ ∗ Π P ∗ Π (cid:48) Θ ∗ .Here some rows of Π are random (so we may call the corresponding model the random-label DCBM), but this is conventional in the study of lower bounds. Let λ k be the k th largest12igenvalue (in magnitude) of Ω. Since Ω is random, λ k ’s are also random (but we can bound | λ K | / √ λ conveniently). The following theorem is proved in the appendix. Theorem 2.4.

Fix K ≥ and consider a DCBM model with n nodes and K communities,whose parameters ( θ, (cid:101) Π , (cid:101) P ) satisfy (2.1)-(2.2). Let ( β (cid:48) , (cid:48) be the last column of (cid:101) P , and let S bethe sub-matrix of (cid:101) P excluding the last row and last column. We assume | β (cid:48) S − β − | ≥ C . • Fix m ≥ . Given any b n ∈ (0 , , we can construct a random-label DCBM model with K = K + m communities as in (2.5)-(2.6). Then, as n → ∞ , | λ K | / √ λ ≤ C (cid:107) θ (cid:107) (1 − b n ) with probability − o ( n − ) . Moreover, if (1 − b n ) / | λ min ( S ) | = o (1) , where λ min ( S ) is theminimum eigenvalue (in magnitude) of S , then | λ K | / √ λ ≥ C − (cid:107) θ (cid:107) (1 − b n ) with probability − o ( n − ) . Here C > is a constant that does not depend on b n . • Fix m , m ≥ with m (cid:54) = m . As n → ∞ , if (cid:107) θ (cid:107) (1 − b n ) → , then the two random-labelDCBM models associated with m and m are asymptotically indistinguishable. By Theorem 2.4, starting from a (ﬁxed-label) DCBM with K communities, we can construct acollection of random-label DCBM, with K + 1 , K + 2 , . . . , K + m communities, respectively,where (a) for the model with ( K + m ) communities, | λ K + m | / √ λ (cid:16) (cid:107) θ (cid:107) (1 − b n ), with anoverwhelming probability, and (b) each pair of models are asymptotically indistinguishable if (cid:107) θ (cid:107) (1 − b n ) = o (1). Therefore, for a broad class of DCBM with unknown K where SNR = o (1)for some models, a consistent estimate for K does not exist.Fixing m > a n >

0, let M n ( m , a n ) be the collection of DCBMfor an n -node network with K communities, where 1 ≤ K ≤ m , | λ K | / √ λ ≥ a n , and (2.1)-(2.2)hold. In Section 2.2, we show that if a n ≥ C (cid:112) log( n ) for a suﬃciently large constant C , thenfor each DCBM in M n ( m , a n ), StGoF provides a consistent estimate for K . The followingtheorem says that, if we allow a n →

0, then M n ( m , a n ) is too broad, and a consistent estimatefor K does not exist. Theorem 2.5.

Fix m > and let M n ( m , a n ) be the class of DCBM as above. As n → ∞ , if a n → , then inf ˆ K (cid:8) sup M n ( m ,a n ) P ( ˆ K (cid:54) = K ) (cid:9) ≥ (1 / o (1)) , where the probability is evaluatedat any given model in M n ( m , a n ) and the supremum is over all such models. Combining Theorems 2.1, 2.5, and Corollary 2.2, we have a phase transition result. • Impossibility . If a n →

0, then M n ( m , a n ) deﬁnes a class of DCBM that is too broadwhere some pairs of models in the class are asymptotically indistinguishable. Therefore,no estimator can consistently estimate the number of communities for each model in theclass. In this case, we can say “a consistent estimate for K does not exist” for short. • Possibility . If a n ≥ C (cid:112) log( n ) for a suﬃciently large C , then for every DCBM in M n ( m , a n ), StGoF provides a consistent estimate for the number of communities if themodel only has moderate degree heterogeneity (i.e., θ max ≤ Cθ min ). StGoF continues tobe consistent in the presence of severe degree heterogeneity if the adjusted SNR satisﬁesthat s n ≥ C (cid:112) log( n ) with a suﬃciently large C .The case of C ≤ a n < C (cid:112) log( n ) is more delicate. Sharp results are possible if we consider morespeciﬁc models (e.g., for a scaling parameter α n >

0, ( θ i /α n ) are iid from a ﬁxed distribution F ,and the oﬀ-diagonals of P are the same). We leave this to the future. This section contains the second part of our main theoretical results. We ﬁrst present the maintechnical tools for proving Theorem 2.2 (i.e., the NSP of SCORE), and then prove Theorem 2.2.13hy NSP holds is non-obvious, so in Section 3.3, we also shed light by providing an intuitiveexplanation and several examples. The NSP may hold in many other unsupervised learningsettings, and the gained insight in Section 3.3 may serve as a good starting point for studyingNSP in these settings.Here, the primary focus of our study on SCORE is on the under-ﬁtting case of m < K , whileexisting study on SCORE (e.g., [14]) has been focused on the null case of m = K . In the lasttwo paragraphs of Section 1.1, we have brieﬂy explained why the study in the under-ﬁtting caseis much harder. This section will further explain this with details.Recall that in the SCORE step, for each 1 < m ≤ K , we apply the k -means to the rows ofan n × ( m −

1) matrix (cid:98) R ( m ) , where (cid:98) R ( m ) ( i, k ) = ˆ ξ k +1 ( i ) / ˆ ξ ( i ), 1 ≤ i ≤ n , 1 ≤ k ≤ m −

1, andˆ ξ k is the k -th eigenvector (eigenvectors are arranged in the descending order in magnitudes ofcorresponding eigenvalues) of the adjacency matrix A . Viewing each row of (cid:98) R ( m ) as a point in R m − , we will show that there is a polytope in R m − with vertices v , v , . . . , v K such that withlarge probability, row i of (cid:98) R ( m ) falls close to v k if node i belongs to the true community k , for all1 ≤ i ≤ n . Therefore, the n rows form K clusters (but K and true cluster labels are unknown),each being a true community. To show that SCORE satisﬁes the NSP, the goal is to show thatthe k-means algorithm will not split any of these K clusters. See Figure 2 where we illustratethe NSP with an example with ( K, m ) = (4 , ( I )( II ) ( III )( I )( II ) ( III ) Figure 2: Illustration for what NSP means ((

K, m ) = (4 , (cid:98) R ( m ) (blue crosses)form K clusters (red: cluster centers) each of which is a true community ( K and true clusterlabels are unknown). SCORE aims to cluster all rows of (cid:98) R ( m ) into m clusters. Left: Voronoidiagram of k -means when the NSP does not hold (which will not happen according to our proof).Right: Voronoi diagram when the NSP holds. Deﬁnition 3.1 (Bottom up pruning and minimum pairwise distances) . Fixing

K > and < m ≤ K , consider a K × ( m − matrix U = [ u , u , . . . , u K ] (cid:48) . First, let d K ( U ) be theminimum pairwise distance of all K rows. Second, let u k and u (cid:96) ( k < (cid:96) ) be the pair that satisﬁes (cid:107) u k − u (cid:96) (cid:107) = d K ( U ) (if this holds for multiple pairs, pick the ﬁrst pair in the lexicographical order).Remove row (cid:96) from the matrix U and let d K − ( U ) be the minimum pairwise distance for theremaining ( K − rows. Repeat this step and deﬁne d K − ( U ) , d K − ( U ) , . . . , d ( U ) recursively.Note that d K ( U ) ≤ d K − ( U ) ≤ . . . ≤ d ( U ) . For example, if (

K, m ) = (4 , U are (1 , , (1 , , (0 ,

1) and (1 , d ( U ) = 0, d ( U ) = 1, and d ( U ) = √

2. The following theorem is the key to prove the NSP ofSCORE, and is proved in the appendix.

Theorem 3.1.

Fix < m ≤ K and let n be suﬃciently large. Suppose x , x , . . . , x n ∈ R m − take K distinct values u , u , . . . , u K . Letting U = [ u , u , . . . , u K ] (cid:48) and F k = { ≤ i ≤ n : x i = u k } , for ≤ k ≤ K , suppose min ≤ k ≤ K | F k | ≥ α n and max ≤ k ≤ K (cid:107) u k (cid:107) ≤ C · d m ( U ) , forconstants < α < , C > . Suppose we apply k -means to a set of n points ˆ x , ˆ x , . . . , ˆ x n assuming m clusters. Let ˆ S , ˆ S , . . . , ˆ S m be the resultant clusters (which are not necessarily nique). There is a number c = c ( α , C , m ) > such that if max ≤ i ≤ n (cid:107) ˆ x i − x i (cid:107) ≤ c · d m ( U ) ,then (cid:8) ≤ j ≤ m : ˆ S j ∩ F k (cid:54) = ∅ (cid:9) = 1 , for each ≤ k ≤ K . When we apply Theorem 3.1 to prove Theorem 2.2, all conditions required in Theorem 3.1can be deduced from those in Theorem 2.2, so we do not need any additional conditions. SeeLemma 3.3 and Section 3.2. Theorem 3.1 is a general result on k -means and may be useful inmany other unsupervised settings. The proof is non-trivial for the following reasons. • The objective function of the k -means is complicate, and the k -means solution is notnecessarily unique. See Example 3. • Theorem 3.1 only requires that there are at least m true cluster centers the minimumpairwise distance of which is large. If we assume a stronger condition, say, the minimumpairwise distance of all K cluster centers is large (i.e., max ≤ k ≤ K (cid:107) u k (cid:107) ≤ C · d K ( U )),the proof is much easier, but unfortunately, such a condition does not always hold in oursettings. See Example 4 below. Example 3 . Suppose (

K, m ) = (4 ,

3) and F , F , F , F have equal sizes. We view u , u , . . . , u K as the vertices of a quadrilateral in R . Suppose we apply the k -means to x , x , . . . , x n and let C , C , C be the resultant clusters. Suppose that among the 6 diﬀerent pairs of vertices, ( u , u )is the pair with the smallest distance. In this case, the three clusters are C = F ∪ F , C = F ,and C = F , and the cluster centers are ( u + u ) / u , and u . If the quadrilateral is a squareor rectangle, then among the 6 pairs of indices, more than one pairs have the smallest pairwisedistance, so the k -means solutions are not unique.Now, to prove Theorem 2.2, the idea is to apply Theorem 3.1 with ˆ x i being row i of (cid:98) R ( m ) .To do this, we study the geometrical structure underlying (cid:98) R ( m ) in the under-ﬁtting case, wherethe ideal polytope and tight row-wise large deviation bounds for (cid:98) R ( m ) play a key role. Fix 1 ≤ k ≤ K . Let λ k be the k -th largest (in magnitude) eigenvalue of the n × n matrixΩ and let ξ k be the corresponding unit- (cid:96) -norm eigenvector. By Davis-Kahan sin( θ )-theorem[4], the two matrices [ ξ , . . . , ξ K ] and [ ˆ ξ , . . . , ˆ ξ K ] only match well with each other by a rotationmatrix Γ: [ ˆ ξ , . . . , ˆ ξ K ] ≈ [ ξ , . . . , ξ K ]Γ. Let Ξ be the matrix consisting of the ﬁrst m columns of[ ξ , . . . , ξ K ]Γ. The geometrical structure underlying Ξ is the key to our study.In the null case of m = K , the geometric structure was studied in [14, 15]. For the under-ﬁtting case of 1 < m < K , the study is much harder. The reason is that, Γ is hard to trackwithout a strong condition on the eigen-gap of Ω, and as Γ ranges, the row-wise distances of Ξremain the same when m = K , but may vary signiﬁcantly when m < K . To deal with this, weneed relatively tedious notations and harder proofs, compared to those in [14, 15].Recall that µ k is the k -th largest (in magnitude) eigenvalue of the K × K matrix HP H , and η k is the corresponding unit- (cid:96) -norm eigenvector. We now relate ( µ k , η k ) to ( λ k , ξ k ) above. Thefollowing lemma is proved in the appendix. Lemma 3.1.

Consider a DCBM where (2.4) holds and let λ k , µ k , η k , ξ k be as above. We have thefollowing claims. First, λ k = (cid:107) θ (cid:107) µ k for ≤ k ≤ K . Second, the multiplicity of µ is and allentries of η have the same sign, and the same holds for λ and ξ . Last, if η k is an eigenvectorof HP H corresponding to µ k , then (cid:107) θ (cid:107) − ΘΠ H − η k is an eigenvector of Ω corresponding to λ k ,and conversely, if ξ k is an eigenvector of Ω corresponding to λ k , then (cid:107) θ (cid:107) − H − Π (cid:48) Θ ξ k is aneigenvector of HP H corresponding to µ k . η be the unique unit- (cid:96) -norm eigenvector of HP H corresponding to λ thathave all positive entries. Note that η , . . . , η K may not be unique. Fix a particular candidatefor η , . . . , η K , say, η ∗ , . . . , η ∗ K . Let[ ξ , ξ ∗ , . . . , ξ ∗ K ] = (cid:107) θ (cid:107) − ΘΠ H − [ η , η ∗ , . . . , η ∗ K ] . (3.7) Deﬁnition 3.2.

Given any ( K − × ( K − orthogonal matrix Γ and ≤ k ≤ K , let η k (Γ) bethe ( k − -th column of [ η ∗ , η ∗ , . . . , η ∗ K ]Γ , with η k ( i, Γ) being the i -th entry, ≤ i ≤ K , and let ξ k (Γ) be the ( k − -th column of [ ξ ∗ , ξ ∗ , . . . , ξ ∗ K ]Γ , with ξ k ( j, Γ) being the j -th entry, ≤ j ≤ n . Note that ( η , ξ ) are uniquely deﬁned (up to a factor of ± { ( η k , ξ k ) } ≤ k ≤ K are notnecessarily unique. However, by Lemma 3.1 and basic linear algebra, there is a collection of ( K − × ( K −

1) orthogonal matrices, denoted by A , such that when Γ ranges in A , { η (Γ) , . . . , η K (Γ) } give all possible candidates of { η , . . . , η K } , and { ξ (Γ) , . . . , ξ K (Γ) } give all possible candidates of { ξ , . . . , ξ K } . In the special case where µ , . . . , µ K are distinct, A is the set of all ( K − × ( K − µ = . . . = µ K , A is the set of all( K − × ( K −

1) orthogonal matrices.Fix 1 < m ≤ K and a ( K − × ( K −

1) orthogonal matrix Γ (which is not necessarily in A ). We deﬁne a K × ( m −

1) matrix V ( m ) (Γ) and an n × ( m −

1) matrix R ( m ) (Γ) by V ( m ) ( k, (cid:96) ; Γ) = η (cid:96) +1 ( k ; Γ) /η ( k ) , ≤ k ≤ K, ≤ (cid:96) ≤ m − , (3.8)and R ( m ) ( i, (cid:96) ; Γ) = ξ (cid:96) +1 ( i ; Γ) /ξ ( i ) , ≤ i ≤ n, ≤ (cid:96) ≤ m − . (3.9)We note that V ( m ) (Γ) is the sub-matrix of V ( K ) (Γ) consisting the ﬁrst ( m −

1) columns; samecomments for R ( m ) (Γ). Write V ( m ) (Γ) = [ v ( m )1 (Γ) , . . . , v ( m ) K (Γ)] (cid:48) and R ( m ) (Γ) = [ r ( m )1 (Γ) , . . . , r ( m ) n (Γ)] (cid:48) ,so that ( v ( m ) k (Γ)) (cid:48) is the k -th row of V ( m ) (Γ) and ( r ( m ) i (Γ)) (cid:48) is the i -th row of R ( m ) (Γ), 1 ≤ k ≤ K, ≤ i ≤ n . For notational simplicity, we may drop “Γ” when there is no confusion. Recallthat for 1 ≤ k ≤ K , N k denotes the k -th true community. The following lemma is proved in theappendix. Lemma 3.2 (The ideal polytope) . Consider a DCBM model where (2.4) holds. For any

Consider a DCBM model where (2.2) and (2.4) hold. Fix ≤ m ≤ K andan ( K − × ( K − orthogonal matrix Γ , we have d m ( V ( m ) (Γ)) ≥ √ when m = K , and d m ( V ( m ) (Γ)) ≥ C when < m < K , where the constant C > does not depend on Γ . We should not expect that d K ( V ( m ) (Γ)) ≥ C holds for all rotation Γ. We can only show aweaker claim of d m ( V ( m ) (Γ)) ≥ C as in Lemma 3.3. Below, we use a special example to illustratehow Γ aﬀect d K ( V ( m ) (Γ)). Example 4 . Consider a special case of Example 2 where P = (1 − b n ) I K + b n K (cid:48) K ,0 < b n <

1, and (cid:107) θ ( k ) (cid:107) = (cid:107) θ (cid:107) / √ K , 1 ≤ k ≤ K (as a result, HP H = (1 /K ) P ). Note thatthe eigenvectors of HP H , denoted by η , η , . . . , η K , do not depend on b n . We take the case of( K, m ) = (3 ,

2) for example. In this case, η = (1 / √ , , (cid:48) , and a candidate for { η , η } is16 ∗ = (1 / √ , − , (cid:48) , and η ∗ = (1 / √ , , − (cid:48) , and all possible candidates for { η , η } aregiven by [ η ∗ , η ∗ ]Γ , Γ = Γ( θ ) = (cid:20) cos( θ ) sin( θ ) − sin( θ ) cos( θ ) (cid:21) , ≤ θ < π. Now, d ( V (2) (Γ)) changes continuously in θ and take values in [0 , √ / √ θ ∈ { π/ , π/ , π/ , π/ , π/ , π/ } . However, d ( V (2) (Γ)) ≥ √ / √ θ .Similarly, we write (cid:98) R ( m ) = [ˆ r ( m )1 , ˆ r ( m )2 , . . . , ˆ r ( m ) n ] (cid:48) , so that (ˆ r ( m ) i ) (cid:48) is the i -th row of (cid:98) R ( m ) . Thefollowing lemma provides a tight row-wise large-deviation bound for (cid:98) R ( m ) and is proved in theappendix. Lemma 3.4.

Consider a DCBM model where (2.1)-(2.4) hold. With probability − O ( n − ) ,there exists a ( K − × ( K − orthogonal matrix Γ (which may depend on n and (cid:98) R ( K ) ) suchthat as n → ∞ , (cid:107) ˆ r ( m ) i − r ( m ) i (Γ) (cid:107) ≤ (cid:107) ˆ r ( K ) i − r ( K ) i (Γ) (cid:107) ≤ Cs − n (cid:112) log( n ) , for all < m < K and ≤ i ≤ n . For illustration, we assume d K ( V ( m ) ) ≥ C for all 1 < m ≤ K (we have dropped “Γ” tosimplify notations) so the minimum pairwise distance of the K rows of V ( m ) is no smallerthan C . In this case, Lemmas 3.2-3.4 say that the n rows of R ( m ) have K distinct values,( v ( m )1 ) (cid:48) , ( v ( m )2 ) (cid:48) , . . . , ( v ( m ) K ) (cid:48) , and partitioning the rows with respect to diﬀerent values gives ex-actly K true communities. Note that we can view v ( m )1 , v ( m )2 , . . . , v ( m ) K as the vertices of apolytope in R m − . See Figure 3 for an illustration of K = 4. In this case, v ( m )1 , v ( m )2 , . . . , v ( m ) K are the vertices of a tetrahedron when m = 4, the vertices of a quadrilateral when m = 3, and K scalars when K = 2. By Lemma 3.4 and the condition (2.3), for all 1 ≤ i ≤ n , (cid:107) ˆ r ( m ) i − r ( m ) i (cid:107) is much smaller than d K ( V ( m ) ). Therefore, the n rows of (cid:98) R ( m ) also form K clusters, each beinga true community. If we apply k-means assuming K clusters, then we can fully recover thetrue communities. Unfortunately, K is unknown. In the under-ﬁtting case, m < K and weunder-estimate the number of clusters. However, Theorem 3.1 guarantees that, although we arenot able to recover all true communities, the NSP holds.Figure 3: An example ( K = 4). From left to right: m = 4 , ,

2. Red dots: the 4 distinct rowsof R ( m ) , v ( m )1 , v ( m )2 , v ( m )3 , v ( m )4 . Blue crosses: rows of (cid:98) R ( m ) . The red dots are the vertices of atetrahedron when m = 4, vertices of a quadrilateral when m = 3, and scalars when m = 2. Foreach m , the n rows of (cid:98) R ( m ) are seen to have K clusters, each of which is a true community. By Lemma 3.4, there is an event E , where P ( E c ) = O ( n − ), and on this event there exists a( K − × ( K −

1) orthogonal matrix Γ (which may depend on n and (cid:98) R ( K ) ) such thatmax ≤ i ≤ n (cid:107) ˆ r ( m ) i − r ( m ) i (Γ) (cid:107) ≤ Cs − n (cid:112) log( n ) , for all 1 < m ≤ K. Fix 1 < m ≤ K . By Lemma 3.2, r ( m ) i (Γ) = v ( m ) k (Γ) for each i ∈ N k and 1 ≤ k ≤ K . Suppose v ( m )1 (Γ) , . . . , v ( m ) K (Γ) have L distinct values, where L may depend on m and Γ and L ≥ m by17emma 3.3. Note that whenever two vectors (say) v ( m )1 (Γ) and v ( m )2 (Γ) are identical, we canalways treat N and N as the same cluster before we apply Theorem 3.1. Therefore, withoutloss of generality, we assume L = K , so v ( m )1 (Γ) , . . . , v ( m ) K (Γ) are distinct. It suﬃces to showthat, on the event E , none of N , N , . . . , N K is split by the k-means.We now apply Theorem 3.1 with ˆ x i = ˆ r ( m ) i , x i = r ( m ) i (Γ), F k = N k , and U = V ( m ) (Γ).Note that by Lemma 3.3, d m ( U ) ≥ C . Also, in the proof of Lemma 3.3, we have shown thatmax ≤ k ≤ K (cid:107) v ( m ) k (Γ) (cid:107) ≤ C . It follows that the (cid:96) -norm of each row of U is bounded by C · d m ( U ).Additionally, on the event E , max ≤ i ≤ n (cid:107) ˆ x i − x i (cid:107) ≤ Cs − n (cid:112) log( n ). As long as s n ≥ C (cid:112) log( n )for a suﬃciently large constant C , we have max ≤ i ≤ n (cid:107) ˆ x i − x i (cid:107) ≤ c · d m ( U ) for a suﬃcientlysmall constant c . The claim now follows by applying Theorem 3.1. Why NSP holds is non-obvious, so we provide an intuitive explanation and some examples. TheNSP may hold for many other unsupervised learning settings, and this section may be especiallyhelpful if we wish to extend our ideas to other settings. Since the NSP in general settings isalready proved above and the purpose here is to provide some insight, we consider settings where d K ( V ( m ) (Γ)) ≥ C. (3.10)This condition is stronger than the condition d m ( V ( m ) (Γ) ≥ C needed in Theorem 3.1 (e.g., seeExample 4). Also, for notational simplicity, we drop “Γ” below.We start by introducing the minimum gap as a measure for the stability of the clusteringresults by k -means. Fixing 1 < m ≤ K , consider n points u , u , . . . , u n ∈ R m − and let U = [ u , u , . . . , u n ] (cid:48) . Suppose we cluster u , u , . . . , u n into m clusters using the k -means. Deﬁnition 3.3.

Let c , c , . . . , c m be any possible cluster centers from k-means (the set is notnecessarily unique). Let d ( u i ; c , . . . , c m ) and d ( u i ; c , . . . , c m ) be the distances between u i and its closest cluster center and the distance between u i and its second closest cluster center,respectively. The minimum gap for the clustering results is deﬁned by g m ( U ) = min { all possible c , c , . . . , c m } min ≤ i ≤ n (cid:8) d ( u i ; c , . . . , c m ) − d ( u i ; c , . . . , c m ) (cid:9) . We now explain why NSP holds for the under-ﬁtting case. We start by considering the oraclecase where we apply k -means to the n rows of the non-stochastic matrix R ( m ) (Γ). Theorem 3.2.

Consider a DCBM model where (2.2) holds. Fix < m < K and any ( K − × ( K − orthogonal matrix Γ . Let V ( m ) (Γ) and R ( m ) (Γ) be as in (3.8) and (3.9), respec-tively. If d K ( V ( m ) (Γ)) > and we apply the k -means to rows of R ( m ) (Γ) , then NSP holds and g m ( R ( m ) (Γ)) ≥ Cd K ( V ( m ) (Γ)) , where C only depends on the constant in (2.2) . Theorem 3.2 is proved in the appendix. In the oracle case, since r ( m ) i = r ( m ) j when i and j are in the same community, the NSP must hold once we have g m ( R ( m ) ) > g m ( R ( m ) ) ≥ Cd K ( V ( m ) )holds. Below, we use two examples for further illustration. In these examples, we assume K = 4,and let N , N , N , N be the true communities. We assume these communities have equal sizes.We consider the cases of m = 2 and m = 3, separately. Example 5a . When m = 3, the four points v ( m )1 , . . . , v ( m )4 are the vertices of a quadrilateralin R . Following Example 3, it is seen g m ( R ( m ) ) ≥ (1 / (cid:107) v ( m )1 − v ( m )2 (cid:107) ≡ (1 / d K ( V ( m ) ). Example 5b . When m = 2, v ( m )1 , . . . , v ( m )4 are scalars. Without loss of generality, we assume v ( m )1 < v ( m )2 < v ( m )3 < v ( m )4 . In Section B.7, we show that g m ( R ( m ) ) ≥ [(3 − √ / · d K ( V ( m ) ).18n the real case, we take an intuitive approach to explain why NSP holds for the k -means (seeTheorem 3.1 for a rigorous proof). Recall that N , N , . . . , N K are the true communities. Sup-pose we apply the k -means to the rows of (cid:98) R ( m ) and obtain m clusters with centers ˆ c , ˆ c , . . . , ˆ c m .Suppose we also apply the k -means to the rows of R ( m ) and obtain m clusters c , c , . . . , c m .Under some regularity conditions, we expect to see thatmax ≤ k ≤ m (cid:107) ˆ c k − c k (cid:107) ≤ C max ≤ i ≤ n (cid:107) ˆ r i − r i (cid:107) , up to a permutation of c , c , . . . , c m . (3.11)By Lemma 3.4, the right hand side is ≤ Cs − n (cid:112) log( n ) with large probability. In the k-means onrows of R ( m ) , it follows from Theorem 3.2 that every row i for i ∈ N k is clustered into a clusterwith center c j , for some 1 ≤ j ≤ m . By Deﬁnition 3.3, (cid:107) r i − c j (cid:107) + g m ( R ( m ) ) ≤ (cid:107) r i − c (cid:96) (cid:107) , for any (cid:96) (cid:54) = j. Combining it with (3.11), except for a small probability, for all i ∈ N k and (cid:96) (cid:54) = j , (cid:107) ˆ r i − ˆ c j (cid:107) ≤ (cid:107) r i − c j (cid:107) + (cid:107) ˆ r i − r i (cid:107) + (cid:107) ˆ c j − c j (cid:107) ≤ (cid:107) r i − c j (cid:107) + Cs − n (cid:112) log( n ) , (cid:107) ˆ r i − ˆ c (cid:96) (cid:107) ≥ (cid:107) r i − c (cid:96) (cid:107) − (cid:107) ˆ r i − r i (cid:107) − (cid:107) ˆ c (cid:96) − c (cid:96) (cid:107) ≥ (cid:107) r i − c (cid:96) (cid:107) − Cs − n (cid:112) log( n ) . It follows that (cid:107) ˆ r i − ˆ c j (cid:107) ≤ (cid:107) ˆ r i − ˆ c (cid:96) (cid:107) + (cid:2) Cs − n (cid:112) log( n ) − g m ( R ( m ) ) (cid:3) . Therefore, as long as 2 Cs − n (cid:112) log( n ) < g m ( R ( m ) ), ˆ c j is the closest cluster center to ˆ r i , for every i ∈ N k . This shows that except for a small probability, the whole set N k is assigned to thecluster with center ˆ c j , i.e., NSP holds.While the above explanation is intuitive and easy to understand, quite strong conditions areneeded when we try to solidify each step. For example, while (3.11) sounds correct intuitively,it may not hold in some cases when the k-means solutions are not unique. Condition (3.10) maynot hold in some cases either, due to the rotation aforementioned. To show NSP in the generalsettings as in our paper, we need Theorem 3.1 and Lemmas 3.2-3.4. On the other hand, theintuitive explanation here is easy-to-understand, and may provide a starting point for provingNSP in other unsupervised learning settings. Remark 7 . A simpler version of Theorem 3.1 was proved in [29], under stronger conditionsof (a) when we apply the k -means to { x , x , . . . , x n } , the k -means solution is unique, and (b) d K ( U ) ≥ C (with the same notations as in Theorem 3.1). Unfortunately, [29] only proved theirclaim for the special case of ( K, m ) = (3 ,

2) (for general (

K, m ), the proof is non-trivial due tocomplex combinatorics). Also, conditions (a)-(b) are hard to check especially as we need themto hold for U = V ( m ) (Γ) with all Γ and all m ; see Examples 3-4. For example, as illustrated inExample 4, when Γ ranges continuously, (b) tends to fail for some m . To make sure (b) holds,[29] assumes a relatively strong condition (b1): P → P for a ﬁxed matrix P with distincteigenvalues. This is a strong signal case where λ , λ , . . . , λ K (eigenvalues of Ω) are at the samemagnitude, and the eigen-gaps are also at the same magnitude; see Example 1. In this case, theΓ in David-Kahan sin( θ ) theorem is uniquely determined, so (b) holds. However, our primaryinterest is in the more challenging weak signal case, where typically | λ | /λ →

0. In this case,(b1) won’t hold, because the only P that can be the limit of P is the K × K matrix of all ones,where the K eigenvalues are not distinct. In this section, we prove Theorems 2.1 and 2.3. Corollaries 2.1-2.2 follow directly from Theorems2.1 and 2.3, respectively, so the proofs are omitted. All other theorems and lemmas are provedin the appendix. 19 .1 Proof of Theorem 2.1 (the null case of m = K ) First, it is seen that the ﬁrst item is a direct result of Theorem 2.2. Second, by deﬁnitions, P ( (cid:98) K ∗ α ≤ K ) ≥ P ( ψ ( K ) n ≤ z α ) , and so the last item follows once the second item is proved. Therefore, we only need to showthe second item. Recall that when m = K , ψ ( K ) n = [ Q ( K ) n − B ( K ) n ] / (cid:112) C n , where Q ( K ) n , B ( K ) n , and C n are deﬁned in (1.13), (1.12) and (1.15), respectively, which we reiteratebelow: Q ( K ) n = (cid:88) i ,i ,i ,i ( dist ) ( A i i − (cid:98) Ω ( K ) i i )( A i i − (cid:98) Ω ( K ) i i )( A i i − (cid:98) Ω ( K ) i i )( A i i − (cid:98) Ω ( K ) i i ) ,C n = (cid:88) i ,i ,i ,i ( dist ) A i i A i i A i i A i i , B ( K ) n = 2 (cid:107) ˆ θ (cid:107) · [ˆ g (cid:48) (cid:98) V − ( (cid:98) P (cid:98) H (cid:98) P ◦ (cid:98) P (cid:98) H (cid:98) P ) (cid:98) V − ˆ g ] . In the ﬁrst equation here, (cid:98) Ω ( K ) depends on the estimated community label matrix (cid:98) Π ( K ) . Tofacilitate the analysis, it’s desirable to replace (cid:98) Π ( K ) by the true membership matrix Π. By theﬁrst claim of the current theorem, this replacement only has a negligible eﬀect.Formally, we introduce (cid:98) Ω ( K, to be the proxy of (cid:98) Ω ( K ) with (cid:98) Π ( K ) in its deﬁnition replacedby Π. Moreover, deﬁne Q ( K, n to be the proxy of Q ( K ) n with (cid:98) Ω ( K ) replaced by (cid:98) Ω ( K, in itsdeﬁnition, and deﬁne the corresponding counterpart of ψ ( K ) n as ψ ( K, n = [ Q ( K, n − B ( K ) n ] / (cid:112) C n . Then, for any ﬁxed number t ∈ R we have (cid:12)(cid:12)(cid:12) P ( ψ ( K ) n ≤ t ) − P ( ψ ( K, n ≤ t ) (cid:12)(cid:12)(cid:12) ≤ P ( (cid:98) Π ( K ) (cid:54) = Π) → , as n → ∞ , where the last step follows from the ﬁrst claim in the current theorem. Hence by elementaryprobability, to prove ψ ( K ) n → N (0 ,

1) in law, if suﬃces to show ψ ( K, n → N (0 ,

1) in law.Recall that if we neglect the diﬀerence in the main diagonal entries, then A − Ω = W . Bydeﬁnition, we expect that (cid:98) Ω ( K, ≈ Ω, and so ( A − (cid:98) Ω ( K, ) ≈ W . This motivates us to deﬁne (cid:101) Q n = (cid:88) i ,i ,i ,i ( dist ) W i i W i i W i i W i i . (4.12)At the same time, for short, let b n and c n be the oracle counterparts of B ( K ) n and C n c n = (cid:88) i ,i ,i ,i ( dist ) Ω i i Ω i i Ω i i Ω i i , b n = 2 (cid:107) θ (cid:107) · [ g (cid:48) V − ( P H P ◦ P H P ) V − g ] . (4.13)Here, two vectors g, h ∈ R K are deﬁned as g k = ( (cid:48) k θ ) / (cid:107) θ (cid:107) and h k = ( (cid:48) k Θ k ) / / (cid:107) θ (cid:107) , where k is for short of ( K ) k , which is deﬁned as ( K ) k ( i ) = 1 if i ∈ N k and 0 otherwise . Moreover, V = diag( P g ), and H = diag( h ). The following lemmas are proved in the appendix. Lemma 4.1.

Under the conditions of Theorem 2.1, we have E [ C n ] = c n (cid:16) (cid:107) θ (cid:107) and Var( C n ) ≤ C (cid:107) θ (cid:107) · [1 + (cid:107) θ (cid:107) ] , and so C n /c n → in probability for c n deﬁned in (4.13) . emma 4.2. Under the conditions of Theorem 2.1, (cid:101) Q n / √ c n → N (0 , in law. Lemma 4.3.

Under the conditions of Theorem 2.1, E ( Q ( K, n − (cid:101) Q n − b n ) = o ( (cid:107) θ (cid:107) ) . Lemma 4.4.

Under the conditions of Theorem 2.1, we have b n (cid:16) (cid:107) θ (cid:107) and B ( K ) n /b n → inprobability for b n deﬁned in (4.13) . Among these lemmas, the proof of Lemma 4.3 is the most complicated one, as it requires com-puting the bias in Q ( m ) n caused by the reﬁtting step; see Section C.8 in the appendix for details.We now prove Theorem 2.1. Rewrite ψ ( K, n as (cid:114) c n C n (cid:20) (cid:101) Q n √ c n + ( Q ( K, n − (cid:101) Q n − b n ) √ c n + ( b n − B ( K ) n ) √ c n (cid:21) = (cid:114) c n C n · [( I ) + ( II ) + ( III )] , (4.14)where ( I ) = (cid:101) Q n / √ c n , ( II ) = ( Q ( K, n − (cid:101) Q n − b n ) / √ c n , and ( III ) = ( b n − B ( K ) n ) / √ c n . Now,ﬁrst by Lemmas 4.1-4.2, c n /C n → , and ( I ) → N (0 ,

1) in law . (4.15)Second, by Lemma 4.2, E [( II ) ] ≤ (8 c n ) − · E [( Q ( K, n − (cid:101) Q ( K, n − b n ) ] ≤ c − n · o ( (cid:107) θ (cid:107) ) , (4.16)where the right hand side is o (1) for c n (cid:16) (cid:107) θ (cid:107) by Lemma 4.1. Last, by Lemma 4.1- 4.4, we have b n (cid:16) √ c n (cid:16) (cid:107) θ (cid:107) and B ( K ) n /b n p →

1, and so(

III ) = (cid:18) b n √ c n (cid:19) · (cid:18) B ( K ) n b n − (cid:19) p → . (4.17)Inserting (4.15)-(4.17) into (4.14) gives the claim and concludes the proof of Theorem 2.1. m < K ) In the proof of Theorem 2.1, we start from replacing (cid:98) Π ( K ) with the true community label matrixΠ. However, when m < K , (cid:98) Π ( m ) does not concentrate on one particular label matrix. Below, weintroduce a collection of label matrices, G m , consisting of all possible realizations of (cid:98) Π ( m ) whenNSP holds. We then study the GoF statistic on the event that (cid:98) Π ( m ) = Π , for a ﬁxed Π ∈ G m .Recall that Π is the true community label matrix. Fix 1 ≤ m < K . Let G m be the class of n × m matrices Π , where each Π is formed as follows: let { , , . . . , K } = S ∪ S . . . ∪ S m be apartition, column (cid:96) of Π is the sum of all columns of Π in S (cid:96) , 1 ≤ (cid:96) ≤ m . Let L be the K × m matrix of 0 and 1 where L ( k, (cid:96) ) = 1 if and only if k ∈ S (cid:96) , ≤ k ≤ K, ≤ (cid:96) ≤ m. (4.18)Therefore, for each Π ∈ G m , we can ﬁnd an L such that Π = Π L . Note that each Π isthe community label matrix where each community implied by it (i.e., “pseudo community”) isformed by merging one or more (true) communities of the original network.Fix a Π and let N ( m, , N ( m, , · · · , N ( m, m be the m “pseudo communities” associated withΠ . Recall that ˆ θ ( m ) , (cid:98) Θ ( m ) and (cid:98) P ( m ) are reﬁtted quantities obtained by using the adjacencymatrix A and (cid:98) Π ( m ) ; see (1.9)-(1.10). To misuse the notations a little bit, let ˆ θ ( m, , (cid:98) Θ ( m, and (cid:98) P ( m, be the proxy of ˆ θ ( m ) , (cid:98) Θ ( m ) and (cid:98) P ( m ) respectively, constructed similarly by (1.9)-(1.10),but with (cid:98) Π ( m ) replaced by Π . Introduce (cid:98) Ω ( m, = (cid:98) Θ ( m, Π (cid:98) P ( m, Π (cid:48) (cid:98) Θ ( m, , (4.19)21 ( m, n = (cid:88) i ,i ,i ,i ( dist ) ( A i i − (cid:98) Ω ( m, i i )( A i i − (cid:98) Ω ( m, i i )( A i i − (cid:98) Ω ( m, i i )( A i i − (cid:98) Ω ( m, i i ) , and ψ ( m, n = [ Q ( m, n − B ( m ) n ] / (cid:112) C n . (4.20)These are the proxies of Ω ( m ) , Q ( m ) n , and ψ ( m ) n , respectively, where (cid:98) Π ( m ) is now frozen at anon-stochastic matrix Π .In the under-ﬁtting case, m < K , and we do not expect (cid:98) Ω ( m, to be close to Ω. We deﬁnea non-stochastic counterpart of (cid:98) Ω ( m, as follows. Let θ ( m, , Θ ( m, and P ( m, be constructedsimilarly by (1.9)-(1.10), except that ( A, (cid:98) Π ( m ) ) and the vector d = ( d , d , . . . , d n ) (cid:48) are replacedwith (Ω , Π ) and Ω n , respectively. LetΩ ( m, = Θ ( m, Π P ( m, Π (cid:48) Θ ( m, . (4.21)The following lemma gives an equivalent expression of Ω ( m, and is proved in the appendix. Lemma 4.5.

Fix

K > and ≤ m ≤ K . Let Π = Π L ∈ G m and Ω ( m, be as above. Write D = Π (cid:48) ΘΠ ∈ R K,K and D = Π (cid:48) ΘΠ ∈ R m,K . Let P be the K × K matrix given by P = diag( P D K ) · L · diag( D P D K ) − ( D P D (cid:48) )diag( D P D K ) − · L (cid:48) · diag( P D K ) , where the rank of P is m . Then, Ω ( m, = ΘΠ P Π (cid:48) Θ . This lemmas says that Ω ( m, has a similar expression as Ω, with P replaced by a rank- m matrix P . When m = K , G m has only one element Π; then ( P , Ω ( m, ) reduces to ( P, Ω).We expect (cid:98) Ω ( m, to concentrate at Ω ( m, . This motivates the following proxy of Q ( m, n . (cid:101) Q ( m, n = (cid:88) i ,i ,i ,i ( dist ) ( A i i − Ω ( m, i i )( A i i − Ω ( m, i i )( A i i − Ω ( m, i i )( A i i − Ω ( m, i i ) . (4.22)Introduce (cid:101) Ω ( m, = Ω − Ω ( m, . (4.23)Recall that A = (Ω − diag(Ω)) + W , we rewrite (cid:101) Q ( m, n as (cid:101) Q ( m, n = (cid:88) i ,i ,i ,i ( dist ) ( W i i + (cid:101) Ω ( m, i i )( W i i + (cid:101) Ω ( m, i i )( W i i + (cid:101) Ω ( m, i i )( W i i + (cid:101) Ω ( m, i i ) . (4.24)Note that when m = K and Π = Π, the statistic (cid:101) Q ( m, n reduces to (cid:101) Q n deﬁned in (4.12).The matrix (cid:101) Ω ( m, captures the signal strength in (cid:101) Q ( m, n . From now on, for notation sim-plicity, we write (cid:101) Ω ( m, = (cid:101) Ω in the rest of the proof. Let ˜ λ k be the k -th largest (in magnitude)eigenvalue of (cid:101) Ω and recall that λ k is the k -th largest (in magnitude) eigenvalue of Ω. In lightof (4.23), we write Ω = Ω ( m, + (cid:101) Ω and apply Weyl’s theorem for singular values (see equation(7.3.13) of [9]). Note that Ω ( m, has a rank m and Ω has a rank K . By Weyl’s theorem, for all1 ≤ k ≤ K − m , | λ m + k | ≤ | λ m +1 (Ω ( m, ) | + | ˜ λ k | = | ˜ λ k | . It follows thattr( (cid:101) Ω ) ≥ K − m (cid:88) k =1 | ˜ λ k | ≥ K (cid:88) k = m +1 | λ k | . As we will see in Lemma 4.7 below, tr( (cid:101) Ω ) is the dominating term of E [ (cid:101) Q ( m, n ]. Deﬁne τ ( m, = | ˜ λ | /λ . (4.25)For notation simplicity, we write τ ( m, = τ , but keep in mind both (cid:101) Ω and τ actually depend on m and Π ∈ G m . The following lemmas are proved in the appendix.22 emma 4.6. Under the conditions of Theorem 2.3, for each ≤ m ≤ K , let (cid:101) Ω and τ be deﬁnedas in (4.21) and (4.25) . The following statements are true: • There exists a constant

C > such that | (cid:101) Ω ij | ≤ Cτ θ i θ j , for all ≤ i, j ≤ n . • c n (cid:16) (cid:107) θ (cid:107) , λ (cid:16) (cid:107) θ (cid:107) , and τ = O (1) . • tr( (cid:101) Ω ) ≥ Cτ (cid:107) θ (cid:107) , and τ (cid:107) θ (cid:107) → ∞ . Lemma 4.7.

Under the condition of Theorem 2.3, for ≤ m < K , E [ (cid:101) Q ( m, n ] = tr( (cid:101) Ω ) + o ( (cid:107) θ (cid:107) ) , Var( (cid:101) Q ( m, n ) ≤ C ( (cid:107) θ (cid:107) + τ (cid:107) θ (cid:107) (cid:107) θ (cid:107) ) . Lemma 4.8.

Under the condition of Theorem 2.3, for ≤ m < K , E [ Q ( m, n − (cid:101) Q ( m, n ] = o ( τ (cid:107) θ (cid:107) ) , Var( Q ( m, n − (cid:101) Q ( m, n ) ≤ o ( (cid:107) θ (cid:107) ) + Cτ (cid:107) θ (cid:107) (cid:107) θ (cid:107) . Lemma 4.9.

Under the conditions of Theorem 2.3, for ≤ m < K , there exists a constant C > , such that P ( B ( m ) n ≤ C (cid:107) θ (cid:107) ) ≥ o (1) . We now prove Theorem 2.3. Note that by Theorem 2.1, the second item of Theorem 2.3follows once the ﬁrst item is proved. Therefore we only consider the ﬁrst item, where it issuﬃcient to show that for all 1 < m < K , ψ ( m ) n → ∞ , in probability . By the NSP of the solutions produced by SCORE, which is shown in Theorem 2.2, there existsan event A n with P ( A cn ) ≤ Cn − as n → ∞ , such that on event A n we have (cid:98) Π ( m ) ∈ G m . Thisfurther indicates that on event A n we have ψ ( m ) n ≥ min Π ∈G m ψ ( m, n , (4.26)where ψ ( m, n is deﬁned in (4.20). The LHS is hard to analyze, but the RHS is relatively easyto analyze. Then further notice that the cardinality of G m is |G m | = m K , which is of constantorder as long as K is constant. Therefore to prove ψ ( m ) n → ∞ in probability, it suﬃces to showthat for any ﬁxed Π ∈ G m , ψ ( m, n → ∞ , in probability . (4.27)We now show (4.27). Rewrite ψ ( m, n as (cid:114) c n C n · (cid:20) Q ( m, n √ c n − B ( m ) n √ c n (cid:21) = (cid:114) c n C n · [( I ) − ( II )] , (4.28)where ( I ) = Q ( m, n / √ c n , and ( II ) = B ( m ) n / √ c n . First, by Lemma 4.1 (since C n and c n donot depend on m , this lemma applies to both the null case and the under-ﬁtting case), c n /C n → . (4.29)Second, by Lemma 4.6, c n (cid:16) (cid:107) θ (cid:107) . Combining it with Lemma 4.9 gives that there is a constant C > P (( II ) ≤ C ) ≥ o (1) . (4.30)Last, by Lemma 4.6-4.8, E [( I )] ≥ Cτ (cid:107) θ (cid:107) · [1 + o (1)] → ∞ , Var(( I )) ≤ C (1 + τ (cid:107) θ (cid:107) ) . M > P (( I ) < M ) ≤ ( E [( I )] − M ) − Var(( I )) ≤ C (cid:20) τ (cid:107) θ (cid:107) ( τ (cid:107) θ (cid:107) [1 + o (1)] − M ) (cid:21) , (4.31)where on the denominator, τ (cid:107) θ (cid:107) → ∞ by Lemma 4.6. Note that under our conditions, (cid:107) θ (cid:107) = o ( (cid:107) θ (cid:107) ) and (cid:107) θ (cid:107) → ∞ . Combining these, the RHS of (4.31) tends to 0 as n → ∞ . Inserting(4.29)-(4.31) into (4.28) proves the claim, and concludes the proof of Theorem 2.3. ∼ mejn/netdata/. We now discuss the true K . For thedolphin network, it was argued in [27] that both K = 2 or K = 4 are reasonable. For UKfacultynetwork, we symmetrize the network by ignoring the directions of the edges. There are 4 schoolaﬃliations for the faculty members so we take K = 4. For the football network, we take K = 11. The network was manually labelled as 12 groups, but the 12th group only consist ofthe 5 “independent” teams that do not belong to any conference and do not form a conferencethemselves. For polbooks network, Le and Levina [21] suggest that K = 3, but it was arguedby [15] that a more appropriate model for the network is a degree corrected mixed-membership(DCMM) model with two communities, so K = 2 is also appropriate.We compare StGoF and bootstrap StGoF (StGoF*) with the BIC approach by Wang andBickel [36], BH approach by Le and Levina [21], ECV approach by Li et al . [25], and NCVapproach by Chen and Lei [1]. For all these methods, we use the R package “randnet” toimplement them. Note that among these approaches, ECV and NCV are cross validation (CV)approaches and the results vary from one repetition to the other. Therefore, we run each methodfor 25 times and report the mean and SD. The StGoF* uses bootstrapping mean and standarddeviation and is also random, but the SDs are 0 for ﬁve data sets. Most methods require afeasible range of K as a priori. We take { , , ..., } as the range in this section. Name n K

BIC BH ECV NCV StGoF StGoF*Dolphins 62 2, 4 2 2 3.08(0.91) [2,5] 2.20(2.71) [1,15] 2 3Football 115 11 10 10 11.28(0.61) [11,13] 12.36(1.15) [11, 15] 10 10Karate 34 2 2 2 2.60(1.00) [1,6] 2.56(0.58) [2,4] 2 2UKfaculty 81 4 4 3 5.56(1.61) [3,11] 2.40(0.28) [2,3] 4 4Polblogs 1222 2 6 8 4.88(1.13) [4, 8] 2(0.00) [2, 2] 2* 2Polbooks 105 2, 3 3 4 7.56(2.66) [2,15] 2.08(0.71) [2, 5] 5 2.4(0.25) [2, 3]

Table 1: Comparison of estimated K . Take ECV for Dolphins for example: for 25 independentrepetitions, the estimated K have a mean of 3 .

08 and a SD of 0 .

91, ranging from 2 to 5 (SD ofStGoF* are 0 for the ﬁrst 5 data sets).The polblogs network is suspected to have outliers, so most of the methods do not work well.For this particular network, the mean of StGoF is much larger than expected, so we choose toestimate K by the m that minimizes ψ ( m ) n for 1 ≤ m ≤

15 (for this reason, we put a ∗ next to 2in the table). Note that StGoF* correctly estimates K as 2. The polbooks network is suspectedhave a signiﬁant faction of mixing nodes (e.g., [15]), which explains why StGoF overestimates K . Fortunately, for both data sets, StGoF* estimates K correctly, suggesting the bootstrappingmeans and standard deviations help standardize Q ( m ) n .24 .2 Simulations We now study StGoF with simulated data. We compare StGoF with BIC, ECV, NCV via asmall scale simulations (for StGoF, α = 0 . n, K ), a scalar β n > P ∈ R K × K , a distribution f ( θ ) on (0 , ∞ ), and a distribution g ( π ) on the standard simplex of R K , we generate the adjacency matrix A ∈ R n,n as follows:1. Generate ˜ θ , ˜ θ , ..., ˜ θ n iid from f ( θ ). Let θ i = β n · ˜ θ i / (cid:107) ˜ θ (cid:107) and Θ = diag( θ , ..., θ n ).2. Generate π , π , ..., π n iid from g ( π ), and let Π = [ π , π , ..., π n ] (cid:48) .3. Let Ω = ΘΠ P Π (cid:48) Θ. For each experiment below, once Ω is generated, we keep it ﬁxed, anduse it to generate A according to the DCBM, for 100 times independently.For all algorithms, we measure the performance by the fraction of times the algorithm correctlyestimates the true number of communities K (i.e., accuracy). Note that (cid:107) θ (cid:107) = β n , and SNR (cid:16)(cid:107) θ (cid:107) (1 − b n ). For the experiments, we let β n range so to cover many diﬀerent sparsity levels, butlet (cid:107) θ (cid:107) (1 − b n ) be at a more or less the same level, so the problem of estimating K is not toodiﬃcult or too easy; see details below. We consider three experiments, and each experiment hassome sub-experiments. Experiment 1 . In this experiment, we study how degree heterogeneity aﬀect the resultsand comparisons. Fixing ( n, K ) = (600 , P be the 4 × P ( k, (cid:96) ) = 1 − [(1 − b n )( | k − (cid:96) | + 1)] /K , where 1 ≤ k, (cid:96) ≤ k (cid:54) = (cid:96) . Such matrixis called a Toeplitz matrix. Let g ( π ) be the uniform distribution over e , e , e , e (the standardbasis vectors of R ).We consider three sub-experiments, Exp 1a-1c. In these sub-experiments, we keep (1 − b n ) (cid:107) θ (cid:107) ﬁxed at 9 . β n range from 10 to 14 so tocover both the more sparse and the more dense cases. Moreover, for the three sub-experiments,we take f ( θ ) to be U (2 ,

3) (uniform distribution), Pareto(8 , . .

375 is the scale parameter), and two point mixture 0 . δ + 0 . δ ( δ a is a point mass at a ),respectively. Note that from Exp 1a to Exp 1c, the degree heterogeneity is increasingly moresevere on average.The estimation accuracy is presented in Figure 4, where StGoF is seen to consistently outper-form other approaches. Also, from Exp 1a to Exp1c, the estimation accuracy for all algorithmsget consistently lower, suggesting that when the degree heterogeneity gets more severe, theproblem of estimating K gets more challenging. Experiment 2 . In this experiment, we study how the relative sizes of diﬀerent communitiesaﬀect the results and comparisons. For b n > n, K ) = (1200 , f ( θ )as Pareto(10 , . P be the 3 × P ( k, (cid:96) ) = 1 − | k − (cid:96) | (1 − b n ) / ≤ k, (cid:96) ≤

3. We let β n range in { , , ..., } and keep (1 − b n ) (cid:107) θ (cid:107) ﬁxed at 10 so theSNR’s are roughly at the same level. We take g ( π ) as the distribution with weights a , b , and(1 − a − b ) on vectors e , e , e (the standard basis vectors of R ), respectively. Consider three sub-experiments, Exp 2a-2c, where we take ( a, b ) = ( . , . , ( . , . . , . K gets increasingly harder. Last, the performanceof ECV and NCV are relatively close to that of StGOF when the communities are relatively25 θ ||0.40.60.81 A cc u r a cy ECVNCVBICStGoF

10 11 12 13 14|| θ ||0.40.50.60.70.80.9 A cc u r a cy ECVNCVBICStGoF

10 11 12 13 14|| θ ||0.20.40.60.8 A cc u r a cy ECVNCVBICStGoF

Figure 4: Left to right: Experiment 1a, 1b, and 1c, where the degree heterogeneity are increas-ingly more severe (x-axis: sparsity. y -axis: accuracy). Results are based on 100 repetitions.

12 13 14 15 16 17|| θ ||0.60.70.80.9 A cc u r a cy ECVNCVBICStGoF

12 13 14 15 16 17|| θ ||0.40.50.60.70.80.9 A cc u r a cy ECVNCVBICStGoF

12 13 14 15 16 17|| θ ||00.20.40.60.8 A cc u r a cy ECVNCVBICStGoF

Figure 5: Left to right: Experiment 2a, 2b, and 2c ( x -axis: (cid:107) θ (cid:107) (sparsity level); y -axis: estimationaccuracy. The results are based on 100 repetitions.balanced (e.g., Exp 2a), but are comparably more unsatisfactorily when the models are moreunbalanced (e.g., Exp 2b-2c). Experiment 3 . We study how robust these algorithms are in cases of model misspeciﬁcation.Fix ( n, K ) = (600 , f ( θ ) be the uniform distribution U (2 , P be the 4 × ≤ k, (cid:96) ≤ k (cid:54) = (cid:96) , P ( k, (cid:96) ) = 1 − (1 − b n )( | k − (cid:96) | +1) /K . We consider two sub-experiments, Exp 3a-3b. For sparsity, we let β n range from 11 to 16in Exp 3a and range from 11 to 18 in Exp 3b. For diﬀerent β n , we choose b n so that (1 − b n ) (cid:107) θ (cid:107) isﬁxed at 10 .

5. Moreover, in Exp 3a, we allow mixed-memberships. We take g ( π ) to be the mixingdistribution which puts probability . e , e , e , e (standard basis vectors of R ), respectively,and let π be the symmetric K -dimensional Dirichlet distribution for the remaining probability of .

2. Once we have θ i , π i , and P , we let Ω ij = θ i θ j π (cid:48) i P π j , 1 ≤ i, j ≤ n , similar to that in DCBM.In Exp 3b, we allow outliers. First, we let g ( π ) be the mixing distribution that puts masses .

25 on e , e , e , e , and obtain Ω as in DCBM. We then randomly select 10% of the nodes andre-deﬁne Ω ij as ρ n if either i or j is selected, where ρ n = n − (cid:80) ≤ i,j ≤ n Ω ij .Figure 6 presents the estimation accuracy. The two cross-validation methods (ECV and NCV)are not model based algorithms and are expected to be less aﬀected by model misspeciﬁcation,so we can use their results as a benchmark to evaluate the performances of StGoF and thelikelihood-based approach BIC. Figure 6 shows that StGoF continues to perform well in allsettings, suggesting that it is not sensitive to model misspeciﬁcation. The performance of BIC,if compared to those in Experiments 1-2, is less satisfactory, suggesting that the method is moresensitive to the model misspeciﬁcation. 26 θ ||0.20.40.60.8 A cc u r a cy ECVNCVBICStGoF

12 14 16 18|| θ ||0.30.40.50.60.70.80.9 A cc u r a cy ECVNCVBICStGoF

Figure 6: Experiment 3a (left) and 3b (right) ( x -axis: (cid:107) θ (cid:107) (sparsity level). y -axis: estimationaccuracy). The results are based on 100 repetitions. References [1] Kehui Chen and Jing Lei. Network cross-validation for determining the number of commu-nities in network data.

J. Amer. Statist. Assoc. , 113(521):241–251, 2018.[2] Yudong Chen, Xiaodong Li, Jiaming Xu, et al. Convexiﬁed modularity maximization fordegree-corrected stochastic block models.

Ann. Statist. , 46(4):1573–1602, 2018.[3] J-J Daudin, Franck Picard, and St´ephane Robin. A mixture model for random graphs.

Stat.Comput. , 18(2):173–183, 2008.[4] Chandler Davis and William Morton Kahan. The rotation of eigenvectors by a perturbation.iii.

SIAM J. Numer. Anal. , 7(1):1–46, 1970.[5] David Donoho and Jiashun Jin. Higher criticism for detecting sparse heterogeneous mix-tures.

Ann. Statist. , 32:962–994, 2004.[6] Bradley Efron. Large-scale simultaneous hypothesis testing: the choice of a null hypothesis.

J. Amer. Statist. Assoc. , 99(465):96–104, 2004.[7] Chao Gao, Zongming Ma, Anderson Y Zhang, and Harrison H Zhou. Community detectionin degree-corrected block models.

Ann. Statist. , 46(5):2153–2185, 2018.[8] Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmod-els: First steps.

Social networks , 5(2):109–137, 1983.[9] Roger Horn and Charles Johnson.

Matrix Analysis . Cambridge University Press, 1985.[10] Jianwei Hu, Jingfei Zhang, Hong Qin, Ting Yan, and Ji Zhu. Using maximum entry-wisedeviation to test the goodness-of-ﬁt for stochastic block models.

J. Amer. Statist. Assoc.to appear , 2020.[11] Yuri I. Ingster, Alexandre B. Tsybakov, and Nicolas Verzelen. Detection boundary in sparseregression.

Electron. J. Statist. , 4:1476–1526, 2010.[12] Pengsheng Ji and Jiashun Jin. Coauthorship and citation networks for statisticians (withdiscussions).

Ann. Appl. Statist. , 10:1779–1812, 2016.[13] Pengsheng Ji, Jiashun Jin, Zheng Tracy Ke, and Wanshan Li. Statistics about statisticians.

Manuscript , 2020. 2714] Jiashun Jin. Fast community detection by score.

Ann. Statist. , 43(1):57–89, 2015.[15] Jiashun Jin, Zheng Tracy Ke, and Shengming Luo. Estimating network memberships bysimplex vertices hunting. arXiv:1708.07852 , 2017.[16] Jiashun Jin, Zheng Tracy Ke, and Shengming Luo. Network global testing by countinggraphlets. In

Proceedings of the 35th International Conference on Machine Learning, ICML2018, Stockholm, Sweden , pages 2338–2346, 2018.[17] Jiashun Jin, Zheng Tracy Ke, and Shengming Luo. Optimal adaptivity of signed-polygonstatistics for network testing. arXiv:1904.09532 , 2019.[18] Jiashun Jin, Zheng Tracy Ke, and Shengming Luo. Improvements on score, especially forweak signals.

Manuscript , 2020.[19] Brian Karrer and Mark Newman. Stochastic blockmodels and community structure innetworks.

Phys. Rev. E , 83(1):016107, 2011.[20] Pierre Latouche, Etienne Birmele, and Christophe Ambroise. Variational bayesian inferenceand complexity control for stochastic block models.

Statistical Modelling , 12(1):93–115,2012.[21] Can M Le and Elizaveta Levina. Estimating the number of communities in networks byspectral methods. arXiv:1507.00827 , 2015.[22] Jing Lei. A goodness-of-ﬁt test for stochastic block models.

Ann. Statist. , 44(1):401–424,2016.[23] Lihua Lei, Xiaodong Li, and Xingmei Lou. Consistency of spectral clustering on hierarchicalstochastic block models. arXiv:2004.14531 , 2020.[24] Tianxi Li, Lihua Lei, Sharmodeep Bhattacharyya, Purnamrita Sarkar, Peter J. Bickel,and Elizaveta Levina. Hierarchical community detection by recursive partitioning. arXiv:1810.01509 , 2018.[25] Tianxi Li, Elizaveta Levina, and Ji Zhu. Network cross-validation by edge sampling.

Biometrika , 107(2):257–276, 2020.[26] Fuchen Liu, David Choi, Lu Xie, and Kathryn Roeder. Global spectral clustering in dynamicnetworks.

Proc. Natl. Acad. Sci. , 115(5):927–932, January 2017.[27] Wei Liu, Xingpeng Jiang, Matteo Pellegrini, and Xiaofan Wang. Discovering communitiesin complex networks by edge label propagation.

Scientiﬁc Reports , 6:22470, 03 2016.[28] Yan Liu, Zhiqiang Hou, Zhigang Yao, Zhidong Bai, Jiang Hu, and Shurong Zheng. Com-munity detection based on the l ∞ convergence of eigenvectors in dcbm. arXiv:1906.06713 ,2019.[29] Shujie Ma, Liangjun Su, and Yichong Zhang. Determining the number of communities indegree-corrected stochastic block models. arXiv:1809.01028 , 2018.[30] Zhuang Ma, Zongming Ma, and Hongsong Yuan. Universal latent space model ﬁtting forlarge networks with edge covariates. J. Mach. Learn. Res. , 21(4):1–67, 2020.[31] Zongming Ma and Yihong Wu. Computational barriers in minimax submatrix detection.

Ann. Statist. , 43(3):1089–1116, 2015. 2832] Debashis Paul. Asymptotics of sample eigenstructure for a large dimensional spiked covari-ance model.

Stat. Sin. , 17(4):1617–1642, 2007.[33] D. Franco Salda˜na, Yi Yu, and Yang Feng. How many communities are there?

J. Comput.Graph Stat. , 26(1):171–181, 2017.[34] Galen Shorack and Jon Wellner.

Empirical processes with applications to statistics . JohnWiley & Sons, 1986.[35] Alexandre B Tsybakov.

Introduction to nonparametric estimation . Springer Science &Business Media, 2008.[36] YX Rachel Wang, Peter J Bickel, et al. Likelihood-based model selection for stochasticblock models.

Ann. Statist. , 45(2):500–528, 2017.[37] Min Xu, Varun Jog, and Po-Ling Loh. Optimal rates for community estimation in theweighted stochastic block model.

Ann. Statist. , 48(1):183–204, 2020.[38] Yuan Zhang, Elizaveta Levina, and Ji Zhu. Detecting overlapping communities in networkswith spectral methods. arXiv:1412.3432 , 2014.[39] Yunpeng Zhao, Elizaveta Levina, and Ji Zhu. Community extraction for social networks.

Proc. Natl. Acad. Sci. , 108(18):7321–7326, 2011.29 ppendix for “Estimating the number of communities byStepwise Goodness-of-ﬁt”

The appendix contains the proof of theorems and lemmas in the main article. Section A provesLemma 1.1 and Theorems 2.4-2.5. Section B proves Theorems 3.1-3.2 and Lemmas 3.1-3.4.Section C proves Lemmas 4.1-4.9. Section D proves the technical lemmas needed in Sections A-C.

A Proof of results in Sections 1-2

A.1 Proof of Lemma 1.1

For the goodness-of-ﬁt test, it contains calculation of (a) (cid:98) Ω ( m ) as the reﬁtted Ω, (b) Q ( m ) n as themain term, (c) B ( m ) n as the bias correction term and (d) C n as the variance estimator.For (a), it requires calculation of d i for 1 ≤ i ≤ n , and ˆ (cid:48) k A ˆ (cid:96) and ˆ (cid:48) k A ˆ n for 1 ≤ k, (cid:96) ≤ m with m ≤ K. Since d i needs O ( d i ) operations, it takes O ( n ¯ d ) for calculating d i , 1 ≤ i ≤ n . Similarly,it takes O (ˆ (cid:48) k A ˆ (cid:96) ) to calculate ˆ (cid:48) k A ˆ (cid:96) and O (ˆ (cid:48) k A ˆ n ) to calculate ˆ (cid:48) k A ˆ n , 1 ≤ k, (cid:96) ≤ m . Thetotal complexity is then O ( n ¯ d ) . By (1.12), (cid:98) Ω ( m ) ( i, j ) = (cid:98) θ ( m ) ( i ) (cid:98) θ ( m ) ( j )(ˆ π ( m ) i ) (cid:48) (cid:98) P ( m ) ˆ π ( m ) j , whose calculation takes O ( m ) operations. Hence, calculation of (cid:98) Ω ( m ) needs O ( m n ) operations.Combining together, we conclude that step (a) costs O ( m n ).For (b), Q ( m ) n can be calculated using the same form in Theorem 1.1 of [17]. As is shownthere, this step requires O ( n ¯ d ) operations.For (c), given (cid:98) Ω ( m ) and (cid:98) P ( m ) , the calculation of ˆ g ( m ) , (cid:98) V ( m ) and (cid:98) H ( m ) only takes O ( n ). By(1.15), calculation of B ( m ) n only involves calculate (cid:107) ˆ θ (cid:107) and ˆ g (cid:48) (cid:98) V − ( (cid:98) P (cid:98) H (cid:98) P ◦ (cid:98) P (cid:98) H (cid:98) P ) (cid:98) V − ˆ g . Theﬁrst part needs O ( n ) operations. The second part only involves vectors in R m and matrices in R m,m . Moreover since m ≤ K and K is ﬁxed, it takes at most o ( n ) operations. Combiningabove, step (c) costs O ( n ).For (d), the calculation follows from Proposition A.1 of [16]. It should be noted C n is denotedas (cid:98) C there, and it requires calculation of (i) trace of a matrix, (ii) A for matrix A and (iii)quadratic form of matrix A and A . For (i), it only takes O ( n ). For (iii), it takes at most O ( n ).For (ii), we can compute A k recursively from A k = A k − A . it suﬃces to consider the complexityof computing BA , for an arbitrary n × n matrix B. The ( i, j )-th entry of BA is (cid:80) (cid:96) : A (cid:96)j (cid:54) =0 B i(cid:96) A (cid:96)j ,where the total number of nonzero A (cid:96)j equals to d j , the degree of node j . Hence, the complexityof computing the ( i, j )-th entry of BA is O ( d j ) . It follows that the complexity of computing BA is O ( n ¯ d ) . Combining above, the goodness-of-ﬁt test needs O ( n ¯ d ) operations. A.2 Proof of Theorem 2.4

First, we show the claims on | λ K | / √ λ . Deﬁne a diagonal matrix H by H kk = (cid:107) θ (cid:107) − (cid:113)(cid:80) i : (cid:96) i = k θ i ,for 1 ≤ k ≤ K . Note that H is also stochastic. By Lemma 3.1, the eigenvalues of Ω are equal tothe eigenvalues of (cid:107) θ (cid:107) HP H , i.e., λ k = (cid:107) θ (cid:107) · λ k ( HP H ) , ≤ k ≤ K.

30t follows that | λ K | / (cid:112) λ = (cid:107) θ (cid:107) · | λ K ( HP H ) | / (cid:112) λ ( HP H ) . (A.32)Below, we ﬁrst study the matrix H and then show the claims.Consider the matrix H . Let (cid:101) N , (cid:101) N , . . . , (cid:101) N K be the (non-stochastic) communities of theDCBM with K communities. For each 1 ≤ k ≤ K , let θ ( k ) ∈ R n be such that θ ( k ) i = θ i · { i ∈ (cid:101) N k } . By deﬁnition, H kk = (cid:107) θ (cid:107) − (cid:40) (cid:107) θ ( k ) (cid:107) , for 1 ≤ k ≤ K − , (cid:80) i ∈ (cid:101) N K θ i · { (cid:96) i = k } , for K ≤ k ≤ K + m. Since (2.2) is satisﬁed, (cid:107) θ (cid:107) ≥ (cid:107) θ ( k ) (cid:107) ≥ C (cid:107) θ (cid:107) , for 1 ≤ k ≤ K . It implies that C − ≤ H kk ≤ C, for 1 ≤ k ≤ K − . (A.33)Fix k ≥ K . The n indicators 1 { (cid:96) i = k } are iid Bernoulli variables with a success probability of m +1 . Therefore, E H kk = m +1 (cid:107) θ (cid:107) − (cid:107) θ ( K ) (cid:107) . Furthermore, by Hoeﬀding’s inequality, P (cid:16)(cid:12)(cid:12) (cid:107) θ (cid:107) ( H kk − E H kk ) (cid:12)(cid:12) > t (cid:17) ≤ (cid:16) − t (cid:80) i ∈ (cid:101) N K θ i (cid:17) . By (2.1), θ max (cid:112) log( n ) →

0. Hence, (cid:80) i ∈ (cid:101) N K θ i ≤ θ (cid:107) θ ( K ) (cid:107) (cid:28) (cid:107) θ (cid:107) / log( n ). Taking t = (cid:107) θ (cid:107) in the above equation yields (cid:12)(cid:12) H kk − E H kk (cid:12)(cid:12) ≤ (cid:107) θ (cid:107) − with probability 1 − o ( n − ). We have seenthat E H kk = m +1 (cid:107) θ (cid:107) − (cid:107) θ ( K ) (cid:107) , which is bounded above and below by constants. Additionally, (cid:107) θ (cid:107) − = o (1). Combining these results gives C − ≤ H kk ≤ C, with probability 1 − o ( n − ), for any k ≥ K . (A.34)It follows from (A.33) and (A.34) that (cid:107) H (cid:107) ≤ C, (cid:107) H − (cid:107) ≤ C, with probability 1 − o ( n − ) . (A.35)Consider the the upper bound for | λ K | / √ λ . It suﬃces to get an upper bound for | λ K ( HP H ) | and a lower bound for λ ( HP H ). Note that | λ K ( HP H ) | is the smallest singular value of HP H ,which can be diﬀerent from the absolute value of the smallest eigenvalue. Therefore, we cannotuse Cauchy’s interlacing theorem [9] to relate | λ K ( HP H ) | to the smallest eigenvalue of M . Weneed a slightly longer proof. Write P = (cid:20) S β (cid:48) m +1 m +1 β (cid:48) m +1 (cid:48) m +1 (cid:21) + (cid:34) ( K − × ( K − ( K − × × ( K − m +11+ mb n M − m +1 (cid:48) m +1 (cid:35) ≡ P ∗ + ∆ . The matrix P ∗ can be re-expressed as ( e K is the K th standard basis of R K ) P ∗ = (cid:20) I K m e (cid:48) K (cid:21) (cid:20) S ββ (cid:48) (cid:21) (cid:2) I K e K (cid:48) m (cid:3) . Therefore, the rank of P ∗ is only K . Then, HP ∗ H is also a rank- K matrix. Consequently, for K = K + m , λ K ( HP ∗ H ) = 0 . By Weyl’s inequality [9], | λ K ( HP H ) − λ K ( HP ∗ H ) | ≤ (cid:107) H ∆ H (cid:107) . Combining these results gives | λ K ( HP H ) | ≤ (cid:107) H ∆ H (cid:107) . (A.36)31ote that (cid:107) ∆ (cid:107) = (cid:107) m +11+ mb n M − m +1 (cid:48) m +1 (cid:107) . M is a matrix whose diagonals are 1 and oﬀ-diagonalsare equal to b n . As a result, ∆ is a matrix whose diagonals are equal to m (1 − b n )1+ mb n and oﬀ-diagonalsare equal to − (1 − b n )1+ mb n . It follows immediately that (cid:107) ∆ (cid:107) ≤ C (1 − b n ) . We plug it into (A.36) and apply (A.35). It yields that | λ K ( HP H ) | ≤ C (1 − b n ) . (A.37)Furthermore, λ ( P ) ≥ P = 1 and λ ( P ) ≤ (cid:107) H − (cid:107) λ ( HP H ). Combining it with (A.35) gives λ ( HP H ) ≥ C − . (A.38)Note that (A.37)-(A.38) hold with probability 1 − o ( n − ), because their derivation utilizes (A.35).We plug (A.37)-(A.38) into (A.32) to get | λ K | / √ λ ≤ C (cid:107) θ (cid:107) (1 − b n ), with probability 1 − o ( n − ).This proves the upper bound of | λ K | / √ λ .Consider the the lower bound for | λ K | / √ λ . Using (A.35), we have | λ K ( HP H ) | − = (cid:107) ( HP H ) − (cid:107) ≤ (cid:107) H − (cid:107) · (cid:107) P − (cid:107) ≤ C (cid:107) P − (cid:107) . (A.39)We then bound (cid:107) P − (cid:107) . Write P = A + B, where A = (cid:34) S m +11+ mb n M (cid:35) and B = (cid:20) β (cid:48) m +1 m +1 β (cid:48) (cid:21) . The matrix B is a rank-2 matrix, which can be re-expressed as B = XD − X (cid:48) , where X = (cid:20) β β m +1 − m +1 (cid:21) and D = (cid:20) − (cid:21) . We use the matrix inversion formula to get (cid:107) P − (cid:107) = (cid:107) ( A + XD − X (cid:48) ) − (cid:107) = (cid:107) A − − A − X ( D + X (cid:48) A − X ) − X (cid:48) A − (cid:107)≤ (cid:107) A − (cid:107) · (cid:0) (cid:107) X ( D + X (cid:48) A − X ) − X (cid:48) A − (cid:107) (cid:1) = (cid:107) A − (cid:107) · (cid:0) (cid:107) ( D + X (cid:48) A − X ) − ( X (cid:48) A − X ) (cid:107) (cid:1) . (A.40)By direct calculations, writing M = mb n m +1 M and = m +1 for short, we have X (cid:48) A − X = (cid:20) β (cid:48) S − β + (cid:48) M − β (cid:48) S − β − (cid:48) M − β (cid:48) S − β − (cid:48) M − β (cid:48) S − β + (cid:48) M − (cid:21) . Note that M = (1 + mb n ) . It implies that M − = mb n . As a result, (cid:48) M − = 1 + mb n m + 1 (cid:48) M − = 1 + mb n m + 1 (cid:48) (cid:16)

11 + mb n (cid:17) = 1 . Plugging it into the expression of X (cid:48) A − X gives X (cid:48) A − X = (cid:20) β (cid:48) S − β + 1 β (cid:48) S − β − β (cid:48) S − β − β (cid:48) S − β + 1 (cid:21) . It follows from direct calculations that( D + X (cid:48) A − X ) − ( X (cid:48) A − X ) = 12 (cid:34) − β (cid:48) S − β +1 β (cid:48) S − β − (cid:35) . (A.41)32nder the condition | β (cid:48) S − β − | ≥ C , the absolute value of β (cid:48) S − β +1 β (cid:48) S − β − is bounded by a constant.Therefore, the spectral norm of the matrix in (A.41) is bounded by a constant. We plug it into(A.40) to get (cid:107) P − (cid:107) ≤ C (cid:107) A − (cid:107) ≤ C max (cid:8) | λ min ( S ) | − , | λ min ( M ) | − (cid:9) . The minimum eigenvalue of M is (1 − b n ). Hence, under the condition of | λ min ( S ) | (cid:29) − b n , weimmediately have (cid:107) P − (cid:107) ≤ C (1 − b n ) − . We plug it into (A.39) to get | λ K ( HP H ) | ≥ C − (1 − b n ) . (A.42)Additionally, (cid:107) ˜ P (cid:107) ≤ C by (2.1). It follows from the connection between P and ˜ P in (2.5) that (cid:107) P (cid:107) ≤ C . Combining it with (A.35) gives (cid:107) HP H (cid:107) ≤ C , i.e., λ ( HP H ) ≤ C. (A.43)Here (A.42) and (A.43) are satisﬁed with probability 1 − o (1), because their derivation uses(A.35). We plug (A.42)-(A.43) into (A.32). It yields that | λ K | / √ λ ≥ C − (cid:107) θ (cid:107) (1 − b n ), withprobability 1 − o (1). This proves the lower bound of | λ K | / √ λ .Next, we show that, if (cid:107) θ (cid:107) (1 − b n ) →

0, the two random-label DCBM models associated with m and m are asymptotically indistinguishable. It is suﬃcient to show that each random-labelDCBM is asymptotically indistinguishable from the (ﬁxed-label) DCBM with K communities.Fix m ≥

1. Let f ( A ) and f ( A ) be the respective likelihood of the (ﬁxed-label) DCBM andthe random-label DCBM. Write (cid:101) Ω = Θ (cid:101) Π (cid:101) P (cid:101) Π (cid:48) Θ and Ω = ΘΠ P Π (cid:48) Θ. It is seen that f ( A ) = (cid:89) ≤ i

1. In the paragraph below (A.36), we have seen that m + 11 + mb n M = m +1 (cid:48) m +1 + 1 − b n mb n  m − · · · − − m . . . ...... . . . . . . − − · · · − m  ≡ m +1 (cid:48) m +1 + G. G satisﬁes that G m +1 = and (cid:107) G (cid:107) ≤ C (1 − b n ). It follows thatΩ ij = θ i θ j · π (cid:48) i (cid:0) m +1 (cid:48) m +1 + G (cid:1) π j = θ i θ j + θ i θ j ( π (cid:48) i Gπ j )= θ i θ j + θ i θ j (cid:16) m + 1 m +1 + z i (cid:17) (cid:48) G (cid:16) m + 1 m +1 + z j (cid:17) = θ i θ j (1 + z (cid:48) i Gz j ) . (A.45)We plug it into (A.44) to get L ( A ) ≡ f ( A ) f ( A ) = E z  (cid:89) i,j ∈ (cid:101) N K − i

1, we only need to show E A ∼ f [ L ( A )] ≤ o (1) . (A.47)We now show (A.47). Write L ( A ) = E z [ g ( A, z )], where g ( A, z ) is the term inside the expec-tation in (A.46). Let { ˜ z i } i ∈ (cid:101) N K be an independent copy of { z i } i ∈ (cid:101) N K . Then, E A ∼ f [ L ( A )] = E A ∼ f (cid:110) E z [ g ( A, z )] · E ˜ z [ g ( A, ˜ z )] (cid:111) = E z, ˜ z (cid:110) E A ∼ f [ g ( A, z ) g ( A, ˜ z )] (cid:111) . (A.48)Using the expression of g ( A, z ) in (A.46), we have g ( A, z ) g ( A, ˜ z ) = (cid:89) i,j ∈ (cid:101) N K i

0. It follows that z (cid:48) i Gz j = ( m +1)(1 − b n )1+ mb n ( z (cid:48) i z j ). As a result, Y = ( m + 1) (1 − b n ) (1 + mb n ) (cid:88) i

0. We apply Hoeﬀding’s inequality to get that, for all t > P (cid:16)(cid:12)(cid:12)(cid:12)(cid:88) i θ si σ i (cid:12)(cid:12)(cid:12) > t (cid:17) ≤ (cid:16) − t (cid:80) i θ si (cid:17) = 2 exp (cid:16) − t (cid:107) θ ∗ (cid:107) s s (cid:17) . (A.53)For any nonnegative variable X , using the formula of integration by part, we can derive that E [exp( aX )] = 1 + a (cid:82) ∞ exp( at ) P ( X > t ) dt . As a result, E σ (cid:2) exp (cid:0) X s (cid:1)(cid:3) ≤ E σ (cid:40) exp (cid:20) a (1 − b n ) (cid:107) θ (cid:107) (cid:107) θ ∗ (cid:107) s s (cid:16)(cid:88) i θ si σ i (cid:17) (cid:21)(cid:41) = 1 + a (cid:107) θ (cid:107) (1 − b n ) (cid:107) θ ∗ (cid:107) s s (cid:90) ∞ exp (cid:18) a (cid:107) θ (cid:107) (1 − b n ) (cid:107) θ ∗ (cid:107) s s t (cid:19) · P (cid:26)(cid:16)(cid:88) i θ mi σ i (cid:17) > t (cid:27) dt ≤ a (cid:107) θ (cid:107) (1 − b n ) (cid:107) θ ∗ (cid:107) s s (cid:90) ∞ exp (cid:18) a (cid:107) θ (cid:107) (1 − b n ) (cid:107) θ ∗ (cid:107) s s t (cid:19) · exp (cid:18) − t (cid:107) θ ∗ (cid:107) s s (cid:19) dt = 1 + a (cid:107) θ (cid:107) (1 − b n ) (cid:107) θ ∗ (cid:107) s s (cid:90) ∞ exp (cid:18) − − a (cid:107) θ (cid:107) (1 − b n ) (cid:107) θ ∗ (cid:107) s s t (cid:19) dt = 1 + a (cid:107) θ (cid:107) (1 − b n ) (cid:107) θ ∗ (cid:107) s s · (cid:107) θ ∗ (cid:107) s s − a (cid:107) θ (cid:107) (1 − b n ) = 1 + 2 a (cid:107) θ (cid:107) (1 − b n ) − a (cid:107) θ (cid:107) (1 − b n ) . The right hand side does not depend on s , so the same bound holds for max s ≥ { E σ [exp( X s )] } .When (cid:107) θ (cid:107) (1 − b ) →

0, this upper bound is 1 + o (1). Plugging it into (A.51) gives (A.50). Then,the second claim follows. A.3 Proof of Theorem 2.5

We show a slightly stronger argument. Given 1 ≤ K < K ≤ m , let M n ( K , K , a n ) be thesub-collection of M n ( m , a n ) corresponding to K ≤ K ≤ K . Note thatinf ˆ K (cid:8) sup M n ( m ,a n ) P ( ˆ K (cid:54) = K ) (cid:9) ≥ inf ˆ K (cid:8) sup M n ( K ,K ,a n ) P ( ˆ K (cid:54) = K ) (cid:9) . It suﬃces to lower bound the right hand side. 36ix an arbitrary DCBM model with ( K −

1) communities. For each 1 ≤ m ≤ K − K + 1,we use (2.5)-(2.6) to construct a random-label DCBM with ( K − m ) communities, where b n = 1 − c (cid:107) θ (cid:107) − a n , for a constant c to be decided. Let P k denote the probability measureassociated with the k -community random-label DCBM, for K ≤ k ≤ K . By Theorem 2.4,we can choose an appropriately small constant c such that | λ K | / √ λ ≥ a n with probability1 − o ( n − ), under each P k . Additionally, using a proof similar to that of (A.34), we can showthat (2.1)-(2.2) are satisﬁed with probability 1 − o ( n − ). Therefore, under each P k , the realizationof (Θ , Π , P ) belongs to M n ( K , K , a n ) with probability 1 − o ( n − ). Then, for any ˆ K ,sup M n ( K ,K ,a n ) P ( ˆ K (cid:54) = K ) ≥ max K ≤ k ≤ K P k ( ˆ K (cid:54) = K ) + o ( n − ) . (A.54)To bound the right hand side of (A.54), consider a multi-hypothesis testing problem: Given anadjacency matrix A , choose one out of the models { P k } K ≤ k ≤ K . For any test ψ , deﬁne¯ p ( ψ ) = 1 K − K + 1 K (cid:88) k = K P k ( ψ (cid:54) = k ) . We apply [35, Proposition 2.4]. It yields that1 K − K K (cid:88) k = K +1 χ ( P k , P K ) ≤ α ∗ = ⇒ inf ψ ¯ p ( ψ ) ≥ sup <τ< (cid:26) τ ( K − K )1 + τ ( K − K ) [1 − τ ( α ∗ + 1)] (cid:27) . We have shown in Theorem 2.4 that α ∗ = o (1). By letting τ = 1 / ψ ¯ p ( ψ ) (cid:38) K − K K − K ) (cid:16) − o (1)2 (cid:17) ≥ / o (1) . (A.55)Now, given any estimator ˆ K , it deﬁnes a test ψ ˆ K , where ψ ˆ K = ˆ K if K ≤ ˆ K ≤ K and ψ ˆ K = K otherwise. It is easy to see that ¯ p ( ψ ˆ K ) ≤ max K ≤ k ≤ K P k ( ˆ K (cid:54) = k ) . (A.56)Combining (A.55)-(A.56) gives that max K ≤ k ≤ K P k ( ˆ K (cid:54) = k ) ≥ / o (1). We plug it into(A.54) to get the claim. B Proof of results in Section 3

B.1 Proof of Lemma 3.1

By deﬁnition of H , we have Π ΘΠ = (cid:107) θ (cid:107) · H . As a result, the matrix U = (cid:107) θ (cid:107) − ΘΠ H − satisﬁes that U (cid:48) U = I K . We now writeΩ = ΘΠ P Π (cid:48) Θ = (cid:107) θ (cid:107) · U · ( HP H ) · U (cid:48) , where U (cid:48) U = I K . Since U contains orthonormal columns, the nonzero eigenvalues of Ω are the nonzero eigenvaluesof (cid:107) θ (cid:107) ( HP H ). This proves that λ k = (cid:107) θ (cid:107) µ k . Furthermore, there is a one-to-one correspon-dence between the eigenvectors of Ω and the eigenvectors of HP H through[ ξ , ξ , . . . , ξ k ] = U [ η , η , . . . , η K ] . It follows that ξ k = U η k = (cid:107) θ (cid:107) − ΘΠ H − η k . This proves the claim about ξ k . We can multiplyboth sides of the equation ξ k = U η k by (cid:107) θ (cid:107) − H − Π (cid:48) Θ from the left. It yields that (cid:107) θ (cid:107) − H − Π (cid:48) Θ ξ k = ( (cid:107) θ (cid:107) − H − Π (cid:48) Θ)( (cid:107) θ (cid:107) − ΘΠ H − η k )37 (cid:107) θ (cid:107) − H − (Π (cid:48) Θ Π) H − η k = η k . This proves the claim about η k . Last, the condition (2.4) ensures that the multiplicity of µ is1 and that µ is a strictly positive vector. It follows that λ has a multiplicity of 1. Note that ξ k = U η k implies ξ ( i ) = (cid:107) θ (cid:107) − θ i K (cid:88) k =1 H − kk π i ( k ) η ( k ) ≥ (cid:107) θ (cid:107) − θ i min ≤ k ≤ K (cid:8) H − kk η ( k ) (cid:9) . Since η is a positive vector and H is a positive diagonal matrix, we conclude that all entries of ξ are positive. B.2 Proof of Lemma 3.2

We ﬁx an arbitrary ( K − × ( K −

1) orthogonal matrix Γ and drop “Γ” in the notations η k , ξ k , r i , v k . By Deﬁnition 3.2,[ η , η , . . . , η K ] = [ η , η ∗ , . . . , η ∗ K ] (cid:20) (cid:21) , [ ξ , ξ , . . . , η K ] = [ ξ , ξ ∗ , . . . , ξ ∗ K ] (cid:20) (cid:21) . Here, η , η ∗ , . . . , η ∗ K is a particular candidate of the eigenvectors of HP H and ξ , ξ ∗ , . . . , ξ ∗ K islinked to η , η ∗ , . . . , η ∗ K through[ ξ , ξ ∗ , . . . , ξ ∗ K ] = (cid:107) θ (cid:107) − ΘΠ H − [ η , η ∗ , . . . , η ∗ K ] . It follows immediately that[ ξ , ξ , . . . , ξ K ] = (cid:107) θ (cid:107) − ΘΠ H − [ η , η , . . . , η K ] . (B.57)As a result, for any true community N k , ξ (cid:96) ( i ) = [ θ i / ( (cid:107) θ (cid:107) H kk )] · η (cid:96) ( k ) , for all i ∈ N k . We plug it into the deﬁnition of R ( m ) to get that for each i ∈ N k and 1 ≤ (cid:96) ≤ m − R ( m ) ( i, (cid:96) ) = ξ (cid:96) +1 ( i ) ξ ( i ) = [ θ i / ( (cid:107) θ (cid:107) H kk )] · η (cid:96) +1 ( k )[ θ i / ( (cid:107) θ (cid:107) H kk )] · η ( k ) = η (cid:96) +1 ( k ) η ( k ) = V ( m ) ( k, (cid:96) ) . It follows that r ( m ) i = v ( m ) k for each i ∈ N k . B.3 Proof of Lemma 3.3

The matrix V ( K ) (Γ) was studied in [14, 15]. Since the pairwise distances for rows in V ( K ) (Γ)are invariant of Γ, the quantity d K ( V ( K ) (Γ)) does not change with Γ either. Using Lemma B.3of [14], we immediately know that d K ( V ( K ) (Γ)) ≥ √ < m < K and a ( K − × ( K −

1) orthogonal matrix Γ, and study d m ( V ( m ) (Γ)).For notation simplicity, we drop “Γ” when there is no confusion.We apply a bottom up pruning procedure (same as in Deﬁnition 3.1) to V ( m ) . First, we ﬁndtwo rows v ( m ) k and v ( m ) (cid:96) that attain the minimum pairwise distance (if there is a tie, pick theﬁrst pair in the lexicographical order) and change the (cid:96) th row to v ( m ) k (suppose k < (cid:96) ). Denotethe resulting matrix by V ( m,K − . Next, we consider the rows of V ( m,K − and similarly ﬁndtwo rows attaining the minimum pairwise distance and replace one row by the other. Denotethe resulting matrix by V ( m,K − . We repeat these steps to get a sequence of matrices: V ( m,K ) , V ( m,K − , V ( m,K − , . . . , V ( m, , V ( m, , V ( m,K ) = V ( m ) and for each 1 ≤ k ≤ K , V ( m,k ) has at most k distinct rows. Comparing itwith the deﬁnition of d k ( V ( m ) ) (see Deﬁnition 3.1), we ﬁnd that V ( m,k − diﬀers from V ( m,k ) inonly 1 row, and the diﬀerence on this row is a vector whose Euclidean norm is exactly d k ( V ( m ) ).As a result, (cid:107) V ( m,k ) − V ( m,k − (cid:107) = d k ( V ( m ) ) , ≤ k ≤ K. (B.58)By triangle inequality and the fact that d k ( V ( m ) ) ≤ d k − ( V ( m ) ), we have (cid:107) V ( m,K ) − V ( m,m − (cid:107) ≤ K (cid:88) k = m d k ( V ( m ) ) ≤ ( K − m + 1) · d m ( V ( m ) ) . To show the claim, it suﬃces to show that (cid:107) V ( m,K ) − V ( m,m − (cid:107) ≥ C. (B.59)We now show (B.59). Introduce two matrices V ( m,K ) ∗ = [ K , V ( m,K ) ] , V ( m,m − ∗ = [ K , V ( m,m − ] , where K is the K -dimensional vector of 1s. Adding the vector K as the ﬁrst column changesneither the number of distinct rows nor pairwise distances among rows. Additionally, (cid:107) V ( m,K ) − V ( m,m − (cid:107) = (cid:107) V ( m,K ) ∗ − V ( m,m − ∗ (cid:107) . (B.60)Let σ m ( U ) denote the m -th singular value of a matrix U . Since V ( m,m − ∗ has at most ( m − m − σ m ( V ( m,m − ∗ ) = 0 . (B.61)We then study σ m ( V ( m,K ) ∗ ). Note that V ( m,K ) ∗ = [ K , V ( m ) ] =  v ( m )1 ... ...1 v ( m ) K  = [diag( η )] − · (cid:2) η , η (Γ) , . . . , η m (Γ) (cid:3) , (B.62)where η , η (Γ) , . . . , η K (Γ) is one choice of eigenvectors of HP H indexed by Γ (see Deﬁnition 3.2)and diag( η ) is the diagonal matrix whose diagonal entries are from η . Write for short Q =[ η , η (Γ) , . . . , η m (Γ)]. We have( V ( m,K ) ∗ ) (cid:48) V ( m,K ) ∗ = Q (cid:48) [diag( η )] − Q. By the last item of (2.4) and that (cid:107) η (cid:107) = 1, we conclude that η ( k ) (cid:16) / √ K for all 1 ≤ k ≤ K .In particular, there exists a constant c > (cid:0) [diag( η )] − − cI K (cid:1) is a positive semi-deﬁnite matrix. It follows that (cid:0) Q (cid:48) [diag( η )] − Q − cQ (cid:48) Q (cid:1) is a positive semi-deﬁnite matrix.Therefore, λ m (cid:0) ( V ( m,K ) ∗ ) (cid:48) V ( m,K ) ∗ (cid:1) ≥ λ m ( cQ (cid:48) Q ) = c · λ m ( Q (cid:48) Q ) , (B.63)where λ m ( · ) denotes the m -th largest eigenvalue of a symmetric matrix. By (3.7), for somepre-speciﬁed choice of eigenvectors, η , η ∗ , . . . , η ∗ K , of HP H , Q is the ﬁrst m columns of the matrix [ η , η ∗ , . . . , η ∗ K ] · diag(1 , Γ) . Note that [ η , η ∗ , . . . , η ∗ K ] and diag(1 , Γ) are both K × K orthogonal matrices. Then, theirproduct is also an orthogonal matrix, and the columns in Q are orthonormal. It follows that Q (cid:48) Q = I m . c . The left hand side of (B.63) is equalto σ m ( V ( m,K ) ∗ ). It follows that σ m ( V ( m,K ) ∗ ) ≥ C. (B.64)We now combine (B.61) and (B.64), and apply Weyl’s inequality for singular values [9, Corollary7.3.5]. It gives C ≤ σ m ( V ( m,K ) ∗ ) − σ m ( V ( m,m − ∗ ) ≤ (cid:107) V ( m,K ) ∗ − V ( m,m − ∗ (cid:107) . Combining it with (B.60) gives (B.59). The claim follows immediately.

Remark . The proof of Theorem 2.2 uses max ≤ k ≤ K (cid:107) v ( m ) k (Γ) (cid:107) ≤ C , and we prove thisclaim here. Note that v ( m ) k (Γ) is a sub-vector of the k th row of V ( m,K ) ∗ . In light of (B.62), therow-wise (cid:96) -norms of V ( m,K ) ∗ are uniformly bounded by C (cid:107) diag − ( η ) (cid:107) . We have argued that η ( k ) (cid:16) / √ K ≤ C for all 1 ≤ k ≤ K . As a result, max ≤ k ≤ K (cid:107) v ( m ) k (Γ) (cid:107) ≤ C . B.4 Proof of Lemma 3.4

Since (cid:107) ˆ r ( m ) − r ( m ) i (Γ) (cid:107) ≤ (cid:107) ˆ r ( K ) − r ( K ) i (Γ) (cid:107) , we only need to show the claim for m = K . Write r ( K ) i (Γ) = r i (Γ) for short. In the special case of Γ = I K − (i.e., η k (Γ) = η ∗ k for 2 ≤ k ≤ K , byDeﬁnition 3.2), we further write r i = r i ( I K − ) for short. It is easy to see that r i (Γ) = Γ (cid:48) · r i , for any orthogonal matrix Γ ∈ R ( K − × ( K − . It suﬃces to show that with probability 1 − O ( n − ) there exists a ( K − × ( K −

1) orthogonalmatrix Γ, which may depend on n and (cid:98) R ( K ) , such thatmax ≤ i ≤ n (cid:107) ˆ r i − Γ (cid:48) · r i (cid:107) ≤ Cs − n (cid:112) log( n ) . Such a bound was given by Theorem 4.1 of [18] (also, see Lemma 2.1 of [15] for a special casewhere λ , . . . , λ K are at the same order). B.5 Proof of Theorem 3.2

The key of proof is the following lemma, which characterizes the change of the k-means objectiveunder perturbation of cluster assignment. Consider the problem of clustering points y , y , . . . , y n into two disjoint clusters A and B . The k-means objective is the residual sum of squares by settingthe two cluster centers as the within-cluster means. Now, we move a subset C from cluster A tocluster B . The new clusters are ˜ A = A \ C and ˜ B = B ∪ C , and the cluster centers are updatedaccordingly. There is an explicit formula for the change of the k-means objective: Lemma B.1.

For any y , y , . . . , y n ∈ R d and subset M ⊂ { , , . . . , n } , deﬁne ¯ y M = | M | (cid:80) i ∈ M y i .Let { , , . . . , n } = A ∪ B be a partition, and let C be a strict subset of A . Write ˜ A = A \ C and ˜ B = B ∪ C . Deﬁne RSS = (cid:88) i ∈ A (cid:107) y i − ¯ y A (cid:107) + (cid:88) i ∈ B (cid:107) y i − ¯ y B (cid:107) , (cid:93) RSS = (cid:88) i ∈ ˜ A (cid:107) y i − ¯ y ˜ A (cid:107) + (cid:88) i ∈ ˜ B (cid:107) y j − ¯ y ˜ B (cid:107) . Then, (cid:93)

RSS − RSS = | B || C || B | + | C | (cid:107) ¯ y C − ¯ y B (cid:107) − | A || C || A | − | C | (cid:107) ¯ y C − ¯ y A (cid:107) . This lemma is proved by elementary calculation, which is relegated to Section D.1. It showsthat the change of k-means objective depends on the distances from ¯ y C to two previous clustercenters. 40e now apply Lemma B.1 to prove the claim. For notation simplicity, we drop “Γ” andomit the superscript m , i.e., we write r ( m ) i (Γ) = r i and v ( m ) k (Γ) = v k . By Lemma 3.2 and thecondition (2.2), • The n points r , r , . . . , r n take K distinct values, v , . . . , v K . • The minimum pairwise distance of v , v , . . . , v K is deﬁned as d K ( V ) > • For each v k , there are at least a n points, corresponding to nodes in community N k , thatare equal to v k , where a > { r , r , . . . , r n } satisﬁesNSP. We prove by contradiction. If this is not true, there must exist a community N k and twoclusters, say, S and S , such that N k ∩ S (cid:54) = ∅ and N k ∩ S (cid:54) = ∅ . Note that we have either S \N k (cid:54) = ∅ or S \N k (cid:54) = ∅ (if both S and S are contained in N k , then we can combine these twoclusters and construct another cluster assignment with a smaller residual sum of squares, whichconﬂicts with the optimality of the solution). Without loss of generality, we assume S \ N k (cid:54) = ∅ .We now move an arbitrary r i ∈ N k ∩ S to S and update the cluster centers (i.e., within-clustermeans) accordingly. Let RSS and (cid:93)

RSS be the respective k-means objective before and after thechange. We apply Lemma B.1 to get that (cid:93)

RSS − RSS = | S || S | + 1 (cid:107) r i − c (cid:107) − | S || S | − (cid:107) r i − c (cid:107) . (B.65)Since i is clustered to S in the optimal solution, it must be true that (cid:107) r i − c (cid:107) ≤ (cid:107) r i − c (cid:107) , whichfurther implies that (cid:107) v k − c (cid:107) ≤ (cid:107) v k − c (cid:107) . At the same time, if we take any j ∈ N k ∩ S , we cansimilarly derive that (cid:107) v k − c (cid:107) ≤ (cid:107) v k − c (cid:107) . Combining the above gives (cid:107) v k − c (cid:107) = (cid:107) v k − c (cid:107) .It follows that (cid:107) r i − c (cid:107) = (cid:107) r i − c (cid:107) . We immediately see that (cid:93)

RSS − RSS = (cid:16) | S || S | + 1 − | S || S | − (cid:17) (cid:107) r i − c (cid:107) = − | S | + | S | ( | S | + 1)( | S | − (cid:107) r i − c (cid:107) . The optimality of k-means solutions ensures that (cid:93)

RSS − RSS ≥

0. Therefore, the above equalityis possible only if (cid:107) r i − c (cid:107) = 0. However, (cid:107) r i − c (cid:107) = 0 implies c = c , which is impossible.Second, deﬁne g ( r i ; c , c , . . . , c m ) ≡ d ( r i ; c , . . . , c m ) − d ( r i ; c , . . . , c m ), which is the gapbetween the distances from r i to the closest and second closest cluster centers. We aim to showthat g ( r i ; c , c , . . . , c m ) has a uniform lower bound for all 1 ≤ i ≤ n . Fix i . Without loss ofgenerality, we assume c and c are the cluster centers closest and second closest to r i . Then, i is clustered to S . Suppose i ∈ N k . The NSP we proved above implies that N k ⊂ S . Again, by NSP, there are only two possible cases: (a) S = N k , and (b) S is the union of N k and some other true communities.In case (a), we immediately have c = v k . It follows that (cid:107) r i − c (cid:107) = (cid:107) v k − c (cid:107) = 0 . Furthermore, for any j ∈ S , r j equals to some v (cid:96) that is distinct from v k . Therefore, (cid:107) r i − c (cid:107) = (cid:107) v k − c (cid:107) ≥ min j ∈ S (cid:107) v k − r j (cid:107) ≥ min (cid:96) (cid:54) = k (cid:107) v k − v (cid:96) (cid:107) = d K ( V ) .

41s a result, g ( r i ; c , c , . . . , c m ) = (cid:107) r i − c (cid:107) − (cid:107) r i − c (cid:107) = (cid:107) r i − c (cid:107) ≥ d K ( V ) . The proves the claim in case (a).In case (b), we consider moving N k from S to S , and let RSS and (cid:93)

RSS denote the respectivek-means objective before and after the change. Applying Lemma B.1, we obtain (cid:93)

RSS − RSS = | S ||N k || S | + |N k | (cid:107) v k − c (cid:107) − | S ||N k || S | − |N k | (cid:107) v k − c (cid:107) . (B.66)Let ∆ = (cid:107) v k − c (cid:107) − (cid:107) v k − c (cid:107) . By direct calculations, (cid:93) RSS − RSS = | S ||N k || S | + |N k | ∆ − |N k | ( | S | + | S | )( | S | + |N k | )( | S | − |N k | ) (cid:107) v k − c (cid:107) . The optimality of k-means solutions implies that (cid:93)

RSS ≥ RSS . It follows that∆ ≥ |N k | ( | S | + | S | ) | S | ( | S | − |N k | ) (cid:107) v k − c (cid:107) . Note that |N k | ≥ a n , | S | − |N k | ≤ n , and | S | + | S || S | ≥

1. It is seen that |N k | ( | S | + | S | ) | S | ( | S |−|N k | ) ≥ a . Asa result, (cid:107) v k − c (cid:107) − (cid:107) v k − c (cid:107) = ∆ ≥ a (cid:107) v k − c (cid:107) . It implies that (cid:107) v k − c (cid:107) ≥ (1 + a ) (cid:107) v k − c (cid:107) , i.e., (cid:107) v k − c (cid:107) − (cid:107) v k − c (cid:107) ≥ (cid:0) √ a − (cid:1) (cid:107) v k − c (cid:107) . (B.67)We then derive a lower bound on (cid:107) v k − c (cid:107) . Here, c is the mean of r i ’s in S . For any j ∈ S \N k , r j equals to some v (cid:96) that is distinct from v k . As a result, (cid:107) v k − r j (cid:107) ≥ min (cid:96) (cid:54) = k (cid:107) v k − v (cid:96) (cid:107) ≥ d K ( V ),for all j ∈ S \N k . It follows that (cid:107) v k − c (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13) v k − (cid:16) |N k || S | v k + 1 | S | (cid:88) j ∈ S \N k r j (cid:17)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) | S | (cid:88) j ∈ S \N k ( r j − v k ) (cid:13)(cid:13)(cid:13) = | S \N k || S | (cid:13)(cid:13)(cid:13)(cid:16) | S \N k | (cid:88) j ∈ S \N k r j (cid:17) − v k (cid:13)(cid:13)(cid:13) ≥ | S \N k || S | min j ∈ S \N k (cid:107) r j − v k (cid:107)≥ a · d K ( V ) , (B.68)where in the last inequality we have used | S | ≤ n and | S \N k | ≥ a n (because S is the unionof N k and at least one other community). Combing (B.67) and (B.68) gives g ( r i ; c , c , . . . , c m ) ≥ a (cid:0) √ a − (cid:1) d K ( V ) . This proves the claim in case (b). 42 .6 Proof of Theorem 3.1

Write for short d m = d m ( U ) and δ = max ≤ i ≤ n (cid:107) ˆ x i − x i (cid:107) . Given any partition { , , . . . , n } = ∪ mk =1 B k and vectors b , b , . . . , b m ∈ R d , deﬁne R ( B , . . . , B m ; b , . . . , b m ) = n − m (cid:88) k =1 (cid:88) i ∈ B k (cid:107) x i − b k (cid:107) . (B.69)Fixing B , . . . , B m , the value of R ( B , . . . , B m ; b , . . . , b m ) is minimized when b k is the averageof x i ’s within each B k . When b , . . . , b m take these special values, we skip them in the notation.Namely, deﬁne R ( B , . . . , B m ) = R ( B , . . . , B m ; x , . . . , x m ) , where x k = | B k | − (cid:88) i ∈ B k x i , (B.70)We deﬁne ˆ R ( B , . . . , B m ; b , . . . , b m ) and ˆ R ( B , . . . , B m ) similarly but replace x i by ˆ x i . We shallprove the claim by contradiction. Suppose there is 1 ≤ k ≤ K such that F k intersects with morethan one ˆ S j . By pigeonhole principle, there exists j , such that | F k ∩ ˆ S j | ≥ m − | F k | . Let ˆ S j be another cluster that intersects with F k . We have | F k ∩ ˆ S j | ≥ m − α n, F k ∩ ˆ S j (cid:54) = ∅ , Below, we aim to show: There exists C = C ( α , C , m ) such thatmin ˜ S ,..., ˜ S m R ( ˜ S , . . . , ˜ S m ) ≥ R ( ˆ S , . . . , ˆ S m ) − C δ · d m , (B.71)where the minimum on the left hand side is taken over possible partitions of { , , . . . , n } into m clusters. We also aim to show that there exists C = C ( α , C , m ) such that we can constructa clustering structure ˜ S , ˜ S , . . . , ˜ S m satisfying that R ( ˜ S , . . . , ˜ S m ) ≤ R ( ˆ S , . . . , ˆ S m ) − C · d m . (B.72)Combining (B.71)-(B.72) gives R ( ˆ S , . . . , ˆ S m ) − C δ · d m ≤ R ( ˆ S , . . . , ˆ S m ) − C · d m This is impossible if C δ · d m < C · d m . Hence, we can take c ( α , C , m ) < C /C . There is a contradiction between (B.71) and (B.72) whenever δ ≤ c · d m . The claim follows.It remains to prove (B.71) and (B.72). Consider (B.71). For an arbitrary cluster structure B , B , . . . , B m , let ˆ R ( B , . . . , B m ), R ( B , . . . , B m ), ˆ x k and x k be deﬁned as in (B.70). By directcalculations, (ˆ x i − ˆ x k ) − ( x i − x k ) = | B k | − | B k | (ˆ x i − x i ) − | B k | (cid:88) j ∈ B k : j (cid:54) = i (ˆ x j − x j ) . Since (cid:107) ˆ x j − x j (cid:107) ≤ δ for all 1 ≤ j ≤ n , the above equality implies that (cid:107) (ˆ x i − ˆ x k ) − ( x i − x k ) (cid:107) ≤ δ .As a result, (cid:107) ˆ x i − ˆ x k (cid:107) ≤ (cid:107) x i − x k (cid:107) + 2 δ (cid:107) x i − x k (cid:107) + δ . It follows thatˆ R ( B , . . . , B m ) ≤ R ( B , . . . , B m ) + 2 δn − m (cid:88) k =1 (cid:88) i ∈ B k (cid:107) x i − x k (cid:107) + δ ≤ R ( B , . . . , B m ) + 2 δ (cid:112) R ( B , . . . , B m ) + δ (cid:0)(cid:112) R ( B , . . . , B m ) + δ (cid:1) , where the second line is from the Cauchy-Schwarz inequality. It follows that (cid:113) ˆ R ( B , . . . , B m ) ≤ (cid:112) R ( B , . . . , B m )+ δ . We can switch ˆ R ( B , . . . , B m ) and R ( B , . . . , B m ) to get a similar inequal-ity. Combining them gives (cid:112) R ( B , . . . , B m ) − δ ≤ (cid:113) ˆ R ( B , . . . , B m ) ≤ (cid:112) R ( B , . . . , B m ) + δ. (B.73)This inequality holds for an arbitrary partition ( B , B , . . . , B m ). We now apply it to ( ˆ S , . . . , ˆ S m ),which are the clusters obtained from applying k-means on ˆ x , ˆ x , . . . , ˆ x n . We also consider apply-ing k-means on x , x , . . . , x n and let S , S , . . . , S m denote the resultant clusters. By optimalityof the k-means solutions, ˆ R ( ˆ S , . . . , ˆ S m ) ≤ ˆ R ( S , . . . , S m ) . Combining it with (B.73) gives (cid:113) R ( ˆ S , . . . , ˆ S m ) ≤ (cid:113) ˆ R ( ˆ S , . . . , ˆ S m ) + δ ≤ (cid:113) ˆ R ( S , . . . , S m ) + δ ≤ (cid:112) R ( S , . . . , S m ) + 2 δ. (B.74)Since max ≤ i ≤ n (cid:107) x i (cid:107) ≤ C · d m , we can easily see that R ( S , . . . , S m ) ≤ C · d m . It follows that,as long as δ ≤ d m / R ( ˆ S , . . . , ˆ S m ) ≤ R ( S , . . . , S m ) + 4 δ (cid:112) R ( S , . . . , S m ) + 4 δ ≤ R ( S , . . . , S m ) + 4 C δ · d m + δ · d m ≤ R ( S , . . . , S m ) + (4 C + 1) δ · d m . As a result, min ˜ S ,..., ˜ S m R ( ˜ S , . . . , ˜ S m ) = R ( S , . . . , S m ) ≥ R ( ˆ S , . . . , ˆ S m ) − (4 C + 1) δ · d m . This proves (B.71) for C = 4( C + 1).Consider (B.72). Deﬁne w j = | ˆ S j | − (cid:88) i ∈ ˆ S j x i , for each 1 ≤ j ≤ m. (B.75)Using the notations in (B.69)-(B.70), we write R ( ˆ S , . . . , ˆ S m ) = R ( ˆ S , . . . , ˆ S m , w , . . . , w m ). Weaim to construct { ( ˜ S j , ˜ w j ) } ≤ j ≤ m such that R ( ˜ S , . . . , ˜ S m , ˜ w , . . . , ˜ w m ) ≤ R ( ˆ S , . . . , ˆ S m , w , . . . , w m ) − C · d m . (B.76)Since R ( ˜ S , . . . , ˜ S m ) = min b ,...,b m R ( ˜ S , . . . , ˜ S m , b , . . . , b m ), we immediately have R ( ˜ S , . . . , ˜ S m ) ≤ R ( ˜ S , . . . , ˜ S m , ˜ w , . . . , ˜ w m ) ≤ R ( ˆ S , . . . , ˆ S m , w , . . . , w m ) − C · d m . This proves (B.72).What remains is to construct { ( ˜ S j , ˜ w j ) } mj =1 so that (B.76) is satisﬁed. Let ˆ w j = | ˆ S j | − (cid:80) i ∈ ˆ S j ˆ x i ,for 1 ≤ j ≤ m . Then, { ( ˆ S j , ˆ w j ) } ≤ j ≤ m are the clusters and cluster centers obtained by applyingthe k-means algorithm on ˆ x , ˆ x , . . . , ˆ x n . The k-means solution guarantees to assign each pointto the closest center. Take i ∈ F k ∩ ˆ S j and i (cid:48) ∈ F k ∩ ˆ S j . It follows that (cid:107) ˆ x i − ˆ w j (cid:107) ≤ (cid:107) ˆ x i − ˆ w j (cid:107) , (cid:107) ˆ x i (cid:48) − ˆ w j (cid:107) ≤ (cid:107) ˆ x i (cid:48) − ˆ w j (cid:107) . x i = x i (cid:48) = u k and max {(cid:107) ˆ x i − x i (cid:107) , (cid:107) ˆ x i (cid:48) − x i (cid:48) (cid:107) , (cid:107) ˆ w j − w j (cid:107) , (cid:107) ˆ w j − w j (cid:107)} ≤ δ , we have (cid:107) u k − w j (cid:107) ≤ (cid:107) ˆ x i − ˆ w j (cid:107) + 2 δ ≤ (cid:107) ˆ x i − ˆ w j (cid:107) + 2 δ ≤ (cid:107) u k − w j (cid:107) + 4 δ. Similarly, we can derive that (cid:107) u k − w j (cid:107) ≤ (cid:107) u k − w j (cid:107) + 4 δ . Combining them gives |(cid:107) u k − w j (cid:107) − (cid:107) u k − w j (cid:107)| ≤ δ. (B.77)This inequality tells us that (cid:107) u k − w j (cid:107) and (cid:107) u k − w j (cid:107) are suﬃciently close. Introduce C = m − α × C . Below, we consider two cases: (cid:107) u k − w j (cid:107) < C · d m and (cid:107) u k − w j (cid:107) ≥ C · d m .In the ﬁrst case, (cid:107) u k − w j (cid:107) < C · d m . The deﬁnition of d m guarantees that there are m points from { u , u , . . . , u K } such that their minimum pairwise distance is d m . Without loss ofgenerality, we assume these m points are u , u , . . . , u m . If k ∈ { , , . . . , m } , then the distancefrom u k to any of the other ( m −

1) points is at least d m . If k / ∈ { , , . . . , m } , then u k cannotbe simultaneously within a distance of < d m / u , u , . . . , u m . In otherwords, there exists at least ( m −

1) points from u , u , . . . , u m whose distance to u k is at least ≥ d m /

2. Combining the above situations, we conclude that there exist ( m −

1) points from { u , u , . . . , u K } , which we assume to be u , u , . . . , u m − without loss of generality, such thatmin ≤ (cid:96) (cid:54) = s ≤ m − (cid:107) u (cid:96) − u s (cid:107) ≥ d m , min ≤ (cid:96) ≤ m − (cid:107) u (cid:96) − u k (cid:107) ≥ d m / . (B.78)We then consider two sub-cases. In the ﬁrst sub-case, there exists (cid:96) ∈ { , , . . . , m − } such that | F (cid:96) ∩ ( ˆ S j ∪ ˆ S j ) | ≥ m − α n . Then, at least one of ˆ S j and ˆ S j contains more than( m − α / n nodes from F (cid:96) . We only study the situation of | F (cid:96) ∩ ˆ S j | ≥ ( m − α / n . The prooffor the situation of | F (cid:96) ∩ ˆ S j | ≥ ( m − α / n is similar and omitted. We modify the clusters andcluster centers { ( ˆ S j , w j ) } ≤ j ≤ m as follows:(i) Combine ˆ S j \ F (cid:96) and ˆ S j into one cluster and set the cluster center to be w j .(ii) Create a new cluster as ˆ S j ∩ F (cid:96) and set the cluster center to be u (cid:96) .The other clusters and cluster centers remain unchanged. Namely, we let˜ S j =  ˆ S j ∪ ( ˆ S j \ F (cid:96) ) , if j = j , ˆ S j ∩ F (cid:96) , if j = j , ˆ S j , if j / ∈ { j , j } , ˜ w j = (cid:40) u (cid:96) , if j = j ,w j , otherwise . Recall that n · R ( B , . . . , B m , b , . . . , b m ) = (cid:80) mj =1 (cid:80) i ∈ B j (cid:107) x i − b j (cid:107) . By direct calculations,∆ ≡ n · R ( ˆ S , . . . , ˆ S m , w , . . . , w m ) − n · R ( ˜ S , . . . , ˜ S m , ˜ w , . . . , ˜ w m )= (cid:88) i ∈ ( ˆ S j ∩ F (cid:96) ) (cid:0) (cid:107) x i − w j (cid:107) − (cid:107) x i − u (cid:96) (cid:107) (cid:1) − (cid:88) i ∈ ( ˆ S j \ F (cid:96) ) (cid:0) (cid:107) x i − w j (cid:107) − (cid:107) x i − w j (cid:107) (cid:1) ≡ ∆ − ∆ . Here ∆ is the increase of the residual sum of squares (RSS) caused by the operation (i) and ∆ is the decrease of RSS caused by the operation (ii).∆ = (cid:88) i ∈ ( ˆ S j \ F (cid:96) ) ( (cid:107) x i − w j (cid:107) − (cid:107) x i − w j (cid:107) )( (cid:107) x i − w j (cid:107) + (cid:107) x i − w j (cid:107) )45 (cid:88) i ∈ ( ˆ S j \ F (cid:96) ) (cid:107) w j − w j (cid:107) · ( (cid:107) x i − w j (cid:107) + (cid:107) x i − w j (cid:107) ) ≤ | ˆ S j \ F (cid:96) | · (cid:107) w j − w j (cid:107) · C · d m , where the third line is from the triangle inequality and the last line is because max ≤ j ≤ m (cid:107) w j (cid:107) ≤ max ≤ i ≤ n (cid:107) x i (cid:107) ≤ C · d m . Note that (cid:107) w j − w j (cid:107) ≤ (cid:107) u k − w j (cid:107) + (cid:107) u k − w j (cid:107) . We have assumed (cid:107) u k − w j (cid:107) < C · d m in this case. Combing it with (B.77), as long as δ < ( C / · d m , (cid:107) w j − w j (cid:107) ≤ (cid:107) u k − w j (cid:107) + 4 δ ≤ C · d m . It follows that ∆ ≤ C C · nd m . (B.79)Since x i = u (cid:96) for i ∈ F (cid:96) , we immediately have∆ = | ˆ S j ∩ F (cid:96) | · (cid:107) u (cid:96) − w j (cid:107) . We have assumed (cid:107) u k − w j (cid:107) ≤ C · d m in this case. Combining it with (B.77) and (B.78) gives (cid:107) u (cid:96) − w j (cid:107) ≥ (cid:107) u (cid:96) − u k (cid:107) − (cid:107) u k − w j (cid:107)≥ (cid:107) u (cid:96) − u k (cid:107) − (cid:0) (cid:107) u k − w j (cid:107) + 4 δ (cid:1) ≥ d m / − ( C · d m + 4 δ ) . Recall that C = m − α × C < /

12. Then, as long as δ < (1 / d m , we have (cid:107) u (cid:96) − w j (cid:107) ≥ d m / ≥ ( m − α / n · ( d m / ≥ m − α · nd m . (B.80)As a result, ∆ = ∆ − ∆ ≥ (cid:16) m − α − C C (cid:17) · nd m . We plug in the expression of C , the right hand side is ( m − α / · nd m . It follows that R ( ˆ S , . . . , ˆ S m , w , . . . , w m ) − R ( ˜ S , . . . , ˜ S m , ˜ w , . . . , ˜ w m ) ≥ m − α · d m . (B.81)This gives (B.76) in the ﬁrst sub-case.In the second sub-case, | F (cid:96) ∩ ( ˆ S j ∪ ˆ S j ) | < m − α n for all 1 ≤ (cid:96) ≤ m −

1. For each F (cid:96) , bypigeonhole principle, there exists at least one j ∈ { , , . . . , m } such that | F (cid:96) ∩ ˆ S j | ≥ m − | F (cid:96) | ≥ m − α n . Denote such a j by j ∗ (cid:96) ; if there are multiple indices satisfying the requirement, we pickone of them. This gives j ∗ , j ∗ , . . . , j ∗ m − ∈ { , , . . . , m }\{ j , j } . These ( m −

1) indices take at most ( m −

2) distinct values. By pigeonhole principle, there exist1 ≤ (cid:96) (cid:54) = (cid:96) ≤ m − j ∗ (cid:96) = j ∗ (cid:96) = j ∗ , for some j ∗ / ∈ { j , j } . Recalling (B.75), we let w j ∗ denote the average of x i ’s in ˆ S j ∗ . Since (cid:107) u (cid:96) − u (cid:96) (cid:107) ≥ d m , the point w j ∗ cannot be simultaneouslywithin a distance of d m / u (cid:96) and u (cid:96) . Without loss of generality, suppose (cid:107) u (cid:96) − w j ∗ (cid:107) ≥ d m / . We modify the clusters and cluster centers { ( ˆ S j , w j ) } ≤ j ≤ m as follows:(i) Combine ˆ S j and ˆ S j into one cluster and set the cluster center to be w j .46ii) Split ˆ S j ∗ into two clusters, where one is ( ˆ S j ∗ ∩ F (cid:96) ), and the other is ( ˆ S j ∗ \ F (cid:96) ); the twocluster centers are set as u (cid:96) and w j ∗ , respectively.The other clusters and cluster centers remain unchanged. Namely, we let˜ S j =  ˆ S j ∪ ˆ S j , if j = j , ˆ S j ∗ ∩ F (cid:96) , if j = j , ˆ S j ∗ \ F (cid:96) , if j = j ∗ , ˆ S j , if j / ∈ { j , j , j ∗ } , ˜ w j = (cid:40) u (cid:96) , if j = j ,w j , otherwise . By direct calculations,∆ ≡ n · R ( ˆ S , . . . , ˆ S m , w , . . . , w m ) − n · R ( ˜ S , . . . , ˜ S m , ˜ w , . . . , ˜ w m )= (cid:88) i ∈ ( ˆ S j ∗ ∩ F (cid:96) ) (cid:0) (cid:107) x i − w j ∗ (cid:107) − (cid:107) x i − u (cid:96) (cid:107) (cid:1) − (cid:88) i ∈ ˆ S j (cid:0) (cid:107) x i − w j (cid:107) − (cid:107) x i − w j (cid:107) (cid:1) ≡ ∆ − ∆ , where ∆ is the increase of RSS caused by (i) and ∆ is the decrease of RSS caused by (ii). Wecan bound ∆ in a similar way as in the previous sub-case, and the details are omitted. It gives∆ ≤ C C · nd m . Since x i = u (cid:96) for all i ∈ F (cid:96) , we immediately have∆ = | ˆ S j ∗ ∩ F (cid:96) | · (cid:107) u (cid:96) − w j ∗ (cid:107) ≥ ( m − α n ) · ( d m / ≥ m − α · nd m . As a result, ∆ ≥ ( m − α − C C ) m − α · nd m . If we plug in the expression of C , it becomes ≥ ( m − α ) · nd m . This gives R ( ˆ S , . . . , ˆ S m , w , . . . , w m ) − R ( ˜ S , . . . , ˜ S m , ˜ w , . . . , ˜ w m ) ≥ m − α · d m . (B.82)This gives (B.76) in the second sub-case.In the second case, (cid:107) u k − w j (cid:107) ≥ C · d m . We recall that | F k ∩ ˆ S j | ≥ m − α n . Let E be asubset of F k ∩ ˆ S j such that | E | = (cid:100)| F k ∩ ˆ S j | / (cid:101) . Note that | ˆ S j \ E | ≤ n . We haveˆ S j \ E (cid:54) = ∅ , and | E || ˆ S j \ E | ≥ m − α / . We now modify the clusters and cluster centers { ( ˆ S j , w j ) } ≤ j ≤ m as follows: • Move the subset E from ˆ S j to ˆ S j , and update each cluster center to be the within clusteraverage of x i ’s.The other clusters and cluster centers are unchanged. Namely, we let˜ S j =  ˆ S j \ E, if j = j , ˆ S j ∪ E, if j = j , ˆ S j , if j / ∈ { j , j } , ˜ w j =  | ˜ S j | (cid:80) i ∈ ˜ S j x i , if j ∈ { j , j } ,w j , otherwise . We apply Lemma B.1 to A = ˆ S j , B = ˆ S j , and C = E , and note that x i = u k for all i ∈ E . Itfollows that∆ ≡ n · R ( ˆ S , . . . , ˆ S m , w , . . . , w m ) − n · R ( ˜ S , . . . , ˜ S m , ˜ w , . . . , ˜ w m )47 − (cid:32) | ˆ S j | · | E || ˆ S j | + | E | (cid:107) u k − w j (cid:107) − | ˆ S j | · | E || ˆ S j | − | E | (cid:107) u k − w j (cid:107) (cid:33) = | E | · ( | ˆ S j | + | ˆ S j | )( | ˆ S j | + | E | )( | ˆ S j | − | E | ) (cid:107) u k − w j (cid:107) + | ˆ S j | · | E || ˆ S j | + | E | (cid:0) (cid:107) u k − w j (cid:107) − (cid:107) u k − w j (cid:107) (cid:1) ≥ | E | | ˆ S j | − | E | (cid:107) u k − w j (cid:107) + | ˆ S j | · | E || ˆ S j | + | E | (cid:0) (cid:107) u k − w j (cid:107) − (cid:107) u k − w j (cid:107) (cid:1) . (B.83)By (B.77), (cid:107) u k − w j (cid:107) ≤ (cid:107) u k − w j (cid:107) + 4 δ . It follows that, as long as δ < ( C / · d m , (cid:107) u k − w j (cid:107) − (cid:107) u k − w j (cid:107) ≥ − δ · (cid:107) u k − w j (cid:107) − δ ≥ − δ · (cid:107) u k − w j (cid:107) , where the last line is because 16 δ ≤ C δ · d m ≤ δ · (cid:107) u k − w j (cid:107) . We plug it into (B.83) to get∆ ≥ | E | | ˆ S j \ E | (cid:107) u k − w j (cid:107) − | ˆ S j | · | E || ˆ S j | + | E | · δ · (cid:107) u k − w j (cid:107)≥ | E | · ( m − α / · (cid:107) u k − w j (cid:107) − | E | · δ · (cid:107) u k − w j (cid:107)≥ | E | · (cid:107) u k − w j (cid:107) · (cid:16) C m − α d m − δ (cid:17) , where the second line is because | E | ≥ ( m − α / · | ˆ S j \ E | and the last line is because we haveassumed (cid:107) u k − w j (cid:107) ≥ C · d m in the current case. As long as δ < C m − α · d m , the number inbrackets is ≥ C m − α d m . We also plug in | E | = (cid:100) m − α / (cid:101) n and (cid:107) u k − w j (cid:107) ≥ C · d m to get∆ ≥ m − α n · C d m · C m − α d m ≥ C m − α · nd m . It follows that R ( ˆ S , . . . , ˆ S m , w , . . . , w m ) − R ( ˜ S , . . . , ˜ S m , ˜ w , . . . , ˜ w m ) ≥ C m − α · d m . (B.84)This gives (B.76) in the second case. We combine (B.81), (B.82) and (B.84), and take theminimum of the right hand sides of three inequalities. Since m − α < C < /

3, wechoose C = (1 / C m − α . Then, (B.76) is satisﬁed for all cases. This completes the proof of (B.72).We remark that the scalar c = c ( α , C , m ) is not exactly C /C . In the derivation of (B.71)and (B.72), we have imposed other restrictions on δ , which can be expressed as δ ≤ C · d m , where C is determined by ( C , α, m ) and ( C , C , C ). Since ( C , C , C ) only depend on ( α , C , m ), C is a function of ( α , C , m ) only. We take c = min { C /C , C } . B.7 Proof of the claim in Example 4b of Section 3

In Example 4b of Section 3, we have the following claim.

Lemma B.2.

Let R ( m ) and V ( m ) be as in (3.9) and (3.8), respectively. If ( K, m ) = (4 , andall communities have equal sizes, then g m ( R ( m ) ) ≥ [(3 − √ / d K ( V ( m ) ) . We now show the claim. For short, let x k = v ( m ) k for all 1 ≤ k ≤ d ∗ = g m ( R ( m ) ).Without loss of generality, we assume x = 0, x = 1, x = x , and x = y , where y > x >

1. Let z = y − x . It is seen d K ( V ( m ) ) = min { , x − , z } . To show the claim, it is suﬃcient to show d ∗ ≥ − √

32 min { , x − , z } . (B.85)48y deﬁnitions, d ∗ = min { all possible c , c } min ≤ i ≤ { d i ( c , c ) } , (B.86)where for 1 ≤ i ≤ d i ( c , c ) ≥ x i to the center ofthe cluster to which x i does not belong and the distance from x i to the center of the cluster towhich x i belongs. For simplicity, we write d i = d i ( c , c ) when there is no confusion.For the four points x , x , x , x , we have three possible candidates (a)-(c) for the clusteringresults (which of them is the actual clustering result depends on the values of ( x, y )): • (a). The left most point forms one cluster, the other three form the other cluster. • (b). The left two points form one cluster, the other two points form the other cluster. • (c). The left three points form one cluster, the right most point forms the other cluster.Recall that for any n points x , x , . . . , x n , the RSS for the k-means solution with K clusters is RSS = K (cid:88) k =1 (cid:88) { i ∈ cluster k } ( x i − c k ) , where c , c , . . . , c K are the cluster centers. For (a), the two cluster centers are c = 0 and c = (1 + x + y ) /

3. In this case, the RSS is S = x + y + 1 − (1 / x + y + 1) . For (b), the twocluster centers are c = 1 / c = ( x + y ) /

2, and the RSS is S = (1 / / x − y ) . For (c),the two cluster centers are c = (1 + x ) / x = y , and the RSS is S = x + 1 − (1 / x + 1) .It is seen that the actual clustering result is as in (a) if and only if S ≤ S and S ≤ S ; similarfor (b) and (c).Recall that z = y − x . Consider the two-dimensional space with x and z being the two axes.As in Figure 7, we partition the region { ( x, z ) : x > , z > } into three sub-regions as follows. • Region (I). { ( x, z ) : 2 x + z < √ , z < } . • Region (II). { ( x, z ) : z < (2 x − / √ , x + z > √ } . • Region (III). { ( x, z ) : z > , z > (2 x − / √ } .Figure 7: In the two dimensional space with x and z being the two axes, the whole region { ( x, z ) : x > , z > } partitions into three sub-regions (I), (II), and (III), respectively.49ote that any point ( x, z ) in our range of interest either belongs to one of the three regions, orfalls on one of the boundaries of these regions. We now show the claim by consider the threeregions in Figure 7 separately. The discussions for the case where ( x, z ) fall on the boundariesof these regions are similar so are omitted.Consider Region (I). In this region, by elementary algebra, we have S < S and S < S .Therefore, case (a) is the ﬁnal clustering result, where the two clusters are { x } and { x , x , x } ,respectively, with cluster centers c = 0 and c = ( x + y + 1) /

3. By deﬁnitions, for ( x, z ) inRegion (I), d = | c − | − | c − | = (1 + x + y ) / d = | c − | − | c − | = (5 − x − y ) / d = | c − x | − | c − x | = ( x + y + 1) / x > y + 1 and d = (5 x − y − / d = | c − y | − | c − y | = ( x + y + 1) /

3. By elementary algebra, it is seen that d is thesmallest among { d , d , d , d } . Combining this with (B.86) gives that for ( x, z ) in Region (I), d ∗ = (5 − x − y ) / − x − z ) /

3. Note that for ( x, z ) in Region (I), 2 x + z < √

3. Itfollows 2( x −

1) + z < √ { , x − , z } ≤ √ /

3. Combining these, d ∗ min { , x − , z } ≥

13 5 − (2 + √ √ / ≥ ( √ − . (B.87)Consider Region (II). In this region, by elementary algebra, S ≤ S and S < S . Therefore,case (b) is the actual clustering result, so the two cluster centers are c = 1 / c = ( x + y ) / d = | c − | − | c − | = ( x + y − / d = | c − | − | c − | =( x + y − / d = | c − x | − | c − x | = (3 x − y − /

2, and d = | c − y | − | c − y | = ( x + y − / { d , d , d , d } , d is the smallest when z < d is smallest when z >

1. Combining this with (B.86) gives that for ( x, z ) in Region (II), d ∗ = (cid:26) ( x + y − / x + z − / , if z < , (3 x − y − / x − z − / , if z ≥ . Consider the case of z < { , x − , z } = min { x − , z } >

0, and2 x + z − > (2 − / √ x −

1) + (1 − / √ z since 2 x + z > (2 + √

3) in Region (II). Therefore, d ∗ min { , x − , z } = 2 x + z −

32 min { x − , z } ≥ (2 − / √ x −

1) + (1 − / √ z { x − , z } , where the right hand side is no smaller than[(2 − / √

3) + (1 − / √ / − √ / . Consider the case z ≥

1. In this case, min { , x − , z } = min { x − , } >

0, and (2 x − z − ≥ (2 − / √ x −

1) + (1 − / √

3) since z ≤ (2 x − / √

3. Therefore, d ∗ min { , x − , z } = 2 x − z −

12 min { x − , } ≥ (2 − / √ x −

1) + (1 − / √ { x − , } , where the right hand side is no smaller than[(2 − / √

3) + (1 − / √ / − √ / . Combining the above, we have that in Region (II), d ∗ ≥ (3 − √ { , x − , z } . (B.88)Consider Region (III). By elementary algebra, it is seen S < S and S < S in this case.Therefore, case (c) is the actual clustering result, so the two cluster centers are c = (1+ x ) / = y , respectively. By deﬁnitions, d = | c − |−| c − | = (3 y − x − / d = | c − |−| c − | =( − − x +3 y ) / x > d = ( x +3 y − / d = | c − x |−| c − x | = (1 − x +3 y ) / d = | c − y | − | c − y | = ( − − x + 3 y ) /

3. By elementary algebra, d is the smallest in { d , d , d , d } . Combining these with (B.86) gives that for ( x, z ) in Region (III), d ∗ = (1 − x + 3 y ) / − x + 3 z ) / , min { , x − , z } = min { , x − } . When x >

2, min { , x − } = 1, and the minimum of d ∗ in Region (III) is ( √ −

1) attainedat ( x, z ) = (2 , √ x <

2, min { , x − } = x −

1. Therefore, d ∗ / min { , x − } = ( z − / / ( x − − /

3, where the minimum in Region (III) is 2 / √

3, attained at ( x, z ) = (( √ / , x, z ) in Region (III), d ∗ ≥ ( √ −

1) min { , x − , z } . (B.89)Combining (B.87)-(B.89) gives the claim. C Proof of results in Section 4

C.1 Proof of Lemma 4.1

Consider the ﬁrst two claims. It is easy to see that E [ C n ] = c n . In the proof of Theorem 3.1 of[16], it has been shown that c n = tr(Ω ) + O ( (cid:107) θ (cid:107) (cid:107) θ (cid:107) ) = tr(Ω ) + o ( (cid:107) θ (cid:107) ) . Moreover, λ ≤ tr(Ω ) ≤ Kλ . In the proof of Theorem 2.4, we have seen that λ = (cid:107) θ (cid:107) · λ ( HP H (cid:48) ). Using the condition (2.2) and the fact that P has unit diagonals, we have λ ( HP H (cid:48) ) ≥ Cλ ( P ) ≥ C . Similarly, since we have assumed (cid:107) P (cid:107) ≤ C in (2.1), λ ( HP H (cid:48) ) ≤ Cλ ( P ) ≤ C .Here, C is a generic constant. We have proved that E [ C n ] = c n (cid:16) (cid:107) θ (cid:107) . To compute the variance of C n , write C n = (cid:101) Q n + ∆ , where (cid:101) Q n = (cid:88) i ,i ,i ,i ( dist ) W i i W i i W i i W i i . The variance of ∆ is computed in the proof of Lemma B.2 of [16]. Using the upper bound of thevariance of (cid:0)(cid:80) CC ( I n ) ∆ ( k ) i i i i (cid:1) for k = 1 , , ≤ C (cid:107) θ (cid:107) (cid:107) θ (cid:107) . Furthermore, we show in the proof of Lemma 4.2 that Var( (cid:101) Q n ) = 8 c n · [1 + o (1)]. It follows thatVar( (cid:101) Q n ) (cid:16) c n (cid:16) (cid:107) θ (cid:107) . Combining these results givesVar( C n ) ≤ C (cid:107) θ (cid:107) · [1 + (cid:107) θ (cid:107) ] . Consider the last claim. For any (cid:15) >

0, using Chebyshev’s inequality, we have P ( | C n /c n − | ≥ (cid:15) ) ≤ ( c n (cid:15) ) − Var( C n ) ≤ C (1 + (cid:107) θ (cid:107) ) (cid:15) (cid:107) θ (cid:107) . Here we have used the ﬁrst two claims. Since (cid:107) θ (cid:107) ≤ θ max (cid:107) θ (cid:107) = o ( (cid:107) θ (cid:107) ), the rightmost term is o (1) as n → ∞ . This proves that C n /c n → .2 Proof of Lemma 4.2 In the proof of Theorem 3.2 of [16], it was shown that (cid:101) Q n / (cid:113) Var( (cid:101) Q n ) → N (0 ,

1) in law (in theproof there, (cid:101) Q n / (cid:113) Var( (cid:101) Q n ) is denoted as S n,n ). It remains to prove Var( (cid:101) Q n ) = 8 c n · [1 + o (1)].Note that for each ordered quadruple ( i, j, k, (cid:96) ) with four distinct indices, there are 8 sum-mands in the deﬁnition of (cid:101) Q n whose values are exactly the same; these summands correspond to( i , i , i , i ) ∈ { ( i, j, k, (cid:96) ) , ( j, k, (cid:96), i ) , ( k, (cid:96), i, j ) , ( (cid:96), i, j, k ) , ( k, j, i, (cid:96) ) , ( j, i, (cid:96), k ) , ( i, (cid:96), k, j ) , ( (cid:96), k, j, i ) } .We treat these 8 summands as in an equivalent class. Denote by CC the collection of all suchequivalent classes. Then, for any doubly indexed sequence { x ij } ≤ i (cid:54) = j ≤ n such that x ij = x ji , itis true that (cid:80) i ,i ,i ,i ( dist ) x i i x i i x i i x i i = 8 (cid:80) CC x i i x i i x i i x i i . In particular, (cid:101) Q n = 8 (cid:88) CC W i i W i i W i i W i i . The summands are independent of each other, and the variance of W i i W i i W i i W i i is equalto Ω ∗ i i Ω ∗ i i Ω ∗ i i Ω ∗ i i , where Ω ∗ ij = Ω ij (1 − Ω ij ). As a result,Var( (cid:101) Q n ) = 64 (cid:88) CC Ω ∗ i i Ω ∗ i i Ω ∗ i i Ω ∗ i i = 8 (cid:88) i ,i ,i ,i ( dist ) Ω ∗ i i Ω ∗ i i Ω ∗ i i Ω ∗ i i . Recall that c n = (cid:80) i ,i ,i ,i ( dist ) Ω i i Ω i i Ω i i Ω i i . Then, | Var( (cid:101) Q n ) − c n | ≤ (cid:88) i ,i ,i ,i ( dist ) | Ω i i Ω i i Ω i i Ω i i − Ω ∗ i i Ω ∗ i i Ω ∗ i i Ω ∗ i i |≤ (cid:88) i ,i ,i ,i ( dist ) Ω i i Ω i i Ω i i Ω i i · C (cid:107) Ω (cid:107) max = 8 c n · O ( θ ) . Since θ max = o (1) by the condition (2.1), we immediately have Var( (cid:101) Q n ) = 8 c n · [1 + o (1)]. C.3 Proof of Lemma 4.3

The proof is combined with the proof of Lemma 4.8; see below.

C.4 Proof of Lemma 4.4

Consider the ﬁrst claim. Since b n = 2 (cid:107) θ (cid:107) · [ g (cid:48) V − ( P H P ◦ P H P ) V − g ] (see (4.13)), it suﬃcesto show that g (cid:48) V − ( P H P ◦ P H P ) V − g (cid:16) . The vectors g, h ∈ R K are deﬁned by g k = ( (cid:48) k θ ) / (cid:107) θ (cid:107) and h k = ( (cid:48) k Θ k ) / / (cid:107) θ (cid:107) , where k isfor short of ( K ) k . By condition (2.2), c ≤ g k ≤ c ≤ h k ≤ ≤ k ≤ K , and (cid:107) P (cid:107) ≤ c ,for some constants c , c ∈ (0 , h k ≤ (cid:107) P (cid:107) ≤ c , we have (cid:107) ( P H P ) ◦ ( P H P ) (cid:107) ≤ C. Since P has unit diagonals and g k ≥ c , the diagonal elements of V = diag( P g ) is no less than c . Hence g (cid:48) V − ( P H P ◦ P H P ) V − g ≤ (cid:107) g (cid:48) V − (cid:107) · (cid:107) P H P ◦ P H P (cid:107) ≤ C. (C.90)For the lower bound, since P has unit diagonals and h k ≥ c , we can lower bound diagonalelements of P H P ◦ P H P by c . Since g ∈ R K is a non-negative vector with entries summingto 1, the diagonal elements of V = diag( P g ) is no more than max k,l P k,(cid:96) ≤ (cid:107) P (cid:107) ≤ c . Therefore52ach entry of vector gV − is at least c /c . Since

P H P ◦ P H P ∈ R ( K,K ) is non-negative matrixand g (cid:48) V − ∈ R ( K is non-negative vector, we can lower bound g (cid:48) V − ( P H P ◦ P H P ) V − g ≥ c (cid:107) g (cid:48) V − (cid:107) ≥ C, (C.91)Combining (C.90)-(C.91), we completes the proof of the ﬁrst claim.Consider the second claim. Introduce the following event A n = (cid:8)(cid:98) Π ( K ) = Π , up to a permutation in the columns of (cid:98) Π ( K ) (cid:9) . (C.92)By Theorem 2.1, when m = K , SCORE exactly recovers Π with probability 1 − o ( n − ), i.e., P ( A cn ) ≤ Cn − = o (1) . This means if we replace every (cid:98) Π ( K ) in the deﬁnition of B ( K ) n with Π, and denote the resultingquantity as B ( K, n , the above inequality immediately implies that B ( K ) n /B ( K, n p →

1. So we onlyneed to prove B ( K, n /b n p →

1. Since we will never use the original deﬁnition of B ( K ) n in the restof the proof, without causing any confusion we will suspend the original deﬁnitions of B ( K ) n andthe quantities used to deﬁne B ( K ) n , including (ˆ θ, ˆ g, (cid:98) V , (cid:98)

P , (cid:98) H ), and use them to actually denotethe correspondents with every (cid:98) Π ( K ) replaced by Π.Recall the formulas for B ( K ) n and b n in (1.15) and (4.13), we have B ( K ) n b n = (cid:107) ˆ θ (cid:107) (cid:107) θ (cid:107) · ˆ g (cid:48) (cid:98) V − ( (cid:98) P (cid:98) H (cid:98) P ◦ (cid:98) P (cid:98) H (cid:98) P ) (cid:98) V − ˆ gg (cid:48) V − ( P H P ◦ P H P ) V − g . (C.93)To show that B ( K ) n /b n →

1, we need the follow lemma, which is proved in Section D.2.

Lemma C.1.

Suppose the conditions of Theorem 2.1 hold. Let n ∈ R n be the vector of ’s,and let k ∈ R n be the vector such that k ( i ) = 1 { i ∈ N k } , for ≤ i ≤ n and ≤ k ≤ K . As n → ∞ , for all ≤ k ≤ K , (cid:48) n A n (cid:48) n Ω n p → , (cid:48) k A n (cid:48) k Ω n p → , (cid:48) k A k (cid:48) k Ω k p → . Moreover, let d i be the degree of node i and let d ∗ i = (Ω n ) i , for ≤ i ≤ n . Write D = diag( d ) ∈ R n,n and D ∗ = diag( d ∗ ) ∈ R n,n . As n → ∞ , for all ≤ k ≤ K , (cid:107) ˆ θ (cid:107) (cid:107) θ (cid:107) p → , (cid:107) ˆ θ (cid:107)(cid:107) θ (cid:107) p → , (cid:48) k D k (cid:48) k ( D ∗ ) k p → . First, by Lemma C.1, (cid:107) ˆ θ (cid:107) / (cid:107) θ (cid:107) p →

1. It follows from the continuous mapping theorem that (cid:107) ˆ θ (cid:107) / (cid:107) θ (cid:107) p → . (C.94)Second, recall that g k = ( (cid:48) k θ ) / (cid:107) θ (cid:107) and ˆ g k = ( (cid:48) k ˆ θ ) / (cid:107) ˆ θ (cid:107) , where by (1.10), we have the equality (cid:48) k ˆ θ = ( (cid:48) k d ) · (cid:112) (cid:48) k A k / ( (cid:48) k A n ). Here, keep in mind that we have replaced (cid:98) Π ( K ) with Π, whichimplies that ˆ k = k . The vector d is such that d = A n . It follows that (cid:48) k ˆ θ = (cid:112) (cid:48) k A k .Furthermore, (cid:48) k Ω k = ( (cid:48) k θ ) , because P has unit diagonals. Combining the above givesˆ g k g k = (cid:48) k ˆ θ (cid:48) k θ · (cid:107) θ (cid:107) (cid:107) ˆ θ (cid:107) = (cid:112) (cid:48) k A k (cid:112) (cid:48) k Ω k · (cid:107) θ (cid:107) (cid:107) ˆ θ (cid:107) p → , ≤ k ≤ K. (C.95)Third, note that by deﬁnition and basic algebra, both P and (cid:98) P have unit diagonals. We comparetheir oﬀ-diagonals. By (1.10), (cid:98) P k(cid:96) = (cid:48) k A (cid:96) / (cid:112) ( (cid:48) k A k )( (cid:48) (cid:96) A (cid:96) ). At the same time, it can beeasily veriﬁed that P k(cid:96) = (cid:48) k Ω (cid:96) / (cid:112) ( (cid:48) k Ω k )( (cid:48) (cid:96) Ω (cid:96) ). Introduce X = (cid:112) ( (cid:48) k Ω k )( (cid:48) (cid:96) Ω (cid:96) ) (cid:112) ( (cid:48) k A k )( (cid:48) (cid:96) A (cid:96) ) .

53y Lemma C.1, X p →

1. We re-write (cid:98) P k(cid:96) − P k(cid:96) = (cid:48) k A (cid:96) − (cid:48) k Ω (cid:96) (cid:112) ( (cid:48) k A k )( (cid:48) (cid:96) A (cid:96) ) + P k(cid:96) ( X −

1) = (cid:48) k W (cid:96) ( (cid:48) k θ )( (cid:48) (cid:96) θ ) X + P k(cid:96) ( X − , where in the last inequality we have used the fact that (cid:48) k Ω k = ( (cid:48) k θ ) for all 1 ≤ k ≤ K . Notethat E [ (cid:48) k W (cid:96) ] = 0. Moreover, Var( W ij ) ≤ (cid:107) P (cid:107) max θ i θ j ≤ Cθ i θ j . It follows that Var( (cid:48) k W (cid:96) ) ≤ C ( (cid:48) k θ )( (cid:48) (cid:96) θ ). Therefore, E (cid:20) (cid:48) k W (cid:96) ( (cid:48) k θ )( (cid:48) (cid:96) θ ) (cid:21) ≤ C ( (cid:48) k θ )( (cid:48) (cid:96) θ ) = O ( (cid:107) θ (cid:107) − ) = o (1) . Hence, (cid:48) k W (cid:96) ( (cid:48) k θ )( (cid:48) (cid:96) θ ) p →

0. Combining the above results, we have (cid:98) P k(cid:96) − P k(cid:96) p → , ≤ k, (cid:96) ≤ K. (C.96)Fourth, since V = diag( P g ) and (cid:98) V = diag( (cid:98) P ˆ g ), it follows from (C.95) and (C.96) that (cid:98) V kk /V kk p → , ≤ k ≤ K. (C.97)Last, note that H , (cid:98) H ∈ R K,K are diagonal matrices, with k -th diagonal elements being h k andˆ h k , respectively. By (1.14), ˆ h k = ( (cid:48) k (cid:98) Θ k ) / (cid:107) ˆ θ (cid:107) . In addition, by (1.10), for any i ∈ N k , wehave ˆ θ i = d i ( (cid:48) k A k ) / ( (cid:48) k A n ) . We thus re-write (cid:98) H kk ≡ ˆ h k = ( (cid:48) k D k ) · ( (cid:48) k A k )( (cid:48) k A n ) · (cid:107) ˆ θ (cid:107) . Additionally, h k = ( (cid:48) k Θ k ) / (cid:107) θ (cid:107) , as deﬁned in the paragraph below (4.13). By direct calcula-tions, ( (cid:48) k Ω n ) / (cid:112) (cid:48) k Ω k = (cid:2) ( (cid:48) k θ ) (cid:80) (cid:96) P k(cid:96) ( (cid:48) (cid:96) θ ) (cid:3) / ( (cid:48) k θ ) = (cid:80) (cid:96) P k(cid:96) ( (cid:48) (cid:96) θ ). Also, for any i ∈ N k , wehave d ∗ i = (Ω n ) i = θ i [ (cid:80) (cid:96) P k(cid:96) ( (cid:48) (cid:96) θ )]. It implies that (cid:48) k ( D ∗ ) k = ( (cid:48) k Θ k )[ (cid:80) (cid:96) P k(cid:96) ( (cid:48) (cid:96) θ )] . Wecan use these expressions to verify that H kk ≡ h k = [ (cid:48) k ( D ∗ ) k ] · ( (cid:48) k Ω k )( (cid:48) k Ω n ) · (cid:107) θ (cid:107) . We apply Lemma C.1 to obtain that (cid:98) H kk /H kk p → , ≤ k ≤ K. (C.98)We plug (C.94), (C.95), (C.96), (C.97) and (C.98) into (C.93). It follows from elementaryprobability that B ( K ) n /b n →

1. This gives the second claim.

C.5 Proof of Lemma 4.5

Recall N ( m, , N ( m, , ..., N ( m, m are “fake” communities associated with Π , and we decomposethe vector n ∈ R n as follows n = m (cid:88) k =1 ( m, k , where ( m, k ( j ) = 1 if j ∈ N ( m, k and 0 otherwise. (C.99)Notice for Π ∈ G m deﬁned in (4.18), there exists an K × m matrix L such that Π = Π L .By deﬁnitions, Ω ( m, = Θ ( m, Π P ( m, Π (cid:48) Θ ( m, . Here Θ ( m, and P ( m, are obtained byreplacing ( d i , ˆ k , A ) by ( d ∗ i , ( m, k , Ω) in the deﬁnition (1.10). It yields that, for 1 ≤ k, (cid:96) ≤ m and i ∈ N ( m, k , θ ( m, i = d ∗ i ( ( m, k ) (cid:48) Ω n · (cid:113) ( ( m, k ) (cid:48) Ω ( m, k , P ( m, k(cid:96) = ( ( m, k ) (cid:48) Ω( ( m, (cid:96) ) (cid:113) ( ( m, k ) (cid:48) Ω ( m, k (cid:113) ( ( m, (cid:96) ) (cid:48) Ω ( m, (cid:96) .

54s a result, for i ∈ N ( m, k and j ∈ N ( m, (cid:96) ,Ω ( m, ij = θ ( m, i θ ( m, j P ( m, k(cid:96) = d ∗ i d ∗ j · ( ( m, k ) (cid:48) Ω ( m, (cid:96) [( ( m, k ) (cid:48) Ω n ] · [( ( m, (cid:96) ) (cid:48) Ω n ] . (C.100)Note that ( ( m, k ) (cid:48) Ω ( m, (cid:96) = (Π (cid:48) ΩΠ ) k(cid:96) . Since Ω = ΘΠ P Π (cid:48) Θ and D = Π (cid:48) ΘΠ, we immediatelyhave Π (cid:48) ΩΠ = Π (cid:48) ΘΠ P Π (cid:48) Θ (cid:48) Π = D P D (cid:48) . It follows that( ( m, k ) (cid:48) Ω ( m, (cid:96) = ( D P D (cid:48) ) k(cid:96) , ≤ k, (cid:96) ≤ m. Similarly, ( ( m, k ) (cid:48) Ω n = ( e (cid:48) k Π (cid:48) )Ω(Π K ) = e (cid:48) k Π (cid:48) ΘΠ P Π (cid:48) ΘΠ K = e (cid:48) k D P D K . This gives( ( m, k ) (cid:48) Ω n = diag( D P D K ) kk , ≤ k, (cid:96) ≤ m. We plug the above equalities into (C.100). It follows that, for i ∈ N ( m, k and j ∈ N ( m, (cid:96) ,Ω ( m, ij = d ∗ i d ∗ j · (cid:2) (diag( D P D K )) − D P D (cid:48) (diag( D P D K )) − (cid:3) k(cid:96) . (C.101)Write for short M = [diag( D P D K )] − ( D P D (cid:48) )[diag( D P D K )] − . (C.102)Then, (C.101) can be written equivalently asΩ ( m, ij = d ∗ i d ∗ j · m (cid:88) k,(cid:96) =1 M k(cid:96) · { i ∈ N ( m, k } · { j ∈ N ( m, (cid:96) } . By deﬁnition, L ( u, k ) = 1 {N u ⊂ N ( m, k } , for 1 ≤ u ≤ K and 1 ≤ k ≤ m . Therefore, we have theequalities: 1 { i ∈ N ( m, k } = (cid:80) Ku =1 L ( u, k ) · { i ∈ N u } and 1 { j ∈ N ( m, (cid:96) } = (cid:80) Kv =1 L ( v, (cid:96) ) · { j ∈N v } . Combining them with the above equation givesΩ ( m, ij = d ∗ i d ∗ j · K (cid:88) u,v =1 { i ∈ N u } · { j ∈ N v } m (cid:88) k,(cid:96) =1 L ( u, k ) L ( v, (cid:96) ) M k(cid:96) = d ∗ i d ∗ j · K (cid:88) u,v =1 { i ∈ N u } · { j ∈ N v } · ( L M L (cid:48) ) uv . (C.103)By deﬁnition, d ∗ = Ω n = Ω(Π K ). Since Ω = ΘΠ P Π (cid:48) Θ, we immediately have d ∗ i = θ i · π (cid:48) i P Π (cid:48) ΘΠ K = θ i · π (cid:48) i P D K = θ i · K (cid:88) u =1 diag( P D K ) uu · { i ∈ N u } . Similarly, we have d ∗ j = θ i · (cid:80) Kv =1 diag( P D K ) vv · { j ∈ N v } . Plugging the expressions of ( d ∗ i , d ∗ j )into (C.103) givesΩ ( m, ij = θ i θ j K (cid:88) u,v =1 { i ∈ N u } { j ∈ N v } diag( P D K ) uu ( L M L (cid:48) ) uv diag( P D K ) vv = θ i θ j · π (cid:48) i (cid:2) diag( P D K ) L M L (cid:48) diag( P D K ) (cid:3) π j . (C.104)Combining it with the expression of M in (C.102) gives the claim.55 .6 Proof of Lemma 4.6 The claim of c n (cid:16) (cid:107) θ (cid:107) is proved in Lemma 4.1. To prove the claim of λ (cid:16) (cid:107) θ (cid:107) , we note that byLemma 3.1, λ k = (cid:107) θ (cid:107) · λ k ( HP H ), where H is the diagonal matrix such that H kk = (cid:107) θ ( k ) (cid:107) / (cid:107) θ (cid:107) .By the condition (2.2), all the diagonal entries of H are between [ c, c ∈ (0 , λ ( HP H ) (cid:16) λ ( P ). Since λ ≥ P = 1 and λ ≤ (cid:107) P (cid:107) ≤ C , we have λ ( P ) (cid:16) λ (cid:16) (cid:107) θ (cid:107) λ ( P ) (cid:16) (cid:107) θ (cid:107) . We then prove the claims related to the matrix (cid:101)

Ω. First, we show the upper bound of | (cid:101) Ω ij | and the lower bound of tr( (cid:101) Ω ). Recall that (cid:101) Ω = Ω − Ω ( m, . By Lemma 4.5, Ω ( m, = ΘΠ P Π (cid:48) Θfor a rank- m matrix P . It follows that (cid:101) Ω = ΘΠ( P − P )Π (cid:48) Θ . (C.105)Let H be the same diagonal matrix as above. It can be easily veriﬁed that (cid:107) θ (cid:107) · H = Π (cid:48) Θ Π.This means that the matrix U = (cid:107) θ (cid:107) − ΘΠ H − satisﬁes the equality U (cid:48) U = I K . As a result, wecan write (cid:101) Ω = U · ( (cid:107) θ (cid:107) · H ( P − P ) H ) · U (cid:48) . Since U contains orthonormal columns, the nonzeroeigenvalues of (cid:101) Ω are the same as the nonzero eigenvalues of (cid:107) θ (cid:107) · H ( P − P ) H , i.e.,˜ λ k = (cid:107) θ (cid:107) · λ k ( H ( P − P ) H ) , ≤ k ≤ m. In particular, | ˜ λ | = (cid:107) θ (cid:107) · (cid:107) H ( P − P ) H (cid:107) (cid:16) (cid:107) θ (cid:107) · (cid:107) P − P (cid:107) (cid:16) λ (cid:107) P − P (cid:107) , where we have used (cid:107) H (cid:107) (cid:16) (cid:107) H − (cid:107) (cid:16)

1, and λ (cid:16) (cid:107) θ (cid:107) . Combining it with the deﬁnition of τ gives τ (cid:16) (cid:107) P − P (cid:107) . (C.106)Consider | (cid:101) Ω ij | . By (C.105), | (cid:101) Ω ij | = θ i θ j ·| π (cid:48) i ( P − P ) π j | ≤ θ i θ j · C (cid:107) P − P (cid:107) . We plug in (C.106) toget | (cid:101) Ω ij | ≤ Cτ θ i θ j , for 1 ≤ i, j ≤ n . Consider tr( (cid:101) Ω ). We have seen that | ˜ λ | (cid:16) (cid:107) θ (cid:107) · (cid:107) P − P (cid:107) (cid:16) τ (cid:107) θ (cid:107) . As a result, tr( (cid:101) Ω ) ≥ ˜ λ ≥ Cτ (cid:107) θ (cid:107) .Next, we study the order of τ . Note that Ω = Ω ( m, + (cid:101) Ω. We aim to apply Weyl’s inequality.In our notation, λ k ( · ) refers to the k th largest eigenvalue (in magnitude) of a symmetric matrix.As a result, | λ k ( · ) | is the k th singular value. By Weyl’s inequality for singular values (equation(7.3.13) of [9]), we have | λ r + s − (Ω) | ≤ | λ r (Ω ( m, ) | + | λ s ( (cid:101) Ω) | , for 1 ≤ r, s ≤ n − . Since Ω ( m, only has m nonzero eigenvalues, by taking r = m + 1 and s = k in the above, weimmediately have | λ m + k (Ω) | ≤ | λ k ( (cid:101) Ω) | = | ˜ λ k | , ≤ k ≤ K − m. (C.107)In particular, | ˜ λ | ≥ | λ m +1 | ≥ | λ K | . At the same time, λ (cid:16) (cid:107) θ (cid:107) and by deﬁnition, τ = | ˜ λ | /λ .It follows that τ (cid:107) θ (cid:107) ≥ ( | λ K | /λ ) · (cid:107) θ (cid:107) ≥ C (cid:0) | λ K | / (cid:112) λ (cid:1) → ∞ . This gives τ (cid:107) θ (cid:107) → ∞ . We then prove τ ≤ C . In light of (C.106), it suﬃces to show (cid:107) P (cid:107) ≤ C .Consider the expression of P in Lemma 4.5. It is easy to see that (cid:107) L (cid:107) ≤ C , (cid:107) D P D (cid:48) (cid:107) ≤ C (cid:107) θ (cid:107) ,and (cid:107) diag( P D K ) (cid:107) ≤ C (cid:107) θ (cid:107) . As a result, (cid:107) P (cid:107) ≤ C (cid:107) θ (cid:107) · (cid:107) diag( D P D K ) − (cid:107) . (C.108)Since D = Π (cid:48) ΘΠ and D = Π (cid:48) ΘΠ, it is true that D P D K = Π (cid:48) ΘΠ P Π (cid:48) ΘΠ K = Π (cid:48) ΘΠ P Π (cid:48) Θ n =Π (cid:48) Ω n . Then, for each 1 ≤ k ≤ m ,[diag( D P D K )] kk = (Π (cid:48) Ω n ) k = (cid:88) i ∈N ( m, k d ∗ i , where d ∗ = Ω n . N ( m, , N ( m, , ..., N ( m, m are the pseudo-communities deﬁned by Π . Suppose i ∈ N (cid:96) forsome true community N (cid:96) . Then, d ∗ i ≥ (cid:80) j ∈N (cid:96) θ i θ j P (cid:96)(cid:96) = θ i (cid:107) θ ( (cid:96) ) (cid:107) ≥ Cθ i (cid:107) θ (cid:107) . Moreover, for anyΠ ∈ G m , each pseudo-community N ( m, k is the union of one or more true community. It yieldsthat (cid:80) i ∈N ( m, k θ i ≥ min ≤ (cid:96) ≤ K {(cid:107) θ ( (cid:96) ) (cid:107) } ≥ C (cid:107) θ (cid:107) . Combining these results gives (cid:80) i ∈N ( m, k d ∗ i ≥ C (cid:107) θ (cid:107) . This shows that each diagonal entry of diag( D P D K ) is lower bounded by C (cid:107) θ (cid:107) . Weimmediately have (cid:107) diag( D P D K ) − (cid:107) ≤ C (cid:107) θ (cid:107) − . (C.109)Combining (C.108) and (C.109) gives (cid:107) P (cid:107) ≤ C . The claim τ ≤ C then follows from (C.106). C.7 Proof of Lemma 4.7

Recall that W = A − Ω. Given any n × n symmetric matrix T , we can deﬁne a random variableas follows: Q W ( T ) = (cid:88) i ,i ,i ,i ( dist ) ( W i i + T i i )( W i i + T i i )( W i i + T i i )( W i i + T i i ) . (C.110)Then, (cid:101) Q ( m, n is a special case with T = (cid:101) Ω ( m, , where (cid:101) Ω ( m, is deﬁned in (4.23). We study thegeneral form of Q W ( T ). By an expansion of each summand, we can write Q W ( T ) as the sum of2 post-expansion sums. Each post-expansion sum takes a form X = (cid:88) i ,i ,i ,i ( dist ) a i i b i i c i i d i i , (C.111)where each of a ij , b ij , c ij , d ij may take value in { W ij , T ij } . We divide the post-expansion sumsinto 6 common types and compute the mean and variance of each of them (see Table 2 for the spe-cial case of T = (cid:101) Ω ( m, ). For example, the post-expansion sum (cid:80) i ,i ,i ,i ( dist ) T i i T i i T i i T i T i is non-stochastic and has a zero variance. Its mean equals to tr( T ) − ∆, where ∆ contains thesum of T i i T i i T i i T i T i when some of the indices ( i , i , i , i ) are equal. As another exam-ple, the post-expansion sum (cid:80) i ,i ,i ,i ( dist ) W i i W i i W i i W i i has a zero mean, and since thesummands are mutually uncorrelated, its variance is (cid:80) i ,i ,i ,i ( dist ) Ω ∗ i i Ω ∗ i i Ω ∗ i i Ω ∗ i i , whereΩ ∗ ij = Ω ij (1 − Ω ij ).Table 2: The 6 diﬀerent types of the 16 post-expansion sums of (cid:101) Q ( m, n . In our setting, τ =˜ λ ( m, /λ and (cid:107) θ (cid:107) − (cid:28) τ ≤ C , and (cid:107) θ (cid:107) (cid:28) (cid:107) θ (cid:107) (cid:28) (cid:107) θ (cid:107) . Type N (cid:101) Ω , N W ) Examples Mean VarianceI 1 (0, 4) X = (cid:80) i ,i ,i ,i ( dist ) W i i W i i W i i W i i (cid:16) (cid:107) θ (cid:107) II 4 (1, 3) X = (cid:80) i ,i ,i ,i ( dist ) (cid:101) Ω i i W i i W i i W i i ≤ Cτ (cid:107) θ (cid:107) (cid:107) θ (cid:107) = o ( (cid:107) θ (cid:107) )IIIa 4 (2, 2) X = (cid:80) i ,i ,i ,i ( dist ) (cid:101) Ω i i (cid:101) Ω i i W i i W i i ≤ Cτ (cid:107) θ (cid:107) (cid:107) θ (cid:107) = o ( τ (cid:107) θ (cid:107) (cid:107) θ (cid:107) )IIIb 2 (2, 2) X = (cid:80) i ,i ,i ,i ( dist ) (cid:101) Ω i i W i i (cid:101) Ω i i W i i ≤ Cτ (cid:107) θ (cid:107) = o ( (cid:107) θ (cid:107) )IV 4 (3, 1) X = (cid:80) i ,i ,i ,i ( dist ) (cid:101) Ω i i (cid:101) Ω i i (cid:101) Ω i i W i i ≤ τ (cid:107) θ (cid:107) (cid:107) θ (cid:107) V 1 (4, 0) X = (cid:80) i ,i ,i ,i ( dist ) (cid:101) Ω i i (cid:101) Ω i i (cid:101) Ω i i (cid:101) Ω i i ∼ tr( (cid:101) Ω ) 0 Here we omit the calculation details, because similar calculations were done in [17]. In theirTheorem 4.4, they analyzed Q W ( T ) for T equal to a rank-1 matrix (denoted by (cid:101) Ω there). How-ever, their proof does not reply on the condition that (cid:101)

Ω is rank-1 and applies to any symmetricmatrix. They actually proved the following lemma:

Lemma C.2.

Consider a DCBM model where (2.1)-(2.2) and (2.4) hold. Let W = A − Ω and let Q W ( T ) be the random variable deﬁned in (C.110) . As n → ∞ , suppose there is a constant C > and a scalar α n > such that α n ≤ C , α n (cid:107) θ (cid:107) → ∞ , and | T ij | ≤ Cα n θ i θ j for all ≤ i, j ≤ n .Then, E [ Q W ( T )] = tr( T ) + o ( (cid:107) θ (cid:107) ) and Var( Q W ( T )) ≤ C ( (cid:107) θ (cid:107) + α n (cid:107) θ (cid:107) (cid:107) θ (cid:107) ) .

57e now set T = (cid:101) Ω ( m, and verify the conditions of Lemma C.2. Recall that τ = ˜ λ /λ , where˜ λ and λ are the respective largest (in magnitude) eigenvalue of (cid:101) Ω ( m, and Ω. By Lemma 4.6, τ ≤ C, τ (cid:107) θ (cid:107) → ∞ , | (cid:101) Ω ( m, ij | ≤ Cτ θ i θ j , for all 1 ≤ i, j ≤ n. Therefore, we can apply Lemma C.2 with α n = τ . The claim follows immediately. C.8 Proof of Lemma 4.8

Before proceed, recall (4.24) that (cid:101) Q ( m, n = (cid:88) i ,i ,i ,i ( dist ) ( W i i + (cid:101) Ω ( m, i i )( W i i + (cid:101) Ω ( m, i i )( W i i + (cid:101) Ω ( m, i i )( W i i + (cid:101) Ω ( m, i i ) . Here (cid:101) Ω ( m, = Ω − Ω ( m, and Ω ( m, is as in (4.21). By Lemma 4.5, Ω ( m, = ΘΠ P Π (cid:48) Θ, for arank- m matrix P . If m = K and Π = Π, it can be veriﬁed that P = P . Therefore, Ω ( m, = Ω,and (cid:101) Ω ( m, reduces to a zero matrix. In this case, (cid:101) Q ( m, n reduces to (cid:101) Q n in (4.12). It means thatwe can treat Lemma 4.3 as a “special case” of Lemma 4.8, with (cid:101) Ω ( m, being a zero matrix. Wethus combine the proofs of two lemmas.We now show the claim. First, we introduce two proxies of Q ( m, n . By deﬁnition, Q ( m, n = (cid:88) i ,i ,i ,i ( dist ) ( A i i − (cid:98) Ω ( m, i i )( A i i − (cid:98) Ω ( m, i i )( A i i − (cid:98) Ω ( m, i i )( A i i − (cid:98) Ω ( m, i i ) . By (4.19), (cid:98) Ω ( m, is deﬁned by ˆ θ , Π , and (cid:98) P . For 1 ≤ k ≤ m , let N ( m, k and ( m, k be the sameas in (C.99). Then, (ˆ θ, (cid:98) P ) are obtained by replacing ˆ k with ( m, k in (1.10). For the rest of theproof, we write k = ( m, k for short. It follows that, for 1 ≤ k, (cid:96) ≤ K and i ∈ N ( m, k ,ˆ θ ( m, i = d i (cid:112) (cid:48) k A k (cid:48) k A n , (cid:98) P ( m, k(cid:96) = (cid:48) k A (cid:96) (cid:112) ( (cid:48) k A k )( (cid:48) (cid:96) A (cid:96) ) , with k = ( m, k (for short) . We plug it into (4.19) and note that d = A n . It yields that, for i ∈ N ( m, k and j ∈ N ( m, (cid:96) , (cid:98) Ω ( m, ij = d i d j · (cid:98) U ( m, k(cid:96) , where (cid:98) U ( m, k(cid:96) = (cid:48) k A (cid:96) ( (cid:48) k d )( (cid:48) (cid:96) d ) . (C.112)At the same time, in (C.100), we have seen that (recall: d ∗ = Ω n )Ω ( m, ij = d ∗ i d ∗ j · U ∗ ( m, k(cid:96) , where U ∗ ( m, k(cid:96) = (cid:48) k Ω (cid:96) ( (cid:48) k d ∗ )( (cid:48) (cid:96) d ∗ ) . (C.113)Note that (Ω , d ∗ ) are approximately ( E [ A ] , E [ d ]) but there is subtle diﬀerence. We thus introducean intermediate quantity: U ( m, k(cid:96) = (cid:48) k E [ A ] (cid:96) ( (cid:48) k E [ d ])( (cid:48) (cid:96) E [ d ]) . (C.114)We now use (C.112)-(C.114) to decompose ( A ij − (cid:98) Ω ( m, ij ). Recall that (cid:101) Ω ( m, ij = Ω ij − Ω ( m, ij .We immediately have A ij − (cid:98) Ω ( m, ij = W ij + (cid:101) Ω ( m, ij + (Ω ( m, ij − (cid:98) Ω ( m, ij ) . (C.115)From now on, we omit the superscript “( m, (cid:98) U ( m, k(cid:96) , U ∗ ( m, k(cid:96) and U ( m, k(cid:96) , and rewrite themas (cid:98) U k(cid:96) , U ∗ k(cid:96) , and U k(cid:96) , respectively. By (C.112)-(C.114), Ω ( m, ij − (cid:98) Ω ( m, ij = d ∗ i d ∗ j U ∗ k(cid:96) − d i d j (cid:98) U k(cid:96) =[ d ∗ i d ∗ j U ∗ k(cid:96) − ( E d i )( E d j ) U k(cid:96) ] + U k(cid:96) [( E d i )( E d j ) − d i d j ] + ( U k(cid:96) − (cid:98) U k(cid:96) ) d i d j . It turns our that the term58 k(cid:96) [( E d i )( E d j ) − d i d j ] is the “dominating” term. This term does not have an exactly zero mean,and so we introduce a proxy to this term as δ ( m, ij = U k(cid:96) (cid:2) ( E d i )( E d j − d j ) + ( E d j )( E d i − d i ) (cid:3) . (C.116)Note that U k(cid:96) [( E d i )( E d j ) − d i d j ] = δ ( m, ij − U k(cid:96) ( d i − E d i )( d j − E d j ). We then haveΩ ( m, ij − (cid:98) Ω ( m, ij = [ d ∗ i d ∗ j U ∗ k(cid:96) − ( E d i )( E d j ) U k(cid:96) ] + [ δ ( m, ij − U k(cid:96) ( d i − E d i )( d j − E d j )] + ( U k(cid:96) − (cid:98) U k(cid:96) ) d i d j = δ ( m, ij + [ d ∗ i d ∗ j U ∗ k(cid:96) − ( E d i )( E d j ) U k(cid:96) ] − U k(cid:96) ( d i − E d i )( d j − E d j )+ ( U k(cid:96) − (cid:98) U k(cid:96) )( E d i )( E d j ) + ( U k(cid:96) − (cid:98) U k(cid:96) )[( E d i )( d j − E d j ) + ( E d j )( d i − E d i )]+ ( U k(cid:96) − (cid:98) U k(cid:96) )( d i − E d i )( d j − E d j )= δ ( m, ij + ˜ r ( m, ij + (cid:15) ( m, ij , where ˜ r ( m, ij = − (cid:98) U k(cid:96) ( d i − E d i )( d j − E d j ) (C.117)and (cid:15) ( m, ij = [ d ∗ i d ∗ j U ∗ k(cid:96) − ( E d i )( E d j ) U k(cid:96) ] + ( U k(cid:96) − (cid:98) U k(cid:96) )( E d i )( E d j )+ ( U k(cid:96) − (cid:98) U k(cid:96) )[( E d i )( d j − E d j ) + ( E d j )( d i − E d i )] . (C.118)We plug the above results into (C.115) to get A ij − (cid:98) Ω ( m, ij = (cid:101) Ω ( m, ij + W ij + δ ( m, ij + ˜ r ( m, ij + (cid:15) ( m, ij . (C.119)We then use (C.119) to deﬁne two proxies of Q ( m, n . For any 1 ≤ i (cid:54) = j ≤ n , let X ij = (cid:101) Ω ( m, ij + W ij + δ ( m, ij + ˜ r ( m, ij + (cid:15) ( m, ij , (cid:101) X ∗ ij = (cid:101) Ω ( m, ij + W ij + δ ( m, ij + ˜ r ( m, ij ,X ∗ ij = (cid:101) Ω ( m, ij + W ij + δ ( m, ij , (cid:101) X ij = (cid:101) Ω ( m, ij + W ij . (C.120)Correspondingly, we introduce Q ( m, n = (cid:88) i ,i ,i ,i ( dist ) X i i X i i X i i X i i (cid:101) Q ∗ ( m, n = (cid:88) i ,i ,i ,i ( dist ) (cid:101) X ∗ i i (cid:101) X ∗ i i (cid:101) X ∗ i i (cid:101) X ∗ i i ,Q ∗ ( m, n = (cid:88) i ,i ,i ,i ( dist ) X ∗ i i X ∗ i i X ∗ i i X ∗ i i , (cid:101) Q ( m, n = (cid:88) i ,i ,i ,i ( dist ) (cid:101) X i i (cid:101) X i i (cid:101) X i i (cid:101) X i i . (C.121)By comparing it with (4.24), we can see that the above expression of (cid:101) Q ( m, n is the same as before.Additionally, by (C.119), the above expression of Q ( m, n is also equivalent to the deﬁnition. Theother two quantities, Q ∗ ( m, n and (cid:101) Q ∗ ( m, n , are the two proxies we introduce here.Next, we decompose Q ( m, n − (cid:101) Q ( m, n = ( Q ∗ ( m, n − (cid:101) Q ( m, n ) + ( (cid:101) Q ∗ ( m, n − Q ∗ ( m, n ) + ( Q ( m, n − (cid:101) Q ( ∗ ,m, n ) . X, Y, Z , we know that E [ X + Y + Z ] = E X + E Y + E Z and Var( X + Y + Z ) ≤ X ) + 3Var + 3Var( Z ). Therefore, to show the claim, we only need to study themean and variance of each term in the above equation. The next three lemmas are proved inSections D.3-D.5. Lemma C.3.

Let b n = 2 (cid:107) θ (cid:107) · [ g (cid:48) V − ( P H P ◦ P H P ) V − g ] be the same as in (4.13) . Underconditions of Lemma 4.3, it is true that E [ Q ∗ ( m, n − (cid:101) Q ( m, n ] = b n + o ( (cid:107) θ (cid:107) ) , and Var (cid:0) Q ∗ ( m, n − (cid:101) Q ( m, n (cid:1) = o ( (cid:107) θ (cid:107) ) , Let τ = ˜ λ /λ be the same as in (4.25) . Under conditions of Lemma 4.8, it is true that E [ Q ∗ ( m, n − (cid:101) Q ( m, n ] = o ( τ (cid:107) θ (cid:107) ) , and Var (cid:0) Q ∗ ( m, n − (cid:101) Q ( m, n (cid:1) ≤ Cτ (cid:107) θ (cid:107) (cid:107) θ (cid:107) + o ( (cid:107) θ (cid:107) ) . Lemma C.4.

Under conditions of Lemma 4.3, it is true that E [ (cid:101) Q ∗ ( m, n − Q ∗ ( m, n ] = o ( (cid:107) θ (cid:107) ) , and Var (cid:0) (cid:101) Q ∗ ( m, n − Q ( ∗ ,m, n (cid:1) = o ( (cid:107) θ (cid:107) ) . Under conditions of Lemma 4.8, it is true that E [ (cid:101) Q ∗ ( m, n − Q ∗ ( m, n ] = o (cid:0) (cid:107) θ (cid:107) + τ (cid:107) θ (cid:107) (cid:1) , and Var (cid:0) (cid:101) Q ∗ ( m, n − Q ∗ ( m, n (cid:1) = o (cid:0) (cid:107) θ (cid:107) + τ (cid:107) θ (cid:107) (cid:107) θ (cid:107) (cid:1) . Lemma C.5.

Under conditions of Lemma 4.3, it is true that E [ Q ( m, n − (cid:101) Q ∗ ( m, n ] = o ( (cid:107) θ (cid:107) ) , and Var (cid:0) Q ( m, n − (cid:101) Q ∗ ( m, n (cid:1) = o ( (cid:107) θ (cid:107) ) . Under conditions of Lemma 4.8, it is true that E [ Q ( m, n − (cid:101) Q ∗ ( m, n ] = o (cid:0) (cid:107) θ (cid:107) + τ (cid:107) θ (cid:107) (cid:1) , and Var (cid:0) Q ( m, n − (cid:101) Q ∗ ( m, n (cid:1) = o (cid:0) (cid:107) θ (cid:107) + τ (cid:107) θ (cid:107) (cid:107) θ (cid:107) (cid:1) . We now prove Lemma 4.3 and Lemma 4.8. By Lemma C.3-Lemma C.5, under the conditions ofLemma 4.3, E [ Q ( m, n − (cid:101) Q ( m, n ] = b n + o ( (cid:107) θ (cid:107) ) , and Var( Q ( m, n − (cid:101) Q ( m, n ) = o ( (cid:107) θ (cid:107) ) , which implies E ( Q ( m, n − (cid:101) Q ( m, n − b n ) = o ( (cid:107) θ (cid:107) ) and completes the proof of Lemma 4.3. Underthe conditions of Lemma 4.8, it follows from Lemma C.3-Lemma C.5 that E [ Q ( m, n − (cid:101) Q ( m, n ] = o ( τ (cid:107) θ (cid:107) ) and Var( Q ( m, n − (cid:101) Q ( m, n ) ≤ Cτ (cid:107) θ (cid:107) (cid:107) θ (cid:107) + o ( (cid:107) θ (cid:107) ) , which completes the proof of Lemma 4.8. C.9 Proof of Lemma 4.9

Let G m be the class of n × m membership matrices that satisfy NSP (the deﬁnition of G m is inSection 4.2). By Theorem 2.2, (cid:98) Π ( m ) ∈ G m with probability 1 − O ( n − ). Given any Π ∈ G m , Let B ( m ) n (Π ) be deﬁned in the same way as in (1.15), except that (ˆ θ, ˆ g, (cid:98) V , (cid:98)

P , (cid:98) H ) are deﬁned basedon Π instead of (cid:98) Π ( m ) . Then, with probability 1 − O ( n − ), B ( m ) n ≤ max Π ∈G m B n (Π ) . It follows from the probability union bound that P (cid:0) B ( m ) n > C (cid:107) θ (cid:107) (cid:1) ≤ (cid:88) Π ∈G m P (cid:0) B n (Π ) > C (cid:107) θ (cid:107) (cid:1) + O ( n − ) . m < K and K is ﬁnite, G m has only a bounded number of elements. Therefore, it suﬃcesto show that P (cid:0) B n (Π ) > C (cid:107) θ (cid:107) (cid:1) = o (1) , for each Π ∈ G m . (C.122)We now show (C.122). From now on, we ﬁx Π ∈ G m and write B n (Π ) = B n for short. By(1.15) and direct calculations, B n = 2 (cid:107) ˆ θ (cid:107) · ˆ g (cid:48) (cid:98) V − ( (cid:98) P (cid:98) H (cid:98) P ◦ (cid:98) P (cid:98) H (cid:98) P ) (cid:98) V − ˆ g = 2 (cid:107) ˆ θ (cid:107) · (cid:88) ≤ k,(cid:96) ≤ m ˆ g k ˆ g (cid:96) [( (cid:98) P (cid:98) H (cid:98) P ) k,(cid:96) ] ( (cid:98) P (cid:48) k ˆ g ) · ( (cid:98) P (cid:48) (cid:96) ˆ g ) , where (cid:98) P k denotes the k th column of (cid:98) P . We have mis-used the notations (ˆ θ, ˆ g, (cid:98) V , (cid:98)

P , (cid:98) H ), usingthem to refer to the counterparts of original deﬁnitions with (cid:98) Π ( m ) replaced by Π . Denote by N ( m, , N ( m, , . . . , N ( m, m the pseudo-communities deﬁned by Π . Let ( m, k ∈ R n be such that ( m, k ( i ) = 1 { i ∈ N ( m, k } . We write k = ( m, k when there is no confusion. By (1.14),ˆ g = ( (cid:48) k ˆ θ ) / (cid:107) ˆ θ (cid:107) , ˆ h k = ( (cid:48) k (cid:98) Θ k ) / (cid:107) ˆ θ (cid:107) , ≤ k ≤ m. Note that ˆ g , ˆ h and (cid:98) P all have non-negative entries, with all entries of ˆ g and ˆ h are further boundedby 1. Moreover, the diagonals of (cid:98) P are all equal to 1. It follows that, for all 1 ≤ k, (cid:96) ≤ m ,0 ≤ ˆ g k ≤ (cid:98) P (cid:48) k ˆ g, and 0 ≤ ( (cid:98) P (cid:98) H (cid:98) P ) k(cid:96) ≤ ( (cid:98) P ) k(cid:96) . As a result, B n ≤ (cid:107) ˆ θ (cid:107) m (cid:88) k,(cid:96) =1 [( (cid:98) P ) k(cid:96) ] ≤ (cid:107) ˆ θ (cid:107) · m (cid:107) (cid:98) P (cid:107) , (C.123)where (cid:107) · (cid:107) max is the element-wise maximum norm. Below, we study (cid:107) (cid:98) P (cid:107) max and (cid:107) ˆ θ (cid:107) separately.First, we bound (cid:107) (cid:98) P (cid:107) max . By (1.10), (cid:98) P k(cid:96) = ( (cid:48) k A (cid:96) ) / (cid:113) ( (cid:48) k A k )( (cid:48) (cid:96) A (cid:96) ) . Write (cid:48) k A (cid:96) = (cid:80) i ∈N ( m, k ,j ∈N ( m, (cid:96) A ij , where E [ A ij ] = Ω ij , and (cid:80) i ∈N ( m, k ,j ∈N ( m, (cid:96) Var( A ij ) ≤ (cid:80) i ∈N ( m, k ,j ∈N ( m, (cid:96) Cθ i θ j ≤ C ( (cid:48) k θ )( (cid:48) (cid:96) θ ). We apply the Bernstein’s inequality [34] to get P (cid:0) | (cid:48) k A (cid:96) − (cid:48) k Ω (cid:96) | > t (cid:1) ≤ (cid:16) − t / C ( (cid:48) k θ )( (cid:48) (cid:96) θ ) + t/ (cid:17) , for all t > . By NSP, each pseudo-community N ( m, k contains at least one true community, say, N k ∗ . Com-bining it with the condition (2.2) gives (cid:48) k θ ≥ (cid:80) i ∈N k ∗ θ i ≥ C (cid:107) θ (cid:107) . At the same time, (cid:48) k θ ≤ (cid:107) θ (cid:107) .We thus have (cid:48) k θ (cid:16) (cid:107) θ (cid:107) (cid:29) (cid:112) log( n ). Similarly, we can show that k Ω (cid:96) (cid:16) (cid:107) θ (cid:107) . In the aboveequation, if we choose t = C (cid:107) θ (cid:107) (cid:112) log( n ) for a properly large constant C >

0, then the righthand side is O ( n − ). In other words, with probability 1 − O ( n − ), | (cid:48) k A (cid:96) − (cid:48) k Ω (cid:96) | ≤ C (cid:107) θ (cid:107) (cid:112) log( n ) . Since (cid:48) k Ω (cid:96) (cid:16) (cid:107) θ (cid:107) (cid:29) (cid:107) θ (cid:107) (cid:112) log( n ), the above implies (cid:48) k A (cid:96) (cid:16) (cid:107) θ (cid:107) . We combine this resultwith the probability union bound. It follows that there exists a constant C > − O ( n − ), C − (cid:107) θ (cid:107) ≤ min ≤ k,(cid:96) ≤ m { (cid:48) k A (cid:96) } ≤ max ≤ k,(cid:96) ≤ m { (cid:48) k A (cid:96) } ≤ C (cid:107) θ (cid:107) (C.124)We plug it into the expression of (cid:98) P k(cid:96) above and can easily see that (cid:107) (cid:98) P (cid:107) max ≤ C, with probability 1 − O ( n − ) . (C.125)61econd, we bound (cid:107) ˆ θ (cid:107) . By (1.10), ˆ θ i = d i (cid:112) (cid:48) k A k / ( (cid:48) k A n ) for i ∈ N ( m, k . It follows that (cid:107) ˆ θ (cid:107) = m (cid:88) k =1 ( (cid:48) k D k )( (cid:48) k A k )( (cid:48) k A n ) , where D = diag( d , d , . . . , d n ) . Note that (cid:48) k A n = (cid:80) m(cid:96) =1 (cid:48) k A (cid:96) . It follows from (C.124) that (cid:48) k A k (cid:16) (cid:107) θ (cid:107) and (cid:48) k A n (cid:16) (cid:107) θ (cid:107) .As a result, (cid:107) ˆ θ (cid:107) ≤ C (cid:107) θ (cid:107) − (cid:80) mk =1 ( (cid:48) k D k ). Since (cid:80) mk =1 ( (cid:48) k D k ) = (cid:107) d (cid:107) , we immediately have (cid:107) ˆ θ (cid:107) ≤ C (cid:107) θ (cid:107) − (cid:107) d (cid:107) , with probability 1 − O ( n − ) . (C.126)Recall that d i = (cid:80) j : j (cid:54) = i A ij = (cid:80) j : j (cid:54) = i (Ω ij + W ij ). Then, (cid:107) d (cid:107) = n (cid:88) i =1 (cid:88) j,s : j (cid:54) = i,s (cid:54) = i (Ω ij + W ij )(Ω is + W is )= (cid:88) i,j,s : j (cid:54) = i,s (cid:54) = i Ω ij Ω is + 2 (cid:88) i (cid:54) = j (cid:16) (cid:88) s/ ∈{ i,j } Ω is (cid:17) W ij (cid:124) (cid:123)(cid:122) (cid:125) ≡ X + (cid:88) i (cid:54) = j W ij (cid:124) (cid:123)(cid:122) (cid:125) ≡ X + (cid:88) i,j,s ( dist ) W ij W is (cid:124) (cid:123)(cid:122) (cid:125) ≡ X . Since (cid:80) s/ ∈{ i,j } Ω is ≤ Cθ i (cid:107) θ (cid:107) , we have E [ X ] ≤ (cid:80) i (cid:54) = j Cθ i (cid:107) θ (cid:107) · E [ W ij ] ≤ C (cid:107) θ (cid:107) (cid:107) θ (cid:107) . More-over, X ≥ E [ X ] = (cid:80) i (cid:54) = j E [ W ij ] ≤ C (cid:107) θ (cid:107) . Last, E [ X ] = 2 (cid:80) i,j,s ( dist ) Var( W ij W is ) ≤ C (cid:80) i,j,s θ i θ j θ s ≤ C (cid:107) θ (cid:107) (cid:107) θ (cid:107) . By Markov’s inequality, for any sequence (cid:15) n → | X | ≤ C (cid:113) (cid:15) − n (cid:107) θ (cid:107) (cid:107) θ (cid:107) , | X | ≤ C(cid:15) − n (cid:107) θ (cid:107) , | X | ≤ C (cid:113) (cid:15) − n (cid:107) θ (cid:107) (cid:107) θ (cid:107) . It is not hard to see that we can choose a property (cid:15) n → o ( (cid:107) θ (cid:107) (cid:107) θ (cid:107) ). Then, with probability 1 − (cid:15) n , (cid:107) d (cid:107) = (cid:88) i,j,s : j (cid:54) = i,s (cid:54) = i Ω ij Ω is + o ( (cid:107) θ (cid:107) (cid:107) θ (cid:107) ) ≤ C (cid:107) θ (cid:107) (cid:107) θ (cid:107) . We plug it into (C.126) to get (cid:107) ˆ θ (cid:107) ≤ C (cid:107) θ (cid:107) , with probability 1 − o (1) . (C.127)Then, (C.122) follows from plugging (C.125) and (C.127) into (C.123). This proves the claim. D Proof of secondary lemmas

D.1 Proof of Lemma B.1

Note that for any set M ⊂ { , , . . . , n } and z ∈ R d , (cid:88) i ∈ M (cid:107) y i − z (cid:107) = (cid:88) i ∈ M (cid:107) ( y i − ¯ y M ) + (¯ y M − z ) (cid:107) = (cid:88) i ∈ M (cid:107) y i − ¯ y M (cid:107) + 2(¯ y M − z ) (cid:48) (cid:88) i ∈ M ( y i − ¯ y M ) + | M |(cid:107) ¯ y M − z (cid:107) = (cid:88) i ∈ M (cid:107) y i − ¯ y M (cid:107) + | M |(cid:107) ¯ y M − z (cid:107) . The clusters associated with

RSS are A = ˜ A ∪ C and B , and the clusters associated with (cid:93) RSS are ˜ A and ˜ B = C ∪ B . By direct calculations, RSS = (cid:88) i ∈ ˜ A (cid:107) y i − ¯ y A (cid:107) + (cid:88) i ∈ C (cid:107) y i − ¯ y A (cid:107) + (cid:88) i ∈ B (cid:107) y i − ¯ y B (cid:107) (cid:18)(cid:88) i ∈ ˜ A (cid:107) y i − ¯ y ˜ A (cid:107) + | ˜ A |(cid:107) ¯ y ˜ A − ¯ y A (cid:107) (cid:19) + (cid:18)(cid:88) i ∈ C ( y i − ¯ y C ) + | C |(cid:107) ¯ y C − ¯ y A (cid:107) (cid:19) + (cid:88) i ∈ B (cid:107) y i − ¯ y B (cid:107) , (cid:93) RSS = (cid:88) i ∈ ˜ A (cid:107) y i − ¯ y ˜ A (cid:107) + (cid:88) i ∈ C (cid:107) y i − ¯ y ˜ B (cid:107) + (cid:88) i ∈ B (cid:107) y i − ¯ y ˜ B (cid:107) = (cid:88) i ∈ ˜ A (cid:107) y i − ¯ y ˜ A (cid:107) + (cid:18)(cid:88) i ∈ C (cid:107) y i − ¯ y C (cid:107) + | C |(cid:107) ¯ y C − ¯ y ˜ B (cid:107) (cid:19) + (cid:18)(cid:88) i ∈ B (cid:107) y i − ¯ y B (cid:107) + | B |(cid:107) ¯ y B − ¯ y ˜ B (cid:107) (cid:19) . It follows that (cid:93)

RSS − RSS = (cid:0) | B |(cid:107) ¯ y B − ¯ y ˜ B (cid:107) + | C |(cid:107) ¯ y C − ¯ y ˜ B (cid:107) (cid:1) − (cid:0) | ˜ A |(cid:107) ¯ y ˜ A − ¯ y A (cid:107) + | C |(cid:107) ¯ y C − ¯ y A (cid:107) (cid:1) . (D.128)By deﬁnition, ¯ y A = | A | − | C || A | ¯ y ˜ A + | C || A | ¯ y C , ¯ y ˜ B = | B || B | + | C | ¯ y B + | C || B | + | C | ¯ y C . Re-arranging the terms, we have¯ y ˜ A − ¯ y A = | C || A | − | C | (¯ y A − ¯ y C ) , ¯ y ˜ B − ¯ y B = | C || B | + | C | (¯ y C − ¯ y B ) , ¯ y C − ¯ y ˜ B = | B || B | + | C | (¯ y C − ¯ y B ) . (D.129)We plug (D.129) into (D.128) to get (cid:93) RSS − RSS = (cid:18) | B | · | C | ( | B | + | C | ) + | C | · | B | ( | B | + | C | ) (cid:19) (cid:107) ¯ y C − ¯ y B (cid:107) − (cid:18) | ˜ A | · | C | ( | A | − | C | ) + | C | (cid:19) (cid:107) ¯ y C − ¯ y A (cid:107) = | B || C || B | + | C | (cid:107) ¯ y C − ¯ y B (cid:107) − | A || C || A | − | C | (cid:107) ¯ y C − ¯ y A (cid:107) . This proves the claim.

D.2 Proof of Lemma C.1

Recall that k ∈ R n is such that k ( i ) = { i ∈ N k } , D = diag( d , d , . . . , d n ), and d ∗ = Ω n . Were-state the claims as (cid:48) n A n (cid:48) n Ω n p → , (cid:48) k A n (cid:48) k Ω n p → , (cid:48) k A k (cid:48) k Ω k p → . (D.130)and (cid:107) ˆ θ (cid:107) (cid:107) θ (cid:107) p → , (cid:107) ˆ θ (cid:107)(cid:107) θ (cid:107) p → , (cid:48) k D k (cid:48) k ( D ∗ ) k p → . (D.131)We note that convergence in (cid:96) -norm implies convergence in probability. Hence, to show X p → X , it is suﬃcient to show E [( X − ] →

0. Using the equality E [( X − ] =( E X − + Var( X ), we only need to prove that E [ X ] → X ) →

0, for each variable X on the left hand sides of (D.130)-(D.131).First, we prove the three claims in (D.130). Since the proofs are similar, we only show theproof of the ﬁrst claim. Note that (cid:48) n Ω n = (cid:80) k,(cid:96) ( (cid:48) k θ )( (cid:48) (cid:96) θ ) P k(cid:96) . Under the conditions (2.1)-(2.2), (cid:48) n Ω n (cid:16) (cid:107) θ (cid:107) . Additionally, (cid:48) n diag(Ω) n = (cid:107) θ (cid:107) . It follows that (cid:12)(cid:12)(cid:12) E [ (cid:48) n A n ] (cid:48) n Ω n − (cid:12)(cid:12)(cid:12) = (cid:48) n diag(Ω) n (cid:48) n Ω n (cid:16) (cid:107) θ (cid:107) (cid:107) θ (cid:107) = o (1) , (cid:107) θ (cid:107) ≤ θ max (cid:107) θ (cid:107) ≤ C (cid:107) θ (cid:107) and (cid:107) θ (cid:107) → ∞ . Also, since theupper triangular entries of A are independent, Var( (cid:48) n A n ) = 4Var( (cid:80) i

1. This proves (cid:107) ˆ θ (cid:107) / (cid:107) θ (cid:107) p → d ∗ = Ω n and D ∗ = diag( d ∗ ). Then, for i ∈ N k , (cid:80) i ∈N k ( d ∗ i ) ≤ C (cid:80) i ∈N k ( θ i (cid:107) θ (cid:107) ) ≤ C (cid:107) θ (cid:107) (cid:107) θ (cid:107) . At the same time, d ∗ i ≥ θ i P kk ( (cid:48) k θ ) ≥ Cθ i (cid:107) θ (cid:107) , where we have used the condition (2.2). As a result, (cid:80) i ∈N k ( d ∗ i ) ≥ C (cid:107) θ (cid:107) (cid:80) i ∈N k θ i ≥ C (cid:107) θ (cid:107) (cid:107) θ (cid:107) , where we have used (2.2) again. Combining the above gives (cid:48) k ( D ∗ ) k (cid:16) (cid:107) θ (cid:107) (cid:107) θ (cid:107) . (D.132)Note that (cid:48) k D k = (cid:80) t ∈N k ( (cid:80) i : i (cid:54) = t A it ) = (cid:80) i,j (cid:80) t ∈N k \{ i,j } A it A jt . Similarly, (cid:48) k ( D ∗ ) k = (cid:80) i,j (cid:80) t ∈N k Ω it Ω jt . We now write (cid:48) k D k = (cid:88) i (cid:88) t ∈N k \{ i } A it + 2 (cid:88) i

1. This proves that (cid:107) ˆ θ (cid:107) / (cid:107) θ (cid:107) p →

1. By the continuous mapping theorem again, (cid:107) ˆ θ (cid:107) / (cid:107) θ (cid:107) p →

1. 65 .3 Proof of Lemma C.3

We introduce a notation M ijk(cid:96) ( X ) = X ij X jk X k(cid:96) X (cid:96)i , for any symmetric n × n matrix X anddistinct indices ( i, j, k, (cid:96) ). Using the deﬁnition in (C.121), we can write Q ∗ ( m, n − (cid:101) Q ( m, n = (cid:88) i ,i ,i ,i ( dist ) [ M i i i i ( X ∗ ) − M i i i i ( (cid:101) X )] , where (cid:40) X ∗ ij = (cid:101) Ω ( m, ij + W ij + δ ( m, ij , (cid:101) X ij = (cid:101) Ω ( m, ij + W ij . For the rest of the proof, we omit superscripts in (cid:101) Ω ( m, ij and δ ( m, ij to simplify notations. From theexpression of X ∗ ij and (cid:101) X ij , we notice that [ M i i i i ( X ∗ ) − M i i i i ( (cid:101) X )] expands to 3 − = 65terms. Consequently, there are 65 post-expansion sums in Q ∗ ( m, n − (cid:101) Q ( m, n , each with the form (cid:88) i ,i ,i ,i ( dist ) a i i b i i c i i d i i , where a, b, c, d ∈ { (cid:101) Ω , W, δ } . In the ﬁrst 4 columns of Table 3, we group these post-expansion sums into 15 distinct terms,where the second column shows the counts of each distinct term. For example, in the setting ofLemma 4.3, (cid:101)

Ω reduces to a zero matrix. Therefore, any post-expansion sum that involves (cid:101)

Ω iszero. Then, it follows from Table 3 that Q ∗ ( m, n − (cid:101) Q ( m, n = 4 Y + 4 Z + 2 Z + 4 T + F, (D.135)where the expression of ( Y , Z , Z , T , F ) are given in the fourth column of Table 3. Similarly,in the setting of Lemma 4.8, we have Q ∗ ( m, n − (cid:101) Q ( m, n = 4 Y + 8 Y + 4 Y + · · · + 4 T + F . Theseare elementary calculations.To show the claim, we need to study the mean and variance of each post-expansion sum. Wetake Y for example. Let N ( m, , N ( m, , . . . , N ( m, m be the pseudo-communities deﬁned by Π .For each 1 ≤ i ≤ n , let τ ( i ) ∈ { , , . . . , m } be the index of the pseudo-community that containsnode i . By (C.116), δ i i = U τ ( i ) τ ( i ) (cid:2) ( E d i )( E d i − d i ) + ( E d i )( E d i − d i ) (cid:3) = U τ ( i ) τ ( i ) · E d i · (cid:16) − (cid:88) j : j (cid:54) = i W ji (cid:17) + U τ ( i ) τ ( i ) · E d i · (cid:16) − (cid:88) (cid:96) : (cid:96) (cid:54) = i W (cid:96)i (cid:17) = − (cid:88) j : j (cid:54) = i U τ ( i ) τ ( i ) · E d i · W ji . (D.136)It follows that Y = − (cid:88) i ,i ,i ,j (cid:16)(cid:88) i U τ ( i ) τ ( i ) · E d i (cid:17) · W ji W i i W i i W i i , where we note that the indices { i , i , i , i , j } have to satisfy the constraint that i , i , i , i aredistinct and that j (cid:54) = i . We can see that Y is a weighted sum of W ji W i i W i i W i i , wherethe summands have zero mean and are mutually uncorrelated. The mean and variance of Y canbe calculated easily. We will use the same strategy to analyze each term in Table 3— we use theexpansion of δ ij in (D.136) to write each post-expansion sum as a weighted sum of monomialsof W , and then we calculate the mean and variance. The calculations can become very tediousfor some terms (e.g., T , T and F ), because of combinatorics. Fortunately, similar calculationswere done in the proof of Theorem 4.4 in [17], where they analyzed a special case with U k(cid:96) ≡ /v for all 1 ≤ k, (cid:96) ≤ m . However, their proof does not rely on that U k(cid:96) ’s are equal but only requirethat U k(cid:96) ’s have a uniform upper bound. Essentially, they have proved the following lemma:66able 3: The 10 types of the post-expansion sums for ( Q ∗ ( m, n − (cid:101) Q ( m, n ). Notations: same as inTable 2. Type Y (cid:80) i ,i ,i ,i ( dist ) δ i i W i i W i i W i i ≤ C (cid:107) θ (cid:107) (cid:107) θ (cid:107) = o ( (cid:107) θ (cid:107) )Ib 8 Y (cid:80) i ,i ,i ,i ( dist ) δ i i (cid:101) Ω i i W i i W i i ≤ Cτ (cid:107) θ (cid:107) (cid:107) θ (cid:107) = o ( (cid:107) θ (cid:107) )4 Y (cid:80) i ,i ,i ,i ( dist ) δ i i W i i (cid:101) Ω i i W i i ≤ Cτ (cid:107) θ (cid:107) (cid:107) θ (cid:107) = o ( (cid:107) θ (cid:107) )Ic 8 Y (cid:80) i ,i ,i ,i ( dist ) δ i i (cid:101) Ω i i (cid:101) Ω i i W i i ≤ Cτ (cid:107) θ (cid:107) = o ( τ (cid:107) θ (cid:107) ) ≤ Cτ (cid:107) θ (cid:107) (cid:107) θ (cid:107) (cid:107) θ (cid:107) = o ( τ (cid:107) θ (cid:107) (cid:107) θ (cid:107) )4 Y (cid:80) i ,i ,i ,i ( dist ) δ i i (cid:101) Ω i i W i i (cid:101) Ω i i ≤ Cτ (cid:107) θ (cid:107) (cid:107) θ (cid:107) (cid:107) θ (cid:107) = o ( (cid:107) θ (cid:107) )Id 4 Y (cid:80) i ,i ,i ,i ( dist ) δ i i (cid:101) Ω i i (cid:101) Ω i i (cid:101) Ω i i ≤ Cτ (cid:107) θ (cid:107) (cid:107) θ (cid:107) (cid:107) θ (cid:107) = O ( τ (cid:107) θ (cid:107) (cid:107) θ (cid:107) )IIa 4 Z (cid:80) i ,i ,i ,i ( dist ) δ i i δ i i W i i W i i ≤ C (cid:107) θ (cid:107) = o ( τ (cid:107) θ (cid:107) ) ≤ C (cid:107) θ (cid:107) (cid:107) θ (cid:107) = o ( (cid:107) θ (cid:107) )2 Z (cid:80) i ,i ,i ,i ( dist ) δ i i W i i δ i i W i i ≤ C (cid:107) θ (cid:107) = o ( τ (cid:107) θ (cid:107) ) ≤ C (cid:107) θ (cid:107) (cid:107) θ (cid:107) (cid:107) θ (cid:107) = o ( (cid:107) θ (cid:107) )IIb 8 Z (cid:80) i ,i ,i ,i ( dist ) δ i i δ i i (cid:101) Ω i i W i i ≤ Cτ (cid:107) θ (cid:107) (cid:107) θ (cid:107) = o ( (cid:107) θ (cid:107) )4 Z (cid:80) i ,i ,i ,i ( dist ) δ i i (cid:101) Ω jk δ i i W i i ≤ Cτ (cid:107) θ (cid:107) = o ( τ (cid:107) θ (cid:107) ) ≤ Cτ (cid:107) θ (cid:107) (cid:107) θ (cid:107) (cid:107) θ (cid:107) = o ( (cid:107) θ (cid:107) )IIc 4 Z (cid:80) i ,i ,i ,i ( dist ) δ i i δ i i (cid:101) Ω i i (cid:101) Ω i i ≤ Cτ (cid:107) θ (cid:107) = o ( τ (cid:107) θ (cid:107) ) ≤ Cτ (cid:107) θ (cid:107) (cid:107) θ (cid:107) = o ( τ (cid:107) θ (cid:107) (cid:107) θ (cid:107) )2 Z (cid:80) i ,i ,i ,i ( dist ) δ i i (cid:101) Ω i i δ i i (cid:101) Ω i i Cτ (cid:107) θ (cid:107) (cid:107) θ (cid:107) = o ( τ (cid:107) θ (cid:107) ) ≤ Cτ (cid:107) θ (cid:107) (cid:107) θ (cid:107) (cid:107) θ (cid:107) = o ( (cid:107) θ (cid:107) )IIIa 4 T (cid:80) i ,i ,i ,i ( dist ) δ i i δ i i δ i i W i i ≤ C (cid:107) θ (cid:107) = o ( τ (cid:107) θ (cid:107) ) ≤ C (cid:107) θ (cid:107) (cid:107) θ (cid:107) (cid:107) θ (cid:107) = o ( (cid:107) θ (cid:107) )IIIb 4 T (cid:80) i ,i ,i ,i ( dist ) δ i i δ i i δ i i (cid:101) Ω i i ≤ Cτ (cid:107) θ (cid:107) (cid:107) θ (cid:107) = o ( τ (cid:107) θ (cid:107) ) ≤ Cτ (cid:107) θ (cid:107) (cid:107) θ (cid:107) (cid:107) θ (cid:107) = o ( (cid:107) θ (cid:107) )IV 1 F (cid:80) i ,i ,i ,i ( dist ) δ i i δ i i δ i i δ i i ≤ C (cid:107) θ (cid:107) = o ( τ (cid:107) θ (cid:107) ) ≤ C (cid:107) θ (cid:107) (cid:107) θ (cid:107) = o ( (cid:107) θ (cid:107) ) Lemma D.1.

Consider a DCBM model where (2.1)-(2.2) and (2.4) hold. Let W = A − Ω and ∆ = (cid:80) i ,i ,i ,i ( dist ) (cid:2) M i i i i (cid:0)(cid:101) Ω + W + δ (cid:1) − M i i i i (cid:0)(cid:101) Ω + W (cid:1)(cid:3) , where (cid:101) Ω is a non-stochasticsymmetric matrix, δ ij = v ij · [( E d i )( E d j − d j ) + ( E d j )( E d i − d i )] , { v ij } ≤ i (cid:54) = j ≤ n are non-stochasticscalars, d i is the degree of node i , and M i i i i ( · ) is as deﬁned above. As n → ∞ , suppose thereis a constant C > and a scalar α n > such that α n ≤ C , α n (cid:107) θ (cid:107) → ∞ , | (cid:101) Ω ij | ≤ Cα n θ i θ j and | v ij | ≤ C (cid:107) θ (cid:107) − for all ≤ i, j ≤ n . Then, | E [∆] | = o ( α n (cid:107) θ (cid:107) ) and Var(∆) ≤ Cα n (cid:107) θ (cid:107) (cid:107) θ (cid:107) + o ( (cid:107) θ (cid:107) ) . Furthermore, if (cid:101) Ω is a zero matrix, then | E [∆] | ≤ C (cid:107) θ (cid:107) and Var(∆) = o ( (cid:107) θ (cid:107) ) . We check the conditions of Lemma D.1. By Lemma 4.6, τ ≤ C , τ (cid:107) θ (cid:107) → ∞ , and | (cid:101) Ω ij | ≤ Cτ θ i θ j .We now verify that U k(cid:96) has a uniform upper bound for all 1 ≤ k, (cid:96) ≤ m . By (C.114), U k(cid:96) = ( (cid:48) k E [ A ] (cid:96) ) / [( (cid:48) k E [ d ])( (cid:48) (cid:96) E [ d ])] . where k = ( m, k is the same as in (C.99). Since E [ A ij ] = Ω ij ≤ Cθ i θ j , we have 0 ≤ (cid:48) k E [ A ] (cid:96) ≤ C (cid:107) θ (cid:107) . At the same time, by the NSP of SCORE, for each 1 ≤ k ≤ m , there is at least onetrue community N k ∗ such that N k ∗ ⊂ N ( m, k . It follows that (cid:48) k E [ d ] = (cid:80) i ∈N ( m, k (cid:80) j : j (cid:54) = i Ω ij ≥ (cid:80) { i,j }⊂N k ∗ ,i (cid:54) = j θ i θ j P kk = (cid:107) θ ( k ) (cid:107) [1 + o (1)] ≥ C (cid:107) θ (cid:107) , where the last inequality is from the condi-tion (2.2). We plug these results into U k(cid:96) to get0 ≤ U k(cid:96) ≤ C (cid:107) θ (cid:107) − . (D.137)Then, the conditions of Lemma D.1 are satisﬁed. We apply this lemma with α n = τ and v ij = U k(cid:96) for i ∈ N ( m, k and j ∈ N ( m, (cid:96) . It yields that, under the conditions of Lemma 4.8, (cid:12)(cid:12) E [ Q ∗ ( m, n − (cid:101) Q ( m, n ] (cid:12)(cid:12) = o ( τ (cid:107) θ (cid:107) ) , Var (cid:0) Q ∗ ( m, n − (cid:101) Q ( m, n (cid:1) ≤ Cτ (cid:107) θ (cid:107) (cid:107) θ (cid:107) + o ( (cid:107) θ (cid:107) ) , (cid:101) Ω is a zero matrix) (cid:12)(cid:12) E [ Q ∗ ( m, n − (cid:101) Q ( m, n ] (cid:12)(cid:12) ≤ C (cid:107) θ (cid:107) , Var (cid:0) Q ∗ ( m, n − (cid:101) Q ( m, n (cid:1) ≤ o ( (cid:107) θ (cid:107) ) . This proves all the desirable claims except for the following one: Under conditions of Lemma 4.3.It remains to show that, under the conditions of Lemma 4.3, E [ Q ∗ ( m, n − (cid:101) Q ( m, n ] = b n + o ( (cid:107) θ (cid:107) ) . (D.138)We now show (D.138). By (D.135), we only need to calculate the expectations of Y , Z , Z , T and F . From Table 3, E [ Y ] = 0. We now study E [ Z ]. Recall that δ ij = U τ ( i ) τ ( j ) [( E d i )( E d j − d j ) + ( E d j )( E d i − d i )], where τ ( i ) is the index of pseudo-community deﬁned by Π that containsnode i . We plug δ ij into Z , by elementary calculations, Z = (cid:88) i ,i ,i ,i ( dist ) U τ ( i ) τ ( i ) U τ ( i ) τ ( i ) ( E d i )( E d i − d i ) ( E d i ) W i i W i i + 2 (cid:88) i ,i ,i ,i ( dist ) U τ ( i ) τ ( i ) U τ ( i ) τ ( i ) ( E d i )( E d i − d i )( E d i )( E d i − d i ) W i i W i i + (cid:88) i ,i ,i ,i ( dist ) U τ ( i ) τ ( i ) U τ ( i ) τ ( i ) ( E d i − d i )( E d i ) ( E d i − d i ) W i i W i i . We write it as Z = Z + 2 Z + Z . For Z k , we can further replace E d i − d i by (cid:80) j : j (cid:54) = i W ji and write Z k as a weighted sum of monomials of W . Then, E [ Z k ] (cid:54) = 0 if some of the monomialsare W i i W i i . This will not happen in Z and Z , and so only Z has a nonzero mean. It isseen that E [ Z ] = E (cid:20) (cid:88) i ,i ,i ,i ( dist ) U τ ( i ) τ ( i ) U τ ( i ) τ ( i ) (cid:16) (cid:88) j : j (cid:54) = i W ji (cid:17) ( E d i ) (cid:16) (cid:88) k : k (cid:54) = i W i k (cid:17) W i i W i i (cid:21) = E (cid:20) (cid:88) i ,i ,i ,i ( dist ) U τ ( i ) τ ( i ) U τ ( i ) τ ( i ) (cid:0) W i i (cid:1) ( E d i ) ( W i i ) · W i i W i i (cid:21) = (cid:88) i ,i ,i ,i ( dist ) U τ ( i ) τ ( i ) U τ ( i ) τ ( i ) ( E d i ) · E [ W i i W i i ]= (cid:88) k ,k ,k ,k (cid:88) j =1 (cid:88) i j ∈N kj U k k U k k ( E d i ) · E [ W i i W i i ] . (D.139)Here, in the second line, we only keep ( j, k ) = ( i , i ), because other ( j, k ) only contribute zeromeans. Recall that we are considering the setting of Lemma 4.3, where m = K and Π = Π. In(C.113), we introduce a proxy of U k(cid:96) as U ∗ k(cid:96) = ( (cid:48) k Ω (cid:96) ) / [( (cid:48) k Ω n )( (cid:48) k Ω n )], for all 1 ≤ k, (cid:96) ≤ K .Note that Ω ij = θ i θ j P k(cid:96) for i ∈ N k and j ∈ N (cid:96) . At the same time, by (4.13), g k = ( (cid:48) k θ ) / (cid:107) θ (cid:107) ,and V kk = (diag( P g )) kk = [ (cid:80) (cid:96) P k(cid:96) ( (cid:48) (cid:96) θ )] / (cid:107) θ (cid:107) . It follows that U ∗ k(cid:96) = P k(cid:96) ( (cid:48) k θ )( (cid:48) (cid:96) θ )( (cid:48) k θ )[ (cid:80) k P kk ( (cid:48) k θ )] · ( (cid:48) (cid:96) θ )[ (cid:80) (cid:96) P (cid:96)(cid:96) ( (cid:48) (cid:96) θ )] = P k(cid:96) V kk V (cid:96)(cid:96) (cid:107) θ (cid:107) . Comparing U k(cid:96) with U ∗ k(cid:96) (see (C.113)-(C.114)), the diﬀerence is negligible. (We can rigorouslyjustify this by directly computing the diﬀerence caused by replacing U k(cid:96) with U ∗ k(cid:96) , similarly asin the proof of c n = tr( (cid:101) Ω ) + o ( (cid:107) θ (cid:107) ) in Section C.1; see details therein. Such calculations aretoo elementary and so omitted.) We thus have U k(cid:96) = [1 + o (1)] · P k(cid:96) V kk V (cid:96)(cid:96) (cid:107) θ (cid:107) . (D.140)68urthermore, for i ∈ N k , E [ d i ] = [1 + o (1)] n (cid:88) j =1 Ω ij = [1 + o (1)] · θ i (cid:104) K (cid:88) (cid:96) =1 P k(cid:96) ( (cid:48) (cid:96) θ ) (cid:105) = [1 + o (1)] · θ i (cid:107) θ (cid:107) V kk . (D.141)Also, E [ W ij ] = Ω ij (1 − Ω ij ) = Ω ij [1 + o (1)]. We plug these results into (D.139) to get E [ Z ] = [1 + o (1)] (cid:88) k ,k ,k ,k (cid:88) j =1 (cid:88) i j ∈N kj P k k P k k V k k V k k V k k (cid:107) θ (cid:107) · (cid:0) θ i (cid:107) θ (cid:107) V k k (cid:1) · Ω i i Ω i i = [1 + o (1)] (cid:88) k ,k ,k ,k P k k P k k P k k P k k V k k V k k (cid:107) θ (cid:107) (cid:16) (cid:88) i j ∈N kj (cid:88) j =1 θ i θ i θ i θ i (cid:17) = [1 + o (1)] (cid:88) k ,k ,k ,k P k k P k k P k k P k k V k k V k k (cid:107) θ (cid:107) (cid:0) (cid:107) θ (cid:107) (cid:107) θ (cid:107) · g k g k H k k H k k (cid:1) = [1 + o (1)] (cid:107) θ (cid:107) (cid:88) k ,k g k V k k (cid:16)(cid:88) k P k k H k k P k k (cid:17)(cid:16)(cid:88) k P k k H k k P k k (cid:17) g k V k k = [1 + o (1)] (cid:107) θ (cid:107) (cid:88) k ,k ( V − g ) k ( P H P ) k k ( P H P ) k k ( V − g ) k = [1 + o (1)] (cid:107) θ (cid:107) · g (cid:48) V − [( P H P ) ◦ ( P H P )] V − g = [1 + o (1)] · b n / , where in the third line we have used the deﬁnition of H which gives H kk = ( (cid:48) k Θ k ) / / (cid:107) θ (cid:107) . Itfollows that E [ Z ] = E [ Z ] = [1 + o (1)] · b n / . (D.142)We then study E [ Z ]. Similarly, we ﬁrst plug in δ ij = U τ ( i ) τ ( j ) [( E d i )( E d j − d j ) + ( E d j )( E d i − d i )]and then plug in d i − E d i = (cid:80) j (cid:54) = i W ij . This allows us to write Z as a weighted sum of monomialsof W . When calculating E [ Z ], we only keep monomials of the form W i i W i i . It follows that E [ Z ] = E (cid:20) (cid:88) i ,i ,i ,i ( dist ) U τ ( i ) τ ( i ) ( E d i )( E d i − d i ) W i i U τ ( i ) τ ( i ) ( E d i )( E d i − d i ) W i i (cid:21) = E (cid:20) (cid:88) i ,i ,i ,i ( dist ) U τ ( i ) τ ( i ) ( E d i ) W i i U τ ( i ) τ ( i ) ( E d i ) W i i (cid:21) = 2 (cid:88) k ,k ,k ,k (cid:88) j =1 (cid:88) i j ∈N j U k k U k k ( E d i )( E d i ) W i i W i i = 2 [1 + o (1)] (cid:88) k ,k ,k ,k (cid:88) j =1 (cid:88) i j ∈N j P k k P k k V k k V k k V k k V k k (cid:107) θ (cid:107) (cid:0) θ i θ i (cid:107) θ (cid:107) V k k V k k (cid:1) · Ω i i Ω i i = 2 [1 + o (1)] (cid:88) k ,k ,k ,k P k k P k k P k k P k k V k k V k k (cid:107) θ (cid:107) (cid:16) (cid:88) j =1 (cid:88) i j ∈N j θ i θ i θ i θ i (cid:17) = [1 + o (1)] · (cid:107) θ (cid:107) g (cid:48) V − [( P H P ) ◦ ( P H P )] V − g. Here, the ﬁrst two lines come from discarding terms with mean zero, the fourth line is because of(D.140)-(D.141), and the last line is obtained similarly as in the equation above (D.142). Hence, E [ Z ] = b n · [1 + o (1)] . (D.143)69e then study E [ T ]. We plug in δ ij = U τ ( i ) τ ( j ) [( E d i )( E d j − d j ) + ( E d j )( E d i − d i )] to get T = 2 (cid:88) i ,i ,i ,i ( dist ) U τ ( i ) τ ( i ) U τ ( i ) τ ( i ) U τ ( i ) τ ( i ) × ( E d i )( E d i − d i ) ( E d i ) ( E d i − d i ) W i i + rem ≡ T + rem. We claim that | E [ rem ] | = o ( (cid:107) θ (cid:107) ) . The calculations here are similar to those in Equation (E.176) of [17], where T there (with aslightly diﬀerent meaning) is decomposed into 2 T a + 2 T b + 2 T c + 2 T d . Here, T is analogousto T d , and the remainder term is analogous to 2 T a + 2 T b + 2 T c . In [17], it was shown that | E [ T a ] | + | E [ T b ] | + | E [ T c ] | = o ( (cid:107) θ (cid:107) ); see Equations (E.179)-(E.181) in [17]. We can adapt theirproof to show | E [ rem ] | = o ( (cid:107) θ (cid:107) ). Since the calculations are elementary, we omit the details tosave space. We then compute E [ T ]. Since E d i − d i = − (cid:80) j : j (cid:54) = i W ji , it follows that E [ T ] = − E (cid:20) (cid:88) k ,k k ,k (cid:88) j =1 (cid:88) i j ∈N kj U k k U k k U k k ( E d i ) (cid:16) (cid:88) i : i (cid:54) = i W i i (cid:17) ( E d i ) (cid:16) (cid:88) i : i (cid:54) = i W i i (cid:17) W i i (cid:21) = − E (cid:20) (cid:88) k ,k k ,k (cid:88) j =1 (cid:88) i j ∈N kj U k k U k k U k k ( E d i ) (cid:16) (cid:88) i : i (cid:54) = i W i i (cid:17) ( E d i ) W i i (cid:21) = − (cid:88) k ,k k ,k (cid:88) j =1 (cid:88) i j ∈N kj U k k U k k U k k ( E d i )( E d i ) E [ W i i ] (cid:16) (cid:88) i : i (cid:54) = i E [ W i i ] (cid:17) = − (cid:88) k ,k k ,k (cid:88) j =1 (cid:88) i j ∈N kj U k k U k k U k k ( E d i )( E d i ) E [ W i i ] · [1 + o (1)] (cid:16) θ i (cid:107) θ (cid:107) (cid:88) k P k k g k (cid:124) (cid:123)(cid:122) (cid:125) V k k (cid:17) = − [1 + o (1)] (cid:88) k ,k k ,k P k k P k k P k k P k k V k k V k k (cid:107) θ (cid:107) (cid:16) (cid:88) j =1 (cid:88) i j ∈N kj θ i θ i θ i θ i (cid:17) = − [1 + o (1)] · (cid:107) θ (cid:107) g (cid:48) V − [( P H P ) ◦ ( P H P )] V − g, where we have plugged in (D.140)-(D.141) in the second last line, and the last line can be derivedsimilarly as in the equation above (D.142). We have proved E [ T ] = − [1 + o (1)] · b n /

2. Then, E [ T ] = 2 E [ T ] + o ( (cid:107) θ (cid:107) ) = − b n · [1 + o (1)] . (D.144)We then study E [ F ]. Similar to the analysis of T , after plugging in δ ij = U τ ( i ) τ ( j ) [( E d i )( E d j − d j ) + ( E d j )( E d i − d i )], we can obtain that F = rem + 2 (cid:88) i ,i ,i ,i ( dist ) U τ ( i ) τ ( i ) U τ ( i ) τ ( i ) U τ ( i ) τ ( i ) U τ ( i ) τ ( i ) × ( E d i )( E d i − d i ) ( E d i ) ( E d i − d i ) ( E d i ) , ≡ rem + 2 F , where | E [ rem ] | = o ( (cid:107) θ (cid:107) ) . The proof of | E [ rem ] | = o ( (cid:107) θ (cid:107) ) is similar to the proof of (E.188)-(E.189) in [17]. There theyanalyzed a quantity F , which bears some similarity to the F here, and decomposed F = 2 F a +702 F b + 2 F c , where 2 F a + 12 F b is analogous to rem here. They proved that | E [ F a ] | + | E [ F b ] | = o ( (cid:107) θ (cid:107) ). We can mimic their proof to show | E [ rem ] | = o ( (cid:107) θ (cid:107) ). By direct calculations, E [ F ] = E (cid:20) (cid:88) k ,k k ,k (cid:88) j =1 (cid:88) i j ∈N kj U k k U k k U k k U k k ( E d i ) ( E d i ) ( E d i − d i ) ( E d i − d i ) (cid:21) = E (cid:20) (cid:88) k ,k k ,k (cid:88) j =1 (cid:88) i j ∈N kj U k k U k k U k k U k k ( E d i ) ( E d i ) (cid:16) (cid:88) i : i (cid:54) = i W i i (cid:17)(cid:16) (cid:88) i : i (cid:54) = i W i i (cid:17)(cid:21) = [1 + o (1)] (cid:88) k ,k k ,k (cid:88) j =1 (cid:88) i j ∈N kj P k k P k k P k k P k k θ i θ i V k k V k k (cid:107) θ (cid:107) (cid:16) θ i (cid:107) θ (cid:107) (cid:88) k P k k g k (cid:124) (cid:123)(cid:122) (cid:125) V k k (cid:17)(cid:16) θ i (cid:107) θ (cid:107) (cid:88) k P k k g k (cid:124) (cid:123)(cid:122) (cid:125) V k k (cid:17) = [1 + o (1)] (cid:88) k ,k k ,k P k k P k k P k k P k k V k k V k k (cid:107) θ (cid:107) (cid:16) (cid:88) j =1 (cid:88) i j ∈N kj θ i θ i θ i θ i (cid:17) = [1 + o (1)] · (cid:107) θ (cid:107) g (cid:48) V − [( P H P ) ◦ ( P H P )] V − g, where in the second line we discard terms with mean zero, in the third line we plug in (D.140)-(D.141), and in the last line we use elementary calculations similar to those in the equationabove (D.142). It follows that E [ F ] = [1 + o (1)] · b n / E [ F ] = 2 E [ F ] + o ( (cid:107) θ (cid:107) ) = [1 + o (1)] · b n . (D.145)We now plug (D.142), (D.143), (D.144), and (D.145) into (D.135) to get E [ Q ∗ ( m, n − (cid:101) Q ( m, n ] = 4 E [ Z ] + 2 E [ Z ] + 4 E [ T ] + E [ F ]= [1 + o (1)] · [4( b n /

2) + 2 b n − b n + b n ]= [1 + o (1)] · b n . Since b n (cid:16) (cid:107) θ (cid:107) , (D.138) follows immediately. D.4 Proof of Lemma C.4

Similar to the proof of Lemma C.3, we use the notation M ijk(cid:96) ( X ) = X ij X jk X k(cid:96) X (cid:96)i . By (C.121), (cid:101) Q ∗ ( m, n − Q ∗ ( m, n = (cid:88) i ,i ,i ,i ( dist ) [ M i i i i ( (cid:101) X ∗ ) − M i i i i ( X ∗ )] , where (cid:40) (cid:101) X ∗ ij = (cid:101) Ω ( m, ij + W ij + δ ( m, ij + ˜ r ( m, ij ,X ∗ ij = (cid:101) Ω ( m, ij + W ij + δ ( m, ij . For the rest of the proof, we omit the superscripts ( m,

0) in ( (cid:101) Ω , δ, ˜ r ). There are 4 − = 175post-expansion sums in (cid:101) Q ∗ ( m, n − Q ∗ ( m, n , each with the form S ≡ (cid:88) i ,i ,i ,i ( dist ) a i i b i i c i i d i i , where a, b, c, d ∈ { (cid:101) Ω , W, δ, ˜ r } . (D.146)Here we use S as a generic notation for any post-expansion sum. To show the claim, it suﬃcesto bound | E [ S ] | and Var( S ) for each post-expansion sum S .We now study S . Let N ( m, , N ( m, , . . . , N ( m, m be the pseudo-communities deﬁned by Π .By (C.116) and (C.117), for i ∈ N ( m, k and j ∈ N ( m, (cid:96) , δ ij = U k(cid:96) (cid:2) ( E d i )( d j − E d j ) + ( E d j )( d i − E d i ) (cid:3) , ˜ r ij = − (cid:98) U k(cid:96) ( d i − E d i )( d j − E d j ) . (cid:98) U k(cid:96) has a complicated correlation with each summand, so we want to “replace” it with U k(cid:96) . Introduce a proxy of ˜ r ij as r ij = − U k(cid:96) ( d i − E d i )( d j − E d j ) (D.147)We deﬁne a proxy of S as T ≡ (cid:88) i ,i ,i ,i ( dist ) a i i b i i c i i d i i , where a, b, c, d ∈ { (cid:101) Ω , W, δ, r } . (D.148)We note that T is also a generic notation, and it has a one-to-one correspondence with S . For ex-ample, if S = (cid:80) i ,i ,i ,i ( dist ) δ i i W i i (cid:101) Ω i i ˜ r i i , then T = (cid:80) i ,i ,i ,i ( dist ) δ i i W i i (cid:101) Ω i i r i i ; if S = (cid:80) i ,i ,i ,i ( dist ) δ i i ˜ r i i ˜ r i i W i i , then T = (cid:80) i ,i ,i ,i ( dist ) δ i i r i i r i i W i i . Therefore,to bound the mean and variance of S , we only need to study T and S − T separately.First, we study the mean and variance of T . Since d i − E d i = (cid:80) j : j (cid:54) = i W ij , we can write δ ij asa linear form of W and r ij as a quadratic form of W . We then plug them into the expression of T and write T as a weighted sum of monomials of W . Take T = (cid:80) i ,i ,i ,i ( dist ) r i i W i i W i i W i i for example. It can be re-written as (note: τ ( i ) is the index of pseudo-community that containsnode i ) T = − (cid:88) i ,i ,i ,i ( dist ) U τ ( i ) τ ( i ) (cid:16) (cid:88) j : j (cid:54) = i W i j (cid:17)(cid:16) (cid:88) j : j (cid:54) = i W i j (cid:17) W i i W i i W i i = − (cid:88) i ,i ,i ,i ( dist ) j ,j : j (cid:54) = i ,j (cid:54) = i U τ ( i ) τ ( i ) W i j W i j W i i W i i W i i . Then, we can compute the mean and variance of T directly. We use the same strategy to analyzeeach of the 175 post-expansion sums of the form (D.148). Similar calculations were conducted inthe proof of Lemma E.11 of [17]. The setting of Lemma E.11 is a special case where U k(cid:96) ≡ /v for a scalar v . However, their proof does not rely on that U k(cid:96) ’s are equal to each other. Instead,their proof only requires a universal upper bound on U k(cid:96) . In fact, they have proved the followinglemma: Lemma D.2.

Consider a DCBM model where (2.1)-(2.2) and (2.4) hold. Let W = A − Ω and ∆ = (cid:80) i ,i ,i ,i ( dist ) (cid:2) M i i i i (cid:0)(cid:101) Ω+ W + δ + r (cid:1) − M i i i i (cid:0)(cid:101) Ω+ W + δ (cid:1)(cid:3) , where (cid:101) Ω is a non-stochasticsymmetric matrix, δ ij = v ij · [( E d i )( E d j − d j ) + ( E d j )( E d i − d i )] , r ij = − u ij ( d i − E d i )( d j − E d j ) , { v ij , u ij } ≤ i (cid:54) = j ≤ n are non-stochastic scalars, d i is the degree of node i , and M i i i i ( · ) is asdeﬁned above. As n → ∞ , suppose there is a constant C > and a scalar α n > such that α n ≤ C , α n (cid:107) θ (cid:107) → ∞ , | (cid:101) Ω ij | ≤ Cα n θ i θ j , | v ij | ≤ C (cid:107) θ (cid:107) − , and | u ij | ≤ C (cid:107) θ (cid:107) − for ≤ i, j ≤ n .Let T be an arbitrary post-expansion sum of ∆ . Then, | E [ T ] | ≤ Cα n (cid:107) θ (cid:107) + o ( (cid:107) θ (cid:107) ) and Var( T ) = o (cid:0) α n (cid:107) θ (cid:107) (cid:107) θ (cid:107) + (cid:107) θ (cid:107) (cid:1) . We apply Lemma D.2 for α n = τ and v ij = u ij = U τ ( i ) τ ( j ) . By Lemma 4.6, τ ≤ C , τ (cid:107) θ (cid:107) → ∞ ,and | (cid:101) Ω ij | ≤ Cτ θ i θ j . In (D.137), we have seen that | U k(cid:96) | ≤ C (cid:107) θ (cid:107) − . The conditions of Lemma D.2are satisﬁed. We immediately have: Under the conditions of Lemma 4.8 (note: τ (cid:107) θ (cid:107) → ∞ ) | E [ T ] | ≤ Cτ (cid:107) θ (cid:107) + o ( (cid:107) θ (cid:107) ) = o ( τ (cid:107) θ (cid:107) ) , Var( T ) = o (cid:0) τ (cid:107) θ (cid:107) (cid:107) θ (cid:107) + (cid:107) θ (cid:107) (cid:1) , (D.149)and under the conditions of Lemma 4.3 (i.e., (cid:101) Ω is a zero matrix and τ = 0), | E [ T ] | = o ( (cid:107) θ (cid:107) ) , Var( T ) = o ( (cid:107) θ (cid:107) ) . (D.150)Next, we study the variable ( S − T ). In (D.146) and (D.148), if we group the summandsbased on pseudo-communities of ( i , i , i , i ), then we have S = (cid:88) ≤ k ,k ,k ,k ≤ m S k k k k and T = (cid:88) ≤ k ,k ,k ,k ≤ m T k k k k , S k k k k contains all the summands such that i s ∈ N ( m, k s for s = 1 , , ,

4. By straight-forward calculations and deﬁnitions of ( r ij , ˜ r ij ), we have S k k k k = (cid:98) U (cid:96) a k k (cid:98) U (cid:96) b k k (cid:98) U (cid:96) c k k (cid:98) U (cid:96) d k k (cid:88) s =1 (cid:88) i s ∈N ( m, ks ˜ a i i ˜ b i i ˜ c i i ˜ d i i ,T k k k k = U (cid:96) a k k U (cid:96) b k k U (cid:96) c k k U (cid:96) d k k (cid:88) s =1 (cid:88) i s ∈N ( m, ks ˜ a i i ˜ b i i ˜ c i i ˜ d i i , where ˜ a ij , ˜ b ij , ˜ c ij , ˜ d ij ∈ (cid:8)(cid:101) Ω ij , W ij , δ ij , − ( d i − E d i )( d j − E d j ) (cid:9) . Here (cid:96) a ∈ { , } is an indicator about whether a ij takes the value of ˜ r ij in S , and ( (cid:96) b , (cid:96) c , (cid:96) d ) aresimilar. For example, if S = (cid:80) i ,i ,i ,i ( dist ) δ i i W i i (cid:101) Ω i i ˜ r i i , then ( (cid:96) a , (cid:96) b , (cid:96) c , (cid:96) d ) = (0 , , , S = (cid:80) i ,i ,i ,i ( dist ) δ i i ˜ r i i ˜ r i i W i i , then ( (cid:96) a , (cid:96) b , (cid:96) c , (cid:96) d ) = (0 , , , S considered here, 1 ≤ (cid:96) a + (cid:96) b + (cid:96) c + (cid:96) d ≤

4. To study the diﬀerence between S k k k k and T k k k k , we introduce an intermediate term R k k k k = (cid:16) (cid:107) θ (cid:107) (cid:17) (cid:96) a + (cid:96) b + (cid:96) c + (cid:96) d (cid:88) s =1 (cid:88) i s ∈N ( m, ks ˜ a i i ˜ b i i ˜ c i i ˜ d i i . In fact, R k k k k has a similar form as T k k k k except that the scalar U k(cid:96) in the deﬁnition of r ij (see (D.147)) is replaced by 1 / (cid:107) θ (cid:107) . We apply Lemma D.2 with u ij ≡ / (cid:107) θ (cid:107) . It yields that,under conditions of Lemma 4.3, | E [ R k k k k ] | = o ( (cid:107) θ (cid:107) ) , Var( R k k k k ) = o ( (cid:107) θ (cid:107) ) , and under conditions of Lemma 4.8, | E [ R k k k k ] | ≤ Cτ (cid:107) θ (cid:107) + o ( (cid:107) θ (cid:107) ) , Var( R k k k k ) = o (cid:0) (cid:107) θ (cid:107) + τ (cid:107) θ (cid:107) (cid:107) θ (cid:107) (cid:1) . Particularly, since E [ X ] = ( E [ X ]) + Var( X ) for any variable X , we have (cid:107) θ (cid:107) − E [ R k k k k ] ≤ (cid:40) o ( (cid:107) θ (cid:107) ) , for setting of Lemma 4.3 ,Cτ (cid:107) θ (cid:107) + o (cid:0) (cid:107) θ (cid:107) + τ (cid:107) θ (cid:107) (cid:107) θ (cid:107) (cid:1) , for setting of Lemma 4.8 , = (cid:40) o ( (cid:107) θ (cid:107) ) , for setting of Lemma 4.3 ,C (cid:107) θ (cid:107) , for setting of Lemma 4.8 . (D.151)Note that in deriving (D.151) we have used τ ≤ C and τ (cid:107) θ (cid:107) (cid:107) θ (cid:107) ≤ τ (cid:107) θ (cid:107) · θ (cid:107) θ (cid:107) ≤ C (cid:107) θ (cid:107) .We now investigate ( S k k k k − T k k k k ). By condition (2.1), (cid:112) log( n ) (cid:28) (cid:107) θ (cid:107) / (cid:107) θ (cid:107) . Hence,we can take a sequence of x n , such that (cid:112) log( n ) (cid:28) x n (cid:28) (cid:107) θ (cid:107) / (cid:107) θ (cid:107) , and deﬁne the event E n : E n = (cid:26) | U k(cid:96) − (cid:98) U k(cid:96) | ≤ C x n (cid:107) θ (cid:107) , for all 1 ≤ k, (cid:96) ≤ m (cid:27) , (D.152)where C > E cn , we recall that (bydeﬁnitions in (C.112) and (C.114)) (cid:98) U k(cid:96) = (cid:48) k A (cid:96) ( (cid:48) k d )( (cid:48) (cid:96) d ) , and U k(cid:96) = (cid:48) k E [ A ] (cid:96) ( (cid:48) k E [ d ])( (cid:48) (cid:96) E [ d ]) , where k is a shorthand notation for ( m, k in (C.99). Using Bernstein’s inequality and mimickingthe argument from (E.299)-(E.300) of [17], we can easily show that, there is a constant C > ≤ k, (cid:96) ≤ m , P (cid:16)(cid:12)(cid:12) (cid:48) k A (cid:96) − (cid:48) k E [ A ] (cid:96) (cid:12)(cid:12) > x n (cid:107) θ (cid:107) (cid:17) ≤ − C x n ) . (D.153)73y probability union bound, with probability 1 − m exp( − C x n ),max ≤ k,(cid:96) ≤ m (cid:8)(cid:12)(cid:12) (cid:48) k A (cid:96) − (cid:48) k E [ A ] (cid:96) (cid:12)(cid:12)(cid:9) ≤ x n (cid:107) θ (cid:107) . Furthermore, (cid:48) k d − (cid:48) k E [ d ] = (cid:80) m(cid:96) =1 ( (cid:48) k A (cid:96) − (cid:48) k E [ A ] (cid:96) ). So, with probability 1 − m exp( − C x n ),max ≤ k ≤ m (cid:8)(cid:12)(cid:12) (cid:48) k d − (cid:48) k E [ d ] (cid:12)(cid:12)(cid:9) ≤ m · x n (cid:107) θ (cid:107) . At the same time, we know that (cid:48) k E [ A ] (cid:96) (cid:16) (cid:107) θ (cid:107) and (cid:48) k E [ d ] (cid:16) (cid:107) θ (cid:107) . We plug the above resultsinto the expressions of U k(cid:96) and (cid:98) U k(cid:96) and can easily ﬁnd that, with probability 1 − m exp( − C x n ),max ≤ k,(cid:96) ≤ m | (cid:98) U k(cid:96) − U k(cid:96) | ≤ C x n / (cid:107) θ (cid:107) , for some constant C > C still depends on m , but m is bounded here). We use the same C to deﬁne E n . Then, P ( E cn ) ≤ m exp( − C x n ) = o ( n − L ) , for any ﬁxed L > , (D.154)where the last equality is due to x n (cid:29) log( n ). We aim to use (D.154) to bound E [( S k k k k − T k k k k ) · I E cn ]. It is easy to see the trivial bound | (cid:98) U k(cid:96) | ≤ | U k(cid:96) | ≤

1. Also, recall that ˜ a ij takes value in { (cid:101) Ω ij , W ij , δ ij , − ( d i − E d i )( d j − E d j ) } , and so | a ij | ≤ n ; we have the same boundfor | ˜ b ij | , | ˜ c ij | , | ˜ d ij | . This gives a trivial bound( S k k k k − T k k k k ) ≤ S k k k k + 2 T k k k k ≤ n · n ) + 2( n · n ) = 4 n . Combining it with (D.154), we have E [( T k k k k − S k k k k ) · I E cn ] ≤ n · m exp( − C x n ) = o (1) . (D.155)At the same time, on the event E n , | S k k k k − T k k k k | = (cid:12)(cid:12) (cid:98) U (cid:96) a k k (cid:98) U (cid:96) b k k (cid:98) U (cid:96) c k k (cid:98) U (cid:96) d k k − U (cid:96) a k k U (cid:96) b k k U (cid:96) c k k U (cid:96) d k k (cid:12)(cid:12) · (cid:107) θ (cid:107) (cid:96) a + (cid:96) b + (cid:96) c + (cid:96) d )1 | R k k k k |≤ C (cid:16) | U (cid:96) a k k U (cid:96) b k k U (cid:96) c k k U (cid:96) d k k | max ≤ k,(cid:96) ≤ m (cid:12)(cid:12) (cid:98) U k(cid:96) /U k(cid:96) − (cid:12)(cid:12)(cid:17) · (cid:107) θ (cid:107) (cid:96) a + (cid:96) b + (cid:96) c + (cid:96) d )1 | R k k k k |≤ C (cid:107) θ (cid:107) · max ≤ k,(cid:96) ≤ m | (cid:98) U k(cid:96) − U k(cid:96) | · | R k k k k |≤ Cx n (cid:107) θ (cid:107) − · | R k k k k | = o ( (cid:107) θ (cid:107) − ) · | R k k k k | , where the fourth line is because (cid:107) θ (cid:107) − ≤ | U k(cid:96) | ≤ C (cid:107) θ (cid:107) − (e.g., see (D.137)) and the last line isbecause x n (cid:28) (cid:107) θ (cid:107) / (cid:107) θ (cid:107) . It follows that E [( T k k k k − S k k k k ) · I E n ] = o ( (cid:107) θ (cid:107) − ) · E [ R k k k k ] . (D.156)We combine (D.155) and (D.156) and plug in (D.151). It follows that E [( T k k k k − S k k k k ) ] = o ( (cid:107) θ (cid:107) − ) · E [ R k k k k ] + o (1)= (cid:40) o ( (cid:107) θ (cid:107) ) , under conditions of Lemma 4.3 ,o ( (cid:107) θ (cid:107) ) , under conditions of Lemma 4.8 . m is bound, we immediately know that E [( S − T ) ] = (cid:40) o ( (cid:107) θ (cid:107) ) , under conditions of Lemma 4.3 ,o ( (cid:107) θ (cid:107) ) , under conditions of Lemma 4.8 . (D.157)Last, we combine the results on T and the results on ( S − T ). By (D.149)-(D.150) and(D.157), | E [ S ] | ≤ | E [ T ] | + | E [ S − T ] ||≤ | E [ T ] | + (cid:112) E [( S − T ) ]= (cid:40) o ( (cid:107) θ (cid:107) ) + o ( (cid:107) θ (cid:107) ) = o ( (cid:107) θ (cid:107) ) , for setting of Lemma 4.3 ,o ( τ (cid:107) θ (cid:107) ) + o ( (cid:107) θ (cid:107) ) = o ( τ (cid:107) θ (cid:107) ) , for setting of Lemma 4.8 . Additionally,Var( S ) ≤ T ) + 2Var( S − T ) ≤ T ) + 2 E [( S − T ) ] ≤ (cid:40) o ( (cid:107) θ (cid:107) ) + o ( (cid:107) θ (cid:107) ) = o ( (cid:107) θ (cid:107) ) , for setting of Lemma 4.3 ,o (cid:0) (cid:107) θ (cid:107) + τ (cid:107) θ (cid:107) (cid:107) θ (cid:107) (cid:1) + o ( (cid:107) θ (cid:107) ) = o (cid:0) (cid:107) θ (cid:107) + τ (cid:107) θ (cid:107) (cid:107) θ (cid:107) (cid:1) , for setting of Lemma 4.8 . This gives the desirable claim.

D.5 Proof of Lemma C.5

Similar to the proof of Lemma C.3, we use the notation M ijk(cid:96) ( X ) = X ij X jk X k(cid:96) X (cid:96)i . By (C.121), Q ( m, n − (cid:101) Q ∗ ( m, n = (cid:88) i ,i ,i ,i ( dist ) [ M i i i i ( X ) − M i i i i ( (cid:101) X ∗ )] , where (cid:40) X ij = (cid:101) Ω ( m, ij + W ij + δ ( m, ij + ˜ r ( m, ij + (cid:15) ( m, ij , (cid:101) X ∗ ij = (cid:101) Ω ( m, ij + W ij + δ ( m, ij + ˜ r ( m, ij . We shall omit the superscripts ( m,

0) in ( (cid:101) Ω , δ, ˜ r, (cid:15) ). Let N ( m, , N ( m, , . . . , N ( m, m be the pseudo-communities deﬁned by Π . By (C.118), (cid:15) ij = ˜ α ij + ˜ β ij +˜ γ ij , where for i ∈ N ( m, k and j ∈ N ( m, (cid:96) ,˜ α ij = d ∗ i d ∗ j U ∗ k(cid:96) − ( E d i )( E d j ) U k(cid:96) , ˜ β ij = ( U k(cid:96) − (cid:98) U k(cid:96) )( E d i )( E d j ) , ˜ γ ij = ( U k(cid:96) − (cid:98) U k(cid:96) )[( E d i )( d j − E d j ) + ( E d j )( d i − E d i )] . (D.158)Therefore, we can write Q ( m, n − (cid:101) Q ∗ ( m, n = (cid:88) i ,i ,i ,i ( dist ) [ M i i i i ( (cid:101) Ω + W + δ + ˜ r + ˜ α + ˜ β + ˜ γ ) − M i i i i ( (cid:101) Ω + W + δ + ˜ r )] . There are 7 − = 2145 post-expansion sums. Let S be the generic notation for any such post-expansion sum. Similarly as in the proof of Lemma C.4, we group the summands according towhich pseudo-communities ( i , i , i , i ) belong to, i.e., we write S = (cid:80) ≤ k ,k ,k ,k ≤ m S k k k k ,where S k k k k = (cid:88) j =1 (cid:88) i j ∈N ( m, kj a i i b i i c i i d i i , where a, b, c, d ∈ { (cid:101) Ω , W, δ, ˜ r, ˜ α, ˜ β, ˜ γ } . (D.159)75t suﬃces to study the mean and variance of each S k k k k .Let τ and r ij be the same as in (4.25) and (D.147). Deﬁne α ij = τ (cid:107) θ (cid:107) θ max (cid:2) d ∗ i d ∗ j U ∗ k(cid:96) − ( E d i )( E d j ) U k(cid:96) (cid:3) ,β ij = τ U k(cid:96) ( E d i )( E d j ) ,γ ij = U k(cid:96) [( E d i )( d j − E d j ) + ( E d j )( d i − E d i )] . (D.160)We introduce a proxy of S k k k k as S ∗ k k k k = (cid:88) j =1 (cid:88) i j ∈N ( m, kj a i i b i i c i i d i i , where a, b, c, d ∈ { (cid:101) Ω , W, δ, r, α, β, γ } . (D.161)Reviewing the expressions of ( (cid:101) Ω ij , W ij , δ ij , r ij , α ij , β ij , γ ij ), we know that S ∗ k k k k can always bewritten as a weighted sum of monomials of W , and so we can calculate the mean and varianceof S ∗ k k k k (the straightforward calculations are still tedious, but later we will introduce asimple trick to do that). Comparing (D.160) with (D.158) and r ij with ˜ r ij , we observe that, for i ∈ N ( m, k and j ∈ N ( m, (cid:96) ,˜ r ij = (cid:98) U k(cid:96) U k(cid:96) r ij , ˜ α ij = θ max τ (cid:107) θ (cid:107) α ij , ˜ β ij = U k(cid:96) − (cid:98) U k(cid:96) τ U k(cid:96) β ij , ˜ γ ij = U k(cid:96) − (cid:98) U k(cid:96) U k(cid:96) γ ij . We plug them into (D.159) to get S k k k k = (cid:16) (cid:98) U k(cid:96) U k(cid:96) (cid:17) N ˜ r (cid:16) θ max τ (cid:107) θ (cid:107) (cid:17) N ˜ α (cid:16) U k(cid:96) − (cid:98) U k(cid:96) τ U k(cid:96) (cid:17) N ˜ β (cid:16) U k(cid:96) − (cid:98) U k(cid:96) U k(cid:96) (cid:17) N ˜ γ S ∗ k k k k , (D.162)where N ˜ r is the count of { a, b, c, d } in (D.159) taking the value of ˜ r , and ( N ˜ α , N ˜ β , N ˜ γ ) are similar.For any post-expansion sum considered here, 1 ≤ N ˜ α + N ˜ β + N ˜ γ ≤

4. The notation ( (cid:98) U k(cid:96) U k(cid:96) ) N ˜ r isinterpreted in this way: For example, if in (D.159) only a takes the value of ˜ r , then N ˜ r = 1 and( (cid:98) U k(cid:96) U k(cid:96) ) N ˜ r = (cid:98) U k k U k k ; if ( a, b, c ) take the value of ˜ r , then N ˜ r = 3 and ( (cid:98) U k(cid:96) U k(cid:96) ) N ˜ r = (cid:98) U k k U k k (cid:98) U k k U k k (cid:98) U k k U k k . In(D.162), S ∗ k k k k is a random variable whose mean and variance are relatively easy to calculate.The factor in front of S ∗ k k k k has a complicated correlation with the summands in S ∗ k k k k ,but fortunately we can apply a simple bound on this factor. Consider the event E n as in (D.152).We have shown in (D.154) that P ( E cn ) = o ( n − L ) for any ﬁxed L >

0. Therefore, the event E cn has a negligible eﬀect on the mean and variance of S k k k k , i.e., E [ S k k k k · I E cn ] = o (1) . On the event E n , we have max k,(cid:96) {| (cid:98) U k(cid:96) − U k(cid:96) | /U k(cid:96) } ≤ C x n / (cid:107) θ (cid:107) . It follows that | S k k k k | ≤ (cid:16) max k,(cid:96) | (cid:98) U k(cid:96) | U k(cid:96) (cid:17) N ˜ γ (cid:16) θ max τ (cid:107) θ (cid:107) (cid:17) N ˜ α (cid:16) max k,(cid:96) | U k(cid:96) − (cid:98) U k(cid:96) | τ U k(cid:96) (cid:17) N ˜ β (cid:16) max k,(cid:96) | U k(cid:96) − (cid:98) U k(cid:96) | U k(cid:96) (cid:17) N ˜ γ | S ∗ k k k k |≤ C (cid:16) θ max τ (cid:107) θ (cid:107) (cid:17) N ˜ α (cid:16) x n τ (cid:107) θ (cid:107) (cid:17) N ˜ β (cid:16) x n (cid:107) θ (cid:107) (cid:17) N ˜ γ | S ∗ k k k k | . Since x n (cid:28) (cid:107) θ (cid:107) (cid:107) θ (cid:107) and τ (cid:107) θ (cid:107) → ∞ , we immediately have x n (cid:107) θ (cid:107) = o ( (cid:107) θ (cid:107) ), x n τ (cid:107) θ (cid:107) = o ( τ (cid:107) θ (cid:107) ) = o ( (cid:107) θ (cid:107) )and θ max τ (cid:107) θ (cid:107) ≤ θ τ (cid:107) θ (cid:107) = o ( (cid:107) θ (cid:107) ). It follows that | S k k k k | = o (1) · (cid:107) θ (cid:107) − ( N ˜ α + N ˜ β +2 N ˜ γ ) · | S ∗ k k k k | , on the event E n . E [ S k k k k ] = E [ S k k k k · I E n ] + E [ S k k k k · I E cn ]= o (1) · (cid:107) θ (cid:107) − (2 N ˜ α +2 N ˜ β +4 N ˜ γ ) · E (cid:2) ( S ∗ k k k k ) (cid:3) + o (1) . (D.163)It remains to bound E (cid:2) ( S ∗ k k k k ) (cid:3) . As we mentioned, we can write S ∗ k k k k as a weightedsum of monomials of W and calculate its mean and variance directly. However, given that thereare 2145 types of S ∗ k k k k , the calculation is still very tedious. We now use a simple trick torelate the S ∗ k k k k to the post-expansion sums we have analyzed in Lemmas C.3-C.4. We ﬁrstbound | α ij | in (D.160). Since d ∗ i = E [ d i ] + Ω ii , | α ij | ≤ τ (cid:107) θ (cid:107) θ max (cid:16) E [ d i ] E [ d j ] | U ∗ k(cid:96) − U k(cid:96) | + (Ω ii E [ d j ] + Ω jj E [ d i ]) U ∗ k(cid:96) + Ω ii Ω jj U ∗ k(cid:96) (cid:17) . By basic algebra, | ( x + x ) / ( y + y ) − x /y | ≤ | x | / ( y + y ) + | x y | / [( y + y ) y ]. We apply iton (C.113)-(C.114) and note that (cid:48) k (Ω − E [ A ]) (cid:96) = (cid:48) k diag(Ω) (cid:96) = O ( (cid:107) θ (cid:107) ) and (cid:48) k ( d ∗ − E [ d ]) = (cid:48) k diag(Ω) n = O ( (cid:107) θ (cid:107) ). It yields | U ∗ k(cid:96) − U k(cid:96) |≤ | (cid:48) k Ω (cid:96) − (cid:48) k E [ A ] (cid:96) | ( (cid:48) k d ∗ )( (cid:48) (cid:96) d ∗ ) + ( (cid:48) k E [ A ] (cid:96) ) | ( (cid:48) k d ∗ )( (cid:48) (cid:96) d ∗ ) − ( (cid:48) k E [ d ])( (cid:48) (cid:96) E [ d ]) | ( (cid:48) k d ∗ )( (cid:48) (cid:96) d ∗ )( (cid:48) k E [ d ])( (cid:48) (cid:96) E [ d ]) ≤ C (cid:107) θ (cid:107) − · (cid:48) k diag(Ω) n + C (cid:107) θ (cid:107) − · | ( (cid:48) k d ∗ )( (cid:48) (cid:96) d ∗ ) − ( (cid:48) k E [ d ])( (cid:48) (cid:96) E [ d ]) |≤ C (cid:107) θ (cid:107) − · (cid:107) θ (cid:107) + C (cid:107) θ (cid:107) − · (cid:107) θ (cid:107) (cid:107) θ (cid:107) ≤ C (cid:107) θ (cid:107) − θ max , where in the last line we have used (cid:107) θ (cid:107) ≤ θ max (cid:107) θ (cid:107) . Combining the above gives | α ij | ≤ Cτ (cid:107) θ (cid:107) θ max (cid:104) θ i θ j (cid:107) θ (cid:107) · (cid:107) θ (cid:107) − θ max + ( θ i θ j (cid:107) θ (cid:107) + θ j θ i (cid:107) θ (cid:107) ) · (cid:107) θ (cid:107) − + θ i θ j (cid:107) θ (cid:107) − (cid:105) ≤ Cτ (cid:107) θ (cid:107) θ max · θ i θ j θ max (cid:107) θ (cid:107) ≤ Cτ θ i θ j . Additionally, in (D.160), we observe that γ ij = δ ij . Since | U k(cid:96) | ≤ C (cid:107) θ (cid:107) − and E [ d i ] ≤ Cθ i (cid:107) θ (cid:107) ,it is true that | β ij | ≤ Cτ θ i θ j . We summarize the results as | α ij | ≤ Cτ θ i θ j , | β ij | ≤ Cτ θ i θ j , γ ij = δ ij . (D.164)It says that γ is the same as δ , and ( α, β ) behave similarly as (cid:101) Ω. Consequently, the calculation ofmean and variance of S ∗ k k k k in (D.161) can be carried out by replacing ( α, β, γ ) with ( (cid:101) Ω , (cid:101) Ω , δ ).In other words, we only need to study a sum like S ∗∗ k k k k = (cid:88) j =1 (cid:88) i j ∈N ( m, kj a i i b i i c i i d i i , where a, b, c, d ∈ { (cid:101) Ω , W, δ, r } . Let ( N (cid:101) Ω , N W , N δ , N r , N α , N β , N γ ) be the count of diﬀerent terms in { a, b, c, d } determined by S ∗ k k k k , where these counts sum to 4. In S ∗∗ k k k k , the counts become N ∗ (cid:101) Ω = N (cid:101) Ω + N α + N β , N ∗ W = N W , N ∗ δ = N δ + N γ and N ∗ r = N r . Luckily, anything like S ∗∗ k k k k4