[PDF] A Unified Framework for Tuning Hyperparameters in Clustering Problems

Abstract

Selecting hyperparameters for unsupervised learning problems is challenging in general due to the lack of ground truth for validation. Despite the prevalence of this issue in statistics and machine learning, especially in clustering problems, there are not many methods for tuning these hyperparameters with theoretical guarantees. In this paper, we provide a framework with provable guarantees for selecting hyperparameters in a number of distinct models. We consider both the subgaussian mixture model and network models to serve as examples of i.i.d. and non-i.i.d. data. We demonstrate that the same framework can be used to choose the Lagrange multipliers of penalty terms in semi-definite programming (SDP) relaxations for community detection, and the bandwidth parameter for constructing kernel similarity matrices for spectral clustering. By incorporating a cross-validation procedure, we show the framework can also do consistent model selection for network models. Using a variety of simulated and real data examples, we show that our framework outperforms other widely used tuning procedures in a broad range of parameter settings.

Full PDF

AA Uniﬁed Framework for Tuning Hyperparameters in ClusteringProblems

Xinjie Fan , Yuguang Yue , Purnamrita Sarkar , and Y. X. Rachel Wang Department of Statistics and Data Science, University of Texas at Austin School of Mathematics and Statistics, University of Sydney [email protected], [email protected], [email protected], [email protected]

February 4, 2020

Abstract

Selecting hyperparameters for unsupervised learning problems is challenging in general due to the lackof ground truth for validation. Despite the prevalence of this issue in statistics and machine learning,especially in clustering problems, there are not many methods for tuning these hyperparameters withtheoretical guarantees. In this paper, we provide a framework with provable guarantees for selectinghyperparameters in a number of distinct models. We consider both the subgaussian mixture model andnetwork models to serve as examples of i.i.d. and non-i.i.d. data. We demonstrate that the same frameworkcan be used to choose the Lagrange multipliers of penalty terms in semideﬁnite programming (SDP)relaxations for community detection, and the bandwidth parameter for constructing kernel similaritymatrices for spectral clustering. By incorporating a cross-validation procedure, we show the frameworkcan also do consistent model selection for network models. Using a variety of simulated and real dataexamples, we show that our framework outperforms other widely used tuning procedures in a broad rangeof parameter settings.

A standard statistical model has parameters, which characterize the underlying data distribution; an inferencealgorithm to learn these parameters typically involve hyperparameters (or tuning parameters). Popularexamples include the penalty parameter in regularized regression models, the number of clusters in clusteringanalysis, the bandwidth parameter in kernel based clustering, nonparameteric density estimation or regressionmethods (Wasserman (2006); Tibshirani et al. (2015)), to name but a few. It is well-known that selectingthese hyperparameters may require repeated training to search through diﬀerent combinations of plausiblehyperparameter values and often has to rely on good heuristics and domain knowledge from the user.A classical method to do automated hyperparameter tuning is the nonparametric procedure CrossValidation (CV) (Stone (1974); Zhang (1993)) which has been used extensively in machine learning andstatistics (Hastie et al. (2005)).CV has been studied extensively in supervised learning settings, particularlyin low dimensional linear models (Shao (1993); Yang et al. (2007)) and penalized regression in highdimension (Wasserman & Roeder (2009)). Other notable stability based methods for model selection insimilar supervised settings include Breiman et al. (1996); Bach (2008); Meinshausen & Bühlmann (2010);Lim & Yu (2016). Finally, a large number of empirical methods exist in the machine learning literature fortuning hyperparameters in various training algorithms (Bergstra & Bengio (2012); Bengio (2000); Snoek et al. (2012); Bergstra et al. (2011)), most of which do not provide theoretical guarantees.In contrast to the supervised setting with i.i.d. data used in many of the above methods, in this paper,we consider unsupervised clustering problems with possible dependence structure in the datapoints. Wepropose an overarching framework for hyperparameter tuning and model selection for a variety of probabilisticclustering models. Here the challenge is two-fold. Since labels are not available, choosing a criterion forevaluation and in general a method for selecting hyperparameters is not easy. One may consider splitting thedata in diﬀerent folds and selecting the model or hyperparameter with the most stable solution. However, for1 a r X i v : . [ s t a t . M L ] F e b ultiple splits of the data, the inference algorithm may get stuck at the same local optima, and thus stabilityalone can lead to a suboptimal solution (Von Luxburg et al. (2010)). In Wang (2010); Fang & Wang (2012),the authors overcome this by redeﬁning the number of clusters as one that gives the most stable clustering fora given algorithm. In Meila (2018), a semi-deﬁnite program (SDP) maximizing an inner product criterion isperformed for each clustering solution, and the value of the objective function is used to evaluate the stabilityof the clustering. The analysis is done without any model assumptions. The second diﬃculty arises if thereis dependence structure in the datapoints, which necessitates careful splitting procedures in a CV-basedprocedure.To illustrate the generality of our framework, we focus on subgaussian mixtures and the statistical networkmodels like the Stochastic Blockmodel (SBM) and the Mixed Membership Stochastic Blockmodel (MMSB)as two representative models for i.i.d. data and non i.i.d. data, where clustering is a natural problem. Wepropose a uniﬁed framework with provable guarantees to do hyperparameter tuning and model selection inthese models. More speciﬁcally, our contributions can be summarized as below:1. Our framework can provably tune the following hyperparameters :(a) Lagrange multiplier of the penalty term in a type of semideﬁnite relaxation for community detectionproblems in SBM;(b) Bandwidth parameter used in kernel spectral clustering for subgaussian mixture models.2. We have consistent model selection , i.e. determining number of clusters:(a) When the model selection problem is embedded in the choice of the Lagrange multiplier in another typeof SDP relaxation for community detection in SBM;(b) General model selection for the Mixed Membership Stochastic Blockmodel (MMSB), which includes theSBM as a sub-model.We choose to focus on model selection for network-structured data, because there already is an extensiverepertoire of empirical and provable methods including the gap statistic (Tibshirani et al. , 2001), silhouetteindex (Rousseeuw, 1987), the slope criterion (Birgé & Massart, 2001), eigen-gap Von Luxburg (2007), penalizedmaximum likelihood (Leroux, 1992), information theoretic approaches (AIC (Bozdogan, 1987), BIC (Keribin,2000; Drton & Plummer, 2017), minimum message length (Figueiredo & Jain, 2002)), spectral clustering anddiﬀusion based methods (Maggioni & Murphy, 2018; Little et al. , 2017) for i.i.d mixture models. We discussthe related work on the other models in the following subsection. Hyperparameters and model selection in network models:

In network analysis, while a number ofmethods exist for selecting the true number of communities (denoted by r ) with consistency guaranteesincluding Lei et al. (2016); Wang & Bickel (2017); Le & Levina (2015); Bickel & Sarkar (2016) for SBM,and Fan et al. (2019) and Han et al. (2019) for more general models such as the degree-corrected mixedmembership blockmodel, these methods have not been generalized to other hyperparameter selection problems.For CV-based methods, existing strategies involve node splitting (Chen & Lei (2018)), or edge splitting (Li et al. (2016)). In the former, it is established that CV prevents underﬁtting for model selection in SBM.In the latter, a similar one-sided consistency result for Random Dot Product Models (RDPG) (Young &Scheinerman (2007), which includes SBM as a special case) is shown. This method has also been empiricallyapplied to tune other hyperparameters, though no provable guarantee was provided.In terms of algorithms for community detection or clustering, SDP methods have gained a lot of attention(Abbe et al. (2015); Amini et al. (2018); Guédon & Vershynin (2016); Cai et al. (2015); Hajek et al. (2016)) due to their strong theoretical guarantees. Typically, SDP based methods can be divided into twobroad categories. The ﬁrst one maximizes a penalized trace of the product of the adjacency matrix and anunnormalized clustering matrix (see deﬁnition in Section 2.2). Here the hyperparameter is the Lagrangemultiplier of the penalty term Amini et al. (2018); Cai et al. (2015); Chen & Lei (2018); Guédon & Vershynin(2016). In this formulation, the optimization problem does not need to know the number of clusters. However,it is implicitly required in the ﬁnal step which obtains the memberships from the clustering matrix.2he other class of SDP methods uses a trace criterion with a normalized clustering matrix (deﬁnition inSection 2.2) (Peng & Wei, 2007; Yan & Sarkar, 2019; Mixon et al. , 2017). Here the constraints directly use thenumber of clusters. (Yan et al. , 2017) use a penalized alternative of this SDP to do provable model selectionfor SBMs. However, most of these methods require appropriate tuning of the Lagrange multipliers, whichare themselves hyperparameters. Usually the theoretical upper and lower bounds on these hyperparametersinvolve unknown model parameters, which are nontrivial to estimate. The proposed method in Abbe &Sandon (2015) is agnostic of model parameters, but it involves a highly-tuned and hard to implement spectralclustering step (also noted by Perry & Wein (2017)).In this paper, we use a SDP from the ﬁrst class (SDP-1) to demonstrate our provable tuning procedure,and another SDP from the second class (SDP-2) to establish consistency guarantee for our model selectionmethod. Spectral clustering with mixture model:

In statistical machine learning literature, analysis ofspectral clustering typically is done in terms of the Laplacian matrix built from an appropriately constructedsimilarity matrix of the datapoints. There has been much work (Hein et al. , 2005; Hein, 2006; von Luxburg,2007; Belkin & Niyogi, 2003; Giné & Koltchinskii, 2006) on establishing diﬀerent forms of asymptoticconvergence of the Laplacian. Recently Löﬄer et al. (2019) have established error bounds for spectralclustering that uses the gram matrix as the similarity matrix. In Srivastava et al. (2019) error bounds areobtained for a variant of spectral clustering for the Gaussian kernel in presence of outliers. Most of theexisting tuning procedures for the bandwidth parameter of the Gaussian kernel are heuristic and do not haveprovable guarantees. Notable methods include von Luxburg (2007), who choose an analogous parameter,namely the radius (cid:15) in an (cid:15) -neighborhood graph “as the length of the longest edge in a minimal spanning treeof the fully connected graph on the data points.” Other discussions on selecting the bandwidth can be foundin (Hein et al. , 2005; Coifman et al. , 2008) and (Schiebinger et al. , 2015). Shi et al. (2008) propose adata dependent way to set the bandwidth parameter by suitably normalizing the quantile of a vectorcontaining quantiles of distances from each point.We now present our problem setup in Section 2. Section 3 proposes and analyzes our hyperparametertuning method MATR for networks and subgaussian mixtures. Next, in Section 4, we present MATR-CVand the related consistency guarantees for model selection for SBM and MMSB models. Finally, Section 5contains detailed simulated and real data experiments and we conclude with paper with a discussion inSection 6. Let ( C , ..., C r ) denote a partition of n data points into r clusters; m i = | C i | denote the size of C i . Denote π min = min i m i /n . The cluster membership of each node is represented by a n × r matrix Z , with Z ij = 1 ifdata point i belongs to cluster j , and otherwise. Since r is the true number of clusters, Z T Z is full rank.Given Z , the corresponding unnormalized clustering matrix is ZZ T , and the normalized clustering matrix is Z ( Z T Z ) − Z T . X can be either a normalized or unnormalized clustering matrix, and will be made clear. Weuse ˜ X to denote the matrix returned by SDP algorithms, which may not be a clustering matrix. Denote X r as the set of all possible normalized clustering matrices with cluster number r . Let Z and X be themembership and normalized clustering matrix from the ground truth. λ is a general hyperparameter; althoughwith a slight abuse of notation, we also use λ to denote the Lagrange multiplier in SDP methods. For anymatrix X ∈ R n × n , let X C k ,C (cid:96) be a matrix such that X C k ,C (cid:96) ( i, j ) = X ( i, j ) if i ∈ C k , j ∈ C (cid:96) , and otherwise. E n is the n × n all ones matrix. We write (cid:104) A, B (cid:105) = trace ( A T B ) . Standard notations of o, O, o P , O P , Θ , Ω will be used. By “with high probability”, we mean with probability tending to one. We consider a general clustering setting where the data D gives rise to a n × n observed similarity matrix ˆ S , where ˆ S is symmetric. Denote A as a clustering algorithm which operates on the data D with ahyperparameter λ and outputs a clustering result in the form of ˆ Z or ˆ X . Here note that A may or may notperform clustering on ˆ S , and A , ˆ Z and ˆ X could all depend on λ . In this paper we assume that ˆ S has theform ˆ S = S + R , where R is a matrix of arbitrary noise, and S is the “population similarity matrix”. As we3onsider diﬀerent clustering models for network-structured data and iid mixture data, it will be made clearwhat ˆ S and S are in each context. Assortativity (weak and strong):

In some cases, we require weak assortativity on the similaritymatrix S deﬁned as follows. Suppose for i, j ∈ C k , S ij = a kk . Deﬁne the minimal diﬀerence between diagonalterm and oﬀ-diagonal terms in the same row cluster as p gap = min k  a kk − max i ∈ C k ,j ∈ C (cid:96) (cid:96) (cid:54) = k S ij  . (1)Weak assortativity requires p gap > . This condition is similar to weak assortativity deﬁned for blockmodels(e.g. Amini et al. (2018)). It is mild compared to strong assortativity requiring min k a kk − max i ∈ C k ,j ∈ C (cid:96) (cid:96) (cid:54) = k S ij > . Stochastic Blockmodel (SBM):

The SBM is a generative model of networks with community structureon n nodes. By ﬁrst partitioning the nodes into r classes which leads to a membership matrix Z , the n × n binary adjacency matrix A is sampled from probability matrix P = Z i BZ Tj i (cid:54) = j ) . where Z i and Z j are the i th and j th row of matrix Z , B is the r × r block probability matrix. The aim is to estimate node membershipsgiven A . We assume the elements of B have order Θ( ρ ) with ρ → at some rate. Mixed Membership Stochastic Blockmodel (MMSB):

The SBM can be restrictive when it comesto modeling real world networks. As a result, various extensions have been proposed. The mixed membershipstochastic blockmodel (MMSB, (Airoldi et al. , 2008)) relaxes the requirement on the membership vector Z i being binary and allows the entries to be in [0 , r , such that they sum up to 1 for all i . We will denote thissoft membership matrix by Θ .Under the MMSB model, the n × n adjacency matrix A is sampled from the probability matrix P with P ij = Θ i B Θ Tj i (cid:54) = j ) . We use an analogous deﬁnition for normalized clustering matrix: X = Θ(Θ T Θ) − Θ .Note that this reduces to the usual normalized clustering matrix when Θ is a binary cluster membershipmatrix. Mixture of sub-gaussian random variables:

Let Y = [ Y , . . . , Y n ] T be a n × d data matrix. Weconsider a setting where Y i are generated from a mixture model with r clusters, Y i = µ a + W i , E ( W i ) = 0 , Cov ( W i ) = σ a I, a = 1 , . . . , r, (2)where W i ’s are independent sub-gaussian vectors. Trace criterion:

Our framework is centered around the trace (cid:104) ˆ S, X λ (cid:105) , where X λ is the normalizedclustering matrix associated with hyperparameter λ . This criterion is often used in relaxations of the k-meansobjective (Mixon et al. , 2017; Peng & Wei, 2007; Yan et al. , 2017) in the context of SDP methods. Theidea is that the criterion is large when datapoints within the same cluster are more similar. This criterionis also used by Meila (2018) for evaluating stability of a clustering solution, where the author uses SDP tomaximize this criterion for each clustering solution. Of course, this makes the implicit assumption that ˆ S (and S ) is assortative, i.e. datapoints within the same cluster have high similarity based on ˆ S . While this isreasonable for iid mixture models, not all community structures in network models are assortative if we usethe adjacency matrix A as ˆ S . If all the communities in a network are dis-assortative, then one can just use − A as ˆ S . However, for the SBM or MMSB models, one may have a mixture of assortative and dis-assortativestructure. In what follows, we begin our discussion of hyperparameter tuning and model selection for SBMby assuming weak assortativity, both for ease of demonstration and the fact that our algorithms of interest,SDP methods, operate on weakly assortative networks. For MMSB, which includes SBM as a sub-model, weshow the same criterion still works without assortativity if we choose ˆ S to be A with the diagonal removed. r In this section, we consider tuning hyperparameters when the true number of clusters r is known. First, weprovide two simulation studies to motivate this section. The detailed parameter settings for generating thedata can be found in the Appendix Section C. 4s mentioned in Section 1.1, SDP is an important class of methods for community detection in SBM, butits performance can depend on the choice of the Lagrange multiplier parameter. We ﬁrst consider a SDPformulation (Li et al. , 2018), which has been widely used with slight variations in the literature (Amini et al. , 2018; Perry & Wein, 2017; Guédon & Vershynin, 2016; Cai et al. , 2015; Chen & Lei, 2018), max trace ( AX ) − λ trace ( XE n ) s.t. X (cid:23) , X ≥ , X ii = 1 for ≤ i ≤ n, (SDP-1)where λ is a hyperparameter. Typically, one then performs spectral clustering (that is, k -means on the top r eigenvectors) on the output of the SDP to get the clustering result. In Figure 1 (a), we generate an adjacencymatrix from the probability matrix described in Appendix Section C and use SDP-1 with tuning parameter λ from 0 to 1. The accuracy of the clustering result is measured by the normalized mutual information (NMI)and shown in Figure 1 (a). We can see that diﬀerent λ values lead to widely varying clustering performance.As a second example, we consider a four-component Gaussian mixture model generated as describedin Appendix Section C. We perform spectral clustering ( k -means on the top r eigenvectors) on the widelyused Gaussian kernel matrix (denoted K ) with bandwidth parameter θ . Figure 1(b) shows the clusteringperformance using NMI as θ varies, and the ﬂat region of suboptimal θ corresponds to cases when the twoadjacent clusters cannot be separated well. (a) NMI v.s. λ (b) NMI v.s. θ Figure 1: Tuning parameters in SDP and Spectral clustering; accuracy measured by normalized mutualinformation (NMI).We show that in the case where the true cluster number r is known, an ideal hyperparameter λ can bechosen by simply maximizing the trace criterion introduced in Section 2.2. The tuning algorithm (MATR)is presented in Algorithm 1. It takes a general clustering algorithm A , data D and similarity matrix ˆ S asinputs, and outputs a clustering result ˆ Z λ ∗ with λ ∗ chosen by maximizing the trace criterion. Algorithm 1:

MAx-TRace (MATR) based tuning algorithm for known number of clusters.

Input: clustering algorithm A , data D , similarity matrix ˆ S , a set of candidates { λ , · · · , λ T } , numberof clusters r ; Procedure:for t = 1 : T do run clustering on D : ˆ Z t = A ( D , λ t , r ) ;compute normalized clustering matrix: ˆ X t = ˆ Z t ( ˆ Z Tt ˆ Z t ) − ˆ Z Tt ;compute inner product: l t = (cid:104) ˆ S, ˆ X t (cid:105) ; end for t ∗ = argmax ( l , ..., l T ) ; Output: ˆ Z t ∗ We have the following theoretical guarantee for Algorithm 1.5 heorem 1.

Consider a clustering algorithm A with inputs D , λ, r and output ˆ Z λ . The similarity matrix ˆ S used for Algorithm 1(MATR) can be written as ˆ S = S + R . We further assume S is weakly assortative with p gap deﬁned in Eq (1) , and X is the normalized clustering matrix for the true binary membership matrix Z .Let π min be the smallest cluster proportion, and τ := nπ min p gap . As long as there exists λ ∈ { λ , . . . , λ T } ,such that (cid:104) ˆ X λ , ˆ S (cid:105) ≥ (cid:104) X , S (cid:105) − (cid:15) , Algorithm 1 will output a ˆ Z λ ∗ , such that (cid:13)(cid:13)(cid:13) ˆ X λ ∗ − X (cid:13)(cid:13)(cid:13) F ≤ τ ( (cid:15) + sup X ∈X r |(cid:104) X, R (cid:105)| ) , where ˆ X λ ∗ is the normalized clustering matrix associated with ˆ Z λ ∗ . In other words, as long as the range of λ we consider covers some optimal λ value that leads to a suﬃcientlylarge trace criterion (compared with the true underlying X and the population similarity matrix S ), thetheorem guarantees Algorithm 1 will lead to a normalized clustering matrix with small error. The deviation (cid:15) depends both on the noise matrix R and how close the estimated ˆ X λ is to the ground truth X , i.e. theperformance of the algorithm. If both (cid:15) and sup X ∈X r |(cid:104) X, R (cid:105)| are o P ( τ ) , then MATR will yield a clusteringmatrix which is weakly consistent. The proof is in the Appendix Section A.In the following subsections, we apply MATR to more speciﬁc settings, namely to select the Lagrangemultiplier parameter in SDP-1 for SBM and the bandwidth parameter in spectral clustering for sub-gaussianmixtures. We consider the problem of choosing λ in SDP-1 for community detection in SBM. Here, the input toAlgorithm 1 – the data D and similarity matrix ˆ S – are both the adjacency matrix A . A natural choiceof a weakly assortative S is the conditional expectation of A , i.e. P up to diagonal entries: let ˜ P ij = P ij for i (cid:54) = j and ˜ P ii = B kk for i ∈ C k . Note that ˜ P is blockwise constant, and assortativity condition on ˜ P translates naturally to the usual assortativity condition on B . As the output matrix ˜ X from SDP-1 may notnecessarily be a clustering matrix, we use spectral clustering on ˜ X to get the membership matrix ˆ Z requiredin Algorithm 1. SDP-1 together with spectral clustering is used as A .In Proposition 13 of the Appendix, we show that SDP-1 is strongly consistent, when applied to a generalstrongly assortative SBM with known r , as long as λ satisﬁes: max k (cid:54) = l B k,l + Ω( (cid:112) ρ log n/nπ min ) ≤ λ ≤ min k B kk + O ( (cid:112) ρ log n/nπ ) (3)An empirical way of choosing λ was provided in Cai et al. (2015), which we will compare with in Section5. We ﬁrst show a result complementary to Eq 3 under a SBM model with weakly assortative B , that for aspeciﬁc region of λ , the normalized clustering matrix from SDP-1 will merge two clusters with high probability.This highlights the importance of selecting an appropriate λ since diﬀerent values can lead to drasticallydiﬀerent clustering result. The detailed statement and proof can be found in Proposition 12 of the AppendixSection A.2.When we use Algorithm 1 to tune λ for A , we have the following theoretical guarantee. Corollary 2.

Consider A ∼ SBM ( B, Z ) with weakly assortative B and r number of communities. Denote τ := nπ min min k ( B kk − max (cid:96) (cid:54) = k B k(cid:96) ) . If we have (cid:15) = o P ( τ ) , r √ nρ = o ( τ ) , nρ ≥ c log n , for some constant c > , then as long as there exists λ ∈ { λ , . . . , λ T } , such that (cid:104) ˆ X λ , A (cid:105) ≥ (cid:104) X , P (cid:105) − (cid:15) , with A Algorithm1(MATR) will output a ˆ Z λ ∗ , such that (cid:107) ˆ X λ ∗ − X (cid:107) F = o P (1) , where ˆ X λ ∗ , X are the normalized clusteringmatrices for ˆ Z λ ∗ , Z respectively. Remark 3.

1. Since λ ∈ [0 , , to ensure the range of λ considered overlaps with the optimal range in (3) ,it suﬃces to consider λ choices from [0 , . Then for λ satisfying Eq 3, SDP-1 produces ˜ X = X w.h.p. if B is strongly assortative. Since (cid:104) X , R (cid:105) = O P ( r √ nρ ) , we can take (cid:15) = O ( r √ nρ ) , and theconditions in this corollary imply r √ nρπ min → . Suppose all the communities are of comparable sizes,i.e. π min = Θ(1 /r ) , then the conditions only require r = O ( √ n ) since nρ → ∞ . . Since the proofs of Theorem 1 and Corollary 2 are general, the conclusion is not limited to SDP-1 andapplies to more general community detection algorithms for SBM when r is known. It is easy to see thata suﬃcient condition for the consistency of ˆ X λ ∗ to hold is that there exists λ in the range considered,such that |(cid:104) ˆ X λ − X , P (cid:105)| = o P ( τ ) .3. We note that the speciﬁc application of Corollary 2 to SDP-1 leads to weak consistency of ˆ X λ ∗ insteadof strong consistency as originally proved for SDP-1. This is partly due to the generality of theorem(including the relaxation of strong assortativity on B to weak assortativity) as discussed above, and thefact that we are estimating λ . In this case, the data D is Y deﬁned in Eq (2), the clustering algorithm A is spectral clustering (seemotivating example in Section 3) on the Gaussian kernel K ( i, j ) = exp (cid:16) − (cid:107) Y i − Y j (cid:107) θ (cid:17) . Note that one coulduse the similarity matrix as the kernel itself. However, this makes the trace criterion a function of thehyperparameter we are trying to tune, which compounds the diﬃculty of the problem. For simplicity, we usethe negative squared distance matrix as ˆ S , i.e. ˆ S ij = −(cid:107) Y i − Y j (cid:107) . The natural choice for S would be theconditional expectation of ˆ S given the cluster memberships, which is blockwise constant, as in the case forSBM’s. However, in this case, the convergence behavior is diﬀerent from that of blockmodels. In addition,this choice leads to a suboptimal error rate. Therefore we use a slightly corrected variant of the matrix as S (also see (Mixon et al. , 2017)), called the reference matrix: S ij = − d ab − max (cid:26) , d ab W i − W j ) T ( µ a − µ b ) (cid:27) i ∈ C a , j ∈ C b ) , (4)where d ab := (cid:107) µ a − µ b (cid:107) , W i is deﬁned in Eq 2. Note that for i, j in the same cluster S ij = 0 . Interestinglythis reference matrix is random itself, which is a deviation from the S used for network models. For MATRapplied to select θ , we have the following theoretical guarantee. Corollary 4.

Let ˆ S be the negative squared distance matrix, and let S be deﬁned as in Eq 4. Let δ sep denote the minimum distance between cluster centers, i.e. min k (cid:54) = (cid:96) (cid:107) µ k − µ (cid:96) (cid:107) . Denote τ := nπ min δ sep / and α = π max /π min . As long as there exists θ ∈ { θ , . . . , θ T } , such that (cid:104) ˆ X θ , ˆ S (cid:105) ≥ (cid:104) X , S (cid:105) − nπ min (cid:15) , Algorithm1(MATR) will output a ˆ Z θ ∗ , such that w.h.p. (cid:107) ˆ X θ ∗ − X (cid:107) F ≤ C (cid:15) + rασ ( α + min { r, d } ) δ sep where σ max is the largest operator norm of the covariance matrices of the mixture components, ˆ X θ ∗ is thenormalized clustering matrix for ˆ Z θ ∗ and C is an universal constant. Remark 5.

Note that, similar to SBMs, in this setting, (cid:15) has to be much smaller than δ sep in order toguarantee small error. This will happen if the spectral clustering algorithm is supplied with an appropriatebandwidth parameter that leads to small error in estimating X (see for example (Srivastava et al. , 2019)).This is satisﬁed by the condition θ ∈ { θ , . . . , θ T } in Corollary 4. r In this section, we adapt MATR to situations where the number of clusters is unknown to perform modelselection. Similar to Section 3, we ﬁrst explain the general tuning algorithm and state a general theorem toguarantee its performance. Then applications to speciﬁc models will be discussed in the following subsections.Since the applications we focus on are network models, we will present our algorithm with the data D being A for clarity.We show that MATR can be extended to model selection if we incorporate a cross-validation (CV)procedure. In Algorithm 2, we present the general MATR-CV algorithm which takes clustering algorithm7 , adjacency matrix A , and similarity matrix ˆ S as inputs. Compared with MATR, MATR-CV has twoadditional parts.The ﬁrst part (Algorithm 3) is to split nodes into two subsets for training and testing. This in turnpartitions the adjacency matrix A into four submatrices A , A , A and its transpose, and similarlyfor ˆ S . MATR-CV makes use of all the submatrices: A for training, A for testing, A and A forestimating the clustering result for nodes in A as shown in Algorithm 4, which is the second additionalpart. Algorithm 4 clusters testing nodes based on the training nodes cluster membership estimated from A ,and the connections between training nodes and testing nodes A . Algorithm 2:

MATR-CV.

Input: clustering algorithm A , adjacency matrix A ,similarity matrix ˆ S , candidates { r , · · · , r T } , number ofrepetitions J , training ratio γ train , trace gap ∆ ; for j = 1 : J dofor t = 1 : T do ˆ S , ˆ S , ˆ S ← NodeSplitting( ˆ S , n , γ train ); A , A , A ← NodeSplitting( A , n , γ train ); ˆ Z = A ( A , r t ) ; ˆ Z = ClusterTest ( A , ˆ Z ) ; ˆ X = ˆ Z ( ˆ Z T ˆ Z ) − ˆ Z T ; l r t ,j = (cid:104) ˆ S , ˆ X (cid:105) ; end for r ∗ j = min { r t : l r t ,j ≥ max t l r t ,j − ∆ } ; end for ˆ r = median { r ∗ j } Output: ˆ r Algorithm 3:

NodeSplitting

Input: A , n , γ train ;Randomly split [ n ] into Q , Q of size nγ train and n (1 − γ train ) A ← A Q ,Q , A ← A Q ,Q , A ← A Q ,Q Output: A , A , A Algorithm 4:

ClusterTest

Input: A ∈ { , } n × m , ˆ Z ∈ { , } m × k ; M ← A ˆ Z ( ˆ Z T ˆ Z ) − ; for i = 1 : n do ˆ Z ( i, arg max M ( i, :)) = 1 end forOutput: ˆ Z For each node in the testing set, using the estimated membership ˆ Z , the corresponding row in M countsthe number of connections it has with nodes in the training set belonging to each cluster and normalizesthe counts by the cluster sizes. Finally, the estimated membership ˆ Z is determined by a majority vote.For now we still assume B is weakly assortative, so majority vote is reasonable. As we later extend to moregeneral network structures in Section 4.2, we will also show how Algorithm 4 can be generalized.Like other CV procedures, we note that MATR-CV requires specifying a training ratio γ train and thenumber of repetitions J . Choosing any γ train = Θ(1) does not aﬀect our asymptotic results. Repetitions ofsplits are used empirically to enhance stability; theoretically we show asymptotic consistency for any randomsplit. The general theoretical guarantee and the role of the trace gap ∆ are given in the next theorem. Theorem 6.

Given a candidate set of cluster numbers { r , . . . , r T } containing the true number of cluster r ,let ˆ X r t be the normalized clustering matrix obtained from r t clusters, as described in MATR-CV. Assume thefollowing is true:(i) with probability at least − δ under , max r t such that (cid:15) est + (cid:15) over ≤ ∆ < (cid:15) under − (cid:15) est . Here (cid:15) under , (cid:15) est , (cid:15) over > . Then with probability at least − δ under − δ over − δ est , MATR-CV will recover thetrue r with trace gap ∆ . The proof is deferred to the Appendix Section B.

Remark 7.

1. MATR-CV is also compatible with tuning multiple hyperparameters. For example, forSDP-1, if the number of clusters is unknown, then for each ˆ r , we can run MATR to ﬁnd the best λ forthe given ˆ r , followed by running a second level MATR-CV to ﬁnd the best ˆ r . As long as the conditionsin Theorems 1 and 6 are met, ˆ r and the clustering matrix returned will be consistent.2. As will be seen in the applications below, the derivations of (cid:15) under and (cid:15) over are general and only depend n the properties of ˆ S . On the other hand, (cid:15) est measures the estimation error associated with thealgorithm of interest and depends on its performance. In what follows, we demonstrate MATR-CV can be applied to do model selection inherent to an SDPmethod for SBM and more general model selection for MMSB. While we still assume an assortative structurefor the former model as required by the SDP method, the constraint is removed for MMSB. Furthermore, weuse these two models to illustrate how MATR-CV works both when (cid:15) est is zero (SBM) and nonzero (MMSB).

We consider the SDP algorithm introduced in Peng & Wei (2007); Yan et al. (2017) as shown in SDP-2- λ forcommunity detection in SBM. Here X is a normalized clustering matrix, and in the case of exact recoverytrace ( X ) is equal to the number of clusters. In this way, r is implicitly chosen through λ , hence most ofthe existing model selection methods with consistency guarantees do not apply directly. Yan et al. (2017)proposed to recover the clustering and r simultaneously. However, λ still needs to be empirically selectedﬁrst. We provide a systematic way to do this. max X trace ( AX ) − λ trace ( X ) s.t. X (cid:23) , X ≥ , X = (SDP-2- λ )We consider applying MATR-CV to an alternative form of SDP-2- λ as shown in SDP-2, where the clusternumber r (cid:48) appears explicitly in the constraint and is part of the input. SDP-2 returns an estimated normalizedclustering matrix, to which we apply spectral clustering to compute the cluster memberships. We name thisalgorithm A SDP-2 . In this case, we use A as ˆ S , so P is the population similarity matrix. max X trace ( AX ) s.t. X (cid:23) , X ≥ , trace ( X ) = r (cid:48) , X = (SDP-2)We have the following result ensuring MATR-CV returns a consistent cluster number. Theorem 8.

Suppose A is generated from a SBM model with r clusters and a weakly assortative B . We assume r is ﬁxed, and π min ≥ δ > for some constant δ , and nρ/ log n → ∞ . Given a candidate set of { r , . . . , r T } containing true cluster number r and r T = Θ( r ) , with high probability for n large, MATR-CV returns thetrue number of clusters with ∆ = (1 + B max ) √ r max log n + B max r max , where r max := arg max r t (cid:104) A, ˆ X r t (cid:105) .Proof sketch. We provide a sketch of the proof here, the details can be found in the Appendix Section B.2.We derive the three errors in Theorem 6. In this case, we show that w.h.p., (cid:15) under = Ω( np gap π min /r ) , (cid:15) over = (1 + B max ) √ r T log n + B max r, and MATR-CV achieves exact recovery when given the true r , that is, (cid:15) est = 0 . Since (cid:15) under (cid:29) (cid:15) over under the conditions of the theorem, by Theorem 6, taking ∆ = (cid:15) over MATR-CVreturns the correct r w.h.p. Furthermore, we can remove the dependence of ∆ on unknown r by noting that r max := arg max r t (cid:104) A, ˆ X r t (cid:105) ≥ r w.h.p., then it suﬃces to consider the candidate range { r , . . . , r max } . Thus r T and r in ∆ can be replaced with r max . Remark 9.

1. Although we have assumed ﬁxed r , it is easy to see from the order of (cid:15) under and (cid:15) over that the theorem holds for r /n → , r . √ log n/ ( nρ ) → if we let π min = Ω(1 /r ) for clarity. Manyother existing works on SBM model selection assume ﬁxed r . Lei et al. (2016) considered the regime r = o ( n / ) . Hu et al. (2017) allowed r to grow lineary up to a logarithmic factor, but at the cost ofmaking ρ ﬁxed.2. Asymptotically, ∆ is equivalent to ∆ SDP-2 := √ r max log n . We will use ∆ SDP-2 in practice when r isﬁxed. In this section, we consider model selection for the MMSB model as introduced in Section 2.2 with asoft membership matrix Θ , which is more general than the SBM model. As an example of estimationalgorithm, we consider the SPACL algorithm proposed by Mao et al. (2017), which gives consistent parameterestimation when given the correct r . As mentioned in Section 2.2, a normalized clustering matrix in9his case is deﬁned analogously as X = Θ(Θ T Θ) − Θ T for any Θ . X is still a projection matrix, and X n = Θ(Θ T Θ) − Θ T n = Θ(Θ T Θ) − Θ T Θ r = n , since Θ r = n . Following Mao et al. (2017), weconsider a Bayesian setting for Θ : each row of Θ , Θ i ∼ Dirichlet ( α ) , α ∈ R r + . We assume r , α are all ﬁxedconstants. Note that the Bayesian setting here is only for convenience, and can be replaced with equivalentassumptions bounding the eigenvalues of Θ T Θ . We also assume there is at least one pure node for each ofthe r communities for consistent estimation at the correct r .MATR-CV can be applied to the MMSB model with a few modiﬁcations. (i) Replace all ˆ Z by ˆΘ , theestimated soft memberships from the training graph. (ii) We take ˆ S = A − diag ( A ) , S = P − diag ( P ) .This allows us to remove the assortativity requirement on P and replace it with a full rank condition on B ,which is commonly assumed in the MMSB literature. The fact that P is always positive semi-deﬁnite willbe used in the proof. The removal of diag ( A ) and diag ( P ) leads to better concentration, since diag ( A ) iscentered around a diﬀerent mean. (iii) We change Algorithm 4 to estimate ˆΘ . Note that P = Θ B (Θ ) T ,thus we can view the estimation of Θ as a regression problem with plug-in estimators of Θ and B . InAlgorithm 4, we use an estimate of the form ˆΘ = A ˆΘ (( ˆΘ ) T ˆΘ ) − ˆ B − , where ˆ B , ˆΘ are estimatedfrom A .We have the following consistency guarantee for ˆ r returned by MATR-CV. Theorem 10.

Let A be generated from a MMSB model (see Section 2.2) satisfying λ ∗ ( B ) = Ω( ρ ) , where λ ∗ ( B ) is the smallest singular value of B . We assume √ nρ/ (log n ) ξ → ∞ for some arbitrarily small ξ > .Given a candidate set of { r , . . . , r T } containing r and r T = Θ(1) , with high probability for large n , MATR-CVreturns the true cluster number r if ∆ = O (( nρ ) / (log n ) . ) .Proof sketch. We ﬁrst show w.h.p., the underﬁtting and overﬁtting errors in Theorem 6 are (cid:15) under = Ω( n ρ ) , (cid:15) over = O ( nρ √ log n ) . To obtain (cid:15) est , we show that given the true cluster number, the convergence rate ofthe parameter estimates for the testing nodes obtained from the regression algorithm is the same as theconvergence rate for the training nodes. This leads to (cid:15) est = O (( nρ ) / (log n ) ξ ) . For convenience we pick ξ = 0 . . For details, see Section B.3 of the supplement. Remark 11.

1. Compared with Fan et al. (2019) and Han et al. (2019), which consider the more generaldegree-corrected MMSB model, our consistency result holds for ρ → at a faster rate.2. A practical note: due to the constant in the estimation error being tedious to determine, in this case weonly know the asymptotic order of the gap ∆ . As has been observed in many other methods based onasymptotic properties (e.g. Bickel & Sarkar (2016); Lei et al. (2016); Wang & Bickel (2017); Hu et al. (2017)), performing an adjustment for ﬁnite samples often improves the empirical performance. Inpractice we ﬁnd that if the constant factor in ∆ is too large, then we tend to underﬁt. To guard againstthis, we note that at the correct r , the trace diﬀerence δ r,r − := (cid:104) ˆ S, ˆ X r (cid:105) − (cid:104) ˆ S, ˆ X r − (cid:105) should be muchlarger than ∆ . We start with ∆ = ( nρ ) / (log n ) . and ﬁnd ˆ r by Algorithm 2; if δ ˆ r, ˆ r − is smallerthan ∆ , we reduce ∆ by half and repeat the step of ﬁnding r ∗ j in Algorithm 2 until δ ˆ r, ˆ r − > ∆ . Thisadjustment is much more computationally eﬃcient than bootstrap corrections and works well empirically. In this section, we present extensive numerical results on simulated and real data by applying MATR andMATR-CV to diﬀerent settings considered in Sections 3 and 4.

We apply MATR to tune λ in SDP-1 for known r . Since λ ∈ [0 , for SDP-1, we choose λ ∈ { , · · · , } / in all the examples. For comparison we choose two existing data driven methods. The ﬁrst method (CL, Cai et al. (2015)) sets λ as the mean connectivity density in a subgraph determined by nodes with “moderate”degrees. The second is ECV (Li et al. (2016)) which uses CV with edge sampling to select the λ giving thesmallest loss on the test edges from a model estimated on training edges. We use a training ratio of 0.9 andthe L loss throughout. 10 imulated data. Consider a strongly assortative SBM as required by SDP-1 for both equal sized andunequal sized clusters. The details of the experimental setting can be found in the Appendix Section C.Standard deviations are calculated based on random runs of the each parameter setting. We present NMIcomparisons for equal sized SBM ( n = 400 , r = 4 ) in Figure 2(A), and unequal sized SBM (two with 100nodes, and two with 50) in Figure 2(B). In both, MATR outperforms others by a large margin as degreegrows. (A) (B) (C) (D)Figure 2: Comparison of NMI for tuning λ for SDP-1 for equal (A) and unequal sized (B) SBMs. Comparisonof NMI for tuning bandwidth in spectral clustering for mixture models with (C) equal and (D) unequalmixing coeﬃcients. Real data.

We also compare MATR with ECV and CL on three real datasets: the football dataset (Girvan& Newman, 2002), the political books dataset and the political blogs dataset (Adamic & Glance, 2005). Allof them are binary networks with , and nodes respectively. In the football dataset, the nodesrepresent teams and an edge is drawn between two teams if any regular-season games are played betweenthem; there are clusters where each cluster represents the conference among teams, and games are morefrequently between teams in the same conference. In the political blogs dataset, the nodes are weblogsand edges are hyperlinks between the blogs; it has clusters based on political inclination: "liberal" and"conservative". In the political books dataset, the nodes represent books and edges indicate co-purchasing onAmazon; the clusters represent categories based on manual labeling of the content: "liberal", "neutral"and "conservative". The clustering performance of each method is evaluated by NMI and shown in Table 1a.MATR has performs the best out of the three methods on the football dataset, and is tied with ECV on thepolitical books dataset. MATR is not as good as CL on the poligical blogs dataset, but still outperformsECV. MATR ECV CLFootball 0.924 0.895 0.883Political blogs 0.258 0.142 0.423Political books 0.549 0.549 0.525 (a) NMI with tuning λ on SBM Truth MATR-CV ECV BHFootball 12 12 10 10Polblogs 2 6 1 8Polbooks 3 6 2 4 (b) Model selection with SBM

Table 1: Results obtained on real networks

We use MATR-CV to select the bandwidth parameter θ in spectral clustering applied to mixture datawhen given the correct number of clusters. In all the examples, our candidate set of θ is { tα/ } for t = 1 , · · · , and α = max i,j (cid:107) Y i − Y j (cid:107) . We compare MATR with three other well-known heuristic methods.The ﬁrst one was proposed by (Shi et al. , 2008) (DS), where, for each data point Y i , the quantile of {(cid:107) Y i − Y j (cid:107) , j = 1 , ..., n } is denoted q i and then θ is set to be quantile of { q ,...,q n } √ quantile of χ d . We also compare with11wo other methods in Von Luxburg (2007): a method based on k -nearest neighbor (KNN) and a methodbased on minimal spanned tree (MST). For KNN, θ is chosen in the order of the mean distance of a pointto its k -th nearest neighbor, where k ∼ log( n ) + 1 . For MST, θ is set as the length of the longest edge in aminimal spanning tree of the fully connected graph on the data points. Simulated data.

We ﬁrst conduct experiments on simulated data generated from a 3-component Gaussianmixture with d = 20 . The means are multiplied by a separation constant which controls clustering diﬃculty(larger, the better). Detailed descriptions of the parameter settings can be found in Section C of the Appendix. n = 500 datapoints are generated for each mixture model and random runs are used to calculate standarddeviations for each parameter setting. In Figure 2 (A) and (B) we plot NMI on the Y axis against theseparation along the X axis for mixture models with equal and unequal mixing coeﬃcients respectively. Forall these settings, MATR performs as well or better than the best among DS, KNN and MST. Real data.

We also test MATR for tuning θ on a real dataset: Optical Recognition of Handwritten DigitsData Set . We use a copy of the test set provided by scikit-learn (Pedregosa et al. , 2011), which consistsof 1797 instances of 10 classes. We standardize the dataset before clustering. With clusters, MATR,DS, KNN and MST yield cluster results with NMI values . , . , . and . respectively. In otherwords, MATR performs similarly to KNN but outperforms DS and MST. We also visualize and compare theclustering results by diﬀerent methods in 2-D using tSNE (Maaten & Hinton, 2008), which can be found inSection C of the Appendix. We make comparisons among MATR-CV, Bethe-Hessian estimator (BH) (Le & Levina, 2015) and ECV (Li et al. , 2016). For ECV and MATR-CV, we consider r ∈ { , · · · √ n } , where n is the number of nodes. Simulated data.

We simulate networks from a -cluster strongly assortative SBM with equal and unequalsized blocks (detailed in Section C of the Appendix). In Figure 3, we show NMI on Y axis vs. average degreeon Y axis. In Figure 3(a) and (b) we respectively consider equal sized ( clusters of size ) and unequalsized networks (two with nodes and two with nodes). In all cases, MATR-CV has the highest NMI. Atable with median number of clusters selected by each method can be found in Section C of the Appendix. Real data.

The same set of methods are also compared on three real datasets: the football dataset, thepolitical blogs dataset and the political books dataset. The results are shown in Table 1b, where MATR-CVﬁnds the ground truth for the football dataset. (a) NMI for equal sized case (b) NMI for unequal sized case

Figure 3: Comparison of NMI with model selection for equal and unequal sized cases.

We compare MATR-CV with Universal Singular Value Thresholding (USVT) (Chatterjee et al. , 2015), ECV(Li et al. , 2016) and SIMPLE (Fan et al. , 2019) in terms of doing model selection with MMSB. For ECVand MATR-CV, we consider the candidate set r ∈ { , , · · · , (cid:98) ˆ ρn (cid:99)} , where ˆ ρ = (cid:80) i

We ﬁrst apply all the methods to simulated data. We consider B = ρ × { ( p − q ) I r + qE r } .Following (Mao et al. , 2018), we sample Θ i ∼ Dirichlet ( α ) and α = r /r . We generate networks with https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits = 2000 nodes with r = 4 and r = 8 respectively. We set p = 1 , q = 0 . when r = 4 ; p = 1 , q = 0 . when r = 8 for a range of ρ .In Table 2a and 2b, we report the fractions of exactly recovering the true clusternumber r over 40 runs for each method across diﬀerent average degrees. We observe that in both r = 4 and r = 8 cases, MATR-CV outperforms the other three methods with a large margin on sparse graphs. Themethod SIMPLE consistently underﬁts in our sparsity regime, which is understandable, since their theoreticalguarantees hold for a dense degree regime. ρ (a) Exact recovery fractions for clusters ρ (b) Exact recovery fractions for clusters Table 2: Results of MMSB on synthetic data

Real data.

We also test MATR-CV with MMSB on a real network, the political books network, whichcontains 3 clusters. Here ﬁtting a MMSB model is reasonable since each book can have mixed politicalinclinations, e.g. a “conserved” book may be in fact mixed between “neutral” and “conservative”. WithMATR-CV, we found clusters. With USVT, ECV and SIMPLE we found fewer than clusters. Clustering data, both in i.i.d and network structured settings have received a lot of attention both from appliedand theoretical communities. However, methods for tuning hyperparameters involved in clustering problemsare mostly heuristic. In this paper, we present MATR, a provable MAx-TRace based hyperparameter tuningframework for general clustering problems. We prove the eﬀectiveness of this framework for tuning SDPrelaxations for community detection under the block model and for learning the bandwidth parameter of thegaussian kernel in spectral clustering over a mixture of subgaussians. Our framework can also be used todo model selection using a cross validation based extension (MATR-CV) which can be used to consistentlyestimate the number of clusters in blockmodels and mixed membership blockmodels. Using a variety ofsimulation and real experiments we show the advantage of our method over other existing heuristics.The framework presented in this paper is general and can be applied to doing model selection or tuningfor broader model classes like degree corrected blockmodels (Karrer & Newman, 2011), since there are manyexact recovery based algorithms for estimation in these settings (Chen et al. , 2018). We believe that ourframework can be extended to the broader class of degree corrected mixed membership blockmodels (Jin et al. , 2017) which includes the topic model (Mao et al. , 2018). However, the derivation of the estimation error (cid:15) est involves tedious derivations of parameter estimation error, which has not been done by existing works.Furthermore, even though our work uses node sampling, we believe we can extend the MATR-CV frameworkto get consistent model selection for other sampling procedures like edge sampling (Li et al. , 2016).

Appendix

This appendix contains detailed proofs of theoretical results in the main paper “A Uniﬁed Framework forTuning Hyperparameters in Clustering Problems”, additional theoretical results, and detailed description of theexperimental parameter settings. We present proofs for MATR and MATR-CV in Sections A and Sections Brespectively. Sections A.2 also contains additional theoretical results on the role of the hyperparameter inmerging clusters in SDP-1 and SDP-2 respectively. Finally, Section C contains detailed parameter settingsfor the experimental results in the main paper. 13

Additional theoretical results and proofs of results in Section 3

A.1 Proof of Theorem 1

Proof.

If for tuning parameter λ , we have (cid:104) ˆ S, ˆ X λ (cid:105) ≥ (cid:104) S, X (cid:105) − (cid:15) , then (cid:104) S, ˆ X λ (cid:105) ≥ (cid:104) S, X (cid:105) − |(cid:104) ˆ S − S, ˆ X λ (cid:105)| − (cid:15). (5)First we will prove that this immediately gives an upper bound on (cid:107) ˆ X λ − X (cid:107) F . We will remove the subscript λ for ease of exposition. Denote ω k = (cid:104) X , ˆ X C k ,C k (cid:105) , α ij = (cid:104) E i,j , ˆ X (cid:105) m k (1 − ω k ) , when ω k < and otherwise, andoﬀ-diagonal set for k th cluster C ck as { ( i, j ) | i ∈ C k , j / ∈ C k } . Then we have (cid:104) S, ˆ X (cid:105) = r (cid:88) k =1 a kk (cid:104) E C k ,C k , ˆ X (cid:105) + r (cid:88) k =1 (cid:88) ( i,j ) ∈ C ck a ij (cid:104) E i,j , ˆ X (cid:105) = r (cid:88) k =1 a kk m k ω k + r (cid:88) k =1 m k (1 − ω k ) (cid:88) ( i,j ) ∈ C ck a ij α ij = r (cid:88) k =1 m k ω k ( a kk − (cid:88) ( i,j ) ∈ C ck a ij α ij ) + r (cid:88) k =1 m k (cid:88) ( i,j ) ∈ C ck a ij α ij (6)Since (cid:104) S, X (cid:105) = (cid:80) k m k a kk , by (5), (cid:104) S, ˆ X (cid:105) ≥ (cid:80) k m k a kk − |(cid:104) R, ˆ X (cid:105)| − (cid:15) , we have (cid:88) k m k ω k ( a kk − (cid:88) ( i,j ) ∈ C ck a ij α ij ) + (cid:88) k m k (cid:88) ( i,j ) ∈ C ck a ij α ij ≥ (cid:88) k m k a kk − |(cid:104) R, ˆ X (cid:105)| − (cid:15). Note that, since S is weakly assortative, a kk − (cid:80) ( i,j ) ∈ C ck a ij α ij is always positive because (cid:80) ( i,j ) ∈ C ck α ij ≤ .Denote (cid:15) (cid:48) = |(cid:104) R, ˆ X (cid:105)| + (cid:15) , β k = m k ( a kk − (cid:80) Cck α ij a ij ) (cid:80) k m k ( a kk − (cid:80) Cck α ij a ij ) , (cid:88) k m k ω k ( a kk − (cid:88) ( i,j ) ∈ C ck a ij α ij ) ≥ (cid:88) k m k ( a kk − (cid:88) ( i,j ) ∈ C ck α ij a ij ) − (cid:15) (cid:48) (cid:88) k β k ω k ≥ − (cid:15) (cid:48) (cid:80) k m k ( a kk − (cid:80) C ck α ij a ij ) (cid:88) k β k (1 − ω k ) ≤ (cid:15) (cid:48) (cid:80) k m k ( a kk − (cid:80) C ck α ij a ij ) . (cid:88) k (1 − ω k ) ≤ (cid:88) k β k β min (1 − ω k ) ≤ (cid:15) (cid:48) β min (cid:80) k m k ( a kk − (cid:80) C ck α ij a ij ) , where β min = min k β k . Since trace ( ˆ X ) = trace ( X ) , (cid:13)(cid:13)(cid:13) ˆ X − X (cid:13)(cid:13)(cid:13) F = trace (( ˆ X − X ) T ( ˆ X − X ))= trace ( ˆ X + X − XX )= 2 trace ( X ) − (cid:88) k (cid:104) X , ˆ X C k ,C k (cid:105) = 2 (cid:88) k (1 − ω k ) ≤ (cid:15) (cid:48) min k m k ( a kk − (cid:80) C ck α ij a ij ) ≤ (cid:15) (cid:48) nπ min min k ( a kk − max C ck a ij ) = 2 (cid:15) (cid:48) τ . Now consider the λ ∗ returned by MATR, 14 ˆ S, ˆ X λ ∗ (cid:105) ≥ (cid:104) ˆ S, ˆ X λ (cid:105) ≥ (cid:104) S, X (cid:105) − (cid:15). Then, following the above argument and from the condition from the theorem, (cid:107) X λ ∗ − X (cid:107) F ≤ (cid:15) (cid:48) nπ min min k ( a kk − max C ck a ij ) ≤ τ ( (cid:15) + sup X ∈X r |(cid:104) X, R (cid:105)| ) . A.2 Range of λ for merging clusters in SDP-1 Proposition 12.

Let ˜ X be the optimal solution of SDP-1 for A ∼ SBM ( B, Z ) with λ satisfying max k (cid:54) = (cid:96) B ∗ k,(cid:96) + Ω( (cid:114) ρ log nnπ min ) ≤ λ ≤ min k B ∗ kk − max k,(cid:96) = r − ,r m (cid:96) n k ( B (cid:96),(cid:96) − B r,r − ) + O ( (cid:115) ρ log nnπ ) , then ˜ X = X ∗ with probability at least − n , where X ∗ is the unnormalized clustering matrix which mergesthe last two clusters, B ∗ is the corresponding ( r − × ( r − block probability matrix. Remark:

The proposition implies if the ﬁrst r − clusters are more connected within each cluster thanthe last two clusters and the connection between ﬁrst r − clusters and last two clusters are weak, we can ﬁnda range for λ that leads to merging the last two clusters with high probability. The results can be generalizedto merging several clusters at one time. The result above highlights the importance of selecting λ as it aﬀectsthe performance of SDP-1 signiﬁcantly. Proof.

We develop suﬃcient conditions with a contruction of the dual certiﬁcate which guarantees X ∗ to bethe optimal solution. The KKT conditions can be written as below:First order stationary: − A − Λ + λE n − diag ( β ) − Γ = 0

Primal feasibility: X (cid:23) , X ≥ , X ii = 1 ∀ i = 1 · · · , n Dual feasibility: Γ ≥ , Λ (cid:23) Complementary slackness (cid:104) Λ , X (cid:105) = 0 , Γ ◦ X = 0 . Consider the following construction: denote T k = C k , n k = m k , for k < r − , T r − = C r − (cid:83) C r , n r − = m r − + m r . X T k = E n k X T k T l = 0 , for k (cid:54) = l ≤ r − T k = − A T k + λE n k − λn k I n k + diag ( A T k n k )Λ T k T l = − A T k ,T l + 1 n l A T k ,T l E n l + 1 n k E n k A T k T l − n l n k E n k A T k ,T l E n l Γ T k = 0Γ T k ,T l = λE n k ,n l − n l A T k ,T l E n l − n k E n k A T k T l + 1 n l n k E n k A T k ,T l E n l β = diag ( − A − Λ + λE n − Γ) All the KKT conditions are satisﬁed by construction except for positive semideﬁniteness of Λ andpositiveness of Γ . Now, we show it one by one. 15 ositive Semideﬁniteness of Λ Since span (1 T k ) ⊂ ker (Λ) , it suﬃces to show that for any u ∈ span (1 T k ) ⊥ , u T Λ u ≥ . Consider u = (cid:80) k u T k , where u T k := u ◦ T k , then u T k ⊥ n k . u T Λ u = − (cid:88) k u TT k A T k u T k − λ (cid:88) k n k u TT k u T k + (cid:88) k u TT k diag ( A T k n k ) u T k − (cid:88) k (cid:54) = l u TT k A T k T l u T l = − u T ( A − P ) u T − u T P u − λ (cid:88) k n k u TT k u T k + (cid:88) k u TT k diag ( A T k n k ) u T k = − u T ( A − P ) u − u TT k − P T k − T k − u T k − − λ (cid:88) k n k u TT k u T k + (cid:88) k u TT k diag ( A T k n k ) u T k (7)For the ﬁrst term, we know u T ( A − P ) u ≤ (cid:107) A − P (cid:107) (cid:107) u (cid:107) ≤ O ( √ nρ ) (cid:107) u (cid:107) with high probability.For the second term, and note that T r − = C r − (cid:83) C r , and P T r − T r − = (cid:20) B r − ,r − E m r − m r − , B r − ,r E m r − m r B r,r − E m r m r − , B r,r E m r m r (cid:21) Since u T r − ⊥ n r − , u TT r − (cid:20) B r − ,r E m r − m r − , B r − ,r E m r − m r B r,r − E m r m r − , B r,r − E m r m r (cid:21) u T r − = 0 , therefore u TT r − P T r − T r − u T r − = u TT r − (cid:20) ( B r − ,r − B r − ,r − ) E m r − m r − , , ( B r − ,r − B r,r ) E m r m r (cid:21) u T r − ≤ max { m r − ( B r − ,r − − B r − ,r ) , m r ( B r,r − B r,r − ) } (cid:107) u (cid:107) (8)Consider the last term (cid:80) k u TT k diag ( A T k n k ) u T k . Using Chernoﬀ, we know || diag ( A T k n k ) || ≥ B ∗ k,k n k − (cid:112) ρn k log n k with high probability, where for k, l < r − , B ∗ kl = B kl ,B ∗ k,r − = m r − B k,r − + m r B k,r m r − + m r ,B ∗ r − ,r − = ( m r − B r − ,r − + 2 ∗ m r m r − B r − ,r + ( m r B r,r )( m r − + m r ) . Therefore, : − λ (cid:88) k n k u TT k u T k + (cid:88) k u TT k diag ( A T k n k ) u T k ≥ min k ( B ∗ k,k n k − Ω( (cid:112) ρn k log n ) − λn k ) (cid:107) u (cid:107) . So with equation 7, a suﬃcient condition for positive semideﬁniteness of Λ is min k ( B ∗ k,k n k − Ω( (cid:112) ρn k log n ) − λn k ) ≥ O ( √ nρ ) + max { m r − ( B r − ,r − − B r − ,r ) , m r ( B r,r − B r,r − ) } which implies, λ ≤ min k B ∗ kk − max k max { m r − n k ( B r − ,r − − B r − ,r ) , m r n k ( B r,r − B r,r − ) } + O ( (cid:112) ρ log n/nπ ) ositiveness of Γ For i ∈ T k , j ∈ T l , we have Γ i,j = λ − (cid:80) m ∈ T l A i,m n l − (cid:80) m ∈ T k A m,j n k + 1 n k n l (cid:88) m ∈ T k ,o ∈ T l A mo . Therefore, block-wise mean of Γ will be E [Γ T k ,T l ] = ( λ − B ∗ k,l ) E n k ,n l , and the variance for each entry belonging to cluster k and l will be in order of O ( ρ/ ( n k n l )) .Using Chernoﬀ bound, we have p ( | Γ i,j − ( λ − B ∗ k,l ) | > λ − B ∗ k,l ) ≤ (cid:20) − n k n l ρ ( λ − B ∗ k,l ) (cid:21) . Therefore, as long as λ ≥ max k (cid:54) = l B ∗ k,l + Ω( (cid:112) ρ log n/nπ min ) , we have p (Γ i,j < ≤ (cid:20) − nπ min log n (cid:21) We then applying the union bound and conclude that Γ T k T l > with a high probability when λ ≥ max k (cid:54) = l B ∗ k,l + Ω( (cid:112) ρ log n/nπ min ) . Proposition 13.

As long as max k (cid:54) = l B k,l + Ω( (cid:112) ρ log n/nπ min ) ≤ λ ≤ min k B kk + O ( (cid:112) ρ log n/nπ ) ,SDP-1 exactly recovers X with high probability.Proof. We follow the same primal-dual construction as Proposition 12 without merging the last two clusters.Consider the following construction: denote T k = C k , n k = m k , for k = 1 , ..., r . We show the positivesemideﬁniteness and Positiveness of Λ and Γ respectively. Positve Semideﬁniteness of Λ Since span (1 T k ) ⊂ ker (Λ) , it suﬃces to show that for any u ∈ span (1 T k ) ⊥ , u T Λ u ≥ . Consider u = (cid:80) k u T k , where u T k := u ◦ T k , and u T k ⊥ n k , we have u T Λ u = − (cid:88) k u TT k A T k u T k − λ (cid:88) k n k u TT k u T k + (cid:88) k u TT k diag ( A T k n k ) u T k − (cid:88) k (cid:54) = l u TT k A T k T l u T l = − u T ( A − P ) u T − u T P u − λ (cid:88) k n k u TT k u T k + (cid:88) k u TT k diag ( A T k n k ) u T k = − u T ( A − P ) u − λ (cid:88) k n k u TT k u T k + (cid:88) k u TT k diag ( A T k n k ) u T k (9)For the ﬁrst term, we know u T ( A − P ) u ≤ (cid:107) A − P (cid:107) (cid:107) u (cid:107) ≤ O ( √ nρ ) (cid:107) u (cid:107) with high probability, and using Chernoﬀ, we have || diag ( A T k n k ) || ≥ B k,k n k − (cid:112) ρn k log n k with high probability. Therefore, − λ (cid:88) k n k u TT k u T k + (cid:88) k u TT k diag ( A T k n k ) u T k ≥ min k ( B k,k n k − Ω( (cid:112) ρn k log n ) − λn k ) (cid:107) u (cid:107) , which implies a suﬃcient condition for positive semideﬁniteness of Λ is λ ≤ min k B kk + O ( (cid:112) ρ log n/nπ ) , and the lower bound can be obtained exactly the same way as Proposition 12. Using Chernoﬀ bound, Γ T k T l > with high probability as long as λ ≥ max k (cid:54) = l B k,l + Ω( (cid:112) ρ log n/nπ min ) .17 .3 Proof of Corollary 2 Proof.

This result comes directly from Theorem 1. We have S = ˜ P , R = ( A − P ) + ( P − ˜ P ) . For λ , (cid:104) ˆ X λ , A (cid:105) ≥ (cid:104) X , ˜ P (cid:105) − O ( rρ ) − (cid:15), where rρ = o ( τ ) since r √ nρ = o ( τ ) , and for any ˆ X ∈ X r , |(cid:104) A − ˜ P , ˆ X (cid:105)| ≤ || A − P || op trace ( ˆ X ) + O ( rρ ) = O P ( r √ nρ ) . The last inequality follows by Lei & Rinaldo (2015) and nρ ≥ c log n . A.4 Proof of Corollary 4

Proof.

First note that p gap as deﬁned in Eq 1 is δ sep / , where δ sep is the minimum Euclidean distance betweentwo cluster centers. Using the argument in Mixon et al. (2017) and Theorem 1 we obtain: (cid:107) ˆ X θ ∗ − X (cid:107) F ≤ nπ min (cid:15) + sup X ∈X r |(cid:104) X, ˆ S − S (cid:105)| nπ min δ sep ( i ) ≤ C (cid:15) + rασ ( α + min { r, d } ) δ sep Step ( i ) is true with probability at least − η , as long as n ≥ max { c d, c log(2 /η ) , log( c /η ) } , using theargument from Theorem 2 in Mixon et al. (2017). B Additional Theoretical Results and Proofs of Results in Section 4

B.1 Proof of Theorem 6

Proof.

With probability greater than − δ est − δ over − δ under , the following three inequalities hold.For r T ≥ r t > r : (cid:104) ˆ S , ˆ X r (cid:105) ≥ (cid:104) ˆ S , X (cid:105) − (cid:15) est ≥ max r T ≥ r t >r (cid:104) ˆ S , ˆ X r t (cid:105) − (cid:15) est − (cid:15) over ≥ max r T ≥ r t >r (cid:104) ˆ S , ˆ X r t (cid:105) − ∆ For r t < r : (cid:104) ˆ S , ˆ X r (cid:105) ≥ (cid:104) ˆ S , X (cid:105) − (cid:15) est ≥ max r t max r t

Consider a an adjacency matrix A and its population version P . Let X be a normalizedclustering matrix independent of A . Then with probability at least − O ( n − ) , |(cid:104) A − P, X (cid:105)| ≤ (1 + B max ) (cid:112) trace ( X ) log n with B max = max i,j B ij .Proof. The result follows from Hoeﬀding’s inequality and the fact that X is a projection matrix.By independence between A and X , P (cid:88) i t  ≤ exp( − t (1 + B max ) (cid:80) i (1+ B max ) (cid:112) trace ( X ) log n ) = O (1 /n ) . The other direction is the same.In order to prove Theorem 8, we need to derive the three error bounds in Theorem 6 in this setting. Fornotational convenience, we ﬁrst derive the bounds for A and a general normalized clustering matrix ˆ X , withthe understanding that the same asymptotic bounds apply to estimates obtained from the training graphprovided the split is random and the number of training nodes is Θ( n ) . Lemma 15.

For a sequence of underﬁtting normalized clustering matrix { ˆ X r t } r t r t , by thePigeonhole principle, we see that ∃ i , j , k , i (cid:54) = j , such that, γ k ,i = | ˆ C k ∩ C i | ≥ | C i | /r t ≥ π min n/r t γ k ,j = | ˆ C k ∩ C j | ≥ | C j | /r t ≥ π min n/r t (10)For each k (cid:54) = k , (cid:80) i,j B i,j γ k,i γ k,j ˆ m k ≤ (cid:80) i B i,i (cid:80) j γ k,i γ k,j ˆ m k = (cid:88) i B i,i γ k,i . k = k , (cid:80) i,j B i,j γ k,i γ k,j ˆ m k = (cid:80) i,j B i,i γ k,i γ k,j ˆ m k + (cid:80) i (cid:54) = j ( B i,j − B i,i ) γ k,i γ k,j ˆ m k = (cid:88) i B i,i γ k,i + (cid:80) i (cid:54) = j ( B i,j − B i,i ) γ k,i γ k,j ˆ m k ≤ (cid:88) i B i,i γ k,i + (2 B i ,j − B i ,i − B j ,j ) γ k,i γ k,j ˆ m k = (cid:88) i B i,i γ k,i − (( B i ,i − B i ,j ) + ( B j ,j − B i ,j )) γ k,i γ k,j ˆ m k ( a ) ≤ (cid:88) i B i,i γ k,i − τ γ k,i γ k,j nπ min ˆ m k ≤ (cid:88) i B i,i γ k,i − τ π min nr t ˆ m k , (11)where τ = nπ min p gap , p gap := min i ( B i,i − max j (cid:54) = i B i,j ) . ( a ) is true by deﬁnition of τ and Eq (10).Therefore, since ˆ m k ≤ n , (cid:104) P, ˆ X (cid:105) = r t (cid:88) k =1 (cid:80) i,j B i,j γ k,i γ k,j ˆ m k − O ( ρr t ) ≤ r t (cid:88) k =1 r (cid:88) i =1 B i,i γ k,i − Ω( τ π min nr t ˆ m k )= (cid:104) P, X (cid:105) − Ω (cid:18) τ π min r t (cid:19) . (12)Next by Lemma 14, for each X with trace ( X ) ≤ r , |(cid:104) A − P, X (cid:105)| ≤ (1 + B max ) (cid:112) r log n with probabiltiy at least − O (1 /n ) . By a union bound and using the same argument, w.h.p. max r t

For a sequence of overﬁtting normalized clustering matrix { ˆ X r t } r

First note, for any ˆ X , using weak assortativity on B , (cid:104) P, ˆ X (cid:105) ≤ (cid:88) i,j ˆ X i,j B C ( i ) ,C ( j ) ≤ (cid:88) i B C ( i ) ,C ( i ) (cid:88) j ˆ X i,j ≤ (cid:104) P, X (cid:105) + B max r, (15)20here C ( i ) denotes the cluster node i belongs to. By the same argument as in Eq (13), w.h.p. max r

With probabilitiy at least − O (1 /n ) , MATR-CV achieves exact recovery on the testing nodesgiven the true cluster number r , i.e. ˆ X r = X , provided nπ min ρ/ log n → ∞ , γ train = Θ(1) .Proof. Denote m k , m k as the number of nodes belonging to the cluster C k in the training graph and testinggraph respectively.First, with Theorem 2 in Yan et al. (2017) and Lemma 18, we know SDP-2 can achieve exact recovery ontraininng graph with high probability. Now, consider a node s in testing graph, and assume it belongs tocluster C k . The probability that it is assigned to cluster k is: P ( (cid:80) j ∈ Ck A s,j m k ≥ max l (cid:54) = k (cid:80) j ∈ Cl A s,j m l ) .Using the Chernoﬀ bound, for some constant c , P ( (cid:80) j ∈ C k A s,j m k ≥ B k,k − c (cid:113) B k,k log n/m k ) ≥ − n − ; P ( (cid:80) j ∈ C l A s,j m l ≤ B l,k + c (cid:113) B l,k log n/m l ) ≥ − n − . (17)Since the graph split is random, for each k , with probability at least − n − , | m k − γ train m k | ≤ c √ m k log n for some constant c . By a union bound, this holds for all k with probability at least − rn − . Then underthis event, (cid:115) B l,k log nm l ≤ c (cid:114) B l,k log nnπ min for some c since nπ min / log n → ∞ . Since nπ min ρ/ log n → ∞ , by Eq (17), with probabitliy at least − O ( rn − ) , B k,k − c (cid:113) B k,k log n/m k > max l (cid:54) = k B l,k + c (cid:113) B l,k log n/m k , and node s is assigned correctly to cluster k . Taking a union over all s in the training set, with probability − O ( rn − ) , MATR-CV would give exact recovery for the testing graph given r . Lemma 18. If m k ≥ πn , then m k ≥ πnγ train , and m k ≥ πn (1 − γ train ) , with high probability. If max k,l m k m l ≤ δ , then max k,l m k m l ≤ δ + o (1) with high probability.Proof. The result follows from Skala (2013).

Proof for Theorem 8

Proof.

First we note that by a similar argument as in Lemma 17, | m k − (1 − γ train ) m k | ≤ c √ m k log n for all k with probability at least − rn − . Then the size of the smallest cluster of the test graph A will be of thesame order as nπ min . Also A has size Θ( n ) . A is independent of any ˆ X . Thus in Theorem 6, applyingLemma 15 and Lemma 16 to A shows (cid:15) under = Ω( nρπ min /r ) ,(cid:15) over = (1 + B max ) (cid:112) r T log n + B max r, and Lemma 17 shows (cid:15) est = 0 , w.h.p. For ﬁxed r, π min , we have (cid:15) under (cid:29) (cid:15) over . By Theorem 6, choosing ∆ = (1 + B max ) √ r T log n + B max r leads to MATR-CV returning the correct r . We can further reﬁne ∆ by noting that r max := arg max r t (cid:104) A, ˆ X r t (cid:105) ≥ r w.h.p., then it suﬃces to consider the candidate range { r , . . . , r max } . The same arguments still hold for this range, thus r T and r in ∆ can be replaced with r max . 21 .3 Proof of Theorem 10 In the following, we show theoretical guarantees of using MATR-CV to do model selection on MMSB withthe SPACL algorithm proposed by Mao et al. (2017). We assume A has self-loops for clarify of exposition.Adding the diagonal terms introduces a term that is asymptotically negligible compared with other terms,thus does not change our results.First we have the following concentration lemma regarding the criterion (cid:104) A − diag ( A ) , X (cid:105) for a generalnormalized clustering matrix X , which will be used to derive the three errors in Theorem 6. Lemma 19.

For any general normalized clustering matrix X satisfying X n = n , and an adjacency matrix A generated from its expectation matrix P independent of X , w.h.p. (cid:104) ˆ S − S, X (cid:105) = O ( nρ (cid:112) log n ) , (18) where ˆ S = A − diag ( A ) , S = P − diag ( P ) .Proof. (cid:104) ˆ S − S, X (cid:105) = (cid:88) j,i (cid:54) = k ( A ij A jk − P ij P jk ) X ik = (cid:88) i,j,k ( A ij A jk − E [ A ij A jk ]) X ik (cid:124) (cid:123)(cid:122) (cid:125) Part (i) − (cid:88) i,j ( A ij − E [ A ij ]) X ij (cid:124) (cid:123)(cid:122) (cid:125) Part (ii) (19)To bound Part (i), we will ﬁrst bound f ( A ij , ≤ i ≤ j ≤ n ) = (cid:80) i,j,k A ij A jk X ik / . Let f uv := f ( A ij , ≤ i ≤ j ≤ n, A uv = 0) = f ( A ) − A uv (cid:80) k A vk X uk + A uv (cid:80) i A iu X iv , Clearly, ≤ f − f uv ≤ since X n = n , and (cid:88) u

For a sequence of underﬁtting normalized clustering matrix { ˆ X r t } r t

Proposition 21.

For a sequence of overﬁtting normalized clustering matrix { ˆ X r t } r

These bounds follow directly from Corollary 3.7 of Mao et al. (2017), with α , r, λ ∗ ( B/ρ ) all beingconstant.In what follows, we omit the permutation matrix Π to simplify notation. If Π is not the identity matrix,we can always redeﬁne Θ as ΘΠ , and B as Π T B Π . This would not aﬀect the results, since we want to provebounds on normalized clustering matrices where Π always cancels out, i.e., X = ΘΠ((ΘΠ) T ΘΠ) − (ΘΠ) T =Θ(Θ T Θ) − Θ T . We are interested in bounding the estimation error in ˆ H = ˆ B − ( ˆΘ T ˆΘ) − ˆΘ T . In order to build up thebound, we will make repeated use of the following two facts. Fact 23.

For general matrices C, ˆ C, D, ˆ D , (cid:13)(cid:13)(cid:13) ˆ C ˆ D − CD (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13)(cid:13) ( ˆ C − C )( ˆ D − D ) (cid:13)(cid:13)(cid:13) F + (cid:13)(cid:13)(cid:13) ( ˆ C − C ) D (cid:13)(cid:13)(cid:13) F + (cid:13)(cid:13)(cid:13) C ( ˆ D − D ) (cid:13)(cid:13)(cid:13) F Proof.

The proof follows directly from expansion and the triangle inequality, ˆ C ˆ D − CD = ( ˆ C − C )( ˆ D − D ) + ( ˆ C − C ) D + C ( ˆ D − D ) . Fact 24.

For a general matrices C , ˆ C , assume (cid:13)(cid:13)(cid:13) ( C − ˆ C ) C − (cid:13)(cid:13)(cid:13) F < , then (cid:13)(cid:13)(cid:13) ˆ C − − C − (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13) C − (cid:13)(cid:13) F (cid:13)(cid:13)(cid:13) ( C − ˆ C ) C − (cid:13)(cid:13)(cid:13) F − (cid:13)(cid:13)(cid:13) ( C − ˆ C ) C − (cid:13)(cid:13)(cid:13) F Proof.

First decompose ˆ C − − C − = ˆ C − CC − − C − = ( ˆ C − C − I ) C − = ˆ C − ( C − ˆ C ) C − . (27)Taking Frobenius norms, (cid:13)(cid:13)(cid:13) ˆ C − (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13) C − (cid:13)(cid:13) F + (cid:13)(cid:13)(cid:13) ˆ C − ( C − ˆ C ) C − (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13) C − (cid:13)(cid:13) F + (cid:13)(cid:13)(cid:13) ˆ C − (cid:13)(cid:13)(cid:13) F (cid:13)(cid:13)(cid:13) ( C − ˆ C ) C − (cid:13)(cid:13)(cid:13) F . Rearranging, (cid:13)(cid:13)(cid:13) ˆ C − (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13) C − (cid:13)(cid:13) F − (cid:13)(cid:13)(cid:13) ( C − ˆ C ) C − (cid:13)(cid:13)(cid:13) F . Applying this to (27), (cid:13)(cid:13)(cid:13) ˆ C − − C − (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13)(cid:13) ˆ C − (cid:13)(cid:13)(cid:13) F (cid:13)(cid:13)(cid:13) ( C − ˆ C ) C − (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13) C − (cid:13)(cid:13) F (cid:13)(cid:13)(cid:13) ( C − ˆ C ) C − (cid:13)(cid:13)(cid:13) F − (cid:13)(cid:13)(cid:13) ( C − ˆ C ) C − (cid:13)(cid:13)(cid:13) F . Next we have a lemma bounding the error in estimating the quantity H = B − (Θ T Θ) − Θ T .24 emma 25. Let H = B − (Θ T Θ) − Θ T , ˆ H = ˆ B − ( ˆΘ T ˆΘ) − ˆΘ T , then w.h.p. (cid:13)(cid:13)(cid:13) H − ˆ H (cid:13)(cid:13)(cid:13) F = O (cid:18) (log n ) ξ nρ / (cid:19) (cid:107) H (cid:107) F = O (cid:18) √ nρ . (cid:19) (28) Proof.

We build up the estimator of H step by step by repeatedly using Facts 23 and 24. Denote F = (cid:13)(cid:13) (Θ T Θ) − (cid:13)(cid:13) F , and F = (cid:13)(cid:13) B − (cid:13)(cid:13) F , by Lemma 3.6 in Mao et al. (2017), F ≤ √ r (cid:13)(cid:13) (Θ T Θ) − (cid:13)(cid:13) op = O P (1 /n ) , (29)and F ≤ √ r (cid:13)(cid:13) B − (cid:13)(cid:13) op = O (1 /ρ ) . (30)First, applying Fact 23, (cid:13)(cid:13)(cid:13) ˆΘ T ˆΘ − Θ T Θ (cid:13)(cid:13)(cid:13) F ≤ ∆ + 2 (cid:107) Θ (cid:107) F ∆ (cid:13)(cid:13) (Θ T Θ) − (cid:13)(cid:13) F (cid:13)(cid:13)(cid:13) ˆΘ T ˆΘ − Θ T Θ (cid:13)(cid:13)(cid:13) F ≤ (∆ + 2 (cid:107) Θ (cid:107) F ∆ ) F = O P (cid:18) (log n ) ξ nρ (cid:19) + O P (cid:18) (log n ) ξ √ nρ (cid:19) = O P (cid:18) (log n ) ξ √ nρ (cid:19) , using Lemma 22 and eq (29). Thus for large n , (cid:13)(cid:13) (Θ T Θ) − (cid:13)(cid:13) F (cid:13)(cid:13)(cid:13) ˆΘ T ˆΘ − Θ T Θ (cid:13)(cid:13)(cid:13) F < / . Then using Fact 24,we have (cid:13)(cid:13)(cid:13) ( ˆΘ T ˆΘ) − − (Θ T Θ) − (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13) (Θ T Θ) − (cid:13)(cid:13) F (cid:13)(cid:13)(cid:13) ((Θ T Θ) − ( ˆΘ T ˆΘ))(Θ T Θ) − (cid:13)(cid:13)(cid:13) F − (cid:13)(cid:13)(cid:13) ((Θ T Θ) − ( ˆΘ T ˆΘ))(Θ T Θ) − (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13) (Θ T Θ) − (cid:13)(cid:13) F (cid:13)(cid:13)(cid:13) ((Θ T Θ) − ( ˆΘ T ˆΘ)) (cid:13)(cid:13)(cid:13) F − (cid:13)(cid:13)(cid:13) ((Θ T Θ) − ( ˆΘ T ˆΘ))(Θ T Θ) − (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13) (Θ T Θ) − (cid:13)(cid:13) F (cid:13)(cid:13)(cid:13) ((Θ T Θ) − ( ˆΘ T ˆΘ)) (cid:13)(cid:13)(cid:13) F ≤ (∆ + 2 (cid:107) Θ (cid:107) F ∆ ) F = O P (cid:18) (log n ) ξ n / ρ / . (cid:19) (31)Similarly using Lemma 22 and eq (30), by noting that (cid:13)(cid:13) B − (cid:13)(cid:13) F (cid:13)(cid:13)(cid:13) ( B − ˆ B ) (cid:13)(cid:13)(cid:13) F = ∆ F = O ( (log n ) ξ √ nρ ) < / for large n w.h.p., (cid:13)(cid:13)(cid:13) ˆ B − − B − (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13) B − (cid:13)(cid:13) F (cid:13)(cid:13)(cid:13) ( B − ˆ B ) B − (cid:13)(cid:13)(cid:13) F − (cid:13)(cid:13)(cid:13) ( B − ˆ B ) B − (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13) B − (cid:13)(cid:13) F (cid:13)(cid:13)(cid:13) ( B − ˆ B ) (cid:13)(cid:13)(cid:13) F ≤ F = O P (cid:18) (log n ) ξ n / ρ / (cid:19) (32)25sing Fact 24.Next applying Fact 23 to G := B − (Θ T Θ) − and its estimate ˆ G := ˆ B − ( ˆΘ T ˆΘ) − , (cid:13)(cid:13)(cid:13) G − ˆ G (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13)(cid:13) ˆ B − − B − (cid:13)(cid:13)(cid:13) F F + ( (cid:13)(cid:13) B − (cid:13)(cid:13) F + (cid:13)(cid:13)(cid:13) B − − ˆ B − (cid:13)(cid:13)(cid:13) F ) (cid:13)(cid:13)(cid:13) ( ˆΘ T ˆΘ) − − (Θ T Θ) − (cid:13)(cid:13)(cid:13) F ≤ F F + (2∆ F + F )(∆ + 2 (cid:107) Θ (cid:107) F ∆ ) F = O P (cid:18) (log n ) ξ ( nρ ) / (cid:19) , using Eqs (29)-(32).Finally, since H = G Θ T , and ˆ H = ˆ G ˆΘ T , (cid:13)(cid:13)(cid:13) H − ˆ H (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13)(cid:13) ˆ G − G (cid:13)(cid:13)(cid:13) F (cid:107) Θ (cid:107) F + ( (cid:107) G (cid:107) F + (cid:13)(cid:13)(cid:13) G − ˆ G (cid:13)(cid:13)(cid:13) F ) (cid:13)(cid:13)(cid:13) ˆΘ − Θ (cid:13)(cid:13)(cid:13) F ≤ √ n (cid:13)(cid:13)(cid:13) G − ˆ G (cid:13)(cid:13)(cid:13) F + ( F F + (cid:13)(cid:13)(cid:13) G − ˆ G (cid:13)(cid:13)(cid:13) F )∆ = O P (cid:18) (log n ) ξ nρ / (cid:19) , and (cid:107) G (cid:107) F = tr ((Θ T Θ) − (Θ T Θ) − ( BB T ) − ) ≤ (cid:13)(cid:13) (Θ T Θ) − (cid:13)(cid:13) op tr (( BB T ) − )= O P (1 /n ) F by Eq (29), (cid:107) H (cid:107) F = O P ( √ nρ ) follows. Lemma 26.

Consider applying SPACL on the training graph A to obtain ( ˆΘ ) T and ˆ B , and use regressionin MATR-CV to estimate membership matrix, i.e., ( ˆΘ ) T = ˆ B − (( ˆΘ ) T ˆΘ ) − ( ˆΘ ) T A := ˆ HA , then (cid:13)(cid:13)(cid:13) ˆΘ − Θ (cid:13)(cid:13)(cid:13) F = O (cid:16) (log n ) ξ √ ρ (cid:17) w.h.p.Proof. Since (Θ ) T = H Θ B (Θ ) T , where H := B − ((Θ ) T Θ ) − (Θ ) T , ( ˆΘ ) T − (Θ ) T = ˆ HA − H Θ B (Θ ) T = ˆ HA − ˆ H Θ B (Θ ) T + ˆ H Θ B (Θ ) T − H Θ B (Θ ) T = ˆ H ( A − Θ B (Θ ) T ) + ( ˆ H − H )Θ B (Θ ) T = H ( A − P ) (cid:124) (cid:123)(cid:122) (cid:125) Q + ( ˆ H − H )( A − P ) (cid:124) (cid:123)(cid:122) (cid:125) Q + ( ˆ H − H ) P (cid:124) (cid:123)(cid:122) (cid:125) Q . For Q , (cid:107) Q (cid:107) F ≤ (cid:13)(cid:13) A − P (cid:13)(cid:13) op (cid:107) H (cid:107) F = O P ( √ nρ ) O P ( 1 √ nρ ) = O P (1 / √ ρ ) by Lemma 25. For Q , (cid:107) Q (cid:107) F ≤ (cid:13)(cid:13) A − P (cid:13)(cid:13) op (cid:13)(cid:13)(cid:13) ˆ H − H (cid:13)(cid:13)(cid:13) F = O P ( √ nρ ) O P (cid:18) (log n ) ξ nρ / (cid:19) = O P (cid:18) (log n ) ξ √ nρ (cid:19) ,

26y Lemma 25. Finally for Q , (cid:107) Q (cid:107) F ≤ (cid:13)(cid:13) P (cid:13)(cid:13) F (cid:13)(cid:13)(cid:13) ˆ H − H (cid:13)(cid:13)(cid:13) F = O P ( nρ ) O P (cid:18) (log n ) ξ nρ / (cid:19) = O P (cid:18) (log n ) ξ √ ρ (cid:19) , The above arguments lead to (cid:13)(cid:13)(cid:13) ˆΘ − Θ (cid:13)(cid:13)(cid:13) F = O P (cid:18) (log n ) ξ √ ρ (cid:19) . Proposition 27.

Given the correct number of clusters r , then with high probability, (cid:104) A , ˆ X r (cid:105) > (cid:104) A , X (cid:105)− O (( nρ ) / (log n ) ξ ) . Proof.

First note |(cid:104) ( A ) − diag (( A ) ) , ˆ X r − X (cid:105)| ≤ |(cid:104) ˆ S − S , ˆ X r − X (cid:105)| + |(cid:104) S , ˆ X r − X (cid:105)| , (33)where ˆ S = ( A ) − diag (( A ) ) , S = ( P ) − diag (( P ) ) .Consider SVD of ˆΘ = ˆ U ˆ D ˆ V T and Θ = U DV T , then ˆ X r = ˆ U ˆ U T , and X = U U T . For anyorthogonal matrix O , (cid:13)(cid:13)(cid:13) ˆ X r − X (cid:13)(cid:13)(cid:13) F = (cid:13)(cid:13)(cid:13) ˆ U ˆ U T − U U T (cid:13)(cid:13)(cid:13) F = (cid:13)(cid:13)(cid:13) ˆ U ˆ U T − U O ( U O ) T (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13)(cid:13) ( ˆ U − U O )( ˆ U − U O ) T (cid:13)(cid:13)(cid:13) F + 2 (cid:13)(cid:13)(cid:13) ( ˆ U − U O )( U O ) T (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13)(cid:13) ˆ U − U O (cid:13)(cid:13)(cid:13) F + 2 (cid:13)(cid:13)(cid:13) ˆ U − U O (cid:13)(cid:13)(cid:13) F (34)Using the Theorem 2 in Yu et al. (2014), we know there exists O such that, (cid:13)(cid:13)(cid:13) ˆ U − U O (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13)(cid:13) ˆΘ − Θ (cid:13)(cid:13)(cid:13) F λ r (Θ ) , where λ r (Θ ) is the r -th largest singular value of Θ . Using Lemma 3.6 in Mao et al. (2017), w.h.p, λ r (Θ ) = Ω( √ n ) . Now by Lemma 26 and Eq (34), w.h.p. (cid:13)(cid:13)(cid:13) ˆ X r − X (cid:13)(cid:13)(cid:13) F = O ( (log n ) ξ √ nρ ) . Now in Eq (33), |(cid:104) S , ˆ X r − X (cid:105)| ≤ (cid:13)(cid:13) S (cid:13)(cid:13) F (cid:13)(cid:13)(cid:13) ˆ X r − X (cid:13)(cid:13)(cid:13) F = O P (( nρ ) / (log n ) ξ ) , and |(cid:104) ˆ S − S , ˆ X r − X (cid:105)| = O P ( nρ (cid:112) logn ) by Lemma 19.Finally we prove Theorem 10. Proof of Theorem 10.

By Propositions 20, 21 and 27, (cid:15) under = Ω( n ρ ) ,(cid:15) est = O (( nρ ) / (log n ) ξ ) ,(cid:15) over = O ( nρ (cid:112) log n ) . Then the result follows by setting ∆ = O (( nρ ) / (log n ) ξ ) .27 Detailed parameter settings in experiments and additional re-sults

Motivating examples in Section 3 (Figure 1)

In Figure 1(a), we generate an adjacency matrix from a SBM model with four communities, each having 50nodes, and B =  . . . . . . . . . . . . . . . .  . The visualization of the underlying probability matrix is shown in Figure 4(a).In Figure 1(b), we consider a four-component Gaussian mixture model, where the means µ , . . . , µ aregenerated from Gaussian distributions centered at (0 , , (0 , , (5 , , (10 , with covariance I , so that theﬁrst two clusters are closer to each other than the rest. Then we generate 1000 data points centered at thesemeans with covariance . I , each point assigned to one of the four clusters independently with probability ( , , , ) . Finally, we introduce correlation between the two dimensions by multiplying each point by (cid:20) (cid:21) . A scatter plot example of the datapoints is shown in Figure 4(b). (a) SBM (b) Gaussian mixture Figure 4: Datasets used for Figure 1.

Tuning with SDP-1 (Figure 2 (a)-(b))

Figure 2 (a):

We consider graphs generated from a hierarchical SBM with equal sized clusters, where B = ρ ×  . . . . . . . . . . . . . . . .  . Each cluster has 100 nodes and ρ ranges from . to . Figure 2 (b):

Next, we consider graphs generated from a SBM with the same B matrix, but withunequal cluster sizes. Cluster 1 and 3 have 100 nodes each, while cluster 2 and 4 have 50 nodes each. ρ ranges from . to . Tuning with spectral clustering (Figure 2(c)-(d))

We generate the means µ a , a ∈ [3] from d = 20 dimensional Gaussian distribution with covariance . I . Toimpose sparsity on each µ a , we set all but the ﬁrst two dimensions to 0. To change the level of clustering28iﬃculty, we multiply µ a with a separation constant c , and a larger c leads to larger separation and easierclustering. We vary c from to . We generate n = 500 samples from each mixture with a constantcovariance matrix (an identity matrix) using Eq 2. For Figure 2 (c), the probabilities of cluster assignmentare equal, while for Figure 2 (d), each point belongs to one of the three clusters with probability ( , , ) .2D projections of the datapoints for the two settings are shown in Figure 5. (a) Equal sized clusters (b) Unequal sized clusters Figure 5: 2D projections of the datapoints for Gaussian mixtures.

Additional ﬁgure for Section 5.2 (a) True clustering (b) Clustering by MATR (c) Clustering by DS(d) Clustering by KNN (e) Clustering by MST

Figure 6: Visualization of clustering results on handwritten digits dataset.29

MATR-CV BH ECV0.2 2 2 20.3 2 2 20.4 4 2 20.5 4 2 20.6 4 4 2 (a) Median number of clusters selected for equal size case ρ MATR-CV BH ECV0.2 2 2 20.3 2 2 20.4 3 2 20.5 4 2 20.6 4 3 2 (b) Median number of clusters selected for unequal size case

Table 3: Comparison of model selection results along with ρ for all algorithms. Tuning with SDP-2 (Figure 3)

Figure 3 (a):

We ﬁrst consider graphs generated from a SBM with equal sized clusters, where B = ρ ×  . . . . . . . . . . . . . . . .  . Each cluster has nodes and ρ ’s are selected from . to . with even spacing. Figure 3 (b):

Here we consider graphs generated from an unequal-sized SBM , where the B matrix isthe same as above. The clusters have , , , nodes respectively. The same ρ ’s as above are used. Table 3 (a,b):

We show the median number of clusters selected by each method as ρ changes. Theground truth is 4 clusters. References

Abbe, Emmanuel, & Sandon, Colin. 2015. Recovering communities in the general stochastic block modelwithout knowing the parameters.

Pages 676–684 of: Advances in NIPS .Abbe, Emmanuel, Bandeira, Afonso S, & Hall, Georgina. 2015. Exact recovery in the stochastic block model.

IEEE Transactions on Information Theory , (1), 471–487.Adamic, Lada A, & Glance, Natalie. 2005. The political blogosphere and the 2004 US election: divided theyblog. Pages 36–43 of: Proceedings of the 3rd international workshop on Link discovery . ACM.Airoldi, Edoardo M., Blei, David M., Fienberg, Stephen E., & Xing, Eric P. 2008. Mixed MembershipStochastic Blockmodels.

J. Mach. Learn. Res. , (June), 1981–2014.Amini, Arash A, Levina, Elizaveta, et al. . 2018. On semideﬁnite relaxations for the block model. Ann.Statist. , (1), 149–179.Bach, Francis R. 2008. Bolasso: model consistent lasso estimation through the bootstrap. Pages 33–40 of:Proceedings of the 25th international conference on Machine learning . ACM.Belkin, Mikhail, & Niyogi, Partha. 2003. Laplacian Eigenmaps for Dimensionality Reduction and DataRepresentation.

Neural Comput. , (6), 1373–1396.Bengio, Yoshua. 2000. Gradient-based optimization of hyperparameters. Neural computation , (8), 1889–1900.Bergstra, James, & Bengio, Yoshua. 2012. Random search for hyper-parameter optimization. Journal ofMachine Learning Research , (Feb), 281–305.Bergstra, James S, Bardenet, Rémi, Bengio, Yoshua, & Kégl, Balázs. 2011. Algorithms for hyper-parameteroptimization. Pages 2546–2554 of: Advances in NIPS .30ickel, Peter J, & Sarkar, Purnamrita. 2016. Hypothesis testing for automated community detection innetworks.

JRSSb , (1), 253–273.Birgé, Lucien, & Massart, Pascal. 2001. Gaussian model selection. Journal of the European MathematicalSociety , (3), 203–268.Bozdogan, Hamparsum. 1987. Model selection and Akaike’s Information Criterion (AIC): The general theoryand its analytical extensions. Psychometrika , , 345–370.Breiman, Leo, et al. . 1996. Heuristics of instability and stabilization in model selection. Ann. Statist. , (6),2350–2383.Cai, T Tony, Li, Xiaodong, et al. . 2015. Robust and computationally feasible community detection in thepresence of arbitrary outlier nodes. Ann. Statist. , (3), 1027–1059.Chatterjee, Sourav, et al. . 2015. Matrix estimation by universal singular value thresholding. Ann. Statist. , (1), 177–214.Chen, Kehui, & Lei, Jing. 2018. Network cross-validation for determining the number of communities innetwork data. JASA , (521), 241–251.Chen, Yudong, Li, Xiaodong, & Xu, Jiaming. 2018. Convexiﬁed modularity maximization for degree-correctedstochastic block models. Ann. Statist. , (4), 1573–1602.Coifman, R. R., Shkolnisky, Y., Sigworth, F. J., & Singer, A. 2008. Graph Laplacian Tomography FromUnknown Random Projections. Trans. Img. Proc. , (10), 1891–1899.Drton, Mathias, & Plummer, Martyn. 2017. A Bayesian information criterion for singular models. JRSSb , (2), 323–380.Fan, Jianqing, Fan, Yingying, Han, Xiao, & Lv, Jinchi. 2019. SIMPLE: Statistical Inference on MembershipProﬁles in Large Networks. arXiv preprint arXiv:1910.01734 .Fang, Yixin, & Wang, Junhui. 2012. Selection of the number of clusters via the bootstrap method. Computa-tional Statistics & Data Analysis , (3), 468–477.Figueiredo, M. A. T., & Jain, A. K. 2002. Unsupervised learning of ﬁnite mixture models. IEEE Transactionson PAMI , (3), 381–396.Giné, Evarist, & Koltchinskii, Vladimir. 2006. Empirical graph Laplacian approximation of Laplace–Beltramioperators: Large sample results . Lecture Notes–Monograph Series, vol. Number 51. Beachwood, Ohio, USA:Institute of Mathematical Statistics. Pages 238–259.Girvan, Michelle, & Newman, Mark EJ. 2002. Community structure in social and biological networks.

Proceedings of the national academy of sciences , (12), 7821–7826.Guédon, Olivier, & Vershynin, Roman. 2016. Community detection in sparse networks via Grothendieck’sinequality. Probability Theory and Related Fields , (3), 1025–1049.Hajek, Bruce, Wu, Yihong, & Xu, Jiaming. 2016. Achieving exact cluster recovery threshold via semideﬁniteprogramming. IEEE Transactions on Information Theory , (5), 2788–2797.Han, Xiao, Yang, Qing, & Fan, Yingying. 2019. Universal Rank Inference via Residual Subsampling withApplication to Large Networks. arXiv preprint arXiv:1912.11583 .Hastie, Trevor, Tibshirani, Robert, Friedman, Jerome, & Franklin, James. 2005. The elements of statisticallearning: data mining, inference and prediction. The Mathematical Intelligencer , (2), 83–85.Hein, Matthias. 2006. Uniform Convergence of Adaptive Graph-Based Regularization. Page 50–64 of:Proceedings of COLT . COLT’06. Berlin, Heidelberg: Springer-Verlag.31ein, Matthias, Audibert, Jean-Yves, & von Luxburg, Ulrike. 2005. From Graphs to Manifolds – Weak andStrong Pointwise Consistency of Graph Laplacians.

Pages 470–485 of: COLT .Hu, Jianwei, Qin, Hong, Yan, Ting, Zhang, Jingfei, & Zhu, Ji. 2017. Using Maximum Entry-Wise Deviationto Test the Goodness-of-Fit for Stochastic Block Models. arXiv preprint arXiv:1703.06558 .Jin, Jiashun, Ke, Zheng Tracy, & Luo, Shengming. 2017.

Estimating network memberships by simplex vertexhunting .Karrer, Brian, & Newman, M. E. J. 2011. Stochastic blockmodels and community structure in networks.

Phys. Rev. E , (Jan), 016107.Keribin, Christine. 2000. Consistent Estimate of the Order of Mixture Models. Sankhy=a, Series A , (01),49–66.Le, Can M, & Levina, Elizaveta. 2015. Estimating the number of communities in networks by spectralmethods. arXiv preprint arXiv:1507.00827 .Lei, Jing, & Rinaldo, Alessandro. 2015. Consistency of spectral clustering in stochastic block models. Ann.Statist. , (1), 215–237.Lei, Jing, et al. . 2016. A goodness-of-ﬁt test for stochastic block models. Ann. Statist. , (1), 401–424.Leroux, Brian G. 1992. Consistent Estimation of a Mixing Distribution. Ann. Statist. , (3), 1350–1360.Li, Tianxi, Levina, Elizaveta, & Zhu, Ji. 2016. Network cross-validation by edge sampling. arXiv preprintarXiv:1612.04717 .Li, Xiaodong, Chen, Yudong, & Xu, Jiaming. 2018. Convex relaxation methods for community detection. arXiv preprint arXiv:1810.00315 .Lim, Chinghway, & Yu, Bin. 2016. Estimation stability with cross-validation (ESCV). Journal of Computa-tional and Graphical Statistics , (2), 464–492.Little, Anna, Maggioni, Mauro, & Murphy, James. 2017. Path-Based Spectral Clustering: Guarantees,Robustness to Outliers, and Fast Algorithms. None , 12.Löﬄer, Matthias, Zhang, Anderson Y., & Zhou, Harrison H. 2019.

Optimality of Spectral Clustering forGaussian Mixture Model .Maaten, Laurens van der, & Hinton, Geoﬀrey. 2008. Visualizing data using t-SNE.

Journal of machinelearning research , (Nov), 2579–2605.Maggioni, Mauro, & Murphy, James M. 2018. Learning by Unsupervised Nonlinear Diﬀusion. ArXiv , abs/1810.06702 .Mao, Xueyu, Sarkar, Purnamrita, & Chakrabarti, Deepayan. 2017. Estimating Mixed Memberships withSharp Eigenvector Deviations. ArXiv , abs/1709.00407 .Mao, Xueyu, Sarkar, Purnamrita, & Chakrabarti, Deepayan. 2018. Overlapping clustering models, and one(class) SVM to bind them all. Pages 2126–2136 of: Advances in Neurips .Meila, Marina. 2018. How to tell when a clustering is (approximately) correct using convex relaxations.

Pages7407–7418 of: Advances in Neural Information Processing Systems .Meinshausen, Nicolai, & Bühlmann, Peter. 2010. Stability selection.

Journal of the Royal Statistical Society:Series B (Statistical Methodology) , (4), 417–473.Mixon, Dustin G, Villar, Soledad, & Ward, Rachel. 2017. Clustering subgaussian mixtures by semideﬁniteprogramming. Information and Inference: A Journal of the IMA , (4), 389–415.32edregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay,E. 2011. Scikit-learn: Machine Learning in Python . Journal of Machine Learning Research , , 2825–2830.Peng, Jiming, & Wei, Yu. 2007. Approximating K-means-type Clustering via Semideﬁnite Programming. SIAM J. on Optimization , (1), 186–205.Perry, Amelia, & Wein, Alexander S. 2017. A semideﬁnite program for unbalanced multisection in thestochastic block model. Pages 64–67 of: 2017 International Conference on Sampling Theory and Applications(SampTA) . IEEE.Rousseeuw, Peter J. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis.

Journal of Computational and Applied Mathematics , , 53 – 65.Schiebinger, Geoﬀrey, Wainwright, Martin J., & Yu, Bin. 2015. The geometry of kernelized spectral clustering. Ann. Statist. , (2), 819–846.Shao, Jun. 1993. Linear model selection by cross-validation. Journal of the American statistical Association , (422), 486–494.Shi, Tao, Belkin, Mikhail, & Yu, Bin. 2008. Data spectroscopy: Learning mixture models using eigenspacesof convolution operators. Pages 936–943 of: Proceedings of the 25th international conference on Machinelearning . ACM.Skala, Matthew. 2013. Hypergeometric tail inequalities: ending the insanity. arXiv preprint arXiv:1311.5939 .Snoek, Jasper, Larochelle, Hugo, & Adams, Ryan P. 2012. Practical bayesian optimization of machine learningalgorithms.

Pages 2951–2959 of: Advances in neural information processing systems .Srivastava, Prateek R., Sarkar, Purnamrita, & Hanasusanto, Grani A. 2019.

A Robust Spectral ClusteringAlgorithm for Sub-Gaussian Mixture Models with Outliers .Stone, Mervyn. 1974. Cross-validatory choice and assessment of statistical predictions.

Journal of the RoyalStatistical Society: Series B (Methodological) , (2), 111–133.Tibshirani, Robert, Walther, Guenther, & Hastie, Trevor. 2001. Estimating the number of clusters in a dataset via the gap statistic. JRSSb , (2), 411–423.Tibshirani, Robert, Wainwright, Martin, & Hastie, Trevor. 2015. Statistical learning with sparsity: the lassoand generalizations . Chapman and Hall/CRC.Von Luxburg, Ulrike. 2007. A tutorial on spectral clustering.

Statistics and computing , (4), 395–416.von Luxburg, Ulrike. 2007. A tutorial on spectral clustering. Statistics and Computing , (4), 395–416.Von Luxburg, Ulrike, et al. . 2010. Clustering stability: an overview. Foundations and Trends® in MachineLearning , (3), 235–274.Wang, Junhui. 2010. Consistent selection of the number of clusters via crossvalidation. Biometrika , (4),893–904.Wang, Y. X. Rachel, & Bickel, Peter J. 2017. Likelihood-based model selection for stochastic block models. Ann. Statist. , (2), 500–528.Wasserman, Larry. 2006. All of nonparametric statistics . Springer Science & Business Media.Wasserman, Larry, & Roeder, Kathryn. 2009. High dimensional variable selection.

Annals of statistics , (5A), 2178.Yan, Bowei, & Sarkar, Purnamrita. 2019. Covariate Regularized Community Detection in Sparse Graphs. JASA theory and methods . 33an, Bowei, Sarkar, Purnamrita, & Cheng, Xiuyuan. 2017. Provable estimation of the number of blocks inblock models. arXiv preprint arXiv:1705.08580 .Yang, Yuhong, et al. . 2007. Consistency of cross validation for comparing regression procedures.

Ann.Statist. , (6), 2450–2473.Young, Stephen J, & Scheinerman, Edward R. 2007. Random dot product graph models for social networks. Pages 138–149 of: International Workshop on Algorithms and Models for the Web-Graph . Springer.Yu, Y., Wang, T., & Samworth, R. J. 2014. A useful variant of the Davis–Kahan theorem for statisticians.