[PDF] An ℓ p theory of PCA and spectral clustering

Abstract

Principal Component Analysis (PCA) is a powerful tool in statistics and machine learning. While existing study of PCA focuses on the recovery of principal components and their associated eigenvalues, there are few precise characterizations of individual principal component scores that yield low-dimensional embedding of samples. That hinders the analysis of various spectral methods. In this paper, we first develop an ℓ p perturbation theory for a hollowed version of PCA in Hilbert spaces which provably improves upon the vanilla PCA in the presence of heteroscedastic noises. Through a novel ℓ p analysis of eigenvectors, we investigate entrywise behaviors of principal component score vectors and show that they can be approximated by linear functionals of the Gram matrix in ℓ p norm, which includes ℓ 2 and ℓ ∞ as special examples. For sub-Gaussian mixture models, the choice of p giving optimal bounds depends on the signal-to-noise ratio, which further yields optimality guarantees for spectral clustering. For contextual community detection, the ℓ p theory leads to a simple spectral algorithm that achieves the information threshold for exact recovery. These also provide optimal recovery results for Gaussian mixture and stochastic block models as special cases.

Full PDF

AAn (cid:96) p theory of PCA and spectral clustering Emmanuel Abbe ∗ Jianqing Fan † Kaizheng Wang ‡ June 2020

Abstract

Principal Component Analysis (PCA) is a powerful tool in statistics and machinelearning. While existing study of PCA focuses on the recovery of principal componentsand their associated eigenvalues, there are few precise characterizations of individualprincipal component scores that yield low-dimensional embedding of samples. Thathinders the analysis of various spectral methods. In this paper, we ﬁrst develop an (cid:96) p perturbation theory for a hollowed version of PCA in Hilbert spaces which provablyimproves upon the vanilla PCA in the presence of heteroscedastic noises. Through anovel (cid:96) p analysis of eigenvectors, we investigate entrywise behaviors of principal com-ponent score vectors and show that they can be approximated by linear functionalsof the Gram matrix in (cid:96) p norm, which includes (cid:96) and (cid:96) ∞ as special examples. Forsub-Gaussian mixture models, the choice of p giving optimal bounds depends on thesignal-to-noise ratio, which further yields optimality guarantees for spectral clustering.For contextual community detection, the (cid:96) p theory leads to a simple spectral algorithmthat achieves the information threshold for exact recovery. These also provide optimalrecovery results for Gaussian mixture and stochastic block models as special cases. Keywords:

Principal component analysis, eigenvector perturbation, spectral clustering,mixture models, community detection, contextual network models, phase transitions.

Modern technologies generate enormous volumes of data that present new statistical andcomputational challenges. The high throughput data come inevitably with tremendousamount of noise, from which very faint signals are to be discovered. Moreover, the ana-lytic procedures must be aﬀordable in terms of computational costs. While likelihood-basedapproaches usually lead to non-convex optimization problems that are NP-hard in general,the method of moments provide viable solutions. ∗ Institute of Mathematics, EPFL, Lausanne, CH-1015, Switzerland. E-mail: emmanuel.abbe@epﬂ.ch. † Department of ORFE, Princeton University, Princeton, NJ 08544, USA. E-mail: [email protected]. ‡ Department of IEOR, Columbia University, New York, NY 10027,USA. E-mail: [email protected]. a r X i v : . [ m a t h . S T ] J un rincipal Component Analysis (PCA) (Pearson, 1901) is arguably the most prominenttool of this type. It signiﬁcantly reduces the dimension of data using eigenvalue decomposi-tion of a second-order moment matrix. Unlike the classical settings where the dimension d ismuch smaller than the sample size n , nowadays it could be the other way around in numer-ous applications (Ringnér, 2008; Novembre et al., 2008; Yeung and Ruzzo, 2001). Reliabilityof the low-dimensional embedding is of crucial importance, as all downstream tasks arebased on that. Unfortunately, existing theories often fail to provide sharp guarantees whenboth the dimension and noise level are high, especially in the absence of sparsity structures.The matter is further complicated by the use of nonlinear kernels for dimension reduction(Schölkopf et al., 1997), which is de facto PCA in some inﬁnite-dimensional Hilbert space.In this paper, we investigate the spectral embedding returned by a hollowed version ofPCA. Consider the signal-plus-noise model x i = ¯ x i + z i ∈ R d , i ∈ [ n ] . (1.1)Here { x i } ni =1 are noisy observations of signals { ¯ x i } ni =1 contaminated by { z i } ni =1 . Deﬁne thedata matrices X = ( x , · · · , x n ) (cid:62) ∈ R n × d and ¯ X = ( ¯ x , · · · , ¯ x n ) (cid:62) ∈ R n × d . Let ¯ G =¯ X ¯ X (cid:62) ∈ R n × n be the Gram matrix of { ¯ x i } ni =1 , and G = H ( XX (cid:62) ) be the hollowed Grammatrix of { x i } ni =1 where H ( · ) is the hollowing operator, zeroing out all diagonal entriesof a square matrix. Denote by { λ j , u j } nj =1 and { ¯ λ j , ¯ u j } nj =1 the eigen-pairs of G and ¯ G ,respectively, where the eigenvalues are sorted in descending order. While PCA computesthe embedding by eigenvalue decomposition of XX (cid:62) , here we delete its diagonal to enhanceconcentration and handle heteroscedasticity (Koltchinskii and Giné, 2000). We seek ananswer to the following fundamental question: how do the eigenvectors of G relate to thoseof ¯ G ?Roughly speaking, our main results state that u j = Gu j /λ j ≈ G ¯ u j / ¯ λ j , (1.2)where the approximation relies on the (cid:96) p norm for a proper choice of p . In words, theeigenvector u j is a nonlinear function of G but can be well approximated by the linearfunction G ¯ u j / ¯ λ j in the (cid:96) p norm where p is given by the model signal-to-noise ratio (SNR).This linearization facilitates the analysis and allows to quantify how the magnitude of thesignal-to-noise ratio aﬀects theoretical guarantees for signal recovery.In many statistical problems such as mixture models, the vectors { ¯ x i } ni =1 live in a low-dimensional subspace of R d . Their latent coordinates reﬂect the geometry of the data, whichcan be decoded from eigenvalues and eigenvectors of ¯ G . Our results show how well thespectral decomposition of G reveals that of ¯ G , characterizing the behavior of individualembedded samples. From there we easily derive the optimality of spectral clustering intwo-component sub-Gaussian mixture models and contextual stochastic block models, interms of both the misclassiﬁcation rates and the exact recovery thresholds. In particular,the linearization of eigenvector (1.2) helps develop a simple spectral method for contextualstochastic block models, eﬃciently combining the information from the network and thenode attributes.Our general results hold for Hilbert spaces. It is easily seen that construction of thehollowed Gram matrix G and the subsequent steps only depend on pairwise inner products2 (cid:104) x i , x j (cid:105)} ≤ i,j ≤ n . This makes the “kernel trick” applicable (Cristianini and Shawe-Taylor,2000), and our analysis readily handles (a hollowed version of) kernel PCA. We demonstrate the merits of the (cid:96) p analysis using spectral clustering for a mixture of twoGaussians. Let y ∈ {± } n be a label vector with i.i.d. Rademacher entries and µ ∈ R d bea deterministic mean vector, both of which are unknown. Consider the model x i = y i µ + z i , i ∈ [ n ] , (1.3)where { z i } ni =1 are i.i.d. N ( , I d ) vectors. The goal is to estimate y from { x i } ni =1 . (1.3)is a special case of the signal-plus-noise model (1.1) with ¯ x i = y i µ . Since P ( y i = 1) = P ( y i = −

1) = 1 / , { x i } ni =1 are i.i.d. samples from a mixture of two Gaussians N ( µ , I d ) + N ( − µ , I d ) .By construction, ¯ X = ( ¯ x , · · · , ¯ x n ) (cid:62) = yµ (cid:62) and ¯ G = (cid:107) µ (cid:107) yy (cid:62) with ¯ u = y / √ n and ¯ λ = n (cid:107) µ (cid:107) . Hence, sgn( u ) becomes a natural estimator for y , where sgn( · ) is the entrywisesign function. A fundamental question is whether the empirical eigenvector u is informativeenough to accurately recover the labels in competitive regimes. To formalize the discussion,we denote by SNR = (cid:107) µ (cid:107) (cid:107) µ (cid:107) + d/n (1.4)the signal-to-noise ratio of model (1.3). Consider the challenging asymptotic regime where n → ∞ and (cid:28) SNR (cid:46) log n . The dimension d may or may not diverge. According toTheorem 3.2, the spectral estimator sgn( u ) achieves the minimax optimal misclassiﬁcationrate e − SNR(1+ o (1)) . (1.5)In order to get this result, we start from an (cid:96) p analysis of u . Theorem 3.3 shows that P (min s = ± (cid:107) s u − G ¯ u / ¯ λ (cid:107) p < ε n (cid:107) ¯ u (cid:107) p ) > − Ce − p (1.6)for p = SNR , some constant C > and some deterministic sequence { ε n } ∞ n =1 tending tozero. On the event (cid:107) s u − G ¯ u / ¯ λ (cid:107) p < ε n (cid:107) ¯ u (cid:107) p , we apply a Markov-type inequality to theentries of ( s u − G ¯ u / ¯ λ ) : n |{ i : | ( s u − G ¯ u / ¯ λ ) i | > (cid:112) ε n /n }| ≤ n (cid:80) ni =1 | ( s u − G ¯ u / ¯ λ ) i | p ( (cid:112) ε n /n ) p (i) = (cid:18) (cid:107) s u − G ¯ u / ¯ λ (cid:107) p √ ε n (cid:107) ¯ u (cid:107) p (cid:19) p ≤ ε p/ n , (1.7) In Theorem 3.2 we derive results for the exact recovery of the spectral estimator, i.e. P (sgn( u ) = ± y ) → , when SNR (cid:29) log n . Here we omit that case and discuss error rates. (i) follows from ¯ u = y / √ n and (cid:107) ¯ u (cid:107) pp = n (1 / √ n ) p . Hence all but an ε SNR / n fractionof u ’s entries are well-approximated by those of G ¯ u / ¯ λ . On the other hand, since themisclassiﬁcation error is always bounded by 1, the exceptional event in (1.6) may at mostcontribute an Ce − SNR amount to the ﬁnal error. Both ε SNR / n and Ce − SNR are negligiblecompared to the optimal rate e − SNR / in (1.5). This helps us show that the (cid:96) p bound (1.6)ensures suﬃcient proximity between u and G ¯ u / ¯ λ , and the analysis boils down to thelatter term.We now explain why G ¯ u / ¯ λ is a good target to aim at. Observe that ( G ¯ u ) i = [ H ( XX (cid:62) ) ¯ u ] i = (cid:88) j (cid:54) = i (cid:104) x i , x j (cid:105) y j / √ n ∝ (cid:104) x i , ˆ µ ( − i ) (cid:105) , (1.8)where ˆ µ ( − i ) = n − (cid:80) j (cid:54) = i x j y j is the leave-one-out sample mean. Consequently, the (un-supervised) spectral estimator sgn[( u ) i ] for y i is approximated by sgn( (cid:104) x i , ˆ µ ( − i ) (cid:105) ) , whichcoincides with the (supervised) linear discriminant analysis (Fisher, 1936) given additionallabels { y j } j (cid:54) = i . This oracle estimator turns out to capture the diﬃculty of label recovery.That is, sgn( G ¯ u / ¯ λ ) achieves the optimal misclassiﬁcation rate in (1.5).Above we provide high-level ideas about why the spectral estimator sgn( u ) is optimal.Inequality (1.6) ties u and its linearization G ¯ u / ¯ λ together. The latter is connected tothe genie-aided estimator through (1.8). As a side remark, the relation (1.8) hinges on thefact that G is hollowed. Otherwise there would be a square term (cid:104) x i , x i (cid:105) making thingsentangled. Early works on PCA focus on classical settings where the dimension d is ﬁxed and thesample size n goes to inﬁnity (Anderson, 1963). Motivated by modern applications, in thepast two decades there has been a surge of interest in high-dimensional PCA. Most papersin this direction study the consistency of empirical eigenvalues (Johnstone, 2001; Baik et al.,2005) or Principal Component (PC) directions (Paul, 2007; Nadler, 2008; Jung and Marron,2009; Benaych-Georges and Nadakuditi, 2012; Perry et al., 2016; Wang and Fan, 2017) undervarious spiked covariance models with d growing with n . Similar results are also availablefor inﬁnite-dimensional Hilbert spaces (Koltchinskii and Giné, 2000; Zwald and Blanchard,2006; Koltchinskii and Lounici, 2016). The analysis of PCs amounts to showing how theleading eigenvectors of X (cid:62) X = (cid:80) ni =1 x i x (cid:62) i ∈ R d × d recover those of E ( x i x (cid:62) i ) . When itcomes to dimension reduction, one projects the data onto these PCs and get PC scores.This is directly linked to leading eigenvectors of the Gram matrix XX (cid:62) ∈ R n × n . In high-dimensional problems, the n -dimensional PC scores may still consistently reveal meaningfulstructures even if the d -dimensional PCs fail to do so (Cai and Zhang, 2018).Analysis of PC scores is crucial to the theoretical study of spectral methods. However,existing results (Blanchard et al., 2007; Amini and Razaee, 2019) in related areas cannotprecisely characterize individual embedded samples under general conditions. This paperaims to bridge the gap by a novel analysis. In addition, our work is orthogonal to those withsparsity assumptions (Johnstone and Lu, 2009; Jin and Wang, 2016). Here we are concernedwith (i) the non-sparse regime where most components contribute to the main variability4nd (ii) the inﬁnite-dimensional regime in kernel PCA where the sparsity assumption is notappropriate.There is a vast literature on perturbation theories of eigenvectors. Most classical boundsare deterministic and use the (cid:96) norm or other orthonormal-invariant norms as error met-rics. This includes the celebrated Davis-Kahan theorem (Davis and Kahan, 1970) and itsextensions (Wedin, 1972); see Stewart and Sun (1990) for a review. Improved (cid:96) -type resultsare available for stochastic settings (O’Rourke et al., 2018). For many problems in statisticsand machine learning, entrywise analysis is more desirable because that helps characterizethe spectral embedding of individual samples. Fan et al. (2019), Eldridge et al. (2017), Capeet al. (2019) and Damle and Sun (2019) provide (cid:96) ∞ perturbation bounds in deterministicsettings. Their bounds are often too conservative when the noise is stochastic. Recent papers(Koltchinskii and Xia, 2016; Abbe et al., 2017; Mao et al., 2017; Zhong and Boumal, 2018;Chen et al., 2019; Lei, 2019) take advantage of the randomness to obtain sharp (cid:96) ∞ resultsfor challenging tasks.The random matrices considered therein are mostly Wigner-type, with independent en-tries or similar structures. On the contrary, our hollowed Gram matrix G has Wishart-typedistribution since its oﬀ-diagonal entries are inner products of samples and thus dependent.What is more, our (cid:96) p bounds with p determined by the signal strength are adaptive. If thesignal is weak, existing (cid:96) ∞ analysis does not go through as strong concentration is requiredfor uniform control of all the entries. However, our (cid:96) p analysis still manages to control avast majority of the entries. If the signal is strong, our results imply (cid:96) ∞ bounds. The (cid:96) p eigenvector analysis in this paper shares some features with the study on (cid:96) p -delocalization(Erdős et al., 2009), yet the settings are very diﬀerent. It would be interesting to establishfurther connections.The applications in this paper are canonical problems in clustering and have been ex-tensively studied. For the sub-Gaussian mixture model, the settings and methods in Giraudand Verzelen (2018), Ndaoud (2018) and Löﬄer et al. (2019) are similar to ours. The con-textual network problem concerns grouping the nodes based on their attributes and pairwiseconnections, see Binkiewicz et al. (2017), Deshpande et al. (2018) and Yan and Sarkar (2020)for more about the model. We defer detailed discussions on these to Sections 3 and 4. We present the general setup and results for (cid:96) p eigenvector analysis in Section 2; apply themto clustering under mixture models in Section 3 and contextual community detection inSection 4; show a sketch of main proofs in Section 5; and conclude the paper with possiblefuture directions in Section 6. We use [ n ] to refer to { , , · · · , n } for n ∈ Z + . Denote by | · | the absolute value of areal number or cardinality of a set. For real numbers a and b , let a ∧ b = min { a, b } and a ∨ b = max { a, b } . For nonnegative sequences { a n } ∞ n =1 and { b n } ∞ n =1 , we write a n (cid:28) b n or a n = o ( b n ) if b n > and a n /b n → ; a n (cid:46) b n or a n = O ( b n ) if there exists a positive constant5 such that a n ≤ Cb n ; a n (cid:38) b n or a n = Ω( b n ) if b n (cid:46) a n . In addition, we write a n (cid:16) b n if a n (cid:46) b n and b n (cid:46) a n . We let S be the binary indicator function of a set S .Let { e j } dj =1 be the canonical bases of R d , S d − = { x ∈ R d : (cid:107) x (cid:107) = 1 } and B ( x , r ) = { y ∈ R d : (cid:107) y − x (cid:107) ≤ r } . For a vector x = ( x , · · · , x d ) (cid:62) ∈ R d and p ≥ , deﬁne its (cid:96) p norm (cid:107) x (cid:107) p = ( (cid:80) di =1 | x i | p ) /p . For i ∈ [ d ] , let x − i be the ( d − -dimensional sub-vector of x withoutthe i -th entry. For a matrix A ∈ R n × m , we deﬁne its spectral norm (cid:107) A (cid:107) = sup (cid:107) x (cid:107) =1 (cid:107) Ax (cid:107) and Frobenius norm (cid:107) A (cid:107) F = ( (cid:80) i,j A ij ) / . Unless otherwise speciﬁed, we use A i and a j torefer to the i -th row and j -th column of A , respectively. For ≤ p, q ≤ ∞ , we deﬁne anentrywise matrix norm (cid:107) A (cid:107) q,p = (cid:20) n (cid:88) i =1 (cid:18) m (cid:88) j =1 | a ij | q (cid:19) p/q (cid:21) /p . (1.9)The notation is not to be confused with ( q, p ) -induced norm, which is not used in the currentpaper. In words, we concatenate the (cid:96) q norms of the row vectors of A into an n -dimensionalvector and then compute its (cid:96) p norm. A special case is (cid:107) A (cid:107) , ∞ = max i ∈ [ n ] (cid:107) A i (cid:107) .Deﬁne the sub-Gaussian norms (cid:107) X (cid:107) ψ = sup p ≥ p − / E /p | X | p for random variable X and (cid:107) X (cid:107) ψ = sup (cid:107) u (cid:107) =1 (cid:107)(cid:104) u , X (cid:105)(cid:107) ψ for random vector X . Denote by χ n refers to the χ -distribution with n degrees of freedom. P → represents convergence in probability. In addition,we adopt the following convenient notations from Wang (2019) to make probabilistic state-ments compact . Deﬁnition 1.1.

Let { X n } ∞ n =1 , { Y n } ∞ n =1 be two sequences of random variables and { r n } ∞ n =1 ⊆ (0 , + ∞ ) be deterministic. We write X n = O P ( Y n ; r n ) if there exists a constant C > such that ∀ C > , ∃ C (cid:48) > and N > , s.t. P ( | X n | ≥ C (cid:48) | Y n | ) ≤ C e − Cr n , ∀ n ≥ N. We write X n = o P ( Y n ; r n ) if X n = O P ( w n Y n ; r n ) holds for some deterministic sequence { w n } ∞ n =1 tending to zero. Both the new notation O P ( · ; · ) and the conventional one O P ( · ) help avoid dealing withtons of unspeciﬁed constants in operations. Moreover, the former is more informative as itcontrols the convergence rate of exceptional probabilities. This is particularly useful whenwe take union bounds over a growing number of random variables. If { Y n } ∞ n =1 are positiveand deterministic, then X n = O P ( Y n ; 1) is equivalent to X n = O P ( Y n ) . Similar facts holdfor o P ( · ; · ) as well. In the reference above, O P ( · ; · ) and o P ( · ; · ) appear as ˆ O P ( · ; · ) and ˆ o P ( · ; · ) . For simplicity we drop theirhats in this paper. Main results

Consider the signal-plus-noise model x i = ¯ x i + z i ∈ R d , i ∈ [ n ] . (2.1)For simplicity, we assume that the signals { ¯ x i } ni =1 are deterministic and the noises { z i } ni =1 are the only source of randomness. The results readily extend to the case where the signalsare random but independent of the noises.Deﬁne the hollowed Gram matrix G ∈ R n × n of samples { x i } ni =1 through G ij = (cid:104) x i , x j (cid:105) { i (cid:54) = j } ,and the Gram matrix ¯ G ∈ R n × n of signals { ¯ x i } ni =1 through ¯ G ij = (cid:104) ¯ x i , ¯ x j (cid:105) . Denote the eigen-values of G by λ ≥ · · · ≥ λ n and their associated eigenvectors by { u j } nj =1 . Similarly, wedeﬁne the eigenvalues ¯ λ ≥ · · · ≥ ¯ λ n and eigenvectors { ¯ u j } nj =1 of ¯ G . Since ¯ G = ¯ X ¯ X (cid:62) (cid:23) ,we have ¯ λ j ≥ for all j ∈ [ n ] . By convention, λ = ¯ λ = + ∞ and λ n +1 = ¯ λ n +1 = −∞ .Some groups of eigenvectors may only be deﬁned up to orthonormal transforms as we allowfor multiplicity of eigenvalues.Let s and r be two integers in [ n ] satisfying ≤ s ≤ n − r . Deﬁne U = ( u s +1 , · · · , u s + r ) , ¯ U = ( ¯ u s +1 , · · · , ¯ u s + r ) , Λ = diag( λ s +1 , · · · , λ s + r ) and ¯ Λ = diag(¯ λ s +1 , · · · , ¯ λ s + r ) . In order tostudy how U relates to ¯ U , we adopt the standard notion of eigen-gap (Davis and Kahan,1970): ¯∆ = min { ¯ λ s − ¯ λ s +1 , ¯ λ s + r − ¯ λ s + r +1 } . (2.2)This is the separation between the set of target eigenvalues { ¯ λ s + j } rj =1 and the rest, reﬂectingthe signal strength. Deﬁne κ = ¯ λ / ¯∆ , which plays the role of condition number. Mostimportantly, we use a parameter γ to characterize the signal-to-noise ratio, consider theasymptotic setting n → ∞ and impose the following regularity assumptions. Assumption 2.1 (Incoherence) . As n → ∞ we have κµ (cid:114) rn ≤ γ (cid:28) κµ where µ = max (cid:26) (cid:107) ¯ X (cid:107) , ∞ (cid:107) ¯ X (cid:107) (cid:114) nr , (cid:27) . Assumption 2.2 (Sub-Gaussianity) . { z i } ni =1 are independent, zero-mean random vectors in R d . There exists a constant α > and Σ (cid:23) such that E e (cid:104) u , z i (cid:105) ≤ e α (cid:104) Σ u , u (cid:105) / holds for all u ∈ R d and i ∈ [ n ] . Assumption 2.3 (Concentration) . √ n max { ( κ (cid:107) Σ (cid:107) / ¯∆) / , (cid:107) Σ (cid:107) F / ¯∆ } ≤ γ , where Σ is asin Assumption 2.2. By construction, ¯ X = ( ¯ x , · · · , ¯ x n ) (cid:62) and (cid:107) ¯ X (cid:107) , ∞ = max i ∈ [ n ] (cid:107) ¯ x i (cid:107) . Assumption 2.1 regu-lates the magnitudes of {(cid:107) ¯ x i (cid:107) } ni =1 , and it naturally holds under various mixture models. Theincoherence parameter µ is similar to the usual deﬁnition (Candès and Recht, 2009) exceptfor the facts that ¯ X does not have orthonormal columns and r is not its rank. Assumption2.2 is a standard one on sub-Gaussianity (Koltchinskii and Lounici, 2014). Here { z i } ni =1 areindependent but may not have identical distributions, which allows for heteroscedasticity.7ssumption 2.3 governs the concentration of G around its population version ¯ G . To gainsome intuition, we deﬁne Z = ( z , · · · , z n ) (cid:62) ∈ R n × d and observe that G = H [( ¯ X + Z )( ¯ X + Z ) (cid:62) ] = H ( ¯ X ¯ X (cid:62) ) + H ( ¯ XZ (cid:62) + Z ¯ X (cid:62) ) + H ( ZZ (cid:62) )= ¯ X ¯ X (cid:62) + ( ¯ XZ (cid:62) + Z ¯ X (cid:62) ) + H ( ZZ (cid:62) ) − ¯ D , where ¯ D is the diagonal part of ¯ X ¯ X (cid:62) + ¯ XZ (cid:62) + Z ¯ X (cid:62) . Hence (cid:107) G − ¯ G (cid:107) ≤ (cid:107) ¯ XZ (cid:62) + Z ¯ X (cid:62) (cid:107) + (cid:107)H ( ZZ (cid:62) ) (cid:107) + max i ∈ [ n ] | ( ¯ X ¯ X (cid:62) + ¯ XZ (cid:62) + Z ¯ X (cid:62) ) ii | . The individual terms above are easy to work with. For instance, we may control (cid:107)H ( ZZ (cid:62) ) (cid:107) using concentration bounds for random quadratic forms such as Hanson-Wright-type in-equalities (Chen and Yang, 2018). The spectral norm and Frobenius norm of Σ collectivelycharacterize the eﬀective dimension of the noise distribution. That gives the reason whyAssumption 2.3 is formulated as it is. It turns out that Assumptions 2.1, 2.2 and 2.3 leadto a matrix concentration bound (cid:107) G − ¯ G (cid:107) = O P ( γ ¯∆; n ) , paving the way for eigenvectoranalysis. (cid:96) ,p analysis of eigenspaces Note that { u s + j } rj =1 and { ¯ u s + j } rj =1 are only identiﬁable up to sign ﬂips, and things becomeeven more complicated if some eigenvalues are identical. To that end, we need to align U with ¯ U using certain orthonormal transform. Deﬁne H = U (cid:62) ¯ U ∈ R r × r and let ˜ U ˜ Λ ˜ V (cid:62) denote itssingular value decomposition, where ˜ U , ˜ V ∈ O r × r and ˜ Λ ∈ R r × r is diagonal with nonnegativeentries. The orthonormal matrix sgn( H ) = ˜ U ˜ V (cid:62) is the best rotation matrix that aligns U with ¯ U and will play an important role throughout our analysis. Here sgn( · ) refers tothe matrix sign function (Gross, 2011). In addition, deﬁne Z = ( z , · · · , z n ) (cid:62) ∈ R n × d asthe noise matrix. Recall that for A ∈ R n × r with row vectors { A i } ni =1 , the entrywise matrixnorm (cid:107) A (cid:107) ,p = (cid:18) n (cid:88) i =1 (cid:107) A i (cid:107) p (cid:19) /p . Theorem 2.1.

Suppose that Assumptions 2.1, 2.2 and 2.3 hold. As long as ≤ p (cid:46) ( µγ ) − ,we have (cid:107) U sgn( H ) − G ¯ U ¯ Λ − (cid:107) ,p = o P ( (cid:107) ¯ U (cid:107) ,p ; p ) , (cid:107) U sgn( H ) − [ ¯ U + H ( ZX (cid:62) ) ¯ U ¯ Λ − ] (cid:107) ,p = o P ( (cid:107) ¯ U (cid:107) ,p ; p ) , (cid:107) U sgn( H ) (cid:107) ,p = O P ( (cid:107) ¯ U (cid:107) ,p ; p ) . In addition, if κ / γ (cid:28) , then (cid:107) U Λ / sgn( H ) − G ¯ U ¯ Λ − / (cid:107) ,p = o P ( (cid:107) ¯ U (cid:107) ,p (cid:107) ¯ Λ / (cid:107) ; p ) . U is a highly nonlinear functionof G , it can be well-approximated by a linear form G ¯ U ¯ Λ − up to an orthonormal transform.This can be understood from the hand-waving deduction: U = GU Λ − ≈ G ¯ U ¯ Λ − . The second equation in Theorem 2.1 talks about the diﬀerence between U and its populationversion ¯ U . Ignoring the orthonormal transform sgn( H ) , we have that for a large fraction of m ∈ [ n ] , the following entrywise approximation holds U m ≈ [ ¯ U + H ( ZX (cid:62) ) ¯ U ¯ Λ − ] m = ¯ U m + (cid:28) z m , (cid:88) j (cid:54) = m x j ¯ U j ¯ Λ − (cid:29) . (2.3)If we keep { x j } j (cid:54) = m ﬁxed, then the spectral embedding U m for the m -th sample is roughlylinear in z m or equivalently x m itself. This relation is crucial for our analysis of spectralclustering algorithms. The third equation in Theorem 2.1 relates to the delocalization prop-erty of U to that of ¯ U , showing that the mass of U is spread out across its rows as long as ¯ U behaves in a similar way.Many spectral methods use the rows of U ∈ R n × r to embed the samples { x i } ni =1 ⊆ R d into R r (Shi and Malik, 2000; Ng et al., 2002) and perform downstream tasks. Byprecisely characterizing the embedding, the ﬁrst three equations in Theorem 2.1 facilitatetheir analysis under statistical models. We will see several examples in Section 3. In PCA,however, the embedding is deﬁned by PC scores. Recall that the PCs are eigenvectors of thecovariance matrix n X (cid:62) X ∈ R d × d and PC scores are derived by projecting the data ontothem. Therefore, the PC scores in our setting correspond to the rows of U Λ / rather than U . The last equation in Theorem 2.1 studies their behavior.Theorem 2.1 is written to be easily applicable. It forms the basis of our applications inSection 3. General results under relaxed conditions are given by Theorem B.1.Let us now gain some intuition about the (cid:96) ,p error metric. For large p , (cid:107) A (cid:107) ,p is small ifa vast majority of the rows have small (cid:96) norms, but there could be a few rows that are large.Roughly speaking, the number of those outliers is exponentially small in p . We illustratethis using a toy example with r = 1 , i.e., A = x ∈ R n is a vector and (cid:107) · (cid:107) ,p = (cid:107) · (cid:107) p . If (cid:107) x (cid:107) p ≤ ε (cid:107) n (cid:107) p for some ε > , then Markov’s inequality yields n |{ i : | x i | > tε }| ≤ n − (cid:107) x (cid:107) pp ( tε ) p ≤ n − ε p (cid:107) n (cid:107) pp ( tε ) p = t − p , ∀ t > . Thus, larger p implies stronger bounds. In particular, the following easily-veriﬁed fact statesthat when p (cid:38) log n , an upper bound in (cid:96) ,p yields one in (cid:96) , ∞ , controlling all the row-wiseerrors simultaneously. Fact 2.1. (cid:107) x (cid:107) ∞ ≤ (cid:107) x (cid:107) c log n ≤ e /c (cid:107) x (cid:107) ∞ for any n ∈ Z + , x ∈ R n , c > . As a corollary, we get (cid:96) , ∞ approximation bounds for the eigenvectors when the signal isstrong enough for us to take p (cid:38) log n . 9 orollary 2.1. Suppose that Assumptions 2.1, 2.2 and 2.3 hold. As long as µγ (cid:46) / √ log n ,we have (cid:107) U sgn( H ) − G ¯ U ¯ Λ − (cid:107) , ∞ = o P ( (cid:107) ¯ U (cid:107) , ∞ ; log n ) , (cid:107) U sgn( H ) − [ ¯ U + H ( ZX (cid:62) ) ¯ U ¯ Λ − ] (cid:107) , ∞ = o P ( (cid:107) ¯ U (cid:107) , ∞ ; log n ) , (cid:107) U sgn( H ) (cid:107) , ∞ = O P ( (cid:107) ¯ U (cid:107) , ∞ ; ∞ ) . In addition, if κ / γ (cid:28) , then (cid:107) U Λ / sgn( H ) − G ¯ U ¯ Λ − / (cid:107) , ∞ = o P ( (cid:107) ¯ U (cid:107) , ∞ (cid:107) ¯ Λ / (cid:107) ; log n ) . However, p cannot be arbitrarily large in general. When the signal is weak, we may nolonger be able to obtain uniform control of errors as above and should allow for exceptions.The p in Theorem 2.1 may grow as fast as ( µγ ) − , which is a measure of the signal strength.This makes the results adaptive. Since G ∈ R n × n is constructed purely based on pairwise inner products of samples, thewhole procedure can be extended to kernel settings. Here we brieﬂy discuss the kernelPCA (Schölkopf et al., 1997). Suppose that { x i } ni =1 are samples from some space X and K ( · , · ) : X ×X → R is a symmetric and positive semi-deﬁnite kernel, i.e. for any m ∈ Z + and { w i } mi =1 ⊆ X , the matrix M ∈ R m × m with M ij = K ( w i , w j ) is always positive semi-deﬁnite.The kernel PCA is PCA based on a new Gram matrix K ∈ R n × n with K ij = K ( x i , x j ) .PCA is a special case of kernel PCA with X = R d and K ( x , y ) = x (cid:62) y . Commonly-used nonlinear kernels include the Gaussian kernel K ( x , y ) = e − η (cid:107) x − y (cid:107) with η > . Theyoﬀer ﬂexible nonlinear embedding techniques which have achieved great success in machinelearning (Cristianini and Shawe-Taylor, 2000).According to the Moore-Aronszajn Theorem (Aronszajn, 1950), there exists a reproducingkernel Hilbert space H with inner product (cid:104)· , ·(cid:105) and a function φ : X → H such that K ( x , y ) = (cid:104) φ ( x ) , φ ( y ) (cid:105) for any x , y ∈ X . Hence, kernel PCA of { x i } ni =1 ⊆ X is de factoPCA of transformed data { φ ( x i ) } ni =1 ⊆ H . The transform φ can be rather complicated since H has inﬁnite dimensions in general. Fortunately, the inner products {(cid:104) φ ( x i ) , φ ( x j ) (cid:105)} in H can be conveniently computed in the original space X .Motivated by the kernel PCA, we extend the basic setup to Hilbert spaces. Let H bea real separable Hilbert space with inner product (cid:104)· , ·(cid:105) , norm (cid:107) · (cid:107) , and some orthonormalbases { h j } . Deﬁnition 2.1 (Basics of Hilbert spaces) . A linear operator A : H → H is said to bebounded if its operator norm (cid:107) A (cid:107) op = sup (cid:107) u (cid:107) =1 (cid:107) Au (cid:107) is ﬁnite. Deﬁne L ( H ) as the collectionof all bounded linear operators over H . For any A ∈ L ( H ) , we use A ∗ to refer to its adjointoperator and let Tr( A ) = (cid:80) j (cid:104) Ah j , h j (cid:105) . Deﬁne S + ( H ) = { A ∈ L ( H ) : A = A ∗ , (cid:104) Ax , x (cid:105) ≥ , ∀ x ∈ H and Tr( A ) < ∞} . Any A ∈ S + ( H ) is said to be positive semi-deﬁnite. We use (cid:107) A (cid:107) HS = (cid:112) Tr( A ∗ A ) =( (cid:80) j (cid:107) Ah j (cid:107) ) / to refer to its Hilbert-Schmidt norm, and deﬁne A / ∈ T ( H ) as the uniqueoperator such that A / A / = A . emark 2.1. When H = R d , we have L ( H ) = R d × d , Tr( A ) = (cid:80) di =1 A ii , (cid:107) · (cid:107) op = (cid:107) · (cid:107) and (cid:107) · (cid:107) HS = (cid:107) · (cid:107) F . Further, S + ( H ) consists of all d × d positive semi-deﬁnite matrices.We now generalize model (2.1) to the following one in H : x i = ¯ x i + z i ∈ H , i ∈ [ n ] . (2.4)When H = R d , the data matrix X = ( x , · · · , x n ) (cid:62) ∈ R n × d corresponds to a linear transformfrom R d to R n . For any general H , we can always deﬁne X as a bounded linear operator from H to R n through its action h (cid:55)→ ( (cid:104) x , h (cid:105) , · · · , (cid:104) x n , h (cid:105) ) . With slight abuse of notation, we for-mally write X = ( x , · · · , x n ) (cid:62) , use (cid:107) X (cid:107) op to refer to its norm, let (cid:107) X (cid:107) , ∞ = max i ∈ [ n ] (cid:107) x i (cid:107) ,and do the same for ¯ X and Z . We generalize Assumptions 2.1, 2.2 and 2.3 accordingly. Assumption 2.4 (Incoherence) . As n → ∞ we have κµ (cid:114) rn ≤ γ (cid:28) κµ where µ = max (cid:26) (cid:107) ¯ X (cid:107) , ∞ (cid:107) ¯ X (cid:107) op (cid:114) nr , (cid:27) . Assumption 2.5 (Sub-Gaussianity) . { z i } ni =1 are independent, zero-mean random vectors in H . There exists a constant α > and an operator Σ ∈ T ( H ) such that E e (cid:104) u , z i (cid:105) ≤ e α (cid:104) Σ u , u (cid:105) / holds for all u ∈ H and i ∈ [ n ] . Assumption 2.6 (Concentration) . √ n max { ( κ (cid:107) Σ (cid:107) op / ¯∆) / , (cid:107) Σ (cid:107) HS / ¯∆ } ≤ γ . Again, Assumption 2.4 on incoherence holds for various mixture models. Assumption2.5 appears frequently in the study of sub-Gaussianity in Hilbert spaces (Koltchinskii andLounici, 2014). For kernel PCA, Assumption 2.5 automatically holds when the kernel isbounded, i.e. K ( x , x ) ≤ C for some constant C . Assumption 2.6 naturally arises in thestudy of Gram matrices and quadratic forms in Hilbert spaces (Chen and Yang, 2018). Thesame results in Theorem 2.1 continue to hold under the Assumptions 2.4, 2.5 and 2.6. Theproof is in Appendix C.1. Sub-Gaussian and Gaussian mixture models serve as testbeds for clustering algorithms. Max-imum likelihood estimation requires well-speciﬁed models and often involves non-convex orcombinatorial optimization problems that are hard to solve. The recent years have seen aboom in the study of eﬃcient approaches. The Lloyd’s algorithm (Lloyd, 1982) with goodinitialization and its variants are analyzed under certain separation conditions (Kumar andKannan, 2010; Lu and Zhou, 2016; Ndaoud, 2018; Gao and Zhang, 2019). Semi-deﬁniteprogramming (SDP) yields reliable results in more general scenarios (Awasthi et al., 2015;Mixon et al., 2017; Royer, 2017; Fei and Chen, 2018; Giraud and Verzelen, 2018; Chen andYang, 2018; Chen and Yang, 2020). Spectral methods are more eﬃcient in terms of compu-tation and have attracted much attention (Vempala and Wang, 2004; Cai and Zhang, 2018;11öﬄer et al., 2019; Srivastava et al., 2019). However, much less is known about spectralmethods compared with SDP.We apply the (cid:96) p eigenvector analysis to spectral clustering under a popular sub-Gaussianmixture model in Hilbert space H . Suppose that x i = y i µ + z i ∈ H , i ∈ [ n ] , (3.1)where { y i } ni =1 ⊆ {± } are labels, µ ∈ H , and { z i } ni =1 ⊆ H are sub-Gaussian noise vectorssatisfying Assumption 2.5. For simplicity, we assume that { y i } ni =1 and µ are deterministic.Through a conditioning argument, the results extend to the case where they are independentof { z i } ni =1 . (3.1) is a natural model for a centered sub-Gaussian mixture with two equally-sized classes. Heteroscedasticity is allowed, as Assumption 2.5 only requires the covariancematrices of { z i } ni =1 to be uniformly bounded by some Σ , but are allowed to vary across i .The goal of clustering is to estimate { y i } ni =1 based solely on { x i } ni =1 .Under this model, we have ¯ x i = y i µ and ¯ X = ( ¯ x , · · · , ¯ x n ) (cid:62) = yµ (cid:62) . For the populationGram matrix ¯ G = ¯ X ¯ X (cid:62) = (cid:107) µ (cid:107) yy (cid:62) , the leading eigenvalue is ¯ λ = n (cid:107) µ (cid:107) and the leadingeigenvector ¯ u = y / √ n perfectly reveals the class labels. As a result, sgn( u ) is a naturalestimator for the label vector y , where the sign function sgn is applied to all entries of u .This is a speciﬁc case of the spectral clustering algorithm.To state our results for sgn( u ) in a clean way, we deﬁne the signal-to-noise ratio SNR = (cid:107) µ (cid:107) (cid:107) Σ (cid:107) op ∧ n (cid:107) µ (cid:107) (cid:107) Σ (cid:107) (3.2)and the proportion of mismatch M ( ˆ y , y ) = min (cid:26) n n (cid:88) i =1 { ˆ y i (cid:54) = y i } , n n (cid:88) i =1 {− ˆ y i (cid:54) = y i } (cid:27) , ∀ ˆ y , y ∈ {± } n . (3.3) Theorem 3.1 (Error rate of spectral clustering) . Under the model (3.1), there exist constants

C > and c > such that the followings hold:1. If SNR > C log n , then lim n →∞ P [ M (sgn( u ) , y ) = 0] = 1 ;2. If (cid:28) SNR ≤ C log n , then lim sup n →∞ SNR − log E M (sgn( u ) , y ) < − c . The proof is in Appendix D.2. Theorem 3.1 asserts that the spectral estimator sgn( u ) exactly recovers all the labels with high probability when SNR exceeds some constant mul-tiple of log n . When SNR is not that large but still diverges, we have an exponential bound e − Ω(SNR) for the misclassiﬁcation rate. It is worth mentioning that

SNR (cid:29) is necessary forachieving a vanishing misclassiﬁcation rate in the isotropic Gaussian case (Cai and Zhang,2018; Ndaoud, 2018). Hence Theorem 3.1 covers the whole regime that makes vanishingerror possible. For the sub-Gaussian mixture model (3.1), these results are the best avail-able in the literature and have only been established for SDP under sub-Gaussian mixturemodels in Euclidean spaces (Giraud and Verzelen, 2018) and Hilbert spaces (Chen and Yang,2018). They are optimal up to constants C and c in the isopropic case (Giraud and Verze-len, 2018; Ndaoud, 2018). While unspeciﬁed constants are inevitable due to the generalityof sub-Gaussianity, we provide ﬁner results for the Gaussian case in Section 3.2.12he characterization (3.2) for signal-to-noise ratio is ﬁrst proposed by Giraud and Verze-len (2018) and can be rewritten as SNR = (cid:107) µ (cid:107) (cid:107) Σ (cid:107) op (cid:20) ∧ (cid:18) (cid:107) µ (cid:107) (cid:107) Σ (cid:107) op · nr ( Σ ) (cid:19)(cid:21) , (3.4)where r ( Σ ) = (cid:107) Σ (cid:107) / (cid:107) Σ (cid:107) captures the eﬀective rank of Σ . In the isotropic case with Σ = I and H = R d , we have r ( Σ ) = d . The SNR diﬀers from the classical notion of signal-to-noise ratio (cid:107) µ (cid:107) / (cid:107) Σ (cid:107) op frequently used to quantify the misclassiﬁcation rates (Lu andZhou, 2016; Fei and Chen, 2018; Löﬄer et al., 2019; Srivastava et al., 2019; Gao and Zhang,2019). Those results hinge on an extra assumption (cid:107) µ (cid:107) (cid:107) Σ (cid:107) op (cid:29) max (cid:26) , r ( Σ ) n (cid:27) , (3.5)or the one with (cid:29) replaced by (cid:38) . Under such an assumption, SNR in (3.4) is equivalent to (cid:107) µ (cid:107) / (cid:107) Σ (cid:107) op . However, our assumption SNR (cid:29) translates to (cid:107) µ (cid:107) (cid:107) Σ (cid:107) op (cid:29) max (cid:26) , (cid:114) r ( Σ ) n (cid:27) . (3.6)It is much weaker when the noise has high eﬀective dimensions r ( Σ ) (cid:29) n . See Giraud andVerzelen (2018) for more discussions. The symmetries and other structural properties of Gaussian mixture models allow for moreprecise characterizations compared to the above. While a main focus of interest is parameterestimation by likelihood-based methods (Dempster et al., 1977) and methods of moments(Pearson, 1894), the problem of clustering is less explored. Recently there is a surge ofinterest in sharp statistical guarantees, mostly under the isotropic Gaussian mixture model(Lu and Zhou, 2016; Cai and Zhang, 2018; Ndaoud, 2018; Löﬄer et al., 2019; Chen andYang, 2020). In another line of study, sparsity assumptions are adopted for high-dimensionalGaussian mixtures (Azizyan et al., 2013; Jin and Wang, 2016). In this subsection, we studythe optimality of spectral estimator sgn( u ) under the following model. Deﬁnition 3.1 (Gaussian mixture model) . For y ∈ {± } n and µ ∈ R d with n, d ≥ , wewrite { x i } ni =1 ∼ GMM( µ , y ) if x i = y i µ + z i ∈ R d , i ∈ [ n ] , and { z i } ni =1 ⊆ R d are i.i.d. N ( , I d ) vectors. This is a special case of the sub-Gaussian mixture model (3.1). Taking Σ = I d , we get (cid:107) Σ (cid:107) op = 1 and (cid:107) Σ (cid:107) HS = √ d . The signal-to-noise ratio in (3.2) is then (cid:107) µ (cid:107) ∧ ( n (cid:107) µ (cid:107) /d ) .Here we redeﬁne it as SNR = (cid:107) µ (cid:107) (cid:107) µ (cid:107) + d/n . (3.7)It has the same order as the previous one and facilitates presentation. We keep using themismatch proportion M in (3.3). 13 heorem 3.2. Let { x i } ni =1 ∼ GMM( µ , y ) and n → ∞ .1. If SNR > (2 + ε ) log n for some constant ε > , then lim n →∞ P [ M (sgn( u ) , y ) = 0] = 1 ;2. If (cid:28) SNR ≤ n , then lim sup n →∞ SNR − log E M (sgn( u ) , y ) ≤ − / . Theorem 3.2 characterizes the spectral estimator with explicit constants. When

SNR exceeds n , sgn( u ) exactly recovers all the labels (up to a global sign ﬂip) with highprobability. When (cid:28) SNR ≤ n , the misclassiﬁcation rate is bounded from above by e − SNR / [2+ o (1)] . According to Ndaoud (2018), both results are optimal in the minimax sense.The proof of Theorem 3.2 is in Appendix E.2.Cai and Zhang (2018) prove that SNR → ∞ is necessary for any estimator to achievevanishingly small misclassiﬁcation rate and derive an upper bound E M (sgn( ˜ u ) , y ) (cid:46) / SNR for ˜ u being the leading eigenvector of the unhollowed Gram matrix XX (cid:62) . Ndaoud (2018)obtains exact recovery guarantees as well as an optimal exponential error bound for aniterative algorithm starting from sgn( u ) . Our analysis shows that the initial estimatoris already good enough and no reﬁnement is needed. Chen and Yang (2020) study theinformation threshold for exact recovery in multi-class setting and use an SDP to achievethat.The SNR in (3.7) precisely quantiﬁes the signal-to-noise ratio for clustering and is alwaysdominated by the classical one (cid:107) µ (cid:107) . When d (cid:29) n , the condition SNR → ∞ is equivalentto (cid:107) µ (cid:107) (cid:29) ( d/n ) / . (3.8)This is weaker than the commonly-used assumption (cid:107) µ (cid:107) (cid:29) (cid:112) d/n (3.9)for clustering (Lu and Zhou, 2016; Löﬄer et al., 2019), under which SNR is asymptoticallyequivalent to (cid:107) µ (cid:107) . Their discrepancy reﬂects an interesting high-dimensional phenomenon.For the Gaussian mixture model in Deﬁnition 3.1, parameter estimation and clusteringcorrespond to recovering µ ∈ R d and y ∈ {± } n , respectively. A good estimate for µ yieldsthat for y . Hence clustering should be easier than parameter estimation. The diﬀerencebecomes more signiﬁcant when d (cid:29) n as clustering targets fewer unknowns. To see this, wewrite X = ( x , · · · , x n ) (cid:62) ∈ R n × d and observe that X = yµ (cid:62) + Z , where Z = ( z , · · · , z n ) (cid:62) ⊆ R n × d has i.i.d. N (0 , entries. Clustering and parameterestimation correspond to estimating the left and right singular vectors of the signal matrix E X . According to the results by Cai and Zhang (2018) on singular subspace estimation,(3.8) and (3.9) are sharp conditions for consistent clustering and parameter estimation.They ensure concentration of the Gram matrix XX (cid:62) and the covariance matrix n X (cid:62) X .When ( d/n ) / (cid:28) (cid:107) µ (cid:107) (cid:28) (cid:112) d/n , consistent clustering is possible even without consistentestimation of the model parameter µ . Intuitively, there are many discriminative directionsthat can tell the classes apart but they are not necessarily aligned with the direction of µ .14ere we outline the proof of Theorem 3.2. When SNR (cid:29) log n , the ﬁrst part in Theorem3.1 implies that P [sgn( u ) = ± y ] → . Hence it suﬃces to consider (cid:28) SNR (cid:46) log n . Thefollowing (cid:96) p approximation result helps illustrate the main idea, whose proof is deferred toAppendix E.3. Theorem 3.3.

Under the GMM model in Deﬁnition 3.1 with n → ∞ and (cid:28) SNR (cid:46) log n ,there exist ε n → and positive constants C, N such that P ( (cid:107) u − G ¯ u / ¯ λ (cid:107) SNR < ε n (cid:107) ¯ u (cid:107) SNR ) > − Ce − SNR , ∀ n ≥ N. In a hand-waving way, the analysis right after (1.6) in the introduction suggests thatthe expected misclassiﬁcation rate of sgn( u ) diﬀers from that of sgn( G ¯ u / ¯ λ ) by at most O ( e − SNR ) . Then, it boils down to studying sgn( G ¯ u / ¯ λ ) . Note that ( G ¯ u / ¯ λ ) i ∝ ( Gy ) i = n (cid:88) j =1 [ H ( XX (cid:62) )] ij y j = (cid:88) j (cid:54) = i (cid:104) x i , x j y j (cid:105) = ( n − (cid:104) x i , ˆ µ ( − i ) (cid:105) , ∀ i ∈ [ n ] . Here ˆ µ ( − i ) = n − (cid:80) j (cid:54) = i x j y j is an estimate of µ based on the samples { x j } j (cid:54) = i and their labels { y j } j (cid:54) = i . It is straightforward to prove E M (sgn( G ¯ u / ¯ λ ) , y ) = 1 n n (cid:88) i =1 P [sgn( (cid:104) x i , ˆ µ ( − i ) (cid:105) ) (cid:54) = y i ] ≤ e − SNR / [2+ o (1)] and get the same bound for E M (sgn( u ) , y ) . When SNR > (2 + ε ) log n , this leads to an n − (1+ ε/ upper bound for the misclassiﬁcation rate, which implies exact recovery with highprobability as any misclassiﬁed sample contributes n − to the error rate. When SNR ≤ n , we get the second part in Theorem 3.2. The proof is then ﬁnished.The quantity sgn( (cid:104) x i , ˆ µ ( − i ) (cid:105) ) is the prediction of y i by linear discriminant analysis (LDA)given features { x i } ni =1 and additional labels { y j } j (cid:54) = i . It resembles an oracle (or genie-aided)estimator that is usually linked to the fundamental limits of clustering (Abbe et al., 2016;Zhang and Zhou, 2016), which plays an important role in our analysis as well. By connecting u with G ¯ u / ¯ λ and thus {(cid:104) x i , ˆ µ ( − i ) (cid:105)} ni =1 , Theorem 3.3 already hints the optimality of sgn( u ) for recovering y .Perhaps surprisingly, both the (unsupervised) spectral clustering and (supervised) LDAachieve the minimax optimal misclassiﬁcation error e − SNR / [2+ o (1)] . Here the missing labels donot hurt much. This phenomenon is also observed by Ndaoud (2018). On the other hand, theBayes classiﬁer sgn( (cid:104) µ , x (cid:105) ) given the true parameter µ achieves error rate − Φ( (cid:107) µ (cid:107) ) , where Φ is the cumulative distribution function of N (0 , . As (cid:107) µ (cid:107) → ∞ , this is e −(cid:107) µ (cid:107) / [2+ o (1)] and it is always superior to the minimax error without the knowledge of µ . From there weget the followings for spectral clustering and LDA. • If (cid:107) µ (cid:107) (cid:29) (cid:112) d/n , then SNR = (cid:107) µ (cid:107) [1 + o (1)] and both estimators achieve the Bayes errorexponent; • If (cid:107) µ (cid:107) ≤ C (cid:112) d/n for some constant C > , then SNR ≤ (cid:107) µ (cid:107) / (1 + C − ) and bothestimators achieve the minimax optimal exponent that is worse than the Bayes errorexponent. 15 Contextual stochastic block model

Contextual network analysis concerns discovering interesting structures such as communitiesin a network with the help of node attributes. Large-scale applications call for computation-ally eﬃcient procedures incorporating the information from both sources. For communitydetection in the contextual setting, various models and algorithms have been proposed andanalyzed (Zhang et al., 2016; Weng and Feng, 2016; Binkiewicz et al., 2017; Ma and Ma,2017; Deshpande et al., 2018; Mele et al., 2019; Yan and Sarkar, 2020). How to quantifythe beneﬁts of aggregation is a fundamental and challenging question. We study communitydetection under a canonical model for contextual network data and prove the optimality ofa simple spectral method.To begin with, we present a binary version of the stochastic block model (Holland et al.,1983) that plays a central role in statistical network analysis (Abbe, 2017). We use a labelvector y = ( y , · · · , y n ) (cid:62) ∈ {± } n to encode the block (community) memberships of nodes.For any pair of nodes i and j , we connect them with probability α if they are from the sameblock. Otherwise, the connection probability is β . Deﬁnition 4.1 (Stochastic Block Model) . For n ∈ Z + , y ∈ {± } n and < α, β < , wewrite A ∼ SBM( y , α, β ) if A ∈ { , } n × n is symmetric, A ii = 0 for all i ∈ [ n ] , { A ij } ≤ i , we write ( y , µ , A , { x i } ni =1 ) ∼ CSBM( n, d, α, β, R ) if1. the label vector y and separation vector µ are independently generated from the uniformdistributions over {± } n and { u ∈ R d : (cid:107) u (cid:107) = R } , respectively;2. given y and µ , the network A and attributes { x i } ni =1 are independently generated from SBM( y , α, β ) and GMM( µ , y ) , respectively. The goal of contextual community detection is to reconstruct y based on A and { x i } ni =1 .We adopt a commonly-used calibration for the network where the connection probabilities16 , β scale like log nn and diﬀer by a constant factor. It is the regime where phase transitionsof exact recovery and connectivity take place (Abbe, 2017). Meanwhile, recall that SNR = R / ( R + d/n ) in (3.7) is the signal-to-noise ratio of the Gaussian mixture model. We take SNR (cid:16) log n to ensure that the signal strengths of A and { x i } ni =1 are comparable. Again,we do not impose the conditions on d = d n . It may be bounded or diverge to inﬁnity at anypossible rate. Assumption 4.1 (Asymptotics) . Let a , b and c be positive constants. ( y , µ , A , { x i } ni =1 ) ∼ CSBM( n, d, α, β, R ) with α = a log nn , β = b log nn and R / ( R + d/n ) = c log n . Some of our results are stated under the following weaker (more general) assumption inwhich log n is replaced by q n . Assumption 4.2.

Let a , b and c be positive constants. ( y , µ , A , { x i } ni =1 ) ∼ CSBM( n, d, α, β, R ) with (cid:28) q n (cid:28) n , α = aq n /n , β = bq n /n and R / ( R + d/n ) = cq n . On the one hand, Section 3.2 shows that the leading eigenvector u of the hollowed Grammatrix G = H ( XX (cid:62) ) is optimal for the Gaussian mixture model. From now on we renameit as u ( G ) to avoid ambiguity. On the other hand, the second eigenvector u ( A ) of A estimates the labels under the stochastic block model (Abbe et al., 2017). To get someintuition, suppose that half of the entries in { y i } ni =1 are +1 ’s and the others are − ’s so that (cid:62) n y = 0 . For such y , it is easy to see that E ( A | y ) = α + β n (cid:62) n + α − β yy (cid:62) (4.1)and its second eigenvector y / √ n reveals the membership structure. Our estimator for theintegrated problem is an aggregation of the two individual spectral estimators u ( A ) and u ( G ) . Without loss of generality, we assume (cid:104) u ( A ) , u ( G ) (cid:105) ≥ to avoid cancellation.We now begin the construction. The ideal ‘estimator’ ˆ y genie i = argmax y = ± P ( y i = y | A , X , y − i ) . is the best guess of y i given the network, attributes, and labels of all nodes (assisted by Genie)except the i -th one. It is referred to as a genie-aided estimator or oracle estimator in theliterature and is closely related to fundamental limits in clustering (Abbe et al., 2016; Zhangand Zhou, 2016), see Theorem F.3. To mimic ˆ y genie i , we ﬁrst approximate its associated oddsratio. Lemma 4.1.

Under Assumption 4.2, we have for each given i (cid:12)(cid:12)(cid:12)(cid:12) log (cid:18) P ( y i = 1 | A , X , y − i ) P ( y i = − | A , X , y − i ) (cid:19) − (cid:20)(cid:18) log( a/b ) A + 2 n + d/R G (cid:19) y (cid:21) i (cid:12)(cid:12)(cid:12)(cid:12) = o P ( q n ; q n ) . i -th coordinate of Ay corresponds to the log odds ratio log[ P ( y i = 1 | A , y − i ) / P ( y i = − | A , y − i )] for the stochastic block model (Abbe et al., 2016). From A ii = 0 we see that ( Ay ) i = (cid:80) j (cid:54) = i A ij y j tries to predict the label y i via majority voting among the neigh-bors of node i . Similarly, ( Gy ) i relates to the log odds ratio log[ P ( y i = 1 | X , y − i ) / P ( y i = − | X , y − i )] for the Gaussian mixture model. The overall log odds ratio is linked to a linearcombination of Ay and Gy thanks to the conditional independence between A and X inDeﬁnition 4.2. The proof of Lemma 4.1 can be found in Appendix F.2.Intuitively, Lemma 4.1 reveals that sgn (cid:18) log( a/b ) Ay + 2 n + d/R Gy (cid:19) ≈ (ˆ y genie1 , · · · , ˆ y genie n ) (cid:62) The left-hand side still involves unknown parameters a/b , R and y . Once these unknownsare consistently estimated, the substitution version of the left-hand side provides a validestimator that mimics well the genie-aided estimator and hence is optimal. Heuristics oflinear approximation in Theorem 3.3 above and Abbe et al. (2017) suggest u ( A ) ≈ A ¯ u / ¯ λ A and u ( G ) ≈ G ¯ u / ¯ λ G . Here ¯ u = y / √ n , ¯ λ A = n ( α − β ) / is the second largest (in absolute value) eigenvalue of E ( A | y ) when α (cid:54) = β and the two blocks are equally-sized, and ¯ λ G = nR is the leadingeigenvalue of ¯ G = ¯ X ¯ X (cid:62) . Hence log( a/b ) Ay + 2 n + d/R Gy ≈ log( a/b ) √ n ¯ λ A u ( A ) + 2 n + d/R √ n ¯ λ G u ( G ) ∝ n ( α − β )2 log (cid:18) αβ (cid:19) u ( A ) + 2 R R + d/n u ( G ) , (4.2)which yields a linear combination of u ( A ) and u ( G ) . The coeﬃcient in front of u ( G ) istwice the SNR in (3.2) for Gaussian mixture model. Analogously, we may regard n ( α − β )4 log( α/β ) as a signal-to-noise ratio for the stochastic block model.An legitimate estimator for y is to replace the unknown parameters α , β and R in(4.2) by their estimates. When the two classes are balanced, i.e. y (cid:62) n = 0 , (4.1) yields λ [ E ( A | y )] = n ( α + β ) / and λ [ E ( A | y )] = n ( α − β ) / . Here λ j ( · ) denotes the j -th largest(in absolute value) eigenvalue of a real symmetric matrix. Hence, n ( α − β )2 log (cid:18) αβ (cid:19) = λ [ E ( A | y )] log (cid:18) λ [ E ( A | y )] + λ [ E ( A | y )] λ [ E ( A | y )] − λ [ E ( A | y )] (cid:19) ≈ λ ( A ) log (cid:18) λ ( A ) + λ ( A ) λ ( A ) − λ ( A ) (cid:19) . It can be consistently estimated by using the substitution principle. Similarly, using λ ( ¯ G ) = nR , we have R R + d/n = 2[ λ ( ¯ G ) /n ] λ ( ¯ G ) /n + d/n ≈ λ ( G ) nλ ( G ) + nd . sgn( ˆ u ) with ˆ u = log (cid:18) λ ( A ) + λ ( A ) λ ( A ) − λ ( A ) (cid:19) λ ( A ) u ( A ) + 2 λ ( G ) nλ ( G ) + nd u ( G ) . (4.3)Our estimator uses a weighted sum of two individual estimators without any tuning param-eter. Binkiewicz et al. (2017) propose a spectral method based on a weighted sum of thegraph Laplacian matrix and XX (cid:62) . Yan and Sarkar (2020) develop an SDP using a weightedsum of A and a kernel matrix of { x i } ni =1 . Deshpande et al. (2018) study a belief propagationalgorithm. Their settings and regimes are diﬀerent from ours. There are very few theoretical results on the information gain in combining the network andnode attributes. Binkiewicz et al. (2017) and Yan and Sarkar (2020) derive upper boundsfor the misclassiﬁcation error that depend on both sources of information. However, thosebounds are not tight and cannot rigorously justify the beneﬁts. Deshpande et al. (2018)uses techniques from statistical physics to derive an information threshold for obtaining anestimator that is better than random guessing in some regimes. The threshold is smallerthan those for the stochastic block model and the Gaussian mixture model. Their calculationis under the sparse regime where the maximum expected degree n ( α + β ) / of the networkremains bounded as n goes to inﬁnity. They obtain a formal proof by taking certain large-degree limits. To our best knowledge, the result below gives the ﬁrst characterization of theinformation threshold for exact recovery and provides an eﬃcient method achieving it byaggregating the two pieces of information.We now investigate the aggregated spectral estimator (4.3) under the log n -regime (As-sumption 4.1). Our results show that sgn( ˆ u ) achieves the information threshold for exactrecovery as well as the optimal misclassiﬁcation rate, both of which are better than thosebased on a single form of data in terms of the mismatch proportion M in (3.3). To statethe results, deﬁne I ∗ ( a, b, c ) = ( √ a − √ b ) + c . (4.4) Theorem 4.1.

Let Assumption 4.1 hold and a (cid:54) = b . When I ∗ ( a, b, c ) > , we have lim n →∞ P [ M (sgn( ˆ u ) , y ) = 0] = 1 . When I ∗ ( a, b, c ) < , we have lim inf n →∞ P [ M ( ˆ y , y ) > > for any sequence of estimators ˆ y = ˆ y n ( A , { x i } ni =1 ) . Theorem 4.1 asserts that I ∗ ( a, b, c ) quantiﬁes the signal-to-noise ratio and the phasetransition of exact recovery takes place at I ∗ ( a, b, c ) = 1 . When c = 0 (node attributesare uninformative), we have I ∗ ( a, b,

0) = ( √ a − √ b ) / ; the threshold reduces to that for19he stochastic block model ( √ a − √ b = √ by Abbe et al. (2016)). Similarly, when a = b (the network is uninformative), we have I ∗ ( a, a, c ) = c/ ; the threshold reduces to that forthe Gaussian mixture model ( c = 2 by Ndaoud (2018)). The relation (4.4) indicates thatcombining two sources of information adds up the powers of each part. The proof of Theorem4.1 is deferred to Appendix F.5.Figure 1 presents numerical examples demonstrating the eﬃcacy of our aggregated esti-mator sgn( ˆ u ) . The two experiments use c = 0 . and c = 1 . respectively. We ﬁx n = 500 , d = 2000 and vary a ( y -axis), b ( x -axis) from to . For each parameter conﬁguration ( a, b, c ) , we compute the frequency of exact recovery (i.e. sgn( ˆ u ) = ± y ) over 100 indepen-dent runs. Light color represents high chance of success. The red curves ( √ a − √ b ) + c = 2 correspond to theoretical boundaries for phase transitions, which match the empirical re-sults pretty well. Also, larger c implies stronger signal in node attributes and makes exactrecovery easier.Figure 1: Exact recovery for CSBM: c = 0 . (left) and c = 1 . (right).When I ∗ ( a, b, c ) < , exact recovery of y with high probability is no longer possible. Inthat case, we justify the beneﬁts of aggregation using misclassiﬁcation rates, by presentingan upper bound for sgn( ˆ u ) as well as a matching lower bound for all possible estimators.Their proofs can be found in Appendices F.6 and F.7. Theorem 4.2.

Let Assumption 4.1 hold, a (cid:54) = b and I ∗ ( a, b, c ) ≤ . We have lim sup n →∞ log E M (sgn( ˆ u ) , y )log n ≤ − I ∗ ( a, b, c ) . Theorem 4.3.

Let Assumption 4.2 hold. For any sequence of estimators ˆ y = ˆ y n ( A , { x i } ni =1 ) ,we have lim inf n →∞ q − n log E M ( ˆ y , y ) ≥ − I ∗ ( a, b, c ) . Theorems 4.2 and 4.3 imply that in the log n -regime (Assumption 4.1), the aggregatedspectral estimator sgn( ˆ u ) achieves the optimal misclassiﬁcation rate: E M (sgn( ˆ u ) , y ) = n − I ∗ ( a,b,c )+ o (1) . c = 0 , it reduces to the optimal rate n − ( √ a −√ b ) / o (1) for the stochastic block model(Deﬁnition 4.1) and when a = b , the result reduces n − c/ o (1) for the Gaussian mixturemodel (Deﬁnition 3.1), respectively. It is easy to show that they are achieved by u ( A ) (Abbe et al., 2017) and u ( G ) (Theorem 3.2), which are asymptotically equivalent to ouraggregated estimator ˆ u in extreme cases c → and a → b , respectively. In other words, ourresult and procedure encompass those for the stochastic block model and Gaussian mixturemodel as two speciﬁc examples.While our lower bound e − q n [ I ∗ ( a,b,c )+ o (1)] for misclassiﬁcation is proved under the generalAssumption 4.2, our aggregated spectral estimator sgn( ˆ u ) is only analyzed for the log n -regime in Assumption 4.1. When the network becomes sparser ( q n (cid:28) log n ), A no longerconcentrates (Feige and Ofek, 2005), the eigenvector analysis in Abbe et al. (2017) breaksdown, and we do not have sharp characterizations of u ( A ) anymore. However, the (cid:96) p resultsfor u ( G ) in this paper continue to hold, and sgn[ u ( G )] faithfully recovers y . We conjecturethat the estimator sgn( ˜ u ) with ˜ u = 1 √ n log (cid:18) λ ( A ) + λ ( A ) λ ( A ) − λ ( A ) (cid:19) A sgn[ u ( G )] + 2 λ ( G ) nλ ( G ) + nd u ( G ) (4.5)achieves the lower bound e − q n [ I ∗ ( a,b,c )+ o (1)] for misclassiﬁcation rate even when q n (cid:28) log n .The expression (4.5) is obtained by replacing λ ( A ) u ( A ) = Au ( A ) in (4.3) with A sgn[ u ( G )] / √ n .Here sgn[ u ( G )] gives estimated labels given X , and A sgn[ u ( G )] provides reﬁned resultsusing A . To illustrate the key ideas behind the (cid:96) p analysis in Theorem 2.1, we use a simple rank-1model x i = µ y i + z i ∈ R d , i ∈ [ n ] , (5.1)where y = ( y , · · · , y n ) (cid:62) ⊆ {± } n and µ ∈ R d are deterministic; { z i } ni =1 are independentand z i ∼ N ( , Σ i ) for some Σ i (cid:31) . We assume further Σ i (cid:22) C I d for all i ∈ [ n ] and someconstant C > .Model (5.1) is a heteroscedastic version of the Gaussian mixture model in Deﬁnition 3.1.We have ¯ x i = y i µ , ¯ X = ( ¯ x , · · · , ¯ x n ) (cid:62) = yµ (cid:62) , ¯ G = ¯ X ¯ X (cid:62) = (cid:107) µ (cid:107) yy (cid:62) , ¯ λ = n (cid:107) µ (cid:107) and ¯ u = y / √ n . For simplicity, we suppress the subscript in u , ¯ u , λ and ¯ λ . The goal is toshow that for p that satisﬁes our technical condition, min c = ± (cid:107) c u − G ¯ u / ¯ λ (cid:107) p = o P ( (cid:107) ¯ u (cid:107) p ; p ) . (5.2)For simplicity, we assume that u is already aligned with G ¯ u / ¯ λ such that the optimal c aboveis . The hollowing procedure conducted on the Gram matrix has been commonly used in high-dimensional PCA and spectral methods (Koltchinskii and Giné, 2000; Montanari and Sun,21igure 2: Beneﬁts of hollowing2018; Ndaoud, 2018; Cai et al., 2019). When the noises { z i } ni =1 are strong and heteroscedas-tic, it drives G closer to ¯ G and thus ensures small angle between u and ¯ u . Such (cid:96) proximityis the starting point of our reﬁned (cid:96) p analysis.Observe that (cid:104) x i , x j (cid:105) = (cid:104) ¯ x i , ¯ x j (cid:105) + (cid:104) ¯ x i , ¯ z j (cid:105) + (cid:104) z i , ¯ x j (cid:105) + (cid:104) z i , z j (cid:105) , E (cid:104) x i , x j (cid:105) = (cid:104) ¯ x i , ¯ x j (cid:105) + E (cid:107) z i (cid:107) { i = j } . Hence the diagonal and oﬀ-diagonal entries of the Gram matrix behave diﬀerently. In high-dimensional and heteroscedastic case, the diﬀerence in noise levels { E (cid:107) z i (cid:107) } ni =1 could have asevere impact on the spectrum of Gram matrix XX (cid:62) . In particular, the following theoremshows that the leading eigenvector of XX (cid:62) could be asymptotically perpendicular to thatof ¯ X ¯ X (cid:62) , while H ( XX (cid:62) ) is still faithful. The proof is in Appendix G.1. Lemma 5.1.

Consider the model (5.1) with Σ = 2 I d and Σ = · · · = Σ n = I d . Let ˆ u and u be the leading eigenvectors of the Gram matrix XX (cid:62) and its hollowed version H ( XX (cid:62) ) . Suppose that n → ∞ and ( d/n ) / (cid:28) (cid:107) µ (cid:107) (cid:28) (cid:112) d/n . We have |(cid:104) ˆ u , ¯ u (cid:105)| P → and |(cid:104) u , ¯ u (cid:105)| P → . Figure 2 visualizes the entries of eigenvectors ¯ u (black), ˆ u (red) and u (blue) in a typicalrealization with n = 100 , d = 500 , (cid:107) µ (cid:107) = 3 and y = ( (cid:62) n/ , − (cid:62) n/ ) (cid:62) . The population eigen-vector ¯ u perfectly reveals class labels, and the eigenvector u of the hollowed Gram matrixis aligned with that. Without hollowing, the eigenvector ˆ u is localized due to heteroscedas-ticity and fails to recover the labels. The error rates of sgn( ˆ u ) and sgn( u ) are and ,respectively.With the help of hollowing, we obtain the following results on spectral concentration.See Appendix G.2 for the proof. Lemma 5.2.

Consider the model (5.1). When n → ∞ and (cid:107) µ (cid:107) (cid:29) max { , ( d/n ) / } , wehave (cid:107) G − ¯ G (cid:107) = o P (¯ λ ; n ) , | λ − ¯ λ | = o P (¯ λ ; n ) and min c = ± (cid:107) c u − ¯ u (cid:107) = o P (1; n ) . It is worth pointing out that hollowing inevitably creates bias as the diagonal informationof ¯ G is lost. Under incoherence conditions on the signals { ¯ x i } ni =1 (Assumption 2.1), this22ﬀect is under control. It becomes negligible when the noise is strong. While the simplehollowing already suﬃces for our need, general problems may beneﬁt from more sophisticatedprocedures such as the heteroscedastic PCA in Zhang et al. (2018). p As hollowing has been shown to tackle heteroscedasticity, from now on we focus on thehomoscedastic case Σ = · · · = Σ n = I d to facilitate presentation. We want to approximate u with G ¯ u / ¯ λ . By deﬁnition, u = Gu /λ and (cid:107) u − G ¯ u / ¯ λ (cid:107) p = (cid:107) Gu /λ − G ¯ u / ¯ λ (cid:107) p ≤ (cid:107) G ( u − ¯ u ) (cid:107) p / | λ | + (cid:107) G ¯ u (cid:107) p | λ − − ¯ λ − | . The spectral concentration of G (Lemma 5.2) forces / | λ | = O P (¯ λ − ; n ) and | λ − − ¯ λ − | = o P (¯ λ − ; n ) . In order to get (5.2), it suﬃces to choose some p (cid:46) n such that (cid:107) G ( u − ¯ u ) (cid:107) p = o P (¯ λ (cid:107) ¯ u (cid:107) p ; p ) , (5.3) (cid:107) G ¯ u (cid:107) p = O P (¯ λ (cid:107) ¯ u (cid:107) p ; p ) . (5.4)The desired bound (5.4) sheds light on the choice of p . Let ¯ Z = ( z , · · · , z n ) (cid:62) andobserve that G = H ( XX (cid:62) ) = H [( ¯ X + Z )( ¯ X + Z ) (cid:62) ]= H ( ¯ X ¯ X (cid:62) ) + H ( ¯ XZ (cid:62) ) + H ( Z ¯ X (cid:62) ) + H ( ZZ (cid:62) ) . As an example, we show how to obtain (cid:107)H ( Z ¯ X (cid:62) ) ¯ u (cid:107) p = O P (¯ λ (cid:107) ¯ u (cid:107) p ; p ) . By Markov’sinequality, a convenient and suﬃcient condition is E /p (cid:107)H ( Z ¯ X (cid:62) ) ¯ u (cid:107) pp (cid:46) ¯ λ (cid:107) ¯ u (cid:107) p = n (cid:107) µ (cid:107) · n /p − / . (5.5)The facts [ H ( Z ¯ X (cid:62) )] ij = (cid:104) z i , y j µ (cid:105) { i (cid:54) = j } and ¯ u = y / √ n yield [ H ( Z ¯ X (cid:62) ) ¯ u ] i = (cid:88) j (cid:54) = i (cid:104) z i , y j µ (cid:105) y j / √ n = n − √ n (cid:104) z i , µ (cid:105) , ∀ i ∈ [ n ] . Note that { z i } ni =1 are i.i.d. N ( , I d ) random vectors, (cid:104) z i , µ (cid:105) ∼ N (0 , (cid:107) µ (cid:107) ) . By momentbounds for Gaussian distribution (Vershynin, 2010), sup q ≥ { q − / E /q |(cid:104) z i , µ (cid:105)| q } ≤ c (cid:107) µ (cid:107) for some constant c . Then E (cid:107)H ( Z ¯ X (cid:62) ) ¯ u (cid:107) pp = n (cid:88) i =1 E | [ H ( Z ¯ X (cid:62) ) ¯ u ] i | p ≤ n ( c (cid:107) µ (cid:107) √ np ) p . p (cid:46) (cid:107) µ (cid:107) . Hence p cannot be arbitrarily large. Momentbounds are used throughout the proof. The ﬁnal choice of p depends on the most stringentcondition.Moments bounds are natural choices for obtaining (cid:96) p control and they can adapt to thesignal strength. As a comparison, the (cid:96) ∞ analysis in Abbe et al. (2017) targets quantitieslike (cid:107) G ¯ u (cid:107) ∞ by ﬁrst applying concentration inequalities to each entry and then taking unionbounds. Such uniform control clearly requires stronger signal. Finally we come to (5.3). Let G i denote the i -th row of G . By deﬁnition, (cid:107) G ( u − ¯ u ) (cid:107) p = (cid:18) n (cid:88) i =1 | G i ( u − ¯ u ) | p (cid:19) /p . We need to study | G i ( u − ¯ u ) | for each individual i ∈ [ n ] . By Cauchy-Schwarz inequality,the upper bound | G i ( u − ¯ u ) | ≤ (cid:107) G i (cid:107) (cid:107) u − ¯ u (cid:107) always holds. Unfortunately, it is too large to be used. We should resort to probabilisticanalysis for tighter control.For any i ∈ [ n ] , we construct a new data matrix X ( i ) = ( x , · · · , x i − , , x i +1 , · · · , x n ) (cid:62) = ( I n − e i e (cid:62) i ) X by deleting the i -th sample. Then G i = ( (cid:104) x i , x (cid:105) , · · · , (cid:104) x i , x i − (cid:105) , , (cid:104) x i , x i +1 (cid:105) , · · · , (cid:104) x i , x n (cid:105) ) = x (cid:62) i X ( i ) (cid:62) , G i ( u − ¯ u ) = (cid:104) x i , X ( i ) (cid:62) ( u − ¯ u ) (cid:105) . Recall that u is the eigenvector of the whole matrix G constructed by n independent samples.It should not depend too much on any individual x i . Also, X ( i ) (cid:62) is independent of x i . Hencethe dependence between x i and X ( i ) (cid:62) ( u − ¯ u ) is weak. We would like to invoke sub-Gaussianconcentration inequalities to control their inner product.To decouple them in a rigorous way, we construct leave-one-out auxiliaries { G ( i ) } ni =1 ⊆ R n × n where G ( i ) = H ( X ( i ) X ( i ) (cid:62) ) = H [( I − e i e (cid:62) i ) XX (cid:62) ( I − e i e (cid:62) i )] is the hollowed Gram matrix of the dataset { x , · · · , x i − , , x i +1 , · · · , x n } with x i zeroedout. Equivalently, G ( i ) is obtained by zeroing out the i -th row and column of G . Let u ( i ) bethe leading eigenvector of G ( i ) . Then | G i ( u − ¯ u ) | = |(cid:104) x i , X ( i ) (cid:62) ( u − ¯ u ) (cid:105)| ≤ |(cid:104) x i , X ( i ) (cid:62) ( u ( i ) − ¯ u ) (cid:105)| (cid:124) (cid:123)(cid:122) (cid:125) ε + |(cid:104) x i , X ( i ) (cid:62) ( u − u ( i ) ) (cid:105)| (cid:124) (cid:123)(cid:122) (cid:125) ε .

24e have the luxury of convenient concentration inequalities for ε as x i and X ( i ) (cid:62) ( u ( i ) − ¯ u ) are completely independent. In addition, we can safely apply the Cauchy-Schwarz inequalityto ε because u ( i ) should be very similar to u .The leave-one-out technique is a powerful tool in random matrix theory (Erdős et al.,2009) and high-dimensional statistics (Javanmard and Montanari, 2018; El Karoui, 2018).Zhong and Boumal (2018), Abbe et al. (2017) and Chen et al. (2017) apply it to (cid:96) ∞ eigenvec-tors analysis of Wigner-type random matrices. Here we focus on (cid:96) p analysis of Wishart-typematrices with dependent entries. We conduct a novel (cid:96) p analysis of PCA and establish linear approximations of eigenvectors.The results yield optimality guarantees for spectral clustering in several challenging prob-lems. Meanwhile, this study leads to new research directions that are worth exploring. First,we hope to extend the analysis from Wishart-type matrices to more general random matri-ces. One example is the normalized Laplacian matrix frequently used in spectral clustering.Second, our general results hold for Hilbert spaces and they are potentially useful in thestudy of kernel PCA, such as quantifying the performances of diﬀerent kernels. Third, thelinearization of eigenvectors provides tractable characterizations of spectral embedding thatserve as the starting point of statistical inference. Last but not least, while we focus onsymmetric and binary clustering applications for simplicity, it would be nice to generalizethe results to multi-class and imbalanced settings. That is of great practical importance.25 Useful facts

Here we list some elementary results about operations using the new notations O P ( · ; · ) and o P ( · ; · ) . Most of them can be found in Wang (2019). Fact A.1.

The following two statements hold.1. X n = O P ( Y n ; r n ) is equivalent to the following: there exist positive constants C , C and N , a non-decreasing function f : [ C , + ∞ ) → (0 , + ∞ ) satisfying lim x → + ∞ f ( x ) = + ∞ ,and a positive deterministic sequence { R n } ∞ n =1 tending to inﬁnity such that P ( | X n | ≥ t | Y n | ) ≤ C e − r n f ( t ) , ∀ n ≥ N, C ≤ t ≤ R n .

2. When X n = o P ( Y n ; r n ) , we have lim n →∞ r − n log P ( | X n | ≥ c | Y n | ) = −∞ for any constant c > . Here we adopt the convention log 0 = −∞ . Fact A.2 (Truncation) . If X n {| Z n |≤| W n |} = O P ( Y n ; r n ) and Z n = o P ( W n ; r n ) , then X n = O P ( Y n ; r n ) . Fact A.2 directly follows from Fact A.1 above and Lemma 4 in Wang (2019).

Fact A.3. If E /r n | X n | r n (cid:46) Y n or E /r n | X n | r n (cid:28) Y n for deterministic Y n , then X n = O P ( Y n ; r n ) or X n = o P ( Y n ; r n ) , respectively. Fact A.4 (Lemma 2 in Wang (2019)) . If X n = O P ( Y n ; r n ) and W n = O P ( Z n ; s n ) , then X n + W n = O P ( | Y n | + | Z n | ; r n ∧ s n ) ,X n W n = O P ( Y n Z n ; r n ∧ s n ) . Fact A.5 (Lemma 3 in Wang (2019)) . We have the followings:1. if X n = O P ( Y n ; r n ) , then | X n | α = O P ( | Y n | α ; r n ) for any α > ;2. if X n = o P (1; r n ) , then f ( X n ) = o P (1; r n ) for any f : R → R that is continuous at . Deﬁnition A.1 (A uniform version of O P ( · , · ) ) . Let { Λ n } ∞ n =1 be a sequence of ﬁnite in-dex sets. For any n ≥ , { X nλ } λ ∈ Λ n , { Y nλ } λ ∈ Λ n are two collections of random variables; { r nλ } λ ∈ Λ n ⊆ (0 , + ∞ ) are deterministic. We write { X nλ } λ ∈ Λ n = O P ( { Y nλ } λ ∈ Λ n ; { r nλ } λ ∈ Λ n ) (A.1) if there exist positive constants C , C and N , a non-decreasing function f : [ C , + ∞ ) → (0 , + ∞ ) satisfying lim x → + ∞ f ( x ) = + ∞ , and a positive deterministic sequence { R n } ∞ n =1 tending to inﬁnity such that P ( | X n | ≥ t | Y n | ) ≤ C e − r n f ( t ) , ∀ n ≥ N, C ≤ t ≤ R n . When Y nλ = Y n and/or r nλ = r n for all n and λ , we may replace { Y nλ } λ ∈ Λ n and/or { r nλ } λ ∈ Λ n in (A.1) by Y n and/or r n for simplicity. Fact A.6. If r n (cid:38) log | Λ n | , then { X nλ } λ ∈ Λ n = O P ( { Y nλ } λ ∈ Λ n ; r n ) implies that max λ ∈ Λ n | X nλ | = O P (max λ ∈ Λ n Y nλ ; r n ) . More on (cid:96) ,p analysis of eigenspaces In this section, we provide a generalized version of Theorem 2.1 and its proof. Instead ofAssumption 2.4, we use a weaker version of that (Assumption B.1) at the cost of a morenested regularity condition for p = p n (Assumption B.2). Assumptions 2.5 and 2.6 are stillin use. Assumption B.1 (Incoherence) . n → ∞ and (cid:107) ¯ G (cid:107) , ∞ / ¯∆ ≤ γ (cid:28) /κ . Assumption B.2 (Regularity of p = p n ) . √ np (cid:107) ¯ X Σ / (cid:107) ,p (cid:46) ¯∆ (cid:107) ¯ U (cid:107) ,p and n /p √ rp max {(cid:107) Σ (cid:107) HS , √ n (cid:107) Σ (cid:107) op } (cid:46) ¯∆ (cid:107) ¯ U (cid:107) ,p . Theorem B.1.

Let Assumptions 2.5, 2.6, B.1 and B.2 hold. We have (cid:107) U sgn( H ) (cid:107) ,p = O P (cid:0) (cid:107) ¯ U (cid:107) ,p + γ ¯∆ − (cid:107) ¯ G (cid:107) ,p ; p ∧ n (cid:1) , (cid:107) U sgn( H ) − G ¯ U ¯ Λ − (cid:107) ,p = O P (cid:0) κγ (cid:107) ¯ U (cid:107) ,p + γ ¯∆ − (cid:107) ¯ G (cid:107) ,p ; p ∧ n (cid:1) , (cid:107) U Λ / sgn( H ) − G ¯ U ¯ Λ − / (cid:107) ,p = O P ( κ / γ ¯∆ / (cid:107) ¯ U (cid:107) ,p + κ / γ ¯∆ − / (cid:107) ¯ G (cid:107) ,p ; p ∧ n ) . B.1 Proof of Theorem B.1

The following lemmas provide useful intermediate results, whose proofs can be found inSections B.2 and B.3.

Lemma B.1.

Let Assumptions 2.5, 2.6 and B.1 hold. We have (cid:107) G − ¯ G (cid:107) = O P ( γ ¯∆; n ) , (cid:107) Λ − ¯ Λ (cid:107) = O P ( γ ¯∆; n ) and (cid:107) U U (cid:62) − ¯ U ¯ U (cid:62) (cid:107) = O P ( γ ; n ) . Lemma B.2.

Let Assumptions 2.5, 2.6, B.1 and B.2 hold. We have (cid:107) G ¯ U − ¯ U ¯ Λ − H ( ZX (cid:62) ) ¯ U (cid:107) ,p = ( γ + (cid:112) r/n ) O P ( ¯∆ (cid:107) ¯ U (cid:107) ,p ; p ) , (cid:107) G ¯ U ¯ Λ − (cid:107) ,p = O P ( (cid:107) ¯ U (cid:107) ,p ; p ∧ n ) . We now prove Theorem B.1. Let ¯ γ = (cid:107) G − ¯ G (cid:107) / ¯∆ . It follows from Lemma 1 in Abbeet al. (2017) that when ¯ γ ≤ / , (cid:107) U H − G ¯ U ¯ Λ − (cid:107) ,p ≤ γ ¯∆ − (cid:107) G ¯ U (cid:107) ,p + 2 ¯∆ − (cid:107) G ( U H − ¯ U ) (cid:107) ,p . By Lemma B.1 and γ → in Assumption B.1, ¯ γ = O P ( γ ; n ) = o P (1; n ) . Lemma B.2 assertsthat (cid:107) G ¯ U (cid:107) ,p ≤ (cid:107) G ¯ U ¯ Λ − (cid:107) ,p (cid:107) ¯ Λ (cid:107) = O P ( κ ¯∆ (cid:107) ¯ U (cid:107) ,p ; p ∧ n ) , respectively. Hence (cid:107) U H − G ¯ U ¯ Λ − (cid:107) ,p = O P ( κγ (cid:107) ¯ U (cid:107) ,p ; p ∧ n ) + (cid:107) G ( U H − ¯ U ) (cid:107) ,p O P ( ¯∆ − ; n ) , (B.1) (cid:107) U H (cid:107) ,p ≤ (cid:107) G ¯ U ¯ Λ − (cid:107) ,p + (cid:107) U H − G ¯ U ¯ Λ − (cid:107) ,p = O P ( (cid:107) ¯ U (cid:107) ,p ; p ∧ n )+ (cid:107) G ( U H − ¯ U ) (cid:107) ,p O P ( ¯∆ − ; n ) . (B.2)We construct leave-one-out auxiliaries { G ( m ) } nm =1 ⊆ R n × n where G ( m ) is obtained byzeroing out the m -th row and column of G . Mathematically, we deﬁne a new data matrix X ( m ) = ( x , · · · , x m − , , x m +1 , · · · , x n ) (cid:62) = ( I n − e m e (cid:62) m ) X

27y deleting the m -th sample and G ( m ) = H ( X ( m ) X ( m ) (cid:62) ) = H [( I n − e m e (cid:62) m ) XX (cid:62) ( I n − e m e (cid:62) m )] . Let { u ( m ) j } nj =1 be the eigenvectors of G ( m ) , U ( m ) = ( u ( m ) s +1 , · · · , u ( m ) s + r ) ∈ R n × r and H ( m ) = U ( m ) (cid:62) ¯ U . The construction is also used by Abbe et al. (2017) in entrywise eigenvectoranalysis.By Minkowski’s inequality, (cid:107) G ( U H − ¯ U ) (cid:107) ,p ≤ (cid:18) n (cid:88) m =1 [ (cid:107) G m ( U H − U ( m ) H ( m ) ) (cid:107) + (cid:107) G m ( U ( m ) H ( m ) − ¯ U ) (cid:107) ] p (cid:19) /p ≤ (cid:18) n (cid:88) m =1 (cid:107) G m ( U H − U ( m ) H ( m ) ) (cid:107) p (cid:19) /p + (cid:18) n (cid:88) m =1 (cid:107) G m ( U ( m ) H ( m ) − ¯ U ) (cid:107) p (cid:19) /p . (B.3)The ﬁrst term on the right hand side of (B.3) corresponds to leave-one-out perturbations.When max {(cid:107) ¯ G (cid:107) , ∞ , (cid:107) G − ¯ G (cid:107) } κ ≤ ¯∆ / , Lemma 3 in Abbe et al. (2017) forces (cid:107) U U (cid:62) − U ( m ) ( U ( m ) ) (cid:62) (cid:107) ≤ κ (cid:107) ( U H ) m (cid:107) , ∀ m ∈ [ n ] , max m ∈ [ n ] (cid:107) U ( m ) H ( m ) − ¯ U (cid:107) ≤ {(cid:107) ¯ G (cid:107) , ∞ , (cid:107) G − ¯ G (cid:107) } / ¯∆ . The fact (cid:107) ¯ G (cid:107) , ∞ ≤ γ ¯∆ , the result (cid:107) G − ¯ G (cid:107) = O P ( γ ¯∆; n ) in Lemma B.1, and AssumptionB.1 imply that (cid:107) G (cid:107) , ∞ ≤ (cid:107) ¯ G (cid:107) , ∞ + (cid:107) G − ¯ G (cid:107) = O P ( γ ¯∆; n ) , (cid:18) n (cid:88) m =1 (cid:107) U U (cid:62) − U ( m ) ( U ( m ) ) (cid:62) (cid:107) p (cid:19) /p = O P ( κ (cid:107) U H (cid:107) ,p ; n ) , max m ∈ [ n ] (cid:107) U ( m ) H ( m ) − ¯ U (cid:107) = O P ( γ ; n ) . (B.4)The deﬁnitions H = U (cid:62) ¯ U and H ( m ) = ( U ( m ) ) (cid:62) ¯ U yield (cid:107) U H − U ( m ) H ( m ) (cid:107) = (cid:107) ( U U (cid:62) − U ( m ) ( U ( m ) ) (cid:62) ) ¯ U (cid:107) ≤ (cid:107) U U (cid:62) − U ( m ) ( U ( m ) ) (cid:62) (cid:107) . Based on these estimates, (cid:18) n (cid:88) m =1 (cid:107) G m ( U H − U ( m ) H ( m ) ) (cid:107) p (cid:19) /p ≤ (cid:107) G (cid:107) , ∞ (cid:18) n (cid:88) m =1 (cid:107) U H − U ( m ) H ( m ) (cid:107) p (cid:19) /p ≤ (cid:107) G (cid:107) , ∞ (cid:18) n (cid:88) m =1 (cid:107) U U (cid:62) − U ( m ) ( U ( m ) ) (cid:62) (cid:107) p (cid:19) /p = O P ( κγ ¯∆ (cid:107) U H (cid:107) ,p ; n )= O P ( κγ ¯∆ (cid:107) ¯ U (cid:107) ,p ; p ∧ n ) + O P ( κγ (cid:107) G ( U H − ¯ U ) (cid:107) ,p ; n ) . (B.5)The last equality follows from (B.2). We use (B.3), (B.5) and κγ = o (1) from AssumptionB.1 to derive (cid:107) G ( U H − ¯ U ) (cid:107) ,p ≤ (cid:18) n (cid:88) m =1 (cid:107) G m ( U ( m ) H ( m ) − ¯ U ) (cid:107) p (cid:19) /p + O P ( κγ ¯∆ (cid:107) ¯ U (cid:107) ,p ; p ∧ n ) .

28y plugging this into (B.1) and (B.2) and using κγ = o (1) , we obtain that (cid:107) U H − G ¯ U ¯ Λ − (cid:107) ,p = O P ( κγ (cid:107) ¯ U (cid:107) ,p ; p ∧ n ) + (cid:18) n (cid:88) m =1 (cid:107) G m ( U ( m ) H ( m ) − ¯ U ) (cid:107) p (cid:19) /p O P ( ¯∆ − ; n ) , (B.6) (cid:107) U H (cid:107) ,p = O P ( (cid:107) ¯ U (cid:107) ,p ; p ∧ n ) + (cid:18) n (cid:88) m =1 (cid:107) G m ( U ( m ) H ( m ) − ¯ U ) (cid:107) p (cid:19) /p O P ( ¯∆ − ; n ) . (B.7)We now control the second term in (B.3). From the decompositions G = H [( ¯ X + Z )( ¯ X + Z ) (cid:62) ] = H ( ¯ X ¯ X (cid:62) + ¯ XZ (cid:62) + Z ¯ X (cid:62) ) + H ( ZZ (cid:62) ) , we have (cid:18) n (cid:88) m =1 (cid:107) G m ( U ( m ) H ( m ) − ¯ U ) (cid:107) p (cid:19) /p ≤ (cid:107)H ( ¯ X ¯ X (cid:62) + ¯ XZ (cid:62) + Z ¯ X (cid:62) ) (cid:107) ,p max m ∈ [ n ] (cid:107) U ( m ) H ( m ) − ¯ U (cid:107) + (cid:18) n (cid:88) m =1 (cid:107) [ H ( ZZ (cid:62) )] m ( U ( m ) H ( m ) − ¯ U ) (cid:107) p (cid:19) /p . (B.8)We now work on the ﬁrst term on the right hand side of (B.8). Deﬁne M ∈ R n × n through M ij = (cid:107) ( ¯ XZ (cid:62) ) ij (cid:107) ψ . Then E M ij = 0 and M ij = (cid:107)(cid:104) ¯ x i , z j (cid:105)(cid:107) ψ (cid:46) (cid:107) Σ / ¯ x i (cid:107) , where (cid:46) onlyhides a universal constant. (cid:107) M (cid:107) ,p = (cid:20) n (cid:88) i =1 (cid:18) n (cid:88) j =1 | M ij | (cid:19) p/ (cid:21) /p (cid:46) (cid:20) n (cid:88) i =1 (cid:18) n (cid:88) j =1 (cid:107) Σ / ¯ x i (cid:107) (cid:19) p/ (cid:21) /p = √ n (cid:107) ¯ X Σ / (cid:107) ,p , (cid:107) M (cid:62) (cid:107) ,p = (cid:20) n (cid:88) j =1 (cid:18) n (cid:88) i =1 | M ij | (cid:19) p/ (cid:21) /p (cid:46) (cid:20) n (cid:88) j =1 (cid:18) n (cid:88) i =1 (cid:107) Σ / ¯ x i (cid:107) (cid:19) p/ (cid:21) /p = n /p (cid:107) ¯ X Σ / (cid:107) , ≤ √ n (cid:107) ¯ X Σ / (cid:107) ,p . By Lemma H.3 and p ≥ , (cid:107) ¯ XZ (cid:62) (cid:107) ,p = O P ( √ p (cid:107) M (cid:107) ,p ; p ) = O P ( √ np (cid:107) ¯ X Σ / (cid:107) ,p ; p ) , (cid:107) Z ¯ X (cid:62) (cid:107) ,p = O P ( √ p (cid:107) M (cid:62) (cid:107) ,p ; p ) = O P ( √ np (cid:107) ¯ X Σ / (cid:107) ,p ; p ) . These estimates and √ np (cid:107) ¯ X Σ / (cid:107) ,p (cid:46) ¯∆ (cid:107) ¯ U (cid:107) ,p in Assumption B.2 yield (cid:107)H ( ¯ XZ (cid:62) + Z ¯ X (cid:62) ) (cid:107) ,p ≤ (cid:107) ¯ XZ (cid:62) + Z ¯ X (cid:62) (cid:107) ,p = O P ( ¯∆ (cid:107) ¯ U (cid:107) ,p ; p ) . This and (B.4) lead to (cid:107)H ( ¯ X ¯ X (cid:62) + ¯ XZ (cid:62) + Z ¯ X (cid:62) ) (cid:107) ,p max m ∈ [ n ] (cid:107) U ( m ) H ( m ) − ¯ U (cid:107) O P ( γ ( (cid:107) ¯ X ¯ X (cid:62) (cid:107) ,p + ¯∆ (cid:107) ¯ U (cid:107) ,p ); p ∧ n ) . (B.9)We use (B.6), (B.8) and (B.9) to get (cid:107) U H − G ¯ U ¯ Λ − (cid:107) ,p = O P ( κγ (cid:107) ¯ U (cid:107) ,p ; p ∧ n ) + O P ( γ ¯∆ − (cid:107) ¯ X ¯ X (cid:62) (cid:107) ,p ; p ∧ n )+ (cid:18) n (cid:88) m =1 (cid:107) [ H ( ZZ (cid:62) )] m ( U ( m ) H ( m ) − ¯ U ) (cid:107) p (cid:19) /p O P ( ¯∆ − ; n ) . (B.10)By construction, U ( m ) H ( m ) − ¯ U ∈ R n × r is independent of z m . We invoke Lemma H.2 toget (cid:18) n (cid:88) m =1 (cid:107) [ H ( ZZ (cid:62) )] m ( U ( m ) H ( m ) − ¯ U ) (cid:107) p (cid:19) /p = (cid:18) n (cid:88) m =1 (cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j (cid:54) = m (cid:104) z m , z j (cid:105) ( U ( m ) H ( m ) − ¯ U ) j (cid:13)(cid:13)(cid:13)(cid:13) p (cid:19) /p = n /p max m ∈ [ n ] (cid:107) U ( m ) H ( m ) − ¯ U (cid:107) O P (cid:0) √ rp max {(cid:107) Σ (cid:107) HS , √ n (cid:107) Σ (cid:107) op } ; p ∧ n (cid:1) = O P ( γ ¯∆ (cid:107) ¯ U (cid:107) ,p ; p ∧ n ) , (B.11)where we also used (B.4) and Assumption B.2.We use (B.10) and (B.11) to derive (cid:107) U H − G ¯ U ¯ Λ − (cid:107) ,p = O P ( κγ (cid:107) ¯ U (cid:107) ,p ; p ∧ n ) + O P ( γ ¯∆ − (cid:107) ¯ G (cid:107) ,p ; p ∧ n ) . (B.12)Consequently, Lemma B.2 yields (cid:107) U H (cid:107) ,p ≤ (cid:107) G ¯ U ¯ Λ − (cid:107) ,p + (cid:107) U H − G ¯ U ¯ Λ − (cid:107) ,p = O P ( (cid:107) ¯ U (cid:107) ,p ; p ∧ n ) + O P ( γ ¯∆ − (cid:107) ¯ G (cid:107) ,p ; p ∧ n ) . (B.13)Lemma 2 in Abbe et al. (2017) and the result (cid:107) G − ¯ G (cid:107) = O P ( γ ¯∆; n ) in Lemma B.1 implythat (cid:107) H − sgn( H ) (cid:107) = O P ( γ ; n ) . As sgn( H ) is orthonormal, we have (cid:107) H − (cid:107) = O P (1 , n ) and (cid:107) U sgn( H ) − U H (cid:107) ,p ≤ (cid:107) U HH − (sgn( H ) − H ) (cid:107) ,p ≤ (cid:107) U H (cid:107) ,p (cid:107) H − (cid:107) (cid:107) sgn( H ) − H (cid:107) = (cid:107) U H (cid:107) ,p O P ( γ ; n ) . (B.14)The tail bounds for (cid:107) U sgn( H ) (cid:107) ,p and (cid:107) U sgn( H ) − G ¯ U ¯ Λ − (cid:107) ,p in Theorem B.1 followfrom (B.12), (B.13) and (B.14).Finally we use the results above to control (cid:107) U Λ / sgn( H ) − G ¯ U ¯ Λ − / (cid:107) ,p . By LemmaB.1, (cid:107) Λ − ¯ Λ (cid:107) ≤ (cid:107) G − ¯ G (cid:107) = O P ( γ ¯∆; n ) = o P ( ¯∆; n ) . Hence n − log P ( (cid:107) Λ − ¯ Λ (cid:107) ≥ ¯∆ / →−∞ . When (cid:107) G − ¯ G (cid:107) < ¯∆ / , we have Λ (cid:31) ( ¯∆ / I , and Λ / is well-deﬁned. It remains toshow that (cid:107) U Λ / ¯ H − G ¯ U ¯ Λ − / (cid:107) ,p {(cid:107) G − ¯ G (cid:107) < ¯∆ / } = O P ( κ / γ ¯∆ / (cid:107) ¯ U (cid:107) ,p + κ / γ ¯∆ − / (cid:107) ¯ G (cid:107) ,p ; p ∧ n ) . (B.15)Deﬁne ¯ H = sgn( H ) . When (cid:107) G − ¯ G (cid:107) < ¯∆ / happens, we use triangle’s inequality toderive (cid:107) U Λ / ¯ H − G ¯ U ¯ Λ − / (cid:107) ,p ≤ (cid:107) U ¯ H ( ¯ H (cid:62) Λ / ¯ H − ¯ Λ / ) (cid:107) ,p + (cid:107) ( U ¯ H − G ¯ U ¯ Λ − ) ¯ Λ / (cid:107) ,p (cid:107) U ¯ H (cid:107) ,p (cid:107) ¯ H (cid:62) Λ / ¯ H − ¯ Λ / (cid:107) + (cid:107) U ¯ H − G ¯ U ¯ Λ − (cid:107) ,p (cid:107) ¯ Λ (cid:107) / . It is easily seen from (cid:107) ¯ Λ (cid:107) ≤ κ ¯∆ that (cid:107) U ¯ H − G ¯ U ¯ Λ − (cid:107) ,p (cid:107) ¯ Λ (cid:107) / = O P ( κ / γ ¯∆ / (cid:107) ¯ U (cid:107) ,p + κ / γ ¯∆ − / (cid:107) ¯ G (cid:107) ,p ; p ∧ n ) . Hence (cid:107) U Λ / ¯ H − G ¯ U ¯ Λ − / (cid:107) ,p {(cid:107) G − ¯ G (cid:107) < ¯∆ / } = O P ( κ / γ ¯∆ / (cid:107) ¯ U (cid:107) ,p + κ / γ ¯∆ − / (cid:107) ¯ G (cid:107) ,p ; p ∧ n )+ O P ( (cid:107) ¯ U (cid:107) ,p + γ ¯∆ − (cid:107) ¯ G (cid:107) ,p ; p ∧ n ) · (cid:107) ¯ H (cid:62) Λ / ¯ H − ¯ Λ / (cid:107) {(cid:107) G − ¯ G (cid:107) < ¯∆ / } . (B.16)Note that ¯ H (cid:62) Λ / ¯ H = ( ¯ H (cid:62) Λ ¯ H ) / . In view of the perturbation bound for matrixsquare roots (Schmitt, 1992, Lemma 2.1), (cid:107) ¯ H (cid:62) Λ / ¯ H − ¯ Λ / (cid:107) ≤ (cid:107) ¯ H (cid:62) Λ ¯ H − ¯ Λ (cid:107) λ min ( ¯ H (cid:62) Λ / ¯ H ) + λ min ( ¯ Λ / ) ≤ (cid:107) Λ ¯ H − ¯ H ¯ Λ (cid:107) / (cid:46) ( (cid:107) Λ H − H ¯ Λ (cid:107) + (cid:107) Λ ( ¯ H − H ) (cid:107) + (cid:107) ( ¯ H − H ) ¯ Λ (cid:107) ) / ¯∆ / (cid:46) (cid:107) Λ H − H ¯ Λ (cid:107) / ¯∆ / + O P ( κγ ¯∆ / ; n ) as long as (cid:107) G − ¯ G (cid:107) < ¯∆ / . Here we used (cid:107) H − ¯ H (cid:107) = O P ( γ ; n ) according to LemmaB.1 as well as Lemma 2 in Abbe et al. (2017).From U (cid:62) G = Λ U (cid:62) and ¯ G ¯ U = ¯ U ¯ Λ we obtain that Λ H − H ¯ Λ = Λ U (cid:62) ¯ U − U (cid:62) ¯ U ¯ Λ = U (cid:62) G ¯ U − U (cid:62) ¯ G ¯ U = U (cid:62) ( G − ¯ G ) ¯ U and (cid:107) Λ H − H ¯ Λ (cid:107) ≤ (cid:107) G − ¯ G (cid:107) = O P ( γ ¯∆; n ) . As a result, (cid:107) ¯ H (cid:62) Λ / ¯ H − ¯ Λ / (cid:107) {(cid:107) G − ¯ G (cid:107) < ¯∆ / } = O P ( γ ¯∆ / ; n ) , where we also used κγ = o (1) in Assumption B.1. Plugging this into (B.16), we get thedesired bound (B.15) and thus complete the proof of Theorem B.1. B.2 Proof of Lemma B.1

Note that G = H [( ¯ X + Z )( ¯ X + Z ) (cid:62) ] = H ( ¯ X ¯ X (cid:62) ) + H ( ¯ XZ (cid:62) + Z ¯ X (cid:62) ) + H ( ZZ (cid:62) )= ¯ X ¯ X (cid:62) + ( ¯ XZ (cid:62) + Z ¯ X (cid:62) ) + H ( ZZ (cid:62) ) − ¯ D , (B.17)where ¯ D is the diagonal part of ¯ X ¯ X (cid:62) + ¯ XZ (cid:62) + Z ¯ X (cid:62) , with ¯ D ii = (cid:107) ¯ x i (cid:107) + 2 (cid:104) ¯ x i , z i (cid:105) . From (cid:107)(cid:104) ¯ x i , z i (cid:105)(cid:107) ψ (cid:46) (cid:107) Σ / ¯ x i (cid:107) we get {|(cid:104) ¯ x i , z i (cid:105)|} ni =1 = O P ( {(cid:107) Σ / ¯ x i (cid:107)√ n } ni =1 ; n ) . By Fact A.6, max i ∈ [ n ] |(cid:104) ¯ x i , z i (cid:105)| = O P (cid:16) max i ∈ [ n ] (cid:107) Σ / ¯ x i (cid:107)√ n ; n (cid:17) and (cid:107) ¯ D (cid:107) = max i ∈ [ n ] | ¯ D ii | = max i ∈ [ n ] (cid:107) ¯ x i (cid:107) + O P (cid:18) max i ∈ [ n ] (cid:107) Σ / ¯ x i (cid:107)√ n ; n (cid:19) (cid:107) ¯ X (cid:107) , ∞ + O P (cid:0) (cid:107) ¯ X (cid:107) , ∞ ( n (cid:107) Σ (cid:107) op ) / ; n (cid:1) ≤ (cid:107) ¯ X ¯ X (cid:62) (cid:107) , ∞ + O P (cid:16) (cid:107) ¯ X ¯ X (cid:62) (cid:107) / ( n (cid:107) Σ (cid:107) op ) / ; n (cid:17) = (cid:107) ¯ G (cid:107) , ∞ + O P (cid:0) ( nκ ¯∆ (cid:107) Σ (cid:107) op ) / ; n (cid:1) . (B.18)Note that (cid:107) Z ¯ X (cid:62) (cid:107) = sup u , v ∈ S n − u (cid:62) Z ¯ X (cid:62) v . Since { z (cid:62) i ¯ X (cid:62) v } ni =1 are zero-mean, inde-pendent and (cid:107) z (cid:62) i ¯ X (cid:62) v (cid:107) ψ (cid:46) (cid:107) Σ / ¯ X (cid:62) v (cid:107) ≤ (cid:107) ¯ X Σ / (cid:107) op ≤ ( (cid:107) ¯ G (cid:107) (cid:107) Σ (cid:107) op ) / = ( κ ¯∆ (cid:107) Σ (cid:107) op ) / , we have (cid:107) u (cid:62) Z ¯ X (cid:62) v (cid:107) ψ = (cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 u i z (cid:62) i ¯ X (cid:62) v (cid:13)(cid:13)(cid:13)(cid:13) ψ (cid:46) (cid:18) n (cid:88) i =1 u i (cid:107) z (cid:62) i ¯ X (cid:62) v (cid:107) ψ (cid:19) / (cid:46) ( κ ¯∆ (cid:107) Σ (cid:107) op ) / . A standard covering argument (Vershynin, 2010, Section 5.2.2) yields (cid:107) Z ¯ X (cid:62) (cid:107) = O P (( nκ ¯∆ (cid:107) Σ (cid:107) op ) / ; n ) . The same tail bound also holds for (cid:107) ¯ XZ (cid:62) (cid:107) .From these estimates, (B.17), (B.18) and Lemma H.1 we obtain that (cid:107) G − ¯ X ¯ X (cid:62) (cid:107) = O P (cid:0) (cid:107) ¯ G (cid:107) , ∞ + ( nκ ¯∆ (cid:107) Σ (cid:107) op ) / + max {√ n (cid:107) Σ (cid:107) HS , n (cid:107) Σ (cid:107) op } ; n (cid:1) . By Assumptions B.1 and 2.6, we have nκ (cid:107) Σ (cid:107) op ≤ ¯∆ . Hence n (cid:107) Σ (cid:107) op ≤ ( nκ ¯∆ (cid:107) Σ (cid:107) op ) / and (cid:107) G − ¯ X ¯ X (cid:62) (cid:107) = O P ( γ ¯∆; n ) .Finally, Weyl’s inequality (Stewart and Sun, 1990) and Davis-Kahan theorem (Davis andKahan, 1970) assert that (cid:107) Λ − ¯ Λ (cid:107) ≤ (cid:107) G − ¯ G (cid:107) = O P ( γ ¯∆; n ) and (cid:107) U U (cid:62) − ¯ U ¯ U (cid:62) (cid:107) (cid:46) (cid:107) G − ¯ G (cid:107) / ¯∆ = O P ( γ ; n ) . B.3 Proof of Lemma B.2

Observe that G = H ( XX (cid:62) ) = H [( ¯ X + Z ) X (cid:62) ] = ¯ X ¯ X (cid:62) + [ H ( ¯ X ¯ X (cid:62) ) − ¯ X ¯ X (cid:62) ] + H ( ¯ XZ (cid:62) ) + H ( ZX (cid:62) ) . From ¯ X ¯ X (cid:62) ¯ U = ¯ G ¯ Λ = ¯ U ¯ Λ we get (cid:107) G ¯ U − ¯ U ¯ Λ − H ( ZX (cid:62) ) ¯ U (cid:107) ,p = (cid:107) G ¯ U − ¯ X ¯ X (cid:62) ¯ U − H ( ZX (cid:62) ) ¯ U (cid:107) ,p = (cid:107) [ H ( ¯ X ¯ X (cid:62) ) − ¯ X ¯ X (cid:62) + H ( ¯ XZ (cid:62) )] ¯ U (cid:107) ,p ≤ (cid:32) n (cid:88) m =1 ( (cid:107) ¯ x m (cid:107) (cid:107) ¯ U m (cid:107) ) p (cid:33) /p + (cid:107)H ( ¯ XZ (cid:62) ) ¯ U (cid:107) ,p . On the one hand, we have n (cid:88) m =1 ( (cid:107) ¯ x m (cid:107) (cid:107) ¯ U m (cid:107) ) p ≤ max m ∈ [ n ] (cid:107) ¯ x m (cid:107) p n (cid:88) m =1 (cid:107) ¯ U m (cid:107) p = (cid:107) ¯ X (cid:107) p , ∞ (cid:107) ¯ U (cid:107) p ,p ≤ ( γ ¯∆ (cid:107) ¯ U (cid:107) ,p ) p , (cid:107) ¯ X (cid:107) , ∞ ≤ (cid:107) ¯ X ¯ X (cid:62) (cid:107) , ∞ ≤ γ ¯∆ in Assumption B.1. On the other hand, { z j } j (cid:54) = m are independent, (cid:107)(cid:104) ¯ x m , z j (cid:105)(cid:107) ψ (cid:46) (cid:107) Σ / ¯ x m (cid:107) , ¯ U = ( ¯ u , · · · , ¯ u r ) and (cid:107) ¯ u j (cid:107) = 1 for j ∈ [ r ] .Then (cid:107) [ H ( ¯ XZ (cid:62) )] m ¯ u j (cid:107) ψ = (cid:107) ( ¯ XZ (cid:62) ) m ( I − e m e (cid:62) m ) ¯ u j (cid:107) ψ = (cid:13)(cid:13)(cid:13)(cid:13) (cid:88) k (cid:54) = m ¯ u jk (cid:104) ¯ x m , z j (cid:105) (cid:13)(cid:13)(cid:13)(cid:13) ψ (cid:46) (cid:107) Σ / ¯ x m (cid:107) , j ∈ [ r ] , m ∈ [ n ] . Lemma H.3 forces (cid:107)H ( ¯ XZ (cid:62) ) ¯ U (cid:107) ,p = O P ( √ p (cid:107) M (cid:107) ,p ; p ) , where M ij = (cid:107) Σ / ¯ x i (cid:107) . Hence (cid:107) M (cid:107) ,p = (cid:20) n (cid:88) i =1 (cid:18) r (cid:88) j =1 (cid:107) Σ / ¯ x i (cid:107) (cid:19) p/ (cid:21) /p = √ r (cid:107) ¯ X Σ / (cid:107) ,p , (cid:107)H ( ¯ XZ (cid:62) ) ¯ U (cid:107) ,p = O P ( √ rp (cid:107) ¯ X Σ / (cid:107) ,p ; p ) = O P ( (cid:112) r/n ¯∆ (cid:107) ¯ U (cid:107) ,p ; p ) , where the last equality follows from Assumption B.2. By combining the two parts we get (cid:107) G ¯ U − ¯ U ¯ Λ − H ( ZX (cid:62) ) ¯ U (cid:107) ,p = ( γ + (cid:112) r/n ) O P ( ¯∆ (cid:107) ¯ U (cid:107) ,p ; p ) , (cid:107) G ¯ U ¯ Λ − − H ( ZX (cid:62) ) ¯ U ¯ Λ − (cid:107) ,p ≤ (cid:107) G ¯ U − ¯ U ¯ Λ − H ( ZX (cid:62) ) ¯ U (cid:107) ,p (cid:107) ¯ Λ − (cid:107) + (cid:107) ¯ U (cid:107) ,p = O P ( (cid:107) ¯ U (cid:107) ,p ; p ) . (B.19)To study H ( ZX (cid:62) ) ¯ U , we decompose it into H ( Z ¯ X (cid:62) ) ¯ U + H ( ZZ (cid:62) ) ¯ U . Note that [ H ( Z ¯ X (cid:62) ) ¯ U ] mj = ( Z ¯ X (cid:62) ) m ( I − e m e (cid:62) m ) ¯ u j = (cid:104) z m , ¯ X (cid:62) ( I − e m e (cid:62) m ) ¯ u j (cid:105) , (cid:107) [ H ( Z ¯ X (cid:62) ) ¯ U ] mj (cid:107) ψ (cid:46) (cid:107) Σ / ¯ X (cid:62) ( I − e m e (cid:62) m ) ¯ u j (cid:107) . Lemma H.3 forces (cid:107)H ( Z ¯ X (cid:62) ) ¯ U (cid:107) ,p = O P ( √ p (cid:107) M (cid:107) ,p ; p ) , where M ij = (cid:107) Σ / ¯ X (cid:62) ( I − e m e (cid:62) m ) ¯ u j (cid:107) . From r (cid:88) j =1 (cid:107) Σ / ¯ X (cid:62) ( I − e m e (cid:62) m ) ¯ u j (cid:107) = (cid:28) ( I − e m e (cid:62) m ) ¯ X Σ ¯ X (cid:62) ( I − e m e (cid:62) m ) , r (cid:88) j =1 ¯ u j ¯ u (cid:62) j (cid:29) ≤ Tr( ¯ X Σ ¯ X (cid:62) ) = (cid:107) ¯ X Σ / (cid:107) , we get (cid:107) M (cid:107) ,p = (cid:20) n (cid:88) m =1 (cid:18) r (cid:88) j =1 (cid:107) Σ / ¯ X (cid:62) ( I − e m e (cid:62) m ) ¯ u j (cid:107) (cid:19) p/ (cid:21) /p = n /p (cid:107) ¯ X Σ / (cid:107) , ≤ n / (cid:107) ¯ X Σ / (cid:107) ,p , (cid:107)H ( Z ¯ X (cid:62) ) ¯ U (cid:107) ,p = O P ( √ np (cid:107) ¯ X Σ / (cid:107) ,p ; p ) = O P ( ¯∆ (cid:107) ¯ U (cid:107) ,p ; p ) , (B.20)where we used Assumption B.2 to get the last equality.Note that (cid:107) ¯ U (cid:107) = 1 and (cid:107) [ H ( ZZ (cid:62) ) ¯ U ] m (cid:107) = (cid:107) (cid:80) j (cid:54) = m (cid:104) z m , z j (cid:105) ¯ U j (cid:107) , ∀ m ∈ [ n ] . LemmaH.2 asserts that (cid:107)H ( ZZ (cid:62) ) ¯ U (cid:107) ,p = (cid:18) n (cid:88) m =1 (cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j (cid:54) = m (cid:104) z m , z j (cid:105) ¯ U j (cid:13)(cid:13)(cid:13)(cid:13) p (cid:19) /p n /p (cid:107) ¯ U (cid:107) p O P (cid:0) √ rp max {(cid:107) Σ (cid:107) HS , √ n (cid:107) Σ (cid:107) op } ; p ∧ n (cid:1) = O P (cid:16) n /p √ rp max {(cid:107) Σ (cid:107) HS , √ n (cid:107) Σ (cid:107) op } ; p ∧ n (cid:17) = O P ( ¯∆ (cid:107) ¯ U (cid:107) ,p ; p ∧ n ) . (B.21)The last equality is due to Assumption B.2. Then we complete the proof using (B.19), (B.20)and (B.21). C Proofs of Section 2

C.1 Proof of Theorem 2.1

We will invoke Theorem B.1 to prove Theorem 2.1 in the Hilbert setting (under Assumptions2.4, 2.5 and 2.6). We claim that Assumption B.2 holds, p (cid:46) n and γ (cid:107) ¯ G (cid:107) , ∞ / ¯∆ (cid:28) (cid:112) r/n. (C.1)In that case, Theorem B.1 asserts that (cid:107) U sgn( H ) (cid:107) ,p = O P (cid:0) (cid:107) ¯ U (cid:107) ,p + γ ¯∆ − (cid:107) ¯ G (cid:107) ,p ; p (cid:1) , (C.2) (cid:107) U sgn( H ) − G ¯ U ¯ Λ − (cid:107) ,p = O P (cid:0) κγ (cid:107) ¯ U (cid:107) ,p + γ ¯∆ − (cid:107) ¯ G (cid:107) ,p ; p (cid:1) ., (C.3) (cid:107) U Λ / sgn( H ) − G ¯ U ¯ Λ − / (cid:107) ,p = O P ( κ / γ ¯∆ / (cid:107) ¯ U (cid:107) ,p + κ / γ ¯∆ − / (cid:107) ¯ G (cid:107) ,p ; p ∧ n ) . (C.4)When ≤ p < ∞ , we have n − / (cid:107) v (cid:107) ≤ n − /p (cid:107) v (cid:107) p ≤ (cid:107) v (cid:107) ∞ , ∀ v ∈ R n . This inequalityand (C.1) force that γ (cid:107) ¯ G (cid:107) ,p ≤ γn /p (cid:107) ¯ G (cid:107) , ∞ (cid:28) n /p ¯∆ (cid:112) r/n = n /p ¯∆ n − / (cid:107) ¯ U (cid:107) , ≤ ¯∆ (cid:107) ¯ U (cid:107) ,p . Hence γ ¯∆ − (cid:107) ¯ G (cid:107) ,p = o ( (cid:107) ¯ U (cid:107) ,p ) . The ﬁrst and last equation in Theorem 2.1 directly followfrom (C.2), (C.3), (C.4) and κγ (cid:28) /µ (cid:46) in Assumption 2.3.To control (cid:107) U sgn( H ) − [ ¯ U + H ( ZX (cid:62) ) ¯ U ¯ Λ − ] (cid:107) ,p , we invoke Lemma B.2 to get (cid:107) G ¯ U − ¯ U ¯ Λ − H ( ZX (cid:62) ) ¯ U (cid:107) ,p = ( γ + (cid:112) r/n ) O P ( ¯∆ (cid:107) ¯ U (cid:107) ,p ; p ) = o P ( ¯∆ (cid:107) ¯ U (cid:107) ,p ; p ) . Then (cid:107) U sgn( H ) − [ ¯ U + H ( ZX (cid:62) ) ¯ U ¯ Λ − ] (cid:107) ,p ≤ (cid:107) U sgn( H ) − G ¯ U ¯ Λ − (cid:107) ,p + (cid:107) G ¯ U ¯ Λ − − [ ¯ U + H ( ZX (cid:62) ) ¯ U ¯ Λ − ] (cid:107) ,p ≤ (cid:107) U sgn( H ) − G ¯ U ¯ Λ − (cid:107) ,p + (cid:107) G ¯ U − [ ¯ U ¯ Λ + H ( ZX (cid:62) ) ¯ U ] (cid:107) ,p (cid:107) ¯ Λ − (cid:107) = o P ( (cid:107) ¯ U (cid:107) ,p ; p ∧ n ) . We get all the desired results in Theorem 2.1, provided that Assumption B.2, p (cid:46) n and(C.1) hold.The claim p (cid:46) n is easy to prove: p (i) (cid:46) ( µγ ) − (cid:46) γ − ≤ ( κµ (cid:112) r/n ) − = nrκ µ ≤ n, (i) the condition on p ; (ii) µ ≥ ; (iii) Assumption 2.1; (iv) r ≥ , κ ≥ and µ ≥ .To verify (C.1), we start from (cid:107) ¯ G (cid:107) , ∞ = (cid:107) ¯ X ¯ X (cid:62) (cid:107) , ∞ ≤ (cid:107) ¯ X (cid:107) , ∞ (cid:107) ¯ X (cid:107) = (cid:107) ¯ X (cid:107) , ∞ (cid:107) ¯ X (cid:107) · (cid:107) ¯ X (cid:107) ≤ ( µ (cid:112) r/n )( κ ¯∆) = κµ (cid:112) r/n · ¯∆ , (C.5)where (i) is due to µ ≥ ( (cid:107) ¯ X (cid:107) , ∞ / (cid:107) ¯ X (cid:107) ) (cid:112) n/r and (cid:107) ¯ X (cid:107) = (cid:107) ¯ G (cid:107) = κ ¯∆ . Assumption 2.1forces γ ≥ κµ (cid:112) r/n and (cid:107) ¯ G (cid:107) , ∞ / ¯∆ ≤ γ. (C.6)In addition, (C.5) and the condition γ (cid:28) ( κµ ) − in Assumption 2.1 imply (C.1)It remain to check Assumption B.2. To prove √ np (cid:107) ¯ X Σ / (cid:107) ,p (cid:46) ¯∆ (cid:107) ¯ U (cid:107) ,p , we ﬁrst provean inequality in (cid:107) · (cid:107) , ∞ and then convert it to (cid:107) · (cid:107) ,p using n − / (cid:107) v (cid:107) ≤ n − /p (cid:107) v (cid:107) p ≤ (cid:107) v (cid:107) ∞ , ∀ v ∈ R n (C.7)By elementary calculation, ¯∆ √ rn (cid:107) ¯ X Σ / (cid:107) , ∞ (i) ≥ ¯∆ √ rn (cid:107) ¯ X (cid:107) , ∞ (cid:107) Σ (cid:107) / = (cid:18) ¯∆ κn (cid:107) Σ (cid:107) (cid:19) / (cid:112) κr ¯∆ /n (cid:107) ¯ X (cid:107) , ∞ (ii) = (cid:18) ¯∆ κn (cid:107) Σ (cid:107) (cid:19) / (cid:18) (cid:107) ¯ X (cid:107) (cid:107) ¯ X (cid:107) , ∞ (cid:114) rn (cid:19) (iii) ≥ (cid:18) ¯∆ κn (cid:107) Σ (cid:107) (cid:19) / µ (iv) ≥ µγ (v) (cid:38) √ p. where we used (i) (cid:107) ¯ X Σ / (cid:107) , ∞ ≤ (cid:107) ¯ X (cid:107) , ∞ (cid:107) Σ (cid:107) / ; (ii) κ ¯∆ = (cid:107) ¯ G (cid:107) = (cid:107) ¯ X (cid:107) ; (iii) µ ≥ (cid:107) ¯ X (cid:107) , ∞ (cid:107) ¯ X (cid:107) (cid:112) nr ; (iv) γ ≥ ( κn (cid:107) Σ (cid:107) / ¯∆) / in Assumption 2.3; (v) p (cid:46) ( µγ ) − . We use (C.7) to get √ np (cid:107) ¯ X Σ / (cid:107) ,p ≤ √ npn /p (cid:107) ¯ X Σ / (cid:107) , ∞ (cid:46) √ npn /p √ r ¯∆ /n √ p = ¯∆ n /p (cid:112) r/n = ¯∆ n /p n − / (cid:107) ¯ U (cid:107) , ≤ ¯∆ n /p n − /p (cid:107) ¯ U (cid:107) ,p = ¯∆ (cid:107) ¯ U (cid:107) ,p . We ﬁnally prove n /p √ rp max {(cid:107) Σ (cid:107) HS , √ n (cid:107) Σ (cid:107) op } (cid:46) ¯∆ (cid:107) ¯ U (cid:107) ,p . By Assumption 2.3, max { ( nκ (cid:107) Σ (cid:107) / ¯∆) / , √ n (cid:107) Σ (cid:107) F / ¯∆ } ≤ γ. Since γ (cid:28) according to Assumption 2.1, we have nκ (cid:107) Σ (cid:107) / ¯∆ (cid:28) ( nκ (cid:107) Σ (cid:107) / ¯∆) / ≤ γ (cid:28) .Hence √ p (cid:46) ( µγ ) − (cid:46) γ − ≤ { nκ (cid:107) Σ (cid:107) / ¯∆ , √ n (cid:107) Σ (cid:107) F / ¯∆ } ≤ ¯∆ /n max {(cid:107) Σ (cid:107) , n − / (cid:107) Σ (cid:107) F } . By the conversion (C.7), n /p √ rp max {(cid:107) Σ (cid:107) F , √ n (cid:107) Σ (cid:107) } = n /p +1 / √ rp max { n − / (cid:107) Σ (cid:107) F , (cid:107) Σ (cid:107) } (cid:46) n /p +1 / √ r ¯∆ /n = ¯∆ n /p − / (cid:107) ¯ U (cid:107) , ≤ ¯∆ (cid:107) ¯ U (cid:107) ,p . Proofs of Section 3.1

D.1 A useful lemma

We ﬁrst prove a useful lemma bridging (cid:96) p approximation and misclassiﬁcation rates. Lemma D.1.

Suppose that v = v n , w = w n and ¯ v = ¯ v n are random vectors in R n , min i ∈ [ n ] | ¯ v i | = δ n > , and p = p n → ∞ . If min s = ± (cid:107) s v − ¯ v − w (cid:107) p = o P ( n /p δ n ; p ) ,then lim sup n →∞ p − log (cid:18) n E min s = ± |{ i ∈ [ n ] : s sgn( v i ) (cid:54) = sgn(¯ v i ) }| (cid:19) ≤ lim sup ε → lim sup n →∞ p − log (cid:18) n n (cid:88) i =1 P ( − w i sgn(¯ v i ) ≥ (1 − ε ) | ¯ v i | ) (cid:19) . Proof of Lemma D.1.

Let S n = { i ∈ [ n ] : sgn( v i ) (cid:54) = sgn(¯ v i ) } and r = v − ¯ v − w . Fornotational simplicity, we will prove the upper bound for lim sup n →∞ p − log( E | S n | /n ) under astronger assumption (cid:107) r (cid:107) p = o P ( n /p δ n ; p ) . Otherwise we just redeﬁne v as (argmin s = ± (cid:107) s v − ¯ v − w (cid:107) p ) v and go through the same proof.As a matter of fact, S n ⊆ { i ∈ [ n ] : − ( v i − ¯ v i ) sgn(¯ v i ) ≥ | ¯ v i |} = { i ∈ [ n ] : − ( w i + r i ) sgn(¯ v i ) ≥ | ¯ v i |} . For any ε ∈ (0 , , { i ∈ [ n ] : − r i sgn(¯ v i ) < ε | ¯ v i | and − w i sgn(¯ v i ) < (1 − ε ) | ¯ v i |}⊆ { i ∈ [ n ] : − ( w i + r i ) sgn(¯ v i ) < | ¯ v i |} . Hence S n ⊆ { i ∈ [ n ] : − r i sgn(¯ v i ) ≥ ε | ¯ v i | or − w i sgn(¯ v i ) ≥ (1 − ε ) | ¯ v i |}⊆ { i ∈ [ n ] : | r i | ≥ ε | ¯ v i |} ∪ { i ∈ [ n ] : − w i sgn(¯ v i ) ≥ (1 − ε ) | ¯ v i |} . Let q n ( ε ) = n (cid:80) ni =1 P ( − w i sgn(¯ v i ) ≥ (1 − ε ) | ¯ v i | ) . We have E | S n | ≤ E |{ i ∈ [ n ] : | r i | ≥ ε | ¯ v i |}| + nq n ( ε ) .To study { i ∈ [ n ] : | r i | ≥ ε | ¯ v i |} , we deﬁne E n = {(cid:107) r (cid:107) p < ε n /p δ n } . Since (cid:107) r (cid:107) p = o P ( n /p δ n ; p ) , there exist C , N ∈ Z + such that P ( E cn ) ≤ C e − p/ε , ∀ n ≥ N . When E n happens, |{ i ∈ [ n ] : | r i | ≥ ε | ¯ v i |}| ≤ |{ i ∈ [ n ] : | r i | ≥ εδ n }| ≤ (cid:107) r (cid:107) pp ( εδ n ) p ≤ ( ε n /p δ n ) p ( εδ n ) p = nε p . Then by log t = log(1 + t − ≤ t − < t for t ≥ , we have log(1 /ε ) ≤ /ε , n − E |{ i ∈ [ n ] : | r i | ≥ ε | ¯ v i |}| ≤ ε p P ( E n ) + 1 · P ( E cn ) ≤ e − p log(1 /ε ) + C e − p/ε ≤ ( C ∨ e − p log(1 /ε ) , n − E | S n | ≤ ( C ∨ e − p log(1 /ε ) + q n ( ε ) . As a result, log( E | S n | /n ) ≤ log(( C ∨ e − p log(1 /ε ) + q n ( ε )) ≤ log[2 max { ( C ∨ e − p log(1 /ε ) , q n ( ε ) } ] ≤ log 2 + max { log( C ∨ − p log(1 /ε ) , log q n ( ε ) } . The assumption p = p n → ∞ leads to lim sup n →∞ p − log( E | S n | /n ) ≤ max {− log(1 /ε ) , lim sup n →∞ p − log q n ( ε ) } , ∀ ε ∈ (0 , . By letting ε → we ﬁnish the proof. D.2 Proof of Theorem 3.1

We supress the subscripts of λ , ¯ λ , u and ¯ u . Note that ¯∆ = ¯ λ = n (cid:107) µ (cid:107) and κ = 1 .Assumption 2.4 holds for µ = 1 and / √ n ≤ γ (cid:28) . Assumption 2.6 holds when γ ≥ / √ SNR . Taking γ = 1 / √ SNR ∧ n ensures all the assumptions for Theorem 2.1 to hold.We ﬁrst consider the case where (cid:28) SNR (cid:46) log n and take p = SNR . By Theorem 2.1, min c = ± (cid:107) c u − ¯ u − H ( ZX (cid:62) ) ¯ u / ¯ λ (cid:107) p = o P ( (cid:107) ¯ u (cid:107) p ; p ) . Since ¯ u = n − / y n , Lemma D.1 asserts that lim sup n →∞ p − log E M [sgn( u )] = lim sup n →∞ p − log (cid:18) n E min s = ± |{ i ∈ [ n ] : s sgn( u i ) (cid:54) = sgn(¯ u i ) }| (cid:19) ≤ lim sup ε → lim sup n →∞ p − log (cid:18) n n (cid:88) i =1 P (cid:0) − [ H ( ZX (cid:62) ) ¯ u / ¯ λ ] i sgn(¯ u i ) ≥ (1 − ε ) | ¯ u i | (cid:1) (cid:19) . From [ H ( ZX (cid:62) ) ¯ u ] i = (cid:80) j (cid:54) = i (cid:104) z i , x j (cid:105) ¯ u j and ¯ λ = n (cid:107) µ (cid:107) we obtain that P (cid:0) − [ H ( ZX (cid:62) ) ¯ u / ¯ λ ] i sgn(¯ u i ) ≥ (1 − ε ) | ¯ u i | (cid:1) ≤ P (cid:0) | [ H ( ZX (cid:62) ) ¯ u / ¯ λ ] i | ≥ (1 − ε ) / √ n (cid:1) ≤ P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:28) z i , (cid:88) j (cid:54) = i x j ¯ u j (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≥ √ n (cid:107) µ (cid:107) / (cid:19) , ∀ ε ∈ (0 , / . The estimates above yields lim sup n →∞ p − log E M [sgn( u )] ≤ lim sup n →∞ p − log (cid:20) n n (cid:88) i =1 P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:28) z i , (cid:88) j (cid:54) = i x j ¯ u j (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≥ √ n (cid:107) µ (cid:107) / (cid:19)(cid:21) . (D.1)Note that ¯ u j = y j / √ n and x j = y j µ + z j , we have (cid:88) j (cid:54) = i x j ¯ u j = (cid:88) j (cid:54) = i ( µ + y j z j ) = (cid:114) n − n ( √ n − µ + w i ) , w i = √ n − (cid:80) j (cid:54) = i y j z j . Hence P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:28) z i , (cid:88) j (cid:54) = i x j ¯ u j (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≥ √ n (cid:107) µ (cid:107) (cid:19) = P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:28) z i , √ n − µ + w i (cid:107) Σ / ( √ n − µ + w i ) (cid:107) (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≥ √ n (cid:107) µ (cid:107) (cid:107) Σ / ( √ n − µ + w i ) (cid:107) (cid:19) . (D.2)By the triangle’s inequality, (cid:107) Σ / ( √ n − µ + w i ) (cid:107) ≤ √ n (cid:107) Σ (cid:107) / (cid:107) µ (cid:107) + (cid:107) Σ / w i (cid:107) . Since Σ / w i satisties th Assumption 2.5 with Σ replaced by Σ , Lemma H.1 yields (cid:107) Σ / w i (cid:107) = O P (max { Tr( Σ ) , n (cid:107) Σ (cid:107) op } ; n ) = O P (max {(cid:107) Σ (cid:107) , n (cid:107) Σ (cid:107) } ; n ) There exist constants c , c > such that P ( (cid:107) Σ / w i (cid:107) > c max {(cid:107) Σ (cid:107) HS , √ n (cid:107) Σ (cid:107) op } ) < c e − n . The assumption

SNR (cid:29) yields (cid:107) µ (cid:107) (cid:29) (cid:107) Σ (cid:107) op } and thus P ( (cid:107) Σ / ( √ n − µ + w i ) (cid:107) > ( c + 1) max {(cid:107) Σ (cid:107) HS , √ n (cid:107) Σ (cid:107) / (cid:107) µ (cid:107)} ) < c e − n . By (D.2) and the deﬁnition of

SNR , P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:28) z i , (cid:88) j (cid:54) = i x j ¯ u j (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≥ √ n (cid:107) µ (cid:107) / (cid:19) ≤ P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:28) z i , √ n − µ + w i (cid:107) Σ / ( √ n − µ + w i ) (cid:107) (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≥ √ SNR2( c + 1) (cid:19) + c e − n . (D.3)The desired result lim sup n →∞ SNR − log E M [sgn( u )] < − c for some constant c > follows from (D.1) and (cid:13)(cid:13)(cid:13)(cid:13)(cid:28) z i , √ n − µ + w i (cid:107) Σ / ( √ n − µ + w i ) (cid:107) (cid:29)(cid:13)(cid:13)(cid:13)(cid:13) ψ (cid:46) . (D.4)Here we used the independence between z i and w i = √ n − (cid:80) j (cid:54) = i y j z j .From now on we consider the case where SNR ≥ C log n for some constant C > , andtake p = SNR . By Theorem 2.1 and Fact 2.1, min c = ± (cid:107) c u − ¯ u − H ( ZX (cid:62) ) ¯ u / ¯ λ (cid:107) ∞ = o P ( (cid:107) ¯ u (cid:107) ∞ ; log n ) . As a result, P (cid:16) min c = ± (cid:107) c u − ¯ u − H ( ZX (cid:62) ) ¯ u / ¯ λ (cid:107) ∞ > / (2 √ n ) (cid:17) (cid:46) /n. (D.5)38n the other hand, repeating the arguments from (D.2) to (D.3) yields P (cid:16) (cid:107)H ( ZX (cid:62) ) ¯ u / ¯ λ (cid:107) ∞ > / (2 √ n ) (cid:17) = P (cid:32) max i ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:28) z i , (cid:88) j (cid:54) = i x j ¯ u j (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) > √ n (cid:107) µ (cid:107) (cid:33) ≤ P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:28) z i , √ n − µ + w i (cid:107) Σ / ( √ n − µ + w i ) (cid:107) (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≥ √ C log n c + 1) (cid:19) + c e − n . (D.6)where w i = √ n − (cid:80) j (cid:54) = i y j z j is independent of z i . (D.6) and (D.4) imply that when C is largeenough, P (cid:16) (cid:107)H ( ZX (cid:62) ) ¯ u / ¯ λ (cid:107) ∞ > / (2 √ n ) (cid:17) ≤ /n + c e − n . (D.7)Finally, it follows from (D.5) and (D.7) that P [sgn( u ) (cid:54) = ± sgn( ¯ u )] ≤ P (cid:16) min c = ± (cid:107) c u − ¯ u (cid:107) ∞ > / √ n (cid:17) (cid:46) /n. E Proofs of Section 3.2

E.1 A technical lemma

The following technical lemma will be used in the analysis of misclassiﬁcation rates.

Lemma E.1.

Consider the Gaussian mixture model in Deﬁnition 3.1 with d ≥ . Let R = (cid:107) µ (cid:107) and p = SNR = R / ( R + d/n ) . If n → ∞ and SNR → ∞ , then for any ﬁxed i we have (cid:107) ˆ µ ( − i ) − µ (cid:107) = O P ( (cid:112) ( d ∨ p ) /n ; p ) , (cid:12)(cid:12)(cid:12) (cid:107) ˆ µ ( − i ) (cid:107) − (cid:112) R + d/ ( n − (cid:12)(cid:12)(cid:12) = O P ( (cid:112) p/n ; p ) , (cid:107) x i (cid:107) = O P ( R ∨ √ d ; p ) , (cid:104) ˆ µ ( − i ) − µ , x i (cid:105) = (cid:112) p/nO P ( R ∨ √ d ; p ) and (cid:104) ˆ µ ( − i ) , x i (cid:105) = O P ( R ; p ) . Proof of Lemma E.1.

Let w i = (cid:80) j (cid:54) = i z j y j and note that ( n −

1) ˆ µ ( − i ) = (cid:80) j (cid:54) = i x j y j = (cid:80) j (cid:54) = i ( µ y j + z j ) y j = ( n − µ + w i . From w i ∼ N ( , ( n − I d ) we get (cid:107) w i (cid:107) / ( n − ∼ χ d ,and Lemma H.4 leads to (cid:107) w i (cid:107) / ( n − − d = O P ( p ∨ √ pd ; p ) . Then (cid:107) ˆ µ ( − i ) − µ (cid:107) = ( n − − (cid:107) w i (cid:107) = d + O P ( p ∨ √ pd ; p ) n − O P (( d ∨ p ) /n ; p ) , and (cid:107) ˆ µ ( − i ) − µ (cid:107) = O P ( (cid:112) ( d ∨ p ) /n ; p ) . To study (cid:107) ˆ µ ( − i ) (cid:107) , we start from the decomposition (cid:107) ˆ µ ( − i ) (cid:107) = (cid:107) µ (cid:107) + 2( n − − (cid:104) µ , w i (cid:105) + ( n − − (cid:107) w i (cid:107) . Since (cid:104) µ , w i (cid:105) ∼ N (0 , ( n − R ) , Lemma H.3 yields (cid:104) µ , w i (cid:105) = O P ( R √ np ; p ) . We use theseand √ p ≤ R to derive (cid:107) ˆ µ ( − i ) (cid:107) = R + 2 · O P ( R √ np ; p ) n − d + O P ( p ∨ √ pd ; p ) n − R + dn − { R √ np, p, √ pd } n O P (1; p ) R + dn − { R √ np, √ pd } n O P (1; p )= R + dn − (cid:114) pn O P ( R ∨ (cid:112) d/n ; p ) . Based on this and (cid:112) R + d/ ( n − ≥ (cid:112) R + d/n (cid:16) R ∨ (cid:112) d/n , (cid:12)(cid:12)(cid:12) (cid:107) ˆ µ ( − i ) (cid:107) − (cid:112) R + d/ ( n − (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12) (cid:107) ˆ µ ( − i ) (cid:107) − [ R + d/ ( n − (cid:12)(cid:12) (cid:107) ˆ µ ( − i ) (cid:107) + (cid:112) R + d/ ( n − ≤ (cid:112) p/nO P ( R ∨ (cid:112) d/n ; p ) (cid:112) R + d/ ( n −

1) = O P ( (cid:112) p/n ; p ) . From (cid:107) z i (cid:107) ∼ χ d and Lemma H.4 we get (cid:107) z i (cid:107) = d + O P ( √ pd ∨ p ; p ) = O P ( p ∨ d ; p ) .Hence (cid:107) x i (cid:107) ≤ (cid:107) µ (cid:107) + (cid:107) z i (cid:107) = R + O P ( √ p ∨ d ; p ) = O P ( R ∨ √ d ; p ) as R ≥ √ p .Now we study (cid:104) ˆ µ ( − i ) − µ , x i (cid:105) = (cid:104) ˆ µ ( − i ) − µ , µ (cid:105) y i + (cid:104) ˆ µ ( − i ) − µ , z i (cid:105) . On the one hand, (cid:104) ˆ µ ( − i ) − µ , µ (cid:105) = ( n − − (cid:104) w i , µ (cid:105) ∼ N (0 , R / ( n − and Lemma H.3 implies that (cid:104) ˆ µ ( − i ) − µ , µ (cid:105) = O P ( R (cid:112) p/n ; p ) . On the other hand, (cid:104) ˆ µ ( − i ) − µ , z i (cid:105) / (cid:107) ˆ µ ( − i ) − µ (cid:107) ∼ N (0 , leadsto (cid:104) ˆ µ ( − i ) − µ , z i (cid:105) / (cid:107) ˆ µ ( − i ) − µ (cid:107) = O P ( √ p ; p ) . Since (cid:107) ˆ µ ( − i ) − µ (cid:107) = O P ( (cid:112) ( d ∨ p ) /n ; p ) , wehave (cid:104) ˆ µ ( − i ) − µ , z i (cid:105) = (cid:112) p/nO P ( √ p ∨ d ; p ) . As a result, (cid:104) ˆ µ ( − i ) − µ , x i (cid:105) = (cid:112) p/nO P ( R ∨ √ d ; p ) . Note that |(cid:104) µ , x i (cid:105)| ≤ |(cid:107) µ (cid:107) y i + (cid:104) µ , z i (cid:105)| ≤ R + |(cid:104) µ , z i (cid:105)| . From (cid:104) µ , z i (cid:105) ∼ N (0 , R ) weobtain that (cid:104) µ , z i (cid:105) = O P ( R √ p ; p ) . The fact √ p ≤ R leads to (cid:104) µ , x i (cid:105) = O P ( R ; p ) and (cid:104) ˆ µ ( − i ) , x i (cid:105) = (cid:104) µ , x i (cid:105) + (cid:104) ˆ µ ( − i ) − µ , x i (cid:105) = O P ( R + (cid:112) p/n ( R ∨ √ d ); p ) = O P ( R ; p ) , where we also applied (cid:112) pd/n = R (cid:112) d/n/ (cid:112) R + d/n ≤ R . E.2 Proof of Theorem 3.2

We supress the subscripts of λ , ¯ λ , u and ¯ u . When SNR (cid:29) log n , the ﬁrst part in Theorem3.1 implies that P [sgn( u ) = ± y ] → . From now on we assume that (cid:28) SNR (cid:46) log n andlet p = SNR . Repeating the derivation of (D.1) in the proof of Theorem 3.1 and using theexchangeability of { z i } ni =1 , we get lim sup n →∞ p − log E M (sgn( u ) , y ) ≤ lim sup ε → lim sup n →∞ p − log P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:28) z i , (cid:88) j (cid:54) = i x j ¯ u j (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (1 − ε ) √ n (cid:107) µ (cid:107) (cid:19) . (E.1)Since (cid:80) j (cid:54) = i x j ¯ u j = ( n −

1) ˆ µ − i / √ n , we get P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:28) z i , (cid:88) j (cid:54) = i x j ¯ u j (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (1 − ε ) √ n (cid:107) µ (cid:107) (cid:19) ≤ P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) (cid:104) z i , ˆ µ − i (cid:105)(cid:107) ˆ µ − i (cid:107) (cid:12)(cid:12)(cid:12)(cid:12) ≥ (1 − ε ) (cid:107) µ (cid:107) (cid:107) ˆ µ − i (cid:107) (cid:19) . (E.2)40et R = (cid:107) µ (cid:107) . Lemma E.1 yields (cid:12)(cid:12)(cid:12) (cid:107) ˆ µ ( − i ) (cid:107) − (cid:112) R + d/ ( n − (cid:12)(cid:12)(cid:12) = O P ( (cid:112) p/n ; p ) . Hencethere exist constants C , C and N such that P ( (cid:107) ˆ µ ( − i ) (cid:107) − (cid:112) R + d/ ( n − ≥ C (cid:112) p/n ) ≤ C e − p , ∀ n ≥ N. (E.3)On the one hand, (cid:112) R + d/ ( n −

1) = [1 + o (1)] (cid:112) R + d/n = [1 + o (1)] R / √ p . On the otherhand, R / √ p (cid:112) p/n = √ nR p = √ nR R / ( R + d/n ) = √ n ( R + d/n ) R ≥ √ n. As a result, (E.3) implies that for any constant δ > , there exists a constant N (cid:48) such that P ( (cid:107) ˆ µ ( − i ) (cid:107) ≥ (1 + δ ) R / √ p ) ≤ C e − p , ∀ n ≥ N (cid:48) . (E.4)By (E.2) and (E.4), P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:28) z i , (cid:88) j (cid:54) = i x j ¯ u j (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (1 − ε ) √ n (cid:107) µ (cid:107) (cid:19) ≤ P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) (cid:104) z i , ˆ µ − i (cid:105)(cid:107) ˆ µ − i (cid:107) (cid:12)(cid:12)(cid:12)(cid:12) ≥ (1 − ε ) (cid:107) µ (cid:107) (1 + δ ) R / √ p (cid:19) + C e − p = P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) (cid:104) z i , ˆ µ − i (cid:105)(cid:107) ˆ µ − i (cid:107) (cid:12)(cid:12)(cid:12)(cid:12) ≥ − ε δ √ p (cid:19) + C e − p , ∀ n ≥ N (cid:48) . (E.5)The independence between z i and ˆ µ − i yields (cid:104) z i , ˆ µ − i (cid:105) / (cid:107) ˆ µ − i (cid:107) ∼ N (0 , . Then we get lim sup n →∞ p − log E M [sgn( u )] ≤ − / . (E.6)by (E.2), (E.5), standard tail bounds for Gaussian random variable and the fact that ε , δ are arbitrary.When SNR > (2 + ε ) log n for some constant ε , (E.6) implies the existence of positiveconstants ε (cid:48) and N (cid:48)(cid:48) such that E M (sgn( u ) , y ) ≤ n − − ε (cid:48) , ∀ n ≥ N (cid:48)(cid:48) . Then we must have P [ M (sgn( u ) , y ) = 0] → as any misclassiﬁed sample contributes n − to M (sgn( u ) , y ) . E.3 Proof of Theorem 3.3

It is easily checked that Assumptions 2.4, 2.5 and 2.6 hold with Σ = I , κ = 1 , µ = 1 and γ (cid:16) SNR . Theorem 2.1 then yields the desired result.

F Proof of Section 4

Deﬁne I ( t, a, b, c ) = a − ( a/b ) t ] + b − ( b/a ) t ] − c ( t + t ) for ( t, a, b, c ) ∈ R × (0 , + ∞ ) . It is easily seen that both a ( a/b ) t + b ( b/a ) t and t + t areconvex and achieve their minima at − / . Then I ∗ ( a, b, c ) = I ( − / , a, b, c ) = sup t ∈ R I ( t, a, b, c ) . .1 Useful lemmas We present three useful lemmas. The ﬁrst one ﬁnds an (cid:96) ∞ approximation of the aggregatedspectral estimator. The second one concerns large deviation probabilities. The third onerelates genie-aided estimators to fundamental limits of clustering. Lemma F.1.

Let ¯ u = y / √ n and w = log( a/b ) A ¯ u + 2 R nR + d G ¯ u . For ˆ u deﬁned by (4.3) , there exist some ε n → and constant C > such that P (min c = ± (cid:107) c ˆ u − w (cid:107) ∞ < ε n n − / log n ) > − Cn − . Proof of Lemma F.1.

Deﬁne, as in (4.2), v = n ( α − β )2 log (cid:18) αβ (cid:19) u ( A ) + 2 nR nR + d u ( G ) . Then (cid:107) v − w (cid:107) ∞ ≤ log( a/b ) (cid:107) [ n ( α − β ) / u ( A ) − A ¯ u (cid:107) ∞ + 2 R nR + d (cid:107) ( nR ) u ( G ) − G ¯ u (cid:107) ∞ , (F.1) (cid:107) ˆ u − v (cid:107) ∞ ≤ (cid:12)(cid:12)(cid:12)(cid:12) λ ( A ) log (cid:18) λ ( A ) + λ ( A ) λ ( A ) − λ ( A ) (cid:19) − n ( α − β )2 log (cid:18) αβ (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) u ( A ) (cid:107) ∞ + (cid:12)(cid:12)(cid:12)(cid:12) λ ( G ) nλ ( G ) + nd − nR nR + d (cid:12)(cid:12)(cid:12)(cid:12) (cid:107) u ( G ) (cid:107) ∞ . (F.2)For simplicity, suppose that (cid:104) u ( G ) , ¯ u (cid:105) ≥ and (cid:104) u ( A ) , ¯ u (cid:105) ≥ . By Lemma B.1 andTheorem 2.1, we have | λ ( G ) − nR | = o P (1; n ) , (cid:107) u ( G ) − G ¯ u / ( nR ) (cid:107) ∞ = o P ( n − / ; log n ) , (cid:107) u ( G ) (cid:107) ∞ = O P ( n − / ; log n ) . Hence there exists ε n → and a constant C such that P ( | λ ( G ) /nR − | < ε n , (cid:107) u ( G ) − G ¯ u / ( nR ) (cid:107) ∞ < ε n / √ n, (cid:107) u ( G ) (cid:107) ∞ < C / √ n ) > − n − . (F.3)By mimicking the proof of Corollary 3.1 in Abbe et al. (2017) and applying Lemma 6 therein,we get ε n → and a constant C such that P ( max {| λ ( A ) − n ( α + β ) / | , | λ ( A ) − n ( α − β ) / |} < ε n (cid:112) log n, (cid:107) u ( A ) − A ¯ u / [ n ( α − β ) / (cid:107) ∞ < ε n / √ n, (cid:107) u ( A ) (cid:107) ∞ < C / √ n ) > − n − . (F.4)Inequalities (F.1), (F.2), (F.3) and (F.4) yield some ε n → and constant C > such that P ( (cid:107) ˆ u − w (cid:107) ∞ < ε n n − / log n ) > − Cn − . emma F.2. Let Assumption 4.2 hold and deﬁne W ni = (cid:32) R nR + d (cid:88) j (cid:54) = i (cid:104) x i , x j (cid:105) y j + log( a/b ) (cid:88) j (cid:54) = i A ij y j (cid:33) y i . For any ﬁxed i , lim n →∞ q − n log P ( W ni ≤ εq n ) = − sup t ∈ R { εt + I ( t, a, b, c ) } , ∀ ε < a − b a/b ) + 2 c. Proof of Lemma F.2.

We will invoke Lemma H.5 to prove Lemma F.2, starting from thecalculation of E e tW ni . Conditioned on y i , (cid:80) j (cid:54) = i (cid:104) x i , x j (cid:105) y j and (cid:80) j (cid:54) = i A ij y j are independent.Hence E ( e tW ni | y i ) = E (cid:20) exp (cid:18) t · R nR + d (cid:88) j (cid:54) = i (cid:104) x i , x j (cid:105) y j y i (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) y i (cid:21) · E (cid:20) exp (cid:18) t log( a/b ) (cid:88) j (cid:54) = i A ij y j y i (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) y i (cid:21) . We claim that for any ﬁxed t ∈ R , there exists N > such that when n > N , log E (cid:20) exp (cid:18) t · R nR + d (cid:88) j (cid:54) = i (cid:104) x i , x j (cid:105) y j y i (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) y i (cid:21) = log E (cid:20) exp (cid:18) t · R nR + d (cid:88) j (cid:54) = i (cid:104) x i , x j (cid:105) y j y i (cid:19)(cid:21) = 2 c ( t + t )[1 + o (1)] q n , (F.5) log E (cid:20) exp (cid:18) t log( a/b ) y i (cid:88) j (cid:54) = i A ij y j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) y i (cid:21) = log E (cid:20) exp (cid:18) t log( a/b ) y i (cid:88) j (cid:54) = i A ij y j (cid:19)(cid:21) = a [( a/b ) t −

1] + b [( b/a ) t − o (1)] q n . (F.6)If (F.5) and (F.6) hold, then E ( e tW ni | y i ) = E (cid:20) exp (cid:18) t · R nR + d (cid:88) j (cid:54) = i (cid:104) x i , x j (cid:105) y j y i (cid:19)(cid:21) · E exp (cid:18) t log( a/b ) y i (cid:88) j (cid:54) = i A ij y j (cid:19) does not depend on y i , and q − n log E e tW ni = q − n log E (cid:20) exp (cid:18) t · R nR + d (cid:88) j (cid:54) = i (cid:104) x i , x j (cid:105) y j y i (cid:19)(cid:21) + q − n log E (cid:20) exp (cid:18) t log( a/b ) y i (cid:88) j (cid:54) = i A ij y j (cid:19)(cid:21) = (cid:18) a a/b ) t −

1] + b b/a ) t −

1] + 2 c ( t + t ) (cid:19) [1 + o (1)]= − I ( t, a, b, c )[1 + o (1)] . Lemma H.5 implies that for ε < − ∂∂t I ( t, a, b, c ) | t =0 = a − b log( a/b ) + 2 c , lim n →∞ q − n log P ( W ni ≤ εq n ) = − sup t ∈ R { εt + I ( t, a, b, c ) } . x i = µ y i + z i we see that given y i , x i y i ∼ N ( µ , I d ) is independent of √ n − µ ( − i ) ∼ N ( √ n − µ , I d ) . Lemma H.4 asserts that log E ( e t (cid:104) x i , ˆ µ ( − i ) (cid:105) y i | y i ) = log E ( e ( t/ √ n − (cid:104) x i y i , √ n − µ ( − i ) (cid:105) | y i )= ( t √ n − ) − ( t √ n − ) ] ( (cid:107) µ (cid:107) + ( n − (cid:107) µ (cid:107) )+ t √ n − − ( t √ n − ) (cid:104) µ , √ n − µ (cid:105) − d (cid:20) − (cid:18) t √ n − (cid:19) (cid:21) = tR − t / ( n − (cid:18) nt n − (cid:19) − d (cid:18) − t n − (cid:19) , ∀ t ∈ ( −√ n − , √ n − . Since the right hand side does not depend on y i , log E e t (cid:104) x i , ˆ µ ( − i ) (cid:105) y i is also equal to it. Now weﬁx any t ∈ R and let s = 2 tp/R = 2 t/ [1 + d/ ( nR )] . Since | s | < | t | , we have | s | < √ n − for large n . In that case, we obtain from the equation above that log E (cid:20) exp (cid:18) t · (cid:104) ˆ µ ( − i ) , x i (cid:105) y i d/ ( nR ) (cid:19)(cid:21) = log E e s (cid:104) x i , ˆ µ ( − i ) (cid:105) y i = sR − s / ( n − (cid:18) ns n − (cid:19) − d (cid:18) − s n − (cid:19) . = [1 + o (1)] sR (1 + s/

2) + d · s n − o (1)] = (cid:20) tp (cid:18) tpR (cid:19) + d n · t p R (cid:21) [1 + o (1)]= 2 pt (cid:20) tpR (cid:18) dnR (cid:19)(cid:21) [1 + o (1)] = 2 pt (1 + t )[1 + o (1)] , where we used p = R / ( R + d/n ) . It then follows from the results above and the assumption p = cq n that log E (cid:20) exp (cid:18) t · (cid:104) ˆ µ ( − i ) , x i (cid:105) y i d/ ( nR ) (cid:19)(cid:21) = cq n p − log E (cid:20) exp (cid:18) t · (cid:104) ˆ µ ( − i ) , x i (cid:105) y i d/ ( nR ) (cid:19)(cid:21) = 2 c ( t + t )[1 + o (1)] q n , which leads to (F.5).On the other hand, E ( e tA ij y i y j | y i ) = 12 E ( e tA ij | y i y j = 1) + 12 E ( e − tA ij | y i y j = − ue t + (1 − u )] + 12 [ ve − t + (1 − v )] = 1 + u ( e t −

1) + v ( e − t − , ∀ t ∈ R . Conditioned on y i , { A ij y i y j } j (cid:54) = i are i.i.d. random variables. Hence E (cid:20) exp (cid:18) t log( a/b ) y i (cid:88) j (cid:54) = i A ij y j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) y i (cid:21) = (cid:18) u [( a/b ) t −

1] + v [( b/a ) t − (cid:19) n − . Again, the right hand side does not depend on y i . By substituting u = aq n /n and v = bq n /n , log E (cid:20) exp (cid:18) t log( a/b ) y i (cid:88) j (cid:54) = i A ij y j (cid:19)(cid:21) = ( n −

1) log (cid:18) u [( a/b ) t −

1] + v [( b/a ) t − (cid:19) ( n −

1) log (cid:18) aq n [( a/b ) t −

1] + bq n [( b/a ) t − n (cid:19) = a [( a/b ) t −

1] + b [( b/a ) t − · [1 + o (1)] q n . We get (F.6) and thus ﬁnish the proof.

Lemma F.3 (Fundamental limit via genie-aided approach) . Suppose that S is a Borel spaceand ( y , X ) is a random element in {± } n × S . Let F be a family of Borel mappings from S to {± } n . Deﬁne M ( u , v ) = min (cid:26) n n (cid:88) i =1 { u i (cid:54) = v i } , n n (cid:88) i =1 {− u i (cid:54) = v i } (cid:27) , ∀ u , v ∈ {± } n ,f ( ·| ˜ X , ˜ y − i ) = P ( y i = ·| X = ˜ X , y − i = ˜ y − i ) , ∀ i ∈ [ n ] , ˜ X ∈ S , ˜ y − i ∈ {± } n − . We have inf ˆ y ∈F E M ( ˆ y , y ) ≥ n − n − · n n (cid:88) i =1 P [ f ( y i | X , y − i ) < f ( − y i | X , y − i )] . Proof of Lemma F.3.

For u , v ∈ {± } m with some m ∈ Z + , deﬁne the sign s ( u , v ) = argmin c = ± (cid:107) c u − v (cid:107) with any tie-breaking rule. As a matter of fact, s ( u , v ) = sgn( (cid:104) u , v (cid:105) ) if (cid:104) u , v (cid:105) (cid:54) = 0 . When |(cid:104) u , v (cid:105)| > , we have s ( u − i , v − i ) = s ( u , v ) for all i . Hence for ˆ y ∈ F (we drop the dependenceof ˆ y on X ), E M ( ˆ y , y ) ≥ E (cid:32) n n (cid:88) i =1 { s ( ˆ y , y )ˆ y i (cid:54) = y i } {|(cid:104) ˆ y , y (cid:105)| > } (cid:33) = E (cid:32) n n (cid:88) i =1 { s ( ˆ y − i , y − i )ˆ y i (cid:54) = y i } {|(cid:104) ˆ y , y (cid:105)| > } (cid:33) = E (cid:32) n n (cid:88) i =1 { s ( ˆ y − i , y − i )ˆ y i (cid:54) = y i } (cid:33) − E (cid:32) n n (cid:88) i =1 { s ( ˆ y − i , y − i )ˆ y i (cid:54) = y i } {|(cid:104) ˆ y , y (cid:105)|≤ } (cid:33) ≥ n n (cid:88) i =1 P ( s ( ˆ y − i , y − i )ˆ y i (cid:54) = y i ) − P ( |(cid:104) ˆ y , y (cid:105)| ≤ . Deﬁne F ε = { ˆ y ∈ F : P ( |(cid:104) ˆ y , y (cid:105)| ≤ ≤ ε } for ε ∈ [0 , . If F ε (cid:54) = ∅ , then inf ˆ y ∈F ε E M ( ˆ y , y ) ≥ n n (cid:88) i =1 inf ˆ y ∈F P ( s ( ˆ y − i , y − i )ˆ y i (cid:54) = y i ) − ε. Deﬁne G be the family of Borel mappings from S × {± } n − → {± } . For any ﬁxed ˆ y ∈ F ,the mapping ( X , y − i ) (cid:55)→ s ( ˆ y − i , y − i )ˆ y i belongs to G . Then inf ˆ y ∈F P ( s ( ˆ y − i , y − i )ˆ y i (cid:54) = y i ) ≥ inf ˆ (cid:96) ∈G P (cid:16) ˆ (cid:96) ( X , y − i ) (cid:54) = y i (cid:17) ≥ P [ f ( y i | X , y − i ) < f ( − y i | X , y − i )] , δ = n (cid:80) ni =1 P [ f ( y i | X , y − i ) < f ( − y i | X , y − i )] . We have inf ˆ y ∈F ε E M ( ˆ y , y ) ≥ δ − ε provided that F ε (cid:54) = ∅ .On the other hand, when |(cid:104) ˆ y , y (cid:105)| ≤ , we have M ( ˆ y , y ) = (4 n ) − min c = ± (cid:107) c ˆ y − y (cid:107) = (4 n ) − min c = ± {(cid:107) ˆ y (cid:107) − c (cid:104) ˆ y , y (cid:105) + (cid:107) y (cid:107) } ≥ n − n . Hence if

F \F ε (cid:54) = ∅ , inf ˆ y ∈F\F ε E M ( ˆ y , y ) ≥ n − n inf ˆ y ∈F\F ε P ( |(cid:104) ˆ y ( X ) , y (cid:105)| ≤ ≥ n − n · ε = ε (cid:18) − n (cid:19) . Based on the deduction above, we have the followings for all ε ∈ [0 , :1. If F ε (cid:54) = ∅ and F \F ε (cid:54) = ∅ , then inf ˆ y ∈F E M ( ˆ y , y ) ≥ min { δ − ε, ε (1 − n − ) / } ;2. If F ε = ∅ , then inf ˆ y ∈F E M ( ˆ y , y ) ≥ ε (1 − n − ) / .3. If F \F ε = ∅ , then inf ˆ y ∈F E M ( ˆ y , y ) ≥ δ − ε .As a result, inf ˆ y ∈F E M ( ˆ y , y ) ≥ sup ε ∈ [0 , min { δ − ε, ε (1 − n − ) / } = n − n − δ . F.2 Proof of Lemma 4.1

The proof directly follows the Lemmas F.4 and F.5, plus the conditional independence be-tween A and X as well as the Bayes formula. See Appendices F.3 and F.4 for proofs oflemmas. Lemma F.4.

Denote by p X ( ·| ˜ (cid:96) i , ˜ y − i ) the conditional density function of X given y i = ˜ (cid:96) i ∈{± } and y − i = ˜ y − i ∈ {± } n − . Under Assumption 4.2, (cid:12)(cid:12)(cid:12)(cid:12) y i log (cid:18) p X ( X | y i , y − i ) p X ( X | − y i , y − i ) (cid:19) − R nR + d (cid:88) j (cid:54) = i (cid:104) x i , x j (cid:105) y j (cid:12)(cid:12)(cid:12)(cid:12) = o P ( q n ; q n ) , ∀ i. Lemma F.5.

Denote by p A ( ·| ˜ y i , ˜ y − i ) the conditional probability mass function of A given y i = ˜ (cid:96) i and y − i = ˜ y − i . Under Assumption 4.2, (cid:12)(cid:12)(cid:12)(cid:12) y i log (cid:18) p A ( A | y i , y − i ) p A ( A | − y i , y − i ) (cid:19) − log (cid:18) ab (cid:19) (cid:88) j (cid:54) = i A ij y j (cid:12)(cid:12)(cid:12)(cid:12) = o P ( q n ; q n ) , ∀ i. F.3 Proof of Lemma F.4

Let p = p n = R / ( R + d/n ) . We have p n (cid:16) q n . First of all, from the data generating model,we have p X ( X | y ) ∝ E µ exp (cid:16) − n (cid:88) j =1 (cid:107) x j − y j µ (cid:107) (cid:17) ∝ E µ exp (cid:16)(cid:68) n (cid:88) j =1 x j y j , µ (cid:69)(cid:17) , ∝ hide quantities that do not depend on y . By deﬁning I ( α ) = R d − (cid:90) S d − e R (cid:104) α , ˜ µ (cid:105) ρ (d ˜ µ ) , ∀ α ∈ R d , and using the uniform distribution of µ on the sphere with radius R , we get p X ( X | y i = s, y − i ) p X ( X | y i = − s, y − i ) = I (cid:0) ( n −

1) ˆ µ ( − i ) + x i s (cid:1) I (( n −

1) ˆ µ ( − i ) − x i s ) . (F.7)Let P ( t, s ) = (cid:82) π e t cos θ (sin θ ) s − d θ for t ≥ , s ≥ . Then, I ( α ) ∝ (cid:90) π e R (cid:107) α (cid:107) cos θ (sin θ ) d − d θ = P ( R (cid:107) α (cid:107) , d ) , where ∝ only hides some factor that does not depend on α . Hence by (F.7) and ˆ µ ( − i ) = n − (cid:80) j (cid:54) = i y j x j , log (cid:18) p X ( X | y i , y − i ) p X ( X | − y i , y − i ) (cid:19) = log P (cid:0) R (cid:107) ( n −

1) ˆ µ ( − i ) + x i y i (cid:107) , d (cid:1) − log P (cid:0) R (cid:107) ( n −

1) ˆ µ ( − i ) − x i y i (cid:107) , d (cid:1) . We will linearize the functional above, and invoke Lemma H.8 to control the approx-imation error. Take t = ( n − R (cid:112) R + d/ ( n − , t = R (cid:107) ( n −

1) ˆ µ ( − i ) + x i y i (cid:107) , t = R (cid:107) ( n −

1) ˆ µ ( − i ) − x i y i (cid:107) . We ﬁrst claim that t = nR (cid:112) R + d/n [1 + o (1)] = [1 + o (1)] nR / √ p (cid:16) nR ( R ∨ (cid:112) d/n ) , (F.8) max { /t , d /t , | t − t | /t , | t − t | /t } = o P (1; p ) . (F.9)Equation (F.8) is obvious and it leads to /t = o (1) . From t (cid:38) R √ nd and the assumption R (cid:29) ∨ ( d/n ) / we get d t (cid:46) d ( R √ nd ) = (cid:18) d ( R nd ) (cid:19) / = (cid:18) dnR · n R (cid:19) / = o (1) . By the triangle’s inequality and (cid:107) x i (cid:107) = O P ( R ∨ √ d ; p ) in Lemma E.1, (cid:12)(cid:12) | t − t | − (cid:12)(cid:12) R (cid:107) ( n −

1) ˆ µ ( − i ) (cid:107) − t (cid:12)(cid:12)(cid:12)(cid:12) ≤ R (cid:107) x i y i (cid:107) ≤ R ( R ∨ √ d ) O P (1; p ) . By (cid:12)(cid:12)(cid:12) (cid:107) ˆ µ ( − i ) (cid:107) − (cid:112) R + d/ ( n − (cid:12)(cid:12)(cid:12) = O P ( (cid:112) p/n ; p ) in Lemma E.1, (cid:12)(cid:12) R (cid:107) ( n −

1) ˆ µ ( − i ) (cid:107) − t (cid:12)(cid:12) = O P ( R √ np ; p ) . Hence | t − t | /R = O P ( R ∨ √ d ∨ √ np ; p ) = O P ( √ nR ∨ √ d ; p ) as √ p ≤ R .Then t (cid:16) nR ( R ∨ (cid:112) d/n ) forces | t − t | /t = | t − t | /R | t | /R = O P ( √ nR ∨ √ d ; p ) nR ∨ √ nd = o P (1; p ) . | t − t | /t = o P (1; p ) .Now that (F.9) has been justiﬁed, Lemma H.8 and Fact A.5 assert that (cid:12)(cid:12)(cid:12)(cid:12) log p X ( X | y i , y − i ) − log p X ( X | − y i , y − i ) g ( t , d )( t − t ) − (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) log P ( t , d ) − log P ( t , d ) g ( t , d )( t − t ) − (cid:12)(cid:12)(cid:12)(cid:12) = o P (1; p ) , (F.10)where g ( t , d ) = (cid:112) [( d − /t ] + 4 − ( d − /t (cid:112) ( d − + 4 t − ( d − t . By (F.8), we have t = [1 + o (1)] nR / √ p and g ( t , d ) = √ p [1 + o (1)]2 nR [ (cid:112) ( d − + 4 n R /p − ( d − √ p [1 + o (1)]2 R · (cid:20)(cid:115)(cid:18) d − nR (cid:19) + 4 R p − d − nR (cid:21) . Since p = R / ( R + d/n ) and (cid:18) d − nR (cid:19) + 4 R p = (cid:18) d − nR (cid:19) + 4 (cid:18) dnR (cid:19) = (cid:18) d − nR + 2 (cid:19) + 8 nR , we have d − nR + 2 ≤ (cid:115)(cid:18) d − nR (cid:19) + 4 R p ≤ d − nR + 2 + (cid:114) nR and g ( t , d ) = [1 + o (1)] √ p/R .To further simplify (F.10), we ﬁrst note that t − t = t − t t + t = 4 R ( n − (cid:104) ˆ µ ( − i ) , x i (cid:105) y i t + t = 4 R ( n − (cid:104) ˆ µ ( − i ) , x i (cid:105) y i [1 + o P (1; p )]2 t = 4 R ( n − (cid:104) ˆ µ ( − i ) , x i (cid:105) y i [1 + o P (1; p )]2 nR / √ p = (cid:18) √ pnR (cid:88) j (cid:54) = i (cid:104) x i , x j (cid:105) y j (cid:19) y i [1 + o P (1; p )] , where we used t = [1 + o (1)] nR / √ p in (F.8). Then g ( t , d )( t − t ) = (cid:18) pnR (cid:88) j (cid:54) = i (cid:104) x i , x j (cid:105) y j (cid:19) y i [1 + o P (1; p )] . By (cid:104) ˆ µ ( − i ) , x i (cid:105) = O P ( R ; p ) in Lemma E.1, g ( t , d )( t − t ) = O P ( p ; p ) . The proof is completedby plugging these estimates into (F.10). 48 .4 Proof of Lemma F.5 Deﬁne T i = { j ∈ [ n ] \{ i } : y i y j = 1 } and S i = { j ∈ [ n ] \{ i } : A ij = 1 } for i ∈ [ n ] . Bydeﬁnition, p A ( A | y i , y − i ) ∝ α | T i ∩ S i | (1 − α ) | T i \ S i | β | S i \ T i | (1 − β ) [ n ] ∩{ i } c ∩ T ci ∩ S ci ,p A ( A | − y i , y − i ) ∝ α | S i \ T i | (1 − α ) [ n ] ∩{ i } c ∩ T ci ∩ S ci β | T i ∩ S i | (1 − β ) | T i \ S i | , where both ∝ ’s hide the same factor that does not involve { A ij } nj =1 or y i . Hence log (cid:18) p A ( A | y i , y − i ) p A ( A | − y i , y − i ) (cid:19) = ( | T i ∩ S i | − | S i \ T i | ) log( α/β )+ ( | T i \ S i | − | [ n ] ∩ { i } c ∩ T ci ∩ S ci | ) log (cid:18) − β − α (cid:19) . (F.11)The facts | T i |−| S i | ≤ | T i \ S i | ≤ | T i | and n − −| T i |−| S i | ≤ | [ n ] ∩{ i } c ∩ T ci ∩ S ci | ≤ n − −| T i | yield || T i \ S i | − | [ n ] ∩ { i } c ∩ T ci ∩ S ci || ≤ | | T i | − ( n − | + | S i | For any independent random variables { ξ i } ni =1 taking values in [ − , , Hoeﬀding’s inequality(Hoeﬀding, 1963) asserts P ( | (cid:80) ni ξ i − (cid:80) ni E ξ i | ≥ nt ) ≤ e − nt / , ∀ t ≥ . Hence | (cid:80) ni ξ i − (cid:80) ni E ξ i | = O P ( √ nq ; q ) . This elementary fact leads to | | T i | − ( n − | = O P ( √ nq ; q ) , | S i | ≤ | E S i | + | S i − E S i | ≤ O ( q ) + O P ( √ nq ; q ) = O P ( √ nq ; q ) . As a result, || T i \ S i | − | [ n ] ∩ { i } c ∩ T ci ∩ S ci || = O P ( √ nq ; q ) . This bound, combined with ≤ log (cid:18) − β − α (cid:19) = log (cid:18) α − β − α (cid:19) ≤ α − β − α = ( a − b ) q/n − aq/n (cid:46) qn , (F.11) and log( α/β ) = log( a/b ) , implies that (cid:12)(cid:12)(cid:12)(cid:12) log (cid:18) p A ( A | y i , y − i ) p A ( A | − y i , y − i ) (cid:19) − ( | T i ∩ S i | − | S i \ T i | ) log( a/b ) (cid:12)(cid:12)(cid:12)(cid:12) = O P ( √ nq · q/n ; q ) = o P ( q ; q ) . The proof is completed by | T i ∩ S i | − | S i \ T i | = (cid:88) j ∈ T i A ij − (cid:88) j ∈ [ n ] ∩{ i } c ∩ T ci A ij = y i (cid:88) j (cid:54) = i A ij y j . F.5 Proof of Theorem 4.1

Lemma F.1 asserts the existence of some ε n → and constant C > such that P (min c = ± (cid:107) c ˆ u − w (cid:107) ∞ < ε n n − / log n ) > − Cn − . (F.12)49et ˆ c = argmin c = ± (cid:107) c ˆ u − w (cid:107) ∞ and v = ˆ c ˆ u . Hence P [ M (sgn( ˆ u ) , y ) = 0] ≥ P (sgn( ˆ v ) = y ) ≥ P (min i ∈ [ n ] w i y i > ε n n − / log n, (cid:107) v − w (cid:107) ∞ < ε n n − / ) ≥ P (min i ∈ [ n ] w i y i > ε n n − / log n ) − P ( (cid:107) v − w (cid:107) ∞ < ε n n − / ) ≥ − n (cid:88) i =1 P ( w i y i ≤ ε n n − / log n ) − Cn − = 1 − n P ( w i y i ≤ ε n n − / log n ) − Cn − . (F.13)where we used (F.12), union bounds and symmetry.Take any < ε < a − b log( a/b ) + 2 c . Lemma F.2 asserts that lim n →∞ P ( w i y i ≤ εn − / log n )log n = − sup t ∈ R { εt + I ( t, a, b, c ) } . For any δ > , there exists a large N such that when n ≥ N , ε n < ε and P ( w i y i ≤ ε n n − / log n ) ≤ n − sup t ∈ R { εt + I ( t,a,b,c ) } + δ . This and (F.13) lead to P [ M (sgn( ˆ u ) , y ) = 0] ≥ − n − sup t ∈ R { εt + I ( t,a,b,c ) } + δ − Cn − , ∀ n ≥ N. When I ∗ ( a, b, c ) = sup t ∈ R I ( t, a, b, c ) > , by choosing small ε and δ we get P [ M (sgn( ˆ u ) , y ) =0] → .The converse result for I ∗ ( a, b, c ) = sup t ∈ R I ( t, a, b, c ) < follows from the large deviationLemma F.2 and the proof of Theorem 1 in Abbe et al. (2016). F.6 Proof of Theorem 4.2

Lemma F.1 asserts the existence of some ε n → and constant C > such that P (min c = ± (cid:107) c ˆ u − w (cid:107) ∞ < ε n n − / log n ) > − Cn − . (F.14)Let ˆ c = argmin c = ± (cid:107) c ˆ u − w (cid:107) ∞ and v = ˆ c ˆ u .By deﬁnition, E M (sgn( ˆ u ) , y ) ≤ n (cid:80) ni =1 P ( v i y i < . By union bounds and (F.14), P ( v i y i < ≤ P ( v i y i < , (cid:107) ˆ u − w (cid:107) ∞ < ε n n − / log n ) + P ( (cid:107) v − w (cid:107) ∞ ≥ ε n n − / log n ) ≤ P ( w i y i < ε n n − / log n ) + Cn − . (F.15)Take any < ε < a − b log( a/b ) + 2 c . Lemma F.2 asserts that lim n →∞ P ( w i y i ≤ εn − / log n )log n = − sup t ∈ R { εt + I ( t, a, b, c ) } . δ > , there exists a large N such that when n ≥ N , ε n < ε and P ( w i y i < ε n n − / log n ) ≤ n − sup t ∈ R { εt + I ( t,a,b,c ) } + δ . (F.16)From (F.15) and (F.16) we obtain that E M (sgn( ˆ u ) , y ) ≤ n − sup t ∈ R { εt + I ( t,a,b,c ) } + δ + Cn − , ∀ n ≥ N. The proof is completed using I ∗ ( a, b, c ) = sup t ∈ R I ( t, a, b, c ) ≤ and letting ε , δ go to zero. F.7 Proof of Theorem 4.3

Deﬁne f ( ·| ˜ A , ˜ X , ˜ y − i ) = P ( y i = ·| A = ˜ A , X = ˜ X , y − i = ˜ y − i ) . By Lemma F.3 and symme-tries, for any estimator ˆ y we have E M ( ˆ y , y ) ≥ n − n − P [ f ( y | A , X , y − ) < f ( − y | A , X , y − )] . Denote by A the event on the right hand side. Let B ε = (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) log (cid:18) f ( y | A , X , y − ) f ( − y | A , X , y − ) (cid:19) − (cid:18) log( a/b )( Ay ) + 2 R nR + d ( Gy ) (cid:19) y (cid:12)(cid:12)(cid:12)(cid:12) < εq n (cid:27) C ε = (cid:26)(cid:18) log( a/b )( Ay ) + 2 R nR + d ( Gy ) (cid:19) y ≤ − εq n (cid:27) By the triangle’s inequality, C ε ∩ B ε ⊆ A . Hence E M ( ˆ y , y ) (cid:38) P ( A ) ≥ P ( C ε ∩ B ε ) ≥ P ( C ε ) − P ( B cε ) . (F.17)Since a − b log( a/b ) + 2 c > , Lemma F.2 asserts that lim n →∞ q − n log P ( C ε ) = − sup t ∈ R {− εt + I ( t, a, b, c ) } . By Lemma 4.1 and the property of o P ( · ; · ) , lim n →∞ q − n log P ( B cε ) = −∞ . These limits and (F.17) lead to lim inf n →∞ q − n log E M ( ˆ y , y ) ≥ − sup t ∈ R {− εt + I ( t, a, b, c ) } . Taking ε → ﬁnishes the proof. 51 Proofs of Section 5

G.1 Proof of Lemma 5.1

Note that s = 0 , r = 1 , ¯∆ = ¯ λ = n (cid:107) µ (cid:107) and κ = 1 . Assumption B.1 holds if / √ n ≤ γ (cid:28) .Assumption 2.5 holds with Σ = 2 I d and in that case, Assumption 2.6 holds with γ ≥ (cid:26) (cid:107) µ (cid:107) , (cid:112) d/n (cid:107) µ (cid:107) (cid:27) . The right hand side goes to zero as d/n → ∞ and ( n/d ) / (cid:107) µ (cid:107) → ∞ . Hence we can take γ = 2 max (cid:26) √ n , (cid:107) µ (cid:107) , (cid:112) d/n (cid:107) µ (cid:107) (cid:27) to satisfy all the assumptions above. Then Lemma B.1 yields |(cid:104) u , ¯ u (cid:105)| P → .To study ˆ u , we ﬁrst deﬁne ˜ G = E ( XX (cid:62) ) = d I n + d e e (cid:62) . Hence its leading eigenvectorand the associated eigengap are ˜ u = e and ˜∆ = d . Observe that G = H ( XX (cid:62) ) and (cid:107) XX (cid:62) − ˜ G (cid:107) ≤ (cid:107)H ( XX (cid:62) − ˜ G ) (cid:107) + max i ∈ [ n ] (cid:12)(cid:12)(cid:12) ( XX (cid:62) − ˜ G ) ii (cid:12)(cid:12)(cid:12) ≤ (cid:107)H ( XX (cid:62) ) − ¯ G (cid:107) + (cid:107) ¯ G − H ( ˜ G ) (cid:107) + max i ∈ [ n ] (cid:12)(cid:12) (cid:107) x i (cid:107) − E (cid:107) x i (cid:107) (cid:12)(cid:12) (G.1)By Lemma B.1, (cid:107)H ( XX (cid:62) ) − ¯ G (cid:107) = o P ( ¯∆; n ) = o P ( n (cid:107) µ (cid:107) ; n ) . (G.2)When i (cid:54) = j , ˜ G ij = E (cid:104) x i , x j (cid:105) = E (cid:104) ¯ x i + z i , ¯ x j + z j (cid:105) = E (cid:104) ¯ x i , ¯ x j (cid:105) = ¯ G ij . Hence H ( ¯ G ) = H ( ˜ G ) , and (cid:107) ¯ G − H ( ˜ G ) (cid:107) = max i ∈ [ n ] | ¯ G ii | = max i ∈ [ n ] (cid:107) ¯ x i (cid:107) = (cid:107) µ (cid:107) . (G.3)For the last term in (5.1), we have (cid:107) x i (cid:107) − E (cid:107) x i (cid:107) = (cid:107) ¯ x i + z i (cid:107) − ( (cid:107) ¯ x i (cid:107) + E (cid:107) z i (cid:107) ) = 2 (cid:104) ¯ x i , z i (cid:105) + ( (cid:107) z i (cid:107) − E (cid:107) z i (cid:107) ) . From (cid:107)(cid:104) ¯ x i , z i (cid:105)(cid:107) ψ (cid:46) (cid:107) ¯ x i (cid:107) = (cid:107) µ (cid:107) , Fact 2.1 and Lemma H.3 we obtain that max i ∈ [ n ] |(cid:104) ¯ x i , z i (cid:105)| (cid:46) (cid:107) ( (cid:104) ¯ x , z (cid:105) , · · · , (cid:104) ¯ x n , z n (cid:105) ) (cid:107) log n = O P ( (cid:112) log n (cid:107) µ (cid:107) ; log n ) (G.4)For any i ≥ , (cid:107) x i (cid:107) ∼ χ d . Lemma H.4 forces P ( |(cid:107) x i (cid:107) − d | ≥ √ dt + 2 t ) ≤ e − t , ∀ t ≥ , i ≥ .

52y the χ -concentration above and union bounds, max ≤ i ≤ n |(cid:107) x i (cid:107) − E (cid:107) x i (cid:107) | = O P ( √ dn ∨ n ; n ) = O P ( √ dn ; n ) . Since (cid:107) x (cid:107) / ∼ χ d , we get max i ∈ [ n ] |(cid:107) x i (cid:107) − E (cid:107) x i (cid:107) | = O P ( √ dn ; n ) .Plugging this and (G.2), (G.3), (G.4) into (G.1), we get (cid:107) XX (cid:62) − ˜ G (cid:107) = O P ( n (cid:107) µ (cid:107) + (cid:107) µ (cid:107) + (cid:112) log n (cid:107) µ (cid:107) + √ dn ; log n ) = O P ( n (cid:107) µ (cid:107) ; log n ) . Here we used (cid:107) µ (cid:107) (cid:29) ( d/n ) / (cid:29) . The Davis-Kahan Theorem (Davis and Kahan, 1970)then yields min c = ± (cid:107) s ˆ u − ˜ u (cid:107) (cid:46) (cid:107) XX (cid:62) − ˜ G (cid:107) / ˜∆ = O P ( n (cid:107) µ (cid:107) ; log n ) /d = o P (1; log n ) , since (cid:107) µ (cid:107) (cid:28) (cid:112) d/n . From ˜ u = e and (cid:104) ˜ u , ¯ u (cid:105) = 1 / √ n → we get |(cid:104) ˆ u , ¯ u (cid:105)| P → . G.2 Proof of Lemma 5.2

Lemma 5.2 directly follows from Lemma B.1 and thus we omit its proof.

H Technical lemmas

H.1 Lemmas for probabilistic analysis

Lemma H.1.

Under Assumption 2.5, we have (cid:107)H ( ZZ (cid:62) ) (cid:107) = O P (cid:0) max {√ n (cid:107) Σ (cid:107) HS , n (cid:107) Σ (cid:107) op } ; n (cid:1) , max i ∈ [ n ] (cid:107) z i (cid:107) = O P (max { Tr( Σ ) , n (cid:107) Σ (cid:107) op } ; n ) , (cid:107) ZZ (cid:62) (cid:107) = O P (max { Tr( Σ ) , n (cid:107) Σ (cid:107) op } ; n ) . Proof of Lemma H.1.

By deﬁnition, (cid:107)H ( ZZ (cid:62) ) (cid:107) = sup u ∈ S n − | u (cid:62) H ( ZZ (cid:62) ) u | = sup u ∈ S n − (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i (cid:54) = j u i u j (cid:104) z i , z j (cid:105) (cid:12)(cid:12)(cid:12)(cid:12) . Fix u ∈ S n − , let A = uu (cid:62) and S = (cid:80) i (cid:54) = j u i u j (cid:104) z i , z j (cid:105) . By Proposition 2.5 in Chen andYang (2018), there exists an absolute constant C > such that P ( S ≥ t ) ≤ exp (cid:18) − C min (cid:26) t (cid:107) Σ (cid:107) , t (cid:107) Σ (cid:107) op (cid:27)(cid:19) , ∀ t > . When t = λ max {√ n (cid:107) Σ (cid:107) HS , n (cid:107) Σ (cid:107) op } for some λ ≥ , we have min { t / (cid:107) Σ (cid:107) , t/ (cid:107) Σ (cid:107) op } ≥ λn and P ( S ≥ t ) ≤ e − Cλn . Similarly, we get P ( S ≤ − t ) ≤ e − Cλn and thus P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i (cid:54) = j u i u j (cid:104) z i , z j (cid:105) (cid:12)(cid:12)(cid:12)(cid:12) ≥ λ max {√ n (cid:107) Σ (cid:107) HS , n (cid:107) Σ (cid:107) op } (cid:19) ≤ e − Cλn , ∀ λ ≥ . (cid:107)H ( ZZ (cid:62) ) (cid:107) then follows from a standard covering argument (Vershynin,2010, Section 5.2.2).Theorem 2.6 in Chen and Yang (2018) with n = 1 and A = 1 implies the existence ofconstants C and C such that for any t ≥ , P ( (cid:107) z i (cid:107) ≥ C Tr( Σ ) + t ) ≤ exp (cid:18) − C min (cid:26) t (cid:107) Σ (cid:107) , t (cid:107) Σ (cid:107) op (cid:27)(cid:19) . When t = λ max {√ n (cid:107) Σ (cid:107) HS , n (cid:107) Σ (cid:107) op } for some λ ≥ , we have min { t / (cid:107) Σ (cid:107) , t/ (cid:107) Σ (cid:107) op } ≥ λn . Hence P ( (cid:107) z i (cid:107) ≥ C Tr( Σ ) + λ max {√ n (cid:107) Σ (cid:107) HS , n (cid:107) Σ (cid:107) op } ) ≤ e − C λn , ∀ λ ≥ . Union bounds force max i ∈ [ n ] (cid:107) z i (cid:107) = O P (cid:0) max { Tr( Σ ) , √ n (cid:107) Σ (cid:107) HS , n (cid:107) Σ (cid:107) op } ; n (cid:1) . We can neglect the term √ n (cid:107) Σ (cid:107) HS above, since √ n (cid:107) Σ (cid:107) F = (cid:113) n (cid:107) Σ (cid:107) ≤ (cid:113) ( n (cid:107) Σ (cid:107) op ) Tr( Σ ) ≤ max { Tr( Σ ) , n (cid:107) Σ (cid:107) op } . Finally, the bound on (cid:107) ZZ (cid:62) (cid:107) follows from (cid:107) ZZ (cid:62) (cid:107) ≤ (cid:107)H ( ZZ (cid:62) ) (cid:107) + max i ∈ [ n ] (cid:107) z i (cid:107) . Lemma H.2.

Let Assumption 2.5 hold, p ≥ and { V ( m ) } nm =1 ⊆ R n × K be random matricessuch that V ( m ) is independent of z m . Then, (cid:18) n (cid:88) m =1 (cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j (cid:54) = m (cid:104) z m , z j (cid:105) V ( m ) j (cid:13)(cid:13)(cid:13)(cid:13) p (cid:19) /p = n /p (cid:112) Kp max m ∈ [ n ] (cid:107) V ( m ) (cid:107) O P (cid:0) max {(cid:107) Σ (cid:107) HS , √ n (cid:107) Σ (cid:107) op } ; p ∧ n (cid:1) . Proof of Lemma H.2.

By Minkowski’s inequality, (cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j (cid:54) = m (cid:104) z m , z j (cid:105) V ( m ) j (cid:13)(cid:13)(cid:13)(cid:13) p = (cid:18) K (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) j (cid:54) = m (cid:104) z m , z j (cid:105) V ( m ) jk (cid:12)(cid:12)(cid:12)(cid:12) (cid:19) p/ ≤ (cid:20)(cid:18) K (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) j (cid:54) = m (cid:104) z m , z j (cid:105) V ( m ) jk (cid:12)(cid:12)(cid:12)(cid:12) p (cid:19) /p K − /p (cid:21) p/ = K p/ − K (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) j (cid:54) = m (cid:104) z m , z j (cid:105) V ( m ) jk (cid:12)(cid:12)(cid:12)(cid:12) p = K p/ − K (cid:88) k =1 |(cid:104) z m , w ( m ) k (cid:105)| p , where we deﬁne w ( m ) k = (cid:80) j (cid:54) = m V ( m ) jk z j = Z (cid:62) ( I − e m e (cid:62) m ) v ( m ) k , ∀ m ∈ [ n ] , k ∈ [ K ] . Observethat (cid:107) Σ / w ( m ) k (cid:107) = ( v ( m ) k ) (cid:62) ( I − e m e (cid:62) m ) Z Σ Z (cid:62) ( I − e m e (cid:62) m ) v ( m ) k ≤ (cid:107) v ( m ) k (cid:107) (cid:107) Z Σ Z (cid:62) (cid:107) ≤ (cid:107) V ( m ) (cid:107) (cid:107) Z Σ Z (cid:62) (cid:107) .

54s a result, (cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j (cid:54) = m (cid:104) z m , z j (cid:105) V ( m ) j (cid:13)(cid:13)(cid:13)(cid:13) p ≤ K p/ − (cid:18) K (cid:88) k =1 |(cid:104) z m , w ( m ) k / (cid:107) Σ / w ( m ) k (cid:107)(cid:105)| p (cid:19)(cid:16) max m ∈ [ n ] (cid:107) V ( m ) (cid:107) · (cid:107) Z Σ Z (cid:62) (cid:107) / (cid:17) p . and (cid:18) n (cid:88) m =1 (cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j (cid:54) = m (cid:104) z m , z j (cid:105) V ( m ) j (cid:13)(cid:13)(cid:13)(cid:13) p (cid:19) /p ≤ (cid:112) K (cid:107) Z Σ Z (cid:62) (cid:107) max m ∈ [ n ] (cid:107) V ( m ) (cid:107) · (cid:18) K − n (cid:88) m =1 K (cid:88) k =1 |(cid:104) z m , w ( m ) k / (cid:107) Σ / w ( m ) k (cid:107)(cid:105)| p (cid:19) /p . (H.1)On the one hand, let ˜ z i = Σ / z i , ∀ i ∈ [ n ] and ˜ Z = ( ˜ z , · · · , ˜ z n ) (cid:62) . Note that { ˜ z i } ni =1 satisfy Assumption 2.5 with Σ replaced by Σ , because E e (cid:104) u , ˜ z i (cid:105) = E e (cid:104) Σ / u , z i (cid:105) ≤ e α (cid:104) ΣΣ / u , Σ / u (cid:105) = e α (cid:104) Σ u , u (cid:105) , ∀ u ∈ H , i ∈ [ n ] . It is easily seen from Σ ∈ T ( H ) that Σ ∈ T ( H ) . Then Lemma H.1 asserts that (cid:107) Z Σ Z (cid:62) (cid:107) = (cid:107) ˜ Z ˜ Z (cid:62) (cid:107) = O P (cid:0) max { Tr( Σ ) , n (cid:107) Σ (cid:107) op } ; n (cid:1) = O P (cid:0) max {(cid:107) Σ (cid:107) , n (cid:107) Σ (cid:107) } ; n (cid:1) . (H.2)On the other hand, note that z m and w ( m ) k are independent. According to Assumption2.5 on sub-Gaussianity of z m , we have E (cid:16) (cid:104) z m , w ( m ) k / (cid:107) Σ / w ( m ) k (cid:107)(cid:105) (cid:12)(cid:12)(cid:12) w ( m ) k (cid:17) = 0 ,p − / E /p (cid:16) |(cid:104) z m , w ( m ) k / (cid:107) Σ / w ( m ) k (cid:107)(cid:105)| p (cid:12)(cid:12)(cid:12) w ( m ) k (cid:17) ≤ C for some absolute constant C . Then E |(cid:104) z m , w ( m ) k / (cid:107) Σ / w ( m ) k (cid:107)(cid:105)| p ≤ ( C √ p ) p . We have n (cid:88) m =1 K (cid:88) k =1 E |(cid:104) z m , w ( m ) k / (cid:107) Σ / w ( m ) k (cid:107)(cid:105)| p ≤ nK ( C √ p ) p = ( n /p K /p C √ p ) p . By Fact A.3, (cid:18) n (cid:88) m =1 K (cid:88) k =1 |(cid:104) z m , w ( m ) k / (cid:107) Σ / w ( m ) k (cid:107)(cid:105)| p (cid:19) /p = O P (cid:0) n /p K /p C √ p ; p (cid:1) . (H.3)The ﬁnal result follows from (H.1), (H.2) and (H.3). Lemma H.3.

Let X ∈ R n × m be a random matrix with sub-Gaussian entries, and deﬁne M ∈ R n × m through M ij = (cid:107) X ij (cid:107) ψ . For any p ≥ q ≥ , we have (cid:107) X (cid:107) q,p = O P ( √ p (cid:107) M (cid:107) q,p ; p ) . roof of Lemma H.3. By Minkowski’s inequality, E (cid:107) X (cid:107) pq,p = n (cid:88) i =1 E (cid:18) n (cid:88) j =1 | X ij | q (cid:19) p/q ≤ n (cid:88) i =1 (cid:18) n (cid:88) j =1 E q/p ( | X ij | q ) p/q (cid:19) p/q = n (cid:88) i =1 (cid:18) n (cid:88) j =1 [ E /p | X ij | p ] q (cid:19) p/q . Since p − / E /p | X ij | p ≤ (cid:107) X ij (cid:107) ψ = M ij , we have E (cid:107) X (cid:107) pq,p ≤ n (cid:88) i =1 (cid:18) n (cid:88) j =1 ( √ pM ij ) q (cid:19) p/q = p p/ n (cid:88) i =1 (cid:18) n (cid:88) j =1 M qij (cid:19) p/q = ( √ p (cid:107) M (cid:107) q,p ) p . By Fact A.3, (cid:107) X (cid:107) q,p = O P ( √ p (cid:107) M (cid:107) q,p ; p ) . Lemma H.4.

For independent random vectors X ∼ N ( µ , I d ) and Y ∼ N ( ν , I d ) , we havethe followings:1. If µ = , then P ( |(cid:107) X (cid:107) − d | ≥ √ dt + 2 t ) ≤ e − t , ∀ t ≥ , log E e α (cid:107) X (cid:107) + (cid:104) β , X (cid:105) = − d − α ) + (cid:107) β (cid:107) − α ) ∀ α < , β ∈ R d ;

2. For any t ∈ ( − , , log E e t (cid:104) X , Y (cid:105) = t − t ) ( (cid:107) µ (cid:107) + (cid:107) ν (cid:107) ) + t − t (cid:104) µ , ν (cid:105) − d − t ) . Proof of Lemma H.4.

When µ = , (cid:107) X (cid:107) ∼ χ d . The concentration inequality in theclaim is standard, see Remark 2.11 in Boucheron et al. (2013). Note that p ( x ) = (2 π ) − d/ e −(cid:107) x (cid:107) / is the probability density function of X . With a new variable y = √ − α x , we have α (cid:107) x (cid:107) + (cid:104) β , x (cid:105) − (cid:107) x (cid:107) = − (cid:107) y (cid:107) (cid:104) β / √ − α, y (cid:105) = − (cid:107) y − β / √ − α (cid:107) + (cid:107) β (cid:107) − α ) and E e α (cid:107) X (cid:107) + (cid:104) β , X (cid:105) = (2 π ) − d/ (cid:90) R d exp (cid:18) α (cid:107) x (cid:107) + (cid:104) β , x (cid:105) − (cid:107) x (cid:107) (cid:19) d x = (2 π ) − d/ (cid:90) R d exp (cid:18) − (cid:107) y − β / √ − α (cid:107) + (cid:107) β (cid:107) − α ) (cid:19) (1 − α ) − d/ d y = (1 − α ) − d/ exp (cid:18) (cid:107) β (cid:107) − α ) (cid:19) . Now we come to the second part. Given Y , (cid:104) X , Y (cid:105) ∼ N ( (cid:104) µ , Y (cid:105) , (cid:107) Y (cid:107) ) . Hence E ( e t (cid:104) X , Y (cid:105) | Y ) = e (cid:104) µ , Y (cid:105) t + (cid:107) Y (cid:107) t / . Deﬁne Z = Y − ν . From (cid:104) µ , Y (cid:105) = (cid:104) µ , ν (cid:105) + (cid:104) µ , Z (cid:105) and (cid:107) Y (cid:107) = (cid:107) ν (cid:107) + 2 (cid:104) ν , Z (cid:105) + (cid:107) Z (cid:107) we obtain that log E e t (cid:104) X , Y (cid:105) = log E [ E ( e t (cid:104) X , Y (cid:105) | Y )] log E exp (cid:2) ( (cid:104) µ , ν (cid:105) + (cid:104) µ , Z (cid:105) ) t + ( (cid:107) ν (cid:107) + 2 (cid:104) ν , Z (cid:105) + (cid:107) Z (cid:107) ) t / (cid:3) = (cid:104) µ , ν (cid:105) t + (cid:107) ν (cid:107) t / E exp (cid:0) (cid:104) t µ + t ν , Z (cid:105) + (cid:107) Z (cid:107) t / (cid:1) = (cid:104) µ , ν (cid:105) t + (cid:107) ν (cid:107) t / − d (cid:18) − · t (cid:19) + (cid:107) t µ + t ν (cid:107) − · t / (cid:104) µ , ν (cid:105) t + (cid:107) ν (cid:107) t − d − t ) + t (cid:107) µ + t ν (cid:107) − t )= t − t ) ( (cid:107) µ (cid:107) + (cid:107) ν (cid:107) ) + t − t (cid:104) µ , ν (cid:105) − d − t ) . Lemma H.5.

Let { S n } ∞ n =1 be random variables such that Λ n ( t ) = log E e tS n exists for all t ∈ [ − R n , R n ] , where { R n } ∞ n =1 is a positive sequence tending to inﬁnity. Suppose there is aconvex function Λ : R → R and a positive sequence { a n } ∞ n =1 tending to inﬁnity such that lim n →∞ Λ n ( t ) /a n = Λ( t ) for all t ∈ R . We have lim n →∞ a − n log P ( S n ≤ ca n ) = − sup t ∈ R { ct − Λ( t ) } , ∀ c < Λ (cid:48) (0) . Proof of Lemma H.5.

This result follows directly from the Gärtner-Ellis theorem (Gärt-ner, 1977; Ellis, 1984) for large deviation principles.

H.2 Other lemmas

Lemma H.6.

Let x ∈ (0 , π/ , ε ∈ (0 , and δ = επ ( π − x ) . We have max | y |≤ δ | cos( x + y )cos x − | ≤ ε . Moreover, if x > δ , then max | y |≤ δ/ | sin x sin ( x + y ) − | ≤ .Proof of Lemma H.6. Recall the elementary identity cos( x + y ) = cos x cos y − sin x sin y . If | y | ≤ δ , then | sin y | ≤ | y | ≤ δ = επ ( π − x ) ≤ tan( π − x ) and (cid:12)(cid:12)(cid:12)(cid:12) cos( x + y )cos x − cos y (cid:12)(cid:12)(cid:12)(cid:12) ≤ sin x | sin y | cos x = | sin y | tan( π − x ) ≤ επ ( π − x )tan( π − x ) ≤ ε π , ≤ − cos y ≤ y ≤ (2 δ ) ε (1 − x/π )] ≤ ε . The result on max | y |≤ δ | cos( x + y )cos x − | follows from the estimates above and ε π + ε = ε (1 /π + ε ) ≤ ε .The identity sin( x + y ) = sin x cos y + cos x sin y imply that if δ < x ≤ tan x and | y | ≤ δ/ , then (cid:12)(cid:12)(cid:12)(cid:12) sin( x + y )sin x − cos y (cid:12)(cid:12)(cid:12)(cid:12) ≤ cos x | sin y | sin x = | sin y | tan x ≤ δ/ x ≤ δ/ x ≤ , ≤ − cos y ≤ y ≤ ( δ/ ε (1 − x/π )] ≤ ε ≤ . Hence for | y | ≤ δ/ , we have | sin( x + y )sin x − | ≤ + = < . Direct calculation yields ≤ sin( x + y )sin x ≤ , ≤ sin x sin ( x + y ) ≤ and | sin x sin ( x + y ) − | ≤ .57 emma H.7. For t ≥ and s ≥ , deﬁne P ( t, s ) = (cid:82) π e t cos x (sin x ) s − d x and a = ( s − /t .There exists a constant c > and a continuous, non-decreasing function w : [0 , c ] (cid:55)→ [0 , with w (0) = 0 such that when max { /t, s /t } ≤ c , (cid:12)(cid:12)(cid:12)(cid:12) ∂∂t [log P ( t, s )]( √ a + 4 − a ) / − (cid:12)(cid:12)(cid:12)(cid:12) ≤ w (max { /t, s /t } ) . Proof of Lemma H.7.

It suﬃces to show that ∂∂t [log P ( t,s )]( √ a +4 − a ) / → as t → ∞ and t /s → ∞ .If s = 2 , then a = 0 , P ( t, s ) = (cid:82) π e t cos x d x and ∂∂t P ( t, s ) = (cid:82) π cos xe t cos x d x . A directapplication of Laplace’s method (Laplace, 1986) yields ∂∂t [log P ( t, s )] = [ ∂∂t P ( t, s )] /P ( t, s ) → as t → ∞ , proving the result. From now on we assume s > and thus a > . Underour general setting, the proof is quite involved and existing results in asymptotic analysis,including the generalization of Laplace’s method to two-parameter asymptotics (Fulks, 1951)cannot be directly applied.Deﬁne f ( x, a ) = e cos x sin a x for x ∈ [0 , π ] . Then P ( t, s ) = (cid:82) π f t ( x, a )d x and ∂∂t P ( t, s ) = (cid:82) π cos xf t ( x, a )d x . From log f ( x, a ) = cos x + a log sin x we get ∂∂x [log f ( x, a )] = − sin x + a cos x sin x and ∂ ∂x [log f ( x, a )] = − cos x − a sin x . (H.4)Let x ∗ be the solution to ∂∂x [log f ( x, a )] = 0 on (0 , π ) . We have x ∗ ∈ (0 , π/ , a = 1cos x ∗ − cos x ∗ , cos x ∗ = √ a + 4 − a and sin x ∗ = (cid:18) a ( √ a + 4 − a )2 (cid:19) / . (H.5)Moreover, f ( · , a ) is strictly increasing in [0 , x ∗ ) and strictly decreasing in ( x ∗ , π ] . Hence x ∗ is its unique maximizer in [0 , π ] .Fix any ε ∈ (0 , / and let δ = επ ( π − x ∗ ) . Deﬁne I = [ x ∗ − δ, x ∗ + 2 δ ] ∩ [0 , π ] , J = [ x ∗ , x ∗ + δ/ and r ( a ) = inf y ∈ J f ( y, a ) / sup y ∈ [0 ,π ] \ I f ( y, a ) . Then J ⊆ I ⊆ [0 , π/ and | J | = δ/ . We have (cid:12)(cid:12)(cid:12)(cid:12) P ( t, s ) (cid:82) I f t ( x, a )d x − (cid:12)(cid:12)(cid:12)(cid:12) = (cid:82) [0 ,π ] \ I f t ( x, a )d x (cid:82) I f t ( x, a )d x ≤ (cid:82) [0 ,π ] \ I f t ( x, a )d x (cid:82) J f t ( x, a )d x ≤ π [sup y ∈ [0 ,π ] \ I f ( y, a )] t ( δ/ y ∈ J f ( y, a )] t = 6 πδr t ( a ) and (cid:12)(cid:12)(cid:12)(cid:12) ∂∂t P ( t, s ) (cid:82) I cos xf t ( x, a )d x − (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:82) [0 ,π ] \ I | cos x | f t ( x, a )d x (cid:82) I cos xf t ( x, a )d x ≤ (cid:82) [0 ,π ] \ I f t ( x, a )d x (cid:82) J cos xf t ( x, a )d x ≤ π [sup y ∈ [0 ,π ] \ I f ( y, α )] t cos( x ∗ + δ )( δ/ y ∈ J f ( y, α )] t = 6 π cos( x ∗ + δ ) δr t ( a ) ≤ π δ r t ( a ) , x ∗ + 2 δ < π/ and cos( x ∗ + δ ) ≥ cos( π/ − δ ) = sin δ ≥ δ/π . Consequently, max (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) P ( t, s ) (cid:82) I f t ( x, a )d x − (cid:12)(cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12)(cid:12) ∂∂t P ( t, s ) (cid:82) I cos xf t ( x, a )d x − (cid:12)(cid:12)(cid:12)(cid:12)(cid:27) ≤ π δ r t ( a ) . Let h ( a, t ) denote the right hand side. If h ( a, t ) < , the estimate above yields − h ( a, t )1 + h ( a, t ) ≤ [ ∂∂t P ( t, s )] /P ( t, s ) (cid:82) I cos xf t ( x, a )d x/ (cid:82) I f t ( x, a )d x ≤ h ( a, t )1 − h ( a, t ) . According to Lemma H.6, | cos x/ cos x ∗ − | ≤ ε holds for all x ∈ I . Hence (1 − ε ) 1 − h ( a, t )1 + h ( a, t ) ≤ ∂∂t [log P ( t, s )]cos x ∗ ≤ (1 + ε ) 1 + h ( a, t )1 − h ( a, t ) . Note that our assumptions t → ∞ and t /s → ∞ imply that t/ ( a ∨ → ∞ . Belowwe will prove h ( a, t ) → as t/ ( a ∨ → ∞ for any ﬁxed ε ∈ (0 , / . If that holds,then we get the desired result by letting ε → .The analysis of h ( a, t ) hinges on that of r ( a ) = inf y ∈ J f ( y, a ) / sup y ∈ [0 ,π ] \ I f ( y, a ) . Themonotonicity of f ( · , a ) in [0 , x ∗ ) and ( x ∗ , π ] yields inf y ∈ J f ( y, a ) = f ( x ∗ + δ/ , a ) , sup y ∈ [0 ,π ] \ I f ( y, a ) = max { f ( x ∗ − δ, a ) , f ( x ∗ + 2 δ, a ) }≤ max { f ( x ∗ − δ/ , a ) , f ( x ∗ + δ/ , a ) } , if x ∗ > δ, sup y ∈ [0 ,π ] \ I f ( y, a ) = f ( x ∗ + 2 δ, a ) , if x ∗ ≤ δ. The two cases x ∗ > δ and x ∗ ≤ δ require diﬀerent treatments. If we deﬁne g ( x ) =1 / cos x − cos x for x ∈ (0 , π/ , then a = g ( x ∗ ) and δ = επ ( π − x ∗ ) yield the following simplefact. Fact H.1. If x ∗ > δ , then x ∗ > ε ε/π , a > g ( ε ε/π ) and δ < πε π +4 ε ; if x ∗ ≤ δ , then x ∗ ≤ ε ε/π , a ≤ g ( ε ε/π ) and δ ≥ πε π +4 ε . We ﬁrst consider the case where x ∗ > δ , which is equivalent to a > g ( ε ε/π ) . Let I (cid:48) = [ x ∗ − δ/ , x ∗ + δ/ . For any y ∈ I (cid:48) , there exists ξ in the closed interval between x ∗ and y such that log f ( y, a ) = log f ( x ∗ , a ) + ∂∂x [log f ( x, a )] | x = x ∗ ( y − x ) + 12 ∂ ∂x [log f ( x, a )] | x = ξ ( y − x ) . By construction, ∂∂x [log f ( x, a )] | x = x ∗ = 0 . From equation (H.4) we get max y ∈ I (cid:48) (cid:12)(cid:12)(cid:12)(cid:12) ∂ ∂x [log f ( x, a )] | x = y∂ ∂x [log f ( x, a )] | x = x ∗ − (cid:12)(cid:12)(cid:12)(cid:12) ≤ max y ∈ I (cid:48) (cid:12)(cid:12)(cid:12)(cid:12) cos y cos x ∗ − (cid:12)(cid:12)(cid:12)(cid:12) + max y ∈ I (cid:48) (cid:12)(cid:12)(cid:12)(cid:12) sin x ∗ sin y − (cid:12)(cid:12)(cid:12)(cid:12) ≤ ε + 916 ≤

132 + 916 = 1932 , ε ≤ / . Therefore, inf y ∈ J log f ( y, a ) − log f ( x ∗ , a ) ∂ ∂x [log f ( x, a )] | x = x ∗ ≤ (cid:18) (cid:19)(cid:18) δ (cid:19) = 5132 · δ , sup y ∈ [0 ,π ] \ I (cid:48) log f ( y, a ) − log f ( x ∗ , a ) ∂ ∂x [log f ( x, a )] | x = x ∗ ≥ (cid:18) − (cid:19)(cid:18) δ (cid:19) = 1332 · δ · δ . Since ∂ ∂x [log f ( x, a )] | x = x ∗ = − cos x ∗ − a/ sin x ∗ < , log r ( a ) = inf y ∈ J log f ( y, a ) − sup y ∈ [0 ,π ] \ I log f ( y, a ) ≥ inf y ∈ J log f ( y, a ) − sup y ∈ [0 ,π ] \ I (cid:48) log f ( y, a ) ≥ ∂ ∂x [log f ( x, a )] | x = x ∗ (cid:18) · δ − · δ (cid:19) = (cos x ∗ + a/ sin x ∗ ) δ × × (cid:38) aδ / sin x ∗ . From this and h ( a, t ) = π δ r t ( a ) we get − log h ( a, t ) = − log(3 π ) + log δ + t log r ( a ) (cid:38) − δ + taδ / sin x ∗ . From (H.5) we see that lim a →∞ sin x ∗ = 1 , lim a →∞ a cos x ∗ = 1 and lim a →∞ a ( π − x ∗ ) = 1 .Since δ = επ ( π − x ∗ ) > , we have lim a →∞ aδ = επ . There exists C > determined by ε suchthat for any a > g ( ε ε/π ) , we have δ ≥ C /a and aδ / sin x ∗ ≥ C /a . As a result, for some C determined by ε , − log h ( a, t ) ≥ C ( − − log a + t/a ) ≥ C [ − − log( a ∨

1) + t/ ( a ∨ , ∀ a > g (cid:18) ε ε/π (cid:19) . (H.6) We move on to the case where x ∗ ≤ δ . Recall that for x ∈ ( x ∗ , x ∗ + 2 δ ) ⊆ ( x ∗ , π/ ,we have ∂∂x [log f ( x, a )] < and − ∂ ∂x [log( x, a )] = cos x + a sin x ≥ cos x ≥ cos( x ∗ + 2 δ ) ≥ cos(4 δ ) ≥ cos(1 / , where we used δ ≤ ε/ ≤ / . By Taylor expansion, there exists ξ ∈ [ x ∗ + δ/ , x ∗ + 2 δ ] such that log r ( a ) = inf y ∈ J log f ( y, a ) − sup y ∈ [0 ,π ] \ I log f ( y, a ) = log f ( x ∗ + δ/ , a ) − log f ( x ∗ + 2 δ, a )= − (cid:18) ∂∂x [log( x, a )] | x = x ∗ + δ/ (2 δ − δ/

6) + 12 ∂ ∂x [log( x, a )] | x = ξ (2 δ − δ/ (cid:19) >

12 inf x ∈ [ x ∗ ,x ∗ +2 δ ] (cid:18) − ∂ ∂x [log( x, a )] (cid:19) (2 δ − δ/ (cid:38) δ . Based on h ( a, t ) = π δ r t ( a ) and δ ≥ πε π +4 ε from Fact H.1, there exists some C > determinedby ε such that − log h ( a, t ) ≥ C ( − t ) ≥ C [ − t/ ( a ∨ holds when a ≤ g ( ε ε/π ) .60 his bound, (H.6) and log( a ∨ ≤ a ∨ imply that − log h ( a, t ) (cid:38) − − log( a ∨

1) + ta ∨ ≥ − − ( a ∨

1) + ta ∨ − a ∨ (cid:18) t ( a ∨ − (cid:19) . As t/ ( a ∨ → ∞ , we have − log h ( a, t ) → ∞ and h ( a, t ) → . Lemma H.8.

For t ≥ and s ≥ , deﬁne a = ( s − /t and g ( t, s ) = ( √ a + 4 − a ) / . There exist a constant c ∈ (0 , and a function w : [0 , c ] → [0 , such that when max { /t , d /t , | t − t | /t , | t − t | /t } ≤ c , (cid:12)(cid:12)(cid:12)(cid:12) log P ( t , s ) − log P ( t , s ) g ( t , s )( t − t ) − (cid:12)(cid:12)(cid:12)(cid:12) ≤ w (max { /t , s /t , | t − t | /t , | t − t | /t } ) . Proof of Lemma H.8.

Let h ( a ) = ( √ a + 4 − a ) / . Observe that ∂a∂t = − ( s − /t = − a/t and h (cid:48) ( a ) = ( a √ a +4 −

1) = − h ( a ) / √ a + 4 . By the chain rule, ∂∂t [log g ( t, s )] = dd a [log h ( a )] · ∂a∂t = h (cid:48) ( a ) h ( a ) · ∂a∂t = at √ a + 4 . Hence < ∂∂t [log g ( t, s )] ≤ /t . For any t ≥ t > there exists ξ ∈ [ t , t ] such that ≤ log (cid:18) g ( t , s ) g ( t , s ) (cid:19) = log g ( t , s ) − log g ( t , s ) = ∂∂t [log g ( t, s )] | t = ξ ( t − t ) ≤ t − t ξ ≤ t − t t . This leads to | g ( t , s ) /g ( t , s ) − | ≤ e | t − t | / ( t ∧ t ) − for any t , t > .Let c and w be those deﬁned in the statement of Lemma H.7. Suppose that t > and s ≥ satisﬁes max { /t , s /t } < c/ . When t ≥ t / / , max { /t, s /t } ≤ { /t , s /t } < c . Lemma H.7 and the non-decreasing property of w force (cid:12)(cid:12)(cid:12)(cid:12) ∂∂t [log P ( t, s )] g ( t, s ) − (cid:12)(cid:12)(cid:12)(cid:12) ≤ w (max { / ( t / / ) , s / ( t / / ) } ) ≤ w (2 max { /t , s /t } ) , ∀ t ≥ t / / . When | t − t | ≤ t / , we have t ≥ . t ≥ t / / and | t − t | ≤ . t ≤ t / / . Then | t − t | / ( t ∧ t ) ≤ / and | g ( t, s ) /g ( t , s ) − | ≤ e | t − t | / ( t ∧ t ) − ≤ e / | t − t | t ∧ t ≤ √ e | t − t | t / / ≤ | t − t | t < . Hence when t ∈ [4 t / , t / , [1 − w (2 max { /t , s /t } )] (cid:18) − | t − t | t (cid:19) ≤ ∂∂t [log P ( t, s )] g ( t , s ) ≤ [1 + w (2 max { /t , s /t } )] (cid:18) | t − t | t (cid:19) .

61e can ﬁnd a constant ˜ c ∈ (0 , and construct a new function ˜ w : [0 , ˜ c ] → [0 , suchthat for any distinct t , t ∈ [(1 − ˜ c ) t , (1 + ˜ c ) t ] , (cid:12)(cid:12)(cid:12)(cid:12) log P ( t , s ) − log P ( t , s ) g ( t , s )( t − t ) − (cid:12)(cid:12)(cid:12)(cid:12) ≤ ˜ w (max { /t , s /t , | t − t | /t , | t − t | /t } ) . The proof is completed by re-deﬁning c and w as ˜ c and ˜ w , respectively. References

Abbe, E. (2017). Community detection and stochastic block models: recent developments.

The Journal of Machine Learning Research Abbe, E. , Bandeira, A. S. and

Hall, G. (2016). Exact recovery in the stochastic blockmodel.

IEEE Transactions on Information Theory Abbe, E. , Fan, J. , Wang, K. and

Zhong, Y. (2017). Entrywise eigenvector analysis ofrandom matrices with low expected rank. arXiv preprint arXiv:1709.09565 . Amini, A. A. and

Razaee, Z. S. (2019). Concentration of kernel matrices with applicationto kernel spectral clustering. arXiv preprint arXiv:1909.03347 . Anderson, T. W. (1963). Asymptotic theory for principal component analysis.

The Annalsof Mathematical Statistics Aronszajn, N. (1950). Theory of reproducing kernels.

Transactions of the Americanmathematical society Awasthi, P. , Bandeira, A. S. , Charikar, M. , Krishnaswamy, R. , Villar, S. and

Ward, R. (2015). Relax, no need to round: Integrality of clustering formulations. In

Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science . Azizyan, M. , Singh, A. and

Wasserman, L. (2013). Minimax theory for high-dimensionalGaussian mixtures with sparse mean separation. In

Advances in Neural Information Pro-cessing Systems . Baik, J. , Arous, G. B. and

Péché, S. (2005). Phase transition of the largest eigenvaluefor nonnull complex sample covariance matrices.

The Annals of Probability Benaych-Georges, F. and

Nadakuditi, R. R. (2012). The singular values and vectorsof low rank perturbations of large rectangular random matrices.

Journal of MultivariateAnalysis

Binkiewicz, N. , Vogelstein, J. T. and

Rohe, K. (2017). Covariate-assisted spectralclustering.

Biometrika

Blanchard, G. , Bousquet, O. and

Zwald, L. (2007). Statistical properties of kernelprincipal component analysis.

Machine Learning oucheron, S. , Lugosi, G. and

Massart, P. (2013).

Concentration inequalities: Anonasymptotic theory of independence . Oxford university press.

Cai, C. , Li, G. , Chi, Y. , Poor, H. V. and

Chen, Y. (2019). Subspace estimation fromunbalanced and incomplete data matrices: (cid:96) , ∞ statistical guarantees. arXiv preprintarXiv:1910.04267 . Cai, T. T. and

Zhang, A. (2018). Rate-optimal perturbation bounds for singular subspaceswith applications to high-dimensional statistics.

The Annals of Statistics Candès, E. J. and

Recht, B. (2009). Exact matrix completion via convex optimization.

Foundations of Computational mathematics Cape, J. , Tang, M. and

Priebe, C. E. (2019). The two-to-inﬁnity norm and singular sub-space geometry with applications to high-dimensional statistics.

The Annals of Statistics Chen, X. and

Yang, Y. (2018). Hanson-Wright inequality in Hilbert spaces with applica-tion to K -means clustering for non-Euclidean data. arXiv e-prints arXiv:1810.11180. Chen, X. and

Yang, Y. (2020). Cutoﬀ for exact recovery of Gaussian mixture models. arXiv preprint arXiv:2001.01194 . Chen, Y. , Fan, J. , Ma, C. and

Wang, K. (2017). Spectral method and regularized MLEare both optimal for top- K ranking. arXiv preprint arXiv:1707.09971 . Chen, Y. , Fan, J. , Ma, C. and

Wang, K. (2019). Spectral method and regularized MLEare both optimal for top-K ranking.

Annals of statistics Cristianini, N. and

Shawe-Taylor, J. (2000).

An introduction to support vector ma-chines and other kernel-based learning methods . Cambridge university press.

Damle, A. and

Sun, Y. (2019). Uniform bounds for invariant subspace perturbations. arXiv preprint arXiv:1905.07865 . Davis, C. and

Kahan, W. M. (1970). The rotation of eigenvectors by a perturbation. III.

SIAM Journal on Numerical Analysis Dempster, A. P. , Laird, N. M. and

Rubin, D. B. (1977). Maximum likelihood fromincomplete data via the EM algorithm.

Journal of the Royal Statistical Society: Series B(Methodological) Deshpande, Y. , Sen, S. , Montanari, A. and

Mossel, E. (2018). Contextual stochasticblock models. In

Advances in Neural Information Processing Systems . El Karoui, N. (2018). On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators.

Probability Theoryand Related Fields ldridge, J. , Belkin, M. and

Wang, Y. (2017). Unperturbed: spectral analysis beyondDavis-Kahan. arXiv preprint arXiv:1706.06516 . Ellis, R. S. (1984). Large deviations for a general class of random vectors.

The Annals ofProbability Erdős, L. , Schlein, B. and

Yau, H.-T. (2009). Semicircle law on short scales and de-localization of eigenvectors for Wigner random matrices.

The Annals of Probability Fan, J. , Wang, W. and

Zhong, Y. (2019). An (cid:96) ∞ eigenvector perturbation bound and itsapplication to robust covariance estimation. Journal of Econometrics

Fei, Y. and

Chen, Y. (2018). Hidden integrality of SDP relaxations for sub-Gaussianmixture models. In

Conference On Learning Theory . Feige, U. and

Ofek, E. (2005). Spectral techniques applied to sparse random graphs.

Random Structures & Algorithms Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems.

Annalsof eugenics Fulks, W. (1951). A generalization of Laplace’s method.

Proceedings of the AmericanMathematical Society Gao, C. and

Zhang, A. Y. (2019). Iterative algorithm for discrete structure recovery. arXiv preprint arXiv:1911.01018 . Gärtner, J. (1977). On large deviations from the invariant measure.

Theory of Probability& Its Applications Giraud, C. and

Verzelen, N. (2018). Partial recovery bounds for clustering with therelaxed k means. arXiv preprint arXiv:1807.07547 . Gross, D. (2011). Recovering low-rank matrices from few coeﬃcients in any basis.

IEEETransactions on Information Theory Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables.

Journal of the American statistical association Holland, P. W. , Laskey, K. B. and

Leinhardt, S. (1983). Stochastic blockmodels:First steps.

Social Networks Javanmard, A. and

Montanari, A. (2018). Debiasing the lasso: Optimal sample size forGaussian designs.

The Annals of Statistics Jin, J. and

Wang, W. (2016). Inﬂuential features PCA for high dimensional clustering.

The Annals of Statistics ohnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal com-ponents analysis. Annals of statistics

Johnstone, I. M. and

Lu, A. Y. (2009). On consistency and sparsity for principal com-ponents analysis in high dimensions.

Journal of the American Statistical Association

Jung, S. and

Marron, J. S. (2009). PCA consistency in high dimension, low sample sizecontext.

The Annals of Statistics Koltchinskii, V. and

Giné, E. (2000). Random matrix approximation of spectra ofintegral operators.

Bernoulli Koltchinskii, V. and

Lounici, K. (2014). Concentration inequalities and moment boundsfor sample covariance operators. arXiv preprint arXiv:1405.2468 . Koltchinskii, V. and

Lounici, K. (2016). Asymptotics and concentration bounds forbilinear forms of spectral projectors of sample covariance. In

Annales de l’Institut HenriPoincaré, Probabilités et Statistiques , vol. 52. Institut Henri Poincaré.

Koltchinskii, V. and

Xia, D. (2016). Perturbation of linear forms of singular vectorsunder Gaussian noise. In

High Dimensional Probability VII . Springer, 397–423.

Kumar, A. and

Kannan, R. (2010). Clustering with spectral norm and the k-meansalgorithm. In .IEEE.

Laplace, P. S. (1986). Memoir on the probability of the causes of events.

Statistical Science Lei, L. (2019). Uniﬁed (cid:96) →∞ eigenspace perturbation theory for symmetric random matrices. arXiv preprint arXiv:1909.04798 . Lloyd, S. (1982). Least squares quantization in pcm.

IEEE transactions on informationtheory Löffler, M. , Zhang, A. Y. and

Zhou, H. H. (2019). Optimality of spectral clusteringfor Gaussian mixture model. arXiv preprint arXiv:1911.00538 . Lu, Y. and

Zhou, H. H. (2016). Statistical and computational guarantees of Lloyd’salgorithm and its variants. arXiv preprint arXiv:1612.02099 . Ma, Z. and

Ma, Z. (2017). Exploration of large networks with covariates via fast anduniversal latent space model ﬁtting. arXiv preprint arXiv:1705.02372 . Mao, X. , Sarkar, P. and

Chakrabarti, D. (2017). Estimating mixed memberships withsharp eigenvector deviations. arXiv preprint arXiv:1709.00407 . Mele, A. , Hao, L. , Cape, J. and

Priebe, C. E. (2019). Spectral inference for largestochastic blockmodels with nodal covariates. arXiv preprint arXiv:1908.06438 .65 ixon, D. G. , Villar, S. and

Ward, R. (2017). Clustering subgaussian mixtures bysemideﬁnite programming.

Information and Inference: A Journal of the IMA Montanari, A. and

Sun, N. (2018). Spectral algorithms for tensor completion.

Commu-nications on Pure and Applied Mathematics Nadler, B. (2008). Finite sample approximation results for principal component analysis:A matrix perturbation approach.

The Annals of Statistics Ndaoud, M. (2018). Sharp optimal recovery in the two component Gaussian mixture model. arXiv preprint arXiv:1812.08078 . Neyman, J. and

Pearson, E. S. (1933). IX. On the problem of the most eﬃcient tests ofstatistical hypotheses.

Philosophical Transactions of the Royal Society of London. SeriesA, Containing Papers of a Mathematical or Physical Character

Ng, A. Y. , Jordan, M. I. and

Weiss, Y. (2002). On spectral clustering: Analysis and analgorithm. In

Advances in Neural Information Processing Systems . Novembre, J. , Johnson, T. , Bryc, K. , Kutalik, Z. , Boyko, A. R. , Auton, A. , Indap, A. , King, K. S. , Bergmann, S. , Nelson, M. R. et al. (2008). Genes mirrorgeography within Europe.

Nature

O’Rourke, S. , Vu, V. and

Wang, K. (2018). Random perturbation of low rank matrices:Improving classical bounds.

Linear Algebra and its Applications

Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spikedcovariance model.

Statistica Sinica

Pearson, K. (1894). Contributions to the mathematical theory of evolution.

PhilosophicalTransactions of the Royal Society of London. A

Pearson, K. (1901). LIII. on lines and planes of closest ﬁt to systems of points in space.

The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science Perry, A. , Wein, A. S. , Bandeira, A. S. and

Moitra, A. (2016). Optimality andsub-optimality of PCA for spiked random matrices and synchronization. arXiv preprintarXiv:1609.05573 . Ringnér, M. (2008). What is principal component analysis?

Nature biotechnology Royer, M. (2017). Adaptive clustering through semideﬁnite programming. In

Advances inNeural Information Processing Systems . Schmitt, B. A. (1992). Perturbation bounds for matrix square roots and Pythagoreansums.

Linear algebra and its applications chölkopf, B. , Smola, A. and

Müller, K.-R. (1997). Kernel principal componentanalysis. In

International conference on artiﬁcial neural networks . Springer.

Shi, J. and

Malik, J. (2000). Normalized cuts and image segmentation.

IEEE Transactionson pattern analysis and machine intelligence Srivastava, P. R. , Sarkar, P. and

Hanasusanto, G. A. (2019). A robust spec-tral clustering algorithm for sub-Gaussian mixture models with outliers. arXiv preprintarXiv:1912.07546 . Stewart, G. and

Sun, J. (1990).

Matrix Perturbation Theory . Computer Science andScientiﬁc Computing, ACADEMIC PressINC.URL https://books.google.com/books?id=bIYEogEACAAJ

Vempala, S. and

Wang, G. (2004). A spectral algorithm for learning mixture models.

Journal of Computer and System Sciences Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027 . Wang, K. (2019). Some compact notations for concentration inequalities and user-friendlyresults. arXiv preprint arXiv:1912.13463 . Wang, W. and

Fan, J. (2017). Asymptotics of empirical eigenstructure for high dimensionalspiked covariance.

Annals of statistics Wedin, P.-Å. (1972). Perturbation bounds in connection with singular value decomposition.

BIT Numerical Mathematics Weng, H. and

Feng, Y. (2016). Community detection with nodal information. arXivpreprint arXiv:1610.09735 . Yan, B. and

Sarkar, P. (2020). Covariate regularized community detection in sparsegraphs.

Journal of the American Statistical Association

Yeung, K. Y. and

Ruzzo, W. L. (2001). Principal component analysis for clustering geneexpression data.

Bioinformatics Zhang, A. , Cai, T. T. and

Wu, Y. (2018). Heteroskedastic pca: Algorithm, optimality,and applications. arXiv preprint arXiv:1810.08316 . Zhang, A. Y. and

Zhou, H. H. (2016). Minimax rates of community detection in stochasticblock models.

The Annals of Statistics Zhang, Y. , Levina, E. and

Zhu, J. (2016). Community detection in networks with nodefeatures.

Electronic Journal of Statistics Zhong, Y. and

Boumal, N. (2018). Near-optimal bounds for phase synchronization.

SIAMJournal on Optimization Zwald, L. and

Blanchard, G. (2006). On the convergence of eigenspaces in kernelprincipal component analysis. In