An ℓ p theory of PCA and spectral clustering
AAn (cid:96) p theory of PCA and spectral clustering Emmanuel Abbe ∗ Jianqing Fan † Kaizheng Wang ‡ June 2020
Abstract
Principal Component Analysis (PCA) is a powerful tool in statistics and machinelearning. While existing study of PCA focuses on the recovery of principal componentsand their associated eigenvalues, there are few precise characterizations of individualprincipal component scores that yield low-dimensional embedding of samples. Thathinders the analysis of various spectral methods. In this paper, we first develop an (cid:96) p perturbation theory for a hollowed version of PCA in Hilbert spaces which provablyimproves upon the vanilla PCA in the presence of heteroscedastic noises. Through anovel (cid:96) p analysis of eigenvectors, we investigate entrywise behaviors of principal com-ponent score vectors and show that they can be approximated by linear functionalsof the Gram matrix in (cid:96) p norm, which includes (cid:96) and (cid:96) ∞ as special examples. Forsub-Gaussian mixture models, the choice of p giving optimal bounds depends on thesignal-to-noise ratio, which further yields optimality guarantees for spectral clustering.For contextual community detection, the (cid:96) p theory leads to a simple spectral algorithmthat achieves the information threshold for exact recovery. These also provide optimalrecovery results for Gaussian mixture and stochastic block models as special cases. Keywords:
Principal component analysis, eigenvector perturbation, spectral clustering,mixture models, community detection, contextual network models, phase transitions.
Modern technologies generate enormous volumes of data that present new statistical andcomputational challenges. The high throughput data come inevitably with tremendousamount of noise, from which very faint signals are to be discovered. Moreover, the ana-lytic procedures must be affordable in terms of computational costs. While likelihood-basedapproaches usually lead to non-convex optimization problems that are NP-hard in general,the method of moments provide viable solutions. ∗ Institute of Mathematics, EPFL, Lausanne, CH-1015, Switzerland. E-mail: emmanuel.abbe@epfl.ch. † Department of ORFE, Princeton University, Princeton, NJ 08544, USA. E-mail: [email protected]. ‡ Department of IEOR, Columbia University, New York, NY 10027,USA. E-mail: [email protected]. a r X i v : . [ m a t h . S T ] J un rincipal Component Analysis (PCA) (Pearson, 1901) is arguably the most prominenttool of this type. It significantly reduces the dimension of data using eigenvalue decomposi-tion of a second-order moment matrix. Unlike the classical settings where the dimension d ismuch smaller than the sample size n , nowadays it could be the other way around in numer-ous applications (Ringnér, 2008; Novembre et al., 2008; Yeung and Ruzzo, 2001). Reliabilityof the low-dimensional embedding is of crucial importance, as all downstream tasks arebased on that. Unfortunately, existing theories often fail to provide sharp guarantees whenboth the dimension and noise level are high, especially in the absence of sparsity structures.The matter is further complicated by the use of nonlinear kernels for dimension reduction(Schölkopf et al., 1997), which is de facto PCA in some infinite-dimensional Hilbert space.In this paper, we investigate the spectral embedding returned by a hollowed version ofPCA. Consider the signal-plus-noise model x i = ¯ x i + z i ∈ R d , i ∈ [ n ] . (1.1)Here { x i } ni =1 are noisy observations of signals { ¯ x i } ni =1 contaminated by { z i } ni =1 . Define thedata matrices X = ( x , · · · , x n ) (cid:62) ∈ R n × d and ¯ X = ( ¯ x , · · · , ¯ x n ) (cid:62) ∈ R n × d . Let ¯ G =¯ X ¯ X (cid:62) ∈ R n × n be the Gram matrix of { ¯ x i } ni =1 , and G = H ( XX (cid:62) ) be the hollowed Grammatrix of { x i } ni =1 where H ( · ) is the hollowing operator, zeroing out all diagonal entriesof a square matrix. Denote by { λ j , u j } nj =1 and { ¯ λ j , ¯ u j } nj =1 the eigen-pairs of G and ¯ G ,respectively, where the eigenvalues are sorted in descending order. While PCA computesthe embedding by eigenvalue decomposition of XX (cid:62) , here we delete its diagonal to enhanceconcentration and handle heteroscedasticity (Koltchinskii and Giné, 2000). We seek ananswer to the following fundamental question: how do the eigenvectors of G relate to thoseof ¯ G ?Roughly speaking, our main results state that u j = Gu j /λ j ≈ G ¯ u j / ¯ λ j , (1.2)where the approximation relies on the (cid:96) p norm for a proper choice of p . In words, theeigenvector u j is a nonlinear function of G but can be well approximated by the linearfunction G ¯ u j / ¯ λ j in the (cid:96) p norm where p is given by the model signal-to-noise ratio (SNR).This linearization facilitates the analysis and allows to quantify how the magnitude of thesignal-to-noise ratio affects theoretical guarantees for signal recovery.In many statistical problems such as mixture models, the vectors { ¯ x i } ni =1 live in a low-dimensional subspace of R d . Their latent coordinates reflect the geometry of the data, whichcan be decoded from eigenvalues and eigenvectors of ¯ G . Our results show how well thespectral decomposition of G reveals that of ¯ G , characterizing the behavior of individualembedded samples. From there we easily derive the optimality of spectral clustering intwo-component sub-Gaussian mixture models and contextual stochastic block models, interms of both the misclassification rates and the exact recovery thresholds. In particular,the linearization of eigenvector (1.2) helps develop a simple spectral method for contextualstochastic block models, efficiently combining the information from the network and thenode attributes.Our general results hold for Hilbert spaces. It is easily seen that construction of thehollowed Gram matrix G and the subsequent steps only depend on pairwise inner products2 (cid:104) x i , x j (cid:105)} ≤ i,j ≤ n . This makes the “kernel trick” applicable (Cristianini and Shawe-Taylor,2000), and our analysis readily handles (a hollowed version of) kernel PCA. We demonstrate the merits of the (cid:96) p analysis using spectral clustering for a mixture of twoGaussians. Let y ∈ {± } n be a label vector with i.i.d. Rademacher entries and µ ∈ R d bea deterministic mean vector, both of which are unknown. Consider the model x i = y i µ + z i , i ∈ [ n ] , (1.3)where { z i } ni =1 are i.i.d. N ( , I d ) vectors. The goal is to estimate y from { x i } ni =1 . (1.3)is a special case of the signal-plus-noise model (1.1) with ¯ x i = y i µ . Since P ( y i = 1) = P ( y i = −
1) = 1 / , { x i } ni =1 are i.i.d. samples from a mixture of two Gaussians N ( µ , I d ) + N ( − µ , I d ) .By construction, ¯ X = ( ¯ x , · · · , ¯ x n ) (cid:62) = yµ (cid:62) and ¯ G = (cid:107) µ (cid:107) yy (cid:62) with ¯ u = y / √ n and ¯ λ = n (cid:107) µ (cid:107) . Hence, sgn( u ) becomes a natural estimator for y , where sgn( · ) is the entrywisesign function. A fundamental question is whether the empirical eigenvector u is informativeenough to accurately recover the labels in competitive regimes. To formalize the discussion,we denote by SNR = (cid:107) µ (cid:107) (cid:107) µ (cid:107) + d/n (1.4)the signal-to-noise ratio of model (1.3). Consider the challenging asymptotic regime where n → ∞ and (cid:28) SNR (cid:46) log n . The dimension d may or may not diverge. According toTheorem 3.2, the spectral estimator sgn( u ) achieves the minimax optimal misclassificationrate e − SNR(1+ o (1)) . (1.5)In order to get this result, we start from an (cid:96) p analysis of u . Theorem 3.3 shows that P (min s = ± (cid:107) s u − G ¯ u / ¯ λ (cid:107) p < ε n (cid:107) ¯ u (cid:107) p ) > − Ce − p (1.6)for p = SNR , some constant C > and some deterministic sequence { ε n } ∞ n =1 tending tozero. On the event (cid:107) s u − G ¯ u / ¯ λ (cid:107) p < ε n (cid:107) ¯ u (cid:107) p , we apply a Markov-type inequality to theentries of ( s u − G ¯ u / ¯ λ ) : n |{ i : | ( s u − G ¯ u / ¯ λ ) i | > (cid:112) ε n /n }| ≤ n (cid:80) ni =1 | ( s u − G ¯ u / ¯ λ ) i | p ( (cid:112) ε n /n ) p (i) = (cid:18) (cid:107) s u − G ¯ u / ¯ λ (cid:107) p √ ε n (cid:107) ¯ u (cid:107) p (cid:19) p ≤ ε p/ n , (1.7) In Theorem 3.2 we derive results for the exact recovery of the spectral estimator, i.e. P (sgn( u ) = ± y ) → , when SNR (cid:29) log n . Here we omit that case and discuss error rates. (i) follows from ¯ u = y / √ n and (cid:107) ¯ u (cid:107) pp = n (1 / √ n ) p . Hence all but an ε SNR / n fractionof u ’s entries are well-approximated by those of G ¯ u / ¯ λ . On the other hand, since themisclassification error is always bounded by 1, the exceptional event in (1.6) may at mostcontribute an Ce − SNR amount to the final error. Both ε SNR / n and Ce − SNR are negligiblecompared to the optimal rate e − SNR / in (1.5). This helps us show that the (cid:96) p bound (1.6)ensures sufficient proximity between u and G ¯ u / ¯ λ , and the analysis boils down to thelatter term.We now explain why G ¯ u / ¯ λ is a good target to aim at. Observe that ( G ¯ u ) i = [ H ( XX (cid:62) ) ¯ u ] i = (cid:88) j (cid:54) = i (cid:104) x i , x j (cid:105) y j / √ n ∝ (cid:104) x i , ˆ µ ( − i ) (cid:105) , (1.8)where ˆ µ ( − i ) = n − (cid:80) j (cid:54) = i x j y j is the leave-one-out sample mean. Consequently, the (un-supervised) spectral estimator sgn[( u ) i ] for y i is approximated by sgn( (cid:104) x i , ˆ µ ( − i ) (cid:105) ) , whichcoincides with the (supervised) linear discriminant analysis (Fisher, 1936) given additionallabels { y j } j (cid:54) = i . This oracle estimator turns out to capture the difficulty of label recovery.That is, sgn( G ¯ u / ¯ λ ) achieves the optimal misclassification rate in (1.5).Above we provide high-level ideas about why the spectral estimator sgn( u ) is optimal.Inequality (1.6) ties u and its linearization G ¯ u / ¯ λ together. The latter is connected tothe genie-aided estimator through (1.8). As a side remark, the relation (1.8) hinges on thefact that G is hollowed. Otherwise there would be a square term (cid:104) x i , x i (cid:105) making thingsentangled. Early works on PCA focus on classical settings where the dimension d is fixed and thesample size n goes to infinity (Anderson, 1963). Motivated by modern applications, in thepast two decades there has been a surge of interest in high-dimensional PCA. Most papersin this direction study the consistency of empirical eigenvalues (Johnstone, 2001; Baik et al.,2005) or Principal Component (PC) directions (Paul, 2007; Nadler, 2008; Jung and Marron,2009; Benaych-Georges and Nadakuditi, 2012; Perry et al., 2016; Wang and Fan, 2017) undervarious spiked covariance models with d growing with n . Similar results are also availablefor infinite-dimensional Hilbert spaces (Koltchinskii and Giné, 2000; Zwald and Blanchard,2006; Koltchinskii and Lounici, 2016). The analysis of PCs amounts to showing how theleading eigenvectors of X (cid:62) X = (cid:80) ni =1 x i x (cid:62) i ∈ R d × d recover those of E ( x i x (cid:62) i ) . When itcomes to dimension reduction, one projects the data onto these PCs and get PC scores.This is directly linked to leading eigenvectors of the Gram matrix XX (cid:62) ∈ R n × n . In high-dimensional problems, the n -dimensional PC scores may still consistently reveal meaningfulstructures even if the d -dimensional PCs fail to do so (Cai and Zhang, 2018).Analysis of PC scores is crucial to the theoretical study of spectral methods. However,existing results (Blanchard et al., 2007; Amini and Razaee, 2019) in related areas cannotprecisely characterize individual embedded samples under general conditions. This paperaims to bridge the gap by a novel analysis. In addition, our work is orthogonal to those withsparsity assumptions (Johnstone and Lu, 2009; Jin and Wang, 2016). Here we are concernedwith (i) the non-sparse regime where most components contribute to the main variability4nd (ii) the infinite-dimensional regime in kernel PCA where the sparsity assumption is notappropriate.There is a vast literature on perturbation theories of eigenvectors. Most classical boundsare deterministic and use the (cid:96) norm or other orthonormal-invariant norms as error met-rics. This includes the celebrated Davis-Kahan theorem (Davis and Kahan, 1970) and itsextensions (Wedin, 1972); see Stewart and Sun (1990) for a review. Improved (cid:96) -type resultsare available for stochastic settings (O’Rourke et al., 2018). For many problems in statisticsand machine learning, entrywise analysis is more desirable because that helps characterizethe spectral embedding of individual samples. Fan et al. (2019), Eldridge et al. (2017), Capeet al. (2019) and Damle and Sun (2019) provide (cid:96) ∞ perturbation bounds in deterministicsettings. Their bounds are often too conservative when the noise is stochastic. Recent papers(Koltchinskii and Xia, 2016; Abbe et al., 2017; Mao et al., 2017; Zhong and Boumal, 2018;Chen et al., 2019; Lei, 2019) take advantage of the randomness to obtain sharp (cid:96) ∞ resultsfor challenging tasks.The random matrices considered therein are mostly Wigner-type, with independent en-tries or similar structures. On the contrary, our hollowed Gram matrix G has Wishart-typedistribution since its off-diagonal entries are inner products of samples and thus dependent.What is more, our (cid:96) p bounds with p determined by the signal strength are adaptive. If thesignal is weak, existing (cid:96) ∞ analysis does not go through as strong concentration is requiredfor uniform control of all the entries. However, our (cid:96) p analysis still manages to control avast majority of the entries. If the signal is strong, our results imply (cid:96) ∞ bounds. The (cid:96) p eigenvector analysis in this paper shares some features with the study on (cid:96) p -delocalization(Erdős et al., 2009), yet the settings are very different. It would be interesting to establishfurther connections.The applications in this paper are canonical problems in clustering and have been ex-tensively studied. For the sub-Gaussian mixture model, the settings and methods in Giraudand Verzelen (2018), Ndaoud (2018) and Löffler et al. (2019) are similar to ours. The con-textual network problem concerns grouping the nodes based on their attributes and pairwiseconnections, see Binkiewicz et al. (2017), Deshpande et al. (2018) and Yan and Sarkar (2020)for more about the model. We defer detailed discussions on these to Sections 3 and 4. We present the general setup and results for (cid:96) p eigenvector analysis in Section 2; apply themto clustering under mixture models in Section 3 and contextual community detection inSection 4; show a sketch of main proofs in Section 5; and conclude the paper with possiblefuture directions in Section 6. We use [ n ] to refer to { , , · · · , n } for n ∈ Z + . Denote by | · | the absolute value of areal number or cardinality of a set. For real numbers a and b , let a ∧ b = min { a, b } and a ∨ b = max { a, b } . For nonnegative sequences { a n } ∞ n =1 and { b n } ∞ n =1 , we write a n (cid:28) b n or a n = o ( b n ) if b n > and a n /b n → ; a n (cid:46) b n or a n = O ( b n ) if there exists a positive constant5 such that a n ≤ Cb n ; a n (cid:38) b n or a n = Ω( b n ) if b n (cid:46) a n . In addition, we write a n (cid:16) b n if a n (cid:46) b n and b n (cid:46) a n . We let S be the binary indicator function of a set S .Let { e j } dj =1 be the canonical bases of R d , S d − = { x ∈ R d : (cid:107) x (cid:107) = 1 } and B ( x , r ) = { y ∈ R d : (cid:107) y − x (cid:107) ≤ r } . For a vector x = ( x , · · · , x d ) (cid:62) ∈ R d and p ≥ , define its (cid:96) p norm (cid:107) x (cid:107) p = ( (cid:80) di =1 | x i | p ) /p . For i ∈ [ d ] , let x − i be the ( d − -dimensional sub-vector of x withoutthe i -th entry. For a matrix A ∈ R n × m , we define its spectral norm (cid:107) A (cid:107) = sup (cid:107) x (cid:107) =1 (cid:107) Ax (cid:107) and Frobenius norm (cid:107) A (cid:107) F = ( (cid:80) i,j A ij ) / . Unless otherwise specified, we use A i and a j torefer to the i -th row and j -th column of A , respectively. For ≤ p, q ≤ ∞ , we define anentrywise matrix norm (cid:107) A (cid:107) q,p = (cid:20) n (cid:88) i =1 (cid:18) m (cid:88) j =1 | a ij | q (cid:19) p/q (cid:21) /p . (1.9)The notation is not to be confused with ( q, p ) -induced norm, which is not used in the currentpaper. In words, we concatenate the (cid:96) q norms of the row vectors of A into an n -dimensionalvector and then compute its (cid:96) p norm. A special case is (cid:107) A (cid:107) , ∞ = max i ∈ [ n ] (cid:107) A i (cid:107) .Define the sub-Gaussian norms (cid:107) X (cid:107) ψ = sup p ≥ p − / E /p | X | p for random variable X and (cid:107) X (cid:107) ψ = sup (cid:107) u (cid:107) =1 (cid:107)(cid:104) u , X (cid:105)(cid:107) ψ for random vector X . Denote by χ n refers to the χ -distribution with n degrees of freedom. P → represents convergence in probability. In addition,we adopt the following convenient notations from Wang (2019) to make probabilistic state-ments compact . Definition 1.1.
Let { X n } ∞ n =1 , { Y n } ∞ n =1 be two sequences of random variables and { r n } ∞ n =1 ⊆ (0 , + ∞ ) be deterministic. We write X n = O P ( Y n ; r n ) if there exists a constant C > such that ∀ C > , ∃ C (cid:48) > and N > , s.t. P ( | X n | ≥ C (cid:48) | Y n | ) ≤ C e − Cr n , ∀ n ≥ N. We write X n = o P ( Y n ; r n ) if X n = O P ( w n Y n ; r n ) holds for some deterministic sequence { w n } ∞ n =1 tending to zero. Both the new notation O P ( · ; · ) and the conventional one O P ( · ) help avoid dealing withtons of unspecified constants in operations. Moreover, the former is more informative as itcontrols the convergence rate of exceptional probabilities. This is particularly useful whenwe take union bounds over a growing number of random variables. If { Y n } ∞ n =1 are positiveand deterministic, then X n = O P ( Y n ; 1) is equivalent to X n = O P ( Y n ) . Similar facts holdfor o P ( · ; · ) as well. In the reference above, O P ( · ; · ) and o P ( · ; · ) appear as ˆ O P ( · ; · ) and ˆ o P ( · ; · ) . For simplicity we drop theirhats in this paper. Main results
Consider the signal-plus-noise model x i = ¯ x i + z i ∈ R d , i ∈ [ n ] . (2.1)For simplicity, we assume that the signals { ¯ x i } ni =1 are deterministic and the noises { z i } ni =1 are the only source of randomness. The results readily extend to the case where the signalsare random but independent of the noises.Define the hollowed Gram matrix G ∈ R n × n of samples { x i } ni =1 through G ij = (cid:104) x i , x j (cid:105) { i (cid:54) = j } ,and the Gram matrix ¯ G ∈ R n × n of signals { ¯ x i } ni =1 through ¯ G ij = (cid:104) ¯ x i , ¯ x j (cid:105) . Denote the eigen-values of G by λ ≥ · · · ≥ λ n and their associated eigenvectors by { u j } nj =1 . Similarly, wedefine the eigenvalues ¯ λ ≥ · · · ≥ ¯ λ n and eigenvectors { ¯ u j } nj =1 of ¯ G . Since ¯ G = ¯ X ¯ X (cid:62) (cid:23) ,we have ¯ λ j ≥ for all j ∈ [ n ] . By convention, λ = ¯ λ = + ∞ and λ n +1 = ¯ λ n +1 = −∞ .Some groups of eigenvectors may only be defined up to orthonormal transforms as we allowfor multiplicity of eigenvalues.Let s and r be two integers in [ n ] satisfying ≤ s ≤ n − r . Define U = ( u s +1 , · · · , u s + r ) , ¯ U = ( ¯ u s +1 , · · · , ¯ u s + r ) , Λ = diag( λ s +1 , · · · , λ s + r ) and ¯ Λ = diag(¯ λ s +1 , · · · , ¯ λ s + r ) . In order tostudy how U relates to ¯ U , we adopt the standard notion of eigen-gap (Davis and Kahan,1970): ¯∆ = min { ¯ λ s − ¯ λ s +1 , ¯ λ s + r − ¯ λ s + r +1 } . (2.2)This is the separation between the set of target eigenvalues { ¯ λ s + j } rj =1 and the rest, reflectingthe signal strength. Define κ = ¯ λ / ¯∆ , which plays the role of condition number. Mostimportantly, we use a parameter γ to characterize the signal-to-noise ratio, consider theasymptotic setting n → ∞ and impose the following regularity assumptions. Assumption 2.1 (Incoherence) . As n → ∞ we have κµ (cid:114) rn ≤ γ (cid:28) κµ where µ = max (cid:26) (cid:107) ¯ X (cid:107) , ∞ (cid:107) ¯ X (cid:107) (cid:114) nr , (cid:27) . Assumption 2.2 (Sub-Gaussianity) . { z i } ni =1 are independent, zero-mean random vectors in R d . There exists a constant α > and Σ (cid:23) such that E e (cid:104) u , z i (cid:105) ≤ e α (cid:104) Σ u , u (cid:105) / holds for all u ∈ R d and i ∈ [ n ] . Assumption 2.3 (Concentration) . √ n max { ( κ (cid:107) Σ (cid:107) / ¯∆) / , (cid:107) Σ (cid:107) F / ¯∆ } ≤ γ , where Σ is asin Assumption 2.2. By construction, ¯ X = ( ¯ x , · · · , ¯ x n ) (cid:62) and (cid:107) ¯ X (cid:107) , ∞ = max i ∈ [ n ] (cid:107) ¯ x i (cid:107) . Assumption 2.1 regu-lates the magnitudes of {(cid:107) ¯ x i (cid:107) } ni =1 , and it naturally holds under various mixture models. Theincoherence parameter µ is similar to the usual definition (Candès and Recht, 2009) exceptfor the facts that ¯ X does not have orthonormal columns and r is not its rank. Assumption2.2 is a standard one on sub-Gaussianity (Koltchinskii and Lounici, 2014). Here { z i } ni =1 areindependent but may not have identical distributions, which allows for heteroscedasticity.7ssumption 2.3 governs the concentration of G around its population version ¯ G . To gainsome intuition, we define Z = ( z , · · · , z n ) (cid:62) ∈ R n × d and observe that G = H [( ¯ X + Z )( ¯ X + Z ) (cid:62) ] = H ( ¯ X ¯ X (cid:62) ) + H ( ¯ XZ (cid:62) + Z ¯ X (cid:62) ) + H ( ZZ (cid:62) )= ¯ X ¯ X (cid:62) + ( ¯ XZ (cid:62) + Z ¯ X (cid:62) ) + H ( ZZ (cid:62) ) − ¯ D , where ¯ D is the diagonal part of ¯ X ¯ X (cid:62) + ¯ XZ (cid:62) + Z ¯ X (cid:62) . Hence (cid:107) G − ¯ G (cid:107) ≤ (cid:107) ¯ XZ (cid:62) + Z ¯ X (cid:62) (cid:107) + (cid:107)H ( ZZ (cid:62) ) (cid:107) + max i ∈ [ n ] | ( ¯ X ¯ X (cid:62) + ¯ XZ (cid:62) + Z ¯ X (cid:62) ) ii | . The individual terms above are easy to work with. For instance, we may control (cid:107)H ( ZZ (cid:62) ) (cid:107) using concentration bounds for random quadratic forms such as Hanson-Wright-type in-equalities (Chen and Yang, 2018). The spectral norm and Frobenius norm of Σ collectivelycharacterize the effective dimension of the noise distribution. That gives the reason whyAssumption 2.3 is formulated as it is. It turns out that Assumptions 2.1, 2.2 and 2.3 leadto a matrix concentration bound (cid:107) G − ¯ G (cid:107) = O P ( γ ¯∆; n ) , paving the way for eigenvectoranalysis. (cid:96) ,p analysis of eigenspaces Note that { u s + j } rj =1 and { ¯ u s + j } rj =1 are only identifiable up to sign flips, and things becomeeven more complicated if some eigenvalues are identical. To that end, we need to align U with ¯ U using certain orthonormal transform. Define H = U (cid:62) ¯ U ∈ R r × r and let ˜ U ˜ Λ ˜ V (cid:62) denote itssingular value decomposition, where ˜ U , ˜ V ∈ O r × r and ˜ Λ ∈ R r × r is diagonal with nonnegativeentries. The orthonormal matrix sgn( H ) = ˜ U ˜ V (cid:62) is the best rotation matrix that aligns U with ¯ U and will play an important role throughout our analysis. Here sgn( · ) refers tothe matrix sign function (Gross, 2011). In addition, define Z = ( z , · · · , z n ) (cid:62) ∈ R n × d asthe noise matrix. Recall that for A ∈ R n × r with row vectors { A i } ni =1 , the entrywise matrixnorm (cid:107) A (cid:107) ,p = (cid:18) n (cid:88) i =1 (cid:107) A i (cid:107) p (cid:19) /p . Theorem 2.1.
Suppose that Assumptions 2.1, 2.2 and 2.3 hold. As long as ≤ p (cid:46) ( µγ ) − ,we have (cid:107) U sgn( H ) − G ¯ U ¯ Λ − (cid:107) ,p = o P ( (cid:107) ¯ U (cid:107) ,p ; p ) , (cid:107) U sgn( H ) − [ ¯ U + H ( ZX (cid:62) ) ¯ U ¯ Λ − ] (cid:107) ,p = o P ( (cid:107) ¯ U (cid:107) ,p ; p ) , (cid:107) U sgn( H ) (cid:107) ,p = O P ( (cid:107) ¯ U (cid:107) ,p ; p ) . In addition, if κ / γ (cid:28) , then (cid:107) U Λ / sgn( H ) − G ¯ U ¯ Λ − / (cid:107) ,p = o P ( (cid:107) ¯ U (cid:107) ,p (cid:107) ¯ Λ / (cid:107) ; p ) . U is a highly nonlinear functionof G , it can be well-approximated by a linear form G ¯ U ¯ Λ − up to an orthonormal transform.This can be understood from the hand-waving deduction: U = GU Λ − ≈ G ¯ U ¯ Λ − . The second equation in Theorem 2.1 talks about the difference between U and its populationversion ¯ U . Ignoring the orthonormal transform sgn( H ) , we have that for a large fraction of m ∈ [ n ] , the following entrywise approximation holds U m ≈ [ ¯ U + H ( ZX (cid:62) ) ¯ U ¯ Λ − ] m = ¯ U m + (cid:28) z m , (cid:88) j (cid:54) = m x j ¯ U j ¯ Λ − (cid:29) . (2.3)If we keep { x j } j (cid:54) = m fixed, then the spectral embedding U m for the m -th sample is roughlylinear in z m or equivalently x m itself. This relation is crucial for our analysis of spectralclustering algorithms. The third equation in Theorem 2.1 relates to the delocalization prop-erty of U to that of ¯ U , showing that the mass of U is spread out across its rows as long as ¯ U behaves in a similar way.Many spectral methods use the rows of U ∈ R n × r to embed the samples { x i } ni =1 ⊆ R d into R r (Shi and Malik, 2000; Ng et al., 2002) and perform downstream tasks. Byprecisely characterizing the embedding, the first three equations in Theorem 2.1 facilitatetheir analysis under statistical models. We will see several examples in Section 3. In PCA,however, the embedding is defined by PC scores. Recall that the PCs are eigenvectors of thecovariance matrix n X (cid:62) X ∈ R d × d and PC scores are derived by projecting the data ontothem. Therefore, the PC scores in our setting correspond to the rows of U Λ / rather than U . The last equation in Theorem 2.1 studies their behavior.Theorem 2.1 is written to be easily applicable. It forms the basis of our applications inSection 3. General results under relaxed conditions are given by Theorem B.1.Let us now gain some intuition about the (cid:96) ,p error metric. For large p , (cid:107) A (cid:107) ,p is small ifa vast majority of the rows have small (cid:96) norms, but there could be a few rows that are large.Roughly speaking, the number of those outliers is exponentially small in p . We illustratethis using a toy example with r = 1 , i.e., A = x ∈ R n is a vector and (cid:107) · (cid:107) ,p = (cid:107) · (cid:107) p . If (cid:107) x (cid:107) p ≤ ε (cid:107) n (cid:107) p for some ε > , then Markov’s inequality yields n |{ i : | x i | > tε }| ≤ n − (cid:107) x (cid:107) pp ( tε ) p ≤ n − ε p (cid:107) n (cid:107) pp ( tε ) p = t − p , ∀ t > . Thus, larger p implies stronger bounds. In particular, the following easily-verified fact statesthat when p (cid:38) log n , an upper bound in (cid:96) ,p yields one in (cid:96) , ∞ , controlling all the row-wiseerrors simultaneously. Fact 2.1. (cid:107) x (cid:107) ∞ ≤ (cid:107) x (cid:107) c log n ≤ e /c (cid:107) x (cid:107) ∞ for any n ∈ Z + , x ∈ R n , c > . As a corollary, we get (cid:96) , ∞ approximation bounds for the eigenvectors when the signal isstrong enough for us to take p (cid:38) log n . 9 orollary 2.1. Suppose that Assumptions 2.1, 2.2 and 2.3 hold. As long as µγ (cid:46) / √ log n ,we have (cid:107) U sgn( H ) − G ¯ U ¯ Λ − (cid:107) , ∞ = o P ( (cid:107) ¯ U (cid:107) , ∞ ; log n ) , (cid:107) U sgn( H ) − [ ¯ U + H ( ZX (cid:62) ) ¯ U ¯ Λ − ] (cid:107) , ∞ = o P ( (cid:107) ¯ U (cid:107) , ∞ ; log n ) , (cid:107) U sgn( H ) (cid:107) , ∞ = O P ( (cid:107) ¯ U (cid:107) , ∞ ; ∞ ) . In addition, if κ / γ (cid:28) , then (cid:107) U Λ / sgn( H ) − G ¯ U ¯ Λ − / (cid:107) , ∞ = o P ( (cid:107) ¯ U (cid:107) , ∞ (cid:107) ¯ Λ / (cid:107) ; log n ) . However, p cannot be arbitrarily large in general. When the signal is weak, we may nolonger be able to obtain uniform control of errors as above and should allow for exceptions.The p in Theorem 2.1 may grow as fast as ( µγ ) − , which is a measure of the signal strength.This makes the results adaptive. Since G ∈ R n × n is constructed purely based on pairwise inner products of samples, thewhole procedure can be extended to kernel settings. Here we briefly discuss the kernelPCA (Schölkopf et al., 1997). Suppose that { x i } ni =1 are samples from some space X and K ( · , · ) : X ×X → R is a symmetric and positive semi-definite kernel, i.e. for any m ∈ Z + and { w i } mi =1 ⊆ X , the matrix M ∈ R m × m with M ij = K ( w i , w j ) is always positive semi-definite.The kernel PCA is PCA based on a new Gram matrix K ∈ R n × n with K ij = K ( x i , x j ) .PCA is a special case of kernel PCA with X = R d and K ( x , y ) = x (cid:62) y . Commonly-used nonlinear kernels include the Gaussian kernel K ( x , y ) = e − η (cid:107) x − y (cid:107) with η > . Theyoffer flexible nonlinear embedding techniques which have achieved great success in machinelearning (Cristianini and Shawe-Taylor, 2000).According to the Moore-Aronszajn Theorem (Aronszajn, 1950), there exists a reproducingkernel Hilbert space H with inner product (cid:104)· , ·(cid:105) and a function φ : X → H such that K ( x , y ) = (cid:104) φ ( x ) , φ ( y ) (cid:105) for any x , y ∈ X . Hence, kernel PCA of { x i } ni =1 ⊆ X is de factoPCA of transformed data { φ ( x i ) } ni =1 ⊆ H . The transform φ can be rather complicated since H has infinite dimensions in general. Fortunately, the inner products {(cid:104) φ ( x i ) , φ ( x j ) (cid:105)} in H can be conveniently computed in the original space X .Motivated by the kernel PCA, we extend the basic setup to Hilbert spaces. Let H bea real separable Hilbert space with inner product (cid:104)· , ·(cid:105) , norm (cid:107) · (cid:107) , and some orthonormalbases { h j } . Definition 2.1 (Basics of Hilbert spaces) . A linear operator A : H → H is said to bebounded if its operator norm (cid:107) A (cid:107) op = sup (cid:107) u (cid:107) =1 (cid:107) Au (cid:107) is finite. Define L ( H ) as the collectionof all bounded linear operators over H . For any A ∈ L ( H ) , we use A ∗ to refer to its adjointoperator and let Tr( A ) = (cid:80) j (cid:104) Ah j , h j (cid:105) . Define S + ( H ) = { A ∈ L ( H ) : A = A ∗ , (cid:104) Ax , x (cid:105) ≥ , ∀ x ∈ H and Tr( A ) < ∞} . Any A ∈ S + ( H ) is said to be positive semi-definite. We use (cid:107) A (cid:107) HS = (cid:112) Tr( A ∗ A ) =( (cid:80) j (cid:107) Ah j (cid:107) ) / to refer to its Hilbert-Schmidt norm, and define A / ∈ T ( H ) as the uniqueoperator such that A / A / = A . emark 2.1. When H = R d , we have L ( H ) = R d × d , Tr( A ) = (cid:80) di =1 A ii , (cid:107) · (cid:107) op = (cid:107) · (cid:107) and (cid:107) · (cid:107) HS = (cid:107) · (cid:107) F . Further, S + ( H ) consists of all d × d positive semi-definite matrices.We now generalize model (2.1) to the following one in H : x i = ¯ x i + z i ∈ H , i ∈ [ n ] . (2.4)When H = R d , the data matrix X = ( x , · · · , x n ) (cid:62) ∈ R n × d corresponds to a linear transformfrom R d to R n . For any general H , we can always define X as a bounded linear operator from H to R n through its action h (cid:55)→ ( (cid:104) x , h (cid:105) , · · · , (cid:104) x n , h (cid:105) ) . With slight abuse of notation, we for-mally write X = ( x , · · · , x n ) (cid:62) , use (cid:107) X (cid:107) op to refer to its norm, let (cid:107) X (cid:107) , ∞ = max i ∈ [ n ] (cid:107) x i (cid:107) ,and do the same for ¯ X and Z . We generalize Assumptions 2.1, 2.2 and 2.3 accordingly. Assumption 2.4 (Incoherence) . As n → ∞ we have κµ (cid:114) rn ≤ γ (cid:28) κµ where µ = max (cid:26) (cid:107) ¯ X (cid:107) , ∞ (cid:107) ¯ X (cid:107) op (cid:114) nr , (cid:27) . Assumption 2.5 (Sub-Gaussianity) . { z i } ni =1 are independent, zero-mean random vectors in H . There exists a constant α > and an operator Σ ∈ T ( H ) such that E e (cid:104) u , z i (cid:105) ≤ e α (cid:104) Σ u , u (cid:105) / holds for all u ∈ H and i ∈ [ n ] . Assumption 2.6 (Concentration) . √ n max { ( κ (cid:107) Σ (cid:107) op / ¯∆) / , (cid:107) Σ (cid:107) HS / ¯∆ } ≤ γ . Again, Assumption 2.4 on incoherence holds for various mixture models. Assumption2.5 appears frequently in the study of sub-Gaussianity in Hilbert spaces (Koltchinskii andLounici, 2014). For kernel PCA, Assumption 2.5 automatically holds when the kernel isbounded, i.e. K ( x , x ) ≤ C for some constant C . Assumption 2.6 naturally arises in thestudy of Gram matrices and quadratic forms in Hilbert spaces (Chen and Yang, 2018). Thesame results in Theorem 2.1 continue to hold under the Assumptions 2.4, 2.5 and 2.6. Theproof is in Appendix C.1. Sub-Gaussian and Gaussian mixture models serve as testbeds for clustering algorithms. Max-imum likelihood estimation requires well-specified models and often involves non-convex orcombinatorial optimization problems that are hard to solve. The recent years have seen aboom in the study of efficient approaches. The Lloyd’s algorithm (Lloyd, 1982) with goodinitialization and its variants are analyzed under certain separation conditions (Kumar andKannan, 2010; Lu and Zhou, 2016; Ndaoud, 2018; Gao and Zhang, 2019). Semi-definiteprogramming (SDP) yields reliable results in more general scenarios (Awasthi et al., 2015;Mixon et al., 2017; Royer, 2017; Fei and Chen, 2018; Giraud and Verzelen, 2018; Chen andYang, 2018; Chen and Yang, 2020). Spectral methods are more efficient in terms of compu-tation and have attracted much attention (Vempala and Wang, 2004; Cai and Zhang, 2018;11öffler et al., 2019; Srivastava et al., 2019). However, much less is known about spectralmethods compared with SDP.We apply the (cid:96) p eigenvector analysis to spectral clustering under a popular sub-Gaussianmixture model in Hilbert space H . Suppose that x i = y i µ + z i ∈ H , i ∈ [ n ] , (3.1)where { y i } ni =1 ⊆ {± } are labels, µ ∈ H , and { z i } ni =1 ⊆ H are sub-Gaussian noise vectorssatisfying Assumption 2.5. For simplicity, we assume that { y i } ni =1 and µ are deterministic.Through a conditioning argument, the results extend to the case where they are independentof { z i } ni =1 . (3.1) is a natural model for a centered sub-Gaussian mixture with two equally-sized classes. Heteroscedasticity is allowed, as Assumption 2.5 only requires the covariancematrices of { z i } ni =1 to be uniformly bounded by some Σ , but are allowed to vary across i .The goal of clustering is to estimate { y i } ni =1 based solely on { x i } ni =1 .Under this model, we have ¯ x i = y i µ and ¯ X = ( ¯ x , · · · , ¯ x n ) (cid:62) = yµ (cid:62) . For the populationGram matrix ¯ G = ¯ X ¯ X (cid:62) = (cid:107) µ (cid:107) yy (cid:62) , the leading eigenvalue is ¯ λ = n (cid:107) µ (cid:107) and the leadingeigenvector ¯ u = y / √ n perfectly reveals the class labels. As a result, sgn( u ) is a naturalestimator for the label vector y , where the sign function sgn is applied to all entries of u .This is a specific case of the spectral clustering algorithm.To state our results for sgn( u ) in a clean way, we define the signal-to-noise ratio SNR = (cid:107) µ (cid:107) (cid:107) Σ (cid:107) op ∧ n (cid:107) µ (cid:107) (cid:107) Σ (cid:107) (3.2)and the proportion of mismatch M ( ˆ y , y ) = min (cid:26) n n (cid:88) i =1 { ˆ y i (cid:54) = y i } , n n (cid:88) i =1 {− ˆ y i (cid:54) = y i } (cid:27) , ∀ ˆ y , y ∈ {± } n . (3.3) Theorem 3.1 (Error rate of spectral clustering) . Under the model (3.1), there exist constants
C > and c > such that the followings hold:1. If SNR > C log n , then lim n →∞ P [ M (sgn( u ) , y ) = 0] = 1 ;2. If (cid:28) SNR ≤ C log n , then lim sup n →∞ SNR − log E M (sgn( u ) , y ) < − c . The proof is in Appendix D.2. Theorem 3.1 asserts that the spectral estimator sgn( u ) exactly recovers all the labels with high probability when SNR exceeds some constant mul-tiple of log n . When SNR is not that large but still diverges, we have an exponential bound e − Ω(SNR) for the misclassification rate. It is worth mentioning that
SNR (cid:29) is necessary forachieving a vanishing misclassification rate in the isotropic Gaussian case (Cai and Zhang,2018; Ndaoud, 2018). Hence Theorem 3.1 covers the whole regime that makes vanishingerror possible. For the sub-Gaussian mixture model (3.1), these results are the best avail-able in the literature and have only been established for SDP under sub-Gaussian mixturemodels in Euclidean spaces (Giraud and Verzelen, 2018) and Hilbert spaces (Chen and Yang,2018). They are optimal up to constants C and c in the isopropic case (Giraud and Verze-len, 2018; Ndaoud, 2018). While unspecified constants are inevitable due to the generalityof sub-Gaussianity, we provide finer results for the Gaussian case in Section 3.2.12he characterization (3.2) for signal-to-noise ratio is first proposed by Giraud and Verze-len (2018) and can be rewritten as SNR = (cid:107) µ (cid:107) (cid:107) Σ (cid:107) op (cid:20) ∧ (cid:18) (cid:107) µ (cid:107) (cid:107) Σ (cid:107) op · nr ( Σ ) (cid:19)(cid:21) , (3.4)where r ( Σ ) = (cid:107) Σ (cid:107) / (cid:107) Σ (cid:107) captures the effective rank of Σ . In the isotropic case with Σ = I and H = R d , we have r ( Σ ) = d . The SNR differs from the classical notion of signal-to-noise ratio (cid:107) µ (cid:107) / (cid:107) Σ (cid:107) op frequently used to quantify the misclassification rates (Lu andZhou, 2016; Fei and Chen, 2018; Löffler et al., 2019; Srivastava et al., 2019; Gao and Zhang,2019). Those results hinge on an extra assumption (cid:107) µ (cid:107) (cid:107) Σ (cid:107) op (cid:29) max (cid:26) , r ( Σ ) n (cid:27) , (3.5)or the one with (cid:29) replaced by (cid:38) . Under such an assumption, SNR in (3.4) is equivalent to (cid:107) µ (cid:107) / (cid:107) Σ (cid:107) op . However, our assumption SNR (cid:29) translates to (cid:107) µ (cid:107) (cid:107) Σ (cid:107) op (cid:29) max (cid:26) , (cid:114) r ( Σ ) n (cid:27) . (3.6)It is much weaker when the noise has high effective dimensions r ( Σ ) (cid:29) n . See Giraud andVerzelen (2018) for more discussions. The symmetries and other structural properties of Gaussian mixture models allow for moreprecise characterizations compared to the above. While a main focus of interest is parameterestimation by likelihood-based methods (Dempster et al., 1977) and methods of moments(Pearson, 1894), the problem of clustering is less explored. Recently there is a surge ofinterest in sharp statistical guarantees, mostly under the isotropic Gaussian mixture model(Lu and Zhou, 2016; Cai and Zhang, 2018; Ndaoud, 2018; Löffler et al., 2019; Chen andYang, 2020). In another line of study, sparsity assumptions are adopted for high-dimensionalGaussian mixtures (Azizyan et al., 2013; Jin and Wang, 2016). In this subsection, we studythe optimality of spectral estimator sgn( u ) under the following model. Definition 3.1 (Gaussian mixture model) . For y ∈ {± } n and µ ∈ R d with n, d ≥ , wewrite { x i } ni =1 ∼ GMM( µ , y ) if x i = y i µ + z i ∈ R d , i ∈ [ n ] , and { z i } ni =1 ⊆ R d are i.i.d. N ( , I d ) vectors. This is a special case of the sub-Gaussian mixture model (3.1). Taking Σ = I d , we get (cid:107) Σ (cid:107) op = 1 and (cid:107) Σ (cid:107) HS = √ d . The signal-to-noise ratio in (3.2) is then (cid:107) µ (cid:107) ∧ ( n (cid:107) µ (cid:107) /d ) .Here we redefine it as SNR = (cid:107) µ (cid:107) (cid:107) µ (cid:107) + d/n . (3.7)It has the same order as the previous one and facilitates presentation. We keep using themismatch proportion M in (3.3). 13 heorem 3.2. Let { x i } ni =1 ∼ GMM( µ , y ) and n → ∞ .1. If SNR > (2 + ε ) log n for some constant ε > , then lim n →∞ P [ M (sgn( u ) , y ) = 0] = 1 ;2. If (cid:28) SNR ≤ n , then lim sup n →∞ SNR − log E M (sgn( u ) , y ) ≤ − / . Theorem 3.2 characterizes the spectral estimator with explicit constants. When
SNR exceeds n , sgn( u ) exactly recovers all the labels (up to a global sign flip) with highprobability. When (cid:28) SNR ≤ n , the misclassification rate is bounded from above by e − SNR / [2+ o (1)] . According to Ndaoud (2018), both results are optimal in the minimax sense.The proof of Theorem 3.2 is in Appendix E.2.Cai and Zhang (2018) prove that SNR → ∞ is necessary for any estimator to achievevanishingly small misclassification rate and derive an upper bound E M (sgn( ˜ u ) , y ) (cid:46) / SNR for ˜ u being the leading eigenvector of the unhollowed Gram matrix XX (cid:62) . Ndaoud (2018)obtains exact recovery guarantees as well as an optimal exponential error bound for aniterative algorithm starting from sgn( u ) . Our analysis shows that the initial estimatoris already good enough and no refinement is needed. Chen and Yang (2020) study theinformation threshold for exact recovery in multi-class setting and use an SDP to achievethat.The SNR in (3.7) precisely quantifies the signal-to-noise ratio for clustering and is alwaysdominated by the classical one (cid:107) µ (cid:107) . When d (cid:29) n , the condition SNR → ∞ is equivalentto (cid:107) µ (cid:107) (cid:29) ( d/n ) / . (3.8)This is weaker than the commonly-used assumption (cid:107) µ (cid:107) (cid:29) (cid:112) d/n (3.9)for clustering (Lu and Zhou, 2016; Löffler et al., 2019), under which SNR is asymptoticallyequivalent to (cid:107) µ (cid:107) . Their discrepancy reflects an interesting high-dimensional phenomenon.For the Gaussian mixture model in Definition 3.1, parameter estimation and clusteringcorrespond to recovering µ ∈ R d and y ∈ {± } n , respectively. A good estimate for µ yieldsthat for y . Hence clustering should be easier than parameter estimation. The differencebecomes more significant when d (cid:29) n as clustering targets fewer unknowns. To see this, wewrite X = ( x , · · · , x n ) (cid:62) ∈ R n × d and observe that X = yµ (cid:62) + Z , where Z = ( z , · · · , z n ) (cid:62) ⊆ R n × d has i.i.d. N (0 , entries. Clustering and parameterestimation correspond to estimating the left and right singular vectors of the signal matrix E X . According to the results by Cai and Zhang (2018) on singular subspace estimation,(3.8) and (3.9) are sharp conditions for consistent clustering and parameter estimation.They ensure concentration of the Gram matrix XX (cid:62) and the covariance matrix n X (cid:62) X .When ( d/n ) / (cid:28) (cid:107) µ (cid:107) (cid:28) (cid:112) d/n , consistent clustering is possible even without consistentestimation of the model parameter µ . Intuitively, there are many discriminative directionsthat can tell the classes apart but they are not necessarily aligned with the direction of µ .14ere we outline the proof of Theorem 3.2. When SNR (cid:29) log n , the first part in Theorem3.1 implies that P [sgn( u ) = ± y ] → . Hence it suffices to consider (cid:28) SNR (cid:46) log n . Thefollowing (cid:96) p approximation result helps illustrate the main idea, whose proof is deferred toAppendix E.3. Theorem 3.3.
Under the GMM model in Definition 3.1 with n → ∞ and (cid:28) SNR (cid:46) log n ,there exist ε n → and positive constants C, N such that P ( (cid:107) u − G ¯ u / ¯ λ (cid:107) SNR < ε n (cid:107) ¯ u (cid:107) SNR ) > − Ce − SNR , ∀ n ≥ N. In a hand-waving way, the analysis right after (1.6) in the introduction suggests thatthe expected misclassification rate of sgn( u ) differs from that of sgn( G ¯ u / ¯ λ ) by at most O ( e − SNR ) . Then, it boils down to studying sgn( G ¯ u / ¯ λ ) . Note that ( G ¯ u / ¯ λ ) i ∝ ( Gy ) i = n (cid:88) j =1 [ H ( XX (cid:62) )] ij y j = (cid:88) j (cid:54) = i (cid:104) x i , x j y j (cid:105) = ( n − (cid:104) x i , ˆ µ ( − i ) (cid:105) , ∀ i ∈ [ n ] . Here ˆ µ ( − i ) = n − (cid:80) j (cid:54) = i x j y j is an estimate of µ based on the samples { x j } j (cid:54) = i and their labels { y j } j (cid:54) = i . It is straightforward to prove E M (sgn( G ¯ u / ¯ λ ) , y ) = 1 n n (cid:88) i =1 P [sgn( (cid:104) x i , ˆ µ ( − i ) (cid:105) ) (cid:54) = y i ] ≤ e − SNR / [2+ o (1)] and get the same bound for E M (sgn( u ) , y ) . When SNR > (2 + ε ) log n , this leads to an n − (1+ ε/ upper bound for the misclassification rate, which implies exact recovery with highprobability as any misclassified sample contributes n − to the error rate. When SNR ≤ n , we get the second part in Theorem 3.2. The proof is then finished.The quantity sgn( (cid:104) x i , ˆ µ ( − i ) (cid:105) ) is the prediction of y i by linear discriminant analysis (LDA)given features { x i } ni =1 and additional labels { y j } j (cid:54) = i . It resembles an oracle (or genie-aided)estimator that is usually linked to the fundamental limits of clustering (Abbe et al., 2016;Zhang and Zhou, 2016), which plays an important role in our analysis as well. By connecting u with G ¯ u / ¯ λ and thus {(cid:104) x i , ˆ µ ( − i ) (cid:105)} ni =1 , Theorem 3.3 already hints the optimality of sgn( u ) for recovering y .Perhaps surprisingly, both the (unsupervised) spectral clustering and (supervised) LDAachieve the minimax optimal misclassification error e − SNR / [2+ o (1)] . Here the missing labels donot hurt much. This phenomenon is also observed by Ndaoud (2018). On the other hand, theBayes classifier sgn( (cid:104) µ , x (cid:105) ) given the true parameter µ achieves error rate − Φ( (cid:107) µ (cid:107) ) , where Φ is the cumulative distribution function of N (0 , . As (cid:107) µ (cid:107) → ∞ , this is e −(cid:107) µ (cid:107) / [2+ o (1)] and it is always superior to the minimax error without the knowledge of µ . From there weget the followings for spectral clustering and LDA. • If (cid:107) µ (cid:107) (cid:29) (cid:112) d/n , then SNR = (cid:107) µ (cid:107) [1 + o (1)] and both estimators achieve the Bayes errorexponent; • If (cid:107) µ (cid:107) ≤ C (cid:112) d/n for some constant C > , then SNR ≤ (cid:107) µ (cid:107) / (1 + C − ) and bothestimators achieve the minimax optimal exponent that is worse than the Bayes errorexponent. 15 Contextual stochastic block model
Contextual network analysis concerns discovering interesting structures such as communitiesin a network with the help of node attributes. Large-scale applications call for computation-ally efficient procedures incorporating the information from both sources. For communitydetection in the contextual setting, various models and algorithms have been proposed andanalyzed (Zhang et al., 2016; Weng and Feng, 2016; Binkiewicz et al., 2017; Ma and Ma,2017; Deshpande et al., 2018; Mele et al., 2019; Yan and Sarkar, 2020). How to quantifythe benefits of aggregation is a fundamental and challenging question. We study communitydetection under a canonical model for contextual network data and prove the optimality ofa simple spectral method.To begin with, we present a binary version of the stochastic block model (Holland et al.,1983) that plays a central role in statistical network analysis (Abbe, 2017). We use a labelvector y = ( y , · · · , y n ) (cid:62) ∈ {± } n to encode the block (community) memberships of nodes.For any pair of nodes i and j , we connect them with probability α if they are from the sameblock. Otherwise, the connection probability is β . Definition 4.1 (Stochastic Block Model) . For n ∈ Z + , y ∈ {± } n and < α, β < , wewrite A ∼ SBM( y , α, β ) if A ∈ { , } n × n is symmetric, A ii = 0 for all i ∈ [ n ] , { A ij } ≤ i
Let a , b and c be positive constants. ( y , µ , A , { x i } ni =1 ) ∼ CSBM( n, d, α, β, R ) with (cid:28) q n (cid:28) n , α = aq n /n , β = bq n /n and R / ( R + d/n ) = cq n . On the one hand, Section 3.2 shows that the leading eigenvector u of the hollowed Grammatrix G = H ( XX (cid:62) ) is optimal for the Gaussian mixture model. From now on we renameit as u ( G ) to avoid ambiguity. On the other hand, the second eigenvector u ( A ) of A estimates the labels under the stochastic block model (Abbe et al., 2017). To get someintuition, suppose that half of the entries in { y i } ni =1 are +1 ’s and the others are − ’s so that (cid:62) n y = 0 . For such y , it is easy to see that E ( A | y ) = α + β n (cid:62) n + α − β yy (cid:62) (4.1)and its second eigenvector y / √ n reveals the membership structure. Our estimator for theintegrated problem is an aggregation of the two individual spectral estimators u ( A ) and u ( G ) . Without loss of generality, we assume (cid:104) u ( A ) , u ( G ) (cid:105) ≥ to avoid cancellation.We now begin the construction. The ideal ‘estimator’ ˆ y genie i = argmax y = ± P ( y i = y | A , X , y − i ) . is the best guess of y i given the network, attributes, and labels of all nodes (assisted by Genie)except the i -th one. It is referred to as a genie-aided estimator or oracle estimator in theliterature and is closely related to fundamental limits in clustering (Abbe et al., 2016; Zhangand Zhou, 2016), see Theorem F.3. To mimic ˆ y genie i , we first approximate its associated oddsratio. Lemma 4.1.
Under Assumption 4.2, we have for each given i (cid:12)(cid:12)(cid:12)(cid:12) log (cid:18) P ( y i = 1 | A , X , y − i ) P ( y i = − | A , X , y − i ) (cid:19) − (cid:20)(cid:18) log( a/b ) A + 2 n + d/R G (cid:19) y (cid:21) i (cid:12)(cid:12)(cid:12)(cid:12) = o P ( q n ; q n ) . i -th coordinate of Ay corresponds to the log odds ratio log[ P ( y i = 1 | A , y − i ) / P ( y i = − | A , y − i )] for the stochastic block model (Abbe et al., 2016). From A ii = 0 we see that ( Ay ) i = (cid:80) j (cid:54) = i A ij y j tries to predict the label y i via majority voting among the neigh-bors of node i . Similarly, ( Gy ) i relates to the log odds ratio log[ P ( y i = 1 | X , y − i ) / P ( y i = − | X , y − i )] for the Gaussian mixture model. The overall log odds ratio is linked to a linearcombination of Ay and Gy thanks to the conditional independence between A and X inDefinition 4.2. The proof of Lemma 4.1 can be found in Appendix F.2.Intuitively, Lemma 4.1 reveals that sgn (cid:18) log( a/b ) Ay + 2 n + d/R Gy (cid:19) ≈ (ˆ y genie1 , · · · , ˆ y genie n ) (cid:62) The left-hand side still involves unknown parameters a/b , R and y . Once these unknownsare consistently estimated, the substitution version of the left-hand side provides a validestimator that mimics well the genie-aided estimator and hence is optimal. Heuristics oflinear approximation in Theorem 3.3 above and Abbe et al. (2017) suggest u ( A ) ≈ A ¯ u / ¯ λ A and u ( G ) ≈ G ¯ u / ¯ λ G . Here ¯ u = y / √ n , ¯ λ A = n ( α − β ) / is the second largest (in absolute value) eigenvalue of E ( A | y ) when α (cid:54) = β and the two blocks are equally-sized, and ¯ λ G = nR is the leadingeigenvalue of ¯ G = ¯ X ¯ X (cid:62) . Hence log( a/b ) Ay + 2 n + d/R Gy ≈ log( a/b ) √ n ¯ λ A u ( A ) + 2 n + d/R √ n ¯ λ G u ( G ) ∝ n ( α − β )2 log (cid:18) αβ (cid:19) u ( A ) + 2 R R + d/n u ( G ) , (4.2)which yields a linear combination of u ( A ) and u ( G ) . The coefficient in front of u ( G ) istwice the SNR in (3.2) for Gaussian mixture model. Analogously, we may regard n ( α − β )4 log( α/β ) as a signal-to-noise ratio for the stochastic block model.An legitimate estimator for y is to replace the unknown parameters α , β and R in(4.2) by their estimates. When the two classes are balanced, i.e. y (cid:62) n = 0 , (4.1) yields λ [ E ( A | y )] = n ( α + β ) / and λ [ E ( A | y )] = n ( α − β ) / . Here λ j ( · ) denotes the j -th largest(in absolute value) eigenvalue of a real symmetric matrix. Hence, n ( α − β )2 log (cid:18) αβ (cid:19) = λ [ E ( A | y )] log (cid:18) λ [ E ( A | y )] + λ [ E ( A | y )] λ [ E ( A | y )] − λ [ E ( A | y )] (cid:19) ≈ λ ( A ) log (cid:18) λ ( A ) + λ ( A ) λ ( A ) − λ ( A ) (cid:19) . It can be consistently estimated by using the substitution principle. Similarly, using λ ( ¯ G ) = nR , we have R R + d/n = 2[ λ ( ¯ G ) /n ] λ ( ¯ G ) /n + d/n ≈ λ ( G ) nλ ( G ) + nd . sgn( ˆ u ) with ˆ u = log (cid:18) λ ( A ) + λ ( A ) λ ( A ) − λ ( A ) (cid:19) λ ( A ) u ( A ) + 2 λ ( G ) nλ ( G ) + nd u ( G ) . (4.3)Our estimator uses a weighted sum of two individual estimators without any tuning param-eter. Binkiewicz et al. (2017) propose a spectral method based on a weighted sum of thegraph Laplacian matrix and XX (cid:62) . Yan and Sarkar (2020) develop an SDP using a weightedsum of A and a kernel matrix of { x i } ni =1 . Deshpande et al. (2018) study a belief propagationalgorithm. Their settings and regimes are different from ours. There are very few theoretical results on the information gain in combining the network andnode attributes. Binkiewicz et al. (2017) and Yan and Sarkar (2020) derive upper boundsfor the misclassification error that depend on both sources of information. However, thosebounds are not tight and cannot rigorously justify the benefits. Deshpande et al. (2018)uses techniques from statistical physics to derive an information threshold for obtaining anestimator that is better than random guessing in some regimes. The threshold is smallerthan those for the stochastic block model and the Gaussian mixture model. Their calculationis under the sparse regime where the maximum expected degree n ( α + β ) / of the networkremains bounded as n goes to infinity. They obtain a formal proof by taking certain large-degree limits. To our best knowledge, the result below gives the first characterization of theinformation threshold for exact recovery and provides an efficient method achieving it byaggregating the two pieces of information.We now investigate the aggregated spectral estimator (4.3) under the log n -regime (As-sumption 4.1). Our results show that sgn( ˆ u ) achieves the information threshold for exactrecovery as well as the optimal misclassification rate, both of which are better than thosebased on a single form of data in terms of the mismatch proportion M in (3.3). To statethe results, define I ∗ ( a, b, c ) = ( √ a − √ b ) + c . (4.4) Theorem 4.1.
Let Assumption 4.1 hold and a (cid:54) = b . When I ∗ ( a, b, c ) > , we have lim n →∞ P [ M (sgn( ˆ u ) , y ) = 0] = 1 . When I ∗ ( a, b, c ) < , we have lim inf n →∞ P [ M ( ˆ y , y ) > > for any sequence of estimators ˆ y = ˆ y n ( A , { x i } ni =1 ) . Theorem 4.1 asserts that I ∗ ( a, b, c ) quantifies the signal-to-noise ratio and the phasetransition of exact recovery takes place at I ∗ ( a, b, c ) = 1 . When c = 0 (node attributesare uninformative), we have I ∗ ( a, b,
0) = ( √ a − √ b ) / ; the threshold reduces to that for19he stochastic block model ( √ a − √ b = √ by Abbe et al. (2016)). Similarly, when a = b (the network is uninformative), we have I ∗ ( a, a, c ) = c/ ; the threshold reduces to that forthe Gaussian mixture model ( c = 2 by Ndaoud (2018)). The relation (4.4) indicates thatcombining two sources of information adds up the powers of each part. The proof of Theorem4.1 is deferred to Appendix F.5.Figure 1 presents numerical examples demonstrating the efficacy of our aggregated esti-mator sgn( ˆ u ) . The two experiments use c = 0 . and c = 1 . respectively. We fix n = 500 , d = 2000 and vary a ( y -axis), b ( x -axis) from to . For each parameter configuration ( a, b, c ) , we compute the frequency of exact recovery (i.e. sgn( ˆ u ) = ± y ) over 100 indepen-dent runs. Light color represents high chance of success. The red curves ( √ a − √ b ) + c = 2 correspond to theoretical boundaries for phase transitions, which match the empirical re-sults pretty well. Also, larger c implies stronger signal in node attributes and makes exactrecovery easier.Figure 1: Exact recovery for CSBM: c = 0 . (left) and c = 1 . (right).When I ∗ ( a, b, c ) < , exact recovery of y with high probability is no longer possible. Inthat case, we justify the benefits of aggregation using misclassification rates, by presentingan upper bound for sgn( ˆ u ) as well as a matching lower bound for all possible estimators.Their proofs can be found in Appendices F.6 and F.7. Theorem 4.2.
Let Assumption 4.1 hold, a (cid:54) = b and I ∗ ( a, b, c ) ≤ . We have lim sup n →∞ log E M (sgn( ˆ u ) , y )log n ≤ − I ∗ ( a, b, c ) . Theorem 4.3.
Let Assumption 4.2 hold. For any sequence of estimators ˆ y = ˆ y n ( A , { x i } ni =1 ) ,we have lim inf n →∞ q − n log E M ( ˆ y , y ) ≥ − I ∗ ( a, b, c ) . Theorems 4.2 and 4.3 imply that in the log n -regime (Assumption 4.1), the aggregatedspectral estimator sgn( ˆ u ) achieves the optimal misclassification rate: E M (sgn( ˆ u ) , y ) = n − I ∗ ( a,b,c )+ o (1) . c = 0 , it reduces to the optimal rate n − ( √ a −√ b ) / o (1) for the stochastic block model(Definition 4.1) and when a = b , the result reduces n − c/ o (1) for the Gaussian mixturemodel (Definition 3.1), respectively. It is easy to show that they are achieved by u ( A ) (Abbe et al., 2017) and u ( G ) (Theorem 3.2), which are asymptotically equivalent to ouraggregated estimator ˆ u in extreme cases c → and a → b , respectively. In other words, ourresult and procedure encompass those for the stochastic block model and Gaussian mixturemodel as two specific examples.While our lower bound e − q n [ I ∗ ( a,b,c )+ o (1)] for misclassification is proved under the generalAssumption 4.2, our aggregated spectral estimator sgn( ˆ u ) is only analyzed for the log n -regime in Assumption 4.1. When the network becomes sparser ( q n (cid:28) log n ), A no longerconcentrates (Feige and Ofek, 2005), the eigenvector analysis in Abbe et al. (2017) breaksdown, and we do not have sharp characterizations of u ( A ) anymore. However, the (cid:96) p resultsfor u ( G ) in this paper continue to hold, and sgn[ u ( G )] faithfully recovers y . We conjecturethat the estimator sgn( ˜ u ) with ˜ u = 1 √ n log (cid:18) λ ( A ) + λ ( A ) λ ( A ) − λ ( A ) (cid:19) A sgn[ u ( G )] + 2 λ ( G ) nλ ( G ) + nd u ( G ) (4.5)achieves the lower bound e − q n [ I ∗ ( a,b,c )+ o (1)] for misclassification rate even when q n (cid:28) log n .The expression (4.5) is obtained by replacing λ ( A ) u ( A ) = Au ( A ) in (4.3) with A sgn[ u ( G )] / √ n .Here sgn[ u ( G )] gives estimated labels given X , and A sgn[ u ( G )] provides refined resultsusing A . To illustrate the key ideas behind the (cid:96) p analysis in Theorem 2.1, we use a simple rank-1model x i = µ y i + z i ∈ R d , i ∈ [ n ] , (5.1)where y = ( y , · · · , y n ) (cid:62) ⊆ {± } n and µ ∈ R d are deterministic; { z i } ni =1 are independentand z i ∼ N ( , Σ i ) for some Σ i (cid:31) . We assume further Σ i (cid:22) C I d for all i ∈ [ n ] and someconstant C > .Model (5.1) is a heteroscedastic version of the Gaussian mixture model in Definition 3.1.We have ¯ x i = y i µ , ¯ X = ( ¯ x , · · · , ¯ x n ) (cid:62) = yµ (cid:62) , ¯ G = ¯ X ¯ X (cid:62) = (cid:107) µ (cid:107) yy (cid:62) , ¯ λ = n (cid:107) µ (cid:107) and ¯ u = y / √ n . For simplicity, we suppress the subscript in u , ¯ u , λ and ¯ λ . The goal is toshow that for p that satisfies our technical condition, min c = ± (cid:107) c u − G ¯ u / ¯ λ (cid:107) p = o P ( (cid:107) ¯ u (cid:107) p ; p ) . (5.2)For simplicity, we assume that u is already aligned with G ¯ u / ¯ λ such that the optimal c aboveis . The hollowing procedure conducted on the Gram matrix has been commonly used in high-dimensional PCA and spectral methods (Koltchinskii and Giné, 2000; Montanari and Sun,21igure 2: Benefits of hollowing2018; Ndaoud, 2018; Cai et al., 2019). When the noises { z i } ni =1 are strong and heteroscedas-tic, it drives G closer to ¯ G and thus ensures small angle between u and ¯ u . Such (cid:96) proximityis the starting point of our refined (cid:96) p analysis.Observe that (cid:104) x i , x j (cid:105) = (cid:104) ¯ x i , ¯ x j (cid:105) + (cid:104) ¯ x i , ¯ z j (cid:105) + (cid:104) z i , ¯ x j (cid:105) + (cid:104) z i , z j (cid:105) , E (cid:104) x i , x j (cid:105) = (cid:104) ¯ x i , ¯ x j (cid:105) + E (cid:107) z i (cid:107) { i = j } . Hence the diagonal and off-diagonal entries of the Gram matrix behave differently. In high-dimensional and heteroscedastic case, the difference in noise levels { E (cid:107) z i (cid:107) } ni =1 could have asevere impact on the spectrum of Gram matrix XX (cid:62) . In particular, the following theoremshows that the leading eigenvector of XX (cid:62) could be asymptotically perpendicular to thatof ¯ X ¯ X (cid:62) , while H ( XX (cid:62) ) is still faithful. The proof is in Appendix G.1. Lemma 5.1.
Consider the model (5.1) with Σ = 2 I d and Σ = · · · = Σ n = I d . Let ˆ u and u be the leading eigenvectors of the Gram matrix XX (cid:62) and its hollowed version H ( XX (cid:62) ) . Suppose that n → ∞ and ( d/n ) / (cid:28) (cid:107) µ (cid:107) (cid:28) (cid:112) d/n . We have |(cid:104) ˆ u , ¯ u (cid:105)| P → and |(cid:104) u , ¯ u (cid:105)| P → . Figure 2 visualizes the entries of eigenvectors ¯ u (black), ˆ u (red) and u (blue) in a typicalrealization with n = 100 , d = 500 , (cid:107) µ (cid:107) = 3 and y = ( (cid:62) n/ , − (cid:62) n/ ) (cid:62) . The population eigen-vector ¯ u perfectly reveals class labels, and the eigenvector u of the hollowed Gram matrixis aligned with that. Without hollowing, the eigenvector ˆ u is localized due to heteroscedas-ticity and fails to recover the labels. The error rates of sgn( ˆ u ) and sgn( u ) are and ,respectively.With the help of hollowing, we obtain the following results on spectral concentration.See Appendix G.2 for the proof. Lemma 5.2.
Consider the model (5.1). When n → ∞ and (cid:107) µ (cid:107) (cid:29) max { , ( d/n ) / } , wehave (cid:107) G − ¯ G (cid:107) = o P (¯ λ ; n ) , | λ − ¯ λ | = o P (¯ λ ; n ) and min c = ± (cid:107) c u − ¯ u (cid:107) = o P (1; n ) . It is worth pointing out that hollowing inevitably creates bias as the diagonal informationof ¯ G is lost. Under incoherence conditions on the signals { ¯ x i } ni =1 (Assumption 2.1), this22ffect is under control. It becomes negligible when the noise is strong. While the simplehollowing already suffices for our need, general problems may benefit from more sophisticatedprocedures such as the heteroscedastic PCA in Zhang et al. (2018). p As hollowing has been shown to tackle heteroscedasticity, from now on we focus on thehomoscedastic case Σ = · · · = Σ n = I d to facilitate presentation. We want to approximate u with G ¯ u / ¯ λ . By definition, u = Gu /λ and (cid:107) u − G ¯ u / ¯ λ (cid:107) p = (cid:107) Gu /λ − G ¯ u / ¯ λ (cid:107) p ≤ (cid:107) G ( u − ¯ u ) (cid:107) p / | λ | + (cid:107) G ¯ u (cid:107) p | λ − − ¯ λ − | . The spectral concentration of G (Lemma 5.2) forces / | λ | = O P (¯ λ − ; n ) and | λ − − ¯ λ − | = o P (¯ λ − ; n ) . In order to get (5.2), it suffices to choose some p (cid:46) n such that (cid:107) G ( u − ¯ u ) (cid:107) p = o P (¯ λ (cid:107) ¯ u (cid:107) p ; p ) , (5.3) (cid:107) G ¯ u (cid:107) p = O P (¯ λ (cid:107) ¯ u (cid:107) p ; p ) . (5.4)The desired bound (5.4) sheds light on the choice of p . Let ¯ Z = ( z , · · · , z n ) (cid:62) andobserve that G = H ( XX (cid:62) ) = H [( ¯ X + Z )( ¯ X + Z ) (cid:62) ]= H ( ¯ X ¯ X (cid:62) ) + H ( ¯ XZ (cid:62) ) + H ( Z ¯ X (cid:62) ) + H ( ZZ (cid:62) ) . As an example, we show how to obtain (cid:107)H ( Z ¯ X (cid:62) ) ¯ u (cid:107) p = O P (¯ λ (cid:107) ¯ u (cid:107) p ; p ) . By Markov’sinequality, a convenient and sufficient condition is E /p (cid:107)H ( Z ¯ X (cid:62) ) ¯ u (cid:107) pp (cid:46) ¯ λ (cid:107) ¯ u (cid:107) p = n (cid:107) µ (cid:107) · n /p − / . (5.5)The facts [ H ( Z ¯ X (cid:62) )] ij = (cid:104) z i , y j µ (cid:105) { i (cid:54) = j } and ¯ u = y / √ n yield [ H ( Z ¯ X (cid:62) ) ¯ u ] i = (cid:88) j (cid:54) = i (cid:104) z i , y j µ (cid:105) y j / √ n = n − √ n (cid:104) z i , µ (cid:105) , ∀ i ∈ [ n ] . Note that { z i } ni =1 are i.i.d. N ( , I d ) random vectors, (cid:104) z i , µ (cid:105) ∼ N (0 , (cid:107) µ (cid:107) ) . By momentbounds for Gaussian distribution (Vershynin, 2010), sup q ≥ { q − / E /q |(cid:104) z i , µ (cid:105)| q } ≤ c (cid:107) µ (cid:107) for some constant c . Then E (cid:107)H ( Z ¯ X (cid:62) ) ¯ u (cid:107) pp = n (cid:88) i =1 E | [ H ( Z ¯ X (cid:62) ) ¯ u ] i | p ≤ n ( c (cid:107) µ (cid:107) √ np ) p . p (cid:46) (cid:107) µ (cid:107) . Hence p cannot be arbitrarily large. Momentbounds are used throughout the proof. The final choice of p depends on the most stringentcondition.Moments bounds are natural choices for obtaining (cid:96) p control and they can adapt to thesignal strength. As a comparison, the (cid:96) ∞ analysis in Abbe et al. (2017) targets quantitieslike (cid:107) G ¯ u (cid:107) ∞ by first applying concentration inequalities to each entry and then taking unionbounds. Such uniform control clearly requires stronger signal. Finally we come to (5.3). Let G i denote the i -th row of G . By definition, (cid:107) G ( u − ¯ u ) (cid:107) p = (cid:18) n (cid:88) i =1 | G i ( u − ¯ u ) | p (cid:19) /p . We need to study | G i ( u − ¯ u ) | for each individual i ∈ [ n ] . By Cauchy-Schwarz inequality,the upper bound | G i ( u − ¯ u ) | ≤ (cid:107) G i (cid:107) (cid:107) u − ¯ u (cid:107) always holds. Unfortunately, it is too large to be used. We should resort to probabilisticanalysis for tighter control.For any i ∈ [ n ] , we construct a new data matrix X ( i ) = ( x , · · · , x i − , , x i +1 , · · · , x n ) (cid:62) = ( I n − e i e (cid:62) i ) X by deleting the i -th sample. Then G i = ( (cid:104) x i , x (cid:105) , · · · , (cid:104) x i , x i − (cid:105) , , (cid:104) x i , x i +1 (cid:105) , · · · , (cid:104) x i , x n (cid:105) ) = x (cid:62) i X ( i ) (cid:62) , G i ( u − ¯ u ) = (cid:104) x i , X ( i ) (cid:62) ( u − ¯ u ) (cid:105) . Recall that u is the eigenvector of the whole matrix G constructed by n independent samples.It should not depend too much on any individual x i . Also, X ( i ) (cid:62) is independent of x i . Hencethe dependence between x i and X ( i ) (cid:62) ( u − ¯ u ) is weak. We would like to invoke sub-Gaussianconcentration inequalities to control their inner product.To decouple them in a rigorous way, we construct leave-one-out auxiliaries { G ( i ) } ni =1 ⊆ R n × n where G ( i ) = H ( X ( i ) X ( i ) (cid:62) ) = H [( I − e i e (cid:62) i ) XX (cid:62) ( I − e i e (cid:62) i )] is the hollowed Gram matrix of the dataset { x , · · · , x i − , , x i +1 , · · · , x n } with x i zeroedout. Equivalently, G ( i ) is obtained by zeroing out the i -th row and column of G . Let u ( i ) bethe leading eigenvector of G ( i ) . Then | G i ( u − ¯ u ) | = |(cid:104) x i , X ( i ) (cid:62) ( u − ¯ u ) (cid:105)| ≤ |(cid:104) x i , X ( i ) (cid:62) ( u ( i ) − ¯ u ) (cid:105)| (cid:124) (cid:123)(cid:122) (cid:125) ε + |(cid:104) x i , X ( i ) (cid:62) ( u − u ( i ) ) (cid:105)| (cid:124) (cid:123)(cid:122) (cid:125) ε .
24e have the luxury of convenient concentration inequalities for ε as x i and X ( i ) (cid:62) ( u ( i ) − ¯ u ) are completely independent. In addition, we can safely apply the Cauchy-Schwarz inequalityto ε because u ( i ) should be very similar to u .The leave-one-out technique is a powerful tool in random matrix theory (Erdős et al.,2009) and high-dimensional statistics (Javanmard and Montanari, 2018; El Karoui, 2018).Zhong and Boumal (2018), Abbe et al. (2017) and Chen et al. (2017) apply it to (cid:96) ∞ eigenvec-tors analysis of Wigner-type random matrices. Here we focus on (cid:96) p analysis of Wishart-typematrices with dependent entries. We conduct a novel (cid:96) p analysis of PCA and establish linear approximations of eigenvectors.The results yield optimality guarantees for spectral clustering in several challenging prob-lems. Meanwhile, this study leads to new research directions that are worth exploring. First,we hope to extend the analysis from Wishart-type matrices to more general random matri-ces. One example is the normalized Laplacian matrix frequently used in spectral clustering.Second, our general results hold for Hilbert spaces and they are potentially useful in thestudy of kernel PCA, such as quantifying the performances of different kernels. Third, thelinearization of eigenvectors provides tractable characterizations of spectral embedding thatserve as the starting point of statistical inference. Last but not least, while we focus onsymmetric and binary clustering applications for simplicity, it would be nice to generalizethe results to multi-class and imbalanced settings. That is of great practical importance.25 Useful facts
Here we list some elementary results about operations using the new notations O P ( · ; · ) and o P ( · ; · ) . Most of them can be found in Wang (2019). Fact A.1.
The following two statements hold.1. X n = O P ( Y n ; r n ) is equivalent to the following: there exist positive constants C , C and N , a non-decreasing function f : [ C , + ∞ ) → (0 , + ∞ ) satisfying lim x → + ∞ f ( x ) = + ∞ ,and a positive deterministic sequence { R n } ∞ n =1 tending to infinity such that P ( | X n | ≥ t | Y n | ) ≤ C e − r n f ( t ) , ∀ n ≥ N, C ≤ t ≤ R n .
2. When X n = o P ( Y n ; r n ) , we have lim n →∞ r − n log P ( | X n | ≥ c | Y n | ) = −∞ for any constant c > . Here we adopt the convention log 0 = −∞ . Fact A.2 (Truncation) . If X n {| Z n |≤| W n |} = O P ( Y n ; r n ) and Z n = o P ( W n ; r n ) , then X n = O P ( Y n ; r n ) . Fact A.2 directly follows from Fact A.1 above and Lemma 4 in Wang (2019).
Fact A.3. If E /r n | X n | r n (cid:46) Y n or E /r n | X n | r n (cid:28) Y n for deterministic Y n , then X n = O P ( Y n ; r n ) or X n = o P ( Y n ; r n ) , respectively. Fact A.4 (Lemma 2 in Wang (2019)) . If X n = O P ( Y n ; r n ) and W n = O P ( Z n ; s n ) , then X n + W n = O P ( | Y n | + | Z n | ; r n ∧ s n ) ,X n W n = O P ( Y n Z n ; r n ∧ s n ) . Fact A.5 (Lemma 3 in Wang (2019)) . We have the followings:1. if X n = O P ( Y n ; r n ) , then | X n | α = O P ( | Y n | α ; r n ) for any α > ;2. if X n = o P (1; r n ) , then f ( X n ) = o P (1; r n ) for any f : R → R that is continuous at . Definition A.1 (A uniform version of O P ( · , · ) ) . Let { Λ n } ∞ n =1 be a sequence of finite in-dex sets. For any n ≥ , { X nλ } λ ∈ Λ n , { Y nλ } λ ∈ Λ n are two collections of random variables; { r nλ } λ ∈ Λ n ⊆ (0 , + ∞ ) are deterministic. We write { X nλ } λ ∈ Λ n = O P ( { Y nλ } λ ∈ Λ n ; { r nλ } λ ∈ Λ n ) (A.1) if there exist positive constants C , C and N , a non-decreasing function f : [ C , + ∞ ) → (0 , + ∞ ) satisfying lim x → + ∞ f ( x ) = + ∞ , and a positive deterministic sequence { R n } ∞ n =1 tending to infinity such that P ( | X n | ≥ t | Y n | ) ≤ C e − r n f ( t ) , ∀ n ≥ N, C ≤ t ≤ R n . When Y nλ = Y n and/or r nλ = r n for all n and λ , we may replace { Y nλ } λ ∈ Λ n and/or { r nλ } λ ∈ Λ n in (A.1) by Y n and/or r n for simplicity. Fact A.6. If r n (cid:38) log | Λ n | , then { X nλ } λ ∈ Λ n = O P ( { Y nλ } λ ∈ Λ n ; r n ) implies that max λ ∈ Λ n | X nλ | = O P (max λ ∈ Λ n Y nλ ; r n ) . More on (cid:96) ,p analysis of eigenspaces In this section, we provide a generalized version of Theorem 2.1 and its proof. Instead ofAssumption 2.4, we use a weaker version of that (Assumption B.1) at the cost of a morenested regularity condition for p = p n (Assumption B.2). Assumptions 2.5 and 2.6 are stillin use. Assumption B.1 (Incoherence) . n → ∞ and (cid:107) ¯ G (cid:107) , ∞ / ¯∆ ≤ γ (cid:28) /κ . Assumption B.2 (Regularity of p = p n ) . √ np (cid:107) ¯ X Σ / (cid:107) ,p (cid:46) ¯∆ (cid:107) ¯ U (cid:107) ,p and n /p √ rp max {(cid:107) Σ (cid:107) HS , √ n (cid:107) Σ (cid:107) op } (cid:46) ¯∆ (cid:107) ¯ U (cid:107) ,p . Theorem B.1.
Let Assumptions 2.5, 2.6, B.1 and B.2 hold. We have (cid:107) U sgn( H ) (cid:107) ,p = O P (cid:0) (cid:107) ¯ U (cid:107) ,p + γ ¯∆ − (cid:107) ¯ G (cid:107) ,p ; p ∧ n (cid:1) , (cid:107) U sgn( H ) − G ¯ U ¯ Λ − (cid:107) ,p = O P (cid:0) κγ (cid:107) ¯ U (cid:107) ,p + γ ¯∆ − (cid:107) ¯ G (cid:107) ,p ; p ∧ n (cid:1) , (cid:107) U Λ / sgn( H ) − G ¯ U ¯ Λ − / (cid:107) ,p = O P ( κ / γ ¯∆ / (cid:107) ¯ U (cid:107) ,p + κ / γ ¯∆ − / (cid:107) ¯ G (cid:107) ,p ; p ∧ n ) . B.1 Proof of Theorem B.1
The following lemmas provide useful intermediate results, whose proofs can be found inSections B.2 and B.3.
Lemma B.1.
Let Assumptions 2.5, 2.6 and B.1 hold. We have (cid:107) G − ¯ G (cid:107) = O P ( γ ¯∆; n ) , (cid:107) Λ − ¯ Λ (cid:107) = O P ( γ ¯∆; n ) and (cid:107) U U (cid:62) − ¯ U ¯ U (cid:62) (cid:107) = O P ( γ ; n ) . Lemma B.2.
Let Assumptions 2.5, 2.6, B.1 and B.2 hold. We have (cid:107) G ¯ U − ¯ U ¯ Λ − H ( ZX (cid:62) ) ¯ U (cid:107) ,p = ( γ + (cid:112) r/n ) O P ( ¯∆ (cid:107) ¯ U (cid:107) ,p ; p ) , (cid:107) G ¯ U ¯ Λ − (cid:107) ,p = O P ( (cid:107) ¯ U (cid:107) ,p ; p ∧ n ) . We now prove Theorem B.1. Let ¯ γ = (cid:107) G − ¯ G (cid:107) / ¯∆ . It follows from Lemma 1 in Abbeet al. (2017) that when ¯ γ ≤ / , (cid:107) U H − G ¯ U ¯ Λ − (cid:107) ,p ≤ γ ¯∆ − (cid:107) G ¯ U (cid:107) ,p + 2 ¯∆ − (cid:107) G ( U H − ¯ U ) (cid:107) ,p . By Lemma B.1 and γ → in Assumption B.1, ¯ γ = O P ( γ ; n ) = o P (1; n ) . Lemma B.2 assertsthat (cid:107) G ¯ U (cid:107) ,p ≤ (cid:107) G ¯ U ¯ Λ − (cid:107) ,p (cid:107) ¯ Λ (cid:107) = O P ( κ ¯∆ (cid:107) ¯ U (cid:107) ,p ; p ∧ n ) , respectively. Hence (cid:107) U H − G ¯ U ¯ Λ − (cid:107) ,p = O P ( κγ (cid:107) ¯ U (cid:107) ,p ; p ∧ n ) + (cid:107) G ( U H − ¯ U ) (cid:107) ,p O P ( ¯∆ − ; n ) , (B.1) (cid:107) U H (cid:107) ,p ≤ (cid:107) G ¯ U ¯ Λ − (cid:107) ,p + (cid:107) U H − G ¯ U ¯ Λ − (cid:107) ,p = O P ( (cid:107) ¯ U (cid:107) ,p ; p ∧ n )+ (cid:107) G ( U H − ¯ U ) (cid:107) ,p O P ( ¯∆ − ; n ) . (B.2)We construct leave-one-out auxiliaries { G ( m ) } nm =1 ⊆ R n × n where G ( m ) is obtained byzeroing out the m -th row and column of G . Mathematically, we define a new data matrix X ( m ) = ( x , · · · , x m − , , x m +1 , · · · , x n ) (cid:62) = ( I n − e m e (cid:62) m ) X
27y deleting the m -th sample and G ( m ) = H ( X ( m ) X ( m ) (cid:62) ) = H [( I n − e m e (cid:62) m ) XX (cid:62) ( I n − e m e (cid:62) m )] . Let { u ( m ) j } nj =1 be the eigenvectors of G ( m ) , U ( m ) = ( u ( m ) s +1 , · · · , u ( m ) s + r ) ∈ R n × r and H ( m ) = U ( m ) (cid:62) ¯ U . The construction is also used by Abbe et al. (2017) in entrywise eigenvectoranalysis.By Minkowski’s inequality, (cid:107) G ( U H − ¯ U ) (cid:107) ,p ≤ (cid:18) n (cid:88) m =1 [ (cid:107) G m ( U H − U ( m ) H ( m ) ) (cid:107) + (cid:107) G m ( U ( m ) H ( m ) − ¯ U ) (cid:107) ] p (cid:19) /p ≤ (cid:18) n (cid:88) m =1 (cid:107) G m ( U H − U ( m ) H ( m ) ) (cid:107) p (cid:19) /p + (cid:18) n (cid:88) m =1 (cid:107) G m ( U ( m ) H ( m ) − ¯ U ) (cid:107) p (cid:19) /p . (B.3)The first term on the right hand side of (B.3) corresponds to leave-one-out perturbations.When max {(cid:107) ¯ G (cid:107) , ∞ , (cid:107) G − ¯ G (cid:107) } κ ≤ ¯∆ / , Lemma 3 in Abbe et al. (2017) forces (cid:107) U U (cid:62) − U ( m ) ( U ( m ) ) (cid:62) (cid:107) ≤ κ (cid:107) ( U H ) m (cid:107) , ∀ m ∈ [ n ] , max m ∈ [ n ] (cid:107) U ( m ) H ( m ) − ¯ U (cid:107) ≤ {(cid:107) ¯ G (cid:107) , ∞ , (cid:107) G − ¯ G (cid:107) } / ¯∆ . The fact (cid:107) ¯ G (cid:107) , ∞ ≤ γ ¯∆ , the result (cid:107) G − ¯ G (cid:107) = O P ( γ ¯∆; n ) in Lemma B.1, and AssumptionB.1 imply that (cid:107) G (cid:107) , ∞ ≤ (cid:107) ¯ G (cid:107) , ∞ + (cid:107) G − ¯ G (cid:107) = O P ( γ ¯∆; n ) , (cid:18) n (cid:88) m =1 (cid:107) U U (cid:62) − U ( m ) ( U ( m ) ) (cid:62) (cid:107) p (cid:19) /p = O P ( κ (cid:107) U H (cid:107) ,p ; n ) , max m ∈ [ n ] (cid:107) U ( m ) H ( m ) − ¯ U (cid:107) = O P ( γ ; n ) . (B.4)The definitions H = U (cid:62) ¯ U and H ( m ) = ( U ( m ) ) (cid:62) ¯ U yield (cid:107) U H − U ( m ) H ( m ) (cid:107) = (cid:107) ( U U (cid:62) − U ( m ) ( U ( m ) ) (cid:62) ) ¯ U (cid:107) ≤ (cid:107) U U (cid:62) − U ( m ) ( U ( m ) ) (cid:62) (cid:107) . Based on these estimates, (cid:18) n (cid:88) m =1 (cid:107) G m ( U H − U ( m ) H ( m ) ) (cid:107) p (cid:19) /p ≤ (cid:107) G (cid:107) , ∞ (cid:18) n (cid:88) m =1 (cid:107) U H − U ( m ) H ( m ) (cid:107) p (cid:19) /p ≤ (cid:107) G (cid:107) , ∞ (cid:18) n (cid:88) m =1 (cid:107) U U (cid:62) − U ( m ) ( U ( m ) ) (cid:62) (cid:107) p (cid:19) /p = O P ( κγ ¯∆ (cid:107) U H (cid:107) ,p ; n )= O P ( κγ ¯∆ (cid:107) ¯ U (cid:107) ,p ; p ∧ n ) + O P ( κγ (cid:107) G ( U H − ¯ U ) (cid:107) ,p ; n ) . (B.5)The last equality follows from (B.2). We use (B.3), (B.5) and κγ = o (1) from AssumptionB.1 to derive (cid:107) G ( U H − ¯ U ) (cid:107) ,p ≤ (cid:18) n (cid:88) m =1 (cid:107) G m ( U ( m ) H ( m ) − ¯ U ) (cid:107) p (cid:19) /p + O P ( κγ ¯∆ (cid:107) ¯ U (cid:107) ,p ; p ∧ n ) .
28y plugging this into (B.1) and (B.2) and using κγ = o (1) , we obtain that (cid:107) U H − G ¯ U ¯ Λ − (cid:107) ,p = O P ( κγ (cid:107) ¯ U (cid:107) ,p ; p ∧ n ) + (cid:18) n (cid:88) m =1 (cid:107) G m ( U ( m ) H ( m ) − ¯ U ) (cid:107) p (cid:19) /p O P ( ¯∆ − ; n ) , (B.6) (cid:107) U H (cid:107) ,p = O P ( (cid:107) ¯ U (cid:107) ,p ; p ∧ n ) + (cid:18) n (cid:88) m =1 (cid:107) G m ( U ( m ) H ( m ) − ¯ U ) (cid:107) p (cid:19) /p O P ( ¯∆ − ; n ) . (B.7)We now control the second term in (B.3). From the decompositions G = H [( ¯ X + Z )( ¯ X + Z ) (cid:62) ] = H ( ¯ X ¯ X (cid:62) + ¯ XZ (cid:62) + Z ¯ X (cid:62) ) + H ( ZZ (cid:62) ) , we have (cid:18) n (cid:88) m =1 (cid:107) G m ( U ( m ) H ( m ) − ¯ U ) (cid:107) p (cid:19) /p ≤ (cid:107)H ( ¯ X ¯ X (cid:62) + ¯ XZ (cid:62) + Z ¯ X (cid:62) ) (cid:107) ,p max m ∈ [ n ] (cid:107) U ( m ) H ( m ) − ¯ U (cid:107) + (cid:18) n (cid:88) m =1 (cid:107) [ H ( ZZ (cid:62) )] m ( U ( m ) H ( m ) − ¯ U ) (cid:107) p (cid:19) /p . (B.8)We now work on the first term on the right hand side of (B.8). Define M ∈ R n × n through M ij = (cid:107) ( ¯ XZ (cid:62) ) ij (cid:107) ψ . Then E M ij = 0 and M ij = (cid:107)(cid:104) ¯ x i , z j (cid:105)(cid:107) ψ (cid:46) (cid:107) Σ / ¯ x i (cid:107) , where (cid:46) onlyhides a universal constant. (cid:107) M (cid:107) ,p = (cid:20) n (cid:88) i =1 (cid:18) n (cid:88) j =1 | M ij | (cid:19) p/ (cid:21) /p (cid:46) (cid:20) n (cid:88) i =1 (cid:18) n (cid:88) j =1 (cid:107) Σ / ¯ x i (cid:107) (cid:19) p/ (cid:21) /p = √ n (cid:107) ¯ X Σ / (cid:107) ,p , (cid:107) M (cid:62) (cid:107) ,p = (cid:20) n (cid:88) j =1 (cid:18) n (cid:88) i =1 | M ij | (cid:19) p/ (cid:21) /p (cid:46) (cid:20) n (cid:88) j =1 (cid:18) n (cid:88) i =1 (cid:107) Σ / ¯ x i (cid:107) (cid:19) p/ (cid:21) /p = n /p (cid:107) ¯ X Σ / (cid:107) , ≤ √ n (cid:107) ¯ X Σ / (cid:107) ,p . By Lemma H.3 and p ≥ , (cid:107) ¯ XZ (cid:62) (cid:107) ,p = O P ( √ p (cid:107) M (cid:107) ,p ; p ) = O P ( √ np (cid:107) ¯ X Σ / (cid:107) ,p ; p ) , (cid:107) Z ¯ X (cid:62) (cid:107) ,p = O P ( √ p (cid:107) M (cid:62) (cid:107) ,p ; p ) = O P ( √ np (cid:107) ¯ X Σ / (cid:107) ,p ; p ) . These estimates and √ np (cid:107) ¯ X Σ / (cid:107) ,p (cid:46) ¯∆ (cid:107) ¯ U (cid:107) ,p in Assumption B.2 yield (cid:107)H ( ¯ XZ (cid:62) + Z ¯ X (cid:62) ) (cid:107) ,p ≤ (cid:107) ¯ XZ (cid:62) + Z ¯ X (cid:62) (cid:107) ,p = O P ( ¯∆ (cid:107) ¯ U (cid:107) ,p ; p ) . This and (B.4) lead to (cid:107)H ( ¯ X ¯ X (cid:62) + ¯ XZ (cid:62) + Z ¯ X (cid:62) ) (cid:107) ,p max m ∈ [ n ] (cid:107) U ( m ) H ( m ) − ¯ U (cid:107) O P ( γ ( (cid:107) ¯ X ¯ X (cid:62) (cid:107) ,p + ¯∆ (cid:107) ¯ U (cid:107) ,p ); p ∧ n ) . (B.9)We use (B.6), (B.8) and (B.9) to get (cid:107) U H − G ¯ U ¯ Λ − (cid:107) ,p = O P ( κγ (cid:107) ¯ U (cid:107) ,p ; p ∧ n ) + O P ( γ ¯∆ − (cid:107) ¯ X ¯ X (cid:62) (cid:107) ,p ; p ∧ n )+ (cid:18) n (cid:88) m =1 (cid:107) [ H ( ZZ (cid:62) )] m ( U ( m ) H ( m ) − ¯ U ) (cid:107) p (cid:19) /p O P ( ¯∆ − ; n ) . (B.10)By construction, U ( m ) H ( m ) − ¯ U ∈ R n × r is independent of z m . We invoke Lemma H.2 toget (cid:18) n (cid:88) m =1 (cid:107) [ H ( ZZ (cid:62) )] m ( U ( m ) H ( m ) − ¯ U ) (cid:107) p (cid:19) /p = (cid:18) n (cid:88) m =1 (cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j (cid:54) = m (cid:104) z m , z j (cid:105) ( U ( m ) H ( m ) − ¯ U ) j (cid:13)(cid:13)(cid:13)(cid:13) p (cid:19) /p = n /p max m ∈ [ n ] (cid:107) U ( m ) H ( m ) − ¯ U (cid:107) O P (cid:0) √ rp max {(cid:107) Σ (cid:107) HS , √ n (cid:107) Σ (cid:107) op } ; p ∧ n (cid:1) = O P ( γ ¯∆ (cid:107) ¯ U (cid:107) ,p ; p ∧ n ) , (B.11)where we also used (B.4) and Assumption B.2.We use (B.10) and (B.11) to derive (cid:107) U H − G ¯ U ¯ Λ − (cid:107) ,p = O P ( κγ (cid:107) ¯ U (cid:107) ,p ; p ∧ n ) + O P ( γ ¯∆ − (cid:107) ¯ G (cid:107) ,p ; p ∧ n ) . (B.12)Consequently, Lemma B.2 yields (cid:107) U H (cid:107) ,p ≤ (cid:107) G ¯ U ¯ Λ − (cid:107) ,p + (cid:107) U H − G ¯ U ¯ Λ − (cid:107) ,p = O P ( (cid:107) ¯ U (cid:107) ,p ; p ∧ n ) + O P ( γ ¯∆ − (cid:107) ¯ G (cid:107) ,p ; p ∧ n ) . (B.13)Lemma 2 in Abbe et al. (2017) and the result (cid:107) G − ¯ G (cid:107) = O P ( γ ¯∆; n ) in Lemma B.1 implythat (cid:107) H − sgn( H ) (cid:107) = O P ( γ ; n ) . As sgn( H ) is orthonormal, we have (cid:107) H − (cid:107) = O P (1 , n ) and (cid:107) U sgn( H ) − U H (cid:107) ,p ≤ (cid:107) U HH − (sgn( H ) − H ) (cid:107) ,p ≤ (cid:107) U H (cid:107) ,p (cid:107) H − (cid:107) (cid:107) sgn( H ) − H (cid:107) = (cid:107) U H (cid:107) ,p O P ( γ ; n ) . (B.14)The tail bounds for (cid:107) U sgn( H ) (cid:107) ,p and (cid:107) U sgn( H ) − G ¯ U ¯ Λ − (cid:107) ,p in Theorem B.1 followfrom (B.12), (B.13) and (B.14).Finally we use the results above to control (cid:107) U Λ / sgn( H ) − G ¯ U ¯ Λ − / (cid:107) ,p . By LemmaB.1, (cid:107) Λ − ¯ Λ (cid:107) ≤ (cid:107) G − ¯ G (cid:107) = O P ( γ ¯∆; n ) = o P ( ¯∆; n ) . Hence n − log P ( (cid:107) Λ − ¯ Λ (cid:107) ≥ ¯∆ / →−∞ . When (cid:107) G − ¯ G (cid:107) < ¯∆ / , we have Λ (cid:31) ( ¯∆ / I , and Λ / is well-defined. It remains toshow that (cid:107) U Λ / ¯ H − G ¯ U ¯ Λ − / (cid:107) ,p {(cid:107) G − ¯ G (cid:107) < ¯∆ / } = O P ( κ / γ ¯∆ / (cid:107) ¯ U (cid:107) ,p + κ / γ ¯∆ − / (cid:107) ¯ G (cid:107) ,p ; p ∧ n ) . (B.15)Define ¯ H = sgn( H ) . When (cid:107) G − ¯ G (cid:107) < ¯∆ / happens, we use triangle’s inequality toderive (cid:107) U Λ / ¯ H − G ¯ U ¯ Λ − / (cid:107) ,p ≤ (cid:107) U ¯ H ( ¯ H (cid:62) Λ / ¯ H − ¯ Λ / ) (cid:107) ,p + (cid:107) ( U ¯ H − G ¯ U ¯ Λ − ) ¯ Λ / (cid:107) ,p (cid:107) U ¯ H (cid:107) ,p (cid:107) ¯ H (cid:62) Λ / ¯ H − ¯ Λ / (cid:107) + (cid:107) U ¯ H − G ¯ U ¯ Λ − (cid:107) ,p (cid:107) ¯ Λ (cid:107) / . It is easily seen from (cid:107) ¯ Λ (cid:107) ≤ κ ¯∆ that (cid:107) U ¯ H − G ¯ U ¯ Λ − (cid:107) ,p (cid:107) ¯ Λ (cid:107) / = O P ( κ / γ ¯∆ / (cid:107) ¯ U (cid:107) ,p + κ / γ ¯∆ − / (cid:107) ¯ G (cid:107) ,p ; p ∧ n ) . Hence (cid:107) U Λ / ¯ H − G ¯ U ¯ Λ − / (cid:107) ,p {(cid:107) G − ¯ G (cid:107) < ¯∆ / } = O P ( κ / γ ¯∆ / (cid:107) ¯ U (cid:107) ,p + κ / γ ¯∆ − / (cid:107) ¯ G (cid:107) ,p ; p ∧ n )+ O P ( (cid:107) ¯ U (cid:107) ,p + γ ¯∆ − (cid:107) ¯ G (cid:107) ,p ; p ∧ n ) · (cid:107) ¯ H (cid:62) Λ / ¯ H − ¯ Λ / (cid:107) {(cid:107) G − ¯ G (cid:107) < ¯∆ / } . (B.16)Note that ¯ H (cid:62) Λ / ¯ H = ( ¯ H (cid:62) Λ ¯ H ) / . In view of the perturbation bound for matrixsquare roots (Schmitt, 1992, Lemma 2.1), (cid:107) ¯ H (cid:62) Λ / ¯ H − ¯ Λ / (cid:107) ≤ (cid:107) ¯ H (cid:62) Λ ¯ H − ¯ Λ (cid:107) λ min ( ¯ H (cid:62) Λ / ¯ H ) + λ min ( ¯ Λ / ) ≤ (cid:107) Λ ¯ H − ¯ H ¯ Λ (cid:107) / (cid:46) ( (cid:107) Λ H − H ¯ Λ (cid:107) + (cid:107) Λ ( ¯ H − H ) (cid:107) + (cid:107) ( ¯ H − H ) ¯ Λ (cid:107) ) / ¯∆ / (cid:46) (cid:107) Λ H − H ¯ Λ (cid:107) / ¯∆ / + O P ( κγ ¯∆ / ; n ) as long as (cid:107) G − ¯ G (cid:107) < ¯∆ / . Here we used (cid:107) H − ¯ H (cid:107) = O P ( γ ; n ) according to LemmaB.1 as well as Lemma 2 in Abbe et al. (2017).From U (cid:62) G = Λ U (cid:62) and ¯ G ¯ U = ¯ U ¯ Λ we obtain that Λ H − H ¯ Λ = Λ U (cid:62) ¯ U − U (cid:62) ¯ U ¯ Λ = U (cid:62) G ¯ U − U (cid:62) ¯ G ¯ U = U (cid:62) ( G − ¯ G ) ¯ U and (cid:107) Λ H − H ¯ Λ (cid:107) ≤ (cid:107) G − ¯ G (cid:107) = O P ( γ ¯∆; n ) . As a result, (cid:107) ¯ H (cid:62) Λ / ¯ H − ¯ Λ / (cid:107) {(cid:107) G − ¯ G (cid:107) < ¯∆ / } = O P ( γ ¯∆ / ; n ) , where we also used κγ = o (1) in Assumption B.1. Plugging this into (B.16), we get thedesired bound (B.15) and thus complete the proof of Theorem B.1. B.2 Proof of Lemma B.1
Note that G = H [( ¯ X + Z )( ¯ X + Z ) (cid:62) ] = H ( ¯ X ¯ X (cid:62) ) + H ( ¯ XZ (cid:62) + Z ¯ X (cid:62) ) + H ( ZZ (cid:62) )= ¯ X ¯ X (cid:62) + ( ¯ XZ (cid:62) + Z ¯ X (cid:62) ) + H ( ZZ (cid:62) ) − ¯ D , (B.17)where ¯ D is the diagonal part of ¯ X ¯ X (cid:62) + ¯ XZ (cid:62) + Z ¯ X (cid:62) , with ¯ D ii = (cid:107) ¯ x i (cid:107) + 2 (cid:104) ¯ x i , z i (cid:105) . From (cid:107)(cid:104) ¯ x i , z i (cid:105)(cid:107) ψ (cid:46) (cid:107) Σ / ¯ x i (cid:107) we get {|(cid:104) ¯ x i , z i (cid:105)|} ni =1 = O P ( {(cid:107) Σ / ¯ x i (cid:107)√ n } ni =1 ; n ) . By Fact A.6, max i ∈ [ n ] |(cid:104) ¯ x i , z i (cid:105)| = O P (cid:16) max i ∈ [ n ] (cid:107) Σ / ¯ x i (cid:107)√ n ; n (cid:17) and (cid:107) ¯ D (cid:107) = max i ∈ [ n ] | ¯ D ii | = max i ∈ [ n ] (cid:107) ¯ x i (cid:107) + O P (cid:18) max i ∈ [ n ] (cid:107) Σ / ¯ x i (cid:107)√ n ; n (cid:19) (cid:107) ¯ X (cid:107) , ∞ + O P (cid:0) (cid:107) ¯ X (cid:107) , ∞ ( n (cid:107) Σ (cid:107) op ) / ; n (cid:1) ≤ (cid:107) ¯ X ¯ X (cid:62) (cid:107) , ∞ + O P (cid:16) (cid:107) ¯ X ¯ X (cid:62) (cid:107) / ( n (cid:107) Σ (cid:107) op ) / ; n (cid:17) = (cid:107) ¯ G (cid:107) , ∞ + O P (cid:0) ( nκ ¯∆ (cid:107) Σ (cid:107) op ) / ; n (cid:1) . (B.18)Note that (cid:107) Z ¯ X (cid:62) (cid:107) = sup u , v ∈ S n − u (cid:62) Z ¯ X (cid:62) v . Since { z (cid:62) i ¯ X (cid:62) v } ni =1 are zero-mean, inde-pendent and (cid:107) z (cid:62) i ¯ X (cid:62) v (cid:107) ψ (cid:46) (cid:107) Σ / ¯ X (cid:62) v (cid:107) ≤ (cid:107) ¯ X Σ / (cid:107) op ≤ ( (cid:107) ¯ G (cid:107) (cid:107) Σ (cid:107) op ) / = ( κ ¯∆ (cid:107) Σ (cid:107) op ) / , we have (cid:107) u (cid:62) Z ¯ X (cid:62) v (cid:107) ψ = (cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 u i z (cid:62) i ¯ X (cid:62) v (cid:13)(cid:13)(cid:13)(cid:13) ψ (cid:46) (cid:18) n (cid:88) i =1 u i (cid:107) z (cid:62) i ¯ X (cid:62) v (cid:107) ψ (cid:19) / (cid:46) ( κ ¯∆ (cid:107) Σ (cid:107) op ) / . A standard covering argument (Vershynin, 2010, Section 5.2.2) yields (cid:107) Z ¯ X (cid:62) (cid:107) = O P (( nκ ¯∆ (cid:107) Σ (cid:107) op ) / ; n ) . The same tail bound also holds for (cid:107) ¯ XZ (cid:62) (cid:107) .From these estimates, (B.17), (B.18) and Lemma H.1 we obtain that (cid:107) G − ¯ X ¯ X (cid:62) (cid:107) = O P (cid:0) (cid:107) ¯ G (cid:107) , ∞ + ( nκ ¯∆ (cid:107) Σ (cid:107) op ) / + max {√ n (cid:107) Σ (cid:107) HS , n (cid:107) Σ (cid:107) op } ; n (cid:1) . By Assumptions B.1 and 2.6, we have nκ (cid:107) Σ (cid:107) op ≤ ¯∆ . Hence n (cid:107) Σ (cid:107) op ≤ ( nκ ¯∆ (cid:107) Σ (cid:107) op ) / and (cid:107) G − ¯ X ¯ X (cid:62) (cid:107) = O P ( γ ¯∆; n ) .Finally, Weyl’s inequality (Stewart and Sun, 1990) and Davis-Kahan theorem (Davis andKahan, 1970) assert that (cid:107) Λ − ¯ Λ (cid:107) ≤ (cid:107) G − ¯ G (cid:107) = O P ( γ ¯∆; n ) and (cid:107) U U (cid:62) − ¯ U ¯ U (cid:62) (cid:107) (cid:46) (cid:107) G − ¯ G (cid:107) / ¯∆ = O P ( γ ; n ) . B.3 Proof of Lemma B.2
Observe that G = H ( XX (cid:62) ) = H [( ¯ X + Z ) X (cid:62) ] = ¯ X ¯ X (cid:62) + [ H ( ¯ X ¯ X (cid:62) ) − ¯ X ¯ X (cid:62) ] + H ( ¯ XZ (cid:62) ) + H ( ZX (cid:62) ) . From ¯ X ¯ X (cid:62) ¯ U = ¯ G ¯ Λ = ¯ U ¯ Λ we get (cid:107) G ¯ U − ¯ U ¯ Λ − H ( ZX (cid:62) ) ¯ U (cid:107) ,p = (cid:107) G ¯ U − ¯ X ¯ X (cid:62) ¯ U − H ( ZX (cid:62) ) ¯ U (cid:107) ,p = (cid:107) [ H ( ¯ X ¯ X (cid:62) ) − ¯ X ¯ X (cid:62) + H ( ¯ XZ (cid:62) )] ¯ U (cid:107) ,p ≤ (cid:32) n (cid:88) m =1 ( (cid:107) ¯ x m (cid:107) (cid:107) ¯ U m (cid:107) ) p (cid:33) /p + (cid:107)H ( ¯ XZ (cid:62) ) ¯ U (cid:107) ,p . On the one hand, we have n (cid:88) m =1 ( (cid:107) ¯ x m (cid:107) (cid:107) ¯ U m (cid:107) ) p ≤ max m ∈ [ n ] (cid:107) ¯ x m (cid:107) p n (cid:88) m =1 (cid:107) ¯ U m (cid:107) p = (cid:107) ¯ X (cid:107) p , ∞ (cid:107) ¯ U (cid:107) p ,p ≤ ( γ ¯∆ (cid:107) ¯ U (cid:107) ,p ) p , (cid:107) ¯ X (cid:107) , ∞ ≤ (cid:107) ¯ X ¯ X (cid:62) (cid:107) , ∞ ≤ γ ¯∆ in Assumption B.1. On the other hand, { z j } j (cid:54) = m are independent, (cid:107)(cid:104) ¯ x m , z j (cid:105)(cid:107) ψ (cid:46) (cid:107) Σ / ¯ x m (cid:107) , ¯ U = ( ¯ u , · · · , ¯ u r ) and (cid:107) ¯ u j (cid:107) = 1 for j ∈ [ r ] .Then (cid:107) [ H ( ¯ XZ (cid:62) )] m ¯ u j (cid:107) ψ = (cid:107) ( ¯ XZ (cid:62) ) m ( I − e m e (cid:62) m ) ¯ u j (cid:107) ψ = (cid:13)(cid:13)(cid:13)(cid:13) (cid:88) k (cid:54) = m ¯ u jk (cid:104) ¯ x m , z j (cid:105) (cid:13)(cid:13)(cid:13)(cid:13) ψ (cid:46) (cid:107) Σ / ¯ x m (cid:107) , j ∈ [ r ] , m ∈ [ n ] . Lemma H.3 forces (cid:107)H ( ¯ XZ (cid:62) ) ¯ U (cid:107) ,p = O P ( √ p (cid:107) M (cid:107) ,p ; p ) , where M ij = (cid:107) Σ / ¯ x i (cid:107) . Hence (cid:107) M (cid:107) ,p = (cid:20) n (cid:88) i =1 (cid:18) r (cid:88) j =1 (cid:107) Σ / ¯ x i (cid:107) (cid:19) p/ (cid:21) /p = √ r (cid:107) ¯ X Σ / (cid:107) ,p , (cid:107)H ( ¯ XZ (cid:62) ) ¯ U (cid:107) ,p = O P ( √ rp (cid:107) ¯ X Σ / (cid:107) ,p ; p ) = O P ( (cid:112) r/n ¯∆ (cid:107) ¯ U (cid:107) ,p ; p ) , where the last equality follows from Assumption B.2. By combining the two parts we get (cid:107) G ¯ U − ¯ U ¯ Λ − H ( ZX (cid:62) ) ¯ U (cid:107) ,p = ( γ + (cid:112) r/n ) O P ( ¯∆ (cid:107) ¯ U (cid:107) ,p ; p ) , (cid:107) G ¯ U ¯ Λ − − H ( ZX (cid:62) ) ¯ U ¯ Λ − (cid:107) ,p ≤ (cid:107) G ¯ U − ¯ U ¯ Λ − H ( ZX (cid:62) ) ¯ U (cid:107) ,p (cid:107) ¯ Λ − (cid:107) + (cid:107) ¯ U (cid:107) ,p = O P ( (cid:107) ¯ U (cid:107) ,p ; p ) . (B.19)To study H ( ZX (cid:62) ) ¯ U , we decompose it into H ( Z ¯ X (cid:62) ) ¯ U + H ( ZZ (cid:62) ) ¯ U . Note that [ H ( Z ¯ X (cid:62) ) ¯ U ] mj = ( Z ¯ X (cid:62) ) m ( I − e m e (cid:62) m ) ¯ u j = (cid:104) z m , ¯ X (cid:62) ( I − e m e (cid:62) m ) ¯ u j (cid:105) , (cid:107) [ H ( Z ¯ X (cid:62) ) ¯ U ] mj (cid:107) ψ (cid:46) (cid:107) Σ / ¯ X (cid:62) ( I − e m e (cid:62) m ) ¯ u j (cid:107) . Lemma H.3 forces (cid:107)H ( Z ¯ X (cid:62) ) ¯ U (cid:107) ,p = O P ( √ p (cid:107) M (cid:107) ,p ; p ) , where M ij = (cid:107) Σ / ¯ X (cid:62) ( I − e m e (cid:62) m ) ¯ u j (cid:107) . From r (cid:88) j =1 (cid:107) Σ / ¯ X (cid:62) ( I − e m e (cid:62) m ) ¯ u j (cid:107) = (cid:28) ( I − e m e (cid:62) m ) ¯ X Σ ¯ X (cid:62) ( I − e m e (cid:62) m ) , r (cid:88) j =1 ¯ u j ¯ u (cid:62) j (cid:29) ≤ Tr( ¯ X Σ ¯ X (cid:62) ) = (cid:107) ¯ X Σ / (cid:107) , we get (cid:107) M (cid:107) ,p = (cid:20) n (cid:88) m =1 (cid:18) r (cid:88) j =1 (cid:107) Σ / ¯ X (cid:62) ( I − e m e (cid:62) m ) ¯ u j (cid:107) (cid:19) p/ (cid:21) /p = n /p (cid:107) ¯ X Σ / (cid:107) , ≤ n / (cid:107) ¯ X Σ / (cid:107) ,p , (cid:107)H ( Z ¯ X (cid:62) ) ¯ U (cid:107) ,p = O P ( √ np (cid:107) ¯ X Σ / (cid:107) ,p ; p ) = O P ( ¯∆ (cid:107) ¯ U (cid:107) ,p ; p ) , (B.20)where we used Assumption B.2 to get the last equality.Note that (cid:107) ¯ U (cid:107) = 1 and (cid:107) [ H ( ZZ (cid:62) ) ¯ U ] m (cid:107) = (cid:107) (cid:80) j (cid:54) = m (cid:104) z m , z j (cid:105) ¯ U j (cid:107) , ∀ m ∈ [ n ] . LemmaH.2 asserts that (cid:107)H ( ZZ (cid:62) ) ¯ U (cid:107) ,p = (cid:18) n (cid:88) m =1 (cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j (cid:54) = m (cid:104) z m , z j (cid:105) ¯ U j (cid:13)(cid:13)(cid:13)(cid:13) p (cid:19) /p n /p (cid:107) ¯ U (cid:107) p O P (cid:0) √ rp max {(cid:107) Σ (cid:107) HS , √ n (cid:107) Σ (cid:107) op } ; p ∧ n (cid:1) = O P (cid:16) n /p √ rp max {(cid:107) Σ (cid:107) HS , √ n (cid:107) Σ (cid:107) op } ; p ∧ n (cid:17) = O P ( ¯∆ (cid:107) ¯ U (cid:107) ,p ; p ∧ n ) . (B.21)The last equality is due to Assumption B.2. Then we complete the proof using (B.19), (B.20)and (B.21). C Proofs of Section 2
C.1 Proof of Theorem 2.1
We will invoke Theorem B.1 to prove Theorem 2.1 in the Hilbert setting (under Assumptions2.4, 2.5 and 2.6). We claim that Assumption B.2 holds, p (cid:46) n and γ (cid:107) ¯ G (cid:107) , ∞ / ¯∆ (cid:28) (cid:112) r/n. (C.1)In that case, Theorem B.1 asserts that (cid:107) U sgn( H ) (cid:107) ,p = O P (cid:0) (cid:107) ¯ U (cid:107) ,p + γ ¯∆ − (cid:107) ¯ G (cid:107) ,p ; p (cid:1) , (C.2) (cid:107) U sgn( H ) − G ¯ U ¯ Λ − (cid:107) ,p = O P (cid:0) κγ (cid:107) ¯ U (cid:107) ,p + γ ¯∆ − (cid:107) ¯ G (cid:107) ,p ; p (cid:1) ., (C.3) (cid:107) U Λ / sgn( H ) − G ¯ U ¯ Λ − / (cid:107) ,p = O P ( κ / γ ¯∆ / (cid:107) ¯ U (cid:107) ,p + κ / γ ¯∆ − / (cid:107) ¯ G (cid:107) ,p ; p ∧ n ) . (C.4)When ≤ p < ∞ , we have n − / (cid:107) v (cid:107) ≤ n − /p (cid:107) v (cid:107) p ≤ (cid:107) v (cid:107) ∞ , ∀ v ∈ R n . This inequalityand (C.1) force that γ (cid:107) ¯ G (cid:107) ,p ≤ γn /p (cid:107) ¯ G (cid:107) , ∞ (cid:28) n /p ¯∆ (cid:112) r/n = n /p ¯∆ n − / (cid:107) ¯ U (cid:107) , ≤ ¯∆ (cid:107) ¯ U (cid:107) ,p . Hence γ ¯∆ − (cid:107) ¯ G (cid:107) ,p = o ( (cid:107) ¯ U (cid:107) ,p ) . The first and last equation in Theorem 2.1 directly followfrom (C.2), (C.3), (C.4) and κγ (cid:28) /µ (cid:46) in Assumption 2.3.To control (cid:107) U sgn( H ) − [ ¯ U + H ( ZX (cid:62) ) ¯ U ¯ Λ − ] (cid:107) ,p , we invoke Lemma B.2 to get (cid:107) G ¯ U − ¯ U ¯ Λ − H ( ZX (cid:62) ) ¯ U (cid:107) ,p = ( γ + (cid:112) r/n ) O P ( ¯∆ (cid:107) ¯ U (cid:107) ,p ; p ) = o P ( ¯∆ (cid:107) ¯ U (cid:107) ,p ; p ) . Then (cid:107) U sgn( H ) − [ ¯ U + H ( ZX (cid:62) ) ¯ U ¯ Λ − ] (cid:107) ,p ≤ (cid:107) U sgn( H ) − G ¯ U ¯ Λ − (cid:107) ,p + (cid:107) G ¯ U ¯ Λ − − [ ¯ U + H ( ZX (cid:62) ) ¯ U ¯ Λ − ] (cid:107) ,p ≤ (cid:107) U sgn( H ) − G ¯ U ¯ Λ − (cid:107) ,p + (cid:107) G ¯ U − [ ¯ U ¯ Λ + H ( ZX (cid:62) ) ¯ U ] (cid:107) ,p (cid:107) ¯ Λ − (cid:107) = o P ( (cid:107) ¯ U (cid:107) ,p ; p ∧ n ) . We get all the desired results in Theorem 2.1, provided that Assumption B.2, p (cid:46) n and(C.1) hold.The claim p (cid:46) n is easy to prove: p (i) (cid:46) ( µγ ) − (cid:46) γ − ≤ ( κµ (cid:112) r/n ) − = nrκ µ ≤ n, (i) the condition on p ; (ii) µ ≥ ; (iii) Assumption 2.1; (iv) r ≥ , κ ≥ and µ ≥ .To verify (C.1), we start from (cid:107) ¯ G (cid:107) , ∞ = (cid:107) ¯ X ¯ X (cid:62) (cid:107) , ∞ ≤ (cid:107) ¯ X (cid:107) , ∞ (cid:107) ¯ X (cid:107) = (cid:107) ¯ X (cid:107) , ∞ (cid:107) ¯ X (cid:107) · (cid:107) ¯ X (cid:107) ≤ ( µ (cid:112) r/n )( κ ¯∆) = κµ (cid:112) r/n · ¯∆ , (C.5)where (i) is due to µ ≥ ( (cid:107) ¯ X (cid:107) , ∞ / (cid:107) ¯ X (cid:107) ) (cid:112) n/r and (cid:107) ¯ X (cid:107) = (cid:107) ¯ G (cid:107) = κ ¯∆ . Assumption 2.1forces γ ≥ κµ (cid:112) r/n and (cid:107) ¯ G (cid:107) , ∞ / ¯∆ ≤ γ. (C.6)In addition, (C.5) and the condition γ (cid:28) ( κµ ) − in Assumption 2.1 imply (C.1)It remain to check Assumption B.2. To prove √ np (cid:107) ¯ X Σ / (cid:107) ,p (cid:46) ¯∆ (cid:107) ¯ U (cid:107) ,p , we first provean inequality in (cid:107) · (cid:107) , ∞ and then convert it to (cid:107) · (cid:107) ,p using n − / (cid:107) v (cid:107) ≤ n − /p (cid:107) v (cid:107) p ≤ (cid:107) v (cid:107) ∞ , ∀ v ∈ R n (C.7)By elementary calculation, ¯∆ √ rn (cid:107) ¯ X Σ / (cid:107) , ∞ (i) ≥ ¯∆ √ rn (cid:107) ¯ X (cid:107) , ∞ (cid:107) Σ (cid:107) / = (cid:18) ¯∆ κn (cid:107) Σ (cid:107) (cid:19) / (cid:112) κr ¯∆ /n (cid:107) ¯ X (cid:107) , ∞ (ii) = (cid:18) ¯∆ κn (cid:107) Σ (cid:107) (cid:19) / (cid:18) (cid:107) ¯ X (cid:107) (cid:107) ¯ X (cid:107) , ∞ (cid:114) rn (cid:19) (iii) ≥ (cid:18) ¯∆ κn (cid:107) Σ (cid:107) (cid:19) / µ (iv) ≥ µγ (v) (cid:38) √ p. where we used (i) (cid:107) ¯ X Σ / (cid:107) , ∞ ≤ (cid:107) ¯ X (cid:107) , ∞ (cid:107) Σ (cid:107) / ; (ii) κ ¯∆ = (cid:107) ¯ G (cid:107) = (cid:107) ¯ X (cid:107) ; (iii) µ ≥ (cid:107) ¯ X (cid:107) , ∞ (cid:107) ¯ X (cid:107) (cid:112) nr ; (iv) γ ≥ ( κn (cid:107) Σ (cid:107) / ¯∆) / in Assumption 2.3; (v) p (cid:46) ( µγ ) − . We use (C.7) to get √ np (cid:107) ¯ X Σ / (cid:107) ,p ≤ √ npn /p (cid:107) ¯ X Σ / (cid:107) , ∞ (cid:46) √ npn /p √ r ¯∆ /n √ p = ¯∆ n /p (cid:112) r/n = ¯∆ n /p n − / (cid:107) ¯ U (cid:107) , ≤ ¯∆ n /p n − /p (cid:107) ¯ U (cid:107) ,p = ¯∆ (cid:107) ¯ U (cid:107) ,p . We finally prove n /p √ rp max {(cid:107) Σ (cid:107) HS , √ n (cid:107) Σ (cid:107) op } (cid:46) ¯∆ (cid:107) ¯ U (cid:107) ,p . By Assumption 2.3, max { ( nκ (cid:107) Σ (cid:107) / ¯∆) / , √ n (cid:107) Σ (cid:107) F / ¯∆ } ≤ γ. Since γ (cid:28) according to Assumption 2.1, we have nκ (cid:107) Σ (cid:107) / ¯∆ (cid:28) ( nκ (cid:107) Σ (cid:107) / ¯∆) / ≤ γ (cid:28) .Hence √ p (cid:46) ( µγ ) − (cid:46) γ − ≤ { nκ (cid:107) Σ (cid:107) / ¯∆ , √ n (cid:107) Σ (cid:107) F / ¯∆ } ≤ ¯∆ /n max {(cid:107) Σ (cid:107) , n − / (cid:107) Σ (cid:107) F } . By the conversion (C.7), n /p √ rp max {(cid:107) Σ (cid:107) F , √ n (cid:107) Σ (cid:107) } = n /p +1 / √ rp max { n − / (cid:107) Σ (cid:107) F , (cid:107) Σ (cid:107) } (cid:46) n /p +1 / √ r ¯∆ /n = ¯∆ n /p − / (cid:107) ¯ U (cid:107) , ≤ ¯∆ (cid:107) ¯ U (cid:107) ,p . Proofs of Section 3.1
D.1 A useful lemma
We first prove a useful lemma bridging (cid:96) p approximation and misclassification rates. Lemma D.1.
Suppose that v = v n , w = w n and ¯ v = ¯ v n are random vectors in R n , min i ∈ [ n ] | ¯ v i | = δ n > , and p = p n → ∞ . If min s = ± (cid:107) s v − ¯ v − w (cid:107) p = o P ( n /p δ n ; p ) ,then lim sup n →∞ p − log (cid:18) n E min s = ± |{ i ∈ [ n ] : s sgn( v i ) (cid:54) = sgn(¯ v i ) }| (cid:19) ≤ lim sup ε → lim sup n →∞ p − log (cid:18) n n (cid:88) i =1 P ( − w i sgn(¯ v i ) ≥ (1 − ε ) | ¯ v i | ) (cid:19) . Proof of Lemma D.1.
Let S n = { i ∈ [ n ] : sgn( v i ) (cid:54) = sgn(¯ v i ) } and r = v − ¯ v − w . Fornotational simplicity, we will prove the upper bound for lim sup n →∞ p − log( E | S n | /n ) under astronger assumption (cid:107) r (cid:107) p = o P ( n /p δ n ; p ) . Otherwise we just redefine v as (argmin s = ± (cid:107) s v − ¯ v − w (cid:107) p ) v and go through the same proof.As a matter of fact, S n ⊆ { i ∈ [ n ] : − ( v i − ¯ v i ) sgn(¯ v i ) ≥ | ¯ v i |} = { i ∈ [ n ] : − ( w i + r i ) sgn(¯ v i ) ≥ | ¯ v i |} . For any ε ∈ (0 , , { i ∈ [ n ] : − r i sgn(¯ v i ) < ε | ¯ v i | and − w i sgn(¯ v i ) < (1 − ε ) | ¯ v i |}⊆ { i ∈ [ n ] : − ( w i + r i ) sgn(¯ v i ) < | ¯ v i |} . Hence S n ⊆ { i ∈ [ n ] : − r i sgn(¯ v i ) ≥ ε | ¯ v i | or − w i sgn(¯ v i ) ≥ (1 − ε ) | ¯ v i |}⊆ { i ∈ [ n ] : | r i | ≥ ε | ¯ v i |} ∪ { i ∈ [ n ] : − w i sgn(¯ v i ) ≥ (1 − ε ) | ¯ v i |} . Let q n ( ε ) = n (cid:80) ni =1 P ( − w i sgn(¯ v i ) ≥ (1 − ε ) | ¯ v i | ) . We have E | S n | ≤ E |{ i ∈ [ n ] : | r i | ≥ ε | ¯ v i |}| + nq n ( ε ) .To study { i ∈ [ n ] : | r i | ≥ ε | ¯ v i |} , we define E n = {(cid:107) r (cid:107) p < ε n /p δ n } . Since (cid:107) r (cid:107) p = o P ( n /p δ n ; p ) , there exist C , N ∈ Z + such that P ( E cn ) ≤ C e − p/ε , ∀ n ≥ N . When E n happens, |{ i ∈ [ n ] : | r i | ≥ ε | ¯ v i |}| ≤ |{ i ∈ [ n ] : | r i | ≥ εδ n }| ≤ (cid:107) r (cid:107) pp ( εδ n ) p ≤ ( ε n /p δ n ) p ( εδ n ) p = nε p . Then by log t = log(1 + t − ≤ t − < t for t ≥ , we have log(1 /ε ) ≤ /ε , n − E |{ i ∈ [ n ] : | r i | ≥ ε | ¯ v i |}| ≤ ε p P ( E n ) + 1 · P ( E cn ) ≤ e − p log(1 /ε ) + C e − p/ε ≤ ( C ∨ e − p log(1 /ε ) , n − E | S n | ≤ ( C ∨ e − p log(1 /ε ) + q n ( ε ) . As a result, log( E | S n | /n ) ≤ log(( C ∨ e − p log(1 /ε ) + q n ( ε )) ≤ log[2 max { ( C ∨ e − p log(1 /ε ) , q n ( ε ) } ] ≤ log 2 + max { log( C ∨ − p log(1 /ε ) , log q n ( ε ) } . The assumption p = p n → ∞ leads to lim sup n →∞ p − log( E | S n | /n ) ≤ max {− log(1 /ε ) , lim sup n →∞ p − log q n ( ε ) } , ∀ ε ∈ (0 , . By letting ε → we finish the proof. D.2 Proof of Theorem 3.1
We supress the subscripts of λ , ¯ λ , u and ¯ u . Note that ¯∆ = ¯ λ = n (cid:107) µ (cid:107) and κ = 1 .Assumption 2.4 holds for µ = 1 and / √ n ≤ γ (cid:28) . Assumption 2.6 holds when γ ≥ / √ SNR . Taking γ = 1 / √ SNR ∧ n ensures all the assumptions for Theorem 2.1 to hold.We first consider the case where (cid:28) SNR (cid:46) log n and take p = SNR . By Theorem 2.1, min c = ± (cid:107) c u − ¯ u − H ( ZX (cid:62) ) ¯ u / ¯ λ (cid:107) p = o P ( (cid:107) ¯ u (cid:107) p ; p ) . Since ¯ u = n − / y n , Lemma D.1 asserts that lim sup n →∞ p − log E M [sgn( u )] = lim sup n →∞ p − log (cid:18) n E min s = ± |{ i ∈ [ n ] : s sgn( u i ) (cid:54) = sgn(¯ u i ) }| (cid:19) ≤ lim sup ε → lim sup n →∞ p − log (cid:18) n n (cid:88) i =1 P (cid:0) − [ H ( ZX (cid:62) ) ¯ u / ¯ λ ] i sgn(¯ u i ) ≥ (1 − ε ) | ¯ u i | (cid:1) (cid:19) . From [ H ( ZX (cid:62) ) ¯ u ] i = (cid:80) j (cid:54) = i (cid:104) z i , x j (cid:105) ¯ u j and ¯ λ = n (cid:107) µ (cid:107) we obtain that P (cid:0) − [ H ( ZX (cid:62) ) ¯ u / ¯ λ ] i sgn(¯ u i ) ≥ (1 − ε ) | ¯ u i | (cid:1) ≤ P (cid:0) | [ H ( ZX (cid:62) ) ¯ u / ¯ λ ] i | ≥ (1 − ε ) / √ n (cid:1) ≤ P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:28) z i , (cid:88) j (cid:54) = i x j ¯ u j (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≥ √ n (cid:107) µ (cid:107) / (cid:19) , ∀ ε ∈ (0 , / . The estimates above yields lim sup n →∞ p − log E M [sgn( u )] ≤ lim sup n →∞ p − log (cid:20) n n (cid:88) i =1 P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:28) z i , (cid:88) j (cid:54) = i x j ¯ u j (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≥ √ n (cid:107) µ (cid:107) / (cid:19)(cid:21) . (D.1)Note that ¯ u j = y j / √ n and x j = y j µ + z j , we have (cid:88) j (cid:54) = i x j ¯ u j = (cid:88) j (cid:54) = i ( µ + y j z j ) = (cid:114) n − n ( √ n − µ + w i ) , w i = √ n − (cid:80) j (cid:54) = i y j z j . Hence P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:28) z i , (cid:88) j (cid:54) = i x j ¯ u j (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≥ √ n (cid:107) µ (cid:107) (cid:19) = P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:28) z i , √ n − µ + w i (cid:107) Σ / ( √ n − µ + w i ) (cid:107) (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≥ √ n (cid:107) µ (cid:107) (cid:107) Σ / ( √ n − µ + w i ) (cid:107) (cid:19) . (D.2)By the triangle’s inequality, (cid:107) Σ / ( √ n − µ + w i ) (cid:107) ≤ √ n (cid:107) Σ (cid:107) / (cid:107) µ (cid:107) + (cid:107) Σ / w i (cid:107) . Since Σ / w i satisties th Assumption 2.5 with Σ replaced by Σ , Lemma H.1 yields (cid:107) Σ / w i (cid:107) = O P (max { Tr( Σ ) , n (cid:107) Σ (cid:107) op } ; n ) = O P (max {(cid:107) Σ (cid:107) , n (cid:107) Σ (cid:107) } ; n ) There exist constants c , c > such that P ( (cid:107) Σ / w i (cid:107) > c max {(cid:107) Σ (cid:107) HS , √ n (cid:107) Σ (cid:107) op } ) < c e − n . The assumption
SNR (cid:29) yields (cid:107) µ (cid:107) (cid:29) (cid:107) Σ (cid:107) op } and thus P ( (cid:107) Σ / ( √ n − µ + w i ) (cid:107) > ( c + 1) max {(cid:107) Σ (cid:107) HS , √ n (cid:107) Σ (cid:107) / (cid:107) µ (cid:107)} ) < c e − n . By (D.2) and the definition of
SNR , P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:28) z i , (cid:88) j (cid:54) = i x j ¯ u j (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≥ √ n (cid:107) µ (cid:107) / (cid:19) ≤ P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:28) z i , √ n − µ + w i (cid:107) Σ / ( √ n − µ + w i ) (cid:107) (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≥ √ SNR2( c + 1) (cid:19) + c e − n . (D.3)The desired result lim sup n →∞ SNR − log E M [sgn( u )] < − c for some constant c > follows from (D.1) and (cid:13)(cid:13)(cid:13)(cid:13)(cid:28) z i , √ n − µ + w i (cid:107) Σ / ( √ n − µ + w i ) (cid:107) (cid:29)(cid:13)(cid:13)(cid:13)(cid:13) ψ (cid:46) . (D.4)Here we used the independence between z i and w i = √ n − (cid:80) j (cid:54) = i y j z j .From now on we consider the case where SNR ≥ C log n for some constant C > , andtake p = SNR . By Theorem 2.1 and Fact 2.1, min c = ± (cid:107) c u − ¯ u − H ( ZX (cid:62) ) ¯ u / ¯ λ (cid:107) ∞ = o P ( (cid:107) ¯ u (cid:107) ∞ ; log n ) . As a result, P (cid:16) min c = ± (cid:107) c u − ¯ u − H ( ZX (cid:62) ) ¯ u / ¯ λ (cid:107) ∞ > / (2 √ n ) (cid:17) (cid:46) /n. (D.5)38n the other hand, repeating the arguments from (D.2) to (D.3) yields P (cid:16) (cid:107)H ( ZX (cid:62) ) ¯ u / ¯ λ (cid:107) ∞ > / (2 √ n ) (cid:17) = P (cid:32) max i ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:28) z i , (cid:88) j (cid:54) = i x j ¯ u j (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) > √ n (cid:107) µ (cid:107) (cid:33) ≤ P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:28) z i , √ n − µ + w i (cid:107) Σ / ( √ n − µ + w i ) (cid:107) (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≥ √ C log n c + 1) (cid:19) + c e − n . (D.6)where w i = √ n − (cid:80) j (cid:54) = i y j z j is independent of z i . (D.6) and (D.4) imply that when C is largeenough, P (cid:16) (cid:107)H ( ZX (cid:62) ) ¯ u / ¯ λ (cid:107) ∞ > / (2 √ n ) (cid:17) ≤ /n + c e − n . (D.7)Finally, it follows from (D.5) and (D.7) that P [sgn( u ) (cid:54) = ± sgn( ¯ u )] ≤ P (cid:16) min c = ± (cid:107) c u − ¯ u (cid:107) ∞ > / √ n (cid:17) (cid:46) /n. E Proofs of Section 3.2
E.1 A technical lemma
The following technical lemma will be used in the analysis of misclassification rates.
Lemma E.1.
Consider the Gaussian mixture model in Definition 3.1 with d ≥ . Let R = (cid:107) µ (cid:107) and p = SNR = R / ( R + d/n ) . If n → ∞ and SNR → ∞ , then for any fixed i we have (cid:107) ˆ µ ( − i ) − µ (cid:107) = O P ( (cid:112) ( d ∨ p ) /n ; p ) , (cid:12)(cid:12)(cid:12) (cid:107) ˆ µ ( − i ) (cid:107) − (cid:112) R + d/ ( n − (cid:12)(cid:12)(cid:12) = O P ( (cid:112) p/n ; p ) , (cid:107) x i (cid:107) = O P ( R ∨ √ d ; p ) , (cid:104) ˆ µ ( − i ) − µ , x i (cid:105) = (cid:112) p/nO P ( R ∨ √ d ; p ) and (cid:104) ˆ µ ( − i ) , x i (cid:105) = O P ( R ; p ) . Proof of Lemma E.1.
Let w i = (cid:80) j (cid:54) = i z j y j and note that ( n −
1) ˆ µ ( − i ) = (cid:80) j (cid:54) = i x j y j = (cid:80) j (cid:54) = i ( µ y j + z j ) y j = ( n − µ + w i . From w i ∼ N ( , ( n − I d ) we get (cid:107) w i (cid:107) / ( n − ∼ χ d ,and Lemma H.4 leads to (cid:107) w i (cid:107) / ( n − − d = O P ( p ∨ √ pd ; p ) . Then (cid:107) ˆ µ ( − i ) − µ (cid:107) = ( n − − (cid:107) w i (cid:107) = d + O P ( p ∨ √ pd ; p ) n − O P (( d ∨ p ) /n ; p ) , and (cid:107) ˆ µ ( − i ) − µ (cid:107) = O P ( (cid:112) ( d ∨ p ) /n ; p ) . To study (cid:107) ˆ µ ( − i ) (cid:107) , we start from the decomposition (cid:107) ˆ µ ( − i ) (cid:107) = (cid:107) µ (cid:107) + 2( n − − (cid:104) µ , w i (cid:105) + ( n − − (cid:107) w i (cid:107) . Since (cid:104) µ , w i (cid:105) ∼ N (0 , ( n − R ) , Lemma H.3 yields (cid:104) µ , w i (cid:105) = O P ( R √ np ; p ) . We use theseand √ p ≤ R to derive (cid:107) ˆ µ ( − i ) (cid:107) = R + 2 · O P ( R √ np ; p ) n − d + O P ( p ∨ √ pd ; p ) n − R + dn − { R √ np, p, √ pd } n O P (1; p ) R + dn − { R √ np, √ pd } n O P (1; p )= R + dn − (cid:114) pn O P ( R ∨ (cid:112) d/n ; p ) . Based on this and (cid:112) R + d/ ( n − ≥ (cid:112) R + d/n (cid:16) R ∨ (cid:112) d/n , (cid:12)(cid:12)(cid:12) (cid:107) ˆ µ ( − i ) (cid:107) − (cid:112) R + d/ ( n − (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12) (cid:107) ˆ µ ( − i ) (cid:107) − [ R + d/ ( n − (cid:12)(cid:12) (cid:107) ˆ µ ( − i ) (cid:107) + (cid:112) R + d/ ( n − ≤ (cid:112) p/nO P ( R ∨ (cid:112) d/n ; p ) (cid:112) R + d/ ( n −
1) = O P ( (cid:112) p/n ; p ) . From (cid:107) z i (cid:107) ∼ χ d and Lemma H.4 we get (cid:107) z i (cid:107) = d + O P ( √ pd ∨ p ; p ) = O P ( p ∨ d ; p ) .Hence (cid:107) x i (cid:107) ≤ (cid:107) µ (cid:107) + (cid:107) z i (cid:107) = R + O P ( √ p ∨ d ; p ) = O P ( R ∨ √ d ; p ) as R ≥ √ p .Now we study (cid:104) ˆ µ ( − i ) − µ , x i (cid:105) = (cid:104) ˆ µ ( − i ) − µ , µ (cid:105) y i + (cid:104) ˆ µ ( − i ) − µ , z i (cid:105) . On the one hand, (cid:104) ˆ µ ( − i ) − µ , µ (cid:105) = ( n − − (cid:104) w i , µ (cid:105) ∼ N (0 , R / ( n − and Lemma H.3 implies that (cid:104) ˆ µ ( − i ) − µ , µ (cid:105) = O P ( R (cid:112) p/n ; p ) . On the other hand, (cid:104) ˆ µ ( − i ) − µ , z i (cid:105) / (cid:107) ˆ µ ( − i ) − µ (cid:107) ∼ N (0 , leadsto (cid:104) ˆ µ ( − i ) − µ , z i (cid:105) / (cid:107) ˆ µ ( − i ) − µ (cid:107) = O P ( √ p ; p ) . Since (cid:107) ˆ µ ( − i ) − µ (cid:107) = O P ( (cid:112) ( d ∨ p ) /n ; p ) , wehave (cid:104) ˆ µ ( − i ) − µ , z i (cid:105) = (cid:112) p/nO P ( √ p ∨ d ; p ) . As a result, (cid:104) ˆ µ ( − i ) − µ , x i (cid:105) = (cid:112) p/nO P ( R ∨ √ d ; p ) . Note that |(cid:104) µ , x i (cid:105)| ≤ |(cid:107) µ (cid:107) y i + (cid:104) µ , z i (cid:105)| ≤ R + |(cid:104) µ , z i (cid:105)| . From (cid:104) µ , z i (cid:105) ∼ N (0 , R ) weobtain that (cid:104) µ , z i (cid:105) = O P ( R √ p ; p ) . The fact √ p ≤ R leads to (cid:104) µ , x i (cid:105) = O P ( R ; p ) and (cid:104) ˆ µ ( − i ) , x i (cid:105) = (cid:104) µ , x i (cid:105) + (cid:104) ˆ µ ( − i ) − µ , x i (cid:105) = O P ( R + (cid:112) p/n ( R ∨ √ d ); p ) = O P ( R ; p ) , where we also applied (cid:112) pd/n = R (cid:112) d/n/ (cid:112) R + d/n ≤ R . E.2 Proof of Theorem 3.2
We supress the subscripts of λ , ¯ λ , u and ¯ u . When SNR (cid:29) log n , the first part in Theorem3.1 implies that P [sgn( u ) = ± y ] → . From now on we assume that (cid:28) SNR (cid:46) log n andlet p = SNR . Repeating the derivation of (D.1) in the proof of Theorem 3.1 and using theexchangeability of { z i } ni =1 , we get lim sup n →∞ p − log E M (sgn( u ) , y ) ≤ lim sup ε → lim sup n →∞ p − log P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:28) z i , (cid:88) j (cid:54) = i x j ¯ u j (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (1 − ε ) √ n (cid:107) µ (cid:107) (cid:19) . (E.1)Since (cid:80) j (cid:54) = i x j ¯ u j = ( n −
1) ˆ µ − i / √ n , we get P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:28) z i , (cid:88) j (cid:54) = i x j ¯ u j (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (1 − ε ) √ n (cid:107) µ (cid:107) (cid:19) ≤ P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) (cid:104) z i , ˆ µ − i (cid:105)(cid:107) ˆ µ − i (cid:107) (cid:12)(cid:12)(cid:12)(cid:12) ≥ (1 − ε ) (cid:107) µ (cid:107) (cid:107) ˆ µ − i (cid:107) (cid:19) . (E.2)40et R = (cid:107) µ (cid:107) . Lemma E.1 yields (cid:12)(cid:12)(cid:12) (cid:107) ˆ µ ( − i ) (cid:107) − (cid:112) R + d/ ( n − (cid:12)(cid:12)(cid:12) = O P ( (cid:112) p/n ; p ) . Hencethere exist constants C , C and N such that P ( (cid:107) ˆ µ ( − i ) (cid:107) − (cid:112) R + d/ ( n − ≥ C (cid:112) p/n ) ≤ C e − p , ∀ n ≥ N. (E.3)On the one hand, (cid:112) R + d/ ( n −
1) = [1 + o (1)] (cid:112) R + d/n = [1 + o (1)] R / √ p . On the otherhand, R / √ p (cid:112) p/n = √ nR p = √ nR R / ( R + d/n ) = √ n ( R + d/n ) R ≥ √ n. As a result, (E.3) implies that for any constant δ > , there exists a constant N (cid:48) such that P ( (cid:107) ˆ µ ( − i ) (cid:107) ≥ (1 + δ ) R / √ p ) ≤ C e − p , ∀ n ≥ N (cid:48) . (E.4)By (E.2) and (E.4), P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:28) z i , (cid:88) j (cid:54) = i x j ¯ u j (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (1 − ε ) √ n (cid:107) µ (cid:107) (cid:19) ≤ P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) (cid:104) z i , ˆ µ − i (cid:105)(cid:107) ˆ µ − i (cid:107) (cid:12)(cid:12)(cid:12)(cid:12) ≥ (1 − ε ) (cid:107) µ (cid:107) (1 + δ ) R / √ p (cid:19) + C e − p = P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) (cid:104) z i , ˆ µ − i (cid:105)(cid:107) ˆ µ − i (cid:107) (cid:12)(cid:12)(cid:12)(cid:12) ≥ − ε δ √ p (cid:19) + C e − p , ∀ n ≥ N (cid:48) . (E.5)The independence between z i and ˆ µ − i yields (cid:104) z i , ˆ µ − i (cid:105) / (cid:107) ˆ µ − i (cid:107) ∼ N (0 , . Then we get lim sup n →∞ p − log E M [sgn( u )] ≤ − / . (E.6)by (E.2), (E.5), standard tail bounds for Gaussian random variable and the fact that ε , δ are arbitrary.When SNR > (2 + ε ) log n for some constant ε , (E.6) implies the existence of positiveconstants ε (cid:48) and N (cid:48)(cid:48) such that E M (sgn( u ) , y ) ≤ n − − ε (cid:48) , ∀ n ≥ N (cid:48)(cid:48) . Then we must have P [ M (sgn( u ) , y ) = 0] → as any misclassified sample contributes n − to M (sgn( u ) , y ) . E.3 Proof of Theorem 3.3
It is easily checked that Assumptions 2.4, 2.5 and 2.6 hold with Σ = I , κ = 1 , µ = 1 and γ (cid:16) SNR . Theorem 2.1 then yields the desired result.
F Proof of Section 4
Define I ( t, a, b, c ) = a − ( a/b ) t ] + b − ( b/a ) t ] − c ( t + t ) for ( t, a, b, c ) ∈ R × (0 , + ∞ ) . It is easily seen that both a ( a/b ) t + b ( b/a ) t and t + t areconvex and achieve their minima at − / . Then I ∗ ( a, b, c ) = I ( − / , a, b, c ) = sup t ∈ R I ( t, a, b, c ) . .1 Useful lemmas We present three useful lemmas. The first one finds an (cid:96) ∞ approximation of the aggregatedspectral estimator. The second one concerns large deviation probabilities. The third onerelates genie-aided estimators to fundamental limits of clustering. Lemma F.1.
Let ¯ u = y / √ n and w = log( a/b ) A ¯ u + 2 R nR + d G ¯ u . For ˆ u defined by (4.3) , there exist some ε n → and constant C > such that P (min c = ± (cid:107) c ˆ u − w (cid:107) ∞ < ε n n − / log n ) > − Cn − . Proof of Lemma F.1.
Define, as in (4.2), v = n ( α − β )2 log (cid:18) αβ (cid:19) u ( A ) + 2 nR nR + d u ( G ) . Then (cid:107) v − w (cid:107) ∞ ≤ log( a/b ) (cid:107) [ n ( α − β ) / u ( A ) − A ¯ u (cid:107) ∞ + 2 R nR + d (cid:107) ( nR ) u ( G ) − G ¯ u (cid:107) ∞ , (F.1) (cid:107) ˆ u − v (cid:107) ∞ ≤ (cid:12)(cid:12)(cid:12)(cid:12) λ ( A ) log (cid:18) λ ( A ) + λ ( A ) λ ( A ) − λ ( A ) (cid:19) − n ( α − β )2 log (cid:18) αβ (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) u ( A ) (cid:107) ∞ + (cid:12)(cid:12)(cid:12)(cid:12) λ ( G ) nλ ( G ) + nd − nR nR + d (cid:12)(cid:12)(cid:12)(cid:12) (cid:107) u ( G ) (cid:107) ∞ . (F.2)For simplicity, suppose that (cid:104) u ( G ) , ¯ u (cid:105) ≥ and (cid:104) u ( A ) , ¯ u (cid:105) ≥ . By Lemma B.1 andTheorem 2.1, we have | λ ( G ) − nR | = o P (1; n ) , (cid:107) u ( G ) − G ¯ u / ( nR ) (cid:107) ∞ = o P ( n − / ; log n ) , (cid:107) u ( G ) (cid:107) ∞ = O P ( n − / ; log n ) . Hence there exists ε n → and a constant C such that P ( | λ ( G ) /nR − | < ε n , (cid:107) u ( G ) − G ¯ u / ( nR ) (cid:107) ∞ < ε n / √ n, (cid:107) u ( G ) (cid:107) ∞ < C / √ n ) > − n − . (F.3)By mimicking the proof of Corollary 3.1 in Abbe et al. (2017) and applying Lemma 6 therein,we get ε n → and a constant C such that P ( max {| λ ( A ) − n ( α + β ) / | , | λ ( A ) − n ( α − β ) / |} < ε n (cid:112) log n, (cid:107) u ( A ) − A ¯ u / [ n ( α − β ) / (cid:107) ∞ < ε n / √ n, (cid:107) u ( A ) (cid:107) ∞ < C / √ n ) > − n − . (F.4)Inequalities (F.1), (F.2), (F.3) and (F.4) yield some ε n → and constant C > such that P ( (cid:107) ˆ u − w (cid:107) ∞ < ε n n − / log n ) > − Cn − . emma F.2. Let Assumption 4.2 hold and define W ni = (cid:32) R nR + d (cid:88) j (cid:54) = i (cid:104) x i , x j (cid:105) y j + log( a/b ) (cid:88) j (cid:54) = i A ij y j (cid:33) y i . For any fixed i , lim n →∞ q − n log P ( W ni ≤ εq n ) = − sup t ∈ R { εt + I ( t, a, b, c ) } , ∀ ε < a − b a/b ) + 2 c. Proof of Lemma F.2.
We will invoke Lemma H.5 to prove Lemma F.2, starting from thecalculation of E e tW ni . Conditioned on y i , (cid:80) j (cid:54) = i (cid:104) x i , x j (cid:105) y j and (cid:80) j (cid:54) = i A ij y j are independent.Hence E ( e tW ni | y i ) = E (cid:20) exp (cid:18) t · R nR + d (cid:88) j (cid:54) = i (cid:104) x i , x j (cid:105) y j y i (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) y i (cid:21) · E (cid:20) exp (cid:18) t log( a/b ) (cid:88) j (cid:54) = i A ij y j y i (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) y i (cid:21) . We claim that for any fixed t ∈ R , there exists N > such that when n > N , log E (cid:20) exp (cid:18) t · R nR + d (cid:88) j (cid:54) = i (cid:104) x i , x j (cid:105) y j y i (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) y i (cid:21) = log E (cid:20) exp (cid:18) t · R nR + d (cid:88) j (cid:54) = i (cid:104) x i , x j (cid:105) y j y i (cid:19)(cid:21) = 2 c ( t + t )[1 + o (1)] q n , (F.5) log E (cid:20) exp (cid:18) t log( a/b ) y i (cid:88) j (cid:54) = i A ij y j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) y i (cid:21) = log E (cid:20) exp (cid:18) t log( a/b ) y i (cid:88) j (cid:54) = i A ij y j (cid:19)(cid:21) = a [( a/b ) t −
1] + b [( b/a ) t − o (1)] q n . (F.6)If (F.5) and (F.6) hold, then E ( e tW ni | y i ) = E (cid:20) exp (cid:18) t · R nR + d (cid:88) j (cid:54) = i (cid:104) x i , x j (cid:105) y j y i (cid:19)(cid:21) · E exp (cid:18) t log( a/b ) y i (cid:88) j (cid:54) = i A ij y j (cid:19) does not depend on y i , and q − n log E e tW ni = q − n log E (cid:20) exp (cid:18) t · R nR + d (cid:88) j (cid:54) = i (cid:104) x i , x j (cid:105) y j y i (cid:19)(cid:21) + q − n log E (cid:20) exp (cid:18) t log( a/b ) y i (cid:88) j (cid:54) = i A ij y j (cid:19)(cid:21) = (cid:18) a a/b ) t −
1] + b b/a ) t −
1] + 2 c ( t + t ) (cid:19) [1 + o (1)]= − I ( t, a, b, c )[1 + o (1)] . Lemma H.5 implies that for ε < − ∂∂t I ( t, a, b, c ) | t =0 = a − b log( a/b ) + 2 c , lim n →∞ q − n log P ( W ni ≤ εq n ) = − sup t ∈ R { εt + I ( t, a, b, c ) } . x i = µ y i + z i we see that given y i , x i y i ∼ N ( µ , I d ) is independent of √ n − µ ( − i ) ∼ N ( √ n − µ , I d ) . Lemma H.4 asserts that log E ( e t (cid:104) x i , ˆ µ ( − i ) (cid:105) y i | y i ) = log E ( e ( t/ √ n − (cid:104) x i y i , √ n − µ ( − i ) (cid:105) | y i )= ( t √ n − ) − ( t √ n − ) ] ( (cid:107) µ (cid:107) + ( n − (cid:107) µ (cid:107) )+ t √ n − − ( t √ n − ) (cid:104) µ , √ n − µ (cid:105) − d (cid:20) − (cid:18) t √ n − (cid:19) (cid:21) = tR − t / ( n − (cid:18) nt n − (cid:19) − d (cid:18) − t n − (cid:19) , ∀ t ∈ ( −√ n − , √ n − . Since the right hand side does not depend on y i , log E e t (cid:104) x i , ˆ µ ( − i ) (cid:105) y i is also equal to it. Now wefix any t ∈ R and let s = 2 tp/R = 2 t/ [1 + d/ ( nR )] . Since | s | < | t | , we have | s | < √ n − for large n . In that case, we obtain from the equation above that log E (cid:20) exp (cid:18) t · (cid:104) ˆ µ ( − i ) , x i (cid:105) y i d/ ( nR ) (cid:19)(cid:21) = log E e s (cid:104) x i , ˆ µ ( − i ) (cid:105) y i = sR − s / ( n − (cid:18) ns n − (cid:19) − d (cid:18) − s n − (cid:19) . = [1 + o (1)] sR (1 + s/
2) + d · s n − o (1)] = (cid:20) tp (cid:18) tpR (cid:19) + d n · t p R (cid:21) [1 + o (1)]= 2 pt (cid:20) tpR (cid:18) dnR (cid:19)(cid:21) [1 + o (1)] = 2 pt (1 + t )[1 + o (1)] , where we used p = R / ( R + d/n ) . It then follows from the results above and the assumption p = cq n that log E (cid:20) exp (cid:18) t · (cid:104) ˆ µ ( − i ) , x i (cid:105) y i d/ ( nR ) (cid:19)(cid:21) = cq n p − log E (cid:20) exp (cid:18) t · (cid:104) ˆ µ ( − i ) , x i (cid:105) y i d/ ( nR ) (cid:19)(cid:21) = 2 c ( t + t )[1 + o (1)] q n , which leads to (F.5).On the other hand, E ( e tA ij y i y j | y i ) = 12 E ( e tA ij | y i y j = 1) + 12 E ( e − tA ij | y i y j = − ue t + (1 − u )] + 12 [ ve − t + (1 − v )] = 1 + u ( e t −
1) + v ( e − t − , ∀ t ∈ R . Conditioned on y i , { A ij y i y j } j (cid:54) = i are i.i.d. random variables. Hence E (cid:20) exp (cid:18) t log( a/b ) y i (cid:88) j (cid:54) = i A ij y j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) y i (cid:21) = (cid:18) u [( a/b ) t −
1] + v [( b/a ) t − (cid:19) n − . Again, the right hand side does not depend on y i . By substituting u = aq n /n and v = bq n /n , log E (cid:20) exp (cid:18) t log( a/b ) y i (cid:88) j (cid:54) = i A ij y j (cid:19)(cid:21) = ( n −
1) log (cid:18) u [( a/b ) t −
1] + v [( b/a ) t − (cid:19) ( n −
1) log (cid:18) aq n [( a/b ) t −
1] + bq n [( b/a ) t − n (cid:19) = a [( a/b ) t −
1] + b [( b/a ) t − · [1 + o (1)] q n . We get (F.6) and thus finish the proof.
Lemma F.3 (Fundamental limit via genie-aided approach) . Suppose that S is a Borel spaceand ( y , X ) is a random element in {± } n × S . Let F be a family of Borel mappings from S to {± } n . Define M ( u , v ) = min (cid:26) n n (cid:88) i =1 { u i (cid:54) = v i } , n n (cid:88) i =1 {− u i (cid:54) = v i } (cid:27) , ∀ u , v ∈ {± } n ,f ( ·| ˜ X , ˜ y − i ) = P ( y i = ·| X = ˜ X , y − i = ˜ y − i ) , ∀ i ∈ [ n ] , ˜ X ∈ S , ˜ y − i ∈ {± } n − . We have inf ˆ y ∈F E M ( ˆ y , y ) ≥ n − n − · n n (cid:88) i =1 P [ f ( y i | X , y − i ) < f ( − y i | X , y − i )] . Proof of Lemma F.3.
For u , v ∈ {± } m with some m ∈ Z + , define the sign s ( u , v ) = argmin c = ± (cid:107) c u − v (cid:107) with any tie-breaking rule. As a matter of fact, s ( u , v ) = sgn( (cid:104) u , v (cid:105) ) if (cid:104) u , v (cid:105) (cid:54) = 0 . When |(cid:104) u , v (cid:105)| > , we have s ( u − i , v − i ) = s ( u , v ) for all i . Hence for ˆ y ∈ F (we drop the dependenceof ˆ y on X ), E M ( ˆ y , y ) ≥ E (cid:32) n n (cid:88) i =1 { s ( ˆ y , y )ˆ y i (cid:54) = y i } {|(cid:104) ˆ y , y (cid:105)| > } (cid:33) = E (cid:32) n n (cid:88) i =1 { s ( ˆ y − i , y − i )ˆ y i (cid:54) = y i } {|(cid:104) ˆ y , y (cid:105)| > } (cid:33) = E (cid:32) n n (cid:88) i =1 { s ( ˆ y − i , y − i )ˆ y i (cid:54) = y i } (cid:33) − E (cid:32) n n (cid:88) i =1 { s ( ˆ y − i , y − i )ˆ y i (cid:54) = y i } {|(cid:104) ˆ y , y (cid:105)|≤ } (cid:33) ≥ n n (cid:88) i =1 P ( s ( ˆ y − i , y − i )ˆ y i (cid:54) = y i ) − P ( |(cid:104) ˆ y , y (cid:105)| ≤ . Define F ε = { ˆ y ∈ F : P ( |(cid:104) ˆ y , y (cid:105)| ≤ ≤ ε } for ε ∈ [0 , . If F ε (cid:54) = ∅ , then inf ˆ y ∈F ε E M ( ˆ y , y ) ≥ n n (cid:88) i =1 inf ˆ y ∈F P ( s ( ˆ y − i , y − i )ˆ y i (cid:54) = y i ) − ε. Define G be the family of Borel mappings from S × {± } n − → {± } . For any fixed ˆ y ∈ F ,the mapping ( X , y − i ) (cid:55)→ s ( ˆ y − i , y − i )ˆ y i belongs to G . Then inf ˆ y ∈F P ( s ( ˆ y − i , y − i )ˆ y i (cid:54) = y i ) ≥ inf ˆ (cid:96) ∈G P (cid:16) ˆ (cid:96) ( X , y − i ) (cid:54) = y i (cid:17) ≥ P [ f ( y i | X , y − i ) < f ( − y i | X , y − i )] , δ = n (cid:80) ni =1 P [ f ( y i | X , y − i ) < f ( − y i | X , y − i )] . We have inf ˆ y ∈F ε E M ( ˆ y , y ) ≥ δ − ε provided that F ε (cid:54) = ∅ .On the other hand, when |(cid:104) ˆ y , y (cid:105)| ≤ , we have M ( ˆ y , y ) = (4 n ) − min c = ± (cid:107) c ˆ y − y (cid:107) = (4 n ) − min c = ± {(cid:107) ˆ y (cid:107) − c (cid:104) ˆ y , y (cid:105) + (cid:107) y (cid:107) } ≥ n − n . Hence if
F \F ε (cid:54) = ∅ , inf ˆ y ∈F\F ε E M ( ˆ y , y ) ≥ n − n inf ˆ y ∈F\F ε P ( |(cid:104) ˆ y ( X ) , y (cid:105)| ≤ ≥ n − n · ε = ε (cid:18) − n (cid:19) . Based on the deduction above, we have the followings for all ε ∈ [0 , :1. If F ε (cid:54) = ∅ and F \F ε (cid:54) = ∅ , then inf ˆ y ∈F E M ( ˆ y , y ) ≥ min { δ − ε, ε (1 − n − ) / } ;2. If F ε = ∅ , then inf ˆ y ∈F E M ( ˆ y , y ) ≥ ε (1 − n − ) / .3. If F \F ε = ∅ , then inf ˆ y ∈F E M ( ˆ y , y ) ≥ δ − ε .As a result, inf ˆ y ∈F E M ( ˆ y , y ) ≥ sup ε ∈ [0 , min { δ − ε, ε (1 − n − ) / } = n − n − δ . F.2 Proof of Lemma 4.1
The proof directly follows the Lemmas F.4 and F.5, plus the conditional independence be-tween A and X as well as the Bayes formula. See Appendices F.3 and F.4 for proofs oflemmas. Lemma F.4.
Denote by p X ( ·| ˜ (cid:96) i , ˜ y − i ) the conditional density function of X given y i = ˜ (cid:96) i ∈{± } and y − i = ˜ y − i ∈ {± } n − . Under Assumption 4.2, (cid:12)(cid:12)(cid:12)(cid:12) y i log (cid:18) p X ( X | y i , y − i ) p X ( X | − y i , y − i ) (cid:19) − R nR + d (cid:88) j (cid:54) = i (cid:104) x i , x j (cid:105) y j (cid:12)(cid:12)(cid:12)(cid:12) = o P ( q n ; q n ) , ∀ i. Lemma F.5.
Denote by p A ( ·| ˜ y i , ˜ y − i ) the conditional probability mass function of A given y i = ˜ (cid:96) i and y − i = ˜ y − i . Under Assumption 4.2, (cid:12)(cid:12)(cid:12)(cid:12) y i log (cid:18) p A ( A | y i , y − i ) p A ( A | − y i , y − i ) (cid:19) − log (cid:18) ab (cid:19) (cid:88) j (cid:54) = i A ij y j (cid:12)(cid:12)(cid:12)(cid:12) = o P ( q n ; q n ) , ∀ i. F.3 Proof of Lemma F.4
Let p = p n = R / ( R + d/n ) . We have p n (cid:16) q n . First of all, from the data generating model,we have p X ( X | y ) ∝ E µ exp (cid:16) − n (cid:88) j =1 (cid:107) x j − y j µ (cid:107) (cid:17) ∝ E µ exp (cid:16)(cid:68) n (cid:88) j =1 x j y j , µ (cid:69)(cid:17) , ∝ hide quantities that do not depend on y . By defining I ( α ) = R d − (cid:90) S d − e R (cid:104) α , ˜ µ (cid:105) ρ (d ˜ µ ) , ∀ α ∈ R d , and using the uniform distribution of µ on the sphere with radius R , we get p X ( X | y i = s, y − i ) p X ( X | y i = − s, y − i ) = I (cid:0) ( n −
1) ˆ µ ( − i ) + x i s (cid:1) I (( n −
1) ˆ µ ( − i ) − x i s ) . (F.7)Let P ( t, s ) = (cid:82) π e t cos θ (sin θ ) s − d θ for t ≥ , s ≥ . Then, I ( α ) ∝ (cid:90) π e R (cid:107) α (cid:107) cos θ (sin θ ) d − d θ = P ( R (cid:107) α (cid:107) , d ) , where ∝ only hides some factor that does not depend on α . Hence by (F.7) and ˆ µ ( − i ) = n − (cid:80) j (cid:54) = i y j x j , log (cid:18) p X ( X | y i , y − i ) p X ( X | − y i , y − i ) (cid:19) = log P (cid:0) R (cid:107) ( n −
1) ˆ µ ( − i ) + x i y i (cid:107) , d (cid:1) − log P (cid:0) R (cid:107) ( n −
1) ˆ µ ( − i ) − x i y i (cid:107) , d (cid:1) . We will linearize the functional above, and invoke Lemma H.8 to control the approx-imation error. Take t = ( n − R (cid:112) R + d/ ( n − , t = R (cid:107) ( n −
1) ˆ µ ( − i ) + x i y i (cid:107) , t = R (cid:107) ( n −
1) ˆ µ ( − i ) − x i y i (cid:107) . We first claim that t = nR (cid:112) R + d/n [1 + o (1)] = [1 + o (1)] nR / √ p (cid:16) nR ( R ∨ (cid:112) d/n ) , (F.8) max { /t , d /t , | t − t | /t , | t − t | /t } = o P (1; p ) . (F.9)Equation (F.8) is obvious and it leads to /t = o (1) . From t (cid:38) R √ nd and the assumption R (cid:29) ∨ ( d/n ) / we get d t (cid:46) d ( R √ nd ) = (cid:18) d ( R nd ) (cid:19) / = (cid:18) dnR · n R (cid:19) / = o (1) . By the triangle’s inequality and (cid:107) x i (cid:107) = O P ( R ∨ √ d ; p ) in Lemma E.1, (cid:12)(cid:12) | t − t | − (cid:12)(cid:12) R (cid:107) ( n −
1) ˆ µ ( − i ) (cid:107) − t (cid:12)(cid:12)(cid:12)(cid:12) ≤ R (cid:107) x i y i (cid:107) ≤ R ( R ∨ √ d ) O P (1; p ) . By (cid:12)(cid:12)(cid:12) (cid:107) ˆ µ ( − i ) (cid:107) − (cid:112) R + d/ ( n − (cid:12)(cid:12)(cid:12) = O P ( (cid:112) p/n ; p ) in Lemma E.1, (cid:12)(cid:12) R (cid:107) ( n −
1) ˆ µ ( − i ) (cid:107) − t (cid:12)(cid:12) = O P ( R √ np ; p ) . Hence | t − t | /R = O P ( R ∨ √ d ∨ √ np ; p ) = O P ( √ nR ∨ √ d ; p ) as √ p ≤ R .Then t (cid:16) nR ( R ∨ (cid:112) d/n ) forces | t − t | /t = | t − t | /R | t | /R = O P ( √ nR ∨ √ d ; p ) nR ∨ √ nd = o P (1; p ) . | t − t | /t = o P (1; p ) .Now that (F.9) has been justified, Lemma H.8 and Fact A.5 assert that (cid:12)(cid:12)(cid:12)(cid:12) log p X ( X | y i , y − i ) − log p X ( X | − y i , y − i ) g ( t , d )( t − t ) − (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) log P ( t , d ) − log P ( t , d ) g ( t , d )( t − t ) − (cid:12)(cid:12)(cid:12)(cid:12) = o P (1; p ) , (F.10)where g ( t , d ) = (cid:112) [( d − /t ] + 4 − ( d − /t (cid:112) ( d − + 4 t − ( d − t . By (F.8), we have t = [1 + o (1)] nR / √ p and g ( t , d ) = √ p [1 + o (1)]2 nR [ (cid:112) ( d − + 4 n R /p − ( d − √ p [1 + o (1)]2 R · (cid:20)(cid:115)(cid:18) d − nR (cid:19) + 4 R p − d − nR (cid:21) . Since p = R / ( R + d/n ) and (cid:18) d − nR (cid:19) + 4 R p = (cid:18) d − nR (cid:19) + 4 (cid:18) dnR (cid:19) = (cid:18) d − nR + 2 (cid:19) + 8 nR , we have d − nR + 2 ≤ (cid:115)(cid:18) d − nR (cid:19) + 4 R p ≤ d − nR + 2 + (cid:114) nR and g ( t , d ) = [1 + o (1)] √ p/R .To further simplify (F.10), we first note that t − t = t − t t + t = 4 R ( n − (cid:104) ˆ µ ( − i ) , x i (cid:105) y i t + t = 4 R ( n − (cid:104) ˆ µ ( − i ) , x i (cid:105) y i [1 + o P (1; p )]2 t = 4 R ( n − (cid:104) ˆ µ ( − i ) , x i (cid:105) y i [1 + o P (1; p )]2 nR / √ p = (cid:18) √ pnR (cid:88) j (cid:54) = i (cid:104) x i , x j (cid:105) y j (cid:19) y i [1 + o P (1; p )] , where we used t = [1 + o (1)] nR / √ p in (F.8). Then g ( t , d )( t − t ) = (cid:18) pnR (cid:88) j (cid:54) = i (cid:104) x i , x j (cid:105) y j (cid:19) y i [1 + o P (1; p )] . By (cid:104) ˆ µ ( − i ) , x i (cid:105) = O P ( R ; p ) in Lemma E.1, g ( t , d )( t − t ) = O P ( p ; p ) . The proof is completedby plugging these estimates into (F.10). 48 .4 Proof of Lemma F.5 Define T i = { j ∈ [ n ] \{ i } : y i y j = 1 } and S i = { j ∈ [ n ] \{ i } : A ij = 1 } for i ∈ [ n ] . Bydefinition, p A ( A | y i , y − i ) ∝ α | T i ∩ S i | (1 − α ) | T i \ S i | β | S i \ T i | (1 − β ) [ n ] ∩{ i } c ∩ T ci ∩ S ci ,p A ( A | − y i , y − i ) ∝ α | S i \ T i | (1 − α ) [ n ] ∩{ i } c ∩ T ci ∩ S ci β | T i ∩ S i | (1 − β ) | T i \ S i | , where both ∝ ’s hide the same factor that does not involve { A ij } nj =1 or y i . Hence log (cid:18) p A ( A | y i , y − i ) p A ( A | − y i , y − i ) (cid:19) = ( | T i ∩ S i | − | S i \ T i | ) log( α/β )+ ( | T i \ S i | − | [ n ] ∩ { i } c ∩ T ci ∩ S ci | ) log (cid:18) − β − α (cid:19) . (F.11)The facts | T i |−| S i | ≤ | T i \ S i | ≤ | T i | and n − −| T i |−| S i | ≤ | [ n ] ∩{ i } c ∩ T ci ∩ S ci | ≤ n − −| T i | yield || T i \ S i | − | [ n ] ∩ { i } c ∩ T ci ∩ S ci || ≤ | | T i | − ( n − | + | S i | For any independent random variables { ξ i } ni =1 taking values in [ − , , Hoeffding’s inequality(Hoeffding, 1963) asserts P ( | (cid:80) ni ξ i − (cid:80) ni E ξ i | ≥ nt ) ≤ e − nt / , ∀ t ≥ . Hence | (cid:80) ni ξ i − (cid:80) ni E ξ i | = O P ( √ nq ; q ) . This elementary fact leads to | | T i | − ( n − | = O P ( √ nq ; q ) , | S i | ≤ | E S i | + | S i − E S i | ≤ O ( q ) + O P ( √ nq ; q ) = O P ( √ nq ; q ) . As a result, || T i \ S i | − | [ n ] ∩ { i } c ∩ T ci ∩ S ci || = O P ( √ nq ; q ) . This bound, combined with ≤ log (cid:18) − β − α (cid:19) = log (cid:18) α − β − α (cid:19) ≤ α − β − α = ( a − b ) q/n − aq/n (cid:46) qn , (F.11) and log( α/β ) = log( a/b ) , implies that (cid:12)(cid:12)(cid:12)(cid:12) log (cid:18) p A ( A | y i , y − i ) p A ( A | − y i , y − i ) (cid:19) − ( | T i ∩ S i | − | S i \ T i | ) log( a/b ) (cid:12)(cid:12)(cid:12)(cid:12) = O P ( √ nq · q/n ; q ) = o P ( q ; q ) . The proof is completed by | T i ∩ S i | − | S i \ T i | = (cid:88) j ∈ T i A ij − (cid:88) j ∈ [ n ] ∩{ i } c ∩ T ci A ij = y i (cid:88) j (cid:54) = i A ij y j . F.5 Proof of Theorem 4.1
Lemma F.1 asserts the existence of some ε n → and constant C > such that P (min c = ± (cid:107) c ˆ u − w (cid:107) ∞ < ε n n − / log n ) > − Cn − . (F.12)49et ˆ c = argmin c = ± (cid:107) c ˆ u − w (cid:107) ∞ and v = ˆ c ˆ u . Hence P [ M (sgn( ˆ u ) , y ) = 0] ≥ P (sgn( ˆ v ) = y ) ≥ P (min i ∈ [ n ] w i y i > ε n n − / log n, (cid:107) v − w (cid:107) ∞ < ε n n − / ) ≥ P (min i ∈ [ n ] w i y i > ε n n − / log n ) − P ( (cid:107) v − w (cid:107) ∞ < ε n n − / ) ≥ − n (cid:88) i =1 P ( w i y i ≤ ε n n − / log n ) − Cn − = 1 − n P ( w i y i ≤ ε n n − / log n ) − Cn − . (F.13)where we used (F.12), union bounds and symmetry.Take any < ε < a − b log( a/b ) + 2 c . Lemma F.2 asserts that lim n →∞ P ( w i y i ≤ εn − / log n )log n = − sup t ∈ R { εt + I ( t, a, b, c ) } . For any δ > , there exists a large N such that when n ≥ N , ε n < ε and P ( w i y i ≤ ε n n − / log n ) ≤ n − sup t ∈ R { εt + I ( t,a,b,c ) } + δ . This and (F.13) lead to P [ M (sgn( ˆ u ) , y ) = 0] ≥ − n − sup t ∈ R { εt + I ( t,a,b,c ) } + δ − Cn − , ∀ n ≥ N. When I ∗ ( a, b, c ) = sup t ∈ R I ( t, a, b, c ) > , by choosing small ε and δ we get P [ M (sgn( ˆ u ) , y ) =0] → .The converse result for I ∗ ( a, b, c ) = sup t ∈ R I ( t, a, b, c ) < follows from the large deviationLemma F.2 and the proof of Theorem 1 in Abbe et al. (2016). F.6 Proof of Theorem 4.2
Lemma F.1 asserts the existence of some ε n → and constant C > such that P (min c = ± (cid:107) c ˆ u − w (cid:107) ∞ < ε n n − / log n ) > − Cn − . (F.14)Let ˆ c = argmin c = ± (cid:107) c ˆ u − w (cid:107) ∞ and v = ˆ c ˆ u .By definition, E M (sgn( ˆ u ) , y ) ≤ n (cid:80) ni =1 P ( v i y i < . By union bounds and (F.14), P ( v i y i < ≤ P ( v i y i < , (cid:107) ˆ u − w (cid:107) ∞ < ε n n − / log n ) + P ( (cid:107) v − w (cid:107) ∞ ≥ ε n n − / log n ) ≤ P ( w i y i < ε n n − / log n ) + Cn − . (F.15)Take any < ε < a − b log( a/b ) + 2 c . Lemma F.2 asserts that lim n →∞ P ( w i y i ≤ εn − / log n )log n = − sup t ∈ R { εt + I ( t, a, b, c ) } . δ > , there exists a large N such that when n ≥ N , ε n < ε and P ( w i y i < ε n n − / log n ) ≤ n − sup t ∈ R { εt + I ( t,a,b,c ) } + δ . (F.16)From (F.15) and (F.16) we obtain that E M (sgn( ˆ u ) , y ) ≤ n − sup t ∈ R { εt + I ( t,a,b,c ) } + δ + Cn − , ∀ n ≥ N. The proof is completed using I ∗ ( a, b, c ) = sup t ∈ R I ( t, a, b, c ) ≤ and letting ε , δ go to zero. F.7 Proof of Theorem 4.3
Define f ( ·| ˜ A , ˜ X , ˜ y − i ) = P ( y i = ·| A = ˜ A , X = ˜ X , y − i = ˜ y − i ) . By Lemma F.3 and symme-tries, for any estimator ˆ y we have E M ( ˆ y , y ) ≥ n − n − P [ f ( y | A , X , y − ) < f ( − y | A , X , y − )] . Denote by A the event on the right hand side. Let B ε = (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) log (cid:18) f ( y | A , X , y − ) f ( − y | A , X , y − ) (cid:19) − (cid:18) log( a/b )( Ay ) + 2 R nR + d ( Gy ) (cid:19) y (cid:12)(cid:12)(cid:12)(cid:12) < εq n (cid:27) C ε = (cid:26)(cid:18) log( a/b )( Ay ) + 2 R nR + d ( Gy ) (cid:19) y ≤ − εq n (cid:27) By the triangle’s inequality, C ε ∩ B ε ⊆ A . Hence E M ( ˆ y , y ) (cid:38) P ( A ) ≥ P ( C ε ∩ B ε ) ≥ P ( C ε ) − P ( B cε ) . (F.17)Since a − b log( a/b ) + 2 c > , Lemma F.2 asserts that lim n →∞ q − n log P ( C ε ) = − sup t ∈ R {− εt + I ( t, a, b, c ) } . By Lemma 4.1 and the property of o P ( · ; · ) , lim n →∞ q − n log P ( B cε ) = −∞ . These limits and (F.17) lead to lim inf n →∞ q − n log E M ( ˆ y , y ) ≥ − sup t ∈ R {− εt + I ( t, a, b, c ) } . Taking ε → finishes the proof. 51 Proofs of Section 5
G.1 Proof of Lemma 5.1
Note that s = 0 , r = 1 , ¯∆ = ¯ λ = n (cid:107) µ (cid:107) and κ = 1 . Assumption B.1 holds if / √ n ≤ γ (cid:28) .Assumption 2.5 holds with Σ = 2 I d and in that case, Assumption 2.6 holds with γ ≥ (cid:26) (cid:107) µ (cid:107) , (cid:112) d/n (cid:107) µ (cid:107) (cid:27) . The right hand side goes to zero as d/n → ∞ and ( n/d ) / (cid:107) µ (cid:107) → ∞ . Hence we can take γ = 2 max (cid:26) √ n , (cid:107) µ (cid:107) , (cid:112) d/n (cid:107) µ (cid:107) (cid:27) to satisfy all the assumptions above. Then Lemma B.1 yields |(cid:104) u , ¯ u (cid:105)| P → .To study ˆ u , we first define ˜ G = E ( XX (cid:62) ) = d I n + d e e (cid:62) . Hence its leading eigenvectorand the associated eigengap are ˜ u = e and ˜∆ = d . Observe that G = H ( XX (cid:62) ) and (cid:107) XX (cid:62) − ˜ G (cid:107) ≤ (cid:107)H ( XX (cid:62) − ˜ G ) (cid:107) + max i ∈ [ n ] (cid:12)(cid:12)(cid:12) ( XX (cid:62) − ˜ G ) ii (cid:12)(cid:12)(cid:12) ≤ (cid:107)H ( XX (cid:62) ) − ¯ G (cid:107) + (cid:107) ¯ G − H ( ˜ G ) (cid:107) + max i ∈ [ n ] (cid:12)(cid:12) (cid:107) x i (cid:107) − E (cid:107) x i (cid:107) (cid:12)(cid:12) (G.1)By Lemma B.1, (cid:107)H ( XX (cid:62) ) − ¯ G (cid:107) = o P ( ¯∆; n ) = o P ( n (cid:107) µ (cid:107) ; n ) . (G.2)When i (cid:54) = j , ˜ G ij = E (cid:104) x i , x j (cid:105) = E (cid:104) ¯ x i + z i , ¯ x j + z j (cid:105) = E (cid:104) ¯ x i , ¯ x j (cid:105) = ¯ G ij . Hence H ( ¯ G ) = H ( ˜ G ) , and (cid:107) ¯ G − H ( ˜ G ) (cid:107) = max i ∈ [ n ] | ¯ G ii | = max i ∈ [ n ] (cid:107) ¯ x i (cid:107) = (cid:107) µ (cid:107) . (G.3)For the last term in (5.1), we have (cid:107) x i (cid:107) − E (cid:107) x i (cid:107) = (cid:107) ¯ x i + z i (cid:107) − ( (cid:107) ¯ x i (cid:107) + E (cid:107) z i (cid:107) ) = 2 (cid:104) ¯ x i , z i (cid:105) + ( (cid:107) z i (cid:107) − E (cid:107) z i (cid:107) ) . From (cid:107)(cid:104) ¯ x i , z i (cid:105)(cid:107) ψ (cid:46) (cid:107) ¯ x i (cid:107) = (cid:107) µ (cid:107) , Fact 2.1 and Lemma H.3 we obtain that max i ∈ [ n ] |(cid:104) ¯ x i , z i (cid:105)| (cid:46) (cid:107) ( (cid:104) ¯ x , z (cid:105) , · · · , (cid:104) ¯ x n , z n (cid:105) ) (cid:107) log n = O P ( (cid:112) log n (cid:107) µ (cid:107) ; log n ) (G.4)For any i ≥ , (cid:107) x i (cid:107) ∼ χ d . Lemma H.4 forces P ( |(cid:107) x i (cid:107) − d | ≥ √ dt + 2 t ) ≤ e − t , ∀ t ≥ , i ≥ .
52y the χ -concentration above and union bounds, max ≤ i ≤ n |(cid:107) x i (cid:107) − E (cid:107) x i (cid:107) | = O P ( √ dn ∨ n ; n ) = O P ( √ dn ; n ) . Since (cid:107) x (cid:107) / ∼ χ d , we get max i ∈ [ n ] |(cid:107) x i (cid:107) − E (cid:107) x i (cid:107) | = O P ( √ dn ; n ) .Plugging this and (G.2), (G.3), (G.4) into (G.1), we get (cid:107) XX (cid:62) − ˜ G (cid:107) = O P ( n (cid:107) µ (cid:107) + (cid:107) µ (cid:107) + (cid:112) log n (cid:107) µ (cid:107) + √ dn ; log n ) = O P ( n (cid:107) µ (cid:107) ; log n ) . Here we used (cid:107) µ (cid:107) (cid:29) ( d/n ) / (cid:29) . The Davis-Kahan Theorem (Davis and Kahan, 1970)then yields min c = ± (cid:107) s ˆ u − ˜ u (cid:107) (cid:46) (cid:107) XX (cid:62) − ˜ G (cid:107) / ˜∆ = O P ( n (cid:107) µ (cid:107) ; log n ) /d = o P (1; log n ) , since (cid:107) µ (cid:107) (cid:28) (cid:112) d/n . From ˜ u = e and (cid:104) ˜ u , ¯ u (cid:105) = 1 / √ n → we get |(cid:104) ˆ u , ¯ u (cid:105)| P → . G.2 Proof of Lemma 5.2
Lemma 5.2 directly follows from Lemma B.1 and thus we omit its proof.
H Technical lemmas
H.1 Lemmas for probabilistic analysis
Lemma H.1.
Under Assumption 2.5, we have (cid:107)H ( ZZ (cid:62) ) (cid:107) = O P (cid:0) max {√ n (cid:107) Σ (cid:107) HS , n (cid:107) Σ (cid:107) op } ; n (cid:1) , max i ∈ [ n ] (cid:107) z i (cid:107) = O P (max { Tr( Σ ) , n (cid:107) Σ (cid:107) op } ; n ) , (cid:107) ZZ (cid:62) (cid:107) = O P (max { Tr( Σ ) , n (cid:107) Σ (cid:107) op } ; n ) . Proof of Lemma H.1.
By definition, (cid:107)H ( ZZ (cid:62) ) (cid:107) = sup u ∈ S n − | u (cid:62) H ( ZZ (cid:62) ) u | = sup u ∈ S n − (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i (cid:54) = j u i u j (cid:104) z i , z j (cid:105) (cid:12)(cid:12)(cid:12)(cid:12) . Fix u ∈ S n − , let A = uu (cid:62) and S = (cid:80) i (cid:54) = j u i u j (cid:104) z i , z j (cid:105) . By Proposition 2.5 in Chen andYang (2018), there exists an absolute constant C > such that P ( S ≥ t ) ≤ exp (cid:18) − C min (cid:26) t (cid:107) Σ (cid:107) , t (cid:107) Σ (cid:107) op (cid:27)(cid:19) , ∀ t > . When t = λ max {√ n (cid:107) Σ (cid:107) HS , n (cid:107) Σ (cid:107) op } for some λ ≥ , we have min { t / (cid:107) Σ (cid:107) , t/ (cid:107) Σ (cid:107) op } ≥ λn and P ( S ≥ t ) ≤ e − Cλn . Similarly, we get P ( S ≤ − t ) ≤ e − Cλn and thus P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i (cid:54) = j u i u j (cid:104) z i , z j (cid:105) (cid:12)(cid:12)(cid:12)(cid:12) ≥ λ max {√ n (cid:107) Σ (cid:107) HS , n (cid:107) Σ (cid:107) op } (cid:19) ≤ e − Cλn , ∀ λ ≥ . (cid:107)H ( ZZ (cid:62) ) (cid:107) then follows from a standard covering argument (Vershynin,2010, Section 5.2.2).Theorem 2.6 in Chen and Yang (2018) with n = 1 and A = 1 implies the existence ofconstants C and C such that for any t ≥ , P ( (cid:107) z i (cid:107) ≥ C Tr( Σ ) + t ) ≤ exp (cid:18) − C min (cid:26) t (cid:107) Σ (cid:107) , t (cid:107) Σ (cid:107) op (cid:27)(cid:19) . When t = λ max {√ n (cid:107) Σ (cid:107) HS , n (cid:107) Σ (cid:107) op } for some λ ≥ , we have min { t / (cid:107) Σ (cid:107) , t/ (cid:107) Σ (cid:107) op } ≥ λn . Hence P ( (cid:107) z i (cid:107) ≥ C Tr( Σ ) + λ max {√ n (cid:107) Σ (cid:107) HS , n (cid:107) Σ (cid:107) op } ) ≤ e − C λn , ∀ λ ≥ . Union bounds force max i ∈ [ n ] (cid:107) z i (cid:107) = O P (cid:0) max { Tr( Σ ) , √ n (cid:107) Σ (cid:107) HS , n (cid:107) Σ (cid:107) op } ; n (cid:1) . We can neglect the term √ n (cid:107) Σ (cid:107) HS above, since √ n (cid:107) Σ (cid:107) F = (cid:113) n (cid:107) Σ (cid:107) ≤ (cid:113) ( n (cid:107) Σ (cid:107) op ) Tr( Σ ) ≤ max { Tr( Σ ) , n (cid:107) Σ (cid:107) op } . Finally, the bound on (cid:107) ZZ (cid:62) (cid:107) follows from (cid:107) ZZ (cid:62) (cid:107) ≤ (cid:107)H ( ZZ (cid:62) ) (cid:107) + max i ∈ [ n ] (cid:107) z i (cid:107) . Lemma H.2.
Let Assumption 2.5 hold, p ≥ and { V ( m ) } nm =1 ⊆ R n × K be random matricessuch that V ( m ) is independent of z m . Then, (cid:18) n (cid:88) m =1 (cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j (cid:54) = m (cid:104) z m , z j (cid:105) V ( m ) j (cid:13)(cid:13)(cid:13)(cid:13) p (cid:19) /p = n /p (cid:112) Kp max m ∈ [ n ] (cid:107) V ( m ) (cid:107) O P (cid:0) max {(cid:107) Σ (cid:107) HS , √ n (cid:107) Σ (cid:107) op } ; p ∧ n (cid:1) . Proof of Lemma H.2.
By Minkowski’s inequality, (cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j (cid:54) = m (cid:104) z m , z j (cid:105) V ( m ) j (cid:13)(cid:13)(cid:13)(cid:13) p = (cid:18) K (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) j (cid:54) = m (cid:104) z m , z j (cid:105) V ( m ) jk (cid:12)(cid:12)(cid:12)(cid:12) (cid:19) p/ ≤ (cid:20)(cid:18) K (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) j (cid:54) = m (cid:104) z m , z j (cid:105) V ( m ) jk (cid:12)(cid:12)(cid:12)(cid:12) p (cid:19) /p K − /p (cid:21) p/ = K p/ − K (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) j (cid:54) = m (cid:104) z m , z j (cid:105) V ( m ) jk (cid:12)(cid:12)(cid:12)(cid:12) p = K p/ − K (cid:88) k =1 |(cid:104) z m , w ( m ) k (cid:105)| p , where we define w ( m ) k = (cid:80) j (cid:54) = m V ( m ) jk z j = Z (cid:62) ( I − e m e (cid:62) m ) v ( m ) k , ∀ m ∈ [ n ] , k ∈ [ K ] . Observethat (cid:107) Σ / w ( m ) k (cid:107) = ( v ( m ) k ) (cid:62) ( I − e m e (cid:62) m ) Z Σ Z (cid:62) ( I − e m e (cid:62) m ) v ( m ) k ≤ (cid:107) v ( m ) k (cid:107) (cid:107) Z Σ Z (cid:62) (cid:107) ≤ (cid:107) V ( m ) (cid:107) (cid:107) Z Σ Z (cid:62) (cid:107) .
54s a result, (cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j (cid:54) = m (cid:104) z m , z j (cid:105) V ( m ) j (cid:13)(cid:13)(cid:13)(cid:13) p ≤ K p/ − (cid:18) K (cid:88) k =1 |(cid:104) z m , w ( m ) k / (cid:107) Σ / w ( m ) k (cid:107)(cid:105)| p (cid:19)(cid:16) max m ∈ [ n ] (cid:107) V ( m ) (cid:107) · (cid:107) Z Σ Z (cid:62) (cid:107) / (cid:17) p . and (cid:18) n (cid:88) m =1 (cid:13)(cid:13)(cid:13)(cid:13) (cid:88) j (cid:54) = m (cid:104) z m , z j (cid:105) V ( m ) j (cid:13)(cid:13)(cid:13)(cid:13) p (cid:19) /p ≤ (cid:112) K (cid:107) Z Σ Z (cid:62) (cid:107) max m ∈ [ n ] (cid:107) V ( m ) (cid:107) · (cid:18) K − n (cid:88) m =1 K (cid:88) k =1 |(cid:104) z m , w ( m ) k / (cid:107) Σ / w ( m ) k (cid:107)(cid:105)| p (cid:19) /p . (H.1)On the one hand, let ˜ z i = Σ / z i , ∀ i ∈ [ n ] and ˜ Z = ( ˜ z , · · · , ˜ z n ) (cid:62) . Note that { ˜ z i } ni =1 satisfy Assumption 2.5 with Σ replaced by Σ , because E e (cid:104) u , ˜ z i (cid:105) = E e (cid:104) Σ / u , z i (cid:105) ≤ e α (cid:104) ΣΣ / u , Σ / u (cid:105) = e α (cid:104) Σ u , u (cid:105) , ∀ u ∈ H , i ∈ [ n ] . It is easily seen from Σ ∈ T ( H ) that Σ ∈ T ( H ) . Then Lemma H.1 asserts that (cid:107) Z Σ Z (cid:62) (cid:107) = (cid:107) ˜ Z ˜ Z (cid:62) (cid:107) = O P (cid:0) max { Tr( Σ ) , n (cid:107) Σ (cid:107) op } ; n (cid:1) = O P (cid:0) max {(cid:107) Σ (cid:107) , n (cid:107) Σ (cid:107) } ; n (cid:1) . (H.2)On the other hand, note that z m and w ( m ) k are independent. According to Assumption2.5 on sub-Gaussianity of z m , we have E (cid:16) (cid:104) z m , w ( m ) k / (cid:107) Σ / w ( m ) k (cid:107)(cid:105) (cid:12)(cid:12)(cid:12) w ( m ) k (cid:17) = 0 ,p − / E /p (cid:16) |(cid:104) z m , w ( m ) k / (cid:107) Σ / w ( m ) k (cid:107)(cid:105)| p (cid:12)(cid:12)(cid:12) w ( m ) k (cid:17) ≤ C for some absolute constant C . Then E |(cid:104) z m , w ( m ) k / (cid:107) Σ / w ( m ) k (cid:107)(cid:105)| p ≤ ( C √ p ) p . We have n (cid:88) m =1 K (cid:88) k =1 E |(cid:104) z m , w ( m ) k / (cid:107) Σ / w ( m ) k (cid:107)(cid:105)| p ≤ nK ( C √ p ) p = ( n /p K /p C √ p ) p . By Fact A.3, (cid:18) n (cid:88) m =1 K (cid:88) k =1 |(cid:104) z m , w ( m ) k / (cid:107) Σ / w ( m ) k (cid:107)(cid:105)| p (cid:19) /p = O P (cid:0) n /p K /p C √ p ; p (cid:1) . (H.3)The final result follows from (H.1), (H.2) and (H.3). Lemma H.3.
Let X ∈ R n × m be a random matrix with sub-Gaussian entries, and define M ∈ R n × m through M ij = (cid:107) X ij (cid:107) ψ . For any p ≥ q ≥ , we have (cid:107) X (cid:107) q,p = O P ( √ p (cid:107) M (cid:107) q,p ; p ) . roof of Lemma H.3. By Minkowski’s inequality, E (cid:107) X (cid:107) pq,p = n (cid:88) i =1 E (cid:18) n (cid:88) j =1 | X ij | q (cid:19) p/q ≤ n (cid:88) i =1 (cid:18) n (cid:88) j =1 E q/p ( | X ij | q ) p/q (cid:19) p/q = n (cid:88) i =1 (cid:18) n (cid:88) j =1 [ E /p | X ij | p ] q (cid:19) p/q . Since p − / E /p | X ij | p ≤ (cid:107) X ij (cid:107) ψ = M ij , we have E (cid:107) X (cid:107) pq,p ≤ n (cid:88) i =1 (cid:18) n (cid:88) j =1 ( √ pM ij ) q (cid:19) p/q = p p/ n (cid:88) i =1 (cid:18) n (cid:88) j =1 M qij (cid:19) p/q = ( √ p (cid:107) M (cid:107) q,p ) p . By Fact A.3, (cid:107) X (cid:107) q,p = O P ( √ p (cid:107) M (cid:107) q,p ; p ) . Lemma H.4.
For independent random vectors X ∼ N ( µ , I d ) and Y ∼ N ( ν , I d ) , we havethe followings:1. If µ = , then P ( |(cid:107) X (cid:107) − d | ≥ √ dt + 2 t ) ≤ e − t , ∀ t ≥ , log E e α (cid:107) X (cid:107) + (cid:104) β , X (cid:105) = − d − α ) + (cid:107) β (cid:107) − α ) ∀ α < , β ∈ R d ;
2. For any t ∈ ( − , , log E e t (cid:104) X , Y (cid:105) = t − t ) ( (cid:107) µ (cid:107) + (cid:107) ν (cid:107) ) + t − t (cid:104) µ , ν (cid:105) − d − t ) . Proof of Lemma H.4.
When µ = , (cid:107) X (cid:107) ∼ χ d . The concentration inequality in theclaim is standard, see Remark 2.11 in Boucheron et al. (2013). Note that p ( x ) = (2 π ) − d/ e −(cid:107) x (cid:107) / is the probability density function of X . With a new variable y = √ − α x , we have α (cid:107) x (cid:107) + (cid:104) β , x (cid:105) − (cid:107) x (cid:107) = − (cid:107) y (cid:107) (cid:104) β / √ − α, y (cid:105) = − (cid:107) y − β / √ − α (cid:107) + (cid:107) β (cid:107) − α ) and E e α (cid:107) X (cid:107) + (cid:104) β , X (cid:105) = (2 π ) − d/ (cid:90) R d exp (cid:18) α (cid:107) x (cid:107) + (cid:104) β , x (cid:105) − (cid:107) x (cid:107) (cid:19) d x = (2 π ) − d/ (cid:90) R d exp (cid:18) − (cid:107) y − β / √ − α (cid:107) + (cid:107) β (cid:107) − α ) (cid:19) (1 − α ) − d/ d y = (1 − α ) − d/ exp (cid:18) (cid:107) β (cid:107) − α ) (cid:19) . Now we come to the second part. Given Y , (cid:104) X , Y (cid:105) ∼ N ( (cid:104) µ , Y (cid:105) , (cid:107) Y (cid:107) ) . Hence E ( e t (cid:104) X , Y (cid:105) | Y ) = e (cid:104) µ , Y (cid:105) t + (cid:107) Y (cid:107) t / . Define Z = Y − ν . From (cid:104) µ , Y (cid:105) = (cid:104) µ , ν (cid:105) + (cid:104) µ , Z (cid:105) and (cid:107) Y (cid:107) = (cid:107) ν (cid:107) + 2 (cid:104) ν , Z (cid:105) + (cid:107) Z (cid:107) we obtain that log E e t (cid:104) X , Y (cid:105) = log E [ E ( e t (cid:104) X , Y (cid:105) | Y )] log E exp (cid:2) ( (cid:104) µ , ν (cid:105) + (cid:104) µ , Z (cid:105) ) t + ( (cid:107) ν (cid:107) + 2 (cid:104) ν , Z (cid:105) + (cid:107) Z (cid:107) ) t / (cid:3) = (cid:104) µ , ν (cid:105) t + (cid:107) ν (cid:107) t / E exp (cid:0) (cid:104) t µ + t ν , Z (cid:105) + (cid:107) Z (cid:107) t / (cid:1) = (cid:104) µ , ν (cid:105) t + (cid:107) ν (cid:107) t / − d (cid:18) − · t (cid:19) + (cid:107) t µ + t ν (cid:107) − · t / (cid:104) µ , ν (cid:105) t + (cid:107) ν (cid:107) t − d − t ) + t (cid:107) µ + t ν (cid:107) − t )= t − t ) ( (cid:107) µ (cid:107) + (cid:107) ν (cid:107) ) + t − t (cid:104) µ , ν (cid:105) − d − t ) . Lemma H.5.
Let { S n } ∞ n =1 be random variables such that Λ n ( t ) = log E e tS n exists for all t ∈ [ − R n , R n ] , where { R n } ∞ n =1 is a positive sequence tending to infinity. Suppose there is aconvex function Λ : R → R and a positive sequence { a n } ∞ n =1 tending to infinity such that lim n →∞ Λ n ( t ) /a n = Λ( t ) for all t ∈ R . We have lim n →∞ a − n log P ( S n ≤ ca n ) = − sup t ∈ R { ct − Λ( t ) } , ∀ c < Λ (cid:48) (0) . Proof of Lemma H.5.
This result follows directly from the Gärtner-Ellis theorem (Gärt-ner, 1977; Ellis, 1984) for large deviation principles.
H.2 Other lemmas
Lemma H.6.
Let x ∈ (0 , π/ , ε ∈ (0 , and δ = επ ( π − x ) . We have max | y |≤ δ | cos( x + y )cos x − | ≤ ε . Moreover, if x > δ , then max | y |≤ δ/ | sin x sin ( x + y ) − | ≤ .Proof of Lemma H.6. Recall the elementary identity cos( x + y ) = cos x cos y − sin x sin y . If | y | ≤ δ , then | sin y | ≤ | y | ≤ δ = επ ( π − x ) ≤ tan( π − x ) and (cid:12)(cid:12)(cid:12)(cid:12) cos( x + y )cos x − cos y (cid:12)(cid:12)(cid:12)(cid:12) ≤ sin x | sin y | cos x = | sin y | tan( π − x ) ≤ επ ( π − x )tan( π − x ) ≤ ε π , ≤ − cos y ≤ y ≤ (2 δ ) ε (1 − x/π )] ≤ ε . The result on max | y |≤ δ | cos( x + y )cos x − | follows from the estimates above and ε π + ε = ε (1 /π + ε ) ≤ ε .The identity sin( x + y ) = sin x cos y + cos x sin y imply that if δ < x ≤ tan x and | y | ≤ δ/ , then (cid:12)(cid:12)(cid:12)(cid:12) sin( x + y )sin x − cos y (cid:12)(cid:12)(cid:12)(cid:12) ≤ cos x | sin y | sin x = | sin y | tan x ≤ δ/ x ≤ δ/ x ≤ , ≤ − cos y ≤ y ≤ ( δ/ ε (1 − x/π )] ≤ ε ≤ . Hence for | y | ≤ δ/ , we have | sin( x + y )sin x − | ≤ + = < . Direct calculation yields ≤ sin( x + y )sin x ≤ , ≤ sin x sin ( x + y ) ≤ and | sin x sin ( x + y ) − | ≤ .57 emma H.7. For t ≥ and s ≥ , define P ( t, s ) = (cid:82) π e t cos x (sin x ) s − d x and a = ( s − /t .There exists a constant c > and a continuous, non-decreasing function w : [0 , c ] (cid:55)→ [0 , with w (0) = 0 such that when max { /t, s /t } ≤ c , (cid:12)(cid:12)(cid:12)(cid:12) ∂∂t [log P ( t, s )]( √ a + 4 − a ) / − (cid:12)(cid:12)(cid:12)(cid:12) ≤ w (max { /t, s /t } ) . Proof of Lemma H.7.
It suffices to show that ∂∂t [log P ( t,s )]( √ a +4 − a ) / → as t → ∞ and t /s → ∞ .If s = 2 , then a = 0 , P ( t, s ) = (cid:82) π e t cos x d x and ∂∂t P ( t, s ) = (cid:82) π cos xe t cos x d x . A directapplication of Laplace’s method (Laplace, 1986) yields ∂∂t [log P ( t, s )] = [ ∂∂t P ( t, s )] /P ( t, s ) → as t → ∞ , proving the result. From now on we assume s > and thus a > . Underour general setting, the proof is quite involved and existing results in asymptotic analysis,including the generalization of Laplace’s method to two-parameter asymptotics (Fulks, 1951)cannot be directly applied.Define f ( x, a ) = e cos x sin a x for x ∈ [0 , π ] . Then P ( t, s ) = (cid:82) π f t ( x, a )d x and ∂∂t P ( t, s ) = (cid:82) π cos xf t ( x, a )d x . From log f ( x, a ) = cos x + a log sin x we get ∂∂x [log f ( x, a )] = − sin x + a cos x sin x and ∂ ∂x [log f ( x, a )] = − cos x − a sin x . (H.4)Let x ∗ be the solution to ∂∂x [log f ( x, a )] = 0 on (0 , π ) . We have x ∗ ∈ (0 , π/ , a = 1cos x ∗ − cos x ∗ , cos x ∗ = √ a + 4 − a and sin x ∗ = (cid:18) a ( √ a + 4 − a )2 (cid:19) / . (H.5)Moreover, f ( · , a ) is strictly increasing in [0 , x ∗ ) and strictly decreasing in ( x ∗ , π ] . Hence x ∗ is its unique maximizer in [0 , π ] .Fix any ε ∈ (0 , / and let δ = επ ( π − x ∗ ) . Define I = [ x ∗ − δ, x ∗ + 2 δ ] ∩ [0 , π ] , J = [ x ∗ , x ∗ + δ/ and r ( a ) = inf y ∈ J f ( y, a ) / sup y ∈ [0 ,π ] \ I f ( y, a ) . Then J ⊆ I ⊆ [0 , π/ and | J | = δ/ . We have (cid:12)(cid:12)(cid:12)(cid:12) P ( t, s ) (cid:82) I f t ( x, a )d x − (cid:12)(cid:12)(cid:12)(cid:12) = (cid:82) [0 ,π ] \ I f t ( x, a )d x (cid:82) I f t ( x, a )d x ≤ (cid:82) [0 ,π ] \ I f t ( x, a )d x (cid:82) J f t ( x, a )d x ≤ π [sup y ∈ [0 ,π ] \ I f ( y, a )] t ( δ/ y ∈ J f ( y, a )] t = 6 πδr t ( a ) and (cid:12)(cid:12)(cid:12)(cid:12) ∂∂t P ( t, s ) (cid:82) I cos xf t ( x, a )d x − (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:82) [0 ,π ] \ I | cos x | f t ( x, a )d x (cid:82) I cos xf t ( x, a )d x ≤ (cid:82) [0 ,π ] \ I f t ( x, a )d x (cid:82) J cos xf t ( x, a )d x ≤ π [sup y ∈ [0 ,π ] \ I f ( y, α )] t cos( x ∗ + δ )( δ/ y ∈ J f ( y, α )] t = 6 π cos( x ∗ + δ ) δr t ( a ) ≤ π δ r t ( a ) , x ∗ + 2 δ < π/ and cos( x ∗ + δ ) ≥ cos( π/ − δ ) = sin δ ≥ δ/π . Consequently, max (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) P ( t, s ) (cid:82) I f t ( x, a )d x − (cid:12)(cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12)(cid:12) ∂∂t P ( t, s ) (cid:82) I cos xf t ( x, a )d x − (cid:12)(cid:12)(cid:12)(cid:12)(cid:27) ≤ π δ r t ( a ) . Let h ( a, t ) denote the right hand side. If h ( a, t ) < , the estimate above yields − h ( a, t )1 + h ( a, t ) ≤ [ ∂∂t P ( t, s )] /P ( t, s ) (cid:82) I cos xf t ( x, a )d x/ (cid:82) I f t ( x, a )d x ≤ h ( a, t )1 − h ( a, t ) . According to Lemma H.6, | cos x/ cos x ∗ − | ≤ ε holds for all x ∈ I . Hence (1 − ε ) 1 − h ( a, t )1 + h ( a, t ) ≤ ∂∂t [log P ( t, s )]cos x ∗ ≤ (1 + ε ) 1 + h ( a, t )1 − h ( a, t ) . Note that our assumptions t → ∞ and t /s → ∞ imply that t/ ( a ∨ → ∞ . Belowwe will prove h ( a, t ) → as t/ ( a ∨ → ∞ for any fixed ε ∈ (0 , / . If that holds,then we get the desired result by letting ε → .The analysis of h ( a, t ) hinges on that of r ( a ) = inf y ∈ J f ( y, a ) / sup y ∈ [0 ,π ] \ I f ( y, a ) . Themonotonicity of f ( · , a ) in [0 , x ∗ ) and ( x ∗ , π ] yields inf y ∈ J f ( y, a ) = f ( x ∗ + δ/ , a ) , sup y ∈ [0 ,π ] \ I f ( y, a ) = max { f ( x ∗ − δ, a ) , f ( x ∗ + 2 δ, a ) }≤ max { f ( x ∗ − δ/ , a ) , f ( x ∗ + δ/ , a ) } , if x ∗ > δ, sup y ∈ [0 ,π ] \ I f ( y, a ) = f ( x ∗ + 2 δ, a ) , if x ∗ ≤ δ. The two cases x ∗ > δ and x ∗ ≤ δ require different treatments. If we define g ( x ) =1 / cos x − cos x for x ∈ (0 , π/ , then a = g ( x ∗ ) and δ = επ ( π − x ∗ ) yield the following simplefact. Fact H.1. If x ∗ > δ , then x ∗ > ε ε/π , a > g ( ε ε/π ) and δ < πε π +4 ε ; if x ∗ ≤ δ , then x ∗ ≤ ε ε/π , a ≤ g ( ε ε/π ) and δ ≥ πε π +4 ε . We first consider the case where x ∗ > δ , which is equivalent to a > g ( ε ε/π ) . Let I (cid:48) = [ x ∗ − δ/ , x ∗ + δ/ . For any y ∈ I (cid:48) , there exists ξ in the closed interval between x ∗ and y such that log f ( y, a ) = log f ( x ∗ , a ) + ∂∂x [log f ( x, a )] | x = x ∗ ( y − x ) + 12 ∂ ∂x [log f ( x, a )] | x = ξ ( y − x ) . By construction, ∂∂x [log f ( x, a )] | x = x ∗ = 0 . From equation (H.4) we get max y ∈ I (cid:48) (cid:12)(cid:12)(cid:12)(cid:12) ∂ ∂x [log f ( x, a )] | x = y∂ ∂x [log f ( x, a )] | x = x ∗ − (cid:12)(cid:12)(cid:12)(cid:12) ≤ max y ∈ I (cid:48) (cid:12)(cid:12)(cid:12)(cid:12) cos y cos x ∗ − (cid:12)(cid:12)(cid:12)(cid:12) + max y ∈ I (cid:48) (cid:12)(cid:12)(cid:12)(cid:12) sin x ∗ sin y − (cid:12)(cid:12)(cid:12)(cid:12) ≤ ε + 916 ≤
132 + 916 = 1932 , ε ≤ / . Therefore, inf y ∈ J log f ( y, a ) − log f ( x ∗ , a ) ∂ ∂x [log f ( x, a )] | x = x ∗ ≤ (cid:18) (cid:19)(cid:18) δ (cid:19) = 5132 · δ , sup y ∈ [0 ,π ] \ I (cid:48) log f ( y, a ) − log f ( x ∗ , a ) ∂ ∂x [log f ( x, a )] | x = x ∗ ≥ (cid:18) − (cid:19)(cid:18) δ (cid:19) = 1332 · δ · δ . Since ∂ ∂x [log f ( x, a )] | x = x ∗ = − cos x ∗ − a/ sin x ∗ < , log r ( a ) = inf y ∈ J log f ( y, a ) − sup y ∈ [0 ,π ] \ I log f ( y, a ) ≥ inf y ∈ J log f ( y, a ) − sup y ∈ [0 ,π ] \ I (cid:48) log f ( y, a ) ≥ ∂ ∂x [log f ( x, a )] | x = x ∗ (cid:18) · δ − · δ (cid:19) = (cos x ∗ + a/ sin x ∗ ) δ × × (cid:38) aδ / sin x ∗ . From this and h ( a, t ) = π δ r t ( a ) we get − log h ( a, t ) = − log(3 π ) + log δ + t log r ( a ) (cid:38) − δ + taδ / sin x ∗ . From (H.5) we see that lim a →∞ sin x ∗ = 1 , lim a →∞ a cos x ∗ = 1 and lim a →∞ a ( π − x ∗ ) = 1 .Since δ = επ ( π − x ∗ ) > , we have lim a →∞ aδ = επ . There exists C > determined by ε suchthat for any a > g ( ε ε/π ) , we have δ ≥ C /a and aδ / sin x ∗ ≥ C /a . As a result, for some C determined by ε , − log h ( a, t ) ≥ C ( − − log a + t/a ) ≥ C [ − − log( a ∨
1) + t/ ( a ∨ , ∀ a > g (cid:18) ε ε/π (cid:19) . (H.6) We move on to the case where x ∗ ≤ δ . Recall that for x ∈ ( x ∗ , x ∗ + 2 δ ) ⊆ ( x ∗ , π/ ,we have ∂∂x [log f ( x, a )] < and − ∂ ∂x [log( x, a )] = cos x + a sin x ≥ cos x ≥ cos( x ∗ + 2 δ ) ≥ cos(4 δ ) ≥ cos(1 / , where we used δ ≤ ε/ ≤ / . By Taylor expansion, there exists ξ ∈ [ x ∗ + δ/ , x ∗ + 2 δ ] such that log r ( a ) = inf y ∈ J log f ( y, a ) − sup y ∈ [0 ,π ] \ I log f ( y, a ) = log f ( x ∗ + δ/ , a ) − log f ( x ∗ + 2 δ, a )= − (cid:18) ∂∂x [log( x, a )] | x = x ∗ + δ/ (2 δ − δ/
6) + 12 ∂ ∂x [log( x, a )] | x = ξ (2 δ − δ/ (cid:19) >
12 inf x ∈ [ x ∗ ,x ∗ +2 δ ] (cid:18) − ∂ ∂x [log( x, a )] (cid:19) (2 δ − δ/ (cid:38) δ . Based on h ( a, t ) = π δ r t ( a ) and δ ≥ πε π +4 ε from Fact H.1, there exists some C > determinedby ε such that − log h ( a, t ) ≥ C ( − t ) ≥ C [ − t/ ( a ∨ holds when a ≤ g ( ε ε/π ) .60 his bound, (H.6) and log( a ∨ ≤ a ∨ imply that − log h ( a, t ) (cid:38) − − log( a ∨
1) + ta ∨ ≥ − − ( a ∨
1) + ta ∨ − a ∨ (cid:18) t ( a ∨ − (cid:19) . As t/ ( a ∨ → ∞ , we have − log h ( a, t ) → ∞ and h ( a, t ) → . Lemma H.8.
For t ≥ and s ≥ , define a = ( s − /t and g ( t, s ) = ( √ a + 4 − a ) / . There exist a constant c ∈ (0 , and a function w : [0 , c ] → [0 , such that when max { /t , d /t , | t − t | /t , | t − t | /t } ≤ c , (cid:12)(cid:12)(cid:12)(cid:12) log P ( t , s ) − log P ( t , s ) g ( t , s )( t − t ) − (cid:12)(cid:12)(cid:12)(cid:12) ≤ w (max { /t , s /t , | t − t | /t , | t − t | /t } ) . Proof of Lemma H.8.
Let h ( a ) = ( √ a + 4 − a ) / . Observe that ∂a∂t = − ( s − /t = − a/t and h (cid:48) ( a ) = ( a √ a +4 −
1) = − h ( a ) / √ a + 4 . By the chain rule, ∂∂t [log g ( t, s )] = dd a [log h ( a )] · ∂a∂t = h (cid:48) ( a ) h ( a ) · ∂a∂t = at √ a + 4 . Hence < ∂∂t [log g ( t, s )] ≤ /t . For any t ≥ t > there exists ξ ∈ [ t , t ] such that ≤ log (cid:18) g ( t , s ) g ( t , s ) (cid:19) = log g ( t , s ) − log g ( t , s ) = ∂∂t [log g ( t, s )] | t = ξ ( t − t ) ≤ t − t ξ ≤ t − t t . This leads to | g ( t , s ) /g ( t , s ) − | ≤ e | t − t | / ( t ∧ t ) − for any t , t > .Let c and w be those defined in the statement of Lemma H.7. Suppose that t > and s ≥ satisfies max { /t , s /t } < c/ . When t ≥ t / / , max { /t, s /t } ≤ { /t , s /t } < c . Lemma H.7 and the non-decreasing property of w force (cid:12)(cid:12)(cid:12)(cid:12) ∂∂t [log P ( t, s )] g ( t, s ) − (cid:12)(cid:12)(cid:12)(cid:12) ≤ w (max { / ( t / / ) , s / ( t / / ) } ) ≤ w (2 max { /t , s /t } ) , ∀ t ≥ t / / . When | t − t | ≤ t / , we have t ≥ . t ≥ t / / and | t − t | ≤ . t ≤ t / / . Then | t − t | / ( t ∧ t ) ≤ / and | g ( t, s ) /g ( t , s ) − | ≤ e | t − t | / ( t ∧ t ) − ≤ e / | t − t | t ∧ t ≤ √ e | t − t | t / / ≤ | t − t | t < . Hence when t ∈ [4 t / , t / , [1 − w (2 max { /t , s /t } )] (cid:18) − | t − t | t (cid:19) ≤ ∂∂t [log P ( t, s )] g ( t , s ) ≤ [1 + w (2 max { /t , s /t } )] (cid:18) | t − t | t (cid:19) .
61e can find a constant ˜ c ∈ (0 , and construct a new function ˜ w : [0 , ˜ c ] → [0 , suchthat for any distinct t , t ∈ [(1 − ˜ c ) t , (1 + ˜ c ) t ] , (cid:12)(cid:12)(cid:12)(cid:12) log P ( t , s ) − log P ( t , s ) g ( t , s )( t − t ) − (cid:12)(cid:12)(cid:12)(cid:12) ≤ ˜ w (max { /t , s /t , | t − t | /t , | t − t | /t } ) . The proof is completed by re-defining c and w as ˜ c and ˜ w , respectively. References
Abbe, E. (2017). Community detection and stochastic block models: recent developments.
The Journal of Machine Learning Research Abbe, E. , Bandeira, A. S. and
Hall, G. (2016). Exact recovery in the stochastic blockmodel.
IEEE Transactions on Information Theory Abbe, E. , Fan, J. , Wang, K. and
Zhong, Y. (2017). Entrywise eigenvector analysis ofrandom matrices with low expected rank. arXiv preprint arXiv:1709.09565 . Amini, A. A. and
Razaee, Z. S. (2019). Concentration of kernel matrices with applicationto kernel spectral clustering. arXiv preprint arXiv:1909.03347 . Anderson, T. W. (1963). Asymptotic theory for principal component analysis.
The Annalsof Mathematical Statistics Aronszajn, N. (1950). Theory of reproducing kernels.
Transactions of the Americanmathematical society Awasthi, P. , Bandeira, A. S. , Charikar, M. , Krishnaswamy, R. , Villar, S. and
Ward, R. (2015). Relax, no need to round: Integrality of clustering formulations. In
Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science . Azizyan, M. , Singh, A. and
Wasserman, L. (2013). Minimax theory for high-dimensionalGaussian mixtures with sparse mean separation. In
Advances in Neural Information Pro-cessing Systems . Baik, J. , Arous, G. B. and
Péché, S. (2005). Phase transition of the largest eigenvaluefor nonnull complex sample covariance matrices.
The Annals of Probability Benaych-Georges, F. and
Nadakuditi, R. R. (2012). The singular values and vectorsof low rank perturbations of large rectangular random matrices.
Journal of MultivariateAnalysis
Binkiewicz, N. , Vogelstein, J. T. and
Rohe, K. (2017). Covariate-assisted spectralclustering.
Biometrika
Blanchard, G. , Bousquet, O. and
Zwald, L. (2007). Statistical properties of kernelprincipal component analysis.
Machine Learning oucheron, S. , Lugosi, G. and
Massart, P. (2013).
Concentration inequalities: Anonasymptotic theory of independence . Oxford university press.
Cai, C. , Li, G. , Chi, Y. , Poor, H. V. and
Chen, Y. (2019). Subspace estimation fromunbalanced and incomplete data matrices: (cid:96) , ∞ statistical guarantees. arXiv preprintarXiv:1910.04267 . Cai, T. T. and
Zhang, A. (2018). Rate-optimal perturbation bounds for singular subspaceswith applications to high-dimensional statistics.
The Annals of Statistics Candès, E. J. and
Recht, B. (2009). Exact matrix completion via convex optimization.
Foundations of Computational mathematics Cape, J. , Tang, M. and
Priebe, C. E. (2019). The two-to-infinity norm and singular sub-space geometry with applications to high-dimensional statistics.
The Annals of Statistics Chen, X. and
Yang, Y. (2018). Hanson-Wright inequality in Hilbert spaces with applica-tion to K -means clustering for non-Euclidean data. arXiv e-prints arXiv:1810.11180. Chen, X. and
Yang, Y. (2020). Cutoff for exact recovery of Gaussian mixture models. arXiv preprint arXiv:2001.01194 . Chen, Y. , Fan, J. , Ma, C. and
Wang, K. (2017). Spectral method and regularized MLEare both optimal for top- K ranking. arXiv preprint arXiv:1707.09971 . Chen, Y. , Fan, J. , Ma, C. and
Wang, K. (2019). Spectral method and regularized MLEare both optimal for top-K ranking.
Annals of statistics Cristianini, N. and
Shawe-Taylor, J. (2000).
An introduction to support vector ma-chines and other kernel-based learning methods . Cambridge university press.
Damle, A. and
Sun, Y. (2019). Uniform bounds for invariant subspace perturbations. arXiv preprint arXiv:1905.07865 . Davis, C. and
Kahan, W. M. (1970). The rotation of eigenvectors by a perturbation. III.
SIAM Journal on Numerical Analysis Dempster, A. P. , Laird, N. M. and
Rubin, D. B. (1977). Maximum likelihood fromincomplete data via the EM algorithm.
Journal of the Royal Statistical Society: Series B(Methodological) Deshpande, Y. , Sen, S. , Montanari, A. and
Mossel, E. (2018). Contextual stochasticblock models. In
Advances in Neural Information Processing Systems . El Karoui, N. (2018). On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators.
Probability Theoryand Related Fields ldridge, J. , Belkin, M. and
Wang, Y. (2017). Unperturbed: spectral analysis beyondDavis-Kahan. arXiv preprint arXiv:1706.06516 . Ellis, R. S. (1984). Large deviations for a general class of random vectors.
The Annals ofProbability Erdős, L. , Schlein, B. and
Yau, H.-T. (2009). Semicircle law on short scales and de-localization of eigenvectors for Wigner random matrices.
The Annals of Probability Fan, J. , Wang, W. and
Zhong, Y. (2019). An (cid:96) ∞ eigenvector perturbation bound and itsapplication to robust covariance estimation. Journal of Econometrics
Fei, Y. and
Chen, Y. (2018). Hidden integrality of SDP relaxations for sub-Gaussianmixture models. In
Conference On Learning Theory . Feige, U. and
Ofek, E. (2005). Spectral techniques applied to sparse random graphs.
Random Structures & Algorithms Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems.
Annalsof eugenics Fulks, W. (1951). A generalization of Laplace’s method.
Proceedings of the AmericanMathematical Society Gao, C. and
Zhang, A. Y. (2019). Iterative algorithm for discrete structure recovery. arXiv preprint arXiv:1911.01018 . Gärtner, J. (1977). On large deviations from the invariant measure.
Theory of Probability& Its Applications Giraud, C. and
Verzelen, N. (2018). Partial recovery bounds for clustering with therelaxed k means. arXiv preprint arXiv:1807.07547 . Gross, D. (2011). Recovering low-rank matrices from few coefficients in any basis.
IEEETransactions on Information Theory Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables.
Journal of the American statistical association Holland, P. W. , Laskey, K. B. and
Leinhardt, S. (1983). Stochastic blockmodels:First steps.
Social Networks Javanmard, A. and
Montanari, A. (2018). Debiasing the lasso: Optimal sample size forGaussian designs.
The Annals of Statistics Jin, J. and
Wang, W. (2016). Influential features PCA for high dimensional clustering.
The Annals of Statistics ohnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal com-ponents analysis. Annals of statistics
Johnstone, I. M. and
Lu, A. Y. (2009). On consistency and sparsity for principal com-ponents analysis in high dimensions.
Journal of the American Statistical Association
Jung, S. and
Marron, J. S. (2009). PCA consistency in high dimension, low sample sizecontext.
The Annals of Statistics Koltchinskii, V. and
Giné, E. (2000). Random matrix approximation of spectra ofintegral operators.
Bernoulli Koltchinskii, V. and
Lounici, K. (2014). Concentration inequalities and moment boundsfor sample covariance operators. arXiv preprint arXiv:1405.2468 . Koltchinskii, V. and
Lounici, K. (2016). Asymptotics and concentration bounds forbilinear forms of spectral projectors of sample covariance. In
Annales de l’Institut HenriPoincaré, Probabilités et Statistiques , vol. 52. Institut Henri Poincaré.
Koltchinskii, V. and
Xia, D. (2016). Perturbation of linear forms of singular vectorsunder Gaussian noise. In
High Dimensional Probability VII . Springer, 397–423.
Kumar, A. and
Kannan, R. (2010). Clustering with spectral norm and the k-meansalgorithm. In .IEEE.
Laplace, P. S. (1986). Memoir on the probability of the causes of events.
Statistical Science Lei, L. (2019). Unified (cid:96) →∞ eigenspace perturbation theory for symmetric random matrices. arXiv preprint arXiv:1909.04798 . Lloyd, S. (1982). Least squares quantization in pcm.
IEEE transactions on informationtheory Löffler, M. , Zhang, A. Y. and
Zhou, H. H. (2019). Optimality of spectral clusteringfor Gaussian mixture model. arXiv preprint arXiv:1911.00538 . Lu, Y. and
Zhou, H. H. (2016). Statistical and computational guarantees of Lloyd’salgorithm and its variants. arXiv preprint arXiv:1612.02099 . Ma, Z. and
Ma, Z. (2017). Exploration of large networks with covariates via fast anduniversal latent space model fitting. arXiv preprint arXiv:1705.02372 . Mao, X. , Sarkar, P. and
Chakrabarti, D. (2017). Estimating mixed memberships withsharp eigenvector deviations. arXiv preprint arXiv:1709.00407 . Mele, A. , Hao, L. , Cape, J. and
Priebe, C. E. (2019). Spectral inference for largestochastic blockmodels with nodal covariates. arXiv preprint arXiv:1908.06438 .65 ixon, D. G. , Villar, S. and
Ward, R. (2017). Clustering subgaussian mixtures bysemidefinite programming.
Information and Inference: A Journal of the IMA Montanari, A. and
Sun, N. (2018). Spectral algorithms for tensor completion.
Commu-nications on Pure and Applied Mathematics Nadler, B. (2008). Finite sample approximation results for principal component analysis:A matrix perturbation approach.
The Annals of Statistics Ndaoud, M. (2018). Sharp optimal recovery in the two component Gaussian mixture model. arXiv preprint arXiv:1812.08078 . Neyman, J. and
Pearson, E. S. (1933). IX. On the problem of the most efficient tests ofstatistical hypotheses.
Philosophical Transactions of the Royal Society of London. SeriesA, Containing Papers of a Mathematical or Physical Character
Ng, A. Y. , Jordan, M. I. and
Weiss, Y. (2002). On spectral clustering: Analysis and analgorithm. In
Advances in Neural Information Processing Systems . Novembre, J. , Johnson, T. , Bryc, K. , Kutalik, Z. , Boyko, A. R. , Auton, A. , Indap, A. , King, K. S. , Bergmann, S. , Nelson, M. R. et al. (2008). Genes mirrorgeography within Europe.
Nature
O’Rourke, S. , Vu, V. and
Wang, K. (2018). Random perturbation of low rank matrices:Improving classical bounds.
Linear Algebra and its Applications
Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spikedcovariance model.
Statistica Sinica
Pearson, K. (1894). Contributions to the mathematical theory of evolution.
PhilosophicalTransactions of the Royal Society of London. A
Pearson, K. (1901). LIII. on lines and planes of closest fit to systems of points in space.
The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science Perry, A. , Wein, A. S. , Bandeira, A. S. and
Moitra, A. (2016). Optimality andsub-optimality of PCA for spiked random matrices and synchronization. arXiv preprintarXiv:1609.05573 . Ringnér, M. (2008). What is principal component analysis?
Nature biotechnology Royer, M. (2017). Adaptive clustering through semidefinite programming. In
Advances inNeural Information Processing Systems . Schmitt, B. A. (1992). Perturbation bounds for matrix square roots and Pythagoreansums.
Linear algebra and its applications chölkopf, B. , Smola, A. and
Müller, K.-R. (1997). Kernel principal componentanalysis. In
International conference on artificial neural networks . Springer.
Shi, J. and
Malik, J. (2000). Normalized cuts and image segmentation.
IEEE Transactionson pattern analysis and machine intelligence Srivastava, P. R. , Sarkar, P. and
Hanasusanto, G. A. (2019). A robust spec-tral clustering algorithm for sub-Gaussian mixture models with outliers. arXiv preprintarXiv:1912.07546 . Stewart, G. and
Sun, J. (1990).
Matrix Perturbation Theory . Computer Science andScientific Computing, ACADEMIC PressINC.URL https://books.google.com/books?id=bIYEogEACAAJ
Vempala, S. and
Wang, G. (2004). A spectral algorithm for learning mixture models.
Journal of Computer and System Sciences Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027 . Wang, K. (2019). Some compact notations for concentration inequalities and user-friendlyresults. arXiv preprint arXiv:1912.13463 . Wang, W. and
Fan, J. (2017). Asymptotics of empirical eigenstructure for high dimensionalspiked covariance.
Annals of statistics Wedin, P.-Å. (1972). Perturbation bounds in connection with singular value decomposition.
BIT Numerical Mathematics Weng, H. and
Feng, Y. (2016). Community detection with nodal information. arXivpreprint arXiv:1610.09735 . Yan, B. and
Sarkar, P. (2020). Covariate regularized community detection in sparsegraphs.
Journal of the American Statistical Association
Yeung, K. Y. and
Ruzzo, W. L. (2001). Principal component analysis for clustering geneexpression data.
Bioinformatics Zhang, A. , Cai, T. T. and
Wu, Y. (2018). Heteroskedastic pca: Algorithm, optimality,and applications. arXiv preprint arXiv:1810.08316 . Zhang, A. Y. and
Zhou, H. H. (2016). Minimax rates of community detection in stochasticblock models.
The Annals of Statistics Zhang, Y. , Levina, E. and
Zhu, J. (2016). Community detection in networks with nodefeatures.
Electronic Journal of Statistics Zhong, Y. and
Boumal, N. (2018). Near-optimal bounds for phase synchronization.
SIAMJournal on Optimization Zwald, L. and
Blanchard, G. (2006). On the convergence of eigenspaces in kernelprincipal component analysis. In