[PDF] On the Non-Asymptotic Concentration of Heteroskedastic Wishart-type Matrix

Abstract

This paper focuses on the non-asymptotic concentration of the heteroskedastic Wishart-type matrices. Suppose Z is a p 1 -by- p 2 random matrix and Z ij ∼N(0, σ 2 ij ) independently, we prove that \begin{equation*} \bbE \left\|ZZ^\top - \bbE ZZ^\top\right\| \leq (1+\epsilon)\left\{2\sigma_C\sigma_R + \sigma_C^2 + C\sigma_R\sigma_*\sqrt{\log(p_1 \wedge p_2)} + C\sigma_*^2\log(p_1 \wedge p_2)\right\}, \end{equation*} where σ 2 C := max j ∑ p 1 i=1 σ 2 ij , σ 2 R := max i ∑ p 2 j=1 σ 2 ij and σ 2 ∗ := max i,j σ 2 ij . A minimax lower bound is developed that matches this upper bound. Then, we derive the concentration inequalities, moments, and tail bounds for the heteroskedastic Wishart-type matrix under more general distributions, such as sub-Gaussian and heavy-tailed distributions. Next, we consider the cases where Z has homoskedastic columns or rows (i.e., σ ij ≈ σ i or σ ij ≈ σ j ) and derive the rate-optimal Wishart-type concentration bounds. Finally, we apply the developed tools to identify the sharp signal-to-noise ratio threshold for consistent clustering in the heteroskedastic clustering problem.

Full PDF

aa r X i v : . [ m a t h . S T ] A ug On the Non-Asymptotic Concentration of HeteroskedasticWishart-type Matrix

T. Tony Cai , Rungang Han , and Anru R. Zhang (August 31, 2020) Abstract

This paper focuses on the non-asymptotic concentration of the heteroskedastic Wishart-typematrices. Suppose Z is a p -by- p random matrix and Z ij ∼ N (0 , σ ij ) independently, we provethat E (cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13) ≤ (1 + ǫ ) n σ C σ R + σ C + Cσ R σ ∗ p log( p ∧ p ) + Cσ ∗ log( p ∧ p ) o , where σ C := max j P p i =1 σ ij , σ R := max i P p j =1 σ ij and σ ∗ := max i,j σ ij . A minimax lower boundis developed that matches this upper bound. Then, we derive the concentration inequalities,moments, and tail bounds for the heteroskedastic Wishart-type matrix under more generaldistributions, such as sub-Gaussian and heavy-tailed distributions. Next, we consider the caseswhere Z has homoskedastic columns or rows (i.e., σ ij ≈ σ i or σ ij ≈ σ j ) and derive the rate-optimal Wishart-type concentration bounds. Finally, we apply the developed tools to identifythe sharp signal-to-noise ratio threshold for consistent clustering in the heteroskedastic clusteringproblem. Random matrix theory is an important topic in its own right and has been proven to be a powerfultool in a wide range of applications in statistics, high-energy physics, and number theory. Wignermatrices, symmetric matrices with mean-zero independent and identically distributed (i.i.d.) entries(subject to the symmetry constraint), have been a particular focus. Asymptotic and non-asymptoticproperties of the spectrum of Wigner matrices have been widely studied in the literature. See, forexample, [2, 28, 31] and the references therein.Motivated by a range of applications, heteroskedastic Wigner-type matrices, random matriceswith independent heteroskedastic entries, have attracted much recent attention. A central problem Department of Statistics, the Wharton School, University of Pennsylvania. Department of Statistics, University of Wisconsin-Madison Department of Statistics, University of Wisconsin-Madison and Department of Biostatistics & Bioinformatics,Duke University

1f interest is the characterization of the dependence of the spectrum of a heteroskedastic Wigner-type matrix on the variances of its entries.To answer this question, Ajanki, Erd˝os, Kr¨uger [1]established the asymptotic behavior of the resolvent, a local law down to the smallest spectralresolution scale, and bulk universality for the heteroskedastic Wigner-type matrix. Bandeira andvan Handel [4] proved an non-asymptotic upper bound for the spectral norm. More speciﬁcally, let Z = ( Z ij ) be a p × p heteroskedastic Wigner-type matrix with Var( Z ij ) := σ ij . Bandeira and vanHandel [4] showed that E k Z k . σ + σ ∗ √ log p, where σ = max i P j σ ij and σ ∗ = max ij σ ij are thecolumn-sum-wise and entry-wise maximum variances, respectively. This bound was improved byvan Handel [30] to E k Z k . σ + max i,j ∈ [ p ] σ ∗ ij log i. Here, the matrix { σ ∗ ij } is obtained by permutingthe rows and columns of the variance matrix { σ ij } such that max j σ ∗ j ≥ max j σ ∗ j ≥ · · · ≥ max j σ ∗ pj .Later, Lata la and van Handel [20] further improved it to a tight bound: E k Z k ≍ σ + max i,j ∈ [ p ] σ ∗ ij p log i. (1)In addition to the Wigner-type matrix, the Wishart-type matrix, ZZ ⊤ − E ZZ ⊤ , also playsa crucial role in many high-dimensional statistical problems, including the principal componentanalysis (PCA) and factor analysis [36], matrix denoising [25], and bipartite community detection[15]. Though there have been many results on the asymptotic and non-asymptotic properties ofthe homoskedastic Wishart-type matrix, where Z has i.i.d entries (see [8] for an introduction andthe references therein), the properties of the heteroskedastic Wishart-type matrices are much lessunderstood.Speciﬁcally, suppose Z is a p × p random matrix with independent and zero-mean entries.In this paper, we are interested in the Wishart-type concentration: E (cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13) . Deﬁne σ C , σ R , σ ∗ as the column-sum-wise, row-sum-wise, and entry-wise maximum variances: σ C = max j p X i =1 σ ij , σ R = max i p X j =1 σ ij , σ ∗ = max ij σ ij . (2)By the symmetrization scheme and the asymmetric Wigner-type concentration inequality in [4], itis not diﬃcult to show that E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) ≤ E (cid:13)(cid:13)(cid:13) ZZ ⊤ − Z ′ ( Z ′ ) ⊤ (cid:13)(cid:13)(cid:13) ≤ E (cid:13)(cid:13)(cid:13) ZZ ⊤ (cid:13)(cid:13)(cid:13) = 2 E k Z k . (cid:16) σ C + σ R + σ ∗ p log( p ∧ p ) (cid:17) . (3)Since ZZ ⊤ − E ZZ ⊤ can be decomposed into a sum of independent random matrices, ZZ ⊤ − E ZZ ⊤ = p X j =1 (cid:16) Z · j Z ⊤· j − E Z · j Z ⊤· j (cid:17) , one can apply the concentration inequality for the sum of independent random matrices [29, The-orem 1] to show that E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) . σ C σ R p log p + σ C (log p ) . (4)2owever, as we will show later, these bounds are not tight.In this paper, we establish non-asymptotic bounds for the Wishart-type concentration E k ZZ ⊤ − E ZZ ⊤ k . The main results include the following. We begin by focusing on the Gaussian case inSection 2.1 and prove that if all entries of Z are independently Gaussian, E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) ≤ (1 + ǫ ) n σ C σ R + (1 + ǫ ) σ C + C ( ǫ ) σ R σ ∗ p log( p ∧ p ) + C ( ǫ , ǫ ) σ ∗ log( p ∧ p ) o , (5)for any ǫ , ǫ >

0. Here, C ( ǫ ) and C ( ǫ , ǫ ) are some constants that only depends on ǫ , ǫ . Wefurther justify that the constants in 2 σ C σ R + σ C are essential under the homoskedastic setting. Theproof of (5) is based on a Wishart-type moment method provided in Section 2.2. In Section 2.3,we provide a lower bound to show that the upper bound (5) is minimax rate-optimal in a generalclass of heteroskedastic random matrices.We then consider the more general non-Gaussian setting including sub-Gaussian, sub-exponential, heavy tailed, and bounded distributions in Section 3.1. In particular, we establishthe following concentration bound when the entries have independent sub-Gaussian distributions: E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) . (cid:16) σ C + σ R + σ ∗ p log( p ∧ p ) (cid:17) − σ R . (6)Upper bounds for the moments and probability tails of k ZZ ⊤ − E ZZ ⊤ k are developed in Section3.2.In Sections 3.3 and 3.4, we consider two variance structures arising in statistical applicationsand develop tight Wishart-type concentration bounds. If the random matrix Z has independentsub-Gaussian entries and homoskedastic rows, i.e., σ ij = σ i , we prove that E k ZZ ⊤ − E ZZ ⊤ k ≍ p X i =1 σ i + vuut p p X i =1 σ i · max i ∈ [ p ] σ i . If Z has independent sub-Gaussian entries and homoskedastic column variances, i.e., σ ij ≍ σ j , weprove that E k ZZ ⊤ − E ZZ ⊤ k ≍ vuut p p X j =1 σ j + p max j ∈ [ p ] σ j . To illustrate the usefulness of the newly established tools, we apply these tools in Section 4 tosolve a statistical problem in heteroskedastic clustering. Speciﬁcally, we obtain a sharp signal-to-noise ratio threshold to guarantee consistent clustering.

We ﬁrst introduce the notation to be used in the rest of the paper. Let a ∧ b and a ∨ b be the minimumand maximum of real numbers a and b , respectively. We use [ d ] to denote the set { , . . . , d } forany positive integer d . For any vector v , let k v k q = ( P i | v i | q ) /q be the vector ℓ q norm; speciﬁcally,3 v k ∞ = sup i | v i | . For any sequences { a n } , { b n } , denote a . b (or b n & a n ) if there exists a uniformconstant C > a ≤ Cb . If a . b and a & b both hold, we say a ≍ b . For any α ≥

1, theOrlicz ψ α norm of any random variable X is deﬁned as k X k ψ α = inf { x ≥ E exp (( | X | /x ) α ) ≤ } . In the literature [31, 33], a random variable is often called sub-Gaussian, sub-exponential, or sub-Weibull with tail parameter (1 /α ), if k X k ψ ≤ C , k X k ψ ≤ C , and k X k ψ α ≤ C , respectively. Thematrix spectral norm is deﬁned as k X k = sup u,v u ⊤ Xv k u k k v k . The capital letters C, C , ˜ C and lowercaseletters c, c , c represent the generic large and small constants, respectively, whose exact values mayvary from place to place. We begin by considering the Gaussian case where the entries Z ij ∼ N (0 , σ ij ) independently. Thefollowing theorem provides an upper bound for the concentration and is one of the main results ofthe paper. Theorem 1 (Wishart-type Concentration for Gaussian random matrix) . Suppose Z is a p -by- p random matrix and Z ij ∼ N (0 , σ ij ) independently. Then for any ǫ , ǫ > , E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) ≤ (1 + ǫ ) n σ C σ R + (1 + ǫ ) σ C + C ( ǫ ) σ R σ ∗ p log( p ∧ p ) + C ( ǫ , ǫ ) σ ∗ log( p ∧ p ) o , (7) where C ( ǫ ) = 10(1 + ε ) p ⌈ / log(1 + ε ) ⌉ and C ( ǫ , ǫ ) = (1 + ε ) ⌈ / log(1 + ε ) ⌉ (cid:16) ǫ + 24 (cid:17) . Remark 1 (Lower bound for the homoskedastic case) . If Z has independent and homoskedasticGaussian entries, i.e., Z ij iid ∼ N (0 , σ C = √ p , σ R = √ p , and Theorem 1 implies E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) ≤ (1 + ǫ ) (cid:0) σ C + 2 σ C σ R (cid:1) + C ǫ σ R p log( p ∧ p ) (8)for any ǫ > C ǫ only depending on ǫ . On the other hand, we have Proposition 1. If Z is a p -by- p matrix with i.i.d. homoskedastic Gaussian entries, then lim inf p ,p →∞ E (cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13) σ C σ R + σ C ≥ . (9)Proposition (1) and (8) together indicate that ( σ C + 2 σ C σ R ) in the upper bound of Theorem 1are sharp in the homoskedastic case. In Section 2.3, we establish a minimax lower bound to showthat all four terms in the upper bound (6) are essential when Z is a general heteroskedastic randommatrix. 4 .2 Proof of Theorem 1 The proof of Theorem 1 relies on a moment method and the following fact: for a p -by- p symmetricmatrix A (in the context of Theorem 1, A = ZZ ⊤ − E ZZ ⊤ ) and a even number q ≍ log( p ), we have k A k ≈ (tr( A q )) /q . We introduce two lemmas for the proof of Theorem 1. First, Lemma 1 builds a comparison betweenthe q -th moment of the heteroskedastic Wishart-type matrix ZZ ⊤ − E ZZ ⊤ with a homoskedasticanalogue HH ⊤ − E HH ⊤ . The complete proof of Lemma 1 is postponed to Section 5.1. Lemma 1 (Gaussian Comparison) . Suppose Z ∈ R p × p has independent Gaussian entries: Z ij ∼ N (0 , σ ij ) . Let m = ⌈ σ C ⌉ + q − and m = ⌈ σ R ⌉ + q − . Suppose H ∈ R m × m has i.i.d. N (0 , entries. Then for any q ≥ , E tr n ( ZZ ⊤ − E ZZ ⊤ ) q o ≤ (cid:18) p m ∧ p m (cid:19) E tr n ( HH ⊤ − E HH ⊤ ) q o . (10) Remark 2 (Proof sketch of Lemma 1) . Previously, [4, Proposition 2.1] compared the momentsof the Wigner-type matrices (i.e., Z is symmetric and thus p = p = p, σ C = σ R = σ ) bythe expansion E tr( Z p ) = P u ,...,u q E ( Z u u Z u u · · · Z u p u ) and counting the cycles in a reducedunipartite graph: E tr( Z q ) ≤ p ⌈ σ ⌉ + q E tr( H q ) . (11)Compared to the expansion of Wigner-type random matrix E tr( Z q ), the expansion of Wishart-typerandom matrix E tr (cid:8) ( ZZ ⊤ − E ZZ ⊤ ) q (cid:9) is much more complicated: E tr n ( ZZ ⊤ − E ZZ ⊤ ) q o = X u q +1 = u ,...,u q ∈ [ p ] v ,...,v q ∈ [ p ] E q Y k =1 (cid:0) Z u k ,v k Z u k +1 ,v k − σ u k ,v k · { u k = u k +1 } (cid:1) = · · · = X c ∈ ([ p ] × [ p ]) q q Y k =1 σ u k ,v k σ u k +1 ,v k Y ( i,j ) ∈ [ p ] × [ p ] E G α ij ( c ) ij (cid:0) G ij − (cid:1) β ij ( c ) , (12)where ([ p ] × [ p ]) q is the set of all cycles of length 2 q on a p -by- p complete bipartite graph, G ij = Z ij /σ ij are i.i.d. standard normal distributed, and α ij ( c ) , β ij ( c ) are some graphical charac-teristic quantities of cycle c to be deﬁned later. By gathering the cycles with the same “shape” s ,we can show: E tr n ( ZZ ⊤ − E ZZ ⊤ ) q o ≤ X s Y α,β ≥ n E G α ( G − β o m α,β ( s ) · n p σ m L ( s ) − C σ m R ( s ) R o ∧ n p σ m L ( s ) C σ m R ( s ) − R o , (13)where m α,β ( s ) , m L ( s ) and m R ( s ) are some graphical properties of the cycles with shape s to bedeﬁned later and G ∼ N (0 , E tr (cid:16) ( HH ⊤ − E HH ⊤ ) q (cid:17) ≥ X s Y α,β ≥ E n G α ( G − β o m α,β ( s ) · n m σ m L ( s ) − C · σ m R ( s ) R o ∨ n m σ m L ( s ) C · σ m R ( s ) − R o . (14)Lemma 1 follows by combining (13) and (14).Next, Lemma 2 gives an upper bound on the moment of the standard Wishart matrix. Thecomplete proof is provided in Section 5.1. Lemma 2.

Suppose H ∈ R m × m has i.i.d. standard Gaussian entries. Then for any integer q ≥ , (cid:16) E k HH ⊤ − E HH ⊤ k q (cid:17) /q ≤ √ m m + m + 4( √ m + √ m ) √ q + 2 q, (cid:16) E tr n ( HH ⊤ − E HH ⊤ ) q o(cid:17) /q ≤ /q ( m ∧ m ) /q · (2 √ m m + m + 4( √ m + √ m ) √ q + 2 q ) . Remark 3 (Proof idea of Lemma 2) . Let σ i ( H ) be the i -th singular value of H . The proof ofLemma 2 utilizes the following fact: k HH ⊤ − E HH ⊤ k = k HH ⊤ − m I m k = max (cid:8) σ ( H ) − m , m − σ m ( H ) (cid:9) and the concentration inequalities of the largest and smallest singular values of the Gaussian en-semble (e.g., [31]). See Section 5.1 for the complete proof.Now, we are in position to ﬁnish the proof of Theorem 1. Proof of Theorem 1.

Without loss of generality, we assume σ ∗ = max ij σ ij = 1. Let m = ⌈ σ C ⌉ +2 q − , m = ⌈ σ R ⌉ + 2 q − q to be speciﬁed later and H be an m -by- m random matrixwith i.i.d. standard Gaussian entries. Lemmas 1 and 2 imply E k ZZ ⊤ − E ZZ ⊤ k ≤ (cid:18) E tr (cid:26)(cid:16) ZZ ⊤ − E ZZ ⊤ (cid:17) q (cid:27)(cid:19) / q Lemma 1 ≤ (cid:26)(cid:18) p m ∧ p m (cid:19) · E tr n HH ⊤ − E HH ⊤ o q (cid:27) / q Lemma 2 ≤ / q (cid:18)(cid:18) p m ∧ p m (cid:19) m ∧ m (cid:19) / q (cid:16) √ m m + m + 4( √ m + √ m ) p q + 4 q (cid:17) ≤ / q ( p ∧ p ) / q (cid:0) σ C σ R + σ C + 10 σ C √ q + 10 σ R √ q + 24 q (cid:1) . (15)Let q = K ⌈ log( p ∧ p ) ⌉ for K = ⌈ ε ) ⌉ , then we have E k ZZ ⊤ − E ZZ ⊤ k≤ / K (cid:16) e q/K (cid:17) / q (cid:0) σ C σ R + σ C + 10 σ C √ q + 10 σ R √ q + 24 q (cid:1) ≤ (2 e ) / K (cid:18) σ C σ R + (1 + ǫ ) σ C + 10 √ Kσ R p log( p ∧ p ) + (cid:18) ǫ + 24 (cid:19) K log( p ∧ p ) (cid:19) ≤ ǫ ) σ C σ R + (1 + ǫ )(1 + ǫ ) σ C + C ( ǫ ) σ R p log( p ∧ p ) + C ( ǫ , ǫ ) log( p ∧ p ) . (16)6ere, C ( ǫ ) = 10(1 + ε ) p ⌈ / log(1 + ε ) ⌉ ,C ( ǫ , ǫ ) = (1 + ε ) ⌈ / log(1 + ε ) ⌉ (cid:18) ǫ + 24 (cid:19) . To show the tightness of the upper bound given earlier, we also develop the following minimaxlower bound for the heteroskedastic Wishart-type concentration.

Theorem 2 (Lower bound of heteroskedastic Wishart-type concentration) . Suppose p , p ≥ .Consider the following set of p -by- p random matrices, F p ( σ ∗ , σ C , σ R ) = ( Z ∈ R p × p : Z ij ind ∼ N (0 , σ ij ) , p = p ∧ p , max i,j σ ij ≤ σ ∗ , max i P p j =1 σ ij ≤ σ R , max j P p i =1 σ ij ≤ σ C ) . For any ( σ ∗ , σ R , σ C ) tuple satisfying min { σ C , σ R } ≥ σ ∗ ≥ max { σ C / √ p , σ R / √ p } , there exists arandom Gaussian matrix Z ∈ F p ( σ ∗ , σ R , σ C ) such that E k ZZ ⊤ − E ZZ ⊤ k & σ C + σ C σ R + σ R σ ∗ p log p + σ ∗ log p. (17)The proof of Theorem 2 is given in Section 5.1. Remark 4.

Theorems 1 and 2 together establish the minimax optimal rate of E k ZZ ⊤ − E ZZ ⊤ k inthe class of F p ( σ ∗ , σ C , σ R ). In other words, Theorem 2 shows that (7) yields the best upper boundfor heteroskedastic Wishart-type concentration among all the bounds characterized by σ C , σ R , σ ∗ .We shall point out that the upper bound of Theorem 1 may not be tight for some speciﬁc values of { σ ij } . For example, in Sections 3.3 and 3.4, we develop sharper bounds via a more reﬁned analysiswhen the Wishart matrix has near-homoskedastic rows or columns.Generally speaking, it remains an open problem to develop a heteroskedastic Wishart-typeconcentration inequality that is tight for all speciﬁc values of { σ ij } . We leave this problem asfuture work. We consider several extensions of Theorem 1 in this section.

In this section, we generalize the developed concentration inequality for heteroskedastic Wishartmatrices with more general entrywise distributions, such as sub-Gaussian, sub-exponential, heavytailed, and bounded distributions. We ﬁrst introduce the following lemma as a sub-Gaussian analogof Lemma 1. 7 emma 3 (Sub-Gaussian comparison) . Suppose Z ∈ R p × p has independent mean-zero symmetricsub-Gaussian entries: E Z ij = 0 , Var( Z ij ) = σ ij , k Z ij /σ ij k ψ ≤ κ. (18) M ∈ R m × m has i.i.d. standard Gaussian entries. When q ≥ , m = ⌈ σ C ⌉ + q − , m = ⌈ σ R ⌉ + q − , we have E tr n ( ZZ ⊤ − E ZZ ⊤ ) q o ≤ ( Cκ ) q (cid:18) p m ∧ p m (cid:19) E tr n ( HH ⊤ − E HH ⊤ ) q o . (19)The proof of Lemma 3 is deferred to Section 5.2. Remark 5 (Proof ideas of Lemma 3) . Compared to the proof of Lemma 1, the proof of Lemma3 requires more delicate scheme to bound E G α ij ( c ) ij ( G ij − β ij ( c ) for non-standard-Gaussian dis-tributed G ij := Z ij /σ ij . To this end, we introduce Lemma 7 to bound E G α ij ( c ) ij ( G ij − β ij ( c ) by aGaussian analog: E G α ij ( c ) ij ( G ij − β ij ( c ) ≤ ( Cκ ) q E G α ij ( c ) ( G − β ij ( c ) , G ∼ N (0 , . As a consequence of Lemma 3, we have the following Wishart-type Concentration of sub-Gaussian random matrix.

Corollary 1 (Wishart-type concentration of sub-Gaussian random matrix) . Suppose Z ∈ R p × p has independent mean-zero sub-Gaussian entries that satisfy (18) . Then E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) . κ (cid:16) σ C σ R + σ C + σ R σ ∗ p log( p ∧ p ) + σ ∗ log( p ∧ p ) (cid:17) . (20) Proof of Corollary 1.

When all Z ij ’s are symmetrically distributed, Corollary 1 follows from theproof of Theorem 1 along with Lemmas 2 and 3. If Z ij ’s are not all symmetric, let Z ′ be an inde-pendent copy of Z , then each entry of Z − Z ′ has independent symmetric sub-Gaussian distribution.By Jensen’s inequality, we have E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) = E (cid:13)(cid:13)(cid:13) ZZ ⊤ + E ′ Z ′ ( Z ′ ) ⊤ − Z ( E ′ Z ′ ) ⊤ − ( E ′ Z ′ ) Z ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) = E (cid:13)(cid:13)(cid:13) E n ZZ ⊤ + Z ′ ( Z ′ ) ⊤ − Z ( Z ′ ) ⊤ − ( Z ′ ) Z ⊤ − E ZZ ⊤ (cid:12)(cid:12)(cid:12) Z o(cid:13)(cid:13)(cid:13) ≤ E h E n(cid:13)(cid:13)(cid:13) ZZ ⊤ + Z ′ ( Z ′ ) ⊤ − Z ( Z ′ ) ⊤ − ( Z ′ ) Z ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) Z oi = E h E ′ (cid:13)(cid:13)(cid:13) ( Z − Z ′ )( Z − Z ′ ) ⊤ − E ( Z − Z ′ )( Z − Z ′ ) ⊤ (cid:13)(cid:13)(cid:13)i . κ (cid:16) σ C σ R + σ C + σ R σ ∗ p log( p ∧ p ) + σ ∗ log( p ∧ p ) (cid:17) . Next, we turn to the Wishart-type concentration for random matrix Z with heavy-tailed entries.8 heorem 3 (Wishart-type concentration for heavy-tailed random matrix) . Suppose α ≤ , Z ∈ R p × p has independent entries, Var( Z ij ) ≤ σ ij , and k Z ij /σ ij k ψ α ≤ κ for all i, j . Given σ C , σ R ,and σ ∗ deﬁned in (2) , we have E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) . (cid:16) σ C + σ R + σ ∗ (log( p ∧ p )) / (log( p ∨ p )) /α − / (cid:17) − σ R . In a variety of applications, the observations and random perturbations are naturally bounded(e.g., adjacency matrix in network analysis [24] and single-nucleotide polymorphisms (SNPs) datain genomics [27]). Thus, we provide a Wishart-type concentration for entrywise uniformly boundedrandom matrices as follows.

Theorem 4 (Wishart-type concentration of bounded random Matrix) . Suppose Z ∈ R p × p , E Z ij = 0 , Var( Z ij ) = σ ij , | Z ij | ≤ B almost surely, then E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) ≤ (1 + ǫ ) n σ C σ R + (1 + ǫ ) σ C + C ( ǫ ) Bσ R p log( p ∧ p ) + C ( ǫ , ǫ ) B log( p ∧ p ) o , where C ( ǫ ) and C ( ǫ , ǫ ) are deﬁned as in Theorem 1. If we further have max i,j σ ij ≤ σ ∗ and B (log( p ∧ p ) /p ) / ≪ σ ∗ for some σ ∗ , then E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) ≤ (1 + ǫ ) (2 √ p p + p ) σ ∗ . (21)An immediate application of the previous theorem is the following Wishart-type concentrationfor independent Bernoulli random matrices. Corollary 2 (Wishart-type Concentration of Bernoulli Random Matrix) . Suppose Z ∈ R p × p , A ij ind ∼ Bernoulli( θ ij ) , θ ij ≤ θ ∗ and θ ∗ ≥ C log( p ∧ p ) /p . Then, E (cid:13)(cid:13)(cid:13) ( A − Θ)( A − Θ) ⊤ − E ( A − Θ)( A − Θ) ⊤ (cid:13)(cid:13)(cid:13) . ( √ p p + p ) θ ∗ . (22)To prove Theorems 3 and 4, we establish the corresponding comparison lemmas for random ma-trices with heavy tail/bounded distributions, which is more technically involved from Gaussian/sub-Gaussian distributions due to the essential diﬀerence. The proofs of Theorems 3 and 4 are providedin Section 5.2. Remark 6.

It is helpful to summarize the heteroskedastic Wishart-type concentration inequalitieswith Gaussian, sub-Gaussian, heavy-tail, and bounded entries in a uniﬁed form: E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) ≤ C n ( σ C + σ R + K ) − σ R o , where K = σ ∗ (log( p ∧ p )) / and C > Z are sub-Gaussian; K = σ ∗ (log( p ∧ p )) / (log( p ∨ p )) /α − / and C > Z has bounded ψ α norm; K = C p log( p ∧ p ) and C = 1 + ε if the entries of Z are bounded; and K = Cσ ∗ (log( p ∧ p )) / and C = (1 + ε ) if the entries of Z are Gaussian.9 .2 Moments and tail bounds We study the general b -th moment and the tail probability of heteroskedastic Wishart-type matrixin the following theorem. Theorem 5 (High-order moments and tail probability bounds) . Suppose the conditions in Theorem1 hold. For any b > , we have (cid:26) E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) b (cid:27) /b . (cid:16) σ C + σ R + σ ∗ p b ∨ log( p ∧ p ) (cid:17) − σ C . (23) There exists uniform constant

C > such that for any x > , P (cid:26)(cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) ≥ C (cid:18)(cid:16) σ C + σ R + σ ∗ p log( p ∧ p ) + x (cid:17) − σ C (cid:19)(cid:27) ≤ exp( − x ) . (24)Since neither k ZZ ⊤ − E ZZ ⊤ k nor k ZZ ⊤ − E ZZ ⊤ k / are Lipschitz continuous in Z , the classicTalagrand’s concentration inequality [10, Theorem 6.10] does not directly apply to give the tailprobability bound of k ZZ ⊤ − E ZZ ⊤ k . We instead prove (24) via a more direct moment method.The complete proof is given in Section 5.3. In this section, we consider a special class of heteroskedastic matrices. Let Z ∈ R p × p be a randommatrix with independent, sub-Gaussian, and zero-mean entries. Suppose all entries in the samerow of Z share similar variance (i.e., there exists σ i such that σ ij approximately equals σ i forall i, j ). Then the p columns of Z , i.e., { Z · j } p j =1 , have approximately equal covariance matrix,diag( σ , . . . , σ p ). In this case, n ZZ ⊤ = n P nj =1 Z · j Z ⊤· j is the sample covariance matrix. It is ofgreat interest to analyze k ZZ ⊤ − E ZZ ⊤ k , i.e., the concentration of the sample covariance matrixin both probability and statistics [3, 12].Note that Corollary 1 directly implies E k ZZ ⊤ − E ZZ ⊤ k . X i σ i + s p X i σ i · max i σ i + p p log( p ∧ p ) max i σ i . (25)With a more careful analysis, we can derive a better concentration inequality than (25) withoutthe logarithmic terms. Theorem 6.

Suppose Z is a p -by- p random matrix with independent mean-zero sub-Gaussianentries. If there exist σ , . . . , σ p ≥ such that k Z ij /σ i k ψ ≤ C K for constant C K > , then E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) . X i σ i + s p X i σ i · max i σ i . (26) Remark 7.

We also note that a similar result of Theorem 6 can be derived from Koltchinskiiand Lounici [18]. Their result is based on generic chaining argument with the assumption that allcolumns of Z are i.i.d. Here, we assume independence and an upper bound on the Orlicz- ψ normof each entry, while allow the distributions to be non-identical.10he following theorem gives a lower bound on the concentration of Wishart matrix with ho-moskedastic rows. Theorem 7. If Z ∈ R p × p , Z ij ind ∼ N (0 , σ i ) , we have E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) & X i σ i + s p X i σ i · max i σ i . The proof of Theorem 7 is deferred to Section 5.4. Theorems 6 and 7 render an exact rate ofWishart-type concentration for random matrices with homoskedastic rows: E k ZZ ⊤ − E ZZ ⊤ k ≍ X i σ i + s p X i σ i max i σ i , if Var( Z ij ) ind ∼ N (0 , σ i ) . The rest of this section is dedicated to the proof of Theorems 6. We only prove for GaussianWishart-type random matrices since the sub-Gaussian case follows similarly. We ﬁrst introduce akey tool to sequentially reduce the number of rows of Z . The tool, as summarized in the followinglemma, may of independent interest. Lemma 4 (Variance contraction inequality of Gaussian random matrix) . Suppose G ∈ R p × p and ˜ G ∈ R ( p − × p are two random matrices with independent Gaussian entries satisfying E G ij = E ˜ G ij = 0 , Var( G ij ) = σ ij , Var( ˜ G ij ) = ( σ ij , ≤ i ≤ p − σ p − ,j + σ p ,j , i = p − . In other words, G and ˜ G are identical distributed in their ﬁrst ( p − rows; the variance of thelast row of ˜ G is the sum of last two rows’ variances of G . Then for any positive integer q , tr (cid:16)(cid:16) GG ⊤ − E GG ⊤ (cid:17) q (cid:17) ≤ tr (cid:16)(cid:16) ˜ G ˜ G ⊤ − E ˜ G ˜ G ⊤ (cid:17) q (cid:17) . The proof of Lemma 4 is provided in Section 5.4. Now we are ready to prove Theorem 6.

Proof of Theorem 6.

Denote σ C = P i σ i , σ ∗ = max i σ i . Assume σ ∗ = 1 without loss of generality.Set q = 2 ⌈ σ C ⌉ . We use mathematical induction on p to show the following upper bound: for someuniform constant C > p , p , σ C ), we have (cid:16) E tr n(cid:16) ZZ ⊤ − E ZZ ⊤ (cid:17) q o(cid:17) /q ≤ C (cid:0) σ C + √ p σ C (cid:1) . (27) • If p ≤ q , Lemma 1 yields E tr n ( ZZ ⊤ − E ZZ ⊤ ) q o ≤ (cid:18) p m ∧ p m (cid:19) E tr n ( HH ⊤ − E HH ⊤ ) q o . Here, H is a m -by- m dimensional matrix with i.i.d. standard Gaussian entries and m = ⌈ σ C ⌉ + q − ⌈ σ C ⌉ − , m = p + 2 ⌈ σ C ⌉ − . (28)11dditionally, by Lemma 2, (cid:16) E n ( ZZ ⊤ − E ZZ ⊤ ) q o(cid:17) /q ≤ (cid:18)(cid:18) p m ∧ p m (cid:19) E tr n(cid:16) HH ⊤ − E HH ⊤ (cid:17) q o(cid:19) /q ≤ (cid:18) E (cid:18) p m ∧ p m (cid:19) m k HH ⊤ − E HH ⊤ k q (cid:19) /q ≤ p /q (2 √ m m + m + 4( √ m + √ m ) √ q + 2 q ) (28) ≤ (2 q ) /q · C (cid:0) √ p σ C + σ C (cid:1) ≤ C (cid:0) √ p σ C + σ C (cid:1) , which implies (27). • Suppose the statement (27) holds for Z ∈ R ( p − × p for some p > q , we further considerthe case where Z ∈ R p × p . Note that1 = σ ∗ = σ ≥ σ ≥ · · · ≥ σ p ≥ . By such the ordering, σ p − + σ p ≤ p p X i =1 σ i = 2 p σ C ≤ σ C q ≤ σ C ⌈ σ C ⌉ ≤ σ ∗ . (29)By Lemma 4, we havetr (cid:16) ( ZZ ⊤ − E ZZ ⊤ ) q (cid:17) ≤ tr (cid:16) ( ˜ Z ˜ Z ⊤ − E ˜ Z ˜ Z ⊤ ) q (cid:17) . where ˜ Z is a ( p − p random matrix with independent entries and E ( ˜ Z ) = 0 , Var(( ˜ Z ) ij ) = ( σ i , if 1 ≤ i ≤ p − σ p − + σ p , if 1 ≤ i ≤ p − . By (29), we have max i,j

Var(( ˜ Z ) ij ) ≤ σ ∗ . Meanwhile, P p − i =1 Var(( ˜ Z ) ij ) = P p i =1 σ i = σ C .Thus, the induction assumption of (27) implies (cid:16) E n ( ZZ ⊤ − E ZZ ⊤ ) q o(cid:17) /q ≤ (cid:16) E n ( ˜ Z ˜ Z ⊤ − E ˜ Z ˜ Z ⊤ ) q o(cid:17) /q ≤ C (cid:0) √ p σ C + σ C (cid:1) . By induction, we have proved that (27) holds in general. Therefore, E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) ≤ (cid:16) E tr n(cid:16) ZZ ⊤ − E ZZ ⊤ (cid:17) q o(cid:17) /q . √ p σ C + σ C . .4 Wishart matrix with near-homoskedastic columns Let Z ∈ R p × p be a random matrix with independent entries. We consider another case of interestthat all entries in each column of Z have the similar variance (i.e., there exist σ j such that σ ij ≈ σ j , ∀ i, i ′ ∈ [ p ], ∀ j ∈ [ p ]). This model has been used to characterize heteroskedastic independentsamples in statistical applications [17]. Applying Theorem 1, one obtains E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) . s p X j σ j max j σ j + p max j σ j . (30)As the direct upper bound of (30) may be sub-optimal, we prove the following upper and lowerbounds via a more careful analysis. Theorem 8.

Suppose Z ∈ R p × p has independent, mean-zero, and sub-Gaussian entries. Assumethere exist σ , . . . , σ n ≥ such that k Z ij /σ j k ψ ≤ C K for constant C K > . Then, E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) . s p X j σ j + p max j σ j . (31) Theorem 9. If Z ∈ R p × p , Z ij ind ∼ N (0 , σ j ) , we have E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) & s p X j σ j + p max j σ j . The proof of Theorem 9 is deferred to Section 5.5. Now we consider the proof of Theorem8. Since the Gaussian comparison lemma (Lemma 1) cannot give the desired term P p j =1 σ j , weturn to study the expansion of E tr (cid:8)(cid:0) ∆( ZZ ⊤ ) (cid:1) q (cid:9) , where ∆( ZZ ⊤ ) equals to ZZ ⊤ with all diagonalentries set to zero. The expansion of E tr (cid:8)(cid:0) ∆( ZZ ⊤ ) (cid:1) q (cid:9) can be related to the cycles in a completegraph for which every edge is visited { , , , . . . } times. Based on this new idea, we introducethe following lemma. Lemma 5.

Suppose Z ∈ R p × p , Z ij ind ∼ N (0 , σ ij ) , and σ ij ≤ σ j . For a square matrix A , let ∆( A ) be A with all diagonal entries set to zero and D ( A ) be A with all oﬀ-diagonal entries setto zero. For any integer q ≥ , suppose H ∈ R p × m have i.i.d. standard normal entries and m = ⌈ P p j =1 σ j ⌉ + q − . Then, E tr n(cid:16) ∆( ZZ ⊤ ) (cid:17) q o ≤ E tr n(cid:16) ∆( HH ⊤ ) (cid:17) q o . (32)The proof of Lemma 5 is provided in Section 5.5. Next, we prove Theorem 8. Proof of Theorem 8.

Denote σ R = P j σ j , σ ∗ = max i σ i . Without loss of generality, we assume σ ∗ = 1. Note that E (cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13) ≤ E (cid:13)(cid:13) D ( ZZ ⊤ ) − E ZZ ⊤ (cid:13)(cid:13) + E (cid:13)(cid:13) ∆( ZZ ⊤ ) (cid:13)(cid:13) . It suﬃces tobound the two terms separately. Since D ( ZZ ⊤ ) − E ZZ ⊤ is a diagonal matrix with independentdiagonal entries, we have (cid:13)(cid:13)(cid:13) D ( ZZ ⊤ ) − E ZZ ⊤ (cid:13)(cid:13)(cid:13) = max i ∈ [ p ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p X j =1 Z ij − E p X j =1 Z ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . P  max i ∈ [ p ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p X j =1 Z ij − E p X j =1 Z ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t  ≤ log p − c t P p j =1 σ j ∧ tσ ∗ !! . Integration over the tail further yields E max i ∈ [ p ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p X j =1 Z ij − E p X j =1 Z ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . vuut log p p X j =1 σ j + σ ∗ log p . (33)Next, we use moment method to bound E (cid:13)(cid:13) ∆( ZZ ⊤ ) (cid:13)(cid:13) . For any even positive integer q , byLemma 5, E (cid:13)(cid:13)(cid:13) ∆( ZZ ⊤ ) (cid:13)(cid:13)(cid:13) ≤ (cid:16) E tr n(cid:16) ∆( ZZ ⊤ ) (cid:17) q o(cid:17) /q ≤ (cid:16) E tr n(cid:16) ∆( HH ⊤ ) (cid:17) q o(cid:17) /q . (34)Here H is a p -by- m random matrix with i.i.d. N (0 ,

1) entries and m = ⌈ P p j =1 σ j ⌉ + q −

1. Thusit suﬃces to bound (cid:0) E tr (cid:8)(cid:0) ∆( HH ⊤ ) (cid:1) q (cid:9)(cid:1) /q .On the one hand, by Lemma 2, ∀ q ≥ (cid:16) E k HH ⊤ − E HH ⊤ k q (cid:17) /q ≤ √ p m + m + 4( √ p + √ m ) √ q + 2 q. (35)On the other hand, note that (cid:13)(cid:13) D ( HH ⊤ ) − E HH ⊤ (cid:13)(cid:13) = max i ∈ [ m ] | X i | , where X i are independentcentralized χ m random variable. By the Chi-square concentration and union bound, we have P (cid:18) max i ∈ [ p ] | X i | q > t (cid:19) ≤ log p − c t /q m ∧ t /q !! . Integration gives E max i ∈ [ p ] | X i | q ≤ C q (cid:16) log q p + ( p m log p ) q (cid:17) . (36)Then it follows that (cid:16) E tr n(cid:16) ∆( HH ⊤ ) (cid:17) q o(cid:17) /q ≤ (cid:16) p E (cid:13)(cid:13)(cid:13) ∆( HH ⊤ ) (cid:13)(cid:13)(cid:13) q (cid:17) /q ≤ p /q (cid:16) E (cid:13)(cid:13)(cid:13) HH ⊤ − E HH ⊤ (cid:13)(cid:13)(cid:13) q (cid:17) /q + (cid:16) E (cid:13)(cid:13)(cid:13) D ( HH ⊤ − E HH ⊤ ) (cid:13)(cid:13)(cid:13) q (cid:17) /q (35)(36) . p /q · (cid:0) √ p m + p + 4( √ p + √ m ) √ q + 2 q (cid:1) . (37)Now we specify q = 2 p and get E (cid:13)(cid:13)(cid:13) ∆( ZZ ⊤ ) (cid:13)(cid:13)(cid:13) (34) ≤ (cid:16) E tr n(cid:16) ∆( HH ⊤ ) (cid:17) q o(cid:17) /q . vuut p n X j =1 σ j + p . This together with (33) completes the proof of this theorem.14

Applications

The concentration bounds established in the previous sections have a range of applications. In thissection, we illustrate the usefulness of the heteroskedastic Wishart-type concentration by applica-tions to low-rank matrix denoising and heteroskedastic clustering.Consider the following “signal + noise” model: Y = X + Z, where X ∈ R p × p is a (approximately) low-rank matrix of interest, Z is the random noise withindependent entries, and Y is the observation. This model has attracted signiﬁcant attention inprobability and statistics [5, 7, 14, 26], and has also been the prototypical setting in various applica-tions, such as bipartite stochastic block model [15], exponential family PCA [22], top- k ranking frompairwise comparison [23]. In these applications, the leading singular values/vectors of X often con-tain information of interest. A straightforward way to estimate the leading singular values/vectorsof X (which are also the square root eigenvalues and the eigenvectors of XX ⊤ ) is by evaluating thespectrum of Y (or equivalently Y Y ⊤ ). Suppose λ i ( Y Y ⊤ ) , λ i ( XX ⊤ ) , v i ( Y Y ⊤ ) , v i ( Y Y ⊤ ) are the i theigenvalue and i th eigenvector of Y Y ⊤ , XX ⊤ , respectively. The classic perturbation theory (e.g.,Weyl [34] and David-Kahan [13]) yield the following sharp bounds, | λ i ( Y Y ⊤ ) − λ i ( XX ⊤ ) | ≤ k Y Y ⊤ − XX ⊤ k , k v i ( Y Y ⊤ ) ± v i ( Y Y ⊤ ) k . k Y Y ⊤ − XX ⊤ k min j = i,i +1 { λ j − ( XX ⊤ ) − λ j ( XX ⊤ ) } . Then, a tight upper bound for the perturbation

Y Y ⊤ − XX ⊤ is critical to quantify the estimationaccuracy of λ i ( Y Y ⊤ ), v i ( Y Y ⊤ ) to λ i ( XX ⊤ ), v i ( XX ⊤ ). By expansion, the perturbation of Y Y ⊤ − XX ⊤ can be written as Y Y ⊤ − XX ⊤ = XZ ⊤ + ZX ⊤ + E ZZ ⊤ + ( ZZ ⊤ − E ZZ ⊤ ) . (38)Here, E ZZ ⊤ is a deterministic diagonal matrix; k XZ ⊤ k = k ZX ⊤ k are the spectral norm ofa random matrix multiplied by a deterministic matrix, which has been considered in [32]; Theterm k ZZ ⊤ − ZZ ⊤ k can often be the dominating and most complicated part in (38) and theheteroskedastic Wishart-type concentration inequality established in the present paper provides apowerful tool for analyzing it.We further illustrate through a speciﬁc application to high-dimensional heteroskedastic clus-tering. The clustering is an ubiquitous task in statistics and machine learning [16]. Suppose weobserve a two-component Gaussian mixture: Y j = l j µ + ε j , ε j = ( ε j , . . . , ε pj ) ⊤ , ε ij ind ∼ N (0 , σ i ) , j = 1 , . . . , n. (39)Here, µ is an unknown deterministic vector in R p and l j ∈ {− , } are unknown labels of two classes.While most existing works focus on the homoskedastic setting, we consider a heteroskedastic setting15here the noise variance σ i may vary across diﬀerent coordinates. Then, the sample { Y j } nj =1 canbe written in a matrix form, Y = X + Z , where Y = h Y ⊤ , Y ⊤ , · · · , Y ⊤ n i ⊤ , X = [ l , l , · · · , l n ] ⊤ µ, and Z = ( ε ij ) . Our goal is to cluster { Y j } nj =1 into two groups, or equivalently to estimate the hidden label { l j } nj =1 .Let ˆ v be the ﬁrst eigenvector of Y Y ⊤ . As ˆ v is an estimation of l , it is straightforward to cluster asˆ l j = sgn(ˆ v j ) , j = 1 , . . . , n. (40)Applying Theorem 8 and perturbation bound of k XZ ⊤ k [36, Lemma 3] on (38), it can be shownthat E (cid:13)(cid:13)(cid:13) Y Y ⊤ − E ZZ ⊤ − XX ⊤ (cid:13)(cid:13)(cid:13) . n k µ k σ ∗ + nσ ∗ + s n X j σ j . Combining this with the Davis-Kahan Theorem [13], we obtain the following result.

Theorem 10.

Let σ ∗ = max i σ i and ˜ σ = ( P i σ i ) / . The estimator in (40) satisﬁes E M ( l, ˆ l ) . n k µ k σ ∗ + nσ ∗ + √ n ˜ σ n k µ k ∧ . (41) Here, M ( l, ˆ l ) is the misclassiﬁcation rate deﬁned as M ( l, ˆ l ) = 1 n min ( n X i =1 { l i =ˆ l i } , n X i =1 { l i = − ˆ l i } ) . (42)The complete proof of Theorem 10 is deferred to Section 5.6. By (41), the clustering is consistent(i.e., E M ( l, ˆ l ) = o (1)) as long as k µ k ≫ σ ∗ ∨ (˜ σ/n / ) . (43)The following lower bound shows that the signal-noise-ration condition (43) is necessary to ensurea consistent classiﬁcation. The proof is provided in Section 5.6. Theorem 11.

Suppose σ ∗ ≤ ˜ σ ≤ p / σ ∗ . Consider the following class of distributions on R n × p : P l,λ ( σ ∗ , ˜ σ ) = ( P Y : Y = X + Z ∈ R n × p : X = lµ ⊤ , Z ij ind ∼ N (0 , σ j ) , k µ k ≥ λ, max j σ j ≤ σ ∗ , P pi =1 σ i ≤ ˜ σ ) . There exists a universal constant c > , such that if λ < c (cid:0) σ ∗ ∨ (˜ σ/n / ) (cid:1) , we have inf ˆ l sup P l,λ ( σ ∗ , ˜ σ ) E M ( l, ˆ l ) ≥ / . Additional Proofs

In this section, we collect the proofs of upper and lower bound results in Section 2 including Lemma1, Lemma 2, Proposition 1 and Theorem 11.

Proof of Lemma 1.

This proof shares similarity but shows more distinct aspects, compared withthe one of Wigner-type [4, Proposition 2.1]. We assume σ ∗ = 1 throughout the proof without lossof generality. We divide the proof into two steps, which targets on the two sides of the inequalities,respectively.Step 1 One can check that E ZZ ⊤ = diag (cid:16)nP p j =1 σ ij o p i =1 (cid:17) . Consider the following expansion, E tr n ( ZZ ⊤ − E ZZ ⊤ ) q o = X u ,...,u q ,u q +1 ∈ [ p ] E q Y k =1 (cid:16) ZZ ⊤ − E ZZ ⊤ (cid:17) u k ,u k +1 = X u ,...,u q ,u q +1 ∈ [ p ] E q Y k =1  X v k ∈ [ p ] (cid:0) Z u k ,v k Z u k +1 ,v k − { u k = u k +1 } E Z u k ,v k (cid:1) = X u ,...,u q ,u q +1 ∈ [ p ] v ,...,v q ∈ [ p ] E q Y k =1 (cid:0) Z u k ,v k Z u k +1 ,v k − σ u k ,v k · { u k = u k +1 } (cid:1) . (44)Here, the indices are in module q , i.e., u = u q +1 . Next, we consider the bipartite graph from[ p ] on [ p ] and the cycles of length 2 q , i.e., c := ( u → v → u → v → . . . → u q → v q → u q +1 = u ). For any ( i, j ) ∈ [ p ] × [ p ], let α ij ( c ) = Card { k : ( u k = i, v k = j, u k +1 = i ) or ( u k = i, v k = j, u k +1 = i ) } ; β ij ( c ) = Card { k : u k = u k +1 = i, v k = j } . (45)Then, α ij ( L ) is the number of times that the edge ( i, j ) is visited exactly once by sub-path u k → v k → u k +1 ; β ij ( c ) is the number of times that the edge ( i, j ) is visited twice by sub-path u k → v k → u k +1 (back and forth). Since Z ij /σ ij has i.i.d. standard normal distribution, wehave E tr n ( ZZ ⊤ − E ZZ ⊤ ) q o = X c ∈ ([ p ] × [ p ]) q Y ( i,j ) ∈ [ p ] × [ p ] E Z α ij ( c ) ij (cid:0) Z ij − σ ij (cid:1) β ij ( c ) = X c ∈ ([ p ] × [ p ]) q Y ( i,j ) ∈ [ p ] × [ p ] σ α ij ( c )+2 β ij ( c ) ij Y ( i,j ) ∈ [ p ] × [ p ] E G α ij ( c ) (cid:0) G − (cid:1) β ij ( c ) = X c ∈ ([ p ] × [ p ]) q q Y k =1 σ u k ,v k σ u k +1 ,v k Y ( i,j ) ∈ [ p ] × [ p ] E G α ij ( c ) (cid:0) G − (cid:1) β ij ( c ) . (46)Here G denotes a N (0 ,

1) random variable. Next, let m α,β ( c ) be the number of edges whichappear α times in ( u k → v k ) or ( v k → u k +1 ) with u k = u k +1 , and β times in ( u k → v k → u k +1 )17ith u k = u k +1 . More rigorously, m α,β ( c ) := Card n ( i, j ) ∈ [ p ] × [ p ] : β = |{ k : u k = u k +1 = i, v k = j }| ,α = |{ k : exactly one of u k or u k +1 = i, v k = j | o . (47)For any cycle c , we deﬁne its shape s ( u ) by relabeling the vertices in order of appearance. Forexample, the cycle 2 → ′ → → ′ → → ′ → → ′ → → ′ → → ′ → → ′ → → ′ →

1. Here i denotes the left vertex while i ′ denotes the right vertex. It iseasy to see for any two cycles c and c ′ with the same shape, we must have m α,β ( c ) = m α,β ( c ′ ).Thus we can well deﬁne m α,β ( s ( c )) := m α,β ( c ). Based on previous discussions, Y ( i,j ) ∈ [ p ] × [ p ] E G α ij ( c ) (cid:0) G − (cid:1) β ij ( c ) = Y α,β ≥ n E G α ( G − β o m α,β ( s ( c )) . (48)Then a natural observation is that E G α ( G − β ≥ α, β and E G α ( G − β = 0 if and only if α is an odd or α = 0 , β = 1 (see Lemma 7 in Appendix A for details).We then deﬁne even shape set S p ,p as S p ,p = { s ( c ) : m α,β ( s ( c )) = 0 for all α, β s.t. α is an odd or α = 0 , β = 1 } . (49)Then the right hand side of (48) is nonzero only for s ( c ) ∈ S p ,p and the expansion (46) canbe further rewritten as E tr n ( ZZ ⊤ − E ZZ ⊤ ) q o = X s ∈S p ,p X c : s ( c )= s q Y k =1 σ u k ,v k σ u k +1 ,v k Y α,β ≥ n E G α ( G − β o m α,β ( s ) = X s ∈S p ,p Y α,β ≥ n E G α ( G − β o m α,β ( s ) · X c : s ( c )= s q Y k =1 σ u k ,v k σ u k +1 ,v k . (50)Now denote m L ( s ) and m R ( s ) be the number of distinct left and right nodes that is visitedby cycles with shape s , we have the following lemma: Lemma 6.

Suppose σ ∗ ≤ . Then for any shape s ∈ S p ,p , X c : s ( c )= s q Y k =1 σ u k ,v k σ u k +1 ,v k ≤ (cid:16) p σ m L ( s ) − C σ m R ( s ) R (cid:17) ∧ (cid:16) p σ m L ( s ) C σ m R ( s ) − R (cid:17) . Proof.

The proof of Lemma 6 is an analogue of [4, Lemma 2.5]. We ﬁrst show X c : s ( c )= s q Y k =1 σ u k ,v k σ u k +1 ,v k ≤ p σ m L ( s ) − C σ m R ( s ) R . (51)Suppose s = ( s , s ′ , . . . , s q , s ′ q ), let l ( k ) = min { j : s j = k } , i.e., the ﬁrst time in any cycle ofshape s at which its k th distinct left vertex is visited. Similarly we deﬁne r ( k ) = min { j :18 ′ j = k } . Now let c = ( u , v , · · · u q , v q ) be a cycle with shape s . Then the following m L ( s )distinct edges from right vertex to left vertex will appear in order: v l (2) − → u l (2) , v l (3) − → u l (3) , · · · , v l ( m L ( s )) − → u l ( m L ( s )) . Similarly, we have m R ( s ) edges from left vertex to rightvertex: u r (1) → v r (1) , u r (2) → v r (2) , · · · , u r ( m R ( s )) → v r ( m R ( s )) . In addition, these m L + m R − l ( k ) and r ( k ). We claim each of these m L + m R − m , ( s ( c )) ≥

1, which contradicts s ∈ S p + p . Now for a ﬁxed starting vertex u = u ∈ [ p ],we can bound X c : u = u s ( c )= s q Y k =1 σ u k ,v k σ u k +1 ,v k ≤ X c : u = u s ( c )= s (cid:16) σ u r (1) ,v r (1) · · · σ u r ( mR ( s ,v r ( mR ( s (cid:17) · (cid:16) σ u l (2) ,v l (2) − · · · σ u l ( mL ( s ,u l ( mR ( s − (cid:17) = X a = ···6 = a mL ( s ∈ [ p ] b = ···6 = b mR ( s ∈ [ p ] (cid:18) σ a sr (1) ,b · · · σ a sr ( mR ( s ,b mR ( s (cid:19) · (cid:18) σ a ,b s ′ l (2) − · · · σ a mL ( s ,b s ′ l ( mR ( s − (cid:19) ≤ σ m R ( s ) R σ m L ( s ) − C . Then (51) follows by taking diﬀerent initial vertices u ∈ [ p ]. Similarly we can show X c : s ( c )= s q Y k =1 σ u k ,v k σ u k +1 ,v k ≤ p σ m L ( s ) C σ m R ( s ) − R and the proof is complete.Combining (50) and Lemma 6, we obtain E tr n ( ZZ ⊤ − E ZZ ⊤ ) q o ≤ X s ∈S p ,p Y α,β ≥ n E G α ( G − β o m α,β ( s ) · n p σ m L ( s ) − C σ m R ( s ) R o ∧ n p σ m L ( s ) C σ m R ( s ) − R o . (52)Step 2 Next, we consider the expansion for E tr (cid:0) ( HH ⊤ ) q (cid:1) , where H ∈ R m × m is with i.i.d. standardGaussian entries. We similarly expand as Step 1 to obtain E tr (cid:16) ( HH ⊤ − m I m ) q (cid:17) = X s ∈S p ,p Y α,β ≥ n E G α ( G − β o m α,β ( s ) · |{ c : s ( c ) = s }| = X s ∈S p ,p Y α,β ≥ E n G α ( G − β o m α,β ( s ) · m ( m − · · · ( m − m L ( s ) + 1) m ( m − · · · ( m − m R ( s ) + 1)19rovided that m = ⌈ σ C ⌉ + q − m = ⌈ σ R ⌉ + q − m L ( s ) , m R ( s ) ≤ q , we have m ( m − · · · ( m − m L ( s ) + 1) · m ( m − · · · ( m − m R ( s ) + 1) ≥ m · ( m − m L ( s ) + 1) m L ( s ) − · ( m − m R ( s ) + 1) m R ( s ) ≥ m σ m L ( s ) − C · σ m R ( s ) R . Similarly, m ( m − · · · ( m − m L ( s ) + 1) · m ( m − · · · ( m − m R ( s ) + 1) ≥ σ m L ( s ) C · m σ m R ( s ) − R . These all together imply E tr (cid:16) ( HH ⊤ − m I m ) q (cid:17) ≥ X s ∈S p p Y α,β ≥ E n G α ( G − β o m α,β ( s ) · n m σ m L ( s ) − C · σ m R ( s ) R o ∨ n m σ m L ( s ) C · σ m R ( s ) − R o . (53)By comparing (52) and (53), we have ﬁnally proved that E tr n ( ZZ ⊤ − E ZZ ⊤ ) q o ≤ (cid:18) p m ∧ p m (cid:19) E tr n(cid:16) HH ⊤ − E HH ⊤ (cid:17) q o . Proof of Lemma 2.

Let W = max (cid:8) σ max ( H ) − √ m − √ m , √ m − √ m − σ min ( M ) , (cid:9) , by thetail bound of i.i.d. Gaussian matrix (c.f., [31, Corollary 5.35]), P ( W ≥ t ) ≤ − t /

2) for all t ≥

0. Thus for any q ≥ E W q = q Z ∞ t q − P ( W ≥ t ) dt ≤ q Z ∞ t q − exp( − t / dt = 2 q q Γ( q/ . Since k HH ⊤ − E HH ⊤ k = k HH ⊤ − m I m k = max (cid:8) σ ( H ) − m , m − σ ( H ) (cid:9) ≤ ( W + √ m + √ m ) − m =2 √ m m + m + W + 2( √ m + √ m ) W, (54)we have (cid:16) E k HH ⊤ − E HH ⊤ k q (cid:17) /q ≤ √ m m + m + ( E W q ) /q + 2( √ m + √ m ) ( E W q ) /q ≤ √ m m + m + (cid:0) q +1 q Γ( q ) (cid:1) /q + 2( √ m + √ m ) (cid:16) q q Γ( q/ (cid:17) /q . Next we claim (cid:0) q +1 q Γ( q ) (cid:1) /q ≤ q, (cid:16) q q Γ( q/ (cid:17) /q ≤ q / . (55)20ne can verify (55) for 2 ≤ q ≤

10 by calculation. When q ≥

11, (55) can be veriﬁed by the Gammafunction upper bound in [6]. In summary, we have (cid:16) E k HH ⊤ − E HH ⊤ k q (cid:17) /q ≤ √ m m + m + 4( √ m + √ m ) √ q + 2 q. which has ﬁnished the proof of the ﬁrst part of this lemma.For the second part, when m ≤ m , since HH ⊤ − E HH ⊤ is an m -by- m matrix, we knowtr( (cid:0) HH ⊤ − E HH ⊤ (cid:1) q ) is the sum of m eigenvalues of (cid:0) HH ⊤ − E HH ⊤ (cid:1) q , while each of theseeigenvalues are no more than k HH ⊤ − E HH ⊤ k q . Thus, E tr n(cid:16) HH ⊤ − E HH ⊤ (cid:17) q o ≤ E m k HH ⊤ − E HH ⊤ k q ≤ ( m ∧ m ) · (2 √ m m + m + 4( √ m + √ m ) √ q + 2 q ) q . When m > m , we shall note that rank( HH ⊤ ) ≤ m and E HH ⊤ = m I m . Then, (cid:16) HH ⊤ − E HH ⊤ (cid:17) q − ( − q m q I m = q X k =1 ( − m ) q − k (cid:18) qm (cid:19) ( HH ⊤ ) k , which shares the eigenspace of HH ⊤ and has rank no more than m . Thus, E tr n(cid:16) HH ⊤ − E HH ⊤ (cid:17) q o = E tr n(cid:16) HH ⊤ − E HH ⊤ (cid:17) q − ( − q m q I m o + tr (( − q m q I m ) ≤ m E (cid:13)(cid:13)(cid:13)(cid:16) HH ⊤ − E HH ⊤ (cid:17) q − ( − q m q I m (cid:13)(cid:13)(cid:13) + m m q ≤ m { (2 √ m m + m + 4( √ m + √ m ) √ q + 2 q ) q + m q } + m m q ≤ m (2 √ m m + m + 4( √ m + √ m ) √ q + 2 q ) q =2( m ∧ m ) (2 √ m m + m + 4( √ m + √ m ) √ q + 2 q ) q . where the last inequality is due to m > m . Proof of Proposition 1.

Since Z ij iid ∼ N (0 , E ZZ ⊤ = p I p and E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) = E (cid:13)(cid:13)(cid:13) ZZ ⊤ − p I p (cid:13)(cid:13)(cid:13) ≥ E (cid:16)(cid:13)(cid:13)(cid:13) ZZ ⊤ (cid:13)(cid:13)(cid:13) − p (cid:17) = E k Z k − p . Since k Z k / ( √ p + √ p ) → p , p tend to inﬁnity [31, Theorem 5.31],lim inf p ,p →∞ E (cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13) σ C σ R + σ C ≥ lim inf p ,p →∞ E k Z k − p √ p p + p ≥ . Proof of Theorem 2.

It suﬃces to prove the following separate lower bounds to prove this theorem.sup Z ∈F p ( σ ∗ ,σ C ,σ R ) E k ZZ ⊤ − E ZZ ⊤ k & σ C ; (56)sup Z ∈F p ( σ ∗ ,σ C ,σ R ) E k ZZ ⊤ − E ZZ ⊤ k & σ C σ R ; (57)sup Z ∈F p ( σ ∗ ,σ C ,σ R ) E k ZZ ⊤ − E ZZ ⊤ k & σ R σ ∗ p log p + σ ∗ log p. (58)21. We ﬁrst set σ i = σ C / √ p ; σ ij = 0 , j ≥

2. If Z ij ∼ N (0 , σ ij ) independently, it is easy tocheck that Z ∈ F p ,p ( σ ∗ , σ R , σ C ). Then Z is zero except the ﬁrst column. Suppose the ﬁrstcolumn of Z is z , then ZZ ⊤ − E ZZ ⊤ = zz ⊤ − σ C p I p , E k ZZ ⊤ − E ZZ ⊤ k = E k zz ⊤ − σ C /p k ≥ E k zz ⊤ k − σ C /p = E k z k − σ C /p ≥ σ C (1 − /p ) ≥ cσ C , which has shown (56).2. Let k = ⌊ σ C /σ ∗ ⌋ , k = ⌊ σ R /σ ∗ ⌋ . Construct σ ij = ( σ ∗ , ≤ i ≤ k , ≤ j ≤ k ;0 , otherwise . By such a construction, Z ij ∼ N (0 , σ ∗ ) for 1 ≤ i ≤ k , ≤ j ≤ k ; Z ij = 0 otherwise. Thus, E (cid:16) Z · j Z ⊤· j − E Z · j Z ⊤· j (cid:17) = E Z · j Z ⊤· j Z · j Z ⊤· j − (cid:16) E Z · j Z ⊤· j (cid:17) = E k Z · j k Z · j Z ⊤· j − σ ∗ I k = ( k + 1) σ ∗ . Here, the last equality is due to (cid:16) E k Z · j k Z · j Z ⊤· j (cid:17) i,i ′ = E k Z · j k Z i,j Z i ′ ,j = ( ( k − σ ∗ , ≤ i = i ′ ≤ k ;0 , ≤ i = i ′ ≤ k . Thus, (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k X j =1 E n Z · j Z ⊤· j − E Z · j Z ⊤· j o (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13) ( k + 1) k σ ∗ I (cid:13)(cid:13) = ( k + 1) k σ ∗ . Note that ZZ ⊤ − E ZZ ⊤ can be decomposed as the sum of independent random matrices, ZZ ⊤ − E ZZ ⊤ = k X j =1 n Z · j Z ⊤· j − E Z · j Z ⊤· j o . We apply the bound for expected norm of random matrices sum [29] and obtain E k ZZ ⊤ − E ZZ ⊤ k & p ( k + 1) k σ ∗ = q ( ⌊ σ C /σ ∗ ⌋ + 1) · ⌊ σ R /σ ∗ ⌋ · σ ∗ ≥ q ( σ C /σ ∗ ) · σ R / (2 σ ∗ ) · σ ∗ (since σ R ≥ σ ∗ ) & σ R σ C . We thus have shown (57).3. Set k = ⌊ σ C /σ ∗ ⌋ , k = ⌊ σ R /σ ∗ ⌋ , m = ⌊ ( p /k ) ∧ ( p /k ) ⌋ . If k ≥ (log p ) , then σ R ≥ σ ∗ log p and (58) can be implied by (57). So we assume k ≤ (log p ) , thus k m ≥ k (cid:18) p k ∧ p k (cid:19) ≥ p ∧ p p ) ≥ p (log p ) k m ) ≥ c log p . Let( σ ij ) =  B B . . .  = diag( m z }| { B, B, . . . , B, B, ∈ R p × p , B = σ ∗ k ⊤ k . Then we can rewrite down Z in rowwise form as Z =  β ⊤ β ⊤ k β ⊤ k +1 β ⊤ k

00 0 . . .  ∈ R p × p , β , . . . , β k m ∈ R k , β , . . . , β k m iid ∼ N (0 , σ ∗ I k ) . By taking a look at the expression of (cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13) , we know (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) ≥ max ≤ j ≤ k m (cid:12)(cid:12)(cid:12) β ⊤ j β j − k σ ∗ (cid:12)(cid:12)(cid:12) . Note that β ⊤ j β j /σ ∗ ∼ χ k . By the lower bound of right-tail of Chi-square distribution (Corol-lary 3 in [37]), we have P (cid:16) β ⊤ j β j − k σ ∗ ≥ σ ∗ x (cid:17) ≥ c exp (cid:16) − C ( x ∧ x k ) (cid:17) . Since P (cid:18) max j β ⊤ j β j − k σ ∗ > σ ∗ x (cid:19) = 1 − P (cid:18) max j β ⊤ j β j − k σ ∗ ≤ σ ∗ x (cid:19) = 1 − k m Y j =1 (cid:16) − P (cid:16) β ⊤ j β j − k σ ∗ ≥ σ ∗ x (cid:17)(cid:17) ≥ − (cid:18) − c exp (cid:18) − C (cid:18) x ∧ x k (cid:19)(cid:19)(cid:19) k m , Taking x = c (cid:16)p k log( k m ) ∨ log( k m ) (cid:17) for some c such that − C (cid:16) x ∧ x k (cid:17) ≥ − log( k m ),we get (cid:18) − c exp (cid:18) − C (cid:18) x ∧ x k (cid:19)(cid:19)(cid:19) k m ≤ (cid:18) − c ′ ( k m ) (cid:19) k m ≤ e − c ′ . Thus, E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) ≥ E max ≤ j ≤ k m β ⊤ j β j − k σ ∗ ≥ sup x> xσ ∗ · P (cid:18) max j β ⊤ j β j − k σ ∗ > xσ ∗ (cid:19) ≥ c (1 − e − c ′ ) (cid:16)p k log( k m ) ∨ log( k m ) (cid:17) & cσ ∗ (cid:16)p k log( k m ) + log( k m ) (cid:17) & cσ ∗ σ R p log p + cσ ∗ log p. .2 Proofs for non-Gaussian distributions In this section, we collect the proofs of concentration for the non-Gaussian Wishart-type matrix(Lemma 3, Theorem 3 and Theorem 4) in Section 3.1.

Proof of Lemma 3.

Following the notations and proof idea of Lemma 1, we have the same expansionof E tr (cid:8) ( ZZ ⊤ − E ZZ ⊤ ) q (cid:9) as (46): E tr n ( ZZ ⊤ − E ZZ ⊤ ) q o = X c ∈ ([ p ] × [ p ]) q Y ( i,j ) ∈ [ p ] × [ p ] E Z α ij ( c ) ij (cid:0) Z ij − σ ij (cid:1) β ij ( c ) = X c ∈ ([ p ] × [ p ]) q q Y k =1 σ u k ,v k σ u k +1 ,v k Y ( i,j ) ∈ [ p ] × [ p ] E G α ij ( c ) ij (cid:0) G ij − (cid:1) β ij ( c ) , (59)where G ij := Z ij /σ ij . Diﬀerent from (46), E ij in (59) may not have N (0 ,

1) distribution. Toovercome this diﬃculty, we introduce the following lemma to bound E E αij ( E ij − β via a Gaussiananalogue. Lemma 7 (Gaussian moments) . Suppose G ∼ N (0 , , α, β are non-negative integers, then ( ( α + 2 β − ≥ E G α ( G − β ≥ ( α + 2 β − · ( α + β − , if α is even ; E G α ( G − β = 0 , if α is odd . (60) Here for odd k , k !! = k ( k − · · · . Especially, ( − , ( − − . More generally, if Z hassymmetric distribution and satisﬁes Var( Z ) = 1 , k Z k ψ = sup q ≥ q − / ( E | Z | q ) /q ≤ κ. (61) Then for any integers α, β ≥ , (cid:12)(cid:12)(cid:12) E Z α ( Z − β (cid:12)(cid:12)(cid:12) ≤ ( Cκ ) α +2 β E G α ( G − β (62) for some uniform constant C > .Proof of Lemma 7. See Appendix.Now, Combining (59) and (62), we have E tr n ( ZZ ⊤ − E ZZ ⊤ ) q o ≤ X c ∈ ([ p ] × [ p ]) q q Y k =1 σ u k ,v k σ u k +1 ,v k Y ( i,j ) ∈ [ p ] × [ p ] ( Cκ ) α ij ( c )+2 β ij ( c ) E E α ij ( c ) ij (cid:0) E ij − (cid:1) β ij ( c ) = ( Cκ ) q X c ∈ ([ p ] × [ p ]) q q Y k =1 σ u k ,v k σ u k +1 ,v k Y ( i,j ) ∈ [ p ] × [ p ] E G α ij ( c ) (cid:0) G − (cid:1) β ij ( c ) . The rest of the proof can similarly proceed as we did in proving Lemma 1.24 roof of Theorem 3.

Let b := 2 /α ≥ E ij := Z ij /σ ij . By deﬁnition, we havesup q q − b ( E | E ij | q ) /q ≤ κ . Thus for any α, β ≥ (cid:12)(cid:12)(cid:12) E E αij ( E ij − β (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) E E αij ( E ij − {| E ij |≤ } + E E αij ( E ij − β {| E ij | > } (cid:12)(cid:12)(cid:12) ≤ E | E ij | α +2 β ≤ ( Cκ ) α +2 β ( α + 2 β ) b ( α +2 β )2 . (63)We introduce the following technical lemma. Lemma 8.

Let G, ˜ G be independent N (0 , and let F ij be i.i.d. copy of G | ˜ G | b − . Then, E E αij ( E ij − β ≤ ( C b κ ) α +2 β E F αij ( F ij − β . (64) Here C b is some constant which only depend on b .Proof of Lemma 8. See Appendix.Now let G ij , ˜ G ij be i.i.d. N (0 ,

1) and deﬁne F ij = G ij | ˜ G ij | b − . Let ˜ Z be a random matrix withentries ˜ Z ij = σ ij F ij . Then, by Lemma 8 and the similar proof in Lemma 3, we have E tr n ( ZZ ⊤ − E ZZ ⊤ ) q o ≤ ( C b κ ) q E tr n ( ˜ Z ˜ Z − E ˜ Z ˜ Z ⊤ ) q o . Thus, E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) ≤ (cid:16) E tr n ( ZZ ⊤ − E ZZ ⊤ ) q o(cid:17) / q ≤ ( C b κ ) (cid:16) E tr n ( ˜ Z ˜ Z ⊤ − E ˜ Z ˜ Z ⊤ ) q o(cid:17) / q . (65)Let q = ⌈ log( p ∧ p ) ⌉ , now it suﬃces to upper bound (cid:18) E (cid:13)(cid:13)(cid:13) ˜ Z ˜ Z ⊤ − E Z ˜ Z ⊤ (cid:13)(cid:13)(cid:13) q (cid:19) / q . We deﬁne˜ σ C = max j P p i =1 σ ij | ˜ G ij | b − , ˜ σ R = max i P p j =1 σ ij | ˜ G ij | b − and ˜ σ ∗ = max i σ ij | ˜ G ij | b − and applyTheorem 1 conditionally on ˜ G : E (cid:20) tr (cid:26)(cid:16) ˜ Z ˜ Z ⊤ − E ˜ Z ˜ Z ⊤ (cid:17) q (cid:27) (cid:12)(cid:12)(cid:12) ˜ G (cid:21) ≤ C q (cid:16) ˜ σ C + ˜ σ C ˜ σ R + σ C σ ∗ p log( p ∧ p ) + σ ∗ log( p ∧ p ) (cid:17) q . Then, (cid:18) E tr (cid:26)(cid:16) ˜ Z ˜ Z ⊤ − E ˜ Z ˜ Z ⊤ (cid:17) q (cid:27)(cid:19) / q ≤ C (cid:16)(cid:13)(cid:13) ˜ σ C (cid:13)(cid:13) q + k ˜ σ C σ R k q + k ˜ σ R ˜ σ ∗ k q p log( p ∧ p ) + (cid:13)(cid:13) ˜ σ ∗ (cid:13)(cid:13) q log( p ∧ p ) (cid:17) . (66)Here k X k q := ( E | X | q ) / q is the ℓ q -norm of random variable X . Now we bound (cid:13)(cid:13) ˜ σ ∗ (cid:13)(cid:13) q , (cid:13)(cid:13) ˜ σ R (cid:13)(cid:13) q and (cid:13)(cid:13) ˜ σ C (cid:13)(cid:13) q separately. 25 (cid:13)(cid:13) ˜ σ ∗ (cid:13)(cid:13) q . For any a >

0, since P (cid:18) max i,j | ˜ G ij | > t (cid:19) ≤ (cid:18) − t p p ) (cid:19) ≤ (cid:18) − t (cid:19) , ∀ t > p log( p p ) , integration yields E max i,j | ˜ G ij | a = Z ∞ P (cid:18) max i,j | ˜ G ij | > t /a (cid:19) dt ≤ (cid:16) p log( p p ) (cid:17) a + Z ∞ e − t /a dt = (4 log( p p )) a/ + 4 a Γ (cid:16) a (cid:17) . Then it follows that (cid:13)(cid:13) ˜ σ ∗ (cid:13)(cid:13) q ≤ σ ∗ (cid:18) E max i,j | ˜ G ij | b − q (cid:19) / q ≤ σ ∗ (cid:16) (4 log( p p )) b − q + 16( b − q Γ (2( b − q ) (cid:17) / q . σ ∗ (cid:16) log( p p ) b − + q b − (cid:17) . σ ∗ log b − ( p ∨ p ) . (67) • (cid:13)(cid:13) ˜ σ C (cid:13)(cid:13) q and (cid:13)(cid:13) ˜ σ R (cid:13)(cid:13) . By the moment bound of supremum of empirical process [9, Theorem 11], (cid:13)(cid:13) ˜ σ C (cid:13)(cid:13) q = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) max j p X i =1 σ ij | ˜ G ij | b − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) q . E max ≤ j ≤ p p X i =1 σ ij | ˜ G ij | b − + q (cid:13)(cid:13) ˜ σ ∗ (cid:13)(cid:13) q ≤ E max j p X i =1 (cid:16) σ ij | ˜ G ij | b − − E σ ij | ˜ G ij | b − (cid:17) + σ C + q (cid:13)(cid:13) ˜ σ ∗ (cid:13)(cid:13) q . (68)Denote Y j = P p i =1 (cid:16) σ ij | ˜ G ij | b − − E σ ij | ˜ G ij | b − (cid:17) , it suﬃces to bound E max j Y j . To this end,we introduce the following Generalized Bernstein-Orlicz norm deﬁned in [19]. For a randomvariable X , let k X k Ψ α,L := inf { η > E [Ψ α,L ( | X | /η )] ≤ } be the Ψ α,L -norm where Ψ α,L is deﬁned via its inverse functionΨ − α,L ( t ) := p log(1 + t ) + L (log(1 + t )) /α , ∀ t ≥ . Now ﬁx j ∈ [ p ] and let α = 1 / ( b −

1) and L = b − σ ∗ √ qP p i =1 σ ij . By [19, Theorem 3.1], k Y j k Ψ α,L ≤ C vuut p X i =1 σ ij ; P  | Y j | ≥ C vuut p X i =1 σ ij n √ t + Lt /α o ≤ − t ) , t ≥ . P (cid:16) | Y j | ≥ C n σ C σ ∗ √ t + σ ∗ t b − o(cid:17) ≤ − t ) , t ≥ , which can be rewritten as P ( | Y j | ≥ t ) ≤ − c t σ C σ ∗ ∧ (cid:18) tσ ∗ (cid:19) / ( b − !! , t ≥ . Applying union bound, we get P (cid:18) max j | Y j | ≥ t (cid:19) ≤ log p − c t σ C σ ∗ ∧ (cid:18) tσ ∗ (cid:19) / ( b − !! , t ≥ . Now it follows that E max j Y j ≤ E max j | Y j | = Z ∞ P (cid:18) max j | Y j | > t (cid:19) dt ≤ C (cid:16) σ C σ ∗ p log p + σ ∗ log b − ( p ) (cid:17) + Z ∞ C ( σ C σ ∗ √ log p + σ ∗ log b − ( p ) ) P (cid:18) max j | Y j | > t (cid:19) dt ≤ C (cid:16) σ C σ ∗ p log p + σ ∗ log b − ( p ) (cid:17) + Z ∞ exp (cid:18) − c t σ C σ ∗ (cid:19) + exp − c t / ( b − σ / ( b − ∗ !! dt . σ C σ ∗ p log p + σ ∗ log b − ( p ) . σ C + σ ∗ log b − ( p ) . Combining with (68), we obtained (cid:13)(cid:13) ˜ σ C (cid:13)(cid:13) q . σ C + σ ∗ log b − ( p ∨ p ) log( p ∧ p ) . (69)Similarly we can obtain k ˜ σ R k q . σ R + σ ∗ log b − ( p ∨ p ) log( p ∧ p ) . (70)Combining (66), (67), (69), (70) and applying Cauchy-Schwarz inequality, we obtain E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) . σ C + σ R σ C + σ R σ ∗ log ( b − / ( p ∨ p ) p log( p ∧ p )+ σ ∗ log b − ( p ∨ p ) log( p ∧ p ) . This completes the proof.

Proof of Theorem 4.

We ﬁrst prove the following comparison Lemma.

Lemma 9.

Suppose Z is a p -by- p random matrix with independent entries satisfying E Z ij =0 , Var( Z ij ) = σ ij , | Z | ≤ . H is an m -by- m dimensional matrix with i.i.d. standard Gaussianentries. When q ≥ , m = ⌈ σ C ⌉ + q − , m = ⌈ σ R ⌉ + q − , we have E tr n ( ZZ ⊤ − E ZZ ⊤ ) q o ≤ (cid:18) p m ∧ p m (cid:19) E tr n ( HH ⊤ − E HH ⊤ ) q o . roof. Recall Z ∈ R p × p , | Z | ≤ E Z ij = 0, Var( Z ij ) = σ ij . Similarly as the proofof Lemma 1, let c = ( u , v , . . . , u q , v q ) ∈ ([ p × [ p ]]) q be the cycle of length 2 q on bipartite graph[ p ] → [ p ], α ij ( c ) and β ij ( c ) be deﬁned as (45). We similarly have the following expansion, E n ( ZZ ⊤ − E ZZ ⊤ ) q o = X u ,...,u q ∈ [ p ] E q Y j =1 ( ZZ ⊤ − E ZZ ⊤ ) u j ,u j +1 = E X u ,...,u q ∈ [ p ] q Y j =1  X v j ∈ [ p ] Z u j ,v j ( Z ⊤ ) v j ,u j +1 − { u j = u j +1 } X v j ∈ [ p ] E Z u j ,v j  = X u ,...,u q ∈ [ p ] v ,...,v q ∈ [ p ] E q Y j =1 (cid:16) Z u j ,v j Z u j +1 ,v j − σ u j ,v j { u j = u j +1 } (cid:17) = X c ∈ ([ p ] × [ p ]) q Y ( i,j ) ∈ [ p ] × [ p ] E Z α ij ( c ) ij (cid:0) Z ij − σ ij (cid:1) β ij ( c ) . Since Z ij is symmetric distributed and E Z ij = σ ij , we have E Z αij (cid:0) Z ij − σ ij (cid:1) β = 0 , if α is odd or { α = 0 , β = 1 } . For any ( i, j ) ∈ [ p ] × [ p ], we shall note that 0 ≤ Z ij ≤ | Z ij − σ ij | ≤

1. If α ≥ α iseven, E (cid:12)(cid:12)(cid:12) Z α ij ( c ) ij ( Z ij − σ ij ) β ij ( c ) (cid:12)(cid:12)(cid:12) = E | Z ij | · | Z α ij ( c ) − ij ( Z ij − σ ij ) β ij ( c ) | ≤ E Z ij = σ ij ;if α ≥ β ≥

2, one has E (cid:12)(cid:12)(cid:12) Z α ij ( c ) ij ( Z ij − σ ij ) β ij ( c ) (cid:12)(cid:12)(cid:12) ≤ E ( Z ij − σ ij ) · (cid:12)(cid:12)(cid:12) Z α ij ( c ) ij ( Z ij − σ ij ) β ij ( c ) − (cid:12)(cid:12)(cid:12) ≤ E Z ij − σ ij ≤ E Z ij · k Z ij k ∞ − σ ij = σ ij − σ ij ≤ σ ij . Therefore, for any α, β ≥

0, we have E Z αij ( Z ij − σ ij ) β  ≤ σ ij · E G α (cid:0) G − (cid:1) β , α is even and ( α, β ) = (0 , , α = 0 , β = 0;= 0 , α is odd . Here, G ∼ N (0 , E n ( ZZ ⊤ − E ZZ ⊤ ) q o ≤ X c ∈ ([ p ] × [ p ]) q Y ( i,j ) ∈ [ p ] × [ p ] σ ij { ( α ij ( c ) ,β ij ( c )) =(0 , } E G α ij ( c ) (cid:0) G − (cid:1) β ij ( c ) . Let s be the shape of any loop c ∈ ([ p ] × [ p ]) q , m L ( s ) and m R ( s ) be the number of distinct leftand right nodes respectively visited by any c with shape s ; m α,β ( s ) = m α,β ( c ) is deﬁned as (47)).28hen, E n ( ZZ ⊤ − E ZZ ⊤ ) q o ≤ X c ∈ ([ p ] × [ p ]) q Y ( i,j ) ∈ [ p ] × [ p ] σ ij { ( α ij ( c ) ,β ij ( c )) =(0 , } E G α ij ( c ) (cid:0) G − (cid:1) β ij ( c ) = X s X c : c has shape s · Y ( i,j ) ∈ [ p ] × [ p ] c pass ( i, j ) for positive even times σ ij · Y α,β ≥ α is even n G α ( G − β o m α,β ( s ) ≤ X s (cid:16) p σ m L ( s ) − C σ m R ( s ) R ∧ p σ m L ( s ) C σ m R ( s ) − R (cid:17) · Y α,β ≥ α is even n G α ( G − β o m α,β ( s ) . On the other hand, we have E tr (cid:16) ( HH ⊤ − E HH ⊤ ) q (cid:17) = E tr (cid:16) ( HH ⊤ − m I m ) q (cid:17) = X s m · · · ( m − m L ( s ) + 1) · m · · · ( m − m R ( s ) + 1) · Y α,β ≥ E n G α ( G − β o m α,β ( s ) . Provided that m = ⌈ σ C ∨ ⌉ + q − m = ⌈ σ R ∨ ⌉ + q −

1, we have σ m L ( s ) − C ≤ m ( m − · · · ( m − m L ( s ) + 1) m , σ m R ( s ) R ≤ m ( m − · · · ( m − m L ( s ) + 1); σ m L ( s ) C ≤ m ( m − · · · ( m − m L ( s ) + 1) , σ m R ( s ) − R ≤ m ( m − · · · ( m − m L ( s ) + 1) m . Thus E tr n ( ZZ ⊤ − E ZZ ⊤ ) q o ≤ (cid:18) p m ∧ p m (cid:19) · E tr n ( HH ⊤ − E HH ⊤ ) q o , which has ﬁnished the proof of this lemma.Assume B = 1 without loss of generality. With Lemma 9 and Lemma 2, the proof of Theoremrefth:heter-wishart-bounded is the same as Theorem 1. Proof of Theorem 5.

Without loss of generality, we assume σ ∗ = 1. Let q ≥ m = ⌈ σ C ⌉ + qb − , m = ⌈ σ R ⌉ + qb − . By Lemmas 1 and 2, E tr (cid:26)(cid:16) ZZ ⊤ − E ZZ ⊤ (cid:17) qb (cid:27) ≤ (cid:18) p m ∧ p m (cid:19) E tr (cid:16) ( HH ⊤ − E HH ⊤ ) qb (cid:17) ≤ (cid:18) p m ∧ p m (cid:19) ( m ∧ m ) (cid:16) √ m m + m + 4( √ m + √ m ) p qb + 2 qb (cid:17) qb ≤ ( p ∧ p ) (cid:16) √ m m + m + 4( √ m + √ m ) p qb + 2 qb (cid:17) qb . E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) b ≤ (cid:18) E tr (cid:16) ZZ ⊤ − E ZZ ⊤ (cid:17) qb (cid:19) /q ≤ ( p ∧ p ) /q (cid:16) √ m m + m + 4( √ m + √ m ) p qb + 2 qb (cid:17) b = n ( p ∧ p ) / ( qb ) (cid:16) √ m m + m + 4( √ m + √ m ) p qb + 2 qb (cid:17)o b ≤ n C ( p ∧ p ) / ( qb ) (cid:16) σ R σ C + σ C + ( σ R + σ C ) p qb + qb (cid:17)o b . We set q = 2 ⌈ log( p ∧ p ) /b ⌉ and consider the following two cases:1. If b ≥ log( p ∧ p ), we have q = 2 and E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) b ≤ n C ( p ∧ p ) / (2 b ) (cid:16) σ R σ C + σ C + ( σ R + σ C ) √ b + b (cid:17)o b ≤ n C (cid:16) ( σ C + σ R + √ b ) − σ R (cid:17)o b .

2. If b < log( p ∧ p ), we have2 log( p ∧ p ) /b ≤ q = 2 ⌈ log( p ∧ p ) /b ⌉ ≤ p ∧ p ) /b + 1) ≤ p ∧ p ) /b. Then, E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) b ≤ n C ( p ∧ p ) / (2 log( p ∧ p )) (cid:16) σ R σ C + σ C + ( σ R + σ C ) p p ∧ p ) + 4 log( p ∧ p ) (cid:17)o b ≤ n C (cid:16) ( σ C + σ R + p log( p ∧ p )) − σ C (cid:17)o b . In summary, there exists a uniform constant C > E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) b ≤ n C (cid:16) ( σ C + σ R + p b ∨ log( p ∧ p )) − σ C (cid:17)o b . In fact, the statement holds for all b > k ZZ ⊤ − E ZZ ⊤ k . Let C be a to-be-speciﬁedconstant. By Markov inequality, P (cid:16)(cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) ≥ C (cid:16) ( σ C + σ R + p log( p ∧ p ) + x ) − σ C (cid:17)(cid:17) ≤ E (cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13) b n C (cid:16) ( σ C + σ R + p log( p ∧ p ) + x ) − σ C (cid:17)o b ≤  C (cid:16) ( σ C + σ R + p b ∨ log( p ∧ p )) − σ C (cid:17) C (cid:16) ( σ C + σ R + p log( p ∧ p ) + x ) − σ C (cid:17)  b .

30e set b = x , C = eC , we have P (cid:16)(cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) ≥ C (cid:16) ( σ C + σ R + p log( p ∧ p ) + x ) − σ C (cid:17)(cid:17) ≤  C (cid:16) ( σ C + σ R + p log( p ∧ p ) + √ b ) − σ C (cid:17) C (cid:16) ( σ C + σ R + p log( p ∧ p ) + x ) − σ C (cid:17)  b = exp( − x ) . Therefore, we have ﬁnished the proof of this theorem.

Proof of Lemma 4.

The proof of this lemma relies on a more careful counting scheme for each cycle.For convenience, we deﬁne˜ σ ij = Var( ˜ G ij ) = ( σ ij , ≤ i ≤ p − , ≤ j ≤ p ; σ p − ,j + σ p ,j , i = p − , ≤ j ≤ p . ( G ) ij = G ij /σ ij , ≤ i ≤ p , ≤ j ≤ p ; ( ˜ G ) ij = ˜ G ij / ˜ σ ij , ≤ i ≤ p − , ≤ j ≤ p as the variances and standardizations of each entry of G and ˜ G . Since the proof is lengthy, wedivide into steps for a better presentation.Step 1 In this step, we consider the expansions for both E tr( GG ⊤ − E GG ⊤ ) q and E tr( ˜ G ˜ G ⊤ − E ˜ G ˜ G ⊤ ) q , E tr n ( GG ⊤ − E GG ⊤ ) q o = E X u ,...,u q ∈ [ p ] q Y k =1 (cid:16) GG ⊤ − E GG ⊤ (cid:17) u k ,u k +1 = X u ,...,u q ∈ [ p ] v ,...,v q ∈ [ p ] E q Y k =1 σ u k ,v k σ u k +1 ,v k (cid:0) ( G ) u k ,v k ( G ) u k +1 ,v k − { u k = u k +1 } (cid:1) = X Ω ⊆ [ q ] X u Ω c ∈ [ p − X v ,...,v q ∈ [ p ] ·  X u Ω ∈{ p − ,p } E q Y k =1 σ u k ,v k σ u k +1 ,v k (cid:0) ( G ) u k ,v k ( G ) u k +1 ,v k − { u k = u k +1 } (cid:1) . (71)Here u q +1 := u . Similarly, E tr n ( ˜ G ˜ G ⊤ − E ˜ G ˜ G ⊤ ) q o = E X u ,...,u q ∈ [ p − q Y k =1 (cid:16) ˜ G ˜ G ⊤ − E ˜ G ˜ G ⊤ (cid:17) u k ,u k +1 = X Ω ⊆ [ q ] X u Ω c ∈ [ p − X v ,...,v q ∈ [ p ] ·  X u Ω = p − E q Y k =1 ˜ σ u k ,v k ˜ σ u k +1 ,v k (cid:16) ( ˜ G ) u k ,v k ( ˜ G ) u k +1 ,v k − { u k = u k +1 } (cid:17) . v , . . . , v q ∈ [ p ], Ω ⊆ [ q ] , u Ω c ∈ [ p − X u Ω ∈{ p − ,p } E q Y k =1 σ u k ,v k σ u k +1 ,v k (cid:0) ( G ) u k ,v k ( G ) u k +1 ,v k − { u k = u k +1 } (cid:1) ≤ E q Y k =1 ˜ σ ˜ u k ,v k ˜ σ ˜ u k +1 ,v k (cid:16) ( ˜ G ) ˜ u k ,v k ( ˜ G ) ˜ u k +1 ,v k − { ˜ u k =˜ u k +1 } (cid:17) . (72)Here, ˜ u k = u k ∈ [ p − , if k ∈ Ω c ; ˜ u k = p − , if k ∈ Ω . (73)Step 2 To prove (72), we shall ﬁrst recall that the deﬁnition of u , . . . , u q , v , . . . , v q are cyclic, i.e., u = u q +1 , we also denote v = v q . Thus, X u Ω ∈{ p − ,p } q Y k =1 σ u k ,v k σ u k +1 ,v k = X u Ω ∈{ p − ,p } q Y k =1 σ u k ,v k σ u k ,v k − = Y k ∈ Ω c σ u k ,v k σ u k ,v k − · Y k ∈ Ω σ p − ,v k σ p − ,v k − + Y k ∈ Ω σ p ,v k σ p ,v k − ! ≤ Y k ∈ Ω c σ u k ,v k σ u k ,v k − · Y k ∈ Ω (cid:0) σ p − ,v k σ p − ,v k − + σ p ,v k σ p ,v k − (cid:1) Cauchy-Schwarz ≤ Y k ∈ Ω c ˜ σ u k ,v k ˜ σ u k ,v k − · Y k ∈ Ω (cid:16) ( σ p − ,v k + σ p ,v k ) · ( σ p − ,v k − + σ p ,v k − ) (cid:17) / = Y k ∈ Ω c ˜ σ u k ,v k ˜ σ u k ,v k − · Y k ∈ Ω ˜ σ p − ,v k ˜ σ p − ,v k − = X u Ω = p − q Y k =1 ˜ σ u k ,v k ˜ σ u k ,v k − = X u Ω = p − q Y k =1 ˜ σ u k ,v k ˜ σ u k +1 ,v k . (74)Step 3 For any ﬁxed Ω = { k : u k ∈ [ p − } and a cycle c = ( u → v → u → v → . . . → u q → v q → u ) such that u Ω ∈ { p − , p } and u Ω ∈ [ p − u k is deﬁned as (73). We aimto show in this step that E q Y k =1 (cid:0) ( G ) u k ,v k ( G ) u k +1 ,v k − { u k = u k +1 } (cid:1) ≤ E q Y k =1 (cid:16) ( ˜ G ) ˜ u k ,v k ( ˜ G ) ˜ u k +1 ,v k − { ˜ u k =˜ u k +1 } (cid:17) . (75)We can rearrange the left hand side and the right hand side of (75) to E p Y i =1 p Y j =1 ( G ) α ij ij (( G ) ij − β ij , and E p − Y i =1 p Y j =1 ( ˜ G ) ˜ α ij ij (( ˜ G ) ij − ˜ β ij . Here, α ij , β ij , ˜ α ij , and ˜ β ij are deﬁned as α ij = |{ k : ( u k = i, v k = j, u k +1 = i ) or ( u k = i, v k = j, u k +1 = i ) }| ,β ij = |{ k : u k = u k +1 = i, v k = j }| , ˜ α ij = |{ k : (˜ u k = i, v k = j, ˜ u k +1 = i ) or (˜ u k = i, v k = j, ˜ u k +1 = i ) }| ˜ β ij = |{ k : ˜ u k = ˜ u k +1 = i, v k = j }| . α ij (or ˜ α ij ) is the number of times that the edge ( i, j ) is visited exactly once by sub-path u k → v k → u k +1 (or ˜ u k → v k → ˜ u k +1 ); β ij (or ˜ β ij ) is the number of times that the edge( i, j ) is visited twice (back and forth) by sub-path u k → v k → u k +1 (or ˜ u k → v k → ˜ u k +1 ).Here, by comparing the order of ( ˜ G ) ij and ( G ) ij in these two monomials (75), ˜ α ij , ˜ β ij , α ij , β ij are related as ˜ α ij = α ij , ˜ β ij = β ij , if 1 ≤ i ≤ p − , ≤ j ≤ n, The relationship among ˜ α p − ,j , ˜ β p − ,j , α p − ,j , α p ,j , β p − ,j , β p ,j is more involved. To ana-lyze them, for any ﬁxed 1 ≤ j ≤ p we deﬁne x ( j )1 = |{ k : ( u k → v k → u k +1 ) = (( p − → j → { p − , p } c ) or ( { p − , p } c → j → ( p − }| ,x ( j )2 = |{ k : ( u k → v k → u k +1 ) = ( p → j → { p − , p } c ) or ( { p − , p } c → j → p ) }| ,x ( j )3 = |{ k : ( u k → v k → u k +1 ) = (( p − → j → ( p − }| ,x ( j )4 = |{ k : ( u k → v k → u k +1 ) = ( p → j → p ) }| ,x ( j )5 = |{ k : ( u k → v k → u k +1 ) = (( p − → j → p ) or ( p → j → ( p − }| . Then by deﬁnitions, we have α p − ,j = x ( j )1 + x ( j )5 , α p ,j = x ( j )2 + x ( j )5 , ˜ α p − ,j = x ( j )1 + x ( j )2 ; β p − ,j = x ( j )3 , β p ,j = x ( j )4 , ˜ β p − ,j = x ( j )3 + x ( j )4 + x ( j )5 . We introduce the following Lemma before we proceed.

Lemma 10.

Suppose Z , Z are independent and symmetric distributed random variables. Var( Z ) = Var( Z ) = 1 , k Z k ψ , k Z k ψ ≤ κ . G is standard Gaussian distributed. For anynon-negative integers x , . . . , x , we have (cid:12)(cid:12) E Z x + x Z x + x ( Z − x ( Z − x (cid:12)(cid:12) ≤ ( Cκ ) x + x +2( x + x + x ) E G x + x ( G − x + x + x . (76) Especially when Z , Z , G are all standard Gaussian, (cid:12)(cid:12) E Z x + x Z x + x ( Z − x ( Z − x (cid:12)(cid:12) ≤ E G x + x ( G − x + x + x . (77) Proof.

See Appendix.By Lemma 10, (cid:12)(cid:12)(cid:12) E ( G ) α p − ,j p − ,j (( G ) p − ,j − β p − ,j (cid:12)(cid:12)(cid:12) · (cid:12)(cid:12)(cid:12) E ( G ) α p ,j p ,j (( G ) p ,j − β p ,j (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) E ( G ) x ( j )1 + x ( j )5 p − ,j (( G ) p − ,j − x ( j )3 (cid:12)(cid:12)(cid:12)(cid:12) · (cid:12)(cid:12)(cid:12)(cid:12) E ( G ) x ( j )2 + x ( j )5 p ,j (( G ) p ,j − x ( j )4 (cid:12)(cid:12)(cid:12)(cid:12) ≤ E ( ˜ G ) x ( j )1 + x ( j )2 p − ,j (( ˜ G ) p − ,j − x ( j )3 + x ( j )4 + x ( j )5 = E ( ˜ G ) ˜ α p − ,j p − ,j (( ˜ G ) p − ,j − ˜ β p − ,j . E p Y i =1 p Y j =1 ( G ) α ij ij (( G ) ij − β ij ≤ E p − Y i =1 p Y j =1 ( ˜ G ) ˜ α ij ij (( ˜ G ) ij − ˜ β ij . (78)This gives (75).Step 4 Combining (74) and (75), we ﬁnally have X u Ω ∈{ p − ,p } E q Y k =1 σ u k ,v k σ u k +1 ,v k (cid:0) ( G ) u k ,v k ( G ) u k +1 ,v k − { u k = u k +1 } (cid:1) = X u Ω ∈{ p − ,p } q Y k =1 σ u k ,v k σ u k +1 ,v k · E q Y k =1 (cid:0) ( G ) u k ,v k ( G ) u k +1 ,v k − { u k = u k +1 } (cid:1) (75) ≤ X u Ω ∈{ p − ,p } q Y k =1 σ u k ,v k σ u k +1 ,v k · E q Y k =1 (cid:16) ( ˜ G ) ˜ u k ,v k ( ˜ G ) ˜ u k +1 ,v k − { ˜ u k =˜ u k +1 } (cid:17) (74) ≤ E q Y k =1 ˜ σ ˜ u k ,v k ˜ σ ˜ u k +1 ,v k (cid:16) ( ˜ G ) ˜ u k ,v k ( ˜ G ) ˜ u k +1 ,v k − { ˜ u k =˜ u k +1 } (cid:17) , which yields (72) and additionally ﬁnishes the proof of this lemma. (cid:3) Proof of Theorem 7.

Denote σ C = P i σ i , σ ∗ = max i σ i , Z = [ Z , . . . , Z p ], and S k = Z k Z ⊤ k − E Z k Z ⊤ k . Then E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) = E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p X k =1 S k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . By the lower bound for expected norm of independent random matrices sum [29, Theorem I andSection 1.3], E k ZZ ⊤ − E ZZ ⊤ k & (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E p X k =1 S k S ⊤ k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)! / + E max k k S k k . (79)If Z ij ∼ N (0 , σ i ) for any i ∈ [ p ] , j ∈ [ p ]. Note that (cid:16) E Z k Z ⊤ k Z k Z ⊤ k (cid:17) ij = E Z ik p X l =1 Z lk Z jk = ( σ i + σ i (cid:16)P l = i σ l (cid:17) , i = j ;0 , i = j, = diag (cid:0) { σ i + σ i σ C } p i =1 (cid:1)(cid:16) E Z k Z ⊤ k (cid:17) = diag( σ , . . . , σ p ) . Thus, (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E p X k =1 S k S ⊤ k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p X k =1 E ( Z k Z ⊤ k − E Z k Z ⊤ k )( Z k Z ⊤ k − E Z k Z ⊤ k ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p X k =1 E Z k Z ⊤ k Z k Z ⊤ k − ( E Z k Z ⊤ k ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13) diag (cid:0) { σ i σ C + σ i } p i =1 (cid:1)(cid:13)(cid:13) = σ ∗ + σ ∗ σ C . i ∗ ∈ [ p ] such that suppose σ ∗ = σ i ∗ , then E k S k k = E (cid:13)(cid:13)(cid:13) Z k Z ⊤ k − E Z k Z ⊤ k (cid:13)(cid:13)(cid:13) ≥ E (cid:13)(cid:13)(cid:13) Z k Z ⊤ k (cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13) E Z k Z ⊤ k (cid:13)(cid:13)(cid:13) = σ C − σ ∗ ; E k S k k ≥ E k ( S k ) i ∗ i ∗ k = E (cid:12)(cid:12) Z i ∗ k − E Z i ∗ k (cid:12)(cid:12) ≥ cσ ∗ . Combining the previous two inequalities, we have E k S k k ≥ cσ C . Consequently, E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) (79) & σ C + √ p σ ∗ σ C . (cid:3) Proof of Lemma 5.

Since the diagonal of ∆( ZZ ⊤ ) is zero, we have the following expansion, E tr n(cid:16) ∆( ZZ ⊤ ) (cid:17) q o = X u ,...,u ∈ [ p ] E q Y k =1 (cid:16) ∆( ZZ ⊤ ) (cid:17) u k ,u k +1 = X u ,...,u ∈ [ p ] v ,...v q ∈ [ p ] E q Y k =1 (cid:0) { u k = u k +1 } Z u k ,v k Z u k +1 ,v k (cid:1) . (80)Again, the indices on u are in module q , i.e., u = u q +1 . For a cycle c := ( u → v → u → v → . . . → u q → v q → u ), recall the deﬁnition of α ij ( c ): α ij ( c ) = Card { k : ( u k = i, v k = j, u k +1 = i ) or ( u k = i, v k = j, u k +1 = i ) } for any i ∈ [ p ] and j ∈ [ p ], which counts how many times edge i → j or j → i are visited. Nowthe expansion in (80) can be further written as E tr n(cid:16) ∆( ZZ ⊤ ) (cid:17) q o = X c ∈ ([ p ] × [ p ]) q q Y k =1 { u k = u k +1 } ! ·  Y ( i,j ) ∈ [ p ] × [ p ] E Z α ij ( c ) ij  = X c ∈ ([ p ] × [ p ]) q q Y k =1 { u k = u k +1 } σ u k ,v k σ u k +1 ,v k ! ·  Y ( i,j ) ∈ [ p ] × [ p ] E G α ij ( c )  = X c ∈ ([ p ] × [ p ]) q q Y k =1 { u k = u k +1 } σ v k ! ·  Y ( i,j ) ∈ [ p ] × [ p ] E G α ij ( c )  . (81)We deﬁne m α ( c ) be the number of edges which appear α times in the cycle c : m α ( c ) = Card { ( i, j ) ∈ [ p ] × [ p ] : |{ k : u k or u k +1 = i, v k = j, }| = α } Let s ( c ) be the shape of c , we have Y ( i,j ) ∈ [ p ] × [ p ] E G α ij ( c ) = Y α ≥ E G m α ( s ( c )) , G ∼ N (0 , S p ,p := (cid:8) s ( c ) : m ′ α ( c ) = 0 for all odd α ; and u k = u k +1 for all k = 1 , . . . , q (cid:9) . Based on the notations above, one can check the expansion in (81) can be further simpliﬁed to E tr n(cid:16) ∆( ZZ ⊤ ) (cid:17) q o = X s ∈S p ,p X c : s ( c )= s q Y k =1 σ v k ! Y α ≥ E G m α ( s ) = X s ∈S p ,p Y α ≥ E G m α ( s ) X c : s ( c )= s q Y k =1 σ v k ! . (82)For a ﬁxed shape s ∈ S p ,p , let m L ( s ) ( m R ( s )) be the number of distinct left (right) vertexesvisited by cycles with shape s . Now we bound P c : s ( c )= s (cid:0)Q qk =1 σ v k (cid:1) via m L ( s ) and m R ( s ). Tothis end, we ﬁrst present three facts for any cycles with shape s : • Each visited edges must appear at least twice in the cycles; • For each right vertex in the cycle, its predecessor and successor in left vertex set must bediﬀerent; • The cycle is uniquely deﬁned by specifying m L ( s ) left vertexes and m R ( s ) right vertexes;moreover, the summation term is free of the index of the left visited vertexes.These three observations, together with the assumption σ ∗ = 1, yield the following bound: X c : s ( c )= s q Y k =1 σ v k ! ≤ p ( p − · · · ( p − m L ( s ) + 1)  n X j =1 σ j  m R ( s ) . (83)Next we make comparison between E tr (cid:8)(cid:0) ∆( ZZ ⊤ ) (cid:1) q (cid:9) and E tr (cid:8)(cid:0) ∆( HH ⊤ ) (cid:1) q (cid:9) , where H is a p -by- m random matrix with i.i.d. standard Gaussian entries. Similar, as above, we have E tr n(cid:16) ∆( HH ⊤ ) (cid:17) q o = X s ∈S p ,p Y α ≥ E G m α ( s ) X c : s ( c )= s |{ c : s ( c ) = s }| . Setting m = ⌈ P p j =1 σ j ⌉ + q −

1, we have |{ c : s ( c ) = s }| = p ( p − · · · ( p − m L ( s ) + 1) m ( m − · · · ( m − m R ( s ) + 1) ≥ p ( p − · · · ( p − m L ( s ) + 1)( m − m R ( s ) + 1) m R ( s ) ≥ p ( p − · · · ( p − m L ( s ) + 1)  n X j =1 σ j  m R ( s ) . (84)Combining (83) and (84), we ﬁnish the proof. 36 roof of Theorem 9. Denote σ R = P j σ j , σ ∗ = max i σ i . We use the general lower bound forexpected norm of independent random matrices sum [29, Theorem I and Section 1.3] as we did inthe proof of Theorem 7. Since Z ij ∼ N (0 , σ j ), for any k ∈ [ p ], (cid:16) E Z k Z ⊤ k Z k Z ⊤ k (cid:17) ij = E Z ik p X l =1 Z lk Z jk = ( σ k + ( p − σ k , i = j ;0 , i = j, = diag (cid:0) { ( p + 2) σ k } p i =1 (cid:1)(cid:16) E Z k Z ⊤ k (cid:17) = σ k I p . Thus, (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E p X k =1 S k S ⊤ k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p X k =1 E Z k Z ⊤ k Z k Z ⊤ k − ( E Z k Z ⊤ k ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( p + 1) p X k =1 σ k ! I p (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≥ p p X k =1 σ k . On the other hand, E max k k S k k ≥ max k E k S k k ≥ max k n E (cid:13)(cid:13)(cid:13) Z k Z ⊤ k (cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13) E Z k Z ⊤ k (cid:13)(cid:13)(cid:13)o = ( p − σ ∗ . Combining the previous two inequalities and (79) in the proof of Theorem 7, we obtain E (cid:13)(cid:13)(cid:13) ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13)(cid:13) (79) & vuut p p X k =1 σ k + p σ ∗ . Proof of Theorem 10.

We ﬁrst introduce following three lemmas.

Lemma 11.

For any x ∈ {− , +1 } n and z ∈ R with k z k = 1 we have d ( x, sgn ( z )) ≤ n (cid:13)(cid:13)(cid:13)(cid:13) x √ n − z (cid:13)(cid:13)(cid:13)(cid:13) . Here d represents the Hamming distance: d ( x, z ) = P ni =1 { x i = y i } .Proof. See [21]. (cid:3)

Lemma 12.

Assume that Z ∈ R p × p has independent sub-Gaussian entries, Var( Z ij ) = σ ij , σ C = max j P i σ ij , σ R = max i P j σ ij , σ ∗ = max i,j σ ij . Assume that k Z ij /σ ij k ψ ≤ κ. Let V ∈ O p ,r be a ﬁxed orthogonal matrix. Then, P ( k EV k ≥ σ C + x )) ≤ (cid:18) r − min (cid:26) x κ σ ∗ σ C , x κ σ ∗ (cid:27)(cid:19) , E k EV k . σ C + κr / ( σ ∗ σ C ) / + κr / σ ∗ . Proof.

See [36, Lemma 3]. 37 emma 13 (Davis-Kahan) . Let A be an n -by- n symmetric matrix with eigenvalues | λ | ≥ | λ | ≥· · · , with | λ k | − | λ k +1 | ≥ δ . Let B be a symmetric matrix such that k B k < δ . Let A k and ( A + B ) k be the spaces spanned by the top k eigenvectors of the respective matrices. Then (cid:13)(cid:13)(cid:13) I k − A ⊤ k ( A + B ) k (cid:13)(cid:13)(cid:13) ≤ k B k δ . Proof.

See [13].Now we are ready for the proof. Recall that Y = X + Z , we can write Y Y ⊤ = XX ⊤ + XZ ⊤ + X ⊤ + ZZ ⊤ = XX ⊤ + XZ ⊤ + ZX ⊤ + (cid:16) ZZ ⊤ − E ZZ ⊤ (cid:17) + E ZZ ⊤ . (85)Since E ZZ ⊤ = (cid:16)P pj =1 σ j (cid:17) I , the leading eigenvector of Y ⊤ Y (i.e., ˆ v ) is the same as that of XX ⊤ + XZ ⊤ + ZX ⊤ + (cid:16) ZZ ⊤ − E ZZ ⊤ (cid:17) . Since √ n l is the leading eigenvector of X ⊤ X , it follows that E M ( l, ˆ l ) Lemma 11 ≤ E min ± (cid:13)(cid:13)(cid:13)(cid:13) √ n l ± ˆ l (cid:13)(cid:13)(cid:13)(cid:13)

22 Lemma 13 ≤ E (cid:13)(cid:13) XZ ⊤ + ZX ⊤ + ZZ ⊤ − E ZZ ⊤ (cid:13)(cid:13) n k µ k ≤ E (cid:13)(cid:13) ZX ⊤ (cid:13)(cid:13) + E (cid:13)(cid:13) Z ⊤ Z − E Z ⊤ Z (cid:13)(cid:13) n k µ k . n k µ k σ ∗ + E (cid:13)(cid:13) Z ⊤ Z − E Z ⊤ Z (cid:13)(cid:13) n k µ k . n k µ k σ ∗ + q n P pi =1 σ i + nσ ∗ n k µ k . Proof of Theorem 11.

We only need to prove the lower bound under the following two situations: • when λ ≤ c σ ∗ , there exists { σ i } pi =1 such that max i σ i ≤ σ ∗ , P i σ i ≤ ˜ σ and the lower boundholds; • when λ ≤ c ˜ σ/n / , there exists { σ i } pi =1 such that max i σ i ≤ σ ∗ , P i σ i ≤ ˜ σ and the lowerbound holds.We start with the ﬁrst case. We specify σ = σ ∗ and take σ , . . . , σ p to be arbitrary values thatsatisfy the constraint of P λ,l ( σ ∗ , ˜ σ ). Consider the metric space {− , } n with the metric M ( l (1) , l (2) ) = 1 n min n(cid:12)(cid:12)(cid:12) i : l (1) i = l (2) i (cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12) i : l ( i ) i = − l (2) i (cid:12)(cid:12)(cid:12)o ,

38y [35, Lemma 4], when n ≥

6, we can ﬁnd some constant c , such that there exists a subset { l (1) , . . . , l ( N ) } ⊂ {− , } n satisfying M ( l ( i ) , l ( i ) ) ≥ / , ∀ ≤ i < i ≤ N and N ≥ exp( c n ). Let Y ( i ) = µ (cid:0) l ( i ) (cid:1) ⊤ + Z ∈ R p × n , where Z ij ind ∼ N (0 , σ i ). Let µ = [ λ, , · · · , ⊤ ,then the KL-divergence between Y ( i i ) and Y ( i ) for i = i is D KL ( Y ( i ) | Y ( i ) ) = 12 p X j =1 σ − j µ j (cid:13)(cid:13)(cid:13) l ( i ) − l ( i ) (cid:13)(cid:13)(cid:13) ≤ n Σ pj =1 σ − j µ j = 4 nσ − λ = 4 nλ /σ ∗ . (86)By the generalized Fano’s lemma, we haveinf ˆ l sup P l,λ ( σ ∗ , ˜ σ ) E M ( l, ˆ l ) ≥ (cid:18) − nλ /σ ∗ + log 2 c n (cid:19) ≥ . In the last inequality we use the assumption that λ ≤ c σ ∗ for some suﬃciently small constant c .Now we consider the second situation. We specify σ = σ = . . . = σ p = ˜ σ p . When the variancestructure reduces to a homoskedastic structure, we have the following lower bound result which isalready established. Lemma 14.

Suppose σ = · · · = σ p = 1 , there exists c , C such that if n ≥ C , inf ˆ l sup k µ k ≤ c ( p/n ) / l ∈{− , } n E M (ˆ l, l ) ≥ / . Proof.

See [11, Theorem 6].Based on Lemma 14 and homoskedasticity of µ and σ , if we set λ < c ˜ σp / · (cid:0) pn (cid:1) / = c ˜ σ/n / inour setting, we obtain inf ˆ l sup P l,λ ( σ ∗ , ˜ σ ) E M ( l, ˆ l ) ≥ / . This ﬁnishes the proof.

A Proofs of technical lemmas

We collect the proofs of Lemma 7, 8, and 10 in this section.

Proof of Lemma 7.

We ﬁrst consider the proof of (60). Note that if G ∼ N (0 , E G d = ( ( d − , d ≥ , and d is even;0 , d ≥ , and d is odd .

39n addition, ( − , ( − −

1. When α is odd, only odd moments of G appear in theexpansion of G α ( G − β , then clearly E G α ( G − β = 0. When α is even and α + 2 β ≥ E G α ( G − β = β X j =0 E G α +2 β − j ( − j (cid:18) βj (cid:19) = β X j =0 ( − j ( α + 2 β − j − · β !( β − j )! j ! ≥ X ≤ j ≤ βj is even (cid:26) ( α + 2 β − j − β !( β − j )! j ! − ( α + 2 β − j + 1) − β !( β − j − j + 1)! (cid:27) = X ≤ j ≤ βj is even ( α + 2 β − j − β !( β − j )!( j + 1)! · { ( α + 2 β − j − j + 1) − ( β − j ) }• If j = β , ( α + 2 β − j − β !( β − j )!( j + 1)! · { ( α + 2 β − j − j + 1) − ( β − j ) } = ( α + 2 β − j − β !( β − j )!( j + 1)! · ( α + 2 β − j − j + 1) ≥ • If β − ≥ j ≥ β − ,( α + 2 β − j − j + 1) ≥ ( α + 2 β − β − − (cid:18) β −

12 + 1 (cid:19) ≥ β + 12 ≥ β − j ; • if 0 ≤ j < β − , ( α + 2 β − j − j + 1) ≥ α + 2 β − ( β − − ≥ β − j. Thus, we always have( α + 2 β − j − β !( β − j )!( j + 1)! · { ( α + 2 β − j − j + 1) − ( β − j ) } ≥ , ∀ ≤ j ≤ β, j is even , and E G α ( G − β ≥ X j =0 ( α + 2 β − j − β !( β − j )!( j + 1)! · { ( α + 2 β − j − j + 1) − ( β − j ) } =( α + 2 β − · ( α + β − , which has ﬁnished the proof of (60).Next we consider the upper bound of E G α ( G − β . E G α ( G − β = β X j =0 E G α +2 β − j ( − j (cid:18) βj (cid:19) = β X j =0 ( − j ( α + 2 β − j − · β !( β − j )! j ! ≤ ( α + 2 β − − X ≤ j ≤ βj is odd (cid:26) ( α + 2 β − j − β !( β − j )! j ! − ( α + 2 β − j + 1) − β !( β − j − j + 1)! (cid:27) =( α + 2 β − − X ≤ j ≤ βj is odd ( α + 2 β − j − β !( β − j )!( j + 1)! · { ( α + 2 β − j − j + 1) − ( β − j ) } ≤ j ≤ β ,( α + 2 β − j − β !( β − j )!( j + 1)! · { ( α + 2 β − j − j + 1) − ( β − j ) } ≥ , thus, E G α ( G − β ≤ ( α + 2 β − . Then we consider the proof of sub-Gaussian case (62). When α is odd, the statement clearlyholds as Z α ( Z − β has symmetric distribution then E Z α ( Z − β = 0. When α is even, since Z ≥

0, we must have | Z − | = max { Z − , − Z } ≤ Z ∨

1, thus (cid:12)(cid:12)(cid:12) E Z α ( Z − β (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) E Z α ( Z − β {| Z |≤ } + E Z α ( Z − β {| Z | > } (cid:12)(cid:12)(cid:12) ≤ E Z α | Z | β = E | Z | α +2 β + 1 . Since E Z = 1, we have κ ≥ / √

2. Thus, E | Z | α +2 β + 1 ≤ ( κ + 1) α +2 β ( α + 2 β ) ( α +2 β ) / ≤ (3 κ ) α +2 β ( α + 2 β ) ( α +2 β ) / . It is easy to see (62) holds when α + 2 β ≤ α + 2 β ≥

4, by the relationship between double factorial and Gamma function and thelower bound of Gamma function [6], we have( α + 2 β − α + β −

1) = 2 α/ β − √ π Γ (cid:18) α β − (cid:19) ( α + β − ≥ α/ β − ( α + β − √ π · √ πx x e − x (cid:0) x + x/ . (cid:1) / ≥ C − ( α +2 β ) · ( α + 2 β ) ( α +2 β ) / = ( Cκ ) α +2 β ( α + 2 β ) ( α +2 β ) / ≥ E Z α ( Z − β . Here, x = α +2 β − . Proof of Lemma 8.

Firstly we have E F αij ( F ij − β = β X j =0 E F α +2 β − jij ( − j (cid:18) βj (cid:19) = 1 √ π β X j =0 ( − j β !( β − j )! j ! ( x j − · ( b − xj Γ (cid:18) ( b − x j + 12 (cid:19) ≥ √ π X j ≤ βj is even (cid:26) β !( β − j )! j ! ( x j − · ( b − xj Γ (cid:18) ( b − x j + 12 (cid:19) − β !( β − j − j + 1)! ( x j − ( b − xj − Γ (cid:18) ( b − x j −

2) + 12 (cid:19)(cid:27) ≥ √ π X j ≤ βj is even β !( β − j )!( j + 1)! ( x j − ( b − xj − Γ (cid:18) ( b − x j −

2) + 12 (cid:19) · [( j + 1)( x j − − ( β − j )] , See https://en.wikipedia.org/wiki/Double_factorial x j := α + 2 β − j and the last inequality comes from the strictly increasing property ofGamma function. By the proof of Lemma 7, we know β !( β − j )!( j + 1)! ( x j − · (( j + 1)( x j − − ( β − j )) ≥ . Thus, E F αij ( F ij − β ≥ α + β − √ π ( α + 2 β − · ( b − α +2 β − Γ (cid:18) ( b − α + 2 β −

2) + 12 (cid:19) = 1 π b ( α +2 β − Γ (cid:18) α + 2 β − (cid:19) Γ (cid:18) ( b − α + 2 β −

2) + 12 (cid:19) . When α + 2 β ≥ (cid:16) b − (cid:17) ∨

5, by the lower bound of Gamma function [6], we further have E F αij ( F ij − β ≥ · b ( α +2 β − x x e − x (cid:16) x + x . (cid:17) / y y e − y (cid:16) y + y . (cid:17) / ≥ ( c b ) α +2 β · ( α + 2 β ) ( α +2 β ) / · (( b − α + 2 β )) ( b − α +2 β ) / ≥ ( c ′ b ) α +2 β · ( α + 2 β ) b ( α +2 β ) / , where x = α +2 β − , y = ( b − α +2 β − − and c b > b .When 2 ≤ α + 2 β < (cid:16) b − (cid:17) ∨

5, we can ﬁnd another universal constant c ′′ b such that1 π b ( α +2 β − Γ (cid:18) α + 2 β − (cid:19) Γ (cid:18) ( b − α + 2 β −

2) + 12 (cid:19) ≥ ( c ′′ b ) α +2 β · ( α + 2 β ) b ( α +2 β ) / . In conclusion, we proved that E F αij ( F ij − β ≥ ( C b κ ) α +2 β · E E αij ( E ij − β for any α, β ≥

0. Thus (64) is proved.

Proof of Lemma 10.

If either ( x , x , x ) = (0 , ,

0) or ( x , x , x ) = (0 , , x + x or x + x is odd, the left hand side of (76) (77) are zero since Z and Z aresymmetric distributed and independent. Meanwhile, the right hand side of (76) is non-negative(Lemma 7), thus (76) holds if either x + x or x + x is odd. When ( x , x , x ) = (0 , ,

0) (or( x , x , x ) = (0 , , x + x and x + x are even, and x + x + x ≥ , and x + x + x ≥ . By Lemma 7, we have (cid:12)(cid:12) E Z x + x Z x + x ( Z − x ( Z − x (cid:12)(cid:12) = (cid:12)(cid:12)(cid:0) E Z x + x ( Z − x (cid:1) · (cid:0) E Z x + x ( Z − x (cid:1)(cid:12)(cid:12) ≤ ( Cκ ) x + x +2( x + x + x ) · (cid:12)(cid:12)(cid:0) E G x + x ( G − x (cid:1) · (cid:0) E G x + x ( G − x (cid:1)(cid:12)(cid:12) ≤ ( Cκ ) x + x +2( x + x + x ) · ( x + x + 2 x − · ( x + x + 2 x − . x, y , x !! · y !! ≤ ( x + y − x + x + 2 x − · ( x + x + 2 x − ≤ ( x + x + 2( x + x + x ) − (cid:12)(cid:12) E Z x + x Z x + x ( Z − x ( Z − x (cid:12)(cid:12) ≤ ( Cκ ) x + x +2( x + x + x ) · ( x + x + 2( x + x + x ) − · ( x + x + x + x + x ) Lemma 7 ≤ ( Cκ ) x + x +2( x + x + x ) · E ( G − x + x + x G x + x , which has ﬁnished the proof of (76).If Z , Z , G are standard Gaussian, by Lemma 7, (cid:12)(cid:12) E Z x Z x ( Z − x ( Z − x ( Z Z ) x (cid:12)(cid:12) = (cid:12)(cid:12)(cid:0) E Z x + x ( Z − x (cid:1)(cid:12)(cid:12) · (cid:12)(cid:12)(cid:0) E Z x + x ( Z − x (cid:1)(cid:12)(cid:12) ≤ ( x + x + 2 x − · ( x + x + 2 x − ≤ ( x + x + 2( x + x + x ) − · ( x + x + x + x + x ) Lemma 7 ≤ E G x + x ( G − x + x + x , which has ﬁnished the proof of (77). References [1] Oskari H Ajanki, L´aszl´o Erd˝os, and Torben Kr¨uger. Universality for general wigner-typematrices.

Probability Theory and Related Fields , 169(3-4):667–727, 2017. 2[2] Greg W Anderson, Alice Guionnet, and Ofer Zeitouni.

An introduction to random matrices ,volume 118. Cambridge university press, 2010. 1[3] Zhidong D Bai. Convergence rate of expected spectral distributions of large random matrices.part ii. sample covariance matrices.

The Annals of Probability , 21(2):649–672, 1993. 10[4] Afonso S Bandeira and Ramon van Handel. Sharp nonasymptotic bounds on the norm ofrandom matrices with independent entries.

The Annals of Probability , 44(4):2479–2506, 2016.2, 5, 17, 18[5] Zhigang Bao, Xiucai Ding, and Ke Wang. Singular vector and singular subspace distributionfor the matrix denoising model.

The Annals of Statistics , to appear, 2020. 15[6] Necdet Batır. Bounds for the gamma function.

Results in Mathematics , 72(1-2):865–874, 2017.21, 41, 42[7] Florent Benaych-Georges and Raj Rao Nadakuditi. The singular values and vectors of lowrank perturbations of large rectangular random matrices.

Journal of Multivariate Analysis ,111:120–135, 2012. 15 438] Adrian N Bishop, Pierre Del Moral, and Ang`ele Niclas. An introduction to wishart matrixmoments.

Foundations and Trends in Machine Learning , 11(2), 2018. 2[9] St´ephane Boucheron, Olivier Bousquet, G´abor Lugosi, Pascal Massart, et al. Moment inequal-ities for functions of independent random variables.

The Annals of Probability , 33(2):514–560,2005. 26[10] St´ephane Boucheron, G´abor Lugosi, and Pascal Massart.

Concentration inequalities: Anonasymptotic theory of independence . Oxford university press, 2013. 10[11] T Tony Cai and Anru Zhang. Rate-optimal perturbation bounds for singular subspaces withapplications to high-dimensional statistics.

The Annals of Statistics , 46(1):60–89, 2018. 39[12] T Tony Cai, Cun-Hui Zhang, and Harrison H Zhou. Optimal rates of convergence for covariancematrix estimation.

The Annals of Statistics , 38(4):2118–2144, 2010. 10[13] Chandler Davis and William Morton Kahan. The rotation of eigenvectors by a perturbation.iii.

SIAM Journal on Numerical Analysis , 7(1):1–46, 1970. 15, 16, 38[14] David Donoho and Matan Gavish. Minimax risk of matrix denoising by singular value thresh-olding.

The Annals of Statistics , 42(6):2413–2440, 2014. 15[15] Laura Florescu and Will Perkins. Spectral thresholds in the bipartite stochastic block model.In

Conference on Learning Theory , pages 943–959, 2016. 2, 15[16] Trevor Hastie, Robert Tibshirani, and Jerome Friedman.

The elements of statistical learning:data mining, inference, and prediction . Springer Science & Business Media, 2009. 15[17] David Hong, Laura Balzano, and Jeﬀrey A Fessler. Asymptotic performance of pca for high-dimensional heteroscedastic data.

Journal of multivariate analysis , 167:435–452, 2018. 13[18] Vladimir Koltchinskii and Karim Lounici. Concentration inequalities and moment bounds forsample covariance operators.

Bernoulli , 23(1):110–133, 2017. 10[19] Arun Kumar Kuchibhotla and Abhishek Chakrabortty. Moving beyond sub-gaussianity inhigh-dimensional statistics: Applications in covariance estimation and linear regression. arXivpreprint arXiv:1804.02605 , 2018. 26[20] Rafa l Lata la, Ramon van Handel, and Pierre Youssef. The dimension-free structure of nonho-mogeneous random matrices. arXiv preprint arXiv:1711.00807 , 2017. 2[21] Marc Lelarge, Laurent Massouli´e, and Jiaming Xu. Reconstruction in the labelled stochasticblock model.

IEEE Transactions on Network Science and Engineering , 2(4):152–163, 2015. 37[22] Lydia T Liu, Edgar Dobriban, and Amit Singer. e PCA: High dimensional exponential familyPCA. arXiv preprint arXiv:1611.05550 , 2016. 154423] Sahand Negahban, Sewoong Oh, and Devavrat Shah. Rank centrality: Ranking from pairwisecomparisons.

Operations Research , 65(1):266–287, 2017. 15[24] Mark EJ Newman. Spectral methods for community detection and graph partitioning.

PhysicalReview E , 88(4):042822, 2013. 9[25] Joseph Salmon, Zachary Harmany, Charles-Alban Deledalle, and Rebecca Willett. Poissonnoise reduction with non-local PCA.

Journal of mathematical imaging and vision , 48(2):279–294, 2014. 2[26] Andrey A Shabalin and Andrew B Nobel. Reconstruction of a low-rank matrix in the presenceof gaussian noise.

Journal of Multivariate Analysis , 118:67–76, 2013. 15[27] Ann-Christine Syv¨anen. Accessing genetic variation: genotyping single nucleotide polymor-phisms.

Nature Reviews Genetics , 2(12):930–942, 2001. 9[28] Terence Tao.

Topics in random matrix theory , volume 132. American Mathematical Soc.,2012. 1[29] Joel A Tropp. The expected norm of a sum of independent random matrices: An elementaryapproach. In

High Dimensional Probability VII , pages 173–202. Springer, 2016. 2, 22, 34, 37[30] Ramon van Handel. On the spectral norm of gaussian random matrices.

Transactions of theAmerican Mathematical Society , 369(11):8161–8178, 2017. 2[31] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices.

Com-pressed Sensing: Theory and Applications , pages 210–268, 2009. 1, 4, 6, 20, 21[32] Roman Vershynin. Spectral norm of products of random and deterministic matrices.

Probabilitytheory and related ﬁelds , 150(3-4):471–509, 2011. 15[33] Mariia Vladimirova, St´ephane Girard, Hien Nguyen, and Julyan Arbel. Sub-weibull distribu-tions: generalizing sub-gaussian and sub-exponential properties to heavier-tailed distributions. arXiv preprint arXiv:1905.04955 , 2019. 4[34] Hermann Weyl. Das asymptotische verteilungsgesetz der eigenwerte linearer partieller diﬀeren-tialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung).

MathematischeAnnalen , 71(4):441–479, 1912. 15[35] Bin Yu. Assouad, Fano, and Le Cam. In

Festschrift for Lucien Le Cam , pages 423–435.Springer, 1997. 39[36] Anru Zhang, T Tony Cai, and Yihong Wu. Heteroskedastic PCA: Algorithm, optimality, andapplications. arXiv preprint arXiv:1810.08316 , 2018. 2, 16, 37[37] Anru Zhang and Yuchen Zhou. On the non-asymptotic and sharp lower tail bounds of randomvariables. arXiv preprint arXiv:1810.09006arXiv preprint arXiv:1810.09006