[PDF] Optimal Clustering in Anisotropic Gaussian Mixture Models

Abstract

We study the clustering task under anisotropic Gaussian Mixture Models where the covariance matrices from different clusters are unknown and are not necessarily the identical matrix. We characterize the dependence of signal-to-noise ratios on the cluster centers and covariance matrices and obtain the minimax lower bound for the clustering problem. In addition, we propose a computationally feasible procedure and prove it achieves the optimal rate within a few iterations. The proposed procedure is a hard EM type algorithm, and it can also be seen as a variant of the Lloyd's algorithm that is adjusted to the anisotropic covariance matrices.

Full PDF

OOptimal Clustering in Anisotropic Gaussian Mixture Models

Xin Chen and Anderson Y. Zhang University of Washington University of Pennsylvania

Abstract

We study the clustering task under anisotropic Gaussian Mixture Models where the covari-ance matrices from diﬀerent clusters are unknown and are not necessarily the identical matrix.We characterize the dependence of signal-to-noise ratios on the cluster centers and covariancematrices and obtain the minimax lower bound for the clustering problem. In addition, we pro-pose a computationally feasible procedure and prove it achieves the optimal rate within a fewiterations. The proposed procedure is a hard EM type algorithm, and it can also be seen as avariant of the Lloyd’s algorithm that is adjusted to the anisotropic covariance matrices.

Clustering is a fundamentally important task in statistics and machine learning [7, 2]. The mostpopular and studied model for clustering is the Gaussian Mixture Model (GMM) [18, 20] whichcan be written as Y j = θ ∗ z ∗ j + (cid:15) j , where (cid:15) j ind ∼ N (0 , Σ ∗ z ∗ j ) , ∀ j ∈ [ n ] . Here Y = ( Y , . . . , Y n ) are the observations with n being the sample size. Let k be the numberof clusters that is assumed to be known. Denote { θ ∗ a } a ∈ [ k ] to be unknown centers and { Σ ∗ a } to beunknown covariance matrices for the k clusters. Let z ∗ ∈ [ k ] n be the cluster structure such thatfor each index j ∈ [ n ], the value of z ∗ j indicates which cluster the j th data point belongs to. Thegoal is to recover z ∗ from Y . For any estimator ˆ z , its clustering performance is measured by amisclustering error rate h (ˆ z, z ∗ ) which will be introduced later in (4).There have been increasing interests in the theoretical and algorithmic analysis of clusteringunder GMMs. When the GMM is isotropic (that is, all the covariance matrices { Σ ∗ a } a ∈ [ k ] are equalto the same identity matrix), [15] obtains the minimax rate for clustering which takes a form ofexp( − (1 + o (1))(min a (cid:54) = b (cid:107) θ ∗ a − θ ∗ b (cid:107) ) /

8) under the loss h (ˆ z, z ∗ ). Various methods have been studiedin the isotropic setting. k -means clustering [16] might be the most natural choice but it is NP-hard[4]. As a local approach to optimize the k -mean objects, Lloyd’s algorithm [13] is one of the mostpopular clustering algorithms and has achieved many successes in diﬀerent disciplines [24]. [15, 8]establishes computational and statistical guarantees for the Lloyd’s algorithm by showing it achievesthe optimal rates after a few iterations provided with some decent initialization. Another popularapproach to clustering especially for high-dimensional data is spectral clustering [22, 19, 21], whichis an umbrella term for clustering after a dimension reduction through a spectral decomposition.[14, 17, 1] proves the spectral clustering also achieves the optimality under the isotropic GMM.Another line of work is to consider semideﬁnite programming (SDP) as a convex relaxation of the k -means objective. Its statistical properties have been studied in [6, 9].1 a r X i v : . [ m a t h . S T ] J a n n spite of all the exciting results, most of the existing literature focuses on isotropic GMMs,and clustering under the anisotropic case where the covariance matrices are not necessarily theidentity matrix is not well-understood. The results of some papers [15, 6] hold under sub-Gaussianmixture models, where the errors (cid:15) j are assumed to follow some sub-Gaussian distribution withvariance proxy σ . It seems that their result already covers the anisotropic case, as {N (0 , Σ ∗ a ) } a ∈ [ k ] are indeed sub-Gaussian distributions. However, from a minimax point of view, among all the sub-Gaussian distributions with variance proxy σ , the least favorable case (the case where clusteringis the most diﬃcult) is when the errors are N (0 , σ ). Therefore, the minimax rates for clusteringunder the sub-Gaussian mixture model is essentially the one under isotropic GMMs, and methodssuch as the Lloyd’s algorithm that requires no covariance matrix information can be rate-optimal.As a result, the aforementioned results are all for isotropic GMMs.A few papers have explored the direction of clustering under anisotropic GMMs. [3] gives apolynomial-time clustering algorithm that provably works well when the Gaussian distributions arewell separated by hyperplanes. Their idea is further developed in [11] which allows the Gaussiansto be overlapped with each other but only for two-cluster cases. A recent paper [23] proposesanother method for clustering under a balanced mixture of two elliptical distributions. They give aprovable upper bound of their clustering performance with respect to an excess risk. Nevertheless,it remains unknown what is the fundamental limit of clustering under the anisotropic GMMs andwhether there is any polynomial-time procedure that achieves it.In this paper, we will investigate the optimal rates of the clustering task under two anisotropicGMMs. Model 1 is when the covariance matrices are all equal to each other (i.e., homogeneous)and are equal to some unknown matrix Σ ∗ . Model 2 is more ﬂexible, where the covariance matricesare unknown and are not necessarily equal to each other (i.e., heterogeneous). The contribution ofthis paper is two-fold, summarized as follows.Our ﬁrst contribution is on the fundamental limits. We obtain the minimax lower bound forclustering under the anisotropic GMMs with respect to the loss h (ˆ z, z ∗ ). We show it takes the forminf ˆ z sup z ∗ ∈ [ k ] n E h ( z, z ∗ ) ≥ exp (cid:18) − (1 + o (1)) (signal-to-noise ratio) (cid:19) , where the signal-to-noise ratio under Model 1 is equal to min a,b ∈ [ k ]: a (cid:54) = b (cid:107) ( θ ∗ a − θ ∗ b ) T Σ ∗− (cid:107) and the onefor Model 2 is more complicated. For both models, we can see the minimax rates depend not only onthe centers but also the covariance matrices. This is diﬀerent from the isotropic case whose signal-to-noise ratio is min a (cid:54) = b (cid:107) θ ∗ a − θ ∗ b (cid:107) . Our results precisely capture the role the covariance matricesplay in the clustering problem. It shows covariance matrices impact the fundamental limits of theclustering problem through complicated interaction with the centers especially in Model 2. Theminimax lower bounds are obtained by establishing connections with Linear Discriminant Analysis(LDA) and Quadratic Discriminant Analysis (QDA).Our second and more important contribution is on the computational side. We propose acomputationally feasible and rate-optimal algorithm for the anisotropic GMM. Popular methodsincluding the Lloyd’s algorithm and the spectral clustering no longer work well as they are developedunder the isotropic case and only consider the distances among the centers [3]. We study an adjustedLloyd’s algorithm which estimates the covariance matrices in each iteration and recovers the clustersusing the covariance matrix information. It can also be seen as a hard EM algorithm [5]. As aniterative algorithm, we give a statistical and computational guarantee and guidance to practitionersby showing that it obtains the minimax lower bound within log n iterations. That is, let z ( t ) be the2utput of the algorithm after t iterations, we have with high probability, h ( z ( t ) , z ∗ ) ≤ exp (cid:18) − (1 + o (1)) (signal-to-noise ratio) (cid:19) , holds for all t ≥ log n . The algorithm can be initialized by popular methods such as the spectralclustering or the Lloyd’s algorithm. In numeric studies, we show the proposed algorithm improvesgreatly from the two aforementioned methods under anisotropic GMMs, and matches the optimalexponent given in the minimax lower bound. Paper Organization.

The remaining paper is organized as follows. In Section 2, we study Model1 where the covariance matrices are unknown but homogeneous. In Section 3, we consider Model 2where covariance matrices are unknown and heterogeneous. For both cases, we obtain the minimaxlower bound for clustering and study the adjusted Lloyd’s algorithm. In Section 4, we provide anumeric comparison with other popular methods. The proofs of theorems in Section 2 are givenin Section 5 and the proofs for Section 3 are included in Section 6. All the technical lemmas areincluded in Section 7.

Notation.

Let [ m ] = { , , . . . , m } for any positive integer m . For any set S , we denote | S | for itscardinality. For any matrix X ∈ R d × d , we denote λ ( X ) to be its smallest eigenvalue and λ d ( X ) tobe its largest eigenvalue. In addition, we denote (cid:107) X (cid:107) to be its operator norm. For any two vectors u, v of the same dimension, we denote (cid:104) u, v (cid:105) = u T v to be its inner product. For any positive integer d , we denote I d to be the d × d identity matrix. We denote N ( µ, Σ) to be the normal distributionwith mean µ and covariance matrix Σ. We denote I {·} to be the indicator function. Given twopositive sequences a n , b n , we denote a n = o ( b n ) if a n /b n = o (1) when n → ∞ . We write a n (cid:46) b n ifthere exists a constant C > n such that a n ≤ Cb n for all n . We ﬁrst consider a GMM where covariance matrices of diﬀerent clusters are unknown but areassumed to be equal to each other. The data generating progress can be displayed as follow:

Model 1: Y j = θ ∗ z ∗ j + (cid:15) j , where (cid:15) j ind ∼ N (0 , Σ ∗ ) , ∀ j ∈ [ n ] . (1)It is called Stretched Mixture Model in [23] as the density of Y j is elliptical. Throughout the paper,we call it Model 1 for simplicity and to distinguish it from a more complicated model that will beintroduced in Section 3. The goal is to recover the underlying cluster assignment vector z ∗ from Y . Signal-to-noise Ratio.

Deﬁne the signal-to-noise ratio

SNR = min a,b ∈ [ k ]: a (cid:54) = b (cid:107) ( θ ∗ a − θ ∗ b ) T Σ ∗− (cid:107) , (2)which is a function of all the centers { θ ∗ a } a ∈ [ k ] and the covariance matrix Σ ∗ . As we will show laterin Theorem 2.1, SNR captures the diﬃculty of the clustering problem and determines the minimaxrate. For the geometric interpretation of

SNR , we defer it after presenting Theorem 2.2.3 quantity closely related to

SNR is the minimum distance among the centers. Deﬁne ∆ as∆ = min a,b ∈ [ k ]: a (cid:54) = b (cid:107) θ ∗ a − θ ∗ b (cid:107) . (3)Then we can see SNR and ∆ are in the same order if all eigenvalues of the covariance matrix Σ ∗ is assumed to be constants. If Σ ∗ is further assumed to be an identical matrix, then we have SNR equal to ∆. As a result, in [15, 8, 14] where isotropic GMMs are studied, ∆ plays the role ofsignal-to-noise ratio and appears in the minimax rates.

Loss Function.

To measure the clustering performance, we consider the misclustering error ratedeﬁned as follows. For any z, z ∗ ∈ [ k ] n , we deﬁne h ( z, z ∗ ) = min ψ ∈ Ψ n n (cid:88) j =1 I (cid:8) ψ ( z j ) (cid:54) = z ∗ j (cid:9) , (4)where Ψ = { ψ : ψ is a bijection from [ k ] to [ k ] } . Here the minimum is over all the permutationsover [ k ] due to the identiﬁability issue of the labels 1 , , . . . , k . Another loss that will be used is (cid:96) ( z, z ∗ ) deﬁned as (cid:96) ( z, z ∗ ) = n (cid:88) j =1 (cid:13)(cid:13)(cid:13) θ ∗ z j − θ ∗ z ∗ j (cid:13)(cid:13)(cid:13) . (5)It also measures the clustering performance of z considering the distances among the true centers.It is related to h ( z, z ∗ ) as h ( z, z ∗ ) ≤ (cid:96) ( z, z ∗ ) / ( n ∆ ) and hence provides more information than h ( z, z ∗ ). We will mainly use (cid:96) ( z, z ∗ ) in the technical analysis but will eventually present the resultsusing h ( z, z ∗ ) which is more interpretable. We ﬁrst establish the minimax lower bound for the clustering problem under Model 1.

Theorem 2.1.

Under the assumption

SNR √ log k → ∞ , we have inf ˆ z sup z ∗ ∈ [ k ] n E h ( z, z ∗ ) ≥ exp (cid:18) − (1 + o (1)) SNR (cid:19) . (6) If SNR = O (1) instead, we have inf ˆ z sup z ∗ ∈ [ k ] n E h ( z, z ∗ ) ≥ c for some constant c > . Theorem 2.1 allows the cluster numbers k to grow with n and shows that SNR → ∞ is anecessary condition to have a consistent clustering if k is a constant. Theorem 2.1 holds for anyarbitrary { θ ∗ a } a ∈ [ k ] and Σ ∗ , and the minimax lower bound depend on them through SNR . Theparameter space is only for z ∗ while { θ ∗ a } a ∈ [ k ] and Σ ∗ are ﬁxed. Hence, (6) can be interpreted as apointwise result, and it captures precisely the explicit dependence of the minimaxity on { θ ∗ a } a ∈ [ k ] and Σ ∗ .Theorem 2.1 is closely related to the Linear Discriminant Analysis (LDA). If there are only twoclusters, and if the centers and the covariance matrix are known, then estimating each z ∗ j is exactlythe task of LDA: we want to ﬁgure out which normal distribution an observation Y j is generatedfrom, where the two normal distributions have diﬀerent means but the same covariance matrix. Infact, this is also how Theorem 2.1 is proved: we will ﬁrst reduce the estimation problem of z ∗ into4 two-point hypothesis testing problem for each individual z ∗ j , the error of which is given in Lemma2.1 by the analysis of LDA, and then aggregate all the testing errors together.In the following lemma, we give a sharp and explicit formula for the testing error of the LDA.Here we have two normal distributions N ( θ ∗ , Σ ∗ ) and N ( θ ∗ , Σ ∗ ) and an observation X that is gen-erated from one of them. We are interested in estimating which distribution it is from. By Neyman-Pearson lemma, it is known that the likelihood ratio test I (cid:8) θ ∗ − θ ∗ ) T (Σ ∗ ) − X ≥ θ ∗ T (Σ ∗ ) − θ ∗ − θ ∗ T (Σ ∗ ) − θ ∗ (cid:9) is the optimal testing procedure. Then by using the Gaussian tail probability, we are able to obtainthe optimal testing error, the lower bound of which is given in Lemma 2.1. Lemma 2.1 (Testing Error for Linear Discriminant Analysis) . Consider two hypotheses H : X ∼N ( θ ∗ , Σ ∗ ) and H : X ∼ N ( θ ∗ , Σ ∗ ) . Deﬁne a testing procedure φ = I (cid:8) θ ∗ − θ ∗ ) T (Σ ∗ ) − X ≥ θ ∗ T (Σ ∗ ) − θ ∗ − θ ∗ T (Σ ∗ ) − θ ∗ (cid:9) . Then we have inf ˆ φ ( P H ( ˆ φ = 1)+ P H ( ˆ φ = 0)) = P H ( φ = 1)+ P H ( φ = 0) . If (cid:107) ( θ ∗ − θ ∗ ) T (Σ ∗ ) − (cid:107) →∞ , we have inf ˆ φ ( P H ( ˆ φ = 1) + P H ( ˆ φ = 0)) ≥ exp (cid:32) − (1 + o (1)) (cid:107) ( θ ∗ − θ ∗ ) T (Σ ∗ ) − (cid:107) (cid:33) . Otherwise, inf ˆ φ ( P H ( ˆ φ = 1) + P H ( ˆ φ = 0)) ≥ c for some constant c > . −2 0 2 4 − − x y N ( q , S * ) N ( q , S * ) f = f = − − x y SNR/2 N (0, I d ) N (( S * ) −1/2 ( q - q ), I d ) f = f = Figure 1: A geometric interpretation of

SNR .With the help of Lemma 2.1, we have a geometric interpretation of

SNR . In the left panel ofFigure 1, we have two normal distributions N ( θ ∗ , Σ ∗ ) and N ( θ ∗ , Σ ∗ ) for X to be generated from.The black line represents the optimal testing procedure φ displayed in Lemma 2.1 that dividesthe space into two half-spaces. To calculate the testing error, we can make a transformation X (cid:48) =(Σ ∗ ) − ( X − θ ∗ ) so that the two normal distributions become isotropic: N (0 , I d ) and N ((Σ ∗ ) − ( θ ∗ − θ ∗ ) , I d ) as displayed in the right panel. Then the distance between the two centers are (cid:107) (Σ ∗ ) − ( θ ∗ − θ ∗ ) (cid:107) , and the distance between a center and the black curve is half of it. Then P H ( ˆ φ = 1) is theprobability of N (0 , I d ) in the grayed area, which is equal to exp( − (1 + o (1)) (cid:107) (Σ ∗ ) − ( θ ∗ − θ ∗ ) (cid:107) / (cid:107) (Σ ∗ ) − ( θ ∗ − θ ∗ ) (cid:107) is the eﬀective distance between thetwo centers of N ( θ ∗ , Σ ∗ ) and N ( θ ∗ , Σ ∗ ) for the clustering problem, considering the geometry of thecovariance matrix. Since we have multiple clusters, SNR deﬁned in (2) can be interpreted as theminimum eﬀective distances among the centers { θ ∗ a } a ∈ [ k ] considering the anisotropic structure ofΣ ∗ and it captures the intrinsic diﬃculty of the clustering problem.5 .3 Rate-Optimal Adaptive Procedure In this section, we will propose a computationally feasible and rate-optimal procedure for clusteringunder Model 1. Summarized in Algorithm 1, the proposed algorithm is a variant of the Lloydalgorithm. Starting from some initialization, it updates the estimation of the centers { θ ∗ a } a ∈ [ k ] (in(7)), the covariance matrix Σ ∗ (in (8)), and the cluster assignment vector z ∗ (in (9)) iteratively.It diﬀers from the Lloyd’s algorithm in the sense that Lloyd’s algorithm is for isotropic GMMswithout the covariance matrix update (8). In addition, in (9) it updates the estimation of z ∗ j byargmin a ∈ [ k ] ( Y j − θ ( t ) a ) T ( Y j − θ ( t ) a ) instead. To distinguish them from each other, we call the classicalLloyd’s algorithm as the vanilla Lloyd’s algorithm , and name Algorithm 1 as the adjusted Lloyd’salgorithm , as it is adjusted to the unknown and anisotropic covariance matrix.Algorithm 1 can also be interpreted as a hard EM algorithm. If we apply the ExpectationMaximization (EM) for Model 1, we will have an M step for estimating parameters { θ ∗ a } a ∈ [ k ] andΣ ∗ and an E step for estimating z ∗ . It turns out the updates on the parameters (7) - (8) areexactly the same as the updates of EM (M step). However, the update on z ∗ in Algorithm 1 isdiﬀerent from that in the EM. Instead of taking a conditional expectation (E step), we also take amaximization in (9). As a result, Algorithm 1 consists solely of M steps for both the parametersand z ∗ , which is known as a hard EM algorithm. Algorithm 1:

Adjusted Lloyd’s Algorithm for Model 1 (1).

Input:

Data Y , number of clusters k , an initialization z (0) , number of iterations T Output: z ( T ) for t = 1 , . . . , T do Update the centers: θ ( t ) a = (cid:80) j ∈ [ n ] Y j I (cid:110) z ( t − j = a (cid:111)(cid:80) j ∈ [ n ] I (cid:110) z ( t − j = a (cid:111) , ∀ a ∈ [ k ] . (7) Update the covariance matrix:Σ ( t ) = (cid:80) a ∈ [ k ] (cid:80) j ∈ [ n ] ( Y j − θ ( t ) a )( Y j − θ ( t ) a ) T I (cid:110) z ( t − j = a (cid:111) n . (8) Update the cluster estimations: z ( t ) j = argmin a ∈ [ k ] ( Y j − θ ( t ) a ) T (Σ ( t ) ) − ( Y j − θ ( t ) a ) , j ∈ [ n ] . (9)In Theorem 2.2, we give a computational and statistical guarantee of the proposed Algorithm1. We show that starting from a decent initialization, within log n iterations, Algorithm 1 achievesan error rate exp (cid:0) − (1 + o (1)) SNR / (cid:1) which matches with the minimax lower bound given inTheorem 2.1. As a result, Algorithm 1 is a rate-optimal procedure. In addition, the algorithm isfully adaptive to the unknown { θ ∗ a } a ∈ [ k ] and Σ ∗ . The only information assumed to be known is k the number of clusters, which is commonly assumed to be known in clustering literature [15, 8, 14].The theorem also shows that the number of iterations to achieve the optimal rate is at most log n ,6hich provides implementation guidance to practitioners. Theorem 2.2.

Assume kd = O ( √ n ) and min a ∈ k (cid:80) nj =1 I { z ∗ j = a } ≥ αnk for some constant α > .Assume SNR k → ∞ and λ d (Σ ∗ ) /λ (Σ ∗ ) = O (1) . For Algorithm 1, suppose z (0) satisﬁes (cid:96) ( z (0) , z ∗ ) = o ( n/k ) with probability at least − η . Then with probability at least − η − n − − exp( − SNR ) , wehave h ( z ( t ) , z ∗ ) ≤ exp (cid:18) − (1 + o (1)) SNR (cid:19) , for all t ≥ log n. We have remarks on the assumptions of Theorem 2.2. We allow the number of clusters k togrow with n . When k is a constant, the assumption on SNR → ∞ is the necessary condition to havea consistent recovery of z ∗ according to the minimax lower bound presented in Theorem 2.1. Theassumption on Σ ∗ is to make sure the covariance matrix is well-conditioned. The dimensionality d is assumed to be at most O ( √ n ), an assumption that is stronger than that in [15, 8, 14] which onlyneeds d = O ( n ). This is due to that compared to these papers, we need to estimate the covariancematrix Σ ∗ and to have a control on the estimation error (cid:107) Σ ( t ) − Σ ∗ (cid:107) .The requirement for the initialization (cid:96) ( z (0) , z ∗ ) = o ( n/k ) can be fulﬁlled by simple procedures.A popular choice is the vanilla Lloyd’s algorithm the performance of which is studied in [15, 8].Since (cid:15) j are sub-Gaussian random variables with proxy variance λ max , [8] implies the vanilla Lloyd’salgorithm output ˆ z satisﬁes (cid:96) (ˆ z, z ∗ ) ≤ n exp( − (1 + o (1))∆ / (8 λ max )) with probability at least1 − exp( − ∆) − n − , under the assumption that SNR /k → ∞ . Note that [8] is for isotropic GMMs,but its results can be extended to sub-Gaussian mixture models with nearly identical proof. Thenwe have (cid:96) (ˆ z, z ∗ ) = o ( n/k ), as ∆ /λ max and SNR are both in the same order under the assumption SNR /k → ∞ . As a result, we immediately have the following corollary. Corollary 2.1.

Assume kd = O ( √ n ) and min a ∈ k (cid:80) nj =1 I { z ∗ j = a } ≥ αnk for some constant α > . Assume SNR k → ∞ and λ d (Σ ∗ ) /λ (Σ ∗ ) = O (1) . Using the vanilla Lloyd’s algorithm as theinitialization z (0) in Algorithm 1, we have with probability at least − n − − exp( − SNR ) − exp( − ∆) , h ( z ( t ) , z ∗ ) ≤ exp (cid:18) − (1 + o (1)) SNR (cid:19) , for all t ≥ log n. In this section, we study the GMM with covariance matrices from each cluster unknown and notnecessarily equal to each other. The data generation process can be displayed as follow,

Model 2: Y j = θ ∗ z ∗ j + (cid:15) j , where (cid:15) j ind ∼ N (0 , Σ ∗ z ∗ j ) , ∀ j ∈ [ n ] . (10)We call it Model 2 throughout the paper to distinguish it from Model 1 studied in Section 2. Thediﬀerence between (10) and (1) is that we now have { Σ ∗ a } a ∈ [ k ] instead of a shared Σ ∗ . We considerthe same loss functions as in (4) and (5). Signal-to-noise Ratio.

The signal-to-noise ratio for Model 2 is deﬁned as follows. We use thenotation

SNR (cid:48) to distinguish it from

SNR for Model 1. Compared to

SNR , SNR (cid:48) is much more7omplicated and does not have an explicit formula. We ﬁrst deﬁne a space B a,b ∈ R d for any a, b ∈ [ k ] such that a (cid:54) = b : B a,b = (cid:40) x ∈ R d : x T Σ ∗ a Σ ∗− b ( θ ∗ a − θ ∗ b ) + 12 x T (cid:18) Σ ∗ a Σ ∗− b Σ ∗ a − I d (cid:19) x ≤ −

12 ( θ ∗ a − θ ∗ b ) T Σ ∗− b ( θ ∗ a − θ ∗ b ) + 12 log | Σ ∗ a | −

12 log | Σ ∗ b | (cid:41) . We then deﬁne

SNR (cid:48) a,b = 2 min x ∈B a,b (cid:107) x (cid:107) and SNR (cid:48) = min a,b ∈ [ n ]: a (cid:54) = b SNR (cid:48) a,b . (11)The from of SNR (cid:48) is closely connected to the testing error of the Quadratic Discriminant Analysis(QDA), which we will give in Lemma 3.1. For the interpretation of the

SNR (cid:48) (especially from ageometric point of view), we defer it after presenting Lemma 3.1. Here let us consider a few specialcases where we are able to simplify

SNR (cid:48) : (1) When Σ ∗ a = Σ ∗ for all a ∈ [ k ], by simple algebra, wehave SNR (cid:48) a,b = (cid:107) ( θ ∗ a − θ ∗ b ) T Σ ∗− (cid:107) for any a, b ∈ [ k ] such that a (cid:54) = b . Hence, SNR (cid:48) = SNR and Model2 is reduced to the Model 1. (2) When Σ ∗ = σ a I d for any a ∈ [ k ] where σ , . . . , σ k > SNR (cid:48) a,b , SNR (cid:48) b,a both close to 2 (cid:107) θ ∗ a − θ ∗ b (cid:107) / ( σ a + σ b ). From these examples, we cansee SNR (cid:48) is determined by both the centers { θ ∗ a } a ∈ [ k ] and the covariance matrices { Σ ∗ a } a ∈ [ k ] . We ﬁrst establish the minimax lower bound for the clustering problem under Model 2.

Theorem 3.1.

Under the assumption

SNR (cid:48) √ log k → ∞ , we have inf ˆ z sup z ∗ ∈ [ k ] n E h ( z, z ∗ ) ≥ exp (cid:32) − (1 + o (1)) SNR (cid:48) (cid:33) . If SNR (cid:48) = O (1) instead, we have inf ˆ z sup z ∗ ∈ [ k ] n E h ( z, z ∗ ) ≥ c for some constant c > . Despite that the statement of Theorem 3.1 looks similar to that of Theorem 2.1, the two minimaxlower bounds are diﬀerent from each other due to the discrepancy in the dependence of the centersand the covariance matrices in

SNR (cid:48) and

SNR . By the same argument as in Section 2.2, the minimaxlower bound established in Theorem 3.1 is closely related to the Quadratic Discriminant Analysis(QDA) between two normal distributions with diﬀerent means and diﬀerent covariance matrices.

Lemma 3.1 (Testing Error for Quadratic Discriminant Analysis) . Consider two hypotheses H : X ∼ N ( θ ∗ , Σ ∗ ) and H : X ∼ N ( θ ∗ , Σ ∗ ) . Deﬁne a testing procedure φ = I (cid:8) log | Σ ∗ | + ( x − θ ∗ ) T Σ ∗ ( x − θ ∗ ) ≥ log | Σ ∗ | + ( x − θ ∗ ) T Σ ∗ ( x − θ ∗ ) (cid:9) . Then we have inf ˆ φ ( P H ( ˆ φ = 1)+ P H ( ˆ φ = 0)) = P H ( φ = 1)+ P H ( φ = 0) . If min (cid:8) SNR (cid:48) , , SNR (cid:48) , (cid:9) →∞ , we have inf ˆ φ ( P H ( ˆ φ = 1) + P H ( ˆ φ = 0)) ≥ exp (cid:32) − (1 + o (1)) min (cid:8) SNR (cid:48) , , SNR (cid:48) , (cid:9) (cid:33) . Otherwise, inf ˆ φ ( P H ( ˆ φ = 1) + P H ( ˆ φ = 0)) ≥ c for some constant c > . − − x y N ( q , S ) N ( q , S ) f = f = − − x y SNR' N (0, I d ) N (( S ) −1/2 ( q - q ), ( S ) −1/2 S ( S ) −1/2 ) f = f = Figure 2: A geometric interpretation of

SNR (cid:48) .From Lemma 3.1, we can have a geometric interpretation of

SNR (cid:48) . In the left panel of Figure 2,we have two normal distributions N ( θ ∗ , Σ ∗ ) and N ( θ ∗ , Σ ∗ ) where X can be generated from. Theblack curve represents the optimal testing procedure φ displayed in Lemma 3.1. Since Σ ∗ is notnecessarily equal to Σ ∗ , the black curve is not necessarily a straight line. If H is true, then theprobability for X to be incorrectly classiﬁed is when X falls in the gray area, which is P H ( ˆ φ = 1).To calculate it, we can make a transformation X (cid:48) = (Σ ∗ ) − ( X − θ ∗ ). Then displayed in the rightpanel of Figure 2, the two distributions become N (0 , I d ) and N ((Σ ∗ ) − ( θ ∗ − θ ∗ ) , (Σ ∗ ) − Σ ∗ (Σ ∗ ) − ),and the optimal testing procedure I { X (cid:48) ∈ B , } . As a result, in the right panel of Figure 2, B , represents the space colored by gray, and the black curve is its boundary. Then P H ( ˆ φ = 1) is equalto P ( N (0 , I d ) ∈ B , ), which can be shown to be determined by the minimum distance between thecenter of N (0 , I d ) and the space B , . Denote the minimum distance by SNR (cid:48) , /

2, by Lemmas 7.10and Lemma 7.11, we can show P ( N (0 , I d ) ∈ B , ) = exp( − (1 + o (1)) SNR (cid:48) , / SNR (cid:48) can be interpreted as the minimum eﬀective distance among the centers { θ ∗ a } a ∈ [ k ] consideringthe anisotropic and heterogeneous structure of { Σ ∗ a } a ∈ [ k ] and it captures the intrinsic diﬃculty ofthe clustering problem under Model 2. In this section, we will propose a computationally feasible and rate-optimal procedure for clusteringunder Model 2. Similar to Algorithm 1, the proposed Algorithm 2 can be seen as a variant of theLloyd’s algorithm that is adjusted to the unknown and heterogeneous covariance matrices. It canalso be interpreted as a hard EM algorithm under Model 2. Algorithm 2 diﬀers from Algorithm 1in (13) and (14), as now there are k covariance matrices to be estimated.In Theorem 3.2, we give a computational and statistical guarantee of the proposed Algorithm2. We show that provided with some decent initialization, Algorithm 2 is able to achieve theminimax lower bound within log n iterations. The assumptions needed in Theorem 3.2 are similarto those in Theorem 3.2, except that we require stronger assumptions on k and the dimensionality d since now we have k (instead of one) covariance matrices to be estimated. In addition, bymax a,b ∈ [ k ] λ d (Σ ∗ a ) /λ (Σ ∗ b ) = O (1) we not only assume each of the k covariance matrices is well-conditioned, but also assume they are comparable to each other. Theorem 3.2.

Assume k, d = O (1) and min a ∈ k (cid:80) nj =1 I { z ∗ j = a } ≥ αnk for some constant α > .Assume SNR (cid:48) → ∞ and max a,b ∈ [ k ] λ d (Σ ∗ a ) /λ (Σ ∗ b ) = O (1) . For Algorithm 2 , suppose z (0) satisﬁes lgorithm 2: Adjusted Lloyd’s Algorithm for Model 2 (10).

Input:

Data Y , number of clusters k , an initialization z (0) , number of iterations T Output: z ( T ) for t = 1 , . . . , T do Update the centers: θ ( t ) a = (cid:80) j ∈ [ n ] Y j I (cid:110) z ( t − j = a (cid:111)(cid:80) j ∈ [ n ] I (cid:110) z ( t − j = a (cid:111) , ∀ a ∈ [ k ] . (12) Update the covariance matrices:Σ ( t ) a = (cid:80) j ∈ [ n ] ( Y j − θ ( t ) a )( Y j − θ ( t ) a ) T I (cid:110) z ( t − j = a (cid:111)(cid:80) j ∈ [ n ] I (cid:110) z ( t − j = a (cid:111) , ∀ a ∈ [ k ] . (13) Update the cluster estimations: z ( t ) j = argmin a ∈ [ k ] ( Y j − θ ( t ) a ) T (Σ ( t ) a ) − ( Y j − θ ( t ) a ) + log | Σ ( t ) a | , j ∈ [ n ] . (14) end (cid:96) ( z (0) , z ∗ ) = o ( n/k ) with probability at least − η . Then with probability at least − η − n − − exp( − SNR (cid:48) ) , we have h ( z ( t ) , z ∗ ) ≤ exp (cid:32) − (1 + o (1)) SNR (cid:48) (cid:33) , for all t ≥ log n. The vanilla Lloyd’s algorithm can be used as the initialization for Algorithm 2. Under theassumption that λ d (Σ ∗ ) /λ (Σ ∗ ) = O (1), Model 2 is also a sub-Gaussian mixture model. By thesame argument as in Section 2.3 we have the following corollary. Corollary 3.1.

Assume k, d = O (1) and min a ∈ k (cid:80) nj =1 I { z ∗ j = a } ≥ αnk for some constant α > . Assume SNR (cid:48) → ∞ and λ d (Σ ∗ ) /λ (Σ ∗ ) = O (1) . Using the vanilla Lloyd’s algorithm as theinitialization z (0) in Algorithm 2, we have with probability at least − n − − exp( − SNR ) , h ( z ( t ) , z ∗ ) ≤ exp (cid:32) − (1 + o (1)) SNR (cid:48) (cid:33) , for all t ≥ log n. In this section, we compare the performance of the proposed methods with other popular clusteringmethods on synthetic datasets under diﬀerent settings.

Model 1.

The ﬁrst simulation is designed for the GMM with unknown but homogeneous co-variance matrices (i.e., Model 1). We independently generate n = 1200 samples with dimension10 iteration l og ( e rr o r) method spectralvanilla Lloydspectral + Alg 1vanilla Lloyd + Alg 1 −7−6−5 2 4 6 iteration l og ( e rr o r) method spectralvanilla Lloydspectral + Alg 1vanilla Lloyd + Alg 1 Figure 3: Left: Performance of Algorithm 1 compared with other methods under Model 1. Right:Performance of Algorithm 2 compared with other methods under Model 2. d = 50 from k = 30 clusters. Each cluster has 40 samples. We set Σ ∗ = U T Λ U , where Λ is a50 ×

50 diagonal matrix with diagonal elements selected from 0.5 to 8 with equal space and U isa randomly generated orthogonal matrix. The centers { θ ∗ a } a ∈ [ n ] are orthogonal or each other with (cid:107) θ ∗ (cid:107) = . . . = (cid:107) θ ∗ (cid:107) = 9. We consider four popular clustering methods: (1) the spectral clusteringmethod [14] (denoted as “spectral”), (2) the vanilla Lloyd’s algorithm [15] (denoted as “vanillaLloyd”), (3) the proposed Algorithm 1 initialized by the spectral clustering (denoted as “spectral+ Alg 1”), and (4) Algorithm 1 initialized by the vanilla Lloyd (denoted as “vanilla Lloyd + Alg1”). The comparison is presented in the left panel of Figure 3.In the plot, the x -axis is the number of iterations and the y -axis is the logarithm of the mis-clustering error rate, i.e., log( h ). Each of the curves plotted is an average of 100 independent trials.We can see the proposed Algorithm 1 outperforms the spectral clustering and the vanilla Lloyd’salgorithm signiﬁcantly. What is more, the dashed line represents the optimal exponent − SNR / Model 2.

We also compare the performances of four methods (spectral, vanilla Lloyd, spectral+ Alg 2, and vanilla Lloyd + Alg 2) for the GMM with unknown and heterogeneous covariancematrices (i.e., Model 2). In this case, we take n = 1200, k = 3 and d = 5. We set Σ ∗ = I ,Σ ∗ = Λ which is a 5 × ∗ = U T Λ U , where Λ is a diagonal matrix with elements selected uniformly from 0.5 to 2and U is a randomly generated orthogonal matrix. To simplify the calculation of SNR (cid:48) , we take θ ∗ as a randomly selected unit vector, θ ∗ = θ ∗ + 5 e with e denoting the vector with a 1 in the ﬁrstcoordinate and 0’s elsewhere and θ ∗ = θ ∗ + v with v randomly selected satisfying (cid:107) v (cid:107) = 10. Thecomparison is presented in the right panel of Figure 3 where each curve plotted is an average of100 independent trials.From the plot, we can clearly see the proposed Algorithm 2 improves greatly the spectral clus-tering and the vanilla Lloyd algorithm. The dashed line represents the optimal exponent − SNR (cid:48) / Proofs in Section 2

Proof of Lemma 2.1.

Note that φ is the likelihood ratio test. Hence by the Neyman-Pearson lemma,it is the optimal procedure. Let (cid:15) ∼ N (0 , I d ). By Gaussian tail probability, we have P H ( φ = 1) + P H ( φ = 0) = P (cid:0) θ ∗ − θ ∗ ) T (Σ ∗ ) − ( θ + (cid:15) ) ≥ θ ∗ T (Σ ∗ ) − θ ∗ − θ ∗ T (Σ ∗ ) − θ ∗ (cid:1) + P (cid:0) θ ∗ − θ ∗ ) T (Σ ∗ ) − ( θ + (cid:15) ) < θ ∗ T (Σ ∗ ) − θ ∗ − θ ∗ T (Σ ∗ ) − θ ∗ (cid:1) = 2 P (cid:0) θ ∗ − θ ∗ ) T (Σ ∗ ) − ( θ + (cid:15) ) ≥ θ ∗ T (Σ ∗ ) − θ ∗ − θ ∗ T (Σ ∗ ) − θ ∗ (cid:1) = 2 P (cid:18) (cid:15) > (cid:107) ( θ ∗ − θ ∗ ) T (Σ ∗ ) − (cid:107) (cid:19) ≥ C min (cid:40) , (cid:107) ( θ ∗ − θ ∗ ) T (Σ ∗ ) − (cid:107) exp (cid:32) − (cid:107) ( θ ∗ − θ ∗ ) T (Σ ∗ ) − (cid:107) (cid:33)(cid:41) , for some constant C >

0. The proof is complete.

Proof of Theorem 2.1.

We adopt the idea from [15]. Without loss of generality, assume the min-imum in (2) is achieved at a = 1 , b = 2 so that SNR = ( θ ∗ − θ ∗ ) T (Σ ∗ ) − ( θ ∗ − θ ∗ ). Consider anarbitrary ¯ z ∈ [ k ] n such that |{ i ∈ [ n ] : ¯ z i = a }| ≥ (cid:100) nk − n k (cid:101) for any a ∈ [ k ]. Then for each a ∈ [ k ], we can choose a subset of { i ∈ [ n ] : ¯ z i = a } with cardinality (cid:100) nk − n k (cid:101) , denoted by T a . Let T = ∪ a ∈ [ k ] T a . Then we can deﬁne a parameter space Z = { z ∈ [ k ] n : z i = ¯ z i for all i ∈ T and z i ∈ { , } if i ∈ T c } . Notice that for any z (cid:54) = ˜ z ∈ Z , we have n (cid:80) ni =1 I { z i (cid:54) = ˜ z i } ≤ kn n k = k and n (cid:80) ni =1 I { ψ ( z i ) (cid:54) =˜ z i } ≥ n ( n k − n k ) ≥ k for any permutation ψ on [ k ]. Thus we can conclude h ( z, ˜ z ) = 1 n n (cid:88) i =1 I { z i (cid:54) = ˜ z i } , for all z, ˜ z ∈ Z . We notice that inf ˆ z sup z ∗ ∈ [ k ] n E h (ˆ z, z ∗ ) ≥ inf ˆ z sup z ∗ ∈Z E h (ˆ z, z ∗ ) ≥ inf ˆ z |Z| (cid:88) z ∗ ∈Z E h (ˆ z, z ∗ ) ≥ n (cid:88) i ∈ T c inf ˆ z i |Z| (cid:88) z ∗ ∈Z P z ∗ (ˆ z i (cid:54) = z i ) . Now consider a ﬁxed i ∈ T c . Deﬁne Z a = { z ∈ Z : z i = a } for a = 1 ,

2. Then we can see Z = Z ∪ Z and Z ∩ Z = ∅ . What is more, there exists a one-to-one mapping f ( · ) between Z and Z , such that for any z ∈ Z , we have f ( z ) ∈ Z with [ f ( z )] j = z j for any j (cid:54) = i and [ f ( z )] i = 2.Hence, we can reduce the problem into a two-point testing probe and then apply Lemma 2.1. We12rst consider the case that SNR → ∞ . We haveinf ˆ z i |Z| (cid:88) z ∗ ∈Z P z ∗ (ˆ z i (cid:54) = z i ) = inf ˆ z i |Z| (cid:88) z ∗ ∈Z (cid:0) P z ∗ (ˆ z i (cid:54) = 1) + P f ( z ∗ ) (ˆ z i (cid:54) = 2) (cid:1) ≥ |Z| (cid:88) z ∗ ∈Z inf ˆ z i (cid:0) P z ∗ (ˆ z i (cid:54) = 1) + P f ( z ∗ ) (ˆ z i (cid:54) = 2) (cid:1) ≥ |Z |Z exp (cid:18) − (1 + η ) SNR (cid:19) ≥

12 exp (cid:18) − (1 + η ) SNR (cid:19) , for some η = o (1). Here the second inequality is due to Lemma 2.1. Then,inf ˆ z sup z ∗ ∈ [ k ] n E h (ˆ z, z ∗ ) ≥ | T c | n exp (cid:18) − (1 + η ) SNR (cid:19) = 116 k exp (cid:18) − (1 + η ) SNR (cid:19) = exp (cid:18) − (1 + η (cid:48) ) SNR (cid:19) , for some other η (cid:48) = o (1), where we use SNR / log k → ∞ .The proof for the case SNR = O (1) is similar and hence is omitted here. In this section, we will prove Theorem 2.2 using the framework developed in [8] for analyzingiterative algorithms. The key idea to establish statistical guarantees of the proposed iterativealgorithm (i.e., Algorithm 1) is to perform a “one-step” analysis. That is, assume we have anestimation z for z ∗ . Then we can apply (7), (8), and (9) on z to obtain { ˆ θ a ( z ) } a ∈ [ k ] , ˆΣ( z ), andˆ z ( z ) sequentially, which all depend on z . Then ˆ z ( z ) can be seen as a reﬁned estimation of z ∗ . Wewill ﬁrst build the connection between (cid:96) ( z, z ∗ ) with (cid:96) (ˆ z ( z ) , z ∗ ) as in Lemma 5.1. To establish theconnection, we will decompose the loss (cid:96) (ˆ z ( z ) , z ∗ ) into several errors according to the diﬀerencein their behaviors. Then we will give conditions (Condition 5.2.1 - 5.2.3), under which we willshow these errors are either negligible or well controlled by (cid:96) ( z, z ∗ ). With Lemma 5.1 established,in Lemma 5.2 we will show the connection can be extended to multiple iterations, under twomore conditions (Condition 5.2.4 - 5.2.5). Last, we will show all these conditions hold with highprobability, and hence prove Theorem 2.2.In the statement of Theorem 2.2, the covariance matrix Σ ∗ is assumed to satisfy λ d (Σ ∗ ) /λ (Σ ∗ ) = O (1). Without loss of generality, we can replace it by assuming Σ ∗ satisﬁes λ min ≤ λ (Σ ∗ ) ≤ λ d (Σ ∗ ) ≤ λ max (15)where λ min , λ max > { Y j } be some dataset generated according to Model1 with parameters { θ ∗ a } a ∈ [ k ] , Σ ∗ , and z ∗ . The assumption λ d (Σ ∗ ) /λ (Σ ∗ ) = O (1) is equivalent toassume there exist some constants λ min , λ max > σ > n such that λ min σ ≤ λ (Σ ∗ ) ≤ λ d (Σ ∗ ) ≤ λ max σ . Then performing a scaling transformation weobtain another dataset Y (cid:48) j = Y j /σ . Note that: 1) { Y (cid:48) j } can be seen to be generated from Model 1with parameters { θ ∗ a /σ } a ∈ [ k ] , Σ ∗ /σ , and z ∗ , 2) clustering on { Y j } is equivalent to clustering on { Y (cid:48) j } ,3) by the deﬁnition in (2), the SNR s that are associated with the data generating processes of { Y (cid:48) j } { Y j } are exactly equal to each other, and 4) we have λ min ≤ λ (Σ ∗ /σ ) ≤ λ d (Σ ∗ /σ ) ≤ λ max .Hence, in this section, we will assume (15) holds and it will not lose any generality.In the proof, we will mainly use the loss (cid:96) ( · , · ) for convenience. Recall ∆ is deﬁned as theminimum distance among centers in (3). We have h ( z, z ∗ ) ≤ (cid:96) ( z, z ∗ ) n ∆ . (16)The algorithmic guarantees Lemma 5.1 and Lemma 5.2 are established with respect to the (cid:96) ( · , · )loss. But eventually we will use (16) to convert it into a result with respect to h ( · , · ) in the proofof Theorem 2.2. Error Decomposition for the One-step Analysis:

Consider an arbitrary z ∈ [ k ] n . Apply(7), (8), and (9) on z to obtain { ˆ θ a ( z ) } a ∈ [ k ] , ˆΣ( z ), and ˆ z ( z ):ˆ θ a ( z ) = (cid:80) j ∈ [ n ] Y j I { z j = a } (cid:80) j ∈ [ n ] I { z j = a } , a ∈ [ k ]ˆΣ( z ) = (cid:80) a ∈ [ k ] (cid:80) j ∈ [ n ] ( Y j − ˆ θ a ( z ))( Y j − ˆ θ a ( z )) T I { z j = a } n , ˆ z j ( z ) = argmin a ∈ [ k ] ( Y j − ˆ θ a ( z )) T ( ˆΣ( z )) − ( Y j − ˆ θ a ) , j ∈ [ n ] . For simplicity we use ˆ z that is short for ˆ z ( z ). Let j ∈ [ n ] be an arbitrary index with z ∗ j = a .According to (9), z ∗ j will be incorrectly estimated after on iteration in ˆ z j if a (cid:54) = argmin a ∈ [ k ] ( Y j − ˆ θ a ( z )) T ( ˆΣ( z )) − ( Y j − ˆ θ a ( z )). That is, it is important to analyze the event (cid:104) Y j − ˆ θ b ( z ) , ( ˆΣ( z )) − ( Y j − ˆ θ b ( z )) (cid:105) ≤ (cid:104) Y j − ˆ θ a ( z ) , ( ˆΣ( z )) − ( Y j − ˆ θ a ( z )) (cid:105) , (17)for any b ∈ [ k ] \ { a } . Note that Y j = θ ∗ a + (cid:15) j . After some rearrangements, we can see (17) isequivalent to, (cid:104) (cid:15) j , ( ˆΣ( z ∗ )) − (ˆ θ a ( z ∗ ) − ˆ θ b ( z ∗ )) (cid:105)≤ − (cid:104) θ ∗ a − θ ∗ b , (Σ ∗ ) − ( θ ∗ a − θ ∗ b ) (cid:105) + F j ( a, b, z ) + G j ( a, b, z ) + H j ( a, b, z ) , where F j ( a, b, z ) = (cid:104) (cid:15) j , ( ˆΣ( z )) − (ˆ θ b ( z ) − ˆ θ b ( z ∗ )) (cid:105) − (cid:104) (cid:15) j , ( ˆΣ( z )) − (ˆ θ a ( z ) − ˆ θ a ( z ∗ )) (cid:105) + (cid:104) (cid:15) j , (( ˆΣ( z )) − − ( ˆΣ( z ∗ )) − )(ˆ θ b ( z ∗ ) − ˆ θ a ( z ∗ )) (cid:105) ,G j ( a, b, z ) = (cid:18) (cid:104) θ ∗ a − ˆ θ a ( z ) , ( ˆΣ( z )) − ( θ ∗ a − ˆ θ a ( z )) (cid:105) − (cid:104) θ ∗ a − ˆ θ a ( z ∗ ) , ( ˆΣ( z )) − ( θ ∗ a − ˆ θ a ( z ∗ )) (cid:105) (cid:19) + (cid:18) (cid:104) θ ∗ a − ˆ θ a ( z ∗ ) , ( ˆΣ( z )) − ( θ ∗ a − ˆ θ a ( z ∗ )) (cid:105) − (cid:104) θ ∗ a − ˆ θ a ( z ∗ ) , ( ˆΣ( z ∗ )) − ( θ ∗ a − ˆ θ a ( z ∗ )) (cid:105) (cid:19) + (cid:18) − (cid:104) θ ∗ a − ˆ θ b ( z ) , ( ˆΣ( z )) − ( θ ∗ a − ˆ θ b ( z )) (cid:105) + 12 (cid:104) θ ∗ a − ˆ θ b ( z ∗ ) , ( ˆΣ( z )) − ( θ ∗ a − ˆ θ b ( z ∗ )) (cid:105) (cid:19) + (cid:18) − (cid:104) θ ∗ a − ˆ θ b ( z ∗ ) , ( ˆΣ( z )) − ( θ ∗ a − ˆ θ b ( z ∗ )) (cid:105) + 12 (cid:104) θ ∗ a − ˆ θ b ( z ∗ ) , ( ˆΣ( z ∗ )) − ( θ ∗ a − ˆ θ b ( z ∗ )) (cid:105) (cid:19) , H j ( a, b, z ) = (cid:18) − (cid:104) θ ∗ a − ˆ θ b ( z ∗ ) , ( ˆΣ( z ∗ )) − ( θ ∗ a − ˆ θ b ( z ∗ )) (cid:105) + 12 (cid:104) θ ∗ a − θ ∗ b , ( ˆΣ( z ∗ )) − ( θ ∗ a − θ ∗ b ) (cid:105) (cid:19) + (cid:18) − (cid:104) θ ∗ a − θ ∗ b , ( ˆΣ( z ∗ )) − ( θ ∗ a − θ ∗ b ) (cid:105) + 12 (cid:104) θ ∗ a − θ ∗ b , (Σ ∗ ) − ( θ ∗ a − θ ∗ b ) (cid:105) (cid:19) + (cid:18) (cid:104) θ ∗ a − ˆ θ a ( z ∗ ) , ( ˆΣ( z ∗ )) − ( θ ∗ a − ˆ θ a ( z ∗ )) (cid:105) (cid:19) . Here (cid:104) (cid:15) j , ( ˆΣ( z ∗ )) − (ˆ θ a ( z ∗ ) − ˆ θ b ( z ∗ )) (cid:105) ≤ − (cid:104) θ ∗ a − θ ∗ b , (Σ ∗ ) − ( θ ∗ a − θ ∗ b ) (cid:105) is the main term that willlead to the optimal rate. Among all the remaining terms, F j ( a, b, z ) includes all terms involving (cid:15) j . G j ( a, b, z ) includes all terms related to z and H j ( a, b, z ) consists of terms that only involves z ∗ .Readers can refer [8] for more information about the decomposition. Conditions and Guarantees for One-step Analysis.

To establish the guarantee for the one-step analysis, we ﬁrst give several conditions on the error terms F j ( a, b ; z ) , G j ( a, b ; z ) and H j ( a, b ; z ). Condition 5.2.1.

Assume that max { z : l ( z,z ∗ ) ≤ τ } max j ∈ [ n ] max b ∈ [ k ] \{ z ∗ j } | H j ( z ∗ j , b, z ) |(cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) ≤ δ holds with probability with at least − η for some τ, δ, η > . Condition 5.2.2.

Assume that max { z : l ( z,z ∗ ) ≤ τ } n (cid:88) j =1 max b ∈ [ k ] \{ z ∗ j } F j ( z ∗ j , b, z ) (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) l ( z, z ∗ ) ≤ δ holds with probability with at least − η for some τ, δ, η > . Condition 5.2.3.

Assume that max { z : l ( z,z ∗ ) ≤ τ } max j ∈ [ n ] max b ∈ [ k ] \{ z ∗ j } | G j ( z ∗ j , b, z ) |(cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) ≤ δ holds with probability with at least − η for some τ, δ, η > . We next deﬁne a quantity that we refer it as the ideal error, ξ ideal ( δ ) = n (cid:88) j =1 (cid:88) b ∈ [ k ] \{ z ∗ j } (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) I {(cid:104) (cid:15) j , ( ˆΣ( z ∗ )) − (ˆ θ a ( z ∗ ) − ˆ θ b ( z ∗ )) (cid:105) ≤ − − δ (cid:104) θ ∗ a − θ ∗ b , (Σ ∗ ) − ( θ ∗ a − θ ∗ b ) (cid:105)} . Lemma 5.1.

Assumes Conditions 5.2.1 - 5.2.3 hold for some τ, δ, η , η , η , > . We then have P (cid:18) (cid:96) (ˆ z, z ∗ ) ≤ ξ ideal ( δ ) + 14 (cid:96) ( z, z ∗ ) for any z ∈ [ k ] n such that (cid:96) ( z, z ∗ ) ≤ τ (cid:19) ≥ − η, where η = (cid:80) i =1 η i . roof. We notice that I { ˆ z j = b } ≤ I (cid:110) (cid:104) Y j − ˆ θ b ( z ) , ( ˆΣ( z )) − ( Y j − ˆ θ b ( z )) (cid:105) ≤ (cid:104) Y j − ˆ θ a ( z ) , ( ˆΣ( z )) − ( Y j − ˆ θ a ( z )) (cid:105) (cid:111) = I (cid:26) (cid:104) (cid:15) j , ( ˆΣ( z ∗ )) − (ˆ θ z ∗ j ( z ∗ ) − ˆ θ b ( z ∗ )) (cid:105)≤ − (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) + F j ( z ∗ j , b, z ) + G j ( z ∗ j , b, z ) + H j ( z ∗ j , b, z ) (cid:27) ≤ I (cid:26) (cid:104) (cid:15) j , ( ˆΣ( z ∗ )) − (ˆ θ z ∗ j ( z ∗ ) − ˆ θ b ( z ∗ )) (cid:105) ≤ − − δ (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:27) + I (cid:26) δ (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) ≤ F j ( z ∗ j , b, z ) + G j ( z ∗ j , b, z ) + H j ( z ∗ j , b, z ) (cid:27) ≤ I (cid:26) (cid:104) (cid:15) j , ( ˆΣ( z ∗ )) − (ˆ θ z ∗ j ( z ∗ ) − ˆ θ b ( z ∗ )) (cid:105) ≤ − − δ (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:27) + I (cid:26) δ (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) ≤ F j ( z ∗ j , b, z ) (cid:27) ≤ I (cid:26) (cid:104) (cid:15) j , ( ˆΣ( z ∗ )) − (ˆ θ z ∗ j ( z ∗ ) − ˆ θ b ( z ∗ )) (cid:105) ≤ − − δ (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:27) + 64 F j ( z ∗ j , b, z ) δ (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) , where the third inequality comes from Condition 5.2.1. Thus, we have (cid:96) (ˆ z, z ∗ )= n (cid:88) j =1 (cid:88) b ∈ [ k ] \{ a } (cid:13)(cid:13)(cid:13) θ ∗ b − θ ∗ z ∗ j (cid:13)(cid:13)(cid:13) I { ˆ z j = b }≤ ξ ideal ( δ ) + n (cid:88) j =1 (cid:88) b ∈ [ k ] \{ a } (cid:13)(cid:13)(cid:13) θ ∗ b − θ ∗ z ∗ j (cid:13)(cid:13)(cid:13) I { ˆ z j = b } F j ( z ∗ j , b, z ) δ (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) ≤ ξ ideal ( δ ) + (cid:96) ( z, z ∗ )4 , which implies Lemma 5.1. Here the last inequality uses Conditions 5.2.2 and 5.2.3. Conditions and Guarantees for Multiple Iterations.

In the above we establish a statisticalguarantee for the one-step analysis. Now we will extend the result to multiple iterations. Thatis, starting from some initialization z (0) , we will characterize how the losses (cid:96) ( z (0) , z ∗ ), (cid:96) ( z (1) , z ∗ ), (cid:96) ( z (2) , z ∗ ), . . . , decay. We impose a condition on ξ ideal ( δ ) and a condition for z (0) . Condition 5.2.4.

Assume that ξ ideal ( δ ) ≤ τ holds with probability with at least − η for some τ, δ, η > . Finally, we need a condition on the initialization.16 ondition 5.2.5.

Assume that (cid:96) ( z (0) , z ∗ ) ≤ τ holds with probability with at least − η for some τ, η > . With these conditions satisﬁed, we can give a lemma that shows the linear convergence guaranteefor our algorithm.

Lemma 5.2.

Assumes Conditions 5.2.1 - 5.2.5 hold for some τ, δ, η , η , η , η , η > . We thenhave (cid:96) ( z ( t ) , z ∗ ) ≤ ξ ideal ( δ ) + 14 (cid:96) ( z ( t − , z ∗ ) for all t ≥ , with probability at least − η , where η = (cid:80) i =1 η i .Proof. By Condition 5.2.4, 5.2.5, and a mathematical induction argument, we can easily know (cid:96) ( z ( t ) , z ∗ ) ≤ τ for any t ≥

0. Thus, Lemma 5.2 is a direct extension of Lemma 5.1.

With-high-probability Results for the Conditions and The Proof of The Main Theo-rem.

Recall the deﬁnition of ∆ in (3). Recall that in (15) we assume λ min ≤ λ (Σ ∗ ) ≤ λ d (Σ ∗ ) ≤ λ max for two constants λ min , λ max >

0. Hence we have ∆ is in the same order of

SNR . Speciﬁcally,we have 1 λ max ∆ ≤ SNR ≤ λ min ∆ . (18)Hence the assumption SNR /k → ∞ in the statement of Theorem 2.2 is equivalently ∆ /k → ∞ .Next, we give two lemmas to clarify the Conditions. The ﬁrst lemma shows that δ can be taken forsome o (1) term and the second lemma shows for any δ = o (1), ξ ideal ( δ ) is upper bounded by thedesired minimax rate multiply by the sample size n . Lemma 5.3.

Under the same conditions as in Theorem 2.2, for any constant C (cid:48) > , there existssome constant C > only depending on α and C (cid:48) such that max { z : (cid:96) ( z,z ∗ ) ≤ τ } max j ∈ [ n ] max b ∈ [ k ] \{ z ∗ j } | H j ( z ∗ j , b, z ) |(cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) ≤ C (cid:114) k ( d + log n ) n (19)max { z : (cid:96) ( z,z ∗ ) ≤ τ } n (cid:88) j =1 max b ∈ [ k ] \{ z ∗ j } F j ( z ∗ j , b, z ) (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:96) ( z, z ∗ ) ≤ Ck (cid:18) τn + 1∆ + d n ∆ (cid:19) (20)max { z : (cid:96) ( z,z ∗ ) ≤ τ } max j ∈ [ n ] max b ∈ [ k ] \{ z ∗ j } | G j ( z ∗ j , b, z ) |(cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) ≤ Ck (cid:18) τn + 1∆ (cid:114) τn + d √ τn ∆ (cid:19) (21) with probability at least − n − C (cid:48) .Proof. Under the conditions of Theorem 2.2, the inequalities (33)-(38) hold with probability atleast 1 − n − C (cid:48) . In the remaining proof, we will work on the event these inequalities hold. DenoteˆΣ a ( z ) = (cid:80) j ∈ [ n ] ( Y j − ˆ θ a ( z ))( Y j − ˆ θ a ( z )) T I { z j = a } (cid:80) j ∈ [ n ] I { z j = a } and Σ ∗ a = Σ ∗ for any a ∈ [ k ]. Then we have the equivalenceˆΣ( z ∗ ) − Σ ∗ = k (cid:88) a =1 (cid:80) nj =1 I { z ∗ j = a } n ( ˆΣ a ( z ∗ ) − Σ ∗ a ) . (cid:107) ˆΣ( z ∗ ) − Σ ∗ (cid:107) (cid:46) (cid:114) k ( d + log n ) n , and (cid:107) ˆΣ( z ) − ˆΣ( z ∗ ) (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k (cid:88) a =1 (cid:80) nj =1 I { z j = a } n ˆΣ a ( z ) − k (cid:88) a =1 (cid:80) nj =1 I { z ∗ j = a } n ˆΣ a ( z ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:46) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k (cid:88) a =1 (cid:80) nj =1 I { z j = a } n ( ˆΣ a ( z ) − ˆΣ( z ∗ )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k (cid:88) a =1 (cid:80) nj =1 ( I { z j = a } − I { z ∗ j = a } ) n ˆΣ a ( z ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:46) k (cid:112) n(cid:96) ( z, z ∗ ) n ∆ + kn (cid:96) ( z, z ∗ ) + kdn ∆ (cid:112) (cid:96) ( z, z ∗ ) + kn ∆ (cid:96) ( z, z ∗ ) (cid:46) k (cid:112) n(cid:96) ( z, z ∗ ) n ∆ + kn (cid:96) ( z, z ∗ ) + kdn ∆ (cid:112) (cid:96) ( z, z ∗ ) . By the assumption that kd = O ( √ n ), ∆ k → ∞ and τ = o ( n/k ), we have (cid:107) ˆΣ( z ∗ ) − Σ ∗ (cid:107) , (cid:107) ˆΣ( z ) − ˆΣ( z ∗ ) (cid:107) = o (1), which implies (cid:107) ( ˆΣ( z ∗ )) − (cid:107) , (cid:107) ( ˆΣ( z )) − (cid:107) (cid:46)

1. Thus, we have (cid:107) ( ˆΣ( z ∗ )) − − (Σ ∗ ) − (cid:107) ≤ (cid:107) ( ˆΣ( z ∗ )) − (cid:107)(cid:107) ˆΣ( z ∗ ) − Σ ∗ (cid:107)(cid:107) (Σ ∗ ) − (cid:107) (cid:46) (cid:114) k ( d + log n ) n , (22)and similarly (cid:107) ( ˆΣ( z )) − − ( ˆΣ( z ∗ )) − (cid:107) (cid:46) kn (cid:96) ( z, z ∗ ) + k (cid:112) n(cid:96) ( z, z ∗ ) n ∆ + kdn ∆ (cid:112) (cid:96) ( z, z ∗ ) . (23)Now we start to prove (19)-(21). Let F j ( a, b, z ) = F (1) j ( a, b, z ) + F (2) j ( a, b, z ) + F (3) j ( a, b, z ) where F (1) j ( a, b, z ) := (cid:104) (cid:15) j , ( ˆΣ( z )) − (ˆ θ b ( z ) − ˆ θ b ( z ∗ )) (cid:105) − (cid:104) (cid:15) j , ( ˆΣ( z )) − (ˆ θ a ( z ) − ˆ θ a ( z ∗ )) (cid:105) ,F (2) j ( a, b, z ) := −(cid:104) (cid:15) j , (( ˆΣ( z )) − − ( ˆΣ( z ∗ )) − )( θ ∗ a − θ ∗ b ) (cid:105) ,F (3) j ( a, b, z ) := −(cid:104) (cid:15) j , (( ˆΣ( z )) − − ( ˆΣ( z ∗ )) − )( θ ∗ b − ˆ θ b ( z ∗ )) (cid:105) + (cid:104) (cid:15) j , (( ˆΣ( z )) − − ( ˆΣ( z ∗ )) − )( θ ∗ a − ˆ θ a ( z ∗ )) (cid:105) . Notice that n (cid:88) j =1 max b ∈ [ k ] \{ z ∗ j } F (2) j ( z ∗ j , b, z ) (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:96) ( z, z ∗ ) (cid:46) n (cid:88) j =1 k (cid:88) b =1 (cid:12)(cid:12)(cid:12)(cid:12) (cid:104) (cid:15) j , (( ˆΣ( z )) − − ( ˆΣ( z ∗ )) − )( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:12)(cid:12)(cid:12)(cid:12) (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) (cid:96) ( z, z ∗ ) ≤ k (cid:88) b =1 (cid:88) a ∈ [ k ] \{ b } n (cid:88) j =1 I { z ∗ j = a } (cid:12)(cid:12)(cid:12)(cid:12) (cid:104) (cid:15) j , (( ˆΣ( z )) − − ( ˆΣ( z ∗ )) − )( θ ∗ a − θ ∗ b ) (cid:105) (cid:12)(cid:12)(cid:12)(cid:12) (cid:107) θ ∗ a − θ ∗ b (cid:107) (cid:96) ( z, z ∗ ) ≤ k (cid:88) b =1 (cid:88) a ∈ [ k ] \{ b } (cid:107) (( ˆΣ( z )) − − ( ˆΣ( z ∗ )) − )( θ ∗ a − θ ∗ b ) (cid:107) (cid:107) θ ∗ a − θ ∗ b (cid:107) (cid:96) ( z, z ∗ ) (cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) j =1 I { z ∗ j = a } (cid:15) j (cid:15) Tj (cid:13)(cid:13)(cid:13)(cid:13) (cid:46) k ( τn + 1∆ + d n ∆ ) , (cid:96) ( z, z ∗ ) ≤ τ and kd = O ( √ n ) for the last inequality.From (41) we have max a ∈ [ k ] (cid:107) θ ∗ a − ˆ θ a ( z ∗ ) (cid:107) = o (1) under the assumption kd = O ( √ n ). By the similaranalysis as in F (2) j ( a, b, z ), we have n (cid:88) j =1 max b ∈ [ k ] \{ z ∗ j } F (3) j ( z ∗ j , b, z ) (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:96) ( z, z ∗ ) (cid:46) k ( τn + 1∆ + d n ∆ ) . Similarly, we have n (cid:88) j =1 max b ∈ [ k ] \{ z ∗ j } F (1) j ( z ∗ j , b, z ) (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:96) ( z, z ∗ ) (cid:46) k (cid:88) b =1 (cid:88) a ∈ [ k ] \{ b } (cid:107) ( ˆΣ( z )) − (ˆ θ a ( z ) − ˆ θ a ( z ∗ )) (cid:107) (cid:107) θ ∗ a − θ ∗ b (cid:107) (cid:96) ( z, z ∗ ) (cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) j =1 I { z ∗ j = a } (cid:15) j (cid:15) Tj (cid:13)(cid:13)(cid:13)(cid:13) (cid:46) k ∆ , where we use (42) and the fact that ( ˆΣ( z )) − has bounded operator norm. Combining these termstogether, we obtain (20).Next, for (19), by (41) we have | − (cid:104) θ ∗ a − ˆ θ b ( z ∗ ) , ( ˆΣ( z ∗ )) − ( θ ∗ a − ˆ θ b ( z ∗ )) (cid:105) + (cid:104) θ ∗ a − θ ∗ b , ( ˆΣ( z ∗ )) − ( θ ∗ a − θ ∗ b ) (cid:105)|≤ |(cid:104) θ ∗ b − ˆ θ b ( z ∗ ) , ( ˆΣ( z ∗ )) − ( θ ∗ b − ˆ θ b ( z ∗ )) (cid:105)| + 2 |(cid:104) θ ∗ b − ˆ θ b ( z ∗ ) , ( ˆΣ( z ∗ )) − ( θ ∗ a − θ ∗ b ) (cid:105)| (cid:46) k ( d + log n ) n + (cid:114) k ( d + log n ) n (cid:107) θ ∗ a − θ ∗ b (cid:107) , and (cid:104) θ ∗ a − ˆ θ a ( z ∗ ) , ( ˆΣ( z ∗ )) − ( θ ∗ a − ˆ θ a ( z ∗ )) (cid:105) (cid:46) k ( d + log n ) n . By (22) we have −(cid:104) θ ∗ a − θ ∗ b , ( ˆΣ( z ∗ )) − ( θ ∗ a − θ ∗ b ) (cid:105) + (cid:104) θ ∗ a − θ ∗ b , (Σ ∗ ) − ( θ ∗ a − θ ∗ b ) (cid:105) (cid:46) (cid:114) k ( d + log n ) n (cid:107) θ ∗ a − θ ∗ b (cid:107) . Using the results above we can get (19).Finally we are going to establish (21). Recall the deﬁnition of G j ( a, b, z ) which has four terms.For the third and fourth terms, we have − (cid:104) θ ∗ a − ˆ θ b ( z ) , ( ˆΣ( z )) − ( θ ∗ a − ˆ θ b ( z )) (cid:105) + (cid:104) θ ∗ a − ˆ θ b ( z ∗ ) , ( ˆΣ( z )) − ( θ ∗ a − ˆ θ b ( z ∗ )) (cid:105) (cid:46) (cid:107) ˆ θ b ( z ) − ˆ θ b ( z ∗ ) (cid:107) + (cid:107) ˆ θ b ( z ) − ˆ θ b ( z ∗ ) (cid:107)(cid:107) θ ∗ a − θ ∗ b (cid:107) , and − (cid:104) θ ∗ a − ˆ θ b ( z ∗ ) , ( ˆΣ( z )) − ( θ ∗ a − ˆ θ b ( z ∗ )) (cid:105) + (cid:104) θ ∗ a − ˆ θ b ( z ∗ ) , ( ˆΣ( z ∗ )) − ( θ ∗ a − ˆ θ b ( z ∗ )) (cid:105) (cid:46) (cid:107) θ ∗ a − θ ∗ b (cid:107) (cid:107) ( ˆΣ( z )) − − ( ˆΣ( z ∗ )) − (cid:107) .

19e can easily verify that the other two terms are smaller than the above two terms. Then, byusing (42) and (23), we have | G j ( z ∗ j , b, z ) |(cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:46) (cid:107) ˆ θ b ( z ) − ˆ θ b ( z ∗ ) (cid:107) + (cid:107) ˆ θ b ( z ) − ˆ θ b ( z ∗ ) (cid:107)(cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) + (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) (cid:107) ( ˆΣ( z )) − − ( ˆΣ( z ∗ )) − (cid:107)(cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) (cid:46) kτn + k ∆ (cid:114) τn + kd √ τn ∆ . Lemma 5.4.

With the same conditions as Theorem 2.2, for any sequence δ n = o (1) , we have ξ ideal ( δ n ) ≤ n exp (cid:18) − (1 + o (1)) SNR (cid:19) . with probability at least − n − C (cid:48) − exp( − SNR ) .Proof. Under the conditions of Theorem 2.2, the inequalities (33)-(38) hold with probability atleast 1 − n − C (cid:48) . In the remaining proof, we will work on the event these inequalities hold. Recallthe deﬁnition of ξ ideal , we can write ξ ideal ( δ ) = n (cid:88) j =1 (cid:88) b ∈ [ k ] \{ z ∗ j } (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) I (cid:26) (cid:104) (cid:15) j , ( ˆΣ( z ∗ )) − (ˆ θ z ∗ j ( z ∗ ) − ˆ θ b ( z ∗ )) (cid:105) ≤ − − δ (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:27) ≤ n (cid:88) j =1 (cid:88) b ∈ [ k ] \{ z ∗ j } (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) I (cid:26) (cid:104) (cid:15) j , (Σ ∗ ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) ≤ − − δ − ¯ δ (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:27) + n (cid:88) j =1 (cid:88) b ∈ [ k ] \{ z ∗ j } (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) I (cid:26) (cid:104) (cid:15) j , (( ˆΣ( z ∗ )) − − (Σ ∗ ) − )( θ ∗ z ∗ j − θ ∗ b ) (cid:105) ≤ − ¯ δ (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:27) + n (cid:88) j =1 (cid:88) b ∈ [ k ] \{ z ∗ j } (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) I (cid:26) (cid:104) (cid:15) j , ( ˆΣ( z ∗ )) − (ˆ θ z ∗ j ( z ∗ ) − θ ∗ z ∗ j ) (cid:105) ≤ − ¯ δ (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:27) + n (cid:88) j =1 (cid:88) b ∈ [ k ] \{ z ∗ j } (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) I (cid:26) −(cid:104) (cid:15) j , ( ˆΣ( z ∗ )) − (ˆ θ b ( z ∗ ) − θ ∗ b ) (cid:105) ≤ − ¯ δ (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:27) =: M + M + M + M . where ¯ δ = ¯ δ n is some sequence to be chosen later. We bound the four terms respectively. Suppose20 j = (Σ ∗ ) / ω j , where w j iid ∼ N (0 , I d ). By (22), we know M ≤ n (cid:88) j =1 (cid:88) b ∈ [ k ] \{ z ∗ j } (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) I (cid:26) ¯ δ λ max (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) ≤ λ max (cid:107) w j (cid:107)(cid:107) ( ˆΣ( z ∗ )) − − (Σ ∗ ) − (cid:107)(cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) (cid:27) ≤ n (cid:88) j =1 (cid:88) b ∈ [ k ] \{ z ∗ j } (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) I (cid:26) C ¯ δ (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) (cid:114) nd + log n ≤ (cid:107) w j (cid:107) (cid:27) ≤ n (cid:88) j =1 (cid:88) b ∈ [ k ] \{ z ∗ j } (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) I (cid:26) C ¯ δ (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) nd + log n − d ≤ (cid:107) w j (cid:107) − d (cid:27) , where C is a constant which may diﬀer line by line. Recall that kd = O ( √ n ), min a (cid:54) = b (cid:107) θ ∗ a − θ ∗ b (cid:107) → ∞ ,and ∆ /k → ∞ by assumption. Let n − = o (¯ δ ). Using the χ tail probability in Lemma 7.1, wehave for any a (cid:54) = b ∈ [ k ], E M ≤ n (cid:88) j =1 (cid:88) b ∈ [ k ] \{ z ∗ j } (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) exp (cid:16) − C ¯ δ (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) √ n (cid:17) ≤ n exp (cid:18) − (1 + o (1)) SNR (cid:19) . We can obtain similar bounds on M and M by using (41). For M , the Gaussian tail bound leadsto the inequality P (cid:26) (cid:104) (cid:15) j , (Σ ∗ ) − ( θ ∗ a − θ ∗ b ) (cid:105) ≤ − − δ − ¯ δ (cid:104) θ ∗ a − θ ∗ b , (Σ ∗ ) − ( θ ∗ a − θ ∗ b ) (cid:105) (cid:27) = P (cid:26) (cid:104) w j , (Σ ∗ ) − / ( θ ∗ a − θ ∗ b ) (cid:105) ≤ − − δ − ¯ δ (cid:104) θ ∗ a − θ ∗ b , (Σ ∗ ) − ( θ ∗ a − θ ∗ b ) (cid:105) (cid:27) ≤ exp (cid:18) − (1 − δ − ¯ δ ) (cid:104) θ ∗ a − θ ∗ b , (Σ ∗ ) − ( θ ∗ a − θ ∗ b ) (cid:105) (cid:19) . Thus, E M ≤ n (cid:88) j =1 (cid:88) b ∈ [ k ] \{ z ∗ j } (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) exp (cid:18) − (1 − δ − ¯ δ ) (cid:104) θ ∗ a − θ ∗ b , (Σ ∗ ) − ( θ ∗ a − θ ∗ b ) (cid:105) (cid:19) ≤ n exp (cid:18) − (1 + o (1)) SNR (cid:19) . Overall, we have E ξ ideal (cid:46) n exp (cid:16) − (1 + o (1)) SNR (cid:17) . By the Markov’s inequality, we have P ( ξ ideal ( δ n ) ≥ E ξ ideal exp( SNR )) ≤ exp( − SNR ) . In other words, with probability at least 1 − exp( − SNR ), we have ξ ideal ( δ n ) ≤ E ξ ideal ( δ n ) exp( SNR ) ≤ n exp (cid:18) − (1 + o (1)) SNR (cid:19) . roof of Theorem 2.2. By Lemmas 5.2 - 5.4, we have Conditions 5.2.1 - 5.2.5 satisﬁed with proba-bility at least 1 − η − n − − exp( − SNR ). Then applying Lemma 5.2, we have (cid:96) ( z ( t ) , z ∗ ) ≤ n exp (cid:18) − (1 + o (1)) SNR (cid:19) + 14 (cid:96) ( z ( t − , z ∗ ) , for all t ≥ . By (16) and there exists a constant C such that ∆ ≤ C SNR , we can conclude h ( z ( t ) , z ∗ ) ≤ exp (cid:18) − (1 + o (1)) SNR (cid:19) + 4 − t , for all t ≥ . Notice that h ( z, z ∗ ) takes value in the set { j/n : j ∈ [ n ] ∪ { }} , the term 4 − t in the above inequalityshould be negligible as long as 4 − t = o ( n − ). Thus, we can claim h ( z ( t ) , z ∗ ) ≤ exp (cid:18) − (1 + o (1)) SNR (cid:19) , for all t ≥ log n. Proof of Lemma 3.1.

The Neyman-Pearson lemma tells us the likelihood ratio test φ is the optimalprocedure. Following the proof of Lemma 2.1, we have P H ( φ = 1) + P H ( φ = 0) = P ( (cid:15) ∈ B , ) + P ( (cid:15) ∈ B , ) ≥ exp (cid:18) − o (1)8 SNR (cid:48) , (cid:19) + exp (cid:18) − o (1)8 SNR (cid:48) , (cid:19) , where the last inequality is by Lemma 7.11. Proof of Theorem 3.1.

The proof is identical to the proof of Theorem 2.1 and is omitted here.

We adopt a similar proof idea as in Section 5.2. We ﬁrst present an error decomposition for one-stepanalysis for Algorithm 2. In Lemma 6.1, we show the loss decays after a one-step iteration underConditions 6.2.1 - 6.2.6. Then in Lemma 6.2 we extend the result to multiple iterations, under twoextra Conditions 6.2.7 - 6.2.8. Last we show all the conditions are satisﬁed with high probabilityand thus prove Theorem 3.2.In the statement of Theorem 3.2, we assume max a,b ∈ [ k ] λ d (Σ ∗ a ) /λ (Σ ∗ b ) = O (1) for the covariancematrix { Σ ∗ a } a ∈ [ k ] . Without loss of generality, we can replace it by assuming Σ ∗ satisﬁes λ min ≤ min a ∈ [ k ] λ (Σ ∗ a ) ≤ max a ∈ [ k ] λ d (Σ ∗ a ) ≤ λ max (24)where λ min , λ max > rror Decomposition for the One-step Analysis: Consider an arbitrary z ∈ [ k ] n . Apply(12), (13), and (14) on z to obtain { ˆ θ a ( z ) } a ∈ [ k ] , { ˆΣ a ( z ) } a ∈ [ k ] , and ˆ z ( z ):ˆ θ a ( z ) = (cid:80) j ∈ [ n ] Y j I { z j = a } (cid:80) j ∈ [ n ] I { z j = a } , ˆΣ a ( z ) = (cid:80) j ∈ [ n ] ( Y j − ˆ θ a ( z ))( Y j − ˆ θ a ( z )) T I { z j = a } (cid:80) j ∈ [ n ] I { z j = a } , ˆ z j ( z ) = argmin a ∈ [ k ] ( Y j − ˆ θ a ( z )) T ( ˆΣ a ( a )) − ( Y j − ˆ θ a ( z )) + log | ˆΣ a ( z ) | , j ∈ [ n ] . For simplicity we denote ˆ z short for ˆ z ( z ). Let j ∈ [ n ] be an arbitrary index with z ∗ j = a . Ac-cording to (9), z ∗ j will be incorrectly estimated after on iteration in ˆ z if a (cid:54) = argmin a ∈ [ k ] ( Y j − ˆ θ a ( z )) T ( ˆΣ a ( a )) − ( Y j − ˆ θ a ( z )) + log | ˆΣ a ( z ) | , . That is, it is important to analyze the event (cid:104) Y j − ˆ θ b ( z ) , ( ˆΣ b ( z )) − ( Y j − ˆ θ b ( z )) (cid:105) + log | ˆΣ b ( z ) | ≤ (cid:104) Y j − ˆ θ a ( z ) , ( ˆΣ a ( z )) − ( Y j − ˆ θ a ( z )) (cid:105) + log | ˆΣ a ( z ) | , (25)for any b ∈ [ k ] \ { a } . After some rearrangements, we can see (25) is equivalent to, (cid:104) (cid:15) j , ( ˆΣ b ( z ∗ )) − ( θ ∗ a − ˆ θ b ( z ∗ )) (cid:105) − (cid:104) (cid:15) j , ( ˆΣ a ( z ∗ )) − ( θ ∗ a − ˆ θ a ( z ∗ )) (cid:105) + 12 (cid:104) (cid:15) j , (( ˆΣ b ( z ∗ )) − − ( ˆΣ a ( z ∗ )) − ) (cid:15) j (cid:105) −

12 log | Σ ∗ a | + 12 log | Σ ∗ b |≤ − (cid:104) θ ∗ a − θ ∗ b , (Σ ∗ b ) − ( θ ∗ a − θ ∗ b ) (cid:105) + F j ( a, b, z ) + Q j ( a, b, z ) + G j ( a, b, z ) + H j ( a, b, z ) + K j ( a, b, z ) + L j ( a, b, z ) , where F j ( a, b, z ) = (cid:104) (cid:15) j , ( ˆΣ b ( z )) − (ˆ θ b ( z ) − ˆ θ b ( z ∗ )) (cid:105) − (cid:104) (cid:15) j , ( ˆΣ a ( z )) − (ˆ θ a ( z ) − ˆ θ a ( z ∗ )) (cid:105)− (cid:104) (cid:15) j , (( ˆΣ b ( z )) − − ( ˆΣ b ( z ∗ )) − )( θ ∗ a − ˆ θ b ( z ∗ )) (cid:105) + (cid:104) (cid:15) j , (( ˆΣ a ( z )) − − ( ˆΣ a ( z ∗ )) − )( θ ∗ a − ˆ θ a ( z ∗ )) (cid:105) ,Q j ( a, b, z ) = − (cid:104) (cid:15) j , (( ˆΣ b ( z )) − − ( ˆΣ b ( z ∗ )) − ) (cid:15) j (cid:105) + 12 (cid:104) (cid:15) j , (( ˆΣ a ( z )) − − ( ˆΣ a ( z ∗ )) − ) (cid:15) j (cid:105) ,G j ( a, b, z ) = 12 (cid:104) θ ∗ a − ˆ θ a ( z ) , ( ˆΣ a ( z )) − ( θ ∗ a − ˆ θ a ( z )) (cid:105) − (cid:104) θ ∗ a − ˆ θ a ( z ∗ ) , ( ˆΣ a ( z )) − ( θ ∗ a − ˆ θ a ( z ∗ )) (cid:105) + 12 (cid:104) θ ∗ a − ˆ θ a ( z ∗ ) , ( ˆΣ a ( z )) − ( θ ∗ a − ˆ θ a ( z ∗ )) (cid:105) − (cid:104) θ ∗ a − ˆ θ a ( z ∗ ) , ( ˆΣ a ( z ∗ )) − ( θ ∗ a − ˆ θ a ( z ∗ )) (cid:105)− (cid:104) θ ∗ a − ˆ θ b ( z ) , ( ˆΣ b ( z )) − ( θ ∗ a − ˆ θ b ( z )) (cid:105) + 12 (cid:104) θ ∗ a − ˆ θ b ( z ∗ ) , ( ˆΣ b ( z )) − ( θ ∗ a − ˆ θ b ( z ∗ )) (cid:105)− (cid:104) θ ∗ a − ˆ θ b ( z ∗ ) , ( ˆΣ b ( z )) − ( θ ∗ a − ˆ θ b ( z ∗ )) (cid:105) + 12 (cid:104) θ ∗ a − ˆ θ b ( z ∗ ) , ( ˆΣ b ( z ∗ )) − ( θ ∗ a − ˆ θ b ( z ∗ )) (cid:105) ,H j ( a, b, z ) = − (cid:104) θ ∗ a − ˆ θ b ( z ∗ ) , ( ˆΣ b ( z ∗ )) − ( θ ∗ a − ˆ θ b ( z ∗ )) (cid:105) + 12 (cid:104) θ ∗ a − θ ∗ b , ( ˆΣ b ( z ∗ )) − ( θ ∗ a − θ ∗ b ) (cid:105)− (cid:104) θ ∗ a − θ ∗ b , ( ˆΣ b ( z ∗ )) − ( θ ∗ a − θ ∗ b ) (cid:105) + 12 (cid:104) θ ∗ a − θ ∗ b , (Σ ∗ b ) − ( θ ∗ a − θ ∗ b ) (cid:105) + 12 (cid:104) θ ∗ a − ˆ θ a ( z ∗ ) , ( ˆΣ a ( z ∗ )) − ( θ ∗ a − ˆ θ a ( z ∗ )) (cid:105) , j ( a, b, z ) := 12 (log | ˆΣ a ( z ) | − log | ˆΣ a ( z ∗ ) | ) −

12 (log | ˆΣ b ( z ) | − log | ˆΣ b ( z ∗ ) | ) , and L j ( a, b, z ) := 12 (log | ˆΣ a ( z ∗ ) | − log | Σ ∗ a | ) −

12 (log | ˆΣ b ( z ∗ ) | − log | Σ ∗ b | ) . Among these terms, F j , G j , H j are nearly identical to their counterparts in Section 5.2 with ˆΣ( z )replaced by ˆΣ a ( z ) or ˆΣ b ( z ). There are three extra terms not appearing in Section 5.2: Q j is aquadratic term of (cid:15) j and K j , L j are terms involving matrix determinants. Conditions and Guarantees for One-step Analysis.

To establish the guarantee for the one-step analysis, we ﬁrst give several conditions on the error terms.

Condition 6.2.1.

Assume that max { z : (cid:96) ( z,z ∗ ) ≤ τ } max j ∈ [ n ] max b ∈ [ k ] \{ z ∗ j } | H j ( z ∗ j , b, z ) |(cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ b ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) ≤ δ holds with probability with at least − η for some τ, δ, η > . Condition 6.2.2.

Assume that max { z : (cid:96) ( z,z ∗ ) ≤ τ } n (cid:88) j =1 max b ∈ [ k ] \{ z ∗ j } F j ( z ∗ j , b, z ) (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ b ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:96) ( z, z ∗ ) ≤ δ holds with probability with at least − η for some τ, δ, η > . Condition 6.2.3.

Assume that max { z : (cid:96) ( z,z ∗ ) ≤ τ } max j ∈ [ n ] max b ∈ [ k ] \{ z ∗ j } | G j ( z ∗ j , b, z ) |(cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ b ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) ≤ δ holds with probability with at least − η for some τ, δ, η > . Condition 6.2.4.

Assume that max { z : (cid:96) ( z,z ∗ ) ≤ τ } n (cid:88) j =1 max b ∈ [ k ] \{ z ∗ j } Q j ( z ∗ j , b, z ) (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ b ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:96) ( z, z ∗ ) ≤ δ holds with probability with at least − η for some τ, δ, η > . Condition 6.2.5.

Assume that max { z : (cid:96) ( z,z ∗ ) ≤ τ } n (cid:88) j =1 max b ∈ [ k ] \{ z ∗ j } K j ( z ∗ j , b, z ) (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ b ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:96) ( z, z ∗ ) ≤ δ holds with probability with at least − η for some τ, δ, η > . Condition 6.2.6.

Assume that max { z : (cid:96) ( z,z ∗ ) ≤ τ } max j ∈ [ n ] max b ∈ [ k ] \{ z ∗ j } | L j ( z ∗ j , b, z ) |(cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ b ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) ≤ δ holds with probability with at least − η for some τ, δ, η > .

24e next deﬁne a quantity that refers to as the ideal error, ξ ideal ( δ ) = p (cid:88) j =1 (cid:88) b ∈ [ k ] \{ z ∗ j } (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) I {(cid:104) (cid:15) j , ( ˆΣ b ( z ∗ )) − ( θ ∗ a − ˆ θ b ( z ∗ )) (cid:105) − (cid:104) (cid:15) j , ( ˆΣ a ( z ∗ )) − ( θ ∗ a − ˆ θ a ( z ∗ )) (cid:105) + 12 (cid:104) (cid:15) j , (( ˆΣ b ( z ∗ )) − − ( ˆΣ a ( z ∗ )) − ) (cid:15) j (cid:105) −

12 log | Σ ∗ a | + 12 log | Σ ∗ b | ≤ − − δ (cid:104) θ ∗ a − θ ∗ b , (Σ ∗ b ) − ( θ ∗ a − θ ∗ b ) (cid:105)} . Lemma 6.1.

Assumes Conditions 6.2.1 - 6.2.6 hold for some τ, δ, η , . . . , η > . We then have P (cid:18) (cid:96) (ˆ z, z ∗ ) ≤ ξ ideal ( δ ) + 14 (cid:96) ( z, z ∗ ) for any z ∈ [ k ] n such that (cid:96) ( z, z ∗ ) ≤ τ (cid:19) ≥ − η, where η = (cid:80) i =1 η i .Proof. The proof of this lemma is quite similar to the proof of Lemma 5.1. The additional terms Q j and K j can be dealt with the same way as F j while L j can be dealt with the same way as H j .We omit the details here. Conditions and Guarantees for Multiple Iterations.

Assume that ξ ideal ( δ ) ≤ τ holds with probability with at least − η for some τ, δ, η > . Finally, we need a condition on the initialization.

Condition 6.2.8.

Lemma 6.2.

Assumes Conditions 6.2.1 - 6.2.8 hold for some τ, δ, η , · · · , η > . We then have (cid:96) ( z ( t ) , z ∗ ) ≤ ξ ideal ( δ ) + 12 (cid:96) ( z ( t − , z ∗ ) for all t ≥ , with probability at least − η , where η = (cid:80) i =1 η i .Proof. The proof of this lemma is the same as the proof of Lemma 5.2.25 ith-high-probability Results for the Conditions and The Proof of The Main Theo-rem.

Recall the deﬁnition of ∆ in (3). Lemma 6.3 shows

SNR (cid:48) is in the same order with ∆, whichwill play a similar role as (18) in Section 5.2. It immediately implies the assumption

SNR (cid:48) → ∞ in the statement of Theorem 3.2 is equivalently ∆ → ∞ . The proof of Lemma 6.3 is deferred toSection 7.

Lemma 6.3.

Assume

SNR (cid:48) → ∞ and d = O (1) . Further assume there exist constants λ min , λ max > such that λ min ≤ λ (Σ ∗ a ) ≤ λ d (Σ ∗ a ) ≤ λ max for any a ∈ [ k ] . Then, there exist constants C , C > only depending on λ min , λ max , d such that C (cid:107) θ ∗ a − θ ∗ b (cid:107) ≤ SNR (cid:48) a,b ≤ C (cid:107) θ ∗ a − θ ∗ b (cid:107) , for any a (cid:54) = b ∈ [ k ] . As a result, SNR (cid:48) is in the same order of ∆ . Lemma 6.4 and Lemma 6.5 are counterparts of Lemmas 5.3 and 5.4 in Section 5.2.

Lemma 6.4.

Under the same conditions as in Theorem 3.2, for any constant C (cid:48) > , there existssome constant C > only depending on α, C (cid:48) , λ min , λ max such that max { z : (cid:96) ( z,z ∗ ) ≤ τ } max j ∈ [ n ] max b ∈ [ k ] \{ z ∗ j } | H j ( z ∗ j , b, z ) |(cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ b ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) ≤ C (cid:114) k ( d + log n ) n (26)max { z : (cid:96) ( z,z ∗ ) ≤ τ } n (cid:88) j =1 max b ∈ [ k ] \{ z ∗ j } F j ( z ∗ j , b, z ) (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ b ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:96) ( z, z ∗ ) ≤ Ck (cid:18) τn + 1∆ + d n ∆ (cid:19) (27)max { z : (cid:96) ( z,z ∗ ) ≤ τ } max j ∈ [ n ] max b ∈ [ k ] \{ z ∗ j } | G j ( z ∗ j , b, z ) |(cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ b ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) ≤ Ck (cid:18) τn + 1∆ (cid:114) τn + d √ τn ∆ (cid:19) (28)max { z : (cid:96) ( z,z ∗ ) ≤ τ } n (cid:88) j =1 max b ∈ [ k ] \{ z ∗ j } Q j ( z ∗ j , b, z ) (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ b ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:96) ( z, z ∗ ) ≤ C k d ∆ (cid:18) τn + 1∆ + d n ∆ (cid:19) (29)max { z : (cid:96) ( z,z ∗ ) ≤ τ } n (cid:88) j =1 max b ∈ [ k ] \{ z ∗ j } K j ( z ∗ j , b, z ) (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ b ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:96) ( z, z ∗ ) ≤ C k d ∆ (cid:18) τn + 1∆ + d n ∆ (cid:19) (30)max { z : (cid:96) ( z,z ∗ ) ≤ τ } max j ∈ [ n ] max b ∈ [ k ] \{ z ∗ j } | L j ( z ∗ j , b, z ) |(cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ b ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) ≤ C d ∆ (cid:114) k ( d + log n ) n (31) with probability at least − n − C (cid:48) − nd .Proof. Under the conditions of Theorem 3.2, the inequalities (33)-(38) hold with probability atleast 1 − n − C (cid:48) . In the remaining proof, we will work on the event these inequalities hold. Hence, wecan use the results from Lemma 7.7 and 7.8. Using the same arguments as in the proof of Lemma5.3, we can get (26), (27) and (28).As for (29), we ﬁrst use Lemma 7.9 to have (cid:80) nj =1 (cid:107) (cid:15) j (cid:107) ≤ nd with probability at least 1 − / ( nd ).Then, we have n (cid:88) j =1 max b ∈ [ k ] \{ z ∗ j } Q j ( z ∗ j , b, z ) (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ b ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:96) ( z, z ∗ ) (cid:46) n (cid:88) j =1 k (cid:88) b =1 Q j ( z ∗ j , b, z ) ∆ (cid:96) ( z, z ∗ ) ≤ k n (cid:88) j =1 (cid:107) (cid:15) j (cid:107) max a ∈ [ k ] (cid:107) ( ˆΣ a ( z )) − − ( ˆΣ a ( z ∗ )) − (cid:107) ∆ (cid:96) ( z, z ∗ ) (cid:46) k d ∆ (cid:18) τn + 1∆ + d n ∆ (cid:19) , (cid:96) ( z, z ∗ ) ≤ τ .Next for (30), notice that by (43), (44), and SNR (cid:48) → ∞ , we have for any 1 ≤ i ≤ d , λ min ≤ λ i ( ˆΣ a ( z ∗ )) ≤ λ max and (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) log(1 + max a ∈ [ k ] (cid:107) ˆΣ a ( z ) − ˆΣ a ( z ∗ ) (cid:107) λ i ( ˆΣ a ( z ∗ )) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) log(1 − max a ∈ [ k ] (cid:107) ˆΣ a ( z ) − ˆΣ a ( z ∗ ) (cid:107) λ i ( ˆΣ a ( z ∗ )) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Thus by Lemma 7.6, we knowmax a ∈ [ k ] (cid:12)(cid:12)(cid:12) log | ˆΣ a ( z ) | − log | ˆΣ a ( z ∗ ) | (cid:12)(cid:12)(cid:12) = max a ∈ [ k ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) log | ˆΣ a ( z ) || ˆΣ a ( z ∗ ) | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d (cid:88) i =1 log(1 − max a ∈ [ k ] (cid:107) ˆΣ a ( z ) − ˆΣ a ( z ∗ ) (cid:107) λ i ( ˆΣ a ( z ∗ )) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ d (cid:88) i =1 log  a ∈ [ k ] (cid:107) ˆΣ a ( z ) − ˆΣ a ( z ∗ ) (cid:107) λ i ( ˆΣ a ( z ∗ )) + max a ∈ [ k ] (cid:107) ˆΣ a ( z ) − ˆΣ a ( z ∗ ) (cid:107) λ i (ˆΣ a ( z ∗ )) − max a ∈ [ k ] (cid:107) ˆΣ a ( z ) − ˆΣ a ( z ∗ ) (cid:107) λ i (ˆΣ a ( z ∗ ))  (cid:46) d (cid:107) ˆΣ a ( z ) − ˆΣ a ( z ∗ ) (cid:107) , (32)where the last inequality is due to the fact that λ i ( ˆΣ a ( z ∗ )) is at the constant rate, (cid:107) ˆΣ a ( z ) − ˆΣ a ( z ∗ ) (cid:107) = o (1) and the inequality log(1 + x ) ≤ x for any x >

0. (32) yields to the inequality n (cid:88) j =1 max b ∈ [ k ] \{ z ∗ j } K j ( z ∗ j , b, z ) (cid:107) θ ∗ z ∗ j − θ ∗ b (cid:107) (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ b ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:96) ( z, z ∗ ) (cid:46) n (cid:88) j =1 d max a ∈ [ k ] (cid:107) ˆΣ a ( z ) − ˆΣ a ( z ∗ ) (cid:107) ∆ (cid:96) ( z, z ∗ ) (cid:46) k d ∆ (cid:18) τn + 1∆ + d n ∆ (cid:19) . Finally for (31), by (43) and the similar argument as (32), we can getmax a ∈ [ k ] (cid:12)(cid:12)(cid:12) log | ˆΣ a ( z ∗ ) | − log | Σ ∗ a | (cid:12)(cid:12)(cid:12) (cid:46) d (cid:114) k ( d + log n ) n which implies (31). We complete the proof. Lemma 6.5.

With the same conditions as Theorem 3.2, for any sequence δ n = o (1) , we have ξ ideal ( δ n ) ≤ n exp (cid:18) − (1 + o (1)) SNR (cid:48) (cid:19) . with probability at least − n − C (cid:48) − exp( − SNR (cid:48) ) .Proof. Under the conditions of Theorem 3.2, the inequalities (33)-(38) hold with probability atleast 1 − n − C (cid:48) . In the remaining proof, we will work on the event these inequalities hold. Similarto the proof of Lemma 5.4, we have a decomposition ξ ideal ≤ (cid:80) i =1 M i where M := n (cid:88) j =1 (cid:88) b ∈ [ k ] \{ z ∗ j } (cid:13)(cid:13)(cid:13) θ ∗ z ∗ j − θ ∗ b (cid:13)(cid:13)(cid:13) I (cid:26) (cid:104) (cid:15) j , (Σ ∗ b ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) + 12 (cid:104) (cid:15) j , ((Σ ∗ b ) − − (Σ ∗ z ∗ j ) − ) (cid:15) j (cid:105)−

12 log | Σ ∗ z ∗ j | + 12 log | Σ ∗ b | ≤ − − δ − ¯ δ (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ b ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:27)

27s the main term and M := n (cid:88) j =1 (cid:88) b ∈ [ k ] \{ z ∗ j } (cid:13)(cid:13)(cid:13) θ ∗ z ∗ j − θ ∗ b (cid:13)(cid:13)(cid:13) I (cid:26) (cid:104) (cid:15) j , (( ˆΣ b ( z ∗ )) − − (Σ ∗ b ) − )( θ ∗ z ∗ j − θ ∗ b ) (cid:105) ≤ − ¯ δ (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ b ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:27) M := n (cid:88) j =1 (cid:88) b ∈ [ k ] \{ z ∗ j } (cid:13)(cid:13)(cid:13) θ ∗ z ∗ j − θ ∗ b (cid:13)(cid:13)(cid:13) I (cid:26) −(cid:104) (cid:15) j , ( ˆΣ z ∗ j ( z ∗ )) − ( θ ∗ z ∗ j − ˆ θ z ∗ j ( z ∗ )) (cid:105) ≤ − ¯ δ (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ b ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:27) M := n (cid:88) j =1 (cid:88) b ∈ [ k ] \{ z ∗ j } (cid:13)(cid:13)(cid:13) θ ∗ z ∗ j − θ ∗ b (cid:13)(cid:13)(cid:13) I (cid:26) −(cid:104) (cid:15) j , ( ˆΣ b ( z ∗ )) − (ˆ θ b ( z ∗ ) − θ ∗ b ) (cid:105) ≤ − ¯ δ (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ b ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:27) M := n (cid:88) j =1 (cid:88) b ∈ [ k ] \{ z ∗ j } (cid:13)(cid:13)(cid:13) θ ∗ z ∗ j − θ ∗ b (cid:13)(cid:13)(cid:13) I (cid:26) (cid:104) (cid:15) j , (( ˆΣ b ( z ∗ )) − − (Σ ∗ b ) − ) (cid:15) j (cid:105) ≤ − ¯ δ (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ b ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:27) M := n (cid:88) j =1 (cid:88) b ∈ [ k ] \{ z ∗ j } (cid:13)(cid:13)(cid:13) θ ∗ z ∗ j − θ ∗ b (cid:13)(cid:13)(cid:13) I (cid:26) − (cid:104) (cid:15) j , (( ˆΣ z ∗ j ( z ∗ )) − − (Σ ∗ z ∗ j ) − ) (cid:15) j (cid:105) ≤ − ¯ δ (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ b ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:27) . Using the same arguments as the proof of Lemma 5.4, we can choose some ¯ δ = ¯ δ n = o (1) which isslowly diverging to zero satisfying E M i ≤ n exp (cid:32) − (1 + o (1)) SNR (cid:48) (cid:33) for i = 2 , , . As for M , by (43) we have M ≤ n (cid:88) j =1 (cid:88) b ∈ [ k ] \{ z ∗ j } (cid:13)(cid:13)(cid:13) θ ∗ z ∗ j − θ ∗ b (cid:13)(cid:13)(cid:13) I (cid:40) C ¯ δ (cid:13)(cid:13)(cid:13) θ ∗ z ∗ j − θ ∗ b (cid:13)(cid:13)(cid:13) ≤ (cid:107) w j (cid:107) (cid:114) log nn (cid:41) , where C is a constant and w j iid ∼ N (0 , I d ). Since there exists some constant C (cid:48) such that SNR (cid:48) ≤ C (cid:48) ∆, we can choose appropriate ¯ δ = o (1) such that E M ≤ n (cid:88) j =1 (cid:88) b ∈ [ k ] \{ z ∗ j } (cid:13)(cid:13)(cid:13) θ ∗ z ∗ j − θ ∗ b (cid:13)(cid:13)(cid:13) P (cid:26) C ¯ δ (cid:13)(cid:13)(cid:13) θ ∗ z ∗ j − θ ∗ b (cid:13)(cid:13)(cid:13) (cid:114) n log n ≤ (cid:107) w j (cid:107) (cid:27) ≤ n exp (cid:32) − (1 + o (1)) SNR (cid:48) (cid:33) . is essentially the same with M . Finally for M , using Lemma 7.10, we have P (cid:18) (cid:104) (cid:15) j , (Σ ∗ b ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) + 12 (cid:104) (cid:15) j , ((Σ ∗ b ) − − (Σ ∗ z ∗ j ) − ) (cid:15) j (cid:105)−

12 log | Σ ∗ z ∗ j | + 12 log | Σ ∗ b | ≤ − − δ − ¯ δ (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ b ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:19) = P (cid:18) (cid:104) w j , (Σ ∗ z ∗ j ) (Σ ∗ b ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) + 12 (cid:104) w j , ((Σ ∗ z ∗ j ) (Σ ∗ b ) − (Σ ∗ z ∗ j ) − I d ) w j (cid:105)−

12 log | Σ ∗ z ∗ j | + 12 log | Σ ∗ b | ≤ − − δ − ¯ δ (cid:104) θ ∗ z ∗ j − θ ∗ b , (Σ ∗ b ) − ( θ ∗ z ∗ j − θ ∗ b ) (cid:105) (cid:19) ≤ exp (cid:32) − (1 − o (1)) SNR (cid:48) z ∗ j ,b (cid:33) . Then we have E M ≤ n exp (cid:32) − (1 + o (1)) SNR (cid:48) (cid:33) . Using the Markov’s inequality we complete the proof of Lemma 6.5.

Proof of Theorem 3.2.

By Lemmas 6.2-6.5, we can obtain the result by arguments used in the proofof Theorem 2.2 and hence is omitted here.

Here are the technical lemmas.

Lemma 7.1.

For any x > , we have P ( χ d ≥ d + 2 √ dx + 2 x ) ≤ e − x , P ( χ d ≤ d − √ dx ) ≤ e − x . Proof.

These results are Lemma 1 of [12].

Lemma 7.2.

For any z ∗ ∈ [ k ] n and k ∈ [ n ] , consider independent vectors (cid:15) j ∼ N (0 , Σ ∗ z ∗ j ) for any j ∈ [ n ] . Assume there exists a constant λ max > such that (cid:107) Σ ∗ a (cid:107) ≤ λ max for any a ∈ [ k ] . Then, forany constant C (cid:48) > , there exists some constant C > only depending on C (cid:48) , λ max such that max a ∈ [ k ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:80) nj =1 I { z ∗ j = a } (cid:15) j (cid:113)(cid:80) nj =1 I { z ∗ j = a } (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C (cid:112) d + log n, (33)max a ∈ [ k ] d + (cid:80) nj =1 I { z ∗ j = a } (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) j =1 I { z ∗ j = a } (cid:15) j (cid:15) Tj (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C, (34)max T ⊂ [ n ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:112) | T | (cid:88) j ∈ T (cid:15) j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C √ d + n, (35)max a ∈ [ k ] max T ⊂{ j : z ∗ j = a } (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:113) | T | ( d + (cid:80) nj =1 I { z ∗ j = a } ) (cid:88) j ∈ T (cid:15) j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C, (36)29 ith probability at least − n − C (cid:48) . We have used the convention that / .Proof. Note that (cid:15) j is sub-Gaussian with parameter λ max which is a constant. The inequalities(33) and (35) are respectively Lemmas A.4, A.1 in [15]. The inequality (34) is a slight extension ofLemma A.2 in [15]. This extension can be done by a standard union bound argument. The proofof (36) is identical to that of (35). Lemma 7.3.

Consider the same assumptions as in Lemma 7.2. Assume additionally min a ∈ k (cid:80) nj =1 I { z ∗ j = a } ≥ αnk for some constant α > and k ( d +log n ) n = o (1) . Then, for any constant C (cid:48) > , there existssome constant C > only depending on α, C (cid:48) , λ max such that max a ∈ [ k ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:80) nj =1 I { z ∗ j = a } n (cid:88) j =1 I { z ∗ j = a } (cid:15) j (cid:15) Tj − Σ ∗ a (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C (cid:114) k ( d + log n ) n , (37) with probability at least − n − C (cid:48) .Proof. Note that we have (cid:15) j = Σ ∗ z ∗ j η j where η j iid ∼ N (0 , I d ) for any j ∈ [ n ]. Since max a (cid:107) Σ ∗ a (cid:107) ≤ λ max ,we havemax a ∈ [ k ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:80) nj =1 I { z ∗ j = a } n (cid:88) j =1 I { z ∗ j = a } (cid:15) j (cid:15) Tj − Σ ∗ a (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ λ max max a ∈ [ k ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:80) nj =1 I { z ∗ j = a } n (cid:88) j =1 I { z ∗ j = a } η j η Tj − I d (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . Deﬁne Q a = 1 (cid:80) nj =1 I { z ∗ j = a } n (cid:88) j =1 I { z ∗ j = a } η j η Tj − I d . Take S d − = { y ∈ R d : (cid:107) y (cid:107) = 1 } and N (cid:15) = { v , · · · , v | N (cid:15) | } is an (cid:15) -covering of S d − . In particular,we pick (cid:15) < , then | N (cid:15) | ≤ d . By the deﬁnition of the (cid:15) -covering, we have (cid:107) Q a (cid:107) ≤ − (cid:15) max i =1 , ··· , | N (cid:15) | | v Ti Q a v i | ≤ i =1 , ··· , | N (cid:15) | | v Ti Q a v i | . For any v ∈ N (cid:15) , v T Q a v = 1 (cid:80) nj =1 I { z ∗ j = a } n (cid:88) j =1 I { z ∗ j = a } ( v T η j η Tj v − . Denote n a = (cid:80) nj =1 I { z ∗ j = a } . Then (cid:80) nj =1 I { z ∗ j = a } v T η j η Tj v ∼ χ n a . Using Lemma 7.1, we have P (max a ∈ [ k ] (cid:107) Q a (cid:107) ≥ t ) ≤ k (cid:88) a =1 P ( (cid:107) Q a (cid:107) ≥ t ) ≤ k (cid:88) a =1 | N (cid:15) | (cid:88) i =1 P ( | v Ti Q a v i | ≥ t/ ≤ k (cid:88) a =1 (cid:40) − n a { t, t } + d log 9 (cid:41) . Since k ( d +log n ) n = o (1) and n a ≥ αn/k where α is a constant, we can take t = C (cid:48)(cid:48) (cid:113) k ( d +log n ) n forsome large constant C (cid:48)(cid:48) and the proof is complete.30 emma 7.4. Consider the same assumptions as in Lemma 7.2. Then, for any s = o ( n ) and forany constant C (cid:48) > , there exists some constant C > only depending on C (cid:48) , λ max such that max T ⊂ [ n ]: | T |≤ s | T | log n | T | + min { , (cid:112) | T |} d (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88) j ∈ T (cid:15) j (cid:15) Tj (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C, (38) with probability at least − n − C (cid:48) . We have used the convention that / .Proof. Consider any a ∈ [ s ] and a ﬁxed T ⊂ [ n ] such that | T | = a . Similar to the proof of Lemma7.3, we can take S d − = { y ∈ R d : (cid:107) y (cid:107) = 1 } and its (cid:15) -covering N (cid:15) with (cid:15) < and | N (cid:15) | ≤ d . Thenwe have (cid:107) (cid:88) j ∈ T (cid:15) j (cid:15) Tj (cid:107) = sup (cid:107) w (cid:107) =1 (cid:88) j ∈ T ( w T (cid:15) j ) ≤ w ∈ N (cid:15) (cid:88) j ∈ T ( w T (cid:15) j ) . Note that w T (cid:15) j / √ λ max is a sub-Gaussian random variable with parameter 1. By [10], for any ﬁxed w ∈ N (cid:15) , we have P (cid:88) j ∈ T ( w T (cid:15) j ) ≥ λ max (cid:16) a + 2 √ at + 2 t (cid:17) ≤ exp ( − t ) . Since a = o ( n ), there exists a constant C such that 2 a ≤ C a log na . We can take t = ˜ C ( a log na + d )with ˜ C = C − C , then a + 2 √ at + 2 t ≤ C ( a log na + d ). Thus, P (cid:88) j ∈ T ( w T (cid:15) j ) ≥ C a log na + d )  ≤ exp (cid:18) − ˜ C ( a log na + d ) (cid:19) . Hence, we have P  (cid:107) (cid:88) j ∈ T (cid:15) j (cid:15) Tj (cid:107) ≥ C a log na + d )  ≤ d exp (cid:18) − ˜ C ( a log na + d ) (cid:19) . As a result, P (cid:26) max T ⊂ [ n ] , ≤| T |≤ s | T | log n | T | + d (cid:107) (cid:88) j ∈ T (cid:15) j (cid:15) Tj (cid:107) ≥ C (cid:27) ≤ s (cid:88) a =1 P (cid:26) max | T | = a (cid:107) (cid:88) j ∈ T (cid:15) j (cid:15) Tj (cid:107) ≥ C ( a log na + d ) (cid:27) ≤ s (cid:88) a =1 (cid:18) na (cid:19) max | T | = a P (cid:26) (cid:107) (cid:88) j ∈ T (cid:15) j (cid:15) Tj (cid:107) ≥ C ( a log na + d ) (cid:27) ≤ s (cid:88) a =1 (cid:18) na (cid:19) d exp (cid:18) − ˜ C ( a log na + d ) (cid:19) . Since a log na is an increasing function when a ∈ [1 , s ] and a log na ≥ log n ≥ log s , a choice of˜ C = 3 + C (cid:48) , that is C = 16 C (cid:48) + 4 C + 48, can yield the desired result.Finally, to allow | T | = 0, we note that d ≤ min { , (cid:112) | T |} d . The proof is complete.31 emma 7.5. For any z ∗ ∈ [ k ] n and k ∈ [ n ] , assume min a ∈ k (cid:80) nj =1 I { z ∗ j = a } ≥ αnk and (cid:96) ( z, z ∗ ) = o ( n ∆ k ) , then max a ∈ [ k ] (cid:80) nj =1 I { z ∗ j = a } (cid:80) nj =1 I { z j = a } ≤ . (39) Proof.

For any z ∈ [ k ] n such that (cid:96) ( z, z ∗ ) = o ( n ) and any a ∈ [ k ], we have n (cid:88) j =1 I { z j = a } ≥ n (cid:88) j =1 I { z ∗ j = a } − n (cid:88) j =1 I { z j (cid:54) = z ∗ j }≥ n (cid:88) j =1 I { z ∗ j = a } − (cid:96) ( z, z ∗ )∆ ≥ αn k , (40)which implies (cid:80) nj =1 I { z ∗ j = a } (cid:80) nj =1 I { z j = a } ≤ (cid:80) nj =1 I { z j = a } + (cid:80) nj =1 I { z j (cid:54) = z ∗ j } (cid:80) nj =1 I { z j = a }≤ αn/ k (cid:80) nj =1 I { z j = a }≤ . Thus, we obtain (39).The next lemma is the famous Weyl’s Theorem and we omit the proof here.

Lemma 7.6 (Weyl’s Theorem) . Let A and B be any two d × d symmetric real matrix. Then forany ≤ i ≤ d , we have λ i ( A + B ) ≤ λ d ( A ) + λ i ( B ) . In the following lemma, we are going to analyze estimation errors of { Σ ∗ a } a ∈ [ k ] under theanisotropic GMMs. For any z ∈ [ k ] n and for any z ∈ [ k ], recall the deﬁnitionsˆ θ a ( z ) = (cid:80) j ∈ [ n ] Y j I { z j = a } (cid:80) j ∈ [ n ] I { z j = a } , ˆΣ a ( z ) = (cid:80) j ∈ [ n ] ( Y j − ˆ θ a ( z ))( Y j − ˆ θ a ( z )) T I { z j = a } (cid:80) j ∈ [ n ] I { z j = a } . Lemma 7.7.

For any z ∗ ∈ [ k ] n and k ∈ [ n ] , consider independent vectors Y j = θ ∗ z ∗ j + (cid:15) j where (cid:15) j ∼N (0 , Σ ∗ z ∗ j ) for any j ∈ [ n ] . Assume there exist constants λ min , λ max > such that λ min ≤ λ (Σ ∗ a ) ≤ λ d (Σ ∗ a ) ≤ λ max for any a ∈ [ k ] , and a constant α > such that min a ∈ k (cid:80) nj =1 I { z ∗ j = a } ≥ αnk .Assume k ( d +log n ) n = o (1) and ∆ k → ∞ . Assume (33) - (38) hold. Then for any τ = o ( n ) and for any onstant C (cid:48) > , there exists some constant C > only depending on α, λ max , C (cid:48) such that max a ∈ [ k ] (cid:13)(cid:13)(cid:13) ˆ θ a ( z ∗ ) − θ ∗ a (cid:13)(cid:13)(cid:13) ≤ C (cid:114) k ( d + log n ) n , (41)max a ∈ [ k ] (cid:13)(cid:13)(cid:13) ˆ θ a ( z ) − ˆ θ a ( z ∗ ) (cid:13)(cid:13)(cid:13) ≤ C (cid:18) kn ∆ (cid:96) ( z, z ∗ ) + k √ d + nn ∆ (cid:112) (cid:96) ( z, z ∗ ) (cid:19) , (42)max a ∈ [ k ] (cid:13)(cid:13)(cid:13) ˆΣ a ( z ∗ ) − Σ ∗ a (cid:13)(cid:13)(cid:13) ≤ C (cid:114) k ( d + log n ) n , (43)max a ∈ [ k ] (cid:13)(cid:13)(cid:13) ˆΣ a ( z ) − ˆΣ a ( z ∗ ) (cid:13)(cid:13)(cid:13) ≤ C (cid:32) kn (cid:96) ( z, z ∗ ) + k (cid:112) n(cid:96) ( z, z ∗ ) n ∆ + kdn ∆ (cid:112) (cid:96) ( z, z ∗ ) (cid:33) , (44) for all z such that (cid:96) ( z, z ∗ ) ≤ τ .Proof. Using (33) we obtain (41). By the same argument of (118) in [8], we can obtain (42). By(33) and (37) and (41), we can obtain (43). In the remaining of the proof, we will establish (53).Since k ( d +log n ) n = o (1), we have (cid:107) ˆΣ a ( z ∗ ) (cid:107) (cid:46) a ∈ [ k ]. The diﬀerence ˆΣ a ( z ) − ˆΣ a ( z ∗ )will be decomposed into several terms. We notice that (cid:13)(cid:13)(cid:13) ˆΣ a ( z ) − ˆΣ a ( z ∗ ) (cid:13)(cid:13)(cid:13) ≤ S + S , (45)where S := (cid:13)(cid:13)(cid:13)(cid:13) (cid:80) I { z j = a } n (cid:88) j =1 I { z j = a } (cid:18) ( Y j − ˆ θ a ( z ))( Y j − ˆ θ a ( z )) T − ( Y j − ˆ θ a ( z ∗ ))( Y j − ˆ θ a ( z ∗ )) T (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) , and S := (cid:13)(cid:13)(cid:13)(cid:13) (cid:32) (cid:80) I { z j = a } − (cid:80) I { z ∗ j = a } (cid:33) n (cid:88) j =1 I { z ∗ j = a } ( Y j − ˆ θ a ( z ∗ ))( Y j − ˆ θ a ( z ∗ )) T (cid:13)(cid:13)(cid:13)(cid:13) . Also, we notice that S ≤ L + L + L , (46)where L := (cid:13)(cid:13)(cid:13)(cid:13) (cid:80) nj =1 I { z j = a } n (cid:88) j =1 I { z j = z ∗ j = a } (cid:18) ( Y j − ˆ θ a ( z ))( Y j − ˆ θ a ( z )) T − ( Y j − ˆ θ a ( z ∗ ))( Y j − ˆ θ a ( z ∗ )) T (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ,L := (cid:13)(cid:13)(cid:13)(cid:13) (cid:80) nj =1 I { z j = a } n (cid:88) j =1 I { z j = a, z ∗ j (cid:54) = a } ( Y j − ˆ θ a ( z ))( Y j − ˆ θ a ( z )) T (cid:13)(cid:13)(cid:13)(cid:13) ,L := (cid:13)(cid:13)(cid:13)(cid:13) (cid:80) nj =1 I { z j = a } n (cid:88) j =1 I { z j (cid:54) = a, z ∗ j = a } ( Y j − ˆ θ a ( z ∗ ))( Y j − ˆ θ a ( z ∗ )) T (cid:13)(cid:13)(cid:13)(cid:13) . L , we have L ≤ (cid:13)(cid:13)(cid:13)(cid:13) (cid:80) nj =1 I { z j = a } n (cid:88) j =1 I { z j = z ∗ j = a } (ˆ θ a ( z ) − ˆ θ a ( z ∗ ))(ˆ θ a ( z ) − ˆ θ a ( z ∗ )) T (cid:13)(cid:13)(cid:13)(cid:13) + 2 (cid:13)(cid:13)(cid:13)(cid:13) (cid:80) nj =1 I { z j = a } n (cid:88) j =1 I { z j = z ∗ j = a } ( Y j − ˆ θ a ( z ∗ ))(ˆ θ a ( z ) − ˆ θ a ( z ∗ )) T (cid:13)(cid:13)(cid:13)(cid:13) (cid:46) (cid:13)(cid:13)(cid:13) ˆ θ a ( z ) − ˆ θ a ( z ∗ ) (cid:13)(cid:13)(cid:13) (cid:80) nj =1 I { z ∗ j = a } (cid:80) nj =1 I { z j = a } + (cid:13)(cid:13)(cid:13) θ ∗ a − ˆ θ a ( z ∗ ) (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) ˆ θ a ( z ) − ˆ θ a ( z ∗ ) (cid:13)(cid:13)(cid:13) (cid:80) nj =1 I { z ∗ j = a } (cid:80) nj =1 I { z j = a } + (cid:13)(cid:13)(cid:13) ˆ θ a ( z ) − ˆ θ a ( z ∗ ) (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:80) nj =1 I { z j = a } n (cid:88) j =1 I { z j = z ∗ j = a } (cid:15) j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . (47)By (36), (39), (40), we have uniformly for any a ∈ [ k ], (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:80) nj =1 I { z j = a } n (cid:88) j =1 I { z j = z ∗ j = a } (cid:15) j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:46) (cid:113)(cid:80) nj =1 I { z j = z ∗ j = a } (cid:80) nj =1 I { z j = a } (cid:118)(cid:117)(cid:117)(cid:116) d + n (cid:88) j =1 I { z ∗ j = a } (cid:46) . (48)Since max a ∈ [ k ] (cid:13)(cid:13)(cid:13) ˆ θ a ( z ∗ ) − θ ∗ a (cid:13)(cid:13)(cid:13) = o (1), by (39), (42), (41), (47), and (48), we have uniformly for any a ∈ [ k ], L (cid:46) (cid:13)(cid:13)(cid:13) ˆ θ a ( z ) − ˆ θ a ( z ∗ ) (cid:13)(cid:13)(cid:13) (cid:46) kn ∆ (cid:96) ( z, z ∗ ) + k √ d + nn ∆ (cid:112) (cid:96) ( z, z ∗ ) . (49)To bound L , we ﬁrst give the following simple fact. For any positive integer m and any { u j } j ∈ [ m ] , { v j } j ∈ [ m ] ∈ R d , we have (cid:107) (cid:80) j ∈ [ m ] ( u j + v j )( u j + v j ) T (cid:107) ≤ (cid:107) (cid:80) j ∈ [ m ] u j u Tj (cid:107) + 2 (cid:107) (cid:80) j ∈ [ m ] v j v Tj (cid:107) . Hence, for L , wehave the following decomposition L ≤ R + 2 R , (50)where R := (cid:13)(cid:13)(cid:13)(cid:13) (cid:80) nj =1 I { z j = a } n (cid:88) j =1 I { z j = a, z ∗ j (cid:54) = a } ( Y j − θ ∗ a )( Y j − θ ∗ a ) T (cid:13)(cid:13)(cid:13)(cid:13) ,R := (cid:13)(cid:13)(cid:13)(cid:13) (cid:80) nj =1 I { z j = a } n (cid:88) j =1 I { z j = a, z ∗ j (cid:54) = a } ( θ ∗ a − ˆ θ a ( z ))( θ ∗ a − ˆ θ a ( z )) T (cid:13)(cid:13)(cid:13)(cid:13) . Since max a ∈ [ k ] (cid:80) nj =1 I { z j = a, z ∗ j (cid:54) = a } ≤ (cid:96) ( z,z ∗ )∆ , we have R ≤ (cid:13)(cid:13)(cid:13) θ ∗ a − ˆ θ a ( z ) (cid:13)(cid:13)(cid:13) (cid:80) nj =1 I { z j = a, z ∗ j (cid:54) = a } (cid:80) nj =1 I { z j = a } (cid:46) (cid:18)(cid:13)(cid:13)(cid:13) ˆ θ a ( z ) − ˆ θ a ( z ∗ ) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ˆ θ a ( z ∗ ) − θ ∗ a (cid:13)(cid:13)(cid:13) (cid:19) k(cid:96) ( z, z ∗ ) n ∆ . (51)34y (38) and the fact that max a ∈ [ k ] (cid:80) nj =1 I { z i = a, z ∗ i (cid:54) = a } ≤ (cid:96) ( z,z ∗ )∆ , we also have R ≤ (cid:13)(cid:13)(cid:13)(cid:13) (cid:80) nj =1 I { z j = a } n (cid:88) j =1 I { z j = a, z ∗ j (cid:54) = a } ( θ ∗ z ∗ j − θ ∗ z j )( θ ∗ z ∗ j − θ ∗ z j ) T (cid:13)(cid:13)(cid:13)(cid:13) + 2 (cid:13)(cid:13)(cid:13)(cid:13) (cid:80) nj =1 I { z j = a } n (cid:88) j =1 I { z j = a, z ∗ j (cid:54) = a } (cid:15) j (cid:15) Tj (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:80) nj =1 I { z j = a, z ∗ j (cid:54) = a }(cid:107) θ ∗ z ∗ j − θ ∗ z j (cid:107) (cid:80) nj =1 I { z j = a } + 2 (cid:13)(cid:13)(cid:13)(cid:13) (cid:80) nj =1 I { z j = a } n (cid:88) j =1 I { z j = a, z ∗ j (cid:54) = a } (cid:15) j (cid:15) Tj (cid:13)(cid:13)(cid:13)(cid:13) (cid:46) k(cid:96) ( z, z ∗ ) n + (cid:96) ( z,z ∗ )∆ log n ∆ (cid:96) ( z,z ∗ ) + d (cid:113) (cid:96) ( z,z ∗ )∆ n/k . We are going to simplify the above bounds for R , R . Under the assumption that k ( d +log n ) n = o (1), ∆ /k → ∞ , and (cid:96) ( z, z ∗ ) ≤ τ = o ( n ), we have max a ∈ [ k ] (cid:107) ˆ θ a ( z ) − ˆ θ a ( z ∗ ) (cid:107) = o (1), max a ∈ [ k ] (cid:107) ˆ θ a ( z ∗ ) − θ ∗ a (cid:107) = o (1), and k(cid:96) ( z,z ∗ ) n ∆ = o (1). Hence R (cid:46) k(cid:96) ( z,z ∗ ) n ∆ . Also we have k(cid:96) ( z, z ∗ ) n ∆ log n ∆ (cid:96) ( z, z ∗ ) = k (cid:112) (cid:96) ( z, z ∗ ) n ∆ (cid:115) (cid:96) ( z, z ∗ )∆ (cid:18) log n ∆ (cid:96) ( z, z ∗ ) (cid:19) ≤ k (cid:112) n(cid:96) ( z, z ∗ ) n ∆ . where in the last inequality, we use the fact that x (log( n/x )) is an increasing function of x when0 < x = o ( n ). Then, L (cid:46) k (cid:112) n(cid:96) ( z, z ∗ ) n ∆ + kn (cid:96) ( z, z ∗ ) + kdn ∆ (cid:112) (cid:96) ( z, z ∗ ) . Since L is similar to L , by (46) we have uniformly for any a ∈ [ k ] S (cid:46) k (cid:112) n(cid:96) ( z, z ∗ ) n ∆ + kn (cid:96) ( z, z ∗ ) + kdn ∆ (cid:112) (cid:96) ( z, z ∗ ) . (52)To bound S , by (70) in [8], we have uniformly for any a ∈ [ k ], S = (cid:12)(cid:12)(cid:12)(cid:80) nj =1 I { z ∗ j = a } − (cid:80) nj =1 I { z j = a } (cid:12)(cid:12)(cid:12)(cid:80) nj =1 I { z j = a } (cid:13)(cid:13)(cid:13) ˆΣ a ( z ∗ ) (cid:13)(cid:13)(cid:13) (cid:46) kn (cid:96) ( z, z ∗ )∆ , where we use (43). Since kn (cid:96) ( z,z ∗ )∆ (cid:46) k √ n(cid:96) ( z,z ∗ ) n ∆ , by (45) and the facts that (cid:96) ( z, z ∗ ) ≤ τ = o ( n ) wehave max a ∈ [ k ] (cid:13)(cid:13)(cid:13) ˆΣ a ( z ) − ˆΣ a ( z ∗ ) (cid:13)(cid:13)(cid:13) (cid:46) k (cid:112) n(cid:96) ( z, z ∗ ) n ∆ + kn (cid:96) ( z, z ∗ ) + kdn ∆ (cid:112) (cid:96) ( z, z ∗ ) . Lemma 7.8.

Under the same assumption as in Lemma 7.7, if additional we assume kd = O ( √ n ) and τ = o ( n/k ) , there exists some constant C > only depending on α, λ min , λ max , C (cid:48) such that max a ∈ [ k ] (cid:13)(cid:13)(cid:13) ( ˆΣ a ( z )) − − ( ˆΣ a ( z ∗ )) − (cid:13)(cid:13)(cid:13) ≤ C (cid:32) kn (cid:96) ( z, z ∗ ) + k (cid:112) n(cid:96) ( z, z ∗ ) n ∆ + kdn ∆ (cid:112) (cid:96) ( z, z ∗ ) (cid:33) . (53)35 roof. By (43) we have max a ∈ [ k ] (cid:107) ˆΣ a ( z ∗ ) (cid:107) , max a ∈ [ k ] (cid:107) ( ˆΣ a ( z ∗ )) − (cid:107) (cid:46)

1. By (44) we also havemax a ∈ [ k ] (cid:107) ˆΣ a ( z ) (cid:107) , max a ∈ [ k ] (cid:107) ( ˆΣ a ( z )) − (cid:107) (cid:46)

1. Hence,max a ∈ [ k ] (cid:13)(cid:13)(cid:13) ( ˆΣ a ( z )) − − ( ˆΣ a ( z ∗ )) − (cid:13)(cid:13)(cid:13) ≤ max a ∈ [ k ] (cid:13)(cid:13)(cid:13) ( ˆΣ a ( z ∗ )) − (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) ˆΣ a ( z ) − ˆΣ a ( z ∗ ) (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) ( ˆΣ a ( z )) − (cid:13)(cid:13)(cid:13) (cid:46) k (cid:112) n(cid:96) ( z, z ∗ ) n ∆ + kn (cid:96) ( z, z ∗ ) + kdn ∆ (cid:112) (cid:96) ( z, z ∗ ) . (54) Lemma 7.9.

Let W i iid ∼ χ d for any i ∈ [ n ] where n, d are positive integers. Then we have P (cid:32) n (cid:88) i =1 W i ≥ nd (cid:33) ≤ nd . Proof.

We have E (cid:80) ni =1 W i = nd ( d + 2) and E (cid:80) ni =1 W i = nd ( d + 2)( d + 4)( d + 6). Then wehave Var (cid:0)(cid:80) ni =1 W i (cid:1) = 8 nd ( d + 2)( d + 3). Then we obtain the desired result by Chebyshev’sinequality. Proof of Lemma 6.3.

Consider any a (cid:54) = b ∈ [ k ]. We are going to prove −√ λ max + (cid:113) λ max + λ min ( λ min + λ max )2 λ max λ min + λ max (cid:107) θ ∗ a − θ ∗ b (cid:107) ≤ SNR (cid:48) a,b ≤ λ − / (cid:107) θ ∗ a − θ ∗ b (cid:107) + (cid:114) d + (cid:114) d log λ max λ min . (55)We ﬁrst prove the upper bound. Denote Ξ a,b = θ ∗ a − θ ∗ b . Since we have assumed SNR (cid:48) → ∞ , wehave that 0 / ∈ B a,b . Note that we have an equivalent expression of B a,b : B a,b = (cid:26) − (Σ ∗ a ) − Ξ a,b + y ∈ R d :2 y T (Σ ∗ a ) − Ξ a,b + y T (cid:18) Σ ∗ a Σ ∗− b Σ ∗ a − I (cid:19) y − log | Σ ∗ a | + log | Σ ∗ b | − Ξ Ta,b (Σ ∗ a ) − Ξ a,b ≤ (cid:27) . We consider the following scenarios. (1). If λ (cid:18) Σ ∗ a Σ ∗− b Σ ∗ a − I (cid:19) ≥

0, we have | Σ ∗ a | ≥ | Σ ∗ b | . Let y = 0, we can know − (Σ ∗ a ) − Ξ a,b ∈B a,b . This tells us SNR (cid:48) a,b ≤ (cid:13)(cid:13)(cid:13) − (Σ ∗ a ) − Ξ a,b (cid:13)(cid:13)(cid:13) ≤ λ − / (cid:107) Ξ a,b (cid:107) . (2). If λ (cid:18) Σ ∗ a Σ ∗− b Σ ∗ a − I (cid:19) ≤ −

1, let A := Σ ∗ a Σ ∗− b Σ ∗ a − I and assume U T AU = V ,where U is an orthogonal matrix and V := diag { v , · · · , v d } is a diagonal matrix with diagonalelements v ≤ v ≤ · · · ≤ v d and v ≤ −

1. We can rewrite y = U z with z = ( z , · · · , z d ) T and U T (Σ ∗ a ) − Ξ a,b = ( τ , · · · , τ d ) T . Since − log | Σ ∗ a | + log | Σ ∗ b | ≤ d log λ max λ min , we can take z = − sign ( τ ) (cid:113) d log λ max λ min and z i = 0 for i ≥

2. Then, we have2 y T (Σ ∗ a ) − Ξ a,b + y T (cid:18) Σ ∗ a Σ ∗− b Σ ∗ a − I (cid:19) y − log | Σ ∗ a | + log | Σ ∗ b | − Ξ Ta,b (Σ ∗ a ) − Ξ a,b = 2 z τ + v z − log | Σ ∗ a | + log | Σ ∗ b | − Ξ Ta,b (Σ ∗ a ) − Ξ a,b ≤ − (cid:114) d log λ max λ min | τ | − d log λ max λ min − log | Σ ∗ a | + log | Σ ∗ b | − Ξ Ta,b (Σ ∗ a ) − Ξ a,b ≤ .

36t means − (Σ ∗ a ) − Ξ a,b + y / ∈ B a,b and hence SNR (cid:48) a,b ≤ (cid:13)(cid:13)(cid:13) − (Σ ∗ a ) − Ξ a,b (cid:13)(cid:13)(cid:13) + (cid:107) y (cid:107) . Then Thus we have SNR (cid:48) a,b ≤ (cid:13)(cid:13)(cid:13) − (Σ ∗ a ) − Ξ a,b (cid:13)(cid:13)(cid:13) + (cid:113) d log λ max λ min ≤ λ − / (cid:107) Ξ a,b (cid:107) + (cid:113) d log λ max λ min . (3). If − < λ (cid:18) Σ ∗ a Σ ∗− b Σ ∗ a − I (cid:19) <

0, we still use the notations in scenario (2). Notice thatΣ ∗ a Σ ∗− b Σ ∗ a = A + I , we have log | Σ ∗ a || Σ ∗ b | = log(1 + v ) · · · · · (1 + v d ) ≥ d log(1 + v ) ≥ dv . Now we take z = − sign ( τ ) (cid:113) d and z i = 0 for i ≥

2, then we have2 y T (Σ ∗ a ) − Ξ a,b + y T (cid:18) Σ ∗ a Σ ∗− b Σ ∗ a − I (cid:19) y − log | Σ ∗ a | + log | Σ ∗ b | − Ξ Ta,b (Σ ∗ a ) − Ξ a,b = 2 z τ + v z − log | Σ ∗ a | + log | Σ ∗ b | − Ξ Ta,b (Σ ∗ a ) − Ξ a,b ≤ − | τ | (cid:114) d − Ξ Ta,b (Σ ∗ a ) − Ξ a,b ≤ . Thus we have

SNR (cid:48) a,b ≤ (cid:13)(cid:13)(cid:13) − (Σ ∗ a ) − Ξ a,b (cid:13)(cid:13)(cid:13) + (cid:113) d ≤ λ − / (cid:107) Ξ a,b (cid:107) + (cid:113) d .Overall, we have SNR (cid:48) a,b ≤ λ − / (cid:107) Ξ a,b (cid:107) + (cid:113) d + (cid:113) d log λ max λ min for all the three scenarios.To prove the lower bound, we have x T Σ ∗ a Σ ∗− b ( θ ∗ a − θ ∗ b ) + 12 x T (cid:18) Σ ∗ a Σ ∗− b Σ ∗ a − I (cid:19) x ≥ − (cid:13)(cid:13)(cid:13)(cid:13) Σ ∗ a Σ ∗− b (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) x (cid:107) (cid:107) Ξ a,b (cid:107) − (cid:107) x (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) Σ ∗ a Σ ∗− b Σ ∗ a − I (cid:13)(cid:13)(cid:13)(cid:13) ≥ − √ λ max λ min (cid:107) x (cid:107) (cid:107) Ξ a,b (cid:107) − (cid:18) λ max λ min + 1 (cid:19) (cid:107) x (cid:107) . By the upper bound we know for any a (cid:54) = b ∈ [ k ], (cid:107) Ξ a,b (cid:107) → ∞ when SNR (cid:48) → ∞ . Thus, we have √ λ max λ min (cid:107) x (cid:107) (cid:107) Ξ a,b (cid:107) + 12 (cid:18) λ max λ min + 1 (cid:19) (cid:107) x (cid:107) ≥

12 Ξ

Ta,b Σ ∗− b Ξ a,b − log | Σ ∗ a | + log | Σ ∗ b |≥ λ max (cid:107) Ξ a,b (cid:107) . Hence, (cid:107) x (cid:107) ≥ −√ λ max + (cid:113) λ max + λ min ( λ min + λ max )2 λ max λ min + λ max (cid:107) Ξ a,b (cid:107) .

37n the following lemmas, we are going to establish connections between testing errors and { SNR (cid:48) a,b } a (cid:54) = b . Consider any a, b ∈ [ k ] such that a (cid:54) = b . Let η ∼ N (0 , I d ), Ξ a,b = θ ∗ a − θ ∗ b , and∆ a,b = (cid:107) Ξ a,b (cid:107) . Deﬁne B a,b ( δ ) = (cid:40) x ∈ R d : x T Σ ∗ a (Σ ∗ b ) − Ξ a,b + 12 x T (cid:18) Σ ∗ a (Σ ∗ b ) − Σ ∗ a − I d (cid:19) x ≤ − − δ Ta,b (Σ ∗ b ) − Ξ a,b + 12 log | Σ ∗ a | −

12 log | Σ ∗ b | (cid:41) , for any δ ∈ R . In addition, we deﬁne SNR (cid:48) a,b ( δ ) = min x ∈B a,b ( δ ) (cid:107) x (cid:107) , and P a,b ( δ ) = P ( η ∈ B a,b ( δ )) . Recall the deﬁnitions of B a,b and SNR (cid:48) a,b in Section 3. Then they are a special case of B a,b ( δ ) and SNR (cid:48) a,b ( δ ) with δ = 0. That is, we have B a,b = B a,b (0) and SNR (cid:48) a,b = SNR (cid:48) a,b (0).

Lemma 7.10.

Assume d = O (1) and λ min ≤ λ (Σ ∗ a ) , λ (Σ ∗ b ) ≤ λ d (Σ ∗ a ) , λ d (Σ ∗ b ) ≤ λ max where λ min , λ max > are constants. Under the condition SNR (cid:48) a,b → ∞ , for any positive sequence δ = o (1) ,there exists a ˜ δ = o (1) that depends on δ, d, ∆ a,b , λ min , λ max such that P a,b ( δ ) ≤ exp (cid:32) − − ˜ δ SNR (cid:48) a,b (cid:33) Proof.

For convenience and conciseness, we will use the notation θ a , θ b , Σ a , Σ b instead of θ ∗ a , θ ∗ b , Σ ∗ a , Σ ∗ b throughout the proof. By Lemma 6.3, we have SNR (cid:48) a,b in the same order of ∆ a,b , which means∆ a,b → ∞ .Assume we had obtained

SNR (cid:48) a,b ( δ ) ≥ (1 − o (1)) SNR (cid:48) a,b . Then by Lemma 6.3, we have

SNR (cid:48) a,b ( δ )in the same order of ∆ a,b which is far bigger than d by assumption. Since (cid:107) η (cid:107) ∼ χ d , using Lemma7.1, we have P a,b ( δ ) ≤ P (cid:32) (cid:107) η (cid:107) ≥ SNR (cid:48) a,b ( δ )4 (cid:33) ≤ exp (cid:32) − (cid:32) − O (cid:32) d ∆ a,b (cid:33)(cid:33) SNR (cid:48) a,b ( δ )8 (cid:33) ≤ (cid:32) − O (cid:32) d ∆ a,b (cid:33) (1 − o (1)) SNR (cid:48) a,b (cid:33) which is the desired result. Hence, the proof of this lemma is all about establishing SNR (cid:48) a,b ( δ ) ≥ (1 − o (1)) SNR (cid:48) a,b .To prove it, we ﬁrst simplify

SNR (cid:48) a,b ( δ ). In spite of some abuse of notation, denote λ ≤ . . . ≤ λ d to be the eigenvalues of Σ a (Σ b ) − Σ a − I d such that its eigen-decomposition can be written asΣ a (Σ b ) − Σ a − I d = (cid:80) di =1 λ i u i u Ti , where { u i } are orthogonal vectors. Denote U = ( u , . . . , u d ) and v = U T Σ a Σ − b Ξ a,b and B (cid:48) a,b ( δ ) = (cid:40) y ∈ R d : (cid:88) i y i v i + 12 (cid:88) i λ i y i ≤ − − δ Ta,b Σ − b Ξ a,b + 12 log | Σ a || Σ b | (cid:41) . B (cid:48) a,b ( δ ) can be seen a reﬂection-rotation of B a,b ( δ ) by the transformation y = U t x . Hence wehave SNR (cid:48) a,b ( δ ) = min y ∈B (cid:48) a,b ( δ ) (cid:107) y (cid:107) for any δ . What is more, let ¯ B (cid:48) a,b ( δ ) to be its boundary, i.e.,¯ B (cid:48) a,b ( δ ) = (cid:40) y ∈ R d : (cid:88) i y i v i + 12 (cid:88) i λ i y i = − − δ Ta,b Σ − b Ξ a,b + 12 log | Σ a || Σ b | (cid:41) . Since 0 / ∈ B (cid:48) a,b ( δ ), we have SNR (cid:48) a,b ( δ ) = 2 min y ∈ ¯ B (cid:48) a,b ( δ ) (cid:107) y (cid:107) As a result, we only need to work on¯ B (cid:48) a,b ( δ ) instead of B a,b ( δ ). Denote ¯ B (cid:48) a,b to be ¯ B (cid:48) a,b (0) for simplicity.We then give an equivalent expression of B (cid:48) a,b ( δ ). From (55), we have an upper bound of SNR (cid:48) a,b : SNR (cid:48) a,b ≤ λ − / ∆ a,b where we use ∆ a,b (cid:29) λ min , λ max , d . The same upper bound actually holds for B (cid:48) a,b ( δ ) for any δ = o (1) following the same proof. Deﬁne S = { y ∈ R d : (cid:107) y (cid:107) ≤ λ − / ∆ a,b } . Wethen have SNR (cid:48) a,b ( δ ) = 2 min y ∈ ¯ B (cid:48) a,b ( δ ) ∩ S (cid:107) y (cid:107) . We have the following inequality. Let g ( y ) : ¯ B a,b ( δ ) → ¯ B a,b (0) be any mapping. By the triangleinequality, we have (cid:107) y (cid:107) ≥ (cid:107) g ( y ) (cid:107) − (cid:107) y − g ( y ) (cid:107) . We have2 − SNR (cid:48) a,b ( δ ) = min y ∈ ¯ B (cid:48) a,b ( δ ) ∩ S (cid:107) y (cid:107)≥ min y ∈ ¯ B (cid:48) a,b ( δ ) ∩ S ( (cid:107) g ( y ) (cid:107) − (cid:107) y − g ( y ) (cid:107) ) ≥ min y ∈ ¯ B (cid:48) a,b ( δ ) ∩ S (cid:107) g ( y ) (cid:107) − max y ∈ ¯ B (cid:48) a,b ( δ ) ∩ S (cid:107) y − g ( y ) (cid:107)≥ min y ∈ ¯ B (cid:48) a,b (0) (cid:107) y (cid:107) − max y ∈ ¯ B (cid:48) a,b ( δ ) ∩ S (cid:107) y − g ( y ) (cid:107)≥ SNR (cid:48) a,b − max y ∈ ¯ B (cid:48) a,b ( δ ) ∩ S (cid:107) y − g ( y ) (cid:107) . (56)As a result, if we are able to ﬁnd some g such that max y ∈ ¯ B (cid:48) a,b ( δ ) ∩ S (cid:107) y − g ( y ) (cid:107) = o (1) SNR (cid:48) a,b , we willimmediately have

SNR (cid:48) a,b ( δ ) ≥ (1 − o (1)) SNR (cid:48) a,b and the proof will be complete.Let w ∈ R d be some vector. Deﬁne g ( y ) = y + w argmin t ∈ R : y + tw ∈ ¯ B (cid:48) a,b | t | . If g ( y ) is a well-deﬁnedmapping, we have max y ∈ ¯ B (cid:48) a,b ( δ ) ∩ S (cid:107) y − g ( y ) (cid:107) = max y ∈ ¯ B (cid:48) a,b ( δ ) ∩ S min t ∈ R : y + tw ∈ ¯ B (cid:48) a,b (cid:107) w (cid:107) | t | , (57)which can be used to derive an upper bound. However, to make g ( y ) well-deﬁned, we need for any y ∈ ¯ B (cid:48) a,b ( δ ) ∩ S , there exits some t ∈ R such that y + tw ∈ ¯ B (cid:48) a,b . This means we have the followingtwo equations: (cid:88) i y i v i + 12 (cid:88) i λ i y i = − − δ Ta,b Σ − b Ξ a,b + 12 log | Σ a || Σ b | , (58)and (cid:88) i ( y i + tw i ) v i + 12 (cid:88) i λ i ( y i + tw i ) = −

12 Ξ

Ta,b Σ − b Ξ a,b + 12 log | Σ a || Σ b | . It is equivalent to require t to satisfy t (cid:88) i ( w i v i + λ i y i w i ) + t (cid:88) i λ i w i = − δ Ta,b Σ − b Ξ a,b . (59)39ence, all we need is to ﬁnd a decent vector w such that: for any y ∈ ¯ B (cid:48) a,b ( δ ) ∩ S there exists a t satisfying (59), and we can obtain the desired upper bound for (57).In the following, we will consider four diﬀerent scenarios according to the spectral { λ i } . Foreach scenario, we will construct a w with decent bounds for (57). Denote δ (cid:48) = √ δ . Scenario 1: | λ | , | λ d | ≤ δ (cid:48) . We choose w = v/ (cid:107) v (cid:107) . Note that we have (cid:107) v (cid:107) in the same order of ∆ a,b and (cid:107) Ξ Ta,b Σ − b Ξ a,b (cid:107) in the same order of ∆ a,b . Note that we have t (cid:88) i ( w i v i + λ i y i w i ) + t (cid:88) i λ i w i ≤ t (cid:107) v (cid:107) + | t | (cid:107) y (cid:107) (cid:115)(cid:88) i λ i w i + t (cid:88) i λ i w i ≤ t (cid:107) v (cid:107) + | t | δ (cid:48) (cid:107) y (cid:107) + t δ (cid:48) ≤ t (cid:107) v (cid:107) + 2 | t | δ (cid:48) λ − / ∆ a,b + t δ (cid:48) , where in the last inequality we use y ∈ S . Deﬁne t = − δ / ∆ a,b . Then we have t (cid:88) i ( w i v i + λ i y i w i ) + t (cid:88) i λ i w i (cid:46) − δ ∆ a,b + δ δ (cid:48) ∆ a,b + δδ (cid:48) ∆ a,b (cid:28) − δ ∆ a,b (cid:46) − δ Ta,b Σ − b Ξ a,b . Hence for any y ∈ S there exists a t ∈ ( t ,

0) such that (59) is satisﬁed. Hence, | t | = δ / ∆ a,b isan upper bound for (57). Scenario 2: λ < − δ (cid:48) . We choose w = e which is the ﬁrst standard basis of R d . Then, (59) canbe written as λ t + 2( v + λ y ) t + δ Ξ Ta,b Σ − b Ξ a,b = 0 . Since λ <

0, the above equation has two diﬀerent solutions t , t ∈ R . Simple algebra leads tomin {| t | , | t |} ≤ (cid:115) δ Ξ Ta,b Σ − b Ξ a,b − λ ≤ (cid:115) δ Ξ Ta,b Σ − b Ξ a,b δ (cid:48) (cid:46) δ ∆ a,b . Hence, an upper bound for (57) is O ( δ ∆ a,b ). Scenario 3: λ ≥ − δ (cid:48) and there exists a j ∈ [ d ] such that λ j ≤ δ (cid:48) and | v j | ≥ √ δ (cid:48) ∆ a,b . We choose w = e j . Then (59) can be written as λ j t + 2( v j + λ j y j ) t + δ Ξ Ta,b Σ − b Ξ a,b = 0 . Note that for any y ∈ S , we have | v j + λ j y j | ≥ | v j |−| λ j y j | ≥ √ δ (cid:48) ∆ a,b − δ (cid:48) (2 λ − / ∆ a,b ) ≥ √ δ (cid:48) ∆ a,b / t = − sign( v j + λ j y j ) √ δ (cid:48) ∆ a,b . Then we have λ j t + 2( v j + λ j y j ) t + δ Ξ Ta,b Σ − b Ξ a,b = λ j δ (cid:48) ∆ a,b − | v j + λ j y j | √ δ (cid:48) ∆ a,b + δ Ξ Ta,b Σ − b Ξ a,b ≤ − (cid:18) δ (cid:48) − δ (cid:48) (cid:19) ∆ a,b + δO (∆ a,b ) ≤ .

40s a result, there exists some t ∈ ( t ,

0) satisfying (59). Hence, | t | = δ (cid:48) / ∆ a,b is an upper boundfor (57). Scenario 4: λ ≥ − δ (cid:48) and | v j | < √ δ (cid:48) ∆ a,b for all j ∈ [ d ] such that λ j ≤ δ (cid:48) . This scenario is slightlymore complicated as we need w to be dependent on y . Denote it as w ( y ). Then (56) still holds and(57) can be changed intomax y ∈ ¯ B (cid:48) a,b ( δ ) ∩ S (cid:107) y − g ( y ) (cid:107) = max y ∈ ¯ B (cid:48) a,b ( δ ) ∩ S min t ∈ R : y + tw ∈ ¯ B (cid:48) a,b (cid:107) w ( y ) (cid:107) | t | . (60)Denote m ∈ [ d ] to be the integer such that λ j ≤ δ (cid:48) for all j ≤ m and λ j > δ (cid:48) for all j > m . We canhave m < d otherwise this scenario can be reduced to Scenario 1. Deﬁne[ w ( y )] i = − (cid:18) y i + v i λ i (cid:19) I { i > m } . for any i ∈ [ d ]. Instead of using (59), we will analyze it slightly diﬀerently.For y ∈ ¯ B (cid:48) a,b ( δ ), (58) can be rewritten as (cid:88) i>m λ i (cid:18) y i + v i λ i (cid:19) = (cid:88) i>m v i λ i − (1 − δ )Ξ Ta,b Σ − b Ξ a,b + log | Σ a || Σ b | −  (cid:88) i ≤ m y i v i + (cid:88) i ≤ m λ i y i  . (61)On the other hand, if g ( y ) is well-deﬁned, we need g ( y ) ∈ ¯ B (cid:48) a,b which means (cid:88) i>m λ i (cid:18) [ g ( y )] i + v i λ i (cid:19) = (cid:88) i>m v i λ i − Ξ Ta,b Σ − b Ξ a,b + log | Σ a || Σ b | −  (cid:88) i ≤ m y i v i + (cid:88) i ≤ m λ i y i  . Note that we have ( y i + v i /λ i )(1 − t ) = [ g ( y )] i + v i /λ i for i > m and [ g ( y )] i = y i for i ≤ m . Thenthe above display can be written as(1 − t ) (cid:88) i>m λ i (cid:18) y i + v i λ i (cid:19) = (cid:88) i>m v i λ i − Ξ Ta,b Σ − b Ξ a,b + log | Σ a || Σ b | −  (cid:88) i ≤ m y i v i + (cid:88) i ≤ m λ i y i  . Together with (61) multiplied, the above equation leads to(1 − t ) δ Ξ Ta,b Σ − b Ξ a,b = (2 t − t ) (cid:88) i>m v i λ i − Ξ Ta,b Σ − b Ξ a,b + log | Σ a || Σ b | −  (cid:88) i ≤ m y i v i + (cid:88) i ≤ m λ i y i  . (62)It is suﬃcient to ﬁnd some 0 < t < − t ) t (2 − t ) δ Ξ Ta,b Σ − b Ξ a,b ≤ (cid:88) i>m v i λ i − Ξ Ta,b Σ − b Ξ a,b + log | Σ a || Σ b | −  (cid:88) i ≤ m y i v i + (cid:88) i ≤ m λ i y i  , (63)then there deﬁnitely exists some 0 < t < t satisfying (62).41e are going to give a lower bound for the right hand side of (63). Particularly, we need tolower bound (cid:80) i>m v i λ i − Ξ Ta,b Σ − b Ξ a,b . Denote ˜ y = U T ( − Σ − / a Ξ a,b ). Then using the deﬁnition of v and { λ i } , we have2 (cid:88) i ∈ [ k ] ˜ y i v i + (cid:88) i ∈ [ k ] λ i ˜ y i = 2( − Σ − a Ξ a,b ) T Σ a Σ − b Ξ a,b + ( − Σ − a Ξ a,b ) T (cid:18) Σ a Σ − b Σ a − I d (cid:19) ( − Σ − a Ξ a,b )= − Ξ Ta,b Σ − b Ξ a,b − Ξ Ta,b Σ − a Ξ a,b . Then we have (cid:88) i>m v i λ i − Ξ Ta,b Σ − b Ξ a,b = Ξ Ta,b Σ − a Ξ a,b + (cid:88) i>m λ i (cid:18) ˜ y i + v i λ i (cid:19) +  (cid:88) i ≤ m ˜ y i v i + (cid:88) i ≤ m λ i ˜ y i  ≥ Ξ Ta,b Σ − a Ξ a,b +  (cid:88) i ≤ m ˜ y i v i + (cid:88) i ≤ m λ i ˜ y i  . (64)Hence, the right hand side of (63) can be lower bounded by ≥ Ξ Ta,b Σ − a Ξ a,b + log | Σ a || Σ b | +  (cid:88) i ≤ m ˜ y i v i + (cid:88) i ≤ m λ i ˜ y i  −  (cid:88) i ≤ m y i v i + (cid:88) i ≤ m λ i y i  ≥ Ξ Ta,b Σ − a Ξ a,b + log | Σ a || Σ b | − (cid:18) √ δ (cid:48) √ dλ − min ∆ a,b + δ (cid:48) λ − ∆ a,b (cid:19) ≥ Ξ Ta,b Σ − a Ξ a,b + log | Σ a || Σ b | − √ δ (cid:48) dλ − min ∆ a,b , (65)where we use both ˜ y, y ∈ S and the assumption that | v i | ≤ √ δ (cid:48) ∆ a,b and | λ i | ≤ δ (cid:48) for any i ≤ m .Then a suﬃcient condition for (63) is t satisﬁes(1 − t ) t (2 − t ) δ Ξ Ta,b Σ − b Ξ a,b ≤ Ξ Ta,b Σ − a Ξ a,b + log | Σ a || Σ b | + 16 √ δ (cid:48) dλ − min ∆ a,b . Since Ξ

Ta,b Σ − b Ξ a,b is in the same order of ∆ a,b and log | Σ a || Σ b | (cid:46) d = O (1), it can be achieved by t = √ δ .As a result, from (60) we havemax y ∈ ¯ B (cid:48) a,b ( δ ) ∩ S (cid:107) y − g ( y ) (cid:107) ≤ max y ∈ ¯ B (cid:48) a,b ( δ ) ∩ S (cid:107) w ( y ) (cid:107) | t | ≤ | t | max y ∈ ¯ B (cid:48) a,b ( δ ) ∩ S (cid:115) (cid:107) y (cid:107) + (cid:107) v (cid:107) δ (cid:48) (cid:46) √ δ (cid:16) ∆ a,b + δ (cid:48)− ∆ a,b (cid:17) ≤ δ ∆ a,b . Combining the above four scenarios, we can see we all have max y ∈ ¯ B (cid:48) a,b ( δ ) ∩ S (cid:107) y − g ( y ) (cid:107) (cid:46) δ ∆ a,b which is o (1) SNR (cid:48) a,b . By the argument before the discussion of the four scenarios, we have

SNR (cid:48) a,b ( δ ) ≥ (1 − o (1)) SNR (cid:48) a,b and the proof is complete. 42 emma 7.11.

Assume d = O (1) and λ min ≤ λ (Σ ∗ a ) , λ (Σ ∗ b ) ≤ λ d (Σ ∗ a ) , λ d (Σ ∗ b ) ≤ λ max where λ min , λ max > are constants. Under the condition SNR (cid:48) a,b → ∞ , for any positive sequence δ = o (1) ,there exists a ˜ δ = o (1) that depends on d, ∆ a,b , λ min , λ max such that P , (0) ≥ exp (cid:32) − δ SNR (cid:48) a,b (cid:33) . Proof.

For convenience and conciseness, we will use the notation θ a , θ b , Σ a , Σ b instead of θ ∗ a , θ ∗ b , Σ ∗ a , Σ ∗ b throughout the proof. From Lemma 6.3, we know SNR (cid:48) a,b is in the same order of ∆ a,b , which means∆ a,b → ∞ . Similar to the proof of Lemma 7.10, denote λ ≤ . . . ≤ λ d to be the eigenvaluesof Σ a (Σ b ) − Σ a − I d such that its eigen-decomposition can be written as Σ a (Σ b ) − Σ a − I d = (cid:80) di =1 λ i u i u Ti , where { u i } are orthogonal vectors. Denote U = ( u , . . . , u d ), v = U T Σ a Σ − b Ξ a,b .Then denote B (cid:48) a,b = (cid:40) y ∈ R d : (cid:88) i y i v i + 12 (cid:88) i λ i y i ≤ −

12 Ξ

Ta,b Σ − b Ξ a,b + 12 log | Σ a || Σ b | (cid:41) , and its boundary¯ B (cid:48) a,b = (cid:40) y ∈ R d : (cid:88) i y i v i + 12 (cid:88) i λ i y i = −

12 Ξ

Ta,b Σ − b Ξ a,b + 12 log | Σ a || Σ b | (cid:41) . By the same argument as in the proof of Lemma 7.10, B (cid:48) a,b can be seen a reﬂection-rotation of B a,b by the transformation y = U t x . Hence we have SNR (cid:48) a,b = min y ∈B (cid:48) a,b (cid:107) y (cid:107) and we can work on B (cid:48) a,b instead of B a,b . Denote ¯ y ∈ B (cid:48) a,b to be the one such that 2 (cid:107) ¯ y (cid:107) = SNR (cid:48) a,b . From the proof of Lemma7.10 we also know ¯ y ∈ S which is deﬁned as S = { y ∈ R d : (cid:107) y (cid:107) ≤ λ − / ∆ a,b } . In addition, weknow ¯ y ∈ ¯ B (cid:48) a,b .We ﬁrst give the main idea of the remaining proof. Denote p ( y ) to be the density function of y ∼ N (0 , I d ) We will construct a set T ⊂ R d around ¯ y such that for any y ∈ T we have y ∈ B (cid:48) a,b and (cid:107) y − ¯ y (cid:107) = o (∆ a,b ). Then we have P , (0) ≥ | T | inf y ∈ T p ( y ) = | T | π ) d exp (cid:18) −

12 max y ∈ T (cid:107) y (cid:107) (cid:19) = | T | π ) d exp (cid:32) − (1 + o (1)) SNR (cid:48) a,b (cid:33) . (66)Hence if log | T | = o ( SNR (cid:48) a,b ) then the proof will be complete. So it is all about constructing such T .We will consider four scenarios same as in the proof of Lemma 7.10. Let δ = o (1) be some positivesequence going to 0 very slowly and denote δ (cid:48) = √ δ . Scenario 1: | λ | , | λ d | ≤ δ (cid:48) . Deﬁne w = v/ (cid:107) v (cid:107) . We deﬁne T as follows: T = (cid:8) y = ¯ y + s : (cid:13)(cid:13) ( I d − ww T ) s (cid:13)(cid:13) ≤ δ (cid:48) (cid:12)(cid:12) w T s (cid:12)(cid:12) , w T s ∈ [ − δ ∆ a,b , (cid:9) . Since ¯ y ∈ ¯ B (cid:48) a,b , we have (cid:88) i ¯ y i v i + 12 (cid:88) i λ i ¯ y i = −

12 Ξ

Ta,b Σ − b Ξ a,b + 12 log | Σ a || Σ b | . (67)43t is obvious max y ∈ T (cid:107) y − ¯ y (cid:107) ≤ δ ∆ a,b . Hence we only need to show that for any y ∈ T , y ∈ B (cid:48) a,b ,i.e., (cid:88) i (¯ y i + s i ) v i + 12 (cid:88) i λ i (¯ y i + s i ) ≤ −

12 Ξ

Ta,b Σ − b Ξ a,b + 12 log | Σ a || Σ b | . (68)From the above two displays, we need to show2 (cid:88) i s i v i + (cid:88) i λ i s i + 2 (cid:88) i λ i ¯ y i s i ≤ . (69)Note that s satisﬁes (cid:107) s (cid:107) ≤ (cid:12)(cid:12) w T s (cid:12)(cid:12) .2 (cid:88) i s i v i + (cid:88) i λ i s i + 2 (cid:88) i λ i ¯ y i s i ≤ (cid:107) v (cid:107) w T s + δ (cid:48) (cid:107) s (cid:107) + δ (cid:48) (cid:107) ¯ y (cid:107) (cid:107) s (cid:107)≤ (cid:107) v (cid:107) w T s + δ (cid:48) (cid:12)(cid:12) w T s (cid:12)(cid:12) + δ (cid:48) (cid:107) ¯ y (cid:107) (cid:12)(cid:12) w T s (cid:12)(cid:12) = (cid:12)(cid:12) w T s (cid:12)(cid:12) (cid:0) − (cid:107) v (cid:107) + δ (cid:48) (cid:12)(cid:12) w T s (cid:12)(cid:12) + δ (cid:48) (cid:107) ¯ y (cid:107) (cid:1) ≤ , where we use the fact that (cid:107) v (cid:107) , (cid:107) ¯ y (cid:107) are in the order of ∆ a,b . Hence, for any y ∈ T , we have shown y ∈ B (cid:48) a,b . From Lemma 7.12, we have | T | ≥ exp (cid:16) d log δδ (cid:48) ∆ a,b − d log d (cid:17) . Since d = O (1), ∆ a,b → ∞ ,and δ goes to 0 slowly, (66) leads to P , (0) ≥ exp (cid:18) − (1 + o (1)) SNR (cid:48) a,b (cid:19) . Scenario 2: λ < − δ (cid:48) . Denote e the ﬁrst standard basis in R d . Deﬁne T as T = (cid:110) y = ¯ y + s : (cid:13)(cid:13) ( I d − e e T ) s (cid:13)(cid:13) ≤ δ (cid:12)(cid:12) e T s (cid:12)(cid:12) , sign( v + λ ¯ y ) e T s ∈ [ − δ ∆ a,b , − δ ∆ a,b ] (cid:111) . Here for the sign function we deﬁne sign(0) = 1. It is obvious max y ∈ T (cid:107) y − ¯ y (cid:107) ≤ δ ∆ a,b . Hence weonly need to establish (69) to show that for any y ∈ T , y ∈ B (cid:48) a,b . Note that2 (cid:88) i s i v i + (cid:88) i λ i s i + 2 (cid:88) i λ i ¯ y i s i = 2 s ( v + λ ¯ y ) + λ s + 2 (cid:88) i ≥ s i ( v i + λ i ¯ y i ) + 2 (cid:88) i ≥ λ i s i ≤ s ( v + λ ¯ y ) + λ s + 2 (cid:13)(cid:13) ( I − e e T ) s (cid:13)(cid:13) (cid:18) (cid:107) v (cid:107) + max j | λ j | (cid:107) ¯ y (cid:107) (cid:19) + 2 max j | λ j | (cid:13)(cid:13) ( I − e e T ) s (cid:13)(cid:13) ≤ (cid:18) λ + δ max j | λ j | (cid:19) s − | v + λ ¯ y | | s | + 2 (cid:18) (cid:107) v (cid:107) + max j | λ j | (cid:107) ¯ y (cid:107) (cid:19) δ | s |≤ − δ (cid:48) s + O (∆ a,b ) δ | s | , where we use max j | λ j | = O (1) and (cid:107) v (cid:107) , (cid:107) ¯ y (cid:107) are in the order of ∆ a,b . It is easy to verify theright hand side is negative when s ∈ [ − δ ∆ a,b , − δ ∆ a,b ]. From Lemma 7.12, we have | T | ≥ exp (cid:18) d log δ ∆ a,b − d log d (cid:19) . Then (66) leads to the desired result. Scenario 3: λ ≥ − δ (cid:48) and there exists a j ∈ [ d ] such that λ j ≤ δ (cid:48) and | v j | ≥ √ δ (cid:48) ∆ a,b . Denote e j the j th standard basis in R d . Deﬁne T as T = (cid:8) y = ¯ y + s : (cid:13)(cid:13) ( I d − e j e Tj ) s (cid:13)(cid:13) ≤ δ (cid:48) (cid:12)(cid:12) e Tj s (cid:12)(cid:12) , sign( v j + λ j ¯ y j ) e Tj s ∈ [ − δ ∆ a,b , (cid:9) . y ∈ T (cid:107) y − ¯ y (cid:107) ≤ δ ∆ a,b . Now we are going to verify(69), i.e., to show λ j s j + 2 s j ( v j + λ j ¯ y j ) ≤ − (cid:80) i (cid:54) = j λ i s i − (cid:80) i (cid:54) = j s i ( v i + λ i ¯ y i ). On one hand, we have λ j s j + 2 s j ( v j + λ j ¯ y j ) = λ j s j − | s j | | v j + λ j ¯ y j |≤ − δ (cid:48) δ ∆ a,b | s j | − √ δ (cid:48) ∆ a,b − δ (cid:48) O (∆ a,b )) | s j |≤ −√ δ (cid:48) ∆ a,b | s j | . One the other hand, we have − (cid:88) i (cid:54) = j λ i s i − (cid:88) i (cid:54) = j s i ( v i + λ i ¯ y i ) ≥ − max j | λ j | (cid:13)(cid:13) ( I − e j e Tj ) s (cid:13)(cid:13) − (cid:13)(cid:13) ( I − e j e Tj ) s (cid:13)(cid:13) (cid:18) (cid:107) v (cid:107) + max j | λ j | (cid:107) ¯ y (cid:107) (cid:19) ≥ − max j | λ j | δ (cid:48) | s j | − δ (cid:48) | s j | (cid:18) (cid:107) v (cid:107) + max j | λ j | (cid:107) ¯ y (cid:107) (cid:19) ≥ − δ (cid:48) (cid:18) δ ∆ a,b max j | λ j | + (cid:107) v (cid:107) + max j | λ j | (cid:107) ¯ y (cid:107) (cid:19) | s j |≥ − δ (cid:48) O (∆ a,b ) | s j |≥ −√ δ (cid:48) ∆ a,b | s j | , we use max j | λ j | = O (1) and (cid:107) v (cid:107) , (cid:107) ¯ y (cid:107) are in the order of ∆ a,b . Hence (69) is established. FromLemma 7.12, we have | T | ≥ exp (cid:16) d log δδ (cid:48) ∆ a,b − d log d (cid:17) . Then (66) leads to the desired result. Scenario 4: λ ≥ − δ (cid:48) and | v j | < √ δ (cid:48) ∆ a,b for all j ∈ [ d ] such that λ j ≤ δ (cid:48) . Denote m ∈ [ d ] to be theinteger such that λ j ≤ δ (cid:48) for all j ≤ m and λ j > δ (cid:48) for all j > m . We can have m < k otherwisethis scenario can be reduced to Scenario 1.Deﬁne w ∈ R d to be unit vector such that w i =  λ i ¯ y i + v i √ (cid:80) j>m ( λ j ¯ y j + v j ) , for all i > m, , o.w..Deﬁne T = (cid:8) y = ¯ y + s : (cid:13)(cid:13) ( I d − ww T ) s (cid:13)(cid:13) ≤ δ (cid:48) (cid:12)(cid:12) w T s (cid:12)(cid:12) , w T s ∈ [ − δ ∆ a,b , (cid:9) . Now we are going to verify (69), i.e., to show 2 (cid:80) i>m s i ( v i + λ i ¯ y i )+ (cid:80) i λ i s i +2 (cid:80) i ≤ m s i ( v i + λ i ¯ y i ) ≤ (cid:88) i>m s i ( v i + λ i ¯ y i ) = 2 (cid:115) (cid:88) j>m ( v j + λ j ¯ y j ) (cid:88) i>m s i w i = − (cid:115) (cid:88) j>m ( v j + λ j ¯ y j ) (cid:12)(cid:12) w T s (cid:12)(cid:12) ≤ − √ δ (cid:48) (cid:118)(cid:117)(cid:117)(cid:116) (cid:88) j>m (cid:18) ¯ y j + v j λ j (cid:19) (cid:12)(cid:12) w T s (cid:12)(cid:12) . We are going to give a lower bound for (cid:80) j>m (cid:16) ¯ y j + v j λ j (cid:17) . Note that (67) can be written as (cid:88) i>m λ i (cid:18) ¯ y i + v i λ i (cid:19) = (cid:88) i>m v i λ i − Ξ Ta,b Σ − b Ξ a,b + log | Σ a || Σ b | −  (cid:88) i ≤ m ¯ y i v i + (cid:88) i ≤ m λ i ¯ y i  . y = U T ( − Σ − / a Ξ a,b ). Using (64), we have (cid:88) i>m λ i (cid:18) ¯ y i + v i λ i (cid:19) ≥ Ξ Ta,b Σ − a Ξ a,b +  (cid:88) i ≤ m ˜ y i v i + (cid:88) i ≤ m λ i ˜ y i  + log | Σ a || Σ b | −  (cid:88) i ≤ m ¯ y i v i + (cid:88) i ≤ m λ i ¯ y i  ≥ Ξ Ta,b Σ − a Ξ a,b + log | Σ a || Σ b | − √ δ (cid:48) dλ − min ∆ a,b , ≥ C ∆ a,b , for some constant C >

0. Here the second inequality is by the same argument as (65) and the lastinequality uses the fact that Ξ

Ta,b Σ − a Ξ a,b is in the order of ∆ a,b and d = O (1). Hence, (cid:88) i>m λ i (cid:18) ¯ y i + v i λ i (cid:19) ≤ − √ Cδ (cid:48) ∆ a,b (cid:12)(cid:12) w T s (cid:12)(cid:12) . On the other hand, we have (cid:88) i λ i s i + 2 (cid:88) i ≤ m s i ( v i + λ i ¯ y i ) ≤ max j | λ j | (cid:107) s (cid:107) + 2 (cid:115)(cid:88) i ≤ m s i (cid:18) (cid:107) v (cid:107) + max j | λ j | (cid:107) ¯ y (cid:107) (cid:19) ≤ j | λ j | (cid:12)(cid:12) w T s (cid:12)(cid:12) + 2 (cid:18) (cid:107) v (cid:107) + max j | λ j | (cid:107) ¯ y (cid:107) (cid:19) (cid:13)(cid:13) ( I − ww T ) s (cid:13)(cid:13) ≤ (cid:18) max j | λ j | δ ∆ a,b + δ (cid:48) (cid:18) (cid:107) v (cid:107) + max j | λ j | (cid:107) ¯ y (cid:107) (cid:19)(cid:19) (cid:12)(cid:12) w T s (cid:12)(cid:12) ≤ O ( δ (cid:48) ∆ a,b ) (cid:12)(cid:12) w T s (cid:12)(cid:12) , where we use the properties of s as y ∈ T . Summing the above two displays together, we have (69)satisﬁed. From Lemma 7.12, we have | T | ≥ exp (cid:16) d log δδ (cid:48) ∆ a,b − d log d (cid:17) . Then (66) leads to thedesired result. Lemma 7.12.

Consider any positive integer d and any < r < , t > . Deﬁne a set T = (cid:26) y ∈ R d : (cid:16)(cid:80) i ≥ y i (cid:17) / ≤ r | y | , y ∈ [ − t, − t ] (cid:27) . Then we have | T | ≥ exp (cid:18) d log rt − d d (cid:19) . Proof.

Deﬁne a d -dimensional ball B = (cid:110) y ∈ R d : ( y + 1 . t ) + (cid:80) i ≥ y i ≤ ( rt/ (cid:111) . We can easilyverify that B ∈ T . First, for all y ∈ B , we have y ∈ [ − t, − t ] as r ∈ (0 , (cid:16)(cid:80) i ≥ y i (cid:17) / ≤ rt/ ≤ r | y | . As a result, by the expression of the volume of a d -dimensional ball,we have | T | ≥ | B | = π d Γ( d + 1) (cid:18) rt (cid:19) d ≥ d d (cid:18) rt (cid:19) d = exp (cid:18) d log rt − d d (cid:19) , where Γ( · ) is the Gamma function. 46 eferences [1] E. Abbe, J. Fan, and K. Wang. An (cid:96) p -theory of PCA and spectral clustering. arxiv preprint ,2020.[2] Christopher M Bishop. Pattern recognition and machine learning . springer, 2006.[3] S Charles Brubaker and Santosh S Vempala. Isotropic pca and aﬃne-invariant clustering. In

Building Bridges , pages 241–281. Springer, 2008.[4] Sanjoy Dasgupta.

The hardness of k-means clustering . Department of Computer Science andEngineering, University of California . . . , 2008.[5] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incompletedata via the em algorithm.

Journal of the Royal Statistical Society: Series B (Methodological) ,39(1):1–22, 1977.[6] Yingjie Fei and Yudong Chen. Hidden integrality of sdp relaxation for sub-gaussian mixturemodels. arXiv preprint arXiv:1803.06510 , 2018.[7] Jerome Friedman, Trevor Hastie, and Robert Tibshirani.

The elements of statistical learning ,volume 1. Springer series in statistics New York, 2001.[8] Chao Gao and Anderson Y Zhang. Iterative algorithm for discrete structure recovery. arXivpreprint arXiv:1911.01018 , 2019.[9] Christophe Giraud and Nicolas Verzelen. Partial recovery bounds for clustering with therelaxed k means. arXiv preprint arXiv:1807.07547 , 2018.[10] Daniel Hsu, Sham Kakade, Tong Zhang, et al. A tail inequality for quadratic forms of sub-gaussian random vectors. Electronic Communications in Probability , 17, 2012.[11] Adam Tauman Kalai, Ankur Moitra, and Gregory Valiant. Eﬃciently learning mixtures oftwo gaussians. In

Proceedings of the forty-second ACM symposium on Theory of computing ,pages 553–562, 2010.[12] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by modelselection.

Annals of Statistics , pages 1302–1338, 2000.[13] Stuart Lloyd. Least squares quantization in pcm.

IEEE transactions on information theory ,28(2):129–137, 1982.[14] Matthias L¨oﬄer, Anderson Y Zhang, and Harrison H Zhou. Optimality of spectral clusteringfor gaussian mixture model. arXiv preprint arXiv:1911.00538 , 2019.[15] Yu Lu and Harrison H Zhou. Statistical and computational guarantees of lloyd’s algorithmand its variants. arXiv preprint arXiv:1612.02099 , 2016.[16] James MacQueen et al. Some methods for classiﬁcation and analysis of multivariate observa-tions. 1967.[17] M. Ndaoud. Sharp optimal recovery in the two component gaussian mixture model. arXivpreprint , 2019. 4718] Karl Pearson. Contributions to the mathematical theory of evolution.

Philosophical Transac-tions of the Royal Society of London. A , 185:71–110, 1894.[19] Daniel A Spielman and Shang-Hua Teng. Spectral partitioning works: Planar graphs and ﬁniteelement meshes. In

Proceedings of 37th Conference on Foundations of Computer Science , pages96–105. IEEE, 1996.[20] D Michael Titterington, Adrian FM Smith, and Udi E Makov.

Statistical analysis of ﬁnitemixture distributions . Wiley,, 1985.[21] S. Vempala and G. Wang. A spectral algorithm for learning mixture models.

J. Comput. Syst.Sci. , 68(4):841–860, 2004.[22] Ulrike Von Luxburg. A tutorial on spectral clustering.

Statistics and computing , 17(4):395–416,2007.[23] Kaizheng Wang, Yuling Yan, and Mateo Diaz. Eﬃcient clustering for stretched mixtures:Landscape and optimality. arXiv preprint arXiv:2003.09960 , 2020.[24] Xindong Wu, Vipin Kumar, J Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda,Geoﬀrey J McLachlan, Angus Ng, Bing Liu, S Yu Philip, et al. Top 10 algorithms in datamining.