How close are the eigenvectors and eigenvalues of the sample and actual covariance matrices?
HHow close are the eigenvectors and eigenvalues of the sample andactual covariance matrices?
Andreas Loukas´Ecole Polytechnique F´ed´erale Lausanne, SwitzerlandFebruary 20, 2017
Abstract
How many samples are sufficient to guarantee that the eigenvectors and eigenvalues of the samplecovariance matrix are close to those of the actual covariance matrix? For a wide family of distributions,including distributions with finite second moment and distributions supported in a centered Euclideanball, we prove that the inner product between eigenvectors of the sample and actual covariance matri-ces decreases proportionally to the respective eigenvalue distance. Our findings imply non-asymptotic concentration bounds for eigenvectors, eigenspaces, and eigenvalues. They also provide conditions fordistinguishing principal components based on a constant number of samples.
The covariance matrix C of an n -dimensional distribution is an integral part of data analysis, with numerousoccurrences in machine learning and signal processing. It is therefore crucial to understand how close itis to the sample covariance , i.e., the matrix (cid:101) C estimated from a finite number of samples m . Followingdevelopments in the tools for the concentration of measure, Vershynin showed that a sample size of m = O ( n )is up to iterated logarithmic factors sufficient for all distributions with finite fourth moment supported in acentered Euclidean ball of radius O ( √ n ) [26]. Similar results hold also for sub-exponential distributions [1]and distributions with finite second moment [20].We take an alternative standpoint and ask if we can do better when only a subset of the spectrum is ofinterest. Concretely, our objective is to characterize how many samples are sufficient to guarantee that aneigenvector and/or eigenvalue of the sample and actual covariance matrices are, respectively, sufficiently close.Our approach is motivated by the observation that methods that utilize the covariance commonly prioritizethe estimation of principal eigenspaces. For instance, in (local) principal component analysis we are usuallyinterested in the first few eigenvectors [9, 18], whereas when reducing the dimension of a distribution onecommonly projects it to the span of the first few eigenvectors [16, 12].Our finding is that the “spectral leaking” occurring in the eigenvector estimation is strongly localizedw.r.t. the eigenvalue axis. In other words, the eigenvector (cid:101) u i of the sample covariance is less far likely tolie in the span of an eigenvector u j of the actual covariance when the eigenvalue distance | λ i − λ j | is largeand the concentration of the distribution in the direction of u j is small. This phenomenon agrees with theintuition that principal components of high variance are easier to estimate, exactly because they are morelikely to appear in the samples of the distribution. In addition, it suggests that it might be possible to obtaingood estimates of well separated principal eigenspaces from fewer than n samples.We provide a mathematical argument confirming this phenomenon. Under fairly general conditions, weprove that m = O (cid:32) k j ( λ i − λ j ) (cid:33) and m = O (cid:32) k i λ i (cid:33) (1)samples are asymptotically almost surely (a.a.s). sufficient to guarantee that |(cid:104) (cid:101) u i , u j (cid:105)| and | δλ i | /λ i , respec-tively, is small for all distributions with finite second moment. Here, k j is a measure of the kurtosis of the1 a r X i v : . [ s t a t . M L ] F e b -5 -3 -1 λ j | h ˜ u i , u j i | ˜ u ˜ u ˜ u ˜ u h ˜ u ,u i (a) m = 10 -5 -3 -1 λ j | h ˜ u i , u j i | ˜ u ˜ u ˜ u ˜ u h ˜ u ,u i (b) m = 100 -5 -3 -1 λ j | h ˜ u i , u j i | ˜ u ˜ u ˜ u ˜ u h ˜ u ,u i (c) m = 500 -5 -3 -1 λ j | h ˜ u i , u j i | ˜ u ˜ u ˜ u ˜ u h ˜ u ,u i (d) m = 1000 Figure 1: Inner products (cid:104) (cid:101) u i , u j (cid:105) are localized w.r.t. the eigenvalue axis. The phenomenon is shown for MNIST. Muchfewer than n = 784 samples are needed to estimate u and u . distribution in the direction of u j . We also attain a high probability bound for distributions supported in acentered and scaled Euclidean ball, and show how our results can be used to characterize the sensitivity ofprincipal component analysis to a limited sample set.To the best of our knowledge, these are the first non-asymptotic results concerning the eigenvectors of thesample covariance of distributions with finite second moment. Previous studies have intensively investigatedthe limiting distribution of the eigenvalues of a sample covariance matrix [24, 4], such as the smallest andlargest eigenvalues [6] and the eigenvalue support [7]. Eigenvectors and eigenprojections have attractedless attention; the main research thrust entails using tools from the theory of large-dimensional matrices tocharacterize limiting distributions [3, 13, 22, 5] and it has limited applicability in the non-asymptotic settingwhere the sample size m is small and n cannot be arbitrary large.Differently, our arguments follow from a combination of techniques from perturbation analysis and non-asymptotic concentration of measure. Moreover, in contrast to standard perturbation bounds [11, 27] com-monly used to reason about eigenspaces [14, 15], they can be used to characterize weighted linear combinationsof (cid:104) (cid:101) u i , u j (cid:105) over i and j , and they do not depend on the minimal eigenvalue gap separating two eigenspacesbut rather on all eigenvalue differences. The latter renders them particularly amendable to situations wherethe eigenvalue gap is not significant but the eigenvalue magnitudes decrease sufficiently fast.Our work also connects to subspace methods, where the signal and noise spaces are separated by anappropriate eigenspace projection. In their recent work, Shaghaghi and Vorobyov characterized the first twomoments of the projection error, a result which implies sample estimates [23]. Their results are particularlytight, but are restricted to specific projectors and Normal distributions. Finally, we remark that there existalternative estimators for the spectrum of the covariance with better asymptotic properties [2, 19]. Instead,we here focus on the standard estimates, i.e., the eigenvalues and eigenvectors of the sample covariance. Let x ∈ C n be a sample of a multivariate distribution and denote by x , x , . . . , x m the m independentsamples used to form the sample covariance, defined as (cid:101) C = m (cid:88) p =1 ( x p − ¯ x )( x p − ¯ x ) ∗ m , (2)where ¯ x is the sample mean. Denote by u i the eigenvector of C associated with eigenvalue λ i , and corre-spondingly for the eigenvectors (cid:101) u i and eigenvalues (cid:101) λ i of (cid:101) C , such that λ ≥ λ ≥ . . . ≥ λ n . We ask: Problem 1.
How many samples are sufficient to guarantee that the inner product |(cid:104) (cid:101) u i , u j (cid:105)| = | (cid:101) u ∗ i u j | and theeigenvalue gap | δλ i | = | (cid:101) λ i − λ i | is smaller than some constant t with probability larger than (cid:15) ? Clearly, when asking that all eigenvectors and eigenvalues of the sample and actual covariance matricesare close, we will require at least as many samples as needed to ensure that (cid:107) (cid:101) C − C (cid:107) ≤ t [26]. However, wemight do better when only a subset of the spectrum is of interest. The reason is that inner products |(cid:104) (cid:101) u i , u j (cid:105)| possess strong localized structure along the eigenvalue axis. To illustrate this phenomenon, let us consider thedistribution constructed by the n = 784 pixel values of digit ‘1’ in the MNIST database. Figure 1, comparesthe eigenvectors u j of the covariance computed from all 6742 images, to the eigenvectors (cid:101) u i of the sample2ovariance matrices (cid:101) C computed from a random subset of m = 10, 100, 500, and 1000 samples. For each i = 1 , , , λ j the average of |(cid:104) (cid:101) u i , u j (cid:105)| over 100 sampling draws. We observe that: ( i ) Themagnitude of (cid:104) (cid:101) u i , u j (cid:105) is inversely proportional to their eigenvalue gap | λ i − λ j | . ( ii ) Eigenvector (cid:101) u j mostlylies in the span of eigenvectors u j over which the distribution is concentrated.We formalize these statements in two steps. First, we work in the setting of Hermitian matrices and notice the following inequality:
Theorem 3.2.
For Hermitian matrices C and (cid:101) C = δC + C , with eigenvectors u j and (cid:101) u i respectively, theinequality |(cid:104) (cid:101) u j , u j (cid:105)| ≤ (cid:107) δCu j (cid:107) | λ i − λ j | , holds for sgn ( λ i > λ j ) 2 (cid:101) λ i > sgn ( λ i > λ j )( λ i + λ j ) and λ i (cid:54) = λ j . The above stands out from standard eigenspace perturbation results, such as the sin(Θ) Theorem [11] andits variants [14, 15, 27] for three main reasons: First, Theorem 3.2 characterizes the angle between any pair ofeigenvectors allowing us to jointly bound any linear combination of inner-products. Though this often proveshandy (c.f. Section 5), it was not possible using sin(Θ)-type arguments. Second, classical bounds are notappropriate for a probabilistic analysis as they feature ratios of dependent random variables (correspondingto perturbation terms). In the analysis of spectral clustering, this complication was dealt with by assumingthat | λ i − λ j | ≤ | (cid:101) λ i − λ j | [15]. We weaken this condition at a cost of a multiplicative factor of 2. In contrastto previous work, we also prove that the condition is met a.a.s. Third, previous bounds are expressed interms of the minimal eigenvalue gap between eigenvectors lying in the interior and exterior of the subspaceof interest. This is a limiting factor in practice as it renders the results only amenable to situations wherethere is a very large eigenvalue gap separating the subspaces. The proposed result improves upon this byconsidering every eigenvalue difference. The second part of our analysis focuses on the covariance and has a statistical flavor. In particular, we givean answer to Problem 1 for various families of distributions.In the context of distributions with finite second moment , we prove in Section 4.1 that:
Theorem 4.1.
For any two eigenvectors (cid:101) u i and u j of the sample and actual covariance respectively, and forany real number t > : P ( |(cid:104) (cid:101) u i , u j (cid:105)| ≥ t ) ≤ m (cid:16) k j t | λ i − λ j | (cid:17) , (3) s.t. the same conditions as Theorem 3.2. For eigenvalues, we provide the following guarantee:
Theorem 4.2.
For any eigenvalues λ i and (cid:101) λ i of C and (cid:101) C , respectively, and for any t > , we have P (cid:32) | (cid:101) λ i − λ i | λ i ≥ t (cid:33) ≤ m (cid:18) k i λ i t (cid:19) . Term k j = ( E (cid:2) (cid:107) xx ∗ u j (cid:107) (cid:3) − λ j ) / captures the tendency of the distribution to fall in the span of u j : thesmaller the tail in the direction of u j the less likely we are going to confuse (cid:101) u i with u j .For normal distributions , we have that k j = λ j + λ j tr( C ) and the number of samples needed for |(cid:104) (cid:101) u i , u j (cid:105)| to be small is m = O (tr( C ) /λ i ) when λ j = O (1) and m = O ( λ − i ) when λ j = O (tr( C ) − ). Thus fornormal distributions, principal components u i and u j with min { λ i /λ j , λ i } = Ω(tr( C ) / ) can be distinguished3iven a constant number of samples. On the other hand, estimating λ i with small relative error requires m = O (tr( C ) /λ i ) samples and can thus be achieved from very few samples when λ i is large .In Section 4.2, we also give a sharp bound for the family of distributions supported within a ball (i.e., (cid:107) x (cid:107) ≤ r a.s.). Theorem 4.3.
For sub-gaussian distributions supported within a centered Euclidean ball of radius r , thereexists an absolute constant c , independent of the sample size, such that for any real number t > , P ( |(cid:104) (cid:101) u i , u j (cid:105)| ≥ t ) ≤ exp (cid:32) − c m Φ ij ( t ) λ j (cid:107) x (cid:107) (cid:33) , (4) where Φ ij ( t ) = | λ i − λ j | t − λ j r /λ j − / − (cid:107) x (cid:107) Ψ and s.t. the same conditions as Theorem 3.2. Above, (cid:107) x (cid:107) Ψ is the sub-gaussian norm, for which we usually have (cid:107) x (cid:107) Ψ = O (1) [25]. As such, the theoremimplies that, whenever λ i (cid:29) λ j = O (1), the sample requirement is with high probability m = O ( r /λ i ).These theorems solidify our experimental findings shown in Figure 1. Moreover, combined with theanalysis of principal component estimation given in Section 5, they provide a concrete characterization of therelation between the spectrum (eigenvectors and eigenvalues) of the sample and actual covariance matrix asa function of the number of samples, the eigenvalue gap, and the distribution properties. Before focusing on the sample covariance matrix, it helps to study (cid:104) (cid:101) u i , u j (cid:105) in the setting of Hermitian matrices.The presentation of the results is split in three parts. Section 3.1 starts by studying some basic propertiesof inner products of the form (cid:104) (cid:101) u i , u j (cid:105) , for any i and j . The results are used in Section 3.2 to provide a firstbound on the angle between two eigenvectors, and refined in Section 3.3. We start by noticing an exact relation between the angle of a perturbed eigenvector and the actual eigenvectorsof C . Lemma 3.1.
For every i and j in , , . . . , n , the relation ( (cid:101) λ i − λ j ) ( (cid:101) u ∗ i u j ) = (cid:80) n(cid:96) =1 ( (cid:101) u ∗ i u (cid:96) ) ( u ∗ j δCu (cid:96) ) holds .Proof. The proof follows from a modification of a standard argument in perturbation theory. We start fromthe definition (cid:101) C (cid:101) u i = (cid:101) λ i (cid:101) u i and write( C + δC ) ( u i + δu i ) = ( λ i + δλ i ) ( u i + δu i ) . (5)Expanded, the above expression becomes Cδu i + δCu i + δCδu i = λ i δu i + δλ i u i + δλ i δu i , (6)where we used the fact that Cu i = λ i u i to eliminate two terms. To proceed, we substitute δu i = (cid:80) nj =1 β ij u (cid:96) ,where β ij = δu ∗ i u (cid:96) , into (6) and multiply from the left by u ∗ j , resulting to: n (cid:88) (cid:96) =1 β ij u ∗ j Cu (cid:96) + u ∗ j δCu i + n (cid:88) (cid:96) =1 β ij u ∗ j δCu (cid:96) = λ i n (cid:88) (cid:96) =1 β ij u ∗ j u (cid:96) + δλ i u ∗ j u i + δλ i n (cid:88) (cid:96) =1 β ij u ∗ j u (cid:96) (7)Cancelling the unnecessary terms and rearranging, we have δλ i u ∗ j u i + ( λ i + δλ i − λ j ) β ik = u ∗ j δCu i + n (cid:88) (cid:96) =1 β ij u ∗ j δCu (cid:96) . (8) Though the same cannot be stated about the absolute error | δλ i | , that is smaller for small λ i .
4t this point, we note that ( λ i + δλ i − λ j ) = (cid:101) λ i − λ j and furthermore that β ik = (cid:101) u ∗ i u j − u ∗ i u j . With this inplace, equation (8) becomes δλ i u ∗ j u i + ( (cid:101) λ i − λ j ) ( (cid:101) u ∗ i u j − u ∗ i u j ) = u ∗ j δCu i + n (cid:88) (cid:96) =1 ( (cid:101) u ∗ i u (cid:96) ) u ∗ j δCu (cid:96) − u ∗ j δCu i . (9)The proof completes by noticing that, in the left hand side, all terms other than ( (cid:101) λ i − λ j ) (cid:101) u ∗ i u j fall-off, eitherdue to u ∗ i u j = 0, when i (cid:54) = k , or because δλ i = (cid:101) λ i − λ j , o.w.As the expression reveals, (cid:104) (cid:101) u i , u j (cid:105) depends on the orientation of (cid:101) u i with respect to all other u (cid:96) . Moreover,the angles between eigenvectors depend not only on the minimal gap between the subspace of interest andits complement space (as in the sin(Θ) theorem), but on every difference (cid:101) λ i − λ j . This is a crucial ingredientto a tight bound, that will be retained throughout our analysis. We proceed to decouple the inner products.
Theorem 3.1.
For any Hermitian matrices C and (cid:101) C = δC + C , with eigenvectors u j and (cid:101) u i respectively,we have that | (cid:101) λ i − λ j | |(cid:104) (cid:101) u i , u j (cid:105)| ≤ (cid:107) δC u j (cid:107) .Proof. We rewrite Lemma 3.1 as( (cid:101) λ i − λ j ) ( (cid:101) u ∗ i u j ) = (cid:32) n (cid:88) (cid:96) =1 ( (cid:101) u ∗ i u (cid:96) ) ( u ∗ j δCu (cid:96) ) (cid:33) . (10)We now use the Cauchy-Schwartz inequality( (cid:101) λ i − λ j ) ( (cid:101) u ∗ i u j ) ≤ n (cid:88) (cid:96) =1 ( (cid:101) u ∗ i u (cid:96) ) n (cid:88) (cid:96) =1 ( u ∗ j δCu (cid:96) ) = n (cid:88) (cid:96) =1 ( u ∗ j δCu (cid:96) ) = (cid:107) δC u j (cid:107) , (11)where in the last step we exploited Lemma 3.2. The proof concludes by taking a square root at both sides ofthe inequality. Lemma 3.2. n (cid:80) (cid:96) =1 ( u ∗ j δCu (cid:96) ) = (cid:107) δC u j (cid:107) .Proof. We first notice that u ∗ j δCu (cid:96) is a scalar and equal to its transpose. Moreover, δC is Hermitian as thedifference of two Hermitian matrices. We therefore have that n (cid:88) (cid:96) =1 ( u ∗ j δCu (cid:96) ) = n (cid:88) (cid:96) =1 u ∗ j δCu (cid:96) u ∗ (cid:96) δCu j = u ∗ j δC n (cid:88) (cid:96) =1 ( u (cid:96) u ∗ (cid:96) ) δCu j = u ∗ j δCδCu j = (cid:107) δCu j (cid:107) , matching our claim. As a last step, we move all perturbation terms to the numerator, at the expense of a multiplicative constantfactor.
Theorem 3.2.
For Hermitian matrices C and (cid:101) C = δC + C , with eigenvectors u j and (cid:101) u i respectively, theinequality |(cid:104) (cid:101) u j , u j (cid:105)| ≤ (cid:107) δCu j (cid:107) | λ i − λ j | , holds for sgn ( λ i > λ j ) 2 (cid:101) λ i > sgn ( λ i > λ j )( λ i + λ j ) and λ i (cid:54) = λ j . roof. Adding and subtracting λ i from the left side of the expression in Lemma 3.1 gives( δλ i + λ i − λ j ) ( (cid:101) u ∗ i u j ) = n (cid:88) (cid:96) =1 ( (cid:101) u ∗ i u (cid:96) ) ( u ∗ j δCu (cid:96) ) . (12)For λ i (cid:54) = λ j , the above expression can be re-written as | (cid:101) u ∗ i u j | = (cid:12)(cid:12)(cid:12)(cid:12) n (cid:80) (cid:96) =1 ( (cid:101) u ∗ i u (cid:96) ) ( u ∗ j δCu (cid:96) ) − δλ i ( (cid:101) u ∗ i u j ) (cid:12)(cid:12)(cid:12)(cid:12) | λ i − λ j | ≤ (cid:12)(cid:12)(cid:12)(cid:12) n (cid:80) (cid:96) =1 ( (cid:101) u ∗ i u (cid:96) ) ( u ∗ j δCu (cid:96) ) (cid:12)(cid:12)(cid:12)(cid:12) | λ i − λ j | , | δλ i | | (cid:101) u ∗ i u j || λ i − λ j | . (13)Let us examine the right-hand side inequality carefully. Obviously, when the condition | λ i − λ j | ≤ | δλ i | is not met, the right clause of (13) is irrelevant. Therefore, for | δλ i | < | λ i − λ j | / | (cid:101) u ∗ i u j | ≤ (cid:12)(cid:12)(cid:12)(cid:12) n (cid:80) (cid:96) =1 ( (cid:101) u ∗ i u (cid:96) ) ( u ∗ j δCu (cid:96) ) (cid:12)(cid:12)(cid:12)(cid:12) | λ i − λ j | . (14)Similar to the proof of Theorem 3.1, applying the Cauchy-Schwartz inequality we have that | (cid:101) u ∗ i u j | ≤ (cid:115) n (cid:80) (cid:96) =1 ( (cid:101) u ∗ i u (cid:96) ) n (cid:80) (cid:96) =1 ( u ∗ j δCu (cid:96) ) | λ i − λ j | = 2 (cid:107) δCu j (cid:107) | λ i − λ j | , (15)where in the last step we used Lemma 3.2. To finish the proof we notice that, due to Theorem 3.2, whenever | λ i − λ j | ≤ | (cid:101) λ i − λ j | , one has | (cid:101) u ∗ i u j | ≤ (cid:107) δC u j (cid:107) | (cid:101) λ i − λ j | ≤ (cid:107) δC u j (cid:107) | λ i − λ j | < (cid:107) δCu j (cid:107) | λ i − λ j | . (16)Our bound therefore holds for the union of intervals | δλ i | < | λ i − λ j | / | λ i − λ j | ≤ | (cid:101) λ i − λ j | , i.e., for (cid:101) λ i > ( λ i + λ j ) / λ i > λ j and for (cid:101) λ i < ( λ i + λ j ) / λ i < λ j . This section builds on the perturbation results of Section 3 to characterize how far any inner product (cid:104) (cid:101) u i , u j (cid:105) and eigenvalue (cid:101) λ i are from the ideal estimates.Before proceeding, we remark on some simplifications employed in the following. W.l.o.g., we will assumethat the mean E [ x ] is zero and the covariance full rank. Though the case of rank-deficient C is easilyhandled by substituting the inverse with the Moore-Penrose pseudoinverse, we opt to make the expositionin the simpler setting. In addition, we will assume the perspective of Theorem 3.2, for which the inequalitysgn( λ i > λ j ) 2 (cid:101) λ i > sgn( λ i > λ j )( λ i + λ j ) holds. This event occurs a.a.s. when the gap and the samplesize are sufficiently large (see Section 4.1.2), but it is convenient to assume that it happens almost surely.Removing this assumption is possible, but is not pursued here as it leads to less elegant and sharp estimates. Our first flavor of results is based on a variant of the Tchebichef inequality and holds for any distributionwith finite second moment. 6 .1.1 Concentration of eigenvector angles
We start with the concentration of inner-products |(cid:104) (cid:101) u i , u j (cid:105)| . Theorem 4.1.
For any two eigenvectors (cid:101) u i and u j of the sample and actual covariance respectively, with λ i (cid:54) = λ j , and for any real number t > , we have P ( |(cid:104) (cid:101) u i , u j (cid:105)| ≥ t ) ≤ m (cid:16) k j t | λ i − λ j | (cid:17) for sgn ( λ i > λ j ) 2 (cid:101) λ i > sgn ( λ i > λ j )( λ i + λ j ) and k j = (cid:0) E (cid:2) (cid:107) xx ∗ u j (cid:107) (cid:3) − λ j (cid:1) / .Proof. According to a variant of Tchebichef’s inequality [21], for any random variable X and for any realnumbers t > α : P ( | X − α | ≥ t ) ≤ Var [ X ] + ( E [ X ] − α ) t . (17)Setting X = (cid:104) (cid:101) u i , u j (cid:105) and α = 0, we have P ( |(cid:104) (cid:101) u i , u j (cid:105)| ≥ t ) ≤ Var [ (cid:104) (cid:101) u i , u j (cid:105) ] + E [ (cid:104) (cid:101) u i , u j (cid:105) ] t = E (cid:2) (cid:104) (cid:101) u i , u j (cid:105) (cid:3) t ≤ E (cid:2) (cid:107) δCu j (cid:107) (cid:3) t ( λ i − λ j ) , (18)where the last inequality follows from Theorem 3.2. We continue by expanding δC using the definition of theeigenvalue decomposition and substituting the expectation. E (cid:2) (cid:107) δCu j (cid:107) (cid:3) = E (cid:104) (cid:107) (cid:101) Cu j − λ j u j (cid:107) (cid:105) = E (cid:104) u ∗ j ( (cid:101) C − λ j )( (cid:101) C − λ j ) u j (cid:105) = E (cid:104) u ∗ j (cid:101) C u j (cid:105) + λ j − λ j u ∗ j E (cid:104) (cid:101) C (cid:105) u j = E (cid:104) u ∗ j (cid:101) C u j (cid:105) − λ j . (19)In addition, E (cid:104) u ∗ j (cid:101) C u j (cid:105) = m (cid:88) p,q =1 u ∗ j E (cid:2) ( x p x ∗ p )( x q x ∗ q ) (cid:3) m u j = (cid:88) p (cid:54) = q u ∗ j E (cid:2) x p x ∗ p (cid:3) E (cid:2) x q x ∗ q (cid:3) m u j + m (cid:88) p =1 u ∗ j E (cid:2) x p x ∗ p x p x ∗ p (cid:3) m u j = m ( m − m λ j + 1 m u ∗ j E [ xx ∗ xx ∗ ] u j = (1 − m ) λ j + 1 m u ∗ j E [ xx ∗ xx ∗ ] u j (20)and therefore E (cid:2) (cid:107) δCu j (cid:107) (cid:3) = (1 − m ) λ j + 1 m u ∗ j E [ xx ∗ xx ∗ ] u j − λ j = u ∗ j E [ xx ∗ xx ∗ ] u j − λ j m = E (cid:2) (cid:107) xx ∗ u j (cid:107) (cid:3) − λ j m . Putting everything together, the claim follows.The following corollary will be very useful when applying our results.7 orollary 4.1.
For any weights w ij and real t > : P (cid:88) i (cid:54) = j w ij (cid:104) (cid:101) u i , u j (cid:105) > t ≤ (cid:88) i (cid:54) = j w ij k j m t ( λ i − λ j ) , where k j = (cid:0) E (cid:2) (cid:107) xx ∗ u j (cid:107) (cid:3) − λ j (cid:1) / and w ij (cid:54) = 0 when λ i (cid:54) = λ j and sgn ( λ i > λ j ) 2 (cid:101) λ i > sgn ( λ i > λ j )( λ i + λ j ) .Proof. We proceed as in the proof of Theorem 4.1: P (cid:16) (cid:88) i (cid:54) = j w ij (cid:104) (cid:101) u i , u j (cid:105) (cid:17) > t ≤ E (cid:104)(cid:80) i (cid:54) = j w ij (cid:104) (cid:101) u i , u j (cid:105) (cid:105) t ≤ t (cid:88) i (cid:54) = j w ij E (cid:2) (cid:107) δCu j (cid:107) (cid:3) ( λ i − λ j ) (21)The claim follows by computing E (cid:2) (cid:107) δCu j (cid:107) (cid:3) (as before) and squaring both terms within the probability. A slight modification of the same argument can be used to characterize the eigenvalue relative difference.
Theorem 4.2.
For any eigenvalues λ i and (cid:101) λ i of C and (cid:101) C , respectively, and for any t > , we have P (cid:32) | (cid:101) λ i − λ i | λ i ≥ t (cid:33) ≤ m (cid:18) k i λ i t (cid:19) , where k i = ( E (cid:2) (cid:107) xx ∗ u i (cid:107) (cid:3) − λ i ) / .Proof. Directly from the Bauker-Fike theorem [8] one sees that | δλ i | ≤ (cid:107) (cid:101) Cu i − λ i u i (cid:107) = (cid:107) δCu i (cid:107) . (22)The proof is then identical to the one of Theorem 4.1.As such, the probability the main condition of our Theorems holds is at least P (cid:16) sgn( λ i > λ j ) 2 (cid:101) λ i > sgn( λ i > λ j )( λ i + λ j ) (cid:17) ≥ P (cid:18) | (cid:101) λ i − λ i | < | λ i − λ j | (cid:19) > − k i m | λ i − λ j | . (23) As seen by the straightforward inequality E (cid:2) (cid:107) xx ∗ u j (cid:107) (cid:3) ≤ E (cid:2) (cid:107) x (cid:107) (cid:3) , k j connects to the kurtosis of the distri-bution. However, it also captures the tendency of the distribution to fall in the span of u j .To see this, we will work with the uncorrelated random vectors ε = C − / x , which have zero mean andidentity covariance. k j = E (cid:104) u ∗ j C / εε ∗ Cεε ∗ C / u j (cid:105) − λ j = λ j E (cid:2) u ∗ j εε ∗ Cεε ∗ u j (cid:3) − λ j = λ j ( E (cid:104) (cid:107) Λ / U ∗ εε ∗ u j (cid:107) (cid:105) − λ j ) . (24)If we further set ˆ ε = U ∗ ε , we have k j = λ j (cid:16) n (cid:88) (cid:96) =1 λ (cid:96) E (cid:2) ˆ ε ( (cid:96) ) ˆ ε ( j ) (cid:3) − λ j (cid:17) . (25)It is therefore easier to untangle the spaces spanned by (cid:101) u i and u j when the variance of the distribution alongthe latter space is small (the expression is trivially minimized when λ j →
0) or when the variance is entirelycontained along that space (the expression is also small when λ i = 0 for all i (cid:54) = j ). In addition, it can beseen that distributions with fast decaying tails allow for better principal component identification ( E (cid:2) ˆ ε ( j ) (cid:3) is a measure of kurtosis over the direction of u j ).For the particular case of a Normal distribution, we provide a closed-form expression.8 orollary 4.2. For a Normal distribution, we have k j = λ j ( λ j + tr ( C )) .Proof. For a centered and normal distribution with identity covariance, the choice of basis is arbitrary andthe vector ˆ ε = U ∗ ε is also zero mean with identity covariance. Moreover, for every (cid:96) (cid:54) = j we can write E (cid:2) ˆ ε ( (cid:96) ) ˆ ε ( j ) (cid:3) = E (cid:2) ˆ ε ( (cid:96) ) (cid:3) E (cid:2) ˆ ε ( j ) (cid:3) = 1. This implies that E (cid:2) (cid:107) xx ∗ u j (cid:107) (cid:3) = λ j E (cid:2) ˆ ε ( j ) (cid:3) + λ j n (cid:88) (cid:96) (cid:54) = j λ (cid:96) = λ j (3 −
1) + λ j tr( C ) = 2 λ j + λ j tr( C ) (26)and, accordingly, k j = λ j ( λ j + tr( C )). Our last result provides a sharper probability estimate for the family of sub-gaussian distributions supportedin a centered Euclidean ball of radius r , with their Ψ -norm (cid:107) x (cid:107) Ψ = sup y ∈S n − (cid:107)(cid:104) x, y (cid:105)(cid:107) ψ , (27)where S n − is the unit sphere and with the ψ -norm of a random variable X defined as (cid:107) X (cid:107) ψ = sup p ≥ p − / E [ | X | p ] /p . (28)Our setting is therefore similar to the one used to study covariance estimation [26]. Due to space constraints,we refer the reader to the excellent review article [25] for an introduction to sub-gaussian distributions as atool for non-asymptotic analysis of random matrices. Theorem 4.3.
For sub-gaussian distributions supported within a centered Euclidean ball of radius r , thereexists an absolute constant c , independent of the sample size, such that for any real number t > , P ( |(cid:104) (cid:101) u i , u j (cid:105)| ≥ t ) ≤ exp (cid:32) − c m Φ ij ( t ) λ j (cid:107) x (cid:107) (cid:33) , (29) where Φ ij ( t ) = | λ i − λ j | t − λ j r /λ j − / − (cid:107) x (cid:107) Ψ , λ i (cid:54) = λ j and sgn ( λ i > λ j ) 2 (cid:101) λ i > sgn ( λ i > λ j )( λ i + λ j ) .Proof. We start from the simple observation that, for every upper bound B of |(cid:104) (cid:101) u i , u j (cid:105)| the relation P ( |(cid:104) (cid:101) u i , u j (cid:105)| > t ) ≤ P ( B > t ) holds. To proceed therefore we will construct a bound with a known tail. As we saw in Sections 3.3and 4.1, |(cid:104) (cid:101) u i , u j (cid:105)| ≤ (cid:107) δCu j (cid:107) | λ i − λ j | = 2 (cid:13)(cid:13)(cid:13) (1 /m ) (cid:80) mp =1 ( x p x ∗ p u j − λ j u j ) (cid:13)(cid:13)(cid:13) | λ i − λ j |≤ (cid:80) mp =1 (cid:13)(cid:13) x p x ∗ p u j − λ j u j (cid:13)(cid:13) m | λ i − λ j | = 2 (cid:80) mp =1 (cid:113) ( u ∗ j x p ) ( x ∗ p x p ) − λ j ( u ∗ j x p ) + λ j m | λ i − λ j | = 2 (cid:80) mp =1 (cid:113) ( u ∗ j x p ) ( (cid:107) x p (cid:107) − λ j ) + λ j m | λ i − λ j | (30)Assuming further that (cid:107) x (cid:107) ≤ r , and since the numerator is minimized when (cid:107) x p (cid:107) approaches λ j , we canwrite for every sample x = C / ε : (cid:113) ( u ∗ j x ) ( (cid:107) x (cid:107) − λ j ) + λ j ≤ (cid:113) ( u ∗ j x ) ( r − λ j ) + λ j = (cid:113) λ j ( u ∗ j ε ) ( r − λ j ) + λ j ≤ | u ∗ j ε | (cid:113) λ j r − λ j + λ j , (31)9hich is a shifted and scaled version of the random variable | ˆ ε ( j ) | = | u ∗ j ε | . Setting a = ( λ j r − λ j ) / , wehave P ( |(cid:104) (cid:101) u i , u j (cid:105)| ≥ t ) ≤ P (cid:32) (cid:80) mp =1 ( | ˆ ε p ( j ) | a + λ j ) m | λ i − λ j | ≥ t (cid:33) = P (cid:32) m (cid:88) p =1 ( | ˆ ε p ( j ) | a + λ j ) ≥ . mt | λ i − λ j | (cid:33) = P (cid:32) m (cid:88) p =1 | ˆ ε p ( j ) | ≥ m (0 . t | λ i − λ j | − λ j ) a (cid:33) . (32)By Lemma 4.1 however, the left hand side is a sum of independent sub-gaussian variables. Since the summandsare not centered, we expand each | ˆ ε p ( j ) | = z p + E [ | ˆ ε p ( j ) | ] in terms of a centered sub-gaussian z p with thesame ψ -norm. Furthermore, by Jensen’s inequality and Lemma 4.1 E [ | ˆ ε p ( j ) | ] ≤ E (cid:2) ˆ ε p ( j ) (cid:3) / ≤ λ j (cid:107) x (cid:107) Ψ . (33)Therefore, if we set Φ ij ( t ) = (0 . | λ i − λ j | t − λ j )( r /λ j − / − (cid:107) x (cid:107) Ψ P ( |(cid:104) (cid:101) u i , u j (cid:105)| ≥ t ) ≤ P (cid:32) m (cid:88) p =1 z p ≥ m Φ ij ( t ) λ j (cid:33) . (34)Moreover, by the rotation invariance principle, the left hand side of the last inequality is a sub-gaussianwith ψ -norm smaller than ( C (cid:80) mp =1 (cid:107) z p (cid:107) ψ ) / = ( c m ) / (cid:107) z (cid:107) ψ ≤ ( Cm/λ j ) / (cid:107) x (cid:107) Ψ , for some absoluteconstant c . As a consequence, there exists an absolute constant c , such that for each θ > P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m (cid:88) p =1 z p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ θ (cid:33) ≤ exp (cid:32) − c θ λ j m (cid:107) x (cid:107) (cid:33) . (35)Substituting θ = m Φ ij ( t ) /λ j , we have P ( |(cid:104) (cid:101) u i , u j (cid:105)| ≥ t ) ≤ exp (cid:32) − c m Φ ij ( t ) λ j mλ j (cid:107) x (cid:107) (cid:33) = exp (cid:32) − c m Φ ij ( t ) λ j (cid:107) x (cid:107) (cid:33) , (36)which is the desired bound. Lemma 4.1. If x is a sub-gaussian random vector and ε = C − / x , then for every i , the random variable ˆ ε ( i ) = u ∗ i ε is also sub-gaussian, with (cid:107) ˆ ε ( i ) (cid:107) ψ ≤ (cid:107) x (cid:107) Ψ / √ λ i .Proof. The fact that ˆ ε ( i ) is sub-gaussian follows easily by the definition of a sub-gaussian random vector,according to which for every y ∈ R n the random variable (cid:104) x, y (cid:105) is sub-gaussian. Setting y = u i , the first partis proven. For the bound of the norm, notice that (cid:107) x (cid:107) Ψ = sup y ∈S n − (cid:107)(cid:104) x, y (cid:105)(cid:107) ψ = sup y ∈S n − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) j =1 λ / j ( u ∗ j y )( u ∗ j ε ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ψ ≥ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) j =1 λ / j ( u ∗ j u i )ˆ ε ( j ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ψ = λ / i (cid:107) ˆ ε ( i ) (cid:107) ψ , (37)where, for the last inequality, we set y = u i . The last step in our exposition entails moving from statements about eigenvector angles to statements aboutthe eigenvectors themselves. In particular, we are going to focus on the eigenvectors of the sample covarianceassociated to the largest eigenvalues, commonly referred to as principal components.10
197 393 588 784 m h ˜ u i , u i i rank(C) n (a) (cid:96) = 0 m P i + j = i − h ˜ u i , u j i ˜ u ˜ u ˜ u ˜ u ˜ u (b) (cid:96) = 2 m P i + j = i − h ˜ u i , u j i ˜ u ˜ u ˜ u ˜ u ˜ u (c) (cid:96) = 4 Figure 2: Much less than n = 784 samples are needed to estimate principal components associated to large isolatedeigenvalues (top). The sample requirement reduces further if we are satisfied by a tight localization of eigenvectors,such that (cid:101) u i is almost entirely contained within the span of the 2 (cid:96) surrounding eigenvectors of u i , shown here for (cid:96) = 2 (middle) and (cid:96) = 4 (bottom). The phenomenon we will characterize is portrayed in Figure 2. The experiment in question concerns the n = 784 dimensional distribution constructed by the 6131 images featuring digit ‘3’ found in the MNISTdatabase. The top sub-figure shows the estimation accuracy for a set of five principal components averagedover 200 sampling draws. The trends suggest that the first three eigenvectors can be estimated up to asatisfactory accuracy by much less than n samples. The sample requirements decrease further if we aresatisfied by a tight localization of eigenvectors , such that (cid:101) u i is almost entirely contained within the spanof the 2 (cid:96) surrounding eigenvectors of u i . As suggested by the middle and bottom sub-figures, by slightlyincreasing (cid:96) , we reduce the number of samples needed to estimate higher principal components.In the following, we show that our results verify these trends. We are interested to find out how manysamples are sufficient to ensure that (cid:101) u i is almost entirely contained within span( u i − (cid:96) , . . . , u i , . . . , u i + (cid:96) ) forsome small non-negative integer (cid:96) . Setting S = { ( j < i − (cid:96) ) ∪ ( i + (cid:96) < j ≤ r ) } , where r is the rank of C , wehave as a consequence of Corollary 4.1 that for distributions with finite second moment: P i + (cid:96) (cid:88) j = i − (cid:96) (cid:104) (cid:101) u i , u j (cid:105) < − t ≤ (cid:88) j ∈S k j mt ( λ i − λ j ) (38)In accordance with the experimental results, equation (38) reveals that it is much easier to estimate theprincipal components of larger variance, and that, by introducing a small slack in terms of (cid:96) one mitigatesthe requirement for eigenvalue separation.It might be also interesting to observe that, for covariance matrices that are (approximately) low-rank,we obtain estimates reminiscent of compressed setting [10], in the sense that the sample requirement becomesa function of the non-zero eigenvalues. Though intuitive, this dependency of the estimation accuracy on therank was not transparent in known results for covariance estimation [20, 1, 26]. The main contribution of this paper was the derivation of non-asymptotic bounds for the concentration ofinner-products |(cid:104) (cid:101) u i , u j (cid:105)| involving eigenvectors of the sample and actual covariance matrices. We also showedhow these results can be extended to reason about eigenvectors, eigenspaces, and eigenvalues. This is relevant for instance when we wish to construct projectors over specific principal eigenspaces and we have to ensurethat the projection space estimated from the sample covariance closely approximates an eigenspace of the actual covariance.
11e have identified two interesting directions for further research. The first has to do with obtainingtighter concentration estimates. Especially with regards to our perturbation arguments, we believe that ourcurrent bounds on inner products could be sharpened by at least a constant multiplicative factor. We alsosuspect that a joint analysis of angles could also lead to a significant improvement over Corollary 4.1. Thesecond direction involves using our results for the analysis of methods that utilize the eigenvectors of thecovariance for dimensionality reduction. Examples include (fast) principal component projection [12] andregression [17].
References [1] Rados(cid:32)law Adamczak, Alexander Litvak, Alain Pajor, and Nicole Tomczak-Jaegermann. Quantitativeestimates of the convergence of the empirical covariance matrix in log-concave ensembles.
Journal of theAmerican Mathematical Society , 23(2):535–561, 2010.[2] SE Ahmed. Large-sample estimation strategies for eigenvalues of a wishart matrix.
Metrika , 47(1):35–45,1998.[3] Theodore Wilbur Anderson. Asymptotic theory for principal component analysis.
The Annals of Math-ematical Statistics , 34(1):122–148, 1963.[4] ZD Bai. Methodologies in spectral analysis of large dimensional random matrices, a review.
StatisticaSinica , pages 611–662, 1999.[5] ZD Bai, BQ Miao, GM Pan, et al. On asymptotics of eigenvectors of large sample covariance matrix.
The Annals of Probability , 35(4):1532–1572, 2007.[6] ZD Bai and YQ Yin. Limit of the smallest eigenvalue of a large dimensional sample covariance matrix.
The annals of Probability , pages 1275–1294, 1993.[7] Zhi-Dong Bai and Jack W Silverstein. No eigenvalues outside the support of the limiting spectraldistribution of large-dimensional sample covariance matrices.
Annals of probability , pages 316–345,1998.[8] Friedrich L Bauer and Charles T Fike. Norms and exclusion theorems.
Numerische Mathematik ,2(1):137–141, 1960.[9] Jens Berkmann and Terry Caelli. Computation of surface geometry and segmentation using covariancetechniques.
IEEE Transactions on Pattern Analysis and Machine Intelligence , 16(11):1114–1116, 1994.[10] Emmanuel J. Cand`es, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis?
Journal of the ACM , 58(3):11:1–11:37, June 2011.[11] Chandler Davis and William Morton Kahan. The rotation of eigenvectors by a perturbation. III.
SIAMJournal on Numerical Analysis , 7(1):1–46, 1970.[12] Roy Frostig, Cameron Musco, Christopher Musco, and Aaron Sidford. Principal component projectionwithout principal component analysis. In
Proceedings of The 33rd International Conference on MachineLearning , pages 2349–2357, 2016.[13] V Girko. Strong law for the eigenvalues and eigenvectors of empirical covariance matrices. 1996.[14] Ling Huang, Donghui Yan, Nina Taft, and Michael I Jordan. Spectral clustering with perturbed data.In
Advances in Neural Information Processing Systems , pages 705–712, 2009.[15] Blake Hunter and Thomas Strohmer. Performance analysis of spectral clustering on compressed, incom-plete and inaccurate measurements. arXiv preprint arXiv:1011.0997 , 2010.[16] Ian Jolliffe.
Principal component analysis . Wiley Online Library, 2002.1217] Ian T Jolliffe. A note on the use of principal components in regression.
Applied Statistics , pages 300–303,1982.[18] Nandakishore Kambhatla and Todd K Leen. Dimension reduction by local principal component analysis.
Neural computation , 9(7):1493–1516, 1997.[19] Xavier Mestre. Improved estimation of eigenvalues and eigenvectors of covariance matrices using theirsample estimates.
IEEE Transactions on Information Theory , 54(11), 2008.[20] Mark Rudelson. Random vectors in the isotropic position.
Journal of Functional Analysis , 164(1):60–72,1999.[21] Dilip Sarwate. Two-sided chebyshev inequality for event not symmetric around the mean? MathematicsStack Exchange, 2013. URL:http://math.stackexchange.com/q/144675 (version: 2012-05-13).[22] James R Schott. Asymptotics of eigenprojections of correlation matrices with some applications inprincipal components analysis.
Biometrika , pages 327–337, 1997.[23] Mahdi Shaghaghi and Sergiy A Vorobyov. Subspace leakage analysis of sample data covariance matrix.In
ICASSP , pages 3447–3451. IEEE, 2015.[24] Jack W Silverstein and ZD Bai. On the empirical distribution of eigenvalues of a class of large dimensionalrandom matrices.
Journal of Multivariate analysis , 54(2):175–192, 1995.[25] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv:1011.3027 ,2010.[26] Roman Vershynin. How close is the sample covariance matrix to the actual covariance matrix?
Journalof Theoretical Probability , 25(3):655–686, 2012.[27] Yi Yu, Tengyao Wang, Richard J Samworth, et al. A useful variant of the davis–kahan theorem forstatisticians.