[PDF] Testing for subsphericity when n and p are of different asymptotic order

Abstract

We extend a classical test of subsphericity, based on the first two moments of the eigenvalues of the sample covariance matrix, to the high-dimensional regime where the signal eigenvalues of the covariance matrix diverge to infinity and either p/n→0 or p/n→∞ . In the latter case we further require that the divergence of the eigenvalues is suitably fast in a specific sense. Our work can be seen to complement that of Schott (2006) who established equivalent results in the case p/n→γ∈(0,∞) . As our second main contribution, we use the test to derive a consistent estimator for the latent dimension of the model. Simulations and a real data example are used to demonstrate the results, providing also evidence that the test might be further extendable to a wider asymptotic regime.

Full PDF

TTesting for subsphericity when n and p are ofdiﬀerent asymptotic order Joni VirtaDepartment of Mathematics and StatisticsUniversity of Turku, FinlandJanuary 27, 2021

Abstract

In this short note, we extend a classical test of subsphericity, basedon the ﬁrst two moments of the eigenvalues of the sample covariancematrix, to the high-dimensional regime where the signal eigenvalues ofthe covariance matrix diverge to inﬁnity and either p/n → p/n → ∞ .In the latter case we further require that the divergence of the eigenvaluesis suitably fast in a speciﬁc sense. Our work can be seen to complementthat of Schott (2006) who established equivalent results in the regime p/n → γ ∈ (0 , ∞ ). Simulations are used to demonstrate the results,providing also evidence that the test might be further extendable to awider asymptotic setting. Keywords—

Dimension estimation; high-dimensional statistics; PCA; sample co-variance matrix; Wishart distribution

The objective of principal component analysis (PCA), and dimension reduction ingeneral, is to extract a low-dimensional signal from noise-corrupted observed data. Themost basic statistical model for the problem is as follows. Assume that S n is the samplecovariance matrix of a random sample from a p -variate normal distribution whosecovariance matrix has the eigenvalues λ ≥ · · · ≥ λ d > σ , . . . , σ exhibiting “spiked”structure. The data can thus be seen to be generated by contaminating a randomsample residing in a d -dimensional subspace with independent normal noise havingthe covariance matrix σ I p . This signal subspace can be straightforwardly estimatedwith PCA as long as one knows its dimension d which is, however, usually unknownin practice. Numerous procedures for determining the dimension have been proposed,see Jolliﬀe (2002) for a review and, e.g., Schott (2006); Nordhausen et al. (2016); Virtaand Nordhausen (2019) for asymptotic tests and Beran and Srivastava (1985); Dray(2008); Luo and Li (2016) for bootstrap- and permutation-based techniques. Simplestof these methods is perhaps the test of sub-sphericity based on the test statistics, T n,j = m ,p − j ( S n ) m ,p − j ( S n ) − , j = 0 , . . . , p − , a r X i v : . [ m a t h . S T ] J a n here m (cid:96),r ( A ) denotes the (cid:96) th sample moment of the last r eigenvalues of the sym-metric matrix A . Under the null hypothesis H k : d = k that the signal dimensionequals k , the limiting null distribution of T n,k is12 n ( p − k ) T n,k (cid:32) χ ( p − k )( p − k +1) − , (1)as n → ∞ , see, e.g., Schott (2006). Hence, the dimension d can in practice be deter-mined by testing the sequence of null hypotheses H , H , . . . and taking the estimateof d to be the smallest k for which H k is not rejected. By examining the power ofthe tests, Nordhausen et al. (2016) concluded that this procedure yields a consistentestimate of d (with a suitable choice of test levels).The previous test assumes a ﬁxed dimension p and, in the face of modern large andnoisy data sets with great room for dimension reduction, it is desirable to extend thetest to the high-dimensional regime where p = p n is a function of n and we have p n →∞ as n → ∞ . This is discussed in Section 2 where our main contribution, extendingthe test based on (1) to the high-dimensional regime where either the sample size orthe dimension asymptotically dominates the other, is also presented. In Section 3 wedemonstrate our results using simulations and explore the behavior of the test in aregime not covered by our results and, in Section 4, we ﬁnally conclude with somediscussion. The behaviour of most high-dimensional statistical procedures depends crucially onthe interplay between n and p n and the most common approach in the literature is toassume that their growth rates are proportional in the sense that p n /n → γ ∈ (0 , ∞ )as n → ∞ , see, e.g., Yao et al. (2015). The limiting ratio γ is also known as the concentration of the regime. In Schott (2006), the test of subsphericity discussed inSection 1 is extended to this asymptotic regime under the following two assumptions(note that in Assumption 2 the signal dimension d is a constant not depending on n ). Assumption 1.

The observations x , . . . , x n are a random sample from N p n ( µ n , Σ n ) for some µ n ∈ R p n and some positive-deﬁnite Σ n ∈ R p n × p n . Assumption 2.

The eigenvalues of the matrix Σ n are λ n ≥ · · · ≥ λ nd > σ = · · · = σ for some σ > . Moreover, the eigenvalues λ nk , k = 1 , . . . , d , satisfy λ nk → ∞ . In fact, Schott (2006) additionally required that the quantities λ nk / tr(Σ n ) con-verge to positive constants summing to less than unity, but applying our Lemma 1 inthe proof of their Theorem 4 reveals that this condition is unnecessary, see A for de-tails. Hence, denoting by S n the sample covariance matrix of the observations, underAssumptions 1 and 2 and γ ∈ (0 , ∞ ) \ { } (see A for more details on the exclusion ofthe case γ = 1), Theorem 4 in Schott (2006) establishes that the test statistic, T n,j := m ,p n − j ( S n ) m ,p n − j ( S n ) − , satisﬁes ( n − d − T n,d − ( p n − d ) (cid:32) N (1 ,

4) where d is the signal dimension, enablingthe pin-pointing of the dimension with a chain of hypothesis tests similarly as in the ow-dimensional case. As remarked by Schott (2006), this limiting result is consistentwith its low-dimensional equivalent (1) in the sense that, as p → ∞ ,2 p − d χ ( p − d )( p − d +1) − − ( p − d ) (cid:32) N (1 , . A crucial condition that allows the above limiting result is the divergence of thespike eigenvalues λ n , . . . , λ nd of the covariance matrix to inﬁnity in Assumption 2. In-deed, usually the spikes are taken to be constant in the literature for high-dimensionalPCA, see, e.g. Baik and Silverstein (2006); Johnstone and Paul (2018). However,requiring the spikes to diverge to inﬁnity is rather natural and reﬂects the idea thatonly a few principal components are suﬃcient to recover a large proportion of the totalvariance even in high dimensions. See, for example, Yata et al. (2018), who use cross-data-matrices to detect spiked principal components with divergent variance, and thereferences therein.As our contribution, we extend the result of Schott (2006) outside of the regime p n /n → γ ∈ (0 , ∞ ), to the extreme cases γ ∈ { , ∞} . The latter have been lessstudied in the high-dimensional literature, but see, for example, Karoui (2003); Birkeand Dette (2005); Yata and Aoshima (2009); Jung and Marron (2009), the last of whichconsider the extreme asymptotic scenario where the dimension diverges to inﬁnity butthe sample size remains ﬁxed. In our treatment of the case γ = ∞ , we further requirethe additional condition that p n / ( n √ λ nd ) → n → ∞ , i.e., the dimension mustnot diverge too fast compared to the sample size and the magnitude of the spike λ nd corresponding to the weakest signal. Assumptions of this form are rather common inhigh-dimensional PCA when the spikes are taken to diverge, see, e.g., Shen et al. (2016)who saw n , λ nk and p n as three competing forces aﬀecting the consistency propertiesof PCA, n and λ nk contributing information about the signals and p n decreasing therelative share of information in the sample by introducing more noise to the model.The condition p n / ( n √ λ nd ) → γ ∈ { , ∞} , to testing of subsphericity. In this sense, our work is to Birkeand Dette (2005) what Schott (2006) is to Ledoit and Wolf (2002), who studied testsof sphericity in the case where γ ∈ (0 , ∞ ) and on whose work Schott (2006) basedtheir proof. Theorem 1.

Under Assumptions 1 and 2, if, as n → ∞ , eitheri) p n /n → , or,ii) p n /n → ∞ and p n / ( n √ λ nd ) → , then, ( n − d − T n,d − ( p n − d ) (cid:32) N (1 , . We next demonstrate the result of Theorem 1 using simulated data. In addition, webrieﬂy explore a case where p n /n → ∞ and p n / ( n √ λ nd ) (cid:57) n from N p n (0 , Σ n ) where Σ n = diag( λ n , . . . , λ nd , , . . . , orm of the normal distribution (zero location, unit noise variance and diagonal covari-ance) is without loss of generality as our test statistic is location, scale and rotationinvariant. The settings are as follows:1. d = 3, n = 216, p n = n / , λ n = 3 n , and λ n = λ n = n / ,2. d = 3, n = 216, p n = n / , λ n = 3 n , and λ n = λ n = n ,3. d = 2, n = 36, p n = n / , λ n = 2 n and λ n = n ,4. d = 2, n = 36, p n = n / , λ n = 2 n and λ n = n / .Settings 1 and 2 fall within the case γ = 0, and their only diﬀerence is in the growthrates of the spikes. Settings 3 and 4 explore the case γ = ∞ , the former satisfying theconditions of Theorem 1 and the latter not (again the only diﬀerence between themis in the growth rates of the spikes). Note that the sample sizes and dimensions havebeen chosen such that the dimensions of the data matrix are in each case 216 and36 (in either order). In each case, we compute 10000 replicates of the test statistic( n − d − T n,d − ( p n − d ) and plot the obtained histogram superimposed with thedensity of the limiting distribution N (1 , p n / ( n √ λ nd ) → n is increased. Based on this, it seemspossible that, even when p n / ( n √ λ nd ) (cid:57)

0, the limiting distribution of ( n − d − T n,d − ( p n − d ) could be made to equal N (1 ,

4) with a suitable additive correction term a n ,which vanishes, a n → n → ∞ , when the conditions of Theorem 1 are satisﬁed.Next, we demonstrate how the result of Theorem 1 can be used to estimate thesignal dimension d by using a chain of hypothesis tests for the null hypotheses H k : d = k . That is, we sequentially test the null hypotheses H , H , . . . using, respectively, thetest statistics T n, , T n, , . . . and take our estimate of the dimension to be the smallest k for which H k is not rejected. For each test, we use the two-sided 95% critical regionsof the limiting N (1 , H d − , H d and H d +1) only. In Settings 1–3, the test achievesrather accurately the nominal level at H d and shows moreover extremely good powerat H d − , leading us to conclude that the sequence of tests indeed manages to detectthe true dimension. Finally, as expected, the procedure does not work in Setting 4where the conditions of Theorem 1 are not satisﬁed. In this short note, we showed that a classical test of subsphericity is valid also in theless often studied high-dimensional regimes where the concentration γ is allowed to etting 3 Setting 4Setting 1 Setting 2−5 0 5 10 −5 0 5 100.000.050.100.150.200.000.050.100.150.20 ( n - d - ) T n,d - ( p n - d ) D en s i t y Figure 1: The histograms of 10000 independent replicates of the test statistic( n − d − T n,d − ( p n − d ) under the four diﬀerent settings, with the density ofthe limiting distribution N (1 ,

4) overlaid.5able 1: The subtables give the observed rejection rates for H d − , H d and H d +1) over 10000 independent replicates under each of the four settings. Twodiﬀerent sample sizes are considered for each setting.Setting 1 n H d − H d H d +1)

216 1.000 0.055 0.105512 1.000 0.052 0.133 Setting 2 n H d − H d H d +1)

216 1.000 0.056 0.104512 1.000 0.051 0.127Setting 3 n H d − H d H d +1)

36 1.000 0.052 0.04164 1.000 0.053 0.052 Setting 4 n H d − H d H d +1)

36 0.083 0.091 0.07964 0.068 0.093 0.104 take the extreme values 0 and ∞ , as long as the spikes themselves diverge to inﬁnity.The case γ = ∞ further requires the condition that p n / ( n √ λ nd ) →

0, which can beseen as the main limiting factor in applying the test in practice. And even though,by our simulation study, it seems plausible that the test could be extended outside ofthis condition, several key arguments of our proof of Theorem 1 hinge on it, meaningthat any extensions should use a diﬀerent technique of proof.As a natural continuation to this work, we note that even though the test exhibiteda good capability to detect the signal dimension in our simulation study, the fulltheoretical guarantee of its ability to estimate d would require investigating the powerof the test under the alternatives H ,d − , H ,d − , . . . . Corresponding studies for testsof sphericity have been conducted by Wang and Yao (2013); Onatski et al. (2014) (fornon-divergent spikes) and similar approaches could possibly be used in the presentcontext as well. A Discussion of Theorem 4 in Schott (2006)

For convenience, this section uses the notation of Schott (2006). We ﬁrst show thatthe ﬁnal part of Condition 2 in Schott (2006), assuming that lim k →∞ λ i,k / tr(Σ k ) = ρ i ∈ (0 , i = 1 , . . . , q , and that (cid:80) qi =1 ρ i ∈ (0 , m/λ q = O (1). This, in conjunction with the observationthat tr( W W (cid:48) ) = o p ( m ), then gives the relation,1 λ q tr( W W (cid:48) ) = mλ q o p (1) = o p (1) , used in bounding the moments. However, the same relation follows directly from thedivergence of the spike eigenvalues λ j by ﬁrst observing that, by the proof of ourLemma 1, we have tr( W W (cid:48) ) = qc + o p (1), where c ∈ (0 , ∞ ) is the limit of p/n .Note also that, to obtain the ﬁnal bound in the equation right after (24) withoutassuming anything about the relative growth rates of the spikes, we use the bound (cid:107) Σ − ∗ (cid:107) ≤ λ − q λ q tr(Σ − ∗ ) ≤ λ − q q (which is valid simply by the ordering of the spikeeigenvalues). Thus, the result of Theorem 4 can be obtained without the ﬁnal part ofCondition 2 in Schott (2006). dditionally, we remark that, by what appears to be an oversight, the proof ofTheorem 4 in Schott (2006) does not hold as such in the case where p/n → c = 1.Namely, in equation (22) and in the equation right after (24), the upper bounds involvethe term φ − r ( S · ), which converges in probability to (1 − c / ) − which fails to beﬁnite when c = 1. It seems to us that introducing some additional (non-trivial)assumptions on the spike eigenvalues could possibly recover the proof for c = 1 as,indeed, the simulations in Schott (2006) suggest that the result of Theorem 4 holds inthat case also. B Proofs

Before the proof of Theorem 1 we establish an auxiliary lemma.

Lemma 1.

Let W n ∼ W p n ( I p n /n, n ) be partitioned as W n = (cid:18) W n, W n, W n, W n, (cid:19) , where the block W n, has the size d × d and W p (Σ , ν ) denotes the ( p × p ) -dimensionalWishart distribution with the scale matrix Σ and ν degrees of freedom. Then, as n, p n → ∞ ,1. if p n /n → , we have, tr( W n, W n, ) = O p (cid:16) p n n (cid:17) ,

2. if p n /n → ∞ and p n / ( n √ λ n ) → for some sequence λ n → ∞ , we have, tr( W n, W n, ) = o p (cid:16) √ λ n (cid:17) . Proof of Lemma 1.

The matrix W n has the same distribution as the (biased) non-centered sample covariance matrix of a random sample z , . . . , z n from the p n -variatestandard normal distribution. Hence, letting Y n := W n, W n, , we have, for arbitrary j = 1 , . . . , d , that y n,jj = p n (cid:88) k = d +1 (cid:32) n n (cid:88) i =1 z ij z ik (cid:33) The expected value of y n,jj is E ( y n,jj ) = 1 n p n (cid:88) k = d +1 n (cid:88) i =1 n (cid:88) (cid:96) =1 E ( z ij z ik z (cid:96)j z (cid:96)k ) = p n − dn . hereas, its second moment is E ( y n,jj ) = 1 n p n (cid:88) k = d +1 p n (cid:88) k (cid:48) = d +1 n (cid:88) i =1 n (cid:88) i (cid:48) =1 n (cid:88) (cid:96) =1 n (cid:88) (cid:96) (cid:48) =1 E ( z ij z ik z (cid:96)j z (cid:96)k z i (cid:48) j z i (cid:48) k (cid:48) z (cid:96) (cid:48) j z (cid:96) (cid:48) k (cid:48) )= 1 n (cid:88) kk (cid:48) ii (cid:48) (cid:96)(cid:96) (cid:48) E ( z ij z (cid:96)j z i (cid:48) j z (cid:96) (cid:48) j ) E ( z ik z (cid:96)k z i (cid:48) k (cid:48) z (cid:96) (cid:48) k (cid:48) )= 1 n (cid:88) kk (cid:48) ii (cid:48) (cid:96)(cid:96) (cid:48) ( δ i(cid:96) δ i (cid:48) (cid:96) (cid:48) + δ ii (cid:48) δ (cid:96)(cid:96) (cid:48) + δ i(cid:96) (cid:48) δ (cid:96)i (cid:48) )( δ i(cid:96) δ i (cid:48) (cid:96) (cid:48) + δ ii (cid:48) δ kk (cid:48) δ (cid:96)(cid:96) (cid:48) + δ i(cid:96) (cid:48) δ kk (cid:48) δ (cid:96)i (cid:48) )= p n − dn { ( p n − d ) n + 2( p n − d ) + 2 n + 4 } , where the second-to-last equality uses Isserlis’ theorem. Consequently, the variance of y n,jj is Var( y n,jj ) = 2( p n − d ) n { p n − d ) + n } . Hence, the moments of t n,jj := ( n/p n ) y n,jj are E( t n,jj ) = 1 − d/p n = 1 + o (1) andVar( t n,jj ) = 2 (cid:18) − dp n (cid:19) (cid:26) n + p n − dn + 1 n (cid:27) = o (1) . The ﬁrst claim now follows and the second one is straightforwardly veriﬁed to be truein a like manner.

Proof of Theorem 1.

Due to centering we may WLOG assume that µ n = 0 for all n ∈ N . Moreover, as our main claim depends on S n only through its eigenvalues, wemay, again WLOG, assume that Σ n = diag( λ n , . . . , λ nd , σ , . . . , σ ). Finally, as theleft-hand side of our main claim is invariant under scaling of the observations, we mayWLOG assume that σ = 1.Denoting n := n −

1, we have that S n = Σ / n W n Σ / n where W n ∼ W p n { n − I p n , n } is the sample covariance matrix of a sample of size n from the p n -variate standard nor-mal distribution. Denote then Λ n = diag( λ n , . . . , λ nd ) and partition S n and W n as S n = (cid:18) S n, S n, S n, S n, (cid:19) = (cid:32) Λ / n W n, Λ / n Λ / n W n, W n, Λ / n W n, (cid:33) where the matrices S n, and W n, are of the size d × d . Then W n, ∼ W r n { n − I r n , n } ,where r n := p n − d , and the Schur complement S n, · satisﬁes S n, · := S n, − S n, S − n, S n, = W n, · ∼ W r n { n − I r n , n − d } , where the distribution of W n, · follows from Theorem 3.4.6 in Mardia et al. (1995).Consequently, G n := { n / ( n − d ) } S n, · ∼ W r n { ( n − d ) − I r n , n − d } , implying that m ,r n ( S n, · ) /m ,r n ( S n, · ) = m ,r n ( G n ) /m ,r n ( G n ) . Hence, by Theorem 3.7 inBirke and Dette (2005), we have( n − d − (cid:26) m ,r n ( S n, · ) m ,r n ( S n, · ) − (cid:27) − r n (cid:32) N (1 , , (2)regardless of which of the two asymptotic regimes we are in. Note also that, as G n isof the size r n × r n , the notation m k,r n ( G n ) simply refers to the k th sample momentof its eigenvalues. onsider next the regime where p n /n → m k,r n ( S n, · ) = m k,r n ( S n ) + o p (1 /n ) , (3)for k = 1 ,

2. Then, the diﬀerence( n − d − (cid:26) m ,r n ( S n ) m ,r n ( S n ) − m ,r n ( S n, · ) m ,r n ( S n, · ) (cid:27) =( n − d − m ,r n ( S n ) m ,r n ( S n, · ) − m ,r n ( S n, · ) m ,r n ( S n ) m ,r n ( S n, · ) m ,r n ( S n ) , (4)is easily checked to be of the order o p (1) using (3) and the results following fromLemma 2.1 in Birke and Dette (2005) that m ,r n ( S n, · ) → p m ,r n ( S n, · ) → p

1. Hence, the ﬁrst claim of the theorem follows from (2).Similarly, in the regime that p n /n → ∞ and p n / ( n √ λ nd ) →

0, assume that m ,r n ( S n, · ) = m ,r n ( S n ) + o p (1 /p n ) ,m ,r n ( S n, · ) = m ,r n ( S n ) + o p (1 /n ) . (5)Then, the diﬀerence (4) can similarly be shown to be of the order o p (1) (proving thesecond claim of the theorem). Note that in this case we require a faster convergencefrom the ﬁrst moment since, by Lemma 2.1 of Birke and Dette (2005) we have again m ,r n ( S n, · ) → p m ,r n ( S n, · ) − ( n − d )( r n +1) /n → p k = 1 , p n /n →

0, we may without loss of generality assume n > p n , implying that S n isalmost surely positive deﬁnite. Now, we have for S n, · := S n, − S n, S − n, S n, that, φ − d ( S n, · )= φ { ( S n, − S n, S − n, S n, ) − } = φ ( S − n, + S − n, S n, S − n, · S n, S − n, ) ≤ φ (Λ − / n W − n, Λ − / n ) + φ (Λ − / n W − n, W n, W − n, · W n, W − n, Λ − / n ) ≤ φ (Λ − / n ) φ ( W − n, ) + φ (Λ − / n ) φ ( W − n, ) φ ( W n, W − n, · W n, ) , (6)where the second equality follows from the Woodbury matrix identity, the ﬁrst inequal-ity uses Weyl’s inequality and the second inequality follows from the sub-multiplicativityof the spectral norm. Now, Assumption 2 guarantees that φ (Λ − / n ) = λ − nd → W n, → p I d , we further have, by the continuity of eigenvalues, that φ ( W − n, ) → p

1. Write then, φ ( W n, W − n, · W n, ) = (cid:107) W n, W − n, · W n, (cid:107) ≤ (cid:107) W n, (cid:107) (cid:107) W − n, · (cid:107) ≤ tr( W n, W n, ) φ − r n ( W n, · ) , where (cid:107) · (cid:107) denotes the spectral norm. Now, since G n = { n / ( n − d ) } W n, · ∼W r n { ( n − d ) − I r n , n − d } , we have by the discussion after Theorem 1.1 in Rudelsonand Vershynin (2009) that P (cid:40) φ r n ( G n ) ≤ (cid:18) − (cid:114) r n n − d − t √ n − d (cid:19) (cid:41) ≤ e − t / , or all t >

0. Substituting t = (1 / √ n − d − √ r n (which is positive for a largeenough n ), gives, P { φ r n ( G n ) ≤ / } ≤ exp[ −{ (1 / √ n − d − √ r n } / → . Hence, φ − r n ( W n, · ) = n n − d { φ r n ( G n ) − / } + 1 / O p (1) , (7)where the ﬁnal step follows as φ r n ( G n ) − / W n, W n, ) = O p ( p n /n ) = o p (1) andplugging all these in to (6), we obtain that 0 < φ − d ( S n, · ) ≤ o p (1) (where the ﬁrstinequality holds a.s. by the positive-deﬁniteness of the Schur complement). This, inconjunction with the fact that φ ( S n, · ) → p

1, implied by Theorem 2 in Karoui(2003), lets us to conclude that P { φ d ( S n, · ) > φ ( S n, · ) } → n → ∞ , and, inthe sequel, we restrict our attention to this event, allowing us to apply Theorem 3 inSchott (2006), equation (17) of which yields,0 ≤ m ,r n ( S n, · ) − m ,r n ( S n ) ≤ φ ( S n, · ) r n { φ − ( S n, · ) − φ − d ( S n, · ) } tr( S − n, S n, S − n, · S n, S − n, ) ≤ r − n { o p (1) }(cid:107) Λ − / n (cid:107) (cid:107) W − n, (cid:107) (cid:107) W n, W − n, · W n, (cid:107) = o p (1 /p n ) (cid:107) W n, W − n, · W n, (cid:107) . (8)Let the singular value decomposition of W n, be W n, = R n D n T (cid:48) n . Then, (cid:107) W n, W − n, · W n, (cid:107) ≤ (cid:107) D n (cid:107) (cid:107) R (cid:48) n W − n, · R n (cid:107) = { tr( W n, W n, ) } (cid:107) R (cid:48) n W − n, · R n (cid:107) = O p ( p n /n ) d (cid:88) j =1 φ j ( R (cid:48) n W − n, · R n ) ≤ O p ( p n /n ) d (cid:88) j =1 φ j ( W − n, · ) ≤ O p ( p n /n ) dφ ( W − n, · )= O p ( p n /n ) φ − r n ( W n, · )= O p ( p n /n ) , where the second equality follows from Lemma 1, the second inequality from thePoincar´e separation theorem and the ﬁnal equality from (7). Plugging this in to (8)then establishes (3) for k = 1.To show the same for k = 2, we apply equation (18) from Theorem 3 in Schott(2006) to obtain0 ≤ m ,r n ( S n, · ) − m ,r n ( S n ) ≤ φ ( S n, · ) r n (cid:26) φ − d ( S n, · ) φ − ( S n, · ) − φ − d ( S n, · ) (cid:27) tr( S − n, S n, S − n, · S n, S − n, ) , here arguing as in the case k = 1 shows that the right-hand side is bounded by a o p (1 /n )-quantity, concluding the proof of the case where p n /n → p n > n , implyingthat the rank of S n, · is almost surely n − d . Denote then any of its eigendecom-positions by S n, · = Q n ∆ n Q (cid:48) n where Q n is a r n × ( n − d ) matrix with orthonormalcolumns and ∆ n contains the almost surely positive n − d eigenvalues. Our aim isto use Corollary 3 of Schott (2006) and, for that, we ﬁrst show that P { φ d ( ˜ S n, · ) >φ ( S n, · ) } → n → ∞ , where ˜ S n, · := S n, − S n, Q n ( Q (cid:48) n S n, Q n ) − Q (cid:48) n S n, .Now, the inverse of ˜ S n, · is S − n, + S − n, S n, Q n ∆ − n Q (cid:48) n S n, S − n, and, proceedingas in (6), we see that φ − d ( ˜ S n, · ) has the upper bound, φ (Λ − / n ) φ ( W − n, ) + φ (Λ − / n ) φ ( W − n, ) φ ( W n, Q n ∆ − n Q (cid:48) n W n, ) , where the ﬁnal leading eigenvalue has, by the Poincar´e separation theorem, the upperbound tr( W n, W n, ) φ − n − d (∆ n ). Now, as in the proof of the ﬁrst claim, Rudelsonand Vershynin (2009) can be used to show that φ − n − d (∆ n ) = O p ( n/p n ). Furthermore,Lemma 1 shows that tr( W n, W n, ) = o p ( √ λ nd ) under our assumptions, ﬁnally yield-ing that, φ − d ( ˜ S n, · ) ≤ λ nd { o p (1) } + o p { n/ ( p n (cid:112) λ nd ) } , This, in conjunction with the result that ( p n /n ) φ − ( S n, · ) → p

1, implied by Theo-rem 1 in Karoui (2003), guarantees now that( p n /n ) { φ − ( S n, · ) − φ − d ( ˜ S n, · ) } ≥ − p n nλ nd { o p (1) } + o p (1) = 1 + o p (1) , showing that P { φ d ( ˜ S n, · ) > φ ( S n, · ) } →

1, as desired, and allowing us to restrictour attention to the corresponding set and to use Corollary 3 in Schott (2006). Itsﬁrst part gives us0 ≤ m ,r n ( S n, · ) − m ,r n ( S n ) ≤ φ ( S n, · ) r n { φ − ( S n, · ) − φ − d ( ˜ S n, · ) } tr( S − n, S n, Q n ∆ − n Q (cid:48) n S n, S − n, ) ≤ p n n r n λ nd λ nd tr(Λ − n ) { d + o p (1) }(cid:107) W n, Q n ∆ − n Q (cid:48) n W n, (cid:107) , (9)where λ nd tr(Λ − n ) ≤ d . Reasoning similarly as with the ﬁrst claim of the theorem, wefurther have (cid:107) W n, Q n ∆ − n Q (cid:48) n W n, (cid:107) ≤ tr( W n, W n, ) (cid:107) R (cid:48) n Q n ∆ − n Q (cid:48) n R n (cid:107)≤ o p ( (cid:112) λ nd ) O p ( n /p n ) , where R n again contains the left singular vectors of W n, . Plugging the obtainedupper bound to (9) then ﬁnally gives the ﬁrst claim of (5) and the second claim isobtained in exactly the same manner but by using the second inequality of Corollary 3in Schott (2006) instead of the ﬁrst. eferences Baik, J. and Silverstein, J. W. (2006). Eigenvalues of large sample covariance matricesof spiked population models.

Journal of Multivariate Analysis , 97(6):1382–1408.Beran, R. and Srivastava, M. S. (1985). Bootstrap tests and conﬁdence regions forfunctions of a covariance matrix.

Annals of Statistics , 13(1):95–115.Birke, M. and Dette, H. (2005). A note on testing the covariance matrix for largedimension.

Statistics & Probability Letters , 74(3):281–289.Dray, S. (2008). On the number of principal components: A test of dimensionalitybased on measurements of similarity between matrices.

Computational Statistics &Data Analysis , 52(4):2228–2237.Johnstone, I. M. and Paul, D. (2018). PCA in high dimensions: An orientation.

Proceedings of the IEEE , 106(8):1277–1292.Jolliﬀe, I. T. (2002).

Principal Component Analysis . Springer. Second edition.Jung, S. and Marron, J. S. (2009). PCA consistency in high dimension, low samplesize context.

Annals of Statistics , 37(6B):4104–4130.Karoui, N. E. (2003). On the largest eigenvalue of Wishart matrices with identitycovariance when n, p and p/n → ∞ . arXiv preprint math/0309355 .Ledoit, O. and Wolf, M. (2002). Some hypothesis tests for the covariance matrixwhen the dimension is large compared to the sample size. Annals of Statistics ,30(4):1081–1102.Luo, W. and Li, B. (2016). Combining eigenvalues and variation of eigenvectors fororder determination.

Biometrika , 103(4):875–887.Mardia, K., Kent, J., and Bibby, J. (1995).

Multivariate Analysis . Academic Press.Nordhausen, K., Oja, H., and Tyler, D. E. (2016). Asymptotic and bootstrap tests forsubspace dimension. arXiv preprint arXiv:1611.04908 .Onatski, A., Moreira, M. J., and Hallin, M. (2014). Signal detection in high dimension:The multispiked case.

Annals of Statistics , 42(1):225–254.Rudelson, M. and Vershynin, R. (2009). Smallest singular value of a random rectangu-lar matrix.

Communications on Pure and Applied Mathematics: A Journal Issuedby the Courant Institute of Mathematical Sciences , 62(12):1707–1739.Schott, J. R. (2006). A high-dimensional test for the equality of the smallest eigenvaluesof a covariance matrix.

Journal of Multivariate Analysis , 97(4):827–843.Shen, D., Shen, H., and Marron, J. (2016). A general framework for consistency ofprincipal component analysis.

Journal of Machine Learning Research , 17(1):5218–5251.Virta, J. and Nordhausen, K. (2019). Estimating the number of signals using principalcomponent analysis.

Stat , 8(1):e231. ang, Q. and Yao, J. (2013). On the sphericity test with large-dimensional observa-tions. Electronic Journal of Statistics , 7:2164–2192.Yao, J., Zheng, S., and Bai, Z. (2015).

Large Sample Covariance Matrices and High-Dimensional Data Analysis . Cambridge University Press.Yata, K. and Aoshima, M. (2009). PCA consistency for non-Gaussian data in highdimension, low sample size context.

Communications in Statistics—Theory andMethods , 38(16-17):2634–2652.Yata, K., Aoshima, M., and Nakayama, Y. (2018). A test of sphericity for high-dimensional data and its application for detection of divergently spiked noise.

Se-quential Analysis , 37(3):397–411., 37(3):397–411.