High dimensional PCA: a new model selection criterion
Abhinav Chakraborty, Soumendu Sundar Mukherjee, Arijit Chakrabarti
HHigh dimensional PCA: a new model selection criterion
Abhinav Chakraborty
Department of StatisticsThe Wharton SchoolUniversity of PennsylvaniaPhiladelphia, PA 19104, USA [email protected]
Soumendu Sundar Mukherjee
Interdisciplinary Statistical Research UnitApplied Statistics DivisionIndian Statistical InstituteKolkata, WB-700108, India [email protected]
Arijit Chakrabarti
Applied Statistics UnitApplied Statistics DivisionIndian Statistical InstituteKolkata, WB-700108, India [email protected]
November 10, 2020
Abstract
Suppose we have a random sample from a multivariate population consisting of many variables.Estimating the number of dominant/large eigenvalues of the population covariance matrix basedon the sample information is an important question arising in Statistics with wide applicationsin many areas. In the context of Principal Components Analysis (PCA), the linear combinationsof the original variables having the largest amounts of variation are determined by this number.In this paper, we study the high dimensional asymptotic setting where the number of variablesgrows at the same rate as the number of observations. We work in the framework where thepopulation covariance matrix is assumed to have the spiked structure proposed in Johnstone(2001)). In this setup, the problem of interest becomes essentially one of model selection and hasattracted a lot of interest from researchers. Our focus is on the Akaike Information Criterion(AIC) which is known to be strongly consistent from the work of Bai et al. (2018). The result ofBai et al. (2018) requires a certain “gap condition” ensuring that the dominant eigenvalues ofthe covariance matrix are all above a level which is strictly larger than a threshold discoveredby Baik, Ben Arous and Peche (called the BBP threshold), both quantities depending on thelimiting ratio of the number of variables and observations. It is well-known in the literature that,below this threshold, a spiked covariance structure becomes indistinguishable from one with nospikes. Thus the strong consistency of AIC requires in a sense some extra “signal strength” thanwhat the BBP threshold corresponds to. a r X i v : . [ m a t h . S T ] N ov hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion In this paper, our aim is to investigate whether consistency continues to hold even if the “gap”is made smaller. In this regard, we make two novel theoretical contributions. Firstly, we showthat strong consistency under arbitrarily small gap is achievable if we alter the penalty term ofAIC suitably depending on the target gap. Inspired by this result, we are able to show that afurther intuitive alteration of the penalty can indeed make the gap exactly zero, although wecan only achieve weak consistency in this case. We compare the two newly-proposed estimatorswith other existing estimators in the literature via extensive simulation studies, and show, bysuitably calibrating our proposals, that a significant improvement in terms of mean-squared erroris achievable.
AMS 2010 Mathematics Subject Classifications:
Keywords:
Spiked model; model selection; high dimensional PCA
Suppose we have a sample of observations from a multivariate population with p variables. Estimatingthe number of dominant/significant eigenvalues of the population covariance matrix in such a scenariois an important question arising in Statistics with wide applications in many areas. In the context ofPrincipal Components Analysis (PCA), a very popular method of dimension reduction for multivariatedata, the individual principal components having the largest variability are determined by this number.Let λ ≥ λ ≥ · · · ≥ λ p be the eigenvalues of the population covariance matrix. We further assumethat the covariance matrix has a spiked structure proposed by Johnstone (2001). In this framework,the number of dominant eigenvalues is denoted by k and all the eigenvalues except the first k are allassumed equal, i.e. λ ≥ λ ≥ · · · ≥ λ k > λ k +1 = λ k +2 = · · · = λ p . This k is called the true numberof dominant/significant components in this framework.The spiked covariance model finds wide applications in many scientific fields. In wireless commu-nications, for example, a signal emitted by a source is modulated and received by several antennas,and the quality of reconstruction of the original signal is directly linked to the “inference” of spikes.The spiked model is also used in different areas of Artificial Intelligence such as face, handwriting andspeech recognition and statistical learning. See Johnstone and Paul (2018) for more applications.The number of significant components k is usually unknown, and we need to estimate it, whichin turn becomes essentially a problem of model selection in the spiked covariance framework. Thiswill be explained clearly in the next section. Many estimators have been developed in the literature,mostly based on information theoretic criteria, such as the minimum description length (MDL),Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC) (see, e.g., Wax andKailath (1985)). However, these applications have been focused on the large sample size and lowdimensional regimes and arguments in support of these estimators may not carry over to the highdimensional setup. In the recent past, several works have appeared in the area of signal processing forhigh dimensional data, where techniques from random matrix theory (RMT) it have been used (see,for example, Kritchman and Nadler (2009) and Nadler (2010)). More recent papers in the literatureinclude Passemier and Yao (2014) and Bai et al. (2018).Our work is inspired by Bai et al. (2018), where the authors consider the Akaike InformationCriterion (AIC) (Akaike (1998)) and the Bayesian Information Criterion (BIC) (Schwarz et al. (1978))as their estimation criterion in a high dimensional setting. They studied the consistency of the hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion estimators based on the AIC and BIC criteria under an asymptotic framework where p, n → ∞ suchthat p/n → c > , where n refers to the sample size. They showed that unlike AIC, the BIC criterionis not consistent when the signal strength λ k is bounded (Bai et al. (2018)); in other words BICrequires much more signal strength for signal detection. Their main result shows consistency of AICfor estimating k , when a certain “gap condition” is satisfied, i.e. when λ k is above a certain level λ c above the BBP threshold (Baik et al. (2005)) of √ c (see Section 2.2). More precisely, they showedthat AIC is consistent if and only if λ k > λ c > √ c . Figure 1 shows the “gap” between λ c and √ c . We want to highlight the fact that if λ k ≤ √ c , then there is no hope for estimating k ,(see, for example, Baik et al. (2005) and Section 2.1.1 for more details).Figure 1: Gap between λ c and the BBP threshold.The primary aim our work is to investigate if we can improve upon the results of Bai et al. (2018)by suitably modifying the AIC criterion to give consistent estimates of k under weakening of the“gap condition”. Towards this, our main contributions are as follows. We have shown that, givenany δ > , we can develop an estimator (depending on δ ) which is strongly consistent when thegap between λ k and the BBP threshold is at least δ (see Section 3.1). We can also make the gapexactly zero by modifying our estimator, but then we have to part with strong consistency-we canonly prove weak consistency of the modified estimator (see Section 3.2). We note that there is anotherweakly-consistent estimator due to Passemier and Yao (2014) that is known in the literature andwhich also works under zero gap.The inspection of the proof of strong consistency in Bai et al. (2018) reveals that for any arbitrary δ > , a modification (based on δ and ratio of dimension and sample size) of the penalty of AIC willmake the asymptotic argument work when the gap between λ k and the BBP threshold is above δ . Afurther modification of the penalty term by letting that δ go to zero at appropriate rate dependingon n gives us an estimator that is weakly consistency under ‘zero gap’. This proof is obtained byemploying novel arguments aided by some deep Random Matrix Theory results on the asymptoticbehavior of sample covariance matrices. hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion After finishing this work, it came to our notice that a similar problem has been studied indepen-dently by Hu et al. (2020). However, their motivations, results and methods of proof are differentfrom ours.The rest of the paper is organised as follows. In Section 2, we discuss the problem setup, reviewsome key results of Random Matrix Theory (RMT) and the literature on estimating the number ofspikes. Section 3 introduces the main idea and results behind our modified AIC criterion. In Section4 we compare our proposed estimator with other estimators available in the literature via extensivesimulation studies. Conclusions then follow and the Appendix collects all the proofs.
We first describe our setup and main problem in the first few paragraphs of this section. We followthe same notations as used in Bai et al. (2018). Suppose we have a random sample y , . . . , y n of size n from a population of dimension p and let Y = ( y , . . . , y n ) T denote the full ( n × p ) data matrix.Let the population mean be µ and population covariance matrix is denoted by Σ . Our interest here isin the spiked covariance structure described before which can be succinctly described through models M k , where M k : λ ≥ λ ≥ · · · ≥ λ k > λ k +1 = · · · = λ p . We consider models M , M , . . . , M p − starting from a configuration of zero spikes (i.e. λ = · · · = λ p )to one with ( p − spikes. We want to estimate, using the available data, the true value of k . Soour main problem is thus reduced to one of model selection from the pool of the above p candidatemodels.Our focus in this paper will be the high dimensional asymptotic setting where p and n growproportionately, i.e. we assume that p/n → c > . (C1)To avoid notational clutter, we suppress the dependence of p on n , and of Σ , λ , λ . . . on p . Forthe estimation of k , we will be interested in the study (to be explained later in this section) of thedistributional properties of the eigenvalues of the sample covariance matrix of the y j ’. We can assumewithout any loss of generality that µ = 0 . Then the sample covariance matrix S n is given by S n = 1 n − n (cid:88) i =1 y i y i T . Denoting { , . . . , n } by [ n ] , we may also assume, as in Bai et al. (2018), that y j = Σ x j for j ∈ [ n ] ,where x k = ( x k , . . . , x pk ) T , and { x ij , i ∈ [ n ] , j ∈ [ n ] } is a double array of i.i.d. random variables withmean , variance . We further assume, as in Bai et al. (2018), that the x ij ’s have finite with finitefourth moment.For the rest of the paper we assume that the eigenvalues of population covariance matrix Σ are 1except the first k which are ( λ i ) ≤ i ≤ k and Σ has the form Σ = (cid:18) Σ k
00 I p − k (cid:19) (C2)where Σ k has k non-null and non-unit eigenvalues λ ≥ λ ≥ · · · ≥ λ k > . We also assume, as inBai et al. (2018) that k is unknown but a fixed finite number and does not change with n . hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion We will now summarize in the next few subsections, the concepts, issues and results from theliterature which are relevant in this study. We will first touch upon some fundamental results on theproperties of the eigenstructure of the sample covariance matrix and the high dimensional phenomenon.This will be followed by review of some specific works in the literature on the topic of our interest.
We start this section by briefly describing Principal Components Analysis (PCA) since the eigenstruc-ture of the sample covariance matrix is intrinsically related to this. We recall that in PCA (see, e.g.,Anderson (1958); Jolliffe (1986)), one sequentially finds orthogonal directions to produce uncorrelated(normalized) linear combinations of the original variables with maximum variance. In the standardapproach to this, this is obtained from the eigen-decomposition of the (sample)covariance matrix.Consider the eigen-decomposition of Σ : Σ = λ u u (cid:48) + · · · + λ p u p u (cid:48) p = UΛU (cid:48) , where U is a p × p orthogonal matrix whose columns are the eigenvector u i and Λ is a diagonalmatrix with entries λ i being the eigenvalues of Σ . The sample analogue of this is S n = l v v (cid:48) + · · · + l p v p v (cid:48) p = VLV (cid:48) , where now the orthogonal matrix V has columns which are the sample eigenvectors v i and L is adiagonal matrix consisting of the eigenvalues l i of S n . If Z denotes a generic random observationfrom the distribution, then the vector of the population principal components is given by ZU . Thesample principal components are defined as Yv i , for i = 1 , · · · , p . It may be noted that for each i ∈ { , · · · , p } , the variance of Zu i is λ i . So larger the λ i , larger is the variance of random variable Zu i . Thus the number of dominant/large eigenvalues of Σ correspond to the number of principalcomponents having the largest amount of information. In the traditional fixed dimensional settingwhere p is fixed, and n is large, the sample eigenvalues and eigenvectors converge to their populationcounterparts, i.e. as n → ∞ , we have that l i a.s. → λ i and v i a.s. → u i , for each i = 1 , · · · , p (see, forexample, Anderson (1958)). The situation is more subtle in the high dimensional case where p growswith n and it is well-known in the literature (Silverstein (1995)) that the consistency of the sampleeigenvalues and eigenvectors do not carry over in this case. In the next part of this section we focuson the high-dimensional aspect with special emphasis on results from random matrix theory underthe spiked covariance framework. Many of the things described below are classically well-known facts from Random Matrix Theory(RMT) and some are more recent interesting discoveries. The notational convention is similar toBai et al. (2018). Recall that the eigenvalues of S n are denoted by l ≥ l ≥ · · · ≥ l p ≥ ( againsuppressing their dependence on n and/or p ). Let us define the Empirical Spectral Distribution (ESD)of S n by F n ( x ) = 1 p p (cid:88) i =1 { l i ≤ x } . hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion By a result of Silverstein (1995) we have that under condition (C2) the ESD of S n , i.e. F n ( x ) convergesto F c ( x ) almost surely, where F c ( x ) is the Marčhenco-Pastur law/distribution. Here, for < c ≤ , F c is given by F (cid:48) c ( x ) = f c ( x ) = 12 πxc (cid:112) ( b − x )( x − a ) . The support of this distribution is [ a, b ] where a := (1 − √ c ) and b := (1 + √ c ) .If c > , F c has a point mass − /c at the origin, i.e. F c ( x ) = if x < , − /c if ≤ x < a, − /c + (cid:82) xa f c ( t ) dt if a ≤ x ≤ b, where a and b are the same as in the case < c ≤ .The previous result characterizes the bulk behaviour of the sample eigenvalues. Now we stateresults characterizing convergence of individual eigenvalues of S n . As in Bai et al. (2018), we define λ i to be a “distant spiked eigenvalue” if λ i > √ c . We also define the function ψ c ( x ) for x (cid:54) = 1 as ψ c ( x ) = x + cxx − . This next result is the same as Lemma 2.1 in Bai et al. (2018).
Lemma 2.1.
Let l i denote the i-th largest eigenvalue of S n the covariance matrix in our setup.Suppose that E ( x ) < ∞ , conditions ( C and ( C hold, and that λ is bounded.(i) If λ i is distant-spiked, then l i a.s. → ψ c ( λ i ) = ψ i = λ i + cλ i λ i − .(ii) If λ i is not distant-spiked and i/p → α , then l i a.s. → µ − α , where µ α is the α -quantile of the MPdistribution. In particular, if i = o ( p ) , then l i a.s. → µ = b = (1 + √ c ) . The results in the above lemma are examples of a general high dimensional “phase transition”phenomenon observed by many authors (e.g., Bai and Yao (2012), Baik et al. (2005)), and oftenreferred as the BBP phase transition phenomenon after Baik et al. (2005). In a nutshell, as summarizedin Paul (2007), this refers to the fact that if the non-unit eigenvalues of a spiked model are close toone, then their sample counterparts would asymptotically behave as if the true covariance matrix werethe identity matrix. However, the asymptotics would change critically if the dominant eigenvalues arelarger than the threshold of (1 + √ c ) . We describe its details and the implications on our assumptions.For understanding its effect on distributional convergence, we assume now for simplicity that Σ isdiagonal with a single spike ( k = 1 ), so that Σ = diag { λ , , · · · , } . When λ = 1 , the largest sampleeigenvalue is located near the upper edge b of the MP distribution and fluctuates on the (small) scale n − / approximately according to the real valued Tracy-Widom distribution: n l − µ ( c ) σ ( c ) d → T W , where µ ( c ) = b and σ ( c ) = (1 + √ c ) / c − / and T W is a random variable following real valued TWdistribution (see Johnstone (2001) for more details). For λ ≤ √ c , the largest sample eigenvaluehas the same limiting Tracy-Widom distribution-the small spike in the top population eigenvalue has hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion no limiting effect on the distribution of sample top eigenvalue. Put in another way, asymptoticallythe largest sample eigenvalue is of no use in detecting a subcritical spike in the largest populationeigenvalue. A phase transition occurs at √ c : for larger values of λ , the largest sample eigenvalue l now has a limiting Gaussian distribution (See Paul (2007)), with scale on the usual order of n − / .The mean of this Gaussian distribution (= ψ ( λ ) ) shows a significant upward bias, being significantlylarger than the true value of λ . We now come back briefly to the manifestation of the phase transitionin the issue of pointwise convergence as in Lemma 2.1, where we consider a spiked model of k spikes.For the non spiked eigenvalue λ k +1 , the sample counterpart l k +1 converges a.s. to (1 + √ c ) (by theLemma 2.1 (ii)). If < λ k ≤ √ c then l k also converges a.s. to (1 + √ c ) . So asymptotically itbecomes difficult to distinguish l k and l k +1 (i.e. model M k and M k +1 respectively). This is not the caseif λ k > √ c as seen in Lemma 2.1 (i)). Keeping in mind this phase transition behaviour, we haveassumed for our results presented later that the first k eigenvalues are “distant spiked eigenvalues” This section is mainly based on the paper by Bai et al. (2018). They consider the spiked covariancestructure mentioned before which is expressed by models denoted by M k ’s for varying k dependingon the value of the true number of spikes k . To estimate k , they propose using the traditional AICcriterion to select a model from among the pool of candidate models and thereby obtain an estimatorwhich is strongly consistent. We shall discuss the p ≤ n case first. Defining C p,n = n log(( n − /n ) p + np { π ) } , it has been noted in in Fujikoshi et al. (2011) that the criterion value under model M j is given by AIC j = n log( l . . . l j ) + n ( p − j ) log ¯ l j + 2 d j + C p,n , where l > . . . > l p are the sample eigenvalues of S n and for ≤ j ≤ p − , ¯ l j is the arithmetic meanof l j +1 , . . . , l p , that is, ¯ l j = 1 p − j p (cid:88) t = j +1 l t . Furthermore, d j denotes the number of independent model parameters under model M j and is givenby d j = pj − j ( j + 1) + j + 1 + p = ( j + 1)( p + 1 − j/ . The expression of d j is obtained by looking at the eigen-decomposition of the covariance matrix Σ which can be written as: Σ = j (cid:88) i =1 λ i u i u Ti + λ ( I − j (cid:88) i =1 u i u Ti ) , where u i ’s are mutually orthogonal unit vectors. It is evident that p degrees of freedom are accountedfor by µ , j + 1 by λ i ’s i = 1 , . . . j and λ , and pj − j ( j + 1) / by the j orthonormal eigen-vectors. TheAIC criterion selects the model M ˆ k A where ˆ k A = arg min j AIC j . hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion The estimator of k proposed by Bai et al. (2018) is ˆ k A . When we are interested in only the first q models M j , j = 0 , , . . . , q − , then the criteria is defined by considering the minimum only with respect to j = 0 , , . . . , q − . We call q the number of candidate models. Denoting A j = n (AIC j − AIC p − ) ,the model selection rule of AIC can be equivalently obtained as ˆ k A = arg min j A j . Note that A j is given by A j = ( p − j ) log ¯ l j − p (cid:88) i = j +1 log l i − ( p − j − p − j + 2) n . We are now going to state main results of Bai et al. (2018) regarding consistency of ˆ k A . A criterion ˆ k for estimating k is said to be consistent (strongly consistent) if lim n →∞ P (ˆ k = k ) = 1 [ P (lim n →∞ ˆ k = k ) = 1 .We will first state the result for the case < c ≤ . Before stating the result, we state condition(C3) referred to as the “gap condition”. Recalling the function ψ c ( x ) and quantities ψ i = ψ c ( λ i ) defined earlier, the “gap condition” is given by ψ k − − log ψ k − c > . (C3)The next result is proved as Theorem 3.1(i) of Bai et al. (2018). Theorem 2.2.
Suppose the conditions (C1) with < c ≤ , and (C2) hold, and that the number ofcandidate models, q , satisfies q = o ( p ) . Suppose also that λ is bounded.We have the following resultson the consistency of the estimation criterion ˆ k A based on AIC:(i) If the gap condition (C3) does not hold, then ˆ k A is not consistent.(ii) If the gap condition (C3) holds, then ˆ k A is strongly consistent. Next we consider the case where p, n → ∞ , such that p > n and p/n → c > . Clearly in thissetup the smallest p − ( n − eigenvalues of S n are zero, that is, l n − > l n = . . . = l p = 0 . Thus, as noted in Bai et al. (2018), it is impossible to infer about the smallest population eigenvalues λ n , λ n +1 , . . . , λ p > using the sample eigenvalues. Therefore, these authors additionally assume inthis setup that (C4) holds where λ n − = λ n = . . . = λ p = 1 . (C4)Under this new assumption, for j = 0 , , . . . , n − , it follows that ˜ M j : λ j > λ j +1 = . . . = λ n − ⇔ M j : λ j > λ j +1 = . . . = λ p . hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion In order to describe the model selection criterion proposed in this setup in Bai et al. (2018), ¯ l j firsthas to be redefined as below: ¯ l j = 1 n − − j n − (cid:88) t = j +1 l t j = 1 , , . . . , n − . Using this modification, a new criterion ˜ A j is then introduced as ˜ A j = ( n − − j ) log ¯ l j − n − (cid:88) i = j +1 log l i − ( n − j − n − j + 1) p , by replacing the p and n in A j by n − and p , respectively. Note that ˜ A n − = 0 . The “quasi-AIC”rule, henceforth abbreviated as the qAIC rule, selects the model ˆ k ˜ A defined as ˆ k ˜ A = arg min j ≤ n − ˜ A j . The strong consistency result of qAIC for the case c > is proved under the modified “gap condition” ψ k /c − − log( ψ k /c ) − /c > . (C5)We now state the result regarding consistency of ˆ k ˜ A proved as Theorem 3.3(i) in Bai et al. (2018). Theorem 2.3.
Suppose the conditions (C1) with c > , and (C4) hold, and that the number ofcandidate models, q , satisfies q = o ( p ) . Suppose also that λ is bounded. We have the following resultson the consistency of the estimation criterion ˆ k ˜ A based on qAIC:(i) If the gap condition (C5) fails, then ˆ k ˜ A is not consistent.(ii) If the gap condition (C5) holds, then ˆ k ˜ A is strongly consistent. Next we will discuss another type of estimators available in the literature which are weaklyconsistent, i.e. if ˆ k is an estimator of k then ˆ k p → k as n → ∞ . This section is based on the work of Passemier and Yao (2014) where they proposed a weakly consistentestimator of k under the “zero gap” condition. Suppose we have observed a random sample y i , . . . , y n from a p -dimensional population. These authors additionally assumed that the y ’s can be expressedas y = EV x , where x ∈ R p is a zero-mean random vector of i.i.d. components, E is an orthogonalmatrix and V = cov( x ) = (cid:18) Σ k
00 I p − k (cid:19) , where Σ k has k non-null and non-unit eigenvalues λ ≥ λ ≥ · · · ≥ λ k > . The sample covariancematrix is taken to be S n = 1 n n (cid:88) i =1 y i y i (cid:48) . hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion The proposed estimator in Passemier and Yao (2014) is based on the differences between consecutiveeigenvalues of S n . The main idea behind that is explained next. Define δ j as δ j = l j − l j +1 , for j = 1 , · · · , p − . The authors then observe that under certain assumptions it can be shown that if j ≥ k then δ j → at a fast rate, whereas if j < k and λ j ’s are different then δ j tends to a positive limit. Even if λ j ’s aresame, for j < k , the convergence of δ j to zero is slow. Thus a possible estimate of k can be the index j where δ j becomes small for the first time. The estimator of k proposed in Passemier and Yao (2014)is denoted by ˆ k P and is given by ˆ k P = min { j ∈ { , . . . , s } : δ j +1 ,p < d n } , where s > k is a fixed number big enough, and d n is an appropriately chosen small number. Inpractice, the integer s should be thought as a preliminary bound on the number of possible spikes.Before stating the main theorem on weak consistency of this estimator, we state one of the mainassumptions which is required to prove their theorem. Assumption 1:
The entries x i of the random vector x have a symmetric law and a sub-exponentialdecay, that is there exists positive constants C, C (cid:48) such that, for all t ≥ C (cid:48) P ( | x i | ≥ t C ) ≤ e − t . Theorem 2.4.
Let ( y i ) (1 ≤ i ≤ n ) be n i.i.d. of y = EV x , where x ∈ R p is a zero-mean random vectorof i.i.d. components which satisfies Assumption 1. Assume that V is of the form described beforewhere Σ k has k non null, non unit eigenvalues satisfying λ ≥ . . . ≥ λ k > √ c . Assume furtherthat (C1) holds. Let d n be a real sequence such that d n = o ( n − ) and n / d n → ∞ , then the estimator ˆ k P is a weakly consistent estimator, i.e. P (ˆ k P = k ) → as n → ∞ . In this section we will motivate and describe the main results of this article. We will first discuss ourwork on strong consistency followed by that on weak consistency.
The idea for developing a modification of the AIC criterion comes from a careful study of the proof ofstrong consistency of the traditional AIC in Bai et al. (2018). The “gap condition” mentioned before isa crucial requirement in their proof. This makes AIC inadequate for consistent estimation of k when λ k is close to the BBP threshold ( √ c ). We show in this paper that by suitably modifying thepenalty term in the AIC criterion, this problem can be taken care of, in that the modified criterionis strong consistent even when λ k is arbitrarily close to √ c . In this section We will informallysketch how to get this modified criterion and also the proof of its strong consistency. Towards that,let us first recall the functions we defined earlier h ( x ) = x − − log x,ψ c ( x ) = x + cxx − ,F c ( x ) = h ( ψ c ( x )) . ( (cid:63) ) hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion < c ≤ Assume that λ k > √ c . Note that h is strictly increasing on [1 , ∞ ) , and ψ c is strictly increasing on [1 + √ c, ∞ ) , with ψ c (1 + √ c ) = (1 + √ c ) =: b . Therefore F c is also strictly increasing on [1 + √ c, ∞ ) .Next we break down the proof of consistency by Bai et al. (2018) in two main steps. This gives usthe important clue on how to modify the penalty of AIC.(i) For showing ˆ k A ≤ k a.s., Bai et al. (2018) needed the condition that − b + log b + 2 c > , i.e. c > F c (1 + √ c ) , which is true for any c ∈ (0 , (as can be verified directly).(ii) To ensure that ˆ k A ≥ k a.s. Bai et al. (2018) needed the GAP condition (C3): ψ k − − log ψ k − c > ⇔ F c ( λ k ) > c ⇔ λ k > F − c (2 c ) = ψ − c ( h − (2 c )) =: λ c . Clearly λ c > √ c . Let u ( c ) := λ c − (1 + √ c ) . Let u ( c ) denote the gap between the BBP threshold √ c and the consistency threshold λ c forAIC. A natural question is, how large is the gap as a function of c ? It is displayed in Figure 2. Thenumerical calculation below shows that u ( c ) > c ; in fact, u ( c ) > c . .Figure 2: The size of the gap u ( c ) . hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion Our aim here is to investigate if we can develop a criterion that will be strongly consistent evenwhen λ k is closer to √ c than λ c . The inequalities in (i) and (ii) above appear in the final asymptoticlimits in the argument of Bai et al. (2018). Placed together, for every < c ≤ , it is required thatthe inequalities F c ( λ k ) > c > F c (1 + √ c ) hold, where the factor 2 appears unchanged from the perparameter penalty 2 in AIC and c from the limiting ratio of p and n . Our modified AIC criterion isbased on the crucial analytic observation we now make. From the monotonicity of of F c ( . ) , the secondinequality above also holds if 2 is replaced by any α < such that F c (1 + √ c ) > αc . Evidently, forany fixed c ∈ (0 , , there is whole interval of such choices of α , and for that matter, one can choose α to make F c (1 + √ c ) arbitrarily close of αc . For any such α , a modified first inequality F c ( λ k ) < αc will be satisfied by any λ k larger than F − c ( αc )) , the latter being smaller than λ c by monotonicity of F c ( . ) . Clearly F − c ( αc ) − (1 + √ c ) < u c . Smaller the choice of α , closer is F − c ( αc ) to (1 + √ c ) . Wederive our modified AIC criterion by formally using this basic observation as described below.Suppose we modify the original penalty of AIC with α ( p/n ) , for some continuous function α : R (cid:55)→ (0 , ∞ ) , then the same arguments as in Bai et al. (2018) lead to the following conditions:(iii) For showing ˆ k α ≤ k a.s., − b + log b + α ( c ) c > , i.e. α ( c ) c > F c (1 + √ c ) .(iv) For showing ˆ k α ≥ k a.s., F c ( λ k ) > α ( c ) c, where ˆ k α is the minimizer of this new criterion over the model space. Thus if we want to get amodified criterion that is strongly consistent for any “gap” δ c ∈ (0 , u ( c )) , all we need is that α ( c ) = 1 c F c (1 + √ c + δ c ) . The monotonicity of F c ( . ) will ensure that the ineualities in (iii) and (iv) will be satisfied for thischoice of α ( c ) which assures strong consistency of ˆ k α as long as λ k > √ c + δ c . Thus we can reacharbitrarily close to the BBP threshold √ c and still achieve strong consistency. In fact, since u ( c ) > c , we may choose δ c ∈ (0 , c ] . For example, if we choose δ c = c (cid:28) c , then with α ( c ) = 1 c F c (1 + √ c + c ) , we have consistency of ˆ k α as long as λ k > √ c + c . This is a better estimator than AIC becauseit can consistently estimate k even when GAP condition (C3) fails to hold. In particular if we wantour gap between λ k and √ c to be δ , where δ > is any arbitrarily small constant we can choose α ( . ) such that α ( c ) = 1 c F c (1 + √ c + δ ) , and we would have strong consistency of our estimator as long as λ k > √ c + δ .We now describe in detail the new model selection criterion which we discussed above. The modelselection criterion being a modification of AIC is being called AIC ∗ . Do describe this, we first fix a hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion constant δ > . We shall discuss the p < n case first. With C p,n as defined before, the criterion valuefor AIC ∗ for the model M j is defined as AIC ∗ j = n log( l . . . l j ) + n ( p − j ) log ¯ l j + 1 p/n F p/n (1 + (cid:112) p/n + δ ) d j + C p,n , where, as before, l > . . . > l p > are the sample eigenvalues of S n and for ≤ j ≤ p − , ¯ l j is thearithmetic mean of l j +1 , . . . , l p , that is, ¯ l j = 1 p − j p (cid:88) t = j +1 l t and, d j = ( j + 1)( p + 1 − j/ denotes the number of independent parameters in the model under M j .Furthermore the function F c ( . ) is defined as in ( (cid:63) ). Then the AIC ∗ selects k according to the rule ˆ k ∗ A = arg min j AIC ∗ j , where the minimum has been taken over the model space.Now we state our main result regarding consistency of our estimator ˆ k ∗ A whose proof can be foundin Appendix Section 7.3. Theorem 3.1.
Suppose the conditions (C1) with < c ≤ , and (C2) hold, and that the number ofcandidate models, q , satisfies q = o ( p ) . Assume that λ is bounded. We have the following results onthe consistency of the estimation criterion ˆ k ∗ A based on AIC ∗ :(i) If λ k ≤ √ c + δ , then ˆ k ∗ A is not consistent.(ii) If λ k > √ c + δ , then ˆ k ∗ A is strongly consistent. c > Next we consider the case where c > . Let Q c ( x ) = ch (cid:18) ψ c ( x ) c (cid:19) . ( (cid:63)(cid:63) )Recall that h is strictly increasing on [1 , ∞ ) , and ψ c is strictly increasing on [1 + √ c, ∞ ) , Therefore ψ c ( x ) /c is increasing in x . We have ψ c (1 + √ c ) /c = (1 + √ c ) /c > ∀ c > . Therefore Q c is alsostrictly increasing on [1 + √ c, ∞ ) .(i) For showing ˆ k ˜ A ≤ k a.s., Bai et al. (2018) needed the condition − b/c + log( b/c ) + 2 /c > , i.e. > Q c (1 + √ c ) , which is true for any c > (as can be verified directly).(ii) To ensure that ˆ k ˜ A ≥ k a.s. Bai et al. (2018) need the GAP condition (C5): ψ k /c − − log( ψ k /c ) − /c > ⇔ Q c ( λ k ) > ⇔ λ k > Q − c (2) =: λ c . hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion Note that λ c > √ c . Let v ( c ) := λ c − (1 + √ c ) .v ( c ) is the gap between the BBP threshold √ c and the consistency threshold λ c for qAIC. Our aimas in the previous case is to investigate if a new criterion can be derived with is strongly consistenteven when the “gap” is reduced as much as possible. So firstly we look at how large is the gap as afunction of c ? Figure 3 shows that v ( c ) > c . .Figure 3: The size of the gap v ( c ) .Following similar intuition as in the case of < c ≤ , if we penalize by replacing the in qAICwith α ( p/n ) for some function α : R (cid:55)→ (0 , ∞ ) , then the same arguments as in Bai et al. (2018) leadto the following conditions:(iii) For showing ˆ k α ≤ k a.s., − b/c + log( b/c ) + α ( c ) /c > , i.e. α ( c ) > Q c (1 + √ c ) .(iv) For showing ˆ k α ≥ k a.s., Q c ( λ k ) > α ( c ) . Thus for any δ c ∈ (0 , v ( c )) , if we take the function α ( c ) = Q c (1 + √ c + δ c ) , then using the monotonicity of Q c ( . ) we have consistency of ˆ k α as long as λ k > √ c + δ c . Thuswe can reach arbitrarily close to the BBP threshold √ c . hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion Our model selection criterion is a modification of the criterion proposed by Bai et al. (2018) asdiscussed in Section 2.2. We define our criterion using ˜ A ∗ , which for model M j is given by ˜ A ∗ j = ( n − − j ) log ¯ l j − n − (cid:88) i = j +1 log l i − Q p/n (1 + (cid:112) p/n + δ ) ( n − j − n − j + 1)2 p where ¯ l jp is defined as in Bai et al. (2018) for p > n case which is ¯ l j = 1 n − i − n − (cid:88) t = i +1 l t and Q c ( . ) is as defined in ( (cid:63)(cid:63) ). The model selection rule ˜ A ∗ estimates k by ˆ k ∗ ˜ A defined as ˆ k ∗ ˜ A = arg min j ˜ A ∗ j , the minimum. as before, being taken over the model space.The following result shows the consistency of our estimator ˆ k ∗ ˜ A . Theorem 3.2.
Suppose the conditions (C1) with c > and (C4) hold, and that the number ofcandidate models, q , satisfies q = o ( p ) . Suppose that λ is bounded. We have the following results onthe consistency of the estimation criterion ˆ k ∗ ˜ A :(i) If λ k ≤ √ c + δ , then ˆ k ˜ A is not consistent.(ii) If λ k > √ c + δ , then ˆ k ˜ A is strongly consistent. We have seen in the previous section that for any given δ > , however small, we can appropriatelymodify the penalty term (using δ among other things) of the usual AIC to produce an estimator of k which is strongly consistent when the gap between λ k and the BBP threshold is larger than δ . Butideally we would aim to develop consistent estimators of k which would be consistent if λ k > √ c ,i.e. the we can afford to have the “gap” equal to zero. A natural thing is to first investigate whathappens when we choose the gap to be δ n > , where δ n → as n → ∞ in the definition of themodified AIC criterion described before.We first discuss the case < c ≤ . We mofify the penalty term in AIC criterion as α ( p, n ) insteadof where α ( p, n ) = 1 p/n F p/n (1 + (cid:112) p/n + δ n ) . We define ˆ k α as the index that minimizes this modified criterion. Observe that α ( p, n ) → c F c (1 + √ c ) as n → ∞ . Let us check if ˆ k α is strongly consistent. Following the arguments of Bai et al. (2018), we find thefollowing. hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion (i) For showing ˆ k α ≥ k a.s. we need, F c ( λ k ) > F c (1 + √ c ) . This is true since we have that λ k > √ c .(i) For showing ˆ k α ≤ k a.s. we need, − b + log b + F c (1 + √ c ) > , i.e. F c (1 + √ c ) > F c (1 + √ c ) , which is not true.Clearly the arguments given by Bai et al. (2018) doesn’t give us our desired result for the aboveintuitive modification of our previous criterion. We however have been able to come up with anovel argument that gives us our desired result. The argument is described in detail in Section 7.2of the Appendix and carefully exploits some deep results available from Random Matrix Theoryregarding the rate of convergence of certain functionals of the sample covariance matrix. Ultimatelyit turns out that δ n has to converge to zero in such a rate that the new penalty α ( p, n ) satisfiescertain convergence properties. The only downside is that we cannot prove strong consistency of thenew criterion using this method, but we can show weak consistency of the estimator under someassumptions (see Theorem 3.3 and 3.4).As mentioned above, the basic idea for developing the estimator which would be consistent underthe “zero” gap condition was to use our previous modified AIC criterion with δ n instead of δ , where δ n → . We describe below the model selection rule. This The being a modification of AIC ∗ , wecall it AIC ∗∗ . We discuss the p < n case first. With C p,n defined as before, the criterion value undermodel M j is defined as AIC ∗∗ j = n log( l . . . l j ) + n ( p − j ) log ¯ l j + 1 p/n F p/n (1 + (cid:112) p/n + δ n ) d j + C p,n Furthermore, d j is as defined before. Defining A ∗∗ j = 1 n (AIC ∗∗ j − AIC ∗∗ p − ) , we have A ∗∗ j = ( p − j ) log ¯ l j − p (cid:88) i = j +1 log l i − p/n F p/n (1 + (cid:112) p/n + δ n ) ( p − j − p − j + 2)2 n Then the
AIC ∗∗ criterion selects model with index ˆ k ∗∗ A obtained as ˆ k ∗∗ A = arg min j AIC ∗∗ j , or equivalently ˆ k ∗∗ A = arg min j A ∗∗ j . The criteria above is defined by considering the minimum only with respect to j = 0 , , . . . , s − .where s > k is a fixed number big enough. In practice, the integer s should be thought as a preliminarybound on the number of possible spikes. We have the following theorem regarding consistency of theproposed estimator whose proof can be found in Appendix Section 7.2. hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion Theorem 3.3.
Suppose the condition (C1) holds for < c ≤ and (C2) hold, and that that thenumber of candidate models is s , where s > k is a fixed number. Suppose that Assumption 1 holdstrue, λ is bounded and λ k > √ c . Let δ n be a real sequence going to 0 such that n / (cid:16) F p/n (1 + (cid:112) p/n + δ n ) − F c (1 + √ c ) (cid:17) → ∞ , where F c is as defined in ( (cid:63) ), then the estimator ˆ k ∗∗ A is a weakly consistent estimator of k , i.e. P (ˆ k ∗∗ A = k ) → as n → ∞ . We have the similar result for the case p > n , i.e. c > . Our modified model selection criterion isdefined using ˜ A j ∗∗ , which is defined as ˜ A j ∗∗ = ( n − − j ) log ¯ l j − n − (cid:88) i = j +1 log l i − Q p/n (1 + (cid:112) p/n + δ n ) ( n − j − n − j + 1)2 p Then ˜ A j ∗∗ selects the number of significant components according to ˆ k ∗∗ ˜ A = arg min j ˜ A j ∗∗ . The criteria above is defined by considering the minimum only with respect to j = 0 , , . . . , s − .where s > k is a fixed number big enough. In practice, the integer s should be thought as a preliminarybound on the number of possible spikes. We have the following theorem regarding consistency of theproposed estimator whose proof can be found in Appendix Section 7.2. Theorem 3.4.
Suppose the condition (C1) with c > and (C2) hold, and that that the number ofcandidate models is s , where s > k is a fixed number. Suppose that Assumption 1 holds true, λ isbounded and λ k > √ c . We also assume that p/n = c + O ( n − / ) .Let δ n be a real sequence going to 0 such that n / (cid:18) p/n Q p/n (1 + (cid:112) p/n + δ n ) − c Q c (1 + √ c ) (cid:19) → ∞ , where Q c is as defined in ( (cid:63)(cid:63) ), then the estimator ˆ k ∗∗ ˜ A is a weakly consistent estimator of k , i.e. P (ˆ k ∗∗ ˜ A = k ) → as n → ∞ . δ n Consistency of our estimator holds for a wide variety of sequences. For instance, if δ n → is asequence satisfying the condition of our theorem, then so does any cδ n where c ∈ R and c > . Inthis section we provide an automatic calibration of this parameter. The main idea is to look at thenull case ( k = 0 ), i.e. X i i.i.d. ∼ N (0 , I p ) . Given a n and a p , we propose to select δ n as δ n = inf { δ > | RMSE(ˆ k δ,n ) ≤ . } , hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion where ˆ k δ,n is our strongly consistent estimator dependent on δ as proposed in Theorem 3.1 or 3.2 andRMSE or Relative Mean Square Error is given by RMSE(ˆ k δ,n ) = E (cid:18) ˆ k δ,n − kk (cid:19) . As we do not know the precise expression of RMSE for the null case we approximately calculate itby simulation under independent replications and call it SRMSE (Sample RMSE). Using theSMRSE in place of RMSE above we find our choice of δ n .Table 1 lists the values of δ n chosen according to our proposal for a few different values of ( p, n ) .Let us now illustrate with the help of two models how our method of choosing δ n compares with ( p, n ) (200,200) (400,400) (1000,1000) (200,400) (500,1000) (400,500) δ n δ n for different values of ( p, n ) .manually choosing δ n (i.e. by choosing the δ n which gives the “lowest” RMSE). X , X , . . . , X n are generated from N p (0 , Σ ) where Σ = diag { λ , . . . , λ k , , . . . , } . After fixing a c , n and p are varied such that p/n ≈ c . We compute the RMSE (by replications) for our estimator,for the two different choices of δ n , first one is automatic calibration, and the second one by manualtuning (to get the best performance). We compare both these method for the following two models.• Model 1: c = 1 , k = 2 , λ = 4 . , λ = 3 . In this case our calibrated δ n has performance reallyclose to the “ideal” δ n as can be seen by the Figure 4-(a).• Model 2: c = 1 , k = 2 , λ = 3 , λ = 2 . . In this case the gap between BBP threshold and λ k is very small (= 0 . . Here the difference between the two choice of δ n is more prominent. TheRMSE corresponding to the “best” δ n is about lower than that of our calibrated δ n (SeeFigure 4-(b)). c = 1 , k = 2 , λ = 4 . , λ = 3 c = 1 , k = 2 , λ = 3 , λ = 2 . (a) (b)Figure 4: (a) Model 1; (b) Model 2.The main takeaway here is that when the gap is not too small then our calibrated δ n hasperformance very close to the “best” δ n , but when the “gap” is very small, further study needs tobe done to be able to adaptively choose δ n based on the “gap”. Nevertheless in either case (gap:small/big) our modified AIC criterion (with calibrated δ n ) works better than the other estimators inthe literature as shown in the next two sections. hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion In this section we compare our modified AIC estimator against the estimator by Passemier and Yao(2014) (abbreviated as PY), see section 2.3 for details. X , X , . . . , X n are generated from a p -variate Gaussian distribution with mean 0 and a spikedvariance-covariance matrix Σ = diag { λ , . . . , λ k , , . . . , } . After fixing a c , n and p are varied suchthat p/n ≈ c . We compute the RMSE for our estimator (with calibrated δ n ) and for PY estimator(as proposed in their paper with automatic calibration) by conducting independent replications.We compare both these method for the following models.• Model A: c = 1 , k = 2 , λ = 3 . , λ = 2 . . Our method performs uniformly better than PY(see Figure 5-(a)), though the advantage seems to decrease as n increases.• Model B: c = 1 , k = 2 , λ = 3 , λ = 2 . . In this case the gap between BBP threshold and λ k is very small (= 0 . . Our method is still the better one (see Figure 5-(b)).• Model C: c = 1 , k = 2 , λ = 3 , λ = 3 (equal spikes). Our Method is clearly better than PY’s(see Figure 5-(c)) and the advantage gained is significant especially in case of equal spikes. c = 1 , k = 2 , λ = 3 . , λ = 2 . c = 1 , k = 2 , λ = 3 , λ = 2 . (a) (b) c = 1 , k = 2 , λ = 3 , λ = 3 (c)Figure 5: (a) Model A; (b) Model B; (c) Model (c). In this section we compare our estimator against the ordinary AIC criterion as proposed by Bai et al.(2018), see section 2.2 for details. hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion X , X , . . . , X n are generated from a p -variate Gaussian distribution with mean 0, with spikedvariance-covariance matrix Σ = diag { λ , . . . , λ k , , . . . , } . After fixing a c , n and p are varied suchthat p/n ≈ c . We compute the RMSE for our estimator (with calibrated δ n ) and for ordinary AICcriterion by conducting independent replications. We compare both these method for the followingmodels.• Model D: c = 1 , k = 2 λ = 3 . , λ = 3 . . Here λ k = λ is chosen such that it satifies the gapcondition (C3) as proposed by Bai et al. (2018). Observe that our method performs uniformlybetter compared to theirs (see Figure 6-(a)), though the advantage seems to decrease as n increases.• Model E: c = 1 , k = 5 λ = 4 , λ = 4 , λ = λ = λ = 3 . . Here λ k = λ is chosen such that itsatisfies the gap condition (C3). Observe that in this case too our method performs uniformlybetter compared to theirs (see Figure 6-(b)). c = 1 , k = 2 λ = 3 . , λ = 3 . c = 1 , k = 5 λ = 4 , λ = 4 , λ = λ = λ = 3 . (a) (b)Figure 6: (a) Model D; (b) Model E. δ = 0 Observe that we can reach arbitrary close to √ c , by suitably choosing the penalty term in AICcriterion ( α ( c ) ). Motivated by our estimators where we have chosen our gap to be δ where δ can bearbitrarily small, or even δ n → (in case of weak consistency), we present an estimator where δ isexactly zero, i.e. choose α ( c ) = 1 c F c (1 + √ c ) . We now present a simulation study about this estimator.•
Simulation setup: c = 0 . is fixed, n is varied, X , X , . . . , X n are generated from a p -variateGaussian distribution with mean with spiked variance-covariance matrix (i.e. a diagonalmatrix with first k entries as √ c + δ and rest p − k diagonal entries are 1). Here we fixed δ = 0 . and k = 10 .• Results:
We varied n from to and iterated times each. Table 2 shows the accuracy(i.e. proportion of times ˆ k is equal to true k ) and the average value of ˆ k over iterations.The results suggest that the estimator is consistent. We leave the theoretical analysis of this case tofuture work. hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion n
100 200 400 800 1200 1500 2000 2500 3200accuracy 0.02 0.04 0.04 0.26 0.48 0.76 0.94 0.96 0.94avg ˆ k δ = 0 . In this article, we have considered the AIC cirterion for high dimensional model selection. Bai et al.(2018) have shown strong consistency of AIC under a “gap” condition, requiring more signal thanwhat the BBP threshold demands. We have modified the penalty term in AIC suitably and proposeda criterion that achieves strong consistency under arbitrarily small gap. We have also proposedanother modification that achieves weak consistency under exactly zero gap. We have made a detailedempirical study of the performance of the proposed criteria. Furthermore, we have compared oursecond proposal with the estimator of Passemier and Yao (2014) which also achieves weak consistencyunder zero gap.
This article is based on the Masters dissertation Chakraborty (2020) of the first author under thesupervision of the second and the third authors. The authors thank Professors Gopal K. Basakand Tapas Samanta for their insightful comments. The second author is supported by an INSPIREFaculty Fellowship, Department of Science and Technology, Government of India.
References
Akaike, H. (1998). Information theory and an extension of the maximum likelihood principle. In
Selected papers of hirotugu akaike , pages 199–213. Springer.Anderson, T. W. (1958). An introduction to multivariate statistical analysis. Technical report.Bai, Z., Choi, K. P., Fujikoshi, Y., et al. (2018). Consistency of aic and bic in estimating the number ofsignificant components in high-dimensional principal component analysis.
The Annals of Statistics ,46(3):1050–1076.Bai, Z. and Yao, J. (2012). On sample eigenvalues in a generalized spiked population model.
Journalof Multivariate Analysis , 106:167–177.Baik, J., Arous, G. B., Péché, S., et al. (2005). Phase transition of the largest eigenvalue for nonnullcomplex sample covariance matrices.
The Annals of Probability , 33(5):1643–1697.Benaych-Georges, F., Guionnet, A., Maida, M., et al. (2011). Fluctuations of the extreme eigenvaluesof finite rank deformations of random matrices.
Electronic Journal of Probability , 16:1621–1662.Chakraborty, A. (2020). Some Contributions to High Dimensional Principal Component Analysis.Master’s thesis, Indian Statistical Institute, Kolkata. hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion Fujikoshi, Y., Ulyanov, V. V., and Shimizu, R. (2011).
Multivariate statistics: High-dimensional andlarge-sample approximations , volume 760. John Wiley & Sons.Hu, J., Zhang, J., and Zhu, J. (2020). Detection of the number of principal components by extendedaic-type method.Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis.
Annals of statistics , pages 295–327.Johnstone, I. M. and Paul, D. (2018). Pca in high dimensions: An orientation.
Proceedings of theIEEE , 106(8):1277–1292.Jolliffe, I. T. (1986). Principal components in regression analysis. In
Principal component analysis ,pages 129–155. Springer.Kritchman, S. and Nadler, B. (2009). Non-parametric detection of the number of signals: Hypothesistesting and random matrix theory.
IEEE Transactions on Signal Processing , 57(10):3930–3941.Li, Z., Han, F., and Yao, J. (2019). Asymptotic joint distribution of extreme eigenvalues andtrace of large sample covariance matrix in a generalized spiked population model. arXiv preprintarXiv:1906.09639 .Nadler, B. (2010). Nonparametric detection of signals by information theoretic criteria: performanceanalysis and an improved estimator.
IEEE Transactions on Signal Processing , 58(5):2746–2756.Passemier, D. and Yao, J. (2014). Estimation of the number of spikes, possibly equal, in thehigh-dimensional case.
Journal of Multivariate Analysis , 127:173–183.Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariancemodel.
Statistica Sinica , pages 1617–1642.Schwarz, G. et al. (1978). Estimating the dimension of a model.
The annals of statistics , 6(2):461–464.Silverstein, J. W. (1995). Strong convergence of the empirical distribution of eigenvalues of largedimensional random matrices.
Journal of Multivariate Analysis , 55(2):331–339.Wax, M. and Kailath, T. (1985). Detection of signals by information theoretic criteria.
IEEETransactions on acoustics, speech, and signal processing , 33(2):387–392. hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion The proof of Theorem 3.1, 3.3 and 3.4 rely on several additional lemmas which are organised in thissection. We first recall a lemma stated earlier in Section 2.1.1 as Lemma 2.1.
Lemma 7.1.
Let l ip denote the ith largest eigenvalue of S n the covariance matrix in our setup.Suppose that E ( x < ∞ ) , condition ( C and ( C hold, and that λ is bounded.(i) If λ i > √ c , then l ip a.s. → ψ i = λ i + cλ i λ i − .(ii) λ i < √ c and i/p → α , then l ip a.s. → µ − α (where µ α is the α -quantile of MP distribution). Inparticular if i is bounded l ip a.s. → µ = b := (1 + √ c ) . The next lemma is a direct consequence of Theorem 3.1 of, Li et al. (2019).
Lemma 7.2.
Under the given assumptions of the Theorem 3.3 or 3.4 we have that trace ( S n ) − trace ( V ) d → N (0 , σ ) , where σ is some constant independent of n or p . Next lemma is a direct consequence of Proposition . from Benaych-Georges et al. (2011). Lemma 7.3.
Assume that the entries x i of x have a symmetric law and a sub-exponential decay, thatmeans ∃ positive constants C, C (cid:48) such that, for all t > C (cid:48) , we have P ( | x i | ≥ t C ) ≤ e − t . Then, for all ≤ i ≤ L with a prefixed range L, n ( λ k + i − b ) = O P (1) . In this section we are going to prove Theorem 3.3 and 3.4.
Proof of the Theorem 3.3.
We will show that P (ˆ k ∗∗ A = k ) → as n → ∞ . We will show thatby breaking the problem into two cases j < k and j > k , in both these cases we will show that P ( A ∗∗ j > A ∗∗ k ) → as n → ∞ . Case I:
When j < k , A ∗∗ j − A ∗∗ k = k (cid:88) i = j +1 ( A ∗∗ i − − A ∗∗ i )= k (cid:88) i = j +1 ( p − i + 1) log (cid:26) − p − i + 1 (cid:18) − l ip ¯ l ip (cid:19)(cid:27) + log ¯ l ip − log l ip − p/n F p/n (1 + (cid:112) p/n + δ n ) p − i + 1 n a.s. → k (cid:88) i = j +1 ψ i − − log ψ i − F c (1 + √ c )= k (cid:88) i = j +1 F c ( λ i ) − F c (1 + √ c ) hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion Only step that needs to be justified is the third step , specifically
Claim: ( p − i + 1) log (cid:26) − p − i + 1 (cid:18) − l ip ¯ l ip (cid:19)(cid:27) a.s. → ψ i − . Observe that, (cid:12)(cid:12)(cid:12) ( p − i + 1) log (cid:110) − p − i +1 (cid:16) − l ip ¯ l ip (cid:17)(cid:111) − ( ψ i − (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) − ( p − i + 1) log (cid:110) − p − i +1 (cid:16) − l ip ¯ l ip (cid:17)(cid:111) − (cid:16) − l ip ¯ l ip (cid:17)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) l ip ¯ l ip − ψ i (cid:12)(cid:12)(cid:12) . Result:(Taylor series theorem)
For a twice differentiable f on ( a, b ) with f (cid:48) continuous on [ a, b ] .Assume that c ∈ [ a, b ] . Then for every x ∈ [ a, b ] (cid:54) = c , there exists point x in (min( x, c ) , max( x.c )) such that f ( x ) = f ( c ) + f (cid:48) ( c )( x − c ) + f (cid:48)(cid:48) ( x )2 ( x − c ) . Using the above theorem we get under prescribed conditions on f we have f ( x + h ) = f ( x ) + f (cid:48) ( x ) h + f (cid:48)(cid:48) ( ζ )2 h where ζ ∈ (min( x, x + h ) , max( x.x + h )) . Choose f ( x ) = log(1 − x ) x = 0 h = 1 p − i + 1 (cid:18) − l ip ¯ l ip (cid:19) and observe that f (cid:48) ( x ) = − − x f (cid:48)(cid:48) ( x ) = 1(1 − x ) . We have that log (cid:26) − p − i + 1 (cid:18) − l ip ¯ l ip (cid:19)(cid:27) = − p − i + 1 (cid:18) − l ip ¯ l ip (cid:19) + h − ζ ) where ζ ∈ (min(0 , h ) , max(0 , h )) .So we have − ( p − i + 1) log (cid:26) − p − i + 1 (cid:18) − l ip ¯ l ip (cid:19)(cid:27) = (cid:18) − l ip ¯ l ip (cid:19) − ( p − i + 1) h − ζ ) . Hence (cid:12)(cid:12)(cid:12)(cid:12) − ( p − i + 1) log (cid:26) − p − i + 1 (cid:18) − l ip ¯ l ip (cid:19)(cid:27) − (cid:18) − l ip ¯ l ip (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) ( p − i + 1) h − ζ ) (cid:12)(cid:12)(cid:12)(cid:12) = 1(1 − ζ ) (cid:18) − l ip ¯ l ip (cid:19) p − i + 1 Lets look at the R.H.S, we know from Lemma 2.1 that l ip a.s. → ψ i and from the MP-law that ¯ l ip a.s. → µ MP = 1 , where µ MP is the mean of the MP distribution.Using these we infer that (cid:18) − l ip ¯ l ip (cid:19) a.s → (1 − ψ i ) & h = 1 p − i + 1 (cid:18) − l ip ¯ l ip (cid:19) a.s → . hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion Now using sandwich theorem we have that ζ a.s → as ζ ∈ (min(0 , h ) , max(0 , h )) so, we have that − ζ ) a.s → .Therefore, − ζ ) (cid:18) − l ip ¯ l ip (cid:19) a.s → (1 − ψ i ) . So, (cid:12)(cid:12)(cid:12)(cid:12) − ( p − i + 1) log (cid:26) − p − i + 1 (cid:18) − l ip ¯ l ip (cid:19)(cid:27) − (cid:18) − l ip ¯ l ip (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:16) − l ip ¯ l ip (cid:17) (1 − ζ ) p − i + 1 As p − i +1 → coupled with previous fact, we can conclude that (cid:12)(cid:12)(cid:12)(cid:12) − ( p − i + 1) log (cid:26) − p − i + 1 (cid:18) − l ip ¯ l ip (cid:19)(cid:27) − (cid:18) − l ip ¯ l ip (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) a.s. → . Next using Lemma 2.1 observe that l ip a.s. → ψ i and ¯ l ip a.s. → which implies that l ip ¯ l ip a.s. → ψ i = ⇒ (cid:12)(cid:12)(cid:12)(cid:12) l ip ¯ l ip − ψ i (cid:12)(cid:12)(cid:12)(cid:12) a.s. → . Hence the claim is proved.Next using the monotonicity of F c ( . ) and λ i ’s we have A ∗∗ j − A ∗∗ k a.s. → k (cid:88) i = j +1 F c ( λ i ) − F c (1 + √ c ) > ( k − j )[ F c ( λ k ) − F c (1 + √ c )] . We know that λ k > √ c so we have that F c ( λ k ) − F c (1+ √ c ) > hence P (lim n →∞ ( A ∗∗ j − A ∗∗ k ) >
0) = 1 ∀ j < k this implies that P (lim n →∞ ˆ k ∗∗ A ≥ k ) = 1 or a weaker condition that P (ˆ k ∗∗ A ≥ k ) n →∞ → . Case II:
When j > k and j is bounded, i.e. j < s , A ∗∗ j − A ∗∗ k = j (cid:88) i = k +1 ( A ∗∗ i − A ∗∗ i − )= j (cid:88) i = k +1 − ( p − i + 1) log (cid:26) − p − i + 1 (cid:18) − l ip ¯ l ip (cid:19)(cid:27) − log ¯ l ip + log l ip + 1 p/n F p/n (1 + (cid:112) p/n + δ n ) p − i + 1 n Lets look at hn / ( A ∗∗ j − A ∗∗ k ) where h → as n → ∞ , pick the ith term, we divide the expressioninto sum of four parts as follows1. − hn / log ¯ l ip hn / (log l ip − log b ) hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion hn / (cid:104) − ( p − i + 1) log (cid:110) − p − i +1 (cid:16) − l ip ¯ l ip (cid:17)(cid:111) − (1 − b ) (cid:105) hn / (cid:16) p/n F p/n (1 + (cid:112) p/n + δ n ) p − i +1 n + log b + 1 − b (cid:17) Here b := ψ c (1 + √ c ) = (1 + √ c ) .Let us begin analysing each of these terms one by one1. − hn / log ¯ l ip Using Lemma 7.2: trace ( S n ) − trace (Σ p ) → N (0 , σ )= ⇒ p ( (cid:80) pj =1 l jp p − (cid:80) pj =1 λ j p ) = O p (1)= ⇒ p ( (cid:80) ij =1 l jp + (cid:80) pj = i +1 l jp p − (cid:80) ij =1 λ j +( p − i ) p ) = O p (1)= ⇒ p (cid:18) ( p − ip )[¯ l ip −
1] + (cid:80) kj =1 l jp − (cid:80) kj =1 λ j p + (cid:80) ij = k +1 ( l jp − λ j ) p ) (cid:19) = O p (1)= ⇒ ( p − i )[¯ l ip −
1] + (cid:80) kj =1 ( l jp − λ j ) + (cid:80) ij = k +1 ( l jp −
1) = O p (1) We know that l jp → ψ j = ψ c ( λ j ) a.s. if j ≤ k otherwise l jp → b a.s. when j > k and j is finite. k (cid:88) j =1 ( l jp − λ j ) + i (cid:88) j = k +1 ( l jp − a.s. → k (cid:88) j =1 ( ψ j − λ j ) + ( i − k )( b − . This along with the previous fact implies that ( p − i )[¯ l ip −
1] = O p (1) . So hn / [¯ l ip −
1] = hn / p − i ( p − i )[¯ l ip − , coupled with the fact that p/n → c > and Slutsky’s theorem we have that hn / [¯ l ip − p → . Next by mean value theorem we have that: log x − log 1 = 1 ζ ( x ) ( x − where min(1 , x ) < ζ ( x ) < max(1 , x ) . Using the above we can infer that hn / log ¯ l ip = hn / ζ (¯ l ip ) (¯ l ip − . As ¯ l ip p → . Using sandwich theorem we have: ζ (¯ l ip ) p → . Therefore using Slutsky’s Theorem we have that: − hn / log ¯ l ip = − hn / ζ (¯ l ip ) (¯ l ip − p → . hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion hn / (log l ip − log b ) By mean value theorem we have that: log x − log b = 1 ζ ( x ) ( x − b ) where min( b, x ) < ζ ( x ) < max( b, x ) . Therefore log l ip − log b = 1 ζ ( l ip ) ( l ip − b ) . Using Lemma 7.3 n / ( l ip − b ) = O p (1) ∀ k < i ≤ s. Also as l ip a.s. → b so using sandwich theorem we have ζ ( l ip ) a.s. → b . Therefore we have that: hn / (log l ip − log b ) = hn / ( l ip − b ) 1 ζ ( l ip ) , so using the above fact we have that hn / ( l ip − b ) p → , which implies hn / (log l ip − log b ) p → . hn / (cid:104) − ( p − i + 1) log (cid:110) − p − i +1 (cid:16) − l ip ¯ l ip (cid:17)(cid:111) − (1 − b ) (cid:105) Observe that, hn / (cid:12)(cid:12)(cid:12) − ( p − i + 1) log (cid:110) − p − i +1 (cid:16) − l ip ¯ l ip (cid:17)(cid:111) − (1 − b ) (cid:12)(cid:12)(cid:12) ≤ hn / (cid:12)(cid:12)(cid:12) − ( p − i + 1) log (cid:110) − p − i +1 (cid:16) − l ip ¯ l ip (cid:17)(cid:111) − (cid:16) − l ip ¯ l ip (cid:17)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) l ip ¯ l ip − b (cid:12)(cid:12)(cid:12) . Using the taylors series theorem we get under some conditions on f we have f ( x + h ) = f ( x ) + f (cid:48) ( x ) h + f (cid:48)(cid:48) ( ζ )2 h ζ ∈ (min( x, x + h ) , max( x.x + h ))) . Choose f ( x ) = log(1 − x ) x = 0 h = 1 p − i + 1 (cid:18) − l ip ¯ l ip (cid:19) As in case I we finally have that log (cid:26) − p − i + 1 (cid:18) − l ip ¯ l ip (cid:19)(cid:27) = − p − i + 1 (cid:18) − l ip ¯ l ip (cid:19) + h − ζ ) where ζ ∈ (min(0 , h ) , max(0 , h )) So we have that (cid:12)(cid:12)(cid:12)(cid:12) − ( p − i + 1) log (cid:26) − p − i + 1 (cid:18) − l ip ¯ l ip (cid:19)(cid:27) − (cid:18) − l ip ¯ l ip (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) ( p − i + 1) h − ζ ) (cid:12)(cid:12)(cid:12)(cid:12) hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion Which is equal to − ζ ) (cid:18) − l ip ¯ l ip (cid:19) p − i + 1 . Lets look at the R.H.S, we already know that l ip a.s. → b and ¯ l ip a.s. → . Using these we infer that (cid:18) − l ip ¯ l ip (cid:19) a.s → (1 − b ) & h = 1 p − i + 1 (cid:18) − l ip ¯ l ip (cid:19) a.s → . Now using sandwich theorem we have that ζ a.s → as ζ ∈ (min(0 , h ) , max(0 , h )) so, we havethat − ζ ) a.s → .Therefore, − ζ ) (cid:18) − l ip ¯ l ip (cid:19) a.s → (1 − b ) . So, hn / (cid:12)(cid:12)(cid:12)(cid:12) − ( p − i + 1) log (cid:26) − p − i + 1 (cid:18) − l ip ¯ l ip (cid:19)(cid:27) − (cid:18) − l ip ¯ l ip (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:16) − l ip ¯ l ip (cid:17) (1 − ζ ) hn / p − i + 1 . As hn / p − i +1 → coupled with previous fact, we can conclude that hn / (cid:12)(cid:12)(cid:12)(cid:12) − ( p − i + 1) log (cid:26) − p − i + 1 (cid:18) − l ip ¯ l ip (cid:19)(cid:27) − (cid:18) − l ip ¯ l ip (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) P → . Next let us deal with hn / (cid:12)(cid:12)(cid:12)(cid:12) l ip ¯ l ip − b (cid:12)(cid:12)(cid:12)(cid:12) = hn / (cid:12)(cid:12)(cid:12)(cid:12) l ip ¯ l ip − b ¯ l ip + b ¯ l ip − b (cid:12)(cid:12)(cid:12)(cid:12) ≤ hn / (cid:12)(cid:12)(cid:12)(cid:12) l ip ¯ l ip − b ¯ l ip (cid:12)(cid:12)(cid:12)(cid:12) + hn / (cid:12)(cid:12)(cid:12)(cid:12) b ¯ l ip − b (cid:12)(cid:12)(cid:12)(cid:12) ≤ hn / | l ip − b || ¯ l ip | + hn / b | ¯ l ip − || ¯ l ip | Using Lemma 7.3 n / ( l ip − b ) = O p (1) ∀ k < i ≤ s Therefore hn / ( l ip − b ) P → l ip a.s. → ⇒ hn / | l ip − b || ¯ l ip | P → . We have already observed that hn / [¯ l ip − p → Therefore hn / [¯ l ip − p → l ip a.s. → ⇒ bhn / | l ip − || ¯ l ip | P → . hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion Hence hn / (cid:12)(cid:12)(cid:12)(cid:12) l ip ¯ l ip − b (cid:12)(cid:12)(cid:12)(cid:12) P → . So finally we have that hn / (cid:20) − ( p − i + 1) log (cid:26) − p − i + 1 (cid:18) − l ip ¯ l ip (cid:19)(cid:27) − (1 − b ) (cid:21) P → .
4. Looking at the last term hn / (cid:18) p/n F p/n (1 + (cid:112) p/n + δ n ) p − i + 1 n + log b + 1 − b (cid:19) Call α ( p, n ) = 1 p/n F p/n (1 + (cid:112) p/n + δ n ) & α ( c ) = 1 c F c (1 + √ c ) therefore the last term can be written as hn / (cid:16) α ( p, n ) pn + log b + 1 − b (cid:17) − hn / α ( p, n ) i − n . Observe that hn / i − n → α ( p, n ) → α ( c ) = ⇒ hn / α ( p, n ) i − n → Therefore asymptotically only the first term matters, which is hn / (cid:16) α ( p, n ) pn + log b + 1 − b (cid:17) = hn / (cid:16) F p/n (1 + (cid:112) p/n + δ n ) − F c (1 + √ c ) (cid:17) . Now δ n is chosen such that this term above goes to ∞ .As a result we have that hn / ( A ∗∗ j − A ∗∗ k ) p → ∞ = ⇒ P ( A ∗∗ j > A ∗∗ k ) → s ≥ j > k Now combining both the cases we have P ( A ∗∗ j > A ∗∗ k ) → ∀ j (cid:54) = k j ≤ s As there are finitely many j ’s this implies that P ( A ∗∗ j > A ∗∗ k ∀ j (cid:54) = k j ≤ s ) → Hence P (ˆ k ∗ A = k ) → ⇒ ˆ k ∗ A P → k . Proof of the Theorem 3.4:
We will show that P (ˆ k ∗∗ ˜ A = k ) → as n → ∞ . We will show thatby breaking the problem into two cases as before j < k and j > k , in both these cases we will show hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion that P ( ˜ A ∗∗ j > ˜ A ∗∗ k ) → as n → ∞ . Case I:
When j < k , ˜ A ∗∗ j − ˜ A ∗∗ k = k (cid:88) i = j +1 ( ˜ A ∗∗ i − − ˜ A ∗∗ i )= k (cid:88) i = j +1 ( n − i ) log (cid:26) − n − i (cid:18) − l ip ¯ l ip (cid:19)(cid:27) + log ¯ l ip − log l ip − Q p/n (1 + (cid:112) p/n + δ n ) n − ip a.s. → k (cid:88) i = j +1 ψ i /c − − log( ψ i /c ) − c Q c (1 + √ c )= k (cid:88) i = j +1 c Q c ( λ i ) − c Q c (1 + √ c ) Only step that needs to be justified is the third step , specifically
Claim: ( n − i ) log (cid:26) − n − i (cid:18) − l ip ¯ l ip (cid:19)(cid:27) a.s. → ψ i /c − . Observe that, (cid:12)(cid:12)(cid:12) ( n − i ) log (cid:110) − n − i (cid:16) − l ip ¯ l ip (cid:17)(cid:111) − ( ψ i /c − (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) − ( n − i ) log (cid:110) − n − i (cid:16) − l ip ¯ l ip (cid:17)(cid:111) − (cid:16) − l ip ¯ l ip (cid:17)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) l ip ¯ l ip − ψ i c (cid:12)(cid:12)(cid:12) Using Taylor’s Series theorem with f ( x ) = log(1 − x ) x = 0 h = 1 n − i (cid:18) − l ip ¯ l ip (cid:19) We have that log (cid:26) − n − i (cid:18) − l ip ¯ l ip (cid:19)(cid:27) = − n − i (cid:18) − l ip ¯ l ip (cid:19) + h − ζ ) , where ζ ∈ (min(0 , h ) , max(0 , h )) . So we have − ( n − i ) log (cid:26) − n − i (cid:18) − l ip ¯ l ip (cid:19)(cid:27) = (cid:18) − l ip ¯ l ip (cid:19) − ( n − i ) h − ζ ) . Hence (cid:12)(cid:12)(cid:12)(cid:12) − ( n − i ) log (cid:26) − n − i (cid:18) − l ip ¯ l ip (cid:19)(cid:27) − (cid:18) − l ip ¯ l ip (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) ( n − i ) h − ζ ) (cid:12)(cid:12)(cid:12)(cid:12) = 1(1 − ζ ) (cid:18) − l ip ¯ l ip (cid:19) n − i hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion Lets look at the R.H.S, we know from Lemma 2.1 that l ip a.s. → ψ i .Next we look at ¯ l ip = 1 n − − i n − (cid:88) t = i +1 l tp We already know by M-P law that p − i p (cid:88) t = i +1 l tp = n − − ip − i n − − i n − (cid:88) t = i +1 l tp a.s. → µ MP = 1 . where µ MP is the mean of the Marcenko-Pastur distribution. Therefore n − − ip − i n − − i n − (cid:88) t = i +1 l tp = n − − ip − i ¯ l ip a.s. → ⇒ ¯ l ip a.s. → c. Using these we infer that (cid:18) − l ip ¯ l ip (cid:19) a.s → (cid:18) − ψ i c (cid:19) & h = 1 n − i (cid:18) − l ip ¯ l ip (cid:19) a.s → . Now using sandwich theorem we have that ζ a.s → as ζ ∈ (min(0 , h ) , max(0 , h )) so, we have that − ζ ) a.s → .Therefore, − ζ ) (cid:18) − l ip ¯ l ip (cid:19) a.s → (cid:18) − ψ i c (cid:19) . So, (cid:12)(cid:12)(cid:12)(cid:12) − ( n − i ) log (cid:26) − n − i (cid:18) − l ip ¯ l ip (cid:19)(cid:27) − (cid:18) − l ip ¯ l ip (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:16) − l ip ¯ l ip (cid:17) (1 − ζ ) n − i As n − i → coupled with previous fact, we can conclude that (cid:12)(cid:12)(cid:12)(cid:12) − ( n − i ) log (cid:26) − n − i (cid:18) − l ip ¯ l ip (cid:19)(cid:27) − (cid:18) − l ip ¯ l ip (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) a.s. → , As already observed l ip a.s. → ψ i and ¯ l ip a.s. → c which implies that l ip ¯ l ip a.s. → ψ i c = ⇒ (cid:12)(cid:12)(cid:12)(cid:12) l ip ¯ l ip − ψ i c (cid:12)(cid:12)(cid:12)(cid:12) a.s. → . Hence the claim is proved.Next using the monotonicity of Q c ( . ) and λ i ’s ˜ A ∗∗ j − ˜ A ∗∗ k a.s. → k (cid:88) i = j +1 c ( Q c ( λ i ) − Q c (1 + √ c )) > ( k − j ) 1 c [ Q c ( λ k ) − Q c (1 + √ c ] . hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion We know that λ k > √ c so we have that Q c ( λ k ) − Q c (1+ √ c ) > hence P (lim n →∞ ( ˜ A ∗∗ j − ˜ A ∗∗ k ) >
0) = 1 ∀ j < k this implies that P (lim n →∞ ˆ k ∗∗ ˜ A ≥ k ) = 1 or a weaker condition that P (ˆ k ∗∗ ˜ A ≥ k ) n →∞ → . Case II:
When j > k and j is bounded, i.e. j < s , ˜ A ∗∗ j − ˜ A ∗∗ k = j (cid:88) i = k +1 ( ˜ A ∗∗ i − ˜ A ∗∗ i − )= j (cid:88) i = k +1 − ( n − i ) log (cid:26) − n − i (cid:18) − l ip ¯ l ip (cid:19)(cid:27) − log ¯ l ip + log l ip + Q p/n (1 + (cid:112) p/n + δ n ) n − ip Lets look at hn / ( ˜ A ∗∗ j − ˜ A ∗∗ k ) where h → as n → ∞ , pick the ith term, we divide the expressioninto sum of four parts as follows1. − hn / (log ¯ l ip − log c ) hn / (log l ip − log b ) hn / (cid:104) − ( n − i ) log (cid:110) − n − i (cid:16) − l ip ¯ l ip (cid:17)(cid:111) − (1 − bc ) (cid:105) hn / (cid:16) Q p/n (1 + (cid:112) p/n + δ n ) n − ip + log bc + 1 − bc (cid:17) Here b := ψ c (1 + √ c ) = (1 + √ c ) .Let us begin analysing each of these terms one by one1. − hn / (log ¯ l ip − log c ) From the proof of Theorem 3.3 we have already shown that hn / (cid:34) p − i p (cid:88) t = i +1 l tp − (cid:35) = o p (1) Using which we infer that hn / (cid:18) n − i − p − i ¯ l ip − (cid:19) = o p (1) As p − in − i − → c , we have that: p − in − i − hn / (cid:18) n − i − p − i ¯ l ip − (cid:19) = o p (1) This implies hn / (cid:18) ¯ l ip − p − in − i − (cid:19) = o p (1) , hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion i.e. hn / (¯ l ip − c ) + hn / (cid:18) c − p − in − i − (cid:19) = o p (1) We have that p/n = c + O (cid:0) n − / (cid:1) = ⇒ hn / (cid:18) c − p − in − i − (cid:19) = o (1) As a result hn / (¯ l ip − c ) = o p (1) Next by mean value theorem we have that: log x − log c = 1 ζ ( x ) ( x − c ) where min( c, x ) < ζ ( x ) < max( c, x ) Using the above we can infer that hn / (log ¯ l ip − log c ) = hn / ζ (¯ l ip ) (¯ l ip − c ) As already observed during the proof of Theorem 3.3 ¯ l ip p → c . Using sandwich theorem we have: ζ (¯ l ip ) p → c Therefore using Slutsky’s Theorem we have that: − hn / (log ¯ l ip − log c ) = − hn / ζ (¯ l ip ) (¯ l ip − c ) p → hn / (log l ip − log b ) The proof of this part is exactly the same as that of Theorem 3.3, where we showed that hn / (log l ip − log b ) p → hn / (cid:104) − ( n − i ) log (cid:110) − n − i (cid:16) − l ip ¯ l ip (cid:17)(cid:111) − (1 − bc ) (cid:105) Observe that, hn / (cid:12)(cid:12)(cid:12) − ( n − i ) log (cid:110) − n − i (cid:16) − l ip ¯ l ip (cid:17)(cid:111) − (1 − bc ) (cid:12)(cid:12)(cid:12) ≤ hn / (cid:12)(cid:12)(cid:12) − ( n − i ) log (cid:110) − n − i (cid:16) − l ip ¯ l ip (cid:17)(cid:111) − (cid:16) − l ip ¯ l ip (cid:17)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) l ip ¯ l ip − bc (cid:12)(cid:12)(cid:12) Using the taylors series theorem we get under some conditions on f we have f ( x + h ) = f ( x ) + f (cid:48) ( x ) h + f (cid:48)(cid:48) ( ζ )2 h ζ ∈ (min( x, x + h ) , max( x.x + h ))) Choose f ( x ) = log(1 − x ) x = 0 h = 1 n − i (cid:18) − l ip ¯ l ip (cid:19) hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion As in case I we finally have that log (cid:26) − n − i (cid:18) − l ip ¯ l ip (cid:19)(cid:27) = − n − i (cid:18) − l ip ¯ l ip (cid:19) + h − ζ ) where ζ ∈ (min(0 , h ) , max(0 , h )) So we have that (cid:12)(cid:12)(cid:12)(cid:12) − ( n − i ) log (cid:26) − n − i (cid:18) − l ip ¯ l ip (cid:19)(cid:27) − (cid:18) − l ip ¯ l ip (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) ( n − i ) h − ζ ) (cid:12)(cid:12)(cid:12)(cid:12) = 1(1 − ζ ) (cid:18) − l ip ¯ l ip (cid:19) n − i Lets look at the R.H.S, we already know that l ip a.s. → b and ¯ l ip a.s. → c Using these we infer that (cid:18) − l ip ¯ l ip (cid:19) a.s → (1 − b/c ) & h = 1 n − i (cid:18) − l ip ¯ l ip (cid:19) a.s → Now using sandwich theorem we have that ζ a.s → as ζ ∈ (min(0 , h ) , max(0 , h )) so, we havethat − ζ ) a.s → .Therefore, − ζ ) (cid:18) − l ip ¯ l ip (cid:19) a.s → (cid:18) − bc (cid:19) So, hn / (cid:12)(cid:12)(cid:12)(cid:12) − ( n − i ) log (cid:26) − n − i (cid:18) − l ip ¯ l ip (cid:19)(cid:27) − (cid:18) − l ip ¯ l ip (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:16) − l ip ¯ l ip (cid:17) (1 − ζ ) hn / n − i As hn / n − i → coupled with previous fact, we can conclude that hn / (cid:12)(cid:12)(cid:12)(cid:12) − ( n − i ) log (cid:26) − n − i (cid:18) − l ip ¯ l ip (cid:19)(cid:27) − (cid:18) − l ip ¯ l ip (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) P → , Next let us deal with hn / (cid:12)(cid:12)(cid:12)(cid:12) l ip ¯ l ip − bc (cid:12)(cid:12)(cid:12)(cid:12) = hn / (cid:12)(cid:12)(cid:12)(cid:12) l ip ¯ l ip − b ¯ l ip + b ¯ l ip − bc (cid:12)(cid:12)(cid:12)(cid:12) ≤ hn / (cid:12)(cid:12)(cid:12)(cid:12) l ip ¯ l ip − b ¯ l ip (cid:12)(cid:12)(cid:12)(cid:12) + hn / (cid:12)(cid:12)(cid:12)(cid:12) b ¯ l ip − bc (cid:12)(cid:12)(cid:12)(cid:12) ≤ hn / | l ip − b || ¯ l ip | + hn / bc | ¯ l ip − c || ¯ l ip | Using Lemma 7.3 n / ( l ip − b ) = O p (1) ∀ k < i ≤ s hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion Therefore hn / ( l ip − b ) P → l ip a.s. → c = ⇒ hn / | l ip − b || ¯ l ip | P → We have already observed that hn / [¯ l ip − p → Therefore hn / [¯ l ip − c ] p → l ip a.s. → c = ⇒ hn / bc | l ip − c || ¯ l ip | P → Hence hn / (cid:12)(cid:12)(cid:12)(cid:12) l ip ¯ l ip − bc (cid:12)(cid:12)(cid:12)(cid:12) P → So finally we have that hn / (cid:20) − ( p − i + 1) log (cid:26) − p − i + 1 (cid:18) − l ip ¯ l ip (cid:19)(cid:27) − (cid:18) − bc (cid:19)(cid:21) P →
4. Looking at the last term hn / (cid:18) Q p/n (1 + (cid:112) p/n + δ n ) n − ip + log bc + 1 − bc (cid:19) Call β ( p, n ) = Q p/n (1 + (cid:112) p/n + δ n ) & β ( c ) = Q c (1 + √ c ) therefore the last term can be written as hn / (cid:18) β ( p, n ) np + log bc + 1 − bc (cid:19) − hn / β ( p, n ) ip Observe that hn / ip → β ( p, n ) → β ( c ) = ⇒ hn / β ( p, n ) ip → Therefore asymptotically only the first term matters, which is hn / (cid:18) β ( p, n ) np + log bc + 1 − bc (cid:19) = hn / (cid:18) p/n Q p/n (1 + (cid:112) p/n + δ n ) − c Q c (1 + √ c ) (cid:19) → ∞ as n → ∞ As a result we have that hn / ( ˜ A ∗∗ j − ˜ A ∗∗ k ) p → ∞ = ⇒ P ( ˜ A ∗∗ j > ˜ A ∗∗ k ) → s ≥ j > k Now combining both the cases we have P ( ˜ A ∗∗ j > ˜ A ∗∗ k ) → ∀ j (cid:54) = k j ≤ s As there are finitely many j ’s this implies that P ( ˜ A ∗∗ j > ˜ A ∗∗ k ∀ j (cid:54) = k j ≤ s ) → Hence P (ˆ k ∗ ˜ A = k ) → ⇒ ˆ k ∗ ˜ A P → k . hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion Proof of Theorem 3.1 (Strong Consistency): Here again as for weak consistency we look for twocases j < k and j > k and compare A j and A k . Proof of Case I, i.e. j < k is along the similar linesas in Theorem 3.3 (weak consistency proof) where we have shown that P ( lim n →∞ ˆ k ∗ A ≥ k ) = 1 . Case I:
When j < k , A j − A k = k (cid:88) i = j +1 ( A i − − A i )= k (cid:88) i = j +1 ( p − i + 1) log (cid:26) − p − i + 1 (cid:18) − l ip ¯ l ip (cid:19)(cid:27) + log ¯ l ip − log l ip − p/n F p/n (1 + (cid:112) p/n + δ ) p − i + 1 n a.s. → k (cid:88) i = j +1 ψ i − − log ψ i − F c (1 + √ c + δ )= k (cid:88) i = j +1 F c ( λ i ) − F c (1 + √ c + δ ) Using the monotonicity of F c ( . ) and λ i ’s A j − A k a.s. → k (cid:88) i = j +1 F c ( λ i ) − F c (1 + √ c + δ ) > ( k − j )[ F c ( λ k ) − F c (1 + √ c + δ )] . Now suppose λ k < √ c + δ then we have that F c ( λ k ) − F c (1 + √ c + δ ) < . So we have that A k − − A k a.s. → F c ( λ k ) − F c (1 + √ c + δ ) < , i.e. A k − > A k almost surely implying that ˆ k ∗∗ A is not consistent.On the other hand if we have that λ k > √ c + δ so we have that F c ( λ k ) − F c (1 + √ c + δ ) > hence P (lim n →∞ ( A j − A k ) >
0) = 1 ∀ j < k this implies that P (lim n →∞ ˆ k ∗∗ A ≥ k ) = 1 . Case II:
When j > k , A j − A k = j (cid:88) i = k +1 ( A i − A i − )= j (cid:88) i = k +1 − ( p − i + 1) log (cid:26) − p − i + 1 (cid:18) − l ip ¯ l ip (cid:19)(cid:27) − log ¯ l ip + log l ip + 1 p/n F p/n (1 + (cid:112) p/n + δ ) p − i + 1 n ∼ j (cid:88) i = k +1 (cid:18) − l ip ¯ l ip (cid:19) + log (cid:18) l ip ¯ l ip (cid:19) + F c (1 + √ c + δ ) (cid:18) − ip (cid:19) hakraborty, Mukherjee and Chakrabarti/High dimensional PCA: a new model selection criterion For k < j ≤ j , l jp ≤ l ip ≤ l k +1 ,p . From Lemma 2.1, we have that l k +1 ,p and l jp a.s. → µ = b as n → ∞ so l ip a.s. → b it implies that a.s. A j − A k ∼ ( j − k )(1 − b + log b + F c (1 + √ c + δ ))= ( j − k )( F c (1 + √ c + δ ) − F c (1 + √ c )) > . The last line is true because b − − log b = F c (1 + √ c ) and because F c ( . ) is a monotonically increasingfunction. Therefore we have that P (lim n →∞ ( A j − A k ) >
0) = 1 ∀ q > j > k this implies that P (lim n →∞ ˆ k ∗∗ A ≤ k ) = 1 .Combining Case I and II we have that ˆ k ∗∗ A a.s. → k ..