Optimal spectral shrinkage and PCA with heteroscedastic noise
OOptimal spectral shrinkage and PCA with heteroscedastic noise
William Leeb ∗ and Elad Romanov † Abstract
This paper studies the related problems of prediction, covariance estimation, and principal compo-nent analysis for the spiked covariance model with heteroscedastic noise. We consider an estimator of theprincipal components based on whitening the noise, and we derive optimal singular value and eigenvalueshrinkers for use with these estimated principal components. Underlying these methods are new asymp-totic results for the high-dimensional spiked model with heteroscedastic noise, and consistent estimatorsfor the relevant population parameters. We extend previous analysis on out-of-sample prediction to thesetting of predictors with whitening. We demonstrate certain advantages of noise whitening. Specifically,we show that in a certain asymptotic regime, optimal singular value shrinkage with whitening convergesto the best linear predictor, whereas without whitening it converges to a suboptimal linear predictor. Weprove that for generic signals, whitening improves estimation of the principal components, and increasesa natural signal-to-noise ratio of the observations. We also show that for rank one signals, our estimatedprincipal components achieve the asymptotic minimax rate.
Singular value shrinkage and eigenvalue shrinkage are popular methods for denoising data matrices andcovariance matrices. Singular value shrinkage is performed by computing a singular value decompositionof the observed matrix Y , adjusting the singular values, and reconstructing. The idea is that when Y = X + N , where X is a low-rank signal matrix we wish to estimate, the additive noise term N inflatesthe singular values of X ; by shrinking them we can move the estimated matrix closer to X , even if thesingular vectors remain inaccurate. Similarly, eigenvalue shrinkage for covariance estimation starts withthe sample covariance of the data, and shrinks its eigenvalues. There has been significant recent activityon deriving optimal shrinkage methods [48, 25, 44, 23, 24, 21, 22], and applying them to various scientificproblems [12, 2, 43, 17].A standard setting for analyzing the performance of these methods is the spiked covariance model [31, 7, 46, 6, 21]. Here, the observation matrix is composed of iid columns Y j in R p , j = 1 , . . . , n fromsome distribution consisting of signal vectors X j lying on a low-dimensional subspace, plus independentnoise vectors ε j with some covariance matrix Σ ε . The theory for prediction of X , . . . , X n in the spikedmodel with orthogonally invariant noise, i.e., when Σ ε = νI p , is very well-developed [23, 48, 25, 36].Singular value shrinkage is known to be minimax optimal, and asymptotically optimal shrinkers havebeen derived for a wide variety of loss functions.Many applications in signal processing, imaging, and related fields involve noise that is heteroscedastic [45, 40, 11, 12, 34, 1, 2]. This paper studies the effect of whitening the noise; that is, working in rescaledcoordinates, in which the noise is white. We first estimate the noise covariance matrix Σ ε . We thennormalize, or whiten , the observations Y j by applying Σ − / ε ; the resulting vectors Y w j consist of a trans-formed signal component X w j = Σ − / ε X j , plus isotropic noise G j = Σ − / ε ε j . Singular value shrinkage isthen performed on this new, whitened observation matrix, after which the inverse transformation Σ / ε isapplied. Similarly, we perform eigenvalue shrinkage to the sample covariance of the whitened data, andthen apply the inverse transformation.While this approach is restricted to cases when Σ ε can be consistently estimated, when it does applyit has a number of advantages over competing methods. First, in the classical “large n ” asymptoticlimit, our method of singular value prediction with whitening, while non-linear in the observed data,converges to the best linear predictor of the data, an oracle method that requires knowledge of the ∗ School of Mathematics, University of Minnesota, Twin Cities, Minneapolis, MN, USA. † School of Computer Science and Engineering, The Hebrew University, Jerusalem, Israel. a r X i v : . [ s t a t . O T ] S e p opulation principal components. By contrast, singular value shrinkage without whitening (as in [44])converges to a suboptimal linear filter. Further, we show that under certain modelling assumptions,whitening improves the estimation of the population singular vectors, and achieves the same rate ofsubspace estimation as the minimax optimal method derived in [58]. Next, because we compute the SVDof a matrix with isotropic noise, our method requires weaker assumptions on the principal componentsof the signal vectors than those in [44].As the key step in our procedures is performing spectral shrinkage to the whitened data or covariancematrices, the question arises: what are the optimal singular values/eigenvalues? While whitening hasbeen used with shrinkage in previous works (e.g. in [38, 19, 12]) it appears that the question of optimalshrinkage has not been fully addressed. This paper derives the precise choice of optimal singular valuesand eigenvalues, and shows, using new asymptotic results, how to consistently estimate them from theobserved data. We introduce a new method for predicting X from Y when the noise matrix N is heteroscedastic. We firstperform a linear transformation to the observations to whiten the noise. The resulting vectors are still ofthe form “low rank plus noise”, but the noise term has been transformed into an isotropic Gaussian, whilethe low-rank signal component has been rescaled along the principal components of the noise covariance.Next, we shrink the singular values of the transformed matrix. Intuitively, this step removes the effectof the noise from the spectrum of the observed matrix. Finally, we arrive at a predictor of the signalmatrix X by applying the inverse change of variables, i.e., we unwhiten .This three-step procedure — whiten, shrink, unwhiten — depends on the choice of singular valuesused in the middle shrinkage step. As it turns out, there are precise, optimal, and consistently estimableformulas for the optimal singular values. These are derived in Section 4.1, and the resulting methodsummarized in Algorithm 1.For covariance estimation, we introduce an analogous procedure in which eigenvalue shrinkage isapplied to the sample covariance of the whitened observations. After shrinkage, we then apply the inversewhitening transformation. As with singular value shrinkage, this three-step procedure of whitening,shrinking the eigenvalues, and unwhitening depends crucially on the choice of eigenvalues for the middlestep. In Section 4.2, we will explain the method in detail, including the derivation of consistent estimatorsfor the optimal eigenvalues for a variety of loss functions. The method is summarized in Algorithm 2. In Section 5, we show that in the classical regime (when p (cid:28) n ), singular value shrinkage with whiteningconverges to the optimal linear predictor of the data, while shrinkage without whitening will converge to adifferent, typically suboptimal, linear filter. In this sense, not only is shrinkage with whitening preferableto no whitening, but the whitening transform is an asymptotically optimal change of coordinates to applyto the data before shrinking in the classical setting.In Section 6, we also derive the optimal coefficients for the out-of-sample prediction problem, describedin [19]. In this problem, the PCs estimated from a set of in-sample data Y , . . . , Y n are used to denoise anindependently drawn out-of-sample observation. We show that the AMSE for singular value shrinkagewith whitening is identical to the asymptotic expected loss achieve by out-of-sample denoising, whichextends the analogous result from [19]. The out-of-sample predictor is summarized in Algorithm 3. The eigenspace of the estimated covariance ˆΣ x (equivalently, the left singular subspace of ˆ X ) is notspanned by the singular vectors of the raw data matrix Y . Rather, they are spanned by the vectors ˆ u k obtained by applying the inverse whitening transformation to the top r singular vectors of the whitenedobservation matrix.In Section 7, we will show under a generic model for the signal PCs, the estimated PCs ˆ u , . . . , ˆ u r improve upon estimation of the population PCs u , . . . , u r , as compared to the left singular vectors of Y . We will show too that when r = 1, ˆ u achieves the minimax rate of principal subspace estimationderived in [58]. That is, in a certain sense it is an optimal estimator of the signal principal subspace. .1.4 Spiked model asymptotics The methods and analysis of this paper rely on precise descriptions of the asymptotic behavior of thesingular values and singular vectors of the whitened matrix Y w . While some of the necessary results arealready found in the literature [46, 10], we have also needed to derive several new results as well, whichmay be found in Theorems 3.1 and 3.2 in Section 3. Whereas earlier work has characterized the anglesbetween the singular vectors of X w and Y w , we have provided formulas for the cosines of the anglesbetween the singular vectors after the inverse whitening transformation has been performed – that is, wecharacterize the change in angles resulting from unwhitening. These parameters are a key ingredient forderiving the optimal spectral shrinkers in Section 4. The prediction method in this paper is a generalization of a standard method for predicting the matrix X from the observed matrix Y , known as singular value shrinkage. Briefly, it is performed by leavingfixed the singular vectors of Y , while adjusting its singular values, to mitigate the effects of noise on thespectrum. It is shown in [23] that when the noise matrix N is white Gaussian noise, or in other wordsΣ ε = I p , then singular value shrinkage is minimax optimal for predicting X from Y .The paper [48] considers optimal singular value shrinkage for Frobenius loss and white noise. In[25], optimal singular value shrinkers are derived for isotropic noise, for a much broader family of lossfunctions; the special case of operator norm loss is considered in [36]. The effectiveness of these methodsrests on the asymptotic spectral theory of the data matrix Y developed in [46, 10] among others.In the paper [44], optimal singular value shrinkage (known as ‘OptShrink’) is derived under muchmore general conditions on the noise matrix N , by exploiting the general asymptotic spectral theorydeveloped in [10] for non-isotropic noise. While OptShrink may be effectively applied when the noiseis non-isotropic, it requires the signal principal components to be vectors with iid random entries (ororthonormalized versions thereof). Covariance estimation is a well-studied problem in statistics and its applications. A standard methodfor estimating the population covariance Σ x is eigenvalue shrinkage [51, 52, 21, 22]. Analogously tosingular value shrinkage for predicting X , eigenvalue shrinkage leaves fixed the eigenvectors of the samplecovariance ˆΣ y = (cid:80) nj =1 Y j Y (cid:62) j /n = Y Y (cid:62) /n , or equivalently the left singular vectors of Y , and replacesthe eigenvalues by estimated values to reduce the effect of the noise.As we will discuss in Section 2.2, it is often natural to consider different loss functions for measuringthe error in covariance estimation [22]. The paper [21] derives optimal eigenvalue shrinkers for a verylarge collection of loss functions. Their method is restricted to white noise, i.e., where Σ ε is a multipleof the identity matrix. There have been a number of recent papers on the spiked model with heteroscedastic noise. The paper [58]devises an iterative algorithm for estimating the principal subspace of X j in this setting, and proves thattheir method achieves the optimal error rate. Our method uses a different estimator for the populationPCs, which achieves an error that matches the optimal rate of [58] under an additional assumption (19)(which is vacuous when r = 1).The papers [28, 26, 27] consider a different but related model, in which each observation Y j has whitenoise but with noise strengths varying across the observations. In [27], they show that when the signalenergy and noise energy are fixed, subspace estimation is optimal when the noise is white. The proofof our Theorem 7.2 builds on this result, by combining it with our analysis of the change in anglesbetween the empirical and population PCs after whitening. The work [28] shows that an alternativechoice of weighting is optimal for estimating the signal principal components. The aforementioned paper[44] designs optimal singular value shrinkers without whitening for a broad range of noise distributions,which include our noise model as a special case.When working in the eigenbasis of the noise covariance, the whitening procedure we describe in thiswork is an example of what is called weighted PCA , in which weights are applied to individual variablesbefore the principal components are computed [32, 30]. The inverse standard deviation of the noise is standard choice of weights [54, 57, 55]; in that sense, the present work can be seen as providing atheoretical analysis of this already widely-used choice. Previous works have proposed pairing the whitening transformation with spectral shrinkage, which westudy in this work. The paper [38] proposes the use of whitening in conjunction with exponentialfamily noise models for covariance estimation. The paper [19] proposes whitening in the context oftransformed spiked models for data prediction. The papers [12, 2] use whitening and eigenvalue shrinkagefor covariance estimation.However, previous works on singular value shrinkage with whitening employed suboptimal shrinkers,developed from heuristic considerations. In this paper, we undertake a systematic study of this problem,and rigorously derive the optimal shrinkers, under Frobenius loss (in an asymptotic sense). For covarianceestimation, [38] derives the optimal eigenvalue shrinker for the special case of operator norm loss, buttheir method does not apply to more general loss functions.
The rest of the paper is organized as follows. Section 2 contains a detailed description of the modeland assumptions; statements of the prediction and estimation problems to be studied; and a review ofknown results on the spiked model and spectral shrinkage. Section 3 provides the asymptotic theoryon the spiked model that will be used throughout the rest of the paper. Section 4 presents the optimalspectral shrinkers with whitening. Section 5 analyzes the behavior of weighted singular value shrinkageschemes in the classical ( p (cid:28) n ) setting, and shows the optimality of whitening in this regime. Section6 describes and solves the out-of-sample prediction problem. Section 7 derives several results on thetheoretical benefits of whitening for principal component analysis. Section 8 presents the results ofnumerical experiments illuminating the theoretical analysis and demonstrating the performance of theproposed methods. Finally, Section 9 provides a conclusion and suggestions for future research. In this section, we will introduce the details of the spiked model with heteroscedastic noise, describethe problems we focus on in this paper, and review known results on the asymptotic spectral theory ofthe spiked model, singular value shrinkage, and eigenvalue shrinkage. This will also serve to introducenotation we will use throughout the text.
We now specify the precise model we will be studying in this paper. We observe iid vectors Y , . . . , Y n in R p , of the form: Y j = X j + ε j . (1)The random signal vectors X j are assumed to be mean zero and to have a rank r covariance matrixΣ x = (cid:80) rk =1 (cid:96) k u k u (cid:62) k , where the vectors u k are taken to be orthonormal, and are called the principalcomponents (PCs) of the random vectors X j . More precisely, and to distinguish them from estimatedvectors we will introduce later, we will call them the population PCs. The numbers (cid:96) k , which are thevariances of the X j along u k , are positive; we will specify their ordering later, in equation (16) below.The random noise vectors ε j are of the form ε j = Σ / ε G j , (2)where G j ∈ R p is a mean-zero Gaussian noise vector with covariance I p , and Σ ε is a full-rank positivedefinite covariance matrix, assumed to be known (though see Remark 3). The noise vectors G j are drawnindependently from the X j .We can write X j = r (cid:88) k =1 (cid:96) / k z jk u k (3) X j Signal (3) ε j Heteroscedastic noise (2) Y j Observed (1) X w j Whitened signal (5) G j Whitened noise (2) Y w j Whitened observation (5) z k Signal factor values (3), (11) z w k Whitened signal factor values (6), (11) u k PC of X j ’s (3) u w k PC of X w j ’s (6) u k W − u w k / (cid:107) W − u w k (cid:107) (9)ˆ u w k Left singular vector of Y w Preceding (8)ˆ u k W − ˆ u w k / (cid:107) W − ˆ u w k (cid:107) (8) u w k W u k / (cid:107) W u k (cid:107) (10) v k Right singular vector of X Preceding (8) v w k Right singular vector of X w Preceding (8)ˆ v w k Right singular vector of Y w Preceding (8)Table 1: Vectors used in this paper. where z jk are uncorrelated (though not necessarily independent) random variables, with E z jk = 0 andVar( z jk ) = 1. We remark that the assumption that X j has mean zero is not essential; all the results ofthis paper will go through almost without modification if we first estimate the mean of X by the samplemean and subtract it from each observation Y j . We also note that in the terminology of factor analysis,the z jk may be called the factor values; for background on factor analysis, see, for instance, [3, 4, 47, 18].In addition to the original observations Y j , we will also be working with the whitened (or homogenized [38]) observations Y w j , defined by Y w j = W Y j , where W = Σ − / ε (4)is the whitening matrix . The vectors Y w j can be decomposed into a transformed signal X w j = W X j pluswhite noise G j . The whitened vectors X w j have rank r covarianceΣ w x = W Σ x W, (5)and lie in the r -dimensional subspace span { W u , . . . W u r } . We will let u w1 , . . . , u w r be the orthonormalPCs of X w j – that is, the leading r eigenvectors (up to sign) of Σ w x – and write X w j = r (cid:88) k =1 ( (cid:96) w k ) / z w jk u w k , (6)where again E z w jk = 0 and Var( z w jk ) = 1, the (cid:96) w k are strictly positive, and (cid:96) w1 > · · · > (cid:96) w r > . (7)In general, there is not a simple relationship between the PCs u , . . . , u r of X j and the PCs u w1 , . . . , u w r of X w j , or between the eigenvalues (cid:96) , . . . , (cid:96) r and the eigenvalues (cid:96) w1 , . . . , (cid:96) w r .We introduce some additional notation. We will denote the normalized matrices by Y = [ Y , . . . Y n ] / √ n , Y w = [ Y w1 , . . . , Y w n ] / √ n , X = [ X , . . . , X n ] / √ n , X w = [ X w1 , . . . , X w n ] / √ n , G = [ G , . . . , G n ] / √ n and N = [ ε , . . . , ε n ] / √ n . Note that Y = X + N and Y w = X w + G .We will denote by v , . . . , v r the right singular vectors of the matrix X , and denote by v w1 , . . . , v w r theright singular vectors of the matrix X w . We denote by ˆ u w1 , . . . , ˆ u w r and ˆ v w1 , . . . ˆ v w r the top r left and rightsingular vectors of the matrix Y w . We define, for 1 ≤ k ≤ r , the empirical vectors:ˆ u k = W − ˆ u w k (cid:107) W − ˆ u w k (cid:107) . (8) e also define the population counterparts, u k = W − u w k (cid:107) W − u w k (cid:107) . (9)Similarly, for 1 ≤ k ≤ r we define u w k = W u k (cid:107) W u k (cid:107) . (10)Note that span { u , . . . , u r } = span { u , . . . , u r } , and span { u w1 , . . . , u w r } = span { u w1 , . . . , u w r } . However,the vectors u , . . . , u r will not, in general, be pairwise orthogonal; and similarly for u w1 , . . . , u w r .Finally, we define the factor vectors z k and z w k by z k = ( z k , . . . , z nk ) (cid:62) , z w k = ( z w1 k , . . . , z w nk ) (cid:62) . (11)We formally consider a sequence of problems, where n and p = p n both tend to ∞ with a limitingaspect ratio, γ : γ = lim n →∞ p n n , (12)which is assumed to be finite and positive. The number of population components r and the variances (cid:96) , . . . , (cid:96) r are assumed to be fixed with n . Because p and n are increasing, all quantities that depend on p and n are elements of a sequence, which will be assumed to follow some conditions which we will outlinebelow and summarized in Section 2.1.1. Though we might denote, for instance, the PC u k by u ( p ) k , X by X ( p,n ) , and so forth, to keep the notation to a minimum – and in keeping with standard practice withthe literature on the spiked model – we will typically drop the explicit dependence on p and n . Remark 1.
Because r is fixed as p and n grow, the left singular vectors of the p -by- n populationmatrix X = [ X , . . . , X n ] / √ n are asymptotically consistent estimators (up to sign) of the populationPCs u , . . . , u r . More precisely, if ˜ u , . . . , ˜ u r are the left singular vectors of X , then almost surelylim p →∞ |(cid:104) u k , ˜ u k (cid:105)| = 1 . (13)Similarly, if ˜ u w1 , . . . , ˜ u w r are the left singular vectors of X w , then almost surelylim p →∞ |(cid:104) u w k , ˜ u w k (cid:105)| = 1 . (14)The limits (13) and (14) may be easily derived from, for example, Corollary 5.50 in [53] (restated asLemma B.2 in Appendix B), since the effective dimension of the X j is r , not p . Because this paper isconcerned only with first-order phenomena, we will not distinguish between u k (respectively, u w k ) and ˜ u k (respectively, ˜ u w k ). Remark 2.
The unnormalized vectors W − u w k are the generalized singular vectors of the matrix X , withrespect to the weight matrix W [39]. In particular, they are orthonormal with respect to the weightedinner product defined by W . Similarly, the vectors W − ˆ u w k are generalized singular vectors of Y withrespect to W .We assume that the values (cid:107) W − u w k (cid:107) , 1 ≤ k ≤ r , have well-defined limits as p → ∞ , and we definethe parameters τ k , 1 ≤ k ≤ r , by τ k = lim p →∞ (cid:107) W − u w k (cid:107) − . (15)Note that the τ k are not known a priori; we will show, however, how they may be consistently estimatedfrom the observed data.With the τ k ’s defined, we now specify the ordering of the principal components of X j that will beused throughout: (cid:96) τ > · · · > (cid:96) r τ r > . (16)We will also assume that the spectrum of Σ ε stays bounded between a min > a max < ∞ . Inorder to have well-defined asymptotics in the large p , large n regime, we will assume that the normalizedtrace of Σ ε has a well-defined limit, which we will denote by µ ε : µ ε = lim p →∞ tr(Σ ε ) p ∈ (0 , ∞ ) . (17)For the convenience of the reader, Tables 1 and 2 summarize the notation for vectors and scalar parametersthat will be used throughout this paper. (cid:96) k Signal variances (3), (16) (cid:96) w k Whitened signal variances (6), (7) γ Aspect ratio (12) τ k lim p →∞ (cid:107) W − u w k (cid:107) − (15) (cid:96) k (cid:96) w k /τ k (62) µ ε Normalized trace of Σ ε (17) σ w k Singular value of Y w (41) c w k Cosine between u w k and ˆ u w k (39)˜ c w k Cosine between v w k and ˆ v w k (40) c k Cosine between u k and ˆ u k under (19) (47)Table 2: Scalar parameters used in this paper. Remark 3.
We will assume for most of the paper that the noise covariance Σ ε is known a priori (thoughsee Section 4.3). However, all of the theoretical results, and resulting algorithms, go through unchangedif the true Σ ε is replaced by any estimator ˆΣ ε that is consistent in operator norm, i.e.,lim p →∞ (cid:107) Σ ε − ˆΣ ε (cid:107) op = 0 . (18)Examples of such estimators ˆΣ ε are discussed in Section 4.3. We enumerate the assumptions we have made on the asymptotic model:1. p, n → ∞ and the aspect ratio p/n converges to γ > ε lie between a min > a max < ∞ .3. The limit lim p →∞ tr(Σ ε ) /p is well-defined, finite, and non-zero.4. The limits lim p →∞ (cid:107) W − u w k (cid:107) are well-defined, finite, and non-zero.Assumptions 1–4 will be in effect throughout the entire paper. In addition, some of the results, namelyTheorems 3.2 and 7.3, will require an additional assumption, which we refer to as weighted orthogonality of the PCs u , . . . , u r :5. For j (cid:54) = k , the vectors u j and u k are asymptotically orthogonal with respect to the W = Σ − ε innerproduct: lim p →∞ u (cid:62) j W u k = 0 . (19)The assumptions 1–4 listed above are conceptually very benign. In applications, the practitioner willbe faced with a finite p and n , for which all the listed quantities exist and are finite. The asymptoticassumptions 1–4 allow us to precisely quantify the behavior when p and n are large. By contrast,assumption 5 is stronger than assumptions 1–4, in that it posits not only that certain limits exist, butalso their precise values (namely, 0). Note that assumption 5 is trivially satisfied when r = 1. At first glance, the weighted orthogonality condition (5), which will be used in Theorems 3.2 and 7.3,may seem quite strong. However, it is a considerably weaker assumption than what is often assumedby methods on the spiked model. For instance, the method of OptShrink in [44] assumes that the PCs u , . . . , u r be themselves random vectors with iid entries (or orthonormalized versions thereof). Underthis model, the inner products u (cid:62) j W u k almost surely converge to 0; see Proposition 6.2 in [9].In fact, we may introduce a more general random model for random PCs, under which assumption 5will hold. For each 1 ≤ k ≤ r , we assume there is a p -by- p symmetric matrix B k with bounded operatornorm ( (cid:107) B k (cid:107) op ≤ C < ∞ , where C does not depend on p ), and tr( B k ) /p = 1. We then take u , . . . , u r to be the output of Gram-Schmidt performed on the vectors B k w k , where the w k are vectors with iidsubgaussian entries with variance 1 /p . Then u (cid:62) j W u k = w (cid:62) j B (cid:62) j W B k w k , which converges to zero almostsurely, again using [9] and the bounded operator norm of B j W B k . emark 4. Under the random model just described the parameters τ k are well-defined and equal tolim p →∞ tr( B (cid:62) k W B k ) /p , so long as this limit exists. Indeed, it follows from (19) that u w k is asymptot-ically identical to W u k / (cid:107) W u k (cid:107) (see Theorem 3.2), and so lim p →∞ (cid:107) W − u w k (cid:107) − = lim p →∞ (cid:107) W u k (cid:107) =lim p →∞ tr( B (cid:62) k W B k ) /p , where we have once again invoked [9]. This paper considers three central tasks: denoising the observations Y j to recover X j – what we referto as prediction , since the X j ’s are themselves random – estimating the population covariance Σ x , andestimating the principal subspace span { u , . . . , u r } .For predicting the signal vectors X j , or equivalently the normalized signal matrix X = [ X , . . . , X n ] / √ n ,we will use the asymptotic mean squared error to measure the accuracy of a predictor ˆ X :AMSE = lim n →∞ E (cid:107) ˆ X − X (cid:107) = lim n →∞ n n (cid:88) j =1 E (cid:107) ˆ X j − X j (cid:107) . (20)For covariance estimation, our goal is to estimate the covariance of the signal vectors, Σ x = E [ X j X (cid:62) j ](under the convention that the X j are mean zero; otherwise, we subtract off the mean). While theFrobenius loss, or MSE, is natural for signal estimation, for covariance estimation it is useful to considera wider range of loss functions depending on the statistical problem at hand; see [22] and the referenceswithin for an elucidation of this point.We will denote our covariance estimator as ˆΣ x . Denote the loss function by L ( ˆΣ x , Σ x ); for instance,Frobenius loss L ( ˆΣ x , Σ x ) = (cid:107) ˆΣ x − Σ x (cid:107) , or operator norm loss L ( ˆΣ x , Σ x ) = (cid:107) ˆΣ x − Σ x (cid:107) op . For a specifiedloss function L , we seek to minimize the asymptotic values of these loss functions for our estimator,lim n →∞ E L ( ˆΣ x , Σ x ) . (21)For both the data prediction and covariance estimation problems, it will be a consequence of ouranalysis that the limits of the errors are, in fact, well-defined quantities.Finally, we are also concerned with principal component analysis (PCA), or estimating the principalsubspace U = span { u , . . . , u r } , in which the signal vectors X j lie. We measure the discrepancy betweenthe estimated subspace ˆ U and the true subspace U by the angle Θ( U , ˆ U ) between these subspaces, definedby sin Θ( U , ˆ U ) = (cid:107) ˆ U (cid:62)⊥ U (cid:107) op , (22)where ˆ U ⊥ and U are matrices whose columns are orthonormal bases of ˆ U ⊥ and U , respectively. The spectral theory of the observed matrix Y has been thoroughly studied in the large p , large n regime,when p = p n grows with n . We will offer a brief survey of the relevant results from the literature[46, 10, 19].In the case of isotropic Gaussian noise (that is, when Σ ε = I p ), the r largest singular values of thematrix Y converge to σ k , defined by: σ k = (cid:40) ( (cid:96) k + 1)(1 + γ/(cid:96) k ) , if (cid:96) k > √ γ, (1 + √ γ ) , if (cid:96) k ≤ √ γ . (23)Furthermore, the top singular vectors ˆ u yk and ˆ v yk of Y make asymptotically deterministic angleswith the singular vectors u k and v k of X . More precisely, the absolute cosines |(cid:104) ˆ u yj , u k (cid:105)| converge to c k = c k ( γ, (cid:96) k ), defined by c k = (cid:40) − γ/(cid:96) γ/(cid:96) if j = k and (cid:96) k > √ γ , (24) nd the absolute cosines |(cid:104) ˆ v yj , v k (cid:105)| converge to ˜ c k = ˜ c k ( γ, (cid:96) k ), defined by˜ c k = (cid:40) − γ/(cid:96) /(cid:96) if j = k and (cid:96) k > √ γ . (25)When (cid:96) k > √ γ , the population variance (cid:96) k can be estimated consistently from the observed singularvalue σ k . Since c k and ˜ c k are functions of (cid:96) k and the aspect ratio γ , these quantities can then also beconsistently estimated. Remark 5.
Due to the orthogonal invariance of the noise matrix N = G when Σ ε = I p , formulas (23),(24) and (25) are valid for any rank r matrix X , so long as X ’s singular values do not change with p and n . The paper [10] derive the asymptotics for more general noise matrices N , but with the additionalassumption that the singular vectors of X are themselves random (see the discussion in Section 2.1.2).The formulas for the asymptotic singular values and cosines found in [10] are in terms of the Stieltjestransform [5] of the asymptotic distribution of singular values of Y , which can be estimated consistentlyusing the observed singular values of Y . We review the theory of shrinkage with respect to Frobenius loss; we briefly mention that the paper [25]extends these ideas to a much wider range of loss functions for the spiked model.We suppose that our predictor of X is a rank r matrix of the formˆ X = r (cid:88) k =1 t k ˆ u k ˆ v (cid:62) k , (26)where ˆ u k and ˆ v k are estimated vectors. We will assume that the vectors ˆ v k are orthogonal, and thattheir cosines with the population vectors v k of X are asymptotically deterministic. More precisely, weassume that (cid:104) v j , ˆ v k (cid:105) → ˜ c k when j = k , and converges to 0 when j (cid:54) = k . Similarly, we will assume that (cid:104) u k , ˆ u k (cid:105) → c k ; however, we do not need to assume any orthogonality condition on the u j ’s and ˆ u j ’s forthe purposes of this derivation.Expanding the squared Frobenius loss between ˆ X and X and using the orthogonality conditions onthe v j ’s and ˆ v k ’s, we get: (cid:107) ˆ X − X (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) r (cid:88) k =1 (cid:16) t k ˆ u k ˆ v (cid:62) k − (cid:96) / k u k v (cid:62) k (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = r (cid:88) k =1 (cid:13)(cid:13)(cid:13) t k ˆ u k ˆ v (cid:62) k − (cid:96) / k u k v (cid:62) k (cid:13)(cid:13)(cid:13) + (cid:88) j (cid:54) = k (cid:68) t j ˆ u j ˆ v (cid:62) j − (cid:96) / j u j v (cid:62) j , t k ˆ u k ˆ v (cid:62) k − (cid:96) / k u k v (cid:62) k (cid:69) F ∼ r (cid:88) k =1 (cid:107) t k ˆ u k ˆ v (cid:62) k − (cid:96) / k u k v (cid:62) k (cid:107) , (27)where ∼ denotes almost sure equality as p, n → ∞ .Since the loss separates over the different components, we may consider each component separately.Using the asymptotic cosines, we have: (cid:107) t k ˆ u k ˆ v (cid:62) k − (cid:96) / k u k v (cid:62) k (cid:107) ∼ t k + (cid:96) k − (cid:96) / k c k ˜ c k t k , (28)which is minimized by taking t k = (cid:96) / k c k ˜ c k . (29)These values of t k , therefore, are the optimal ones for predicting X in Frobenius loss.Furthermore, we can also derive an estimable formula for the AMSE. Indeed, plugging in t k = (cid:96) / k c k ˜ c k to (28), we get: AMSE = r (cid:88) k =1 (cid:96) k (1 − c k ˜ c k ) . (30)Note that this derivation of the optimal t k and the AMSE does not require the vectors ˆ u k and ˆ v k to be the singular vectors of Y . Rather, we just require the asymptotic cosines to be well-defined, andthe v j ’s and ˆ v j ’s to be orthogonal across different components. Implementing this procedure, however,requires consistent estimates of (cid:96) k , c k and ˜ c k . .3.3 Eigenvalue shrinkage for covariance estimation Similar to the task of predicting the data matrix X is estimating the covariance matrix Σ x = E [ X j X (cid:62) j ] = (cid:80) rk =1 (cid:96) k u k u (cid:62) k . The procedure we consider in this setting is known as eigenvalue shrinkage . Givenorthonormal vectors ˆ u , . . . , ˆ u r estimating the PCs u , . . . , u r , we consider estimators of the formˆΣ x = r (cid:88) k =1 t k ˆ u k ˆ u (cid:62) k , (31)where t k are estimated population eigenvalues, which it is our goal to determine.In [21], a large family of loss functions are considered for estimating Σ x in white noise. All theseloss functions satisfy two conditions. First, they are orthogonally-invariant , meaning that if both theestimated and population PCs are rotated, the loss does not change. Second, they are block-decomposable ,meaning that if both the estimated and population covariance matrices are in block-diagonal form, theloss can be written as functions of the losses between the individual blocks.The method of [21] rests on an observation from linear algebra. If (asymptotically) the (cid:104) ˆ u k , u k (cid:105) = c k ,and ˆ u j ⊥ u k for all 1 ≤ j (cid:54) = k ≤ r , then there is an orthonormal basis of R p with respect to which bothΣ x and any rank r covariance ˆΣ x are simultaneously block-diagonalizable, with r blocks of size 2-by-2.More precisely, there is a p -by- p orthogonal matrix O so that: O Σ x O (cid:62) = r (cid:77) k =1 A k , (32)and O ˆΣ x O (cid:62) = r (cid:77) k =1 ˆ (cid:96) k B k , (33)where A k = (cid:18) (cid:96) k
00 0 (cid:19) , (34)and B k = (cid:18) c k c k (cid:112) − c k c k (cid:112) − c k − c k (cid:19) . (35)If L ( ˆΣ , Σ) is a loss function that is orthogonally-invariant and block-decomposable, then the lossbetween Σ x and ˆΣ x decomposes into the losses between each A k and B k , which depend only on the oneparameter ˆ (cid:96) k . Consequently, ˆ (cid:96) k = arg min (cid:96) L ( A k , (cid:96)B k ) . (36)The paper [21] contains solutions for ˆ (cid:96) k for a wide range of loss functions L . For example, with Frobeniusloss, the optimal value is ˆ (cid:96) k = (cid:96) k c k , whereas for operator norm loss the optimal value is ˆ (cid:96) k = (cid:96) k . Evenwhen closed form solutions are unavailable, one may perform the mimimization (36) numerically. A precise understanding of the asymptotic behavior of the spiked model is crucial for deriving optimalspectral shrinkers, as we have seen in Sections 2.3.2 and 2.3.3. In this section, we provide expressionsfor the asymptotic cosines between the empirical PCs and the population PCs, as well as limiting valuesfor other parameters. The formulas from Theorem 3.1 below will be employed in Section 4.1 for optimalsingular value shrinkage with whitening; and the formulas from Theorem 3.2 below will be employed inSection 4.2 for optimal eigenvalue shrinkage with whitening.The first result, Theorem 3.1, applies to the standard spiked model with white noise. It gives acharacterization of the asymptotic angles of the population PCs and empirical PCs with respect to aninner product x (cid:62) Ay given by a symmetric positive-definite matrix A . Parts 1 and 4 are standard resultson the spiked covariance model [46, 10]; we include them here for easy reference. A special case of part2 appears in [38], in a somewhat different form; and part 3 appear to be new. heorem 3.1. Suppose Y w1 , . . . , Y w n are iid vectors in R p from the spiked model with white noise, with Y w j = X w j + G j where X w j is of the form (6) and G j ∼ N (0 , I ) . Let A = A p be an element of a sequenceof symmetric, positive-definite p -by- p matrices with bounded operator norm ( (cid:107) A p (cid:107) op ≤ C < ∞ for all p ),whose asymptotic normalized trace is well-defined and finite: µ a = lim p →∞ p tr( A p ) < ∞ . (37) Suppose too that for ≤ k ≤ r , the following quantity τ ak is also well-defined and finite: τ ak = lim p →∞ (cid:107) A / p u w k (cid:107) − < ∞ . (38) Define c w k > by: ( c w k ) = (cid:40) − γ/ ( (cid:96) w k ) γ/(cid:96) w k , if j = k and (cid:96) w k > √ γ , otherwise , (39) and let s w k = (cid:112) − ( c w k ) . Also define ˜ c w k > by: (˜ c w k ) = (cid:40) − γ/ ( (cid:96) w k ) /(cid:96) w k , if j = k and (cid:96) w k > √ γ , otherwise , (40) and ˜ s w k = (cid:112) − (˜ c w k ) .Then for any ≤ j, k ≤ r , we have, as n → ∞ and p/n → γ :1. The k th largest singular value of Y w converges almost surely to σ w k = (cid:114) ( (cid:96) w k + 1) (cid:16) γ(cid:96) w k (cid:17) , if (cid:96) w k > √ γ √ γ, otherwise . (41)
2. The A -norm of ˆ u w k converges almost surely: lim p →∞ (cid:107) A / p ˆ u w k (cid:107) = ( c w k ) τ ak + ( s w k ) µ a . (42)
3. The A -inner product between u w k and ˆ u w k converges almost surely: lim p →∞ (cid:104) A p u w k , ˆ u w k (cid:105) = (cid:40) ( c w k /τ ak ) , if (cid:96) w k > √ γ , otherwise . (43)
4. The inner product between v w j and ˆ v w k converges almost surely: lim n →∞ (cid:104) v w j , ˆ v w k (cid:105) = (cid:40) (˜ c w k ) , if j = k and (cid:96) w k > √ γ , otherwise . (44) Remark 6.
In fact, as will be evident from its proof Theorem 3.1 is applicable to any rank r matrix X w , viewing u w k and v w k as the singular vectors of X w . In particular, the columns of X w need not bedrawn iid from a mean zero distribution. All that is needed for Theorem 3.1 is that the singular valuesof X w remain constant as p and n grow, and that the parameters τ k are well-defined.Theorem 3.1 is concerned only with the standard spiked model with white noise, Y w j = X w j + G j . Bycontrast, the next result, Theorem 3.2, deals with the spiked model with colored noise, Y j = X j + ε j ,where ε j ∼ N (0 , Σ ε ). In Section 2.1, we defined the whitening matrix W = Σ − / ε that transforms Y j into the standard white-noise model Y w j ; that is, Y w j = W Y j = W X j + W ε j = X w j + G j . In stating andapplying Theorem 3.2, we refer to the parameters for both models described in Section 2.1. Theorem 3.2.
Assume that the PCs u , . . . , u r satisfy the weighted orthogonality condition (19) , i.e.,for ≤ j (cid:54) = k ≤ r , lim p →∞ u (cid:62) j W u k = 0 . (45) rder the principal components of X j by decreasing value of (cid:96) k τ k , as in (16) ; that is, we assume Σ x = (cid:80) rk =1 (cid:96) k u k u (cid:62) k , with (cid:96) τ > · · · > (cid:96) r τ r > , (46) where τ k = lim p →∞ (cid:107) W − u w k (cid:107) − as in (15) .Define c k > , ≤ k ≤ r , by: c k ≡ (cid:40) ( c w k ) ( c w k ) +( s w k ) · µ ε · τ k , if (cid:96) w k > √ γ , otherwise , (47) where c w k is given by (39) , (cid:96) w k is defined from (6) with X w j = W X j , and µ ε = lim p →∞ tr(Σ ε ) p as in (17) .Then for any ≤ j, k ≤ r ,1. The vectors u k and u k are almost surely asymptotically identical: lim p →∞ (cid:104) u k , u k (cid:105) = 1 . (48)
2. The vectors v w k and v k are almost surely asymptotically identical: lim n →∞ (cid:104) v k , v w k (cid:105) = 1 . (49)
3. The inner product between u j and ˆ u k converges almost surely: lim p →∞ (cid:104) u j , ˆ u k (cid:105) = (cid:40) c k , if j = k and (cid:96) w k > √ γ , otherwise , (50) where c k is defined in (47) .4. The vectors ˆ u j and ˆ u k are asymptotically orthogonal if j (cid:54) = k : lim p →∞ (cid:104) ˆ u j , ˆ u k (cid:105) = δ jk . (51)
5. The parameter τ k is almost surely asymptotically equal to (cid:107) W u k (cid:107) : lim p →∞ ( τ k − (cid:107) W u k (cid:107) ) = 0 . (52)
6. The variance (cid:96) w k of X w j along u w k is almost surely asymptotically equal to (cid:96) k τ k : lim p →∞ ( (cid:96) w k − (cid:96) k τ k ) = 0 . (53)The proofs for both Theorem 3.1 and Theorem 3.2 may be found in Appendix A. In this section, we will derive the optimal spectral shrinkers for signal prediction and covariance estimationto be used in conjunction with whitening.
Given the noisy matrix Y = X + N , we consider a class of predictors of X defined as follows. First, wewhiten the noise, replacing Y with Y w = W Y . We then apply singular value shrinkage to the transformedmatrix Y w . That is, if ˆ u w1 , . . . , ˆ u w r and ˆ v w1 , . . . , ˆ v w r are the top left and right singular vectors of Y w , wedefine the new matrix ˆ X w = r (cid:88) k =1 t k ˆ u w k (ˆ v w k ) (cid:62) , (54)for some positive scalars t k which we have yet to determine. inally, we recolor the noise, to bring the data back to its original scaling. That is, we define ourfinal predictor ˆ X by ˆ X = W − ˆ X w . (55)In this section, we will show how to optimally choose the singular values t , . . . , t r in (54) to minimizethe AMSE: AMSE = lim n →∞ E (cid:107) ˆ X − X (cid:107) . (56) Remark 7.
Loss functions other than Frobenius loss (i.e., mean-squared error) may be considered aswell. This will be done for the problem of covariance estimation in Section 4.2, where it is more natural[22]. For recovering the data matrix X itself, however, the MSE is the natural loss, and the optimal t k can be derived for minimizing the AMSE without any additional assumptions on the model.Once we have whitened the noise, our resulting matrix Y w = X w + G is from the standard spikedmodel and consequently satisfies the conditions of Theorem 3.1, since G is a Gaussian matrix with iid N (0 ,
1) entries. We will apply the asymptotic results of Theorem 3.1, taking the matrix A = W − .Recalling the definitions of ˆ u k and u k from (8) and (9), respectively, we obtain an immediate corollaryto Theorem 3.1: Corollary 4.1.
For ≤ k ≤ r , the cosine between the vectors u k and ˆ u k converges almost surely: lim p →∞ (cid:104) u k , ˆ u k (cid:105) = c k ≡ (cid:40) ( c w k ) ( c w k ) +( s w k ) · µ ε · τ k , if (cid:96) w k > √ γ , otherwise . (57)We derive the optimal t k . We write: X w ∼ r (cid:88) k =1 ( (cid:96) w k ) / u w k ( v w k ) (cid:62) , (58)and so X = W − X w ∼ r (cid:88) k =1 ( (cid:96) w k ) / W − u w k ( v w k ) (cid:62) = r (cid:88) k =1 ( (cid:96) w k /τ k ) / u k ( v w k ) (cid:62) . (59)Furthermore, ˆ X w = r (cid:88) k =1 t k ˆ u w k (ˆ v w k ) (cid:62) (60)and so ˆ X = W − ˆ X w = r (cid:88) k =1 t k W − ˆ u w k (ˆ v w k ) (cid:62) = r (cid:88) k =1 t k (cid:107) W − ˆ u w k (cid:107) ˆ u k (ˆ v w k ) (cid:62) . (61)It is convenient to reparametrize the problem in terms of (cid:96) k ≡ (cid:96) w k /τ k , (62)and ˜ t k ≡ t k (cid:107) W − ˆ u w k (cid:107) ∼ t k (cid:18) ( c w k ) τ k + ( s w k ) µ ε (cid:19) / , (63)where we have used Theorem 3.1.In this notation, we have X = (cid:80) rk =1 (cid:96) / k u k ( v w k ) (cid:62) , and ˆ X = (cid:80) rk =1 ˜ t k ˆ u k (ˆ v w k ) (cid:62) . From Theorem 3.1, thevectors v w j and ˆ v w k are orthogonal if j (cid:54) = k , and the cosine between v w k and ˆ v w k is ˜ c k ≡ ˜ c w k . The derivationfrom Section 2.3.2 shows that the optimal values ˜ t k are then given by˜ t k = (cid:96) / k c k ˜ c k (64)For this to define a valid estimator, we must show how to estimate the values (cid:96) k , c k and ˜ c k from theobserved data itself. o that end, from Theorem 3.1 (cid:96) w k can be estimated by (cid:96) w k = ( σ w k ) − − γ + (cid:112) (( σ w k ) − − γ ) − γ σ w k is the k th singular value of Y w . The cosines c w k and ˜ c w k can then be estimated by formulas (39)and (40).Now, rearranging part 2 from Theorem 3.1, we can solve for τ k in terms of the estimable quantities c w k , s w k , µ ε and (cid:107) Σ / ε ˆ u w k (cid:107) : τ k ∼ ( c w k ) (cid:107) Σ / ε ˆ u w k (cid:107) − ( s w k ) µ ε . (66)Indeed, this quantity can be estimated consistently: c w k and s w k are estimable from (39), (cid:107) Σ / ε ˆ u w k (cid:107) isdirectly observed, and µ ε ∼ tr(Σ ε ) /p .Having estimated τ k , we apply formula (cid:96) k = (cid:96) w k /τ k , and formula (50) for c k . This completes thederivation of the optimal singular value shrinker. The entire procedure is described in Algorithm 1. Figure 1: Optimal shrinker, naive shrinker, and population shrinker, for τ = 1 and γ = 0 . Figures 1 and 2 plot the optimal shrinker, i.e., the function that sends each top observed singularvalue σ w k of Y w to the optimal t k . For contrast, we also plot the “population” shrinker, which maps σ w k to the corresponding (cid:112) (cid:96) w k ; and the “naive” shrinker, which maps σ w k to (cid:112) (cid:96) w k c w k ˜ c w k . This latter shrinker isconsidered in the paper [19], and is naive in that it optimizes the Frobenius loss before the unwhiteningstep without accounting for the change in angles between singular vectors resulting from unwhitening.In Figure 1 we set γ = 0 .
5, while in Figure 2 we set γ = 2. We fix τ = 1 but consider different values of µ ε (the behavior depends only on the ratio of µ ε and τ ). Figure 2: Optimal shrinker, naive shrinker, and population shrinker, for τ = 1 and γ = 2.14 lgorithm 1 Optimal singular value shrinkage with whitening Input : observations Y , . . . , Y n ; noise covariance Σ ε ; rank r Define Y = [ Y , . . . , Y n ] / √ n ; W = Σ − / ε ; Y w = W Y Compute rank r SVD of Y w : ˆ u w1 , . . . , ˆ u w r ; ˆ v w1 , . . . , ˆ v w r ; σ w1 , . . . , σ w r for all k = 1 , . . . , r do if σ w k > √ γ then (cid:96) w k = (cid:2) ( σ w k ) − − γ + (cid:112) (( σ w k ) − − γ ) − γ (cid:3) (cid:14) c w k = (cid:113) (1 − γ/ ( (cid:96) w k ) ) (cid:14) (1 + γ/(cid:96) w k ) s w k = (cid:112) − ( c w k ) ˜ c k = (cid:113) (1 − γ/ ( (cid:96) w k ) ) (cid:14) (1 + 1 /(cid:96) w k ) µ ε = tr(Σ ε ) /pτ k = ( c w k ) (cid:14) (cid:104) (cid:107) Σ / ε ˆ u w k (cid:107) − ( s w k ) µ ε (cid:105) t k = ( (cid:96) w k ) / c w k ˜ c k (cid:14) (cid:2) ( c w k ) + ( s w k ) µ ε τ k (cid:3) else if σ w k ≤ √ γ then t k = 0 Output : ˆ X = W − (cid:80) rk =1 t k ˆ u w k (ˆ v w k ) (cid:62) Remark 8.
In practice, the rank r may not be known a priori. In Section 4.4, we describe severalmethods for estimating r from the data. Remark 9.
Algorithm 1 may be applied to denoising any rank r matrix X from the observed matrix Y = X + N . As pointed out in Remark 6, the assumption that the columns of X are drawn iid from amean zero distribution with covariance Σ x is not needed for the parameter estimates used by Algorithm1 to be applicable, so long as the singular values of the whitened matrix X w stay fixed (or convergealmost surely) as p and n grow, and the parameters τ k are well-defined. We turn now to the task of estimating the covariance Σ x of X j . Throughout this section, we will assumethe conditions of Theorem 3.2, namely conditon (19).Analogous to the procedure for singular value shrinkage with whitening, we consider the procedureof eigenvalue shrinkage with whitening. We first whiten the observations Y j , producing new observations Y w j = W Y j . We then form the sample covariance ˆΣ w y of the Y w j . We apply eigenvalue shrinkage to ˆΣ w y ,forming a matrix of the form ˆΣ w x = r (cid:88) k =1 t k ˆ u w k (ˆ u w k ) (cid:62) , (67)where ˆ u , . . . , ˆ u w r are the top r eigenvectors of ˆΣ w y , or equivalently the top r left singular vectors of thewhitened data matrix Y w ; and the t k are the parameters we will determine. Finally, we form our finalestimator of Σ x by unwhitening: ˆΣ x = W − ˆΣ w x W − . (68)It remains to define the eigenvalues t , . . . , t r of the matrix ˆΣ w x . We let L denote any of the lossfunctions considered in [21]. As a reminder, all these loss functions satisfy two conditions. First, they are orthogonally-invariant , meaning that if both the estimated and population PCs are rotated, the loss doesnot change. Second, they are block-decomposable , meaning that if both the estimated and populationcovariance matrices are in block-diagonal form, the loss can be written as functions of the losses betweenthe individual blocks.The estimated covariance matrix ˆΣ x = W − ˆΣ w x W − can be written as:ˆΣ x = W − ˆΣ w x W − = r (cid:88) k =1 t k W − ˆ u w k ( W − ˆ u w k ) (cid:62) = r (cid:88) k =1 t k (cid:107) W − ˆ u w k (cid:107) ˆ u k ˆ u (cid:62) k = r (cid:88) k =1 ˜ t k ˆ u k ˆ u (cid:62) k , (69) here we have defined ˜ t k by: ˜ t k ≡ t k (cid:107) W − ˆ u w k (cid:107) ∼ t k (cid:18) ( c w k ) τ k + ( s w k ) µ ε (cid:19) . (70)We also write out the eigendecomposition of Σ x :Σ x = r (cid:88) k =1 (cid:96) k u k u (cid:62) k . (71)From Theorem 3.2, the empirical PCs ˆ u , . . . , ˆ u r are asymptotically pairwise orthonormal, and ˆ u j and u k are asymptotically orthogonal if j (cid:54) = k , and have absolute inner product c k when j = k , given by (47).Consequently, from Section 2.3.3 the optimal ˜ t k are defined by:˜ t k = arg min (cid:96) L ( A k , (cid:96)B k ) , (72)where: A k = (cid:18) (cid:96) k
00 0 (cid:19) , (73)and B k = (cid:18) c k c k (cid:112) − c k c k (cid:112) − c k − c k (cid:19) . (74)As noted in Section 2.3.3, [21] provides closed form solutions to this minimization problem for many lossfunctions L . For example, when operator norm loss is used the optimal ˜ t k is (cid:96) k , and when Frobeniusnorm loss is used, the optimal ˜ t k is (cid:96) k c k . When no such closed formula is known, the optimal values maybe obtained by numerical minimization of (72).Finally, the eigenvalues t k are obtained by inverting formula (70): t k = ˜ t k (cid:18) ( c w k ) τ k + ( s w k ) µ ε (cid:19) − . (75)We summarize the covariance estimation procedure in Algorithm 2. Remark 10.
As stated in Remark 8, in practice the rank r will likely not be known a priori. We referto Section 4.4 for a description of data-driven methods that may be used to estimate r . Σ ε Algorithms 1 and 2 require access to the whitening transformation W = Σ − / ε , or equivalently the noisecovariance matrix Σ ε . However, the same method and analysis goes through unscathed if Σ ε is replacedwith an estimate ˆΣ ε that is consistent in operator norm, i.e., wherelim p →∞ (cid:107) Σ ε − ˆΣ ε (cid:107) op = 0 (76)almost surely as p/n → γ . Indeed, the distribution of the top r singular values and singular vectors of Y w will be asymptotically identical whether the true W = Σ − / ε is used to perform whitening or theestimated ˆ W = ˆΣ − / ε is used instead. Remark 11.
Because we assume that the maximum eigenvalue of Σ ε is bounded and the minimumeigenvalue is bounded away from 0, (76) is equivalent to consistent estimation of the whitening matrix W = Σ − / ε by ˆ W = ˆΣ − / ε .An estimator ˆΣ ε satisfying (76) may be obtained when we have access to an iid sequence of pure noisevectors ε , . . . , ε n (cid:48) in addition to the n signal-plus-noise vectors Y , . . . , Y n . This is the setting consideredin [45], where a number of applications are also discussed. Here, we assume that n (cid:48) = n (cid:48) ( n ) grows fasterthan p = p ( n ), that is, lim n →∞ p ( n ) n (cid:48) ( n ) = 0 . (77) lgorithm 2 Optimal eigenvalue shrinkage with whitening Input : observations Y , . . . , Y n ; noise covariance Σ ε ; rank r Define Y = [ Y , . . . , Y n ] / √ n ; W = Σ − / ε ; Y w = W Y Compute top r left singular vectors/values of Y w : ˆ u w1 , . . . , ˆ u w r ; σ w1 , . . . , σ w r for all k = 1 , . . . , r do if σ w k > √ γ then (cid:96) w k = (cid:2) ( σ w k ) − − γ + (cid:112) (( σ w k ) − − γ ) − γ (cid:3) (cid:14) c w k = (cid:113) (1 − γ/ ( (cid:96) w k ) ) (cid:14) (1 + γ/(cid:96) w k ) µ ε = tr(Σ ε ) /pτ k = ( c w k ) (cid:14) (cid:104) (cid:107) Σ / ε ˆ u w k (cid:107) − (1 − ( c w k ) ) µ ε (cid:105) (cid:96) k = (cid:96) w k /τ k c k = c w k / (cid:112) ( c w k ) + (1 − ( c w k ) ) µ ε τ k A k = (cid:18) (cid:96) k
00 0 (cid:19) B k = (cid:18) c k c k (cid:112) − c k c k (cid:112) − c k − c k (cid:19) ˜ t k = arg min (cid:96) L ( A k , (cid:96)B k ) t k = ˜ t k τ k / [( c w k ) + (1 − ( c w k ) ) µ ε τ k ] else if σ w k ≤ √ γ then t k = 0 Output : ˆΣ x = (cid:80) rk =1 t k ( W − ˆ u w k )( W − ˆ u w k ) (cid:62) In this case, we replace Σ ε by the sample covariance:ˆΣ ε = 1 n (cid:48) n (cid:48) (cid:88) j =1 ε j ε (cid:62) j , (78)which converges to Σ ε in operator norm; that is, (76) holds. In Section 8.5, we will illustrate the use ofthis method in simulations. Remark 12. If p/n (cid:48) does not converge to 0, then ˆΣ ε given by (78) is not a consistent estimator of Σ ε inoperator norm. Indeed, when Σ ε = I p the distribution of ˆΣ ε ’s eigenvalues converges to the Marchenko-Pastur law [42], and more generally converges to a distribution whose Stieltjes transform is implicitlydefined by a fixed point equation [5, 50, 49]. Σ ε Without access to an independent sequence of n (cid:48) (cid:29) p pure noise samples, estimating the noise covarianceΣ ε consistently (with respect to operator norm) is usually hard as p → ∞ . However, it may still bepractical when Σ ε is structured. Examples include: when Σ ε is sparse [13]; when Σ − ε is sparse [56];when Σ ε is a circulant or Toeplitz matrix, corresponding to stationary noise [16]; and more generally,when the eigenbasis of Σ ε is known a priori.To elaborate on the last condition, let us suppose that the eigenbasis of Σ ε is known, and withoutloss of generality that Σ ε is diagonal; and suppose that and the u k ’s are delocalized in that (cid:107) u k (cid:107) ∞ → p → ∞ . Write Σ ε = diag( ν , . . . , ν p ), for unknown ν i . In this setting, the sample variance ofeach coordinate will converge almost surely to the variance of the noise in that coordinate; that is, for i = 1 , . . . , p , we have:ˆ ν i = 1 n n (cid:88) j =1 Y ij = 1 n n (cid:88) j =1 (cid:32) r (cid:88) k =1 (cid:96) k u ki z jk (cid:33) + 1 n n (cid:88) j =1 ε ij + 2 1 n n (cid:88) j =1 ε ij r (cid:88) k =1 (cid:96) k u ki z jk → ν i , (79)where the limit is almost sure as p, n → ∞ . We have made use of the strong law of large numbers andthe limit (cid:107) u k (cid:107) ∞ → et ˆΣ ε have i th diagonal entry ˆ ν i . Then ˆΣ ε − Σ ε is a mean-zero diagonal matrix, with diagonal entriesˆ ν i − ν i ; and the operator norm (cid:107) ˆΣ ε − Σ ε (cid:107) op = max ≤ i ≤ p | ˆ ν i − ν i | , which is easily shown to go to 0 almostsurely as p → ∞ using the subgaussianity of the observations. r A challenging question in principal component analysis is selecting the number of components corre-sponding to signal, and separating these from the noise. In our model, this corresponds to estimatingthe rank r of the matrix X , which is an input to Algorithms 1 and 2. A simple and natural estimate ˆ r of the rank is the following: ˆ r = min { k : σ w k > √ γ + (cid:15) n } . (80)That is, we estimate the rank as the number of singular values of Y w = X w + G exceeding the largestsingular value of the noise matrix G , plus a small finite-sample correction factor (cid:15) n >
0. Any singularvalue exceeding 1 + √ γ + (cid:15) n is attributable to signal, whereas any value below is consistent with purenoise.When (cid:15) n ≡ (cid:15) for all n , it may be shown that in the large p , large n limit, ˆ r converges almost surelyto the number of singular values of X w exceeding 1 + √ γ + (cid:15) . For small enough (cid:15) , this will recover allsingular values of X w exceeding √ γ , and is likely sufficient for many applications. Furthermore, thecorrection (cid:15) n may be calibrated using the Tracy-Widom distribution of the operator norm of GG (cid:62) bytaking (cid:15) n ∼ n − / . Though a detailed discussion is beyond the scope of this paper, we refer to [35] foran approach along these lines.An alternative procedure is similar to ˆ r , but uses the original matrix Y rather than the whitenedmatrix Y w : ˆ r (cid:48) = min { k : σ k > b + + (cid:15) n } , (81)where b + is the asymptotic operator norm of the noise matrix N , and (cid:15) n is a finite-sample correctionfactor. The value b + may be evaluated using, for example, the method from [37]. An estimator likethis is proposed in [44]. In Section 8.8, we present numerical evidence that ˆ r may outperform ˆ r (cid:48) . Moreprecisely, it appears that whitening can increase the gap between the smallest signal singular value andthe bulk edge of the noise, making detection of the signal components more reliable. Remark 13.
We also remark that a widely-used method for rank estimation in non-isotropic noise isknown as parallel analysis [29, 15, 14], which has been the subject of recent investigation [18, 20]. Othermethods have also been explored [33].
In this section, we examine the relationship between singular value shrinkage and linear prediction. Alinear predictor of X j from Y j is of the form AY j , where A is a fixed matrix. It is known (see, e.g. [41])that to minimize the expected mean-squared error, the best linear predictor, also called the Wiener filter ,takes A = Σ x (Σ x + Σ ε ) − , and hence is of the form:ˆ X opt j = Σ x (Σ x + Σ ε ) − Y j . (82)We will prove the following result, which shows that in the classical regime γ →
0, optimal shrinkagewith whitening converges to the Wiener filter.
Theorem 5.1.
Suppose Y , . . . , Y n are drawn from the spiked model with heteroscedastic noise, Y j = X j + ε j . Let ˆ X , . . . , ˆ X n be the predictors of X , . . . , X n obtained from singular value shrinkage withwhitening, as described in Section 4.1 and Algorithm 1. Then almost surely in the limit p/n → , lim n →∞ (cid:107) ˆ X opt − ˆ X (cid:107) = lim n →∞ n n (cid:88) j =1 (cid:107) ˆ X opt j − ˆ X j (cid:107) = 0 . (83) In other words, the predictor ˆ X j is asymptotically equivalent to the best linear predictor ˆ X opt j . Theorem 5.1 is a consequence of the following result. heorem 5.2. Suppose that the numbers s k , ≤ k ≤ r satisfy lim γ → s k σ w k = (cid:96) w k (cid:96) w k + 1 . (84) Then the predictor defined by ˆ X (cid:48) = r (cid:88) k =1 s k W − ˆ u w k (ˆ v w k ) (cid:62) (85) satisfies lim n →∞ (cid:107) ˆ X opt − ˆ X (cid:48) (cid:107) = 0 , (86) where the limit holds almost surely as p/n → . We will also show that in the context of shrinkage methods, whitening is an optimal weighting ofthe data. To make this precise, we consider the following class of weighted shrinkage methods, whichsubsumes both ordinary singular value shrinkage and singular value shrinkage with noise whitening. Fora fixed weight matrix Q , we multiply Y by Q , forming the matrix Y q = [ QY , . . . , QY n ] / √ n . We thenapply singular value shrinkage to Y q , with singular values s q , . . . , s qr , after which we apply the inverseweighting Q − . Clearly, ordinary shrinkage is the special case when Q = I p , whereas singular valueshrinkage with whitening is the case when Q = W = Σ − / ε .When the singular values s q , . . . , s qr are chosen optimally to minimize the AMSE, we will call theresulting predictor ˆ X Q , and denote by ˆ X Q,j the denoised vectors so that ˆ X Q = [ ˆ X Q, , . . . , ˆ X Q,n ] / √ n . Inthis notation, ˆ X = ˆ X W is optimal shrinkage with whitening, whereas ˆ X I is ordinary shrinkage withoutwhitening. The natural question is, what is the optimal matrix Q ?To answer this question, we introduce the linear predictors ˆ X lin Q,j , defined byˆ X lin Q,j = r (cid:88) k =1 η qk (cid:104) QY j , u qk (cid:105) Q − u qk , (87)where the u q , . . . , u qr are the eigenvectors of Q Σ x Q , and the η qk are chosen optimally to minimize theaverage AMSE across all n observations. We prove the following result, which is again concerned withthe classical γ → Theorem 5.3.
Let Q = Q p be an element of a sequence of symmetric, positive-definite p -by- p matriceswith bounded operator norm ( (cid:107) Q p (cid:107) op ≤ C < ∞ for all p ). Then in the limit p/n → , we have almostsurely: lim n →∞ (cid:107) ˆ X lin Q − ˆ X Q (cid:107) = lim n →∞ n n (cid:88) j =1 (cid:107) ˆ X lin Q,j − ˆ X Q,j (cid:107) = 0 . (88) In other words, the weighted shrinkage predictor ˆ X Q,j is asymptotically equal to the linear predictor ˆ X lin Q,j .Furthermore, Q = W minimizes the AMSE: W = arg min Q lim n →∞ E (cid:107) ˆ X Q − X (cid:107) . (89)The first part of Theorem 5.3, namely (88), states that any weighted shrinkage method convergesto a linear predictor when γ →
0. The second part of Theorem 5.3, specifically (89), states that of allweighted shrinkage schemes, whitening is optimal in the γ → Remark 14.
A special case of Theorem 5.2 is the suboptimal “naive” shrinker with whitening, whichuses singular values (cid:112) (cid:96) w k c w k ˜ c w k ; see Figures 1 and 2 and the accompanying text. It is easily shown thatTheorem 5.2 applies to this shrinker, and consequently that in the γ → .1 Columns of weighted singular value shrinkage In this section, we show how to write the predictor ˆ X Q in terms of the individual columns of Y q =[ QY , . . . , QY n ] / √ n . This observation will be used in the proofs of Theorems 5.1, 5.2 and 5.3, and alsomotivates the form of the out-of-sample predictor we will study in Section 6.Let m = min( p, n ). Consistent with our previous notation (when Q = W ), we will denote byˆ u q , . . . , ˆ u qm the left singular vectors of the matrix Y q , and we will denote by ˆ v q , . . . , ˆ v qm the right singularvectors and σ q , . . . , σ qm the corresponding singular values. Lemma 5.4.
Each column ˆ X Q,j of √ n · ˆ X Q is given by the formula ˆ X Q,j = Q − r (cid:88) k =1 η qk (cid:104) QY j , ˆ u qk (cid:105) ˆ u qk , (90) where η qk = s qk /σ qk is the ratio of the new and old singular values. To see this, observe that we can write the j th column of the matrix √ n · Y q as: QY j = m (cid:88) k =1 σ qk ˆ u qk ˆ v qjk , (91)and so by the orthogonality of ˆ u qk , ˆ v qjk = (cid:104) QY j , ˆ u qk (cid:105) /σ qk . Consequently, when ˆ X Q is obtained from Y q bysingular value shrinkage with singular values s q , . . . , s qr , followed by multiplication with Q − , we obtainformula (90). We now consider the problem of out-of-sample prediction. In Section 5.1, specifically Lemma 5.4, we sawthat when applying the method of shrinkage with whitening, as described in Algorithm 1, each denoisedvector ˆ X j can be written in the form:ˆ X j = r (cid:88) k =1 η k (cid:104) W Y j , ˆ u w k (cid:105) W − ˆ u w k , (92)where ˆ u w1 , . . . , ˆ u w r are the top r left singular vectors of Y w = W Y , and η k are deterministic coefficients.We observe that the expression (92) may be evaluated for any vector Y j , even when it is not one of theoriginal Y , . . . , Y n , so long as we have access to the singular vectors ˆ u w k .To formalize the problem, we suppose we have computed the sample vectors ˆ u w1 , . . . , ˆ u w r based on n observed vectors Y , . . . , Y n , which we will call the in-sample observations. That is, the ˆ u w k are the topleft singular vectors of the whitened matrix Y w = [ Y w1 , . . . , Y w n ] / √ n . We now receive a new observation Y = X + ε from the same distribution, which we will refer to as an out-of-sample observation, and ourgoal is to predict the signal X .We will consider predictors of the out-of-sample X of the same form as (92):ˆ X = r (cid:88) k =1 η o k (cid:104) W Y , ˆ u w k (cid:105) W − ˆ u w k . (93)We wish to choose the coefficients η o k to minimize the AMSE, lim n →∞ E (cid:107) ˆ X − X (cid:107) . Remark 15.
We emphasize the difference between the in-sample prediction (92) and the out-of-sampleprediction (93), beyond the different coefficients η k and η o k . In (92), the vectors u w1 , . . . , u w r are dependent on the in-sample observation Y j , 1 ≤ j ≤ n , because they are the top r left singular vectors of Y w .However, in (93) they are independent of the out-of-sample observation Y , which is drawn independentlyfrom Y , . . . , Y n . As we will see, it is this difference that necessitates the different choice of coefficients η k and η o k for the two problems.In this section, we prove the following result comparing optimal out-of-sample prediction and in-sample prediction. Specifically, we derive the explicit formulas for the optimal out-of-sample coefficients η o k and the in-sample coefficients η k ; show that the coefficients are not equal; and show that the AMSEfor both problems are nevertheless identical . Throughout this section, we assume the conditions andnotation of Theorem 3.1. heorem 6.1. Suppose Y , . . . , Y n are drawn iid from the spiked model, Y j = X j + ε j , and ˆ u w1 , . . . , ˆ u w r are the top r left singular vectors of Y w . Suppose Y = X + ε is another sample from the same spikedmodel, drawn independently of Y , . . . , Y n . Then the following results hold:1. The optimal in-sample coefficients η k are given by : η k = ( c w k ) ( c w k ) + ( s w k ) µ ε τ k · (cid:96) w k (cid:96) w k + 1 . (94)
2. The optimal out-of-sample coefficients η o k are given by: η o k = ( c w k ) ( c w k ) + ( s w k ) µ ε τ k · (cid:96) w k (cid:96) w k ( c w k ) + 1 . (95)
3. The AMSEs for in-sample and out-of-sample prediction are identical, and equal to:AMSE = r (cid:88) k =1 (cid:18) (cid:96) w k τ k − ( (cid:96) w k ) ( c w k ) (cid:96) w k ( c w k ) + 1 1 α k τ k (cid:19) , (96) where α k = (cid:0) ( c w k ) + ( s w k ) µ ε τ k (cid:1) − . Remark 16.
To be clear, denoising each in-sample observation Y , . . . , Y n by applying (92) with η k defined by (94) is identical to denoising Y , . . . , Y n by singular value shrinkage with whitening describedin Algorithm 1. We derive this alternate form only to show that the coefficients η k are different from thethe optimal out-of-sample coefficients η o k to be used when Y is independent from the ˆ u w k . Remark 17.
Theorem 6.1 extends the analogous result from [19], which was restricted to the standardspiked model with white noise.The proof of Theorem 6.1 may be found in Appendix C. In Algorithm 3, we summarize the optimalout-of-sample prediction method, with the optimal coefficients derived in Theorem 6.1.
Algorithm 3
Optimal out-of-sample prediction Input : Y ; ˆ u w1 , . . . , ˆ u w r ; σ w1 , . . . , σ w r for all k = 1 , . . . , r do if σ w k > √ γ then (cid:96) w k = (cid:2) ( σ w k ) − − γ + (cid:112) (( σ w k ) − − γ ) − γ (cid:3) (cid:14) c w k = (cid:113) (1 − γ/ ( (cid:96) w k ) ) (cid:14) (1 + γ/(cid:96) w k ) s w k = (cid:112) − ( c w k ) µ ε = tr(Σ ε ) /pτ k = ( c w k ) (cid:14) (cid:104) (cid:107) Σ / ε ˆ u w k (cid:107) − ( s w k ) µ ε (cid:105) α k = 1 / (cid:0) ( c w k ) + ( s w k ) µ ε τ k (cid:1) η o k = α k (cid:96) w k ( c w k ) / ( (cid:96) w k ( c w k ) + 1) else if σ w k ≤ √ γ then η o k = 0 Output : ˆ X = (cid:80) rk =1 η o k (cid:104) W Y , ˆ u w k (cid:105) W − ˆ u w k In this section, we focus on the task of principal component analysis (PCA) , or the estimation of theprincipal components u , . . . , u r of the signal X j , and their span. Specifically, we assess the quality of theempirical PCs ˆ u , . . . , ˆ u r defined in (8). The reader may recall that these are constructed by whiteningthe observed vectors Y j to produce Y w j ; computing the top r left singular vectors of Y w j ; and unwhiteningand normalizing. e first observe that in the classical regime γ →
0, the angle between the subspaces span { ˆ u , . . . , ˆ u r } and span { u , . . . , u r } converges to 0 almost surely; we recall that the sine of the angle between subspaces A and B of R p is defined by sin Θ( A , B ) = (cid:107) A (cid:62)⊥ B (cid:107) op , (97)where A ⊥ and B are matrices whose columns are orthonormal bases of A ⊥ and B , respectively. Proposition 7.1.
Suppose Y , . . . , Y n are drawn from the spiked model, Y j = X j + ε j . Let U = span { u , . . . , u r } be the span of the population PCs, and ˆ U = span { ˆ u , . . . , ˆ u r } be the span of the empiricalPCs. Then lim n → sin Θ( U , ˆ U ) = 0 , (98) where the limit holds almost surely as n → ∞ and p/n → . The proof of Proposition 7.1 may be found in Appendix D.Proposition 7.1 shows consistency of principal subspace estimation in the classical regime. We askwhat happens in the high-dimensional setting γ >
0, where we typically do not expect to be able to haveconsistent estimation of the principal subspace. Our task here is to show that whitening will still improveestimation. To that end, in Section 7.1, we will show that under a uniform prior on the population PCs u k , whitening improves estimation of the PCs. In Section 7.2, we will derive a bound on the error ofestimating the principal subspace span { u , . . . , u r } , under condition (19); we will show that the errorrate matches the optimal rate of the estimator in [58]. Finally, in Section 7.3 we will complement theseresults by showing that under the uniform prior, whitening improves a natural signal-to-noise ratio. In this section, we consider the effect of whitening on estimating the PCs u , . . . , u r . More precisely, wecontrast two estimators of the u k . On the one hand, we shall denote by ˆ u (cid:48) , . . . ˆ u (cid:48) r the left singular vectorsof the raw data matrix Y , without applying any weighting matrix. On the other hand, we considerthe vectors ˆ u , . . . , ˆ u r obtained by whitening, taking the top singular vectors of Y w , unwhitening, andnormalizing, as expressed by formula (8).We claim that “generically”, the vectors ˆ u , . . . , ˆ u r are superior estimators of u , . . . , u r . By “generi-cally”, we mean when we impose a uniform prior over the population PCs u , . . . , u r ; that is, we assumethe u k are themselves random, drawn uniformly from the sphere in R p and orthogonalized. This isprecisely the “orthonormalized model” considered in [10].We set τ = lim p →∞ tr(Σ − ε ) /p , assuming this limit exists; and let ϕ = τ · µ ε . By Jensen’s inequality, ϕ ≥
1, with strict inequality so long as Σ ε is not a multiple of the identity. Theorem 7.2.
Suppose Σ ε has a finite number of distinct eigenvalues, each occurring with a fixedproportion as p → ∞ . Suppose too that u , . . . , u r are uniformly random orthonormal vectors in R p . Let ˆ u (cid:48) , . . . , ˆ u (cid:48) r be the left singular vectors of Y , and ˆ u , . . . , ˆ u r be the empirical PCs defined by (8) . Thenwith probability approaching as n → ∞ and p/n → γ > , |(cid:104) ˆ u (cid:48) k , u k (cid:105)| ≤ R ( ϕ ) |(cid:104) ˆ u k , u k (cid:105)| , ≤ k ≤ r, (99) where R is decreasing, R (1) = 1 , and R ( ϕ ) < for ϕ > .Furthermore, if ˆ v (cid:48) , . . . , ˆ v (cid:48) r are the right singular vectors of Y , and ˆ v , . . . , ˆ v r are the left singularvectors of Y w , then |(cid:104) ˆ v (cid:48) k , z k (cid:105)| ≤ ˜ R ( ϕ ) |(cid:104) ˆ v k , z k (cid:105)| , ≤ k ≤ r, (100) with probability approaching as n → ∞ and p/n → γ > , where z k = ( z k , . . . , z nk ) (cid:62) / √ n , and where ˜ R is decreasing, ˜ R (1) = 1 , and ˜ R ( ϕ ) < for ϕ > . The proof of Theorem 7.2 may be found in Appendix D. It rests on a result from the recent paper[27], combined with the formula (47) for the asymptotic cosines between ˆ u k and u k . Remark 18.
The definition of τ = tr(Σ − ε ) /p is consistent with our definition of τ k = lim p →∞ (cid:107) W − u w k (cid:107) − from (15). Indeed, since Theorem 7.2 assumes that u , . . . , u r are uniformly random unit vectors, the PCs u w k of X w are asymptotically identical to W u k / (cid:107) W u k (cid:107) , since these vectors are almost surely orthogonalas p → ∞ . Consequently, for each 1 ≤ k ≤ r we have τ k = lim p →∞ (cid:107) W − u w k (cid:107) = lim p →∞ (cid:107) W u k (cid:107) ∼ p tr( W ) = 1 p tr(Σ − p ) ∼ τ. (101) .2 Minimax optimality of the empirical PCs In this section, we consider the question of whether the empirical PCs ˆ u , . . . , ˆ u r can be significantlyimproved upon. In the recent paper [58], an estimator ˆ U of the principal subspace U = span { u , . . . , u r } is proposed that achieves the following error rate: E [sin Θ( ˆ U , U )] ≤ min (cid:40) C √ γ (cid:32) µ / ε + ( r/p ) / (cid:107) Σ ε (cid:107) / min k (cid:96) / k + µ / ε (cid:107) Σ ε (cid:107) / min k (cid:96) k (cid:33) , (cid:41) , (102)where C is a constant dependent on the incoherence of u , . . . , u r , defined by I ( U ) = max ≤ j ≤ p (cid:107) e (cid:62) j U (cid:107) where U = [ u , . . . , u r ] ∈ R p × r . Furthermore, the error rate (102) is shown to be minimax optimal overthe class of models with PCs of bounded incoherence.In this section, we show that when (19) holds, then the empirical PCs ˆ u , . . . , ˆ u r achieve the sameerror rate (102) almost surely in the limit n → ∞ , p/n → γ . More precisely, we show the following: Theorem 7.3.
Assume that the weighted orthogonality condition (19) holds. Suppose that Σ ε is diagonal,and that there is a constant C so that | u jk | ≤ C √ p (103) for all k = 1 , . . . , r and j = 1 , . . . , p . Suppose Y , . . . , Y n are drawn iid from the spiked model. Let ˆ u , . . . , ˆ u r be the estimated PCs from equation (8) , and let ˆ U = span { ˆ u , . . . , ˆ u r } and U = span { u , . . . , u r } .Then almost surely in the limit p/n → γ sin Θ( ˆ U , U ) ≤ min (cid:26) Kγµ ε (cid:18) k (cid:96) k + (cid:107) Σ ε (cid:107) op min k (cid:96) k (cid:19) , (cid:27) , (104) where K is a constant depending only on C from (103) . Remark 19.
Theorem 7.3 shows that in the case r = 1, the estimate ˆ u obtained by whitening Y ,computing the top left singular vector of Y w , and then unwhitening and normalizing, is asymptoticallyminimax optimal. When r >
1, we require the extra condition (19) which does not appear in the minimaxlower bound from [58].The proof of Theorem 7.3 follows from the formula (47) for the cosines between u k and ˆ u k fromTheorem 3.2. The details are found in Appendix D. In this section, we define a natural signal-to-noise ratio (SNR) for the spiked model, namely the ratio ofoperator norms between the signal and noise sample covariances. We show that under the generic modelfrom Section 7.1 for the signal principal components u k , the SNR increases after whitening.We define the SNR by: SNR = (cid:107) ˆΣ x (cid:107) op (cid:107) ˆΣ ε (cid:107) op (105)where ˆΣ x = n (cid:80) nj =1 X j X (cid:62) j and ˆΣ ε = n (cid:80) nj =1 ε j ε (cid:62) j are the sample covariances of the signal and noisecomponents, respectively (neither of which are observed).After whitening, the observations change into: Y w j = X w j + G j , (106)and we define the new SNR to be: SNR w = (cid:107) ˆΣ w x (cid:107) op (cid:107) ˆΣ g (cid:107) op (107)where ˆΣ w x = n (cid:80) nj =1 X w j ( X w j ) (cid:62) and ˆΣ g = n (cid:80) nj =1 G j G (cid:62) j .As in Section 7.1, let τ = lim p →∞ tr(Σ − ε ) /p (assuming the limit exists), and define ϕ = τ · µ ε . Notethat by Jensen’s inequality, ϕ ≥
1, with strict inequality unless Σ ε = νI p . We will prove the following: roposition 7.4. Suppose the population principal components u , . . . , u r are uniformly random or-thonormal vectors in R p . Then in the limit p/n → γ > , SNR w ≥ ϕ SNR . (108)In other words, Proposition 7.4 states that for generic signals whitening increases the operator normSNR by a factor of at least ϕ ≥
1. The proof may be found in Appendix D.
Remark 20.
As explained in Remark 18, under the generic model assumed by Proposition 7.4, thenotation τ is consistent with the definition of τ k in (15). Remark 21.
Proposition 7.4 is similar in spirit to a result in [38], which essentially shows that the SNRdefined by the nuclear norms, rather than operator norms, increases after whitening. However, in the p → ∞ limit, defining the SNR using the ratio of nuclear norms is not as meaningful as using operatornorms, because the ratio of nuclear norms always converges to 0 in the high-dimensional limit. Indeed,we have: (cid:107) ˆΣ x (cid:107) ∗ → r (cid:88) k =1 (cid:96) k , (109)almost surely as p, n → ∞ . On the other hand,1 p (cid:107) ˆΣ ε (cid:107) ∗ → µ ε . (110)In particular, (cid:107) ˆΣ ε (cid:107) ∗ grows like p , whereas (cid:107) ˆΣ x (cid:107) ∗ is bounded with p . When p is large, therefore, the normof the noise swamps the norm of the signal. On the other hand, the operator norms of ˆΣ x and ˆΣ ε areboth bounded, and may therefore be comparable in size. In this section we report several numerical results that illustrate the performance of our predictor in thespiked model, as well as several beneficial properties of whitening. Code implementing the shrinkagewith whitening algorithms will be made available online.
In this experiment, we compared our predictor to the best linear predictor (BLP), defined in equation(82). The BLP is an oracle method, as it requires knowledge of the population covariance Σ x , whichis not accessible to us. However, Theorem 5.1 predicts that as p/n →
0, the optimal shrinkage withwhitening predictor will behave identically to the BLP.In the same experiments, we also compare our method to OptShrink [44], the optimal singular valueshrinker without any transformation. Theorem 5.3 predicts that as p/n →
0, OptShrink will behaveidentically to a suboptimal linear filter.In these these tests, we fixed a dimension equal to p = 100, and let n grow. Each signal was rank 3,with PCs chosen so that the first PC was a completely random unit vector, the second PC was set to zeroon the first p/ p/ z jk were chosen to be Gaussian.The noise covariance matrix Σ ε was generated by taking equally spaced values between 1 and aspecified condition number κ >
1, and then normalizing the resulting vector of eigenvalues to be a unitvector. This normalization was done so that in each test, the total energy of the noise remained constant.Figure 3 plots the average prediction errors as a function of n for the three methods, for differentcondition numbers κ of the noise covariance Σ ε . The errors are averaged over 500 runs of the experiment,with different draws of signal and noise. As expected, the errors for optimal shrinkage with whiteningconverge to those of the oracle BLP, while the errors for OptShrink appear to converge to a larger value,namely the error of the limiting suboptimal linear filter. Remark 22.
Unlike shrinkage with whitening, OptShrink does not make use of the noise covariance.Though access to the noise covariance would permit faster evaluation of the OptShrink algorithm using,for instance, the methods described in [37], we have found that this does not change the estimationaccuracy of the method. Similarly, the BLP uses the true PCs of X j , which are not used by eithershrinkage method. The comparison between the methods must be understood in that context. Figure 3: Prediction errors for the optimal whitened shrinker, the optimal unwhitened shrinker (OptShrink),and the best linear predictor (an oracle method).
We examine the performance of optimal shrinkage with whitening for different values of γ and differentcondition numbers of the noise covariance. We compare to OptShrink [44] and the naive shrinker withwhitening employed in [19], which uses singular values (cid:112) (cid:96) w k c w k ˜ c w k ; see Figures 1 and 2 and the associatedtext. This latter shrinker does not account for the change in angle between the singular vectors resultingfrom unwhitening.In each run of the experiment, we fix the dimension p = 1000. We use a diagonal noise covariance witha specified condition number κ , whose entries are linearly spaced between 1 /κ and 1, and increase withthe index. We generate the orthonormal basis of PCs u , u , u from the model described in Section2.1.2, as follows: u is a unifomly random unit vector; u has Gaussian entries with linearly-spacedvariances a , . . . , a p , where a p < a p − < · · · < a , (cid:80) pi =1 a i = 1, and a /a p = 10; and u has Gaussianentries with linearly-spaced variances b , . . . , b p , where b < b < · · · < b p , (cid:80) pi =1 b i = 1, and b p /b = 10.Gram-Schmidt is then performed on u , u , and u to ensure they are orthonormal. For aspect ratio γ ,the three signal singular values are γ / + i/ i = 1 , , n , and hence of γ , we generate 50 draws of the data and record the averagerelative errors for each of the three methods. The results are plotted in Figure 4. As is apparent from thefigures, both whitening methods typically outperform OptShrink. Furthermore, when n is large, bothoptimal shrinkage and naive shrinkage perform very similarly; this makes sense because both methodsconverge to the BLP as n → ∞ . By contrast, when γ is large, the benefits of using the optimal shrinkerover the naive shrinker are more apparent. Remark 23.
As noted in Remark 22, we emphasize that unlike both whitening methods, OptShrinkdoes not make use of the noise covariance, and the comparison between the methods must be understoodin that context. Figure 4: Comparison of whitening with optimal shrinkage; whitening with naive shrinkage; and OptShrink(no whitening), as a function of the noise covariance matrix’s condition number κ .26 .3 Performance of eigenvalue shrinkage We examine the performance of optimal eigenvalue shrinkage with whitening for different values of γ anddifferent condition numbers of the noise covariance. We use nuclear norm loss, for which the optimal ˜ t k in Algorithm 2 is given by the formula ˜ t k = max { (cid:96) k (2 c k − , } . (111)This formula is derived in [21].We compare to two other methods. We consider optimal eigenvalue shrinkage without whitening,where the population eigenvalues and cosines between observed and population eigenvectors are estimatedusing the methods from [44]. We also consider the whitening and eigenvalue shrinkage procedure from[38], which shrinks the eigenvalues to the population values (cid:96) k ; this is an optimal procedure for operatornorm loss [21], but suboptimal for nuclear norm loss.As in Section 8.2, in each run of the experiment, we fix the dimension p = 1000. We use a diagonalnoise covariance with a specified condition number κ , whose entries are linearly spaced between 1 /κ and1, and increase with the index. We generate the orthonormal basis of PCs u , u , u from the modeldescribed in Section 2.1.2, as follows: u is a unifomly random unit vector; u has Gaussian entries withlinearly-spaced variances a , . . . , a p , where a p < a p − < · · · < a , (cid:80) pi =1 a i = 1, and a /a p = 10; and u has Gaussian entries with linearly-spaced variances b , . . . , b p , where b < b < · · · < b p , (cid:80) pi =1 b i = 1,and b p /b = 10. Gram-Schmidt is then performed on u , u , and u to ensure they are orthonormal. Foraspect ratio γ , the three signal singular values are γ / + i , i = 1 , , n , and hence of γ , we generate 50 draws of the data and record the averagerelative errors (cid:107) ˆΣ x − Σ x (cid:107) ∗ / (cid:107) Σ x (cid:107) ∗ for each of the three methods. The results are plotted in Figure 5. Asis apparent from the figures, optimal shrinkage with whitening outperforms the other two methods. Forthe smaller values of γ , optimal shrinkage without whitening outperforms the population shrinker withwhitening when the condition number κ is small, since the benefits of whitening are not large; however,as κ grows, whitening with the suboptimal population shrinker begins to outperform. For larger γ , thecost of using the wrong shrinker outweigh the benefits of whitening, and the population shrinker withwhitening is inferior to both other methods. This illustrates the importance of using a shrinker designedfor the intended loss function. In this section, we numerically illustrate Theorem 7.2 by examining the angles between the spanningvectors ˆ u k (the empirical PCs) and ˆ v k of ˆ X and, respectively, the population vectors u k (the populationPCs) and v k . We show that these angles are smaller (or equivalently, their cosines are larger) than thecorresponding angles between the population u k and v k and the singular vectors of the unwhitened datamatrix Y .Figure 6 plots the cosines as a function of the condition number κ of the noise matrix Σ ε . In thisexperiment, we consider a rank 1 signal model for simplicity, with a uniformly random PC. We useddimension p = 500, and drew n = 1000 observations. For each condition number κ of Σ ε , we generateΣ ε as described in Section 8.1. For each test, we average the cosines over 50 runs of the experiment(drawing new signals and new noise each time). Both signal and noise are Gaussian. As we see, thecosines improve dramatically after whitening. As κ grows, i.e., the noise becomes more heteroscedastic,the improvement becomes more pronounced. In many applications, the true noise covariance may not be accessible. In this experiment, we considerthe effect of estimating the noise covariance by the sample covariance from n (cid:48) iid samples of pure noise, ε , . . . , ε n (cid:48) , as n (cid:48) grows.We fix the dimension p = 500 and number of signal-plus-noise observations n = 625, and r = 2signal singular values 3 and 5. We take the noise covariance to have condition number κ = 500, witheigenvalues equispaced between 1 /
100 and 1 /
5. The eigenvectors of the noise covariance are drawnuniformly at random.For increasing values of n (cid:48) ≥ p , we draw n (cid:48) iid realizations of the noise ε , . . . , ε n (cid:48) , and form the Figure 5: Comparison of whitening with optimal shrinkage; whitening with naive shrinkage; and OptShrink(no whitening), as a function of the noise covariance matrix’s condition number κ . Figure 6: Comparison of the cosines between the empirical and population singular vectors, for the raw dataand the whitened data, as a function of the noise covariance matrix’s condition number κ .28 Figure 7: Comparison of the errors when using the true noise covariance Σ ε and the sample noise covarianceˆΣ ε estimated from n (cid:48) samples. sample covariance: ˆΣ ε = 1 n (cid:48) n (cid:48) (cid:88) i =1 ε i ε (cid:62) i . (112)For each n (cid:48) , we perform Algorithm 1 using the sample covariance ˆΣ ε . The experiment is repeated 2000times for each value of n (cid:48) , and the errors averaged over these 2000 runs. Figure 7 plots the averageerror as a function of n (cid:48) . We also apply Algorithm 1 using the true noise covariance Σ ε , and plot theaverage error (which does not depend on n (cid:48) ) in Figure 7 as well. The error when using the estimatedcovariance converges to the error when using the true covariance, indicating that Algorithm 1 is robustto estimation of the covariance. In this experiment, we test the accuracy of the error formula (96). There are three distinct quantities thatwe define. The first is the oracle AMSE, which we define from the known population parameters. Thesecond is the estimated AMSE, which we will denote by (cid:92)
AMSE; this is estimated using the observations Y , . . . , Y n themselves. The third is the mean-squared error itself, (cid:107) ˆ X − X (cid:107) /n . Of the three quantities,only (cid:92) AMSE would be directly observed in practice. We define the discrepancy between AMSE and (cid:107) ˆ X − X (cid:107) /n as | AMSE − (cid:107) ˆ X − X (cid:107) /n | , and the discrepancy between (cid:92) AMSE and (cid:107) ˆ X − X (cid:107) as | (cid:92) AMSE −(cid:107) ˆ X − X (cid:107) /n | .Figure 8 plots the log discrepancies against log ( p ). We also include a table of the values themselves.In all experiments, we use the following parameters: the aspect ratio is γ = 0 .
8, the rank r = 2, thesignal singular values are 3 and 2, u is (cid:112) /p on entries 1 , . . . , p/ u is (cid:112) /p on entries p/ , . . . , p and 0 elsewhere, and the noise covariance is diagonal with variances linearly spaced from1 /
200 to 3 /
2, increasing with the coordinates.We make two observations. First, the slope of each plot is approximately 0 .
5, indicating that theerror formulas derived are accurate with error O ( n − / ). This is precisely the rate we expect from [8].Second, the discrepancies of AMSE and (cid:92) AMSE are very close, and in fact the discrepancy of (cid:92)
AMSE isslightly smaller than that of AMSE. This indicates that the observed (cid:92)
AMSE provides a viable estimatefor the actual error (cid:107) ˆ X − X (cid:107) /n . In this next experiment, we compare the performance of in-sample and out-of-sample prediction, asdescribed in Section 6. Optimal in-sample prediction is identical to performing optimal singular valueshrinkage with noise whitening to the in-sample data Y , . . . , Y n . For out-of-sample prediction, we usethe expression of the form (93) with the optimal coefficients η o k from Proposition 6.1. Figure 8: Logarithm of the discrepancies | AMSE − (cid:107) ˆ X − X (cid:107) /n | and | (cid:92) AMSE − (cid:107) ˆ X − X (cid:107) /n | , versus log ( p ).AMSE is the oracle value of the error, and (cid:92) AMSE is estimated from the data itself.log ( p ) Discrepancy, AMSE Discrepancy, (cid:92) AMSE7 1.49e-01 1.40e-018 1.04e-01 9.82e-029 7.31e-02 6.90e-0210 5.17e-02 4.89e-0211 3.62e-02 3.41e-0212 2.56e-02 2.42e-0213 1.84e-02 1.74e-02Table 3: Discrepancies | AMSE − (cid:107) ˆ X − X (cid:107) /n | and | (cid:92) AMSE − (cid:107) ˆ X − X (cid:107) /n | . AMSE is the oracle value ofthe error, and (cid:92) AMSE is estimated from the data itself.
We ran the following experiments. For a fixed dimension p , we generated a random value of n > p .We then chose three random PCs from the same model described in Section 8.1, and we generated poolsof n in-sample and out-of-sample observations. We performed optimal shrinkage with whitening on thein-sample observations, and applied the out-of-sample prediction to the out-of-sample data using thevectors ˆ u w k computed from the in-sample data. We then computed the MSEs for the in-sample andout-of-sample data matrices. This whole procedure was repeated 2000 times.Figure 9 shows scatterplots of the in-sample and out-of-sample predictions for p = 50 and p = 500.In both plots, we see that there is not a substantial difference between the in-sample and out-of-sampleprediction errors, validating the asymptotic prediction made by Proposition 6.1. Even for the low-dimension of p = 50, there is very close agreement between the performances, and for p = 500 theyperform nearly identically. In this experiment, we show that whitening improves signal detection. We generated data from a rank 1model, with a weak signal. We computed all the singular values of the original data matrix Y , and thewhitened matrix Y w . Figure 10 plots the the top 20 singular values for each matrix.It is apparent from the comparison of these figures that the top singular value of the whitened matrixpops out from the bulk of noise singular values, making detection of the signal component very easy inthis case. By contrast, the top singular value of the raw, unwhitened matrix Y w does not stick out fromthe bulk. Proposition 7.4 would lead us to expect this type of behavior, since the signal matrix increasesin strength relative to the noise matrix. .3 0.4 0.5 0.6 0.7 0.8 0.90.30.40.50.60.70.80.9 0.2 0.3 0.4 0.5 0.6 0.70.20.30.40.50.60.7 Figure 9: Comparison of in-sample and out-of-sample denoising for p = 50 and p = 500. Figure 10: The top 20 empirical singular values of the raw data matrix Y and the whitened data matrix Y w ,for a rank 1 signal. The theory we have derived relies on the orthogonal invariance of the noise matrix G . In this experiment,we study the agreement between the theoretically predicted values for c k and ˜ c k and the observed valuesfor finite n and p and non-Gaussian noise.For different values of n we generated rank 1 signal matrices of size n/ n , with top PC u havingall entries equal to 1 / √ z j Gaussian, and signal energy (cid:96) = 1. We generated a noise matrix, whereeach entry has mean 0 and variance 1, drawn iid from a specified distribution. We then colored the noisematrix by multiplying it by Σ / ε = diag( √ ν , . . . , √ ν p ), where ν , . . . , ν p are linearly spaced, ν = 1 / ν p = 1.We considered four different distributions for the entries of G : the Gaussian distribution; the Rademacherdistribution; and the Student t distributions with 10 and 3 degrees of freedom (normalized to have vari-ance 1). For each distribution, we drew signal/noise pairs, and computed the absolute value of the cosinesbetween the topmost left and right singular vectors of the observed matrix and the left and right sin-gular vectors of the signal matrix. We then computed the average absolute difference (the discrepancy)between the observed cosines and the theoretically predicted values c and ˜ c from Section 3. The errorsare averaged over 20000 runs.Table 4 contains the average discrepancies for c , and Table 5 contains the average errors for ˜ c , bothfor n = 1000 , , , Gaussian Rademacher t, df=10 t, df=31000 8.173e-03 8.009e-03 8.147e-03 2.584e-012000 5.742e-03 5.794e-03 5.750e-03 3.610e-014000 4.069e-03 4.073e-03 4.071e-03 4.730e-018000 2.896e-03 2.933e-03 2.897e-03 5.866e-01Table 4: Average discrepancies between c and |(cid:104) u, ˆ u (cid:105)| . n Gaussian Rademacher t, df=10 t, df=31000 3.627e-03 3.625e-03 3.650e-03 2.598e-012000 2.704e-03 2.707e-03 2.712e-03 3.708e-014000 1.951e-03 1.939e-03 1.952e-03 4.895e-018000 1.409e-03 1.388e-03 1.410e-03 6.112e-01Table 5: Average discrepancies between ˜ c and |(cid:104) v, ˆ v (cid:105)| . asymptotic values at a rate of roughly O ( n − / ). By contrast, for the t distribution with only 3 de-grees of freedom, there is substantial discrepancy between the theoretical and observed cosines, and thediscrepancies do not decrease with n (in fact, they grow).These numerical results suggest that for noise distributions with sufficiently many finite moments, thedistributions are approximately equal as those Gaussian noise, which in turn suggests that the limitingcosine values we have derived for Gaussian noise may hold for more general distributions. We have derived the optimal spectral shrinkers method for signal prediction and covariance estimation inthe spiked model with heteroscedastic noise, where the data is whitened before shrinkage and unwhitenedafter shrinkage. We also showed the in that γ → r = 1. We showed that the operator norm SNR of the observations increases afterwhitening. We also extended the analysis on out-of-sample prediction found in [19] to the whiteningprocedure.There are a number of interesting directions for future research. First, we plan to revisit previousworks that have employed similar shrinkage-plus-whitening procedures, but with the optimal shrinkerswe have derived. It is of interest to determine how much of an improvement is achieved with the moreprincipled choice we have presented.As our current analysis is restricted to the setting of Gaussian noise, in future work we will try toextend the analysis to more general noise matrices. This likely requires a deeper understanding of thedistribution of the projection of the empirical singular vectors onto the orthogonal complement of thepopulation signal vectors in the setting of non-Gaussian noise.While we have shown that whitening can improve subspace estimation generically, and matches theerror rate (up to a constant) of [58], it is not clear if whitening is the optimal transformation for subspaceestimation. In a different but closely related model to the one we have studied, where the noise variancesdiffer across observations rather than across coordinates, it was found that certain weighting schemes canoutperform whitening [28]. We note too that if the matrix Σ ε is ill-conditioned, numerical instabilitiesmay result from the whitening and unwhitening operations.Finally, it is also of interest to better understand the procedure when the noise covariance Σ ε is notknown exactly, but must be estimated. This is a subject currently under investigation. Acknowledgements
The authors would like to thank Edgar Dobriban, Matan Gavish, and Amit Singer for stimulating discus-sions related to this work. William Leeb acknowledges support from the Simons Foundation Collaboration n Algorithms and Geometry, the NSF BIGDATA program IIS 1837992, and BSF award 2018230. EladRomanov acknowledges support from Israeli Science Foundation grant number 1523/16. References [1] Joakim And´en and Amit Singer. Factor analysis for spectral estimation. In
Sampling Theory andApplications (SampTA), 2017 International Conference on , pages 169–173. IEEE, 2017.[2] Joakim And´en and Amit Singer. Structural variability from noisy tomographic projections.
SIAMJournal on Imaging Sciences , 11(2):1441–1492, 2018.[3] Theodore Wilbur Anderson. Estimating linear statistical relationships.
Annals of Statistics , 12(1):1–45, 03 1984.[4] Theodore Wilbur Anderson.
An Introduction to Multivariate Statistical Analysis . Wiley Series inProbability and Statistics. Wiley, 2003.[5] Zhidong Bai and Jack W. Silverstein.
Spectral analysis of large dimensional random matrices .Springer Series in Statistics. Springer, 2009.[6] Zhidong Bai and Jian-feng Yao. Central limit theorems for eigenvalues in a spiked population model.
Annales de l’Institut Henri Poincar´e, Probabilit´es et Statistiques , 44(3):447–474, 2008.[7] Jinho Baik and Jack W. Silverstein. Eigenvalues of large sample covariance matrices of spikedpopulation models.
Journal of Multivariate Analysis , 97(6):1382–1408, 2006.[8] Zhigang Bao, Xiucai Ding, and Ke Wang. Singular vector and singular subspace distribution for thematrix denoising model. arXiv preprint arXiv:1809.10476 , 2018.[9] Florent Benaych-Georges, Alice Guionnet, and Myl´ene Maida. Fluctuations of the extreme eigenval-ues of finite rank deformations of random matrices.
Electronic Journal of Probability , 16:1621–1662,2011.[10] Florent Benaych-Georges and Raj Rao Nadakuditi. The singular values and vectors of low rankperturbations of large rectangular random matrices.
Journal of Multivariate Analysis , 111:120–135,2012.[11] Tamir Bendory, Alberto Bartesaghi, and Amit Singer. Single-particle cryo-electron microscopy:Mathematical theory, computational challenges, and opportunities.
IEEE Signal Processing Maga-zine , 37(2):58–76, 2020.[12] Tejal Bhamre, Teng Zhang, and Amit Singer. Denoising and covariance estimation of single particlecryo-EM images.
Journal of Structural Biology , 195(1):72–81, 2016.[13] Peter J Bickel, Elizaveta Levina, et al. Covariance regularization by thresholding.
The Annals ofStatistics , 36(6):2577–2604, 2008.[14] Timothy A. Brown.
Confirmatory factor analysis for applied research . Guilford Publications, 2014.[15] Andreas Buja and Nermin Eyuboglu. Remarks on parallel analysis.
Multivariate Behavioral Re-search , 27(4):509–540, 1992.[16] T. Tony Cai, Zhao Ren, and Harrison H. Zhou. Optimal rates of convergence for estimating Toeplitzcovariance matrices.
Probability Theory and Related Fields , 156(1-2):101–143, 2013.[17] Lucilio Cordero-Grande, Daan Christiaens, Jana Hutter, Anthony N. Price, and Jo V. Hajnal.Complex diffusion-weighted image estimation via matrix recovery under general noise models.
Neu-roImage , 200:391–404, 2019.[18] Edgar Dobriban. Permutation methods for factor analysis and PCA.
Annals of Statistics , to appear.[19] Edgar Dobriban, William Leeb, and Amit Singer. Optimal prediction in the linearly transformedspiked model.
Annals of Statistics , 48(1):491–513, 2020.[20] Edgar Dobriban and Art B. Owen. Deterministic parallel analysis: an improved method for selectingfactors and principal components.
Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 81(1):163–183, 2019.[21] David L. Donoho, Matan Gavish, and Iain M Johnstone. Optimal shrinkage of eigenvalues in thespiked covariance model.
Annals of Statistics , 46(6), 2018.[22] David L. Donoho and Behrooz Ghorbani. Optimal covariance estimation for condition number lossin the spiked model. arXiv preprint arXiv:1810.07403 , 2018.
23] Matan Gavish and David L. Donoho. Minimax risk of matrix denoising by singular value thresh-olding.
The Annals of Statistics , 42(6):2413–2440, 2014.[24] Matan Gavish and David L. Donoho. The optimal hard threshold for singular values is 4 / √ IEEETransactions on Information Theory , 60(8):5040–5053, 2014.[25] Matan Gavish and David L. Donoho. Optimal shrinkage of singular values.
IEEE Transactions onInformation Theory , 63(4):2137–2152, 2017.[26] David Hong, Laura Balzano, and Jeffrey A. Fessler. Towards a theoretical analysis of PCA for het-eroscedastic data. In ,pages 496–503. IEEE, 2016.[27] David Hong, Laura Balzano, and Jeffrey A. Fessler. Asymptotic performance of PCA for high-dimensional heteroscedastic data.
Journal of Multivariate Analysis , 2018.[28] David Hong, Jeffrey A. Fessler, and Laura Balzano. Optimally weighted PCA for high-dimensionalheteroscedastic data. arXiv preprint arXiv:1810.12862 , 2018.[29] John L. Horn. A rationale and test for the number of factors in factor analysis.
Psychometrika ,30(2):179–185, 1965.[30] J. Edward Jackson.
A User’s Guide to Principal Components , volume 587. John Wiley & Sons,2005.[31] Iain M Johnstone. On the distribution of the largest eigenvalue in principal components analysis.
Annals of Statistics , 29(2):295–327, 2001.[32] Ian Jolliffe.
Principal component analysis . Wiley Online Library, 2002.[33] Julie Josse and Fran¸cois Husson. Selecting the number of components in principal component anal-ysis using cross-validation approximations.
Computational Statistics & Data Analysis , 56(6):1869–1879, 2012.[34] Hamid Krim and Mats Viberg. Two decades of array signal processing research: the parametricapproach.
IEEE signal processing magazine , 13(4):67–94, 1996.[35] Shira Kritchman and Boaz Nadler. Determining the number of components in a factor model fromlimited noisy data.
Chemometrics and Intelligent Laboratory Systems , 94(1):19–32, 2008.[36] William Leeb. Optimal singular value shrinkage for operator norm loss. arXiv preprintarXiv:2005.11807 , 2020.[37] William Leeb. Rapid evaluation of the spectral signal detection threshold and Stieltjes transform. arXiv preprint arXiv:1904.11665 , 2020.[38] Lydia T. Liu, Edgar Dobriban, and Amit Singer. e PCA: High dimensional exponential family PCA.
The Annals of Applied Statistics , 12(4):2121–2150, 2018.[39] Charles F. Van Loan. Generalizing the singular value decomposition.
SIAM Journal on NumericalAnalysis , 13(1):76–83, 1976.[40] Torben E. Lund, Kristoffer H. Madsen, Karam Sidaros, Wen-Lin Luo, and Thomas E. Nichols.Non-white noise in fMRI: does modelling have an impact?
Neuroimage , 29(1):54–66, 2006.[41] D. J. C. MacKay. Deconvolution. In
Information Theory, Inference and Learning Algorithms , pages550–551. Cambridge University Press, Camridge, UK, 2004.[42] Vladimir Alexandrovich Marchenko and Leonid Andreevich Pastur. Distribution of eigenvalues forsome sets of random matrices.
Matematicheskii Sbornik , 114(4):507–536, 1967.[43] Brian E. Moore, Raj Rao Nadakuditi, and Jeffrey A. Fessler. Improved robust PCA using low-rankdenoising with optimal singular value shrinkage. In
Statistical Signal Processing (SSP), 2014 IEEEWorkshop on . IEEE, 2014.[44] Raj Rao Nadakuditi. OptShrink: An algorithm for improved low-rank signal matrix denoising by op-timal, data-driven singular value shrinkage.
IEEE Transactions on Information Theory , 60(5):3002–3018, 2014.[45] Raj Rao Nadakuditi and Jack W. Silverstein. Fundamental limit of sample generalized eigenvaluebased detection of signals in noise using relatively few signal-bearing and noise-only samples.
IEEEJournal of Selected Topics in Signal Processing , 4(3):468–480, 2010.[46] Debashis Paul. Asymptotics of sample eigenstructure for a large dimensional spiked covariancemodel.
Statistica Sinica , 17(4):1617–1642, 2007.
47] Mark J. Schervish. A review of multivariate analysis.
Statistical Science , 2(4):396–413, 1987.[48] Andrey A. Shabalin and Andrew B. Nobel. Reconstruction of a low-rank matrix in the presence ofGaussian noise.
Journal of Multivariate Analysis , 118:67–76, 2013.[49] Jack W. Silverstein. Strong convergence of the empirical distribution of eigenvalues of large dimen-sional random matrices.
Journal of Multivariate Analysis , 55:331–339, 1995.[50] Jack W. Silverstein and Zhidong Bai. On the empirical distribution of eigenvalues of a class of largedimensional random matrices.
Journal of Multivariate Analysis , 54:175–192, 1995.[51] Charles M. Stein. Some problems in multivariate analysis. Technical report, Stanford UniversityStatistics Department, 1956.[52] Charles M. Stein. Lectures on the theory of estimation of many parameters.
Journal of SovietMathematics , 74(5), 1986.[53] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprintarXiv:1011.3027 , 2010.[54] Svante Wold, Kim Esbensen, and Paul Geladi. Principal component analysis.
Chemometrics andIntelligent Laboratory Systems , 2(1–3):37–52, 1987.[55] Luc Wouters, Hinrich W. G¨ohlmann, Luc Bijnens, Stefan U. Kass, Geert Molenberghs, and Paul J.Lewi. Graphical exploration of gene expression data: a comparative study of three multivariatemethods.
Biometrics , 59(4):1131–1139, 2003.[56] Ming Yuan. High dimensional inverse covariance matrix estimation via linear programming.
TheJournal of Machine Learning Research , 11:2261–2286, 2010.[57] H. Henry Yue and Masayuki Tomoyasu. Weighted principal component analysis and its applicationsto improve FDC performance. In
Decision and Control, 43rd IEEE Conference on , volume 4, pages4262–4267. IEEE, 2004.[58] Anru Zhang, T. Tony Cai, and Yihong Wu. Heteroskedastic PCA: Algorithm, optimality, andapplications. arXiv preprint arXiv:1810.08316 , 2018.
A Proof from Section 3
A.1 Proof of Theorem 3.1
We begin by recalling the result that describes the asymptotics of the spiked model with white noise.This result can be found in [46, 10]. We immediately obtain parts 1 and 4 of Theorem 3.1.
Theorem A.1. If p/n → γ > as n → ∞ , the k th largest singular value of Y w converges almost surelyto σ w k = (cid:114) ( (cid:96) w k + 1) (cid:16) γ(cid:96) w k (cid:17) if (cid:96) w k > √ γ √ γ otherwise . (113) Furthermore, for ≤ j, k ≤ r : (cid:104) u w j , ˆ u w k (cid:105) → (cid:40) ( c w k ) , if j = k and (cid:96) w k > √ γ , otherwise (114) and (cid:104) v w j , ˆ v w k (cid:105) → (cid:40) (˜ c w k ) , if j = k and (cid:96) w k > √ γ , otherwise (115) where the limits hold almost surely as p, n → ∞ and p/n → γ . We now turn to proving parts 2 and 3. Let W = span { u w1 , . . . , u w r } be the r -dimensional subspacespanned by the whitened population PCs (the left singular vectors of X w ). For fixed n and p , writeˆ u w k = c w k w w k + s w k ˜ u w k , (116) here ( c kk ) + ( s kk ) = 1, and w w k ∈ W , and ˜ u w k ⊥ W are unit vectors. Because the whitened noise matrixis Gaussian, and hence orthogonally invariant, the vector ˜ u w k is uniformly distributed over the unit spherein W ⊥ . Since the dimension of W is fixed, it follows immediately from Proposition 6.2 in [9] that for anyunit vector x ∈ R p independent of ˜ u w k , the following limits hold almost surely:lim p →∞ (˜ u w k ) (cid:62) x = 0 , (117)and lim p →∞ (cid:110) (˜ u w k ) (cid:62) A ˜ u w k − µ a (cid:111) = lim p →∞ (cid:26) (˜ u w k ) (cid:62) A ˜ u w k − p tr( A ) (cid:27) = 0 . (118)From Theorem A.1, we know | ( w w k ) (cid:62) u w k | → w w k ) (cid:62) u w j → j (cid:54) = k ; and c w k → c w k almost surely. Consequently, we can writeˆ u w k = c w k u w k + s w k ˜ u w k + ψ (119)where (cid:107) ψ (cid:107) → p → ∞ . The inner product of ψ with any vectors of bounded norm willtherefore also converge to 0. As a short-hand, we will write:ˆ u w k ∼ c w k u w k + s w k ˜ u w k , (120)to indicate that the norm of the difference of the two sides converges to 0 almost surely as p → ∞ .From (120) we have: A / ˆ u w k ∼ c w k A / u w k + s w k A / ˜ u w k . (121)Taking the squared norm of each side of (121) and using (117) and (118), we obtain: (cid:107) A / ˆ u w k (cid:107) ∼ ( c w k ) (cid:107) A / u w k (cid:107) + ( s w k ) (cid:107) A / ˜ u w k (cid:107) ∼ ( c w k ) τ ak + ( s w k ) µ a , (122)This completes the proof of part 2.Part 3 is proved in the same fashion. Taking inner products with each side of (121), and using (117),we get (cid:104) Au w k , ˆ u w k (cid:105) = (cid:104) A / u w k , A / ˆ u w k (cid:105) ∼ c w k τ ak + s w k (( u w k ) (cid:62) A ˜ u w k ) ∼ c w k τ ak , (123)which is the desired result. A.2 Proof of Theorem 3.2
We can decompose X as: X = r (cid:88) k =1 (cid:96) / k u k z (cid:62) k / √ n. (124)Since z jk and z jk (cid:48) are uncorrelated when k (cid:54) = k (cid:48) , and both have variance 1, the vectors z k / √ n arealmost surely asymptotically orthonormal, i.e., lim n →∞ |(cid:104) z k , z k (cid:48) (cid:105)| /n = δ kk (cid:48) . It follows that the z k / √ n are asymptotically equivalent to the right singular vectors v k of X , that is,lim n →∞ (cid:104) v k , z k (cid:105) /n = 1 (125)almost surely; and the singular values of X are asymptotically equal to the (cid:96) / k . That is, we can write: X ∼ r (cid:88) k =1 (cid:96) / k u k v (cid:62) k , (126)where C ∼ D indicates (cid:107) C − D (cid:107) op → p, n → ∞ . Similarly, we can also write X w ∼ r (cid:88) k =1 ( (cid:96) w k ) / u w k ( v w k ) (cid:62) . (127) e can also decompose X w by applying W to X : X w = W X ∼ r (cid:88) k =1 (cid:96) / k W u k v (cid:62) k = r (cid:88) k =1 ( (cid:96) k (cid:107) W u k (cid:107) ) / u w k v (cid:62) k . (128)The condition (19) immediately implies that u w j and u w k are asymptotically orthogonal whenever j (cid:54) = k .Comparing (127) and (128) then shows that almost surely, (cid:96) w k ∼ (cid:96) k (cid:107) W u k (cid:107) , (129)lim p →∞ (cid:104) u w k , u w k (cid:105) = 1 , (130)and lim n →∞ (cid:104) v k , v w k (cid:105) = 1 . (131)From (130), (cid:104) u k , u k (cid:105) ∼ τ k , we use (130): τ k ∼ (cid:107) W − u w k (cid:107) − ∼ (cid:107) W − u w k (cid:107) − ∼ (cid:107) W − W u k (cid:107) − (cid:107) W u k (cid:107) = (cid:107) W u k (cid:107) . (132)To prove the formulas for the asymptotic cosine between u j and ˆ u k we take A = W − in Theorem3.1. When j (cid:54) = k , we have the formulaˆ u w k ∼ c w k u w k + s w k ˜ u w k ∼ c w k W u k √ τ k + s w k ˜ u w k (133)and consequently W − ˆ u w k ∼ c w k u k √ τ k + s w k W − ˜ u w k . (134)We take inner products of each side with u j . From the orthogonality of u k and u j , and using (117), wehave: (cid:104) u j , W − ˆ u w k (cid:105) ∼ , (135)and consequently (cid:104) u j , ˆ u k (cid:105) ∼
0. When j = k , the formula for (cid:104) u j , ˆ u k (cid:105) follows from Theorem 3.1.Finally, we show that ˆ u j and ˆ u k are asymptotically orthogonal when j (cid:54) = k . We use the followinglemma. Lemma A.2.
Suppose X = (cid:80) rk =1 (cid:96) / k w k v (cid:62) k is a p -by- n rank r matrix, and G is a matrix with iidGaussian entries g ij ∼ N (0 , /n ) . Let ˆ w , . . . , ˆ w m be the left singular vectors of Y = X + G , where m = min( p, n ) , and write ˆ w k ∼ c k w k + s k ˜ w k (136) where ˜ w k is orthogonal to w , . . . , w r . Then for any sequence of matrices A = A p with bounded operatornorms and any ≤ j (cid:54) = k ≤ r , lim p →∞ ˜ w (cid:62) j A ˜ w k = 0 (137) almost surely.Proof. First, we prove the cases where A = I p ; that is, we show ˜ w j and ˜ w k are asymptotically orthogonalwhenever 1 ≤ j (cid:54) = k ≤ r . Indeed, we have s j s k (cid:104) ˜ w j , ˜ w k (cid:105) ∼ (cid:104) ˆ w j , ˆ w k (cid:105) + c j c k (cid:104) w j , w k (cid:105) − c j (cid:104) w j , ˆ w k (cid:105) − c k (cid:104) w k , ˆ w j (cid:105) = − c j (cid:104) w j , ˆ w k (cid:105) − c k (cid:104) w k , ˆ w j (cid:105) . (138)Since ˜ w j and ˜ w k are uniformly distributed on the subspace orthogonal to w , . . . , w r , the inner products (cid:104) w j , ˆ w k (cid:105) and (cid:104) w k , ˆ w j (cid:105) both converge to 0 almost surely as p → ∞ , proving the claim.For general A , we note that the joint distribution of ˜ w j and ˜ w k is invariant to orthogonal transfor-mations which leave fixed the r -dimensional subspace span { w , . . . , w r } . The result then follows fromProposition 6.2 in [9], which implies that˜ w (cid:62) j A ˜ w (cid:62) k ∼ p tr( A ) ˜ w (cid:62) j ˜ w k ∼ , (139)where we have used the asymptotic orthogonality of ˜ w j and ˜ w k . ince u k ∼ u k and u j and u k are orthogonal, taking inner products of each side of (133) with W − ˆ u w j we get: (cid:104) W − ˆ u w j , W − ˆ u w k (cid:105) ∼ s w j s w k (cid:104) W − ˜ u w j , W − ˜ u w k (cid:105) = s w j s w k (˜ u w j ) (cid:62) Σ ε ˜ u w k . (140)The result now follows from Lemma A.2. B Proofs from Section 5
First, we establish the consistency of covariance estimation in the γ = 0 regime: Proposition B.1. If p n /n → as n → ∞ , and the subgaussian norm of QY j can be bounded by C independently of the dimension p , then the sample covariance matrix of QY , . . . , QY n converges to thepopulation covariance Q Σ y Q in operator norm.Proof. We first quote the following result, stated as Corollary 5.50 in [53]:
Lemma B.2.
Let Y , . . . , Y n be iid mean zero subgaussian random vectors in R p with covariance matrix Σ y , and let (cid:15) ∈ (0 , and t ≥ . Then with probability at least − − t p ) ,If n ≥ C ( t/(cid:15) ) p, then (cid:107) ˆΣ y − Σ y (cid:107) ≤ (cid:15), (141) where ˆΣ y = (cid:80) nj =1 Y j Y (cid:62) j /n is the sample covariance, and C is a constant. We also state the well-known consequence of the Borel-Cantelli Lemma:
Lemma B.3.
Let A , A , . . . be a sequence of random numbers, and let (cid:15) > . Define: A n ( (cid:15) ) = {| A n | > (cid:15) } . (142) If for every choice of (cid:15) > we have ∞ (cid:88) n =1 P ( A n ( (cid:15) )) < ∞ , (143) then A n → almost surely. Now take t = (cid:15) (cid:112) n/Cp ; then n ≥ C ( t/(cid:15) ) p , and t ≥ n sufficiently large. Consequently, P ( (cid:107) ˆΣ y − Σ y (cid:107) > (cid:15) ) ≤ − t p ) = 2 exp( − n(cid:15) /C ) , (144)and so the series (cid:80) n ≥ P ( (cid:107) ˆΣ y − Σ y (cid:107) > (cid:15) ) converges, meaning (cid:107) ˆΣ y − Σ y (cid:107) → n → ∞ .We now need to check that the subgaussian norm of Y j = X j + ε j from the spiked model is boundedindependently of the dimension p . But this is easy if the distribution of variances of ε j is bounded, using,for example, Lemma 5.24 of [53].An immediate corollary of Proposition B.1 is that the sample eigenvectors of ˆΣ qy = Q ˆΣ y Q are consis-tent estimators of the eigenvectors of Σ qy = Q Σ y Q . Corollary B.4.
Let Σ qy = Q Σ y Q be the population covariance of the random vector Y qj = QY j , andlet ˆΣ qy = Q ˆΣ y Q be the sample covariance of Y q , . . . , Y qn . Let u q , . . . , u qr denote the top r eigenvectors of Σ qy = Q Σ y Q , and ˆ u , . . . , ˆ u qr the top r eigenvectors of ˆΣ qy .Then for ≤ k ≤ r , lim n →∞ |(cid:104) ˆ u qk , u qk (cid:105)| = 1 , (145) where the limit holds almost surely as n → ∞ and p/n → . e now turn to the proof of Theorem 5.2. First, we derive an expression for the BLP ˆ X opt j . We have:ˆ X opt j = Σ x (Σ x + Σ ε ) − Y j = W − W Σ x W ( W Σ x W + I ) − W Y j = W − r (cid:88) k =1 (cid:96) w k (cid:96) w k + 1 (cid:104) W Y j , u w k (cid:105) u w k = r (cid:88) k =1 η opt k (cid:104) W Y j , u w k (cid:105) W − u w k , (146)where W Σ x W = (cid:80) rk =1 (cid:96) w k u w k ( u w k ) (cid:62) , and η opt k = (cid:96) w k / ( (cid:96) w k + 1).Now, for any s , . . . , s r satisfying lim γ → s k σ w k = (cid:96) w k (cid:96) w k + 1 . (147)we define the predictor ˆ X (cid:48) : ˆ X (cid:48) = r (cid:88) k =1 s k W − ˆ u w k (ˆ v w k ) (cid:62) . (148)Following the same reasoning as in the proof of Lemma 5.4, we can write each column ˆ X (cid:48) j of √ n ˆ X (cid:48) as follows: ˆ X (cid:48) j = r (cid:88) k =1 ( s k /σ w k ) (cid:104) Y w j , ˆ u w k (cid:105) W − ˆ u w k . (149)Theorem 5.2 now follows from condition (147), formula (146), and Corollary B.4. Theorem 5.1 followsimmediately, after observing that ˆ X has the same form as ˆ X (cid:48) with s k = t k , andlim γ → t k σ w k = lim γ → ( (cid:96) w k ) / c w k ˜ c k ( c w k ) + ( s w k ) µ ε τ k (cid:112) (cid:96) w k + 1 = lim γ → ( (cid:96) w k ) / ˜ c k (cid:112) (cid:96) w k + 1 = (cid:96) w k (cid:96) w k + 1 . (150)Finally, we prove Theorem 5.3. By definition,ˆ Y Q,j = r (cid:88) k =1 ( s qk /σ qk ) (cid:104) Y qj , ˆ u qk (cid:105) Q − ˆ u qk (151)and ˆ Y lin Q,j = r (cid:88) k =1 η qk (cid:104) Y qj , u qk (cid:105) Q − u qk . (152)The values s qk and η qk are each assumed to minimize the mean-squared error for their respective expres-sions. Consequently, since Corollary B.4 states that ˆ u qk ∼ u qk , we establish (88); (89) follows immediatelyfrom (146). C Proof of Theorem 6.1
C.1 The optimal coefficients for in-sample prediction
Before deriving the optimal out-of-sample coefficients η o k , we will first derive the optimal in-samplecoefficients η k . That is, we will rewrite the optimal shrinkage with noise whitening in the form (92).From Lemma 5.4, the in-sample coefficients η k are the ratios of the optimal singular values t k derivedin Section 4.1 and the observed singular values of Y w , denoted σ w1 , . . . , σ w r . From Theorem A.1, we knowthat σ w k = (cid:115) ( (cid:96) w k + 1) (cid:18) γ(cid:96) w k (cid:19) , (153) nd from Section 4.1 we know that t k = ( (cid:96) w k ) / c w k ˜ c k ( c w k ) + ( s w k ) µ ε τ k = α k ( (cid:96) w k ) / c w k ˜ c k , (154)where α k = (cid:0) ( c w k ) + ( s w k ) µ ε τ k (cid:1) − . Taking the ratio, and using formulas (39) and (40) for c w k and ˜ c k ,we obtain: η k = t k σ w k = α k ( (cid:96) w k ) / c w k ˜ c k (cid:114) ( (cid:96) w k + 1) (cid:16) γ(cid:96) w k (cid:17) = α k (cid:96) w k ( c w k ) (cid:112) ( (cid:96) w k + 1) ( (cid:96) w k + γ ) (cid:115) ( (cid:96) w k ) + γ(cid:96) w k ( (cid:96) w k ) + (cid:96) w k = α k (cid:96) w k ( c w k ) (cid:96) w k + 1 . (155)That is, we have found the optimal in-sample coefficients to be: η k = 1( c w k ) + ( s w k ) µ ε τ k · (cid:96) w k ( c w k ) (cid:96) w k + 1 . (156) C.2 The optimal coefficients for out-of-sample prediction
In this section, we will derive the optimal out-of-sample coefficients η o k . We have a predictor of the formˆ X = r (cid:88) k =1 η o k (cid:104) W Y , ˆ u w k (cid:105) W − ˆ u w k , (157)where ˆ u w k are the top left singular vectors of the in-sample observation matrix Y w = W [ Y , . . . , Y n ] / √ n .We wish to choose the coefficients η o k that minimize the asymptotic mean squared error E (cid:107) X − ˆ X (cid:107) .First, we can expand the MSE across the different principal components as follows: (cid:107) X − ˆ X (cid:107) = r (cid:88) k =1 (cid:107) (cid:96) / k z k u k − η o k (cid:104) W Y , ˆ u w k (cid:105) W − ˆ u w k (cid:107) + (cid:88) k (cid:54) = l (cid:104) (cid:96) / k z k u k − η o k (cid:104) W Y , ˆ u w k (cid:105) W − ˆ u w k , (cid:96) / l z l u l − η o l (cid:104) W Y , ˆ u w l (cid:105) W − ˆ u w l (cid:105) . (158)After taking expectations, the cross-terms vanish and we are left with: E (cid:107) X − ˆ X (cid:107) = r (cid:88) k =1 E (cid:107) (cid:96) / k z k u k − η o k (cid:104) W Y , ˆ u w k (cid:105) W − ˆ u w k (cid:107) . (159)Since the sum separates across the η o k , we can minimize each summand individually. We write: E (cid:107) (cid:96) / k z k u k − η o k (cid:104) W Y , ˆ u w k (cid:105) W − ˆ u w k (cid:107) = (cid:96) k + ( η o k ) E (cid:2) (cid:104) W Y , ˆ u w k (cid:105) (cid:107) W − ˆ u w k (cid:107) (cid:3) − (cid:96) / k η o k E (cid:2) z k (cid:104) W Y , ˆ u w k (cid:105)(cid:104) u k , W − ˆ u w k (cid:105) (cid:3) . (160)We first deal with the quadratic coefficient in η : (cid:104) W Y , ˆ u w k (cid:105) (cid:107) W − ˆ u w k (cid:107) = (cid:104) W X + W ε , ˆ u w k (cid:105) (cid:107) W − ˆ u w k (cid:107) = (cid:0) (cid:104) W X , ˆ u w k (cid:105) + (cid:104) W ε , ˆ u w k (cid:105) + (cid:104) W X , ˆ u w k (cid:105)(cid:104) W ε , ˆ u w k (cid:105) (cid:1) (cid:107) W − ˆ u w k (cid:107) , (161)and taking expectations, we get: E (cid:2) (cid:104) W Y , ˆ u w k (cid:105) (cid:107) W − ˆ u w k (cid:107) (cid:3) ∼ (cid:0) E (cid:2) (cid:104) W X , ˆ u w k (cid:105) (cid:3) + 1 (cid:1) (cid:107) W − ˆ u w k (cid:107) ∼ (cid:0) (cid:96) w k ( c w k ) + 1 (cid:1) (cid:18) ( c w k ) τ k + ( s w k ) µ ε (cid:19) . (162)Now we turn to the linear coefficient in η : (cid:96) / k E (cid:2) z k (cid:104) W Y , ˆ u w k (cid:105)(cid:104) u k , W − ˆ u w k (cid:105) (cid:3) = (cid:96) / k E (cid:104) z k (cid:16) ( (cid:96) w k ) / z k c w k + (cid:104) W ε , ˆ u w k (cid:105) (cid:17) (cid:104) u k , W − ˆ u w k (cid:105) (cid:105) = (cid:96) w k c w k E (cid:2) (cid:104) u k , W − ˆ u w k (cid:105) (cid:3) (cid:107) W u k (cid:107)∼ (cid:96) w k ( c w k ) τ k . (163) inimizing the quadratic for η o k , we get: η o k = (cid:18) (cid:96) w k ( c w k ) τ k (cid:19) (cid:30) (cid:18)(cid:0) (cid:96) w k ( c w k ) + 1 (cid:1) (cid:18) ( c w k ) τ k + ( s w k ) µ ε (cid:19)(cid:19) = 1( c w k ) + ( s w k ) µ ε τ k · (cid:96) w k ( c w k ) (cid:96) w k ( c w k ) + 1 . (164) C.3 Equality of the AMSEs
Evaluating the out-of-sample error at the optimal out-of-sample coefficients η o k , we find the optimalout-of-sample AMSE (where α k = (cid:0) ( c w k ) + ( s w k ) µ ε τ k (cid:1) − ):AMSE = r (cid:88) k =1 (cid:18) (cid:96) k − ( (cid:96) w k ) ( c w k ) (cid:96) w k ( c w k ) + 1 1 α k τ k (cid:19) = r (cid:88) k =1 (cid:18) (cid:96) w k τ k − ( (cid:96) w k ) ( c w k ) (cid:96) w k ( c w k ) + 1 1 α k τ k (cid:19) . (165)The AMSE of the in-sample predictor is: r (cid:88) k =1 (cid:96) k (1 − ( c k ˜ c k ) ) = r (cid:88) k =1 (cid:96) w k τ k (cid:18) − ( c w k ˜ c w k ) α k (cid:19) = r (cid:88) k =1 (cid:18) (cid:96) w k τ k − (cid:96) w k ( c w k ˜ c w k ) α k τ k (cid:19) (166)To show equality, we therefore need to show: (cid:96) w k ( c w k ˜ c w k ) = ( (cid:96) w k ) ( c w k ) (cid:96) w k ( c w k ) + 1 . (167)But this follows from the equality of in-sample and out-of-sample AMSEs for the standard spiked modelwith isotropic noise, established in [19]. D Proofs from Section 7
D.1 Proof of Proposition 7.1
From Corollary B.4, ˆ u w k ∼ u w k , 1 ≤ k ≤ r , in the sense that the angle between the vectors converges to 0.Consequently lim n → Θ( U w , ˆ U w ) = 0 , (168)where U w = span { u w1 , . . . , u w r } and ˆ U w = span { ˆ u w1 , . . . , ˆ u w r } . Since W − has bounded operator norm and U = W − U w and ˆ U = W − ˆ U w , the result follows immediately. D.2 Proof of Theorem 7.2
Since the inner products between random unit vectors in R p vanish as p → ∞ , we may assume that the u k are drawn randomly with iid entries of variance 1 /p ; the result will then follow for the orthonormalizedvectors from the generic model. If Σ ε = diag( ν , . . . , ν p ), then τ k = (cid:107) Σ − / ε u k (cid:107) ∼ p p (cid:88) j =1 ν − j = τ. (169)We now define the n -by- p matrix ˜ Y = Y (cid:62) / √ γ , given by˜ Y = r (cid:88) k =1 ˜ (cid:96) / k z k u (cid:62) k + G (cid:62) Σ / ε / √ p, (170)where ˜ (cid:96) k = (cid:96) k /γ . Note that the noise matrix G (cid:62) Σ / ε has colored rows, not columns, and has beennormalized by dividing by the square root of the number of its columns. Since the vectors u k spanningthe right singular subspace of ˜ Y are assumed to be drawn uniformly from the unit sphere in R p , we may pply Corollary 2 to Theorem 2 of [27] to the matrix ˜ Y . Defining ˜ γ = 1 /γ as the aspect ratio of ˜ Y , wehave: |(cid:104) ˆ u (cid:48) k , u k (cid:105)| ≤ − ˜ γ/ (˜ (cid:96) k /µ ε ) / (˜ (cid:96) k /µ ε ) = 1 − γ/ ( (cid:96) w k /ϕ ) γ/ ( (cid:96) w k /ϕ ) ≡ g ( (cid:96) w k /ϕ ) , (171)where we have defined the function g ( (cid:96) ) = 1 − γ/(cid:96) γ/(cid:96) . (172)On the other hand, the squared cosine c k = |(cid:104) ˆ u k , u k (cid:105)| is equal to c k = ( c w k ) ( c w k ) + ( s w k ) ϕ = g ( (cid:96) w k ) g ( (cid:96) w k ) + ϕ (1 − g ( (cid:96) w k )) . (173)Our goal is to show that for all (cid:96) w k > √ γ , and all ϕ ≥
1, that g ( (cid:96) w k /ϕ ) ≤ g ( (cid:96) w k ) g ( (cid:96) w k ) + ϕ (1 − g ( (cid:96) w k )) ; (174)equivalently, we want to show that for all ξ > ϕ > g ( ξ ) ≤ g ( ξϕ ) g ( ξϕ ) + ϕ (1 − g ( ξϕ )) ; (175)setting G ( ϕ ) = g ( ξϕ ) g ( ξϕ ) + ϕ (1 − g ( ξϕ )) , (176)this is equivalent to showing that G ( ϕ ) ≥ G (1) for all ϕ ≥
1. The derivative of G is equal to ddϕ G ( ϕ ) = γξ ϕ + 2 γ ξϕ + γ ( ξ ϕ − γ + ( γξϕ + γ ) ϕ ) > , (177)which completes the first statement of the theorem.The second statement concerning ˆ v k is proved similarly. Again applying Corollary 2 to Theorem 2 of[27] to ˜ Y , we know that |(cid:104) ˆ v (cid:48) k , z k (cid:105)| ≤ − γ/ (˜ (cid:96) k /µ ε ) γ/ (˜ (cid:96) k /µ ε ) = 1 − γ/ ( (cid:96) w k /ϕ ) / ( (cid:96) w k /ϕ ) ≡ h ( (cid:96) w k /ϕ ) , (178)where we have defined the function h ( (cid:96) ) = 1 − γ/(cid:96) /(cid:96) . (179)Since h is an increasing function of (cid:96) , and |(cid:104) ˆ v k , z k (cid:105)| = ˜ c k = h ( (cid:96) w k ), the result follows. D.3 Proof of Theorem 7.3
We begin the proof with some lemmas.
Lemma D.1.
Let < B < , and suppose q is the number of entries of u k where | u jk | > B/ √ p . Then q ≥ p · − B C − B , (180) where C is the incoherence parameter from (103) .Proof. Let S be the set of indices j on which | u jk | > B/ √ p , and let S be the set of indices j on which | u jk | ≤ B/ √ p . Because u k is a unit vector, we then have1 = (cid:107) u k (cid:107) = p (cid:88) j =1 u jk = (cid:88) j ∈ S u jk + (cid:88) j ∈ S u jk ≤ ( q/p ) C + (1 − q/p ) B . (181)Rearranging, we find qp ≥ − B C − B , (182)as claimed. emma D.2. For each ≤ k ≤ r , τ k ≥ max (cid:26) ˜ Kµ ε , (cid:107) Σ ε (cid:107) op (cid:27) , (183) where ˜ K is a constant depending only on C from (103) .Proof. We will let ν , . . . , ν p denote the diagonal elements of Σ ε . Take any number 0 < B <
1, and let q be the number of indices where | u jk | > B/ √ p . From Lemma D.1, q/p ≥ K , a constant. Using theCauchy-Schwarz inequality, we have: µ ε · τ k = (cid:32) p (cid:88) j =1 (cid:18) √ ν j √ p (cid:19) (cid:33) · (cid:32) p (cid:88) j =1 (cid:18) u jk √ ν j (cid:19) (cid:33) ≥ (cid:32) √ p p (cid:88) j =1 | u jk | (cid:33) ≥ (cid:18) √ p ( K p ) B √ p (cid:19) = K B . (184)This proves that τ k ≥ ˜ K/µ ε .Next, we observe that because (cid:80) pj =1 u jk = 1, we have τ k = p (cid:88) j =1 (cid:18) u jk √ ν j (cid:19) ≥ min ≤ j ≤ p ν − j = (cid:18) max ≤ j ≤ p ν j (cid:19) − = 1 (cid:107) Σ ε (cid:107) op , (185)completing the proof.We now turn to the proof of Theorem 7.3. We have (cid:107) U (cid:62)⊥ ˆ U (cid:107) op = (cid:107) U ⊥ U (cid:62)⊥ ˆ U (cid:107) op = (cid:107) (cid:101) U (cid:107) op (186)where (cid:101) U = [ ˜ w , . . . , ˜ w r ] (187)is the matrix whose columns are the projections ˜ w k of ˆ u k onto the orthogonal complement of span { u , . . . , u r } .Then from Lemma A.2, we know that asymptotically ˜ w j ⊥ ˜ w k if j (cid:54) = k ; consequently, (cid:107) sin Θ( ˆ U, U ) (cid:107) = max ≤ k ≤ r (cid:107) ˜ w k (cid:107) = max ≤ k ≤ r (1 − (cid:104) ˆ u k , u k (cid:105) ) = max ≤ k ≤ r (1 − c k ) . (188)From Theorem 3.2, for each 1 ≤ k ≤ r , the squared sine between ˆ u k and u k is1 − c k = 1 − ( c w k ) ( c w k ) + ( s w k ) · µ ε · τ k = ( s w k ) · µ ε · τ k ( c w k ) + ( s w k ) · µ ε · τ k . (189)Since ( c w k ) = 1 − γ/ ( (cid:96) w k ) γ/(cid:96) w k (190)and ( s w k ) = γ/(cid:96) w k + γ/ ( (cid:96) w k ) γ/(cid:96) w k , (191)we can simplify the expression by multiplying numerator and denominator by ( (cid:96) w k ) (1 + γ/(cid:96) w k ):1 − c k = γ ( (cid:96) w k + 1) µ ε τ k ( (cid:96) w k ) − γ + γ ( (cid:96) w k + 1) µ ε τ k = γ ( (cid:96) w k + 1) µ ε τ k ( (cid:96) w k ) · ( (cid:96) w k ) ( (cid:96) w k ) − γ + γ ( (cid:96) w k + 1) µ ε τ k . (192)Now, using Lemma D.2, there is a constant 0 < ˜ K < τ k µ ε ≥ ˜ K . Consequently, since γ < ( (cid:96) w k ) ,we have: ( (cid:96) w k ) ( (cid:96) w k ) − γ + γ ( (cid:96) w k + 1) µ ε τ k ≤ ( (cid:96) w k ) ( (cid:96) w k ) − (1 − ˜ K ) γ ≤ ( (cid:96) w k ) ( (cid:96) w k ) − (1 − ˜ K )( (cid:96) w k ) = 1˜ K . (193) ombining equation (192) and inequality (193), the fact that (cid:96) w k = (cid:96) k · τ k , and Lemma D.2, we obtainthe bound: 1 − c k ≤ K (cid:18) γ ( (cid:96) w k + 1) µ ε τ k ( (cid:96) w k ) (cid:19) = 1˜ K (cid:18) γ(cid:96) w k µ ε τ k ( (cid:96) w k ) + γµ ε τ k ( (cid:96) w k ) (cid:19) = 1˜ K (cid:18) γµ ε (cid:96) k + γµ ε (cid:96) k τ k (cid:19) ≤ K (cid:18) γµ ε (cid:96) k + γµ ε (cid:107) Σ ε (cid:107) op (cid:96) k (cid:19) . (194)Taking the maximum over 1 ≤ k ≤ r proves the desired result. D.4 Proof of Proposition 7.4
As in the proof of Theorem 7.2, since the inner products between random unit vectors in R p vanish as p → ∞ , we may assume that the u k are drawn randomly with iid entries of variance 1 /p ; the resultwill then follow for the orthonormalized vectors from the generic model. We will use the fact that (cid:107) ˆΣ x (cid:107) op = (cid:107) X (cid:107) and (cid:107) ˆΣ ε (cid:107) op = (cid:107) N (cid:107) . To show the increase in SNR after whitening, we will first derivea lower bound on the operator norm of the noise matrix N alone. Recall that N = Σ / ε G , where g ij areiid N (0 , /n ).Take unit vectors c and d so that Gd = (cid:107) G (cid:107) op c . Then we have (cid:107) N (cid:107) ≥ (cid:107) Σ / ε Gd (cid:107) = (cid:107) G (cid:107) (cid:107) Σ / ε c (cid:107) (195)Since the distribution of G is orthogonally invariant, the distribution of c is uniform over the unit spherein R n . Consequently, (cid:107) Σ / ε c (cid:107) ∼ tr(Σ ε ) /p ∼ µ ε . Therefore, (cid:107) N (cid:107) (cid:38) µ ε (cid:107) G (cid:107) , (196)where “ (cid:38) ” indicates that the inequality holds almost surely in the large p , large n limit.Next, from the assumption that the u k are uniformly random, the parameters τ k are all asymptoticallygiven by: τ k ∼ (cid:107) Σ − / ε u k (cid:107) ∼ tr(Σ − ε ) p ∼ τ. (197)With this, we can show the improvement in SNR after whitening. We have:SNR ∼ (cid:96) (cid:107) N (cid:107) (cid:46) (cid:96) µ ε (cid:107) G (cid:107) ∼ ϕ (cid:96) τ (cid:107) G (cid:107) ∼ ϕ (cid:96) w1 (cid:107) G (cid:107) ∼ SNR w ϕ . (198)This completes the proof.(198)This completes the proof.