[PDF] Factor analysis in high dimensional biological data with dependent observations

Abstract

Factor analysis is a critical component of high dimensional biological data analysis. However, modern biological data contain two key features that irrevocably corrupt existing methods. First, these data, which include longitudinal, multi-treatment and multi-tissue data, contain samples that break critical independence requirements necessary for the utilization of prevailing methods. Second, biological data contain factors with large, moderate and small signal strengths, and therefore violate the ubiquitous "pervasive factor" assumption essential to the performance of many methods. In this work, I develop a novel statistical framework to perform factor analysis and interpret its results in data with dependent observations and factors whose signal strengths span several orders of magnitude. I then prove that my methodology can be used to solve many important and previously unsolved problems that routinely arise when analyzing dependent biological data, including high dimensional covariance estimation, subspace recovery, latent factor interpretation and data denoising. Additionally, I show that my estimator for the number of factors overcomes both the notorious "eigenvalue shadowing" problem, as well as the biases due to the pervasive factor assumption that plague existing estimators. Simulated and real data demonstrate the superior performance of my methodology in practice.

Full PDF

FFactor analysis in high dimensional biological data withdependent observations

Chris McKennanDepartment of StatisticsUniversity of PittsburghPittsburgh, PA 15260 [email protected]

September 24, 2020

Abstract

Factor analysis is a critical component of high dimensional biological data analysis. How-ever, modern biological data contain two key features that irrevocably corrupt existing meth-ods. First, these data, which include longitudinal, multi-treatment and multi-tissue data, con-tain samples that break critical independence requirements necessary for the utilization of pre-vailing methods. Second, biological data contain factors with large, moderate and small signalstrengths, and therefore violate the ubiquitous “pervasive factor” assumption essential to theperformance of many methods. In this work, I develop a novel statistical framework to per-form factor analysis and interpret its results in data with dependent observations and factorswhose signal strengths span several orders of magnitude. I then prove that my methodologycan be used to solve many important and previously unsolved problems that routinely arisewhen analyzing dependent biological data, including high dimensional covariance estimation,subspace recovery, latent factor interpretation and data denoising. Additionally, I show thatmy estimator for the number of factors overcomes both the notorious “eigenvalue shadowing”problem, as well as the biases due to the pervasive factor assumption that plague existing esti-mators. Simulated and real data demonstrate the superior performance of my methodology inpractice.

Keywords:

High dimensional factor analysis, Dependent data, Approximate factor model, Princi-pal component analysis, High dimensional asymptotics

Factor analysis is an indispensable component of high throughput biological data analysis. How-ever, existing methods rely on critical assumptions that are not satisﬁed in modern biological data.Suppose Y ∈ R p × n contains the gene expression or DNA methylation of p genomic unitsmeasured in n samples, where 10 (cid:46) p (cid:46) and n (cid:46) in typical genetic and epigenetic data.1 a r X i v : . [ s t a t . M E ] S e p or latent factors C ∈ R n × K and loadings L ∈ R p × K , I consider the following general factor model: Y = LC T + E , E ( E g ∗ ) = , V ( E g ∗ ) = V g , g ∈ { , . . . , p } , (1)where E is a random matrix with g th row and i th column E g ∗ ∈ R n and E ∗ i ∈ R p . As discussed inSection 5, the rows of E may be dependent, provided the eigenvalues of V ( E ∗ i ) remain bounded.The goal is to estimate K , C , L and V g so as to accomplish the following common objectives inbiological data analysis:(a) Characterize and prioritize the sources of variation contained in C [1–4].(b) Understand the relationships between biological samples [5, 6] and genomic units [7].(c) Denoise Y to empower inference in gene expression and DNA methylation quantitative traitloci (eQTL and meQTL) studies [4, 8].Existing methods to perform factor analysis and their theoretical guarantees can, to a large extent,be partitioned into two groups based on their assumptions on V g and the K latent factors’ signalstrengths, where signal strengths are quantiﬁed as the K non-zero eigenvalues λ ≥ · · · ≥ λ K > n { p − L ( n − C T C ) L T } . The ﬁrst group relies on the standard assumption that the columns of E are independent and identically distributed [9–16], implying V g = v g I n for some v g > g ∈ { , . . . , p } . However, this critical assumption is violated by the cornucopia of biological datawith dependent samples, which include longitudinal data [1, 17–20], multi-tissue data [21–23],multi-treatment data [4, 8], as well as data from related individuals [6, 24, 25]. Not only does theindependence assumption made in by aforementioned articles imply their theoretical guaranteesare not applicable to these dependent data, I show their estimators for K , C and L are irrevocablycorrupted by the dependence between the columns of E in practice.The second group of methods allow for dependence between the columns of E , but rely onthe “pervasive factor” assumption in which λ K (cid:16) n [26–32], or assume λ K → ∞ as n , p → ∞ [33]. Intuitively speaking, this implies a scree plot of the eigenvalues of p − Y T Y should reveal anunambiguous gap between the K th and [ K + E , it is patentlyviolated in nearly all biological data [12, 14, 34]. For example, λ (cid:16) n and λ K (cid:46) λ K → ∞ that they consistently failto recover moderate and weak factors [12], which I show biases estimators from and under-powersinference using downstream methods that rely on estimates for C and L .The purpose of this work is to facilitate objectives (a), (b) and (c) by providing a novel frame-work, e ﬃ cient estimators and the requisite theory to perform factor analysis and interpret the re-sults in dependent biological data with nearly arbitrary eigenvalues λ , . . . , λ K . First, I characterizethe types of dependence typically observed in biological data in Section 2, and for K known, ex-tend a recently proposed method to estimate λ , . . . , λ K , C , L and V g in Section 3. A criticalcomponent of my method is a novel eigenvalue bias correction for dependent data that ensuresthe estimates for λ , . . . , λ K and the left and right singular vectors of LC T are as e ﬃ cient as thosederived from data with independent samples. Accurate estimates for these quantities are crucialto analyzing biological data with dependent [1, 2, 4, 8] samples, and I prove in Section 5 thatmy estimates for them enable objectives (a) and (b). In addition to proving the estimate for C λ , . . . , λ K can beleveraged to derive estimates for and asymptotic distributions of the eigenvectors and eigenvaluesof V ( Y ∗ i ) ∈ R p × p when Y ∗ , . . . , Y ∗ n are dependent. As far as I am aware, the abovementionedmethodology provides the ﬁrst provably accurate set of estimators that achieve objectives (a), (b)and (c) in dependent biological data.Second, I extend the above methodology in Section 4 when K is unknown by framing theestimation of K as a model selection problem, and introduce the Oracle rank, K ( o ) , as that whichminimizes a weighted generalization error. This has the e ﬀ ect of excluding components of LC T whose signal strengths are below the noise level of the data, and are therefore too weak to justifythe added estimation uncertainty that results from their inclusion in the model for LC T . As far asI am aware, my estimate for K ( o ) is the ﬁrst consistent estimator for the number of latent factorsin dependent data with λ K (cid:46) λ (cid:46) n , and therefore circumvents the “eigenvalue shadowing”problem [13] and the biases that accompany the pervasive factor assumption. I lastly use simulatedand real genetic data in Sections 6 and 7 to illustrate the power of my framework and estimators inpractice. The proofs of all theoretical statements are given in the Supplementary Material, and an R package implementing my method is available from https: // github.com / chrismckennan / CorrConf.

Let n > n ∈ R n be the vectors of all ones, I n ∈ R n × n be the identitymatrix, [ n ] = { , . . . , n } and x i be the i th element of x ∈ R n . For M ∈ R n × m , I let M i j ∈ R , M ∗ j ∈ R n and M i ∗ ∈ R m be the ( i , j )th element, j th column and i th row of M , respectively, anddeﬁne P M and P ⊥ M to be the orthogonal projection matrices onto im( M ) = { M v : v ∈ R m } andker( M T ) = { u ∈ R n : M T u = } . If m = n , I let | M | and | M | + be the determinant and pseudo-determinant, respectively, and for M = M T and s ∈ [ n ], let Λ s ( M ) be the s th largest eigenvalueof M . For random vectors X , Y ∈ R n , I let X · ∼ ( µ , V ) if E ( X ) = µ and V ( X ) = V , and X d = Y if X , Y have the same distribution. Let Y ∈ R p × n be the observed data, where Y gi is the observation at genomic unit g ∈ [ p ] in sample i ∈ [ n ]. I assume that Model (1) holds for some non-random latent loadings L ∈ R p × K and randomlatent factors C ∈ R n × K , where V g = b (cid:88) j = v g , j B j , ¯ V = p − p (cid:88) g = V g = b (cid:88) j = ¯ v j B j , g ∈ [ p ] (2)for some observed matrices B , . . . , B b that parametrize the correlation across samples. This isa ubiquitous model for V g in modern high throughput biological data, and can be used to modelthe correlation structure in multi-tissue data [22, 23], longitudinal data [19, 20], multi-treatmentor multi-condition data [4, 8], data from related individuals [6, 24, 25], or a combination of thesedata types [1, 17, 18]. If E ∗ , . . . , E ∗ n are independent and identically distributed, then b = = I n . I assume the unknown variance multipliers v g = ( v g , , . . . , v g , b ) T lie in the convex set Θ = { x ∈ R b : A v x ≥ } for some known A v ∈ R q × b . The matrix A v will typically be I b , but cantake other values depending on the parametrization of B , . . . , B b .I assume throughout that C is independent of E . Similar to previous work that assumes b = B = I n and allows λ K (cid:46)

1, the assumptions I place on the dependence between the rows of E will depend on whether or not an estimate for K is available [11–13, 15]. To avoid confusingtechnicalities, I save the details for Section 5.The dependence between the entries of C ∗ r may depend on r . For example, columns corre-sponding to technical variables like batch number may have independent entries, and others rep-resenting biological factors like cell composition may have dependent entries. Therefore, I onlyassume E ( n − C T C ) exists and is full rank, and, unless otherwise stated, place no assumptions onthe dependence between the elements of C . Therefore, E ( Y ) = L { E ( C ) } T andCov( Y gi , Y h j ) = (cid:96) T g Cov( C i ∗ , C j ∗ ) (cid:96) h + Cov( E gi , E h j ) , g , h ∈ [ p ]; i , j ∈ [ n ] . Evidently, this dependence structure is far more general than that considered by previous authors,who typically only consider data where Cov( E gi , E h j ) does not depend on i or j and Cov( C i ∗ , C j ∗ ) = Ψ I ( i = j ) for some non-singular Ψ ∈ R K × K [11–13, 15, 16, 28, 34].A more general model would be Y = Γ Z T + LC T + E , where Z ∈ R n × r are observed nuisancecovariates, like the intercept or treatment condition, that may not be of immediate interest. Onecan get back to Models (1) and (2) by multiplying Y on the right by a matrix Q Z ∈ R n × ( n − r ) whosecolumns form an orthonormal basis for ker( Z T ), where Y Q Z = L ( Q T Z C ) T + ˜ E , ˜ E g ∗ = Q TZ E g ∗ · ∼ ( , b (cid:88) j = v g , j Q T Z B j Q Z ) , g ∈ [ p ] . (3)I therefore work exclusively with Models (1) and (2) and assume any nuisance covariates havealready been rotated out.It is easy to see that conditional on C and provided dim { im( L ) } = dim { im( C ) } = K , LC T , andtherefore im( L ) and im( C ), are identiﬁable in Model (1). However, L and C are themselves notidentiﬁable. To facilitate interpretation and make my methodology useful to biological practition-ers, I use the IC3 identiﬁcation conditions in Bai et al. [10] and deﬁne C ( o ) and L ( o ) to be( C ( o ) , L ( o ) ) ∈ { ( ¯ C , ¯ L ) ∈ R n × K × R p × K : ¯ L ¯ C T = LC T , n − ¯ C T ¯ C = I K , np − ¯ L T ¯ L = diag( λ , . . . , λ K ) } . (4)Provided λ , . . . , λ K are non-degenerate, C ( o ) ∗ r and L ( o ) ∗ r are identiﬁable up to sign parity and are pro-portional to the r th right and left singular vectors of LC T for all r ∈ [ K ]. As deﬁned, C ( o ) ∗ , . . . , C ( o ) ∗ K and L ( o ) ∗ , . . . , L ( o ) ∗ K are empirically uncorrelated factors and loadings, where C ( o ) ∗ r has the naturaland intelligible interpretation as being the factor with the r th largest e ﬀ ect on expression or methy-lation. This identiﬁcation condition is ubiquitous in the biological literature, and has proven to bequite e ﬃ cacious when analyzing data with independent [3, 7] and dependent [1, 2, 4, 8] samples.4 Estimation when K is known L ( o ) , λ , . . . , λ K and C ( o ) Here I describe my method to estimate L ( o ) , λ , . . . , λ K , C ( o ) and V , . . . , V p assuming K is known,which extends the method to recover im( C ) = im { C ( o ) } and V , . . . , V p proposed in McKennanet al. [35]. Unlike standard Principal Components Analysis (PCA) in data where E ∗ , . . . , E ∗ n are independent and identically distributed, one must be careful to avoid including variation from E that is shared across samples in the estimate for im( C ). Further, even if im( C ) were known,estimating L ( o ) , λ , . . . , λ K and C ( o ) is challenging because they can no longer be estimated usingthe singular value decomposition of Y .To elaborate on both of these points, let S = p − Y T Y and note that PCA’s estimate for im( C ),which is simply the span of the ﬁrst K eigenvectors of S , can be expressed as P ˆ C ( PCA ) = arg max H ∈ R n × n , H T = HH = H , Tr( H ) = K Tr( SH ) , (5)where there is a one-to-one correspondence between the estimators P ˆ C ( PCA ) and im { ˆ C ( PCA ) } . Con-sider the simple case when E ∗ , . . . , E ∗ n are independent and identically distributed. Then b = B = I n , E ( p − E T E ) = ¯ v I n and S , in expectation, can be expressed as E ( S | C ) = C ( p − L T L ) C T + E ( p − E T E ) = n − C ( o ) diag( λ , . . . , λ K ) { C ( o ) } T + ¯ v I n . Since Tr { E ( p − E T E ) H } = Tr { (¯ v I n ) H } = K ¯ v does not depend on H , this implies variation in E has little inﬂuence on the objective in (5), and therefore im { ˆ C ( PCA ) } . Further, since adding a multi-ple of the identity does not change a matrix’s eigenvectors, the orthonormal columns of n − / C ( o ) are the ﬁrst K eigenvectors of E ( S | C ). This suggests n − / C ( o ) can be accurately estimated asthe ﬁrst K eigenvectors of S , which form an ordered orthonormal basis for im { ˆ C ( PCA ) } . However,both of these lines of reasoning break down when E ∗ , . . . , E ∗ n are dependent. In such cases,Tr { E ( p − E T E ) H } will depend on H because ¯ V = E ( p − E T E ) will no longer be a multiple of theidentity. Therefore, the solution to (5) will be driven by variation in E , thereby corrupting PCA’sestimate for im( C ). Further, since the eigenvectors of E ( S | C ) = n − C ( o ) diag( λ , . . . , λ K ) { C ( o ) } T + ¯ V are no longer n − / C ( o ) , the eigenvectors of S should not be used to estimate C ( o ) .Besides illuminating issues when applying standard factor analysis techniques in data withdependent samples, the above discussion also suggests that accounting for ¯ V may circumventthese issues. Suppose that ¯ V (cid:31) were known, and deﬁne P ˆ C = arg max H ∈ R n × n , H T = HH = H , Tr( H ) = K Tr { ( ¯ V − S ¯ V − )( H ¯ V − H ) † } . (6)Because Tr { ¯ V − ( H ¯ V − H ) † } = K for all H , the objective function in (6) satisﬁes E [Tr { ( ¯ V − S ¯ V − )( H ¯ V − H ) † } | C ] = Tr { ¯ V − C ( p − L T L ) C T ¯ V − ( H ¯ V − H ) † } + K . Since the only term involving H in the above expression takes its maximum when H = P C , thissimple analysis argues that (6) properly accounts for ¯ V when estimating im( C ). One can then5se im( ˆ C ) to estimate λ , . . . , λ K , and subsequently choose an appropriate ordered basis for im( ˆ C )to estimate L ( o ) and C ( o ) . These steps are presented below in Algorithm 1, which I call FALCO( f actor a na l ysis in co rrelated data), which estimates im( C ), λ , . . . , λ K , C ( o ) and L ( o ) , and usesthe warm start technique detailed in [35] to estimate ¯ V . Algorithm (FALCO) . Let Y ∈ R p × n and M ∈ R b × b , where M rs = n − Tr( B r B s ) for r , s ∈ [ b ] .Fix some α ∈ (0 , and integer K max ∈ (0 , n ∧ p ] and let V ( θ ) = (cid:80) bj = θ j B j for any θ ∈ R b .(a) Initialize ˆ¯ v = (ˆ¯ v , . . . , ˆ¯ v b ) T as ˆ¯ v = arg max θ ∈ Θ ( − log {| V ( θ ) |} − Tr[( p − Y T Y ) { V ( θ ) } − ]) . Setk = and ˆ¯ V = V ( ˆ¯ v ) .(b) (i) Deﬁne P ˆ C , and therefore im( ˆ C ) , to beP ˆ C = arg max H ∈ R n × n , H T = HH = H , Tr( H ) = k Tr[ { ˆ¯ V − ( p − Y T Y ) ˆ¯ V − } ( H ˆ¯ V − H ) † ] . (7) (ii) Let ˆ M ∈ R b × b be ˆ M rs = n − Tr( P ⊥ ˆ C B r P ⊥ ˆ C B s ) . If Λ b ( ˆ M ) ≤ α Λ b ( M ) , go to Step (c).(iii) Set ˆ¯ v = arg max θ ∈ Θ ( − log( | P ⊥ ˆ C V ( θ ) P ⊥ ˆ C | + ) − Tr[( p − Y T Y ) { P ⊥ ˆ C V ( θ ) P ⊥ ˆ C } † ]) and ˆ¯ V = V ( ˆ¯ v ) .(iv) Repeat Steps (i), (ii) and (iii) two times, and stop on Step (i) of the third iteration.(c) Let ˜ C ∈ R n × k be any matrix such that im( ˜ C ) = im( ˆ C ) . Deﬁne ˜ L = Y ˆ¯ V − ˜ C ( ˜ C T ˆ¯ V − ˜ C ) − .(d) Deﬁne ˆ λ r to be the rth largest eigenvalue of n − ˜ C { np − ˜ L T ˜ L − ( n − ˜ C T ˆ¯ V − ˜ C ) − } ˜ C T for r ∈ [ K ] . If k = K, let ˆ λ r be the estimate for λ r for all r ∈ [ K ] .(e) Let ˜ U ∈ R k × k be a unitary matrix that satisﬁes ˜ U T ( n − ˜ C T ˜ C ) / { np − ˜ L T ˜ L − ( n − ˜ C T ˆ¯ V − ˜ C ) − } ( n − ˜ C T ˜ C ) / ˜ U = diag( ˆ λ , . . . , ˆ λ k ) and deﬁne ˆ L = ˜ L ( n − ˜ C T ˜ C ) / ˜ U and ˆ C = ˜ C ( n − ˜ C T ˜ C ) − / ˜ U . If k = K, let ˆ L and ˆ C be theestimates for L ( o ) and C ( o ) .(f) If k < K max , update k ← k + and return to Step (b). Remark The estimator ˆ¯ v in Step (b)(iii) is exactly the restricted maximum likelihood (REML)estimator for θ under the model Y ∼ MN p × n ( L ˆ C T , I p , V ( θ )) . We can estimate V g = V ( v g ) forany k with REML using the model Y g ∗ ∼ N ( ˆ CL g ∗ , V ( v g )) . Remark If b = , B = I n , then ˆ L ∗ r and ˆ C ∗ r are proportional to the rth left and right singularvectors of Y , and ˆ λ r is the bias-corrected estimator proposed in McKennan et al. [14]. With the exception of (7) and Step (ii) of (b), Steps (a) and (b) of Algorithm 1 resemble theiterative method proposed in McKennan et al. [35] to simultaneously estimate im( C ) and ¯ V , wherethe estimate for ¯ V when dim { im( C ) } is assumed to be k − { im( C ) } = k . This “warm start” technique helps ensure that variation attributable to ¯ V is not6istakenly assigned to C . Step (b)(ii) ﬂags estimates for P C where the subsequent restricted log-likelihood function in (iii) may not identify ¯ v . This step is necessary when k (cid:16) n ∧ p , and allowsus to circumvent the common, but problematic, restriction that the maximum possible value of K , K max , be at most ﬁnite when estimating K in Section 4. I set α = . Proposition If ˆ¯ V (cid:31) , P ˆ C in (7) is exactly P ˆ¯ V / W , where the columns of W ∈ R n × k are the ﬁrstk right singular vectors of Y ˆ¯ V − / . As far as I am aware, the estimates for the eigenvalues in Step (d) and the estimates for L ( o ) and C ( o ) in Step (e) are the ﬁrst estimators for these quantities that account for the correlation betweensamples, and therefore warrant some discussion. Since many of the eigenvalues λ r will be moderateor small in biological data [14], one must account for eigenvalue inﬂation. This is a well-studiedphenomenon in data with independent samples [11, 14], and occurs because small errors in theestimates ˜ L ∗ , . . . , ˜ L p ∗ accumulate and inﬂate the estimator np − ˜ L T ˜ L = np − (cid:80) pg = ˜ L g ∗ ˜ L T g ∗ . Theterm ( n − ˜ C T ˆ¯ V − ˜ C ) − in Steps (d) and (e) corrects that bias, which as hinted in Remark 2, reducesto the usual bias correction used to deﬂate estimates for λ r when b = B = I n [11, 14].Perhaps the most unnatural element of Algorithm 1 is Step (e). To justify this step, suppose k = K . Since ˜ L ˜ C T only depends on im( ˆ C ) and not the choice of parametrization of ˜ C , I re-quire ˆ L = ˜ LR and ˆ C = ˜ CR − T for any non-singular R ∈ R K × K to ensure ˆ L ˆ C T = ˜ L ˜ C T . Since n − { C ( o ) } T C ( o ) = I K , I set R = ( n − ˜ C T ˜ C ) / ˜ U for some unitary matrix ˜ U ∈ R K × K , which guar-antees n − ˆ C T ˆ C = I K . I choose ˜ U so that the inﬂation-corrected estimator for np − { L ( o ) } T L ( o ) = diag( λ , . . . , λ K ), np − ˆ L T ˆ L − ( n − ˆ C T ˆ¯ V − ˆ C ) − = ˜ U T ( n − ˜ C T ˜ C ) / { np − ˜ L T ˜ L − ( n − ˜ C T ˆ¯ V − ˜ C ) − } ( n − ˜ C T ˜ C ) / ˜ U , is exactly diag( ˆ λ , . . . , ˆ λ K ), where ˆ λ r is the inﬂation-corrected estimator of λ r . Choosing such a ˜ U when b = , B = I n is trivial, since one can easily ﬁnd a ˜ C such that n − ˜ C T ˆ¯ V − ˜ C ∝ n − ˜ C T ˜ C = I K and np − ˜ L T ˜ L is diagonal. This is certainly not the case in data with more complex correlationstructures, since n − ˜ C T ˜ C and n − ˜ C T ˆ¯ V − ˜ C cannot both be multiples I K in general.I lastly remark that ˆ L will generally not have orthogonal columns. However, I show in Section5.3 that, quite remarkably, all of the estimators from Algorithm 1 for arbitrary B , . . . , B b are atleast as e ﬃ cient as those derived from standard PCA when b = , B = I n . And while it is not myprimary goal, my proof techniques allow me to derive a central limit theorem for ˆ λ r under far moregeneral assumptions than those considered by other authors.7 Deﬁning and estimating the Oracle rank, factors and load-ings

While Section 3 considers the case when K is known, K is typically unknown in real data. How-ever, determining K , which is a notoriously challenging problem in data with independent samples,is particularly di ﬃ cult in data with correlated samples. First, the true K may not be the most ap-propriate choice for K , since the added beneﬁt of estimating factors with negligibly small e ﬀ ectsis o ﬀ set by the cost of additional statistical uncertainty. Second, given that the goal is to analyzereal biological data, any estimator must be amenable to data with both large and small eigenvalues λ r . Lastly, the estimator must avoid mistaking latent structure due to the dependence between thecolumns of E as arising from LC T , which can lead to severe overestimates for K [15, 35].To address these issues, I follow Owen et al. [12] and treat the estimation of K as a modelselection problem. To deﬁne the optimal model, I set the Oracle rank, K ( o ) , to be that whichminimizes the following inverse-variance weighted generalization error: K ( o ) = arg min k ∈{ , ,..., n ∧ p } {| ˆ V ( k ) | / n E ( (cid:107) [ LC T + ˜ E − ˆ L ( k ) { ˆ C ( k ) } T ] { ˆ V ( k ) } − / (cid:107) F | C , E ) } . (8)Here, ˆ L ( k ) ∈ R p × k , ˆ C ( k ) ∈ R n × k and ˆ V ( k ) are the estimates ˆ L , ˆ C and ˆ¯ V deﬁned in Steps (b)(iii)and (e) at iteration k of Algorithm 1 when K max = n ∧ p , ˜ E is independent of ( C , E ) and ˜ E d = E .The term | ˆ V ( k ) | / n is identical to re-scaling ˆ V ( k ) such that | ˆ V ( k ) | = k , and makes (8) scale-invariant. If b = B = I n , K ( o ) reduces to the Oracle rank deﬁned in Owen et al. [12].Assuming for simplicity that | ˆ V ( k ) | =

1, one can rewrite the generalization error in (8) as p Tr[ E ( p − ˜ E T ˜ E ) { ˆ V ( k ) } − ] + (cid:107) [ LC T − ˆ L ( k ) { ˆ C ( k ) } T ] { ˆ V ( k ) } − / (cid:107) F . (9)The ﬁrst term evaluates the accuracy of the estimate for ¯ V = E ( p − ˜ E T ˜ E ), which, by Jensen’s in-equality, is minimized when ˆ V ( k ) is a scalar multiple of ¯ V . Therefore, weighting (8) by { ˆ V ( k ) } − / ensures that ˆ V ( K ( o ) ) captures the variation across the columns of E . The second term in (9) mea-sures the accuracy of ˆ L ( k ) { ˆ C ( k ) } T as an estimator for LC T , where weighting by { ˆ V ( k ) } − / prioritizescomponents of LC T not already explained by the estimated model for E . Note also that this termis not necessarily minimized at k = K . Instead, a factor is only included if its capacity to estimate LC T outweighs its statistical uncertainty. I describe this precisely in Section 5.2. Given K ( o ) , the Oracle then must choose the best rank- K ( o ) approximation to the rank- K latentsignal matrix LC T . For S k = { ( ¯ C , ¯ L ) ∈ R n × k × R p × k : n − ¯ C T ¯ C = I k , ¯ L T ¯ L is diagonal with non-increasing elements } , we take inspiration from the generalized PCA loss considered in Allen et al. [36] and deﬁne theOracle factors and loadings, C ( o ) and L ( o ) , to be( C ( o ) , L ( o ) ) = arg min ( ¯ C , ¯ L ) ∈S K ( o ) (cid:107) ( LC T − ¯ L ¯ C T ) ¯ V − / (cid:107) F . (10)8f K ( o ) = K , then C ( o ) and L ( o ) are exactly as deﬁned in (4). Otherwise, like (8), weightingby ¯ V − / prioritizes variation in LC T not captured by the true model for E . Note also that L ( o ) { C ( o ) } T is the minimizer of (9) when ˆ V ( K ( o ) ) = ¯ V is known and (9) is treated as a func-tion of ˆ L ( K ( o ) ) { ˆ C ( K ( o ) ) } T . Therefore, taken together with the deﬁnition of K ( o ) , L ( o ) { C ( o ) } T can beinterpreted as the best approximation to LC T whose components’ signal strengths justify their es-timation uncertainty. I show in Section 5 that by replacing K with K ( o ) in Steps (d) and (e) ofAlgorithm 1, Algorithm 1 recovers both C ( o ) and L ( o ) . We extend the procedure developed in [35] to estimate the Oracle rank in Algorithm 2 below.Implicit in Algorithm 2 is the assumption that the rows of E are independent, which is the standardassumption when estimating K in biological data [9, 12, 13, 15, 35]. Simulations in Section 6 showthat Algorithm 2 is robust to dependencies commonly observed in biological data. Algorithm (CBCV + ) . Let K max = (cid:100) η ( n ∧ p ) (cid:101) for η ∈ (0 , and Q ∈ R n × n be sampled uniformlyfrom the set of all n × n unitary matrices. Partition the rows of Y uniformly at random into F ≥ folds.(a) For f ∈ [ F ] , arrange the rows of Y such that Y = (cid:34) Y ( − f ) Y f (cid:35) = (cid:34) L ( − f ) C T L f C T (cid:35) + (cid:34) E ( − f ) E f (cid:35) . Deﬁne Y ( − f ) ∈ R p ( − f ) × n and Y f ∈ R p f × n to be the training and test sets, respectively.(b) For all k ∈ { , , . . . , K max } , obtain ˆ C ∈ R n × k and ˆ V ( − f ) from Y ( − f ) using Algorithm 1.(c) For each k ∈ { , , . . . , K max } , let ¯ Y f = Y f ˆ V − / − f ) Q and ˆ¯ C = Q T ˆ V − / − f ) ˆ C . Deﬁne the loss forthis fold, dimension pair as the leave-one-out cross validation loss: L f ( k ) = | ˆ V ( − f ) | / n n (cid:88) i = (cid:107) ¯ Y f ∗ i − ˆ L f , ( − i ) ˆ¯ C i ∗ (cid:107) . (11) Here, ˆ L f , ( − i ) is the ordinary least squares regression coe ﬃ cient from the regression of ¯ Y f , ( − i ) onto ˆ¯ C ( − i ) , where ¯ Y f , ( − i ) and ˆ¯ C T ( − i ) are submatrices of ¯ Y f and ˆ¯ C T with the ith columns re-moved.(d) Repeat steps (a)–(c) for folds f = , . . . , F and deﬁne ˆ K = arg min k ∈{ , ,..., K max } { (cid:80) Ff = L f ( k ) } . Provided the rows of E are independent, step (a) partitions Y into independent training and testsets, which are used to determine ˆ C , ˆ V ( − f ) and estimate the out-of-sample expected loss deﬁned in(8), respectively. Besides ensuring that (11) approximates the expected loss in (8), re-scaling Y f by ˆ V − / − f ) in step (c) helps alleviate the deleterious e ﬀ ects of correlated data points in leave-one-outcross validation [37]. Further rotating the test data by Q uniformizes the leverage scores of bothˆ V − / − f ) C and ˆ V − / − f ) ˆ C ∈ R n × k , which helps guarantee (11) is well behaved for k (cid:16) n . This latterpoint allows us to avoid the common requirement among existing estimators that K max be at mostﬁnite [26, 29, 30]. While subtle, this is quite important, as such estimators are typically sensitiveto K max [27, 31]. 9 Theoretical guarantees

In all assumptions and theoretical results, I assume that Models (1) and (2) hold, where thesymmetric matrices B , . . . , B b ∈ R n × n are observed and b , K = O (1) as n , p → ∞ . I de-ﬁne A = p − E ( LC T CL T ), ¯ V = p − (cid:80) pg = V g and δ = | ¯ V | / n throughout, where δ = ¯ v if b = , B = I n and δ (cid:16) c > n or p . Assumption Deﬁne M ∈ R b × b to be M i j = n − Tr( B i B j ) . Then:(a) c − I n (cid:22) V g , (cid:12)(cid:12)(cid:12) v g , j (cid:12)(cid:12)(cid:12) ≤ c, c − I b (cid:22) M , (cid:107) B j (cid:107) ≤ c for all j ∈ [ b ] and E { exp( t T E g ∗ ) } ≤ exp( c (cid:107) t (cid:107) ) for all g ∈ [ p ] and t ∈ R n .(b) C ∈ R n × K is a random matrix that is independent of E , where Ψ n = E { n − C T ( δ − ¯ V ) − C } ,c − I K (cid:22) Ψ n (cid:22) cI K and ∆ n , p = (cid:107) n − C T ( δ − ¯ V ) − C − Ψ n (cid:107) = o P (1) as n , p → ∞ .(c) L is a non-random matrix. The K non-zero eigenvalues of np − L Ψ n L T satisfy < γ K ≤· · · ≤ γ ≤ cn, where for each r ∈ [ K ] , either lim sup n , p →∞ γ r < ∞ or lim n , p →∞ γ r = ∞ .Further, L T g ∗ Ψ n L g ∗ ≤ c for all g ∈ [ p ] . The assumptions on B j and M imply no one direction dominates the variation in E g ∗ andthat v g is identiﬁable, respectively, for all g ∈ [ p ]. With the exception of Section 5.5, I placeno assumptions on C besides what are stated in (b), where it can be shown that ∆ n , p = o P (1)under general assumptions [35]. I place assumptions on γ , . . . , γ K and not on the eigenvalues of A to facilitate statements regarding K ( o ) in Section 5.2, but note that Λ r ( A ) (cid:16) γ r for all r ∈ [ K ].Unlike previous work [10, 15, 32, 34, 35, 38], I only require γ (cid:46) n and do not assume γ /γ K isbounded as n , p → ∞ . This allows one to analyze genetic data, where it is the norm rather thanthe exception for the data to contain both strong and weak factors [14, 34]. This assumption ismore than a mere technical condition, since as I show in Section 6, many methods fail in practicewhen γ /γ K is too large. The assumption that E g ∗ is sub-Gaussian in (a) is standard among authorswho assume the entries of E g ∗ are independent and identically distributed and consider both strongand weak factors [11, 14]. The assumed dependence between the rows of E and the relationshipbetween n and p will depend on whether or not K ( o ) is known, which helps make my results asgeneral as possible, and, as I show in Section 5.3, allows me to extend existing results that assume b = , B = I n and K is known. I lastly place an assumption on the estimates from Algorithm 1. Assumption Let α be as deﬁned in the initialization of Algorithm 1. Then α ∈ [ c − , − c − ] andthe estimators for ˆ¯ v from Steps (a) and (b)(iii) in Algorithm 1 are such that ˆ¯ v ∈ Θ ∗ = Θ ∩ { x ∈ R b : (cid:107) x (cid:107) ≤ bc , b (cid:88) j = x j B j − (2 c ) − I n (cid:31) } . This technical condition makes the parameter space for ¯ v compact, and is analogous to As-sumption D in [10] and Assumption 2 in [34]. 10 .2 Properties of K ( o ) and ˆ K I ﬁrst demonstrate the properties of K ( o ) , as well as its estimator from Algorithm 2, ˆ K , in Theorem1 below, where I let γ = ∞ and γ K + = Theorem Suppose Assumptions 1 and 2 hold such that ∆ n , p = O P ( d n , p ) for some non-randomsequence d n , p → as n , p → ∞ . Assume the following hold for F , η deﬁned in Algorithm 2:(i) The rows of E are independent, n / p → as n , p → ∞ and F ∈ [2 , c ] , η ∈ [ c − , − c − ] .(ii) There exists an s ∈ { } ∪ [ K ] such that γ s + + c − ≤ δ < γ s .Fix any (cid:15) > and let a n , p = max( n / p − / , n − / , d n , p ) . Then there exists a constant m (cid:15) > that depends on (cid:15) , but not n or p, such that if δ + a n , p m (cid:15) ≤ γ s for all n , p suitably large, lim inf n , p →∞ P { ˆ K = K ( o ) = s } ≥ − (cid:15) . Remark The sequence a n , p → provides insight into how much larger γ r must be than the noiselevel δ (cid:16) to ensure both the Oracle and Algorithm 2 select the rth factor. If ∆ n , p = o P (1) and δ + c − ≤ γ s , then lim n , p →∞ P { ˆ K = K ( o ) = s } = . Theorem 1 shows Algorithm 2 tends to select the same number of factors as the Oracle, whereboth only include the factor r ∈ [ K ] if its signal strength γ r is greater than the noise level δ . Thisis congruent with the goals of the Oracle estimator established in Section 4.1, which is designedto only return factors whose signal strengths are large enough to outweigh their estimation uncer-tainty. This is contrary to parallel and analysis [13] and estimators proposed in Dobriban et al.[15], which, besides only being applicable when b = B = I n , ignore a factor’s estimationuncertainty when selecting K . This could be why the latter’s estimates for K were exceedinglylarge in their data application.The condition that n / p → n (cid:46) and 10 (cid:46) p (cid:46) . Independence between the rows of E is a standard assumption among methodswith b = B = I n and λ K (cid:46) γ K = o ( n ) [33].Theorem 1 is, as far as I am aware, the ﬁrst result to establish the consistency of an estimate forthe number of latent factors in dependent data with γ (cid:46) n and γ K (cid:16)

1. This is more than a meretechnical triumph. For example, several popular estimators, like parallel analysis [13], su ﬀ er fromthe problem of eigenvalue shadowing, in which factors with large eigenvalues prohibit the recoveryof factors with moderate or small eigenvalues. Other methods, which do allow correlation betweenthe entries of E [26, 27, 29–31], are so dependent on the assumption that γ K (cid:16) n that they tooconsistently fail to recover factors with moderate to small eigenvalues. Here I give theoretical results regarding the accuracy of the estimators from Algorithm 1 assuming K ( o ) is known, along with theory that facilitates interpreting the latent factors C . For remainder ofSection 5, I let K max and the iteration number k ∈ { } ∪ [ K max ] be as deﬁned in Algorithm 1, andlet λ ( o ) r = Λ r [ p − L ( o ) { C ( o ) } T C ( o ) { L ( o ) } T ] for all r ∈ [ K ( o ) ]. I ﬁrst state an assumption that I willutilize for the remainder of Section 5. 11 ssumption (a) p ≥ c − n, K ( o ) ≥ is known, γ K ( o ) , ( γ K ( o ) /γ K ( o ) + − ≥ c − , γ K ( o ) + ≤ c andn / ( p γ K ( o ) ) → as n , p → ∞ .(b) One of the following holds:(i) There exists a non-random A ∈ R pn × pn with (cid:107) A (cid:107) ≤ c such that vec( E ) d = A vec( ˜ E ) ,where the entries of ˜ E are independent with E { exp( t ˜ E gi ) } ≤ exp( ct ) for all t ∈ R ,g ∈ [ p ] and i ∈ [ n ] .(ii) The rows of E can be partitioned into sets with at most c elements, such that E g ∗ and E h ∗ are independent if rows g and h are in di ﬀ erent sets. Assumption 3 is more general than the assumptions used to prove Theorem 1, where the as-sumptions on γ K ( o ) in (a) mirror those placed on γ s in (ii) of Theorem 1. The typical dependenceassumption E = R ˜ ER T for R ∈ R p × p and R ∈ R n × n corresponds to A = R ⊗ R [39, 40].Condition (b)(i) is more general than that considered in Wang et al. [11], which besides assum-ing b = , B = I n and K was known, required Y = U D ˜ E for some unitary matrix U ∈ R p × p and diagonal matrix D ∈ R p × p . Condition (b)(ii) assumes genomic units can be partitioned intonon-overlapping networks, and is common in DNA methylation data [41].I ﬁrst show that the bias-corrected estimates ˆ λ r , deﬁned in Algorithm 1, accurately estimate λ ( o ) r . Theorem Suppose Assumptions 1, 2 and 3 hold and K max ≥ K ( o ) . Then for k = K ( o ) , ˆ λ r /λ ( o ) r = + O P { ( γ r p ) − / + n / ( γ r p ) + ( γ r n ) − } , r ∈ [ K ( o ) ] . (12) Remark As far as I am aware, with the exception of the ( γ r n ) − term, the rate of convergencein Theorem 2 is as fast as the best known rate for PCA when Y ∗ , . . . , Y ∗ n are independent andidentically distributed [14]. Remark When k = K ( o ) , the estimator ˆ λ ( naive ) r = Λ r ( p − ˜ C ˜ L T ˜ L ˜ C T ) that ignores the bias term ( n − ˜ C T ˆ¯ V − ˜ C ) − in Step (d) of Algorithm 1 is inﬂated and behaves as ˆ λ ( naive ) r /λ ( o ) r ≥ + ˜ c /λ ( o ) r + O P { ( γ r p ) − / + n / ( γ r p ) + ( γ r n ) − } , r ∈ [ K ( o ) ] for some constant ˜ c > . If b = , B = I n , the inequality becomes an equality with ˜ c = ¯ v [14]. Theorem 2 and Remark 5 show that my bias-corrected estimator for λ ( o ) r corrects eigenvalueinﬂation. This is relevant whenever p >> n and λ ( o ) r is moderate or small, which is typically thecase in genetic and epigenetic data. I next demonstrate the properties of ˆ L . Theorem Suppose the assumptions of Theorem 2 hold, ﬁx any (cid:15) > , let r ∈ [ K ( o ) ] and let F ( (cid:15) ) r be the event { λ ( o ) r − /λ ( o ) r , λ ( o ) r /λ ( o ) r + ≥ + (cid:15) } . Then for k = K ( o ) , a ∈ {− , } and if { log( p ) } / n → , (cid:107) ˆ L ∗ r − a L ( o ) ∗ r (cid:107) ∞ = O P { log( p ) n − / + n / ( γ K ( o ) p ) − / } on F ( (cid:15) ) r . (13) Further, if K ( o ) = K, n / / { p γ K ( o ) } → and the technical conditions in Section S3.1 in the Supple-ment hold, [ { ( ˆ C T ˆ V − g ˆ C ) − } rr ] − / { ˆ L ( GLS ) gr − a L ( o ) gr } d = Z + o P (1) , g ∈ [ p ] (14)12 s n , p → ∞ , where ˆ V g is the restricted maximum likelihood estimate for V g described in Remark1, ˆ L ( GLS ) g ∗ is the corresponding generalized least squares estimate for L ( o ) g ∗ using the design matrix ˆ C and Z ∼ N (0 , . Remark I show in Section S3.3 of the Supplement that P { F ( (cid:15) ) r } → as n , p → ∞ under standardeigengap assumptions. The conditions that { log( p ) } / n → and n / / { p γ K ( o ) } → are standard ingenetic and epigenetic data [14, 34]. Remark I show in Section S4 of the Supplement that Theorem 2 can be leveraged to derive acentral limit theorem for the eigenvalues of V ( Y ∗ i ) ∈ R p × p if a n , p = n / / ( p γ r ) → , and that (13) holds with L ( o ) ∗ r replaced with a scalar multiple of the rth eigenvector of V ( Y ∗ i ) . This signiﬁcantlyextends the eigenvalue and eigenvector convergence results in Wang et al. [11], which requireda n , p ( p / n − / ) → and U Y have independent sub-Gaussian entries for some unitary matrix U ∈ R p × p . As far as I am aware, this is the ﬁrst result proving the asymptotic normality ofeigenvalue estimates in high dimensional data with dependent observations. Both (13) and (14) are quite useful in practice and facilitate objective (b) from Section 1. Theformer implies a standard principal component plot of ˆ L ∗ r versus ˆ L ∗ r mirrors the informationcontained in a plot of L ( o ) ∗ r versus L ( o ) ∗ r , and the latter justiﬁes inference on the components of L ( o ) . This is quite important, as practitioners are often interested in determining the genomic unitswhose expression or methylation depends on C ( o ) [1]. I lastly demonstrate the accuracy of myestimator for C ( o ) . Theorem Suppose the assumptions of Theorem 2 hold and let k = K ( o ) . Then (cid:107) P C ( o ) − P ˆ C (cid:107) F = O P { ( γ K ( o ) p ) − / + n / ( γ K ( o ) p ) + ( γ K ( o ) n ) − } , (cid:12)(cid:12)(cid:12) ˆ¯ v j − ¯ v j (cid:12)(cid:12)(cid:12) = O P ( n − ) (15) for all j ∈ [ b ] . Further, if r , (cid:15) and F ( (cid:15) ) r are as deﬁned in Theorem 3, | ˆ C T ∗ r C ( o ) ∗ r | / ( (cid:107) ˆ C ∗ r (cid:107) (cid:107) C ( o ) ∗ r (cid:107) ) = − O P { ( γ r p ) − / + n / ( γ r p ) + ( γ r n ) − } on F ( (cid:15) ) r . (16)Theorem 4 shows that Algorithm 1 e ﬀ ectively recovers im { C ( o ) } , and my novel bias-correctedestimator for C ( o ) is just as accurate as the standard principal components estimator when b = B = I n [14]. Like Theorem 3, this implies that a plot of ˆ C ∗ r versus ˆ C ∗ r mirrors theinformation contained in the plot of C ( o ) ∗ r versus C ( o ) ∗ r . In this section, I provide the requisite theory to guarantee that one can perform accurate inferenceconditional on my estimate for C , which is often referred to as denoising the data matrix Y [15].This is critical when inferring eQTLs and meQTLs, where accounting for C has been shown toreduce potential confounding and empower inference [42, 43]. It also has application in DNAmethylation twin studies, in which one goal is to recover V , . . . , V p to determine the latent celltype-independent heritability of DNA methylation [6, 18, 24]. Theorem 5 below, as far as I amaware, is the ﬁrst result showing that denoising is possible in data with correlated samples. Theorem Suppose Assumptions 1, 2 and 3 hold with K = K ( o ) and n / / ( p γ K ) → as n , p → ∞ .Fix a g ∈ [ p ] and suppose for some non-random vector s g ∈ R d , E g ∗ = X g s g + R g , where X g and R g satisfy the following: i) X g and R g are independent, mean and independent of C , where X g is observed andindependent of all but at most c rows of E . Further, d = O (1) and (cid:107) n − X T g X g − Σ g (cid:107) = o P (1) for some non-random Σ g (cid:31) as n → ∞ .(ii) E { exp( t T e ) } ≤ exp( (cid:107) t (cid:107) c ) for e ∈ { X g s g , R g } , V ( X g s g ) = V ( τ g ) and V ( R g ) = V ( α g ) forsome τ g , α g ∈ R b , where α g ∈ Θ ∗ .Deﬁne ˆ α g = arg max θ ∈ Θ ∗ [ − log {| P ⊥ ( ˆ C , X g ) V ( θ ) P ⊥ ( ˆ C , X g ) | + } − Y T g ∗ { P ⊥ ( ˆ C , X g ) V ( θ ) P ⊥ ( ˆ C , X g ) } † Y g ∗ ] (17)ˆ s g = [ X T g { P ⊥ ˆ C V ( ˆ α g ) P ⊥ ˆ C } † X g ] − X T g { P ⊥ ˆ C V ( ˆ α g ) P ⊥ ˆ C } † Y g ∗ (18) to be the restricted maximum likelihood estimator for α g and denoised estimate for s g . Then for ˆ s ( known ) g the generalized least squares estimate for s g from the regression of X g onto E g ∗ assuming α g is known, (cid:107) ˆ α g − α g (cid:107) = o P (1) andn / (cid:107) ˆ s g − ˆ s ( known ) g (cid:107) = o P (1) (19) as n , p → ∞ . Remark By replacing ( ˆ C , X g ) in (17) with ˆ C , the proof of Theorem 5 shows that (cid:13)(cid:13)(cid:13) V ( ˆ α g ) − V g (cid:13)(cid:13)(cid:13) = o P (1) . This is useful in DNA methylation twin studies, where the goal is often to estimate the latentfactor-adjusted heritability of DNA methylation [6, 18, 24]. Remark In eQTL and meQTL studies, X g is a function of the genotypes of the samples. I provideexamples of how X g is constructed in practice in Sections 6 and 7. Equation (19) shows that inference with the denoised estimate for s g is asymptotically equiv-alent to that when LC T = , which is critically important in eQTL and meQTL studies. Forexample, I show in Section 7 that Algorithm 1 and the results of Theorem 5 can be used to performinference to identify eQTLs that is far more powerful than existing methods. C Biologists routinely regress estimated latent factors onto observed technical and biological covari-ates to identify and characterize the most important sources of variation in Y . Such inference isused to perform quality control [8, 44], empower eQTL an meQTL detection algorithms [45] andmake biological conclusions [1, 2]. Theorem 6 below provides the ﬁrst model-based frameworkand set of statistical guarantees aimed at characterizing the variation in C in dependent data. Theorem Let X ∈ R n be a random vector such that n − X T X = σ x + O P ( n − / ) for σ x = E ( n − X T X ) . Suppose Assumptions 1, 2 and 3 hold, K ( o ) = K, Λ r ( A ) / Λ r + ( A ) ≥ + c − forall r ∈ [ K ] , E ( n − C T C ) = I K , np − L T L is diagonal with decreasing diagonal elements and thefollowing assumptions on C hold:(i) C = Xω T + R , where ω ∈ R K is non-random, R is independent of X and E ( R ) = . ii) For j ∈ [ b ] , let Ψ j ∈ R K × K be a non-random, symmetric matrix such that (cid:107) Ψ j (cid:107) ≤ c. Then V { vec( R ) } = (cid:80) bj = Ψ j ⊗ B j (cid:23) c − I n .Let r ∈ [ K ] and ˆ ω r be the generalized least squares estimate for ω r assuming the incorrect model ˆ C ∗ r · ∼ ( Xω r , V ( θ )) for some θ ∈ R b , where θ is estimated via restricted maximum likelihood(REML). If n / / ( p γ r ) → as n , p → ∞ and the regularity conditions in Section S3.2 of theSupplement hold, the following are true:(a) If X is dependent on at most c rows E and the null hypothesis ω = holds, then for ˆ θ theREML estimate for θ and Z ∼ N (0 , , [ X T { V ( ˆ θ ) } − X ] / ˆ ω r d = Z + o P (1) as n , p → ∞ .(b) If X is independent of E , then ˆ ω r = a ω r + O P ( n − / ) for a ∈ {− , } . Remark

The assumption K ( o ) = K is for simplicity of presentation. I state an equivalentversion of Theorem 6 when K ( o ) (cid:44) K in Section S3.4 of the Supplement. The assumptions on E ( n − C T C ) = I K and np − L T L are without loss of generality in (a), and are used to identify ω in(b). Note that under these assumptions, np − L T ∗ r L ∗ s = Λ r ( A ) I ( r = s ) for all r , s ∈ [ K ] . Remark

The model for V { vec( R ) } assumes B , . . . , B b parametrize the variance of linearcombinations of the columns of C . This is natural, since B , . . . , B b are constructed to parametrizethe dependence between samples. Item (b) shows Algorithm 1’s estimators can be used to estimate the linear dependence be-tween C and X , where the conditions on E ( n − C T C ) and np − L T L help identify the columns of C and order them from most important to least important. Item (a) has many applications, but isparticularly useful in eQTL studies. There, practitioners often attempt to account for the genetic re-latedness between individuals when estimating C , and subsequently test for associations betweengenotype X and latent factors C [4, 45]. Loci whose genotypes are correlated with C might beindicative of systematic trans-eQTLs, and modifying the genetic relatedness matrix to account forthe genotypes of such SNPs has been shown to increase the power to detect eQTLs [45]. I simulated the eQTL-dependent expression of p = n / =

60 unrelated individuals to compare Algorithms 1 and 2 with other factor analysis pro-cedures. To mirror the complexity of real data, I set K =

35 and generated 100 gene expressiondatasets according to Model (1), where E was simulated according to Theorem 5: L gk ∼ (1 − π k ) δ + π k N (0 , τ k ) , g ∈ [ p ]; k ∈ [ K ] C ∼ MN n × K ( , I n , I K ) E g ∗ = X g s g + R g , X g = G g ⊗ , R g ∼ N n ( , I n / ⊗ M g ) , g ∈ [ p ] s g ∼ . δ + . N (0 , . ) , g ∈ [ p ] , (20)where δ is the point mass at 0. The vector G g ∈ { , , } n / contains the genotypes at a singlenucleotide polymorphism (SNP) that acts as an eQTL for gene g if s g (cid:44)

0. The condition-speciﬁc15 orr(1,2)

Corr(1,3)

Corr(2,3) ρ g (1,2) ρ g (1,3) ρ g (2,3) x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x r n n n n n n n λ r δ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x r n n n n n n n γ r δ Index ( r ) Index ( r ) f gene s Figure 1: The average simulated λ r /δ (top left) and γ r /δ (top right) and the o ﬀ -diagonal elementsof the correlation matrices corresponding to M , . . . , M p in one simulated dataset (bottom), where ρ ( i , j ) g = Corr( R g i , R g j ) for i (cid:44) j ∈ [3] and g ∈ [ p ]. The dashed red line is the line y = Z = n / ⊗ I were treated as observed nuisance covariates, and M g ∈ R × is thecovariance, conditional on C and X g , of the expression of gene g across treatment conditions,where | P ⊥ Z { I n / ⊗ ( p − (cid:80) pg = M g ) } P ⊥ Z | + =

1. As described in (3), I redeﬁned Y , C , X g , R g and E tobe Y Q Z , Q T Z C , Q T Z X g , Q T Z R g , and EQ Z , respectively, prior to estimation and inference.I set τ k ∈ [0 . ,

1] and π k ∈ (0 ,

1] so as to simulate data with strong, moderate and weakfactors, where K ( o ) =

30 in all simulations (Figure 1). I then used genome-wide SNP data and geneannotations from 15000 randomly selected genes from the data example in Section 7 to simulate G g . In brief, I pruned SNPs for linkage disequilibrium, mapped SNPs to each gene’s cis region,deﬁned as ± base pairs around its transcription start site [21], and randomly chose one SNPwithin each cis region to act as a potential eQTL for the corresponding gene. Genotypes G g had independent entries, were independent of R g and were simulated assuming Hardy-WeinbergEquilibrium with minor allele frequencies as estimated in Section 7, where G g = G h if genes g (cid:44) h had the same potential eQTL and G g | = G h otherwise. This implied that, on average, the expressionsof 25% of all genes with eQTLs were correlated with the expression of at least one other gene.Further, since V ( X g ) ∝ Q T Z { I n / ⊗ ( T ) } Q Z , V ( E g ∗ ) = (cid:80) j = i v g , j Q T Z ( I n / ⊗ A j ) Q Z = (cid:80) j = i v g , j B j for some variance multipliers v g , j for all g ∈ [ p ], where { A , . . . , A } ⊂ { , } × is a basis for thespace of 3 × V g = V ( E g ∗ ) follows Model (2) with b = M g was simu-lated such that each condition had a di ﬀ erent marginal variance and, as shown in Figure 1, the threepairs of conditions had di ﬀ erent correlation coe ﬃ cients. Given only the expression matrix Y , theﬁrst goal was to estimate C ( o ) and λ ( o ) k , which facilitate the characterization and prioritization oflatent sources of variation and is a critical step in multi-condition studies [1, 2, 4, 8]. The secondgoal was to leverage these estimates to identify eQTLs by performing inference on s g . Section S1of the Supplement contains additional simulation details.16igure 2: A comparison of Algorithm 1 (FALCO) and PCA assuming K ( o ) =

30 was known,where ¯ λ (O) r and ¯ λ r are the average simulated λ ( o ) r and λ r . The standardized eigenvalue estimatesfor FALCO and PCA are ˆ λ r /λ ( o ) r and Λ r ( p − Y T Y ) /λ r . Eigenvalues for points in the middle ﬁguresatisfy λ ( o ) r − /λ ( o ) r , λ ( o ) r /λ ( o ) r + ≥ . λ r − /λ r , λ r /λ r + ≥ . λ ( o )0 = λ = ∞ and λ ( o )31 = I ﬁrst evaluated Algorithm 1’s ability to recover λ ( o )1 , . . . , λ ( o ) K ( o ) and C ( o ) assuming K ( o ) was knownby comparing it to the most commonly used method to perform factor analysis in dependent biolog-ical data, PCA [1, 2, 4, 8]. The results are given in Figure 2, where the empirical factor and sub-space correlations are | ˆ A T ∗ r A ∗ r | / ( (cid:107) ˆ A ∗ r (cid:107) (cid:107) A ∗ r (cid:107) ) and min v ∈ im( ˆ A ) \{ } max u ∈ im( A ) \{ } | v T u | / ( (cid:107) v (cid:107) (cid:107) u (cid:107) ),where A = C ( o ) , ˆ A = ˆ C for FALCO and A , ˆ A are the ﬁrst 30 right singular vectors of Y , LC T for PCA. These demonstrate the ﬁdelity of Algorithm 1’s bias-corrected estimates for λ ( o )1 , . . . , λ ( o ) K ( o ) and C ( o ) and clearly indicate that Algorithm 1 outperforms standard PCA. As discussed in Section3.1, PCA’s poor performance can be attributed to the fact that the dependence between the columnsof E precludes it from recovering factors with moderate to small eigenvalues.Next, I assessed my method’s capacity to denoise Y and discover eQTLs by evaluating itspower to identify genes g with s g (cid:44) K and K ( o ) were unknown. I compared my method tothat routinely used to denoise data in dependent biological data, namely using one of the methodsproposed in Bai et al. [26] (BN), Ahn et al. [27] (AH), Onatski [33] (ED), Owen et al. [12] (BCV)or Dobriban [13] (PA) to estimate K , and subsequently estimating C with PCA. Results werenearly identical when I replaced PCA with methods that attempt to account for heterogeneity acrossgenes, like maximum quasi-likelihood [10] or the algorithm proposed in Owen et al. [12]. To makecomputation tractable and to be consistent with current practice, I estimated V ( R g ) via restrictedmaximum likelihood with each method’s estimate for C , ˆ C , by assuming V ( R g ) = σ g (cid:80) j = φ j B j , Y g ∗ ∼ N n ( ˆ CL g ∗ , V ( R g )) and Y g ∗ | = Y h ∗ for g (cid:44) h ∈ [ p ]. I then estimated s g with each methodvia generalized least squares using the design matrix [ X g ˆ C ], computed P values with the normalapproximation and used the Benjamini-Hochberg procedure [46] to control the false discovery rate.Figure 3 contains the results. The fact that Algorithm 2 consistently estimates K ( o ) suggestsAlgorithm 2 is robust to dependencies across genomic units commonly observed in genetic andepigenetic data. The gain in power using my proposed denoised estimate for s g illustrates theimportance of accounting for dependencies between E ∗ , . . . , E ∗ n when estimating C . A briefdiscussion of each competing method is given below. • BN, AH, ED : The theoretical arguments used in Bai et al. [26], Ahn et al. [27], and Onatski [33]17

0 10 27 3 93 33 25 26 2

Figure 3: Estimates for K (left) and each method’s power to identify non-zero s g (right), whereerror bars give the ﬁrst and third quartile and the numbers above each bar denote each method’smedian estimate. “CBCV + ” refers to Algorithm 2, and my method, “CBCV + , FALCO”, usesAlgorithm 2 to estimate K ( o ) =

30 and Algorithm 1 to estimate C ( o ) . At a 5% false discovery rate,my method identiﬁed 25% more eQTLs than the next most powerful method.to prove the consistency of their estimates for K and subsequent ﬁdelity of PCA’s estimate for C allow for general dependence between the entries of E . However, they consistently underes-timate K because their theoretical arguments and estimators rely on the assumption that λ K (cid:16) n [26, 27] or V = · · · = V p and λ K → ∞ [33]. BN IC and BN PC in Figure 3 refer to the IC and PC estimators deﬁned Bai et al. [26]. • BCV : This allows λ (cid:16) n and λ K (cid:46)

1, but requires the entries of E be independent. Whenapplied to the full data matrix Y , denoted as BCV full in Figure 3, it severely overestimates K because it attributes dependencies between E ∗ , . . . , E ∗ n as arising from LC T . To circumventthis problem, I adopted a common strategy and let BCV ind be the estimator that applies BCV toeach of the three conditions separately [21, 47], where an accurate estimate for K would now be ≈ × = ﬀ ectively reduces the sample size by 67%, which causes BCV ind to underestimate K . • PA (Parallel Analysis) : This method and BCV rely on similar assumptions, except it requires λ = o ( n ) and su ﬀ ers from the “eigenvalue shadowing” problem in which factors with largeeigenvalues preclude it from recovering those with moderate to small eigenvalues [13]. This ex-plains why PA full ’s and PA ind ’s, the analogues of BCV full and BCV ind , estimates for K in Figure 3are smaller than BCV full ’s and BCV ind ’s. I analyzed data from Knowles et al. [4] to illustrate the power of Algorithm’s 1 and 2 when ap-plied to modern genetic data with dependent samples and large, moderate and small eigenvalues λ , . . . , λ K . As shown in Figure 4(a), Knowles et al. measured the expression of p = in vitro with ﬁve dosages of the chemotherapeutic agent doxorubicin ( n = × ≈ × SNPs were also collected for each individual. This non-trivial experimentaldesign, coupled with the fact that, as shown in Figure 4(b), λ , . . . , λ K appear to span several ordersof magnitude, suggests existing methods are not equipped to perform factor analysis on these data.One of the goals of this experiment was to identify eGenes, deﬁned as genes whose expressionunder these conditions was regulated by at least one eQTL in the gene’s cis region. To do so,Theorem 5 and the simulations in Section 6 suggest one can empower such inference by estimating C and denoising the expression matrix. I therefore modeled Y as Y gi = Γ T g ∗ Z i ∗ + L T g ∗ C i ∗ + E gi , E gi = x g , m ( i ) s g , d ( i ) + R gi , g ∈ [ p ]; i ∈ [ n ] , (21)where Z ∈ R n × contains the dose level-speciﬁc intercepts, x g , m ∈ { , , } is individual m ’s geno-type at gene g ’s potential eQTL, s g , d ∈ R is the potential eQTL’s e ﬀ ect for dose level d ∈ [5]and m ( i ) and d ( i ) are the individual and dose level for sample i . Since Γ was of little interest inKnowles et al. [4], Z was treated as a nuisance covariate. While individuals were sampled from afounder population, I found no relationship between Y and the known kinship matrix. Therefore,I assumed E gi d = E gi (cid:48) for d ( i ) = d ( i (cid:48) ) and E gi | = E gi (cid:48) for m ( i ) (cid:44) m ( i (cid:48) ), meaning the covariance of e g , m (cid:48) = ( E gi ) { i ∈ [ n ]: m ( i ) = m (cid:48) } ∈ R completely described V g = V ( E g ∗ ). Initial data exploration thenrevealed that a suitable model for V ( e g , m (cid:48) ) was V ( e g , m (cid:48) ) = α g T + (cid:80) d = φ g , d a d a T d for all g ∈ [ p ]and m (cid:48) ∈ [45], where a d ∈ { , } is 1 in the d th coordinate and 0 everywhere else, meaning V g followed (2) with b = C and,to investigate the latent variation explained by each method, the resulting mean marginal variance¯ σ d = p − (cid:80) pg = ( α g + φ g , d ) for each dose level d ∈ [5]. Figure 4(c) contains the results, wherethe methods AH, BCV full and PA full estimated K to be 2, 89 and 21, and were excluded becausethey were outperformed by ED, BCV ind and PA ind , respectively, in all comparisons. First, whileCBCV + is nominally a stochastic algorithm, there was no variation in its estimate. This is contraryto BCV, whose stochasticity gives rise to a highly variable estimator. Second, and perhaps mostinterestingly, is that my method’s estimates for ¯ σ d are the only estimates that are strictly increasingin administered doxorubicin dose. While not explored in Knowles et al. [4], this is consistentwith the observation that doxorubicin disrupts cardiomyocyte homeostasis in an individual- anddose-speciﬁc manner [48].I next evaluated each method’s ability to denoise Y and identify eGenes. I ﬁrst pruned SNPsfor linkage disequilibrium and mapped SNPs with minor allele frequencies ≥

5% to each gene’s cis region. I used (21) to model expression for each gene-SNP pair, where like Section 6.2, Iestimated V ( R g ∗ ) using restricted maximum likelihood with each method’s estimate for C , ˆ C ,assuming V ( R g ∗ ) = σ g ( ¯ α B α + (cid:80) d = ¯ φ d B d ), Y g ∗ ∼ N n ( Z Γ g ∗ + ˆ CL g ∗ , V ( R g ∗ )) and Y g ∗ | = Y h ∗ for g (cid:44) h ∈ [ p ]. I computed P values for the null hypotheses H : s g , = · · · = s g , = n = 45 ✕ p = 12317 genes Cardiomyocyte Cardiomyocyte (a)

Eigenvalue index ( r ) λ r / δ ⌃ ⌃ (b) Doxorubicin dose (µM)

0 0.6 1.25 2.5 5 M ean m a r g i na l v a r i an c e ⌃ K CBCV+ 37 BN IC

27 BN PC ind

60 [15] PA ind (c) C B C V + , F A L C O B N I C , P C A B N P C , P C A E D , P C A B C V i nd , P C A PA i nd , P C A

600 400 200 0 f e G ene s (d) Figure 4: (a): Experimental design from Knowles et al. [4]. (b): Algorithm 1-derived estimatesfor λ ( o ) r . (c): Estimates for K or K ( o ) and the resulting estimated mean marginal variances. Num-bers in square brackets and the error bars give the interquartile range and ﬁrst or third quartiles,respectively, for variable stochastic algorithms. (d): Number of signiﬁcant eGenes identiﬁed usingeach method that are also (black) and are not (grey) eGenes in heart tissues in GTEx, where thenumber above each bar gives the fraction that are also eGenes in heart tissues in GTEx; enrichment P values for each method were 1 . × − , . × − , . × − , . × − , . × − , . × − ,respectively. BCV ind was applied with ˆ K =

60, its median estimate.powerful methods were also identiﬁed using my method. Like the simulation results from Sec-tion 6.2, this suggests my method’s denoised estimates are far more powerful than those fromexisting methods, and highlights the importance of recovering factors with ostensibly moderate orweak signal strengths. While my and Knowles et al.’s results are not directly comparable becausethe latter ignored the heterogeneity in dose-speciﬁc variances, it is worth noting that I identify over20% more eGenes than Knowles et al., who chose K to maximize the number of detected eGenes. In this work, I developed a novel framework and new, provably accurate methodology to performfactor analysis, interpret its results and utilize its estimates in modern high throughput biologicaldata with non-trivial dependence structures and factors whose signal strengths span several ordersof magnitude. I also showed that my estimate for K circumvents the ill-reputed “eigenvalue shad-owing” problem, as well the biases that accompany the “pervasive factor” assumption. I lastly usedsimulated and real genetic data to illustrate the power of my methodology in application.My results and those from the existing literature suggest there is a trade-o ﬀ between two critical20ssumptions in high dimensional factor analysis: either allow the columns of E to have unknowndependence structure but require λ K → ∞ , or allow λ K (cid:46) B , . . . , B b . While it can be argued how relevant such prior knowledge is in other disciplines,biological practitioners have intimate knowledge of the experimental design and data collectionprocess, and therefore will likely know B , . . . , B b . Given the results from Section 7, this suggestsbiological statisticians should worry less about developing methodology that satisﬁes the aesthet-ically pleasing assumption that the entries of E have arbitrary dependence, and focus more onmethodology that can accommodate data with strong, moderate and weak factors. Acknowledgements

I thank Carole Ober for providing the genetic data from Section 7, which motivated this research. Ialso thank Dan Nicolae for his useful comments and suggestions that have substantially improvedthis manuscript. This work is supported in part by NIH grant R01 HL129735.21 eferences [1] R. Jiang, M. J. Jones, F. Sava, M. S. Kobor, and C. Carlsten. “Short-term diesel exhaustinhalation in a controlled human crossover study is associated with changes in DNA methy-lation of circulating mononuclear cells in asthmatics”. In:

Particle and Fibre Toxicology

Epigenetics & Chromatin ﬀ erential methylation between ethnic sub-groups reﬂects the e ﬀ ectof genetic ancestry and environmental exposures”. In: eLife eLife Genome Biology

PLOS Genetics

Proceedings of the National Academy ofSciences issn : 0027-8424. doi : .[8] M. M. Soliai et al. “Multi-omics co-localization with genome-wide association studies re-veals context-speciﬁc mechanisms of asthma risk variants”. In: bioRxiv (Jan. 2019), p. 593558.[9] A. B. Owen and P. O. Perry. “Bi-cross-validation of the SVD and the nonnegative matrixfactorization”. In: The Annals of Applied Statistics doi : .[10] J. Bai and K. Li. “Statistical analysis of factor models of high dimension”. In: The Annalsof Statistics

Ann. Statist.

Statistical Science

Permutation methods for factor analysis and PCA . 2017. eprint: arXiv:1710.00479 .[14] C. McKennan and D. Nicolae. “Accounting for unobserved covariates with varying degreesof estimability in high-dimensional biological data”. In:

Biometrika issn : 0006-3444. doi : .2215] E. Dobriban and A. B. Owen. “Deterministic parallel analysis: an improved method forselecting factors and principal components”. In: Journal of the Royal Statistical Society:Series B (Statistical Methodology)

Estimating Number of Factors by Adjusted Eigenvalues Thresh-olding . 2019. eprint: arXiv:1909.10710 .[17] D. Martino, Y. J. Loke, L. Gordon, M. Ollikainen, M. N. Cruickshank, R. Sa ﬀ ery, and J. M.Craig. “Longitudinal, genome-scale analysis of DNA methylation in twins from birth to18 months of age reveals rapid epigenetic change in early life and pair-speciﬁc e ﬀ ects ofdiscordance”. In: Genome Biology

Epigenetics

Cell Systems bioRxiv (Apr. 2020). doi : https://doi.org/10.1101/339770 .[21] GTEx Consortium. “Genetic e ﬀ ects on gene expression across human tissues”. In: Nature

550 (Oct. 2017).[22] C. C. Y. Wong et al. “Genome-wide DNA methylation proﬁling identiﬁes convergent molec-ular signatures associated with idiopathic and syndromic autism in post-mortem humanbrain tissue”. In:

Human molecular genetics

Genome Research doi : .[24] Q. Tan, B. T. Heijmans, J. v. B. Hjelmborg, M. Soerensen, K. Christensen, and L. Chris-tiansen. “Epigenetic drift in the aging genome: a ten-year follow-up in an elderly twin co-hort”. In: International Journal of Epidemiology issn :0300-5771. doi : .[25] J. Tung, X. Zhou, S. C. Alberts, M. Stephens, Y. Gilad, and E. T. Dermitzakis. “The geneticarchitecture of gene expression levels in wild baboons”. In: eLife Econometrica

Econometrica

Journal of the Royal Statistical Society: Series B (StatisticalMethodology) ﬀ ects”. In: Journal of Econometrics ﬀ ects and MultipleStructural Breaks”. In: Journal of the American Statistical Association

Journal of the American Statistical Association

Biometrika (Feb. 2020). issn : 0006-3444. doi : .[33] A. Onatski. “Determining the number of factors from empirical distribution of eigenvalues”.In: The Review of Economics and Statistics issn : 00346535,15309142.[34] J. Wang, Q. Zhao, T. Hastie, and A. B. Owen. “Confounder adjustment in multiple hypoth-esis testing”. In:

The Annals of Statistics

Journal of the American Statistical Association (May2020), pp. 1–32.[36] G. I. Allen, L. Grosenick, and J. Taylor. “A Generalized Least-Square Matrix Decompo-sition”. In:

Journal of the American Statistical Association

Journal of Nonparametric Statistics ﬃ cient adjustment”. In: Biometrika ﬀ erential expression analyses for RNA-sequencing and microarray studies”. In: Nucleic Acids Research issn : 0305-1048. doi : .[40] A. Touloumis, J. C. Marioni, and S. Tavaré. “HDTD: analyzing multi-tissue gene expressiondata”. In: Bioinformatics (Oxford, England)

American journal of human genetics

Proceedings of the National Academy of Sci-ences issn : 0027-8424. doi : .[43] B. L. Pierce et al. “Co-occurring expression and methylation QTLs allow detection of com-mon causal variants and shared biological mechanisms”. In: Nature Communications

JCI Insight

PLOS Computational Biology

Journal of the Royal Statistical Society: SeriesB issn : 00359246.[47] T. Flutre, X. Wen, J. Pritchard, and M. Stephens. “A Statistical Framework for Joint eQTLAnalysis in Multiple Tissues”. In:

PLOS Genetics

Nature medicine

Bioinformatics issn :1367-4803. doi : .[50] Y. Eldar and G. Kutyniok. Compressed Sensing: Theory and Applications . Cambridge Uni-versity Press, 2012.[51] K. Zajkowski.

Bounds on tail probabilities for quadratic forms in dependent sub-gaussianrandom variables . 2018. eprint: arXiv:1809.08569 .[52] F. Benaych-Georges and S. Péché. “Localization and delocalization for heavy tailed bandmatrices”. en. In:

Annales de l’I.H.P. Probabilités et statistiques

The Annalsof Mathematical Statistics upplementary material for “Factor analysis in high dimensionalbiological data with dependent observations”

S1 Additional simulation details

Here I provide the values for π k and τ k , deﬁned in (20), used to simulate L . k π k τ k k

13 14 15 16 17 18 19 20 21 22 23 24 π k τ k k

25 26 27 28 29 30 31 32 33 34 35 π k τ k S2 Notation used for the remainder the Supplementary Mate-rial

In addition to the notation used throughout the main text, I use the following notation throughoutthe remainder of the supplement. For any matrix M ∈ R n × m , deﬁne Q M ∈ R m × dim { ker ( M T ) } tobe a matrix whose columns form an orthonormal basis for ker ( M T ). For any x ∈ R b , deﬁne V ( x ) = (cid:80) bj = x j B j . Unless otherwise stated, for any sequence X n ∈ R r × s , n ≥

1, I use thenotation X n = O P ( a n ) and X n = o P ( a n ) if (cid:107) X n (cid:107) / a n = O P (1) and (cid:107) X n (cid:107) / a n = o P (1) as n → ∞ ,respectively, where (cid:107) X n (cid:107) is the usual operator norm. We treat vectors X n ∈ R r as matrices withone column. S3 Technical conditions for theory presented in the main text

S3.1 Theorem 3

Recall (14) in Theorem 3 required additional technical assumptions. Assumption S1 below listssaid conditions.

Assumption

S1.

Let c > be a constant not dependent on n or p, r ∈ [ K ( o ) ] be as deﬁned in thestatement of Theorem 3 and g ∈ [ p ] .(a) E g ∗ is dependent on at most c rows of E , and is independent of all others.(b) P { F ( (cid:15) ) r } → as n , p → ∞ . c) Let ˆ v g be the restricted maximum likelihood estimate (REML) for v g deﬁned in Remark 1,where ˆ V g = V ( ˆ v g ) . Then the optimization to determine ˆ v g is restricted to the parameterspace Θ ∗ .(d) For a r ∈ R K the rth standard basis vector, the quantity [ { ( C T V − g C ) − } rr ] − / a T r (cid:16) C T V − g C (cid:17) − C T V − g E g ∗ is asymptotically N (0 , . Asymptotic normality in (d) is satisﬁed in the following general scenario:(1) E g ∗ d = A g ˜ e g , where A g A T g = V g and ˜ e g ∈ R n has independent entries with uniformly boundedsub-Gaussian norm.(2) The entries of A − / g C have uniformly bounded fourth moments.If (d) does not hold but all other conditions do hold, (14) can be replaced with[ { ( ˆ C T ˆ V − g ˆ C ) − } rr ] − / { ˆ L ( GLS ) gr − a L ( o ) gr } d = W + o P (1) , where W · ∼ (0 , ﬃ cient conditions to guarantee that (b) holds in Section S3.3 below. S3.2 Theorem 6

The conditions referenced in the statement of Theorem 6 are given below.

Assumption

S2.

Let r ∈ [ K ] , X , R be as deﬁned in Theorem 6 and ˜ c > be a constant notdependent on n or p.(a) ˆ θ is restricted to the convex set { θ ∈ R b : (2 c ) − I n (cid:22) V ( θ ) (cid:22) bc I n } , where c is as deﬁnedin the ﬁrst paragraph of Section 5.1.(b) (cid:13)(cid:13)(cid:13) n − C T C − E ( n − C T C ) (cid:13)(cid:13)(cid:13) = O P ( n − / ) .(c) Let a r ∈ R K be the rth standard basis vector. Then for ˜ θ ∈ R b such that ˜ θ j = a T r Ψ j a r for allj ∈ [ b ] , [ X T { V ( ˜ θ ) } − X ] − / X T { V ( ˜ θ ) } − R ∗ r is asymptotically N (0 , . It is straightforward to ﬁnd general conditions when item (b) is satisﬁed (see Remark S11, forexample). Asymptotic normality holds under the following conditions:(1) vec( R ) d = D Ξ , where D is a non-random matrix that satisﬁes DD T = (cid:80) bj = Ψ j ⊗ B j and Ξ ∈ R nK is a random matrix with independent entries such that E ( Ξ ) = , E ( Ξ i ) = E ( Ξ i ) < ˜ c for all i ∈ [ nK ].(2) Let D = ( D rs ) r , s ∈ [ K ] , where D rs ∈ R n × n . Then the entries of D rs X have uniformly boundedfourth moments. Su ﬃ cient conditions for this to hold are:27a) X has mean and is sub-exponential with uniformly sub-exponential norm (see Eldaret al. [50] for a deﬁnition of sub-exponential random vectors). This follows from thefact that the columns of D have uniformly bounded 2-norm.(b) X d = H ˜ X , where H ∈ R n × n is a non-random matrix with (cid:107) H (cid:107) ≤ c and ˜ X is mean ,has independent entries and E ( ˜ X i ) < c for all i ∈ [ n ].If item (c) in Assumption S2 does not hold, Z in (a) of Theorem 6 can be replaced with W for W · ∼ (0 , S3.3 Conditions that guarantee P { F ( (cid:15) ) r } → Here I give the conditions necessary to ensure P { F ( (cid:15) ) r } → n , p → ∞ , where F ( (cid:15) ) r was deﬁned inTheorem 3. I study this by considering two scenarios: K ( o ) = K and K ( o ) < K . Proposition

S2.

Suppose Assumptions 1 and 3 hold with K = K ( o ) . Then for τ r = Λ r { E ( p − LC T CL T ) } and τ K + = < τ K ≤ · · · ≤ τ < τ = ∞ , lim n , p →∞ P { F ( (cid:15) ) r } = if τ r − /τ r , τ r /τ r + ≥ + β and (cid:13)(cid:13)(cid:13) E ( n − C T C ) − n − C T C (cid:13)(cid:13)(cid:13) = o P (1) as n , p → ∞ , where β > is a constant that does not dependon n or p.Proof. This follows directly from Lemma S13. (cid:3)

I next state and prove an analogous Proposition when K ( o ) < K . Proposition

S3.

Suppose Assumptions 1 and 3 hold with K ( o ) < K, and without loss of generality,assume E ( n − C T C ) = I K and D = np − L T L = diag ( τ , . . . , τ K ) , where τ ≥ · · · ≥ τ K > aredeﬁned in Proposition S2. Deﬁne the non-random unitary matrix W ∈ R K × K to be such that W T D / E { n − C T (cid:16) δ − ¯ V (cid:17) − C } D / W = diag ( γ , . . . , γ K ) , and for W ( K ( o ) ) ∈ R K × K ( o ) the ﬁrst K ( o ) columns of W , letd r = Λ r [ D / W ( K ( o ) ) { W ( K ( o ) ) } T D / ] = Λ r [ { W ( K ( o ) ) } T DW ( K ( o ) ) ] , r ∈ [ K ( o ) ] . Assume the following hold for some constant c > that does not depend on n or p:(i) C satisﬁes (cid:13)(cid:13)(cid:13) n − C T ∆ C − E ( n − C T ∆ C ) (cid:13)(cid:13)(cid:13) = O P ( n − / ) for any symmetric, positive deﬁnite ∆ ∈ R n × n such that (cid:107) ∆ (cid:107) ≤ c.(ii) γ k /γ k + ≥ + c − for all k ∈ [ K ] and d r / d r + ≥ + c − for all r ∈ [ K ( o ) ] , where γ K + = d K ( o ) + = .Then lim n , p →∞ P { F (cid:15) r } = for all r ∈ [ K ( o ) ] . Remark

S1.

I show in Lemma S23 that under these assumptions λ ( o ) r = d r { + O P ( n − / ) } andd r ∈ [ τ r − ˜ c τ K ( o ) + , τ r + ˜ c τ K ( o ) + ] for some constant ˜ c > that does not depend on n or p. The latterimplies that d r ≈ τ r if τ K ( o ) + is small (i.e. τ K ( o ) + /τ r << ). Remark

S2.

Remark S11 in Section S10 discusses general scenarios when (i) holds.Proof.

This follows directly from Lemma S23 stated in Section S10. (cid:3) K ( o ) < K Theorem 6 can be extended to accommodate the case when K ( o ) < K . A restatement of theTheorem to accommodate this scenario is given below. Theorem S1 (Restatement of Theorem 6 when K ( o ) < K ) . Let X be as deﬁned in the statementof Theorem 6. Suppose Assumptions 1, 2 and 3 hold, K ( o ) < K and let W ( K ( o ) ) , d , . . . , d K ( o ) be asdeﬁned in Proposition S3. Assume the following conditions hold for some constant c > that doesnot depend on n or p:(i) C and L satisfy the identiﬁability conditions from the statement of Proposition S3.(ii) C = Xω T + R , where X is independent of R , ω ∈ R K is a constant and E ( R ) = . Forj ∈ [ b ] , let Ψ j ∈ R K × K be a non-random, symmetric matrix such that (cid:107) Ψ j (cid:107) ≤ c. Then V { vec( R ) } = (cid:80) bj = Ψ j ⊗ B j (cid:23) c − I n . Further, (cid:107) E ( X ) (cid:107) ≤ cn / and Ξ satisﬁes one of thefollowing for all Ξ ∈ { X − E ( X ) , R } :(1) vec( Ξ ) = G ∆ , where G is a non-random square matrix that satisﬁes (cid:107) G (cid:107) ≤ c, and ∆ is mean and E ( ∆ i ) < c for every entry i of ∆ .(2) E [exp { vec( Ξ ) T t } ] ≤ exp( c (cid:107) t (cid:107) ) for all t .(iii) Condition (a) from Assumption S2 holds.(iv) γ , . . . , γ K and d , . . . , d K ( o ) satisfy Condition (ii) from the Statement of Proposition S3.Let U ∈ R K × K ( o ) be a non-random matrix with orthonormal columns whose columns are the eigen-vectors of D / W ( K ( o ) ) { W ( K ( o ) ) } T D / . Then for some constant c > that does not depend on n orp, the following hold for all r ∈ [ K ( o ) ] :(a) Suppose X is dependent on at most c rows of E and ω = . Then if n / / ( p γ r ) → and for θ ∈ R b such that θ j = U T ∗ r Ψ j U ∗ r ,n / (cid:12)(cid:12)(cid:12) [ X T { V ( ˆ θ ) } − X ] − X T { V ( ˆ θ ) } − ˆ C ∗ r − [ X T { V ( θ ) } − X ] − X T { V ( θ ) } − C ( o ) ∗ r (cid:12)(cid:12)(cid:12) = o P (1)[ X T { V ( ˆ θ ) } − X ] − / X T { V ( ˆ θ ) } − ˆ C ∗ r d = Z + o P (1) , where Z · ∼ (0 , .(b) Suppose X is independent of E . Then for a ∈ {− , } and if n / / ( p γ r ) → , [ X T { V ( ˆ θ ) } − X ] − X T { V ( ˆ θ ) } − ˆ C ∗ r = a ω T U ∗ r + O P ( n − / ) . The proof of this theorem and Theorem 6 are given in Section S7.

Remark

S3.

If K ( o ) = K, U = I K and we can replace Condition (iv) with τ r − /τ r , τ r /τ r + ≥ + c − for all r ∈ [ K ] , where τ r is as deﬁned in Proposition S2. When K ( o ) < K, I show in Lemma S23that U tr =  O ( γ K ( o ) + /γ t ) if t < rO (cid:110) γ ( K ( o ) + ∨ t /γ r (cid:111) if t > r , t ∈ [ K ] \ { r } Remark

S4.

Asymptotic normality in result (a) holds if [ X T { V ( θ ) } − X ] − / X T { V ( θ ) } − CU ∗ r isasymptotically normal. Here I derive properties of the eigenvalues and eigenvectors of the population covariance matrices V ( Y ∗ i ). To do so, we need the following assumption: Assumption

S3.

Let c > be a constant. Then K ≥ is known, γ K ≥ c − and the following hold:(a) E ( C ) = , C ∗ , . . . , C n ∗ are identically distributed, V ( C ∗ ) = I K and Λ r ( LL T ) / Λ r + ( LL T ) ≥ + c − for all r ∈ [ K ] .(b) n / vec( n − (cid:80) ni = C i ∗ C T i ∗ − I K ) d = W + o P (1) as n → ∞ , where W ∼ N ( , G ) for some non-singular G ∈ R K × K . The assumption that the rows of C will likely hold when samples i are identically distributed.Some examples include samples collected on related individuals (e.g. twin studies, samples relatedthrough a kinship matrix, etc.), data with repeated measurements and multi-tissue data collectedfrom similar tissues, among others. Theorem S2 gives the asymptotic properties of the eigenvaluesand eigenvectors of V ( Y ∗ i ). Theorem

S2.

Suppose Assumptions 1, 2, 3 and S3 hold, and K ( o ) = K. Then for η ( i ) r = np − Λ r { V ( Y ∗ i ) } and u ( i ) r the rth eigenvector of V ( Y ∗ i ) ,n / ( ˆ λ ( o ) r /η ( i ) r − [1 + O P { n / ( γ r p ) } ]) d = z r + o P (1) , r ∈ [ K ] (S1)max i ∈ [ n ] (cid:107) a ˆ L ∗ r − { p / n η ( i ) r } / u ( i ) r (cid:107) ∞ = O P { log( p ) n − / + n / ( γ K ( o ) p ) − / } , r ∈ [ K ] (S2) as n , p → ∞ , where ( z , . . . , z K ) T ∼ N ( , ˜ G ) does not depend on i, ˜ G st = G ( s − K + s , ( t − K + t for alls , t ∈ [ K ] , a ∈ {− , } and the error term o P (1) is uniform across i = , . . . , n.Proof. Assumptions 1 and 3 imply (cid:107) V ( E ∗ i ) (cid:107) ≤ c for some constant c > n or p . Weyl’s Theorem therefore implies for A = E ( p − LC T CL T ) = np − LL T , (cid:12)(cid:12)(cid:12) Λ r ( A ) − η ( i ) r (cid:12)(cid:12)(cid:12) ≤ n / pc , and by the eigengap assumption on A in Assumption S3, (cid:13)(cid:13)(cid:13) ˜ u ( i ) r − u ( i ) r (cid:13)(cid:13)(cid:13) ≤ c n / ( p γ r ) by Lemma S17 for ˜ u ( i ) r the r th eigenvector of A and c > n or p . The rest of the proof follows from Theorems 2 and 3, as well as the co-factorexpansion argument utilized in Appendix A of Wang et al. [11]. (cid:3) Theorem S2 shows that Algorithm 1 recovers both the eigenvalues and eigenvectors of thegene-by-gene covariance matrix, where like (13), (S2) implies principal components plots of ˆ L mirror the information contained in the population eigenvectors. The result in (S1) shows ˆ λ ( o ) r isconsistent and asymptotically normal if a n = n / / ( γ r p ) →

0, which is the ﬁrst result provingthe existence of consistent and asymptotically normal estimators for population eigenvalues ofnearly arbitrary size in data with correlated samples. Theorem S2 also signiﬁcantly extends theresults of Wang et al. [11], which in order to show the asymptotic normality of sample eigenvalues,required (1)

U Y ∈ R p × n have independent entries for some unitary matrix U ∈ R p × p and (2)( p / n − / ) a n →

0. 30

Here I prove Proposition 1.

Proof of Proposition 1.

Let S = p − Y T Y . We can re-write the objective function in (7) to beTr { ( U T ¯ V − S ¯ V − U )( U ¯ V − U ) − } = Tr( ˜ U T ¯ V − / S ¯ V − / ˜ U )where U ∈ R n × k is any matrix such that im( U ) = im( H ) and ˜ U = ¯ V − / U (cid:16) U T ¯ V − U (cid:17) − / . Sincethe latter has orthonormal columns, the objective achieves its maximum when the columns of ˜ U are the ﬁrst k eigenvectors of ¯ V − / S ¯ V − / , which completes the proof. (cid:3) S6 Estimating the eigenvalues and eigenvectors of C (cid:16) p − L T L (cid:17) C T S6.1 Preliminaries

Without loss of generality, we may assume np − L T L = diag ( λ , . . . , λ K ) and n − C T C = I K . Weutilize similar techniques to those developed in McKennan et al. [14]. For any estimate ˆ V = (cid:80) bj = (cid:16) ¯ v j + (cid:15) j (cid:17) B j of ¯ V = E (cid:16) p − E T E (cid:17) , deﬁne (cid:15) V = (cid:13)(cid:13)(cid:13) ¯ V − ˆ V (cid:13)(cid:13)(cid:13) (S3a)˜ C = ˆ V − / C (cid:16) C T ˆ V − C (cid:17) − / U (S3b)˜ L = p − / L (cid:16) C T ˆ V − C (cid:17) / U (S3c) Q = Q ˆ V − / C (S3d)where U ∈ R K × K is a rotation matrix such that˜ L T ˜ L = diag ( τ , . . . , τ K ) , = τ K + < τ K ≤ · · · ≤ τ < τ = ∞ . (S4)By Lemma S13 in Section S10 and Assumptions 1 and 2, this implies that c − λ k ≤ τ k ≤ c λ k forsome constant c > n or p . Therefore, for any constant c >

1, there existsa c > n or p such that λ r /λ r + > c implies τ r /τ r + > c regardless of the choice of ˆ V . We use this to inductively deﬁne the indices k , . . . , k J ∈ [ K ] inwhich an eigengap occurs. First, for some arbitrary c > c >

1, deﬁne k = min ( { r ∈ [ K ] : λ r /λ r + > c } ) . If k = K , we are done. Otherwise, deﬁne k j inductively as k j = min (cid:16)(cid:110) r ∈ (cid:110) k j − + , . . . , K (cid:111) : λ r /λ r + > c (cid:111)(cid:17) . We let k = J ∈ [ K ] be such that k J = K . We refer to these indices k , k , . . . , k J throughoutthe supplement. 31ollowing McKennan et al. [14], we deﬁne E = E ˆ V − / ˜ C ∈ R p × K , E = E ˆ V − / Q ∈ R p × ( n − K ) (S5) S = (cid:32) ˜ C T Q T (cid:33) (cid:16) p − ˆ V − / Y T Y ˆ V − / (cid:17) (cid:16) ˜ C Q (cid:17) = (cid:16) ˜ L + p − / E (cid:17) T (cid:16) ˜ L + p − / E (cid:17) (cid:16) ˜ L + p − / E (cid:17) T (cid:16) p − / E (cid:17)(cid:16) p − / E (cid:17) T (cid:16) ˜ L + p − / E (cid:17) p − E T E  . (S6)If (cid:16) ˆ v T ˆ z T (cid:17) T ∈ R n × K for ˆ v ∈ R K × K , ˆ z ∈ R n × ( n − K ) are the eigenvectors of S , then˜ C ˆ v + Q ˆ z ∈ R n × K are the ﬁrst K right singular vectors of Y . Our ﬁrst goal is to understand ˆ v and ˆ z . S6.2 The top-left K × K block of S We ﬁrst develop theory to understand the behavior of the upper left block of S , deﬁned as (cid:16) ˜ L + p − / E (cid:17) T (cid:16) ˜ L + p − / E (cid:17) ∈ R K × K . Lemma

S1.

Suppose Assumptions 1, 2 and 3 hold, and deﬁne ˜ N = ˜ L + p − / E , φ = p − / (cid:16) + n / (cid:15) V (cid:17) , φ = (cid:15) V . Then µ s = Λ s (cid:16) ˜ N T ˜ N (cid:17) = τ s + + O P (cid:16) φ λ / s + φ (cid:17) , s ∈ [ K ] . (S7) Additionally, let  V V · · · V J V V · · · V J ... ... . . . ... V J V J · · · V JJ  ∈ R K × K be the right singular values of ˜ N , where V rs ∈ R ( k r − k r − ) × ( k s − k s − ) for r , s ∈ [ J ] and k r , k s deﬁned in Section S6.1. Then (cid:13)(cid:13)(cid:13) I ( k r − k r − ) − V T rr V rr (cid:13)(cid:13)(cid:13) , (cid:13)(cid:13)(cid:13) I ( k r − k r − ) − V rr V T rr (cid:13)(cid:13)(cid:13) = O P (cid:26)(cid:16) φ λ − / k r + φ λ − k r (cid:17) (cid:27) , r ∈ [ J ] (S8a) (cid:107) V rs (cid:107) = O P (cid:16) φ λ − / k min( r , s ) + φ λ − k min( r , s ) (cid:17) , r (cid:44) s ∈ [ J ] . (S8b) Proof.

First, ˜ N T ˜ N = diag ( τ , . . . , τ K ) + p − / ˜ L T E + (cid:16) p − / ˜ L T E (cid:17) T (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) A (1) + p − E T E (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) A (2) . We derive the properties of A (1) and A (2) below.321) Deﬁne ¯ L = n / p − / L (cid:16) np − L T L (cid:17) − / , ˆ M = (cid:16) n − C T ˆ V − / C (cid:17) / U and ∆ = ˆ V − − V − .By deﬁnition, (cid:16) np − L T L (cid:17) / ˆ M − = ˆ W diag (cid:16) τ / , . . . , τ / K (cid:17) , ˜ L = ¯ L ˆ W diag (cid:16) τ / , . . . , τ / K (cid:17) where ˆ W ∈ R K × K is a random unitary matrix. Then p − / ˜ L T E = p − / diag (cid:16) τ / , . . . , τ / K (cid:17) ˆ W T ¯ L T EV − (cid:16) n − / C (cid:17) (cid:16) n − C T ˆ V − / C (cid:17) − / U + p − / diag (cid:16) τ / , . . . , τ / K (cid:17) ˆ W T ¯ L T E ∆ (cid:16) n − / C (cid:17) (cid:16) n − C T ˆ V − / C (cid:17) − / U . For some large constant c > n or p , we have (cid:13)(cid:13)(cid:13)(cid:13) ∆ (cid:16) n − / C (cid:17) (cid:16) n − C T ˆ V − / C (cid:17) − / U (cid:13)(cid:13)(cid:13)(cid:13) ≤ c (cid:15) V and (cid:13)(cid:13)(cid:13)(cid:13) ˆ W T ¯ L T EV − (cid:16) n − / C (cid:17) (cid:16) n − C T ˆ V − / C (cid:17) − / U (cid:13)(cid:13)(cid:13)(cid:13) ≤ c (cid:13)(cid:13)(cid:13)(cid:13) ¯ L T EV − (cid:16) n − / C (cid:17)(cid:13)(cid:13)(cid:13)(cid:13) = O P (1) , where the last equality follows by Lemma S15. Therefore, A (1) rs = O P (cid:110) φ (cid:16) λ / r + λ / s (cid:17)(cid:111) , r , s ∈ [ K ] . (2) Deﬁne ˆ M C = (cid:16) n − C T ˆ V − / C (cid:17) − / U . We see that p − E T E = ˆ M TC (cid:16) n − / C (cid:17) T ˆ V − (cid:16) p − E T E (cid:17) ˆ V − (cid:16) n − / C (cid:17) ˆ M C = ˆ M TC (cid:16) n − / C (cid:17) T V − (cid:16) p − E T E (cid:17) V − (cid:16) n − / C (cid:17) ˆ M C + ˆ M TC (cid:16) n − / C (cid:17) T ∆ (cid:16) p − E T E (cid:17) V − (cid:16) n − / C (cid:17) ˆ M C + ˆ M TC (cid:16) n − / C (cid:17) T V − (cid:16) p − E T E (cid:17) ∆ (cid:16) n − / C (cid:17) ˆ M C + ˆ M TC (cid:16) n − / C (cid:17) T ∆ (cid:16) p − E T E (cid:17) ∆ (cid:16) n − / C (cid:17) ˆ M C . By Lemma S15, we then get that (cid:13)(cid:13)(cid:13) I K − A (2) (cid:13)(cid:13)(cid:13) = O P (cid:16) p − / + φ (cid:17) . Therefore, for M = ˜ N T ˜ NM rs =  τ r + + O P (cid:16) φ λ / r + φ (cid:17) if r = sO P (cid:16) φ λ / r + φ (cid:17) if r (cid:44) s , r , s ∈ [ K ]Let ˜ M j ∈ R ( k j − k j − ) × ( k j − k j − ) be a diagonal matrix containing the eigenvalues of the k j th diagonalblock of M . By Lemma S16,˜ M j ss = τ k j − + s + δ + O P (cid:16) φ λ / k j − + s + φ (cid:17) . v rs ∈ R ( k r − k r − ) × ( k s − k s − ) , we can write M as M =  v ... v J  ˜ M (cid:16) v T · · · v TJ (cid:17) +  k × ( k − k ) v ... v J  ˜ M (cid:16) Tk × ( k − k ) v T · · · v TJ (cid:17) + · · · +  k × ( k J − k J − ) ... ( k J − − k J − ) × ( k J − k J − ) v JJ  ˜ M J (cid:16) Tk × ( k J − k J − ) · · · T ( k J − − k J − ) × ( k J − k J − ) v TJJ (cid:17) + ∆ , where v j j is a unitary matrix for all j = , . . . , J and (cid:107) v rs (cid:107) = O P (cid:16) φ λ − / k s + φ λ − k s (cid:17) for r (cid:44) s . Thematrix ∆ =  ∆ ∆ T · · · ∆ TJ ∆ ∆ · · · ∆ TJ ... ... . . . ... ∆ J ∆ J · · · ∆ JJ  , where ∆ rs ∈ R ( k r − k r − ) × ( k s − k s − ) , is such that (cid:107) ∆ rs (cid:107) = O P (cid:16) K φ (cid:17) + O P  φ s − (cid:88) j = λ − j  s ≥ , s ≤ r . Let V j ∈ R K × ( k j − k j − ) be the eigenvectors corresponding to eigenvalues µ k j − + , . . . , µ k j . By LemmaS16 and Corollary S10, µ s = τ s + δ + O P (cid:16) φ λ / k + φ (cid:17) , s ∈ [ k ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:32) I k × k k × ( K − k ) ( K − k ) × k ( K − k ) × ( K − k ) (cid:33) − P V (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F = O P (cid:16) φ λ − / + φ λ − (cid:17) . To understand V j for j >

1, deﬁne ˜ V j =  k × ( k j − k j − ) ... ( k j − − k j − ) × ( k j − k j − ) v j j ... v J j  . Then P ⊥ ( ˜ V ··· ˜ V j − ) ˜ V j = ˜ V j − (cid:16) ˜ V · · · ˜ V j − (cid:17)  ˜ V T ˜ V · · · ˜ V T ˜ V j − ... . . . ... ˜ V Tj − ˜ V · · · ˜ V Tj − ˜ V j −  −  ˜ V T ˜ V j ... ˜ V Tj − ˜ V j  = ˜ V j − ∆ j (cid:13)(cid:13)(cid:13) ∆ j (cid:13)(cid:13)(cid:13) = O P (cid:16) φ λ − / k j − + φ λ − k j − (cid:17) . Let R j be a symmetric matrix such that (cid:16) ˜ V j − ∆ j (cid:17) R j has orthogonal columns. Therefore, λ − k j M (cid:16) ˜ V j − ∆ j (cid:17) R j = λ − k j (cid:16) ˜ V j − ∆ j (cid:17) R j ˜ M j + O P (cid:16) φ λ − / k j + φ λ − k j (cid:17) , µ s = τ s + δ + O P (cid:16) φ λ / k j + φ λ k j (cid:17) , s ∈ (cid:110) k j − + , . . . , k j (cid:111) (S9) (cid:13)(cid:13)(cid:13)(cid:13) k × k ⊕ · · · ⊕ ( k j − − k j − ) × ( k j − − k j − ) ⊕ I ( k j − k j − ) × ( k j − k j − ) ⊕ ( k j + − k j ) × ( k j + − k j ) ⊕ · · · ⊕ ( k J − k J − ) × ( k J − k J − ) − P V j (cid:13)(cid:13)(cid:13)(cid:13) F = O P (cid:16) φ λ − / k j + φ λ − k j (cid:17) . (S10)Equation (S9) proves (S7).To prove the remainder of the lemma, let V j =  V j ... V J j  . Then by (S10) and for r > j , (cid:13)(cid:13)(cid:13)(cid:13) I ( k j − k j − ) × ( k j − k j − ) − V Tj j V j j (cid:13)(cid:13)(cid:13)(cid:13) = O P (cid:16) φ λ − k j + φ λ − k j (cid:17) , (cid:13)(cid:13)(cid:13) V r j (cid:13)(cid:13)(cid:13) = O P (cid:16) φ λ − / k j + φ λ − k j (cid:17) . Lastly, for any s < j , we have = V Tss (cid:16) V Ts V j (cid:17) = V Tss V ss V s j + O P (cid:16) φ λ − / k s + φ λ − k s (cid:17) = V s j + O P (cid:16) φ λ − / k s + φ λ − k s (cid:17) . Therefore, (cid:13)(cid:13)(cid:13) V s j (cid:13)(cid:13)(cid:13) = O P (cid:16) φ λ − / k s + φ λ − k s (cid:17) , s < j . This proves (S8) and completes the proof. (cid:3)

Remark

S5.

Note that (S8) holds if we deﬁne k j in terms of τ , . . . , τ K , deﬁned in (S4) , as follows:Let k = and deﬁne k j inductively ask j = min (cid:16)(cid:110) r ∈ (cid:110) k j − + , . . . , K (cid:111) : τ r /τ r + ≥ + (cid:15) (cid:111)(cid:17) , j ∈ [ J ] , where k J = K and (cid:15) > is an arbitrarily small constant. S6.3 Understanding ˆ v and ˆ z given an estimate for V We use the results of Lemma S1 to study the properties of ˆ v = ( ˆ v · · · ˆ v K ) ∈ R K × K and ˆ z = (cid:16) ˆ Z · · · ˆ v K (cid:17) ∈ R ( n − K ) × K , which were deﬁned in Section S6.1. Lemma

S2.

Suppose Assumptions 1, 2 and 3 hold, and let ˆ v , ˆ z be as deﬁned above, k , k , . . . , k J be as deﬁned in Section S6.1, φ , φ , ˜ N , µ , . . . , µ K be as deﬁned in Lemma S1 and S be as deﬁnedin (S6) . Deﬁne ˆ µ s to be the sth eigenvalue of S , and for V rs , r , s ∈ [ J ] , deﬁned in the statement ofLemma S1, let V j = (cid:16) V T j · · · V T J j (cid:17) T for all j ∈ [ J ] . Set ˆ V j = (cid:16) ˆ v k j − + · · · ˆ v k j (cid:17) = (cid:16) ˆ V T j · · · ˆ V T J j (cid:17) T , ˆ Z j = (cid:16) ˆ z k j − + · · · ˆ z k j (cid:17) for all j ∈ [ J ] , where ˆ V r j ∈ R ( k r − k r − ) × ( k j − k j − ) for all r ∈ [ J ] , and let f : [ K ] → [ J ] besuch that s ∈ (cid:110) k f ( s ) − + , . . . , k f ( s ) (cid:111) . Then if φ /λ k t = o P (1) for some t ∈ [ J ] , the following hold as , p → ∞ : ˆ µ s = µ s + O P (cid:16) np − + φ λ − s (cid:17) , s ∈ [ k t ] (S11a) (cid:13)(cid:13)(cid:13) ˆ V rr ˆ V T rr − I ( k r − k r − ) (cid:13)(cid:13)(cid:13) F = O P (cid:16) np − λ − k r + φ λ − k r (cid:17) , r ∈ [ t ] (S11b) (cid:13)(cid:13)(cid:13) ˆ V r j (cid:13)(cid:13)(cid:13) F =  O P (cid:32) np λ / kr λ / kj + φ λ − / k r + φ λ − k r (cid:33) if r < j and j ∈ [ t ] O P (cid:18) np λ kj + φ λ − / k j + φ λ − k j (cid:19) if r > j and j ∈ [ t ] (S11c)ˆ z s = ( ˆ µ s − − p − / E T ˜ N (cid:16) V f ( s ) · · · V J (cid:17) (cid:16) V f ( s ) · · · V J (cid:17) T ˆ v s + ( ˆ µ s − − p − / R s E T ˜ N (cid:16) V f ( s ) · · · V J (cid:17) (cid:16) V f ( s ) · · · V J (cid:17) T ˆ v s + O P  λ − s (cid:32) n λ s p (cid:33) / + φ λ − s  (cid:16) np − + φ λ − s (cid:17) , s ∈ [ k t ] (S11d) (cid:107) ˆ z s (cid:107) = O P (cid:16) n / p − / λ − / s + φ λ − s (cid:17) , s ∈ [ k t ] , (S11e) where R s = (cid:104) I n − K + ( ˆ µ s − − (cid:110) I n − K − p − E T E + O P (cid:16) np − + φ λ − k f ( s ) − (cid:17)(cid:111) I { f ( s ) > } (cid:105) − − I n − K = O P (cid:110) λ − s (cid:16) n / p − / + φ (cid:17)(cid:111) , s ∈ [ k t ] . Proof.

Let M be as deﬁned in Lemma S1. We ﬁrst attempt to understand the components of S .First, for some constant c > n or p , (cid:13)(cid:13)(cid:13) p − / ˜ L T ∗ k E (cid:13)(cid:13)(cid:13) ≤ c (cid:13)(cid:13)(cid:13) p − / ˜ L T ∗ k E (cid:13)(cid:13)(cid:13) = O P (cid:16) n / p − / λ / k (cid:17) , k ∈ [ K ] , where the equality follows by the proof of Lemma S1. Next, for ∆ = ˆ V − − V − , p − E T E = p − (cid:16) Q TC ˆ V Q C (cid:17) − / Q TC E T E ˆ V − C (cid:16) C T ˆ V − C (cid:17) − / U = p − (cid:16) Q TC ˆ V Q C (cid:17) − / Q TC E T EV − C (cid:16) C T ˆ V − C (cid:17) − / U + p − (cid:16) Q TC ˆ V Q C (cid:17) − / Q TC E T E ∆ C (cid:16) C T ˆ V − C (cid:17) − / U , meaning (cid:13)(cid:13)(cid:13) p − E T E (cid:13)(cid:13)(cid:13) = O P (cid:16) n / p − + φ (cid:17) by Lemma S15. Lastly, for some constant c > n or p , (cid:13)(cid:13)(cid:13) p − E T E − I n − K (cid:13)(cid:13)(cid:13) ≤ c (cid:16)(cid:13)(cid:13)(cid:13) p − E T E − V (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) V − ˆ V (cid:13)(cid:13)(cid:13) (cid:17) = O P (cid:16) n / p − / + φ (cid:17) . Weyl’s Theorem then implies | ˆ µ k /µ k − | = o P (1) for all k ∈ [ k ], meaning by the deﬁnition of ˆ v k and ˆ z k , ˆ µ k ˆ v k = M ˆ v k + p − ˜ N T E (cid:16) ˆ µ k − p − E T E (cid:17) − E T ˜ N ˆ v k , k ∈ [ k ]ˆ z k = p − / (cid:16) ˆ µ k − p − E T E (cid:17) − E T ˜ N ˆ v k , k ∈ [ k ] = p − / ( ˆ µ k − − E T ˜ N ˆ v k + p − / ( ˆ µ k − − R k E T ˜ N ˆ v k , k ∈ [ k ] , R k = (cid:110) I n − K + ( ˆ µ k − − (cid:16) I n − K − p − E T E (cid:17)(cid:111) − − I n − K = O P (cid:110) λ − k (cid:16) n / p − / + (cid:15) V (cid:17)(cid:111) , k ∈ [ k ]and (cid:107) ˆ z k (cid:107) = O P (cid:16) n / p − / λ − / k + φ λ − k (cid:17) , k ∈ [ k ] . By Weyl’s theorem, ˆ µ k = µ k + O P (cid:16) np − + φ λ − k (cid:17) , k ∈ [ k ] . Deﬁne ˆ V = (cid:0) ˆ v · · · ˆ v k (cid:1) and ˆ Z = (cid:0) ˆ z · · · ˆ z k (cid:1) . By deﬁnition, ˆ V T ˆ V = I k × k − ˆ Z T ˆ Z . Deﬁne M ( k ) = M + p − ˜ N T E (cid:16) ˆ µ k − p − E T E (cid:17) − E T ˜ N , k ∈ [ k ] . Then ˆ v k ˆ µ k = M ( k ) ˆ v k = M ˆ v k + (cid:15) k , k ∈ [ k ] , meaning M ˆ V (cid:16) ˆ V T ˆ V (cid:17) − / = ˆ V (cid:16) ˆ V T ˆ V (cid:17) − / diag (cid:0) ˆ µ , . . . , ˆ µ k (cid:1) (cid:16) ˆ V T ˆ V (cid:17) − / − (cid:0) (cid:15) · · · (cid:15) k (cid:1) (cid:16) ˆ V T ˆ V (cid:17) − / + ˆ V (cid:26) I k − (cid:16) ˆ V T ˆ V (cid:17) − / (cid:27) diag (cid:0) ˆ µ , . . . , ˆ µ k (cid:1) (cid:16) ˆ V T ˆ V (cid:17) − / (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = O P (cid:18) np − + φ λ − k (cid:19) . By Lemma S17 and Corollary S10, this shows that (cid:13)(cid:13)(cid:13) ˆ V ˆ V T − V V T (cid:13)(cid:13)(cid:13) F = O P (cid:16) np − λ − k + φ λ − k (cid:17) and that (cid:13)(cid:13)(cid:13) P ⊥ V ˆ V ˆ V T (cid:13)(cid:13)(cid:13) F = O P (cid:16) np − λ − k + φ λ − k (cid:17) . Therefore, if we express the eigenvectors of M as (cid:16) V V ⊥ (cid:17) , V diag (cid:0) µ , . . . , µ k (cid:1) V T + O P (cid:16) np − + φ λ − k (cid:17) = V diag (cid:0) µ , . . . , µ k (cid:1) V T ˆ V ˆ V T + V ⊥ diag (cid:0) µ k + , . . . , µ K (cid:1) (cid:0) V ⊥ (cid:1) T ˆ V ˆ V T = M ˆ V ˆ V T = ˆ V diag (cid:0) ˆ µ , . . . , ˆ µ k (cid:1) ˆ V T + O P (cid:16) np − + φ λ − k (cid:17) , meaning ˆ V diag (cid:0) ˆ µ , . . . , ˆ µ k (cid:1) ˆ V T = V diag (cid:0) µ , . . . , µ k (cid:1) V T + O P (cid:16) np − + φ λ − k (cid:17) . S (1) = S − (cid:32) ˆ V ˆ Z (cid:33) diag (cid:0) ˆ µ , . . . , ˆ µ k (cid:1) (cid:32) ˆ V ˆ Z (cid:33) T =  A (1) (cid:16) B (1) (cid:17) T B (1) D (1)  , where A (1) = M − ˆ V diag (cid:0) ˆ µ , . . . , ˆ µ k (cid:1) ˆ V T = K (cid:88) k = k + µ k v k v Tk + O P (cid:16) np − + φ λ − k (cid:17) B (1) = p − / E T ˜ N − p − / k (cid:88) k = ˆ µ k ˆ µ k − E T ˜ N ˆ v k ˆ v Tk − p − / k (cid:88) k = ˆ µ k ˆ µ k − R k E T ˜ N ˆ v k ˆ v Tk = p − / E T ˜ N ( V · · · V J ) ( V · · · V J ) T − p − / k (cid:88) k = µ k − E T ˜ N ˆ v k ˆ v Tk (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = O P (cid:32) n λ k p (cid:33) / + φ λ − k  − p − / k (cid:88) k = ˆ µ k ˆ µ k − R k E T ˜ N ˆ v k ˆ v Tk (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = O P (cid:32) n λ k p (cid:33) / + φ λ − k  ( n / p − / + φ )  + O P (cid:32) n λ k p (cid:33) / + φ λ − k  (cid:16) np − + φ λ − k (cid:17) D (1) = p − E T E − k (cid:88) k = ˆ µ k ˆ z k ˆ z Tk (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = O P (cid:18) np − + φ λ − k (cid:19) . Lastly, for ˜ µ j = diag (cid:16) τ k j − + , . . . , τ k j (cid:17) and ¯ L deﬁned in the proof of Lemma S1, (cid:13)(cid:13)(cid:13) p − / E T ˜ N V j (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) p − / E T ¯ L (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˜ µ / V j ˜ µ / V j ... ˜ µ / J V J j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) p − E T E (cid:13)(cid:13)(cid:13) . (S12)Since (cid:13)(cid:13)(cid:13) ˜ µ / s V s j (cid:13)(cid:13)(cid:13) = O P (cid:16) φ + φ λ − / k (cid:17) for all s (cid:44) j , (cid:13)(cid:13)(cid:13) B (1) (cid:13)(cid:13)(cid:13) = O P (cid:16) λ / k n / p − / + φ (cid:17) .The k + , . . . , k eigenvalues and eigenvectors of S can be obtained using A (1) , B (1) and D (1) .By the exact techniques used to analyze the ﬁrst set of eigenvalues (1 , . . . , k ), we get thatˆ µ k ˆ v k = A (1) ˆ v k + (cid:110) B (1) (cid:111) T (cid:110) ˆ µ k − D (1) (cid:111) − B (1) ˆ v k , k ∈ { k + , . . . , k } ˆ z k = (cid:110) ˆ µ k − D (1) (cid:111) − B (1) ˆ v k , k ∈ { k + , . . . , k } . For ˆ V = (cid:0) ˆ v k + · · · ˆ v k (cid:1) , these same techniques can also be used to show the following: (cid:107) ˆ z k (cid:107) = O P (cid:16) n / p − / λ − / k + φ λ − k (cid:17) , k ∈ { k + , . . . , k } ˆ µ k = µ k + O P (cid:16) np − + φ λ − k (cid:17) , k ∈ { k + , . . . , k } (cid:13)(cid:13)(cid:13) ˆ V ˆ V T − V V T (cid:13)(cid:13)(cid:13) F , (cid:13)(cid:13)(cid:13) P ⊥ V ˆ V ˆ V T (cid:13)(cid:13)(cid:13) F = O P (cid:16) np − λ − k + φ λ − k (cid:17) (S13) (cid:13)(cid:13)(cid:13) ˆ V diag (cid:0) ˆ µ k + , . . . , ˆ µ k (cid:1) ˆ V T − V diag (cid:0) µ k + , . . . , µ k (cid:1) V T (cid:13)(cid:13)(cid:13) = O P (cid:16) np − + φ λ − k (cid:17) . k ∈ { k + , . . . , k } and because ˆ v T s ˆ v k = O P (cid:16) np − λ − / k λ − / k + φ λ − k λ − k (cid:17) for s ∈ [ k ], B (1) ˆ v k = p − / E T ˜ N ( V · · · V J ) ( V · · · V J ) T ˆ v k − p − / p − / k (cid:88) s = µ s − E T ˜ N ˆ v s ˆ v Ts ˆ v k − p − / k (cid:88) s = ˆ µ s ˆ µ s − R s E T ˜ N ˆ v s ˆ v Ts ˆ v k + O P (cid:32) n λ k p (cid:33) / + φ λ − k  (cid:16) np − + φ λ − k (cid:17) = p − / E T ˜ N ( V · · · V J ) ( V · · · V J ) T ˆ v k + O P (cid:32) n λ k p (cid:33) / + φ λ − k  (cid:16) np − + φ λ − k (cid:17) . We then have for k ∈ { k + , . . . , k } ,ˆ z k = (cid:110) ˆ µ k − D (1) (cid:111) − B (1) ˆ v k = ( ˆ µ k − − B (1) ˆ v k + ( ˆ µ k − − R k B (1) ˆ v k = ( ˆ µ k − − p − / E T ˜ N ( V · · · V J ) ( V · · · V J ) T ˆ v k + ( ˆ µ k − − p − / R k E T ˜ N ( V · · · V J ) ( V · · · V J ) T ˆ v k + O P  λ − k (cid:32) n λ k p (cid:33) / + φ λ − k  (cid:16) np − + φ λ − k (cid:17) R k = (cid:104) I n − K + ( ˆ µ k − − (cid:110) I n − K − p − E T E + O P (cid:16) np − + φ λ − k (cid:17)(cid:111)(cid:105) − − I n − K = O P (cid:110) λ − k (cid:16) n / p − / + φ (cid:17)(cid:111) . Deﬁne ˆ Z = (cid:0) ˆ z k + · · · ˆ z k (cid:1) and let S (2) = S − (cid:32) ˆ V ˆ Z (cid:33) diag (cid:0) ˆ µ , . . . , ˆ µ k (cid:1) (cid:32) ˆ V ˆ Z (cid:33) T − (cid:32) ˆ V ˆ Z (cid:33) diag (cid:0) ˆ µ k + , . . . , ˆ µ k (cid:1) (cid:32) ˆ V ˆ Z (cid:33) T =  A (2) (cid:16) B (2) (cid:17) T B (2) D (2)  . A (2) = A (1) − ˆ V diag (cid:0) ˆ µ , . . . , ˆ µ k (cid:1) ˆ V T = K (cid:88) k = k + µ k v k v T k + O P (cid:16) np − + φ λ − k (cid:17) B (2) = B (1) − k (cid:88) k = k + ˆ µ k ˆ z k ˆ v T k = p − / E T ˜ N ( V · · · V J ) ( V · · · V J ) T − p − / k (cid:88) k = µ k − E T ˜ N ˆ v k ˆ v Tk (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = O P (cid:32) n λ k p (cid:33) / + φ λ − k  − p − / k (cid:88) k = k + µ k − E T ˜ N ( V · · · V J ) ( V · · · V J ) T ˆ v k ˆ v Tk (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = O P (cid:32) n λ k p (cid:33) / + φ λ − k  − p − / k (cid:88) k = ˆ µ k ˆ µ k − R k E T ˜ N ˆ v k ˆ v Tk (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = O P (cid:32) n λ k p (cid:33) / + φ λ − k  ( n / p − / + φ )  − p − / k (cid:88) k = k + ˆ µ k ˆ µ k − R k E T ˜ N ( V · · · V J ) ( V · · · V J ) T ˆ v k ˆ v Tk (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = O P (cid:32) n λ k p (cid:33) / + φ λ − k  ( n / p − / + φ )  + O P (cid:32) n λ k p (cid:33) / + φ λ − k  (cid:16) np − + φ λ − k (cid:17) D (2) = p − E T E − k (cid:88) k = ˆ µ k ˆ z k ˆ z T k (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = O P (cid:18) np − + φ λ − k (cid:19) . This can be carried out to understand the eigenstructure of the remaining groups j = , . . . , t , whichproves (S11a), (S11d) and (S11e). For the remaining equalities, we ﬁrst see that (S11b) followsfrom (S10) and (S13). Next, for r > j and by (S13), O P (cid:16) np − λ − k j + φ λ − k j (cid:17) = ˆ V T j j (cid:16) ˆ V j j ˆ V T r j − V j j V T r j (cid:17) = ˆ V T r j − ˆ V T j j V j j V T r j + O P (cid:16) np − λ − k j + φ λ − k j (cid:17) . This implies (cid:13)(cid:13)(cid:13) ˆ V r j (cid:13)(cid:13)(cid:13) F follows (S11c) for r > j by (S10). The remaining part of (S11c) follows fromthe fact that ˆ V T r ˆ V j = − ˆ Z T r ˆ Z j for r (cid:44) j . (cid:3) Corollary

S1.

Suppose the assumptions of Lemma S2 hold. Then S − t (cid:88) j = (cid:16) ˆ V T j ˆ Z T j (cid:17) T diag (cid:16) ˆ µ k j − + , . . . , ˆ µ k j (cid:17) (cid:16) ˆ V T j ˆ Z T j (cid:17) =  A ( t ) (cid:16) B ( t ) (cid:17) T B ( t ) p − E T E − F ( t )  , here dim (cid:104) im (cid:110) F ( t ) (cid:111)(cid:105) , dim (cid:104) im (cid:110) A ( t ) (cid:111)(cid:105) , dim (cid:104) im (cid:110) B ( t ) (cid:111)(cid:105) ≤ K and A ( t ) =  J (cid:80) j = t + V j diag (cid:16) µ k j − + , . . . , µ k j (cid:17) V T j + O P (cid:16) np − + φ λ − k t (cid:17) if ≤ t < JO P (cid:16) np − + φ λ − K (cid:17) if t = J (cid:13)(cid:13)(cid:13) B ( t ) (cid:13)(cid:13)(cid:13) =  O P (cid:26)(cid:16) n λ kt + p (cid:17) / + φ (cid:27) if ≤ t < JO P (cid:26)(cid:16) n λ K p (cid:17) / + φ λ − K (cid:27) if t = J (cid:13)(cid:13)(cid:13) F ( t ) (cid:13)(cid:13)(cid:13) = O P (cid:110) np − + φ λ k min( t , J ) (cid:111) as n , p → ∞ .Proof. This is a direct consequence of the proof of Lemma S2. (cid:3)

Remark

S6.

Corollary S1 is analogous to Corollary S3 in McKennan et al. [35]. In fact, LemmaS3 (see below) can be proved using the proof of Lemma S7 in McKennan et al. [35], where wesimply replace the results of Corollary S3 in McKennan et al. [35] with Corollary S1 above.

Remark

S7.

The conclusions of Lemma S2 and Corollary S1 still hold if we re-deﬁne k , k , . . . , k J to be those given in Remark S5 above. Corollary

S2.

Suppose the assumptions of Lemma S2 hold. Deﬁne ˆ˜ C ( t ) ∈ R n × k t to be the ﬁrst k t eigenvectors of ˆ V − / (cid:16) p − Y T Y (cid:17) ˆ V − / and let ˜ C ( t ) = (cid:16) ˜ C ∗ · · · ˜ C ∗ k t (cid:17) . Then (cid:13)(cid:13)(cid:13) P ˆ˜ C ( t ) − P ˜ C ( t ) (cid:13)(cid:13)(cid:13) F = O P (cid:16) np − λ − k t + φ λ − k t (cid:17) Proof.

By deﬁnition, ˆ˜ C ( t ) = ˜ C (cid:16) ˆ V · · · ˆ V t (cid:17) + Q (cid:16) ˆ Z · · · ˆ Z t (cid:17) , meaning (cid:110) ˜ C ( t ) (cid:111) T ˆ˜ C ( t ) =  ˆ V · · · ˆ V t ... . . . ... ˆ V t · · · ˆ V tt  . The result then follows by (S11b) and (S11c). (cid:3)

Corollary

S3.

Suppose the assumptions of Lemma S2 hold. Then for any non-random M ∈ R n × n such that (cid:107) M (cid:107) ≤ c for some constant c > that does not depend on n or p, (cid:13)(cid:13)(cid:13) ˜ C T ˆ V / M Q ˜ C ˆ z s (cid:13)(cid:13)(cid:13) = O P (cid:34) φ λ − / s + n λ s p + φ λ − s (cid:35) , s ∈ [ k t ] . roof. By (S12) and Lemma S15, we ﬁrst see that for any s ∈ [ k t ] and non-random unit vector u ∈ ker ( C T ), λ − s p − / (cid:13)(cid:13)(cid:13)(cid:13) u T E T ˜ N (cid:16) V f ( s ) · · · V J (cid:17) (cid:16) V f ( s ) · · · V J (cid:17) T (cid:13)(cid:13)(cid:13)(cid:13) = λ − / s p − / O P (cid:16)(cid:13)(cid:13)(cid:13) u T E T ¯ L (cid:13)(cid:13)(cid:13) (cid:17) + λ − s O P (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) p − u T E T EV − (cid:16) n − / C (cid:17)(cid:13)(cid:13)(cid:13)(cid:13) (cid:19) + O P (cid:16) λ − s φ (cid:17) = O P (cid:16) φ λ − / s + λ − s φ (cid:17) as n , p → ∞ . Therefore, for all s ∈ [ k t ], (cid:13)(cid:13)(cid:13) ˜ C T ˆ V / M Q ˜ C ˆ z s (cid:13)(cid:13)(cid:13) = O (cid:26)(cid:16) n − / C (cid:17) T M ˆ V Q (cid:16) Q T ˆ V Q (cid:17) − / ˆ z s (cid:27) = λ − s p − / O P (cid:26)(cid:13)(cid:13)(cid:13)(cid:13)(cid:16) n − / C (cid:17) T M ˆ V / Q C (cid:16) Q T C ˆ V Q C (cid:17) − Q T C E T ˜ N (cid:16) V f ( s ) · · · V J (cid:17) (cid:16) V f ( s ) · · · V J (cid:17) T (cid:13)(cid:13)(cid:13)(cid:13) (cid:27) + O P (cid:16) λ − s n / p + λ − s φ − (cid:17) = λ − s p − / O P (cid:26)(cid:13)(cid:13)(cid:13)(cid:13)(cid:16) n − / C (cid:17) T M V / Q C (cid:0) Q T C V Q C (cid:1) − Q T C E T ˜ N (cid:16) V f ( s ) · · · V J (cid:17) (cid:16) V f ( s ) · · · V J (cid:17) T (cid:13)(cid:13)(cid:13)(cid:13) (cid:27) + O P (cid:34) n λ s p + φ (cid:110) λ − s + n / ( λ s p ) − / (cid:111)(cid:35) = O P (cid:32) φ λ − / s + n λ s p + φ λ − s (cid:33) . (cid:3) Corollary

S4.

Suppose the assumptions of Corollary S3 hold, and let ˆ˜ C ∈ R n × K ( o ) be the ﬁrst K ( o ) right singular vectors of Y ˆ¯ V − / , where φ = (cid:13)(cid:13)(cid:13)(cid:13) ˆ¯ V − ¯ V (cid:13)(cid:13)(cid:13)(cid:13) = O P (1 / n ) . Then for ˜ C deﬁned in (S3) , (cid:13)(cid:13)(cid:13)(cid:13) ˆ˜ C T ˆ¯ V / M ˆ¯ V / ˆ˜ C − ˆ v T A T ˜ C T ˆ¯ V / M ˆ¯ V / ˜ CA ˆ v (cid:13)(cid:13)(cid:13)(cid:13) = O P (cid:16) φ γ − / K ( o ) + φ γ − K ( o ) (cid:17)(cid:13)(cid:13)(cid:13)(cid:13) ˆ˜ C T ˆ¯ V / M ˆ¯ V / ˜ CA − ˆ v T A T ˜ C T ˆ¯ V / M ˆ¯ V / ˜ CA (cid:13)(cid:13)(cid:13)(cid:13) = O P (cid:16) φ γ − / K ( o ) + φ γ − K ( o ) (cid:17) , where A = ( I K ( o ) ) T ∈ R K × K ( o ) , φ = p − / , φ = n / p + n − and ˆ v is the upper K ( o ) × K ( o ) block of ˆ v .Proof. Let ˆ Z = ( ˆ z ∗ · · · ˆ z ∗ K ( o ) ) and ˆ v = (cid:32) ˆ v ˆ v ˆ v ˆ v (cid:33) ∈ R K × K , where ˆ v ∈ R K × K ( o ) and ˆ v ∈ R ( K − K ( o ) ) × K ( o ) .By Lemma S2, (cid:107) ˆ v (cid:107) = O P (cid:16) φ γ − / K ( o ) + φ γ − K ( o ) (cid:17) . Therefore,ˆ˜ C = ˜ CA ˆ v + Q ˜ C ˆ Z + O P (cid:16) φ γ − / K ( o ) + φ γ − K ( o ) (cid:17) . Since (cid:13)(cid:13)(cid:13) ˆ Z (cid:13)(cid:13)(cid:13) = O P ( φ γ − K ( o ) ), both results follow after applying Corollary S3. (cid:3) Corollary

S5.

Suppose the assumptions of Corollary S4 hold, and let X ∈ R n be a random vectorthat is independent of C but dependent on at most ﬁnitely many rows of E . Then for ˆ˜ C deﬁned inthe statement of Corollary S4, (cid:13)(cid:13)(cid:13)(cid:13) X T ˆ¯ V / ˆ˜ C ∗ r − X T ˆ¯ V / ˜ C ˆ v ∗ r (cid:13)(cid:13)(cid:13)(cid:13) = {(cid:107) E ( X ) (cid:107) + (cid:107) X − E ( X ) (cid:107) } O P (cid:16) φ γ − / r + φ γ − r (cid:17) , r ∈ [ K ( o ) ] . roof. First, X T ˆ¯ V / ˆ˜ C ∗ r = X T ˆ¯ V / ˜ C ˆ v ∗ r + X T ˆ¯ V / Q ˜ C ˆ z ∗ r . Let µ x = E ( X ) and r ∈ [ K ( o ) ]. By Corollary S3, µ T x ˆ¯ V / Q ˜ C ˆ z ∗ r = (cid:107) µ x (cid:107) O P (cid:16) φ γ − / r + φ γ − r (cid:17) . Therefore, to complete the proof, it su ﬃ ces to assume E ( X ) = and (cid:107) X (cid:107) =

1. Let δ r = φ γ − / r + φ γ − r and S ∈ [ p ] be such that X is independent of E g ∗ for all g ∈ S c . By the proof ofCorollary S3, (cid:13)(cid:13)(cid:13)(cid:13) X T ˆ¯ V / Q ˜ C ˆ z ∗ r (cid:13)(cid:13)(cid:13)(cid:13) ≤ γ − r p − / (cid:13)(cid:13)(cid:13)(cid:13) X T ¯ V / Q C (cid:16) Q T C ¯ V Q C (cid:17) − Q T C E T ˜ N (cid:16) V f ( r ) · · · V J (cid:17) (cid:16) V f ( r ) · · · V J (cid:17) T ˜ v ∗ r (cid:13)(cid:13)(cid:13)(cid:13) + O P ( δ r ) ≤ γ − r p − / (cid:13)(cid:13)(cid:13) ˜ X T E T S∗ ˜ L S∗ (cid:13)(cid:13)(cid:13) + γ − r (cid:13)(cid:13)(cid:13)(cid:13) ˜ X T (cid:16) p − E T S∗ E S ∗ (cid:17)(cid:13)(cid:13)(cid:13)(cid:13) + γ − r p − / (cid:13)(cid:13)(cid:13) ˜ X T E T S c ∗ ˜ L S c ∗ ˜ V (cid:13)(cid:13)(cid:13) + γ − r (cid:13)(cid:13)(cid:13)(cid:13) ˜ X T (cid:16) p − E T S c ∗ E S c ∗ (cid:17)(cid:13)(cid:13)(cid:13)(cid:13) + O P ( δ r )where ˜ X = Q C (cid:16) Q T C ¯ V Q C (cid:17) − Q T C ¯ V / X (note ˜ X ∈ ker( C T )), ˜ V = (cid:16) V f ( r ) · · · V J (cid:17) and A S∗ , A S c ∗ are the sub-matrices of A ∈ R p × m restricted to the rows g ∈ S and g ∈ S c , respectively. Since |S| isat most ﬁnite, (cid:13)(cid:13)(cid:13) ˜ L S∗ (cid:13)(cid:13)(cid:13) ≤ n / p − / c for some constant c >

0, meaning γ − r p − / (cid:13)(cid:13)(cid:13) ˜ X T E T S∗ ˜ L S∗ (cid:13)(cid:13)(cid:13) = O P ( γ − r p − / n / ) (cid:13)(cid:13)(cid:13) ˜ L S∗ (cid:13)(cid:13)(cid:13) = O P { n / ( γ r p ) } . Similarly, γ − r (cid:13)(cid:13)(cid:13)(cid:13) ˜ X T (cid:16) p − E T S∗ E S ∗ (cid:17)(cid:13)(cid:13)(cid:13)(cid:13) = O P { n / ( γ r p ) } . Let Γ = ˜ L T ˜ L and n / p − / L (cid:16) np − L T L (cid:17) − / = ¯ L = ˜ L ˆ W Γ − / for some unitary matrix ˆ W ∈ R K × K . Then by Lemma S1, (cid:13)(cid:13)(cid:13) Γ / ˜ V (cid:13)(cid:13)(cid:13) = O P (cid:16) γ / r (cid:17) , meaning γ − r p − / (cid:13)(cid:13)(cid:13) ˜ X T E T S c ∗ ˜ L S c ∗ ˜ V (cid:13)(cid:13)(cid:13) = γ − / r p − / (cid:13)(cid:13)(cid:13) ˜ X T E T S c ∗ ¯ L S c ∗ (cid:13)(cid:13)(cid:13) O P (1) = O P ( δ r ) . Lastly, γ − r (cid:13)(cid:13)(cid:13)(cid:13) ˜ X T (cid:16) p − E T S c ∗ E S c ∗ (cid:17)(cid:13)(cid:13)(cid:13)(cid:13) = O P ( δ r )by the proof of Corollary S3 (since |S| is at most ﬁnite). This completes the proof. (cid:3) Corollary

S6.

Suppose the assumption of Corollary S4 hold with K = K ( o ) and ﬁx a g ∈ [ p ] . As-sume that E g ∗ = Xs + R g , where X ∈ R n × d is an observed, mean 0 random variable that is inde-pendent of R g and C , dependent on at most ﬁnitely many other rows of E and (cid:13)(cid:13)(cid:13) n − X T X − Σ x (cid:13)(cid:13)(cid:13) = P (1) for some non-random Σ x (cid:31) . Assume that V ( R g ) = V ( α g ) for some α g ∈ Θ ∗ and X , R g have uniformly bounded sub-Gaussian norm. Then if (cid:13)(cid:13)(cid:13)(cid:13) ˆ¯ V − ¯ V (cid:13)(cid:13)(cid:13)(cid:13) = O P (1 / n ) , the estimator ˆ α g = arg max θ ∈ Θ ∗ f ( θ ) f ( θ ) = − n − log (cid:110)(cid:12)(cid:12)(cid:12) P ⊥ M V ( θ ) P ⊥ M (cid:12)(cid:12)(cid:12) + (cid:111) − n − ( P ⊥ M Y g ∗ ) T (cid:8) P ⊥ M V ( θ ) P ⊥ M (cid:9) † ( P ⊥ M Y g ∗ ) satisﬁes (cid:13)(cid:13)(cid:13) ˆ α g − α g (cid:13)(cid:13)(cid:13) = O P (cid:110) n − / + n / ( p γ K ) − / (cid:111) , where M = (cid:18) ( ˆ¯ V / ˆ˜ C ) X (cid:19) .Proof. First, n − ( P ⊥ M Y g ∗ ) T (cid:2) P ⊥ M V ( θ ) P ⊥ M (cid:3) † ( P ⊥ M Y g ∗ ) = n − Tr (cid:104) ( P ⊥ M Y g ∗ )( P ⊥ M Y g ∗ ) T (cid:8) P ⊥ M V ( θ ) P ⊥ M (cid:9) † (cid:105) . Since (cid:107) V ( θ ) (cid:107) and (cid:13)(cid:13)(cid:13) { V ( θ ) } − (cid:13)(cid:13)(cid:13) is uniformly bounded for all θ ∈ Θ ∗ , (cid:13)(cid:13)(cid:13) B j (cid:13)(cid:13)(cid:13) ≤ c for some constant c > R g is sub-Gaussian, we need only show that (cid:13)(cid:13)(cid:13) n − ( P ⊥ M Y g ∗ )( P ⊥ M Y g ∗ ) T − n − R g R T g (cid:13)(cid:13)(cid:13) = O P (cid:110) n − / + n / ( p γ K ) − / (cid:111) to complete the proof. Let ˆ C = ˆ¯ V / ˆ˜ C . Then for ˜ (cid:96) g = p / ˜ L g ∗ (where (cid:13)(cid:13)(cid:13) ˜ (cid:96) g (cid:13)(cid:13)(cid:13) ≤ c for some constant c > Y g ∗ = n / ˆ¯ V / ˜ C ˜ (cid:96) g + Xs + R g n − / P ⊥ ˆ C Y g ∗ = P ⊥ ˆ C ˆ¯ V / ˜ C ˜ (cid:96) g + n − / P ⊥ ˆ C Xs + n − / P ⊥ ˆ C R g . Corollary S4 shows that (cid:13)(cid:13)(cid:13)(cid:13) P ⊥ ˆ C ˆ¯ V / ˜ C ˜ (cid:96) g (cid:13)(cid:13)(cid:13)(cid:13) = O P (cid:110) n / ( p γ K ) − / (cid:111) . Since P ⊥ M = P ⊥ P ⊥ ˆ C X P ⊥ ˆ C , we only haveto understand how n − / P ⊥ M R g = n − / P ⊥ P ⊥ ˆ C X P ⊥ ˆ C R g = n − / R g − n − / P ⊥ ˆ C X (cid:16) X T P ⊥ ˆ C X (cid:17) − X T P ⊥ ˆ C R g behaves. First, n − (cid:13)(cid:13)(cid:13) X T X − X T P ⊥ ˆ C X (cid:13)(cid:13)(cid:13) = n − (cid:13)(cid:13)(cid:13)(cid:13) X T ˆ C (cid:16) ˆ C T ˆ C (cid:17) − ˆ C T X (cid:13)(cid:13)(cid:13)(cid:13) ≥ cn − (cid:13)(cid:13)(cid:13) X T ˆ C (cid:13)(cid:13)(cid:13) . for some constant c >

0. By Corollary S5 and because X has uniformly sub-Gaussian norm, n − / (cid:13)(cid:13)(cid:13) X T ˆ C (cid:13)(cid:13)(cid:13) = O P (cid:16) n − / + φ γ − / K + φ γ − K (cid:17) . Second, by the same technique as we used above, n − X T P ⊥ ˆ C R g = n − X T R g − n − X T ˆ C (cid:16) ˆ C T ˆ C (cid:17) − ˆ C T R g = O P (cid:26) n − / + (cid:16) n − / + φ γ − / K + φ γ − K (cid:17) (cid:27) . Putting all this together shows that n − / (cid:13)(cid:13)(cid:13) P ⊥ M Y g ∗ − R g (cid:13)(cid:13)(cid:13) = O P (cid:110) n − / + n / ( p γ K ) − / (cid:111) and completes the proof. (cid:3) ¯ V Here we derive the asymptotic properties of the estimates for V from step (b)(iii) in Algorithm 1. Lemma

S3.

Suppose Assumptions 1, 2 and 3 hold, and assume for some t ∈ [ J ] , the currentestimate for ¯ v , ˆ¯ v (0) , satisﬁes λ − k t (cid:13)(cid:13)(cid:13)(cid:13) V (cid:110) ˆ¯ v (0) (cid:111) − V (cid:13)(cid:13)(cid:13)(cid:13) = o P (1) as n , p → ∞ . Then for P ˆ C deﬁned instep (b)(i) of Algorithm 1 with k = k t and ˆ¯ V = V (cid:110) ˆ¯ v (0) (cid:111) , the resulting estimator ˆ¯ v in step (b)(iii) ofAlgorithm 1 satisﬁes (cid:13)(cid:13)(cid:13) ˆ¯ v − ¯ v (cid:13)(cid:13)(cid:13) = O P (cid:110) max (cid:0) λ k t + , (cid:1) n − (cid:111) (S14) as n , p → ∞ .Proof. We ﬁrst remark that by Lemma S18, the quasi log-likelihood f ( θ ) = − n − log {| V ( θ ) |} − ( np ) − Tr (cid:104) E T E { V ( θ ) } − (cid:105) is stochastically equicontinuous for θ ∈ Θ ∗ , where | f ( ¯ v ) − E { f ( ¯ v ) }| = o P (1) as n , p → ∞ . Asdiscussed in Remark S6, the remainder of proof is exactly the same as the proof of Lemma S7 inMcKennan et al. [35], except we replace Corollary S3 in McKennan et al. [35] with Corollary S1stated above. The remaining details have been omitted. (cid:3) Corollary

S7.

Suppose the Assumptions of Lemma S3 hold, and let ˜ k = O (1) as n , p → ∞ . Then ifP ˆ C deﬁned in step (b)(i) of Algorithm 1 is deﬁned using k = ˜ k ≥ k t , then the resulting estimator ˆ¯ v in step (b)(iii) of Algorithm 1 satisﬁes (S14) as n , p → ∞ .Proof. The proof follows exactly from the reasoning presented in the proof of Lemma S7 inMcKennan et al. [35, page 40 of the Supplementary Material]. The details are omitted. (cid:3)

Corollary

S8.

Let c > K be a large constant not dependent on n or p and let ˜ k ∈ [ K ] be suchthat lim sup n , p →∞ γ ˜ k + < ∞ , where γ K + = . Suppose Assumptions 1, 2 and 3 hold. Then theestimate for ˆ¯ v upon completion of step (b) of Algorithm 1 satisﬁes (cid:13)(cid:13)(cid:13) ˆ¯ v − ¯ v (cid:13)(cid:13)(cid:13) = O P (cid:16) n − (cid:17) for allk ∈ (cid:104) ˜ k , min( c , K max ) (cid:105) .Proof. Let s = max (cid:110) k ∈ [ K ] : lim sup n , p →∞ λ k = ∞ (cid:111) , where s = n , p →∞ λ < ∞ . If s =

0, then the proof of Lemma S3 shows that (cid:13)(cid:13)(cid:13) ˆ¯ v − ¯ v (cid:13)(cid:13)(cid:13) = O P (cid:16) n − (cid:17) upon completion of step (a)of Algorithm 1. If s >

1, then by assumption, lim sup n , p →∞ λ s + < ∞ , where λ K + =

0. Since φ = O P (1), φ /λ s = o P (1) as n , p → ∞ . The result then follows by Corollary S7. (cid:3) Remark

S8.

This proves that (cid:13)(cid:13)(cid:13) ˆ¯ v − ¯ v (cid:13)(cid:13)(cid:13) = O P ( n − ) in (15) of Theorem 4. S6.5 Estimating λ ( o )1 , . . . , λ ( o ) K ( o ) Let M be any deterministic matrix such that n − C T M C is full rank with bounded minimumeigenvalue. In the main text M = I n , but M can be anything in general (i.e. maybe we only wantto estimate the eigenvalues for a subset of samples). By Lemma S22 and (15) from Theorem 445whose proof is invariant to the choice of parametrization of C ), it su ﬃ ces to re-deﬁne L ( o ) , C ( o ) to be ( C ( o ) , L ( o ) ) ∈ arg min ( ¯ C , ¯ L ) ∈S K ( o ) (cid:13)(cid:13)(cid:13) ( LC T − ¯ L ¯ C T ) ˆ V − / (cid:13)(cid:13)(cid:13) F . Deﬁne A =  I K ( o ) ⊕ ( K − K ( o ) ) × ( K − K ( o ) ) if K ( o ) < KI K if K = K ( o ) . Then λ ( o )1 , . . . , λ ( o ) K ( o ) are exactly the eigenvalues of A ˆ Γ A ˆ F A , where Γ = diag ( τ , . . . , τ K ) , F = ˜ C T ˆ V / M ˆ V / ˜ C (S15a)ˆ Γ = diag ( ˆ µ , . . . , ˆ µ K ) − I K , ˆ F = ˆ˜ C T ˆ V / M ˆ V / ˆ˜ C (S15b)The proof of the accuracy of these estimates is given below, which we use to prove Theorem 2. Lemma

S4.

Suppose Assumptions 1, 2 and 3 hold. Then the estimate ˆ λ s deﬁned in step (d) ofAlgorithm 1 satisﬁes ˆ λ ( o ) s /λ ( o ) s = + O P (cid:32) p − / λ − / s + np λ s + (cid:15) V λ − s (cid:33) , s ∈ (cid:104) K ( o ) (cid:105) . Proof.

By the assumption of an eigengap between γ K ( o ) and γ K ( o ) + in Assumption 3, it su ﬃ ces toassume K ( o ) = k j ∗ for some j ∗ ∈ [ J ]. Suppose F = (cid:32) F F F T F (cid:33) , ˆ F = (cid:32) ˆ F ˆ F ˆ F T ˆ F (cid:33) (S16a) Γ = diag ( τ , . . . , τ K ( o ) ) , ˆ Γ = diag ( ˆ µ , . . . , ˆ µ K ( o ) ) − I K ( o ) (S16b)ˆ v = (cid:32) ˆ v ˆ v ˆ v ˆ v (cid:33) , ˆ Z = ( ˆ z ∗ · · · ˆ z ∗ K ( o ) ) (S16c)where F , ˆ F , ˆ v ∈ R K ( o ) × K ( o ) and ˆ v ∈ R ( K − K ( o ) ) × K ( o ) . We abuse notation when deﬁning ˆ v , ˆ v here.These are not the same as the vectors ˆ v s deﬁned in the proof of Lemma S2. Our goal is to estimatethe eigenvalues of Γ / F Γ / . First,ˆ Γ / ˆ F ˆ Γ / = ˆ Γ / (cid:0) ˆ v T , ˆ v T (cid:1) F (cid:0) ˆ v T , ˆ v T (cid:1) T ˆ Γ / + ˆ Γ / (cid:0) ˆ v T , ˆ v T (cid:1) ˜ C T ˆ V / M ˆ V / ˜ Q ˆ Z ˆ Γ / + (cid:110) ˆ Γ / (cid:0) ˆ v T , ˆ v T (cid:1) ˜ C T ˆ V / M ˆ V / ˜ Q ˆ Z ˆ Γ / (cid:111) T + O P (cid:16) np − + φ λ − K ( o ) (cid:17) = H T (cid:16) ˆ v ˆ Γ ˆ v T (cid:17) / ˆ˜ F (cid:16) ˆ v ˆ Γ ˆ v T (cid:17) / H + O P (cid:16) np − + φ λ − K ( o ) (cid:17) ˆ˜ F = (cid:0) I K ( o ) , ˆ v − T ˆ v T (cid:1) F (cid:0) I K ( o ) , ˆ v − T ˆ v T (cid:1) T + (cid:0) I K ( o ) , ˆ v − T ˆ v T (cid:1) ˜ C T ˆ V / M ˆ V / ˜ Q ˆ Z ˆ v − + (cid:110)(cid:0) I K ( o ) , ˆ v − T ˆ v T (cid:1) ˜ C T ˆ V / M ˆ V / ˜ Q ˆ Z ˆ v − (cid:111) T H ∈ R K × K is a unitary matrix such that H ˆ Γ / ˆ v T = (cid:16) ˆ v ˆ Γ ˆ v (cid:17) / . By Lemma S2, we canwrite ˆ v as ˆ v = diag (cid:16) U , . . . , U j ∗ (cid:17)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = U + (cid:15) where U j ∈ R ( k j − k j − ) × ( k j − k j − ) is a unitary matrix and (cid:15) =  O P (cid:16) φ λ − / + φ λ − + np − λ − (cid:17)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) K ( o ) × · · · O P (cid:16) φ λ − / K ( o ) + φ λ − K ( o ) + np − λ − K ( o ) (cid:17)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) K ( o ) ×  . Next, ˆ v − = U T + U T  ∞ (cid:88) t = ( (cid:15)U ) t  ( (cid:15)U )where (cid:15)U = (cid:16) (cid:15) U · · · (cid:15) j ∗ U j ∗ (cid:17) =  O P (cid:16) φ λ − / k + φ λ − k + np − λ − k (cid:17)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) K ( o ) × k · · · O P (cid:16) φ λ − / k j ∗ + φ λ − k j ∗ + np − λ − k j ∗ (cid:17)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) K ( o ) × ( k j ∗ − k j ∗− )  . Therefore,ˆ v − = U T +  O P (cid:16) φ λ − / + φ λ − + np − λ − (cid:17)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) K ( o ) × · · · O P (cid:16) φ λ − / K ( o ) + φ λ − K ( o ) + np − λ − K ( o ) (cid:17)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) K ( o ) ×  . By the proof of Lemma S2,ˆ v ˆ Γ ˆ v T = j ∗ (cid:88) j = ¯ V j ˜ M j ¯ V T j + O P (cid:16) φ λ − / K ( o ) + np − + φ λ − K ( o ) (cid:17) (S17)where ˜ M j = diag (cid:16) µ k j − + − , . . . , µ k j − (cid:17) and ¯ V j = (cid:16) V T j · · · V T j ∗ j (cid:17) T for V rs , r , s ∈ [ j ∗ ], deﬁned inthe statement of Lemma S1. By (S8a) and (S8b), G j = (cid:16) ¯ V T j ¯ V j (cid:17) − / = I ( k j − k j − ) − O P (cid:16) φ λ − k j + φ λ − k j (cid:17) . Therefore,ˆ v ˆ Γ ˆ v T = j ∗ (cid:88) j = ˜ V j G j ˜ M j G j ˜ V T j + O P (cid:16) φ λ − / K ( o ) + np − + φ λ − K ( o ) (cid:17) = ˜ V diag ( ˜ µ , . . . , ˜ µ K ( o ) ) ˜ V T where for φ = φ + n / p and by Lemmas S16 and S17,˜ µ s = τ s (cid:110) + O P (cid:16) φ λ − / s + φ λ − s (cid:17)(cid:111) , s ∈ [ K ( o ) ] (cid:13)(cid:13)(cid:13) ˜ V ∗ ( k j − + k j − ˜ V j G j W j (cid:13)(cid:13)(cid:13) F = O P (cid:110) λ − k j (cid:16) φ λ − / K ( o ) + np − + φ λ − K ( o ) (cid:17)(cid:111) , j ∈ [ j ∗ ] , W j ∈ R ( k j − k j − ) × ( k j − k j − ) is a unitary matrix. Next, let A i j , ˜ V i j ∈ R ( k i − k i − ) × ( k j − k j − ) be the sub-matrices of (cid:16) ˆ v ˆ Γ ˆ v T (cid:17) / and ˜ V , respectively, containing the [ k i − + k i th rows and (cid:104) k j − + (cid:105) th through k j th columns. Then A rs = j ∗ (cid:88) j = ˜ V r j diag (cid:16) ˜ µ / k j − + , . . . , ˜ µ / k j (cid:17) ˜ V Ts j = O P (cid:16) φ + φ λ − / k min( r , s ) (cid:17) , r (cid:44) s ∈ [ j ∗ ] . First, (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) ˜ V rr ˜ V T rr (cid:17) − / − I ( k r − k r − ) (cid:13)(cid:13)(cid:13)(cid:13) F = O P (cid:16) φ λ − k r + φ λ − k r (cid:17) , r ∈ [ j ∗ ] . (S18)Therefore, A rr = ˜ V rr diag (cid:16) ˜ µ / k r − + , . . . , ˜ µ / k r (cid:17) ˜ V Trr + j ∗ (cid:88) j (cid:44) r ˜ V r j diag (cid:16) ˜ µ / k j − + , . . . , ˜ µ / k j (cid:17) ˜ V Tr j = (cid:110) ˜ V rr diag (cid:0) ˜ µ k r − + , . . . , ˜ µ k r (cid:1) ˜ V Trr (cid:111) / + O P (cid:16) φ + φ λ − / k r (cid:17) = (cid:110) V rr diag (cid:0) µ k r − + − , . . . , µ k r − (cid:1) V Trr (cid:111) / + O P (cid:16) φ + φ λ − / k r (cid:17) = diag (cid:16) τ / k r − + , . . . , τ / k r (cid:17) + O P (cid:16) φ + φ λ − / k r (cid:17) , r ∈ [ j ∗ ] , where the second equality follows by (S18) and (S17), the third equality follows from (S17) andthe last equality follows from Lemma S20 and the fact that (cid:13)(cid:13)(cid:13) V rr diag (cid:0) µ k r − + − , . . . , µ k r − (cid:1) V Trr − diag (cid:0) τ k r − + , . . . , τ k r (cid:1)(cid:13)(cid:13)(cid:13) = O P (cid:16) φ λ / k r + φ (cid:17) by the proof of Lemma S1. Deﬁne R = (cid:16) ˆ v ˆ Γ ˆ v T (cid:17) / − Γ / . First, (cid:0) I K ( o ) , ˆ v − T ˆ v T (cid:1) F (cid:0) I K ( o ) , ˆ v − T ˆ v T (cid:1) T = F + F ˆ v ˆ v − + (cid:16) F ˆ v ˆ v − (cid:17) T + ˆ v − T ˆ v T F ˆ v ˆ v − . By (S11c), the expansion of ˆ v − above and Corollary S3,ˆ v − T ˆ v T F ˆ v ˆ v − , F ˆ v ˆ v − =  O P (cid:16) φ λ − / + φ λ − (cid:17)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) K ( o ) × · · · O P (cid:16) φ λ − / K ( o ) + φ λ − K ( o ) (cid:17)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) K ( o ) × (cid:0) I K ( o ) , ˆ v − T ˆ v T (cid:1) ˜ C T ˆ V / M ˆ V / Q ˆ Z ˆ v − =  O P (cid:16) φ λ − / + φ λ − (cid:17)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) K ( o ) × · · · O P (cid:16) φ λ − / K ( o ) + φ λ − K ( o ) (cid:17)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) K ( o ) ×  . This shows that ˆ˜ F = F + ∆∆ rs = O P (cid:110) φ (cid:16) λ − / r + λ − / s (cid:17) + φ (cid:16) λ − r + λ − s (cid:17)(cid:111) , r , s ∈ [ K ( o ) ] . (cid:110) R ( F + ∆ ) Γ / + Γ / ( F + ∆ ) R + Γ / ∆Γ / (cid:111) rs = O P (cid:110) φ (cid:16) λ / r + λ / s (cid:17) + φ (cid:16) λ / r λ − / s + λ / s λ − / r (cid:17)(cid:111) , r , s ∈ [ K ( o ) ] . This shows that (cid:26)(cid:16) ˆ v ˆ Γ ˆ v T (cid:17) / ˆ˜ F (cid:16) ˆ v ˆ Γ ˆ v T (cid:17) / (cid:27) rs = (cid:16) Γ / F Γ / (cid:17) rs + O P (cid:110) φ (cid:16) λ / r + λ / s (cid:17) + φ (cid:16) λ / r λ − / s + λ / s λ − / r (cid:17)(cid:111) , r , s ∈ [ K ( o ) ] . Let U ∈ R K ( o ) × K ( o ) be the eigenvectors of Γ / F Γ / . By Lemma S21, U rs = O (cid:16) λ / r ∨ s λ − / r ∧ s (cid:17) , r , s ∈ [ K ( o ) ] , meaning for any matrix ∆ ∈ R K ( o ) × K ( o ) that satisﬁes ∆ rs = O P (cid:110) φ (cid:16) λ / r + λ / s (cid:17) + φ (cid:16) λ / r λ − / s + λ / s λ − / r (cid:17)(cid:111) , r , s ∈ [ K ( o ) ] , U T ∗ r ∆ U ∗ s = O P (cid:110) φ (cid:16) λ / r + λ / s (cid:17) + φ (cid:16) λ / r λ − / s + λ / s λ − / r (cid:17)(cid:111) , r , s ∈ [ K ( o ) ] . Putting this all together give usˆ G rs = U T ∗ r (cid:16) ˆ v ˆ Γ ˆ v T (cid:17) / ˆ˜ F (cid:16) ˆ v ˆ Γ ˆ v T (cid:17) / U ∗ s = λ ( o ) r I ( r = s ) + O P (cid:110) φ (cid:16) λ / r + λ / s (cid:17) + φ (cid:16) λ / r λ − / s + λ / s λ − / r (cid:17)(cid:111) , r , s ∈ [ K ( o ) ] . This can then be written asˆ G = ˜ η w w T + ˜ η w w T + · · · ˜ η K ( o ) w K ( o ) w TK ( o ) + O P  φ + (cid:32) φ + np (cid:33) λ − K ( o )  where ˜ η s = λ ( o ) s (cid:104) + O P (cid:110) φ λ − / s + φ λ − s (cid:111)(cid:105) , s ∈ [ K ( o ) ]and w =  O P (cid:110) φ λ − / + φ ( λ λ ) − / (cid:111) ... O P (cid:110) φ λ − / + φ ( λ λ K ( o ) ) − / (cid:111) , w =  ... O P (cid:110) φ λ − / + φ ( λ λ K ( o ) ) − / (cid:111) , · · · , w K ( o ) =  ...  . To estimate the ﬁrst eigenvalue, we see that (cid:107) w (cid:107) = + O P (cid:110) φ λ − + φ ( λ λ K ( o ) ) − (cid:111) w T w k = O P (cid:110) φ λ − / + φ ( λ λ k ) − / (cid:111) , k ∈ (cid:110) , . . . , K ( o ) (cid:111) . Therefore, (cid:16) ˜ η (cid:107) w (cid:107) (cid:17) − w T ˆ Gw = + O P (cid:32) λ λ φ λ + φ λ (cid:33)(cid:16) ˜ η (cid:107) w (cid:107) (cid:17) − ˆ Gw = (cid:107) w (cid:107) − w + ˜ η (cid:16) w T w (cid:17) ˜ η (cid:107) w (cid:107) w + · · · + ˜ η K ( o ) (cid:16) w T w K ( o ) (cid:17) ˜ η (cid:107) w (cid:107) w K ( o ) = (cid:107) w (cid:107) − w + O P (cid:16) φ λ − / + φ λ − (cid:17) . Therefore, ˆ λ ( o )1 = λ ( o )1 (cid:110) + O P (cid:16) φ λ − / + φ λ − (cid:17)(cid:111) . For the remaining eigenvalues, we use a similar technique to that used in the proof of Lemma S1.I will only determine λ ( o )2 . The remaining eigenvalues can be derived by a trivial extension. First, P ⊥ w w = w − (cid:107) w (cid:107) − (cid:16) w T w (cid:17) w = w −  O P (cid:110) φ λ − / + φ ( λ λ ) − / (cid:111) O P (cid:110) φ λ − + φ ( λ λ ) − (cid:111) ... O P (cid:40) φ λ − + φ λ ( λ λ K ( o ) ) / + φ φ λ λ / K ( o ) (cid:41) = w − ∆ . Therefore,˜ η w w T = ˜ η (cid:16) P ⊥ w w (cid:17) (cid:16) P ⊥ w w (cid:17) T + ˜ η (cid:16) P ⊥ w w (cid:17) ∆ T + ˜ η ∆ (cid:16) P ⊥ w w (cid:17) T + ˜ η ∆ ∆ T where (cid:13)(cid:13)(cid:13) P ⊥ w w (cid:13)(cid:13)(cid:13) = + O P (cid:110) φ λ − + φ ( λ λ K ( o ) ) − (cid:111) and for k > (cid:16) P ⊥ w w (cid:17) T w k = w T w k − ∆ T w k = O P (cid:110) φ λ − / + φ ( λ λ k ) − / (cid:111) . A similar technique to that used above shows thatˆ λ ( o )2 = λ ( o )2 (cid:110) + O P (cid:16) φ λ − / + φ λ − (cid:17)(cid:111) . (cid:3) Remark

S9.

It is easy to see that if (1 + (cid:15) ) λ ( o ) s + ≤ η s ≤ (1 − (cid:15) ) λ ( o ) s − for some constant (cid:15) ∈ (0 , , thenCorollary S10 shows that if ˜ w s is the sth eigenvector of ˆ G , then (cid:107) ˜ w s − w s (cid:107) = O P (cid:16) φ λ − / s + φ λ − s (cid:17) . C ( o ) and L ( o ) We use the above work to prove Theorem 4. We note that Corollary S8 shows that (cid:12)(cid:12)(cid:12) ˆ¯ v j − ¯ v j (cid:12)(cid:12)(cid:12) = O P ( n − ) in (15) of Theorem 4. Proof of the rest of (15) in Theorem 4.

Deﬁne δ = p − / λ − / K ( o ) + n / { p λ K ( o ) } + { n λ K ( o ) } − . By CorollaryS8, the estimate ˆ¯ V in Step (b)(iii) when k = K ( o ) satisﬁes φ = (cid:13)(cid:13)(cid:13)(cid:13) ˆ¯ V − ¯ V (cid:13)(cid:13)(cid:13)(cid:13) = O P (cid:16) n − (cid:17) . Letˆ˜ C ∈ R n × K ( o ) be the ﬁrst K ( o ) eigenvectors of ˆ¯ V − / (cid:16) p − Y T Y (cid:17) ˆ¯ V − / . By Lemma S22, n − / C ( o ) = ˆ¯ V / (cid:16) ˜ C ∗ · · · ˜ C ∗ K ( o ) (cid:17) R + O P [ { λ K ( o ) n } − ] , where ˜ C is deﬁned in (S3b) and R ∈ R K ( o ) × K ( o ) is an invertible matrix that satisﬁes (cid:107) R (cid:107) = O (1) as n , p → ∞ . Further, n − / ˆ C = ˆ¯ V / ˆ˜ C = ˆ¯ V / ˜ C ( ˆ v ∗ · · · ˆ v ∗ K ( o ) ) ˆ R + ˆ¯ V / Q ˜ C ( ˆ z ∗ · · · ˆ z ∗ K ( o ) ) ˆ R , where ˆ R ∈ R K ( o ) × K ( o ) is an invertible matrix that satisﬁes (cid:13)(cid:13)(cid:13) ˆ R (cid:13)(cid:13)(cid:13) = O (1) as n , p → ∞ . (S11c) inLemma S2 then shows that n − / ˆ C = n − / C ( o ) ˆ˜ R + ˆ¯ V / Q ˜ C ( ˆ z ∗ · · · ˆ z ∗ K ( o ) ) ˆ R + O P ( δ ) , where ˆ˜ R ∈ R K ( o ) × K ( o ) is an invertible matrix (with probability tending to 1 as n , p → ∞ ) and (cid:13)(cid:13)(cid:13)(cid:13) ˆ˜ R (cid:13)(cid:13)(cid:13)(cid:13) = O P (1) as n , p → ∞ . For notational convenience, I re-deﬁne C ← n − / C ( o ) and ˆ C ← n − / ˆ C forthe remainder of the proof.First, 2 − (cid:13)(cid:13)(cid:13) P ˆ C − P C (cid:13)(cid:13)(cid:13) F = K ( o ) − Tr (cid:26)(cid:0) C T C (cid:1) − C T ˆ C (cid:16) ˆ C T ˆ C (cid:17) − ˆ C T C (cid:27) , where by Corollary S3, C T ˆ C = C T C ˆ˜ R + O P ( δ )ˆ C T ˆ C = ˆ˜ R T C T C ˆ˜ R + O P ( δ ) . This completes the proof. (cid:3)

Proof of (13) in Theorem 3 and (16) in Theorem 4.

By Lemma S22 and because (cid:13)(cid:13)(cid:13)(cid:13) ˆ¯ V − ¯ V (cid:13)(cid:13)(cid:13)(cid:13) = O P (cid:16) n − (cid:17) when k = K ( o ) , it su ﬃ ces to re-deﬁne C ( o ) and L ( o ) to be { C ( o ) , L ( o ) } = arg min ( ¯ C , ¯ L ) ∈S K ( o ) (cid:13)(cid:13)(cid:13)(cid:13) ( LC T − ¯ L ¯ C T ) ˆ¯ V − / (cid:13)(cid:13)(cid:13)(cid:13) F . This implies for ˜ L , ˜ C deﬁned in (S3) and F , Γ deﬁned in (S15) and (S16) (with M = I n ), thereexists a unitary matrix W ∈ R K ( o ) × K ( o ) such that n − / C ( o ) = ˆ¯ V / ˜ CAF − / W (S19a) n / p − / L ( o ) = ˜ LAF / W (S19b) W T F / Γ F / W = diag (cid:110) λ ( o )1 , . . . , λ ( o ) K ( o ) (cid:111) , (S19c)51here A = ( I K ( o ) ) T ∈ R K × K ( o ) . Additionally, for ˆ F , ˆ Γ deﬁned in (S15) and (S16) (with M = I n ), n − / ˆ C = ˆ¯ V / ˆ˜ C ˆ F − / ˆ W = ˆ¯ V / ˜ C (cid:32) ˆ v ˆ v (cid:33) ˆ F − / ˆ W + ˆ¯ V / Q ˜ C ˆ Z ˆ F − / ˆ W (S20a) n / p − / ˆ L = ˆ˜ L ˆ F / ˆ W (S20b)ˆ W T ˆ F / ˆ Γ ˆ F / ˆ W = diag (cid:110) ˆ λ ( o )1 , . . . , ˆ λ ( o ) K ( o ) (cid:111) , (S20c)where ˆ˜ C ∈ R n × K ( o ) are the ﬁrst K ( o ) right singular vectors of Y ˆ¯ V / , ˆ v ∈ R K ( o ) × K ( o ) , ˆ v ∈ R ( K − K ( o ) ) × K ( o ) and ˆ Z ∈ R ( n − K ) × K ( o ) are deﬁned in (S16) andˆ˜ L = p − / Y ˆ¯ V / ˆ˜ C = ˜ L (cid:32) ˆ v ˆ v (cid:33) + p − / E (cid:32) ˆ v ˆ v (cid:33) + p − / E ˆ Z . (S21)In the above equation, E ∈ R p × K , E ∈ R p × ( n − K ) are deﬁned in (S5). Let D = diag (cid:110) λ ( o )1 , . . . , λ ( o ) K ( o ) (cid:111) and ˆ D = diag (cid:110) ˆ λ ( o )1 , . . . , ˆ λ ( o ) K ( o ) (cid:111) . Then for U , ˆ U ∈ R K ( o ) × K ( o ) the eigenvectors of Γ / F Γ / andˆ Γ / ˆ F ˆ Γ / , respectively, W T F / Γ F / W = D = U T Γ / F Γ / U ⇒ F − / W = F − Γ − / U D / ˆ W T ˆ F / ˆ Γ ˆ F / ˆ W = ˆ D = ˆ U T ˆ Γ / ˆ F ˆ Γ / ˆ U ⇒ ˆ F − / ˆ W = ˆ F − ˆ Γ − / ˆ U ˆ D / where ˆ F − / ˆ W = ˆ F − ˆ Γ − / ˆ U ˆ D / = ˆ Γ / (cid:16) ˆ Γ / ˆ F ˆ Γ / (cid:17) − (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = ˆ U ˆ D − ˆ U T ˆ U ˆ D / = ˆ Γ / ˆ U ˆ D − / . This implies for F deﬁned in (S16), n − { C ( o ) } T ˆ C = W T F − / (cid:16) F ˆ v + F ˆ v + ˜ C T ˆ V Q ˜ C ˆ Z (cid:17) ˆ F − / ˆ W = D / U T Γ − / (cid:16) ˆ v + F − F ˆ v + F − ˜ C T ˆ V Q ˜ C ˆ Z (cid:17) ˆ F − / ˆ W = D / U T Γ − / (cid:0) ˆ v Γ ˆ v T (cid:1) / H ˆ U ˆ D − / + F − (cid:16) F ˆ v + ˜ C T ˆ V Q ˜ C ˆ Z (cid:17) ˆ Γ / ˆ U ˆ D − / , where H ∈ R K ( o ) × K ( o ) is the same unitary matrix deﬁned in the proof of Lemma S4 and satisﬁesˆ v ˆ Γ / = (cid:16) ˆ v Γ ˆ v T (cid:17) / H . By the proof of Lemma S4, ˆ U = H T U ˜ W for the unitary matrix ˜ W = ( ˜ w · · · ˜ w K ( o ) ) ∈ R K ( o ) × K ( o ) , where ˜ w s is deﬁned in Remark S9. Next, (S11c) in Lemma S2 andCorollary S3 imply F − (cid:16) F ˆ v + ˜ C T ˆ V Q ˜ C ˆ Z (cid:17) =  O P (cid:110) p − / γ − / + n / ( γ p ) + ( γ n ) − (cid:111)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) K ( o ) × · · · O P (cid:110) p − / γ − / K ( o ) + n / ( γ K ( o ) p ) + ( γ K ( o ) n ) − (cid:111)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) K ( o ) ×  . U rs = O P (cid:16) γ / r ∨ s γ − / r ∧ s (cid:17) by Lemma S21. Therefore, for any r ∈ [ K ( o ) ], n − { C ( o ) ∗ r } T ˆ C ∗ r = (cid:40) λ ( o ) r ˆ λ ( o ) r (cid:41) / U T ∗ r Γ − / (cid:0) ˆ v Γ ˆ v T (cid:1) / U ˜ w r + O P (cid:110) p − / γ − / r + n / ( γ r p ) + ( γ r n ) − (cid:111) . Therefore, we need only understand how U T ∗ r Γ − / (cid:16) ˆ v Γ ˆ v T (cid:17) / U ˜ w r behaves. As we did in theproof of Lemma S4, let R = (cid:16) ˆ v Γ ˆ v T (cid:17) / − Γ / . We showed in the proof of Lemma S4 that R rs = O P (cid:16) φ + φ γ − / r ∧ s (cid:17) for φ = p − / and φ = n / p + n − . Therefore, U T ∗ r Γ − / (cid:0) ˆ v Γ ˆ v T (cid:1) / U ˜ w r = ˜ w r r + U T ∗ r Γ − / RU ˜ w r U T ∗ r Γ − / RU ∗ s = O P (cid:16) φ γ − / r + φ γ − / r γ − / s (cid:17) , s ∈ [ K ( o ) ] . The proof of Lemma S4 and Remark S9 show that on the event F ( (cid:15) ) r , deﬁned in the statement ofTheorem 3, (cid:12)(cid:12)(cid:12) ˜ w r r (cid:12)(cid:12)(cid:12) = − O P (cid:16) φ γ − / r + φ γ − r (cid:17) ˜ w r s = O P (cid:16) φ γ − / r + φ { γ r γ s } − / (cid:17) , r < s ∈ [ K ( o ) ] . This completes the proof of (16).It remains to prove (13). Using the expression of ˆ˜ L in (S21),ˆ L = p / n − / ˜ L ˆ v ˆ Γ − / ˆ U ˆ D / + p / n − / ˜ L ˆ v ˆ Γ − / ˆ U ˆ D / + n − / E (cid:32) ˆ v ˆ v (cid:33) + n − / E ˆ Z , where ˜ L = ( ˜ L ˜ L ). First, for any g ∈ [ p ] and some constant c > a n , p = O P (1 / n ) that does not depend on g , (cid:13)(cid:13)(cid:13) E g ∗ (cid:13)(cid:13)(cid:13) ≤ c (cid:13)(cid:13)(cid:13)(cid:13) n − / C T ˆ¯ V − / E g ∗ (cid:13)(cid:13)(cid:13)(cid:13) ≤ c (cid:13)(cid:13)(cid:13) n − / C T ¯ V − / E g ∗ (cid:13)(cid:13)(cid:13) + a n , p (cid:13)(cid:13)(cid:13) E g ∗ (cid:13)(cid:13)(cid:13) Since n − / C T ¯ V − / E g ∗ is sub-Gaussian with uniformly bounded sub-Gaussian norm,sup g ∈ [ p ] (cid:13)(cid:13)(cid:13) n − / C T ¯ V − / E g ∗ (cid:13)(cid:13)(cid:13) = O P (cid:8) log( p ) (cid:9) . Further, for some constant c > t ≥ g ∈ [ p ] (cid:13)(cid:13)(cid:13) E g ∗ (cid:13)(cid:13)(cid:13) ≤ cn + sup g ∈ [ p ] (cid:12)(cid:12)(cid:12)(cid:12) E T g ∗ E g ∗ − E (cid:16) E T g ∗ E g ∗ (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) P (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) E T g ∗ E g ∗ − E (cid:16) E T g ∗ E g ∗ (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) ≥ tn / (cid:27) ≤ (cid:110) − min (cid:16) ct , ct (cid:17)(cid:111) , where the second equality follows from Lemma S15. Therefore, if log( p ) / n / → g ∈ [ p ] (cid:13)(cid:13)(cid:13) E g ∗ (cid:13)(cid:13)(cid:13) ≤ c / n / (cid:104) + O P (cid:110) log( p ) n − / (cid:111)(cid:105) . (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − / E (cid:32) ˆ v ˆ v (cid:33) ∗ r + n − / E ˆ Z ∗ r (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ = O P  log( p ) n − / + (cid:32) n γ r p (cid:33) / + n −  . Next, since (cid:13)(cid:13)(cid:13) ˆ Γ − / ˆ U ˆ D / (cid:13)(cid:13)(cid:13) = O P (1) and by Lemma S2,ˆ v (cid:13)(cid:13)(cid:13) ˆ Γ − / ˆ U ˆ D / (cid:13)(cid:13)(cid:13) = O P (cid:16) φ γ − / K ( o ) + φ γ − K ( o ) (cid:17) . Therefore, it su ﬃ ces to assume ˆ L = p / n − / ˜ L ˆ v ˆ Γ − / ˆ U ˆ D / . We then see that L ( o ) ∗ r − ˆ L ∗ r = p / n − / ˜ L  { λ ( o ) r } / Γ − / U ∗ r − { ˆ λ ( o ) r } / ˆ v ˆ Γ − / H T U ˜ w r (cid:124) (cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = ˆ U ∗ r  . Since (cid:13)(cid:13)(cid:13) γ / r ˆ v ˆ Γ − / ˆ U ∗ r (cid:13)(cid:13)(cid:13) = O P (1), { λ ( o ) r } / Γ − / U ∗ r − { ˆ λ ( o ) r } / ˆ v ˆ Γ − / H T U ˜ w r = { λ ( o ) r } / (cid:16) Γ − / U ∗ r − ˆ v ˆ Γ − / H T U ˜ w r (cid:17) + O P (cid:16) φ γ − / r + φ γ − r (cid:17) . Next, since H T = ˆ Γ / ˆ v T (cid:16) ˆ v ˆ Γ ˆ v T (cid:17) − / and for a r ∈ R K ( o ) the r th standard basis vector,ˆ v ˆ Γ − / H T U ˜ w r = ˆ v ˆ v T (cid:16) ˆ v ˆ Γ ˆ v T (cid:17) − / U ∗ r + ˆ v ˆ v T (cid:16) ˆ v ˆ Γ ˆ v T (cid:17) − / U ( ˜ w r − a r ) . By the proof of Lemma S4 and Remark S9, (cid:107) ˜ w r − a r (cid:107) = O P (cid:16) φ γ − / r + φ γ − / r γ − / K ( o ) (cid:17) . Therefore, { λ ( o ) r } / (cid:13)(cid:13)(cid:13)(cid:13) ˆ v ˆ v T (cid:16) ˆ v ˆ Γ ˆ v T (cid:17) − / U ( ˜ w r − a r ) (cid:13)(cid:13)(cid:13)(cid:13) = O P (cid:16) φ γ − / K ( o ) + φ γ − K ( o ) (cid:17) . Next, for R deﬁned above, (cid:16) ˆ v ˆ Γ ˆ v T (cid:17) − / = Γ − /  I K ( o ) − ∞ (cid:88) t = ( − t (cid:16) R Γ − / (cid:17) t ( R Γ − / )  { λ ( o ) r } / γ − / K ( o ) R Γ − / U ∗ r = O P (cid:16) φ γ − / K ( o ) + φ γ − K ( o ) (cid:17) . Lastly, by Lemma S2, (cid:13)(cid:13)(cid:13) { λ ( o ) r } / (cid:0) I K ( o ) − ˆ v ˆ v T (cid:1) Γ − / U ∗ r (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) I K ( o ) − ˆ v ˆ v T (cid:13)(cid:13)(cid:13) O P (1) = O P (cid:16) φ γ − / K ( o ) + φ γ − K ( o ) (cid:17) . Putting this all together gives us (cid:13)(cid:13)(cid:13) { λ ( o ) r } / Γ − / U ∗ r − { ˆ λ ( o ) r } / ˆ v ˆ Γ − / H T U ˜ w r (cid:13)(cid:13)(cid:13) = O P (cid:16) φ γ − / K ( o ) + φ γ − K ( o ) (cid:17) . This shows that on the event F ( (cid:15) ) r , (cid:13)(cid:13)(cid:13) L ( o ) ∗ r − a ˆ L ∗ r (cid:13)(cid:13)(cid:13) ∞ = O P  log( p ) n − / + (cid:32) n γ K ( o ) p (cid:33) / + n −  for some a ∈ {− , } , which completes the proof. (cid:3)

54e next prove (14) in Theorem 3.

Proof of (14) in Theorem 3.

Let ˆ V g = V ( ˆ v g ), where ˆ v g is the restricted maximum likelihood esti-mate for v g using the design matrix ˆ C ∈ R n × K (deﬁned in Step (e) of Algorithm 1 when k = K ). ByCorollary S6, (cid:13)(cid:13)(cid:13) ˆ V g − V g (cid:13)(cid:13)(cid:13) = O P (cid:110) n − / + n / ( p γ K ) − / (cid:111) . Let ˆ˜ C ∈ R n × K be the ﬁrst K right singularvectors of Y ˆ¯ V − / . The generalised least squares estimate for L ( o ) g ∗ is thenˆ L ( GLS ) g ∗ = (cid:16) ˆ C T ˆ V − g ˆ C (cid:17) − ˆ C T ˆ V − g Y g ∗ = ˆ W T ˆ F / (cid:18) ˆ˜ C T ˆ¯ V / ˆ V − g ˆ¯ V / ˆ˜ C (cid:19) − ˆ˜ C T ˆ¯ V / ˆ V − g ˆ¯ V / ˜ C ˜ (cid:96) g + n − / ˆ W T ˆ F / (cid:18) ˆ˜ C T ˆ¯ V / ˆ V − g ˆ¯ V / ˆ˜ C (cid:19) − ˆ˜ C T ˆ¯ V / ˆ V − g E g ∗ = M g + R g , where ˆ F is deﬁned in (S16), ˆ W is as deﬁned in (S20), ˜ C is given in (S3b) and ˜ (cid:96) g = p / n − / ˜ L g ∗ for ˜ L g ∗ deﬁned in (S3c). Note that (cid:13)(cid:13)(cid:13) ˜ (cid:96) g (cid:13)(cid:13)(cid:13) ≤ c for some constant c > (cid:13)(cid:13)(cid:13)(cid:13) ˆ¯ V − ¯ V (cid:13)(cid:13)(cid:13)(cid:13) = O P (1 / n ).Let (cid:15) j , j ∈ [ b ], be such that ˆ V g − V g = (cid:80) bj = (cid:15) j B j . Then (cid:15) j = o P ( n − / ) by Corollary S6, meaningˆ˜ C T ˆ¯ V / ˆ V − g ˆ¯ V / ˜ C = ˆ˜ C T ˆ¯ V / V − g ˆ¯ V / ˜ C − b (cid:88) j = (cid:15) j (cid:18) ˆ˜ C T ˆ¯ V / V − g B j V − g ˆ¯ V / ˜ C (cid:19) + o P ( n − / ) = ˆ v T ˜ C T ˆ¯ V / V − g ˆ¯ V / ˜ C − b (cid:88) j = (cid:15) j (cid:16) ˆ v T ˜ C T ˆ¯ V / V − g ˆ¯ V / ˜ CV − g B j V − g ˆ¯ V / ˜ C (cid:17) + o P ( n − / ) = ˆ v T ˜ C T ˆ¯ V /  V − g − V − g b (cid:88) j = (cid:15) j B j V − g  ˆ¯ V / ˜ C + o P ( n − / ) = ˆ v T ˜ C T ˆ¯ V / ˆ V − g ˆ¯ V / ˜ C + o P ( n − / ) , where the second equality follows from Corollary S4. A similar technique shows thatˆ˜ C T ˆ¯ V / ˆ V − g ˆ¯ V / ˆ˜ C = ˆ v T ˜ C T ˆ¯ V / ˆ V − g ˆ¯ V / ˜ C ˆ v + o P ( n − / ) . Therefore, M g = ˆ W T ˆ F / ˆ v − ˜ (cid:96) g + o P ( n − / ) . Let a r ∈ R K be the r th standard basis vector. By Corollary S5 and the fact that (cid:13)(cid:13)(cid:13) n − / E g ∗ (cid:13)(cid:13)(cid:13) = O P (1), R g r = n − / a T r ˆ W T ˆ F / ˆ v − (cid:16) ˜ C T ˆ¯ V / V − g ˆ¯ V / ˜ C (cid:17) − ˜ C T ˆ V / V − g E g ∗ + o P ( n − / ) = O P ( (cid:13)(cid:13)(cid:13) ˆ v − T ˆ F / ˆ W a r − F / W a r (cid:13)(cid:13)(cid:13) ) + a T r [ { C ( o ) } T V − g C ( o ) ] − { C ( o ) } T V − g E g ∗ + o P ( n − / ) . To complete the proof, we need only show that (cid:13)(cid:13)(cid:13) ˆ v − T ˆ F / ˆ W a r − F / W a r (cid:13)(cid:13)(cid:13) = o P ( n − / )55n the event F ( (cid:15) ) r , where F is deﬁned in (S16) and W is as deﬁned in (S19). We note that if U ∈ R K × K contains the eigenvectors of Γ / F Γ / and ˜ w r is as deﬁned in Remark S9,ˆ v − T ˆ F / ˆ W a r − F / W a r = { ˆ λ ( o ) r } / (cid:16) ˆ v ˆ Γ ˆ v T (cid:17) − / U ˜ w r − { λ ( o ) r } / Γ − / U ∗ r . However, we showed this was o P ( n − / ) in the proof of (13) in Theorem 3 and (16) in Theorem 4above. (cid:3) S7 Inference on C ( o )In this section, we prove Theorem 6. We ﬁrst state and prove two useful lemmas regarding theREML estimate for the covariance of linear combinations of the columns of C . The ﬁrst lemmashows that the correlation between X and ˆ C ∗ r mirrors that of X and C ( o ) ∗ r . Lemma

S5.

Let X ∈ R n be as deﬁned in the statement of Theorem 6, suppose Assumptions 1, 2and 3 hold. Let r ∈ [ K ( o ) ] and let the event F ( (cid:15) ) r be as deﬁned in the statement of 3 for some small (cid:15) > . Then if X is dependent on at most c rows of E for some constant c ≥ , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X T C ( o ) ∗ r (cid:107) X (cid:107) (cid:13)(cid:13)(cid:13) C ( o ) ∗ r (cid:13)(cid:13)(cid:13) − X T ˆ C ∗ r (cid:107) X (cid:107) (cid:13)(cid:13)(cid:13) ˆ C ∗ r (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O P (cid:110) ( p γ r ) − / + ( n / p + n − ) γ − r (cid:111) . (S22) Proof of Theorem 6.

Since we are studying the empirical correlation, it su ﬃ ces to assume (cid:107) X (cid:107) = C to be ˆ C ← n − / ˆ C . Then using notation deﬁned in the proof of (16) and (13)in Section S6.6, X T ˆ C ∗ r = { ˆ λ ( o ) r } − / X T ˆ¯ V / ˜ C (cid:32) ˆ v ˆ v (cid:33) ˆ Γ / ˆ U ∗ r + { ˆ λ ( o ) r } − / X T ˆ¯ V / Q ˜ C ˆ Z ˆ Γ / ˆ U ∗ r X T C ( o ) ∗ r = { λ ( o ) r } − / X T ˆ¯ V / ˜ C Γ / U r ∗ , where ˜ C = (cid:16) ˜ C ∗ · · · ˜ C ∗ K ( o ) (cid:17) and ˜ C = (cid:16) ˜ C ∗ ( K ( o ) + · · · ˜ C ∗ K (cid:17) . Next, { ˆ λ ( o ) r } − / X T ˆ¯ V / ˜ C (cid:32) ˆ v ˆ v (cid:33) ˆ Γ / ˆ U ∗ r = { ˆ λ ( o ) r } − / X T ˆ¯ V / ˜ C (cid:16) ˆ v ˆ Γ ˆ v T (cid:17) / U ˜ w r + { ˆ λ ( o ) r } − / X T ˆ¯ V / ˜ C ˆ v ˆ Γ / ˆ U ∗ r . By Lemmas S2 and S21, { ˆ λ ( o ) r } − / (cid:13)(cid:13)(cid:13) ˆ v ˆ Γ / ˆ U ∗ r (cid:13)(cid:13)(cid:13) = O P (cid:110) ( p γ r ) − / + (cid:16) n / p + n − (cid:17) γ − r (cid:111) . By the proof of Corollary S5, X T ˆ¯ V / Q ˜ C ˆ Z ∗ t = O P (cid:110) ( p γ t ) − / + (cid:16) n / p + n − (cid:17) γ − t (cid:111) , t ∈ [ K ( o ) ] , meaning { ˆ λ ( o ) r } − / X T ˆ¯ V / Q ˜ C ˆ Z ˆ Γ / ˆ U ∗ r = O P (cid:110) ( p γ t ) − / + (cid:16) n / p + n − (cid:17) γ − t (cid:111) . Therefore, we onlyneed to show that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:40) λ ( o ) r ˆ λ ( o ) r (cid:41) / { λ ( o ) r } − / (cid:16) ˆ v ˆ Γ ˆ v T (cid:17) / U ˜ w r − { λ ( o ) r } − / Γ / U r ∗ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O P (cid:110) ( p γ r ) − / + (cid:16) n / p + n − (cid:17) γ − r (cid:111)

56o complete the proof. Let δ r = ( p γ r ) − / + (cid:16) n / p + n − (cid:17) γ − r . Since (cid:13)(cid:13)(cid:13)(cid:13) { λ ( o ) r } − / (cid:16) ˆ v ˆ Γ ˆ v T (cid:17) / U ˜ w r (cid:13)(cid:13)(cid:13)(cid:13) = O P (1) and (cid:26) λ ( o ) r ˆ λ ( o ) r (cid:27) / = O P ( δ r ), this amounts to showing { λ ( o ) r } − / (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) ˆ v ˆ Γ ˆ v T (cid:17) / U ˜ w r − Γ / U r ∗ (cid:13)(cid:13)(cid:13)(cid:13) = O P ( δ r ) . Deﬁne R = (cid:16) ˆ v ˆ Γ ˆ v T (cid:17) / − Γ / . By the proof of Lemma S4, R st = O P (cid:16) φ + φ γ − / s ∧ t (cid:17) , where φ = p − / and φ = n / p + / n , meaning ( RU ) st = O P (cid:16) φ + φ γ − / s ∧ t (cid:17) by Lemma S21. Since (cid:107) ˜ w r − a r (cid:107) = O P (cid:16) φ γ − / r + φ γ − / r γ − / K ( o ) (cid:17) for a r ∈ R K ( o ) the r th standard basis vector, { λ ( o ) r } − / (cid:107) RU ˜ w r (cid:107) = O P ( δ r ) . By the proof of Lemma S4 and the fact that ˜ w , . . . , ˜ w K ( o ) are orthogonal with unit norm, ˜ w r s = O P (cid:16) φ γ − / r ∧ s + φ γ − / r γ − / s (cid:17) for all s (cid:44) r ∈ [ K ( o ) ]. Therefore, (cid:13)(cid:13)(cid:13) { λ ( o ) r } − / Γ / U ( a r − ˜ w r ) (cid:13)(cid:13)(cid:13) = O P ( δ r ) , which completes the proof. (cid:3) We next prove REML estimates for the variance of C are consistent. Lemma

S6.

Suppose Assumptions 1, 2 and 3 hold and K ( o ) ∈ [ K ] is known. Further, assume thefollowing assumptions on C hold for some constant c > and r ∈ [ K ( o ) ] :(i) C = X Ω T + Ξ for some non-random Ω ∈ R K and random X ∈ R n that satisﬁes (cid:13)(cid:13)(cid:13) n − X T X − σ x (cid:13)(cid:13)(cid:13) = O P ( n − / ) for some constant σ x > . If Ω (cid:44) , then X is independent of E . If Ω = , then X is independent of all but at most c rows of E .(ii) The random matrix Ξ ∈ R n × K is independent of X , E ( Ξ ) = , V { vec Ξ } = (cid:80) bj = Ψ j ⊗ B j ,where (cid:13)(cid:13)(cid:13) Ψ j (cid:13)(cid:13)(cid:13) ≤ c and V { vec Ξ } (cid:23) c − I nK , and satisﬁes one of the following:(a) vec( Ξ ) = AR , where A ∈ R nK × nK is a non-random matrix that satisﬁes AA T = (cid:80) bj = Ψ j ⊗ B j . R ∈ R nK is a mean zero random matrix with independent entries suchthat E ( R i ) ≤ c for all i ∈ [ nK ] .(b) E [exp { vec( Ξ ) T t } ] ≤ exp( c (cid:107) t (cid:107) ) for all t ∈ R nK .(iii) E ( n − C T C ) = I K and P (cid:110) λ ( o ) r /λ ( o ) r − , λ ( o ) r + /λ ( o ) r ≥ + c − (cid:111) → as n , p → ∞ , where λ ( o )0 = ∞ .Let F =  x ∈ R b : (2 c ) − I n (cid:22) b (cid:88) j = x j B j and (cid:107) x (cid:107) ≤ bc  and for r ∈ [ K ( o ) ] , deﬁnef r , ( θ ) = − n − log (cid:110)(cid:12)(cid:12)(cid:12) P ⊥ X V ( θ ) P ⊥ X (cid:12)(cid:12)(cid:12) + (cid:111) − n − { C ( o ) ∗ r } T (cid:8) P ⊥ X V ( θ ) P ⊥ X (cid:9) † C ( o ) ∗ r ˆ θ = arg max θ ∈F f r , ( θ ) 57 nd f r , ( θ ) = − n − log (cid:110)(cid:12)(cid:12)(cid:12) P ⊥ X V ( θ ) P ⊥ X (cid:12)(cid:12)(cid:12) + (cid:111) − n − ˆ C T ∗ r (cid:8) P ⊥ X V ( θ ) P ⊥ X (cid:9) † ˆ C ∗ r ˆ θ = arg max θ ∈F f r , ( θ ) , and let ˆ u r ∈ R K be such that C ( o ) r = C ˆ u r . Then for θ ∗ ∈ R b such that θ ∗ j = ˆ u T r Ψ j ˆ u r , (cid:13)(cid:13)(cid:13) ˆ θ − θ ∗ (cid:13)(cid:13)(cid:13) = O P (cid:16) n − / (cid:17) , (cid:13)(cid:13)(cid:13) ˆ θ − θ ∗ (cid:13)(cid:13)(cid:13) = O P (cid:104) n − / + { n / ( p γ r ) } / (cid:105) . Proof.

We ﬁrst note that ˆ u r = (cid:16) n − C T C (cid:17) − / v for some unit vector v ∈ R K , where n − C T C = I K + O P ( n − / ). Next, n − / ˆ C ∗ r = { ˆ λ ( o ) r } − / ˆ¯ V / (cid:16) ˜ C ˆ v + ˜ C ˆ v (cid:17) ˆ Γ / ˆ U ∗ r + { ˆ λ ( o ) r } − / ˆ¯ V / Q ˜ C ˆ Z ˆ Γ / ˆ U ∗ r n − / C ( o ) ∗ r = { λ ( o ) r } − / ˆ¯ V / ˜ C Γ / U ∗ r . By Lemma S2 and for some constant ˜ c > (cid:13)(cid:13)(cid:13)(cid:13) { ˆ λ ( o ) r } − / ˆ¯ V / Q ˜ C ˆ Z ˆ Γ / ˆ U ∗ r (cid:13)(cid:13)(cid:13)(cid:13) ≤ ˜ c (cid:13)(cid:13)(cid:13) { ˆ λ ( o ) r } − / ˆ Z ˆ Γ / ˆ U ∗ r (cid:13)(cid:13)(cid:13) = O P (cid:16) n / p − / γ − / r + n − γ − r (cid:17)(cid:13)(cid:13)(cid:13) { ˆ λ ( o ) r } − / ˜ C ˆ v ˆ Γ / ˆ U ∗ r (cid:13)(cid:13)(cid:13) = O P (cid:32) np γ r + p − / γ − / r + n − γ − r (cid:33) . Further, by the proof of Lemma S5, (cid:13)(cid:13)(cid:13)(cid:13) { ˆ λ ( o ) r } − / ˆ¯ V / ˜ C ˆ v ˆ Γ / ˆ U ∗ r − { λ ( o ) r } − / ˆ¯ V / ˜ C Γ / U ∗ r (cid:13)(cid:13)(cid:13)(cid:13) = O P (cid:32) np γ r + p − / γ − / r + n − γ − r (cid:33) . This shows that (cid:13)(cid:13)(cid:13) n − / ˆ C ∗ r − n − / C ( o ) ∗ r (cid:13)(cid:13)(cid:13) = O P (cid:16) n / p − / γ − / r + n − γ − r (cid:17) . Since (cid:107) V ( θ ) (cid:107) , (cid:13)(cid:13)(cid:13) { V ( θ ) } − (cid:13)(cid:13)(cid:13) is uniformly bounded from above for all θ ∈ F , we need only showthat (cid:13)(cid:13)(cid:13) ˆ θ − θ ∗ (cid:13)(cid:13)(cid:13) = O P (cid:16) n − / (cid:17) to complete the proof. We ﬁrst see that f r , can be re-written as f r , ( θ ) = − n − log (cid:110)(cid:12)(cid:12)(cid:12) Q T X V ( θ ) Q X (cid:12)(cid:12)(cid:12)(cid:111) − n − (cid:0) Q T X Ξ ˆ u r (cid:1) T (cid:8) Q T X V ( θ ) Q X (cid:9) − (cid:0) Q T X Ξ ˆ u r (cid:1) It is clear that by the assumptions on Ξ , (cid:13)(cid:13)(cid:13)(cid:13) n − (cid:0) Q T X Ξ (cid:1) T (cid:8) Q T X V ( θ ) Q X (cid:9) − (cid:0) Q T X Ξ (cid:1) − n − E (cid:104)(cid:0) Q T X Ξ (cid:1) T (cid:8) Q T X V ( θ ) Q X (cid:9) − (cid:0) Q T X Ξ (cid:1) | X (cid:105)(cid:13)(cid:13)(cid:13)(cid:13) is stochastically equicontinuous, where n − E (cid:104)(cid:0) Q T X Ξ (cid:1) T (cid:8) Q T X V ( θ ) Q X (cid:9) − (cid:0) Q T X Ξ (cid:1) | X (cid:105) = n − b (cid:88) j = Tr (cid:104)(cid:8) Q T X V ( θ ) Q X (cid:9) − Q T X B j Q X (cid:105) Ψ j . Since (cid:107) ˆ u r (cid:107) = + O P ( n − / ), it is therefore easy to show that (cid:13)(cid:13)(cid:13) ˆ θ − θ ∗ (cid:13)(cid:13)(cid:13) = o P (1). The result thenfollows by a routine Taylor expansion argument. (cid:3)

58e can now prove Theorem 6.

Proof of Theorem 6.

Let ˆ θ be as deﬁned in the statement of Lemma S6 and let ˆ D = V ( ˆ θ ). Let C ( o ) ∗ r = C ˆ u r , where by Lemma S23, ˆ u r = U ˆ y r + ∆ r for U ∈ R K × K ( o ) , ∆ r ∈ R K as deﬁned in thestatement of Lemma S23 andˆ y r t =  + O P (cid:16) n − / (cid:17) if t = rO P (cid:16) n − / (cid:17) if t < rO P (cid:16) n − / γ t γ − r (cid:17) if t > r , t ∈ [ K ( o ) ] . (S23)Note that U is a non-random matrix. If K ( o ) = K , U = I K and ∆ r = . Lemma S6 thenimplies ˆ D = D + b (cid:80) j = (cid:15) j B j , where D = (cid:80) bj = U T ∗ r Ψ j U ∗ r B j is a non-random matrix and (cid:15) j = O P (cid:16) n − / + n / p − / γ − / r (cid:17) for all j ∈ [ b ]. Therefore, n − X T ˆ D − ˆ C ∗ r = n − X T D − ˆ C ∗ r + n − b (cid:88) j = (cid:15) j X T D − B j D − ˆ C ∗ r + o P ( n − / ) . The proof of Lemma S5 can be easily extended to show that (cid:12)(cid:12)(cid:12) n − X T D − ˆ C ∗ r − n − X T D − C ( o ) ∗ r (cid:12)(cid:12)(cid:12) = o P ( n − / ) (cid:12)(cid:12)(cid:12) n − X T D − B j D − ˆ C ∗ r − n − X T D − C ( o ) ∗ r (cid:12)(cid:12)(cid:12) = o P ( n − / ) , meaning n − X T ˆ D − ˆ C ∗ r = n − X T D − C ( o ) ∗ r + n − b (cid:88) j = (cid:15) j X T D − B j D − C ( o ) ∗ r + o P ( n − / ) = n − X T ˆ D − C ( o ) ∗ r + o P ( n − / ) . Therefore, (cid:16) X T ˆ D − X (cid:17) − X T ˆ D − ˆ C ∗ r = (cid:16) X T ˆ D − X (cid:17) − X T ˆ D − C ( o ) ∗ r + o P ( n − / ) = Ω T ˆ u r + (cid:16) X T ˆ D − X (cid:17) − X T ˆ D − Ξ ˆ u r + o P ( n − / ) . Lastly, n − X T ˆ D − Ξ U ∗ r = n − X T D − Ξ U ∗ r + o P ( n − / ) , where (cid:13)(cid:13)(cid:13) n − X T D − Ξ (cid:13)(cid:13)(cid:13) = O P ( n − / ) and an application of the Lindeberg-Feller central limit theo-rem shows that  n − b (cid:88) j = X T B j XU T ∗ r Ψ j U ∗ r  − / n − X T D − Ξ U ∗ r d → N (0 ,

1) (S24)as n , p → ∞ . Since (cid:107) ˆ u r − U ∗ r (cid:107) = o P (1), this completes the proof. (cid:3) Y Here we prove Theorem 5. The proof is given below, and utilizes Corollary S6 to derive theasymptotic properties of the REML estimates for V (cid:16) R g (cid:17) . Proof of Theorem 5.

For ˆ α g as deﬁned in the statement of Theorem 5, Corollary S6 shows that (cid:13)(cid:13)(cid:13) ˆ α g − α g (cid:13)(cid:13)(cid:13) = O P (cid:16) n / p − / γ − / K + n − / (cid:17) . Let A g = V ( α g ) and ˆ A g = V ( ˆ α g ). Then ˆ A g = A g + (cid:80) bj = (cid:15) j B j , where (cid:15) j = O P (cid:16) n / p − / γ − / K + n − / (cid:17) for all j ∈ [ b ]. Let, ˆ M g = (cid:18) ˆ¯ V − / X g n / ˆ˜ C (cid:19) ,where ˆ˜ C ∈ R n × K are the ﬁrst K right singular vectors of Y ˆ¯ V − / . Then for ˆ˜ A g = ˆ¯ V − / ˆ A g ˆ¯ V − / ,ˆ s g = (cid:40)(cid:18) n − ˆ M T g ˆ˜ A − g ˆ M g (cid:19) − (cid:18) n − ˆ M T g ˆ˜ A − g ˆ¯ V / Y g ∗ (cid:19)(cid:41) d , where for x ∈ R m , x d ∈ R d is the ﬁrst d ≤ m components of x . We ﬁrst see that n − ˆ M T g ˆ˜ A − g ˆ M g =  n − X T g ˆ A − g X g n − / X T g ˆ A − g ˆ¯ V / ˆ˜ C n − / ˆ˜ C T ˆ¯ V / ˆ A − g X g ˆ˜ C T ˆ¯ V / ˆ A − g ˆ¯ V / ˆ˜ C  . Since ˆ A − g = A − g + (cid:80) j = (cid:15) j A − g B j A − g + o P ( n / ) and (cid:13)(cid:13)(cid:13) A − g B j A − g (cid:13)(cid:13)(cid:13) = O (1), Corollaries S4 andS5 show that (cid:13)(cid:13)(cid:13)(cid:13) ˆ˜ C T ˆ¯ V / ˆ A − g ˆ¯ V / ˆ˜ C − ˆ v T ˜ C T ˆ¯ V / ˆ A − g ˆ¯ V / ˜ C ˆ v (cid:13)(cid:13)(cid:13)(cid:13) = o P ( n − / ) (cid:13)(cid:13)(cid:13)(cid:13) n − / X T g ˆ A − g ˆ¯ V / ˆ˜ C − n − / X T g ˆ A − g ˆ¯ V / ˜ C ˆ v (cid:13)(cid:13)(cid:13)(cid:13) = o P ( n − / ) , meaning n − ˆ M T g ˆ˜ A − g ˆ M g = (cid:32) I d ˆ v T (cid:33)  n − X T g ˆ A − g X g n − / X T g ˆ A − g ˆ¯ V / ˜ C n − / ˜ C T ˆ¯ V / ˆ A − g X g ˜ C T ˆ¯ V / ˆ A − g ˆ¯ V / ˜ C  (cid:32) I d ˆ v (cid:33) + o P ( n − / ) . Therefore,ˆ s g =  n − X T g ˆ A − g X g n − / X T g ˆ A − g ˆ¯ V / ˜ C n − / ˜ C T ˆ¯ V / ˆ A − g X g ˜ C T ˆ¯ V / ˆ A − g ˆ¯ V / ˜ C  −  n − X T g ˆ A − g Y g ∗ n − / ˆ v − T ˆ˜ C T ˆ¯ V / ˆ A − g Y g ∗  d + o P ( n − / ) , where n − / ˆ v − T ˆ˜ C T ˆ¯ V / ˆ A − g Y g ∗ = n − / ˜ C T ˆ¯ V / ˆ A − g Y g ∗ + n − / ˆ v − T ˆ z T Q T ˜ C ˆ¯ V / ˆ A − g Y g ∗ . Next, n − / ˆ v − T ˆ z T Q T ˜ C ˆ¯ V / ˆ A − g Y g ∗ = ˆ v − T ˆ z T Q T ˜ C ˆ¯ V / ˆ A − g ˆ¯ V / ˜ C (cid:16) n − / p / ˜ L g ∗ (cid:17) + n − / ˆ v − T ˆ z T Q T ˜ C ˆ¯ V / ˆ A − g E g ∗ , where (cid:13)(cid:13)(cid:13) n − / p / ˜ L g ∗ (cid:13)(cid:13)(cid:13) = O (1). The proof of Corollary S5 can therefore be used to show that n − / ˆ v − T ˆ z T Q T ˜ C ˆ¯ V / ˆ A − g Y g ∗ = o P ( n − / ) . V / ˜ C ) = im( C ),ˆ s g = (cid:32) n − X T g ˆ A − g X g n − X T g ˆ A − g C n − C T ˆ A − g X g n − C T ˆ A − g C (cid:33) − (cid:32) n − X T g ˆ A − g Y g ∗ n − C T ˆ A − g Y g ∗ (cid:33) d + o P ( n − / ) . Therefore, ˆ s g = s g + (cid:32) n − X T g ˆ A − g X g n − X T g ˆ A − g C n − C T ˆ A − g X g n − C T ˆ A − g C (cid:33) − (cid:32) n − X T g ˆ A − g R g ∗ n − C T ˆ A − g R g ∗ (cid:33) d + o P ( n − / )Since X g is mean 0, sub-Gaussian and independent of C , n − X T g ˆ A − g C = O P ( n − / ), meaning (cid:32) n − X T g ˆ A − g X g n − X T g ˆ A − g C n − C T ˆ A − g X g n − C T ˆ A − g C (cid:33) − (cid:32) n − X T g ˆ A − g R g ∗ n − C T ˆ A − g R g ∗ (cid:33) d = (cid:16) X T g ˆ A − g X g (cid:17) − X T g ˆ A − g R g ∗ + o P ( n − / ) . The result then follows because (cid:16) X T g ˆ A − g X g (cid:17) − X T g ˆ A − g R g ∗ = (cid:16) X T g A − g X g (cid:17) − X T g A − g R g ∗ + o P ( n − / ) . (cid:3) S9 Properties of and estimating the oracle rank K ( o )In this section, we refer to F as the number of folds in Algorithm 2 use f ∈ [ F ] to denote a fold.We start by stating and proving three useful lemmas. Lemma

S7.

Suppose Assumptions 1 and 2 hold. Let M ( π ) ∈ R p f × n be a matrix whose rows arechosen uniformly at random, without replacement, from the rows of M = E ( Y | C ) . Then for anysymmetric matrix A ∈ R n × n with (cid:107) A (cid:107) ≤ d and A (cid:23) d − I n for some constant d > , Λ k (cid:104) p − f (cid:110) M ( π ) (cid:111) T AM ( π ) (cid:105) = Λ k (cid:16) p − M T AM (cid:17) (cid:20) + O P (cid:26)(cid:16) np − f (cid:17) / (cid:27)(cid:21) + O P (cid:26)(cid:16) np − f (cid:17) / (cid:27) , k ∈ [ K ] as n , p → ∞ , where the randomness is due to the randomness in the rows sampled from the rowsof E ( Y | C ) .Proof. Suppose E ( Y | C ) = LC T , where we assume n − C T AC = I K and np − L T L = diag ( γ , . . . , γ K )without loss of generality (I abuse notation here; γ k are not the same as those deﬁned in Assump-tion 1). Therefore, we need only show that for π : [ p ] → [ p ] a permutation chosen uniformly atrandom from the set of all permutations that map [ p ] onto itself, Λ k  np − f p f (cid:88) g = L π ( g ) ∗ L T π ( g ) ∗  = γ k (cid:20) + O P (cid:26)(cid:16) np − f (cid:17) / (cid:27)(cid:21) + O P (cid:26)(cid:16) np − f (cid:17) / (cid:27) , k ∈ [ K ] . First, E  np − f p f (cid:88) g = L π ( g ) r L π ( g ) s | C  = p − f p f (cid:88) g = E (cid:16) n L π ( g ) r L T π ( g ) s | C (cid:17) = γ r I ( r = s ) , r , s ∈ [ K ] . c = sup g ∈ [ p ] (cid:13)(cid:13)(cid:13) L g ∗ (cid:13)(cid:13)(cid:13) and suppose r ≤ s ∈ [ K ]. Then V  np − f p f (cid:88) g = L π ( g ) r L π ( g ) s | C  ≤ n p − f p f (cid:88) g = V (cid:110) L π ( g ) r L π ( g ) s | C (cid:111) ≤ np − f c p f (cid:88) g = E (cid:110) n L π ( g ) s (cid:111) = c np − f γ s , where the ﬁrst equality follows from the fact that the rows of E ( Y | C ) are being sampled withreplacement, meaning Cov (cid:16) L π ( g ) r L π ( g ) s , L π ( h ) r L π ( h ) s | C (cid:17) ≤ g (cid:44) h . By Assumption 1, c < ˜ c + o P (1) as n , p → ∞ , where ˜ c > n or p .ˆ Γ = np − f p f (cid:88) g = L π ( g ) ∗ L T π ( g ) ∗ = diag ( γ , . . . , γ K ) + RR rs = O P (cid:26)(cid:16) λ r ∨ s n / p f (cid:17) / (cid:27) , r , s ∈ [ K ] . If lim sup n , p →∞ λ K = ∞ , the result follows from Lemma S13. Otherwise, suppose there exists a k ∈ [ K ] such that lim sup n , p →∞ λ k < ∞ but lim sup n , p →∞ λ k − = ∞ , where λ = ∞ . Thenˆ Γ = diag ( γ , . . . , γ K ) + ˜ R + O P (cid:110) ( n / p ) / (cid:111) ˜ R rs = O P (cid:26)(cid:16) λ r ∨ s n / p f (cid:17) / (cid:27) I ( r ∧ s ≥ k ) , r , s ∈ [ K ] . The result then follows by Weyl’s Theorem and an application of Lemma S13. (cid:3)

Lemma S8 (Lemma S8 of McKennan et al. [35]) . Let Q , ˆ V − / − f ) , ˆ C and K max be as deﬁned inTheorem 1. Then for any fold f and k ≤ K max , the loss function in (11) only depends on Q , ˆ V ( − f ) and im (cid:16) ˆ C (cid:17) . Remark

S10.

Let Y ( − f ) , ˆ C , ˆ V − / − f ) be as deﬁned in Algorithm 2, and let Y ( − f ) ˆ V − / − f ) = U Σ W T ,where U ∈ R p × n , Σ ∈ R n × n is a diagonal matrix with non-decreasing and non-negative diagonalelements, W ∈ R n × n and U T U = W T W = I n . Proposition 1 and Lemma S8 show that toprove Theorem 1, it su ﬃ ces to let ˆ V − / − f ) ˆ C = n / ( W ∗ · · · W ∗ k ) for each k ≤ K max and to assumen − C T ˆ V − − f ) C = I K and L T L is diagonal with non-decreasing diagonal elements. Lemma

S9.

Suppose the assumptions of Theorem 1 hold and let c max > K be a constant that doesnot depend on n or p. Then max k ∈{ c max + ,..., K max } (cid:13)(cid:13)(cid:13) ˆ V ( − f ) − ¯ V (cid:13)(cid:13)(cid:13) = O P (cid:16) n / p − / + n − (cid:17) as n , p → ∞ .Proof. Let ¯ V ( − f ) = ( p − p f ) − (cid:80) g (cid:60) fold f V g . Since p f , p − p f (cid:16) p and (cid:13)(cid:13)(cid:13) ¯ V − ¯ V ( − f ) (cid:13)(cid:13)(cid:13) = O P (cid:16) p − / (cid:17) ,it su ﬃ ces drop the subscript ( − f ) and prove the lemma using the full data matrix Y . Let k ∈{ c max + , . . . , K max } and let ˆ C ∈ R n × k be an estimate for C . Step (b)(iii) of Algorithm 1 thenestimates V as ˆ V = (cid:80) bj = ˆ¯ v j B j , where for S = p − S T S , V ( θ ) = (cid:80) bj = θ j B j andˆ f ( θ ) = − n − log (cid:110)(cid:12)(cid:12)(cid:12) Q T ˆ C V ( θ ) Q ˆ C (cid:12)(cid:12)(cid:12)(cid:111) − n − Tr (cid:20) Q T ˆ C SQ ˆ C (cid:110) Q T ˆ C V ( θ ) Q ˆ C (cid:111) − (cid:21) , v = arg max θ ∈ Θ ∗ ˆ f ( θ ). If K = n , p →∞ λ < ∞ , thenˆ f ( θ ) = − n − log (cid:110)(cid:12)(cid:12)(cid:12) Q T ˆ C V ( θ ) Q ˆ C (cid:12)(cid:12)(cid:12)(cid:111) − n − Tr (cid:20) Q T ˆ C (cid:16) p − E T E (cid:17) Q ˆ C (cid:110) Q T ˆ C V ( θ ) Q ˆ C (cid:111) − (cid:21) + O P (cid:16) n − (cid:17) uniformly for θ ∈ Θ ∗ . If lim sup n , p →∞ λ = ∞ , deﬁne c min to be such that lim sup n , p →∞ λ c min = ∞ but lim sup n , p →∞ λ c min + < ∞ , where λ K + =

0. Then Q ˆ C = Q min U , where the columns of Q min ∈ R n × c min form an orthonormal basis for ker (cid:110)(cid:16) ˆ C ∗ · · · ˆ C ∗ c min (cid:17) T (cid:111) and U ∈ R ( n − c min ) × ( n − k ) has orthonormalcolumns. By Corollary S1 and the proof of Lemma S7 in McKennan et al. [35],ˆ f ( θ ) = − n − log (cid:110)(cid:12)(cid:12)(cid:12) U T Q T min V ( θ ) Q min U (cid:12)(cid:12)(cid:12)(cid:111) − n − Tr (cid:104) U T Q T min (cid:16) p − E T E (cid:17) Q min U (cid:8) U T Q T min V ( θ ) Q min U (cid:9) − (cid:105) + O P (cid:16) n − (cid:17) uniformly for θ ∈ Θ ∗ and k ∈ { c max + , . . . , K max } . The latter follows from the fact that all ratesin Lemma S2 and Corollary S1 only depend on k through φ , which is uniformly bounded fromabove and satisﬁes φ /λ r = o P (1) for r ≤ c min . An identical analysis can be used to show that wecan replace S in ∇ θ ˆ f ( θ ) and ∇ θ ˆ f ( θ ) with p − E T E at the cost of O P (cid:16) n − (cid:17) . The result then followsby Lemma S19. (cid:3) Proof of Theorem 1.

Fix some fold f and deﬁne ¯ V f to be the analogue of ¯ V for fold f , and let δ f = (cid:12)(cid:12)(cid:12) ¯ V f (cid:12)(cid:12)(cid:12) / n . Let π : [ p ] → [ p ] be a permutation sampled uniformly from the set of all permutations on[ p ]. All conditional expectations and variances calculated below are with reference to the sigmaalgebra σ (cid:16) Y ( − f ) , π , Q (cid:17) , where Y f = L f C T + E f ∈ R p f × n is the test data used to estimate L f andevaluate the loss and Y ( − f ) = L ( − f ) C T + E ( − f ) ∈ R ( p − p f ) × n is the training data used to estimate C and ¯ V .Assumption 1 implies (cid:13)(cid:13)(cid:13) ¯ V − ¯ V f (cid:13)(cid:13)(cid:13) , (cid:13)(cid:13)(cid:13) ¯ V − ¯ V ( − f ) (cid:13)(cid:13)(cid:13) = O P (cid:16) p − / (cid:17) . Therefore, by Lemma S7, Λ k (cid:110) ¯ V − / f C (cid:16) p − L T f L f (cid:17) C T ¯ V − / − f ) (cid:111) , Λ k (cid:104) ¯ V − / − f ) C (cid:110) p − L T ( − f ) L ( − f ) (cid:111) C T ¯ V − / − f ) (cid:105) = γ k { + o P (1) } + o P (1)for all k ∈ [ K ] as n , p → ∞ , where γ , . . . , γ K are as deﬁned in Assumption 1. Therefore, theresults of Lemmas S1, S2 and Corollary S8 when we substitute Y with the training data Y ( − f ) .Let ˆ ¯ C = Q T ˆ V − / − f ) ˆ C ∈ R n × k be as deﬁned in Algorithm 2. Lemma S8 and Remark S10 showthat it su ﬃ ces to assume the columns of ˆ ¯ C are the ﬁrst k right singular vectors of ¯ Y ( − f ) , where n − ˆ ¯ C T ˆ ¯ C = I k , and that n − C T ˆ V − − f ) C = I K and L T ( − f ) L ( − f ) is a diagonal matrix with non-decreasingentries. We will let ˆ¯ h i , i ∈ [ n ], be the i th leverage score of ˆ ¯ C throughout the proof. Note that ˆ¯ h i is implicitly a function of k ∈ [ K max ]. Since Q ∈ R n × n is sampled uniformly from the space of allunitary matrices, setting (cid:15) = n − / in Lemma S24 impliesmax k ∈ [ K max ] (cid:40)(cid:32) max i ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12) ˆ¯ h i − k / n (cid:12)(cid:12)(cid:12)(cid:12)(cid:33)(cid:41) = o P (1)63s n → ∞ . Let ˆ¯ C ( − i ) ∈ R ( n − × k be the sub-matrix of ˆ¯ C with the i th row removed and deﬁneˆ¯ H = P ˆ¯ C . Two useful equalities to be used throughout the proof are (cid:110) ˆ¯ C T ( − i ) ˆ¯ C ( − i ) (cid:111) − ˆ¯ C i ∗ = − ˆ¯ h i (cid:16) ˆ¯ C T ˆ¯ C (cid:17) − ˆ¯ C i ∗ , i ∈ [ n ]ˆ¯ C (cid:110) ˆ¯ C T ( − i ) ˆ¯ C ( − i ) (cid:111) − ˆ¯ C i ∗ = − ˆ¯ h i ˆ¯ H ∗ i , i ∈ [ n ] . Since ˆ¯ C is invariant to the scale of ˆ V ( − f ) and we normalize the loss in (11) by (cid:12)(cid:12)(cid:12) ˆ V ( − f ) (cid:12)(cid:12)(cid:12) / n , itsu ﬃ ces to assume we have already scaled (cid:12)(cid:12)(cid:12) ˆ V ( − f ) (cid:12)(cid:12)(cid:12) / n so that (cid:12)(cid:12)(cid:12) ˆ V ( − f ) (cid:12)(cid:12)(cid:12) =

1. Deﬁne ¯ C = Q T ˆ V − / − f ) C . Ascaled version of the loss in (11) for fold f can then be expressed as g ( k ) = δ f p f L f ( k ) = δ f p f n (cid:88) i = (cid:13)(cid:13)(cid:13)(cid:13) ¯ Y f ∗ i − ˆ L f , ( − i ) ˆ¯ C i ∗ (cid:13)(cid:13)(cid:13)(cid:13) = δ f p f n (cid:88) i = (cid:13)(cid:13)(cid:13)(cid:13) L f ¯ C i ∗ − ˆ L f , ( − i ) ˆ¯ C i ∗ (cid:13)(cid:13)(cid:13)(cid:13) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) g ( k ) + δ f Tr (cid:110) ˆ V − − f ) (cid:16) p − f E Tf E f (cid:17)(cid:111)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) g ( k ) + δ f n (cid:88) i = p − f (cid:16) L f ¯ C i ∗ − ˆ L f , ( − i ) ˆ¯ C i ∗ (cid:17) T ¯ E f ˆ¯ V − / − f ) a i (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) g ( k ) , (S25)where ¯ E f = E f Q , ˆ¯ V ( − f ) = Q T ˆ V ( − f ) Q and a i ∈ R n , i ∈ [ n ], is the standard basis vector with 1 inthe i th position and zeros everywhere else. For the remainder of the proof, deﬁne A ( − i ) = I n − a i a T i for i ∈ [ n ]. Let (cid:15) > k = k j inductively as k j = min (cid:16)(cid:110) r ∈ (cid:110) k j − + , . . . , K (cid:111) : γ r /γ r + ≥ + (cid:15) (cid:111)(cid:17) , j ∈ [ J ]where k J = K . By the assumptions of Theorem 1, there exists an s ∈ { , , . . . , K } such that k j s = s for some j s ∈ { , , . . . , J } and γ s + ≤ δ − (cid:15) . We derive the asymptotic properties of g , g and g in the three lemmas below. Lemma

S10.

If the assumptions in the statement of Theorem 1 hold, then there exists a largeconstant c max > K and another unrelated constant σ > that do not depend on n or p such thatg ( k )  ≥ k + K (cid:80) r = k + γ r (cid:110) + O P (cid:16) n / p − / + n − / (cid:17)(cid:111) if k < s = k + O P (cid:16) np − + n − / (cid:17) + I ( k < K ) K (cid:80) r = k + γ r (cid:110) + O P (cid:16) n / p − / + n − / (cid:17)(cid:111) if k ∈ { s , . . . , c max }≥ σ k (1 + x k ) if k ∈ { c max + , . . . , K max } max k ∈{ c max + ,..., K max } (cid:12)(cid:12)(cid:12) k / x k (cid:12)(cid:12)(cid:12) = O P (1) as n , p → ∞ . roof. g ( k ) can be expressed as g ( k ) = δ f p f n (cid:88) i = (cid:13)(cid:13)(cid:13)(cid:13) L f ¯ C i ∗ − ˆ L f , ( − i ) ˆ¯ C i ∗ (cid:13)(cid:13)(cid:13)(cid:13) = δ f p f n (cid:88) i = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L f ¯ C i ∗ − L f ¯ C T ˆ¯ C (cid:16) ˆ¯ C T ˆ¯ C (cid:17) − ˆ¯ C i ∗ (cid:16) − ˆ¯ h i (cid:17) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + n (cid:88) i =  − ˆ¯ h i  ˆ¯ H T ∗ i A ( − i ) ˆ¯ V − / − f ) (cid:16) δ − f p − f ¯ E f T ¯ E f (cid:17) ˆ¯ V − / − f ) A ( − i ) ˆ¯ H i − δ f n / p / f n (cid:88) i =  − ˆ¯ h i  ˆ¯ H T ∗ i A ( − i ) ˆ¯ V − / − f ) ¯ E Tf ∆ i (S26)where ∆ i = (cid:16) n / p − / L f (cid:17) ¯ C i ∗ − (cid:16) n / p − / L f (cid:17) ¯ C T ˆ¯ C (cid:16) ˆ¯ C T ˆ¯ C (cid:17) − ˆ¯ C i ∗ = ˜ L f ¯ C i ∗ − ˜ L f ¯ C T ˆ¯ C (cid:16) ˆ¯ C T ˆ¯ C (cid:17) − ˆ¯ C i ∗ . (S27)We derive the asymptotic properties of the three terms in (S26) in (a), (b) and (c) below.(a) Let ˆ v ( − f ) r , ˆ z ( − f ) r , estimated using Y ( − f ) , be the analogues of ˆ v r , ˆ z r , r ∈ [ K ], deﬁned in LemmaS2. Similarly, let ˆ V ( − f ) j , ˆ Z ( − f ) j be the analogues of ˆ V j , ˆ Z j , j ∈ [ J ], deﬁned in Lemma S2. Let g ( k ) = δ f p f n (cid:88) i = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L f ¯ C i ∗ − L f ¯ C T ˆ¯ C (cid:16) ˆ¯ C T ˆ¯ C (cid:17) − ˆ¯ C i ∗ (cid:16) − ˆ¯ h i (cid:17) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) and deﬁne α − = (cid:40) k / n − max k ∈ [ K max ] (cid:32) max i ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12) k / n − ˆ¯ h i (cid:12)(cid:12)(cid:12)(cid:12)(cid:33)(cid:41) ∨ α + = (cid:40) k / n + max k ∈ [ K max ] (cid:32) max i ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12) k / n − ˆ¯ h i (cid:12)(cid:12)(cid:12)(cid:12)(cid:33)(cid:41) ∧ , where α − , α + = k / n + o P (cid:16) n / − (cid:15) (cid:17) for any constant (cid:15) > n , p → ∞ by Lemma S24. Then δ − f (1 − α − ) − Tr (cid:110) np − f L Tf L f (cid:16) n − ¯ C T P ⊥ ˆ¯ C ¯ C (cid:17)(cid:111) ≤ g ( k ) ≤ δ − f (1 − α + ) − Tr (cid:110) np − f L Tf L f (cid:16) n − ¯ C T P ⊥ ˆ¯ C ¯ C (cid:17)(cid:111) where Tr (cid:110) np − f L Tf L f (cid:16) n − ¯ C T P ⊥ ˆ¯ C ¯ C (cid:17)(cid:111) = Tr  np − f L Tf L f  I K − k (cid:88) r = ˆ v ( − f ) r ˆ v T ( − f ) r  . Since the eigenvalues of k (cid:80) r = ˆ v ( − f ) r ˆ v T ( − f ) r are bounded between 0 and 1, g ( k ) ≥ I ( k < K ) δ − f  K (cid:88) r = k + γ r  (cid:110) + O p ( n / p − / ) (cid:111)

65y Lemma S7. Further for any j ∈ [ J ] such that lim sup n , p →∞ λ k j < ∞ ,Tr  np − f L Tf L f  I K − k j (cid:88) r = ˆ v ( − f ) r ˆ v T ( − f ) r  = O P (cid:16) np − + n − (cid:17) + I ( k j < K ) (cid:110) + O P (cid:16) n / p − / (cid:17)(cid:111) K (cid:88) r = k j + γ r by Lemmas S2 and S7. This also shows that for any s ≤ k < ˜ k , g ( k ) − g (˜ k ) ≤ δ − f γ s + + O P (cid:16) n / p − / (cid:17) I (cid:16) ˜ k ≤ K (cid:17) . Putting this altogether gives us δ f g ( k )  ≥ k > K ≥ (cid:32) K (cid:80) r = k + γ r (cid:33) (cid:110) + O p ( n / p − / ) (cid:111) if k ∈ { , , . . . , K } \ { s } = O P (cid:16) np − + n − (cid:17) + I ( s < K ) (cid:32) K (cid:80) r = s + γ r (cid:33) (cid:110) + O P (cid:16) n / p − / (cid:17)(cid:111) if k = s δ f { g ( k ) − g ( k + } ≤ γ s + + O P (cid:16) n / p − / (cid:17) , k ∈ { s , . . . , K } , where γ K + = δ f = (cid:12)(cid:12)(cid:12) ¯ V f (cid:12)(cid:12)(cid:12) = g ( k ) = n (cid:88) i =  − ˆ¯ h i  ˆ¯ H T ∗ i A ( − i ) ˆ¯ V − / − f ) (cid:16) p − f ¯ E f T ¯ E f (cid:17) ˆ¯ V − / − f ) A ( − i ) ˆ¯ H i . First, E (cid:110) g ( k ) | Y ( − f ) , π , Q (cid:111) = n (cid:88) i =  − ˆ¯ h i  ˆ¯ H T ∗ i A ( − i ) ˆ¯ V − / − f ) ˜ V f ˆ¯ V − / − f ) A ( − i ) ˆ¯ H i , where ˜ V f = Q T ¯ V f Q . Note that because Q is a unitary matrix, (cid:12)(cid:12)(cid:12) ˜ V f (cid:12)(cid:12)(cid:12) =

1. We then see that E (cid:110) g ( k ) | Y ( − f ) , π , Q (cid:111) ≥ (1 − α − ) − n (cid:88) i = (cid:16) − ˆ¯ h i (cid:17) − ˆ¯ H T ∗ i A ( − i ) ˆ¯ V − / − f ) ˜ V f ˆ¯ V − / − f ) A ( − i ) ˆ¯ H i = (1 − α − ) − Tr  ˆ¯ V − / − f ) ˜ V f ˆ¯ V − / − f ) n (cid:88) i = (cid:16) − ˆ¯ h i (cid:17) − A ( − i ) ˆ¯ H i ˆ¯ H Ti A ( − i )  where n (cid:80) i = (cid:16) − ˆ¯ h i (cid:17) − A ( − i ) ˆ¯ H i ˆ¯ H Ti A ( − i ) is positive semi-deﬁnite withTr  n (cid:88) i = (cid:16) − ˆ¯ h i (cid:17) − A ( − i ) ˆ¯ H i ˆ¯ H Ti A ( − i )  = Tr  n (cid:88) i = (cid:16) − ˆ¯ h i (cid:17) − (cid:16) ˆ¯ H i − ˆ¯ h i a i (cid:17) (cid:16) ˆ¯ H i − ˆ¯ h i a i (cid:17) T  = n (cid:88) i = ˆ¯ h i = K .

66e ﬁrst note that the minimum eigenvalue of ˆ¯ V − / − f ) ˜ V f ˆ¯ V − / − f ) is uniformly bounded above σ > σ is a constant not dependent on n or p . Therefore, E (cid:110) g ( k ) | Y ( − f ) , π , Q (cid:111) ≥ (1 − α − ) − σ k , k ∈ [ K max ] . Let c min ∈ { , , . . . , K } be such that lim sup n , p →∞ λ c min = ∞ but lim sup n , p →∞ λ c min + < ∞ ,where λ = ∞ and λ K + =

0. Let c max > K be an arbitrarily large constant that does notdepend on n or p . Then Corollary S8 implies (cid:13)(cid:13)(cid:13)(cid:13) ˆ¯ V − / − f ) ˜ V f ˆ¯ V − / − f ) − I n (cid:13)(cid:13)(cid:13)(cid:13) = O P (cid:16) n − + p − / (cid:17) , k ∈ { c min , . . . , c max } . Since max i ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12) ˆ¯ h i − k / n (cid:12)(cid:12)(cid:12)(cid:12) = O P ( n − / ) for all k ∈ { c min , c min + , . . . , c max } , putting this all to-gether gives us E (cid:110) g ( k ) | Y ( − f ) , π , Q (cid:111)  ≥ n , p →∞ λ k = ∞ = k (cid:110) + O P (cid:16) n − / (cid:17)(cid:111) if k ∈ { c min , c min + , . . . , c max }≥ σ k if k > c max . Next, to calculate the conditional variance, we see that g ( k ) = p − f Tr (cid:110) ¯ E f ˆ¯ V − / − f ) ˆ M ˆ¯ V − / − f ) ¯ E Tf (cid:111) = p − f p f (cid:88) g = ¯ E T f g ∗ ˆ M ¯ E f g ∗ ˆ M = ˆ¯ V − / − f ) n (cid:88) i = (cid:16) − ˆ¯ h i (cid:17) − A ( − i ) ˆ¯ H i ˆ¯ H Ti A ( − i ) ˆ¯ V − / − f ) where for all k ∈ [ K max ], (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i = (cid:16) − ˆ¯ h i (cid:17) − A ( − i ) ˆ¯ H i ˆ¯ H Ti A ( − i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F = n (cid:88) i , j = (cid:16) − ˆ¯ h i (cid:17) − (cid:16) − ˆ¯ h j (cid:17) − (cid:110) ˆ¯ H T ∗ i A ( − i ) A ( − j ) ˆ¯ H ∗ j (cid:111) ≤ (1 − α + ) − n (cid:88) i , j = ˆ¯ H i j = k (1 − α + ) − . Therefore, since (cid:13)(cid:13)(cid:13)(cid:13) ˆ V − − f ) (cid:13)(cid:13)(cid:13)(cid:13) is uniformly bounded above by a constant, (cid:13)(cid:13)(cid:13) ˆ M (cid:13)(cid:13)(cid:13) F = O (cid:110) k (1 − α + ) − (cid:111) , (cid:13)(cid:13)(cid:13) ˆ M (cid:13)(cid:13)(cid:13) = O (cid:110) (1 − α + ) − (cid:111) . Since ¯ E f ∗ , . . . , ¯ E f pf ∗ are independent sub-Gaussian random variables with uniformly boundedsub-Gaussian norm, there exists a constant c > n , p or k such that P (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) g ( k ) − E (cid:110) g ( k ) | Y ( − f ) , π , Q (cid:111)(cid:12)(cid:12)(cid:12)(cid:12) ≥ tk / | Y ( − f ) , π , Q (cid:21) ≤ exp (cid:110) − c (1 − α + ) − t p (cid:111) , ≤ t ≤ c . g ( k )  ≥ n , p →∞ λ k = ∞ = k (cid:110) + O P (cid:16) n − / (cid:17)(cid:111) if k ∈ { c min , . . . , c max }≥ σ k (cid:0) + x n , k (cid:1) if k ∈ { c max + , . . . , K max } max k ∈{ c max + ,..., K max } (cid:12)(cid:12)(cid:12) k / x n , k (cid:12)(cid:12)(cid:12) = O P (1)as n , p → ∞ .(c) I again assume δ f = g ( k ) = n / p / f n (cid:88) i =  − ˆ¯ h i  ˆ¯ H T ∗ i A ( − i ) ˆ¯ V − / − f ) ¯ E Tf ∆ i , k ∈ [ K max ] ∆ i = ¯ L f  I K − k (cid:88) r = ˆ v ( − f ) r ˆ v T ( − f ) r  ¯ C i ∗ − n / ¯ L f k (cid:88) r = ˆ v ( − f ) r ˆ z T ( − f ) r ¯ q i , k ∈ [ K ] ∆ i = ¯ L f  I K − K (cid:88) r = ˆ v ( − f ) r ˆ v T ( − f ) r  ¯ C i ∗ − n / ¯ L f K (cid:88) r = ˆ v ( − f ) r ˆ z T ( − f ) r ¯ q i + ¯ L f (cid:110) n − ¯ C T ˆ R ( − f ) (cid:111) ˆ R ( − f ) i ∗ , k ∈ { K + , . . . , K max } where ¯ q i is the i th row of Q ¯ C and for k > K ,ˆ¯ C = (cid:16) ¯ C ˆ v ( − f ) + Q ¯ C ˆ z ( − f ) ˆ R ( − f ) (cid:17) ∈ R n × k . It is clear that E (cid:110) g ( k ) | Y ( − f ) , π , Q (cid:111) =

0. To understand the variation around 0, note that g ( k ) = p − / f p f (cid:88) g = ¯ L T f g ∗ { M N + M N + I ( k > K ) M N } ¯ E f g ∗ M = I K − k ∧ K (cid:88) r = ˆ v ( − f ) r ˆ v T ( − f ) r , N = n − / n (cid:88) i = (cid:16) − ˆ¯ h i (cid:17) − ¯ C i ∗ ˆ¯ H T ∗ i A ( − i ) ˆ¯ V − / − f ) M = k ∧ K (cid:88) r = ˆ v ( − f ) r ˆ z T ( − f ) r , N = n (cid:88) i = (cid:16) − ˆ¯ h i (cid:17) − ¯ q i ˆ¯ H T ∗ i A ( − i ) ˆ¯ V − / − f ) M = n − ¯ C T ˆ R ( − f ) , N = n − / n (cid:88) i = (cid:16) − ˆ¯ h i (cid:17) − ˆ R ( − f ) i ∗ ˆ¯ H T ∗ i A ( − i ) ˆ¯ V − / − f ) , where for some constant c > (cid:107) N t (cid:107) ≤ (1 − α + ) − c , t ∈ [3] . By assumption, e g = ¯ L T f g ∗ { M N + M N + I ( k > K ) M N } ¯ E f g ∗ , g ∈ [ p f ]68s sub-Gaussian withlog (cid:104) E (cid:110) exp (cid:16) te g (cid:17) | Y ( − f ) , π , Q (cid:111)(cid:105) ≤ ct (1 − α + ) − (cid:26)(cid:13)(cid:13)(cid:13) M ¯ L f g ∗ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) M ¯ L f g ∗ (cid:13)(cid:13)(cid:13) + I ( k > K ) (cid:13)(cid:13)(cid:13) M ¯ L f g ∗ (cid:13)(cid:13)(cid:13) (cid:27) , g ∈ [ p f ] , where c > n , p or k . And because e , . . . , e p f areindependent and ¯ L f is at most rank K ,log (cid:16) E (cid:104) exp { tg ( k ) } | Y ( − f ) , π , Q (cid:105)(cid:17) ≤ Kct p − (1 − α + ) − (cid:26)(cid:13)(cid:13)(cid:13) ¯ L f M (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ¯ L f M (cid:13)(cid:13)(cid:13) + I ( k > K ) (cid:13)(cid:13)(cid:13) ¯ L f M (cid:13)(cid:13)(cid:13) (cid:27) , k ∈ [ K max ] . For notational simplicity, I will ignore the subscripts f and ( − f ) when deriving the asymp-totic properties of M , M and M . We ﬁrst see that for j = , ,

3, and some constant c > n , p or k , (cid:13)(cid:13)(cid:13) ¯ LM j (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) ¯ L ∗ · · · ¯ L ∗ c min (cid:17) (cid:16) M j ∗ · · · M j ∗ c min (cid:17) T (cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) ¯ L ∗ ( c min + · · · ¯ L ∗ K (cid:17) (cid:16) M j ∗ ( c min + · · · M j ∗ K (cid:17) T (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) ¯ L ∗ · · · ¯ L ∗ c min (cid:17) (cid:16) M j ∗ · · · M j ∗ c min (cid:17) T (cid:13)(cid:13)(cid:13)(cid:13) + c (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) ¯ L ∗ ( c min + · · · ¯ L ∗ K (cid:17)(cid:13)(cid:13)(cid:13)(cid:13) , where for some constant ˜ c > n , p or k , (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) ¯ L ∗ ( c min + · · · ¯ L ∗ K (cid:17)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ˜ c λ / c min + (cid:110) + O P (cid:16) n / p − / (cid:17)(cid:111) uniformly over k ∈ [ K max ] by Lemma S7. To understand thebehavior of the remaining term in the above expression we consider the cases, we ﬁrst notethat φ = (cid:13)(cid:13)(cid:13)(cid:13) ¯ V − ˆ¯ V (cid:13)(cid:13)(cid:13)(cid:13) , deﬁned in Lemma S2, satisﬁes φ = O (1). Since the rates given in(S11) only depend on the choice of ˆ¯ V through φ , the rates in (S11) hold uniformly over all k ∈ [ K max ]. Therefore, Since λ r → ∞ for all r ≤ c min , (S11b), (S11c) and (S11e) implymax k ∈{ c min ,..., K max } (cid:26)(cid:13)(cid:13)(cid:13)(cid:13)(cid:16) ¯ L ∗ · · · ¯ L ∗ c min (cid:17) (cid:16) M j ∗ · · · M j ∗ c min (cid:17) T (cid:13)(cid:13)(cid:13)(cid:13) (cid:27) = O P (1) , j ∈ [2] , (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) ¯ L ∗ · · · ¯ L ∗ c min (cid:17) (cid:16) M j ∗ · · · M j ∗ c min (cid:17) T (cid:13)(cid:13)(cid:13)(cid:13) = O P (cid:16) λ / k + (cid:17) , k ∈ [ c min −

1] ; j ∈ [2] . For M , let ¯ C min = (cid:16) ¯ C ∗ · · · ¯ C ∗ c min (cid:17) and ˆ¯ C min = (cid:16) ˆ¯ C ∗ · · · ˆ¯ C ∗ c min (cid:17) . Then (S11b), (S11c) and(S11e) and the fact that ˆ¯ C T min ˆ R = imply (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) ¯ L ∗ · · · ¯ L ∗ c min (cid:17) (cid:16) M ∗ · · · M ∗ c min (cid:17) T (cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) ¯ L ∗ · · · ¯ L ∗ c min (cid:17) (cid:16) n − ¯ C T min ˆ R (cid:17)(cid:13)(cid:13)(cid:13)(cid:13) behaves exactly as (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) ¯ L ∗ · · · ¯ L ∗ c min (cid:17) (cid:16) M j ∗ · · · M j ∗ c min (cid:17) T (cid:13)(cid:13)(cid:13)(cid:13) for j = ,

2. Putting this all togethergives us that for any (cid:15) > g ( k ) = O P (cid:110) ( λ k + ∨ / p − / (cid:111) , k ∈ [ c max ]max k ∈{ c max + ,..., K max } | g ( k ) | = O P (cid:16) p − / + (cid:15) (cid:17) , where λ r = r > K . 69his completes the proof. (cid:3) Lemma

S11.

Let c min be such that lim sup n , p →∞ λ c min = ∞ but lim sup n , p →∞ λ c min + < ∞ , where λ = ∞ and λ K + = . Then under the assumptions of Theorem 1 and for some arbitrarily largeinteger c max > K that does not depend on n or p, g ( k ) satisﬁesg ( k )  ≥ n + O P (cid:16) n − / p − / (cid:17) if k (cid:60) { c min , . . . , c max } = n + O P (cid:16) n − + n / p − / (cid:17) if k ∈ { c min , . . . , c max } as n , p → ∞ .Proof. This is follows directly from item (i) of Lemma S15 and the proof of Theorem S5 inMcKennan et al. [35]. (cid:3)

Lemma

S12.

Under the assumptions of Theorem 1 and for some constant c > that does notdepend on n or p, g ( k ) satisﬁes | g ( k ) | =  = γ / k + O P (cid:16) n / p − / (cid:17) if k < c min = O P (cid:16) n / p − / + n − (cid:17) if c min ≤ k ≤ c max max k ∈{ c max + ,..., K max } (cid:12)(cid:12)(cid:12) k − | g ( k ) | (cid:12)(cid:12)(cid:12) = O P (cid:16) n / p − / + n − (cid:17) , where c min is such that lim sup n , p →∞ λ c min = ∞ and lim sup n , p →∞ λ c min + < ∞ (for λ = ∞ and λ K + = ) and c max > K is an arbitrarily large constant that does not depend on n or p.Proof.

I again assume, without loss of generality, that δ f = (cid:12)(cid:12)(cid:12) ¯ V f (cid:12)(cid:12)(cid:12) =

1. Then g ( k ), up to a scalarconstant, can be written as g ( k ) = n − / p − / f n (cid:88) i = (cid:16) − ˆ¯ h i (cid:17) − ∆ Ti ¯ E f ˆ¯ V − / − f ) a i − n (cid:88) i = (cid:16) − ˆ¯ h i (cid:17) − ˆ H Ti A ( − i ) ˆ¯ V − / − f ) (cid:16) p − f ¯ E f T ¯ E f (cid:17) ˆ¯ V − / − f ) a i = g ( k ) − g ( k ) , where ∆ i is deﬁned in (S27). Clearly E (cid:110) g ( k ) | Y ( − f ) , π , Q (cid:111) =

0. An analysis identical to thatused to derive the ﬁnite sample properties of g ( k ) in Lemma S10 can be used to show that for all t ∈ R and some constant c > n , p or k ,log (cid:16) E (cid:2) exp { tg ( k ) } (cid:3) | Y ( − f ) , π , Q (cid:17) ≤ ct np − (1 − α + ) − ˆ M , k ∈ [ K max ]ˆ M = (cid:13)(cid:13)(cid:13) ¯ L f M (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ¯ L f M (cid:13)(cid:13)(cid:13) + I ( k > K ) (cid:13)(cid:13)(cid:13) ¯ L f M (cid:13)(cid:13)(cid:13) where M j , j = , ,

3, are as deﬁned in Lemma S10. This shows that g ( k ) = ( γ k + ∨ / O P (cid:16) n / p − / (cid:17) , k ∈ [ c max ] , γ r = r > K . Further, a union bound shows that for all t > c > n , p or k , P (cid:110) | g ( k ) | ≥ tk / for at least one k ∈ { c max + , . . . , K max } | Y ( − f ) , π , Q (cid:111) ≤ K max (cid:88) k = c max + exp (cid:40) − t ˜ c p (1 − α + ) n ˆ M (cid:41) k ≤  − exp (cid:40) − t ˜ c p (1 − α + ) n ˆ M (cid:41) c max +  − exp (cid:40) − t ˜ c p (1 − α + ) n ˆ M (cid:41) c max + , which implies max k ∈{ c max + ,..., K max } (cid:12)(cid:12)(cid:12) k − / | g ( k ) | (cid:12)(cid:12)(cid:12) = O P (cid:16) n / p − / (cid:17) as n , p → ∞ .Deﬁne ˜ V f = Q T ¯ V f Q and R = ˆ¯ V − / − f ) ˜ V f ˆ¯ V − / − f ) − I n . Then E (cid:110) g ( k ) | Y ( − f ) , π , Q (cid:111) = n (cid:88) i = (cid:16) − ˆ¯ h i (cid:17) − ˆ¯ H Ti A ( − i ) ˆ¯ V − / − f ) ˜ V f ˆ¯ V − / − f ) a i = n (cid:88) i = (cid:16) − ˆ¯ h i (cid:17) − ˆ¯ H Ti A ( − i ) Ra i = Tr (cid:16) ˜ R ˆ¯ H (cid:17) − n (cid:88) i = ˆ¯ h i ˜ R ii ˜ R = R diag (cid:26)(cid:16) − ˆ¯ h (cid:17) − , . . . , (cid:16) − ˆ¯ h n (cid:17) − (cid:27) Therefore, (cid:12)(cid:12)(cid:12)(cid:12) E (cid:110) g ( k ) | Y ( − f ) , π , Q (cid:111)(cid:12)(cid:12)(cid:12)(cid:12) ≤ k (1 − α + ) − (cid:107) R (cid:107) , k ∈ [ K max ] , which by Corollary S8 and Lemma S9 implies for some constant c > E (cid:110) g ( k ) | Y ( − f ) , π , Q (cid:111)  ≥ c (cid:110) + O P (cid:16) n − / (cid:17)(cid:111) if k < c min = O P (cid:16) p − / + n − (cid:17) if k ∈ { c min , . . . , c max } max k ∈{ c max + ,..., K max } (cid:12)(cid:12)(cid:12)(cid:12) k − E (cid:110) g ( k ) | Y ( − f ) , π , Q (cid:111)(cid:12)(cid:12)(cid:12)(cid:12) = O P (cid:16) n / p − / + n − (cid:17) . We lastly need to understand the variation of g ( k ) around its conditional mean. We ﬁrst note that g ( k ) = p − f p f (cid:88) g = ˜ E T g ∗ M ( k ) ˜ E g ∗ ˜ E = ¯ E f ˆ¯ V − / − f ) M ( k ) = n (cid:88) i = (1 − ˆ¯ h i ) − a i ˆ¯ H T i A ( − i ) = diag (cid:110) (1 − ˆ¯ h ) − , . . . , (1 − ˆ¯ h n ) − (cid:111) (cid:110) ˆ¯ H − diag (cid:16) ˆ¯ h , . . . , ˆ¯ h n (cid:17)(cid:111) .

71e see that (cid:13)(cid:13)(cid:13) M ( k ) (cid:13)(cid:13)(cid:13) ≤ − α + ) − and (cid:13)(cid:13)(cid:13) M ( k ) (cid:13)(cid:13)(cid:13) F = n (cid:88) i = ˆ¯ h i − ˆ¯ h i ≤ k (1 − α + ) − , k ∈ [ K max ] . Further, Assumption 1 and Proposition 2.7 and Remark 2.8 of Zajkowski [51] imply (cid:13)(cid:13)(cid:13) ˜ E T g ∗ M ( k ) ˜ E g ∗ (cid:13)(cid:13)(cid:13) Ψ ≤ c (cid:13)(cid:13)(cid:13) M ( k ) (cid:13)(cid:13)(cid:13) F , g ∈ [ p ]; k ∈ [ K max ] , (S28)where c > n , p or k and (cid:107)·(cid:107) Ψ is the sub-Exponential normapplied conditionally on Y ( − f ) , π , Q , deﬁned as (cid:107) x (cid:107) Ψ = inf t > (cid:110) E (cid:110) exp( | x / t | ) | Y ( − f ) , π , Q (cid:111) ≤ e (cid:111) . Since the rows of ˜ E are independent conditional on Y ( − f ) , π and Q , Proposition 5.16 of Eldar et al.[50] implies that for all t ≥ µ k = E (cid:110) g ( k ) | Y ( − f ) , π , Q (cid:111) , P (cid:110) | g ( k ) − µ k | ≥ tck / (1 − α + ) − | Y ( − f ) , π , Q (cid:111) ≤ (cid:110) − ˜ cp min (cid:16) t , t (cid:17)(cid:111) , k ∈ [ K max ] , where c > c > n , p or k . Thiscompletes the proof. (cid:3) Aggregating the results of Lemmas S10, S11 and S12 gives us g ( k ) − n  ≥ O P ( λ k + ) if k < c min = k + I ( k < K ) K (cid:80) r = k + δ − γ r + O P (cid:16) n / p − / + n − / (cid:17) if k ∈ { c min , . . . , c max }≥ σ k { + x k } if k ∈ { c max + , . . . , K max } max k ∈{ c max + ,..., K max } | x k | = O P (cid:16) n / p − / + n − / (cid:17) , where c min ≥ n , p →∞ λ c min = ∞ but lim sup n , p →∞ λ c min + < ∞ (where λ = ∞ and λ K + = c max > K is an arbitrarily large integer and σ > n , p or k . This implies that for all (cid:15) >

0, there exists a constant M (cid:15) that does notdepend on n or p such that if (cid:16) γ s − δ (cid:17) (cid:16) n / p − / + n − / (cid:17) − ≥ M (cid:15) , lim inf n , p →∞ P (cid:16) ˆ K = s (cid:17) ≥ − (cid:15) .An identical argument to that presented above can be used to show that under the same conditions,lim inf n , p →∞ P (cid:16) K ( o ) = s (cid:17) ≥ − (cid:15) . (cid:3) S10 Other important results

Lemma

S13.

Let D = diag ( d , . . . , d K ) with d ≥ · · · ≥ d K ≥ and M ∈ R K × K be a matrix whoseeigenvalues lie in the compact set [ c , c ] where c > . Then Λ k (cid:16) M / DM / (cid:17) ∈ [ d k c , d k c ] . roof. Let

L ⊆ R K be a vector space and |L| ≤ K be its dimension. Then Λ k (cid:16) M / DM / (cid:17) = max |L| = k min u ∈L\{ } u T M / DM / uu T u = max |L| = k min u ∈L\{ } u T Duu T M − u = max |L| = k min u ∈L\{ } u T u = u T Duu T M − u . Consider the subspace L generated by the ﬁrst k ≤ K canonical basis vectors. Then u T Duu T M − u ≥ d k c ,which gives the lower bound. For the upper bound, Λ k (cid:16) M / DM / (cid:17) = min |L| = K − k + max u ∈L\{ } u T u = u T Duu T M − u . Setting the subspace L to be the K th through k th canonical basis vectors gives us u T Duu T M − u ≤ d k c . (cid:3) Lemma

S14.

Suppose E ∈ R p × n such that vec ( E ) = A vec (cid:16) ˜ E (cid:17) for some A ∈ R np × np , where theentries of ˜ E ∈ R p × n are independent with uniformly bounded sub-Gaussian norm, and (cid:107) A (cid:107) = O (1) as n , p → ∞ . Let V = E (cid:16) p − E T E (cid:17) be a positive deﬁnite matrix with eigenvalues that areuniformly bounded above 0 and below ∞ as n , p → ∞ . Then (cid:13)(cid:13)(cid:13) p − E T E − V (cid:13)(cid:13)(cid:13) = O P (cid:16) n / p − / (cid:17) as n , p → ∞ .Proof. Since (cid:13)(cid:13)(cid:13) p − E T E − V (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) V − (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) p − V − / E T EV − / − I n (cid:13)(cid:13)(cid:13) and (cid:13)(cid:13)(cid:13) V − (cid:13)(cid:13)(cid:13) = O (1) as n , p → ∞ , it su ﬃ ces to assume V = I n . First, for any unit vector v ∈ R n , p − v T E T Ev − = p − vec (cid:16) ˜ E (cid:17) T A T ( v ⊕ · · · ⊕ v ) ( v ⊕ · · · ⊕ v ) T A vec (cid:16) ˜ E (cid:17) − , where (cid:107) v ⊕ · · · ⊕ v (cid:107) =

1, meaning, (cid:13)(cid:13)(cid:13) A T ( v ⊕ · · · ⊕ v ) ( v ⊕ · · · ⊕ v ) T A (cid:13)(cid:13)(cid:13) ≤ (cid:107) A (cid:107) . Let B v = AA T ( v ⊕ · · · ⊕ v ) ( v ⊕ · · · ⊕ v ) T AA T and deﬁne B ( g ) v ∈ R n × n be the g th diagonal blockof B v , where g = , . . . , p . Then we also have (cid:13)(cid:13)(cid:13) A T ( v ⊕ · · · ⊕ v ) ( v ⊕ · · · ⊕ v ) T A (cid:13)(cid:13)(cid:13) F = p (cid:88) g = v T B ( g ) v v ≤ p (cid:107) A (cid:107) . By Remark 2.10 in Zajkowski [51], this implies that P (cid:16)(cid:12)(cid:12)(cid:12) p − v T E T Ev − (cid:12)(cid:12)(cid:12) ≥ t (cid:17) ≤ (cid:110) − p min (cid:16) ˜ c t , ˜ ct (cid:17)(cid:111) for some constant ˜ c > v , n or p . A standard covering argument (e.g.Theorem 5.39 in Eldar et al. [50]) then gives us the result. (cid:3) emma S15.

Let c > be a large constant, and suppose E ∈ R p × n , vec ( E ) d = A vec (cid:16) ˜ E (cid:17) , (cid:107) A (cid:107) ≤ cand the entries of ˜ E ∈ R p × n are independent with mean 0, variance 1 and sub-Gaussian normbounded above by c. Then the following hold for any u , u ∈ S n − , (cid:96) , (cid:96) ∈ S p − and positivesemi-deﬁnite matrix V ∈ R n with (cid:107) V (cid:107) ≤ c:(i) P (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) Tr (cid:16) E T EV (cid:17) − E (cid:110) Tr (cid:16) E T EV (cid:17)(cid:111)(cid:12)(cid:12)(cid:12)(cid:12) ≥ t ( np ) / (cid:21) ≤ (cid:104) − min (cid:110) ˜ ct , ˜ ct ( np ) / (cid:111)(cid:105) (ii) P (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) u T E T Eu − E (cid:16) u T E T Eu (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) ≥ t p / (cid:27) ≤ (cid:110) − min (cid:16) ˜ ct , ˜ ct p / (cid:17)(cid:111) (iii) For Σ i ∈ R p × p the ith diagonal block of AA T and a g ∈ R p the gth standard basis vector, E (cid:16) EE T (cid:17) = n (cid:80) i = Σ i and P (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) (cid:96) T EE T (cid:96) − E (cid:16) (cid:96) T EE T (cid:96) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) ≥ tn / (cid:27) ≤ (cid:110) − min (cid:16) ˜ ct , ˜ ctn / (cid:17)(cid:111)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:16) (cid:96) T EE T a g (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) ≤ n max h ∈ [ p ] (cid:0) (cid:96) h (cid:1) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − n (cid:88) i = Σ i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , g ∈ [ p ] (iv) P (cid:110)(cid:12)(cid:12)(cid:12) (cid:96) T Eu (cid:12)(cid:12)(cid:12) ≥ t (cid:111) ≤ exp (cid:110) − ˜ ct (cid:111) for all t ≥ , where ˜ c > only depends on c.Proof. The Inequality in (iv) is trivial and follows because vec (cid:16) ˜ E (cid:17) has sub-Gaussian norm boundedby c , and (ii), (iii) follow by the proof of Lemma S14. To prove (i), we see thatTr (cid:16) E T EV (cid:17) − E (cid:110) Tr (cid:16) E T EV (cid:17)(cid:111) d = vec (cid:16) ˜ E (cid:17) T A T (cid:16) I p ⊗ V (cid:17) A vec (cid:16) ˜ E (cid:17) − E (cid:26) vec (cid:16) ˜ E (cid:17) T A T (cid:16) I p ⊗ V (cid:17) A vec (cid:16) ˜ E (cid:17)(cid:27) , where (cid:13)(cid:13)(cid:13)(cid:13) A T (cid:16) I p ⊗ V (cid:17) A (cid:13)(cid:13)(cid:13)(cid:13) ≤ c and (cid:13)(cid:13)(cid:13)(cid:13) A T (cid:16) I p ⊗ V (cid:17) A (cid:13)(cid:13)(cid:13)(cid:13) F ≤ npc . The result then follows by the proofof Lemma S14. (cid:3) Corollary

S9.

Under the conditions of Lemma S15, (cid:13)(cid:13)(cid:13) (cid:96) T E (cid:13)(cid:13)(cid:13) , p − / (cid:13)(cid:13)(cid:13)(cid:13) u T E T E − E (cid:16) u T E T E (cid:17)(cid:13)(cid:13)(cid:13)(cid:13) = O P (cid:16) n / (cid:17) .Proof. These follow by (iii) and (ii) in the statement of Lemma S15. (cid:3)

Lemma

S16 (Proposition 5.1 of [52]) . Let A ∈ R K × K be a symmetric matrix and v ∈ R K be a unitvector such that v T Av = δ . Then A has an eigenvalue in the closed ball centered at δ with radius (cid:107) Av − δ v (cid:107) . Lemma

S17.

Let H ∈ R K × K be a symmetric matrix, V ∈ R K × r have orthonormal columns, D = diag ( d , . . . , d r ) (cid:31) , (cid:15) ∈ R r × r and W ∈ R K × r have orthonormal columns, where r < K. Suppose V = V D + W (cid:15) . Then if V (cid:15) ∈ R K × r is any orthonormal matrix whose columns are eigenvaluesof H such that min k ∈ [ K − r ] (cid:12)(cid:12)(cid:12)(cid:12) d j − Λ k (cid:16) P ⊥ V (cid:15) H P ⊥ V (cid:15) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) > for all j ∈ [ r ] , (cid:13)(cid:13)(cid:13) P V (cid:15) − P V P V (cid:15) (cid:13)(cid:13)(cid:13) F = (cid:13)(cid:13)(cid:13) P V − P V (cid:15) P V (cid:13)(cid:13)(cid:13) F ≤  r (cid:88) j = (cid:15) T ∗ j (cid:15) ∗ j min k ∈ [ K ] (cid:110) d j − Λ k (cid:16) P ⊥ V (cid:15) H P ⊥ V (cid:15) (cid:17)(cid:111)  / . Proof.

Let D (cid:15) ∈ R r × r be the eigenvalues of H associated with the eigenvectors V (cid:15) and deﬁne R = P ⊥ V (cid:15) V . By the statement of the theorem, we need only show that (cid:107) R (cid:107) F ≤  r (cid:88) j = (cid:15) T ∗ j (cid:15) ∗ j min k ∈ [ K ] (cid:110) d j − Λ k (cid:16) P ⊥ V (cid:15) H P ⊥ V (cid:15) (cid:17)(cid:111)  / . First, P V (cid:15) V D + RD + W (cid:15) = HV = H P V (cid:15) V + HR = V (cid:15) D (cid:15) V T (cid:15) V + HR . Next, D (cid:15) V T (cid:15) V = V T (cid:15) HV = V T (cid:15) V D + V T (cid:15) W (cid:15) . Therefore, RD − HR = V (cid:15) (cid:104) D (cid:15) V T (cid:15) V − V T (cid:15) V D (cid:105) − W (cid:15) = − P ⊥ V (cid:15) W (cid:15) , which implies that (cid:16) H − d j I K (cid:17) R ∗ j = P ⊥ V (cid:15) W (cid:15) ∗ j , j ∈ [ r ] . Since the columns of R lie in the orthogonal complement of V (cid:15) , this completes the proof. (cid:3) Corollary

S10.

Under the conditions of Lemma S17, (cid:13)(cid:13)(cid:13) P V (cid:15) − P V (cid:13)(cid:13)(cid:13) F ≤  r (cid:88) j = (cid:15) T ∗ j (cid:15) ∗ j min k ∈ [ K − r ] (cid:110) d j − Λ k (cid:16) P ⊥ V (cid:15) H P ⊥ V (cid:15) (cid:17)(cid:111)  / .. Proof. (cid:13)(cid:13)(cid:13) P V (cid:15) − P V (cid:13)(cid:13)(cid:13) F = (cid:13)(cid:13)(cid:13)(cid:0) P V (cid:15) − P V P V (cid:15) (cid:1) − P V P ⊥ V (cid:15) (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13)(cid:13) P V (cid:15) − P V P V (cid:15) (cid:13)(cid:13)(cid:13) F + (cid:13)(cid:13)(cid:13) P ⊥ V (cid:15) P V (cid:13)(cid:13)(cid:13) F = (cid:13)(cid:13)(cid:13) P V (cid:15) − P V P V (cid:15) (cid:13)(cid:13)(cid:13) F . (cid:3) Lemma

S18.

For V ( θ ) = (cid:80) bj = θ j B j and E ∈ R p × n a random matrix, let S = p − E T E andf ( θ ) = − n − log {| V ( θ ) |} − n − Tr (cid:104) S { V ( θ ) } − (cid:105) . If B , . . . , B b and E satisﬁes Assumptions 1 and 3 and p (cid:38) n, (cid:12)(cid:12)(cid:12) f (cid:8) (¯ v , . . . , ¯ v b ) T (cid:9) − E { f ( ¯ v ) } (cid:12)(cid:12)(cid:12) = o P (1) and f ( θ ) is stochastically equicontinuous on Θ ∗ as n → ∞ , where Θ ∗ is deﬁned in Assumption 2. roof. This follows from Lemma S15 and the fact that for any any constant δ > θ , θ ∈ Θ ∗ such that (cid:13)(cid:13)(cid:13) ˜ θ − θ (cid:13)(cid:13)(cid:13) ≤ δ , there exists a universal constant c > δ, n or p such that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n − log {| V ( θ ) |} − n − log (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) V (cid:16) ˜ θ (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:27)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (cid:13)(cid:13)(cid:13)(cid:13) { V ( θ ) } − − (cid:110) V (cid:16) ˜ θ (cid:17)(cid:111) − (cid:13)(cid:13)(cid:13)(cid:13) ≤ c δ. (cid:3) Lemma

S19.

Fix some small constant (cid:15) > . In addition to the assumptions of Lemma S18, supposen / p → as n , p → ∞ . For any orthogonal projection matrix Q ∈ R n × n , deﬁne M ( Q ) ∈ R b × b to be M ( Q ) i j = n − Tr (cid:16) QB i QB j (cid:17) for i , j ∈ [ b ] . Lastly, deﬁnef ( Q ) ( θ ) = − n − log (cid:8) | QV ( θ ) Q | + (cid:9) − n − Tr (cid:104) QSQ { QV ( θ ) Q } † (cid:105) S = (cid:110) H ∈ R n × n , H T = H , H = H : M ( H ) (cid:23) (cid:15) I b (cid:111) , where V ( θ ) is as deﬁned in Lemma S18, and let ˆ θ ( Q ) = arg max θ ∈ Θ ∗ f ( Q ) ( θ ) . Then sup Q ∈S (cid:13)(cid:13)(cid:13) ˆ θ ( Q ) − ¯ v (cid:13)(cid:13)(cid:13) = O P (cid:16) n / p − / (cid:17) . Proof.

Fix a Q ∈ S and let U ∈ R n × m be such that U T U = I m and U U T = Q . Then f ( Q ) ( θ ) = − n − log (cid:110)(cid:12)(cid:12)(cid:12) U T V ( θ ) U (cid:12)(cid:12)(cid:12)(cid:111) − n − Tr (cid:104) U T SU (cid:8) U T V ( θ ) U (cid:9) − (cid:105) . Deﬁne g ( Q ) ( θ ) = − n − log (cid:110)(cid:12)(cid:12)(cid:12) U T V ( θ ) U (cid:12)(cid:12)(cid:12)(cid:111) − n − Tr (cid:104) U T ¯ V U (cid:8) U T V ( θ ) U (cid:9) − (cid:105) . Then by Lemma S14, f ( Q ) ( θ ) = g ( Q ) ( θ ) + O P (cid:16) n / p − / (cid:17)(cid:110) ∇ θ f ( Q ) ( θ ) (cid:111) j = (cid:110) ∇ θ g ( Q ) ( θ ) (cid:111) j + O P (cid:16) n / p − / (cid:17) , j ∈ [ b ] (cid:110) ∇ θ f ( Q ) ( θ ) (cid:111) i j = (cid:110) ∇ θ g ( Q ) ( θ ) (cid:111) i j + O P (cid:16) n / p − / (cid:17) , i , j ∈ [ b ]uniformly for θ ∈ Θ ∗ and Q ∈ S . Since Q ∈ S , Lemma S4 in McKennan et al. [35] implies (cid:12)(cid:12)(cid:12) g ( Q ) ( θ ) − g ( Q ) ( ¯ v ) (cid:12)(cid:12)(cid:12) ≥ δ (cid:15) (cid:107) θ − ¯ v (cid:107) for some constant δ (cid:15) > (cid:15) . Therefore,sup Q ∈S (cid:13)(cid:13)(cid:13) ˆ θ ( Q ) − ¯ v (cid:13)(cid:13)(cid:13) = o P (1)as n , p → ∞ . Since ∇ θ g ( Q ) ( θ ) | θ = ¯ v (cid:23) ˜ δ (cid:15) I b for some constant ˜ δ (cid:15) > (cid:15) , theresult follows by a routine Taylor expansion argument. (cid:3) Lemma

S20.

Suppose D ( n ) ∈ R K × K , n ≥ , are diagonal matrices with diagonal elements d ( n ) kk ,k ∈ [ K ] , uniformly bounded from above 0 and below ∞ . Let M ( n ) ∈ R K × K , n ≥ , be symmetricmatrices. Then if (cid:13)(cid:13)(cid:13) M ( n ) (cid:13)(cid:13)(cid:13) → as n → ∞ , (cid:20)(cid:110) D ( n ) + M ( n ) (cid:111) / (cid:21) i j − d ( n ) ii I ( i = j ) = M ( n ) i j (cid:16) d ( n ) ii (cid:17) / + (cid:16) d ( n ) j j (cid:17) / { + o (1) } , i , j ∈ [ K ] as n → ∞ . roof. We suppress the superscript ( n ) for notational convenience. Let X = ( D + M ) / − D / .Then by deﬁnition, D / X + XD / + X − M = . For any symmetric positive semi-deﬁnite matrix A , the function that sends A → A / isdi ﬀ erentiable. Therefore, for some symmetric matrix N ∈ R K × K ,( D + (cid:15) M ) / − D / = (cid:15) N + o ( (cid:15) ) , meaning M = N D / + D / N + o (1) . Since N is symmetric, this completes the proof. (cid:3) Lemma

S21.

Let D = diag ( d , . . . , d K ) be such that d ≥ · · · ≥ d K ≥ and A (cid:31) for A ∈ R K × K ,and deﬁne U ∈ R K × K to be the eigenvectors of DAD . Then | U rs | ≤ κ d r ∨ s / d r ∧ s , ( r , s ) ∈ (cid:8) ( k , k ) ∈ [ K ] × [ K ] : d k ∧ k > (cid:9) . where the constant κ = (cid:107) A (cid:107) (cid:13)(cid:13)(cid:13) A − (cid:13)(cid:13)(cid:13) is the condition number of A ..Proof. Suppose k ∈ [ K −

1] is such that d k > d k + =

0. Then for V = (cid:0) I k k × ( K − k ) (cid:1) T ,˜ D = V T DV = diag ( d , . . . , d k ) and ˜ A = V T AV , DAD = V ˜ D ˜ A ˜ DV T . If the columns of ˜ U ∈ R k × k contain the eigenvectors of ˜ D ˜ AD , then U = ˜ U ⊕ W , where W ∈ R ( K − k ) × ( K − k ) is an arbitrary unitary matrix. Therefore, it su ﬃ ces to assume d K > A lie in [ c , c ], c >

0, and let η k be the k th eigenvalue of DAD .By Lemma S13, η k ∈ (cid:104) d k c , d k c (cid:105) . Further, d k c ≥ η k = U T ∗ k DADU ∗ k ≥ c U rk d r , r ≤ k ∈ [ K ] , meaning | U rk | ≤ ( c / c ) d k d r , r ≤ k ∈ [ K ] . Next, since we are assuming D is invertible, d − k c − ≥ η − k = U T ∗ k D − A − D − U ∗ k ≥ c − U rk d − r , r ≥ k ∈ [ K ]which implies | U rk | ≤ ( c / c ) d r d k r ≥ k ∈ [ K ] . This completes the proof. (cid:3) emma S22.

Let ˜ L , ˜ C be as deﬁned in (S3) , V = (cid:80) bj = ¯ v j B j and ˆ V be an estimate for V . Deﬁne ¯ L = p − / L (cid:16) C T V − C (cid:17) / ¯ U , ¯ C = C (cid:16) C T V − C (cid:17) − / ¯ U , where ¯ U ∈ R K × K is a unitary matrix such that ¯ L T ¯ L = diag ( ¯ γ , . . . , ¯ γ K ) for < ¯ γ K ≤ · · · ≤ ¯ γ .Suppose there exists an s ∈ [ K ] such that for some large constant c > not dependent on n or p, ¯ γ s + ≤ c and ¯ γ s / ¯ γ s + ≥ c − , where ¯ γ K + = . Lastly, let A = I s ⊕ ( K − s ) × ( K − s ) ∈ R K × K and suppose ˜ LA ˜ C T = s (cid:88) k = ˜ µ / k ˜ w k ˜ v T k , ¯ LA ¯ C T = s (cid:88) k = ¯ µ / k ¯ w k ¯ v T k , where ˜ W = ( ˜ w · · · ˜ w s ) , ¯ W = ( ¯ w · · · ¯ w s ) ∈ R p × s and ˜ V = ( ˜ v · · · ˜ v s ) , ¯ V = ( ¯ v · · · ¯ v s ) ∈ R n × s have orthonormal columns, ˜ µ ≥ · · · ≥ ˜ µ s > and ¯ µ ≥ · · · ≥ ¯ µ s > . Then if (cid:13)(cid:13)(cid:13) V − ˆ V (cid:13)(cid:13)(cid:13) = O P (cid:16) n − (cid:17) as n , p → ∞ and Assumption 1 holds, the following hold for (cid:16) ˜ (cid:96) · · · ˜ (cid:96) p (cid:17) T = p / n − / ˜ W diag (cid:16) ˜ µ / , . . . , ˜ µ / s (cid:17) , (cid:16) ¯ (cid:96) · · · ¯ (cid:96) p (cid:17) T = p / n − / ¯ W diag (cid:16) ¯ µ / , . . . , ¯ µ / s (cid:17) and some unitary matrix G ∈ R s × s as n , p → ∞ :(i) sup g ∈ [ p ] (cid:13)(cid:13)(cid:13) G T ˜ (cid:96) g − ¯ (cid:96) g (cid:13)(cid:13)(cid:13) = O P (cid:16) n − (cid:17) (ii) ˜ µ k = ¯ µ k (cid:104) + O P (cid:110) ( λ k n ) − (cid:111)(cid:105) , k ∈ [ s ] (iii) (cid:107) P ˜ V − P ¯ V (cid:107) F = O P (cid:110) ( λ s n ) − (cid:111) .Further, if t ∈ [ s ] is such that ¯ µ t / ¯ µ t − , ¯ µ t / ¯ µ t + ≥ + c − , then(iv) sup g ∈ [ p ] (cid:13)(cid:13)(cid:13) r ˜ (cid:96) g t − ¯ (cid:96) g t (cid:13)(cid:13)(cid:13) = O P (cid:16) n − (cid:17) (v) (cid:107) r ˜ v t − ¯ v t (cid:107) = O P (cid:110) ( λ t n ) − (cid:111) (vi) (cid:107) r ˜ w t − ¯ w t (cid:107) = O P (cid:16) λ − / t n − (cid:17) for r ∈ { , − } .Proof. Item (i) and (iv) are straightforward. Further, all relationships are clearly true when s = K .Therefore, it su ﬃ ces to assume 1 ≤ s < K . Let ˆ M = ¯ C T ˆ V − ¯ C . Then for some unitary matrix H ∈ R K × K , ˜ L = ¯ L ˆ M / H , ˜ C = ¯ L ˆ M − / H T . Let ¯ Γ = diag ( ¯ γ , . . . , ¯ γ s ), ¯ Γ = diag ( ¯ γ , . . . , ¯ γ s ) and ¯ Γ = ¯ Γ ⊕ ¯ Γ . By deﬁnition,¯ Γ / M / H = U Γ / , Γ = ˜ L T ˜ L ˆ H T M − / = Γ − / U ¯ Γ / U ∈ R K × K contain the eigenvectors of ¯ Γ / M ¯ Γ / . I abuse notation anddeﬁne L = ¯ L (cid:16) ¯ L T ¯ L (cid:17) − / = ( L L ) , ¯ C = ( C C ) , U = (cid:32) U U U U (cid:33) where L ∈ R p × s , L ∈ R p × ( K − s ) , C ∈ R n × s , C ∈ R n × ( K − s ) , U ∈ R s × s and U ∈ R ( K − s ) × s .Therefore,¯ LA ¯ C T = L ¯ Γ / C T ˜ LA ˜ C T = ¯ L (cid:16) ¯ L T ¯ L (cid:17) − / ¯ Γ / M / HAH T M − / ¯ C T = ¯ L (cid:16) ¯ L T ¯ L (cid:17) − / U AU T ¯ Γ / ¯ C T = (cid:0) L U U T + L U U T (cid:1) ¯ Γ / C T + (cid:0) L U U T + L U U T (cid:1) ¯ Γ / C T = L (cid:16) U U T ¯ Γ / C T + U U T ¯ Γ / C T (cid:17) + L (cid:16) U U T ¯ Γ / C T + U U T ¯ Γ / C T (cid:17) and ˜ CA ˜ L T ˜ LA ˜ C T = C ¯ Γ / (cid:110)(cid:0) U U T (cid:1) + U U T U U T (cid:111) ¯ Γ / C T + C ¯ Γ / (cid:110) U U T U U T + (cid:0) U U T (cid:1) (cid:111) ¯ Γ / C T + C ¯ Γ / (cid:0) U U T U U T + U U T U U T (cid:1) ¯ Γ / C T + (cid:110) C ¯ Γ / (cid:0) U U T U U T + U U T U U T (cid:1) ¯ Γ / C T (cid:111) T . We therefore only have to understand how U and U behave. Using the exact same techniqueas used in the proof of Lemma S1, it is easy to see that U kr , U rk = O P (cid:16) n − λ − / k (cid:17) , r ∈ [ K − s ] , k ∈ [ s ] . Therefore, (cid:0) I s − U U T (cid:1) rk = (cid:0) U U T (cid:1) rk = O P (cid:16) n − λ − / r λ − / k (cid:17) , r , k ∈ [ s ] (cid:0) I s − U T U (cid:1) rk = (cid:0) U T U (cid:1) rk = O P (cid:16) n − λ − / r λ − / k (cid:17) , r , k ∈ [ s ] , meaning ¯ Γ / (cid:0) U U T (cid:1) ¯ Γ / − ¯ Γ , ¯ Γ / U U T U U T ¯ Γ / , ¯ Γ / (cid:0) U U T (cid:1) ¯ Γ / , ¯ Γ / U U T U U T ¯ Γ / = O P (cid:16) n − (cid:17) . Further, by the proof of Lemma S1, (cid:13)(cid:13)(cid:13) U U T ¯ Γ / (cid:13)(cid:13)(cid:13) = O P (cid:16) n − (cid:17) . Putting this all together, ˜ CA ˜ L T ˜ LA ˜ C T = ¯ CA ¯ L T ¯ LA ¯ C T + O P (cid:16) n − (cid:17) . F i j = C T i C j for i , j ∈ [2] and deﬁne Q = ( L S L ) ∈ R p × K , where S ∈ R s × s is a unitarymatrix such that ¯ Γ / F ¯ Γ / = S ¯ M S T for ¯ M = diag ( ¯ µ , . . . , ¯ µ s ). Then for some unitary matrix T ∈ R s × s , ˜ B = Q T ˜ LA ˜ C T ˜ CA ˜ L T Q = (cid:32) A A A T (cid:33) + O P (cid:16) n − (cid:17) A = ¯ M + ¯ M / T F − / F U U T + (cid:16) ¯ M / T F − / F U U T (cid:17) T A = ¯ M S T U U T where by Lemma S21, which gives the structure of S , A rk = ¯ µ r I ( r = k ) + O P (cid:16) n − λ / k ∧ r λ − / k ∨ r (cid:17) , r , k ∈ [ s ] A rk = O P (cid:16) n − λ / r (cid:17) , r ∈ [ s ]; k ∈ [ K − s ] . By the proof of Lemma S4, the k th eigenvalues of A is ¯ µ k (cid:110) + O P (cid:16) n − λ − k (cid:17)(cid:111) for k ∈ [ s ]. Weyl’sTheorem then shows that the k th eigenvalue of ˜ B is ¯ µ k { + o P (1) } for k ∈ [ s ]. If (cid:16) ˆ v T k ˆ v T k (cid:17) ∈ R K ,ˆ v k ∈ R s , is the k th eigenvector, thenˆ v k = ˜ µ − k A T ˆ v k , k ∈ [ s ]˜ µ k ˆ v k = (cid:16) A + ˜ µ − k A A T (cid:17) ˆ v k , k ∈ [ s ] . To prove (iv), let t ∈ [ s ] be such that ¯ µ t / ¯ µ t − , ¯ µ t / ¯ µ t + ≥ + c − for c deﬁned in the statement of thelemma. We ﬁrst see that (cid:13)(cid:13)(cid:13) A A T (cid:13)(cid:13)(cid:13) = O P (cid:16) n − (cid:17) . By the proof of Lemma S4,ˆ v t r = O P (cid:16) n − λ − / t λ − / r ∨ t (cid:17) , r ∈ [ s ] \ { t } (cid:13)(cid:13)(cid:13) ˆ v k (cid:13)(cid:13)(cid:13) = O P (cid:16) λ − k n − / (cid:17) , k ∈ [ s ] . Since ˆ v T k ˆ v r = − ˆ v T k ˆ v r = O P (cid:110) ( n λ k λ r ) − (cid:111) , k (cid:44) r ∈ [ s ] , ˆ v t r = O P (cid:16) n − λ − / t λ − / r (cid:17) , r ∈ [ s ] \ { t } . Therefore, (cid:13)(cid:13)(cid:13) ˆ v t (cid:13)(cid:13)(cid:13) = ˜ µ − t (cid:13)(cid:13)(cid:13) A T ˆ v t (cid:13)(cid:13)(cid:13) = ˜ µ − t (cid:13)(cid:13)(cid:13) U U T S ¯ M ˆ v t (cid:13)(cid:13)(cid:13) = n − ˜ µ − t O P (cid:16)(cid:13)(cid:13)(cid:13) ¯ M / ˆ v t (cid:13)(cid:13)(cid:13) (cid:17) = O P (cid:16) n − λ − / t (cid:17) . This shows that 1 − (cid:12)(cid:12)(cid:12) ˆ v t t (cid:12)(cid:12)(cid:12) = O P (cid:26)(cid:16) n − λ − / t (cid:17) (cid:27) and completes the proof. (cid:3) emma S23.

Let M ∈ R n × n be a non-random symmetric positive deﬁnite matrix, C ∈ R n × K be arandom matrix and L ∈ R p × K be a non-random matrix such that np − L T L = D = diag ( λ , . . . , λ K ) ,where λ ≥ · · · ≥ λ K > λ K + = . Assume the following hold for some ﬁxed constants s ∈ [ K ] andc > :(i) C satisﬁes (cid:13)(cid:13)(cid:13) n − C T ∆ C − E ( n − C T ∆ C ) (cid:13)(cid:13)(cid:13) = O P ( n − / ) for any symmetric, positive deﬁnite ∆ ∈ R n × n such that (cid:107) ∆ (cid:107) ≤ c .(ii) E ( n − C T C ) = I K , λ , . . . , λ s ∈ [ c − , c n ] , lim sup n , p →∞ λ s + ≤ c and (cid:107) M (cid:107) , (cid:13)(cid:13)(cid:13) M − (cid:13)(cid:13)(cid:13) ≤ c .Let W ∈ R K × K be a non-random unitary matrix such that W T D / E (cid:16) n − C T M C (cid:17) D / W = Γ = diag ( γ , . . . , γ K ) , where γ > · · · > γ K > γ K + = . For s ∈ [ K ] deﬁned above, deﬁne W ( s ) = ( W ∗ · · · W ∗ s ) ,d ( s ) r = Λ r (cid:104) D / W ( s ) { W ( s ) } T D / (cid:105) and C ( s ) ∈ R n × s such that (cid:110) C ( s ) , L ( s ) (cid:111) = arg min ( ¯ C , ¯ L ) ∈S s (cid:13)(cid:13)(cid:13) ( LC T − ¯ L ¯ C T ) M / (cid:13)(cid:13)(cid:13) F , where S s is as deﬁned in Section 4.2. Assume the following also hold for some constant c > :(iii) γ k /γ k + ≥ + c − for all k ∈ [ K ] .(iv) d ( s ) r / d ( s ) r + ≥ + c − for all r ∈ [ s ] , where d ( s ) s + = .Then the following hold for some constant ˜ c > :d ( s ) r /λ r ∈ [˜ c − , ˜ c ] and λ r − ˜ c λ s + ≤ d ( s ) r ≤ λ r + ˜ c λ s + , r ∈ [ s ] (S29a) Λ r (cid:104) p − L ( s ) { C ( s ) } T C ( s ) { L ( s ) } T (cid:105) = d ( s ) r (cid:110) + O P ( n − / ) (cid:111) , r ∈ [ s ] . (S29b) Lastly, let the non-random matrix U ∈ R K × s be such that U T U = I s and whose columns are theﬁrst s eigenvectors of D / W ( s ) { W ( s ) } T D / . Then for all r ∈ [ s ] , U tr =  O ( λ s + /λ t ) if t < rO (cid:8) λ ( s + ∨ t /λ r (cid:9) if t > r , t ∈ [ K ] \ { r } (S30a) C ( s ) ∗ r = C ( U ˆ u r + ∆ r ) , (cid:107) ∆ r (cid:107) = O P (cid:16) n − / λ s + λ − r (cid:17) , ˆ u r t =  + O P (cid:16) n − / (cid:17) if t = rO P (cid:16) n − / (cid:17) if t < rO P (cid:16) n − / λ t λ − r (cid:17) if t > r , t ∈ [ s ](S30b) Remark

S11.

Condition (i) is quite general, and is satisﬁed in the following scenarios: C = E ( C ) + m (cid:80) i = Ξ i for m ≤ c , where (cid:107) E ( C ) (cid:107) ≤ c n / and Ξ i ∈ R n × K are mean , Ξ i , Ξ j are indepen-dent for i (cid:44) j and satisfy one of the following:1. Ξ i = A i R i , where A i ∈ R nK × nK is a non-random matrix such that (cid:107) A i (cid:107) , (cid:13)(cid:13)(cid:13) A − i (cid:13)(cid:13)(cid:13) ≤ c and R ∈ R nK is a mean 0 random matrix with independent entries such that E ( R i j ) ≤ c for allj ∈ [ nK ] . . E [exp { vec( Ξ i ) T t } ] ≤ exp( c (cid:107) t (cid:107) ) .This follows from Corollary 2.6 in Zajkowski [51] and standard properties of sub-Gaussian randomvectors. Remark

S12.

This lemma enumerates the properties of λ ( o )1 , . . . , λ ( o ) K ( o ) and C ( o ) when C is a randommatrix. Using the notation that appears in the main text, M corresponds to ¯ V , γ , . . . , γ K arethe same as those deﬁned in Assumption 1, the index s corresponds to the index s deﬁned in thestatement of Theorem 1, d ( s ) r corresponds to λ ( o ) r and C ( s ) corresponds to C ( o ) . We treat s as non-random because Theorem 1 shows that ˆ K = K ( o ) = s with probability tending to 1 as n , p → ∞ . Remark

S13.

This lemma is used to prove Theorem 6.Proof.

By Lemma S21 and because the eigenvalues of E (cid:16) n − C T M C (cid:17) are uniformly boundedabove 0 and below ∞ , W kt = O (cid:16) λ / k ∨ t λ − / k ∧ t (cid:17) for all k , t ∈ [ K ]. Next, note that d ( s ) r = Λ r (cid:104) { W ( s ) } T DW ( s ) (cid:105) ,where { W ( s ) ∗ k } T DW ( s ) ∗ t = O (cid:16) λ / k λ / t (cid:17) for all k , t ∈ [ s ]. (S29a) then follows by Lemma S13.Next, it is clear (S30a) holds for s = K . If s < K , (cid:13)(cid:13)(cid:13) D − D / W ( s ) { W ( s ) } T D / (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) D / K (cid:88) k = s + W ∗ k W T ∗ k D / (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O ( λ s + ) . By Weyl’s Theorem and Lemma S17, this shows that (cid:107) U ∗ r − a r (cid:107) = O P ( λ s + /λ r ), where a r ∈ R K is the r th standard basis vector (we have assumed U r r ≥ U rt = O P ( λ s + /λ t ) for t < r . Since U T ∗ t U ∗ r = t < r , this implies U tr = O P ( λ s + /λ t ) for t < r . Next, since { W ( s ) ∗ k } T DW ( s ) ∗ t = O (cid:16) λ / k λ / t (cid:17) for all k , t ∈ [ s ], this implies U = D / W ( s ) V diag (cid:110) d ( s )1 , . . . , d ( s ) s (cid:111) / , where V ∈ R s × s is a unitary matrix that satisﬁes V kt = O P (cid:16) λ / k ∨ t λ − / k ∧ t (cid:17) . Therefore, for t > s , U tr = (cid:40) λ t d ( s ) r (cid:41) / { W ( s ) t ∗ } T V ∗ r (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) = O { λ / t λ − / r } = O ( λ t /λ r ) , which proves (S30a).To prove the rest, we ﬁrst observe that (cid:13)(cid:13)(cid:13) n − C T C − I K (cid:13)(cid:13)(cid:13) , (cid:13)(cid:13)(cid:13)(cid:13) n − C T M C − E (cid:16) n − C T M C (cid:17)(cid:13)(cid:13)(cid:13)(cid:13) = O P ( n − / )by the assumptions on C . Let ˆ W ∈ R K × K be a unitary matrix such thatˆ W T W T D / (cid:16) n − C T M C (cid:17) D / W ˆ W = ˆ Γ = diag ( ˆ γ , . . . , ˆ γ K ) , where ˆ γ ≥ · · · ≥ ˆ γ K > W ( s ) = (cid:16) ˆ W ∗ · · · ˆ W ∗ s (cid:17) . We then get that p − C ( s ) { L ( s ) } T L ( s ) { C ( s ) } T = ˜ CD / W ˆ W ( s ) { ˆ W ( s ) } T W T D / ˜ C T , C = n − / C . It is therefore clear that (S29b) and (S30b) hold when s = K . When s < K ,we only need to understand the behavior of ˆ W ( s ) to complete the proof. First, the fact that W kt = O (cid:16) λ / k ∨ t λ − / k ∧ t (cid:17) implies G kt = W T ∗ k D / (cid:16) n − C T M C (cid:17) D / W ∗ t = γ k I ( k = t ) + n − / λ / k λ / t ˆ H kt , k , t ∈ [ K ] , where (cid:13)(cid:13)(cid:13) ˆ H (cid:13)(cid:13)(cid:13) = O P (1). By Lemma S13,ˆ γ t =Λ t ( G ) = γ t (cid:110) + O P ( n − / ) (cid:111) , t ∈ [ K ] . Further, an application of the co-factor expansion argument developed in Section 7 of Anderson[53] (which was extended in Appendix A of Wang et al. [11] to allow λ /λ K → ∞ ) shows thatˆ W kt = O P (cid:16) n − / λ / k ∨ t λ − / k ∧ t (cid:17) for k (cid:44) t ∈ [ K ]. Therefore, if W = ( W ( s ) W ) and D = D ⊕ D for D ∈ R s × s ,ˆ J = D / W ˆ W ( s ) { ˆ W ( s ) } T W T D / = D / W  I − K (cid:88) j = s + ˆ W ∗ j ˆ W T ∗ j  W T D / = D / W ( s ) { W ( s ) } T D / + λ s + n D / W ( s ) D − / ˆ A D − / { W ( s ) } T D / + n − / D / W D / ˆ A D − / { W ( s ) } T D / + (cid:104) n − / D / W D / ˆ A D − / { W ( s ) } T D / (cid:105) T + λ s + λ s n D / W ˆ A W T D / , where (cid:13)(cid:13)(cid:13) ˆ A j (cid:13)(cid:13)(cid:13) = O P (1) for j = , ,

3. We see that (cid:110) D / W ( s ) D − / (cid:111) tk = O { min(1 , λ t /λ k ) } , t ∈ [ K ]; k ∈ [ s ] (S31) (cid:13)(cid:13)(cid:13) D / W D / (cid:13)(cid:13)(cid:13) = O ( λ s + ) . (S32)Therefore, p − C ( s ) { L ( s ) } T L ( s ) { C ( s ) } T = ˜ C (cid:110) D / W ( s ) { W ( s ) } T D / + O P (cid:16) λ s + n − / (cid:17)(cid:111) ˜ C T . Note that ˆ J is rank s . Deﬁne ˜ D = diag (cid:110) d ( s )1 , . . . , d ( s ) s (cid:111) and let U ∈ R K × ( K − s ) have orthonormalcolumns such that U T U = . Then( U U ) T ˆ J ( U U ) = ˜ D ⊕ + O P (cid:16) λ s + n − / (cid:17) and Λ t ( ˆ J ) = d ( s ) t + O P (cid:16) λ s + n − / (cid:17) for all t ∈ [ s ] by Weyl’s Theorem. Since ˜ C T ˜ C = I K + O P ( n − / ),(S29a) follows from Lemma S13. Let ˆ U ∈ R K × s be the ﬁrst s eigenvectors of ˆ J , which are theeigenvectors corresponding to all non-zero eigenvalues. By Corollary S10 (and ignoring sign paritywithout loss of generality), ˆ U = U ˆ ∆ + U ˆ ∆ ˆ ∆ kt =  O P (cid:16) n − / λ s + λ − k ∧ t (cid:17) if t (cid:44) k − O P (cid:16) λ s + n − λ − k (cid:17) if k = t , t , k ∈ [ s ] (cid:13)(cid:13)(cid:13) ˆ ∆ ∗ t (cid:13)(cid:13)(cid:13) = O P (cid:16) n − / λ s + λ − t (cid:17) , t ∈ [ s ] . n − / C ( s ) = ˜ C ˆ U ˆ D / V ˆ Σ − / , where ˆ D = diag (cid:110) Λ ( ˆ J ) , . . . , Λ s ( ˆ J ) (cid:111) ˆ Σ = diag (cid:16) Λ (cid:104) p − L ( s ) { C ( s ) } T C ( s ) { L ( s ) } T (cid:105) , . . . , Λ s (cid:104) p − L ( s ) { C ( s ) } T C ( s ) { L ( s ) } T (cid:105)(cid:17) and V ∈ R s × s is a unitary matrix containing the eigenvectors of ˆ D / ˆ U T ˜ C T ˜ C ˆ U ˆ D / . Sinceˆ U T ˜ C T ˜ C ˆ U = I s + O P ( n − / ), V kt = O P (cid:16) n − / λ / k ∨ t λ − / k ∧ t (cid:17) for all k (cid:44) t ∈ [ s ] by the co-factor ex-pansion argument from Anderson [53]. Therefore (ignoring sign parity without loss of generality), (cid:16) ˆ D / V ˆ Σ − / (cid:17) kt =  + O P ( n − / ) if k = tO P (cid:16) n − / (cid:17) if t > kO P (cid:16) n − / λ k /λ t (cid:17) if t < k , t , k ∈ [ s ] , meaning n − / C ( s ) ∗ r = ˜ CU (cid:16) ˆ ∆ ˆ D / V ˆ Σ − / (cid:17) ∗ r + ˜ C O P (cid:16) n − / λ s + λ − r (cid:17)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) K × where (cid:16) ˆ ∆ ˆ D / V ˆ Σ − / (cid:17) tr =  + O P ( n − / ) if t = rO P ( n − / ) if t < rO P (cid:16) n − / λ t λ − r (cid:17) if t > r , t ∈ [ s ] . This completes the proof. (cid:3)

Lemma

S24.

Let B ∈ R n × m , m ≤ n, be any matrix with orthonormal columns, and suppose Q ∈ R n × n is sampled uniformly from the set of all unitary matrices in R n × n . Then if h i , i ∈ [ n ] , isthe ith leverage score of QB , P (cid:32) max i ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h i − mn (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:15) (cid:33) ≤ c n exp (cid:16) − c (cid:15) n / (cid:17) for all (cid:15) > , where c , c > are constants that do not depend on n , m or B .Proof. The h i has mean m / n and is identically distributed to W / ( W + Z ), where W ∼ n − χ m , Z ∼ n − χ n − m and W and Z are independent. For any ﬁxed δ ∈ (0 , c > n , m or B , E (cid:40)(cid:18) WZ + W (cid:19) p (cid:41) ≤ E (cid:16) ˜ W p (cid:17) + exp (cid:16) − cn δ (cid:17) W = (1 − δ ) − W , c δ > δ . We next see that because n ˜ W ∼ (1 − δ ) − χ m , there exists constants a δ , c δ > δ such that E { ( nh i ) p } ≤ ( a δ m / p ) p + n p exp( − c δ n ) = ( a δ m / p ) p + (cid:40) p np exp( − c δ np ) (cid:41) p ≤ ( a δ m / p ) p + { ( ec δ ) − p } p . This shows that nh i has sub-exponential norm ≤ cm / , where c > n , m or B . Using a standard sub-exponential inequality argument, we get that for someconstants ˜ c , ¯ c > n , m or B , P (cid:32) max i ∈ [ n ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h i − mn (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:15) (cid:33) ≤ n exp (cid:32) ¯ c λ n m − λ(cid:15) (cid:33) , < λ ≤ ˜ cnm − / for all (cid:15) >

0. The result then follows by setting λ = ˜ cnm − / . (cid:3)(cid:3)