Cross-Validated Loss-Based Covariance Matrix Estimator Selection in High Dimensions
Philippe Boileau, Nima S. Hejazi, Mark J. van der Laan, Sandrine Dudoit
CC ROSS -V ALIDATED L OSS -B ASED C OVARIANCE M ATRIX E STIMATOR S ELECTION IN H IGH D IMENSIONS
Philippe Boileau
Graduate Group in Biostatistics,University of California, Berkeley [email protected]
Nima S. Hejazi
Graduate Group in Biostatistics, andCenter for Computational Biology,University of California, Berkeley [email protected]
Mark J. van der Laan
Division of Biostatistics,Department of Statistics, andCenter for Computational Biology,University of California, Berkeley [email protected]
Sandrine Dudoit
Department of Statistics,Division of Biostatistics, andCenter for Computational Biology,University of California, Berkeley [email protected]
February 22, 2021 A BSTRACT
The covariance matrix plays a fundamental role in many modern exploratory and inferential sta-tistical procedures, including dimensionality reduction, hypothesis testing, and regression. In low-dimensional regimes, where the number of observations far exceeds the number of variables, theoptimality of the sample covariance matrix as an estimator of this parameter is well-established.High-dimensional regimes do not admit such a convenience, however. As such, a variety of estima-tors have been derived to overcome the shortcomings of the sample covariance matrix in these set-tings. Yet, the question of selecting an optimal estimator from among the plethora available remainslargely unaddressed. Using the framework of cross-validated loss-based estimation, we develop thetheoretical underpinnings of just such an estimator selection procedure. In particular, we proposea general class of loss functions for covariance matrix estimation and establish finite-sample riskbounds and conditions for the asymptotic optimality of the cross-validated estimator selector withrespect to these loss functions. We evaluate our proposed approach via a comprehensive set of simu-lation experiments and demonstrate its practical benefits by application in the exploratory analysis oftwo single-cell transcriptome sequencing datasets. A free and open-source software implementationof the proposed methodology, the cvCovEst
R package, is briefly introduced.
The covariance matrix underlies numerous exploratory and inferential statistical procedures central to analyses reg-ularly performed in fields as diverse as computational biology and financial economics. For instance, in statisticalgenomics, the covariance matrix serves as a key ingredient in many popular dimensionality reduction, clustering, andclassification methods, which are regularly relied upon in quality control assessments, exploratory data analysis, and,recently, the discovery and characterization of novel types of cells. In financial economics, covariance matrices lie atthe heart of techniques used to gauge the excess returns of financial assets over time, including in such applications asportfolio optimization and returns forecasting. Still other important applications in which the covariance matrix is crit-ical include climate modeling and weather forecasting, imaging data processing and analysis, probabilistic graphicalmodeling, and text corpora compression and retrieval. Even more fundamentally, this statistical parameter plays a keyrole in assessing the strengths of linear relationships within multivariate data, in generating simultaneous confidencebands and regions, and in the construction and evaluation of hypothesis tests. Accurate estimation of the covariance a r X i v : . [ s t a t . M E ] F e b atrix is therefore an essential step in the successful application of many of the most prominent tools of modernexploratory and inferential statistics.When the number of observations in a dataset far exceeds the number of features, the estimator of choice for the covari-ance matrix is the sample covariance matrix: it is an efficient estimator under minimal regularity assumptions on thedata-generating distribution [see, for example, reviews in Pourahmadi, 2013, Wainwright, 2019]. In high-dimensionalregimes, however, this simple estimator leaves much to be desired. When the number of features outstrips the numberof observations, the sample covariance matrix is singular. Even when the number of observations slightly exceedsthe number of features, the sample covariance matrix is numerically unstable on account of an overly large conditionnumber [Golub and Van Loan, 1996], thereby amplifying any error arising in its estimation. Further, in such cases,its eigenvalues are generally over-dispersed when compared to those of the population covariance matrix [Johnstone,2001, Ledoit and Wolf, 2004], disqualifying the use of classical principal component analysis (PCA) and relatedprocedures.High-dimensional data have become increasingly widespread in myriad scientific domains, with many notable ex-amples arising from challenges posed by cutting-edge biological sequencing technologies. Accordingly, researchershave turned to developing novel covariance matrix estimators to remediate the deficiencies of the sample covariancematrix. Under certain sparsity assumptions, Bickel and Levina [2008a,b], Rothman et al. [2009], Lam and Fan [2009],Cai et al. [2010], and Cai and Liu [2011], among others, have demonstrated that the true covariance matrix can beestimated consistently through the application of element-wise thresholding or tapering functions to the sample covari-ance matrix. Another thread of the literature, which includes notable contributions by Stock and Watson [2002], Bai[2003], Fan et al. [2008, 2013, 2016, 2019], and Onatski [2012], has championed methods employing factor modelsin covariance matrix estimation. Other popular proposals include the families of estimators formulated by Schäfer andStrimmer [2005], and Ledoit and Wolf [2004, 2012, 2015, 2018], which use an approach inspired by the empiricalBayes framework [Robbins, 1964, Efron, 2012]. We briefly review and detail several of these estimator families inSection S1.Despite the flexibility provided by an apparent wealth of candidate estimators, this variety poses many practical is-sues. Namely, identifying the most appropriate estimator from among a collection of candidates is itself a significantchallenge. A partial answer to this problem has come in the form of data-adaptive approaches designed to select theoptimal estimator within a particular class [e.g., Bickel and Levina, 2008a,b, Cai and Liu, 2011, Fang et al., 2016, Fanet al., 2013]; however, such approaches are inherently limited by their focus on relatively narrow families of covari-ance matrix estimators. Unfortunately, the successful application of such estimator selection frameworks requires, asa preliminary step, that the practitioner make a successful choice among estimator families, thus injecting a degree ofsubjectivity in their deployment. On the other hand, the broader question of selecting an optimal estimator from amonga diverse library of candidates has remained unaddressed. In answer to this, we offer a general loss-based framework— building upon the seminal work of Dudoit and van der Laan [2003], van der Laan and Dudoit [2003], van der Vaartet al. [2006] — for asymptotically optimal covariance matrix estimator selection based upon cross-validation.In the cross-validated loss-based estimation framework, the parameter of interest is defined as the risk minimizer, withrespect to the data-generating distribution, based on a loss function tailored to the problem at hand. Candidate esti-mators may be generated in a variety of manners, including as empirical risk minimizers with respect to an empiricaldistribution over parameter subspaces corresponding to models for the data-generating distribution. One would ideallyselect as optimal estimator that which minimizes the “true” risk with respect to the data-generating distribution. How-ever, as this distribution is typically unknown, one turns to cross-validation for risk estimation. Dudoit and van derLaan [2003], van der Laan and Dudoit [2003], van der Vaart et al. [2006] have shown that, under general conditionson the data-generating distribution and loss function, the cross-validated estimator selector is asymptotically optimal,in the sense that it performs asymptotically as well in terms of risk as an optimal oracle selector based on the true,unknown data-generating distribution.Focusing specifically on the covariance matrix as the parameter of interest, we address the choice of loss function andcandidate estimators, and derive new, high-dimensional asymptotic optimality results for multivariate cross-validatedestimator selection procedures. Further, the proposed framework accommodates arbitrary families of covariance ma-trix estimators, enabling the objective selection of an optimal estimator while fully taking advantage of the plethora ofavailable estimators. As such, it generalizes existing, but more limited, data-adaptive estimator selection frameworkswhere the library of candidate estimators is narrowed based on available subject matter knowledge, or, as is morecommonly the case, for convenience’s sake.The remainder of the present manuscript is organized as follows. In Section 2, we formulate the problem within thegeneral framework of cross-validated loss-based estimation. In Section 3, we propose a general class of loss functionsfor covariance matrix estimation and establish finite-sample risk bounds and conditions for the asymptotic optimality ofthe cross-validated estimator selector with respect to these loss functions. Section 4 provides a thorough interrogation2f the proposed cross-validated estimator selector in a series of simulation experiments. The practical benefits of usingthe cross-validated estimator selection strategy are then demonstrated by its application in a dimensionality reductionpipeline for novel cell-type identification with two distinct single-cell RNA-seq datasets in Section 5. We concludethe manuscript with a brief discussion in Section 6. Consider a dataset X n × J = { X , . . . , X n : X i ∈ R J } , comprising n independent and identically distributed(i.i.d.) random vectors, where n ≈ J or n < J . Let X i ∼ P ∈ M , where P denotes the true data-generatingdistribution and M the statistical model, i.e., a collection of possible data-generating distributions P for X i . We as-sume a nonparametric statistical model M for P , thereby minimizing assumptions on the form of P . We denote by P n the empirical distribution of the n random vectors forming X n × J . Letting E [ X i ] = 0 without loss of generalityand defining ψ := Var [ X i ] , we take as our goal the estimation of the covariance matrix ψ . This is accomplished byidentifying the “optimal” estimator of ψ from among a collection of candidates, where, as detailed below, optimalityis defined in terms of risk.For any distribution P ∈ M , define its covariance matrix as ψ = Ψ( P ) , in terms of a mapping Ψ from the model M to the space of J × J symmetric, positive semi-definite matrices. Given this mapping, define the parameter space as Ψ := { ψ = Ψ( P ) : P ∈ M} , the set of all covariance matrices corresponding to the data-generating distributions P belonging to the model M . Furthermore, candidate estimators of the covariance matrix are defined as ˆ ψ k := ˆΨ k ( P n ) ,in terms of mappings ˆΨ k from the empirical distribution P n to the parameter space Ψ , k = 1 , . . . K n . The subscriptfor K n emphasizes that, as n and J grow, so too may the number of candidate estimators.In order to assess the optimality of estimators in the set Ψ , we introduce a generic loss function L ( X ; ψ , η ) charac-terizing a cost applicable to any ψ ∈ Ψ and X ∼ P ∈ M , and where η is a (possibly empty) nuisance parameter.Specific examples of loss functions for the covariance estimation setting are proposed in Section 3.1 and S3. Define H as the mapping from the model M to the nuisance parameter space H := { η = H ( P ) : P ∈ M} and let ˆ η = ˆ H ( P n ) denote a generic nuisance parameter estimator, where H is a mapping from P n to H . Given any η ∈ H , the risk under P ∈ M for any ψ ∈ Ψ is defined as the expected value of L ( X ; ψ , η ) with respect to P : Θ( ψ , η, P ) := (cid:90) L ( x ; ψ , η ) dP ( x )= E P [ L ( X ; ψ , η )] . (1)Then, so long as a risk minimizer exists under the true data-generating distribution P , the minimizer is given by ψ := argmin ψ ∈ Ψ Θ( ψ , η , P ) , where η := H ( P ) . The risk minimizer need not be unique. The optimal risk under P is then given by θ := min ψ ∈ Ψ Θ( ψ , η , P ) , which is to say that the risk minimizer ψ attains risk θ .For any given estimator ˆ ψ k of ψ , its conditional risk given P n with respect to the true data-generating distribution P is defined as ˜ θ n ( k ) := E P [ L ( X ; ˆΨ k ( P n ) , η ) | P n ]= (cid:90) L ( x ; ˆΨ k ( P n ) , η ) dP ( x )= Θ( ˆ ψ k , η , P ) (2)and its risk difference relative to ψ is represented as d n ( ˆ ψ k , ψ ) := ˜ θ n ( k ) − θ . ˜ k n := argmin k ∈{ ,...,K n } d n ( ˆ ψ k , ψ ) . The subscript n emphasizes that the risk, risk difference, and optimal estimator index are conditional on the empiricaldistribution P n . They are therefore random variables.Given the high-dimensional nature of the data, it is generally most convenient to study the performance of estimators of ψ using Kolmogorov asymptotics — that is, in the setting in which both n → ∞ and J → ∞ such that J/n → m < ∞ . Historically, estimators have been derived within this high-dimensional asymptotic regime to improve upon thefinite sample results of estimators brought about by traditional asymptotic arguments. After all, the sample covariancematrix retains its asymptotic optimality properties when J is fixed, even though it is known to perform poorly inhigh-dimensional settings.Naturally, it would be desirable for an estimator selection procedure to select the estimator indexed by ˜ k n ; however,this quantity depends on the true, unknown data-generating distribution P . As a substitute for the candidates’ trueconditional risks, we employ instead the cross-validated estimators of these same conditional risks.Cross-validation (CV) consists of randomly, and possibly repeatedly, partitioning a dataset into a training set and avalidation set. The observations in the training set are fed to the candidate estimators and the observations in thevalidation set are used to evaluate the performance of these estimators [Breiman and Spector, 1992, Friedman et al.,2001]. A range of CV schemes have been proposed and investigated, both theoretically and computationally; Dudoitand van der Laan [2005] provide a thorough review of popular CV schemes and their properties. Among the variety, V -fold and Monte Carlo CV stand out as approaches that have gained traction on account of their relative computationalfeasibility and good performance. Any CV scheme can be expressed in terms of a binary random vector B n , whichassigns observations into either the training or validation sets. Observation i is said to lie in the training set when B n ( i ) = 0 and in the validation set otherwise. The training set therefore contains (cid:80) i (1 − B n ( i )) = n (1 − p n ) observations and the validation set (cid:80) i B n ( i ) = np n observations, where p n is the fixed validation set proportioncorresponding to the chosen CV procedure. The empirical distributions of the training and validation sets are denotedby P n,B n and P n,B n , respectively, for any given realization of B n .Using this general definition of CV, we may now define the cross-validated estimator of a candidate ˆΨ k ’s risk as ˆ θ p n ,n ( k, ˆ H ( P n,B n )) := E B n (cid:104) Θ( ˆΨ k ( P n,B n ) , ˆ H ( P n,B n ) , P n,B n ) (cid:105) = E B n (cid:34) np n n (cid:88) i =1 I ( B n ( i ) = 1) L ( X i ; ˆΨ k ( P n,B n ) , ˆ H ( P n,B n )) (cid:35) , (3)for some nuisance parameter estimator mapping ˆ H . The corresponding cross-validated selector is then given by ˆ k p n ,n := argmin k ∈{ ,...,K n } ˆ θ p n ,n ( k, ˆ H ( P n,B n )) . As a benchmark, the unknown cross-validated conditional risk of the k th estimator is ˜ θ p n ,n ( k, η ) := E B n (cid:104) Θ( ˆΨ k ( P n,B n ) , η , P ) (cid:105) . The cross-validated oracle selector is then given by ˜ k p n ,n := argmin k ∈{ ,...,K n } ˜ θ p n ,n ( k, η ) . As before, the p n and n subscripts highlight the dependence of these objects on the CV procedure and the empiricaldistribution P n , respectively (thus making them random variables).Ideally, the cross-validated estimator selection procedure should identify a ˆ k p n ,n that is asymptotically (in n , J , andpossibly large K n ) equivalent in terms of risk to the oracle ˜ k p n ,n , under a set of nonrestrictive assumptions based onthe choice of loss function, target parameter space, and, if applicable, nuisance parameter space, in the sense that ˜ θ p n ,n (ˆ k p n ,n , η ) − θ ˜ θ p n ,n (˜ k p n ,n , η ) − θ P −→ as n, J → ∞ . (4)4hat is, the estimator selected via CV is equivalent in terms of risk to the CV scheme’s oracle estimator chosen fromamong all candidates.When Equation (4) holds, a further step may be taken by relating the performance of the cross-validated selector tothat of the full-dataset oracle selector, ˜ k n : ˜ θ n (ˆ k p n ,n , η ) − θ ˜ θ n (˜ k n , η ) − θ P −→ as n, J → ∞ . (5)When the cross-validated selection procedure’s full-dataset conditional risk difference converges in probability tothat of the full-dataset oracle’s, we say that its chosen estimator is asymptotically optimal . In other words, the data-adaptively selected estimator performs asymptotically as well, with respect to the chosen loss, as the candidate thatwould be picked from the collection of estimators if the true data-generating distribution were known.Next, we focus on identifying a loss function with which to perform this procedure in the covariance estimation setting,and on determining the conditions under which such a procedure generates optimal results. The choice of loss function should reflect the goals of the estimation task. In this section, we argue in favor of thescaled Frobenius loss, providing heuristics as to how it proves an often reasonable choice. Its theoretical properties,which permit the derivation of finite-sample bounds for the cross-validated risk of the estimator selection procedure,are also detailed.The scaled Frobenius loss is defined L ( X ; ψ , H ( P )) := (cid:107) XX (cid:62) − ψ (cid:107) F,H ( P ) = J (cid:88) j =1 J (cid:88) l =1 H ( P ) ( jl ) ( X ( j ) X ( l ) − ψ ( jl ) ) , (6)where X ( j ) is the j th element of a random vector X ∼ P ∈ M , ψ ( jl ) is the entry in the j th row and l th column of anarbitrary covariance matrix ψ ∈ Ψ , and H ( P ) is a J × J matrix acting as a scaling factor (i.e., a potential nuisanceparameter). For an estimator ˆ H ( P n ) of H ( P ) , the cross-validated risk estimator of the k th candidate estimator ˆΨ k under the scaled Frobenius loss is ˆ θ p n ,n ( k, ˆ H ( P n,B n )) = E B n (cid:104) Θ( ˆΨ k ( P n,B n ) , ˆ H ( P n,B n ) , P n,B n ) (cid:105) = E B n (cid:34) np n n (cid:88) i =1 I ( B n ( i ) = 1) (cid:107) X i X (cid:62) i − ˆΨ k ( P n,B n ) (cid:107) F, ˆ H ( P n,Bn ) (cid:35) . (7)A sought-after property in a loss function is that it not diverge as J → ∞ for any reasonable choice of estimator.Otherwise, as the number of features diverges, so too would the estimated conditional risks. As a result, the asymptoticoptimality of the loss-based selection procedure could not be studied, since comparison of the candidate estimators’performances would be impossible. The loss function defined in Equation (6) clearly lacks this asymptotic qualitywhen the scaling factor H ( P ) is omitted or chosen poorly.Ledoit and Wolf [2004] and Bickel and Levina [2008b], among others, have employed analogous scaled matrix-basedFrobenius losses to prove various optimality results, defining H ( P ) ( jl ) = 1 /J ∀ j, l . This particular choice ofscaling factor is such that whatever the value of J , (cid:107) I J × J (cid:107) F,H ( P ) = 1 . A similarly reasonable option for when thetrue covariance matrix is assumed to be dense is H ( P ) ( jl ) = 1 /J . This weighting scheme effectively computes theaverage loss across every entry of the covariance matrix.However, these scaling functions are too restrictive for a CV selection method to be effective across a broad range ofdata-generating processes. Consider the latter weighting scheme which produces paradoxical results in asymptopiawhen the number of non-zero entries in the true covariance matrix is not on the order of J . Biased estimators that5roduce estimates with fewer non-zero entries than ψ would in fact have asymptotically zero risk. A more flexiblescaling factor is therefore needed.A suitable option for when the features’ variances are of similar magnitudes is H ( P ) ( jl ) = 1 / (cid:80) j (cid:80) l | ψ ( jl ) | . Thischoice of scaling factor makes the loss function of Equation (6) analogous to the squared error loss in high dimensions.Further, it is straightforward to verify that, under regularity conditions permitting the interchange of the integral withrespect to X and the partial differential operators with respect to ψ , we obtain ψ = argmin ψ ∈ Ψ Θ( ψ , η , P ) . For these reasons, we use this scaling in our definition of the scaled Frobenius loss in the remainder of the manuscript.Note, however, that when the scaling factor’s entries are defined as identical constants, there is no impact in finite-sample settings. It exists for the sole purpose of asymptotic analysis, as will be made clear in the sequel. Indeed, sincethe effect of the scaling factor on the empirical risk estimators is only relative, the scaling factor need not be estimated.Thus, it is not a nuisance parameter in the conventional sense. Accordingly, in a minor abuse of notation, we suppressdependence of the loss function on the scaling factor wherever possible throughout the remainder of the text.
Having defined a suitable loss function, we turn to a discussion of the theoretical properties of the cross-validatedestimator selection procedure. Specifically, we present, in Theorem 1, sufficient conditions under which the methodis asymptotically equivalent in terms of risk to the commensurate CV oracle selector (as per Equation (4)). Thistheorem extends the general framework of Dudoit and van der Laan [2003] for use in high-dimensional, multivariateestimator selection. Adapting their existing theory to this setting requires a judicious choice of loss function, newassumptions and conditions, and updated proofs reflecting the use of Kolmogorov asymptotics. Corollary 1 thenbuilds on Theorem 1 and details conditions under which the procedure produces asymptotically optimal selections, inthe sense of Equation (5).
Theorem 1.
Let X , . . . X n be a random sample of n i.i.d. random vectors of dimension J , such that X i ∼ P ∈M , i = 1 , . . . , n . Assume, without loss of generality, that E [ X i ] = 0 , and define ψ := Var [ X i ] . Additionally, let Ψ := { ψ = Ψ( P ) : P ∈ M} be the parameter space of all possible covariance matrices for the model M . Definethe set of K n candidate estimator mappings as { ˆΨ k ( · ) : k = 1 , . . . , K n } , and assume that the empirical distribution P n ∈ M . Next, for ψ ∈ Ψ , define the scaled squared Frobenius loss function as L ( X ; ψ ) := (cid:107) X (cid:62) X − ψ (cid:107) F,H ( P ) ,with scaling factor H ( P ) := 1 / (cid:80) j (cid:80) l | ψ ( jl ) | · J × J = H ( P ) · J × J and η := H ( P ) . Finally, designate p n as theproportion of observations in any given cross-validated validation set.Consider the following assumptions: Assumption 1.
For each P ∈ M and X ∼ P , (cid:107) X (cid:107) < √ M < ∞ almost surely (a.s.). Assumption 2.
For each ψ ∈ Ψ , | ψ ( jl ) | < M < ∞ , ∀ j, l = 1 , . . . , J . Finite-Sample Result.
Let M ( J ) := 4 η ( M + M ) J and c ( δ, M ( J )) := 2(1 + δ ) M ( J )(1 / /δ ) . Then, forany δ > , ≤ E P [˜ θ p n ,n (ˆ k p n ,n ) − θ ] ≤ (1 + 2 δ ) E P [˜ θ p n ,n (˜ k p n ,n ) − θ ] + 2 c ( δ, M ( J )) 1 + log( K n ) np n . (8) High-Dimensional Asymptotic Result.
The finite-sample result in Equation (8) has the following asymptotic impli-cations: If c ( δ, M ( J ))(1 + log( K n )) / ( np n E P [˜ θ p n ,n (˜ k p n ,n ) − θ ]) → and J/n → m < ∞ as n, J → ∞ , then E P [˜ θ p n ,n (ˆ k p n ,n ) − θ ] E P [˜ θ p n ,n (˜ k p n ,n ) − θ ] → . (9) Further, if c ( δ, M ( J ))(1 + log( K n )) / ( np n (˜ θ p n ,n (˜ k p n ,n ) − θ )) P → as n, J → ∞ , then ˜ θ p n ,n (ˆ k p n ,n ) − θ ˜ θ p n ,n (˜ k p n ,n ) − θ P → . (10)6he proof is provided in Section S2. It relies on special properties of the random variable Z k := L ( X ; ˆΨ k ( P n )) − L ( X ; ψ ) and on an application of Bernstein’s inequality [Bernstein, 1946, Ben-nett, 1962]. Together, they are used to show that c ( δ, M ( J ))(1 + log( K n )) / ( np n ) is a finite-sample bound forcomparing the performance of the cross-validated selector, ˆ k p n ,n , against that of the oracle selector over the trainingsets, ˜ k p n ,n . Once this bound is established, the high-dimensional asymptotic results follow immediately.Now, Theorem 1’s high-dimensional asymptotic results relate the performance of the cross-validated selector to thatof the oracle selector for the CV scheme. As indicated by the expression in Equation (5), however, we would like ourcross-validated selector to be asymptotically equivalent to the oracle selector over the entire dataset. The conditionsto obtain this desired result are provided in Corollary 1, a minor adaptation of previous work by Dudoit and van derLaan [2003]. This extension accounts for increasing J , thereby permitting its use in high-dimensional asymptotics. Corollary 1.
Building upon Assumptions 1 and 2 of Theorem 1, we introduce the additional assumptions that, as n, J → ∞ , J/n → m < ∞ , p n → , η log ( K n ) /p n → , η log ( K n ) / ( p n (˜ θ p n ,n (˜ k p n ,n ) − θ )) P → , and ˜ θ p n ,n (˜ k p n ,n ) − θ ˜ θ n (˜ k n ) − θ P → . (11) Under these assumptions, it follows that ˜ θ p n ,n (ˆ k p n ,n ) − θ ˜ θ n (˜ k n ) − θ P −→ . (12)The proof is provided in Section S2; it follows immediately from the asymptotic results of Theorem 1. A sufficientcondition for Equation (11) is that there exists a γ > such that (cid:16) n γ (˜ θ n (˜ k n ) − θ ) , ( n (1 − p n )) γ (˜ θ p n ,n (˜ k p n ,n ) − θ ) (cid:17) d → ( Z, Z ) , (13)for a single random variable Z with P ( Z > a ) = 1 for some a > . For single-split validation, where P ( B n = b ) = 1 for some b ∈ { , } n , it suffices to assume that there exists a γ > such that n γ (˜ θ n (˜ k n ) − θ ) d → Z for a randomvariable Z with P ( Z > a ) = 1 for some a > .We discuss next the assumptions and results of Theorem 1. Only a few sufficient conditions must be met to providefinite-sample bounds on the expected risk difference of the estimator selected via our CV procedure. First, that therandom vector X be bounded and, second, that the entries of all covariance matrices in the parameter space be bounded.Together, these assumptions allow for the definition of M ( J ) , the object permitting the extension of the loss-basedestimation framework to the high-dimensional covariance matrix estimation problem. That is, M ( J ) and its use in c ( δ, M ( J )) allow for the finite-sample bound to depend on J . Then, studying Equation (8)’s second term on theright-hand side, we can gain intuition about the procedure’s asymptotic behavior as it relates to n and J .We find that, for any fixed K n , p n , and δ > , the second term in the upper-bound in Equation (8) diverges in n and J when O ( J ) > O (1 /η ) . This follows directly from the assumption that J/n → m < ∞ as J, n → ∞ .However, O ( J ) ≤ O (1 /η ) since, by definition, Ψ admits covariance matrices no sparser than a diagonal matrix.We therefore capture the desired asymptotic results from Equations (9), (10), and (12) if O ( J ) < O (1 /η ) , assumingtheir additional corresponding assumptions are satisfied. When O ( J ) = O (1 /η ) , c ( δ, M ( J ))(1 + log K n ) / ( np n ) converges to a constant as n, J → ∞ , resulting in a more pessimistic bound. Nevertheless, we provide empiricalevidence in the following section that the desired optimality results are readily attained in this situation.As we’ve alluded to previously, the scaling factor employed by the loss function serves only to characterize the asymp-totic behavior of our selection procedure. Studying the ratios in Equations (9), (10), and (12), it is clear that there is acancellation of the conditional risks’ scaling factors. These high-dimensional asymptotic results are therefore applica-ble, under the same conditions and assumptions, to a practicable, non-scaled version of the Frobenius loss.We also highlight that these asymptotic properties are independent of the collection of candidate estimators considered.Hence, these results apply to bespoke estimation procedures that consider only one class of estimators indexed byhyperparameters. In this sense, the proposed estimator selection method is a generalization of existing frameworks.Having provided a formal description of the procedure’s asymptotic behavior, what remains is to provide empirical ev-idence that these results may be approximately achieved under a variety of covariance matrix structures and estimators,and for practical values of n and J . This is explored via simulation study in Section 4.7 .3 Software Implementation The proposed cross-validated covariance matrix estimator selection procedure has been implemented in cvCovEst ,a free and open source package for the R language and environment for statistical computing [R Core Team, 2021].Internally, the cvCovEst package relies upon the generalized CV framework of the origami R package [Coyle andHejazi, 2018]. A non-scaled version of the Frobenius loss (as in Equation (6)) is used to compute the cross-validatedrisks, though a few other choices are provided as well. In particular, users may opt for the matrix-based Frobenius lossof Equation (S10), which has the benefit of carrying a much-reduced computational burden, to select the asymptoticallyoptimal estimator. We discuss this loss function in more detail in Section S3, and show that its selection is also aminimizer of the (scaled) Frobenius loss’s risk.Results for the simulation experiments and real data examples, presented in Sections 4 and 5, respectively, wereproduced using version 0.1.3 of the R package. The code and data used to produce these analyses are publiclyavailable in a separate GitHub repository at https://github.com/PhilBoileau/pub_cvCovEst . The cvCovEst package is available the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=cvCovEst . To verify the theoretical results of our cross-validated estimator selection procedure, we conducted a series of sim-ulation experiments using prominent covariance models appearing commonly in the literature. These models aredescribed below.
Model 1:
A dense covariance matrix, where ψ ( jl ) = (cid:26) , j = l . , otherwise . Model 2:
An AR(1) model, where ψ ( jl ) = 0 . | j − l | . This covariance matrix, corresponding to a common timeseriesmodel, is approximately sparse for large J , since the off-diagonal elements quickly shrink to zero. Model 3:
An MA(1) model, where ψ ( jl ) = 0 . | j − l | · I ( | j − l | ≤ . This covariance matrix, corresponding to anothercommon timeseries model, is truly sparse. Only the diagonal, subdiagonal, and superdiagonal contain non-zero elements. Model 4:
An MA(2) model, where ψ ( jl ) = , j = l . , | j − l | = 10 . , | j − l | = 20 , otherwise . This timeseries model is similar to Model 3, but slightly less sparse.
Model 5:
A random covariance matrix model. First, a J × J random matrix whose elements are i.i.d. Uniform (0 , isgenerated. Next, entries below / are set to , entries between / and / are set to − , and the remainingentries are set to . The square of this matrix is then computed and added to an I J × J matrix. Finally, thecorresponding correlation matrix is computed and used as the model’s covariance matrix. Model 6:
A Toeplitz covariance matrix, where ψ ( jl ) = (cid:26) , j = l . | j − l | − . , otherwise . Like the AR(1) model, this covariance matrix is approximately sparse for large J . However, the off-diagonalentries decay less quickly as their distance from the diagonal increases. Model 7:
A Toeplitz covariance matrix with alternating signs, where ψ ( jl ) = (cid:26) , j = l ( − | j − l | . | j − l | − . , otherwise . This model is almost identical to Model 6, though the signs of the covariance matrix’s entries are alternating.8 odel 8:
A covariance matrix inspired by the latent variable model described in Equation (S3). Let β J × = ( β , . . . , β J ) (cid:62) , where β j were randomly generated using a N (0 , I × ) distribution for j =1 , . . . , J . Then ψ = ββ (cid:62) + I J × J is the covariance matrix of a model with three latent factors.Each covariance model was used to generate datasets consisting of n ∈ { , , , } i.i.d. multivariateGaussian, mean-zero observations. For each model and sample size, five data dimension ratios were considered: J/n ∈ { . , . , , , } . Together, the eight covariance models, four sample sizes, and five dimensionality ratiosresult in distinct simulation settings. For each such simulation setting, the performance of the cross-validatedselector with respect to the various oracle selections and several well-established estimators is evaluated based onaggregation across Monte Carlo repetitions.We applied our estimator selection procedure, which we refer to as cvCovEst , using a 5-fold CV scheme to eachsimulated dataset to compare the candidate estimators’ performances. The library of candidate estimators is providedin Table 1, and includes details on these estimators’ possible hyperparameters. The considered classes of covariancematrix estimators are briefly introduced in Section S1 of the supplement.
Table 1: Families of candidate estimators used by cvCovEst in the simulation study.
Estimator HyperparametersSample covariance matrix Not applicableHard thresholding [Bickel and Levina, 2008b] Thresholds = { . , . , . . . , . } SCAD thresholding [Fan and Li, 2001, Rothman et al., 2009] Thresholds = { . , . , . . . , . } Adaptive LASSO [Rothman et al., 2009] Thresholds = { . , . , . . . , . } ; exponential weights = { . , . , . . . , . } Banding [Bickel and Levina, 2008a] Bands = { , , . . . , } Tapering [Cai et al., 2010] Bands = { , , . . . , } Linear shrinkage [Ledoit and Wolf, 2004] Not applicableDense linear shrinkage [Schäfer and Strimmer, 2005] Not applicableNonlinear shrinkage [Ledoit and Wolf, 2018] Not applicablePOET [Fan et al., 2013] using hard thresholding Latent factors = { , , . . . , } ; thresholds = { . , . , . } .2 Simulation Study Results Figure 1: Comparison of the cross-validated and cross-validated oracle selections’ mean cross-validated conditional risk differences.In all cases, the mean risk difference ratios rapidly converge to one, even when the true data-generating process possesses a densecovariance matrix. Note the differing y-axis scales for the different models.
To examine empirically the optimality results of Theorem 1, we computed analytically, for each replication, the cross-validated conditional risk differences of the cross-validated selection, ˆ k p n ,n , and the cross-validated oracle selection, ˜ k p n ,n , using as candidates the estimators in Table 1. We note that no penalty is attributed to estimators generatingrank-deficient estimates, like the sample covariance matrix when J > n , though this limitation is generally of practicalimportance. When the situation dictates that the resulting estimate must be positive-definite, the library of candidatesshould be assembled accordingly.From the cross-validated conditional risk differences, the Monte Carlo expectations of the risk differences stratified by n , J/n , and the covariance model were computed. The ratios of the expected risk differences are presented in Figure 1.These results make clear that, for virtually all models considered, the estimator chosen by our proposed cross-validatedselection procedure has a risk difference asymptotically identical on average to that of the cross-validated oracle.10 igure 2: Comparison of the cross-validated and oracle selection’s full-dataset conditional risk differences. Asymptotic optimalityis nearly achieved for small to moderate sample sizes and numbers of features in all considered models. Note the differing y-axisscales for the different models. igure 3: Comparison of competing, bespoke covariance matrix estimation procedures to our cross-validated selection approach interms of the Monte Carlo mean Frobenius norm under a variety of data-generating processes. Note that the scales of the y-axis aretailored to the covariance matrix model. A stronger result, corresponding to Equation (10) of Theorem 1, is presented in Figure S1. For all but Models 1 and8, we find that our algorithm’s selection is virtually equivalent to the cross-validated oracle selection for n ≥ and J/n ≥ . . Even for Model 8, in which the covariance matrices are more difficult to estimate due to their densestructures, we find that our selector identifies the optimal estimator with probability tending to for n ≥ and J/n = 5 .More impressive still are the results presented in Figure 2 that characterize the full-dataset conditional risk differenceratios. For all covariance matrix models considered, with the exception of Model 1, our procedure’s selections attainnear asymptotic optimality for moderate values of n and J/n . This suggests that our loss-based estimator selectionapproach’s theoretical guarantee, as outlined in Corollary 1, is achievable in many practical settings.Although these simulation study results are promising, we emphasize that optimality in estimator selection shouldnot be confused with consistency. Consistently estimating a covariance matrix becomes considerably more difficultin high-dimensional regimes, especially if no assumptions about its structure can be made. Intuitively, accuratelyestimating the J ( J +1) / unique parameters of a dense covariance matrix using n × J data points is infeasible [Ledoitand Wolf, 2004]. However, if there is a structure, the number of parameters to estimate relative to J becomes more12anageable asymptotically, especially if an estimator (or family of estimators) tailored to that structure is included inthe collection of candidates. Regardless, an optimal selection does not imply anything about the chosen estimator’sother asymptotic properties – at least not without relying upon additional assumptions.Now, in addition to verifying our method’s asymptotic behavior, we compared its estimates against those of competingprocedures. The candidate estimators were individually applied to the simulated datasets using our CV procedure toselect hyperparameters where necessary, which, as stated previously, is equivalent to the family specific estimationprocedures. The hyperparameters considered are provided in Table S1. Where appropriate, the competing estimators’hyperparameters are more varied than those in Table 1 (used by cvCovEst), reflecting more aggressive estimationprocedures that one might employ when only using a single family of estimators. We then computed the Frobeniusnorm of every estimate against the corresponding true covariance matrix. The mean norms over all simulations werethen computed for each covariance matrix estimation procedure, again stratified by n , J/n , and the covariance matrixmodel (Figure 3).Altogether, the results of these simulation experiments speak plainly — our proposed cross-validated loss-based co-variance matrix estimator selection procedure should be employed when estimating high-dimensional covariance ma-trices. Significant gains are expected relative to even informed (but, at times, arbitrary) choices among popular orstate-of-the-art estimators. In particular, an estimator selected by our procedure is guaranteed to perform no worsethan the best estimator contained in the specified library of candidates, while avoiding the need for subjective deci-sions regarding estimator families or grids of hyperparameters.
Single-cell transcriptome sequencing (scRNA-seq) has taken the biological sciences by storm, allowing researchers tostudy the gene expression profiles of individual cells. The fine-grained transcriptomic data that it provides have beenused to identify rare cell populations and to elucidate the developmental relationships between diverse cellular states.These developments will likely lead to new and exciting clinical applications in the near future [Haque et al., 2017].Given that a typical scRNA-seq dataset possesses tens of thousands of features (i.e. genes), most workflows includea dimensionality reduction step following data preprocessing. In addition to reducing the amount of computationalresources needed to analyze the data, reducing dimensionality often has the effect of mitigating noise corruptingthe biological signal of interest. The lower-dimensional embedding may then subsequently be used for downstreamanalysis, such as novel cell-type discovery via clustering.One of the most popular methods used to reduce the dimensionality of scRNA-seq data is uniform manifold approxi-mation and projection (UMAP) [McInnes et al., 2018]. This method captures the most salient non-linear relationshipsamong a high-dimensional datasets’ observations and projects them into a reduced-dimensional space. Instead of ap-plying UMAP directly, the scRNA-seq datasets’ leading principal components (PCs) are often used as an initialization.This initial dimensionality reduction by PCA is believed to play a helpful role in denoising.However, PCA typically relies on the sample covariance matrix, and so when the dataset is high-dimensional, theresulting principal components are known to be poor estimates of those of the population [Johnstone and Lu, 2009].We hence posit that our cross-validated estimator selection procedure could form a basis for an improved PCA. That is,we hope that the eigenvectors resulting from the eigendecomposition of our estimated covariance matrix could be usedto generate a set of estimates closer to the true PCs in terms of risk. These PCs could then be used as initializations forUMAP, giving rise to low-dimensional embeddings that better reflect a dataset’s biological signal. Indeed, additionalsimulation results provided in Section S6 suggest that cvCovEst produces estimates of the leading eigenvalue at leastas well as does the sample covariance matrix, in terms of the spectral norm.To verify this conjecture, we applied our procedure to two scRNA-seq datasets for which the cell types are known apriori . These data were obtained from the scRNAseq
Bioconductor R package [Risso and Cole, 2020], and prepared foranalysis using a workflow outlined in Amezquita et al. [2020]. A 5-fold CV scheme was used; the library of candidateestimators is provided in Table S2. We expect that cells of the same class will form tight, distinct clusters within thelow-dimensional representations. The resulting embeddings, which we refer to as the cvCovEst-based embeddings,were then compared to those produced by UMAP using traditional PCA for initialization, which we refer to as the PCA-based embeddings. For each embedding, the 20 leading PCs were fed to UMAP. The first dataset is a collectionof 285 mouse visual cortex cells [Tasic et al., 2016], and the second dataset consists of 2,816 mouse brain cells [Zeiselet al., 2015]. The ratios of features to observations, 3.51 and 0.36 for the Tasic et al. [2016] and Zeisel et al. [2015]datasets, respectively, fall closely in line with those considered in the simulation study presented in Section 4. The1,000 most variable genes of each dataset were used to compute the PCs of both embeddings.13 igure 4: Comparisons of scRNA-seq datasets’ UMAP embeddings based on vanilla PCA or PCA with the cross-validated selec-tion’s covariance matrix estimate, for the 1,000 most variable genes. The datasets consist of (A)
285 cells collected from the visualcortex of mice and (B)
The resulting UMAP plots are presented in Figure 4. Though the two embeddings generated for each dataset arequalitatively similar, the low-dimensional representations relying on our loss-based approach are refined. In Figure 4A,there are a number of cells that are erroneously clustered with the wrong cell types in the PCA-based embedding, butnot in the cvCovEst-based embedding. This is further captured by the 41% increase in average silhouette width of ourmethod relative to the traditional approach. The improvements are less pronounced in Figure 4B: some cells are moredensely clustered in the cvCovEst-based embedding and there are fewer mis-clustered cells, but the average silhouettewidth is only marginally better.That the relative improvement of the cvCovEst-based approach over the traditional procedure is more striking in theTasic et al. [2016] dataset should not be surprising. The estimation problem is truly high-dimensional, whereas this isnot the case for the Zeisel et al. [2015] dataset. As the ratio of the number of features to observations decreases, there issimply less to gain from the use of estimators other than the sample covariance matrix. Ideally, data-adaptive selectionprocedures should be cognizant of this. Indeed, cvCovEst, when applied to the Zeisel et al. [2015] dataset, selects anestimator whose cross-validated empirical risk is only slightly smaller than that of the sample covariance matrix, andwhose leading PCs are virtually identical. The diagnostics of the datasets’ corresponding covariance matrix estimationprocedures are presented in Figures S2 and S3 to provide further intuition.These results show that our loss-based covariance matrix estimation procedure can improve the resolution of scRNA-seq data analyses while promoting objectivity. We expect that similar outcomes are possible when including ourmethod in other procedures used to analyze high-dimensional data generated by similar high-throughput biologicalsequencing experiments.
This work extends Dudoit and van der Laan [2003]’s framework for asymptotically optimal, data-adaptive estima-tor selection to the problem of covariance matrix estimation in high-dimensional settings. We provided sufficient14onditions under which our cross-validated procedure is asymptotically optimal in terms of risk and demonstratedthat its applicability to a wide range of data-generating processes and estimators, in both comprehensive simulationexperiments and two real-world data analysis examples.The simulation study provides evidence that near-optimal results are achieved for relatively modest dataset dimensions,even in certain situations where the underlying covariance matrix is dense. These results also demonstrate that ourcross-validated procedure performs as well as the best bespoke estimation procedure in a variety of settings, providingfor the first time the means to objectively estimate and evaluate high-dimensional covariance matrices. Practition-ers need no longer rely upon intuition alone when deciding which candidate estimator is best from among a libraryof diverse estimators. Our scRNA-seq data examples further demonstrate the utility of this method in fields wherehigh-dimensional data are collected routinely. We expect that a variety of analytical tasks relying upon the accu-rate estimation of the covariance matrix beyond the exploratory analyses considered here, like clustering, hypothesistesting, and latent variable estimation, stand to benefit from the application of the proposed framework.Finally, the finite-sample bounds and high-dimensional asymptotic results outlined in Theorem 1 require but minormodifications to be employed in the estimation of other high-dimensional, multivariate parameters, such as the pre-cision matrix. The groundwork for the derivation of finite-sample bounds of other loss functions has also been laid.These potential extensions of the proposed procedure offer salient avenues of future research.
Acknowledgements
We thank Brian Collica and Jamarcus Liu for significant contributions to the development of the cvCovEst R package.
References
R. A. Amezquita, A. T. L. Lun, E. Becht, V. J. Carey, L. N. Carpp, L. Geistlinger, F. Marini, K. Rue-Albrecht, D. Risso, C. Soneson,L. Waldron, H. Pagès, M. L. Smith, W. Huber, M. Morgan, R. Gottardo, and S. C. Hicks. Orchestrating single-cell analysis withbioconductor.
Nature Methods , 17(2):137–145, 2020. doi: 10.1038/s41592-019-0654-x. URL https://doi.org/10.1038/s41592-019-0654-x .J. Bai. Inferential theory for factor models of large dimensions.
Econometrica , 71(1):135–171, 2003. doi: 10.1111/1468-0262.00392.G. Bennett. Probability inequalities for the sum of independent random variables.
Journal of the American Statistical Association ,57(297):33–45, 1962.S. N. Bernstein. The theory of probabilities. 1946.P. J. Bickel and E. Levina. Regularized estimation of large covariance matrices.
Annals of Statistics , 36(1):199–227, 2008a. URL .P. J. Bickel and E. Levina. Covariance regularization by thresholding.
Annals of Statistics , 36(6):2577–2604, 2008b. URL .L. Breiman and P. Spector. Submodel selection and evaluation in regression. the X-random case.
International Statistical Review ,pages 291–319, 1992.T. Cai and W. Liu. Adaptive thresholding for sparse covariance matrix estimation.
Journal of the American Statistical Association ,106(494):672–684, 2011. doi: 10.1198/jasa.2011.tm10560. URL https://doi.org/10.1198/jasa.2011.tm10560 .T. T. Cai, C.-H. Zhang, and H. H. Zhou. Optimal rates of convergence for covariance matrix estimation.
Annals of Statistics , 38(4):2118–2144, 08 2010. doi: 10.1214/09-AOS752.J. Coyle and N. Hejazi. origami: A generalized framework for cross-validation in R.
Journal of Open Source Software , 3(21):512,2018. doi: 10.21105/joss.00512. URL https://doi.org/10.21105/joss.00512 .S. Dudoit and M. J. van der Laan. Asymptotics of cross-validated risk estimation in estimator selection and performance assess-ment. Working Paper 126, University of California, Berkeley, Berkeley, 2003. URL https://biostats.bepress.com/ucbbiostat/paper126 .S. Dudoit and M. J. van der Laan. Asymptotics of cross-validated risk estimation in estimator selection and performance assessment.
Statistical Methodology , 2(2):131–154, 2005.B. Efron.
Large-scale inference: empirical Bayes methods for estimation, testing, and prediction . Cambridge University Press,2012.J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties.
Journal of the AmericanStatistical Association , 96(456):1348–1360, 2001. doi: 10.1198/016214501753382273.J. Fan, Y. Fan, and J. Lv. High dimensional covariance matrix estimation using a factor model.
Journal of Econometrics , 147(1):186 – 197, 2008. doi: https://doi.org/10.1016/j.jeconom.2008.09.017. URL . Econometric modelling in finance and risk management: An overview. . Fan, Y. Liao, and M. Mincheva. Large covariance estimation by thresholding principal orthogonal complements. Journal of theRoyal Statistical Society: Series B (Statistical Methodology) , 75(4):603–680, 8 2013. doi: 10.1111/rssb.12016.J. Fan, Y. Liao, and W. Wang. Projected principal component analysis in factor models.
Annals of Statistics , 44(1):219–254, 022016. doi: 10.1214/15-AOS1364.J. Fan, W. Wang, and Y. Zhong. Robust covariance estimation for approximate factor models.
Journal of Econometrics , 208(1):5 –22, 2019. doi: https://doi.org/10.1016/j.jeconom.2018.09.003. Special Issue on Financial Engineering and Risk Management.Y. Fang, B. Wang, and Y. Feng. Tuning-parameter selection in regularized estimations of large covariance matrices.
Journal ofStatistical Computation and Simulation , 86(3):494–509, 2016. doi: 10.1080/00949655.2015.1017823. URL https://doi.org/10.1080/00949655.2015.1017823 .J. Friedman, T. Hastie, and R. Tibshirani.
The elements of statistical learning , volume 1. Springer series in statistics New York,2001.G. H. Golub and C. F. Van Loan.
Matrix computations . Johns Hopkins Universtiy Press, 3rd edition, 1996.A. Haque, J. Engel, S. A. Teichmann, and T. Lönnberg. A practical guide to single-cell rna-sequencing for biomedical research andclinical applications.
Genome Medicine , 9(1):75, 2017. doi: 10.1186/s13073-017-0467-4.I. M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis.
Annals of Statistics , pages 295–327,2001.I. M. Johnstone and A. Y. Lu. On consistency and sparsity for principal components analysis in high dimensions.
Journal of theAmerican Statistical Association , 104(486):682–693, 2009. doi: 10.1198/jasa.2009.0121. PMID: 20617121.C. Lam and J. Fan. Sparsistency and rates of convergence in large covariance matrix estimation.
Annals of Statistics , 37(6B):4254–4278, 12 2009. doi: 10.1214/09-AOS720.O. Ledoit and M. Wolf. A well-conditioned estimator for large-dimensional covariance matrices.
Journal of Multivariate Analysis ,88(2):365 – 411, 2004. doi: https://doi.org/10.1016/S0047-259X(03)00096-4.O. Ledoit and M. Wolf. Nonlinear shrinkage estimation of large-dimensional covariance matrices.
Annals of Statistics , 40(2):1024–1060, 04 2012. doi: 10.1214/12-AOS989.O. Ledoit and M. Wolf. Spectrum estimation: A unified framework for covariance matrix estimation and PCA in large dimensions.
Journal of Multivariate Analysis , 139:360 – 384, 2015. ISSN 0047-259X. doi: https://doi.org/10.1016/j.jmva.2015.04.006.O. Ledoit and M. Wolf. Analytical nonlinear shrinkage of large-dimensional covariance matrices. ECON - Working Papers 264,Department of Economics - University of Zurich, 2018. URL https://EconPapers.repec.org/RePEc:zur:econwp:264 .L. McInnes, J. Healy, and J. Melville. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. 2 2018.URL http://arxiv.org/abs/1802.03426 .A. Onatski. Asymptotics of the principal components estimator of large factor models with weakly influential factors.
Journal ofEconometrics , 168(2):244 – 258, 2012. doi: https://doi.org/10.1016/j.jeconom.2012.01.034.M. Pourahmadi.
High-dimensional covariance estimation . Wiley series in probability and statistics. Wiley, 2013.R Core Team.
R: A Language and Environment for Statistical Computing . R Foundation for Statistical Computing, Vienna, Austria,2021. URL .D. Risso and M. Cole. scRNAseq: Collection of Public Single-Cell RNA-Seq Datasets , 2020. R package version 2.3.17.H. Robbins. The empirical bayes approach to statistical decision problems.
Annals of Mathematical Statistics , 35(1):1–20, 1964.A. J. Rothman, E. Levina, and J. Zhu. Generalized thresholding of large covariance matrices.
Journal of the American StatisticalAssociation , 104(485):177–186, 2009. doi: 10.1198/jasa.2009.0101. URL https://doi.org/10.1198/jasa.2009.0101 .J. Schäfer and K. Strimmer. A shrinkage approach to large-scale covariance matrix estimation and implications for functionalgenomics.
Statistical Applications in Genetics and Molecular Biology , 4(1), 2005. doi: https://doi.org/10.2202/1544-6115.1175.J. H. Stock and M. W. Watson. Forecasting using principal components from a large number of predictors.
Journal of the AmericanStatistical Association , 97(460):1167–1179, 2002. URL .B. Tasic, V. Menon, T. N. Nguyen, T. K. Kim, T. Jarsky, Z. Yao, B. Levi, L. T. Gray, S. A. Sorensen, T. Dolbeare, D. Bertagnolli,J. Goldy, N. Shapovalova, S. Parry, C. Lee, K. Smith, A. Bernard, L. Madisen, S. M. Sunkin, M. Hawrylycz, C. Koch, andH. Zeng. Adult mouse cortical cell taxonomy revealed by single cell transcriptomics.
Nature Neuroscience , 19(2):335–346,2016. doi: 10.1038/nn.4216. URL https://doi.org/10.1038/nn.4216 .M. J. van der Laan and S. Dudoit. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples. Working Paper 130, University ofCalifornia, Berkeley, Berkeley, 2003. URL https://biostats.bepress.com/ucbbiostat/paper130/ .A. W. van der Vaart, S. Dudoit, and M. J. van der Laan. Oracle inequalities for multi-fold cross validation.
Statistics and Decisions ,24:351–371, 2006.M. J. Wainwright.
High-dimensional statistics: A non-asymptotic viewpoint . Cambridge University Press, 2019.A. Zeisel, A. B. Muñoz-Manchado, S. Codeluppi, P. Lönnerberg, G. La Manno, A. Juréus, S. Marques, H. Munguba, L. He, C. Bet-sholtz, C. Rolny, G. Castelo-Branco, J. Hjerling-Leffler, and S. Linnarsson. Cell types in the mouse cortex and hippocampusrevealed by single-cell rna-seq.
Science , 347(6226):1138–1142, 2015. ISSN 0036-8075. doi: 10.1126/science.aaa1934. UPPLEMENTARY MATERIALS1 Candidate Covariance Matrix Estimators
Using the proposed cross-validated selection procedure effectively requires a large and diverse set of candidate covari-ance matrix estimators. In this spirit, we provide in the sequel a brief overview of select covariance matrix estimatorsthat have proven to be effective in a variety of settings. Note, however, that the proposed selection framework need notbe limited to those described here. Thorough reviews of estimators have been provided by Pourahmadi [2013], Fanet al. [2016], and Ledoit and Wolf [2020], to name only a few.
S1.1 Thresholding Estimators
An often natural, simplifying assumption about the true covariance matrix’s structure is that it is sparse, that is, anon-negligible portion of its off-diagonal elements have a value near zero or equal to zero. This assumption is notaltogether unreasonable: Given a system of numerous variables, it seems unlikely in many settings that a majority ofthese variables would depend on one another.The class of generalized thresholding estimators [Bickel and Levina, 2008b, Rothman et al., 2009, Cai and Liu, 2011]is one such collection of covariance matrix estimators based upon this structural assumption. Given the sample covari-ance matrix S n , the generalized thresholding estimator is defined as ˆΨ gt ( P n ; s ) := (cid:110) s (cid:16) S ( jl ) n (cid:17) : j, l = 1 , . . . , J (cid:111) (S1)where s ( · ) is a thresholding function that often requires one or more hyperparameters dictating the amount of reg-ularization. The hard, soft, smoothly clipped absolute deviation (SCAD) [Fan and Li, 2001] and adaptive LASSO[Rothman et al., 2009] functions are among the collection of suitable thresholding operators. Cai and Liu [2011] alsodemonstrated that element-specific thresholding functions might be useful when the features’ variances are highly vari-able. The hyperparameters for any specific thresholding function are often selected via CV. Regardless of the choiceof s , these estimators preserve the symmetry of the sample covariance matrix and are invariant under permutations ofthe features’ order.Bickel and Levina [2008b] and Rothman et al. [2009] have shown that these estimators are consistent under thespectral norm, which is defined as a square matrix’s largest absolute eigenvalue, assuming that log J/n → , that theobservations’ marginal distributions satisfy a tail condition, and that the true covariance matrix is a member of the classof matrices satisfying a particular notion of “approximate sparsity.” Cai and Liu [2011] have derived similar resultsfor their entry-specific thresholding estimator over an even broader parameter space of sparse covariance matrices. S1.2 Banding and Tapering Estimators
A family of estimators related to thresholding estimators are banding and tapering estimators [Bickel and Levina,2008a, Cai et al., 2010]. Like thresholding estimators, these estimators rely on the assumption that the true covari-ance matrix is sparse; however, the structural assumption of these estimators on ψ is much more rigid than that ofthresholding estimators. Specifically, such estimators assume that the true covariance matrix is a band matrix , that is, asparse matrix whose non-zero entries are concentrated about the diagonal. These estimators therefore require a naturalordering of the observations’ features, operating under the assumption that “distant” variables are uncorrelated. Suchstructure is often encountered in longitudinal and time-series data.Given the sample covariance matrix S n , the banding estimator of Bickel and Levina [2008a] is defined as ˆΨ band ( P n ; b ) := (cid:110) S ( jl ) n I ( | j − l | ≤ b ) : j, l = 1 , . . . , J (cid:111) where b is a hyperparameter that determines the number of bands to retain from the sample covariance matrix and ischosen via a CV procedure. For the class of “bandable” covariance matrices, i.e., the set of well-conditioned matriceswhose elements not in the central bands of the matrix decay rapidly, this banding estimator has been shown to beuniformly consistent in the spectral norm so long as log( J ) /n → .The tapering estimator of Cai et al. [2010] is the smooth generalization of the banding estimator, gradually shrinkingthe off-diagonal bands of the sample covariance matrix towards zero. It is defined as ˆΨ tap ( P n ; b ) := W b ◦ S n W b . Here, “ ◦ ” denotes the Hadamard (element-wise) matrix product. Clearly, letting W ( jl ) b = I ( | j − l | ≤ b ) for some positive integer b results in the banding estimator. A popular weighting scheme derived by Caiet al. [2010] is W ( jl ) b := , when | j − l | ≤ b − | j − l | b , when b < | j − l | ≤ b , otherwise , which we use in our simulation study presented in Section 4. Cai et al. [2010] also derived the optimal rates ofconvergence for this estimator under the Frobenius and spectral norms, considering a class of bandable covariancematrices that is more general than that considered by Bickel and Levina [2008a]: The smallest eigenvalue of thecovariance matrices in this class can take on a value of zero. However, this estimator does not improve upon thebounds set by the banding estimator. S1.3 Shrinkage Estimators
We next consider the linear and non-linear shrinkage estimators inspired by Stein’s work on empirical Bayesian meth-ods. These estimators are rotation-equivariant, shrinking the eigenvalues of the sample covariance matrix towards a setof target values, whilst leaving its eigenvectors untouched. In doing so, the resultant estimators are better-conditionedthan the sample covariance matrix in a manner guaranteeing that the resultant covariance matrix estimator be non-singular. Further, these estimators do not rely on sparsity assumptions about the true covariance matrix, setting themapart from those previously discussed.One of the first shrinkage estimators, the linear shrinkage estimator of the sample covariance matrix, was proposedby Ledoit and Wolf [2004]. This estimator is defined as the convex combination of the sample covariance matrix andthe identity matrix. Hence, it represents a compromise between S n , an unbiased but highly variable estimator of ψ in high dimensions, and I J × J , a woefully biased but fixed estimator. Ledoit and Wolf [2004] found that, under mildconditions, the asymptotically optimal shrinkage intensity with respect to the scaled (by J ) Frobenius norm can beestimated consistently in high dimensions. This estimator is defined as ˆΨ ls ( P n ) := b n d n m n I + a n d n S n , (S2)for m n = tr ( S n ) /J , d n = (cid:107) S n − m n I (cid:107) F, /J , ¯ b n = n − (cid:80) i (cid:107) X i X (cid:62) i − S n (cid:107) F, /J , b n = min(¯ b n , d n ) , and a n = d n − b n .Bespoke shrinkage targets may be used in place of the identity. For example, one might consider a dense matrixtarget whose diagonal elements are the average of the sample covariance matrix’s diagonal elements and whose off-diagonal elements are equal to the average of all the sample covariance matrix’s off-diagonal elements. For the sake ofbrevity, discussion of such estimators is omitted, but examples are provided by, among others, Ledoit and Wolf [2003]and Schäfer and Strimmer [2005], particularly for use in financial economics and statistical genomics applications,respectively.When assumptions about the true covariance matrix’s structure are unfounded, it can become impossible to select anappropriate linear shrinkage target. Instead, one might consider generalizing these estimators to shrink the eigenvaluesof the sample covariance matrix in a non-linear fashion. That is, an estimator that shrinks the sample covariancematrix’s eigenvalues not by a common shrinkage factor (as with linear shrinkage estimators) but with shrinkage factorstailored to each sample eigenvalue. As with the aforementioned linear shrinkage estimators, such non-linear shrinkageestimators produce positive-definite estimates so long as the shrunken sample eigenvalues are positive and rotation-equivariant. These estimators belong to a class initially introduced by Stein [1986] and have since seen a resurgencein the work of Ledoit and Wolf [2012, 2015]. More recently, Ledoit and Wolf [2018] derived an analytical non-linearshrinkage estimator that is asymptotically optimal in high dimensions and more computationally efficient than theirpreviously formulated estimators. S1.4 Estimators Based on Factor Models
Covariance matrix estimators based on factor models form another broad family of estimators that do not assumesparsity of the true covariance matrix. Often encountered in econometrics and finance, these estimators utilize theoperating assumption that the dataset’s observations are functions of a few common, often latent, factors. The factormodel can be described as follows: X = µ + β F + (cid:15), (S3)18here X J × represents a random observation, µ J × a mean vector, β J × L a matrix of factor coefficients, F L × arandom vector of L common factors, and (cid:15) J × a random error vector. Assuming that F and (cid:15) are uncorrelated, thecovariance matrix of X is given by ψ = β Cov ( F ) β (cid:62) + ψ (cid:15) , (S4)where ψ (cid:15) is the covariance matrix of the random error.For a review on estimating the covariance matrix in the presence of observable factors, see Fan et al. [2016]. Wenow briefly discuss the estimation of ψ when the factors are unobservable. Notice that when ψ (cid:15) is not assumed to bediagonal, the decomposition of ψ in Equation (S4) is not identifiable for fixed n and J , since the random vector X isthe only observed component of the model. By letting J → ∞ , and assuming that the eigenvalues of ψ (cid:15) are uniformlybounded or grow at a slow rate relative to J and that the eigenvalues of (1 /J ) β (cid:62) β are uniformly bounded awayfrom zero and infinity, it can be shown that β Cov ( F ) β (cid:62) is asymptotically identifiable [Cai et al., 2010]. It followsfrom these assumptions that the signal in the factors increases as the number of features increases, while the noisecontributed by the error term remains constant. The eigenvalues associated with β Cov ( F ) β (cid:62) therefore become easyto differentiate from those of ψ (cid:15) .Now, even with β Cov ( F ) β (cid:62) being asymptotically identifiable, β and F cannot be distinguished. As a solution, thefollowing constraint is imposed on F and β : Cov ( F ) = I L × L . It then follows that ψ = ββ (cid:62) + ψ (cid:15) . (S5)Under the additional assumption that the columns of β be orthogonal, Fan et al. [2013] found that the leading L eigenvalues of ψ are spiked, meaning that they are bounded below by some constant [Johnstone, 2001], and grow atrate O ( J ) as the dimension of ψ increases. The remaining J − L eigenvalues are then either bounded above or growslowly. This implies that the latent factors and their loadings can be approximated via the eigenvalues and eigenvectorsof ψ .Fan et al. [2013] therefore proposed the Principal Orthogonal compleEment Thresholding (POET) estimator of ψ ,which was motivated by the spectral decomposition of the sample covariance matrix S n = J (cid:88) j =1 λ j V · ,j V (cid:62)· ,j ≈ L (cid:88) j =1 β · ,j β (cid:62)· ,j + J (cid:88) j = L +1 λ j V · ,j V (cid:62)· ,j , where λ j and V · ,j are the j th eigenvalues and eigenvectors of S n , respectively, and β · ,j is the j th column of β . Forease of notation, we denote (cid:80) j λ j V · ,j V (cid:62)· ,j by S (cid:15) for j = L + 1 to J and refer to this matrix as the principal orthogonalcomplement. The estimator for L latent variables is then defined as ˆΨ POET ( P n ; L, s ) := L (cid:88) j =1 λ j V · ,j V (cid:62)· ,j + T (cid:15),s , (S6)where T (cid:15),s is the generalized thresholding matrix of S (cid:15) T ( jl ) (cid:15),s := (cid:40) S ( jj ) (cid:15) , when j = ls (cid:16) S ( jl ) (cid:15) (cid:17) , otherwise , for some generalized thresholding function s .Although this estimator is computationally efficient, the assumptions encoding the factor based model under which itis derived are such that the latent features’ eigenvalues grow in J . This results in a poor convergence rate under thespectral norm [Fan et al., 2016]. 19 Lemma A1.
Let Z k := L ( X ; ˆΨ k ( P n,B n ) , η ) − L ( X ; ψ , η ) . Then, assuming E P [ Z k | P n,B n , B n ] > ,Var P (cid:2) Z k | P n,B n , B n (cid:3) ≤ M ( J ) E P (cid:2) Z k | P n,B n , B n (cid:3) , where M ( J ) := 4 η ( M + M ) J .Proof. E P (cid:2) Z k | P n,B n , B n (cid:3) = E P (cid:104) L ( X ; ˆΨ k ( P n,B n ) , η ) − L ( X ; ψ , η ) (cid:12)(cid:12)(cid:12) P n,B n , B n (cid:105) = η E P J (cid:88) j =1 J (cid:88) l =1 ( X ( j ) X ( l ) − ˆΨ ( jl ) k ( P n,B n )) − ( X ( j ) X ( l ) − ψ ( jl )0 ) (cid:12)(cid:12)(cid:12)(cid:12) P n,B n , B n = η J (cid:88) j =1 J (cid:88) l =1 E P (cid:20) ( ψ ( jl )0 − ˆΨ ( jl ) k ( P n,B n ))(2 X ( j ) X ( l ) − ψ ( jl )0 − ˆΨ ( jl ) k ( P n,B n )) (cid:12)(cid:12)(cid:12)(cid:12) P n,B n , B n (cid:21) = η J (cid:88) j =1 J (cid:88) l =1 ( ψ ( jl )0 − ˆΨ ( jl ) k ( P n,B n )) E P (cid:20) X ( j ) X ( l ) − ψ ( jl )0 − ˆΨ ( jl ) k ( P n,B n ) (cid:12)(cid:12)(cid:12)(cid:12) P n,B n , B n (cid:21) = η J (cid:88) j =1 J (cid:88) l =1 ( ψ ( jl )0 − ˆΨ ( jl ) k ( P n,B n )) = η J (cid:88) j =1 J (cid:88) l =1 E P (cid:104) ( ψ ( jl )0 − ˆΨ ( jl ) k ( P n,B n )) (cid:12)(cid:12)(cid:12) P n,B n , B n (cid:105) , where the second to last equality follows by noting that E P [ X ( j ) X ( l ) | P n,B n , B n ] = E P [ X ( j ) X ( l ) ] and that, bydefinition, E P [ X ( j ) X ( l ) ] = ψ ( jl )0 . Then, 20ar P (cid:2) Z k | P n,B n , B n (cid:3) ≤ E P (cid:2) Z k | P n,B n , B n (cid:3) = E P (cid:34)(cid:18) η J (cid:88) j =1 J (cid:88) l =1 ( ψ ( jl )0 − ˆΨ ( jl ) k ( P n,B n ))(2 X ( j ) X ( l ) − ψ ( jl )0 − ˆΨ ( jl ) k ( P n,B n )) (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P n,B n , B n (cid:35) = η J E P (cid:34)(cid:18) J J (cid:88) j =1 J (cid:88) l =1 ( ψ ( jl )0 − ˆΨ ( jl ) k ( P n,B n ))(2 X ( j ) X ( l ) − ψ ( jl )0 − ˆΨ ( jl ) k ( P n,B n )) (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P n,B n , B n (cid:35) ≤ η J E P (cid:34) J J (cid:88) j =1 J (cid:88) l =1 ( ψ ( jl )0 − ˆΨ ( jl ) k ( P n,B n )) (2 X ( j ) X ( l ) − ψ ( jl )0 − ˆΨ ( jl ) k ( P n,B n )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P n,B n , B n (cid:35) ≤ η J M + M ) E P J (cid:88) j =1 J (cid:88) l =1 ( ψ ( jl )0 − ˆΨ ( jl ) k ( P n,B n )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P n,B n , B n = 4 η ( M + M ) J E P (cid:2) Z k | P n,B n , B n (cid:3) = M ( J ) E P (cid:2) Z k | P n,B n , B n (cid:3) . Here, the second inequality holds from the application of Jensen’s inequality to the square of the double sum, which iseffectively an expectation of a discrete uniform random variable when scaled by J . The final inequality results fromAssumptions 1 and 2, and concludes the proof.The following lemma is a result taken directly from Dudoit and van der Laan [2003]. It is restated here for convenience. Lemma A2.
Let Y , Y , . . . be a sequence of random variables. If E [ | Y n | ] = O ( g ( n )) for some positive function g ( · ) ,then Y n = O P ( g ( n )) .Proof. We must show that, for each (cid:15) > , there exists an N and B > such that P ( | Y n | /g ( n ) > B ) < (cid:15) for all n > N . By assumption, there exists and N and a C > such that E [ | Y n | ] /g ( n ) < C for all n > N . By defining C/B = (cid:15) and making use of Markov’s inequality, we find that P (cid:18) | Y n | g ( n ) > B (cid:19) ≤ E [ | Y n | ] g ( n ) B ≤ CB = (cid:15). Having derived Lemma A1, the remainder of the proof for Theorem 1 closely follows the proof of the first theorem inDudoit and van der Laan [2005]. 21 roof.
Theorem 1, Finite-Sample Result. ≤ ˜ θ p n ,n (ˆ k p n ,n ) − θ = E B n (cid:20) (cid:90) ( L ( x ; ˆΨ ˆ k pn,n ( P n,B n ) , η ) − L ( x ; ψ , η )) dP ( x ) − (1 + δ ) (cid:90) L (( x ; ˆΨ ˆ k pn,n ( P n,B n ) , η ) − L ( x ; ψ , η )) dP n,B n ( x )+ (1 + δ ) (cid:90) ( L ( x ; ˆΨ ˆ k pn,n ( P n,B n ) , η ) − L ( x ; ψ , η )) dP n,B n ( x ) (cid:21) ≤ E B n (cid:20) (cid:90) ( L ( x ; ˆΨ ˆ k pn,n ( P n,B n ) , η ) − L ( x ; ψ , η )) dP ( x ) − (1 + δ ) (cid:90) ( L ( x ; ˆΨ ˆ k pn,n ( P n,B n ) , η ) − L ( x ; ψ , η )) dP n,B n ( x )+ (1 + δ ) (cid:90) ( L ( x ; ˆΨ ˜ k pn,n ( P n,B n ) , η ) − L ( x ; ψ , η )) dP n,B n ( x ) (cid:21) = E B n (cid:20) (cid:90) ( L ( x ; ˆΨ ˆ k pn,n ( P n,B n ) , η ) − L ( x ; ψ , η )) dP ( x ) − (1 + δ ) (cid:90) ( L ( x ; ˆΨ ˆ k pn,n ( P n,B n ) , η ) − L ( x ; ψ , η )) dP n,B n ( x )+ (1 + δ ) (cid:90) ( L ( x ; ˆΨ ˜ k pn,n ( P n,B n ) , η ) − L ( x ; ψ , η )) dP n,B n ( x ) − (1 + 2 δ ) (cid:90) ( L ( x ; ˆΨ ˜ k pn,n ( P n,B n ) , η ) − L ( x ; ψ , η )) dP ( x )+ (1 + 2 δ ) (cid:90) ( L ( x ; ˆΨ ˜ k pn,n ( P n,B n ) , η ) − L ( x ; ψ , η )) dP ( x ) (cid:21) . (S7)The first inequality is by assumption, and the second is by definition of ˆ k p n ,n : ˆ θ p n ,n (ˆ k p n ,n , η ) ≤ ˆ θ p n ,n ( k, η ) ∀ k .For the remainder of the proof, in a slight abuse of notation, we simplify by replacing ˆ k p n ,n and ˜ k p n ,n with ˆ k and ˜ k ,respectively, and omit η from the conditional risks.Now, let the first two terms of the last expression in Equation (S7) be denoted by R ˆ k,n , and the third and fourth termsby T ˜ k,n . The last term is the cross-validated oracle’s risk difference: (1 + 2 δ )(˜ θ p n ,n (˜ k ) − θ ) . Thus, ≤ ˜ θ p n ,n (ˆ k ) − θ ≤ (1 + 2 δ )(˜ θ p n ,n (˜ k ) − θ ) + R ˆ k,n + T ˜ k,n . (S8)We next show that E P [ R ˆ k,n + T ˜ k,n ] ≤ c ( δ, M ( J ))(1 + log ( K n )) / ( np n ) , where c ( δ, M ( J )) = 2(1 + δ ) M ( J )(1 /δ + 1 / for some δ > . For convenience, let ˆ H k := (cid:90) ( L ( x ; ˆΨ k ( P n,B n ) , η ) − L ( x ; ψ , η )) dP n,B n ( x )˜ H k := (cid:90) ( L ( x ; ˆΨ k ( P n,B n ) , η ) − L ( x ; ψ , η )) dP ( x ) R k,n ( B n ) := (1 + δ )( ˜ H k − ˆ H k ) − δ ˜ H k T k,n ( B n ) := (1 + δ )( ˆ H k − ˜ H k ) − δ ˜ H k , so that R k,n = E B n [ R k,n ( B n )] and T k,n = E B n [ T k,n ( B n )] .Given B n and P n,B n , let Z k,i , ≤ i ≤ np n , denote the np n i.i.d. copies of Z k corresponding with the validation set,i.e., with { X i : B n ( i ) = 1 } (as defined in Lemma A1). Then, ˆ H k = (cid:80) i Z k,i /np n and ˜ H k = E P [ Z k,i | P n,B n , B n ] .22ence, ˜ H k − ˆ H k is an empirical mean of np n i.i.d. centered random variables. Further, by Assumptions 1 and 2, | Z k,i | < η ( M + M ) J a.s.. Next, we apply Bernstein’s inequality to the centered empirical mean ˜ H k − ˆ H k ,using the property of the Z k,i ’s derived in Lemma A1, to obtain a bound for the tail probabilities of R k,n ( B n ) and T k,n ( B n ) : σ k := Var P (cid:2) Z k | P n,B n , B n (cid:3) ≤ M ( J ) E (cid:2) Z k | P n,B n , B n (cid:3) = M ( J ) ˜ H k . Then, for s > , Bernstein’s inequality yields P P (cid:0) R k,n ( B n ) > s | P n,B n , B n (cid:1) = P P (cid:32) ˜ H k − ˆ H k > s + δ ˜ H k δ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P n,B n , B n (cid:33) ≤ P P (cid:32) ˜ H k − ˆ H k > s + δσ k /M ( J )1 + δ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P n,B n , B n (cid:33) ≤ exp − np n δ ) (cid:0) s + δσ k /M ( J ) (cid:1) σ k + M ( J )3(1+ δ ) (cid:0) s + δσ k /M ( J ) (cid:1) . Note that (cid:0) s + δσ k /M ( J ) (cid:1) σ k + M ( J )3(1+ δ ) (cid:0) s + δσ k /M ( J ) (cid:1) = s + δσ k /M ( J ) σ k s + δσ k /M ( J ) + M ( J )3(1+ δ ) ≥ s + δσ k /M ( J ) M ( J ) (cid:0) δ + (cid:1) ≥ sM ( J ) (cid:0) δ + (cid:1) . And so, for s > , P P (cid:0) R k,n ( B n ) > s | P n,B n , B n (cid:1) ≤ exp (cid:40) − np n c (cid:0) δ, M ( J ) (cid:1) s (cid:41) ≤ K n exp (cid:40) − np n c (cid:0) δ, M ( J ) (cid:1) s (cid:41) , where c ( δ, M ( J )) = 2(1 + δ ) M ( J )( δ + ) . The same bound applies for the marginal probabilities of P P ( R k,n ( B n ) > s ) since they hold for arbitrary choices of B n and P n,B n . The second inequality follows from K n being larger than or equal to by definition.Finally, for any u > , we have by the properties of expectations and the previously derived result that E P [ R ˆ k,n ] = (cid:90) ∞ P P ( R ˆ k,n > s ) ds − (cid:90) −∞ P P ( R ˆ k,n ≤ s ) ds ≤ (cid:90) ∞ P P ( R ˆ k,n > s ) ds ≤ u + (cid:90) ∞ u P P ( R ˆ k,n > s ) ds ≤ u + (cid:90) ∞ u K n exp (cid:40) − np n c (cid:0) δ, M ( J ) (cid:1) s (cid:41) ds. (S9)Since the expression on the right-hand side of the inequality above achieves its minimum value of c ( δ, M ( J ))(1 + log ( K n ) / ( np n ) at u n = c ( δ, M ( J )) log ( K n ) / ( np n ) , then E P [ R ˆ k,n ] ≤ c (cid:0) δ, M ( J ) (cid:1) log ( K n ) np n . The same bound applies to E P [ T ˜ k,n ] . Therefore, taking the expected values of the inequality in Equation (S8), weproduce the desired finite-sample result in Equation (8). Proof.
Theorem 1, High-Dimensional Asymptotic Result.
The expected risk differences ratio’s convergence follows directly from Equation (8) for some δ > , so longas c ( δ, M ( J ))(1 + log ( K n )) / ( np n E P [˜ θ p n ,n (˜ k p n ,n ) − θ ]) → as n, J → ∞ . Given the assumption in Kol-23ogorov asymptotics that J/n → m < ∞ as J, n → ∞ , an equivalent condition is that m ( M + M ) η J (1 + log ( K n )) / ( p n E P [˜ θ p n ,n (˜ k p n ,n ) − θ ]) → as n, J → ∞ . Convergence in probability then follows from Lemma A2.Though there are minor adaptations to the assumptions to reflect the use of high-dimensional asymptotics, the proofof Corollary 1 follows that of Corollary 1 in Dudoit and van der Laan [2003]. Proof.
Corollary 1.
Note that η is omitted from the conditional risks in the following proof. The asymptotic statement of Equation (12)is an immediate result of Theorem 1’s Equation (10). ˜ θ p n ,n (ˆ k p n ,n ) − θ ˜ θ p n ,n (˜ k p n ,n ) − θ ˜ θ p n ,n (˜ k p n ,n ) − θ ˜ θ n (˜ k n ) − θ P → . Letting Z ,n := ( n (1 − p n )) γ (˜ θ p n ,n (˜ k p n ,n ) − θ ) and Z ,n := n γ (˜ θ n (˜ k n ) − θ ) , and assuming that the sufficientcondition in Equation (13) holds, we find that Z ,n /Z ,n d → by the Continuous Mapping Theorem. Then, notice that Z ,n Z ,n = (1 − p n ) γ (˜ θ p n ,n (˜ k p n ,n ) − θ )˜ θ n (˜ k n ) − θ , which yields the desired result of Equation (11) when p n → . In the case of a single-split validation, Z ,n d = Z , n − pn ,and so Z ,n d → Z implies that ( Z ,n , Z ,n ) d → ( Z, Z ) . S3 Other Risk and Loss Functions
It is worth noting that our choice of the scaled Frobenius loss function is closely related to matrix-based Frobenius riskfunctions often employed in the high-dimensional covariance matrix estimation literature, as seen in Ledoit and Wolf[2004], Bickel and Levina [2008b], Rothman et al. [2009], to name but a few. This matrix-based risk is defined as ˆ R n ( ˆΨ k ) := E B n (cid:104) (cid:107) S n ( P n,B n ) − ˆΨ k ( P n,B n ) (cid:107) F, (cid:12)(cid:12)(cid:12) P n (cid:105) , (S10)where S n ( P n,B n ) is the sample covariance matrix computed over the validation set P n,B n .In the context of Dudoit and van der Laan [2003]’s estimator selection theory, it is convenient for our loss function tooperate individually over observational units; however, in practice, this can prove computationally costly, especiallywith growing n and J . Thus, we show in the sequel that our observation-based Frobenius loss’ selection rule isequivalent to that of the matrix-based Frobenius risk. That is, for any scaling factor matrix H ( P ) whose entries areidentical, positive constants, it holds that ˆ k p n ,n = argmin k ∈{ ,...,K n } ˆ θ p n ,n ( k, η )= argmin k ∈{ ,...,K n } ˆ R n ( ˆΨ k ) . roof. Assume without loss of generality that E P [ X ] = 0 , and that the entries of η are identically η . Then, ˆ θ p n ,n ( k, η ) = E B n (cid:34) np n n (cid:88) i =1 I ( B n ( i ) = 1) (cid:107) X i X (cid:62) i − ˆΨ k ( P n,B n ) (cid:107) F,η (cid:35) = E B n η np n n (cid:88) i =1 I ( B n ( i ) = 1) J (cid:88) j =1 J (cid:88) l =1 ( X ( j ) i X ( l ) i − ˆΨ k ( P n,B n ) ( jl ) ) = E B n η np n J (cid:88) j =1 J (cid:88) l =1 (cid:88) { i : B n ( i )=1 } (cid:16) ( X ( j ) i X ( l ) i ) − X ( j ) i X ( l ) i ˆΨ k ( P n,B n ) ( jl ) (cid:17) + ( ˆΨ k ( P n,B n ) ( jl ) ) = E B n (cid:34) η J (cid:88) j =1 J (cid:88) l =1 (cid:18) ( ˆΨ k ( P n,B n ) ( jl ) ) − S ( P n,B n ) ( jl ) ˆΨ k ( P n,B n ) ( jl ) + 1 np n (cid:88) { i : B n ( i )=1 } ( X ( j ) i X ( l ) i ) (cid:19)(cid:35) = η E B n J (cid:88) j =1 J (cid:88) l =1 (cid:18)(cid:0) ˆΨ k ( P n,B n ) ( jl ) (cid:1) − S ( P n,B n ) ( jl ) ˆΨ k ( P n,B n ) ( jl ) (cid:19) + C , (S11)where C is constant with respect to ˆΨ k ( P n,B n ) . Now, notice that ˆ R n ( ˆΨ k ) = E B n (cid:104) (cid:107) ˆΨ k ( P n,B n ) − S n ( P n,B n ) (cid:107) F, (cid:105) = E B n J (cid:88) j =1 J (cid:88) l =1 (cid:16) ( ˆΨ k ( P n,B n ) ( jl ) ) − S ( P n,B n ) ( jl ) ˆΨ k ( P n,B n ) ( jl ) + ( S ( P n,B n ) ( jl ) ) (cid:17) = E B n J (cid:88) j =1 J (cid:88) l =1 (cid:16) ( ˆΨ k ( P n,B n ) ( jl ) ) − S ( P n,B n ) ( jl ) ˆΨ k ( P n,B n ) ( jl ) (cid:17) + C , (S12)where C is constant with respect to ˆΨ k ( P n,B n ) .Thus, ˆ k p n ,n = arg min k ∈{ ,...,K n } ˆ θ p n ,n ( k, η )= arg min k ∈{ ,...,K n } ˆ R n ( ˆΨ k ) . The observation-based Frobenius loss is computationally taxing when the dimensionality of a dataset is large: theouter products of all the observations in the validation sets must be computed and stored in memory. The matrix-based Frobenius risk generates identical selections for fixed empirical distributions since they possess the same riskminimizer, and yet is much more computationally efficient. When computing the cross-validated matrix-based risk,the outer products of observations need not be computed — a significant computational improvement. We thereforerecommend its use for most applications, with the caveat that all optimality results presented are with respect to theobservation-based Frobenius loss.We also highlight that our cross-validated estimator selection approach is restricted neither to the (scaled) Frobeniusloss of Equation (6) nor to its matrix-based counterpart. For example, one might consider a weighted variant, especiallyappropriate when the random variables are of different magnitudes. In this situation, weighting the covariance matrix’s25ntries’ losses by the appropriate entries along the population covariance matrix’s diagonal elements would ensure amore appropriate evaluation of an estimator: L weighted ( X ; ψ , H ( P )) := J (cid:88) j =1 J (cid:88) l =1 H ( P ) ( jl ) ( X ( j ) X ( l ) − ψ ( jl ) ) = 1 (cid:80) Jj =1 (cid:80) Jl =1 | ψ ( jl ) | J (cid:88) j =1 J (cid:88) l =1 ψ ( jj )0 ψ ( ll )0 ( X ( j ) X ( l ) − ψ ( jl ) ) . (S13)Thus, H ( P ) in Equation (S13) is a genuine nuisance parameter and a portion of it can be estimated via the diagonalentries of the sample covariance matrix. However, the optimality results of Theorem 1 are not directly applicable tothis loss given the presence of a bona fide nuisance parameter. We omit further discussion of this loss, leaving it and anecessary generalization of Theorem 1 as an avenue for future investigation.26 Figure S1: Comparison of the cross-validated ( ˆ k p n ,n ) and cross-validated oracle ( ˜ k p n ,n ) selections’ cross-validated conditionalrisk differences. The proposed cross-validated selection procedure achieves asymptotic equivalence in most settings for relativelysmall sample sizes and numbers of features. igure S2: Tasic dataset: Diagnostic plots and tables generated using the cvCovEst R package. (A)
The top-left plot presents thecross-validated Frobenius risk of the estimator selected by our method, which in this case is the POET estimator with k = 5 and lambda = 0.3 , and alongside the other hyperparameter combinations considered. Here, k represents the number of potential latentfactors, and lambda the thresholding value used. The top-right panel contains a line plot of the selected estimator’s eigenvalues. Thebottom-left plot displays the estimated correlation matrix output by the cvCovEst selection, and the bottom-right table lists the bestperforming estimators from all classes of estimators considered. The table presented in (A) makes it clear that our method’s selectionsubstantially improves upon the sample covariance matrix in terms of the cross-validated Frobenius risk. The comparison in (B) suggests that the sample covariance matrix over-estimates the 20 leading eigenvalues, resulting in a poorer UMAP representation. igure S3: Zeisel dataset: Diagnostic plots and tables generated using the cvCovEst R package.
The components in (A) canbe interpreted in the same manner as the previous figure, with the exception of the top-left panel. In this table, a five numbersummary of the cross-validated Frobenius risk is given for each class of estimators considered across all possible combinations ofhyperparameters, if any. It is clear from the tables in (A) that the cvCovEst selection is essentially equivalent in terms of cross-validated risk to the sample covariance matrix. The plots in (B) further highlight that the 20 leading eigenvalue of the cvCovEstestimate and the sample covariance matrix are indistinguishable. Table S1: Families of candidate estimators compared to the cross-validated loss-based estimator selection procedure. Note that thelibrary of candidate estimators used by the proposed method is provided in Table 1.
Estimator HyperparametersSample covariance matrix NAHard thresholding Thresholds = { . , . , . . . , . } SCAD thresholding Thresholds = { . , . , . . . , . } Adaptive LASSO Thresholds = { . , . , . . . , . } ; exponential weights = { . , . , . . . , . } Banding Bands = { , , . . . , } Tapering Bands = { , , . . . , } Linear shrinkage NADense linear shrinkage NANonlinear shrinkage NAPOET using hard thresholding Latent factors = { , , . . . , } ; thresholds = { . , . , . . . , . } Table S2: Families of candidate estimators used in single-cell transcriptomic data analyses.
Estimator HyperparametersSample covariance matrix NAHard thresholding Thresholds = { . , . , . . . , . } SCAD thresholding Thresholds = { . , . , . . . , . } Adaptive LASSO Thresholds = { . , . , . . . , . } ; exponential weights = { . , . , . . . , . } Linear shrinkage NADense linear shrinkage NAPOET using hard thresholding Latent factors = { , , . . . , } ; thresholds = { . , . , . . . , . } S6 Simulation Results: Spectral Norm
We performed an additional simulation analysis analogous to that presented in Figure 3, but using the spectral norminstead of the Frobenius norm to assess the accuracy of our estimation procedure with respect to the leading eigenvalueof the covariance matrix. Recall that the spectral norm of a square matrix is defined as it’s largest absolute eigenvalue.We applied the estimation procedures listed in Table S1 to all simulated datasets along with our data-adaptive selectionprocedure (still under the Frobenius loss) using the candidates given in Table 1. When necessary, our cross-validationframework was used to select hyperparameters within classes of estimators using the Frobenius loss. We then com-puted the mean spectral norm against the true covariance matrix stratified by n , J/n , and the covariance matrix model.30 igure S4: Comparison of competing, bespoke covariance matrix estimation procedures to our cross-validated selection approachin terms of the Monte Carlo mean spectral norm under a variety of data-generating processes. Note that the scales of the y-axis aretailored to the covariance matrix model.
The results, presented in Figure S4, demonstrate that our estimator selection procedure performs at least as well as thebest alternative estimation strategy. This suggests that procedures dedicated to or relying upon the accurate estimationof leading eigenvalues and eigenvectors, like principal component analysis and latent variable estimation, might benefitfrom the integration of our cross-validated covariance matrix estimation framework.
References
P. J. Bickel and E. Levina. Regularized estimation of large covariance matrices.
Annals of Statistics , 36(1):199–227, 2008a. URL .P. J. Bickel and E. Levina. Covariance regularization by thresholding.
Annals of Statistics , 36(6):2577–2604, 2008b. URL .T. Cai and W. Liu. Adaptive thresholding for sparse covariance matrix estimation.
Journal of the American Statistical Association ,106(494):672–684, 2011. doi: 10.1198/jasa.2011.tm10560. URL https://doi.org/10.1198/jasa.2011.tm10560 .T. T. Cai, C.-H. Zhang, and H. H. Zhou. Optimal rates of convergence for covariance matrix estimation.
Annals of Statistics , 38(4):2118–2144, 08 2010. doi: 10.1214/09-AOS752. . Dudoit and M. J. van der Laan. Asymptotics of cross-validated risk estimation in estimator selection and performance assess-ment. Working Paper 126, University of California, Berkeley, Berkeley, 2003. URL https://biostats.bepress.com/ucbbiostat/paper126 .J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the AmericanStatistical Association , 96(456):1348–1360, 2001. doi: 10.1198/016214501753382273.J. Fan, Y. Liao, and M. Mincheva. Large covariance estimation by thresholding principal orthogonal complements.
Journal of theRoyal Statistical Society: Series B (Statistical Methodology) , 75(4):603–680, 8 2013. doi: 10.1111/rssb.12016.J. Fan, Y. Liao, and H. Liu. An overview of the estimation of large covariance and precision matrices.
The Econometrics Journal , 19(1):C1–C32, 2016. doi: 10.1111/ectj.12061. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/ectj.12061 .I. M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis.
Annals of Statistics , pages 295–327,2001.O. Ledoit and M. Wolf. Improved estimation of the covariance matrix of stock returns with an application to portfolio selection.
Journal of Empirical Finance , 10(5):603 – 621, 2003. ISSN 0927-5398. doi: https://doi.org/10.1016/S0927-5398(03)00007-0.O. Ledoit and M. Wolf. A well-conditioned estimator for large-dimensional covariance matrices.
Journal of Multivariate Analysis ,88(2):365 – 411, 2004. doi: https://doi.org/10.1016/S0047-259X(03)00096-4.O. Ledoit and M. Wolf. Nonlinear shrinkage estimation of large-dimensional covariance matrices.
Annals of Statistics , 40(2):1024–1060, 04 2012. doi: 10.1214/12-AOS989.O. Ledoit and M. Wolf. Spectrum estimation: A unified framework for covariance matrix estimation and PCA in large dimensions.
Journal of Multivariate Analysis , 139:360 – 384, 2015. ISSN 0047-259X. doi: https://doi.org/10.1016/j.jmva.2015.04.006.O. Ledoit and M. Wolf. Analytical nonlinear shrinkage of large-dimensional covariance matrices. ECON - Working Papers 264,Department of Economics - University of Zurich, 2018. URL https://EconPapers.repec.org/RePEc:zur:econwp:264 .O. Ledoit and M. Wolf. The Power of (Non-)Linear Shrinking: A Review and Guide to Covariance Matrix Estimation.
Journal ofFinancial Econometrics , 06 2020. ISSN 1479-8409. doi: 10.1093/jjfinec/nbaa007.M. Pourahmadi.
High-dimensional covariance estimation . Wiley series in probability and statistics. Wiley, 2013.A. J. Rothman, E. Levina, and J. Zhu. Generalized thresholding of large covariance matrices.
Journal of the American StatisticalAssociation , 104(485):177–186, 2009. doi: 10.1198/jasa.2009.0101. URL https://doi.org/10.1198/jasa.2009.0101 .J. Schäfer and K. Strimmer. A shrinkage approach to large-scale covariance matrix estimation and implications for functionalgenomics.
Statistical Applications in Genetics and Molecular Biology , 4(1), 2005. doi: https://doi.org/10.2202/1544-6115.1175.C. Stein. Lectures on the theory of estimation of many parameters.
Journal of Soviet Mathematics , 34(1):1373–1403, 1986. doi:10.1007/BF01085007. URL https://doi.org/10.1007/BF01085007 ..