A Well-Conditioned and Sparse Estimation of Covariance and Inverse Covariance Matrices Using a Joint Penalty
JJPEN Estimation of Covariance and Inverse Covariance Matrix
A Well-Conditioned and Sparse Estimation of Covarianceand Inverse Covariance Matrices Using a Joint Penalty
Ashwini Maurya [email protected]
Department of Statistics and ProbabilityMichigan State UniversityEast Lansing, MI 48824, USA
Editor:
Abstract
We develop a method for estimating well-conditioned and sparse covariance and inverse co-variance matrices from a sample of vectors drawn from a sub-gaussian distribution in highdimensional setting. The proposed estimators are obtained by minimizing the quadraticloss function and joint penalty of (cid:96) norm and variance of its eigenvalues. In contrast tosome of the existing methods of covariance and inverse covariance matrix estimation, whereoften the interest is to estimate a sparse matrix, the proposed method is flexible in estimat-ing both a sparse and well-conditioned covariance matrix simultaneously. The proposedestimators are optimal in the sense that they achieve the minimax rate of estimation inoperator norm for the underlying class of covariance and inverse covariance matrices. Wegive a very fast algorithm for computation of these covariance and inverse covariance matri-ces which is easily scalable to large scale data analysis problems. The simulation study forvarying sample sizes and variables shows that the proposed estimators performs better thanseveral other estimators for various choices of structured covariance and inverse covariancematrices. We also use our proposed estimator for tumor tissues classification using geneexpression data and compare its performance with some other classification methods. Keywords:
Sparsity, Eigenvalue Penalty, Penalized Estimation
1. Introduction
With the recent surge in data technology and storage capacity, today’s statisticians oftenencounter data sets where sample size n is small and number of variables p is very large:often hundreds, thousands and even million or more. Examples include gene expressiondata and web search problems [Clarke et al. (2008), Pass et al. (2006.)]. For many ofthe high dimensional data problems, the choice of classical statistical methods becomesinappropriate for making valid inference. The recent developments in asymptotic theorydeal with increasing p as long as both p and n tend to infinity at some rate depending uponthe parameters of interest.The estimation of covariance and inverse covariance matrix is a problem of primaryinterest in multivariate statistical analysis. Some of the applications include: (i) Principalcomponent analysis (PCA) [Johnstone and Lu (2004), Zou et al. (2006)]:, where the goal isto project the data on “best” k -dimensional subspace, and where best means the projecteddata explains as much of the variation in original data without increasing k . (ii) Discrimi-nant analysis [Mardia et al. (1979)]:, where the goal is to classify observations into different a r X i v : . [ m a t h . S T ] A p r shwini Maurya classes. Here estimates of covariance and inverse covariance matrices play an importantrole as the classifier is often a function of these entities. (iii) Regression analysis: If inter-est focuses on estimation of regression coefficients with correlated (or longitudinal) data, asandwich estimator of the covariance matrix may be used to provide standard errors for theestimated coefficients that are robust in the sense that they remain consistent under mis-specification of the covariance structure. (iv)
Gaussian graphical modeling [Meinshausenand B¨uhlmann (2006), Wainwright et al. (2006), Yuan and Lin (2007),Yuan (2009)]:, therelationship structure among nodes can be inferred from inverse covariance matrix. A zeroentry in the inverse covariance matrix implies conditional independence between the corre-sponding nodes.The estimation of large dimensional covariance matrix based on few sample observationsis a difficult problem, especially when n (cid:16) p (here a n (cid:16) b n means that there exist positiveconstants c and C such that c ≤ a n /b n ≤ C ). In these situations, the sample covariancematrix becomes unstable which explodes the estimation error. It is well known that theeigenvalues of sample covariance matrix are over-dispersed which means that the eigen-spectrum of sample covariance matrix is not a good estimator of its population counterpart[Marcenko and Pastur (1967), Karoui (2008a)]. To illustrate this point, consider Σ p = I p , soall the eigenvalues are 1. A result from [Geman (1980)] shows that if entries of X i ’s are i.i.d(let X i ’s have mean zero and variance 1) with a finite fourth moment and if p/n → θ < l satisfies: l → (1 + √ θ ) , a.s This suggests that l is not a consistent estimator of the largest eigenvalue σ of populationcovariance matrix. In particular if n = p then l tends to 4 whereas σ is 1. This is alsoevident in the eigenvalue plot in Figure 2.1. The distribution of l also depends on theunderlying structure of the true covariance matrix. From Figure 2.1, it is evident that thesmaller sample eigenvalues tend to underestimate the true eigenvalues for large p and small n . For more discussion on this topic, see Karoui (2008a).To correct for this bias, a natural choice would be to shrink the sample eigenvaluestowards some suitable constant to reduce the over-dispersion. For instance, Stein (1975)proposed an estimator of the form ˜Σ = ˜ U Λ(˜ λ ) ˜ U , where Λ(˜ λ ) is a diagonal matrix withdiagonal entries as transformed function of the sample eigenvalues and ˜ U is the matrix ofthe eigenvectors. In another interesting paper Ledoit and Wolf (2004) proposed an estimatorthat shrinks the sample covariance matrix towards the identity matrix. In another paper,Karoui (2008b) proposed a non-parametric estimation of spectrum of eigenvalues and showthat his estimator is consistent in the sense of weak convergence of distributions.The covariance matrix estimates based on eigen-spectrum shrinkage are well-conditionedin the sense that their eigenvalues are well bounded away from zero. These estimates arebased on the shrinkage of the eigenvalues and therefore invariant under some orthogonalgroup i.e. the shrinkage estimators shrink the eigenvalues but eigenvectors remain un-changed. In other words, the basis (eigenvector) in which the data are given is not takenadvantage of and therefore the methods rely on premise that one will be able to find a goodestimate in any basis. In particular, it is reasonable to believe that the basis generating thedata is somewhat nice. Often this translates into the assumption that the covariance matrix PEN Estimation of Covariance and Inverse Covariance Matrix has particular structure that one should be able to take advantage of. In these situations,it becomes natural to perform certain form of regularization directly on the entries of thesample covariance matrix.Much of the recent literature focuses on two broad clases of regularized covariance matrixestimation. i) The one class relies on natural ordering among variables, where one oftenassumes that the variables far apart are weekly correlated and ii) the other class wherethere is no assumption on the natural ordering among variables. The first class includes theestimators based on banding and tapering [Bickel and Levina (2008b), Cai et al. (2011)].These estimators are appropriate for a number of applications for ordered data (time series,spectroscopy, climate data). However for many applications including gene expression data,prior knowledge of any canonical ordering is not available and searching for all permutationof possible ordering would not be feasible. In these situations, an (cid:96) penalized estimatorbecomes more appropriate which yields a permutation-invariant estimate.To obtain a suitable estimate which is both well-conditioned and sparse, we introducetwo regularization terms: i) (cid:96) penalty for each of the off-diagonal elements of matrixand, ii) penalty propotional to the variance of the eigenvalues. The (cid:96) minimization prob-lems are well studied in the covariance and inverse covariance matrix estimation litera-ture [Friedman et al. (2008), Banerjee et al. (2008), Ravikumar et al. (2011), Bein andTibshirani (2011), Maurya (2014) etc.]. Rothman (2012) proposes an (cid:96) penalized log-likelihood estimator and shows that estimator is consistent in Frobenius norm at the rate of O P (cid:16)(cid:112) { ( p + s ) log p } /n (cid:17) , as both p and n approach to infinity. Here s is the number of non-zero off-diagonal elements in the true covariance matrix. In another interesting paper Beinand Tibshirani (2011) propose an estimator of covariance matrix as penalized maximumlikelihood estimator with a weighted lasso type penalty. In these optimization problems,the (cid:96) penalty results in sparse and a permutation-invariant estimator as compared to other l q , q (cid:54) = 1 penalties. Another advantage is that the (cid:96) norm is a convex function which makesit suitable for large scale optimization problems. A number of fast algorithms exist in theliterature for covariance and inverse covariance matrix estimation [(Friedman et al. (2008),Rothman et al. (2008)]. The eigenvalues variance penalty overcomes the over-dispersion inthe sample covariance matrix so that the estimator remains well-conditioned.Ledoit and Wolf (2004) proposed an estimator of covariance matrix as a linear combi-nation of sample covariance and identity matrix. Their estimator of covariance matrix iswell-conditioned but it is not sparse. Rothman et al. (2008) proposed estimator of covariancematrix based on quadratic loss function and (cid:96) penalty with a log-barrier on the determinantof covariance matrix. The log-determinant barrier is a valid technique to achieve positivedefiniteness but it is still unclear whether the iterative procedure proposed in Rothmanet al. (2008) actually finds the right solution to the corresponding optimization problem.In another interesting paper, Xue et al. (2012) proposed an estimator of covariance matrixas a minimizer of penalized quadratic loss function over set of positive definite matrices. Intheir paper, the authors solve a positive definite constrained optimization problem and es-tablish the consistency of estimator. The resulting estimator is sparse and positive definitebut whether it overcomes the over-dispersion of the eigen-spectrum of sample covariancematrix, is hard to justify. Maurya (2014) proposed a joint convex penalty as function of (cid:96) and trace norm (defined as sum of singular values of a matrix) for inverse covariance matrixestimation based on penalized likelihood approach. shwini Maurya In this paper, we propose the JPEN (Joint PENalty) estimators for covariance andinverse covariance matrices estimation and derive an explicit rate of convergence in both theoperator and Frobenius norm. The JPEN estimators achieves minimax rate of convergenceunder operator norm for the underlying class of sparse covariance and inverse covariancematrices and hence is optimal. For more details see section §
3. One of the major advantageof the proposed estimators is that the proposed algorithm is very fast, efficient and easilyscalable to a large scale data analysis problem.The rest of the paper is organized as following. The next section highlights some back-ground and problem set-up for covariance and inverse covariance matrix estimation. Insection 3, we describe the proposed estimators and establish their theoretical consistency.In section 4, we give an algorithm and compare its computational time with some otherexisting algorithms. Section 5 highlights the performance of the proposed estimators on sim-ulated data while an application of proposed estimator to real life data is given in section6.
Notation:
For a matrix M , let (cid:107) M (cid:107) denote its (cid:96) norm defined as the sum of absolutevalues of the entries of M , (cid:107) M (cid:107) F denote its Frobenius norm, defined as the sum of squareof elements of M , (cid:107) M (cid:107) denote its operator norm (also called spectral norm), defined as thelargest absolute eigenvalue of M , M − denotes matrix M where all diagonal elements are setto zero, M + denote matrix M where all off-diagonal elements are set to zero, σ i ( M ) denotethe i th largest eigenvalue of M , tr ( M ) denotes its trace, det ( M ) denote its determinant, σ min ( M ) and σ max ( M ) denote the minimum and maximum eigenvalues of M , | M | be itscardinality, and let sign( M ) be matrix of signs of elements of M . For any real x , let sign( x )denotes sign of x , and let | x | denotes its absolute value.
2. Background and Problem Set-up
Let X = ( X , X , · · · , X p ) be a zero-mean p-dimensional random vector. The focus of thispaper is the estimation of the covariance matrix Σ := E ( XX T ) and its inverse Σ − froma sample of independently and identically distributed data { X ( k ) } nk =1 . In this section weprovide some background and problem setup more precisely.The choice of loss function is very crucial in any optimization problem. An optimal esti-mator for a particular loss function may not be optimal for another choice of loss function.Recent literature in covariance matrix and inverse covariance matrix estimation mostly fo-cuses on estimation based on likelihood function or quadratic loss function [Friedman et al.(2008), Banerjee et al. (2008), Bickel and Levina (2008b), Ravikumar et al. (2011), Roth-man et al. (2008), Maurya (2014)]. The maximum likelihood estimation requires a tractableprobability distribution of observations whereas quadratic loss function does not have anysuch requirement and therefore fully non-parametric. The quadratic loss function is convexand due to this analytical tractability, it is a widely applicable choice for many data analysisproblems. PEN Estimation of Covariance and Inverse Covariance Matrix
Let S be the sample covariance matrix. Consider the following optimization problem.ˆΣ λ,γ = arg min Σ=Σ T ,tr (Σ)= tr ( S ) (cid:104) || Σ − S || + λ (cid:107) Σ − (cid:107) + γ p (cid:88) i =1 (cid:8) σ i (Σ) − ¯ σ Σ (cid:9) (cid:105) , (2.1)where ¯ σ Σ is the mean of eigenvalues of Σ, λ and γ are some positive constants. Note thatby penalty function (cid:107) Σ − (cid:107) , we only penalize off-diagonal elements of Σ. The eigenvaluesvariance penalty term for eigen-spectrum shrinkage is chosen from the following points ofinterest: i) It is easy to interpret and ii) this choice of penalty function yields a very fastoptimization algorithm. By constraint tr (Σ) = tr ( S ), the total variation in ˆΣ λ,γ is same asthat in sample covariance matrix S , however the eigenvalues of ˆΣ λ,γ are well-conditionedthan those of S . From here onwards we suppress the dependence of λ, γ on ˆΣ and denoteˆΣ λ,γ by ˆΣ.For γ = 0, the solution to (2.1) is the standard soft-thresholding estimator for quadraticloss function and its solution is given by (see § ii = s ii ˆΣ ij = sign( s ij ) max (cid:16) | s ij | − λ , (cid:17) , i (cid:54) = j. (2.2)It is clear from this expression that a sufficiently large value of λ will result in sparse co-variance matrix estimate. But estimator ˆΣ of (2.2) is not necesarily positive definite [formore details here see Xue et al. (2012)]. Moreover it is hard to say whether it overcomesthe over-dispersion in the sample eigenvalues. The following eigenvalue plot (Figure (2.1))illustrates this phenomenon for a neighbourhood type (see § n = 50 and number of co-variates p = 20. As is evident from Figure 2.1, eigenvalues of sample covariance matrixare over-dispersed as most of them are either too large or close to zero. Eigenvalues of theproposed Joint Penalty (JPEN) estimator and PDSCE (Positive Definite Sparse Covariancematrix Estimator (Rothman (2012)) of the covariance matrix are well aligned with those oftrue covariance matrix. See § λ = 0. After some algebra, let ˆΣ be the minimizer of (2.1), then it is given by:ˆΣ = ( S + γ t I ) / (1 + γ ) , (2.3)where I is the identity matrix, and t = (cid:80) pi =1 S ii /p . After some algebra, conclude that forany γ > σ min ( ˆΣ) = σ min ( S + γ t I ) / (1 + γ ) ≥ γ t γ > shwini Maurya Figure 2.1:
Comparison of Eigenvalues of Covariance Matrices
Plot of Eigenvalues
Eigenvalue Index E i gen v a l ue s * * * * * * * * * * * * * * * * * * * *TrueSampleJPENPDSCE This means that the eigenvalues variance penalty improves S to a positive definite estimatorˆΣ. However the estimator (2.3) is well-conditioned but need not be sparse. Sparsity canbe achieved by imposing (cid:96) penalty on the entries of covariance matrix. Simulations haveshown that, in general the minimizer of (2.1) is not positive definite for all values of λ > γ >
0. Here onwards we focus on correlation matrix estimation, and later generalizethe method for covariance matrix estimation.To achieve both well-conditioned and sparse positive definite estimator we optimize thefollowing objective function in R over specific region of values of ( λ, γ ) which dependsupon sample correlation matrix K and λ, γ . Here the condition tr (Σ) = tr ( S ) reduces to tr ( R ) = p , and t = 1. Consider the following optimization problem:ˆ R K = arg min R = R T ,tr ( R )= p | ( λ,γ ) ∈ ˆ S K (cid:104) || R − K || F + λ (cid:107) R − (cid:107) + γ p (cid:88) i =1 (cid:8) σ i ( R ) − ¯ σ R (cid:9) (cid:105) , (2.4)where PEN Estimation of Covariance and Inverse Covariance Matrix ˆ S K = (cid:110) ( λ, γ ) : λ, γ > , λ (cid:16) γ (cid:16) (cid:113) log pn , ∀ (cid:15) > , σ min { ( K + γI ) − λ ∗ sign ( K + γI ) } > (cid:15) (cid:111) , and ¯ σ R is mean of the eigenvalues of R . For instance when K is diagonal matrix, the setˆ S K is given by:ˆ S K = (cid:110) ( λ, γ ) : λ, γ > , λ (cid:16) γ (cid:16) (cid:113) log pn , ∀ (cid:15) > , λ < γ − (cid:15) ) (cid:111) .The minimization in (2.4) over R is for fixed ( λ, γ ) ∈ ˆ S K . The proposed estimator ofcovariance matrix (based on regularized correlation matrix estimator ˆ R K ) is given byˆΣ K = ( S + ) / ˆ R K ( S + ) / , where S + is the diagonal matrix of the diagonal elements of S . Furthermore Lemmas 3.1 and 3.2, respectively show that the objective function (2.4) isconvex and estimator given in (2.4) is positive definite. The main contributions are the following: i) The proposed estimators are both sparse and well-conditioned simultaneously. This ap-proach allows to take advantage of a prior structure if known on the eigenvalues of the truecovariance and the inverse covariance matrices. ii)
We establish theoretical consistency of proposed estimators in both operator and Frobe-nius norm. The proposed JPEN estimators achieves the minimax rate of convergence inoperator norm for the underlying class of sparse and well-conditioned covariance and inversecovariance matrices and therefore is optimal. iii)
The proposed algorithm is very fast, efficient and easily scalable to large scale optimiza-tion problems.
3. Analysis of JPEN Method
Def:
A random vector X is said to have sub-gaussian distribution if for each t ≥ y ∈ R p with (cid:107) y (cid:107) = 1, there exist 0 < τ < ∞ such that P {| y T ( X − E ( X )) | > t } ≤ e − t / τ (3.1)Although the JPEN estimators exists for any finite 2 ≤ n < p < ∞ , for theoretical con-sistency in operator norm we require s log p = o ( n ) and for Frobenus norm we require( p + s ) log p = o ( n ) where s is the upper bound on the number of non-zero off-diagonalentries in true covariance matrix. For more details, see the remark after Theorem 3.1. We make the following assumptions about the true covariance matrix Σ . A0.
Let X := ( X , X , · · · , X p ) be a mean zero vector with covariance matrix Σ such thateach X i / √ Σ ii has subgaussian distribution with parameter τ as defined in (3.1). A1.
With E = { ( i, j ) : Σ ij (cid:54) = 0 , i (cid:54) = j } , the | E | ≤ s for some positive integer s . shwini Maurya A2.
There exists a finite positive real number ¯ k > / ¯ k ≤ σ min (Σ ) ≤ σ max (Σ ) ≤ ¯ k .Assumption A2 guarantees that the true covariance matrix Σ is well-conditioned (i.e.all the eigenvalues are finite and positive). A well-conditioned means that [Ledoit andWolf (2004))] inverting the matrix does not explode the estimation error. AssumptionA1 is more of a definition which says that the number of non-zero off diagonal elementsare bounded by some positive integer. Theorem 3.1 gives the rate of convergence of theproposed correlation based covariance matrix estimator (2.4). The following Lemmas showthat optimization problem in (2.4) is convex and the proposed JPEN estimator (2.4) ispositive definite. Lemma 1.
The optimization problem in (2.4) is convex.
Lemma 2.
The estimator given by (2.4) is positive definite for any ≤ n < ∞ and p < ∞ . Theorem 3.1.
Let ( λ, γ ) ∈ ˆ S K and ˆΣ K be as defined in (2.4). Under Assumptions A0,A1, A2, (cid:107) ˆ R K − R (cid:107) F = O P (cid:16)(cid:114) s log pn (cid:17) and (cid:107) ˆΣ K − Σ (cid:107) = O P (cid:16)(cid:114) ( s + 1) log pn (cid:17) , (3.2) where R is true correlation matrix. Remark: 1.
The JPEN estimator ˆΣ K is minimiax optimal under the operator norm.In (Cai et al. (2015)), the authors obtain the minimax rate of convergence in the operatornorm of their covariance matrix estimator for the particular construction of parameter space H ( c n,p ) := (cid:110) Σ : max ≤ i ≤ p (cid:80) pi =1 I { σ ij (cid:54) = 0 } ≤ c n,p (cid:111) . They show that this rate in operatornorm is c n,p (cid:112) log p/n which is same as that of ˆΣ K for 1 ≤ c n,p = √ s . Bickel and Levina (2008a) proved that under the assumption of (cid:80) j =1 | σ ij | q ≤ c ( p )for some 0 ≤ q ≤
1, the hard thresholding estimator of the sample covariance matrix fortuning parameter λ (cid:16) (cid:112) ( log p ) /n is consistent in operator norm at a rate no worse than O P (cid:16) c ( p ) √ p ( log pn ) (1 − q ) / (cid:17) where c ( p ) is the upper bound on the number of non-zero ele-ments in each row. Here the truly sparse case corresponds to q = 0. The rate of convergenceof ˆΣ K is same as that of Bickel and Levina (2008a) except in the following cases: Case (i)
The covariance matrix has all off diagonal elements zero except last row whichhas √ p non-zero elements. Then c ( p ) = √ p and √ s = (cid:112) √ p −
1. The opeartor normrate of convergence for JPEN estimator is O P (cid:16)(cid:112) √ p ( log p ) /n (cid:17) where as rate of Bickel andLevina’s estimator is O P (cid:16)(cid:112) p ( log p ) /n (cid:17) . Case (ii)
When the true covariance matrix is tridiagonal, we have c ( p ) = 2 and s = 2 p − (cid:112) p log p/n whereas the Bickel and Levina’s estimator hasrate of (cid:112) log p/n .For the case √ s (cid:16) c ( p ) and JPEN has the same rate of convergence as that of Bickel andLevina’s estimator. PEN Estimation of Covariance and Inverse Covariance Matrix The operator norm rate of convergence is much faster than Frobenius norm. Thisis due to the fact that Frobenius norm convergence is in terms of all eigenvalues of thecovariance matrix whereas the operator norm gives the convergence of the estimators interms of the largest eigenvalue. Our proposed estimator is applicable to estimate any non-negative definite covariancematrix.Note that the estimator ˆΣ K is obtained by regularization of sample correlation matrixin (2.4). In some application it is desirable to directly regularize the sample covariancematrix. The JPEN estimator of the covariance matrix based on regularization of samplecovariance matrix is obtained by solving the following optimization problem:ˆΣ S = arg min Σ=Σ T ,tr (Σ)= tr ( S ) | ( λ,γ ) ∈ ˆ S S (cid:104) || Σ − S || F + λ (cid:107) Σ − (cid:107) + γ p (cid:88) i =1 { σ i (Σ) − ¯ σ Σ } (cid:105) , (3.3)whereˆ S S = (cid:110) ( λ, γ ) : λ, γ > , λ (cid:16) γ (cid:16) (cid:113) log pn , ∀ (cid:15) > , σ min { ( S + γtI ) − λ ∗ sign ( S + γtI ) } > (cid:15) } , and S is sample covariance matrix. The minimization in (3.3) over Σ is for fixed ( λ, γ ) ∈ ˆ S S .The estimator ˆΣ S is positive definite and well-conditioned. Theorem 3.2 gives the rate ofconvergence of the estimator ˆΣ S in Frobenius norm. Theorem 3.2.
Let ( λ, γ ) ∈ ˆ S S , and let ˆΣ S be as defined in (3.3). Under Assumptions A0,A1, A2, (cid:107) ˆΣ S − Σ (cid:107) F = O P (cid:16)(cid:114) ( s + p ) log pn (cid:17) (3.4)As noted in Rothman (2012) the worst part of convergence here comes from estimatingthe diagonal entries. A modification of estimator ˆ R K is obtained by adding positive weights to the term ( σ i ( R ) − ¯ σ R ) . This leads to weighted eigenvalues variance penalty with larger weights amounting togreater shrinkage towards the center and vice versa. Note that the choice of the weights al-lows one to use any prior structure of the eigenvalues (if known) in estimating the covariancematrix. The weighted JPEN correlation matrix estimator ˆ R A is given by :ˆ R A = arg min R = R T ,tr ( R )= p | ( λ,γ ) ∈ ˆ S K,A (cid:104) || R − K || F + λ (cid:107) R − (cid:107) + γ p (cid:88) i =1 a i { σ i ( R ) − ¯ σ R } (cid:105) , (3.5)where ˆ S K,A = (cid:110) ( λ, γ ) : λ (cid:16) γ (cid:16) (cid:113) log pn , λ ≤ (2 σ min ( K ))(1+ γ max ( A ii ) − )(1+ γ min ( A ii )) − p + γ min ( A ii ) p (cid:111) , shwini Maurya and A = diag( A , A , · · · A pp ) with A ii = a i . The proposed covariance matrix estimatoris ˆΣ K,A = ( S + ) / ˆ R A ( S + ) / . The optimization problem in (3.5) is convex and yieldsa positive definite estimator for each ( λ, γ ) ∈ ˆ S K,A . A simple excercise shows that theestimator ˆΣ K,A has same rate of convergence as that of ˆΣ S . We extend the JPEN approach to estimate a well-conditioned and sparse inverse covariancematrix. Similar to the covariance matrix estimation, we first propose an estimator for in-verse covariance matrix based on regularized inverse correlation matrix and discuss its rateof convergence in Frobenious and operator norm.
Notation:
We shall use Z and Ω for inverse correlation and inverse covariance matrixrespectively. Assumptions:
We make the following assumptions about the true inverse covariance ma-trix Ω . Let Σ = Ω − . B0.
Same as the assumption A B1.
With H = { ( i, j ) : Ω ij (cid:54) = 0 , i (cid:54) = j } , the | H | ≤ s , for some positive integer s . B2.
There exist 0 < ¯ k < ∞ large enough such that (1 / ¯ k ) ≤ σ min (Ω ) ≤ σ max (Ω ) ≤ ¯ k .Let ˆ R K be a JPEN estimator for the true correlation matrix. By Lemma 3.2, ˆ R K ispositive definite. Define the JPEN estimator of inverse correlation matrix as the solutionto the following optimization problem,ˆ Z K = arg min Z = Z T ,tr ( Z )= tr ( ˆ R − K ) | ( λ,γ ) ∈ ˆ S K (cid:104) (cid:107) Z − ˆ R − K (cid:107) + λ (cid:107) Z − (cid:107) + γ p (cid:88) i =1 { σ i ( Z ) − ¯ σ ( Z ) } (cid:105) (3.6)where ˆ S K = (cid:110) ( λ, γ ) : λ, γ > , λ (cid:16) γ (cid:16) (cid:114) log pn , ∀ (cid:15) > ,σ min { ( ˆ R − K + γt I ) − λ ∗ sign ( ˆ R − K + γt I ) } > (cid:15) (cid:111) , and t is average of the diagonal elements of ˆ R − K . The minimization in (3.6) over Z is forfixed ( λ, γ ) ∈ ˆ S K . The proposed JPEN estimator of inverse covariance matrix (based on reg-ularized inverse correlation matrix estimator ˆ Z K ) is given by ˆΩ K = ( S + ) − / ˆ Z K ( S + ) − / ,where S + is a diagonal matrix of the diagonal elements of S . Moreover (3.6) is a convexoptimization problem and ˆ Z K is positive definite.Next we state the consistency of estimators ˆ Z K and ˆΩ K . Theorem 3.3.
Under Assumptions B0, B1, B2 and for ( λ, γ ) ∈ ˆ S K , (cid:107) ˆ Z K − R − (cid:107) F = O P (cid:16)(cid:114) s log pn (cid:17) and (cid:107) ˆΩ K − Ω (cid:107) = O P (cid:16)(cid:114) ( s + 1) log pn (cid:17) (3.7) PEN Estimation of Covariance and Inverse Covariance Matrix where R − is the inverse of true correlation matrix. Remark:1.
Note that the JPEN estimator ˆΩ K achieves minimax rate of convergencefor the class of covariance matrices satisfying assumption B B
1, and B S . Consider thefollowing optimization problem:ˆΩ S = arg min Ω=Ω T ,tr (Ω)= tr (ˆΣ − S ) | ( λ,γ ) ∈ ˆ S S (cid:104) || Ω − ˆΣ − S || F + λ (cid:107) Ω − (cid:107) + γ p (cid:88) i =1 { σ i (Ω) − ¯ σ Ω } (cid:105) , (3.8)where ˆ S S = (cid:110) ( λ, γ ) : λ, γ > , λ (cid:16) γ (cid:16) (cid:114) log pn , ∀ (cid:15) > ,σ min { ( ˆΣ − S + γ t I ) − λ ∗ sign ( ˆΣ − S + γt I ) } > (cid:15) (cid:111) , and t is average of the diagonal elements of ˆΣ S . The minimization in (3.8) over Ω is forfixed ( λ, γ ) ∈ ˆ S S . The estimator in (3.8) is positive definite and well-conditioned. Theconsistency result of the estimator ˆΩ S is given in following theorem. Theorem 3.4.
Let ( λ, γ ) ∈ ˆ S S and let ˆΩ S be as defined in (3.8). Under Assumptions B0,B1, and B2, (cid:107) ˆΩ S − Ω (cid:107) F = O P (cid:16)(cid:114) ( s + p ) log pn (cid:17) . (3.9) Similar to weighted JPEN covariance matrix estimator ˆΣ
K,A , a weighted JPEN estimatorof the inverse covariance matrix is obtained by adding positive weights a i to the term( σ i ( Z ) − in (3.8). The weighted JPEN estimator is ˆΩ K,A := ( S + ) − / ˆ Z A ( S + ) − / ,whereˆ Z A = arg min Z = Z T ,tr ( Z )= tr ( ˆ R − K ) | ( λ,γ ) ∈ ˆ S K,A (cid:104) || Z − ˆ R − K || F + λ (cid:107) Z − (cid:107) + γ p (cid:88) i =1 a i { σ i ( Z ) − } (cid:105) , (3.10)with ˆ S K,A = (cid:110) ( λ, γ ) : λ (cid:16) γ (cid:16) (cid:113) log pn , λ ≤ (2 σ min ( R − K ))(1+ γt max ( A ii ) − )(1+ γ min ( A ii ) − p + γmin ( A ii ) p (cid:111) , and A = diag( A , A , · · · A pp ) with A ii = a i . The optimization problem in (3.10) is convexand yields a positive definite estimator for ( λ, γ ) ∈ ˆ S K,A . A simple excercise shows that theestimator ˆ Z A has similar rate of convergence as that of ˆ Z K . shwini Maurya
4. An Algorithm
The optimization problem (2.4) can be written as:ˆ R K = arg min R = R T | ( λ,γ ) ∈ ˆ S K f ( R ) , (4.1)where f ( R ) = || R − K || F + λ (cid:107) R − (cid:107) + γ p (cid:88) i =1 { σ i ( R ) − ¯ σ ( R ) } . Note that (cid:80) pi =1 { σ i ( R ) − ¯ σ ( R ) } = tr ( R ) − tr ( R ) + p , where we have used the constraint tr ( R ) = p . Therefore, f ( R ) = (cid:107) R − K (cid:107) F + λ (cid:107) R − (cid:107) + γ tr ( R ) − γ tr ( R ) + p = tr ( R (1 + γ )) − tr { R ( K + γI ) } + tr ( K T K ) + λ (cid:107) R − (cid:107) + p = (1 + γ ) { tr ( R ) − / (1 + γ ) tr { R ( K + γI ) } + (1 / (1 + γ )) tr ( K T K ) } + λ (cid:107) R − (cid:107) + p = (1 + γ ) {(cid:107) R − ( K + γI ) / (1 + γ ) (cid:107) F + (1 / (1 + γ )) tr ( K T K ) } + λ (cid:107) R − (cid:107) + p. The solution of (4.1) is soft thresholding estimator and it is given by:ˆ R K = 11 + γ sign( K ) ∗ pmax { abs( K + γ I ) − λ , } (4.2)with ( ˆ R K ) ii = ( K ii + γ ) / (1 + γ ), pmax ( A, b ) ij := max ( A ij , b ) is elementwise max functionfor each entry of the matrix A . Note that for each ( λ, γ ) ∈ ˆ S K , ˆ R K is positive definite. Choice of λ and γ : For a given value of γ , we can find the value of λ satisfying: σ min { ( K + γI ) − λ ∗ sign ( K + γI ) } > λ < σ min ( K + γI ) C σ max (sign( K )) . For some C ≥ .
5. Such choice of ( λ, γ ) ∈ ˆ S K , and the estimator ˆ R K is positive definite.Smaller values of C yeild a solution which is more sparse but may not be positive definite. Choice of weight matrix A:
For optimization problem in (3.5), the weights are cho-sen in following way:Let E be the set of sorted diagonal elements of the sample covariance matrix S . i) Let k be largest index of E such that k th elements of E is less than 1. For i ≤ k, a i = E i .For k < i ≤ p, a i = 1 / E i . ii) A = diag( a , a , · · · , a p ) , where a j = a j / (cid:80) pi =1 a i . Such choice of weights allows moreshrinkage of extreme sample eigenvalues than the ones in center of eigen-spectrum. PEN Estimation of Covariance and Inverse Covariance Matrix
To get an expression of inverse covariance matrix estimate, we replace K by ˆ R − K in (4.2),where ˆ R K is a JPEN estimator of correlation matrix. We chose ( λ, γ ) ∈ ˆ S K . For a given γ ,we chose λ > σ min { ( ˆ R − K + γt I ) − λ ∗ sign ( ˆ R − K + γt I ) } > λ < σ min ( ˆ R − K + γt I ) C σ max (sign( ˆ R − K )) . The JPEN estimator ˆΣ K has computational complexity of O ( p ) as there are at most 3 p multiplications for computing the estimator ˆΣ K . The other existing algorithm Glasso (Fried-man et al. (2008)), PDSCE (Rothman (2012)) have computational complexity of O ( p ). Wecompare the computational timing of our algorithm to some other existing algorithms Glasso(Friedman et al. (2008)), PDSCE (Rothman (2012)). The exact timing of these algorithmalso depends upon the implementation, platform etc. (we did our computations in R on aAMD 2.8GHz processor). Following the approach Bickel and Levina (2008a), the optimaltuning parameter ( λ, γ ) was obtained by minimizing the 5 − fold cross validation error(1 / (cid:88) i =1 (cid:107) ˆΣ vi − Σ − vi (cid:107) , where ˆΣ vi is JPEN estimate of the covariance matrix based on v = 4 n/ − vi is the sample covariance matrix using ( n/
5) observations. Figure 4.1 illustrates thetotal computational time taken to estimate the covariance matrix by
Glasso, P DSCE and
J P EN algorithms for different values of p for Toeplitz type of covariance matrix on log-logscale (see section § λ, γ ) ∈ ˆ S K , our algorithm is very fast andeasily scalable to large scale data analysis problems.
5. Simulation Results
We compare the performance of the proposed method to other existing methods on sim-ulated data for five types of covariance and inverse covariance matrices. (i) Hub Graph:
Here the rows/columns of Σ are partitioned into J equally-sizeddisjoint groups: { V ∪ V ∪ , ..., ∪ V J } = { , , ..., p } , each group is associated with a pivotal row k. Let size | V | = s . We set σ i,j = σ j,i = ρ for i ∈ V k and σ i,j = σ j,i = 0 otherwise.In our experiment, J = [ p/s ] , k = 1 , s + 1 , s + 1 , ..., and we always take ρ = 1 / ( s + 1) withJ = 20. shwini Maurya Figure 4.1:
Timing comparison of JPEN,Glasso, and PDSCE. (ii) Neighborhood Graph:
We firstuniformly sample ( y , y , ..., y n ) from a unitsquare. We then set σ i,j = σ j,i = ρ with probability ( √ π ) − exp ( − (cid:107) y i − y j (cid:107) ).The remaining entries of Σ are set to bezero. The number of nonzero off-diagonalelements of each row or column is restrictedto be smaller than [1 /ρ ] where ρ is set to be0.245. (iii) Toeplitz Matrix: We set σ i,j = 2 for i = j ; σ i,j = | . | | i − j | for | i − j | = 1 ,
2; and σ i,j = 0 other-wise. (iv) Block Diagonal Matrix: In thissetting Σ is a block diagonal matrix withvarying block size. For p = 500 number ofblocks is 4 and for p = 1000 the number ofblocks is 6. Each block of covariance ma-trix is taken to be Toeplitz type matrix asin case (iii). (v) Cov-I type Matrix: In this setting, we first simulate a random sample ( y , y , ..., y p )from standard normal distribution. Let x i = | y i | / ∗ (1+1 /p log (1+1 /p ) ). Next we generatemultivariate normal random vectors Z = ( z , z , ..., z p ) with mean vector zero and identitycovariance matrix. Let U be eigenvector corresponding to sample covariance matrix of Z .We take Σ = U DU (cid:48) , where D = diag( x , x , ....x p ). This is not a sparse setting but thecovariance matrix has most of eigenvalues close to zero and hence allows us to compare theperformance of various methods in a setting where most of eigenvalues are close to zero andwidely spread as compared to structured covariance matrices in (i)-(iv) .We chose similar structure of Ω for simulations. For all these choices of covarianceand inverse covariance matrices, we generate random vectors from multivariate normal dis-tribution with varying n and p . We chose n = 50 ,
100 and p = 500 , K to graphical lasso [Friedmanet al. (2008)], PDSC Estimate [Rothman (2012)], Bickel and Levina’s thresholding estimator(BLThresh) [Bickel and Levina (2008a)] and Ledoit-Wolf [Ledoit and Wolf (2004)] estimateof covariance matrix. The JPEN estimate ˆΣ K was computed using R software(version3.0.2). The graphical lasso estimate of the covariance matrix was computed using Rpackage “glasso” (http://statweb.stanford.edu/ tibs/glasso/). The Ledoit-Wolf estimatewas obtained using code from (http://econ.uzh.ch/faculty/wolf/publications.html PEN Estimation of Covariance and Inverse Covariance Matrix
Table 5.1: Covariance Matrix Estimation
Block type covariance matrixn=50 n=100p=500 p=1000 p=500 p=1000Ledoit-Wolf 1.54(0.102) 2.96(0.0903) 4.271(0.0394) 2.18(0.11)Glasso 0.322(0.0235) 3.618(0.073) 0.227(0.098) 2.601(0.028)PDSCE 3.622(0.231) 4.968(0.017) 1.806(0.21) 2.15(0.01)BLThresh 2.747(0.093) 3.131(0.122) 0.887(0.04) 0.95(0.03)JPEN 2.378(0.138) 3.203(0.144) 1.124(0.088) 2.879(0.011)Hub type covariance matrixn=50 n=100p=500 p=1000 p=500 p=1000Ledoit-Wolf 2.13(0.103) 2.43(0.043) 1.07(0.165) 3.47(0.0477)Glasso 0.511(0.047) 0.551(0.005) 0.325(0.053) 0.419(0.003)PDSCE 0.735(0.106) 0.686(0.006) 0.36(0.035) 0.448(0.002)BLThresh 1.782(0.047) 2.389(0.036) 0.875(0.102) 1.82(0.027)JPEN 0.732(0.111) 0.688(0.006) 0.356(0.058) 0.38(0.007)Neighborhood type covariance matrixn=50 n=100p=500 p=1000 p=500 p=1000Ledoit-Wolf 1.36(0.054) 2.89(0.028) 1.1(0.0331) 2.32(0.0262)Glasso 0.608(0.054) 0.63(0.005) 0.428(0.047) 0.419(0.038)PDSCE 0.373(0.085) 0.468(0.007) 0.11(0.056) 0.175(0.005)BLThresh 1.526(0.074) 2.902(0.033) 0.870(0.028) 1.7(0.026)JPEN 0.454(0.0423) 0.501(0.018) 0.086(0.045) 0.169(0.003)Toeplitz type covariance matrixn=50 n=100p=500 p=1000 p=500 p=1000Ledoit-Wolf 1.526(0.074) 2.902(0.033) 1.967(0.041) 2.344(0.028)Glasso 2.351(0.156) 3.58(0.079) 1.78(0.087) 2.626(0.019)PDSCE 3.108(0.449) 5.027(0.016) 0.795(0.076) 2.019(0.01)BLThresh 0.858(0.040) 1.206(0.059) 0.703(0.039) 1.293(0.018)JPEN 2.517(0.214) 3.205(0.16) 1.182(0.084) 2.919(0.011)Cov-I type covariance matrixn=50 n=100p=500 p=1000 p=500 p=1000Ledoit-Wolf 33.2(0.04) 36.7(0.03) 36.2(0.03) 48.0(0.03)Glasso 15.4(0.25) 16.1(0.4) 14.0(0.03) 14.9(0.02)PDSCE 16.5(0.05) 16.33(0.04) 16.9(0.03) 17.5(0.02)BLThresh 15.7(0.04) 17.1(0.03) 13.4(0.02) 17.5(0.02)JPEN 7.1(0.042) 11.5(0.07) 8.4(0.042) 7.8(0.034) we include glasso, CLIME (Cai et al. (2011)) and PDSCE. For each of covariance and in-verse covariance matrix estimate, we calculate Average Relative Error (ARE) based on 50iterations using following formula,
ARE (Σ , ˆΣ) = | log ( f ( S, ˆΣ)) − log ( f ( S, Σ )) | / | ( log ( f ( S, Σ )) | , where f ( S, · ) is multivariate normal density given the sample covariance matrix S , Σ is thetrue covariance, ˆΣ is the estimate of Σ based on one of the methods under consideration.Other choices of performance criteria are Kullback-Leibler used by Yuan and Lin (2007) andBickel and Levina (2008a). The optimal values of tuning parameters were obtained over agrid of values by minimizing 5 − fold cross-validation as explained in §
4. The average relative shwini Maurya Table 5.2: Inverse Covariance Matrix Estimation
Block type covariance matrixn=50 n=100p=500 p=1000 p=500 p=1000Glasoo 4.144(0.523) 1.202(0.042) 0.168(0.136) 1.524(0.028)PDSCE 1.355(0.497) 1.201(0.044) 0.516(0.196) 0.558(0.032)CLIME 4.24(0.23) 6.56(0.25) 6.88(0.802) 10.64(0.822)JPEN 1.248(0.33) 1.106(0.029) 0.562(0.183) 0.607(0.03)Hub type covariance matrixn=50 n=100p=500 p=1000 p=500 p=1000Glasoo 1.122(0.082) 0.805(0.007) 0.07(0.038) 0.285(0.004)PDSCE 0.717(0.108) 0.702(0.007) 0.358(0.046) 0.356(0.005)CLIME 10.5(0.329) 10.6(0.219) 6.98(0.237) 10.8(0.243)JPEN 0.684(0.051) 0.669(0.003) 0.34(0.024) 0.337(0.002)Neighborhood type covariance matrixn=50 n=100p=500 p=1000 p=500 p=1000Glasoo 1.597(0.109) 0.879(0.013) 1.29(0.847) 0.428(0.007)PDSCE 0.587(0.13) 0.736(0.014) 0.094(0.058) 0.288(0.01)CLIME 10.5(0.535) 11.5(0.233) 10.5(0.563) 11.5(0.245)JPEN 0.551(0.075) 0.691(0.008) 0.066(0.042) 0.201(0.007)Toeplitz type covariance matrixn=50 n=100p=500 p=1000 p=500 p=1000Glasoo 2.862(0.475) 2.89(0.048) 2.028(0.267) 2.073(0.078)PDSCE 1.223(0.5) 1.238(0.065) 0.49(0.269) 0.473(0.061)CLIME 4.91(0.22) 7.597(0.34) 5.27(1.14) 8.154(1.168)JPEN 1.151(0.333) 2.718(0.032) 0.607(0.196) 2.569(0.057)Cov-I type covariance matrixn=50 n=100p=500 p=1000 p=500 p=1000Glasoo 54.0(0.19) 190.(5.91) 14.7(0.37) 49.9(0.08)PDSCE 28.8(0.19) 45.8(0.32) 16.9(0.04) 26.9(0.08)CLIME 59.8(0.82) 207.5(3.44) 15.4(0.03) 53.7(0.69)JPEN 26.3(0.36) 7.0(0.07) 15.7(0.08) 23.5(0.3) error and their standard deviations (in percentage) for covariance and inverse covariancematrix estimates are given in Table 5.1 and Table 5.2, respectively. The numbers in thebracket are the standard errors of relative error based on the estimates using differentmethods. Among all the methods JPEN and PDSCE perform similar for most of choices of n and p for all five type of covariance matrices. This is due to the fact that both PDSCE andJPEN use quadratic optimization function with a different penalty function. The behaviorof Bickel and levina’s estimator is quite good in Toepltiz case where it performs better thanthe other methods. For this type of covariance matrix, the entries away from the diagonaldecay to zero and therefore soft-thresholding estimators like BLThresh perform better in thissetting. However for neighorhood and hub type covariance matrix which are not necessarilybanded type, Bickel and Levina estimator is not a natural choise as their estimator wouldfail to recover the underlying sparsity pattern. The performance of Ledoit-Wolf estimator isnot very encouraging for Cov-I type matrix. The Ledoit-Wolf estimator shrinks the samplecovariance matrix towards identity and hence the eigenvalues estimates are highly shrunk PEN Estimation of Covariance and Inverse Covariance Matrix
Figure 5.1:
Heatmap of zeros identified in covariance matrix out of 50 realizations. Whitecolor is 50/50 zeros identified, black color is 0/50 zeros identified.
Hub type cov matrix
Hub Type cov matrix
Neighborhood type cov matrix
Neighborhood type cov matrix towards one. This is also visible in eigenvalues plot in Figure 5.2 and Figure 5.3. ForCov-I type covariance matrix where most of eigenvalues are close to zero and widely spread,the performance of JPEN estimator is impressive. The eigenplot in Figure 5.3 shows thatamong all the methods, estimates of eigenvalues of JPEN estimator are most consistent withtrue eigenvalues. This clearly shows the advantage of JPEN estimator of covariance matrixwhen the true eigenvalues are dispersed or close to zero. The eigenvalues plot in Figure5.2 shows that when eigen-spectrum of true covariance matrix are not highly dispersed, theJPEN and PDSCE estimates of eigenavlues are almost the same. This phenomenon is alsoapparent in Figure 2.1. Also Ledoit-Wolf estimator heavily shrinks the eigenvalues towardsthe center and thus underestimates the true eigen-spectrum.For inverse covariance matrix, we compare glasso, CLIME and PDSCE estimates withproposed JPEN estimator. The JPEN estimator ˆΩ K outperforms other methods for themost of the choices of n and p for all five types of inverse covariance matrices. Additionalsimulations (not included here) show that for n ≈ p , all the underlying methods perform sim-ilarly and the estimates of their eigenvalues are also well aligned with true values. Howeverin high dimensional setting, for large p and small n , their performance is different as seen insimulations of Table 5.1 and Table 5.2. Figure 5.1 shows the recovery of non-zero and zero shwini Maurya Figure 5.2:
Eigenvalues plot for n = 100 , p = 50 based on 50 realizations for neighborhoodtype of covariance matrix . . . . . . Plot of Eigenvalues
Eigenvalue Index E i gen v a l ue s TrueGlassoPdsceLedoit−WolfJpen entries of true covariance matrix based on JPEN estimator ˆΣ K based on 50 realizations.The estimtor recovers the true zeros for about 90% of times for Hub and Neighborhoodtype of covariance matrix. It also reflect the recovery of true structure of non-zero entriesand actual pattern among the rows/columns of covariance matrix. To see the implicationof eigenvalues shrinkage penalty as compared to other methods, we plot (Figure 5.2) theeigenvalues of estimated covariance matrix for n = 100, p = 50 for neighborhood type ofcovariance matrix. The JPEN estimates of eigen-spectrum are well aligned with true onesand closest being PDSC estimates of eigenvalues. Figure 5.3 shows the recovery of eigen-values based on estimates using different methods for Cov-I type covariance matrix. Forthis particular simulation, the eigenvalues are choosen differently than the one describedin (v) of §
5. The eigenvalues of true covariance matrix are taken to be very diverse withmaximum about 10 and smallest eigenvalue about 10 − . For Cov-I type of matrix, JPENestimates of eigenvalues are better than other methods. PEN Estimation of Covariance and Inverse Covariance Matrix
Figure 5.3:
Eigenvalues plot for n = 100 , p = 100 based on 50 realizations for Cov-I typematrix − + + + Log−Log Plot of Eigenvalues
Eigenvalue Index E i gen v a l ue s TrueGlassoPdsceLedoit−WolfJpen
6. Colon Tumor Classification Example
In this section, we compare performance of JPEN estimator of inverse covariance matrixfor tumors classification using Linear Discriminant Analysis (LDA). The gene expressiondata (Alon et al. (1999) consists of 40 tumorous and 22 non-tumorous adenocarcinomatissue. After preprocessing, data was reduced to a subset of 2,000 gene expression valueswith the largest minimal intensity over the 62 tissue samples (source: http://genomics-pubs.princeton.edu/oncology /affydata/index.html ). In our analysis, we reduced the numberof genes by selecting p most significant genes based on logistic regression. We obtain esti-mates of inverse covariance matrix for p = 50 , ,
200 and then use LDA to classify thesetissues as either tumorous or non-tumorous (normal). We classify each test observation xto either class k = 0 or k = 1 using the LDA rule δ k ( x ) = arg max k (cid:110) x T ˆΩ ˆ µ k −
12 ˆ µ k ˆΩ ˆ µ k + log (ˆ π k ) (cid:111) , (6.1)where ˆ π k is the proportion of class k observations in the training data, ˆ µ k is the samplemean for class k on the training data, and ˆΩ := ˆΣ − is an estimator of the inverse of thecommon covariance matrix on the training data computed. Tuning parameters λ and γ were shwini Maurya chosen using 5-fold cross validation. To create training and test sets, we randomly split thedata into a training and test set of sizes 42 and 20 respectively; following the approach usedby Wang et al. (2007), the training set has 27 tumor samples and 15 non-tumor samples.We repeat the split at random 100 times and measure the average classification error. SinceTable 6.1: Averages and standard errors of classification errors over 100 replications in %.
Method p=50 p=100 p=200Logistic Regression 21.0(0.84) 19.31(0.89) 21.5(0.85)SVM 16.70(0.85) 16.76(0.97) 18.18(0.96)Naive Bayes 13.3(0.75) 14.33(0.85) 14.63(0.75)Graphical Lasso 10.9(1.3) 9.4(0.89) 9.8(0.90)Joint Penalty we do not have separate validation set, we do the 5-fold cross validation on training data.At each split, we divide the training data into 5 subsets (fold) where 4 subsets are usedto estimate the covariance matrix and one subset is used to measure the classifier’s perfor-mance. For each split, this procedure is repeated 5 times by taking one of the 5 subsets asvalidation data. An optimal combination of λ and γ is obtained by minimizing the 5-foldcross validation error.The average classification errors with standard errors over the 100 splits are presented inTable 6.1. Since the sample size is less than the number of genes, we omit the inverse samplecovariance matrix as it is not well defined and instead include the naive Bayes’ and supportvector machine classifiers. Naive Bayes has been shown to perform better than the samplecovariance matrix in high-dimensional settings (Bickel and Levina (2004). Support VectorMachine (SVM) is another popular choice for high dimensional classification tool. Amongall the methods covariance matrix based LDA classifiers perform far better than NaiveBayes, SVM and Logistic Regression. For all other classifiers the classification performancedeteriorates for increasing p . For larger p , i.e., when more genes are added to the data set,the classification performance of JPEN estimate based LDA classifier initially improves butit deteriorates for large p . For p = 2000, the classifier based on inverse covariance matrix hasaccuracy of 30%. This is due to the fact that as dimension of covariance matrix increases,the estimator does not remain very informative.
7. Summary
We have proposed and analyzed regularized estimation of large covariance and inverse co-variance matrix using joint penalty. The proposed JPEN estimators are optimal underspectral norm for underlying classs of sparse and well-conditioned covariance and inversecovariance matrices. We also establish its theoretical consistency in Frobenius norm. Oneof its biggest advantage is that the optimization carries no computational burden and andthe resulting algorithm is very fast and easily scalable to large scale data analysis problems.The extensive simulation shows that the proposed estimators performs well for a number PEN Estimation of Covariance and Inverse Covariance Matrix of structured covariance and inverse covariance matrices. Also when the eigenvalues of un-derlying true covariance matrix are highly dispersed, it outperforms other methods (basedon simulation analysis). The JPEN estimator recovers the sparsity pattern of the true co-variance matrix and provides a good approximation of the underlying eigen-spectrum andhence we expect that PCA will be one of the most important application of the method.Although the proposed JPEN estimators of covariance and inverse covariance matrix do notrequire any assumption on the structure of true covariance and inverse covariance matricesrespectively, any prior knowledge of structure of true covariance matrix might be helpful tochoose a suitable weight matrix and hence improve estimation.
Acknowledgments
The author would like to express the deep gratitude to Professor Hira L. Koul for hisvaluable and constructive suggestions during the planning and development of this researchwork. The author would like to thank Dr. Adam Rothman for his valuable discussion andsuggestion. The author would also like to thank the two anonymous referees and the actioneditor Dr. Jie Peng for insightful reviews that helped to improve the original manuscriptsubstantially.
References
U. Alon, Barkai N., Notterman D., Gish K., Ybarra S., Mack D., and Levine A. Broadpatterns of gene expression revealed by clustering analysis of tumor and normal colontissues probed by oligonucleotide arrays.
Proceeding of National Academy of ScienceUSA , 96(12):67456750, 1999.O. Banerjee, L. El Ghaoui, and A. dAspremont. Model selection through sparse maxi-mum likelihood estimation for multivariate gaussian or binary data.
Journal of MachineLearning Research , 9:485–516, 2008.J. Bein and R. Tibshirani. Sparse estimation of a covariance matrix.
Biometrica , 98:807–820, 2011.P. Bickel and E. Levina. Some theory for fishers linear discriminant function, “naive bayes,and some alternatives when there are many more variables than observations.
Bernoulli ,10:989–1010, 2004.P. Bickel and E. Levina. Covariance regularization by thresholding.
The Annals of Statistics ,36(Mar):2577–2604, 2008a.P. Bickel and E. Levina. Regulatized estimation of large covariance matrices.
Annals ofStatistics , 36:199–227, 2008b.T. Cai, W. Liu, and X. Luo. , a constrained (cid:96) minimization approach to sparse precisionmatrix estimation. Journal of American Statistical Association , 106:2594–607, 2011. shwini Maurya T. Cai, Z. Ren, and H. Zhou. Estimating structured high-dimensional covariance and pre-cision matrices: Optimal rates and adaptive estimation.
Electronic Journal of Statistics ,2015.R. Clarke, Ressom H., Wang A., Xuan J., Liu M., Gehan E., and Wang Y. The propertiesof high-dimensional data spaces: implications for exploring gene and protein expressiondata.
Nat Rev Cancer. , 8:37–49, 2008.J. Friedman, Hastie T., and Tibshirani R. Sparse inverse covariance estimation with thegraphical lasso.
Biostatistics. , 9(3):432–441, 2008.S. Geman. A limit theorem for the norm of random matrices.
The Annals of Statistics ,8(2):252–261, 1980.I. Johnstone and Y. Lu. Sparse principal components analysis.
Unpublished Manuscript ,2004.N. Karoui. Operator norm consistent estimation of large dimensional sparse covariancematrices.
The Annals of Statistics , 36:2717–2756, 2008a.N. Karoui. Spectrum estimation for large dimensional covariance matrices using randommatrix theory.
The Annals of Statistics , 36(6):2757–2790, 2008b.O. Ledoit and M. Wolf. A well-conditioned estimator for large-dimensional covariancematrices.
Journal of Multivariate Analysis , 88:365–411, 2004.V. Marcenko and L. Pastur. Distributions of eigenvalues of some sets of random matrices.
Math. USSR-Sb , 1:507–536, 1967.K. Mardia, Kent J., and Bibby J.
Multivariate Analysis. , volume 1. Academic Press, NewYork, NY, 1979.Ashwini Maurya. A joint convex penalty for inverse covariance matrix estimation.
Compu-tational Statistics and Data Analysis , 75:15–27, 2014.Meinshausen and P. B¨uhlmann. High dimensional graphs and variable selection with thelasso.
Annals of Statistics , 34:1436–1462, 2006.G. Pass, Chowdhury A., and Torgeson C. ”a picture of search”.
The First InternationalConference on Scalable Information Systems. , 6, 2006.P. Ravikumar, Wainwright M.and Raskutti G., and Yu B. High-dimensional covarianceestimation by minimizing l1-penalized log-determinant divergence.
Electronic Journal ofStatistics , 5:935–980, 2011.A. Rothman. Positive definite estimators of large covariance matrices.
Biometrica , 99:733–740, 2012.A. Rothman, Bickel P. J., Levina E., and Zhu J. Sparse permutation invariant covarianceestimation.
Electronic Journal of Statistics , 2:494–515, 2008. PEN Estimation of Covariance and Inverse Covariance Matrix
C. Stein. Estimation of a covariance matrix.
Rietz lecture, 39th Annual Meeting IMS.Atlanta, Georgia , 1975.M. Wainwright, Ravikumar P., and Lafferty J. High-dimensional graphical model selectionusing l -regularized logistic regression. Proceedings of Advances in Neural In formationProcessing Systems. , 2006.L. Xue, Ma S., and Zou Hui. Positive-definite l1-penalized estimation of large covariancematrices.
Journal of American Statistical Association , 107(500):983–990, 2012.M. Yuan. Sparse inverse covariance matrix estimation via linear programming.
Journal ofMachine Learning Research , 11:2261–2286, 2009.M. Yuan and Y. Lin. Model selection and estimation in the gaussian graphical model.
Biometrika , 94(1):19–35, 2007.H. Zou, Hastie T., and Tibshirani R. Sparse principal components analysis.
Journal ofComputational and Graphical Statistics , 15:265–286, 2006. shwini Maurya Appendix A.
Proof of Lemma 3.1
Let f ( R ) = (cid:107) R − K (cid:107) + λ (cid:107) R − (cid:107) + γ p (cid:88) i =1 { σ i ( R ) − ¯ σ R } . (.1)where ¯ σ R is the mean of eigenvalues of R . Due to the constraint tr ( R ) = p , we have ¯ σ R = 1.The third term of (.1) can be written as p (cid:88) i =1 { σ i ( R ) − ¯ σ R } = tr ( R ) − tr ( R ) + p We obtain, f ( R ) = tr ( R ) − tr ( RK ) + tr ( K ) + λ (cid:107) R − (cid:107) + γ { tr ( R ) − tr ( R ) + p } = tr ( R (1 + γ )) − tr ( K + γ I ) + tr ( K ) + λ (cid:107) R − (cid:107) + p = (1 + γ ) (cid:107) R − ( K + γ I ) / (1 + γ ) (cid:107) + tr ( K ) + λ (cid:107) R − (cid:107) + p (.2)This is quadratic in R with a (cid:96) penalty to the off-diagonal entries of R , therefore a convexfunction in R . Proof of Lemma 3.2
The solution to (.2) satisfies:2( R − ( K + γI ))(1 + γ ) − + λ ∂ (cid:107) R − (cid:107) ∂R = 0 (.3)where ∂ (cid:107) R − (cid:107) ∂R is given by: ∂ (cid:107) R − (cid:107) ∂R = if R ij > − if R ij < τ ∈ ( − ,
1) : if R ij = 0Note that (cid:107) R − (cid:107) has same value irrespective of sign of R , therefore the right hand side of(.2) is minimum if : sign( R ) = sign( K + γI ) = sign( K ) ∀ (cid:15) >
0, using (.3), σ min { ( K + γI ) − λ sign( K ) } > (cid:15) gives a ( λ, γ ) ∈ ˆ S K and such a choice of( λ, γ ) gaurantees the estimator to be positive definite. Remark:
Intuitively, a larger γ shrinks the eigenvalues towards center which is 1, a larger γ would result in positive definite estimator, whereas a larger λ results in sparse estimate.A combination of ( λ, γ ) results in a sparse and well-conditioned estimator. In particularcase, when K is diagonal matrix, the λ < ∗ γ . Proof of Theorem 3.1
Define the function Q ( . ) as following: Q ( R ) = f ( R ) − f ( R ) PEN Estimation of Covariance and Inverse Covariance Matrix where R is the true correlation matrix and R is any other correlation matrix. Let R = U DU T be eigenvalue decomposition of R , D is diagonal matrix of eigenvalues and U ismatrix of eigenvectors. We have, Q ( R ) = (cid:107) R − K (cid:107) F + λ (cid:107) R − (cid:107) + γ tr ( D − D + p ) − (cid:107) R − K (cid:107) F − λ (cid:107) R − (cid:107) − γ tr ( D − D + p ) (.4) R = U D U T is eigenvalue decomposition of R . Let Θ n ( M ) := { ∆ : ∆ = ∆ T , (cid:107) ∆ (cid:107) = M r n , < M < ∞ } . The estimate ˆ R minimizes the Q ( R ) or equivalently ˆ∆ = ˆ R − R minimizes the G (∆) = Q ( R + ∆). Note that G (∆) is convex and if ˆ∆ be its solution,then we have G ( ˆ∆) ≤ G (0) = 0. Therefore if we can show that G (∆) is non-negativefor ∆ ∈ Θ n ( M ), this will imply that the ˆ∆ lies within sphere of radius M r n . We require r n = o (cid:16)(cid:112) ( p + s ) log p/n (cid:17) . (cid:107) R − K (cid:107) F − (cid:107) R − K (cid:107) F = tr ( R T R − R T K + K T K ) − tr ( R T R − R S + K T K )= tr ( R T R − R T R ) − tr (( R − R ) T K )= tr (( R + ∆) T ( R + ∆) − R T R ) − tr (∆ T K )= tr (∆ T ∆) − tr (∆ T ( K − R ))Next, we bound term involving K in above expression, we have | tr (∆ T ( R − K )) | ≤ (cid:88) i (cid:54) = j | ∆ ij ( R ij − K ij ) |≤ max i (cid:54) = j ( | R ij − K ij | ) (cid:107) ∆ − (cid:107) ≤ C (1 + τ ) (cid:114) log pn (cid:107) ∆ − (cid:107) ≤ C (cid:114) log pn (cid:107) ∆ − (cid:107) holds with high probability by a result (Lemma 1) from Ravikumar et al. (2011) on thetail inequality for sample covariance matrix of sub-gaussian random vectors and where C = C (1 + τ ) , C >
0. Next we obtain upper bound on the terms involving γ in (.4). wehave, tr ( D − D ) − tr ( D − D )= tr { R − R } − tr { R − R ) } = tr ( R + ∆) − tr ( R )= 2 tr ( R ∆) + tr (∆ T ∆) ≤ √ s (cid:107) ∆ (cid:107) F + (cid:107) ∆ (cid:107) F . using Cauchy-Schwarz inequality. To bound the term λ ( (cid:107) R − (cid:107) −(cid:107) R − (cid:107) ) = λ ( (cid:107) ∆ − + R − (cid:107) −(cid:107) R − (cid:107) ), let E be index set as defined in Assumption A.2 of Theorem 3.2. Then using thetriangle inequality, we obtain, λ ( (cid:107) ∆ − + R − (cid:107) − (cid:107) R − (cid:107) ) = λ ( (cid:107) ∆ − E + R − (cid:107) + (cid:107) ∆ − ¯ E (cid:107) − (cid:107) R (cid:107) ) ≥ λ ( (cid:107) R − (cid:107) − (cid:107) ∆ − E (cid:107) + (cid:107) ∆ − ¯ E (cid:107) − (cid:107) R − (cid:107) ) ≥ λ ( (cid:107) ∆ − ¯ E (cid:107) − (cid:107) ∆ − E (cid:107) ) shwini Maurya Let λ = ( C /(cid:15) ) (cid:112) log p/n , γ = ( C /(cid:15) ) (cid:112) log p/n, where ( λ, γ ) ∈ ˆ S K , we obtain, G (∆) ≥ tr (∆ T ∆)(1 + γ ) − C (cid:110)(cid:114) log pn ( (cid:107) ∆ − (cid:107) ) + 1 (cid:15) (cid:114) s log pn (cid:107) ∆ (cid:107) F (cid:111) + C (cid:15) (cid:114) log pn (cid:0) (cid:107) ∆ − ¯ E (cid:107) − ∆ − E (cid:107) (cid:1) ≥ (cid:107) ∆ (cid:107) F (1 + γ ) − C (cid:114) log pn (cid:0) (cid:107) ∆ − ¯ E (cid:107) + (cid:107) ∆ − E (cid:107) (cid:1) C (cid:15) (cid:114) log pn (cid:0) (cid:107) ∆ − ¯ E (cid:107) − ∆ − E (cid:107) (cid:1) − C (cid:15) (cid:114) s log pn (cid:107) ∆ (cid:107) F . Also because (cid:107) ∆ − E (cid:107) = (cid:80) ( i,j ) ∈ E,i (cid:54) = j ∆ ij ≤ √ s (cid:107) ∆ − (cid:107) F , − C (cid:114) log pn (cid:107) ∆ − ¯ E (cid:107) + C (cid:15) (cid:114) log pn (cid:107) ∆ − ¯ E (cid:107) ≥ (cid:114) log pn (cid:107) ∆ − ¯ E (cid:107) (cid:0) − C + C (cid:15) (cid:1) ≥ (cid:15) . Therefore, G (∆) ≥ (cid:107) ∆ (cid:107) F (cid:0) C (cid:15) (cid:114) log pn (cid:1) − C (cid:114) s log pn (cid:107) ∆ + (cid:107) F { /(cid:15) + 2 /(cid:15) }≥ (cid:107) ∆ (cid:107) F (cid:104) C (cid:15) (cid:114) log pn − C M { /(cid:15) + 2 /(cid:15) } (cid:105) ≥ , for all sufficiently large n and M . Which proves the first part of theorem. To prove theoperator norm consistency, we have, (cid:107) ˆΣ K − Σ (cid:107) = (cid:107) ˆ W ˆ R ˆ W − W KW (cid:107)≤ (cid:107) ˆ W − W (cid:107)(cid:107) ˆ R − K (cid:107)(cid:107) ˆ W − W (cid:107) + (cid:107) ˆ W − W (cid:107) ( (cid:107) ˆ R (cid:107)(cid:107) W (cid:107) + (cid:107) ˆ W (cid:107)(cid:107) K (cid:107) ) + (cid:107) ˆ R − K (cid:107)(cid:107) ˆ W (cid:107)(cid:107) W (cid:107) . using sub-multiplicative norm property (cid:107) AB (cid:107) ≤ (cid:107) A (cid:107)(cid:107) B (cid:107) . Since (cid:107) K (cid:107) = O (1) and (cid:107) ˆ R − K (cid:107) F = O ( (cid:113) s log pn ) these together implies that (cid:107) ˆ R (cid:107) = O (1) . Also, (cid:107) ˆ W − W (cid:107) = max (cid:107) x (cid:107) =1 p (cid:88) i =1 | ( ˆ w i − w i ) | x i ≤ max ≤ i ≤ p | ( ˆ w i − w i ) | p (cid:88) i =1 x i = max ≤ i ≤ p | ( ˆ w i − w i ) | = O (cid:0)(cid:114) log pn (cid:1) . holds with high probability by using a result (Lemma 1) from Ravikumar et al. (2011).Next we shall shows that (cid:107) ˆ W − W (cid:107) (cid:16) (cid:107) ˆ W − W (cid:107) , (where A (cid:16) B means A= O P ( B ) and PEN Estimation of Covariance and Inverse Covariance Matrix B= O P ( A )). We have, (cid:107) ˆ W − W (cid:107) = max (cid:107) x (cid:107) =1 p (cid:88) i =1 | ( ˆ w i − w i ) | x i = max (cid:107) x (cid:107) =1 p (cid:88) i =1 | (cid:0) ˆ w i − w i ˆ w i + w i (cid:1) | x i (cid:16) p (cid:88) i =1 | ( ˆ w i − w i ) | x i = C (cid:107) ˆ W − W (cid:107) . where we have used the fact that the true standard deviations are well above zero, i.e., ∃ < C < ∞ such that 1 /C ≤ w − i ≤ C ∀ i = 1 , , · · · , p , and sample standard deviationare all positive, i.e, ˆ w i > ∀ i = 1 , , · · · , p. Now since (cid:107) ˆ W − W (cid:107) (cid:16) (cid:107) ˆ W − W (cid:107) , this followsthat (cid:107) ˆ W (cid:107) = O (1) and we have (cid:107) ˆΣ K − Σ (cid:107) = O (cid:0) s log pn + log pn (cid:1) . This completes the proof. Proof of Theorem 3.2
Let f (Σ) = || Σ − S || F + λ (cid:107) Σ − (cid:107) + γ p (cid:88) i =1 { σ i (Σ) − ¯ σ Σ } , Similar to the proof of theroem (3.1), define the function Q ( . ) as following: Q (Σ) = f (Σ) − f (Σ )where Σ is the true covariance matrix and Σ is any other covariance matrix. Let Σ = U DU T be eigenvalue decomposition of Σ, D is diagonal matrix of eigenvalues and U is matrix ofeigenvectors. We have, Q (Σ) = (cid:107) Σ − S (cid:107) F + λ (cid:107) Σ − (cid:107) + γ tr ( D ) − ( tr ( D )) /p − (cid:107) Σ − S (cid:107) F − λ (cid:107) Σ − (cid:107) − γ tr ( D ) − ( tr ( D )) /p (.5)where A = diag ( a , a , · · · , a p ) and Σ = U D U T is eigenvalue decomposition of Σ . Write∆ = Σ − Σ , and let Θ n ( M ) := { ∆ : ∆ = ∆ T , (cid:107) ∆ (cid:107) = M r n , < M < ∞ } . The estimateˆΣ minimizes the Q (Σ) or equivalently ˆ∆ = ˆΣ − Σ minimizes the G (∆) = Q (Σ + ∆). Notethat G (∆) is convex and if ˆ∆ be its solution, then we have G ( ˆ∆) ≤ G (0) = 0. Thereforeif we can show that G (∆) is non-negative for ∆ ∈ Θ n ( M ), this will imply that the ˆ∆ lieswithin sphere of radius M r n . We require (cid:112) ( p + s ) log p = o (cid:16) √ n (cid:17) . (cid:107) Σ − S (cid:107) F − (cid:107) Σ − S (cid:107) F = tr (Σ T Σ − T S + S T S ) − tr (Σ T Σ − S + S T S )= tr (Σ T Σ − Σ T Σ ) − tr ((Σ − Σ ) S )= tr ((Σ + ∆) T (Σ + ∆) − Σ T Σ ) − tr (∆ T S )= tr (∆ T ∆) − tr (∆ T ( S − Σ )) shwini Maurya Next, we bound term involving S in above expression, we have | tr (∆(Σ − S )) | ≤ (cid:88) i (cid:54) = j | ∆ ij (Σ ij − S ij ) | + (cid:88) i =1 | ∆ ii (Σ ii − S ii ) |≤ max i (cid:54) = j ( | Σ ij − S ij | ) (cid:107) ∆ − (cid:107) + √ p max i =1 ( | Σ ii − S ii | ) (cid:115)(cid:88) i =1 ∆ ii ≤ C (1 + τ ) max i (Σ ii ) (cid:110)(cid:114) log pn (cid:107) ∆ − (cid:107) + (cid:114) p log pn (cid:107) ∆ + (cid:107) (cid:111) ≤ C (cid:110)(cid:114) log pn (cid:107) ∆ − (cid:107) + (cid:114) p log pn (cid:107) ∆ + (cid:107) (cid:111) holds with high probability by a result (Lemma 1) from Ravikumar et al. (2011) where C = C (1 + τ ) max i (Σ ii ) , C > + is matrix ∆ with all off-diagonal elements set tozero. Next we obtain upper bound on the terms involving γ in (3.7). we have, tr ( D ) − ( tr ( D )) /p − tr ( D ) − ( tr ( D )) /p = tr (Σ ) − tr (Σ ) − ( tr (Σ)) /p + ( tr (Σ )) /p (i) tr (Σ ) − Σ )) ≤ tr (Σ + ∆) − tr (Σ ) = tr (∆) + 2 tr (∆ Σ ) ≤ tr (∆) + C √ s (cid:107) ∆ (cid:107) F (ii) tr ((Σ)) − ( tr (Σ )) = ( tr (Σ + ∆)) − ( tr (Σ )) ≤ ( tr (∆)) + 2 tr (Σ ) tr (∆) ≤ p (cid:107) ∆ (cid:107) F + 2 ¯ kp √ p (cid:107) ∆ + (cid:107) F . Therefore the γ term can be bounded by 2 (cid:107) ∆ (cid:107) F + ( C √ s + 2 √ p ¯ k ) (cid:107) ∆ (cid:107) F . We bound theterm invloving λ as in similar to the proof of Theorem 3.1. For λ (cid:16) γ (cid:16) (cid:113) log pn , the prooffollows very simialr to Therem 3.1. Proof of Theorem 3.3.
To bound the cross product term involving ∆ and ˆ R − K , wehave, | tr (( R − − ˆ R − K )∆) | = | tr ( R − ( ˆ R K − R ) ˆ R − K ∆) |≤ σ ( R − ) | tr (( ˆ R K − R ) ˆ R − K ∆) |≤ ¯ kσ ( ˆ R − K ) | tr (( ˆ R K − R )∆) |≤ ¯ k ¯ k | tr (( ˆ R K − R )∆) | . where σ min ( ˆ R K ) ≥ (1 / ¯ k ) >
0, is a positive lower bound on the eigenvalues of JPEN esti-mate ˆ R K of correlation matrix R . Such a constant exist by Lemma 3.2. Rest of the proofclosely follows as that of Theorem 3.1. Proof of Theorem 3.4.
We bound the term tr (( ˆΩ S − Ω )∆) similar to that in proofof Theorem 3.3. Rest of the proof closely follows to that Theorem 3.2.)∆) similar to that in proofof Theorem 3.3. Rest of the proof closely follows to that Theorem 3.2.