[PDF] Shrinking characteristics of precision matrix estimators

Abstract

We propose a framework to shrink a user-specified characteristic of a precision matrix estimator that is needed to fit a predictive model. Estimators in our framework minimize the Gaussian negative loglikelihood plus an L_1 penalty on a linear or affine function evaluated at the optimization variable corresponding to the precision matrix. We establish convergence rate bounds for these estimators and propose an alternating direction method of multipliers algorithm for their computation. Our simulation studies show that our estimators can perform better than competitors when they are used to fit predictive models. In particular, we illustrate cases where our precision matrix estimators perform worse at estimating the population precision matrix but better at prediction.

Full PDF

SShrinking characteristics of precision matrix estimators

Aaron J. Molstad ∗ and Adam J. RothmanSchool of Statistics, University of MinnesotaApril 18, 2017 Abstract

We propose a framework to shrink a user-speciﬁed characteristic of a precision matrix es-timator that is needed to ﬁt a predictive model. Estimators in our framework minimize theGaussian negative log-likelihood plus an L penalty on a linear or afﬁne function evaluatedat the optimization variable corresponding to the precision matrix. We establish convergencerate bounds for these estimators and we propose an alternating direction method of multipliersalgorithm for their computation. Our simulation studies show that our estimators can performbetter than competitors when they are used to ﬁt predictive models. In particular, we illus-trate cases where our precision matrix estimators perform worse at estimating the populationprecision matrix while performing better at prediction. Estimating precision matrices is required to ﬁt many statistical models. Many papers written inthe last decade have proposed shrinkage estimators of the precision matrix when p , the numberof variables, is large. Pourahmadi (2013) and Fan et al. (2016) provide comprehensive reviews oflarge covariance and precision matrix estimation. The main strategy used in many of these papersis minimize the Gaussian negative log-likelihood plus a penalty on the off-diagonal entries of theoptimization variable corresponding to the precision matrix. For example, Yuan and Lin (2007)proposed the L -penalized Gaussian likelihood precision matrix estimator deﬁned by arg min Ω ∈ S p +  tr( S Ω) − log det(Ω) + λ (cid:88) i (cid:54) = j | Ω ij |  , (1)where S is the sample covariance matrix, λ > is a tuning parameter, S p + is the set of p × p symmetric and positive deﬁnite matrices, and tr and det are the trace and determinant, respectively.Other authors have replaced the L penalty in (1) with the squared Frobenius norm (Witten andTibshirani, 2009; Rothman and Forzani, 2014) or non-convex penalties that also encourage zeros inthe estimator (Lam and Fan, 2009; Fan et al., 2009).To ﬁt many predictive models, only a characteristic of the population precision matrix needs tobe estimated. For example, in binary linear discriminant analysis, the population precision matrix is ∗ Corresponding author: [email protected] a r X i v : . [ s t a t . M E ] A p r eeded for prediction only through the product of the precision matrix and the difference betweenthe two conditional distribution mean vectors. Many authors have proposed methods that directlyestimate this characteristic (Cai and Liu, 2011; Fan et al., 2012; Mai et al., 2012).We propose to estimate the precision matrix by shrinking the characteristic of the estimator thatis needed for prediction. The characteristic we consider is a linear or afﬁne function evaluated atthe precision matrix. The goal is to improve prediction performance. Unlike methods that estimatethe characteristic directly, our approach provides the practitioner an estimate of the entire precisionmatrix, not just the characteristic. In our simulation studies and data example, we show that penal-izing the characteristic needed for prediction can improve prediction performance over competingsparse precision estimators like (1), even when the true precision matrix is very sparse. In addition,estimators in our framework can be used in applications other than linear discriminant analysis. We propose to estimate the population precision matrix Ω ∗ with ˆΩ = arg min Ω ∈ S p + { tr( S Ω) − log det(Ω) + λ | A Ω B − C | } , (2)where A ∈ R a × p , B ∈ R p × b , and C ∈ R a × b are user-speciﬁed matrices; and | M | = (cid:80) i,j | M ij | .Our estimator exploits the assumption that A Ω ∗ B − C is sparse. In cases where A , B , and C needto be estimated, we replace them with their estimators.An estimator deﬁned by (2) with C = 0 was mentioned in an unpublished manuscript by Dalaland Rajaratnam (2014) available on arXiv. These authors proposed an alternating minimizationalgorithm for solving (1) and described how to apply it to solve (2). Dalal and Rajaratnam did notdescribe applications or theoretical properties of this estimator. Also, as written, their algorithmdoes not actually solve (2) when A and B are arbitrary. We propose an alternating direction methodof multipliers algorithm to solve (2) and establish theoretical properties for this estimator. Fitting the discriminant analysis model requires the estimation of one or more precision matrices.In particular, the linear discriminant analysis model assumes that the data are independent copiesof the random pair ( X, Y ) , where the support of Y is { , . . . , J } and X | Y = j ∼ N p (cid:0) µ ∗ j , Ω − ∗ (cid:1) , j = 1 , . . . , J, (3)where µ ∗ j ∈ R p and Ω − ∗ ∈ S p + are unknown. To discriminate between response categories l and m ,only the characteristic Ω ∗ ( µ ∗ l − µ ∗ m ) is needed. Methods that estimate this characteristic directlyhave been proposed (Cai and Liu, 2011; Mai et al., 2012; Fan et al., 2012; Mai et al., 2015). Thesemethods are useful in high dimensions because they perform variable selection. For the j th variableto be non-informative for discriminating between response categories l and m , it must be that the j th element of Ω ∗ ( µ ∗ l − µ ∗ m ) is zero. While these methods can perform well in classiﬁcation andvariable selection, they do not actually ﬁt the model in (3).2ethods for ﬁtting (3) speciﬁcally for linear discriminant analysis either assume Ω ∗ is diagonal(Bickel and Levina, 2004) or that both µ ∗ l − µ ∗ m and Ω ∗ are sparse (Guo, 2010; Xu et al., 2015).A method for ﬁtting (3) and performing variable selection was proposed by Witten and Tibshirani(2009). They suggest a two-step procedure where one ﬁrst estimates Ω ∗ , and then with the estimate ¯Ω ﬁxed, estimates each µ ∗ j by penalizing the characteristic ¯Ω µ j , where µ j is the optimizationvariable corresponding to µ ∗ j . To apply our method to the linear discriminant analysis problem, we use (2) with A = I p , C = 0 , and B equal to the matrix whose columns are ¯ x j − ¯ x k for all ≤ j < k ≤ J , where ¯ x j isthe unpenalized maximum likelihood estimator of µ ∗ j . For large values of the tuning parameter, thiswould lead an estimator of Ω ∗ such that ˆΩ(¯ x j − ¯ x k ) is sparse. Thus our approach simultaneouslyﬁts (3) and performs variable selection.Precision and covariance matrix estimators are also needed for portfolio allocation. The optimalallocation based on the Markowitz (1952) minimum-variance portfolio is proportional to Ω ∗ µ ∗ ,where µ ∗ is the vector of expected returns for p assets and Ω ∗ is precision matrix for the returns.In practice, one would estimate Ω ∗ and µ ∗ with their usual sample estimators ˆΩ and ˆ µ . However,when p is large, the usual sample estimator of Ω ∗ does not exist, so regularization is necessary.Moreover, Brodie et al. (2009) argue that sparse portfolios are often desirable when p is large.While many have proposed using sparse or shrinkage estimators of Ω ∗ or Ω − ∗ plugged-in to theMarkowitz criterion, e.g., Xue et al. (2012), this would not necessarily lead to sparse estimators of Ω ∗ µ ∗ . Chen et al. (2016) proposed a method for estimating the characteristic Ω ∗ µ ∗ directly, but likethe direct linear discriminant methods, this approach does not lead to an estimate of Ω ∗ . For thesparse portfolio allocation problem, we propose to estimate Ω ∗ using (2) with A = I p , C = 0 , and B = ˆ µ .Another application is in linear regression where the response and predictor have a joint multi-variate normal distribution. In this case, the regression coefﬁcient matrix is Ω ∗ Σ ∗ XY , where Ω ∗ isthe marginal precision matrix for the predictors and Σ ∗ XY is the cross-covariance matrix betweenpredictors and responses. We propose to estimate Ω ∗ using (2) with A = I p , C = 0 , and B equalto the usual sample estimator of Σ ∗ XY . Similar to the procedure proposed by Witten and Tibshirani(2009), this approach provides an alternative method for estimating regression coefﬁcients usingshrinkage estimators of the marginal precision matrix for the predictors.

To solve the optimization in (2), we propose an alternating direction method of multipliers algo-rithm with a modiﬁcation based on the majorize-minimize principle (Lange, 2016). Following thestandard alternating direction method of multipliers approach (Boyd et al., 2011), we rewrite (2) asa constrained optimization problem: arg min (Θ , Ω) ∈ R a × b × S p + { tr( S Ω) − log det(Ω) + λ | Θ | } subject to A Ω B − Θ = C. (4)3he augmented Lagrangian for (4) is deﬁned by F ρ (Ω , Θ , Γ) = tr( S Ω) − log det(Ω) + λ | Θ | − tr (cid:8) Γ T ( A Ω B − Θ − C ) (cid:9) + ρ (cid:107) A Ω B − Θ − C (cid:107) F , where ρ > and Γ ∈ R a × b is the Lagrangian dual variable. Let the subscript k denote the k th iterate. From Boyd et al. (2011), to solve (4), the alternating direction method of multipliersalgorithm uses the following updating equations: Ω k +1 = arg min Ω ∈ S p + F ρ (Ω , Θ k , Γ k ) , (5) Θ k +1 = arg min Θ ∈ R a × b F ρ (Ω k +1 , Θ , Γ k ) , (6) Γ k +1 = Γ k − ρ ( A Ω k +1 B − Θ k +1 − C ) . (7)Instead of solving (5) exactly, we approximate its objective function with a majorizing function.Speciﬁcally, we replace (5) with Ω k +1 = arg min Ω ∈ S + p (cid:110) F ρ (Ω , Θ k , Γ k ) + ρ − Ω k ) T Q vec(Ω − Ω k ) (cid:111) , (8)where Q = τ I − (cid:0) A T A ⊗ BB T (cid:1) , τ is selected so that Q ∈ S p + , ⊗ is the Kronecker product, and vec forms a vector by stacking the columns of its matrix argument. Since vec(Ω − Ω k ) T (cid:0) A T A ⊗ BB T (cid:1) vec(Ω − Ω k ) = tr (cid:8) A T A (Ω − Ω k ) BB T (Ω − Ω k ) (cid:9) , we can rewrite (8) as Ω k +1 = arg min Ω ∈ S + p (cid:104) F ρ (Ω , Θ k , Γ k ) + ρτ (cid:107) Ω − Ω k (cid:107) F − ρ (cid:8) A T A (Ω − Ω k ) BB T (Ω − Ω k ) (cid:9)(cid:105) , which is equivalent to Ω k +1 = arg min Ω ∈ S + p (cid:104) tr { ( S + G k ) Ω } − log det(Ω) + ρτ (cid:107) Ω − Ω k (cid:107) F (cid:105) , (9)where G k = ρA T ( A Ω k B − ρ − Γ k − Θ k − C ) B T . The zero gradient equation for (9) is S − Ω − k +1 + 12 (cid:0) G k + G T k (cid:1) + ρτ (Ω k +1 − Ω k ) = 0 , (10)which is a quadratic equation that can be solved in closed form (Witten and Tibshirani, 2009; Priceet al., 2015). The solution is Ω k +1 = 12 ρτ U (cid:110) − Ψ + (cid:0) Ψ + 4 ρτ I p (cid:1) / (cid:111) U T , U Ψ U T is the eigendecomposition of S + 2 − ( G k + G T k ) − ρτ Ω k . Our majorize-minimizeapproach is a special case of the prox-linear alternating direction method of multiplier algorithm(Chen and Teboulle, 1994; Deng and Yin, 2016).Conveniently, (6) also has a closed form solution: Θ k +1 = soft (cid:0) A Ω k +1 B − ρ − Γ k − C, ρ − λ (cid:1) , where soft( x, τ ) = max ( | x | − τ,

0) sign( x ) . To summarize, we solve (2) with the following algo-rithm. Algorithm 1

Alternating direction method of multipliers algorithm for (2)Initialize Ω (0) ∈ S p + , Θ (0) ∈ R a × b , ρ > , and τ such that Q is positive deﬁnite. Set k = 0 . RepeatStep 1 - 6 until convergence:

Step 1.

Compute G k = ρA T ( A Ω k B − ρ − Γ k − Θ k − C ) B T ; Step 2.

Decompose S + 2 − ( G k + G T k ) − ρτ Ω k = U Ψ U T where U is orthogonal and Ψ is diagonal; Step 3.

Set Ω k +1 = (2 ρτ ) − U (cid:8) − Ψ + (Ψ + 4 ρτ I p ) / (cid:9) U T ; Step 4.

Set Θ k +1 = soft( A Ω k +1 B − ρ − Γ k − C, ρ − λ ) ; Step 5.

Set Γ k +1 = Γ k − ρ ( A Ω k +1 B − Θ k +1 − C ) ; Step 6.

Replace k with k + 1 . Using the same proof technique as in Deng and Yin (2016), one can show that the iterates fromAlgorithm 1 converge to their optimal values when a solution to (4) exists.In our implementation, we set τ = ϕ ( A T A ) ϕ ( BB T ) + 10 − , where ϕ ( · ) denotes the largesteigenvalue of its argument. This computation is only needed once at the initialization of our algo-rithm. We expect that in practice, the computational complexity of our algorithm will be dominatedby the eigendecomposition in Step 2, which requires O ( p ) ﬂops.To select the tuning parameter to use in practice, we recommend using some type of cross-validation procedure based on the application. For example, in the linear discriminant analysiscase, one could select the tuning parameter that minimizes the validation misclassiﬁcation rate ormaximizes a validation likelihood. In this section, we show that by using the penalty in (2), we can estimate Ω ∗ and A Ω ∗ B consistentlyin the Frobenius and L norms, respectively. Our results rely on assuming that A Ω ∗ B is sparse.Deﬁne the set G as the indices of A Ω ∗ B that are nonzero, i.e., G = (cid:110) ( i, j ) ∈ { , . . . , a } × { , . . . , b } : [ A Ω ∗ B ] ij (cid:54) = 0 (cid:111) . Let the notation [ A Ω ∗ B ] G ∈ R a × b denote the matrix whose ( i, j ) th entry is equal to the ( i, j ) th of A Ω ∗ B if ( i, j ) ∈ G and is equal to zero if ( i, j ) / ∈ G . We generalize our results to the case that A B are unknown, and we use plug-in estimators of them in (2).We ﬁrst establish convergence rates for the case that A and B are known. Let σ j ( · ) and ϕ j ( · ) denote the j th largest singular value and eigenvalue of their arguments respectively. Suppose thatthe sample covariance matrix used in (2) is S n = n − (cid:80) ni =1 X i X T i , where X , . . . , X n are inde-pendent and identically distributed p n -dimensional random vectors with mean zero and covariancematrix Ω − ∗ . We will make the following assumptions: Assumption 1.

For all n , there exists a constant k such that < k − ≤ ϕ p n (Ω ∗ ) ≤ ϕ (Ω ∗ ) ≤ k < ∞ . Assumption 2.

For all n , there exists a constant k such that min { σ p n ( A ) , σ p n ( B ) } ≥ k > . Assumption 3.

For all n , there exist positive constants k and k such that max j ∈{ ,...,p n } E (cid:8) exp( tX j ) (cid:9) ≤ k < ∞ for all t ∈ ( − k , k ) . Assumptions 1 and 3 are common in the regularized precision matrix estimation literature, e.g.,Assumption 1 was made by Bickel and Levina (2008), Rothman et al. (2008) and Lam and Fan(2009) and Assumption 3 holds if X is multivariate Normal. Assumption 2 requires that A and B are both rank p n , which has the effect of shrinking every entry of ˆΩ . The convergence rate boundswe establish also depend on the quantity ξ ( p n , G ) = sup M ∈ S pn ,M (cid:54) =0 | [ AM B ] G | (cid:107) M (cid:107) F , where S p n is the set of symmetric p n × p n matrices. Negahban et al. (2012) deﬁned a similar andmore general quantity and called it a compatibility constant. Theorem 1.

Suppose Assumptions 1–3 are true. If λ n = K ( n − log p n ) / , K is sufﬁcientlylarge, and ξ ( p n , G ) log p n = o ( n ) , then (i) (cid:107) ˆΩ − Ω ∗ (cid:107) F = O P { ξ ( p n , G )(log p n /n ) / } and (ii) | A ˆΩ B − A Ω ∗ B | = O P { ξ ( p n , G )(log p n /n ) / } . The quantity ξ ( p n , G ) can be used to recover known results for special cases of (2). For example,when A and B are identity matrices, ξ ( p n , G ) = s n / , where s n is the number of nonzero entriesin Ω ∗ . This special case was established by Rothman et al. (2008). We can simplify the results ofTheorem 1 for case that A Ω ∗ B has g n nonzero entries by introducing an additional assumption: Assumption 4.

For all n , there exists a constant k such that sup M ∈ S pn ,M (cid:54) =0 (cid:107) [ AM B ] G (cid:107) F (cid:107) M (cid:107) F ≤ k < ∞ . Assumption 4 is not the same as bounding ξ ( p n , G ) because the numerator uses the Frobeniusnorm instead of the L norm. This requires that for those entries of A Ω ∗ B which are nonzero, thecorresponding rows and columns of A and B , respectively, do not have magnitudes too large as p n grows. 6 emark 1. Assume that the conditions of Theorem 1 are true. If Assumption 4 is true and A Ω ∗ B has g n nonzero entries, then (cid:107) ˆΩ − Ω ∗ (cid:107) F = O P { ( g n log p n /n ) / } and | A ˆΩ B − A Ω ∗ B | = O P { ( g n log p n /n ) / } . In practice, A and B are often unknown and must be estimated. Let ˆ A n and ˆ B n be estimatorsof A and B . In this case, we estimate A Ω ∗ B with ˆ A n ˜Ω ˆ B n , where ˜Ω = arg min Ω ∈ S p + (cid:110) tr( S n Ω) − log det(Ω) + λ n | ˆ A n Ω ˆ B n | (cid:111) . (11)Suppose that there exist sequences a n and b n such that | ( A − ˆ A n ) A + | = O P ( a n ) and | B + ( B − ˆ B n ) | = O P ( b n ) , where A + and B + are the Moore–Penrose pseudoinverses of A and B , respec-tively; and a n = o (1) , and b n = o (1) . Let M n = max { a n , b n } and C n = M n | [ A Ω ∗ B ] G | . Theorem 2.

Suppose Assumptions 1–3 are true. If λ n = K ( n − log p n ) / , K is sufﬁcientlylarge, M n = o (1) , ξ ( p n , G ) log p n = o ( n ) , and C n log p n = o ( n ) , then(i) (cid:107) ˜Ω − Ω (cid:107) F = O P { ξ ( p n , G )(log p n /n ) / + C / n (log p n /n ) / } and(ii) | ˆ A n ˜Ω ˆ B n − A Ω ∗ B | = O P { ξ ( p n , G )(log p n /n ) / + C / n ξ ( p n , G )(log p n /n ) / + C n } . The convergence rate bound in Theorem 2 ( i ) is the sum of the statistical error from Theorem1 ( i ) plus an additional error which comes from estimating A and B . We compare our precision matrix estimator to competing estimators when they are used to ﬁt thelinear discriminant analysis model. For 100 independent replications, we generated a realization of n independent copies of ( X, Y ) deﬁned in (3), where µ ∗ j = Ω − ∗ β ∗ j and P ( Y = j ) = 1 /J for j = 1 , . . . , J . Using this construction, if the k th element of β ∗ l − β ∗ m is zero, i.e., Ω ∗ ( µ ∗ l − µ ∗ m ) is zero, then the k th variable is non-informative for discriminating between response categories l and m .For each J ∈ { , . . . , } , we partition our n observations into a training set of size J , avalidation set of size , and a test set of size 1000. We considered two models for Ω − ∗ and β ∗ j . Let ( · ) be the indicator function. Model 1.

We set β ∗ j,k = 1 . [ k ∈ { j −

1) + 1 , . . . , j } ] , so that for any pair of responsecategories, only eight variables were informative for discrimination. We set Ω − ∗ a,b = . | a − b | , so that Ω ∗ was tridiagonal. Model 2.

We set β ∗ j,k = 2 [ k ∈ { j −

1) + 1 , . . . , j } ] , so that for any pair of response cat-egories, only ten variables were informative for discrimination. We set Ω − ∗ to be block diagonal:the block corresponding to the informative variables, i.e., the ﬁrst J variables, had off diagonalentries equal to 0.5 and diagonal entries equal to one. The block submatrix corresponding to the p − J non-informative variables had ( a, b ) th entry equal to . | a − b | . Number of Response Categories A v e r age m i sc l a ss i f i c a t i on r a t e l l l l l l l ll l l l l l l l Number of Response Categories A v e r age m i sc l a ss i f i c a t i on r a t e l l l l l l l ll l l l l l l l (c) Model 1 (d) Model 2 Number of Response Categories A v e r age F r oben i u s − no r m e rr o r l l l l l l l l Number of Response Categories A v e r age F r oben i u s − no r m e rr o r l l l l l l l l Figure 1: Misclassiﬁcation rates and Frobenius norm error averaged over 100 replications with p = 200 for Models 1 and 2. The methods displayed are the estimator we proposed in Section 2.2(dashed and (cid:4) ), the L -penalized Gaussian likelihood estimator (dashed and (cid:78) ), the Ledoit-Wolf-type estimator from (12) (dashed and (cid:32) ), Bayes (solid and ∗ ), the method proposed by Guo (2010)(dots and (cid:35) ), the method proposed by Mai et al. (2015) (dots and (cid:52) ), and the method proposed byWitten and Tibshirani (2011) (dots and (cid:3) ).For both models, sparse estimators of Ω ∗ should perform well because the precision matrices arevery sparse. The total number of informative variables is J and J in Models 1 and 2 respectively,so a method like that proposed by Mai et al. (2015), which selects variables that are informative forall pairwise comparisons, may perform poorly when J is large. We compared several methods in terms of classiﬁcation accuracy on the test set. We ﬁt (3) usingthe following methods: the sparse na¨ıve Bayes estimator proposed by Guo (2010) with tuningparameter chosen to minimize misclassiﬁcation rate on the validation set; and the Bayes rule, i.e., Ω ∗ , µ ∗ j , and P ( Y = j ) known for j = 1 , . . . , J . We also ﬁt (3) using the ordinary samplemeans and using the following precision matrix estimators: the estimator we proposed in Section8a) Model 1 (b) Model 2 Number of Response Categories T r ue po s i t i v e r a t e l l l l l l l l Number of Response Categories T r ue po s i t i v e r a t e l l l l l l l l (c) Model 1 (d) Model 2 Number of Response Categories T r ue nega t i v e r a t e l l l l l l l l Number of Response Categories T r ue nega t i v e r a t e l l l l l l l l Figure 2: True positive and true negative rates averaged over 100 replications with p = 200 forModel 1 in (a) and (c); and for Model 2 in (b) and (d). The methods displayed are the estimator weproposed in Section 2.2 (dashed and (cid:4) ), the method proposed by Guo (2010) (dots and (cid:35) ), and themethod proposed by Mai et al. (2015) (dots and (cid:52) ).2.2 with tuning parameter chosen to minimize misclassiﬁcation rate on the validation set; the L -penalized Gaussian likelihood precision matrix estimator (Yuan and Lin, 2007; Rothman et al.,2008; Friedman et al., 2008) with the tuning parameter chosen to minimize the misclassiﬁcationrate of the validation set; and a covariance matrix estimator similar to the estimator proposed byLedoit and Wolf (2004), which is deﬁned by ˆΩ − = αS + γ (1 − α ) I p , (12)where ( α, γ ) ∈ (0 , × (0 , ∞ ) were chosen to minimize the misclassiﬁcation rate of the vali-dation set. The L -penalized Gaussian likelihood precision matrix estimator we used penalizedthe diagonals. With our data generating models, we found this performed better at classiﬁcationthan (1), which does not penalize the diagonals. We also tried two Fisher-linear-discriminant-basedmethods applicable to multi-category linear discriminant analysis: the sparse linear discriminantmethod proposed by Witten and Tibshirani (2011) with tuning parameter and dimension chosento minimize the misclassiﬁcation rate of the validation set; and the multi-category sparse lineardiscriminant method proposed by Mai et al. (2015) with tuning parameter chosen to minimize themisclassiﬁcation rate of the validation set.We could also have selected tuning parameters for the model-based methods by maximiz-9ng a validation likelihood or using an information criterion, but minimizing the misclassiﬁcationrate on a validation set made it fairer to compare the model-based methods and the Fisher-linear-discriminant-based methods in terms of classiﬁcation accuracy. We measured classiﬁcation accuracy on the test set for each replication for the methods describedin Section 5.2. For the methods that produced a precision matrix estimator, we also measured thisestimator’s Frobenius norm error: (cid:107) ¯Ω − Ω ∗ (cid:107) F , where ¯Ω is the estimator. To measure variable selec-tion accuracy, we used both the true positive rate and the true negative rate, which are respectivelydeﬁned by card (cid:110) ( m, k ) : ˆ∆ m,k (cid:54) = 0 ∩ ∆ ∗ m,k (cid:54) = 0 (cid:111) card { ( m, k ) : ∆ ∗ m,k (cid:54) = 0 } , card (cid:110) ( m, k ) : ˆ∆ m,k = 0 ∩ ∆ ∗ m,k = 0 (cid:111) card { ( m, k ) : ∆ ∗ m,k = 0 } , where ( m, k ) ∈ { , . . . , J } × { , . . . , p } , ∆ ∗ m = β ∗ − β ∗ m , ˆ∆ m is an estimator of ∆ ∗ m , and card denotes the cardinality of a set. We display average misclassiﬁcation rates and Frobenius norm error averages for both models with p = 200 in Figure 1, and display variable selection accuracy averages in Figure 2. For both models,our method outperformed all competitors in terms of classiﬁcation accuracy for all J , except theBayes rule, which uses population parameter values unknown in practice. In terms of precision ma-trix estimation, for Model 1, our method did better than the L -penalized Gaussian likelihood pre-cision matrix estimator when the sample size was small, but became worse than the L -penalizedGaussian likelihood precision matrix estimator as the sample size increased. For Model 2, ourmethod was worse than the L -penalized Gaussian likelihood precision matrix estimator in Frobe-nius norm error for precision matrix estimation, but was better in terms of classiﬁcation accuracy.In terms of variable selection, our method was competitive with the methods proposed by Guo(2010) and Mai et al. (2015). For Model 1, our method tended to have a higher average true negativerate than the method of Guo (2010) and a lower average true positive rate than the method of Maiet al. (2015). For Model 2, all methods tended to have relatively high average true positive rates,while our method had a higher average true negative rate than the method of Mai et al. (2015).Although the method proposed by Guo (2010) had a higher average true negative rate for Model 2than our proposed method had, our method performed better in terms of classiﬁcation accuracy. We used our method to ﬁt the linear discriminant analysis model in a real data application. The dataare gene expression proﬁles consisting of p = 22 , genes from 127 subjects, who either haveCrohn’s disease, ulcerative colitis, or neither. This dataset comes from Burczynski et al. (2006). Thegoal of our analysis was to ﬁt a linear discriminant analysis model that could be used to identifywhich genes are informative for discriminating between each pair of the response categories. Thesedata were also analyzed in Mai et al. (2015). One difference between our method and the method10a) Model Size (b) Misclassiﬁcation Rate Guo

Mai

Ours

Witten

Methods M ode l S i z e p Glasso

Guo

Mai

Ours

Witten

Methods M i sc l a ss i f i c a t i on R a t e p Figure 3: Model sizes and misclassiﬁcation rates from 100 random training/testing splits with k =100 (dark grey), k = 200 (grey), and k = 300 (light grey). Guo is the method proposed by Guo(2010), Mai is the method proposed by Mai et al. (2015), Glasso is the L -penalized Gaussianlikelihood precision matrix estimator, Ours is the estimator we propose Section 2.2, and Witten isthe method proposed by Witten and Tibshirani (2011).of Mai et al. (2015) is that the method of Mai et al. (2015) excludes variables from all pairwiseresponse category comparisons whereas our method allows a distinct set of informative variables tobe estiamted for each comparison.To measure the classiﬁcation accuracy of our method and its competitors, we randomly split thedata into training set of size 100 and test set of size 27 for 100 independent replications. Withineach replication, we ﬁrst applied a screening rule to the training set as in Rothman et al. (2009) andMai et al. (2015) based on F -test statistics, and then restricted our discriminant analysis model tothe genes with the k largest F -test statistic values.We chose tuning parameters with 5-fold cross validation that minimized the validation classiﬁ-cation error rate. Misclassiﬁcation rates are shown in Figure 3, where we compared our method tothe method proposed by Mai et al. (2015), the method proposed by Witten and Tibshirani (2011),the method proposed by Guo (2010), and the method that used the L -penalized Gaussian like-lihood precision matrix estimator. We saw that our method was as or more accurate in terms ofclassiﬁcation accuracy than the competing methods. The only method that performed nearly aswell was that of Mai et al. (2015) when we used k = 100 screened genes. However, the best out-of-sample classiﬁcation accuracy was achieved with k = 300 , where our method was signiﬁcantlybetter than the competitors.In Figure 3, we also display model sizes, i.e., the total number of variables that were estimatedto be informative for discriminating between response categories. To measure model size for themethod proposed by Witten and Tibshirani (2011), we used the version of their method with twodiscriminant vectors. We saw that although the method of Mai et al. (2015) tended to estimateslightly smaller models, our method, which performs best in classiﬁcation, selects only slightly11ore variables. Moreover, our method can be used to identify a distinct subset of genes that areinformative speciﬁcally for discriminating between patients with Crohn’s disease and ulcerativecolitis. This was of interest in the study of Burczynski et al. (2006). A. J. Molstad’s research is supported by the Doctoral Dissertation Fellowship from the Universityof Minnesota. A. J. Rothman’s research is supported in part by the National Science Foundation.

Appendix

Notation

Deﬁne the following norms: (cid:107) A (cid:107) ∞ = max i,j | A ij | , | A | = (cid:80) i,j | A ij | , (cid:107) A (cid:107) F = tr( A T A ) , (cid:107) A (cid:107) = σ ( A ) . Let S p denote the set of p × p symmetric matrices. To simplify notation, we omit thesubscript n from S n , λ n , and p n as deﬁned in Section 4 and let κ = k − . Proof of Theorem 1

To prove Theorem 1, we use a strategy similar to that employed by Rothman (2012).

Lemma 1.

Suppose that Assumptions 1–3 hold, and λ ≤ (cid:15)κ { ξ ( p, G ) τ } − for some τ > . Thenfor all positive and sufﬁciently small (cid:15) , (cid:107) B + ( S − Ω − ∗ ) A + (cid:107) ∞ ≤ λ/ implies (cid:107) ˆΩ − Ω ∗ (cid:107) F ≤ (cid:15) .Proof. We follow the proof techniques used by Rothman et al. (2008), Negahban et al. (2012)and Rothman (2012). Deﬁne B (cid:15) = { ∆ ∈ S p : (cid:107) ∆ (cid:107) F ≤ (cid:15) } . Let f be the objective function in(2). Because f is convex and ˆΩ is its minimizer, inf { f (Ω ∗ + ∆) : ∆ ∈ B (cid:15) } > f (Ω ∗ ) , implies (cid:107) ˆΩ − Ω ∗ (cid:107) F ≤ (cid:15) (Rothman et al., 2008). Deﬁne D (∆) = f (Ω ∗ + ∆) − f (Ω ∗ ) . Then D (∆) = tr( S ∆) + log det(Ω ∗ ) − log det(Ω ∗ + ∆) + λ {| A (Ω ∗ + ∆) B | − | A Ω ∗ B | } . By the arguments used in Rothman et al. (2008), log det(Ω ∗ ) − log det(Ω ∗ + ∆) ≥ − tr( S Ω − ∗ ) +8 − κ (cid:107) ∆ (cid:107) F , so that D (∆) ≥ tr (cid:8) ∆( S − Ω − ∗ ) (cid:9) + 18 κ (cid:107) ∆ (cid:107) F + λ {| A (Ω ∗ + ∆) B | − | A Ω ∗ B | } . (13)We now bound | A (Ω ∗ + ∆) B | − | A Ω ∗ B | in (13). Recall that G = { ( i, j ) ∈ { , . . . , a } × { , . . . , b } : [ A Ω ∗ B ] ij (cid:54) = 0 } and G c = { , . . . , a } × { , . . . , b } \ G . Since | A Ω ∗ B | = | [ A Ω ∗ B ] G | and | A (Ω ∗ + ∆) B | = | [ A Ω ∗ B ] G + [ A ∆ B ] G | + | [ A ∆ B ] G c | , we can apply the reverse triangle inequality: | A (Ω ∗ +∆) B | − | A Ω ∗ B | ≥ | [ A ∆ B ] G c | − | [ A ∆ B ] G | . Plugging this bound into (13), D (∆) ≥ tr (cid:8) ( S − Ω − ∗ )∆ (cid:9) + 18 κ (cid:107) ∆ (cid:107) F + λ (cid:0) | [ A ∆ B ] G c | − | [ A ∆ B ] G | (cid:1) . (14)12e now bound tr (cid:8) ( S − Ω − ∗ )∆ (cid:9) . Let A + = ( A T A ) − A T and B + = B T ( BB T ) − . Because A and B are both rank p by Assumption 2, A + A = I p and BB + = I p . Thus tr (cid:8) ( S − Ω − ∗ )∆ (cid:9) ≥ −| tr (cid:8) ( S − Ω − ∗ )∆ (cid:9) | = −| tr (cid:8) ( S − Ω − ∗ ) A + A ∆ BB + (cid:9) | = −| tr (cid:8) B + ( S − Ω − ∗ ) A + A ∆ B (cid:9) |≥ −(cid:107) B + ( S − Ω − ∗ ) A + (cid:107) ∞ | A ∆ B | . (15)By assumption, (cid:107) B + ( S − Ω − ∗ ) A + (cid:107) ∞ ≤ λ/ , so applying (15) to (14), D (∆) ≥ κ (cid:107) ∆ (cid:107) F − λ | A ∆ B | + λ (cid:0) | [ A ∆ B ] G c | − | [ A ∆ B ] G | (cid:1) (16) = 18 κ (cid:107) ∆ (cid:107) F − λ (cid:0) | [ A ∆ B ] G | + | [ A ∆ B ] G c | (cid:1) + λ (cid:0) | [ A ∆ B ] G c | − | [ A ∆ B ] G | (cid:1) = 18 κ (cid:107) ∆ (cid:107) F − λ | [ A ∆ B ] G | + 12 λ | [ A ∆ B ] G c | ≥ κ (cid:107) ∆ (cid:107) F − λ | [ A ∆ B ] G | . (17)We now bound the quantity | [ A ∆ B ] G | . Multiplying and dividing ∆ by (cid:107) ∆ (cid:107) F , (cid:12)(cid:12)(cid:12)(cid:12)(cid:20) A | ∆ (cid:107) F (cid:107) ∆ (cid:107) F ∆ B (cid:21) G (cid:12)(cid:12)(cid:12)(cid:12) = (cid:107) ∆ (cid:107) F (cid:12)(cid:12)(cid:12)(cid:12)(cid:20) A (cid:107) ∆ (cid:107) F ∆ B (cid:21) G (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) ∆ (cid:107) F (cid:32) sup M ∈ S p ,M (cid:54) =0 | [ AM B ] G | (cid:107) M (cid:107) F (cid:33) , so that | [ A ∆ B ] G | ≤ (cid:107) ∆ (cid:107) F ξ ( p, G ) . Finally, since λ ≤ (cid:15)κ { τ ξ ( p, G ) } − with τ > , (cid:107) ∆ (cid:107) F = (cid:15) for ∆ ∈ B (cid:15) , D (∆) ≥ κ (cid:107) ∆ (cid:107) F − λ (cid:107) ∆ (cid:107) F ξ ( p, G )= (cid:107) ∆ (cid:107) F (cid:26) κ − λξ ( p, G )2 (cid:107) ∆ (cid:107) F (cid:27) ≥ (cid:15) (cid:18) κ − τ κ (cid:19) > . which establishes the desired result.The following lemma follows from the proof of Lemma 1 of Negahban et al. (2012). Lemma 2.

If the conditions of Lemma 1 are true, then ˆ∆ = ˆΩ − Ω ∗ belongs to the set (cid:8) ∆ ∈ S p : | [ A ∆ B ] G c | ≤ | [ A ∆ B ] G | (cid:9) . Lemma 3 follows from the proof of Lemma 2 from Lam and Fan (2009), Assumption 2, andLemma A.3 of Bickel and Levina (2008).

Lemma 3.

Suppose Assumptions 1–3 hold. Then, there exist constants C and C such that P ( (cid:107) B + SA + − B + Ω − ∗ A + (cid:107) ∞ ≥ ν ) ≤ C p exp( − C nν ) , for | ν | ≤ δ where C , C , and δ do not depend on n .Proof of Theorem 1. Set (cid:15) = K κ − ξ ( p, G )( n − log p ) / , λ = K τ − ( n − log p ) / with τ > . Applying Lemma 1 and Lemma 3, there exist constants C and C such that for sufﬁciently13arge n,P (cid:32) (cid:107) ˆΩ − Ω ∗ (cid:107) F ≤ K κ − ξ ( p, G ) (cid:114) log pn (cid:33) ≥ P (cid:32) (cid:107) B + ( S − Ω − ∗ ) A + (cid:107) ∞ ≤ K τ (cid:114) log pn (cid:33) ≥ − C p − C K / τ , which establishes ( i ) because − C p − C K / τ → as K → ∞ . To establish ( ii ) , | A ( ˆΩ − Ω ∗ ) B | = | [ A ( ˆΩ − Ω ∗ ) B ] G | + | [ A ( ˆΩ − Ω ∗ ) B ] G c | ≤ | [ A ( ˆΩ − Ω ∗ ) B ] G | (18) ≤ ξ ( p, G ) (cid:107) ˆΩ − Ω ∗ (cid:107) F , (19)where (18) follows from Lemma 2 and (19) follows from the deﬁnition of ξ ( p, G ) . Proof of Theorem 2

Lemma 4.

Let M a and M b be constants. Let a n and b n be sequences such that | ( ˆ A n − A ) A + | ≤ M a a n and | B + ( ˆ B n − B ) | ≤ M b b n with probability at least − f ( M a ) and − g ( M b ) . Let ˜ M n = M a a n + M b b n + M a M b a n b n . Then | ˆ A n (Ω ∗ + ∆) ˆ B n | − | ˆ A n Ω ∗ ˆ B n | ≥ | A (∆ + Ω ∗ ) B | − | [ A Ω ∗ B ] G | + ˜ M n ( | A ∆ B | + 2 | [ A Ω ∗ B ] G | ) , with probability at least min { − f ( M a ) , − g ( M b ) } . Proof.

Let | ˆ A n (Ω ∗ + ∆) ˆ B n | − | ˆ A n Ω ∗ ˆ B n | ≡ V − V . First, V = | ˆ A n (Ω ∗ + ∆) ˆ B n + A (Ω ∗ + ∆) B − A (Ω ∗ + ∆) B | ≥ | A (Ω ∗ + ∆) B | − | A (Ω ∗ + ∆) B − ˆ A n (Ω ∗ + ∆) ˆ B n | , (20)by the triangle inequality. Also, V = | ˆ A n Ω ∗ ˆ B n − A Ω ∗ B + A Ω ∗ B | ≤ | ˆ A n Ω ∗ ˆ B n − [ A Ω ∗ B ] G | + | [ A Ω ∗ B ] G | , (21)so that from (20) and (21), V − V ≥| A (Ω ∗ + ∆) B | − | [ A Ω ∗ B ] G | − | A (Ω ∗ + ∆) B − ˆ A n (Ω ∗ + ∆) ˆ B n | − | ˆ A n Ω ∗ ˆ B n − A Ω ∗ B | . (22)Let V = −| A (Ω ∗ + ∆) B − ˆ A n (Ω ∗ + ∆) ˆ B n | − | ˆ A n Ω ∗ ˆ B n − A Ω ∗ B | . By a triangle inequality onthe ﬁrst term of V , V ≥ − | ˆ A n Ω ∗ ˆ B n − A Ω ∗ B | − | ˆ A n ∆ ˆ B n − A ∆ B | . (23)14o bound (23), we need to bound functions of the form | AM B − ˆ A n M ˆ B n | ; | AM B − ˆ A n M ˆ B n | = | ( A − ˆ A n ) M B + AM ( B − ˆ B n ) + ( A − ˆ A n ) M ( ˆ B n − B ) | ≤| ( A − ˆ A n ) M B | + | AM ( B − ˆ B n ) | + | ( A − ˆ A n ) M ( ˆ B n − B ) | . (24) = | ( A − ˆ A n ) A + AM B | + | AM BB + ( B − ˆ B n ) | + | ( A − ˆ A n ) A + AM BB + ( ˆ B n − B ) | , (25) ≤| AM B | (cid:110) | ( A − ˆ A n ) A + | + | B + ( B − ˆ B n ) | + | ( A − ˆ A n ) A + | | B + ( B − ˆ B n ) | (cid:111) (26) ≤| AM B | ( M a a n + M b b n + M a M b a n b n ) , (27)where (24) follows from the triangle inequality; (25) follows from Assumption 2 and the deﬁnitionof A + and B + ; (26) follows from the sub-multiplicative property of the | · | norm; and (27) holdswith probability at least min { − f ( M a ) , − g ( M b ) } . Applying (27) to both terms in (23) gives V ≥ − | ˆ A n Ω ∗ ˆ B n − A Ω ∗ B | − | ˆ A n ∆ ˆ B n − A ∆ B | ≥ − ˜ M n (2 | [ A Ω ∗ B ] G | + | A ∆ B | ) , with probability at least min { − f ( M a ) , − g ( M b ) } . Plugging this bound into (22) gives theresult.

Lemma 5.

Suppose Assumptions 1–3 are true, ˜ M n = o (1) , the bound in Lemma 4 holds, λ ≤ (cid:15)κ ( Q (cid:15) τ ) − for τ > , where Q (cid:15) = (cid:26)(cid:18)

32 + ˜ M n (cid:19) ξ ( p, G ) − M n | [ A Ω ∗ B ] G | (cid:15) (cid:27) . Then for all positive and sufﬁciently small (cid:15) , (cid:107) B + ( S − Ω − ∗ ) A + (cid:107) ∞ ≤ λ/ , implies (cid:107) ˆΩ − Ω ∗ (cid:107) F ≤ (cid:15) .Proof. Let ˜ f be the objective function from (11). Deﬁne ˜ D (∆) = ˜ f (Ω ∗ + ∆) − ˜ f (Ω ∗ ) so that ˜ D (∆) = tr( S ∆) + log det(Ω ∗ ) − log det(Ω ∗ + ∆) + λ (cid:110) | ˆ A n (Ω ∗ + ∆) ˆ B n | − | ˆ A n Ω ∗ ˆ B n | (cid:111) . As in the proof of Lemma 1, we want to show that for ∆ ∈ B (cid:15) , inf { ˜ D (∆) : (cid:107) ∆ (cid:107) F ≤ (cid:15) } > .Applying Lemma 4 to bound | ˆ A n (Ω ∗ + ∆) ˆ B n | − | ˆ A n Ω ∗ ˆ B n | and applying the same arguments15s in the proof of Lemma 1 to obtain (16), ˜ D (∆) ≥ κ (cid:107) ∆ (cid:107) F − λ (cid:0) | [ A ∆ B ] G | + | [ A ∆ B ] G c | (cid:1) + λ (cid:0) | [ A ∆ B ] G c | − | [ A ∆ B ] G | (cid:1) − ˜ M n λ ( | A ∆ B | + 2 | [ A Ω ∗ B ] G | )= 18 κ (cid:107) ∆ (cid:107) F − λ | [ A ∆ B ] G | + 12 λ | [ A ∆ B ] G c | − ˜ M n λ ( | A ∆ B | + 2 | [ A Ω ∗ B ] G | )= 18 κ (cid:107) ∆ (cid:107) F − λ | [ A ∆ B ] G | + 12 λ | [ A ∆ B ] G c | − ˜ M n λ ( | [ A ∆ B ] G | + | [ A ∆ B ] G c | + 2 | [ A Ω ∗ B ] G | )= 18 κ (cid:107) ∆ (cid:107) F − (cid:18)

32 + ˜ M n (cid:19) λ | [ A ∆ B ] G | − M n λ | [ A Ω ∗ B ] G | = (cid:107) ∆ (cid:107) F (cid:26) κ − (cid:18)

32 + ˜ M n (cid:19) λ (cid:107) ∆ (cid:107) F ξ ( p, G ) − M n λ | [ A Ω ∗ B ] G | (cid:107) ∆ (cid:107) F (cid:27) = (cid:107) ∆ (cid:107) F (cid:20) κ − λ (cid:107) ∆ (cid:107) F (cid:26)(cid:18)

32 + ˜ M n (cid:19) ξ ( p, G ) − M n | [ A Ω ∗ B ] G | (cid:107) ∆ (cid:107) F (cid:27)(cid:21) . (29)Since (cid:107) ∆ (cid:107) F = (cid:15) and λ ≤ (cid:15)κ ( τ Q (cid:15) ) − for τ > , where Q (cid:15) = (cid:26)(cid:18)

32 + ˜ M n (cid:19) ξ ( p, G ) − M n | [ A Ω ∗ B ] G | (cid:15) (cid:27) , the inequality from (29) implies D (∆) ≥ (cid:15) (cid:20) κ − λ(cid:15) (cid:26)(cid:18)

32 + ˜ M n (cid:19) ξ ( p, G ) − M n | [ A Ω ∗ B ] G | (cid:15) (cid:27)(cid:21) ≥ (cid:15) (cid:18) κ − τ κ (cid:19) > , which establishes the desired result. Lemma 6.

If the conditions of Lemma 5 are true, then ˆ∆ = ˜Ω − Ω ∗ belongs to the set (cid:40) ∆ ∈ S p n : | [ A ∆ B ] G c | ≤ (3 + 2 ˜ M n ) | [ A ∆ B ] G | + 4 ˜ M n | [ A Ω ∗ B ] G | − M n (cid:41) . Proof.

Using the same arguments as in the proof of Lemma 1 from Negahban et al. (2012), and16rom (28), we have ≤ D ( ˆ∆) ≤ − λ M n ) ( | [ A ∆ B ] G | + | [ A ∆ B ] G c | ) + λ ( | [ A ∆ B ] G c | − | [ A ∆ B ] G | ) − λ M n | [ A Ω ∗ B ] G | = − λ (cid:110) (3 + 2 ˜ M n ) | [ A ∆ B ] G | − (1 − M n ) | [ A ∆ B ] G c | + 4 ˜ M n | [ A Ω ∗ B ] G | (cid:111) so that | [ A ∆ B ] G c | ≤ (3 + 2 ˜ M n ) | [ A ∆ B ] G | + 4 ˜ M n | [ A Ω ∗ B ] G | − M n , which is the desired inequality. Proof of Theorem 2.

Set λ = K τ − ( n − log p ) / and (cid:15) = λQ (cid:15) τ κ − . We can simplify theexpression for (cid:15) by solving (cid:15)κ ( K n − log p ) − / = (cid:26)(cid:18)

32 + ˜ M n (cid:19) ξ ( p, G ) − M n | [ A Ω ∗ B ] G | (cid:15) (cid:27) , or equivalently, (cid:15) κ ( K n − log p ) − / − (cid:15) (cid:26)(cid:18)

32 + ˜ M n (cid:19) ξ ( p, G ) (cid:27) − M n | [ A Ω ∗ B ] G | = 0 . (30)Using the quadratic formula to solve (30) for (cid:15) , (cid:15) = K κ (cid:114) log pn (cid:20)(cid:18)

32 + ˜ M n (cid:19) ξ ( p, G )+ (cid:40)(cid:18)

32 + ˜ M n (cid:19) ξ ( p, G ) + 16 ˜ M n κK (cid:114) n log p | [ A Ω ∗ B ] G | (cid:41) /  . (31)To simplify the result, we ﬁnd an ˜ (cid:15) such that (cid:15) ≤ ˜ (cid:15) . Then (cid:107) ˜Ω − Ω ∗ (cid:107) F ≤ (cid:15) implies (cid:107) ˜Ω − Ω ∗ (cid:107) F ≤ ˜ (cid:15) ,so (cid:107) B + ( S − Ω ∗ ) A + (cid:107) ∞ ≤ λ/ also implies (cid:107) ˜Ω − Ω ∗ (cid:107) F ≤ ˜ (cid:15) . Viewing the square root in (31) asthe Euclidean norm of the sum of the square root of its two terms, we use the triangle inequality toobtain (cid:15) ≤ K κ (cid:114) log pn (cid:18)

32 + ˜ M n (cid:19) ξ ( p, G ) + 2 (cid:32) ˜ M n κK (cid:114) n log p | [ A Ω ∗ B ] G | (cid:33) /  = ˜ (cid:15). Then, applying Lemma 5 and Lemma A.3 from Bickel and Levina (2008), there exists constants C and C such that for sufﬁciently large n,P (cid:16) (cid:107) ˆΩ − Ω ∗ (cid:107) F ≤ ˜ (cid:15) (cid:17) ≥ P (cid:32) (cid:107) B + SA + − B + Ω − ∗ A + (cid:107) ∞ ≤ K τ (cid:114) log pn (cid:33) ≥ − C p − C K / τ ( i ) because − C p − C K / τ → as K → ∞ . To establish ( ii ) , we bound | ˆ A n ˜Ω ˆ B n − A Ω ∗ B | . By the triangle inequality, | ˆ A n ˜Ω ˆ B n − A Ω ∗ B | = | ˆ A n ˜Ω ˆ B n − A ˜Ω B + A ˜Ω B − A Ω ∗ B | ≤ | ˆ A n ˜Ω ˆ B n − A ˜Ω B | + | A ˜Ω B − A Ω ∗ B | (32)and by the argument used to obtain the inequality in (27), | ˆ A n ˜Ω ˆ B n − A ˜Ω B | ≤ ˜ M n | A ˜Ω B | . Usingthis bound on the ﬁrst term in (32), | ˆ A n ˜Ω ˆ B n − A Ω ∗ B | ≤ ˜ M n | A ˜Ω B | + | A ˜Ω B − A Ω ∗ B | . (33)Then, bounding the ﬁrst term in (33), | A ˜Ω B | = | A ˜Ω B + A Ω ∗ B − A Ω ∗ B | ≤ | A Ω ∗ B | + | A ˜Ω B − A Ω ∗ B | so that from (33), | ˆ A n ˜Ω ˆ B n − A Ω ∗ B | ≤ ˜ M n | A Ω ∗ B | + ( ˜ M n + 1) | A ˜Ω B − A Ω ∗ B | . (34)To bound the right term in the sum on the right hand side of (34), we apply Lemma 5 to ˜∆ = ˜Ω − Ω ∗ | A ˜Ω B − A Ω ∗ B | ≤ | [ A ˜∆ B ] G | + | [ A ˜∆ B ] G c | ≤ | [ A ˜∆ B ] G | + (3 + 2 ˜ M n ) | [ A ˜∆ B ] G | + 2 ˜ M n | [ A Ω ∗ B ] G | − M n = (1 − M n ) | [ A ˜∆ B ] G | + (3 + 2 ˜ M n ) | [ A ˜∆ B ] G | + 2 ˜ M n | [ A Ω ∗ B ] G | − M n = 4 | [ A ˜∆ B ] G | + 2 ˜ M n | [ A Ω ∗ B ] G | − M n . (35)Because ˜ M n = o (1) , there exists constants C and C such that for some sufﬁciently large n , (35)implies | A ˜Ω B − A Ω ∗ B | ≤ C (cid:107) ˜Ω − Ω ∗ (cid:107) F ξ ( p, G ) + C ˜ M n | [ A Ω ∗ B ] G | . Combining this with (34), | ˆ A n ˜Ω ˆ B n − A Ω ∗ B | ≤ ( ˜ M n + 1) (cid:110) C (cid:107) ˜Ω − Ω ∗ (cid:107) F ξ ( p, G ) + C ˜ M n | [ A Ω ∗ B ] G | (cid:111) + ˜ M n | [ A Ω ∗ B ] G | . = C ( ˜ M n + 1) (cid:107) ˜Ω − Ω ∗ (cid:107) F ξ ( p, G ) + C ( ˜ M n + ˜ M n + ˜ M n C − ) | [ A Ω ∗ B ] G | and using that ˜ M n = o (1) with the result from Theorem 2 ( i ) for (cid:107) ˜Ω − Ω ∗ (cid:107) F , we obtain theresult. References

Bickel, P. J. and Levina, E. (2004). Some theory for Fisher’s linear discriminant function, ‘naiveBayes’, and some alternatives when there are many more variables than observations.

Bernoulli ,10(6):989–1010.Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance matrices.

The Annalsof Statistics , 36(1):199–227.Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization and18tatistical learning via the alternating direction method of multipliers.

Foundations and Trends R (cid:13) in Machine Learning , 3(1):1–122.Brodie, J., Daubechies, I., De Mol, C., Giannone, D., and Loris, I. (2009). Sparse and stableMarkowitz portfolios. Proceedings of the National Academy of Sciences , 106(30):12267–12272.Burczynski, M. E., Peterson, R. L., Twine, N. C., Zuberek, K. A., Brodeur, B. J., Casciotti, L.,Maganti, V., Reddy, P. S., Strahs, A., Immermann, F., et al. (2006). Molecular classiﬁcation ofCrohn’s disease and ulcerative colitis patients using transcriptional proﬁles in peripheral bloodmononuclear cells.

The Journal of Molecular Diagnostics , 8(1):51–61.Cai, T. and Liu, W. (2011). A direct estimation approach to sparse linear discriminant analysis.

Journal of the American Statistical Association , 106(496):1566–1577.Chen, G. and Teboulle, M. (1994). A proximal-based decomposition method for convex minimiza-tion problems.

Mathematical Programming , 64(1-3):81–101.Chen, X., Xu, M., and Wu, W. B. (2016). Regularized estimation of linear functionals of pre-cision matrices for high-dimensional time series.

IEEE Transactions on Signal Processing ,64(24):6459–6470.Dalal, O. and Rajaratnam, B. (2014). G-AMA: sparse gaussian graphical model estimation viaalternating minimization. arXiv preprint arXiv:1405.3034 .Deng, W. and Yin, W. (2016). On the global and linear convergence of the generalized alternatingdirection method of multipliers.

Journal of Scientiﬁc Computing , 66(3):889–916.Fan, J., Feng, Y., and Tong, X. (2012). A road to classiﬁcation in high dimensional space: the regu-larized optimal afﬁne discriminant.

Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 74(4):745–771.Fan, J., Feng, Y., and Wu, Y. (2009). Network exploration via the adaptive LASSO and SCADpenalties.

The Annals of Applied Statistics , 3(2):521–541.Fan, J., Liao, Y., and Liu, H. (2016). An overview of the estimation of large covariance and precisionmatrices.

The Econometrics Journal , 19(1):C1–C32.Friedman, J. H., Hastie, T. J., and Tibshirani, R. J. (2008). Sparse inverse covariance estimationwith the graphical lasso.

Biostatistics , 9:432–441.Guo, J. (2010). Simultaneous variable selection and class fusion for high-dimensional linear dis-criminant analysis.

Biostatistics , 11:599–608.Lam, C. and Fan, J. (2009). Sparsistency and rates of convergence in large covariance matrixestimation.

Annals of Statistics , 37(6B):4254—4278.Lange, K. (2016).

MM Optimization Algorithms . SIAM.Ledoit, O. and Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariancematrices.

Journal of Multivariate Analysis , 88(2):365–411.19ai, Q., Yang, Y., and Zou, H. (2015). Multiclass sparse discriminant analysis. arXiv preprintarXiv:1504.05845 .Mai, Q., Zou, H., and Yuan, M. (2012). A direct approach to sparse discriminant analysis in ultra-high dimensions.

Biometrika , 99:29–42.Markowitz, H. (1952). Portfolio selection.

The Journal of Finance , 7(1):77–91.Negahban, S. N., Yu, B., Wainwright, M. J., and Ravikumar, P. K. (2012). A uniﬁed framework forhigh-dimensional analysis of m -estimators with decomposable regularizers. Statistical Science ,27(4):538–557.Pourahmadi, M. (2013).

High-dimensional covariance estimation: with high-dimensional data .John Wiley & Sons.Price, B. S., Geyer, C. J., and Rothman, A. J. (2015). Ridge fusion in statistical learning.

Journalof Computational and Graphical Statistics , 24(2):439–454.Rothman, A. J. (2012). Positive deﬁnite estimators of large covariance matrices.

Biometrika ,99:733–740.Rothman, A. J., Bickel, P. J., Levina, E., Zhu, J., et al. (2008). Sparse permutation invariant covari-ance estimation.

Electronic Journal of Statistics , 2:494–515.Rothman, A. J. and Forzani, L. (2014). On the existence of the weighted bridge penalized Gaussianlikelihood precision matrix estimator.

Electronic Journal of Statistics , 8:2693–2700.Rothman, A. J., Levina, E., and Zhu, J. (2009). Generalized thresholding of large covariancematrices.

Journal of the American Statistical Association , 104(485):177–186.Witten, D. M. and Tibshirani, R. (2011). Penalized classiﬁcation using Fisher’s linear discriminant.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 73(5):753–772.Witten, D. M. and Tibshirani, R. J. (2009). Covariance-regularized regression and classiﬁcationfor high dimensional problems.

Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 71(3):615–636.Xu, P., Zhu, J., Zhu, L., and Li, Y. (2015). Covariance-enhanced discriminant analysis.

Biometrika ,102:33–45.Xue, L., Ma, S., and Zou, H. (2012). Positive-deﬁnite 1-penalized estimation of large covariancematrices.

Journal of the American Statistical Association , 107(500):1480–1491.Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical model.